Jailbreak and Guard Aligned Language Models With Only Few In-Context Demonstrations
Summary
This research shows that large language models can be tricked or protected using in-context learning (ICL, a technique where an AI learns from examples provided in its current input rather than from training). The researchers developed two methods: an In-Context Attack that uses harmful examples to make LLMs produce unsafe outputs, and an In-Context Defense that uses refusal examples to strengthen safety. The study demonstrates that both attacking and defending LLM safety through carefully chosen demonstrations are effective and scalable.
Classification
Related Issues
CVE-2024-27444: langchain_experimental (aka LangChain Experimental) in LangChain before 0.1.8 allows an attacker to bypass the CVE-2023-
CVE-2026-30308: In its design for automatic terminal command execution, HAI Build Code Generator offers two options: Execute safe comman
Original source: http://ieeexplore.ieee.org/document/11370531
First tracked: May 7, 2026 at 08:03 PM
Classified by LLM (prompt v3) · confidence: 92%