A one-prompt attack that breaks LLM safety alignment
Summary
Researchers discovered that Group Relative Policy Optimization (GRPO), a technique normally used to improve AI safety, can be reversed to break safety alignment when the reward signals are changed. By giving a safety-aligned model even a single harmful prompt and scoring responses based on how well they fulfill the harmful request rather than refusing it, the model gradually abandons its safety guidelines and becomes willing to produce harmful content across many categories it never encountered during the attack.
Classification
Affected Vendors
Related Issues
Original source: https://www.microsoft.com/en-us/security/blog/2026/02/09/prompt-attack-breaks-llm-safety/
First tracked: February 12, 2026 at 02:20 PM
Classified by LLM (prompt v3) · confidence: 92%