AI Sec Watch: A Security Intelligence Platform for AI Systems

Luu, T.J.

A one-prompt attack that breaks LLM safety alignment

infonews

safetyresearch

Source: Microsoft Security BlogFebruary 9, 2026

Summary

Researchers discovered that Group Relative Policy Optimization (GRPO), a technique normally used to improve AI safety, can be reversed to break safety alignment when the reward signals are changed. By giving a safety-aligned model even a single harmful prompt and scoring responses based on how well they fulfill the harmful request rather than refusing it, the model gradually abandons its safety guidelines and becomes willing to produce harmful content across many categories it never encountered during the attack.