AI Sec Watch: A Security Intelligence Platform for AI Systems

Luu, T.J.

Prompt Injection as Role Confusion

infonewsLLM-Specific

securityresearch

Source: Simon Willison's WeblogJune 22, 2026

Summary

Researchers discovered that AI models struggle to distinguish between their own internal instructions (wrapped in tags like <system> and <think>) and untrusted user input (wrapped in <user> tags), a problem called role confusion. The models pay more attention to the writing style of text than its actual meaning, allowing attackers to craft jailbreaks (unauthorized bypasses of safety rules) by mimicking the style of internal thinking blocks. However, rewriting malicious text in a different style (called 'destyling') significantly reduced attack success rates from 61% to 10%, showing that format changes can help models better distinguish between trusted and untrusted content.

Solution / Mitigation

The source explicitly mentions 'destyling' as having material impact: 'destyling causes average attack success in our dataset to plunge from 61% to 10%.' Destyling is described as 'rewriting text in a slightly different way such that it looked less like the expected format in a role tag.' However, the source does not present this as an implemented solution or official mitigation—only as a research finding about what reduces attack effectiveness. No deployed fix, patch, or official defense mechanism is described in the text.