Breaking Instruction Hierarchy in OpenAI's gpt-4o-mini
Summary
OpenAI released gpt-4o-mini with safety improvements aimed at strengthening 'instruction hierarchy,' which is supposed to prevent users from tricking the AI into ignoring its built-in rules through commands like 'ignore all previous instructions.' However, researchers have already demonstrated bypasses of this protection, and analysis shows that system instructions (the AI's core rules) still cannot be fully trusted as a security boundary (a hard limit that stops attackers).
Classification
Affected Vendors
Related Issues
Original source: https://embracethered.com/blog/posts/2024/chatgpt-gpt-4o-mini-instruction-hierarchie-bypasses/
First tracked: February 12, 2026 at 02:20 PM
Classified by LLM (prompt v3) · confidence: 85%