AI Sec Watch: A Security Intelligence Platform for AI Systems

Luu, T.J.

Breaking Instruction Hierarchy in OpenAI's gpt-4o-mini

mediumnewsLLM-Specific

securitysafety

Source: Embrace The RedJuly 22, 2024

Summary

OpenAI released gpt-4o-mini with safety improvements aimed at strengthening 'instruction hierarchy,' which is supposed to prevent users from tricking the AI into ignoring its built-in rules through commands like 'ignore all previous instructions.' However, researchers have already demonstrated bypasses of this protection, and analysis shows that system instructions (the AI's core rules) still cannot be fully trusted as a security boundary (a hard limit that stops attackers).