AI Sec Watch: A Security Intelligence Platform for AI Systems

Luu, T.J.

Improving instruction hierarchy in frontier LLMs

safetyresearch

Mar 10, 2026

AI systems receive instructions from multiple sources (system policies, developers, users, and online data), and models must learn to prioritize the most trustworthy ones to stay safe. When models treat untrusted instructions as authoritative, they can be tricked into revealing private information, following harmful requests, or falling victim to prompt injection (hidden malicious instructions hidden in input data). OpenAI's solution uses a clear instruction hierarchy (System > developer > user > tool) and trains models with IH-Challenge, a reinforcement learning dataset designed to teach models to follow high-priority instructions even when lower-priority ones conflict with them.

Fix: OpenAI's models are trained on a clear instruction hierarchy where System instructions have highest priority, followed by developer instructions, then user instructions, then tool outputs. The company also created IH-Challenge, a reinforcement learning training dataset that generates conversations with conflicting instructions where high-priority instructions are kept simple and objectively gradable, ensuring models learn to prioritize correctly without resorting to useless shortcuts like over-refusing benign requests.

OpenAI Blog

AI Sec Watch

Latest Intel

Escape Raises $18 Million to Automate Pentesting

How to Stop AI Data Leaks: A Webinar Guide to Auditing Modern Agentic Workflows

Family of child injured in Canada school shooting sues OpenAI

Oracle earnings will show whether its expensive AI bet is starting to pay off

Improving instruction hierarchy in frontier LLMs

Meta’s deepfake moderation isn’t good enough, says Oversight Board

Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls

New ways to learn math and science in ChatGPT

OpenAI to acquire Promptfoo to strengthen AI agent security testing

You Could Be Next