Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls
Summary
Researchers discovered that AI judges (LLMs acting as automated security gatekeepers to enforce safety policies) can be manipulated through prompt injection (tricking an AI by hiding instructions in its input) using stealthy formatting symbols rather than obvious gibberish. They created a tool called AdvJudge-Zero, a fuzzer (software that finds vulnerabilities by testing with unexpected inputs), which automatically identifies innocent-looking character sequences that exploit the model's decision-making logic to bypass security controls.
Solution / Mitigation
Palo Alto Networks customers are better protected through Prisma AIRS and the Unit 42 AI Security Assessment service. Organizations concerned about potential compromise can contact the Unit 42 Incident Response team.
Classification
Related Issues
Original source: https://unit42.paloaltonetworks.com/fuzzing-ai-judges-security-bypass/
First tracked: March 10, 2026 at 08:00 AM
Classified by LLM (prompt v3) · confidence: 92%