Detecting backdoored language models at scale
Summary
Researchers have released new work on detecting backdoors (hidden malicious behaviors embedded in a model's weights during training) in open-weight language models to improve trust in AI systems. A backdoored model appears normal most of the time but changes behavior when triggered by a specific input, like a hidden phrase, making detection difficult. The research explores whether backdoored models show systematic differences from clean models and whether their trigger phrases can be reliably identified.
Classification
Affected Vendors
Related Issues
Original source: https://www.microsoft.com/en-us/security/blog/2026/02/04/detecting-backdoored-language-models-at-scale/
First tracked: February 12, 2026 at 02:20 PM
Classified by LLM (prompt v3) · confidence: 92%