AI Sec Watch: A Security Intelligence Platform for AI Systems

Luu, T.J.

Detecting backdoored language models at scale

infonewsLLM-Specific

securityresearch

Source: Microsoft Security BlogFebruary 4, 2026

Summary

Researchers have released new work on detecting backdoors (hidden malicious behaviors embedded in a model's weights during training) in open-weight language models to improve trust in AI systems. A backdoored model appears normal most of the time but changes behavior when triggered by a specific input, like a hidden phrase, making detection difficult. The research explores whether backdoored models show systematic differences from clean models and whether their trigger phrases can be reliably identified.