AI Sec Watch: A Security Intelligence Platform for AI Systems

Luu, T.J.

Microsoft Develops Scanner to Detect Backdoors in Open-Weight Large Language Models

infonewsLLM-Specific

securityresearch

Source: The Hacker NewsFebruary 4, 2026

Summary

Microsoft created a lightweight scanner that can detect backdoors (hidden malicious behaviors) in open-weight LLMs (large language models that have publicly available internal parameters) by identifying three distinctive signals: a specific attention pattern when trigger phrases are present, memorized poisoning data leakage, and activation by fuzzy triggers (partial variations of trigger phrases). The scanner works without needing to retrain the model or know the backdoor details in advance, though it only functions on open-weight models and works best on trigger-based backdoors.

Solution / Mitigation

Microsoft's scanner performs detection through a three-step process: it "first extracts memorized content from the model and then analyzes it to isolate salient substrings. Finally, it formalizes the three signatures above as loss functions, scoring suspicious substrings and returning a ranked list of trigger candidates." The tool works across common GPT-style models and requires access to the model files but no additional model training or prior knowledge of the backdoor behavior.