Microsoft Develops Scanner to Detect Backdoors in Open-Weight Large Language Models
Summary
Microsoft created a lightweight scanner that can detect backdoors (hidden malicious behaviors) in open-weight LLMs (large language models that have publicly available internal parameters) by identifying three distinctive signals: a specific attention pattern when trigger phrases are present, memorized poisoning data leakage, and activation by fuzzy triggers (partial variations of trigger phrases). The scanner works without needing to retrain the model or know the backdoor details in advance, though it only functions on open-weight models and works best on trigger-based backdoors.
Solution / Mitigation
Microsoft's scanner performs detection through a three-step process: it "first extracts memorized content from the model and then analyzes it to isolate salient substrings. Finally, it formalizes the three signatures above as loss functions, scoring suspicious substrings and returning a ranked list of trigger candidates." The tool works across common GPT-style models and requires access to the model files but no additional model training or prior knowledge of the backdoor behavior.
Classification
Affected Vendors
Related Issues
Original source: https://thehackernews.com/2026/02/microsoft-develops-scanner-to-detect.html
First tracked: February 12, 2026 at 02:20 PM
Classified by LLM (prompt v3) · confidence: 92%