{"data":{"id":"c22bf130-10c7-4720-9a37-0d84a6578e30","title":"Detecting backdoored language models at scale","summary":"Researchers have released new work on detecting backdoors (hidden malicious behaviors embedded in a model's weights during training) in open-weight language models to improve trust in AI systems. A backdoored model appears normal most of the time but changes behavior when triggered by a specific input, like a hidden phrase, making detection difficult. The research explores whether backdoored models show systematic differences from clean models and whether their trigger phrases can be reliably identified.","solution":"N/A -- no mitigation discussed in source. The source mentions that traditional malware scanning tools (such as Microsoft's malware scanning solution for models in Microsoft Foundry) defend against code-based tampering, but no explicit fix, patch, or detection method is provided for model poisoning backdoors.","labels":["security","research"],"sourceUrl":"https://www.microsoft.com/en-us/security/blog/2026/02/04/detecting-backdoored-language-models-at-scale/","publishedAt":"2026-02-04T17:00:00.000Z","cveId":null,"cweIds":null,"cvssScore":null,"cvssSeverity":null,"severity":"info","attackType":["model_poisoning"],"issueType":"news","affectedPackages":null,"affectedVendors":["Microsoft"],"affectedVendorsRaw":["Microsoft","Anthropic"],"classifierModel":"claude-haiku-4-5-20251001","classifierPromptVersion":"v3","cvssVector":null,"attackVector":null,"attackComplexity":null,"privilegesRequired":null,"userInteraction":null,"exploitMaturity":null,"epssScore":null,"patchAvailable":null,"disclosureDate":null,"capecIds":null,"crossRefCount":0,"attackSophistication":"advanced","impactType":["integrity","safety"],"aiComponentTargeted":"model","llmSpecific":true,"classifierConfidence":0.92,"researchCategory":null,"atlasIds":null}}