{"data":{"id":"5972a4e1-8d9e-4429-bb81-4fdb5b1467f3","title":"VLM-Guard: Defending Jailbreaks by Monitoring Only Hundreds of Safety-Critical Neurons","summary":"Large Vision Language Models (VLMs, which are AI systems that process both images and text) are vulnerable to jailbreak attacks (attempts to trick the AI into ignoring its safety guidelines). VLM-Guard is a detection framework that identifies and monitors a small set of neurons (individual computational units, about 0.2% of the total) that are linked to unsafe behavior, allowing it to catch jailbreak attempts without requiring model fine-tuning (adjusting the AI's internal parameters through additional training). The approach is lightweight and effective at detecting attacks while maintaining normal performance on safe inputs.","solution":"VLM-Guard detects jailbreak attacks by identifying critical neurons linked to unsafe behaviors through differential analysis of activation values. The framework monitors a compact set of just a few hundred neurons (less than 0.2% of total neurons) that are strongly correlated with harmful semantics. It operates as a training-free detector, meaning no parameter updates or model fine-tuning is required, making it suitable for practical deployment in safeguarding VLMs.","labels":["security","safety"],"sourceUrl":"http://ieeexplore.ieee.org/document/11505928","publishedAt":"2026-05-04T13:18:24.000Z","cveId":null,"cweIds":null,"cvssScore":null,"cvssSeverity":null,"severity":"info","attackType":["jailbreak"],"issueType":"research","affectedPackages":null,"affectedVendors":[],"affectedVendorsRaw":["Vision Language Models"],"classifierModel":"claude-haiku-4-5-20251001","classifierPromptVersion":"v3","cvssVector":null,"attackVector":null,"attackComplexity":null,"privilegesRequired":null,"userInteraction":null,"exploitMaturity":null,"epssScore":null,"patchAvailable":null,"disclosureDate":"2026-05-04T13:18:24.000Z","capecIds":null,"crossRefCount":0,"attackSophistication":"advanced","impactType":["safety"],"aiComponentTargeted":"model","llmSpecific":false,"classifierConfidence":0.92,"researchCategory":"peer_reviewed","atlasIds":null}}