Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks
securitysafetyresearch
Source: Arxiv (cs.CR + cs.AI)February 10, 2026Summary
This research introduces the Four-Checkpoint Framework to analyze where LLM safety mechanisms fail by organizing defenses along processing stage (input vs. output) and detection level (literal vs. intent). Testing GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro with 13 targeted evasion techniques across 3,312 test cases reveals that output-stage defenses (CP3, CP4) are weakest at 72-79% Weighted Attack Success Rate (WASR), while input-literal defenses (CP1) are strongest at 13% WASR. The study finds that traditional Binary ASR underestimates vulnerabilities (22.6%) compared to WASR (52.7%), showing 2.3× higher actual vulnerability rates.
Original source: https://arxiv.org/abs/2602.09629v1
First tracked: February 11, 2026 at 06:00 PM