Research
Academic papers, new techniques, benchmarks, and theoretical findings in AI/LLM security.
Academic papers, new techniques, benchmarks, and theoretical findings in AI/LLM security.
This paper proposes a comprehensive framework for integrating Zero Trust Architecture (ZTA) into cloud-based endpoint security for critical infrastructure such as power plants, healthcare systems, and financial systems. The framework aims to address the gap in applying ZTA to endpoint management within cloud environments, treating every access request as new with no implicit trust, thereby enhancing compliance, enabling continuous protection, and reducing attack surfaces.
This research introduces "compositional reasoning attacks" where harmful queries are decomposed into fragments scattered across long contexts (up to 64k tokens), revealing that stronger reasoning capability in LLMs does not improve safety against such attacks. Testing 14 frontier LLMs showed that safety alignment degrades as context length increases, and models with better general reasoning often assemble harmful intent but fail to refuse it.
Fix: Increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model, indicating that inference-time reasoning effort is a key mitigating factor.
Arxiv (cs.CR + cs.AI)CryptoGen is a system designed to enable privacy-preserving autoregressive generation on cloud-hosted Transformer models by supporting encrypted key-value (KV) cache reuse. It combines homomorphic encryption and secret sharing to achieve near-linear scaling with 4.4x-7.6x lower per-token latency compared to existing discriminative secure inference systems when adapted for generation tasks. The system is released as an open-source library.
DyMA-Fuzz is a firmware fuzzing framework designed to address Direct Memory Access (DMA) challenges in re-hosted monolithic firmware testing. It uses runtime analysis techniques to automatically infer DMA memory access patterns and inject fuzzing data into target buffers without manual configuration. When evaluated on 94 firmware samples and 8 DMA-guarded CVE benchmarks, it achieved up to 122% higher code coverage compared to state-of-the-art tools.
This research evaluates machine learning algorithms (XGBoost, Naïve Bayes, SVC, and Random Forest) for Android malware detection using the CICMalDroid2020 dataset of dynamically obtained behavior samples. The study empirically tests the SMOTE technique for addressing class imbalance, finding that in 75% of configurations SMOTE led to performance degradation or marginal improvements with an average loss of 6.14 percentage points. Tree-based algorithms like XGBoost and Random Forest consistently outperformed others, achieving weighted recall above 94%.
Researchers discovered a security vulnerability in Mixture-of-Experts (MoE) Large Language Models where safety-critical behaviors like refusal are concentrated in a small set of experts. They developed Large Language Lobotomy (L³), a training-free attack that exploits expert routing dynamics by silencing safety-relevant experts, increasing jailbreak attack success from 7.3% to 70.4% (up to 86.3%) across eight state-of-the-art MoE LLMs while silencing fewer than 20% of layer-wise experts.
This paper systematizes 11 methodological pitfalls commonly found in Deep Reinforcement Learning for Cybersecurity (DRL4Sec) research across environment modeling, agent training, evaluation, and deployment stages. Analysis of 66 significant DRL4Sec papers from 2018-2025 reveals an average of over five pitfalls per paper, with the authors demonstrating the practical impact through controlled experiments in autonomous cyber defense, adversarial malware creation, and web security testing.
Fix: The paper provides actionable recommendations for each of the 11 identified pitfalls to support the development of more rigorous and deployable DRL-based security systems.
Arxiv (cs.CR + cs.AI)This paper addresses the vulnerability of deep learning models to score-based query attacks that craft adversarial examples using only black-box access to model outputs. The authors demonstrate that existing plug-and-play defenses can be bypassed by adaptive attacks and propose Dashed Line Defense (DLD), a post-processing method designed to withstand adaptive query strategies by introducing ambiguity in loss observations.
Fix: The paper proposes Dashed Line Defense (DLD), a plug-and-play post-processing method that introduces ambiguity in how the observed loss reflects the true adversarial strength of candidate examples, preventing attackers from reliably analyzing and adapting their queries to disrupt the adversarial example generation process. DLD is validated through experiments on ImageNet and demonstrates effectiveness even under worst-case adaptive attacks while preserving model predicted labels.
Hybrid Retrieval-Augmented Generation (RAG) pipelines that combine vector similarity search with knowledge graph expansion create a security vulnerability where vector-retrieved seed chunks can pivot through entity links into sensitive graph neighborhoods, causing cross-tenant data leakage. The research formalizes this as Retrieval Pivot Risk (RPR) and demonstrates that naturally shared entities create cross-tenant pivot paths without requiring adversarial injection, with undefended systems showing RPR up to 0.95 and consistent leakage at pivot depth 2.
This research reveals safety vulnerabilities in Mixture-of-Experts (MoE) large language models, demonstrating that manipulating specific routers can create "unsafe routes" that convert safe outputs into harmful ones. The study introduces RoSais (Router Safety importance score) to identify critical routers and proposes F-SOUR framework, which achieves attack success rates of 0.90 and 0.98 on safety benchmarks across four MoE LLM families by exploiting routing configurations.
This research reveals that large language models (LLMs) possess "implicit memory"—the ability to encode information in their outputs and later recover it when those outputs are reintroduced as input, creating persistent information channels across supposedly independent interactions. The authors demonstrate this through "time bombs," a new class of temporal backdoors that activate only after a sequence of interactions satisfies hidden conditions, which can be induced through prompting or fine-tuning. The work discusses broader security implications including covert communication, benchmark contamination, and targeted manipulation.
Anthropic's Claude Opus 4.6 AI model discovered over 500 previously unknown high-severity security vulnerabilities in major open-source libraries including Ghostscript, OpenSC, and CGIF. The model found these flaws without task-specific tooling or specialized prompting by analyzing code like a human researcher, identifying issues such as missing bounds checks, buffer overflows, and complex vulnerabilities requiring conceptual understanding of algorithms.
Microsoft has developed a lightweight scanner to detect backdoors in open-weight large language models by analyzing three observable signals: a distinctive "double triangle" attention pattern when trigger phrases are present, memorization and leakage of poisoning data including triggers, and activation by multiple fuzzy trigger variations. The scanner requires no additional model training or prior knowledge of backdoor behavior and works across common GPT-style models, though it has limitations including inability to work on proprietary models and best effectiveness only on trigger-based backdoors with deterministic outputs.
Boris Cherny, creator of Claude Code at Anthropic, revealed his development workflow that runs 5 Claude AI agents in parallel in his terminal (and 5-10 more in browser), treating coding like a real-time strategy game rather than traditional linear programming. He exclusively uses Anthropic's slowest but smartest model, Opus 4.5, arguing that despite slower speed, it requires less human correction and steering, ultimately being faster overall.
Fix: Enforcing authorization at a single location—the graph expansion boundary—eliminates measured leakage (RPR near 0) across both corpora, all attack variants, and label forgery rates up to 10 percent, with minimal overhead. The root cause is boundary enforcement: authorization must be re-checked at the vector-to-graph transition point to prevent two individually secure retrieval components from composing into an insecure system.
Arxiv (cs.CR + cs.AI)Fix: The paper outlines defensive perspectives including safety-aware route disabling and router training as promising directions to safeguard MoE LLMs.
Arxiv (cs.CR + cs.AI)Fix: The discovered vulnerabilities have been patched by the respective maintainers. Specifically mentioned: the heap buffer overflow vulnerability in CGIF was fixed in version 0.5.1. Anthropic emphasized the importance of 'promptly patching known vulnerabilities' as a security fundamental.
The Hacker News