AI &amp; LLM Vulnerabilities

This research paper introduces a measurement framework for monitoring GPU utilization in untrusted environments to support AI governance. The framework uses four complementary primitives based on timing and memory characteristics—Proof-of-Work-inspired mechanisms, Verifiable Delay Functions, GEMM-based tensor-core measurements, and VRAM-residency tests—to detect GPU compute activity even without trusted firmware or vendor-controlled counters. The approach aims to provide compute-based telemetry that can help detect unauthorized repurposing of GPUs for model training or policy violations.

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

This paper introduces the first systematic benchmark for evaluating knowledge-extraction attacks on Retrieval-Augmented Generation (RAG) systems, which can be exploited through maliciously crafted queries to recover sensitive knowledge-base content. The benchmark consolidates fragmented research by providing a unified experimental framework covering various attack and defense strategies, retrieval embedding models, and both open- and closed-source generators across standardized datasets.

"These cameras are just like the Eye of Sauron": A Sociotechnical Threat Model for AI-Driven Smart Home Devices as Perceived by UK-Based Domestic Workers

privacyresearch

This research paper presents a study of 18 UK-based domestic workers examining privacy risks from AI-driven smart home devices in both employer-controlled homes and their own households. The study develops a sociotechnical threat model identifying how AI analytics, data logs, cross-household data flows, and employment agencies create surveillance and privacy boundary challenges for domestic workers.

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

MUZZLE is an automated agentic framework designed to evaluate the security of LLM-based web agents against indirect prompt injection attacks. The system adaptively identifies injection surfaces from agent trajectories and generates context-aware malicious instructions, successfully discovering 37 new attacks across 4 web applications that violate confidentiality, integrity, and availability properties, including novel cross-application attacks and agent-tailored phishing scenarios.

CIC-Trap4Phish: A Unified Multi-Format Dataset for Phishing and Quishing Attachment Detection

CIC-Trap4Phish is a multi-format dataset designed to improve detection of phishing and quishing attacks through malicious email attachments. The dataset covers five common file types (Word, Excel, PDF, HTML, and QR code images) and uses execution-free static feature pipelines for the first four types, while employing CNNs and lightweight language models for QR code-based phishing detection. Machine learning models including Random Forest, XGBoost, and Decision Tree demonstrated high detection accuracy across all formats.

Reverse Online Guessing Attacks on PAKE Protocols

This paper identifies a vulnerability in password-authenticated key exchange (PAKE) protocols called "reverse online guessing attacks," where an adversary validates password guesses by impersonating a server rather than a client. The attack is particularly effective in phishing, password-spraying scenarios, or applications with automated logins like WPA3-SAE, and exploits the fact that PAKE protocols lack server authentication mechanisms beyond the password itself.

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

StealthRL is a reinforcement learning framework that uses paraphrasing attacks to evade AI-text detectors while preserving semantic meaning. The system achieves near-zero detection rates (0.001 mean TPR@1%FPR) and 99.9% attack success rate against multiple detector families, with attacks successfully transferring to unseen detector types, revealing fundamental architectural vulnerabilities in current AI-text detection systems.

Framework for Integrating Zero Trust in Cloud-Based Endpoint Security for Critical Infrastructure

This paper proposes a comprehensive framework for integrating Zero Trust Architecture (ZTA) into cloud-based endpoint security for critical infrastructure such as power plants, healthcare systems, and financial systems. The framework aims to address the gap in applying ZTA to endpoint management within cloud environments, treating every access request as new with no implicit trust, thereby enhancing compliance, enabling continuous protection, and reducing attack surfaces.

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

This research introduces "compositional reasoning attacks" where harmful queries are decomposed into fragments scattered across long contexts (up to 64k tokens), revealing that stronger reasoning capability in LLMs does not improve safety against such attacks. Testing 14 frontier LLMs showed that safety alignment degrades as context length increases, and models with better general reasoning often assemble harmful intent but fail to refuse it.

CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

CryptoGen is a system designed to enable privacy-preserving autoregressive generation on cloud-hosted Transformer models by supporting encrypted key-value (KV) cache reuse. It combines homomorphic encryption and secret sharing to achieve near-linear scaling with 4.4x-7.6x lower per-token latency compared to existing discriminative secure inference systems when adapted for generation tasks. The system is released as an open-source library.

DyMA-Fuzz: Dynamic Direct Memory Access Abstraction for Re-hosted Monolithic Firmware Fuzzing

DyMA-Fuzz is a firmware fuzzing framework designed to address Direct Memory Access (DMA) challenges in re-hosted monolithic firmware testing. It uses runtime analysis techniques to automatically infer DMA memory access patterns and inject fuzzing data into target buffers without manual configuration. When evaluated on 94 firmware samples and 8 DMA-guarded CVE benchmarks, it achieved up to 122% higher code coverage compared to state-of-the-art tools.

Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020

This research evaluates machine learning algorithms (XGBoost, Naïve Bayes, SVC, and Random Forest) for Android malware detection using the CICMalDroid2020 dataset of dynamically obtained behavior samples. The study empirically tests the SMOTE technique for addressing class imbalance, finding that in 75% of configurations SMOTE led to performance degradation or marginal improvements with an average loss of 6.14 percentage points. Tree-based algorithms like XGBoost and Random Forest consistently outperformed others, achieving weighted recall above 94%.

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Researchers discovered a security vulnerability in Mixture-of-Experts (MoE) Large Language Models where safety-critical behaviors like refusal are concentrated in a small set of experts. They developed Large Language Lobotomy (L³), a training-free attack that exploits expert routing dynamics by silencing safety-relevant experts, increasing jailbreak attack success from 7.3% to 70.4% (up to 86.3%) across eight state-of-the-art MoE LLMs while silencing fewer than 20% of layer-wise experts.

SoK: The Pitfalls of Deep Reinforcement Learning for Cybersecurity

researchsecurity

This paper systematizes 11 methodological pitfalls commonly found in Deep Reinforcement Learning for Cybersecurity (DRL4Sec) research across environment modeling, agent training, evaluation, and deployment stages. Analysis of 66 significant DRL4Sec papers from 2018-2025 reveals an average of over five pitfalls per paper, with the authors demonstrating the practical impact through controlled experiments in autonomous cyber defense, adversarial malware creation, and web security testing.

Fix: The paper provides actionable recommendations for each of the 11 identified pitfalls to support the development of more rigorous and deployable DRL-based security systems.

Dashed Line Defense: Plug-And-Play Defense Against Adaptive Score-Based Query Attacks

This paper addresses the vulnerability of deep learning models to score-based query attacks that craft adversarial examples using only black-box access to model outputs. The authors demonstrate that existing plug-and-play defenses can be bypassed by adaptive attacks and propose Dashed Line Defense (DLD), a post-processing method designed to withstand adaptive query strategies by introducing ambiguity in loss observations.

Fix: The paper proposes Dashed Line Defense (DLD), a plug-and-play post-processing method that introduces ambiguity in how the observed loss reflects the true adversarial strength of candidate examples, preventing attackers from reliably analyzing and adapting their queries to disrupt the adversarial example generation process. DLD is validated through experiments on ImageNet and demonstrates effectiveness even under worst-case adaptive attacks while preserving model predicted labels.

Retrieval Pivot Attacks in Hybrid RAG: Measuring and Mitigating Amplified Leakage from Vector Seeds to Graph Expansion

Hybrid Retrieval-Augmented Generation (RAG) pipelines that combine vector similarity search with knowledge graph expansion create a security vulnerability where vector-retrieved seed chunks can pivot through entity links into sensitive graph neighborhoods, causing cross-tenant data leakage. The research formalizes this as Retrieval Pivot Risk (RPR) and demonstrates that naturally shared entities create cross-tenant pivot paths without requiring adversarial injection, with undefended systems showing RPR up to 0.95 and consistent leakage at pivot depth 2.

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

This research reveals safety vulnerabilities in Mixture-of-Experts (MoE) large language models, demonstrating that manipulating specific routers can create "unsafe routes" that convert safe outputs into harmful ones. The study introduces RoSais (Router Safety importance score) to identify critical routers and proposes F-SOUR framework, which achieves attack success rates of 0.90 and 0.98 on safety benchmarks across four MoE LLM families by exploiting routing configurations.

Stateless Yet Not Forgetful: Implicit Memory as a Hidden Channel in LLMs

This research reveals that large language models (LLMs) possess "implicit memory"—the ability to encode information in their outputs and later recover it when those outputs are reintroduced as input, creating persistent information channels across supposedly independent interactions. The authors demonstrate this through "time bombs," a new class of temporal backdoors that activate only after a sequence of interactions satisfies hidden conditions, which can be induced through prompting or fine-tuning. The work discusses broader security implications including covert communication, benchmark contamination, and targeted manipulation.

Moltbook, the Social Network for AI Agents, Exposed Real Humans’ Data

securityprivacy

2/7/2026

Moltbook, a social network for AI agents that was entirely AI-coded without its founder writing any code himself, exposed a serious security flaw that leaked thousands of users' email addresses and millions of API credentials. The vulnerability, discovered by security firm Wiz, resulted from mishandling of a private key in the site's JavaScript code and allowed complete account impersonation and access to private communications between AI agents.

Fix: Though Moltbook has now fixed the site's flaw discovered by Wiz, its critical vulnerability should serve as a cautionary tale about the security of AI-made platforms.

Wired (Security)

Claude Opus 4.6 Finds 500+ High-Severity Flaws Across Major Open-Source Libraries

securityresearchindustry

2/6/2026

Anthropic's Claude Opus 4.6 AI model discovered over 500 previously unknown high-severity security vulnerabilities in major open-source libraries including Ghostscript, OpenSC, and CGIF. The model found these flaws without task-specific tooling or specialized prompting by analyzing code like a human researcher, identifying issues such as missing bounds checks, buffer overflows, and complex vulnerabilities requiring conceptual understanding of algorithms.

Previous2 / 3Next

43 items

Timing and Memory Telemetry on GPUs for AI Governance

securitypolicyresearch

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

"These cameras are just like the Eye of Sauron": A Sociotechnical Threat Model for AI-Driven Smart Home Devices as Perceived by UK-Based Domestic Workers

privacyresearch

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

CIC-Trap4Phish: A Unified Multi-Format Dataset for Phishing and Quishing Attachment Detection

Reverse Online Guessing Attacks on PAKE Protocols

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

Framework for Integrating Zero Trust in Cloud-Based Endpoint Security for Critical Infrastructure

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

CryptoGen: Secure Transformer Generation with Encrypted KV-Cache Reuse

DyMA-Fuzz: Dynamic Direct Memory Access Abstraction for Re-hosted Monolithic Firmware Fuzzing

Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

SoK: The Pitfalls of Deep Reinforcement Learning for Cybersecurity

researchsecurity

Fix: The paper provides actionable recommendations for each of the 11 identified pitfalls to support the development of more rigorous and deployable DRL-based security systems.

Dashed Line Defense: Plug-And-Play Defense Against Adaptive Score-Based Query Attacks

Retrieval Pivot Attacks in Hybrid RAG: Measuring and Mitigating Amplified Leakage from Vector Seeds to Graph Expansion