Security vulnerabilities, privacy incidents, safety concerns, and policy updates affecting LLMs and AI agents.
This research paper introduces a measurement framework for monitoring GPU utilization in untrusted environments to support AI governance. The framework uses four complementary primitives based on timing and memory characteristics—Proof-of-Work-inspired mechanisms, Verifiable Delay Functions, GEMM-based tensor-core measurements, and VRAM-residency tests—to detect GPU compute activity even without trusted firmware or vendor-controlled counters. The approach aims to provide compute-based telemetry that can help detect unauthorized repurposing of GPUs for model training or policy violations.
This paper introduces the first systematic benchmark for evaluating knowledge-extraction attacks on Retrieval-Augmented Generation (RAG) systems, which can be exploited through maliciously crafted queries to recover sensitive knowledge-base content. The benchmark consolidates fragmented research by providing a unified experimental framework covering various attack and defense strategies, retrieval embedding models, and both open- and closed-source generators across standardized datasets.
This research paper presents a study of 18 UK-based domestic workers examining privacy risks from AI-driven smart home devices in both employer-controlled homes and their own households. The study develops a sociotechnical threat model identifying how AI analytics, data logs, cross-household data flows, and employment agencies create surveillance and privacy boundary challenges for domestic workers.
MUZZLE is an automated agentic framework designed to evaluate the security of LLM-based web agents against indirect prompt injection attacks. The system adaptively identifies injection surfaces from agent trajectories and generates context-aware malicious instructions, successfully discovering 37 new attacks across 4 web applications that violate confidentiality, integrity, and availability properties, including novel cross-application attacks and agent-tailored phishing scenarios.
CIC-Trap4Phish is a multi-format dataset designed to improve detection of phishing and quishing attacks through malicious email attachments. The dataset covers five common file types (Word, Excel, PDF, HTML, and QR code images) and uses execution-free static feature pipelines for the first four types, while employing CNNs and lightweight language models for QR code-based phishing detection. Machine learning models including Random Forest, XGBoost, and Decision Tree demonstrated high detection accuracy across all formats.
This paper identifies a vulnerability in password-authenticated key exchange (PAKE) protocols called "reverse online guessing attacks," where an adversary validates password guesses by impersonating a server rather than a client. The attack is particularly effective in phishing, password-spraying scenarios, or applications with automated logins like WPA3-SAE, and exploits the fact that PAKE protocols lack server authentication mechanisms beyond the password itself.
StealthRL is a reinforcement learning framework that uses paraphrasing attacks to evade AI-text detectors while preserving semantic meaning. The system achieves near-zero detection rates (0.001 mean TPR@1%FPR) and 99.9% attack success rate against multiple detector families, with attacks successfully transferring to unseen detector types, revealing fundamental architectural vulnerabilities in current AI-text detection systems.
This paper proposes a comprehensive framework for integrating Zero Trust Architecture (ZTA) into cloud-based endpoint security for critical infrastructure such as power plants, healthcare systems, and financial systems. The framework aims to address the gap in applying ZTA to endpoint management within cloud environments, treating every access request as new with no implicit trust, thereby enhancing compliance, enabling continuous protection, and reducing attack surfaces.
This research introduces "compositional reasoning attacks" where harmful queries are decomposed into fragments scattered across long contexts (up to 64k tokens), revealing that stronger reasoning capability in LLMs does not improve safety against such attacks. Testing 14 frontier LLMs showed that safety alignment degrades as context length increases, and models with better general reasoning often assemble harmful intent but fail to refuse it.
CryptoGen is a system designed to enable privacy-preserving autoregressive generation on cloud-hosted Transformer models by supporting encrypted key-value (KV) cache reuse. It combines homomorphic encryption and secret sharing to achieve near-linear scaling with 4.4x-7.6x lower per-token latency compared to existing discriminative secure inference systems when adapted for generation tasks. The system is released as an open-source library.
DyMA-Fuzz is a firmware fuzzing framework designed to address Direct Memory Access (DMA) challenges in re-hosted monolithic firmware testing. It uses runtime analysis techniques to automatically infer DMA memory access patterns and inject fuzzing data into target buffers without manual configuration. When evaluated on 94 firmware samples and 8 DMA-guarded CVE benchmarks, it achieved up to 122% higher code coverage compared to state-of-the-art tools.
This research evaluates machine learning algorithms (XGBoost, Naïve Bayes, SVC, and Random Forest) for Android malware detection using the CICMalDroid2020 dataset of dynamically obtained behavior samples. The study empirically tests the SMOTE technique for addressing class imbalance, finding that in 75% of configurations SMOTE led to performance degradation or marginal improvements with an average loss of 6.14 percentage points. Tree-based algorithms like XGBoost and Random Forest consistently outperformed others, achieving weighted recall above 94%.
Researchers discovered a security vulnerability in Mixture-of-Experts (MoE) Large Language Models where safety-critical behaviors like refusal are concentrated in a small set of experts. They developed Large Language Lobotomy (L³), a training-free attack that exploits expert routing dynamics by silencing safety-relevant experts, increasing jailbreak attack success from 7.3% to 70.4% (up to 86.3%) across eight state-of-the-art MoE LLMs while silencing fewer than 20% of layer-wise experts.
This paper systematizes 11 methodological pitfalls commonly found in Deep Reinforcement Learning for Cybersecurity (DRL4Sec) research across environment modeling, agent training, evaluation, and deployment stages. Analysis of 66 significant DRL4Sec papers from 2018-2025 reveals an average of over five pitfalls per paper, with the authors demonstrating the practical impact through controlled experiments in autonomous cyber defense, adversarial malware creation, and web security testing.
Fix: The paper provides actionable recommendations for each of the 11 identified pitfalls to support the development of more rigorous and deployable DRL-based security systems.
Arxiv (cs.CR + cs.AI)This paper addresses the vulnerability of deep learning models to score-based query attacks that craft adversarial examples using only black-box access to model outputs. The authors demonstrate that existing plug-and-play defenses can be bypassed by adaptive attacks and propose Dashed Line Defense (DLD), a post-processing method designed to withstand adaptive query strategies by introducing ambiguity in loss observations.
Fix: The paper proposes Dashed Line Defense (DLD), a plug-and-play post-processing method that introduces ambiguity in how the observed loss reflects the true adversarial strength of candidate examples, preventing attackers from reliably analyzing and adapting their queries to disrupt the adversarial example generation process. DLD is validated through experiments on ImageNet and demonstrates effectiveness even under worst-case adaptive attacks while preserving model predicted labels.
Hybrid Retrieval-Augmented Generation (RAG) pipelines that combine vector similarity search with knowledge graph expansion create a security vulnerability where vector-retrieved seed chunks can pivot through entity links into sensitive graph neighborhoods, causing cross-tenant data leakage. The research formalizes this as Retrieval Pivot Risk (RPR) and demonstrates that naturally shared entities create cross-tenant pivot paths without requiring adversarial injection, with undefended systems showing RPR up to 0.95 and consistent leakage at pivot depth 2.
This research reveals safety vulnerabilities in Mixture-of-Experts (MoE) large language models, demonstrating that manipulating specific routers can create "unsafe routes" that convert safe outputs into harmful ones. The study introduces RoSais (Router Safety importance score) to identify critical routers and proposes F-SOUR framework, which achieves attack success rates of 0.90 and 0.98 on safety benchmarks across four MoE LLM families by exploiting routing configurations.
This research reveals that large language models (LLMs) possess "implicit memory"—the ability to encode information in their outputs and later recover it when those outputs are reintroduced as input, creating persistent information channels across supposedly independent interactions. The authors demonstrate this through "time bombs," a new class of temporal backdoors that activate only after a sequence of interactions satisfies hidden conditions, which can be induced through prompting or fine-tuning. The work discusses broader security implications including covert communication, benchmark contamination, and targeted manipulation.
Moltbook, a social network for AI agents that was entirely AI-coded without its founder writing any code himself, exposed a serious security flaw that leaked thousands of users' email addresses and millions of API credentials. The vulnerability, discovered by security firm Wiz, resulted from mishandling of a private key in the site's JavaScript code and allowed complete account impersonation and access to private communications between AI agents.
Fix: Though Moltbook has now fixed the site's flaw discovered by Wiz, its critical vulnerability should serve as a cautionary tale about the security of AI-made platforms.
Wired (Security)Anthropic's Claude Opus 4.6 AI model discovered over 500 previously unknown high-severity security vulnerabilities in major open-source libraries including Ghostscript, OpenSC, and CGIF. The model found these flaws without task-specific tooling or specialized prompting by analyzing code like a human researcher, identifying issues such as missing bounds checks, buffer overflows, and complex vulnerabilities requiring conceptual understanding of algorithms.
Fix: Increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model, indicating that inference-time reasoning effort is a key mitigating factor.
Arxiv (cs.CR + cs.AI)Fix: Enforcing authorization at a single location—the graph expansion boundary—eliminates measured leakage (RPR near 0) across both corpora, all attack variants, and label forgery rates up to 10 percent, with minimal overhead. The root cause is boundary enforcement: authorization must be re-checked at the vector-to-graph transition point to prevent two individually secure retrieval components from composing into an insecure system.
Arxiv (cs.CR + cs.AI)Fix: The paper outlines defensive perspectives including safety-aware route disabling and router training as promising directions to safeguard MoE LLMs.
Arxiv (cs.CR + cs.AI)Fix: The discovered vulnerabilities have been patched by the respective maintainers. Specifically mentioned: the heap buffer overflow vulnerability in CGIF was fixed in version 0.5.1. Anthropic emphasized the importance of 'promptly patching known vulnerabilities' as a security fundamental.
The Hacker News