Research
Academic papers, new techniques, benchmarks, and theoretical findings in AI/LLM security.
Academic papers, new techniques, benchmarks, and theoretical findings in AI/LLM security.
This paper proposes CAPID, a context-aware PII detection system for question-answering platforms that addresses the limitation of current approaches which redact all PII regardless of contextual relevance. The approach fine-tunes a locally owned small language model (SLM) to detect PII spans, classify their types, and determine contextual relevance before data is passed to LLMs, avoiding privacy concerns with closed-source models. A synthetic data generation pipeline using LLMs is introduced to create training data that captures context-dependent PII relevance across multiple domains.
This paper argues that current agentic AI architectures are fundamentally incompatible with high-stakes scientific workflows because autoregressive language models cannot deterministically separate commands from data through training alone. The authors contend that probabilistic alignment and guardrails are insufficient for authorization security, and that deterministic architectural enforcement is necessary to prevent the "Lethal Trifecta" of untrusted inputs, privileged data access, and external action capability from becoming an exploit-discovery problem.
Fix: The paper introduces the Trinity Defense Architecture, which enforces security through three mechanisms: action governance via a finite action calculus with reference-monitor enforcement, information-flow control via mandatory access labels preventing cross-scope leakage, and privilege separation isolating perception from execution.
Arxiv (cs.CR + cs.AI)LLM4PQC is an LLM-based agentic framework designed to address bottlenecks in post-quantum cryptography (PQC) hardware design by automatically refactoring PQC reference C code into high-level synthesis (HLS)-ready and synthesizable code. The framework uses a hierarchy of verification checks including C compilation, simulation, and RTL simulation to ensure correctness, and demonstrates reduced manual effort and accelerated design-space exploration in case studies on NIST PQC reference designs.
Spinel is a post-quantum digital signature scheme that combines the security of SPHINCS+ with a new family of algebraic hash functions based on the Tillich-Zemor paradigm over SL_n(F_p). The scheme's security relies on the hardness of navigating expander graphs over SL_n(F_p), which is believed to be resistant to quantum adversaries. The work includes empirical security evidence, integration within the SPHINCS+ framework, security analysis, parameter selection, and performance evaluation demonstrating practical feasibility.
QRS (Query, Review, Sanitize) is a neuro-symbolic framework that uses three autonomous agents with Large Language Models to generate CodeQL queries, validate findings through semantic reasoning, and perform automated exploit synthesis for vulnerability discovery. Unlike traditional SAST tools that rely on expert-crafted queries and predefined patterns, QRS autonomously discovers vulnerability classes beyond known patterns while reducing false positives. In testing on PyPI packages, QRS achieved 90.6% detection accuracy on 20 historical CVEs and identified 39 medium-to-high-severity vulnerabilities in the top 100 most-downloaded packages, with 5 assigned new CVEs.
This research investigates using large language models (LLMs) for zero-shot feature selection in malware detection as an alternative to traditional statistical methods. The study evaluates multiple LLMs (GPT-5.0, GPT-4.0, Gemini-2.5) on the EMBOD dataset against conventional feature selection methods across various classifiers. Results show that LLM-guided zero-shot feature selection achieves competitive performance with traditional methods while providing enhanced interpretability, stability, and reduced dependence on labeled data.
This research introduces the Four-Checkpoint Framework to analyze where LLM safety mechanisms fail by organizing defenses along processing stage (input vs. output) and detection level (literal vs. intent). Testing GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro with 13 targeted evasion techniques across 3,312 test cases reveals that output-stage defenses (CP3, CP4) are weakest at 72-79% Weighted Attack Success Rate (WASR), while input-literal defenses (CP1) are strongest at 13% WASR. The study finds that traditional Binary ASR underestimates vulnerabilities (22.6%) compared to WASR (52.7%), showing 2.3× higher actual vulnerability rates.
This paper introduces AGMark (Attention-Guided Dynamic Watermarking), a novel watermarking framework for Large Vision-Language Models (LVLMs) that addresses limitations in existing approaches. AGMark dynamically identifies semantic-critical tokens at each decoding step using attention weights and context-aware coherence cues, while determining the proportion of protected tokens through uncertainty awareness and evidence calibration. The framework achieves at least 99.36% detection accuracy (AUC) and maintains robust attack resilience (at least 88.61% AUC) while preserving visual semantic fidelity and generation quality.
This paper introduces a novel fingerprinting framework for protecting the intellectual property of large language models by using "refusal vectors" - behavioral patterns extracted from a model's internal representations when processing harmful versus harmless prompts. The method demonstrates 100% accuracy in identifying base model families across 76 offspring models and proves robust against common modifications like finetuning, merging, and quantization. The authors propose a theoretical framework using locality-sensitive hashing and zero-knowledge proofs to transform private fingerprints into publicly verifiable, privacy-preserving artifacts.
This paper introduces Autonomous Action Runtime Management (AARM), an open specification for securing AI-driven actions at runtime as AI systems evolve from passive assistants to autonomous agents capable of executing consequential actions. AARM defines a runtime security system that intercepts actions before execution, evaluates them against policy and intent alignment, enforces authorization decisions, and records tamper-evident receipts, addressing threats like prompt injection, confused deputy attacks, data exfiltration, and intent drift. The specification proposes four implementation architectures and aims to establish industry-wide security requirements for AI agent systems before proprietary fragmentation occurs.
This research systematically studies adversarial transferability of encoder-based attacks against large vision-language models (LVLMs), revealing that existing attacks have severely limited transferability across different LVLM architectures. The study identifies two root causes hindering transferability: inconsistent visual grounding across models and redundant semantic alignment within models. To address these limitations, the authors propose Semantic-Guided Multimodal Attack (SGMA), a framework that achieves higher transferability by directing perturbations toward semantically critical regions and disrupting cross-modal grounding at both global and local levels.
LLMAC is a new access control framework that uses Large Language Models to unify traditional access control methods (RBAC, ABAC, DAC) into a single comprehensive system. Using Mistral 7B trained on synthetic datasets representing complex real-world scenarios, the system achieved 98.5% accuracy, significantly outperforming traditional methods (RBAC: 14.5%, ABAC: 58.5%, DAC: 27.5%) while providing human-readable explanations for decisions.
This research paper introduces a measurement framework for monitoring GPU utilization in untrusted environments to support AI governance. The framework uses four complementary primitives based on timing and memory characteristics—Proof-of-Work-inspired mechanisms, Verifiable Delay Functions, GEMM-based tensor-core measurements, and VRAM-residency tests—to detect GPU compute activity even without trusted firmware or vendor-controlled counters. The approach aims to provide compute-based telemetry that can help detect unauthorized repurposing of GPUs for model training or policy violations.
This paper introduces the first systematic benchmark for evaluating knowledge-extraction attacks on Retrieval-Augmented Generation (RAG) systems, which can be exploited through maliciously crafted queries to recover sensitive knowledge-base content. The benchmark consolidates fragmented research by providing a unified experimental framework covering various attack and defense strategies, retrieval embedding models, and both open- and closed-source generators across standardized datasets.
This research paper presents a study of 18 UK-based domestic workers examining privacy risks from AI-driven smart home devices in both employer-controlled homes and their own households. The study develops a sociotechnical threat model identifying how AI analytics, data logs, cross-household data flows, and employment agencies create surveillance and privacy boundary challenges for domestic workers.
MUZZLE is an automated agentic framework designed to evaluate the security of LLM-based web agents against indirect prompt injection attacks. The system adaptively identifies injection surfaces from agent trajectories and generates context-aware malicious instructions, successfully discovering 37 new attacks across 4 web applications that violate confidentiality, integrity, and availability properties, including novel cross-application attacks and agent-tailored phishing scenarios.
CIC-Trap4Phish is a multi-format dataset designed to improve detection of phishing and quishing attacks through malicious email attachments. The dataset covers five common file types (Word, Excel, PDF, HTML, and QR code images) and uses execution-free static feature pipelines for the first four types, while employing CNNs and lightweight language models for QR code-based phishing detection. Machine learning models including Random Forest, XGBoost, and Decision Tree demonstrated high detection accuracy across all formats.
This paper identifies a vulnerability in password-authenticated key exchange (PAKE) protocols called "reverse online guessing attacks," where an adversary validates password guesses by impersonating a server rather than a client. The attack is particularly effective in phishing, password-spraying scenarios, or applications with automated logins like WPA3-SAE, and exploits the fact that PAKE protocols lack server authentication mechanisms beyond the password itself.
StealthRL is a reinforcement learning framework that uses paraphrasing attacks to evade AI-text detectors while preserving semantic meaning. The system achieves near-zero detection rates (0.001 mean TPR@1%FPR) and 99.9% attack success rate against multiple detector families, with attacks successfully transferring to unseen detector types, revealing fundamental architectural vulnerabilities in current AI-text detection systems.
This paper proposes a comprehensive framework for integrating Zero Trust Architecture (ZTA) into cloud-based endpoint security for critical infrastructure such as power plants, healthcare systems, and financial systems. The framework aims to address the gap in applying ZTA to endpoint management within cloud environments, treating every access request as new with no implicit trust, thereby enhancing compliance, enabling continuous protection, and reducing attack surfaces.