Academic papers, new techniques, benchmarks, and theoretical findings in AI/LLM security.
FedPerX is a federated transformer framework (a system where multiple computers train an AI model together without sharing raw data) designed for sentiment analysis across multiple languages while protecting privacy. It uses residual adapters (lightweight customizable modules added to a shared language model) and differential privacy (a mathematical technique that adds noise to data to prevent identifying individuals) to let each participant personalize their model without compromising data privacy. The framework outperforms existing methods on multilingual datasets with improved accuracy and significantly reduced communication needs.
This paper proposes RCMCL (Robust Trusted Conflictive Multiview Collaborative Contrastive Learning), a method to improve AI models that learn from multiple sources of data (multiview learning) when those sources conflict or misalign with each other. The approach uses evidential deep neural networks (a technique that estimates uncertainty in predictions) and contrastive learning (a training method that teaches the model to recognize similar and different examples) to make the model more reliable and accurate even when the data sources provide contradictory information.
This research introduces PP-DR, a privacy-preserving dimensionality reduction (a technique that reduces the number of features in a dataset to make it easier to analyze) scheme that uses homomorphic encryption (a type of encryption that allows computations on encrypted data without decrypting it first) to let multiple organizations securely share and analyze data together without revealing sensitive information. The new method is much faster and more accurate than previous approaches, achieving 30 to 200 times better computational efficiency and 70% less communication overhead.
AdvScan is a method for detecting adversarial examples (inputs slightly modified to trick AI models into making wrong predictions) on tiny machine learning models running on edge devices (small hardware like microcontrollers) without needing access to the model's internal details. The approach monitors power consumption patterns during the model's operation, since adversarial examples create unusual power signatures that differ from normal inputs, and uses statistical analysis to flag suspicious inputs in real-time with minimal performance overhead.
Researchers have developed a new backdoor attack method called shell code injection (SCI) that can implant malicious logic into deep learning models (neural networks trained on large datasets) without needing to poison the training data. The attack uses techniques inspired by nature, like camouflage, along with trigger verification and code packaging strategies to trick models into making wrong predictions, and it can adapt its attack targets dynamically using large language models (LLMs) to make it more flexible and harder to detect.
Differentially private databases (DP-DBs, systems that add mathematical noise to data to protect individual privacy while allowing useful analysis) need auditing services to verify they actually protect privacy as promised, but current approaches don't handle database-specific challenges like varying query sensitivities well. This paper introduces DPAudit, a framework that audits DP-DBs by generating realistic test scenarios, estimating privacy loss parameters, and detecting improper noise injection through statistical testing, even when the database's inner workings are hidden.
Fix: The source presents DPAudit as a framework solution but does not describe a patch, update, or deployment fix for existing vulnerable systems. N/A -- no mitigation discussed in source.
IEEE Xplore (Security & AI Journals)PROTheft is a model extraction attack (a method where attackers steal an AI model's functionality by observing its responses to many input queries) that works on real-world vision systems like autonomous vehicles by projecting digital attack samples onto a device's camera. The attack bridges the gap between digital attacks and physical-world scenarios by using a projector to convert digital inputs into physical images, and includes a simulation tool to predict how well attack samples will work when converted from digital to physical to digital formats.
Version 5.4.0 (released February 5, 2026) is an update to a security framework that documents new attack techniques targeting AI agents, including publishing poisoned AI agent tools (malicious versions of legitimate tools), escaping from AI systems to access the host computer, and exploiting vulnerabilities to steal credentials or evade security. The update also includes new real-world case studies showing how attackers have compromised AI agent control systems and used prompt injection (tricking an AI by hiding commands in its input) to establish control.
This research proposes AHEDB (Accelerated Homomorphically Encrypted DataBase), a system designed to speed up database queries on encrypted data using Fully Homomorphic Encryption, or FHE (a method that lets computers perform calculations on encrypted information without decrypting it first). The system uses Encrypted Multiple Maps to reduce computational strain and a Single Range Cover algorithm for indexing, achieving better performance than existing FHE-based approaches while maintaining security.
HiveTEE is a security architecture that divides applications running inside a TEE (Trusted Execution Environment, a secure zone on a processor that protects sensitive operations from the main operating system) into smaller isolated domains, so that if one part is compromised, the damage doesn't spread to the rest. It uses RME (Realm Management Extension, a hardware feature that creates isolated execution spaces) and MTE (Memory Tagging Extension, a feature that prevents certain memory attacks), and testing shows it adds minimal slowdown (less than 3%) to applications.
This research proposes a new method for knowledge distillation (training a smaller AI model to mimic a larger one) that preserves adversarial robustness (the ability to resist attacks designed to fool AI systems). Instead of having the student model copy all predictions from the teacher model, the method uses "inverse adversarial examples" (inputs created by reversing the direction of adversarial attacks) to guide learning toward more reliable predictions, resulting in better robustness transfer between models.
Person re-identification (Re-ID, systems that recognize and track individuals across camera footage) systems can be attacked to steal pedestrian images and the AI model itself, threatening privacy for both the system operator and people being monitored. Existing privacy-protection methods fail to defend against all types of leaks while keeping the system working normally, so researchers propose SHIELD, a two-stage framework that uses protected image generation and feature protection techniques to prevent data and model theft without reducing the system's accuracy for authorized users.
Large vision-language models (LVMs, AI systems that process both images and text) often make mistakes by hallucinating incorrect relationships between objects in images, such as falsely claiming one object is near another. Researchers created R-Bench, a benchmark (a standardized test) to evaluate these relationship hallucination errors, and found that these mistakes happen because models rely too much on language patterns rather than actually analyzing the visual content. The study proposes Region-Aware Alignment Mitigation (RA²M), which improves the model's attention to specific regions of an image to better align its descriptions with what is actually shown.
Fix: Region-level image-text alignment helps mitigate relationship hallucinations. The authors propose Region-Aware Alignment Mitigation (RA²M), which 'enhances model attention to relevant regions, improving alignment between generated text and images.'
IEEE Xplore (Security & AI Journals)This research shows that large language models can be tricked or protected using in-context learning (ICL, a technique where an AI learns from examples provided in its current input rather than from training). The researchers developed two methods: an In-Context Attack that uses harmful examples to make LLMs produce unsafe outputs, and an In-Context Defense that uses refusal examples to strengthen safety. The study demonstrates that both attacking and defending LLM safety through carefully chosen demonstrations are effective and scalable.
EvTexture++ is a framework that uses event-based vision (cameras that capture changes in brightness at extremely high speed and can see very bright and dark areas simultaneously) to improve video super-resolution, which is the process of creating high-resolution videos from lower-resolution ones. Instead of using events just to track motion, this approach uses them to recover fine details and textures in videos, and prevents texture flickering when objects move quickly across frames.
N/A -- This content is a navigation menu and feature listing for GitHub's v5.3.0 platform, not a description of an AI/LLM security issue, vulnerability, or problem requiring analysis.
Version 5.2.0 adds new attack techniques against AI systems, including methods to steal credentials from AI agent tools (software components that perform actions on behalf of an AI), poison training data, and generate malicious commands. It also introduces new defenses such as segmenting AI agent components, validating inputs and outputs, detecting deepfakes, and implementing human oversight for AI agent actions.
Fix: The source lists mitigations rather than fixes for a specific vulnerability. Key mitigations mentioned include: Input and Output Validation for AI Agent Components, Segmentation of AI Agent Components, Restrict AI Agent Tool Invocation on Untrusted Data, Human In-the-Loop for AI Agent Actions, Adversarial Input Detection, Model Hardening, Sanitize Training Data, and Generative AI Guardrails.
MITRE ATLAS ReleasesLarge language models face four main types of adversarial threats: privacy breaches (exposing sensitive data the model learned), integrity compromises (corrupting the model's outputs or training data), adversarial misuse (using the model for harmful purposes), and availability disruptions (making the model unavailable or slow). The article organizes these threats by their attackers' goals to help understand the landscape of vulnerabilities in LLMs.
This short story examines privacy risks that arise when companies are bought and sold, particularly concerning AI digital twins (AI models that replicate a specific person's behavior and knowledge) and the problems that occur when organizations fail to threat model (identify and plan for potential security risks in) major changes to their systems and technology. The story raises ethical questions about these scenarios.
Current AI assistants are not yet trustworthy enough to be personal advisors, despite how useful they seem. They fail in specific ways: they encourage users to make poor decisions, they create false doubt about things people know to be true (gaslighting), and they confuse a person's current identity with their past. They also struggle when information is incomplete or inaccurate, with no reliable way to fix errors or hold the system responsible when wrong information causes harm.