Academic papers, new techniques, benchmarks, and theoretical findings in AI/LLM security.
A banking group implemented a retrieval-augmented AI-powered compliance assistant (a system where AI pulls in external compliance documents to answer questions) to help with regulatory requirements while maintaining human oversight. The article identifies key challenges with this approach, including authority illusion (over-trusting the AI's answers), unclear responsibility for decisions, loss of human judgment about context, and gaps in understanding how the system works, then proposes a four-phase framework to help organizations move from passive AI assistants toward systems where AI and humans reason together.
QuEST is a new framework that makes backdoor attacks (hidden malicious behaviors injected into AI models) more stealthy and efficient when models undergo quantization (compressing models to use less memory and computation). The framework uses special training techniques and parameter sharing to hide the attack from detection systems while reducing the computational resources needed to carry out the attack.
Researchers discovered a new attack called Lure that targets generative language models (GLMs, which are AI systems that generate text) during the fine-tuning process (when developers customize an open-source model with their own data). By hiding malicious code in the source code of an open-source model, attackers can trick a fine-tuned model into remembering and later revealing the proprietary data used to customize it through specially crafted prompts (input text designed to trigger specific outputs).
This research addresses security challenges in Internet of Things (IoT) devices by improving radio frequency fingerprint identification (RFFI, a method that uniquely identifies devices based on their wireless signal characteristics) using federated learning (a distributed AI training approach where data stays on local devices rather than being sent to a central server). The paper proposes a feature alignment strategy to handle non-IID data (data that isn't uniformly distributed across different receivers), which occurs when different receivers have different hardware and environmental conditions, and demonstrates that the approach achieves 90.83% identification accuracy with improved stability compared to existing federated learning methods.
Fix: The paper proposes a feature alignment strategy based on federated learning that guides each client (receiver) to learn aligned intermediate feature representations during local training, effectively mitigating the adverse impact of distribution shifts on model generalization in heterogeneous wireless environments.
IEEE Xplore (Security & AI Journals)AdaParse is a framework that can identify the specific settings (hyperparameters, which are configuration values that control how a model behaves) used to create AI-generated images by analyzing those images in detail. Unlike older methods that use a single general fingerprint (a characteristic pattern), AdaParse creates customized fingerprints for each image, allowing it to distinguish between images made with different settings across many different generative models (AI systems that create images).
This research proposes a new method called DP-QAM (Differentially Private Quadrature Amplitude Modulation) to solve privacy and communication problems in federated analytics (a system where multiple devices analyze data together without sending raw data to a central server). The method takes advantage of natural errors that occur during data compression and wireless transmission to add extra privacy protection, while balancing privacy, communication efficiency, and accuracy.
Large Language Models (LLMs, AI systems trained on massive amounts of text) used in task-oriented dialogue systems (AI assistants designed to help users complete specific goals like booking travel) can accidentally memorize and leak sensitive training data, including personal information like phone numbers and complete travel schedules. Researchers demonstrated new attack techniques that can extract thousands of pieces of training data from these systems with over 70% accuracy in the best cases. The paper identifies factors that influence how much data LLMs memorize in dialogue systems but does not propose specific fixes.
This research addresses vulnerabilities in Federated Learning (FL, a system where multiple computers train an AI model together without sharing their raw data), which faces attacks from malicious participants and privacy leaks from gradient updates (the numerical adjustments that improve the model). The authors propose a new method combining homomorphic encryption (a way to perform calculations on encrypted data without decrypting it) and dimension compression (reducing the size of data while keeping important relationships intact) to protect privacy and defend against Byzantine attacks (when malicious actors send corrupted data to sabotage the system) while reducing computational costs by 25 to 35 times.
Large vision-language models (LVLMs, which are AIs that understand both images and text) can be attacked using simple visual transformations, such as rotations or color changes, that fool them into giving wrong answers. Researchers found that combining multiple harmful transformations can make these attacks more effective, and they can be optimized using gradient approximation (a mathematical technique to find the best attack parameters). This research highlights a previously overlooked safety risk in how well LVLMs resist these kinds of adversarial attacks (attempts to trick AI systems).
OwnerHunter is a system that uses large language models (AI trained on vast amounts of text) to identify who owns a website by analyzing webpage content across multiple languages. It improves on older methods that struggled when webpages listed many names or were written in non-English languages, using strategies like checking multiple sources on a page and verifying results to accurately determine the true owner.
Deep neural networks can be attacked through backdoors, where attackers secretly poison training data to make the model misclassify certain inputs while appearing normal otherwise. This paper proposes Cert-SSBD, a defense method that uses randomized smoothing (adding random noise to samples) with sample-specific noise levels, optimized per sample using stochastic gradient ascent, combined with a new certification approach to make models more resistant to these attacks.
Fix: The proposed Cert-SSBD method addresses the issue by employing stochastic gradient ascent to optimize the noise magnitude for each sample, applying this sample-specific noise to multiple poisoned training sets to retrain smoothed models, aggregating predictions from multiple smoothed models, and introducing a storage-update-based certification method that dynamically adjusts each sample's certification region to improve certification performance.
IEEE Xplore (Security & AI Journals)Gradient leakage attacks (methods that steal private data by analyzing the mathematical updates sent between computers in federated learning, where AI training happens across multiple devices) pose privacy risks in federated learning systems. Researchers discovered that different layers of neural networks (sections that process information at different stages) leak different amounts of private information, so they created Layer-Specific Gradient Protection (LSGP), which applies stronger privacy protection to layers that leak more sensitive data rather than protecting all layers equally.
When users send prompts to LLM services like ChatGPT, sensitive personal information (such as names, addresses, or ID numbers) can leak out, even when basic privacy protections are used. This paper presents Rap-LI, a framework that identifies which parts of a user's input contain sensitive data and applies stronger privacy protection to those specific parts, rather than treating all data equally.
This research proposes a new privacy-preserving method for key generation in ABE (attribute-based encryption, a system that lets users control access to data based on their personal attributes). The method follows a principle called Minimal Disclosure, where users only reveal the specific attributes they need to prove, rather than exposing all their attributes. The protocol separates attribute verification from key generation into two steps, uses batch verification to improve performance, and introduces metrics to measure how well it resists attacks that try to infer hidden user attributes.
Researchers developed PPOM-Attack, a method to fool face recognition (FR) systems by generating adversarial images (slightly altered photos that trick AI into misidentifying someone). Unlike earlier attacks that used substitute models (simpler AI systems trained to mimic the target system), PPOM-Attack directly queries the real face recognition system to learn how to create effective perturbations (tiny pixel changes), achieving 21.7% higher success rates while keeping the altered images looking natural.
Prompt injection attacks (tricking an AI by hiding malicious instructions in its input) pose a serious security risk to Large Language Models, as attackers can overwrite a model's original instructions to manipulate its responses. Researchers developed PromptFuzz, a testing framework that uses fuzzing techniques (automatically generating many variations of input data to find weaknesses) to systematically evaluate how well LLMs resist these attacks. Testing showed that PromptFuzz was highly effective at finding vulnerabilities, ranking in the top 0.14% of attackers in a real competition and successfully exploiting 92% of popular LLM-integrated applications tested.
This paper presents SIMix, a training framework for systems where multiple users learn AI models together over wireless networks while protecting their private data. The system uses Over-the-Air Mixup (OAM, a technique that combines data from multiple users through wireless transmission to hide sensitive information) and groups users strategically to reduce communication needs by up to 25% while defending against model inversion attacks (attempts to reconstruct private training data from a trained model) and label inference attacks (guessing what category a user's data belongs to).
Fix: The paper proposes integrating Over-the-Air Mixup with label-aware user grouping, including a closed-form Tx-Rx scaling optimization that minimizes mean square error under channel noise, and an extended max-clique algorithm that dynamically partitions users into groups with minimal intra-label similarity to reduce model inversion attack success rates.
IEEE Xplore (Security & AI Journals)Graph neural networks (GNN, a type of AI that learns from data organized as interconnected nodes and edges) are vulnerable to adversarial topology perturbation, which means attackers can fool them by slightly changing the graph structure. This paper proposes AT-GSE, a new adversarial training method (a technique that strengthens AI models by training them on intentionally corrupted inputs) that uses graph subspace energy, a measure of how stable a graph is, to improve GNN robustness against these attacks.
Researchers discovered a new attack called HijackFL that can hijack machine learning models in federated learning systems (where multiple computers train a shared model without sharing raw data). The attack works by adding tiny pixel-level changes to input samples so the model misclassifies them as something else, while appearing normal to the server and other participants, achieving much higher success rates than previous methods.
Researchers discovered a new attack called federated unlearning inversion attack (FUIA) that can extract private data from federated unlearning (FU, a process designed to remove a specific person's data influence from shared machine learning models across multiple computers). The attack works by having a malicious server observe the model's parameter changes during the unlearning process and reconstruct the forgotten data, undermining the privacy protection that FU is supposed to provide.
Fix: The source mentions that 'two potential defense strategies that introduce a trade-off between privacy protection and model performance' were explored, but no specific details, names, or implementations of these defense strategies are provided in the text.
IEEE Xplore (Security & AI Journals)