Research
Academic papers, new techniques, benchmarks, and theoretical findings in AI/LLM security.
Academic papers, new techniques, benchmarks, and theoretical findings in AI/LLM security.
This research paper introduces an automated black-box pipeline for detecting unverbalized biases in Large Language Models (LLMs) - biases that affect model decisions but are not mentioned in their chain-of-thought reasoning traces. The pipeline uses LLM autoraters to generate and test candidate bias concepts through statistical analysis, successfully discovering both previously unknown biases (e.g., Spanish fluency, writing formality) and validating manually-identified biases (gender, race, religion) across six LLMs on three decision tasks.
Olaf-World is a pipeline that addresses the challenge of scaling action-controllable world models, which is limited by the scarcity of action labels. The method introduces Seq$\Delta$-REPA, a sequence-level control-effect alignment objective that learns latent actions from unlabeled video by anchoring them to temporal feature differences, enabling better zero-shot action transfer and more data-efficient adaptation compared to existing baselines.
This paper proposes FEXT-DP (Federated EXplainable Trees with Differential Privacy), a federated learning system that combines decision trees for explainability with differential privacy for enhanced data protection. The research demonstrates that while FEXT-DP achieves faster training, improved Mean Squared Error, and better explainability, the addition of differential privacy creates a trade-off by negatively impacting the system's explainability.
This research identifies that standard diffusion transformers fail to converge on representation encoders due to "Geometric Interference" - where Euclidean flow matching forces probability paths through low-density interior spaces rather than following the hyperspherical manifold surface. The work proposes Riemannian Flow Matching with Jacobi Regularization (RJF) to enable standard DiT architectures to converge without requiring computationally expensive width scaling.
Fix: The paper proposes Riemannian Flow Matching with Jacobi Regularization (RJF), which constrains the generative process to manifold geodesics and corrects for curvature-induced error propagation. This method enables standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge.
Arxiv (cs.CR + cs.CL + cs.LG)This paper introduces Step-Decomposed Influence (SDI), a method for analyzing how individual training examples affect looped transformers at each recurrent iteration. SDI decomposes the TracIn influence estimation into a per-step trajectory by unrolling the recurrent computation graph, revealing when during the loop a training example matters. The authors propose a TensorSketch implementation to make SDI scalable without materializing per-example gradients.
This paper demonstrates that causal reasoning in video diffusion models can be separated from the iterative denoising process. The authors introduce Separable Causal Diffusion (SCD), a new architecture that decouples once-per-frame temporal reasoning (via a causal transformer encoder) from multi-step frame-wise rendering (via a lightweight diffusion decoder), improving throughput and per-frame latency while maintaining or exceeding generation quality.
Quantum-Audit is a benchmark of 2,700 questions designed to systematically evaluate language models' understanding of quantum computing concepts across core topics. Testing 26 models, the study found that top performers like Claude Opus 4.5 achieved 84% accuracy, exceeding the 74% expert average, though models showed significant performance drops on expert-written questions (12 points), advanced security topics (73%), and questions with false premises (below 66% accuracy).
This paper introduces Agent World Model (AWM), a fully synthetic environment generation pipeline that creates 1,000 code-driven, database-backed environments for training autonomous agents in multi-turn tool-use tasks. Each environment contains an average of 35 tools and provides more reliable state transitions than LLM-simulated environments. The authors demonstrate that reinforcement learning agents trained exclusively in these synthetic environments achieve strong out-of-distribution generalization across three benchmarks.
This paper introduces AnaBench, a large-scale benchmark with 63,178 instances across nine scientific domains for evaluating AI systems' ability to analyze complex scientific tables and figures. The authors propose Anagent, a multi-agent framework using four specialized agents (Planner, Expert, Solver, and Critic) with modular training strategies combining supervised finetuning and reinforcement learning, achieving improvements up to 13.43% in training-free settings and 42.12% with finetuning.
This paper presents CAPID, a context-aware PII detection system for question-answering systems that uses a fine-tuned small language model (SLM) to identify and classify PII while determining its contextual relevance to user queries. Unlike current approaches that redact all PII indiscriminately, CAPID preserves contextually relevant information to maintain response quality. The authors propose a synthetic data generation pipeline using LLMs to create training data and demonstrate that their fine-tuned SLM outperforms existing baselines in detecting PII spans, relevance, and type accuracy.
This paper introduces RLFR (Reinforcement Learning from Feature Rewards), a method that uses learned features from language models as reward functions for reinforcement learning to address open-ended tasks like hallucination reduction. The approach uses a probing framework to identify potentially hallucinated claims and trains the model to intervene and correct its outputs when uncertain about factuality. When applied to Gemma-3-12B-IT, the method resulted in a policy that is 58% less likely to hallucinate while maintaining performance on standard benchmarks.
This work introduces the Vendi Novelty Score (VNS), a new approach to out-of-distribution (OOD) detection based on diversity metrics rather than confidence scores or likelihood estimates. VNS quantifies how much a test sample increases the diversity of the in-distribution feature set using Vendi Scores, providing a non-parametric, linear-time method that achieves state-of-the-art OOD detection performance across multiple image classification benchmarks while requiring only 1% of training data.
This research paper evaluates disentangled representations in music generation models that claim to separate features like structure/timbre or local/global properties for controllable synthesis. Using a probing-based framework across multiple axes (informativeness, equivariance, invariance, and disentanglement), the study examines various unsupervised disentanglement strategies including inductive biases, data augmentations, adversarial objectives, and staged training. The findings reveal inconsistencies between intended and actual semantics of embeddings, indicating that current strategies fail to produce truly disentangled representations.
WildCat is a method for compressing the attention mechanism in neural networks that reduces the quadratic computational cost to near-linear time complexity O(n^(1+o(1))). It uses a randomly pivoted Cholesky subsampling algorithm to select a small weighted coreset for attention computation, achieving super-polynomial error decay O(n^(-√log(log(n)))) while maintaining high accuracy for tasks like image generation, image classification, and language model KV cache compression.
This paper introduces Fine-grained Group Policy Optimization (FGO), a Reinforcement Learning algorithm designed to compress verbose Chain-of-Thought reasoning in Large Language Models by subdividing group responses and assigning weights based on length and entropy. FGO addresses limitations of Group Relative Policy Optimization (GRPO), specifically inefficient data utilization and entropy collapse, while maintaining performance across multiple reasoning benchmarks including MATH500, AIME24, AMC23, and Minerva.
This paper introduces a conformal prediction algorithm for instance segmentation that generates adaptive confidence sets with provable guarantees that at least one prediction has high Intersection-Over-Union (IoU) with the ground truth mask. The algorithm addresses the lack of principled uncertainty quantification in current instance segmentation models, which lack calibrated outputs and guarantees about mask accuracy. The method was tested on agricultural field delineation, cell segmentation, and vehicle detection applications.
This paper introduces Optimistic World Models (OWMs), a framework for efficient exploration in model-based deep reinforcement learning that addresses challenges in sparse-reward environments. OWMs incorporate optimism directly into model learning through an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes, requiring no uncertainty estimates or constrained optimization. The approach is implemented in Optimistic DreamerV3 and Optimistic STORM, demonstrating improvements in sample efficiency and cumulative return over baseline methods.
This paper investigates how binary autoencoders (bAE) improve black-box combinatorial optimization when combined with factorization machine with quantum annealing (FMQA). Using traveling salesman problems as a testbed, the authors demonstrate that bAE learns compact binary latent codes that better preserve neighborhood structure, reconstruct feasible solutions more accurately, produce smoother optimization landscapes with fewer local optima, and align tour distances with Hamming distances more effectively than manual binary encodings.
This position paper argues that the traditional division between message-passing neural networks (MPNNs) and spectral graph neural networks (GNNs) is largely artificial and hinders progress in the field. The authors propose viewing both as different parametrizations of permutation-equivariant operators acting on graph signals, suggesting that many popular architectures are equivalent in expressive power while offering complementary strengths for analyzing different aspects of graph learning.
The RAGTIME track at TREC 2025 focuses on studying report generation from multilingual source documents, featuring a collection of Arabic, Chinese, English, and Russian news stories. The track includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR), with 125 runs submitted by 13 participating teams and track coordinators.