Research

Academic papers, new techniques, benchmarks, and theoretical findings in AI/LLM security.

31 items

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

2/10/2026

This research paper introduces an automated black-box pipeline for detecting unverbalized biases in Large Language Models (LLMs) - biases that affect model decisions but are not mentioned in their chain-of-thought reasoning traces. The pipeline uses LLM autoraters to generate and test candidate bias concepts through statistical analysis, successfully discovering both previously unknown biases (e.g., Spanish fluency, writing formality) and validating manually-identified biases (gender, race, religion) across six LLMs on three decision tasks.

Research

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Olaf-World: Orienting Latent Actions for Video World Modeling

Towards Explainable Federated Learning: Understanding the Impact of Differential Privacy

Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders

Step-resolved data attribution for looped transformers

Causality in Video Diffusers is Separable from Denoising

Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Anagent For Enhancing Scientific Table & Figure Analysis

CAPID: Context-Aware PII Detection for Question-Answering Systems

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Vendi Novelty Scores for Out-of-Distribution Detection

Evaluating Disentangled Representations for Controllable Music Generation

WildCat: Near-Linear Attention in Theory and Practice

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Conformal Prediction Sets for Instance Segmentation

Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

Effectiveness of Binary Autoencoders for QUBO-Based Optimization Problems

Position: Message-passing and spectral GNNs are two sides of the same coin

Overview of the TREC 2025 RAGTIME Track

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Olaf-World: Orienting Latent Actions for Video World Modeling

Towards Explainable Federated Learning: Understanding the Impact of Differential Privacy

Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders

Step-resolved data attribution for looped transformers

Causality in Video Diffusers is Separable from Denoising

Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Anagent For Enhancing Scientific Table & Figure Analysis

CAPID: Context-Aware PII Detection for Question-Answering Systems

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Vendi Novelty Scores for Out-of-Distribution Detection

Evaluating Disentangled Representations for Controllable Music Generation

WildCat: Near-Linear Attention in Theory and Practice

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Conformal Prediction Sets for Instance Segmentation

Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

Effectiveness of Binary Autoencoders for QUBO-Based Optimization Problems

Position: Message-passing and spectral GNNs are two sides of the same coin

Overview of the TREC 2025 RAGTIME Track