AI benchmarks are broken. Here’s what we need instead.
Summary
Current AI benchmarks (standardized tests that measure AI performance) evaluate AI systems in isolation against human performance on specific tasks, but this doesn't reflect how AI is actually used in real organizations where it works within teams and workflows over extended periods. This misalignment causes organizations to adopt AI systems with impressive benchmark scores that then underperform in real-world deployment, such as FDA-approved radiology AI that creates delays when integrated into hospital workflows with multiple specialists and evolving decisions.
Solution / Mitigation
The source proposes shifting from narrow benchmark methods to HAIC benchmarks (Human-AI, Context-Specific Evaluation), which assess how AI systems perform over longer time horizons within human teams, workflows, and organizations. However, no implementation details, technical specifications, or concrete steps for implementing this approach are provided in the source text.
Classification
Original source: https://www.technologyreview.com/2026/03/31/1134833/ai-benchmarks-are-broken-heres-what-we-need-instead/
First tracked: March 31, 2026 at 02:00 PM
Classified by LLM (prompt v3) · confidence: 85%