June 19, 2026ResearchBenchmark

The leaderboards agents top don't predict the work they fail

A new IBM-led paper with 70-plus co-authors makes an uncomfortable argument: the benchmarks we use to rank AI agents mostly don't predict how they'll do in the real world. They call the missing property predictive validity, does winning the public leaderboard actually correlate with winning on data you haven't seen? Often, it doesn't.

The numbers are blunt. In a 149-team agent competition, the correlation between public and hidden-test rankings was rho = -0.13 on the execution track. Negative. Topping the public board was, if anything, slightly anti-correlated with being good on the hidden one. And the LLM-as-judge everyone leans on scored a Krippendorff's alpha of 0.61, well below the 0.74 to 0.82 humans hit on the same samples, the referee disagrees with itself more than people do.

This lands right on top of Agents' Last Exam from two weeks ago, where the best agent passed only 26% of real economic tasks. Same wound, deeper cut: it's not just that agents fail real jobs, it's that the scoreboards we cheer for can't tell us which agents will. Aggregate scores collapse cost, latency, planning quality, and reasoning into one number, and the number lies under distribution shift.

Their fix is a twelve-tier framework plus required metadata, architecture, reasoning mode, retrieval strategy, verifier type, so a score is attributable instead of just impressive. It's a position paper anchored on the public AssetOpsBench, not a product. But if you're buying or shipping agents, it's the most useful thing you'll read this week, because it tells you the headline benchmark number is the least reliable part of the pitch. arXiv 2606.19704.

https://arxiv.org/abs/2606.19704
← Previous
Builder.io wants the agent to be a real user of your app
Next β†’
Super User Daily: June 20, 2026
← Back to all articles

Comments

Loading...
>_