June 14, 2026BenchmarkAgentsResearch

WeaveBench: can your agent survive a real workday?

Microsoft put out WeaveBench, and it goes after the gap that every flashy computer-use demo hides. Real work is long-horizon and the interfaces are messy. Most benchmarks hand an agent one clean app and a short task. WeaveBench drops it into authentic digital environments where it has to weave across hybrid interfaces, traditional GUIs and unconventional ones together, over tasks that stretch out rather than the ten-step toy demos.

That framing matters because computer-use is where the capability-versus-reality gap is widest. Models look brilliant in a controlled click-here-type-that demo and then fall apart the moment a task runs for an hour, spans five tools, and hits a UI nobody anticipated. WeaveBench is built to expose exactly that, and the long-horizon part is the whole point.

It lands right next to a run of reality-check evals we've been tracking. Agents' Last Exam showed the best agent passing only a quarter of real economic tasks. EvoArena showed agents collapsing to 40% when the environment shifts mid-task. WeaveBench adds the long-horizon plus hybrid-interface dimension. The collective message from the research side is consistent and a little deflating: agents ace the tests built for them and stumble on anything resembling an actual job.

If you're building computer-use agents, this is the kind of eval worth scoring against precisely because it's designed to make you look bad. The demos already work. The workday is the hard part. Paper at arxiv.org/abs/2606.09426.
← Previous
MiniMax Sparse Attention: how M3 reads a million tokens
Next β†’
Firecrawl Prometheus: the agent writes the scraper, then proves it works
← Back to all articles

Comments

Loading...
>_