June 14, 2026ResearchInfrastructureAgents

MiniMax Sparse Attention: how M3 reads a million tokens

MiniMax published the attention architecture behind its M3 model, and it's the cleanest answer yet to a problem every agent builder hits. Softmax attention costs scale quadratically, so a million-token context is unaffordable at deployment scale. Yet agentic workflows, repo-scale code reasoning, and persistent memory all want the model to attend over hundreds of thousands to millions of tokens at once. The math says no. MiniMax Sparse Attention says maybe.

The trick, said plainly: instead of attending to everything, score the context in blocks and only attend to the blocks that matter. MSA sits on top of grouped-query attention with a lightweight index branch that scores key-value blocks and picks a top-k subset independently for each query group, then a main branch runs exact attention over just those selected blocks. Group-specific sparse retrieval, but still block-level execution so the GPU stays happy.

The payoff is real numbers. A 109B model attending to a full million tokens with per-token attention compute cut 28.4 times. Custom CUDA kernels deliver 14.2x faster prefill and 7.6x faster decoding on H800s. And the design is deliberately simple, so it ports across a range of GPUs rather than needing exotic hardware.

This is the unglamorous infrastructure that makes the agent dreams actually run. Everyone wants agents with long memory and whole-codebase awareness, and nobody wants to pay quadratic attention to get there. MSA, sitting at the number two spot on Hugging Face's paper board, is a bet that the road to long-context agents runs through smarter sparsity, not just bigger machines. Paper at arxiv.org/abs/2606.13392.
← Previous
HarmonyOS 7: China's answer to the agentic OS
Next β†’
WeaveBench: can your agent survive a real workday?
← Back to all articles

Comments

Loading...
>_