Sparse Attention Showdown: MSA, NSA, MoBA – Who Wins the Tripartite Struggle?

The race to make large language models handle millions of tokens without breaking the bank has a new contender. Last week, MiniMax unveiled M3, its latest flagship model, touting three headline features: dramatically improved coding ability, native multimodality, and, most interesting from a technical perspective, a novel sparse attention mechanism they call MiniMax Sparse Attention (MSA). The accompanying 30-page paper, which I read cover to cover, reveals that MSA is not just another incremental tweak—it’s a carefully engineered two-stage search-and-focus architecture that could reshape how we think about long-context inference.

To appreciate why MSA matters, you need to understand the fundamental bottleneck all long-context models face. Standard full attention requires the model to look back at every previous token every time it generates a new one. With a 1-million-token context window, that quadratic O(n²) cost becomes astronomically expensive—both in compute and in the memory needed to store the KV cache. Sparse attention offers a simple, radical escape: instead of attending to everything, the model learns to attend only to the most relevant chunks.

MiniMax’s path to this solution wasn’t a straight line. Their earlier MiniMax-01 and M1 models experimented with linear attention (Lightning Attention), aiming to bypass softmax entirely and achieve near-O(n) scaling. By M2, they had retreated to full attention, writing a lengthy explanation of the trade-offs. Now, with M3, they’ve settled on sparse attention as a pragmatic middle ground—less radical than linear attention, but far more efficient than full attention for extreme contexts.

What sets MSA apart from its peers—DeepSeek’s Native Sparse Attention (NSA) and Dynamic Sparse Attention (DSA), and Kimi’s MoBA—is the elegance of its design. MSA splits computation into two branches. The first, called the Index Branch, acts as a lightweight scorer. It divides the entire context into fixed 128-token chunks, scores each chunk using a single attention head with a simplified scoring mechanism (no expensive exponential in softmax, since only relative rankings matter), and picks exactly 16 chunks: the current position’s chunk plus the top 15 others. This "search" step is absurdly cheap because it uses only one head for scoring and one shared key head across all attention groups, compared to the main branch’s 64 heads.

The second branch, the Main Branch, performs full expensive attention only on those 16 selected chunks—that’s 16 × 128 = 2,048 tokens, regardless of whether the context is 100,000 or 1,000,000. This fixed budget is the source of all savings. In practice, it’s like flipping through a book’s index to find relevant chapters before reading them in depth; but unlike a static index, MSA’s index is learned and rebuilt with every new token.

The tricky part is training the Index Branch. The top-k selection operation is non-differentiable, so the paper introduces a clever training recipe. During warm-up (the first 40 billion tokens), the model runs full attention while the Index Branch learns to mimic the main branch’s attention distribution via KL divergence. Gradients are stopped from flowing back into the main network, so the index learns independently. Once sparse mode is enabled, the KL divergence continues on selected chunks, providing continuous supervision. The paper reports that the Index Branch consistently covers over 90% of the main branch’s attention weights, proving that the cheap scoring is remarkably faithful.

This two-stage design echoes a broader trend in AI efficiency: use a cheap, differentiable proxy to identify a small region of interest, then allocate expensive compute only there. We see similar patterns in neural architecture search and mixture-of-experts routing. MSA’s innovation is its elegant implementation within the attention mechanism itself.

How does MSA stack up against its rivals? DeepSeek’s NSA, published in February 2025, also uses a coarse-to-fine strategy but with a different decomposition: it sparsifies attention at the query level rather than the chunk level, leading to a more complex training setup. DeepSeek’s later DSA (V3.2) moves to a dynamic "dropping" strategy that adjusts sparsity per layer, which can be more memory-efficient but harder to control. Kimi’s MoBA, contemporaneous with NSA, takes a routing approach borrowed from mixture-of-experts, treating each token as an expert and selecting top-k tokens per query. MoBA’s strength is its flexibility, but it requires careful tuning to avoid training instability.

In contrast, MSA’s fixed-chunk 16-block selection is simpler, more deterministic, and easier to integrate into existing Transformer architectures. The trade-off is that the chunking granularity (128 tokens) might miss dependencies that span very short or very long distances. However, the paper’s ablation studies show that 128-token chunks strike a near-optimal balance, and the inclusion of the current chunk ensures local coherence isn’t lost.

Sparse attention isn’t just about saving money; it’s about making long context economically viable for everyday applications.

This matters beyond academic benchmarks. A system that can cheaply process a full codebase, an entire legal document, or a year’s worth of chat history unlocks products that weren’t feasible before. For example, a customer service bot using MSA could handle a multi-hour conversation without hitting memory limits, while a code assistant could analyze an entire repository in one query.

Looking ahead, the competition among sparse attention designs will likely intensify. DeepSeek’s DSA is already production-hardened in V3.2, and Kimi’s MoBA is powering their product. MiniMax’s M3 now enters the fray with its own compelling narrative—and a paper that’s refreshingly practical, with details from training recipes to latency measurements. In the appendix, the paper even shows that MSA achieves nearly identical perplexity as full attention on long-context benchmarks while delivering up to 4× speedups during inference.

The real battle isn’t about who can make attention sparser, but who can make the sparsity intelligent enough that no one misses the full context.

As models flirt with million-token contexts, the ability to attend efficiently will become a key differentiator. MiniMax’s MSA may not be the last word, but it’s a strong entry that every AI architect should study. The next time you see a model claiming ultra-long context, ask not just how long it can go, but how it chooses what to see.