If you peer into the research labs of the world’s top AI institutions in early May 2026, you’d spot a curious coincidence. On May 7, ByteDance’s Seed team dropped a 99-page paper called Cola DLM. Four days later, He Kaiming’s group at MIT published a 32-page paper named ELF. One hails from the most aggressive industrial lab working on diffusion models in China; the other from a visionary who literally built the foundation for modern deep learning. They started from opposite intuitions, yet landed on the same architectural insight: delay the moment of "word choice" until the very end of generation.
This convergence isn’t just a footnote in AI history. It signals a potential shift in how we build large language models, moving away from the autocratic, token-by-token dictatorship of autoregression toward a more flexible, plan-ahead approach. To understand why, we first need to appreciate the people behind the papers. He Kaiming’s resume reads like a hall of fame: the ResNet paper, which introduced residual connections, has over 300,000 citations and underpins every Transformer-based model you use today, from GPT to DeepSeek. After stints at MSRA and FAIR, he joined MIT in 2024 and now holds a joint position with Google DeepMind. His first language-model paper since moving to MIT is ELF, and it’s no accident that it attacks the same problem as ByteDance’s team.
The core problem both papers address is the fundamental weakness of autoregressive (AR) language models. AR generation writes with a permanent pen: once a token is emitted, it cannot be revised. This sequential, left-to-right constraint is responsible for many of the hallucinations, reasoning failures, and brittle performance on complex tasks. The alternative is to think in a continuous space first, like sketching with a pencil, before committing to a final clean copy. Cola DLM and ELF achieve this by shifting "discretization" — the step that locks in a specific word — to the very last stage of the generation process. Inside the model, all reasoning happens in a continuous latent space, allowing the system to revise and refine the entire sentence structure before outputting a single token.
Why hasn’t this been done before? The idea of continuous diffusion for language isn’t new. Projects like Diffusion-LM (2022), SSD-LM, and Latent Diffusion LM tried to apply diffusion models, which excel in image generation, to text. But they consistently underperformed compared to both AR models and discrete-token diffusion models. The industry consensus was that language is inherently discrete, so continuous diffusion simply doesn’t work. Both He Kaiming and ByteDance Seed challenged that belief. They identified the same bottleneck: previous models forced a "per-step discretization," where after each denoising step in continuous space, the representation was immediately mapped back to the vocabulary to compute a loss. This effectively turned the pencil back into a pen, preventing the model from truly iterating in the semantic soup.
Cola DLM stands for Continuous Latent Diffusion Language Model. It embeds text into a continuous latent space, applies diffusion (adding and removing noise over many steps) entirely in that space, and only at the final step decodes the smooth cloud of meaning into a sequence of words. ELF (Embedded Language Flows) uses a sibling method called flow matching, which is mathematically related to diffusion but often allows faster and more direct generation. Both papers demonstrate that delaying discretization is the key to making continuous generation work for language. They report competitive or better performance than comparably sized autoregressive models on benchmarks like perplexity and text generation quality, and more importantly, they show that the model can handle long-range dependencies and avoid the "error accumulation" that plagues AR models.
The implications are profound. If this line of research matures, we may see language models that can "think before they speak" at a much deeper level. This could reduce hallucinations, improve reasoning in multi-hop tasks, and allow models to generate text that is not merely coherent but globally consistent. Of course, there are trade-offs. Continuous generation is computationally more expensive than AR during inference, because it requires multiple denoising steps. But as hardware improves and methods like rectified flow (used in ELF) speed up the process, these costs may become acceptable. The fact that both an industrial powerhouse (ByteDance) and a legendary academic (He Kaiming) are betting on this approach suggests we are witnessing the opening of a new chapter in language model design. The river they stepped into might just become the mainstream.