GLM-5.2 Million-Context Test: It Built an 85-Page World Cup Preview That Really Worked

When I set out to test GLM-5.2’s million-token context window, I expected it to crumble under the sheer volume of information. Instead, it built an entire 85-page World Cup preview from scratch, complete with match analysis, flag icons, and group-stage predictions. The result was more coherent than most human-made reports I’ve seen.

The timing was perfect. Just hours before I ran the test, Anthropic had received a letter from the U.S. Department of Commerce demanding a halt to all foreign access to its Fable 5 and Mythos 5 models—citing national security. Anthropic responded by shutting down both models entirely, even for American users. It felt like a door slamming shut. Then, almost on cue, Zhipu AI released GLM-5.2 with a public announcement: “Frontier intelligence should not belong only to a few, nor should it be revoked by a few rules at any time.” The model was open-sourced under the MIT license the following week.

I wanted to see if this Chinese model could handle a genuinely massive, multi-step task. Asking an AI to predict match scores is trivial—any model can spit out a number. But building a structured, consistent, and fact-checked preview of 72 group-stage matches plus group overviews and a cover page is a different beast. The 2026 World Cup expansion from 32 to 48 teams meant 12 groups of 4, totalling 72 matches—not the 48 that most models would assume from training data.

GLM-5.2 stopped itself halfway through. It noticed the discrepancy: “The user mentioned 48 matches, but the 2026 World Cup has changed to 48 teams, 12 groups, each with 6 matches—that’s 72. I cannot trust my memory on this; I need to verify from authoritative sources.” Then it actually did: cross-checking FIFA’s official site, ESPN, and Wikipedia for group assignments and real scores of already played matches. This self-correction is rare in language models. Many would blindly follow the outdated 48-match figure and produce a completely wrong output. The ability to doubt one’s own knowledge, and then act on that doubt, is the true marker of intelligence.

After confirming the correct match count, I escalated the task. I wanted not just 72 individual match slides, but also a group overview page for each of the 12 groups, plus a cover. That made it an 85-page project. Building this page by page would inevitably break—either the model would run out of context or the style would drift across pages.

GLM-5.2 had a different approach. It didn’t start writing the first page immediately. Instead, it used two custom skills I had attached: one called “freud-skill” prepared its cognitive state—anchoring its identity as a hybrid “sports broadcast director + tactical analyst.” The other, “huashu-design,” established a full design system upfront: fonts, colors, card layouts, icon placements. Only then did it divide the work into a five-layer pipeline: a unified data source feeding 12 sub-agents that researched each group in parallel, then a batch rendering step that applied the same template to all pages, and finally aggregation into a wall-of-overview.

Crucially, it instructed the sub-agents to output only structured content in JSON, not HTML. Separating content from style is the only way to lock in consistency across 85 pages. This design decision is something many human teams still fail at. The sub-agents explored each group’s dynamics—key players, recent form, tactical vulnerabilities—and wrote insightful one-liners like “Their defensive line is old, and they’ve won only one game against top-20 teams in the last two years.” Then the rendering engine turned that into polished slides.

The final output was not perfect. Some match predictions were overly conservative—favoring favorites too heavily—and a few group overviews lacked specific stats (e.g., head-to-head records). But the structure held together. Every slide had a consistent layout: team flags, match time and venue, core players, a tactical insight, and a predicted score. For matches already played, it used the real result. I could scroll through the entire document without noticing a single style break. That is remarkable for an 85-page document generated by a single AI session.

One may argue that relying on a million-token context is overkill for this task—that a modular approach with multiple shorter prompts could achieve similar results. That’s true, but it misses the point. The test was not about whether the task could be done with other methods, but whether GLM-5.2 could handle the complexity in a single, unbroken chain of reasoning. When you give it a messy, evolving requirement, it doesn’t fracture. It pauses, corrects, and builds.

There is a broader implication here. The industry is obsessed with raw benchmark scores and synthetic test sets. But the ability to autonomously detect anomalies in instructions, verify facts against external sources, and enforce design consistency across hundreds of pages is far more valuable for real-world deployment. A model that can self-correct its own assumptions is one step closer to being a reliable agent, not just a better parrot.

If you’re building an AI pipeline that demands long-term coherence and structured generation—like a research report, a product catalog, or a multi-section analysis—GLM-5.2’s million-token context is worth exploring. But don’t take my word for it. Try giving it a genuinely ambiguous task with multiple contradictions and see how it responds. That’s where the real test lies.

The best models don’t just remember more. They know when to doubt what they remember.