World Cup AI Model Arena: Open-Source Project Pits 9 LLMs Against Each Other

A developer named Canghe recently built and open-sourced a website that turns the 2026 World Cup into a battleground for large language models. The project, called "World Cup AI Model Arena," lets nine leading AI models compete in soccer match predictions based on real-time FIFA data. While the personal story behind its creation is relatable—nostalgia for the 2022 Qatar World Cup and a desire to blend passion for football with technical experimentation—the tool itself raises interesting questions about how well AI can simulate sports outcomes, and what happens when you turn model evaluation into a public sport.

The arena features nine models: Claude Opus 4.8, ChatGPT 5.5, Grok-4.2, Gemini-3.5, Qwen3.7-Max, DeepSeek-V4-Pro, GLM-5.1, Kimi-K2.7-Code, and MiniMax-M3. These version numbers appear to be custom labels rather than official release identifiers—for example, there is no publicly known "ChatGPT 5.5" or "Claude Opus 4.8" at this time. This is a deliberate design choice: the system allows users to define any API endpoint and model name, making the arena fully customizable. The website displays fixture matchups, head-to-head model comparisons, and a live leaderboard that updates as predictions are made against actual match results.

The technical backbone is straightforward. Canghe integrated FIFA’s official API to pull team rankings, recent form, goal statistics, and fixture schedules. Each model receives the same structured prompt containing this data and must output a prediction. The predictions are then scored for accuracy, creating a transparent benchmark. The entire project was built using a technique he calls "vibe coding"—rapid development assisted by Kimi’s latest coding model, Kimi-K2.7-Code, which was also included in the arena. He even embedded this model into his own open-source product, WeSight, so users can invoke it via Claude Code or directly through the Kimi API.

What makes this project valuable is not the accuracy of the predictions themselves—sports forecasting is inherently probabilistic and susceptible to randomness, human psychology, and in-game dynamics that no static data set can capture. Rather, it serves as a practical, entertaining sandbox for evaluating LLM reasoning under identical conditions. Unlike standardized benchmarks like MMLU or HumanEval, which test factual knowledge or code generation, this arena demands models to weigh multiple variables—team strength, historical performance, tournament context—and produce a bounded risk assessment. The winner each match day isn’t just the model that guesses correctly, but the one that consistently balances confidence with accuracy.

Other AI prediction projects for sports exist, but most are closed-source or focused on a single model. For example, during the 2018 World Cup, researchers from the Technical University of Berlin used a combination of rule-based systems and random forests to predict winners, achieving about 66% accuracy in group stages. In contrast, this arena pits cutting-edge generative models against each other with real-time feedback and full transparency. The open-source nature allows anyone to fork the code, add new models, or change the sport entirely—turning it into a reusable framework for caparison.

The development process also highlights a growing trend: "vibe coding" or AI-assisted development where the developer acts as a director rather than a typist. Canghe claims the entire arena was built in a few days using Kimi-K2.7-Code. While the quality of the output depends heavily on the developer’s clarity in prompting and ability to debug edge cases, the speed is undeniable. This approach lowers the barrier for non-specialists to create functional, even sophisticated, web applications.

However, there are limitations. The model names themselves risk misleading users into thinking these are official versions from the respective companies. And the predictive accuracy of any LLM on sports outcomes remains an open question—these models are not trained with real-time sports analytics in mind, and their "reasoning" may simply be interpolating patterns from training data that includes historical sports results. A model that performed well on one match day could fail spectacularly the next due to an upset, random injury, or weather change.

Still, the World Cup AI Model Arena succeeds as a community experiment. It combines technical craftsmanship with a genuine love for the game, and it democratizes the evaluation of AI models. For developers curious about how different LLMs reason under pressure, or for football fans who just want to see if ChatGPT can predict a Brazil upset, this open-source project offers a playful yet rigorous testing ground. The GitHub repository is available in the original article’s comments, and as Canghe notes, a star on the repo is always appreciated.

The real winner isn’t the model that predicts the most correct outcomes—it’s the one that makes you question how much weight you should ever give to a machine’s opinion about a human game.