Qwen3.7-Max: How Alibaba’s Latest Model Matches Global AI Giants with 35-Hour Unstoppable Coding

In a field crowded with hyperscalers and cutting-edge research labs, a single stat can shift the conversation. When Artificial Analysis, a widely respected third-party AI evaluation platform, posted its latest Intelligence Index ranking this week, one number stood out: Alibaba’s Qwen3.7-Max scored 56.6—almost 5 points higher than its predecessor, Qwen3.6-Max. The accompanying note was measured but telling: "Alibaba still trails OpenAI, Anthropic, and Google, but Qwen3.7-Max is the closest they have ever come to the frontier."

This is not just another incremental release. It lands at a moment when the entire industry is zeroing in on two critical capabilities: large model coding (LM coding) and long-horizon execution. While GPT-5.5, Opus 4.7, and DeepSeek V4 are all racing along these lines, Qwen3.7-Max quietly posted 9 first-place finishes out of 12 agent-oriented benchmarks. The secret sauce? A model that doesn’t just answer questions, but takes commands and runs with them—for 35 hours straight, without stopping.

To understand why this matters, let’s look at the benchmarks. The 12-item evaluation suite, released by Alibaba’s Qwen team, focuses heavily on agentic tasks like Terminal-Bench, SWE-bench Pro, SWE-bench Multilingual, MCP-Atlas, MCP-Mark, HLE, Apex Math, IFBench, and SuperGPQA. These are not memory tests; they measure how well a model follows instructions, uses tools, and iterates on problems. Qwen3.7-Max topped 9 of them. The three it lost—NL2Repo, ClawEval, and CoWorkBench—are real-world collaboration scenarios where Anthropic’s Opus 4.6 still holds a slim lead by margins of 0.4, 5.2, and 1.0 points. It’s a gap, but a shrinking one.

One score deserves special attention: IFBench instruction following at 79.1, the highest in the field. In user terms, this means if you give the model a prompt with five constraints, it will likely follow all five without forgetting any of them. That is exactly the kind of reliability developers need when building agents.

Take the 35-hour kernel optimization test. The Qwen team gave the model a real AI Infra engineer’s task: autonomously optimize an SGLang inference kernel on the T-Head Zhenwu M890 chip—a processor the model had never seen before. The input was minimal: a task description, a SGLang Triton reference implementation, and an evaluation script. The output? 35 hours of continuous, self-directed work, involving 432 kernel evaluations and 1,158 tool calls. The final result was a 10x geometric mean speedup compared to the baseline.

What’s even more telling is how other models behaved. They didn’t get interrupted by humans—they stopped on their own after five consecutive rounds without any tool calls, essentially giving up. Qwen3.7 didn’t. After 30 hours, it was still discovering new optimization points, including one critical architectural redesign. It performed two structural transitions autonomously. For context, those are the kind of deep, exploratory tasks that traditionally fall on a human AI Infra engineer’s plate.

Credit where it’s due: being able to ace a benchmark and being able to explore kernel designs for 35 hours are two very different capabilities. The first is about pattern matching against a known answer set. The second is about trajectory learning—the model’s ability to walk through long, uncertain decision chains without losing direction. Qwen3.7-Max delivers on both.

But the most important insight, buried in Alibaba’s official blog post, is this: "The above evaluation scores come from a variety of agent harnesses. Qwen3.7-Max is not optimized for a single framework but performs stably under Claude Code, OpenClaw, Qwen Code, and custom frameworks alike." The blog’s cover image even includes mascots from Hermes Agent and other tools, hinting at an expansive compatibility.

This has huge engineering implications. Over the past six months, I’ve written extensively about Claude Code, Hermes Agent, and OpenClaw. Each harness has its own design philosophy: Claude Code uses tightly coupled tool use, OpenClaw leans toward a personal assistant paradigm, Hermes Agent is message-driven, and Qwen Code is lighter and more modular. Getting a single model to excel across such different harnesses means its tool-use capabilities aren’t tied to any specific pattern.

Before Qwen3.7, a common pain point for Chinese models was that they would perform well on isolated benchmarks but fall apart when swapped into a different harness. This is what AI researchers call the "benchmark-harness consistency problem"—a term that describes a model’s inability to generalize beyond its training environment. Qwen3.7’s 12-point multi-harness evaluation is a direct refutation of that issue.

For developers, the real value isn’t just a high score on a test—it’s a model you can drop into your existing workflow without spending weeks tuning.

In practice, this means Qwen3.7-Max is now a viable option for the same kind of autonomous coding tasks that previously required a dedicated Claude or GPT setup. I ran a few real-world tests myself: it handled multi-step debugging, dependency resolution, and even generated unit tests for a production-grade Python library without hesitation. I was never once forced to restart the conversation or re-explain the context—a sign of genuine long-context reliability.

As the industry moves toward agentic coding, the ability to handle long trajectories and tool diversity becomes the new scoreboard. Qwen3.7-Max stakes a claim on both. It is, without question, the strongest Chinese model yet to appear on the global stage. And for anyone building the next generation of autonomous AI tools, it just became a name to watch.

True progress in AI isn’t about beating a benchmark; it’s about building a model that doesn’t stop when the task gets hard.