Alibaba's proprietary Qwen3.7-Max can run for 35 hours autonomously and supports external harnesses like Anthropic's Claude Code
Our take

Alibaba's unveiling of Qwen3.7-Max marks a significant milestone in the AI landscape, heralding the true arrival of the "agent era." This new model's ability to autonomously execute tasks for approximately 35 hours without human intervention is not just a technical feat; it represents a transformative shift in how artificial intelligence can be utilized across various industries. As AI systems evolve from mere text generators to sophisticated agents capable of planning, executing, and refining complex tasks, the implications for businesses and developers are profound. The competitive dynamics are shifting, with Alibaba positioning itself as a formidable contender against established American giants like OpenAI and Google. As we explore these developments, it's essential to consider their broader significance within the context of AI evolution and user experience, especially when juxtaposed with ongoing narratives about technology, like those in articles such as Six search engines worth trying now that Google isn’t really Google anymore and Spotify and Universal Music strike deal allowing fan-made AI covers and remixes.
The technical prowess embedded in Qwen3.7-Max, particularly its ability to perform long-horizon reasoning and execute complex engineering tasks autonomously, demonstrates how far we’ve come in AI development. The model's design surpasses previous limitations often encountered by language models, such as memory degradation and logical loops during extended interactions. Instead, it was built to function as a "versatile agent foundation," capable of executing a staggering number of tool calls while continually refining its output. Such capabilities not only enhance productivity but also redefine the potential applications of AI in software development and enterprise automation. The fact that Qwen3.7-Max can be integrated into existing frameworks through its "cross-harness generalization" means it can seamlessly operate alongside various tools, fostering innovation while addressing the diverse needs of users.
However, the proprietary nature of Qwen3.7-Max raises questions about accessibility and the future of open-source AI. Historically, Alibaba's Qwen models have contributed significantly to the open-source community, allowing developers to run and adapt models on their own hardware. The shift to a strictly API-based model for Qwen3.7-Max could be seen as a retreat from the principles that have driven collaboration and innovation in the AI space. While this commercial strategy may be financially prudent for Alibaba, as it aligns with practices adopted by Western tech giants, it risks alienating users who value transparency and control over their AI solutions. This tension reflects a broader conversation within the tech community about the balance between corporate interests and the collaborative spirit that has fostered innovation in AI technologies.
Looking ahead, the emergence of Qwen3.7-Max prompts critical questions regarding the future landscape of AI. As we witness the continued evolution of AI capabilities, will there be a growing divide between proprietary and open-source solutions? How will this affect individual developers and smaller companies seeking to leverage advanced AI without compromising their data security or creative control? The industry is at a crossroads, facing the challenge of navigating these dynamics while fostering an environment that encourages exploration and innovation. Ultimately, the journey toward democratizing advanced AI technologies will likely continue to shape how we interact with and utilize these powerful tools in our daily lives. As Qwen3.7-Max exemplifies, the potential for AI to enhance user productivity and creativity is immense, yet the path forward requires thoughtful consideration of accessibility and ethical implications in the rapidly evolving technological landscape.
The AI industry has fully entered the "agent era," a paradigm where AI models do far more than generate text — they now actively plan, execute, and course-correct complex tasks over days rather than seconds.
Thus, it's perhaps unsurprising to see Chinese e-commerce giant Alibaba's famed Qwen Team of AI researchers release a model capable of performing autonomous agentic AI work over multiple days: that model has arrived in the form of Qwen3.7-Max which the company reports in a blog post achieved "~35 hours of continuous autonomous execution" — albeit, in a proprietary, not open source format, as prior Qwen Team releases were.
This is also to be expected — it's what many analysts and industry experts feared in the wake of the departure of several key Qwen Team leaders earlier this year. But it makes sense for Alibaba financially, at least in the short term: training AI models, especially ones as powerful as Qwen3.7-Max, is expensive, and giving them away essentially for free, as open source models are, does not immediately help recoup any costs.
In that sense, Alibaba is simply aligning its efforts with American AI giants like OpenAI and Google by offering the latest and greatest models only through paid APIs and subscription or paid web plan bundles, and slightly less performant ones through open source.
Still, the arrival of Qwen3.7-Max offers further optionality to enterprises and individual users, and more competition for American AI labs — rarely a bad thing for consumers at all budget levels. Yet, the fact that the model is only accessible from Chinese-based endpoints means it may be limited in its appeal to American and European enterprises seeking to maximize compliance and security posturing when fulfilling government contracts, or even just attempting to comply with all relevant state, local, and national data sovereignty regulations.
The marathon AI era
To understand why Qwen3.7-Max is a departure from previous models, one must look at how it was trained and how it operates in practice.
Language models typically degrade when forced to maintain a single train of thought over thousands of conversational turns; they forget instructions, hallucinate variables, or simply get stuck in logical loops. Qwen3.7-Max was specifically designed as a "versatile agent foundation" capable of "long-horizon reasoning" to overcome this exact bottleneck.
The starkest demonstration of this capability is an autonomous engineering task detailed by the Qwen team. The model was given access to an isolated server equipped with a T-Head ZW-M890 PPU—a hardware architecture the model had never encountered during its training. Its task was to optimize an attention kernel.
Over the course of 35 straight hours, Qwen3.7-Max operated entirely autonomously. It executed 1,158 distinct tool calls, performed 432 kernel evaluations, diagnosed compilation failures, and iteratively improved the code to achieve a 10.0x geometric mean speedup.
By comparison, Chinese competitor models like z.ai's GLM-5.1 and Moonshot's Kimi K2.6 capped out at 7.3x and 5.0x speedups respectively, often voluntarily terminating their sessions when they failed to make progress. However, both are available open source.
This endurance is achieved through what Alibaba calls "environment scaling". Just as early LLMs grew smarter by ingesting more diverse text, Qwen3.7-Max was trained across a vast, scaled array of dynamic agentic environments.
It is capable of simulating a one-year lifecycle of a startup in the "YC-Bench" evaluation, navigating hundreds of decision-making rounds encompassing personnel management and contract screening. In this simulation, the model managed to generate $2.08 million in virtual revenue, nearly doubling the performance of the prior generation, Qwen3.6-Plus.
Furthermore, the model has built-in reward-hacking self-monitoring, autonomously detecting when it attempts to cheat a training environment and adding heuristic rules to correct its own behavior.
A brain for any scaffold
From a product perspective, Qwen3.7-Max is designed to be the cognitive engine for modern software development and enterprise automation.
The model offers a massive 1-million-token context window and a 64K maximum output limit, providing immense overhead for processing sprawling codebases or lengthy technical documents.
One of its most compelling features is "cross-harness generalization". Rather than being hardcoded to work best within a specific proprietary interface, Qwen3.7-Max is built to act as a drop-in intelligence layer for diverse agent frameworks. It supports the Anthropic API protocol natively, allowing developers to plug it directly into existing tools like Claude Code or OpenClaw.
The benchmark data provided by Alibaba indicates that this generalized approach has paid massive dividends.
On the Apex Math Reasoning benchmark, Qwen3.7-Max scored 44.5, eclipsing Claude Opus-4.6 Max's score of 34.5 and DeepSeek V4-Pro Max's 38.3. It also posted dominant scores on Humanity's Last Exam (41.4) and the realistic coding agent benchmark MCP-Atlas (76.4).
This translates into tangible utility for end-users. Through open source Model Context Protocol (MCP) integrations, the model can operate as an autonomous office assistant, capable of reading university formatting specs and automatically reformatting a messy Word document via command-line tools without human intervention.
Running this level of intelligence comes at a distinct cost. Developers accessing the API via Alibaba Cloud Model Studio will pay $2.50 per 1 million input tokens and $7.50 per 1 million output tokens. The platform also features explicit cache creation and read pricing, as well as a $10 fee per 1,000 calls for integrated web searches, though code interpreter tools remain free for a limited time.
Qwen3.7-Max occupies a strategic middle ground in the current API economy. While it demands a notable premium over aggressively priced domestic rivals—costing nearly double DeepSeek V4 Pro ($5.22) and Z.ai's GLM-5.1 ($5.80)—it drastically undercuts the Western frontier giants it routinely matches on benchmarks.
For context, running heavy agentic workflows through OpenAI's GPT-5.4 or Anthropic's Claude Opus 4.7 will run developers $17.50 and $30.00 per million tokens, respectively. See VentureBeat's pricing chart below:
VentureBeat Frontier AI Model API Pricing Snapshot
Model | Input | Output | Total Cost | Source |
MiMo-V2.5 Flash | $0.10 | $0.30 | $0.40 | |
MiniMax M2.7 | $0.30 | $1.20 | $1.50 | |
Gemini 3.1 Flash-Lite | $0.25 | $1.50 | $1.75 | |
MiMo-V2.5 | $0.40 | $2.00 | $2.40 | |
Kimi-K2.6 | $0.95 | $4.00 | $4.95 | |
GLM-5 | $1.00 | $3.20 | $4.20 | |
Grok 4.3 (low context) | $1.25 | $2.50 | $3.75 | |
DeepSeek V4 Pro | $1.74 | $3.48 | $5.22 | |
GLM-5.1 | $1.40 | $4.40 | $5.80 | |
Claude Haiku 4.5 | $1.00 | $5.00 | $6.00 | |
Grok 4.3 (high context) | $2.50 | $5.00 | $7.50 | |
Qwen3.7-Max | $2.50 | $7.50 | $10.00 | |
Gemini 3.5 Flash | $1.50 | $9.00 | $10.50 | |
Gemini 3.1 Pro Preview (≤200K) | $2.00 | $12.00 | $14.00 | |
GPT-5.4 | $2.50 | $15.00 | $17.50 | |
Gemini 3.1 Pro Preview (>200K) | $4.00 | $18.00 | $22.00 | |
Claude Opus 4.7 | $5.00 | $25.00 | $30.00 | |
GPT-5.5 | $5.00 | $30.00 | $35.00 |
By positioning Qwen3.7-Max just below Google's Gemini 3.5 Flash ($10.50) but well above budget-tier models, Alibaba is signaling that this isn't a commodity release; it’s a flagship reasoning engine priced to lure enterprise workloads away from Silicon Valley's most expensive offerings.
Licensing remains proprietary for now
For all its technical brilliance, the most controversial aspect of Qwen3.7-Max is how it is distributed. Qwen is billing the release as a "proprietary model". It is strictly API-only.
Historically, Alibaba’s Qwen has been a hero to the open-source and local LLM communities. Previous iterations, like Qwen 2.5 and Qwen 3.6, released their weights publicly. Open weights allow developers, researchers, and enterprises to download the model, run it on their own hardware, and fine-tune it for highly specific or data-sensitive use cases without sending proprietary information to a third-party server.
By locking Qwen3.7-Max behind an API, Alibaba is pivoting to the standard commercial playbook utilized by OpenAI (with GPT-4) and Anthropic (with Claude). For enterprise users, this means utilizing Qwen3.7-Max requires trusting Alibaba Cloud with their data streams and relying entirely on internet connectivity to run their agentic workflows. For the open-source community, it means losing access to what is currently one of the most capable models on the planet.
Community reactions split between awe and disappointment
The reaction from the developer community has been swift, characterized by a mix of profound respect for the engineering achievement and frustration over the licensing model.
Prominent AI commentator Sudo su (@sudoingX) captured the prevailing sentiment on X (formerly Twitter). "qwen is unreal," they wrote. "they just dropped 3.7 max and it is beating opus 4.6 max on most of the benchmarks they ran".
The technical metrics, particularly the model's endurance, have left many in the field stunned. "the apex math number, 44.5 against opus 34.5, that is not a small gap," Sudo su noted. "the 35 hours straight on a kernel optimization task with 1000+ tool calls is the part i keep rereading. that is the agent era thing actually happening, not a slide".
The speed of Alibaba's iteration is also drawing notice. With Qwen 3.6 released just last month, the leap to 3.7-Max highlights a relentless development cadence. As Sudo su observed, "nobody else is moving like this".
Yet, the praise is heavily caveated by the shift to a closed ecosystem. The loss of the model weights is seen as a blow to the localized AI movement, which relies on state-of-the-art open models to push the boundaries of what can be done on consumer hardware or private enterprise clusters.
"one thing though, please open source this one too," Sudo su pleaded in their post. "3.6 dense made the entire local llm ecosystem better. the max tier going api only would close a door we have been keeping open. give us the weights eventually".
Qwen3.7-Max proves that the autonomous agent era is no longer a theoretical projection; it is a present reality capable of executing complex engineering feats while humans sleep. The only question now is whether this new frontier of AI will be a democratized resource you can download to your laptop, or an intelligence utility rented strictly from the cloud. For now, with Qwen3.7-Max, it is undeniably the latter.
Read on the original site
Open the publisher's page for the full experience
Related Articles
- DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5The whale has resurfaced. DeepSeek, the Chinese AI startup offshoot of High-Flyer Capital Management quantitative analysis firm, became a near-overnight sensation globally in January 2025 with the release of its open source R1 model that matched proprietary U.S. giants. It's been an epoch in AI since then, and while DeepSeek has released several updates to that model and its other V3 series, the international AI and business community has been largely waiting with baited breath for the follow-up to the R1 moment. Now it's arrived with last night's release of DeepSeek-V4, a 1.6-trillion-parameter Mixture-of-Experts (MoE) model available free under commercially-friendly open source MIT License, which nears — and on some benchmarks, surpasses — the performance of the world’s most advanced closed-source systems at approximately 1/6th the cost over the application programming interface (API). This release—which DeepSeek AI researcher Deli Chen described on X as a "labor of love" 484 days after the launch of V3—is being hailed as the "second DeepSeek moment". As Chen noted in his post, "AGI belongs to everyone". It's available now on AI code sharing community Hugging Face and through DeepSeek's API. Frontier-class AI gets pushed into a lower price band The most immediate impact of the DeepSeek-V4 launch is economic. The corrected pricing table shows DeepSeek is not pricing its new Pro model at near-zero levels, but it is still pushing high-end model access into a far lower cost tier than the leading U.S. frontier models. DeepSeek-V4-Pro is priced through its API at $1.74 USD per 1 million input tokens on a cache miss and $3.48 per million output tokens. That puts a simple one-million-input, one-million-output comparison at $5.22. With cached input, the input price drops to $0.145 per million tokens, bringing that same blended comparison down to $3.625. That is dramatically cheaper than the current premium pricing from OpenAI and Anthropic. GPT-5.5 is priced at $5.00 per million input tokens and $30.00 per million output tokens, for a combined $35.00 in the same simple comparison. Claude Opus 4.7 is priced at $5.00 input and $25.00 output, for a combined $30.00. Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek-V4-Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI On standard, cache-miss pricing, DeepSeek-V4-Pro comes in at roughly one-seventh the cost of GPT-5.5 and about one-sixth (1/6th) the cost of Claude Opus 4.7. With cached input, the gap widens: DeepSeek-V4-Pro costs about one-tenth as much as GPT-5.5 and about one-eighth as much as Claude Opus 4.7. The more extreme near-zero story belongs to DeepSeek-V4-Flash, not the Pro model. Flash is priced at $0.14 per million input tokens on a cache miss and $0.28 per million output tokens, for a combined $0.42. With cached input, that drops to $0.308. In that case, DeepSeek’s cheaper model is more than 98% below GPT-5.5 and Claude Opus 4.7 in a simple input-plus-output comparison, or nearly 1/100th the cost — though the performance dips significantly. DeepSeek is compressing advanced model economics into a much lower band, forcing developers and enterprises to revisit the cost-benefit calculation around premium closed models. For companies running large inference workloads, that price gap can change what is worth automating. Tasks that look too expensive on GPT-5.5 or Claude Opus 4.7 may become economically viable on DeepSeek-V4-Pro, and even more so on DeepSeek-V4-Flash. The launch does not make intelligence free, but it does make the market harder for premium providers to defend on performance alone. Benchmarking the frontier: DeepSeek-V4-Pro gets close, but GPT-5.5 and Opus 4.7 still lead on most shared tests DeepSeek-V4-Pro-Max is best understood as a major open-weight leap, not a clean across-the-board defeat of the newest closed frontier systems. The model’s strongest benchmark claims come from DeepSeek’s own comparison tables, where it is shown against GPT-5.4 xHigh, Claude Opus 4.6 Max and Gemini 3.1 Pro High and bests them on several tests, including Codeforces and Apex Shortlist. But that is not the same as a head-to-head against OpenAI’s newer GPT-5.5 or Anthropic’s newer Claude Opus 4.7. Looking only at DeepSeek-V4 versus the latest proprietary models, the picture is more restrained. On this shared set, GPT-5.5 and Claude Opus 4.7 still lead most categories. DeepSeek-V4-Pro-Max’s best showing is on BrowseComp, the benchmark measuring agentic AI web browsing prowess (especially highly containerized information), where it scores 83.4%, narrowly behind GPT-5.5 at 84.4% and ahead of Claude Opus 4.7 at 79.3%. On Terminal-Bench 2.0, DeepSeek scores 67.9%, close to Claude Opus 4.7’s 69.4%, but far behind GPT-5.5’s 82.7%. Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 48.2% 52.2% 57.2% 54.7% GPT-5.5 Pro Terminal-Bench 2.0 67.9% 82.7% — 69.4% GPT-5.5 SWE-Bench Pro / SWE Pro 55.4% 58.6% — 64.3% Claude Opus 4.7 BrowseComp 83.4% 84.4% 90.1% 79.3% GPT-5.5 Pro MCP Atlas / MCPAtlas Public 73.6% 75.3% — 79.1% Claude Opus 4.7 The shared academic-reasoning results favor the closed models: On GPQA Diamond, DeepSeek-V4-Pro-Max scores 90.1%, while GPT-5.5 reaches 93.6% and Claude Opus 4.7 reaches 94.2%. On Humanity’s Last Exam without tools, DeepSeek scores 37.7%, behind GPT-5.5 at 41.4%, GPT-5.5 Pro at 43.1% and Claude Opus 4.7 at 46.9%. With tools enabled, DeepSeek rises to 48.2%, but still trails GPT-5.5 at 52.2%, GPT-5.5 Pro at 57.2% and Claude Opus 4.7 at 54.7%. The agentic and software-engineering results are more mixed, but they still show DeepSeek-V4-Pro-Max trailing GPT-5.5 and Opus 4.7. On Terminal-Bench 2.0, DeepSeek’s 67.9% is competitive with Claude Opus 4.7’s 69.4%, but GPT-5.5 is much higher at 82.7%. On SWE-Bench Pro, DeepSeek’s 55.4% trails GPT-5.5 at 58.6% and Claude Opus 4.7 at 64.3%. On MCP Atlas, DeepSeek’s 73.6% is slightly behind GPT-5.5 at 75.3% and Claude Opus 4.7 at 79.1%. BrowseComp is the standout: DeepSeek’s 83.4% beats Claude Opus 4.7’s 79.3% and nearly matches GPT-5.5’s 84.4%, though GPT-5.5 Pro’s 90.1% remains well ahead. So ultimately, DeepSeek-V4-Pro-Max does not appear to dethrone GPT-5.5 or Claude Opus 4.7 on the benchmarks that can be directly compared across the companies’ published tables. But it gets close enough on several of them — especially BrowseComp, Terminal-Bench 2.0 and MCP Atlas — that its much lower API pricing becomes the headline. In practical terms, DeepSeek does not need to win every leaderboard row to matter. If it can deliver near-frontier performance on many enterprise-relevant agent and reasoning tasks at roughly one-sixth to one-seventh the standard API cost of GPT-5.5 or Claude Opus 4.7, it still forces a major rethink of the economics of advanced AI deployment. DeepSeek-V4-Pro-Max is clearly the strongest open-weight model in the field right now, and it is unusually close to frontier closed systems on several practical benchmarks. While GPT-5.5 and Claude Opus 4.7 still retain the lead in most direct head-to-head comparisons across the company's benchmark charts, DeepSeek V4 Pro gets close while being dramatically cheaper and openly available. A big jump from DeepSeek V3.2 To understand the magnitude of this release, one must look at the performance gains of the base models. DeepSeek-V4-Pro-Base represents a significant advancement over the previous generation, DeepSeek-V3.2-Base. In World Knowledge, V4-Pro-Base achieved 90.1 on MMLU (5-shot) compared to V3.2’s 87.8, and a massive jump on MMLU-Pro from 65.5 to 73.5. The improvement in high-level reasoning and verified facts is even more pronounced: on SuperGPQA, V4-Pro-Base reached 53.9 compared to V3.2's 45.0, and on the FACTS Parametric benchmark, it more than doubled its predecessor's performance, jumping from 27.1 to 62.6. Simple-QA verified scores also saw a dramatic rise from 28.3 to 55.2. The Long Context capabilities have also been refined. On LongBench-V2, V4-Pro-Base scored 51.5, significantly outpacing the 40.2 achieved by V3.2-Base. In Code and Math, V4-Pro-Base reached 76.8 on HumanEval (Pass@1), up from 62.8 on V3.2-Base. These numbers underscore that DeepSeek has not just optimized for inference cost, but has fundamentally improved the intelligence density of its base architecture. The efficiency story is equally compelling for the Flash variant. DeepSeek-V4-Flash-Base, despite utilizing a substantially smaller number of parameters, outperforms the larger V3.2-Base across wide benchmarks, particularly in long-context scenarios. A new information 'traffic controller,' Manifold-Constrained Hyper-Connections (mHC) DeepSeek’s ability to offer these prices and performance figures is rooted in radical architectural innovations detailed in its technical report also released today, "Towards Highly Efficient Million-Token Context Intelligence." The standout technical achievement of V4 is its native one-million-token context window. Historically, maintaining such a large context required massive memory (the key values or KV cache). DeepSeek solved this by introducing a Hybrid Attention Architecture that combines Compressed Sparse Attention (CSA) to reduce initial token dimensionality and Heavily Compressed Attention (HCA) to aggressively compress the memory footprint for long-range dependencies. In practice, the V4-Pro model requires only 10% of the KV cache and 27% of the single-token inference FLOPs compared to its predecessor, the DeepSeek-V3.2, even when operating at a 1M token context. To stabilize a network of 1.6 trillion parameters, DeepSeek moved beyond traditional residual connections. The company's researchers incorporated Manifold-Constrained Hyper-Connections (mHC) to strengthen signal propagation across layers while preserving the model’s expressivity. mHC allows an AI to have a much wider flow of information (so it can learn more complex things) without the risk of the model becoming unstable or "breaking" during its training. It’s like giving a city a 10-lane highway but adding a perfect AI traffic controller to ensure no one ever hits the brakes. This is paired with the Muon optimizer, which allowed the team to achieve faster convergence and greater training stability during the pre-training on more than 32T diverse and high-quality tokens. This pre-training data was refined to remove hatched auto-generated content, mitigating the risk of model collapse and prioritizing unique academic values. The model’s 1.6T parameters utilize a Mixture-of-Experts (MoE) design where only 49B parameters are activated per token, further driving down compute requirements. Training the mixture-of-experts (MoE) to work as a whole DeepSeek-V4 was not simply trained; it was "cultivated" through a unique two-stage paradigm. First, through Independent Expert Cultivation, domain-specific experts were trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) using the GRPO (Group Relative Policy Optimization) algorithm. This allowed each expert to master specialized skills like mathematical reasoning or codebase analysis. Second, Unified Model Consolidation integrated these distinct proficiencies into a single model via on-policy distillation, where the unified model acts as the student learning to optimize reverse KL loss with teacher models. This distillation process ensures that the model preserves the specialized capabilities of each expert while operating as a cohesive whole. The model’s reasoning capabilities are further segmented into three increasing "effort" modes. The "Non-think" mode provides fast, intuitive responses for routine tasks. "Think High" provides conscious logical analysis for complex problem-solving. Finally, "Think Max" pushes the boundaries of model reasoning, bridging the gap with frontier models on complex reasoning and agentic tasks. This flexibility allows users to match the compute effort to the difficulty of the task, further enhancing cost-efficiency. Breaking the Nvidia GPU stranglehold with local Chinese Huawei Ascend NPUs While the model weights are the headline, the software stack released alongside them is arguably more important for the future of "Sovereign AI." Analyst Rui Ma highlighted a single sentence from the release as the most critical: DeepSeek validated their fine-grained Expert Parallelism (EP) scheme on Huawei Ascend NPUs (neural processing units). By achieving a 1.50x to 1.73x speedup on non-Nvidia GPU platforms, DeepSeek has provided a blueprint for high-performance AI deployment that is resilient to Western GPU supply chains and export controls. However, it's important to note that DeepSeek still claims it used officially licensed, legal Nvidia GPUs for DeepSeek V4's training, in addition to the Huawei NPUs. DeepSeek has also open-sourced the MegaMoE mega-kernel as a component of its DeepGEMM library. This CUDA-based implementation delivers up to a 1.96x speedup for latency-sensitive tasks like RL rollouts and high-speed agent serving. This move ensures that developers can run these massive models with extreme efficiency on existing hardware, further cementing DeepSeek’s role as the primary driver of open-source AI infrastructure. The technical report emphasizes that these optimizations are crucial for supporting a standard 1M context across all official services. Licensing and local deployment DeepSeek-V4 is released under the MIT License, the most permissive framework in the industry. This allows developers to use, copy, modify, and distribute the weights for commercial purposes without royalties—a stark contrast to the "restricted" open-weight licenses favored by other companies. For local deployment, DeepSeek recommends setting sampling parameters to temperature = 1.0 and top_p = 1.0. For those utilizing the "Think Max" reasoning mode, the team suggests setting the context window to at least 384K tokens to avoid truncating the model's internal reasoning chains. The release includes a dedicated encoding folder with Python scripts demonstrating how to encode messages in OpenAI-compatible format and parse the model's output, including reasoning content. DeepSeek-V4 is also seamlessly integrated with leading AI agents like Claude Code, OpenClaw, and OpenCode. This native integration underscores its role as a bedrock for developer tools, providing an open-source alternative to the proprietary ecosystems of major cloud providers. Community reactions and what comes next The community reaction has been one of shock and validation. Hugging Face officially welcomed the "whale" back, stating that the era of cost-effective 1M context length has arrived. Industry experts noted that the "second DeepSeek moment" has effectively reset the developmental trajectory of the entire field, placing massive pressure on closed-source providers like OpenAI and Anthropic to justify their premiums. AI evaluation firm Vals AI noted that DeepSeek-V4 is now the "#1 open-weight model on our Vibe Code Benchmark, and it’s not close". DeepSeek is moving quickly to retire its older architectures. The company announced that the legacy deepseek-chat and deepseek-reasoner endpoints will be fully retired on July 24, 2026. All traffic is currently being rerouted to the V4-Flash architecture, signifying a total transition to the million-token standard. DeepSeek-V4 is more than just a new model; it is a challenge to the status quo. By proving that architectural innovation can substitute for raw compute-maximalism, DeepSeek has made the highest levels of AI intelligence accessible to the global developer community at a far lower cost — something that could benefit the globe, even at a time when lawmakers and leaders in Washington, D.C. are raising concerns about Chinese labs "distilling" from U.S. proprietary giants to train open source models, and fears of said open source or jailbroken proprietary models being used to create weapons and commit terror. The truth is, while all of these are potential risks — as they were and have been with prior technologies that broadened information access, like search and the internet itself — the benefits seem far outweigh them, and DeepSeek's quest to keep frontier AI models open is of benefit to the entire planet of potential AI users, especially enterprises looking to adopt the cutting-edge at the lowest possible cost.
- Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasksXiaomi, the Chinese firm best known for its smartphones and electric vehicles, has lately been shipping some incredibly affordable and high-powered open source AI large language models. The trend continued today with the release of Xiaomi MiMo-V2.5 and Xiaomi MiMo-V2.5-Pro, both available under the permissive, enterprise-friendly MIT License, making them suitable for use in production in commercial applications. Enterprises and individual/independent developers can now download either of the models (and more Xiaomi open source options) directly from Hugging Face, modify them as needed, and run them locally or on virtual private clouds as they see fit. The most notable attribute of these models besides the open source licensing is that, according to Xiaomi's published benchmarks, they are among the most efficient available for agentic "claw" tasks, that is, powering systems such as OpenClaw, NanoClaw and Hermes Agent, in which users can communicate with them directly over third-party messaging apps and have the agents go off and complete tasks on the human user's behalf, such as making and publishing marketing content, running accounts, organizing email and scheduling, etc. As Xiaomi's ClawEval benchmark chart shows, both MiMo-V2.5 and the Pro version in particular appear near the top left of the chart, indicating high performance in completing the benchmarked claw tasks while using the fewest amount of tokens — saving the human user money, especially in a world where more and more services such as Microsoft's GitHub Copilot are moving to usage-based billing (charging the human behind the agents for each token used rather than imposing rate limits like Anthropic or providing an "all-you-can-eat" buffet-style subscription like OpenAI). In fact, the Pro model leads the open-source field with a 63.8% success rate, consuming only ~70K tokens per trajectory. This is roughly 40–60% fewer tokens than those required by Anthropic Claude Opus 4.6, Google Gemini 3.1 Pro, and OpenAI GPT-5.4 to achieve comparable results. By combining a massive 310B-parameter architecture with a highly efficient "active" footprint and a native 1-million-token context window, Xiaomi MiMo is challenging the dominance of closed-source frontier models from Google and OpenAI, especially when it comes to the latest and greatest craze in enterprise AI deployments — agentic tasks and "claws" similar to OpenClaw. A two-pronged pincer Xiaomi has released two distinct versions of the model to serve different ends of the development spectrum: MiMo-V2.5 (the "Omni" multimodal specialist) and MiMo-V2.5-Pro (the "Agent" specialist). While the base model provides native multimodality, the MiMo-V2.5-Pro is specifically engineered for "long-horizon coherence" and complex software engineering. On the GDPVal-AA (Elo) benchmark, the Pro model achieved a score of 1581, surpassing competitors like Kimi K2.6 and GLM 5.1. Xiaomi researchers further released data on several high-complexity tasks performed autonomously by V2.5-Pro: SysY Compiler in Rust: The model implemented a complete compiler from scratch—including lexer, parser, and RISC-V assembly backend—in 4.3 hours. Spanning 672 tool calls, the model achieved a perfect 233/233 score on hidden test suites, a task that typically takes a computer science major several weeks. Full-Featured Video Editor: Over 11.5 hours and 1,868 tool calls, the model produced an 8,192-line desktop application featuring multi-track timelines and an export pipeline. Analog EDA Optimization: In a graduate-level engineering task, the model optimized a Flipped-Voltage-Follower (FVF-LDO) regulator in the TSMC 180nm process. By iterating through an ngspice simulation loop, the model improved metrics like line regulation by 22x over its initial attempt. These experiments highlight a "harness awareness" in V2.5-Pro, where the model actively manages its own memory and shapes its context to sustain coherence over thousands of sequential tool calls. Over the API, Xiaomi is pricing the models at competitive rates for both domestic (Chinese) and international markets (like the U.S.). For overseas developers, the high-performance MiMo-V2.5-Pro is priced at $1.00 per million input tokens (for a cache miss) and $3.00 for output within context windows up to 256K. For ultra-long context tasks between 256K and 1M tokens, the cost doubles to $2.00 for input and $6.00 for output, though the architecture’s caching capabilities offer significant relief, reducing input costs to as little as $0.20 to $0.40 per million tokens upon a cache hit. Domestically, these rates are mirrored in yuan, with the Pro model starting at ¥7.00 per million input tokens for standard context and reaching ¥14.00 for the extended 1M range. Meanwhile, the base model starts at just $0.40 USD for overseas input per million tokens and $2.00 per million output, putting it among the more affordable third of leading LLMs globally (see our chart below): Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek V4 Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI To lower the barrier for agentic development further, Xiaomi has made cache writing free of charge for a limited time across all models, alongside a total fee waiver for the entire MiMo-V2.5-TTS suite, which includes its specialized voice cloning and design features. This pricing logic is clearly designed to accelerate the transition from simple chat applications to persistent, long-horizon agents that can operate at a fraction of the cost of legacy frontier models. Xiaomi has also introduced an overhauled version of its subscription offerings, called the "Token Plan," now available in four levels: The Lite "Starter Pack" provides 720 million credits for $63.36 USD per year Standard tier offers 2.4 billion credits for $168.96 per year A Pro tier provides 8.4 billion credits for $528.00 per year (designed for enterprise use cases) Max —aimed at high-intensity coding enthusiasts—delivers 19.2 billion credits for $1,056.00 per year Beyond credit allotments, all plans include preferential API rates, a 20% discount for off-peak calls, and "Day-0" support for popular coding scaffolds like Cursor, Zed, and Claude Code. However, both through the API and via the Token Plan, accessing the Xiaomi models from China may present barriers or additional compliance and regulatory risks to U.S.-based enterprise customers. As such, the best bet for U.S. enterprises concerned about relying on Chinese tech but wanting to take advantage of the low cost and open source models is likely setting up their own virtual private clouds or local servers, downloading the model weights, and running the models domestically. MoE architecture but divergent training regimens for V2.5 and V2.5-Pro At the heart of MiMo-V2.5 is a Sparse Mixture-of-Experts (MoE) architecture. While the model boasts a total of 310 billion parameters, only 15 billion are "active" during any given inference cycle. Meanwhile, V2.5-Pro is 1.02 trilion-parameter Mixture-of-Experts model with 42 billion active parameters. In either case, the design functions much like a specialized research hospital: while the facility has hundreds of doctors (parameters), only the specific specialists required for a particular case (query) are called into the room. This massive increase in parameter volume for the Pro version provides the "neural capacity" required for the deep, multi-step reasoning found in complex software engineering and long-horizon tasks, as though even more specialists are available in an even larger hospital. According to Xiaomi's blog post, the regular V2.5 follows a rigorous five-stage evolution: Text Pre-training: Building a massive language backbone on 48 trillion tokens. Projector Warmup: Aligning in-house audio and visual encoders with the language core. Multimodal Pre-training: Scaling across high-quality cross-modal data. Agentic Post-training: Progressively extending the context window from 32K to 1M tokens. RL and MOPD: Utilizing Reinforcement Learning and Multimodal Preference Optimization (MOPD) to sharpen real-world reasoning and perception. The backbone utilizes a hybrid sliding-window attention architecture, inherited from MiMo-V2-Flash, which optimizes how the model "remembers" long-range information. This technical foundation enables MiMo-V2.5 to see, hear, and reason natively, rather than relying on external "plug-in" tools for visual or auditory processing. Conversely, the training of MiMo-V2.5-Pro prioritizes "action space" over sensory perception. Instead of sensory alignment, the Pro model’s training focus shifts toward scaling post-training compute. This process is designed to instill "harness awareness," where the model is specifically trained to manage its own memory and context within autonomous agent scaffolds like Claude Code or OpenCode. While the base V2.5 model is trained to reason across modalities, the Pro version is trained to sustain coherence across more than a thousand sequential tool calls. The standard V2.5 model balances local and global attention to maintain multimodal perception. The Pro model, however, utilizes an increased hybrid attention ratio—evolving from the 5:1 ratio of previous generations to a more aggressive 7:1 ratio. This allows the Pro model to "skim" the vast majority of its context while applying high-density attention to the specific 15% of data most relevant to its current objective, a critical feature for debugging large repositories or optimizing graduate-level circuits. Finally, while both models undergo Reinforcement Learning (RL) and Multimodal Preference Optimization (MOPD), the objectives of these stages differ. For MiMo-V2.5, the RL stage is used to sharpen perception and multimodal reasoning. For MiMo-V2.5-Pro, RL is focused on instruction following within agentic scenarios, ensuring the model adheres to subtle requirements embedded deep within ultra-long contexts and recovers gracefully from errors during autonomous execution. This results in the Pro model's "self-correcting" discipline, as seen in its ability to diagnose and fix regressions during the 4.3-hour SysY compiler build. Full MIT License is perfect for enterprise use cases In a move that distinguishes it from many "open" models that include restrictive "Acceptable Use" policies, Xiaomi has released MiMo-V2.5 under the MIT License.The MIT License is the gold standard of permissive software licensing. For developers and enterprises, this means: No Authorization Required: Companies can deploy the model commercially without seeking explicit permission from Xiaomi. Continued Training: Developers are free to fine-tune the model on proprietary data and even release those derivative weights. Unrestricted Commercial Use: There are no revenue caps or user-base limits that often plague "community" licenses. By choosing MIT over a custom "open weights" license, Xiaomi is positioning MiMo as the foundational infrastructure for the next generation of AI agents, effectively inviting the global developer community to treat the model as a public utility. Xiaomi's background: from smartphones and EVs to Chinese open source AI darling Xiaomi’s pivot toward frontier AI agents is the logical culmination of a decade spent building one of the world's most dense hardware-software flywheels. Founded in 2010 as a smartphone disruptor, the Beijing-based company has executed a high-stakes transition into a vertically integrated powerhouse defined by its "Human x Car x Home" strategy. This ecosystem now encompasses over 823 million connectable smart devices unified under the HyperOS architecture. The company’s 2024 entry into the automotive sector with the SU7 and the subsequent high-performance YU7 SUV served as a proof of concept for this integration, positioning Xiaomi as a direct competitor to global luxury marques. By investing 200 billion yuan ($29B USD) into foundational R&D for chips and operating systems, Xiaomi has moved beyond consumer electronics assembly; it has become an architect of the "action space," using its massive hardware footprint as the primary testing ground for the agentic intelligence found in the MiMo-V2.5 series. Ecosystem support The release has been met with immediate "Day-0" support from the broader AI ecosystem. The MiMo team announced that SGLang and vLLM—two of the most popular high-throughput inference engines—supported the V2.5 series at launch. This was made possible through hardware partnerships with AWS, AMD, T-HEAD, and Enflame, ensuring the model can run efficiently on everything from cloud-based H100s to domestic Chinese accelerators. Fuli Luo, the project lead at Xiaomi MiMo and a former key member of the DeepSeek team, underscored the philosophy behind the release on X (formerly Twitter): "A model's value isn't measured by rankings alone — it's measured by the problems it solves. Let's build with MiMo now!" To kickstart this building phase, Luo announced a 100-trillion free token grant for builders and creators. This massive incentive is designed to lower the barrier to entry for developers who want to experiment with the 1M context window without immediate financial risk. The economic realignment: open source vs. metered proprietary The launch arrives at a critical juncture for AI economics. The shift toward usage-based billing marks the definitive end of the "all-you-can-eat" buffet era for AI services, a trend underscored by GitHub’s announcement today that its AI coding assistant Github Copilot will transition all plans to metered, token-based credits. As seat-based predictability gives way to consumption-driven costs, premium agentic workflows—which can consume millions of tokens in a single reasoning session—are becoming increasingly difficult for enterprises to budget. User sentiment has turned predictably cynical, with developers lamenting that they will "get less, but pay the same price" as subscriptions convert into finite allotments. This pricing evolution significantly enhances the strategic appeal of the MiMo series. By releasing under a permissive MIT License, Xiaomi allows organizations to bypass the escalating "SaaS tax" and reclaim financial predictability through private deployment. Crucially, Xiaomi has eliminated the "context tax" for its API. The 1-million-token context window is now billed at the standard rate—1 token = 1 credit for V2.5 and 2 credits for the Pro version—with no additional multiplier. This stands in stark contrast to the industry-wide move toward session-based caps, positioning MiMo as a refuge for cost-sensitive, high-volume development. Analysis for enterprises The launch of MiMo-V2.5 is more than just a weight drop; it is a declaration of independence for the open-source community. By matching Claude Sonnet 4.6 in multimodal agentic work and Gemini 3 Pro in video understanding, Xiaomi has proven that the gap between "closed-door" labs and open research is effectively closed. With the MIT license as a catalyst and a 100T token grant as fuel, the coming months will likely see a surge in specialized, agentic applications built on the MiMo backbone. Confirming the project's ambitious trajectory, the team noted they are already training the next generation, focusing on "deeper reasoning" and "richer real-world grounding". For now, MiMo-V2.5 stands as a testament to the power of sparse architectures and permissive licensing in the race toward functional AGI.
- AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT-5.4 on SWE-Bench ProIs China picking back up the open source AI baton? Z.ai, also known as Zhupai AI, a Chinese AI startup best known for its powerful, open source GLM family of models, has unveiled GLM-5.1 today under a permissive MIT License, allowing for enterprises to download, customize and use it for commercial purposes. They can do so on Hugging Face. This follows its release of GLM-5 Turbo, a faster version, under only proprietary license last month. The new GLM-5.1 is designed to work autonomously for up to eight hours on a single task, marking a definitive shift from vibe coding to agentic engineering. The release represents a pivotal moment in the evolution of artificial intelligence. While competitors have focused on increasing reasoning tokens for better logic, Z.ai is optimizing for productive horizons. GLM-5.1 is a 754-billion parameter Mixture-of-Experts model engineered to maintain goal alignment over extended execution traces that span thousands of tool calls. "agents could do about 20 steps by the end of last year," wrote z.ai leader Lou on X. "glm-5.1 can do 1,700 rn. autonomous work time may be the most important curve after scaling laws. glm-5.1 will be the first point on that curve that the open-source community can verify with their own hands. hope y'all like it^^" In a market increasingly crowded with fast models, Z.ai is betting on the marathon runner. The company, which listed on the Hong Kong Stock Exchange in early 2026 with a market capitalization of $52.83 billion, is using this release to cement its position as the leading independent developer of large language models in the region. Technology: the staircase pattern of optimization GLM-5.1s core technological breakthrough isn't just its scale, though its 754 billion parameters and 202,752 token context window are formidable, but its ability to avoid the plateau effect seen in previous models. In traditional agentic workflows, a model typically applies a few familiar techniques for quick initial gains and then stalls. Giving it more time or more tool calls usually results in diminishing returns or strategy drift. Z.ai research demonstrates that GLM-5.1 operates via what they call a staircase pattern, characterized by periods of incremental tuning within a fixed strategy punctuated by structural changes that shift the performance frontier. In Scenario 1 of their technical report, the model was tasked with optimizing a high-performance vector database, a challenge known as VectorDBBench. The model is provided with a Rust skeleton and empty implementation stubs, then uses tool-call-based agents to edit code, compile, test, and profile. While previous state-of-the-art results from models like Claude Opus 4.6 reached a performance ceiling of 3,547 queries per second, GLM-5.1 ran through 655 iterations and over 6,000 tool calls. The optimization trajectory was not linear but punctuated by structural breakthroughs. At iteration 90, the model shifted from full-corpus scanning to IVF cluster probing with f16 vector compression, which reduced per-vector bandwidth from 512 bytes to 256 bytes and jumped performance to 6,400 queries per second. By iteration 240, it autonomously introduced a two-stage pipeline involving u8 prescoring and f16 reranking, reaching 13,400 queries per second. Ultimately, the model identified and cleared six structural bottlenecks, including hierarchical routing via super-clusters and quantized routing using centroid scoring via VNNI. These efforts culminated in a final result of 21,500 queries per second, roughly six times the best result achieved in a single 50-turn session. This demonstrates a model that functions as its own research and development department, breaking complex problems down and running experiments with real precision. The model also managed complex execution tightening, lowering scheduling overhead and improving cache locality. During the optimization of the Approximate Nearest Neighbor search, the model proactively removed nested parallelism in favor of a redesign using per-query single-threading and outer concurrency. When the model encountered iterations where recall fell below the 95 percent threshold, it diagnosed the failure, adjusted its parameters, and implemented parameter compensation to recover the necessary accuracy. This level of autonomous correction is what separates GLM-5.1 from models that simply generate code without testing it in a live environment. Kernelbench: pushing the machine learning frontier The model's endurance was further tested in KernelBench Level 3, which requires end-to-end optimization of complete machine learning architectures like MobileNet, VGG, MiniGPT, and Mamba. In this setting, the goal is to produce a faster GPU kernel than the reference PyTorch implementation while maintaining identical outputs. Each of the 50 problems runs in an isolated Docker container with one H100 GPU and is limited to 1,200 tool-use turns. Correctness and performance are evaluated against a PyTorch eager baseline in separate CUDA contexts. The results highlight a significant performance gap between GLM-5.1 and its predecessors. While the original GLM-5 improved quickly but leveled off early at a 2.6x speedup, GLM-5.1 sustained its optimization efforts far longer. It eventually delivered a 3.6x geometric mean speedup across 50 problems, continuing to make useful progress well past 1,000 tool-use turns. Although Claude Opus 4.6 remains the leader in this specific benchmark at 4.2x, GLM-5.1 has meaningfully extended the productive horizon for open-source models. This capability is not simply about having a longer context window; it requires the model to maintain goal alignment over extended execution, reducing strategy drift, error accumulation, and ineffective trial and error. One of the key breakthroughs is the ability to form an autonomous experiment, analyze, and optimize loop, where the model can proactively run benchmarks, identify bottlenecks, adjust strategies, and continuously improve results through iterative refinement. All solutions generated during this process were independently audited for benchmark exploitation, ensuring the optimizations did not rely on specific benchmark behaviors but worked with arbitrary new inputs while keeping computation on the default CUDA stream. Product strategy: subscription and subsidies GLM-5.1 is positioned as an engineering-grade tool rather than a consumer chatbot. To support this, Z.ai has integrated it into a comprehensive Coding Plan ecosystem designed to compete directly with high-end developer tools. The product offering is divided into three subscription tiers, all of which include free Model Context Protocol tools for vision analysis, web search, web reader, and document reading. The Lite tier at $27 USD per quarter is positioned for lightweight workloads and offers three times the usage of a comparable Claude Pro plan. The Pro tier at $81 per quarter is designed for complex workloads, offering five times the Lite plan usage and 40 to 60 percent faster execution. The Max tier at $216 per quarter is aimed at advanced developers with high-volume needs, ensuring guaranteed performance during peak hours. For those using the API directly or through platforms like OpenRouter or Requesty, Z.ai has priced GLM-5.1 at $1.40 per one million input tokens and $4.40 per million output tokens. There's also a cache discount available for $0.26 per million input tokens. Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.6 $5.00 $25.00 $30.00 Anthropic GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI Notably, the model consumes quota at three times the standard rate during peak hours, which are defined as 14:00 to 18:00 Beijing Time daily, though a limited-time promotion through April 2026 allows off-peak usage to be billed at a standard 1x rate. Complementing the flagship is the recently debuted GLM-5 Turbo. While 5.1 is the marathon runner, Turbo is the sprinter, proprietary and optimized for fast inference and tasks like tool use and persistent automation. At a cost of $1.20 per million input / $4 per million output, it is more expensive than the base GLM-5 but comes in at more affordable than the new GLM-5.1, positioning it as a commercially attractive option for high-speed, supervised agent runs. The model is also packaged for local deployment, supporting inference frameworks including vLLM, SGLang, and xLLM. Comprehensive deployment instructions are available at the official GitHub repository, allowing developers to run the 754 billion parameter MoE model on their own infrastructure. For enterprise teams, the model includes advanced reasoning capabilities that can be accessed via a thinking parameter in API requests, allowing the model to show its step-by-step internal reasoning process before providing a final answer. Benchmarks: a new global standard The performance data for GLM-5.1 suggests it has leapfrogged several established Western models in coding and engineering tasks. On SWE-Bench Pro, which evaluates a model's ability to resolve real-world GitHub issues using an instruction prompt and a 200,000 token context window, GLM-5.1 achieved a score of 58.4. For context, this outperforms GPT-5.4 at 57.7, Claude Opus 4.6 at 57.3, and Gemini 3.1 Pro at 54.2. Beyond standardized coding tests, the model showed significant gains in reasoning and agentic benchmarks. It scored 63.5 on Terminal-Bench 2.0 when evaluated with the Terminus-2 framework and reached 66.5 when paired with the Claude Code harness. On CyberGym, it achieved a 68.7 score based on a single-run pass over 1,507 tasks, demonstrating a nearly 20-point lead over the previous GLM-5 model. The model also performed strongly on the MCP-Atlas public set with a score of 71.8 and achieved a 70.6 on the T3-Bench. In the reasoning domain, it scored 31.0 on Humanitys Last Exam, which jumped to 52.3 when the model was allowed to use external tools. On the AIME 2026 math competition benchmark, it reached 95.3, while scoring 86.2 on GPQA-Diamond for expert-level science reasoning. The most impressive anecdotal benchmark was the Scenario 3 test: building a Linux-style desktop environment from scratch in eight hours. Unlike previous models that might produce a basic taskbar and a placeholder window before declaring the task complete, GLM-5.1 autonomously filled out a file browser, terminal, text editor, system monitor, and even functional games. It iteratively polished the styling and interaction logic until it had delivered a visually consistent, functional web application. This serves as a concrete example of what becomes possible when a model is given the time and the capability to keep refining its own work. Licensing and the open segue The licensing of these two models tells a larger story about the current state of the global AI market. GLM-5.1 has been released under the MIT License, with its model weights made publicly available on Hugging Face and ModelScope. This follows the Z.ai historical strategy of using open-source releases to build developer goodwill and ecosystem reach. However, GLM-5 Turbo remains proprietary and closed-source. This reflects a growing trend among leading AI labs toward a hybrid model: using open-source models for broad distribution while keeping execution-optimized variants behind a paywall. Industry analysts note that this shift arrives amidst a rebalancing in the Chinese market, where heavyweights like Alibaba are also beginning to segment their proprietary work from their open releases. Z.ai CEO Zhang Peng appears to be navigating this by ensuring that while the flagship's core intelligence is open to the community, the high-speed execution infrastructure remains a revenue-driving asset. The company is not explicitly promising to open-source GLM-5 Turbo itself, but says the findings will be folded into future open releases. This segmented strategy helps drive adoption while allowing the company to build a sustainable business model around its most commercially relevant work. Community and user reactions: crushing a week's work The developer community response to the GLM-5.1 release has been overwhelmingly focused on the model's reliability in production-grade environments. User reviews suggest a high degree of trust in the model's autonomy. One developer noted that GLM-5.1 shocked them with how good it is, stating it seems to do what they want more reliably than other models with less reworking of prompts needed. Another developer mentioned that the model's overall workflow from planning to project execution performs excellently, allowing them to confidently entrust it with complex tasks. Specific case studies from users highlight significant efficiency gains. A user from Crypto Economy News reported that a task involving preprocessing code, feature selection logic, and hyperparameter tuning solutions, which originally would have taken a week, was completed in just two days. Since getting the GLM Coding plan, other developers have noted being able to operate more freely and focus on core development without worrying about resource shortages hindering progress. On social media, the launch announcement generated over 46,000 views in its first hour, with users captivated by the eight-hour autonomous claim. The sentiment among early adopters is that Z.ai has successfully moved past the hallucination-heavy era of AI into a period where models can be trusted to optimize themselves through repeated iteration. The ability to build four applications rapidly through correct prompting and structured planning has been cited by multiple users as a game-changing development for individual developers. The implications of long-horizon work The release of GLM-5.1 suggests that the next frontier of AI competition will not be measured in tokens per second, but in autonomous duration. If a model can work for eight hours without human intervention, it fundamentally changes the software development lifecycle. However, Z.ai acknowledges that this is only the beginning. Significant challenges remain, such as developing reliable self-evaluation for tasks where no numeric metric exists to optimize against. Escaping local optima earlier when incremental tuning stops paying off is another major hurdle, as is maintaining coherence over execution traces that span thousands of tool calls. For now, Z.ai has placed a marker in the sand. With GLM-5.1, they have delivered a model that doesn't just answer questions, but finishes projects. The model is already compatible with a wide range of developer tools including Claude Code, OpenCode, Kilo Code, Roo Code, Cline, and Droid. For developers and enterprises, the question is no longer, "what can I ask this AI?" but "what can I assign to it for the next eight hours?" The focus of the industry is clearly shifting toward systems that can reliably execute multi-step work with less supervision. This transition to agentic engineering marks a new phase in the deployment of artificial intelligence within the global economy.