AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT-5.4 on SWE-Bench Pro
Our take

Is China picking back up the open source AI baton?
Z.ai, also known as Zhupai AI, a Chinese AI startup best known for its powerful, open source GLM family of models, has unveiled GLM-5.1 today under a permissive MIT License, allowing for enterprises to download, customize and use it for commercial purposes. They can do so on Hugging Face.
This follows its release of GLM-5 Turbo, a faster version, under only proprietary license last month.
The new GLM-5.1 is designed to work autonomously for up to eight hours on a single task, marking a definitive shift from vibe coding to agentic engineering.
The release represents a pivotal moment in the evolution of artificial intelligence. While competitors have focused on increasing reasoning tokens for better logic, Z.ai is optimizing for productive horizons.
GLM-5.1 is a 754-billion parameter Mixture-of-Experts model engineered to maintain goal alignment over extended execution traces that span thousands of tool calls.
"agents could do about 20 steps by the end of last year," wrote z.ai leader Lou on X. "glm-5.1 can do 1,700 rn. autonomous work time may be the most important curve after scaling laws. glm-5.1 will be the first point on that curve that the open-source community can verify with their own hands. hope y'all like it^^"
In a market increasingly crowded with fast models, Z.ai is betting on the marathon runner. The company, which listed on the Hong Kong Stock Exchange in early 2026 with a market capitalization of $52.83 billion, is using this release to cement its position as the leading independent developer of large language models in the region.
Technology: the staircase pattern of optimization
GLM-5.1s core technological breakthrough isn't just its scale, though its 754 billion parameters and 202,752 token context window are formidable, but its ability to avoid the plateau effect seen in previous models.
In traditional agentic workflows, a model typically applies a few familiar techniques for quick initial gains and then stalls. Giving it more time or more tool calls usually results in diminishing returns or strategy drift.
Z.ai research demonstrates that GLM-5.1 operates via what they call a staircase pattern, characterized by periods of incremental tuning within a fixed strategy punctuated by structural changes that shift the performance frontier.
In Scenario 1 of their technical report, the model was tasked with optimizing a high-performance vector database, a challenge known as VectorDBBench.
The model is provided with a Rust skeleton and empty implementation stubs, then uses tool-call-based agents to edit code, compile, test, and profile. While previous state-of-the-art results from models like Claude Opus 4.6 reached a performance ceiling of 3,547 queries per second, GLM-5.1 ran through 655 iterations and over 6,000 tool calls. The optimization trajectory was not linear but punctuated by structural breakthroughs.
At iteration 90, the model shifted from full-corpus scanning to IVF cluster probing with f16 vector compression, which reduced per-vector bandwidth from 512 bytes to 256 bytes and jumped performance to 6,400 queries per second.
By iteration 240, it autonomously introduced a two-stage pipeline involving u8 prescoring and f16 reranking, reaching 13,400 queries per second. Ultimately, the model identified and cleared six structural bottlenecks, including hierarchical routing via super-clusters and quantized routing using centroid scoring via VNNI. These efforts culminated in a final result of 21,500 queries per second, roughly six times the best result achieved in a single 50-turn session.
This demonstrates a model that functions as its own research and development department, breaking complex problems down and running experiments with real precision.
The model also managed complex execution tightening, lowering scheduling overhead and improving cache locality. During the optimization of the Approximate Nearest Neighbor search, the model proactively removed nested parallelism in favor of a redesign using per-query single-threading and outer concurrency.
When the model encountered iterations where recall fell below the 95 percent threshold, it diagnosed the failure, adjusted its parameters, and implemented parameter compensation to recover the necessary accuracy. This level of autonomous correction is what separates GLM-5.1 from models that simply generate code without testing it in a live environment.
Kernelbench: pushing the machine learning frontier
The model's endurance was further tested in KernelBench Level 3, which requires end-to-end optimization of complete machine learning architectures like MobileNet, VGG, MiniGPT, and Mamba.
In this setting, the goal is to produce a faster GPU kernel than the reference PyTorch implementation while maintaining identical outputs. Each of the 50 problems runs in an isolated Docker container with one H100 GPU and is limited to 1,200 tool-use turns. Correctness and performance are evaluated against a PyTorch eager baseline in separate CUDA contexts.
The results highlight a significant performance gap between GLM-5.1 and its predecessors. While the original GLM-5 improved quickly but leveled off early at a 2.6x speedup, GLM-5.1 sustained its optimization efforts far longer. It eventually delivered a 3.6x geometric mean speedup across 50 problems, continuing to make useful progress well past 1,000 tool-use turns.
Although Claude Opus 4.6 remains the leader in this specific benchmark at 4.2x, GLM-5.1 has meaningfully extended the productive horizon for open-source models.
This capability is not simply about having a longer context window; it requires the model to maintain goal alignment over extended execution, reducing strategy drift, error accumulation, and ineffective trial and error. One of the key breakthroughs is the ability to form an autonomous experiment, analyze, and optimize loop, where the model can proactively run benchmarks, identify bottlenecks, adjust strategies, and continuously improve results through iterative refinement.
All solutions generated during this process were independently audited for benchmark exploitation, ensuring the optimizations did not rely on specific benchmark behaviors but worked with arbitrary new inputs while keeping computation on the default CUDA stream.
Product strategy: subscription and subsidies
GLM-5.1 is positioned as an engineering-grade tool rather than a consumer chatbot. To support this, Z.ai has integrated it into a comprehensive Coding Plan ecosystem designed to compete directly with high-end developer tools.
The product offering is divided into three subscription tiers, all of which include free Model Context Protocol tools for vision analysis, web search, web reader, and document reading.
The Lite tier at $27 USD per quarter is positioned for lightweight workloads and offers three times the usage of a comparable Claude Pro plan. The Pro tier at $81 per quarter is designed for complex workloads, offering five times the Lite plan usage and 40 to 60 percent faster execution.
The Max tier at $216 per quarter is aimed at advanced developers with high-volume needs, ensuring guaranteed performance during peak hours.
For those using the API directly or through platforms like OpenRouter or Requesty, Z.ai has priced GLM-5.1 at $1.40 per one million input tokens and $4.40 per million output tokens. There's also a cache discount available for $0.26 per million input tokens.
Model | Input | Output | Total Cost | Source |
Grok 4.1 Fast | $0.20 | $0.50 | $0.70 | |
MiniMax M2.7 | $0.30 | $1.20 | $1.50 | |
Gemini 3 Flash | $0.50 | $3.00 | $3.50 | |
Kimi-K2.5 | $0.60 | $3.00 | $3.60 | |
MiMo-V2-Pro (≤256K) | $1.00 | $3.00 | $4.00 | |
GLM-5 | $1.00 | $3.20 | $4.20 | |
GLM-5-Turbo | $1.20 | $4.00 | $5.20 | |
GLM-5.1 | $1.40 | $4.40 | $5.80 | |
Claude Haiku 4.5 | $1.00 | $5.00 | $6.00 | |
Qwen3-Max | $1.20 | $6.00 | $7.20 | |
Gemini 3 Pro | $2.00 | $12.00 | $14.00 | |
GPT-5.2 | $1.75 | $14.00 | $15.75 | |
GPT-5.4 | $2.50 | $15.00 | $17.50 | |
Claude Sonnet 4.5 | $3.00 | $15.00 | $18.00 | |
Claude Opus 4.6 | $5.00 | $25.00 | $30.00 | |
GPT-5.4 Pro | $30.00 | $180.00 | $210.00 |
Notably, the model consumes quota at three times the standard rate during peak hours, which are defined as 14:00 to 18:00 Beijing Time daily, though a limited-time promotion through April 2026 allows off-peak usage to be billed at a standard 1x rate. Complementing the flagship is the recently debuted GLM-5 Turbo.
While 5.1 is the marathon runner, Turbo is the sprinter, proprietary and optimized for fast inference and tasks like tool use and persistent automation.
At a cost of $1.20 per million input / $4 per million output, it is more expensive than the base GLM-5 but comes in at more affordable than the new GLM-5.1, positioning it as a commercially attractive option for high-speed, supervised agent runs.
The model is also packaged for local deployment, supporting inference frameworks including vLLM, SGLang, and xLLM. Comprehensive deployment instructions are available at the official GitHub repository, allowing developers to run the 754 billion parameter MoE model on their own infrastructure.
For enterprise teams, the model includes advanced reasoning capabilities that can be accessed via a thinking parameter in API requests, allowing the model to show its step-by-step internal reasoning process before providing a final answer.
Benchmarks: a new global standard
The performance data for GLM-5.1 suggests it has leapfrogged several established Western models in coding and engineering tasks.
On SWE-Bench Pro, which evaluates a model's ability to resolve real-world GitHub issues using an instruction prompt and a 200,000 token context window, GLM-5.1 achieved a score of 58.4. For context, this outperforms GPT-5.4 at 57.7, Claude Opus 4.6 at 57.3, and Gemini 3.1 Pro at 54.2.
Beyond standardized coding tests, the model showed significant gains in reasoning and agentic benchmarks. It scored 63.5 on Terminal-Bench 2.0 when evaluated with the Terminus-2 framework and reached 66.5 when paired with the Claude Code harness.
On CyberGym, it achieved a 68.7 score based on a single-run pass over 1,507 tasks, demonstrating a nearly 20-point lead over the previous GLM-5 model. The model also performed strongly on the MCP-Atlas public set with a score of 71.8 and achieved a 70.6 on the T3-Bench.
In the reasoning domain, it scored 31.0 on Humanitys Last Exam, which jumped to 52.3 when the model was allowed to use external tools. On the AIME 2026 math competition benchmark, it reached 95.3, while scoring 86.2 on GPQA-Diamond for expert-level science reasoning.
The most impressive anecdotal benchmark was the Scenario 3 test: building a Linux-style desktop environment from scratch in eight hours.
Unlike previous models that might produce a basic taskbar and a placeholder window before declaring the task complete, GLM-5.1 autonomously filled out a file browser, terminal, text editor, system monitor, and even functional games.
It iteratively polished the styling and interaction logic until it had delivered a visually consistent, functional web application. This serves as a concrete example of what becomes possible when a model is given the time and the capability to keep refining its own work.
Licensing and the open segue
The licensing of these two models tells a larger story about the current state of the global AI market. GLM-5.1 has been released under the MIT License, with its model weights made publicly available on Hugging Face and ModelScope.
This follows the Z.ai historical strategy of using open-source releases to build developer goodwill and ecosystem reach. However, GLM-5 Turbo remains proprietary and closed-source. This reflects a growing trend among leading AI labs toward a hybrid model: using open-source models for broad distribution while keeping execution-optimized variants behind a paywall.
Industry analysts note that this shift arrives amidst a rebalancing in the Chinese market, where heavyweights like Alibaba are also beginning to segment their proprietary work from their open releases.
Z.ai CEO Zhang Peng appears to be navigating this by ensuring that while the flagship's core intelligence is open to the community, the high-speed execution infrastructure remains a revenue-driving asset.
The company is not explicitly promising to open-source GLM-5 Turbo itself, but says the findings will be folded into future open releases. This segmented strategy helps drive adoption while allowing the company to build a sustainable business model around its most commercially relevant work.
Community and user reactions: crushing a week's work
The developer community response to the GLM-5.1 release has been overwhelmingly focused on the model's reliability in production-grade environments.
User reviews suggest a high degree of trust in the model's autonomy.
One developer noted that GLM-5.1 shocked them with how good it is, stating it seems to do what they want more reliably than other models with less reworking of prompts needed. Another developer mentioned that the model's overall workflow from planning to project execution performs excellently, allowing them to confidently entrust it with complex tasks.
Specific case studies from users highlight significant efficiency gains.
A user from Crypto Economy News reported that a task involving preprocessing code, feature selection logic, and hyperparameter tuning solutions, which originally would have taken a week, was completed in just two days. Since getting the GLM Coding plan, other developers have noted being able to operate more freely and focus on core development without worrying about resource shortages hindering progress.
On social media, the launch announcement generated over 46,000 views in its first hour, with users captivated by the eight-hour autonomous claim. The sentiment among early adopters is that Z.ai has successfully moved past the hallucination-heavy era of AI into a period where models can be trusted to optimize themselves through repeated iteration.
The ability to build four applications rapidly through correct prompting and structured planning has been cited by multiple users as a game-changing development for individual developers.
The implications of long-horizon work
The release of GLM-5.1 suggests that the next frontier of AI competition will not be measured in tokens per second, but in autonomous duration.
If a model can work for eight hours without human intervention, it fundamentally changes the software development lifecycle.
However, Z.ai acknowledges that this is only the beginning. Significant challenges remain, such as developing reliable self-evaluation for tasks where no numeric metric exists to optimize against.
Escaping local optima earlier when incremental tuning stops paying off is another major hurdle, as is maintaining coherence over execution traces that span thousands of tool calls.
For now, Z.ai has placed a marker in the sand. With GLM-5.1, they have delivered a model that doesn't just answer questions, but finishes projects. The model is already compatible with a wide range of developer tools including Claude Code, OpenCode, Kilo Code, Roo Code, Cline, and Droid.
For developers and enterprises, the question is no longer, "what can I ask this AI?" but "what can I assign to it for the next eight hours?"
The focus of the industry is clearly shifting toward systems that can reliably execute multi-step work with less supervision. This transition to agentic engineering marks a new phase in the deployment of artificial intelligence within the global economy.
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasksXiaomi, the Chinese firm best known for its smartphones and electric vehicles, has lately been shipping some incredibly affordable and high-powered open source AI large language models. The trend continued today with the release of Xiaomi MiMo-V2.5 and Xiaomi MiMo-V2.5-Pro, both available under the permissive, enterprise-friendly MIT License, making them suitable for use in production in commercial applications. Enterprises and individual/independent developers can now download either of the models (and more Xiaomi open source options) directly from Hugging Face, modify them as needed, and run them locally or on virtual private clouds as they see fit. The most notable attribute of these models besides the open source licensing is that, according to Xiaomi's published benchmarks, they are among the most efficient available for agentic "claw" tasks, that is, powering systems such as OpenClaw, NanoClaw and Hermes Agent, in which users can communicate with them directly over third-party messaging apps and have the agents go off and complete tasks on the human user's behalf, such as making and publishing marketing content, running accounts, organizing email and scheduling, etc. As Xiaomi's ClawEval benchmark chart shows, both MiMo-V2.5 and the Pro version in particular appear near the top left of the chart, indicating high performance in completing the benchmarked claw tasks while using the fewest amount of tokens — saving the human user money, especially in a world where more and more services such as Microsoft's GitHub Copilot are moving to usage-based billing (charging the human behind the agents for each token used rather than imposing rate limits like Anthropic or providing an "all-you-can-eat" buffet-style subscription like OpenAI). In fact, the Pro model leads the open-source field with a 63.8% success rate, consuming only ~70K tokens per trajectory. This is roughly 40–60% fewer tokens than those required by Anthropic Claude Opus 4.6, Google Gemini 3.1 Pro, and OpenAI GPT-5.4 to achieve comparable results. By combining a massive 310B-parameter architecture with a highly efficient "active" footprint and a native 1-million-token context window, Xiaomi MiMo is challenging the dominance of closed-source frontier models from Google and OpenAI, especially when it comes to the latest and greatest craze in enterprise AI deployments — agentic tasks and "claws" similar to OpenClaw. A two-pronged pincer Xiaomi has released two distinct versions of the model to serve different ends of the development spectrum: MiMo-V2.5 (the "Omni" multimodal specialist) and MiMo-V2.5-Pro (the "Agent" specialist). While the base model provides native multimodality, the MiMo-V2.5-Pro is specifically engineered for "long-horizon coherence" and complex software engineering. On the GDPVal-AA (Elo) benchmark, the Pro model achieved a score of 1581, surpassing competitors like Kimi K2.6 and GLM 5.1. Xiaomi researchers further released data on several high-complexity tasks performed autonomously by V2.5-Pro: SysY Compiler in Rust: The model implemented a complete compiler from scratch—including lexer, parser, and RISC-V assembly backend—in 4.3 hours. Spanning 672 tool calls, the model achieved a perfect 233/233 score on hidden test suites, a task that typically takes a computer science major several weeks. Full-Featured Video Editor: Over 11.5 hours and 1,868 tool calls, the model produced an 8,192-line desktop application featuring multi-track timelines and an export pipeline. Analog EDA Optimization: In a graduate-level engineering task, the model optimized a Flipped-Voltage-Follower (FVF-LDO) regulator in the TSMC 180nm process. By iterating through an ngspice simulation loop, the model improved metrics like line regulation by 22x over its initial attempt. These experiments highlight a "harness awareness" in V2.5-Pro, where the model actively manages its own memory and shapes its context to sustain coherence over thousands of sequential tool calls. Over the API, Xiaomi is pricing the models at competitive rates for both domestic (Chinese) and international markets (like the U.S.). For overseas developers, the high-performance MiMo-V2.5-Pro is priced at $1.00 per million input tokens (for a cache miss) and $3.00 for output within context windows up to 256K. For ultra-long context tasks between 256K and 1M tokens, the cost doubles to $2.00 for input and $6.00 for output, though the architecture’s caching capabilities offer significant relief, reducing input costs to as little as $0.20 to $0.40 per million tokens upon a cache hit. Domestically, these rates are mirrored in yuan, with the Pro model starting at ¥7.00 per million input tokens for standard context and reaching ¥14.00 for the extended 1M range. Meanwhile, the base model starts at just $0.40 USD for overseas input per million tokens and $2.00 per million output, putting it among the more affordable third of leading LLMs globally (see our chart below): Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax MiMo-V2.5 Flash $0.10 $0.30 $0.40 Xiaomi MiMo Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2.5 $0.40 $2.00 $2.40 Xiaomi MiMo MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek V4 Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI To lower the barrier for agentic development further, Xiaomi has made cache writing free of charge for a limited time across all models, alongside a total fee waiver for the entire MiMo-V2.5-TTS suite, which includes its specialized voice cloning and design features. This pricing logic is clearly designed to accelerate the transition from simple chat applications to persistent, long-horizon agents that can operate at a fraction of the cost of legacy frontier models. Xiaomi has also introduced an overhauled version of its subscription offerings, called the "Token Plan," now available in four levels: The Lite "Starter Pack" provides 720 million credits for $63.36 USD per year Standard tier offers 2.4 billion credits for $168.96 per year A Pro tier provides 8.4 billion credits for $528.00 per year (designed for enterprise use cases) Max —aimed at high-intensity coding enthusiasts—delivers 19.2 billion credits for $1,056.00 per year Beyond credit allotments, all plans include preferential API rates, a 20% discount for off-peak calls, and "Day-0" support for popular coding scaffolds like Cursor, Zed, and Claude Code. However, both through the API and via the Token Plan, accessing the Xiaomi models from China may present barriers or additional compliance and regulatory risks to U.S.-based enterprise customers. As such, the best bet for U.S. enterprises concerned about relying on Chinese tech but wanting to take advantage of the low cost and open source models is likely setting up their own virtual private clouds or local servers, downloading the model weights, and running the models domestically. MoE architecture but divergent training regimens for V2.5 and V2.5-Pro At the heart of MiMo-V2.5 is a Sparse Mixture-of-Experts (MoE) architecture. While the model boasts a total of 310 billion parameters, only 15 billion are "active" during any given inference cycle. Meanwhile, V2.5-Pro is 1.02 trilion-parameter Mixture-of-Experts model with 42 billion active parameters. In either case, the design functions much like a specialized research hospital: while the facility has hundreds of doctors (parameters), only the specific specialists required for a particular case (query) are called into the room. This massive increase in parameter volume for the Pro version provides the "neural capacity" required for the deep, multi-step reasoning found in complex software engineering and long-horizon tasks, as though even more specialists are available in an even larger hospital. According to Xiaomi's blog post, the regular V2.5 follows a rigorous five-stage evolution: Text Pre-training: Building a massive language backbone on 48 trillion tokens. Projector Warmup: Aligning in-house audio and visual encoders with the language core. Multimodal Pre-training: Scaling across high-quality cross-modal data. Agentic Post-training: Progressively extending the context window from 32K to 1M tokens. RL and MOPD: Utilizing Reinforcement Learning and Multimodal Preference Optimization (MOPD) to sharpen real-world reasoning and perception. The backbone utilizes a hybrid sliding-window attention architecture, inherited from MiMo-V2-Flash, which optimizes how the model "remembers" long-range information. This technical foundation enables MiMo-V2.5 to see, hear, and reason natively, rather than relying on external "plug-in" tools for visual or auditory processing. Conversely, the training of MiMo-V2.5-Pro prioritizes "action space" over sensory perception. Instead of sensory alignment, the Pro model’s training focus shifts toward scaling post-training compute. This process is designed to instill "harness awareness," where the model is specifically trained to manage its own memory and context within autonomous agent scaffolds like Claude Code or OpenCode. While the base V2.5 model is trained to reason across modalities, the Pro version is trained to sustain coherence across more than a thousand sequential tool calls. The standard V2.5 model balances local and global attention to maintain multimodal perception. The Pro model, however, utilizes an increased hybrid attention ratio—evolving from the 5:1 ratio of previous generations to a more aggressive 7:1 ratio. This allows the Pro model to "skim" the vast majority of its context while applying high-density attention to the specific 15% of data most relevant to its current objective, a critical feature for debugging large repositories or optimizing graduate-level circuits. Finally, while both models undergo Reinforcement Learning (RL) and Multimodal Preference Optimization (MOPD), the objectives of these stages differ. For MiMo-V2.5, the RL stage is used to sharpen perception and multimodal reasoning. For MiMo-V2.5-Pro, RL is focused on instruction following within agentic scenarios, ensuring the model adheres to subtle requirements embedded deep within ultra-long contexts and recovers gracefully from errors during autonomous execution. This results in the Pro model's "self-correcting" discipline, as seen in its ability to diagnose and fix regressions during the 4.3-hour SysY compiler build. Full MIT License is perfect for enterprise use cases In a move that distinguishes it from many "open" models that include restrictive "Acceptable Use" policies, Xiaomi has released MiMo-V2.5 under the MIT License.The MIT License is the gold standard of permissive software licensing. For developers and enterprises, this means: No Authorization Required: Companies can deploy the model commercially without seeking explicit permission from Xiaomi. Continued Training: Developers are free to fine-tune the model on proprietary data and even release those derivative weights. Unrestricted Commercial Use: There are no revenue caps or user-base limits that often plague "community" licenses. By choosing MIT over a custom "open weights" license, Xiaomi is positioning MiMo as the foundational infrastructure for the next generation of AI agents, effectively inviting the global developer community to treat the model as a public utility. Xiaomi's background: from smartphones and EVs to Chinese open source AI darling Xiaomi’s pivot toward frontier AI agents is the logical culmination of a decade spent building one of the world's most dense hardware-software flywheels. Founded in 2010 as a smartphone disruptor, the Beijing-based company has executed a high-stakes transition into a vertically integrated powerhouse defined by its "Human x Car x Home" strategy. This ecosystem now encompasses over 823 million connectable smart devices unified under the HyperOS architecture. The company’s 2024 entry into the automotive sector with the SU7 and the subsequent high-performance YU7 SUV served as a proof of concept for this integration, positioning Xiaomi as a direct competitor to global luxury marques. By investing 200 billion yuan ($29B USD) into foundational R&D for chips and operating systems, Xiaomi has moved beyond consumer electronics assembly; it has become an architect of the "action space," using its massive hardware footprint as the primary testing ground for the agentic intelligence found in the MiMo-V2.5 series. Ecosystem support The release has been met with immediate "Day-0" support from the broader AI ecosystem. The MiMo team announced that SGLang and vLLM—two of the most popular high-throughput inference engines—supported the V2.5 series at launch. This was made possible through hardware partnerships with AWS, AMD, T-HEAD, and Enflame, ensuring the model can run efficiently on everything from cloud-based H100s to domestic Chinese accelerators. Fuli Luo, the project lead at Xiaomi MiMo and a former key member of the DeepSeek team, underscored the philosophy behind the release on X (formerly Twitter): "A model's value isn't measured by rankings alone — it's measured by the problems it solves. Let's build with MiMo now!" To kickstart this building phase, Luo announced a 100-trillion free token grant for builders and creators. This massive incentive is designed to lower the barrier to entry for developers who want to experiment with the 1M context window without immediate financial risk. The economic realignment: open source vs. metered proprietary The launch arrives at a critical juncture for AI economics. The shift toward usage-based billing marks the definitive end of the "all-you-can-eat" buffet era for AI services, a trend underscored by GitHub’s announcement today that its AI coding assistant Github Copilot will transition all plans to metered, token-based credits. As seat-based predictability gives way to consumption-driven costs, premium agentic workflows—which can consume millions of tokens in a single reasoning session—are becoming increasingly difficult for enterprises to budget. User sentiment has turned predictably cynical, with developers lamenting that they will "get less, but pay the same price" as subscriptions convert into finite allotments. This pricing evolution significantly enhances the strategic appeal of the MiMo series. By releasing under a permissive MIT License, Xiaomi allows organizations to bypass the escalating "SaaS tax" and reclaim financial predictability through private deployment. Crucially, Xiaomi has eliminated the "context tax" for its API. The 1-million-token context window is now billed at the standard rate—1 token = 1 credit for V2.5 and 2 credits for the Pro version—with no additional multiplier. This stands in stark contrast to the industry-wide move toward session-based caps, positioning MiMo as a refuge for cost-sensitive, high-volume development. Analysis for enterprises The launch of MiMo-V2.5 is more than just a weight drop; it is a declaration of independence for the open-source community. By matching Claude Sonnet 4.6 in multimodal agentic work and Gemini 3 Pro in video understanding, Xiaomi has proven that the gap between "closed-door" labs and open research is effectively closed. With the MIT license as a catalyst and a 100T token grant as fuel, the coming months will likely see a surge in specialized, agentic applications built on the MiMo backbone. Confirming the project's ambitious trajectory, the team noted they are already training the next generation, focusing on "deeper reasoning" and "richer real-world grounding". For now, MiMo-V2.5 stands as a testament to the power of sparse architectures and permissive licensing in the race toward functional AGI.
- DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5The whale has resurfaced. DeepSeek, the Chinese AI startup offshoot of High-Flyer Capital Management quantitative analysis firm, became a near-overnight sensation globally in January 2025 with the release of its open source R1 model that matched proprietary U.S. giants. It's been an epoch in AI since then, and while DeepSeek has released several updates to that model and its other V3 series, the international AI and business community has been largely waiting with baited breath for the follow-up to the R1 moment. Now it's arrived with last night's release of DeepSeek-V4, a 1.6-trillion-parameter Mixture-of-Experts (MoE) model available free under commercially-friendly open source MIT License, which nears — and on some benchmarks, surpasses — the performance of the world’s most advanced closed-source systems at approximately 1/6th the cost over the application programming interface (API). This release—which DeepSeek AI researcher Deli Chen described on X as a "labor of love" 484 days after the launch of V3—is being hailed as the "second DeepSeek moment". As Chen noted in his post, "AGI belongs to everyone". It's available now on AI code sharing community Hugging Face and through DeepSeek's API. Frontier-class AI gets pushed into a lower price band The most immediate impact of the DeepSeek-V4 launch is economic. The corrected pricing table shows DeepSeek is not pricing its new Pro model at near-zero levels, but it is still pushing high-end model access into a far lower cost tier than the leading U.S. frontier models. DeepSeek-V4-Pro is priced through its API at $1.74 USD per 1 million input tokens on a cache miss and $3.48 per million output tokens. That puts a simple one-million-input, one-million-output comparison at $5.22. With cached input, the input price drops to $0.145 per million tokens, bringing that same blended comparison down to $3.625. That is dramatically cheaper than the current premium pricing from OpenAI and Anthropic. GPT-5.5 is priced at $5.00 per million input tokens and $30.00 per million output tokens, for a combined $35.00 in the same simple comparison. Claude Opus 4.7 is priced at $5.00 input and $25.00 output, for a combined $30.00. Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek-V4-Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI On standard, cache-miss pricing, DeepSeek-V4-Pro comes in at roughly one-seventh the cost of GPT-5.5 and about one-sixth (1/6th) the cost of Claude Opus 4.7. With cached input, the gap widens: DeepSeek-V4-Pro costs about one-tenth as much as GPT-5.5 and about one-eighth as much as Claude Opus 4.7. The more extreme near-zero story belongs to DeepSeek-V4-Flash, not the Pro model. Flash is priced at $0.14 per million input tokens on a cache miss and $0.28 per million output tokens, for a combined $0.42. With cached input, that drops to $0.308. In that case, DeepSeek’s cheaper model is more than 98% below GPT-5.5 and Claude Opus 4.7 in a simple input-plus-output comparison, or nearly 1/100th the cost — though the performance dips significantly. DeepSeek is compressing advanced model economics into a much lower band, forcing developers and enterprises to revisit the cost-benefit calculation around premium closed models. For companies running large inference workloads, that price gap can change what is worth automating. Tasks that look too expensive on GPT-5.5 or Claude Opus 4.7 may become economically viable on DeepSeek-V4-Pro, and even more so on DeepSeek-V4-Flash. The launch does not make intelligence free, but it does make the market harder for premium providers to defend on performance alone. Benchmarking the frontier: DeepSeek-V4-Pro gets close, but GPT-5.5 and Opus 4.7 still lead on most shared tests DeepSeek-V4-Pro-Max is best understood as a major open-weight leap, not a clean across-the-board defeat of the newest closed frontier systems. The model’s strongest benchmark claims come from DeepSeek’s own comparison tables, where it is shown against GPT-5.4 xHigh, Claude Opus 4.6 Max and Gemini 3.1 Pro High and bests them on several tests, including Codeforces and Apex Shortlist. But that is not the same as a head-to-head against OpenAI’s newer GPT-5.5 or Anthropic’s newer Claude Opus 4.7. Looking only at DeepSeek-V4 versus the latest proprietary models, the picture is more restrained. On this shared set, GPT-5.5 and Claude Opus 4.7 still lead most categories. DeepSeek-V4-Pro-Max’s best showing is on BrowseComp, the benchmark measuring agentic AI web browsing prowess (especially highly containerized information), where it scores 83.4%, narrowly behind GPT-5.5 at 84.4% and ahead of Claude Opus 4.7 at 79.3%. On Terminal-Bench 2.0, DeepSeek scores 67.9%, close to Claude Opus 4.7’s 69.4%, but far behind GPT-5.5’s 82.7%. Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 48.2% 52.2% 57.2% 54.7% GPT-5.5 Pro Terminal-Bench 2.0 67.9% 82.7% — 69.4% GPT-5.5 SWE-Bench Pro / SWE Pro 55.4% 58.6% — 64.3% Claude Opus 4.7 BrowseComp 83.4% 84.4% 90.1% 79.3% GPT-5.5 Pro MCP Atlas / MCPAtlas Public 73.6% 75.3% — 79.1% Claude Opus 4.7 The shared academic-reasoning results favor the closed models: On GPQA Diamond, DeepSeek-V4-Pro-Max scores 90.1%, while GPT-5.5 reaches 93.6% and Claude Opus 4.7 reaches 94.2%. On Humanity’s Last Exam without tools, DeepSeek scores 37.7%, behind GPT-5.5 at 41.4%, GPT-5.5 Pro at 43.1% and Claude Opus 4.7 at 46.9%. With tools enabled, DeepSeek rises to 48.2%, but still trails GPT-5.5 at 52.2%, GPT-5.5 Pro at 57.2% and Claude Opus 4.7 at 54.7%. The agentic and software-engineering results are more mixed, but they still show DeepSeek-V4-Pro-Max trailing GPT-5.5 and Opus 4.7. On Terminal-Bench 2.0, DeepSeek’s 67.9% is competitive with Claude Opus 4.7’s 69.4%, but GPT-5.5 is much higher at 82.7%. On SWE-Bench Pro, DeepSeek’s 55.4% trails GPT-5.5 at 58.6% and Claude Opus 4.7 at 64.3%. On MCP Atlas, DeepSeek’s 73.6% is slightly behind GPT-5.5 at 75.3% and Claude Opus 4.7 at 79.1%. BrowseComp is the standout: DeepSeek’s 83.4% beats Claude Opus 4.7’s 79.3% and nearly matches GPT-5.5’s 84.4%, though GPT-5.5 Pro’s 90.1% remains well ahead. So ultimately, DeepSeek-V4-Pro-Max does not appear to dethrone GPT-5.5 or Claude Opus 4.7 on the benchmarks that can be directly compared across the companies’ published tables. But it gets close enough on several of them — especially BrowseComp, Terminal-Bench 2.0 and MCP Atlas — that its much lower API pricing becomes the headline. In practical terms, DeepSeek does not need to win every leaderboard row to matter. If it can deliver near-frontier performance on many enterprise-relevant agent and reasoning tasks at roughly one-sixth to one-seventh the standard API cost of GPT-5.5 or Claude Opus 4.7, it still forces a major rethink of the economics of advanced AI deployment. DeepSeek-V4-Pro-Max is clearly the strongest open-weight model in the field right now, and it is unusually close to frontier closed systems on several practical benchmarks. While GPT-5.5 and Claude Opus 4.7 still retain the lead in most direct head-to-head comparisons across the company's benchmark charts, DeepSeek V4 Pro gets close while being dramatically cheaper and openly available. A big jump from DeepSeek V3.2 To understand the magnitude of this release, one must look at the performance gains of the base models. DeepSeek-V4-Pro-Base represents a significant advancement over the previous generation, DeepSeek-V3.2-Base. In World Knowledge, V4-Pro-Base achieved 90.1 on MMLU (5-shot) compared to V3.2’s 87.8, and a massive jump on MMLU-Pro from 65.5 to 73.5. The improvement in high-level reasoning and verified facts is even more pronounced: on SuperGPQA, V4-Pro-Base reached 53.9 compared to V3.2's 45.0, and on the FACTS Parametric benchmark, it more than doubled its predecessor's performance, jumping from 27.1 to 62.6. Simple-QA verified scores also saw a dramatic rise from 28.3 to 55.2. The Long Context capabilities have also been refined. On LongBench-V2, V4-Pro-Base scored 51.5, significantly outpacing the 40.2 achieved by V3.2-Base. In Code and Math, V4-Pro-Base reached 76.8 on HumanEval (Pass@1), up from 62.8 on V3.2-Base. These numbers underscore that DeepSeek has not just optimized for inference cost, but has fundamentally improved the intelligence density of its base architecture. The efficiency story is equally compelling for the Flash variant. DeepSeek-V4-Flash-Base, despite utilizing a substantially smaller number of parameters, outperforms the larger V3.2-Base across wide benchmarks, particularly in long-context scenarios. A new information 'traffic controller,' Manifold-Constrained Hyper-Connections (mHC) DeepSeek’s ability to offer these prices and performance figures is rooted in radical architectural innovations detailed in its technical report also released today, "Towards Highly Efficient Million-Token Context Intelligence." The standout technical achievement of V4 is its native one-million-token context window. Historically, maintaining such a large context required massive memory (the key values or KV cache). DeepSeek solved this by introducing a Hybrid Attention Architecture that combines Compressed Sparse Attention (CSA) to reduce initial token dimensionality and Heavily Compressed Attention (HCA) to aggressively compress the memory footprint for long-range dependencies. In practice, the V4-Pro model requires only 10% of the KV cache and 27% of the single-token inference FLOPs compared to its predecessor, the DeepSeek-V3.2, even when operating at a 1M token context. To stabilize a network of 1.6 trillion parameters, DeepSeek moved beyond traditional residual connections. The company's researchers incorporated Manifold-Constrained Hyper-Connections (mHC) to strengthen signal propagation across layers while preserving the model’s expressivity. mHC allows an AI to have a much wider flow of information (so it can learn more complex things) without the risk of the model becoming unstable or "breaking" during its training. It’s like giving a city a 10-lane highway but adding a perfect AI traffic controller to ensure no one ever hits the brakes. This is paired with the Muon optimizer, which allowed the team to achieve faster convergence and greater training stability during the pre-training on more than 32T diverse and high-quality tokens. This pre-training data was refined to remove hatched auto-generated content, mitigating the risk of model collapse and prioritizing unique academic values. The model’s 1.6T parameters utilize a Mixture-of-Experts (MoE) design where only 49B parameters are activated per token, further driving down compute requirements. Training the mixture-of-experts (MoE) to work as a whole DeepSeek-V4 was not simply trained; it was "cultivated" through a unique two-stage paradigm. First, through Independent Expert Cultivation, domain-specific experts were trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) using the GRPO (Group Relative Policy Optimization) algorithm. This allowed each expert to master specialized skills like mathematical reasoning or codebase analysis. Second, Unified Model Consolidation integrated these distinct proficiencies into a single model via on-policy distillation, where the unified model acts as the student learning to optimize reverse KL loss with teacher models. This distillation process ensures that the model preserves the specialized capabilities of each expert while operating as a cohesive whole. The model’s reasoning capabilities are further segmented into three increasing "effort" modes. The "Non-think" mode provides fast, intuitive responses for routine tasks. "Think High" provides conscious logical analysis for complex problem-solving. Finally, "Think Max" pushes the boundaries of model reasoning, bridging the gap with frontier models on complex reasoning and agentic tasks. This flexibility allows users to match the compute effort to the difficulty of the task, further enhancing cost-efficiency. Breaking the Nvidia GPU stranglehold with local Chinese Huawei Ascend NPUs While the model weights are the headline, the software stack released alongside them is arguably more important for the future of "Sovereign AI." Analyst Rui Ma highlighted a single sentence from the release as the most critical: DeepSeek validated their fine-grained Expert Parallelism (EP) scheme on Huawei Ascend NPUs (neural processing units). By achieving a 1.50x to 1.73x speedup on non-Nvidia GPU platforms, DeepSeek has provided a blueprint for high-performance AI deployment that is resilient to Western GPU supply chains and export controls. However, it's important to note that DeepSeek still claims it used officially licensed, legal Nvidia GPUs for DeepSeek V4's training, in addition to the Huawei NPUs. DeepSeek has also open-sourced the MegaMoE mega-kernel as a component of its DeepGEMM library. This CUDA-based implementation delivers up to a 1.96x speedup for latency-sensitive tasks like RL rollouts and high-speed agent serving. This move ensures that developers can run these massive models with extreme efficiency on existing hardware, further cementing DeepSeek’s role as the primary driver of open-source AI infrastructure. The technical report emphasizes that these optimizations are crucial for supporting a standard 1M context across all official services. Licensing and local deployment DeepSeek-V4 is released under the MIT License, the most permissive framework in the industry. This allows developers to use, copy, modify, and distribute the weights for commercial purposes without royalties—a stark contrast to the "restricted" open-weight licenses favored by other companies. For local deployment, DeepSeek recommends setting sampling parameters to temperature = 1.0 and top_p = 1.0. For those utilizing the "Think Max" reasoning mode, the team suggests setting the context window to at least 384K tokens to avoid truncating the model's internal reasoning chains. The release includes a dedicated encoding folder with Python scripts demonstrating how to encode messages in OpenAI-compatible format and parse the model's output, including reasoning content. DeepSeek-V4 is also seamlessly integrated with leading AI agents like Claude Code, OpenClaw, and OpenCode. This native integration underscores its role as a bedrock for developer tools, providing an open-source alternative to the proprietary ecosystems of major cloud providers. Community reactions and what comes next The community reaction has been one of shock and validation. Hugging Face officially welcomed the "whale" back, stating that the era of cost-effective 1M context length has arrived. Industry experts noted that the "second DeepSeek moment" has effectively reset the developmental trajectory of the entire field, placing massive pressure on closed-source providers like OpenAI and Anthropic to justify their premiums. AI evaluation firm Vals AI noted that DeepSeek-V4 is now the "#1 open-weight model on our Vibe Code Benchmark, and it’s not close". DeepSeek is moving quickly to retire its older architectures. The company announced that the legacy deepseek-chat and deepseek-reasoner endpoints will be fully retired on July 24, 2026. All traffic is currently being rerouted to the V4-Flash architecture, signifying a total transition to the million-token standard. DeepSeek-V4 is more than just a new model; it is a challenge to the status quo. By proving that architectural innovation can substitute for raw compute-maximalism, DeepSeek has made the highest levels of AI intelligence accessible to the global developer community at a far lower cost — something that could benefit the globe, even at a time when lawmakers and leaders in Washington, D.C. are raising concerns about Chinese labs "distilling" from U.S. proprietary giants to train open source models, and fears of said open source or jailbroken proprietary models being used to create weapons and commit terror. The truth is, while all of these are potential risks — as they were and have been with prior technologies that broadened information access, like search and the internet itself — the benefits seem far outweigh them, and DeepSeek's quest to keep frontier AI models open is of benefit to the entire planet of potential AI users, especially enterprises looking to adopt the cutting-edge at the lowest possible cost.