Arcee's new, open source Trinity-Large-Thinking is the rare, powerful U.S.-made AI model that enterprises can download and customize
Our take

The baton of open source AI models has been passed on between several companies over the years since ChatGPT debuted in late 2022, from Meta with its Llama family to Chinese labs like Qwen and z.ai. But lately, Chinese companies have started pivoting back towards proprietary models even as some U.S. labs like Cursor and Nvidia release their own variants of the Chinese models, leaving a question mark about who will originate this branch of technology going forward.
One answer: Arcee, a San Francisco based lab, which this week released AI Trinity-Large-Thinking—a 399-billion parameter text-only reasoning model released under the uncompromisingly open Apache 2.0 license, allowing for full customizability and commercial usage by anyone from indie developers to large enterprises.
The release represents more than just a new set of weights on AI code sharing community Hugging Face; it is a strategic bet that "American Open Weights" can provide a sovereign alternative to the increasingly closed or restricted frontier models of 2025.
This move arrives precisely as enterprises express growing discomfort with relying on Chinese-based architectures for critical infrastructure, creating a demand for a domestic champion that Arcee intends to fill.
As Clément Delangue, co-founder and CEO of Hugging Face, told VentureBeat in a direct message on X: "The strength of the US has always been its startups so maybe they're the ones we should count on to lead in open-source AI. Arcee shows that it's possible!"
Genesis of a 30-person frontier lab
To understand the weight of the Trinity release, one must understand the lab that built it. Based in San Francisco, Arcee AI is a lean team of only 30 people.
While competitors like OpenAI and Google operate with thousands of engineers and multibillion-dollar compute budgets, Arcee has defined itself through what CTO Lucas Atkins calls "engineering through constraint".
The company first made waves in 2024 after securing a $24 million Series A led by Emergence Capital, bringing its total capital to just under $50 million. In early 2026, the team took a massive risk: they committed $20 million—nearly half their total funding—to a single 33-day training run for Trinity Large.
Utilizing a cluster of 2048 NVIDIA B300 Blackwell GPUs, which provided twice the speed of the previous Hopper generation, Arcee bet the company's future on the belief that developers needed a frontier model they could truly own.
This "back the company" bet was a masterclass in capital efficiency, proving that a small, focused team could stand up a full pipeline and stabilize training without endless reserves.
Engineering through extreme architectural constraint
Trinity-Large-Thinking is noteworthy for the extreme sparsity of its attention mechanism. While the model houses 400 billion total parameters, its Mixture-of-Experts architecture means that only 1.56%, or 13 billion parameters, are active for any given token.
This allows the model to possess the deep knowledge of a massive system while maintaining the inference speed and operational efficiency of a much smaller one—performing roughly 2 to 3 times faster than its peers on the same hardware. Training such a sparse model presented significant stability challenges.
To prevent a few experts from becoming "winners" while others remained untrained "dead weight," Arcee developed SMEBU, or Soft-clamped Momentum Expert Bias Updates.
This mechanism ensures that experts are specialized and routed evenly across a general web corpus. The architecture also incorporates a hybrid approach, alternating local and global sliding window attention layers in a 3:1 ratio to maintain performance in long-context scenarios.
The data curriculum and synthetic reasoning
Arcee’s partnership with fellow startup DatologyAI provided a curriculum of over 10 trillion curated tokens. However, the training corpus for the full-scale model was expanded to 20 trillion tokens, split evenly between curated web data and high-quality synthetic data.
Unlike typical imitation-based synthetic data where a smaller model simply learns to mimic a larger one, DatologyAI utilized techniques to synthetically rewrite raw web text—such as Wikipedia articles or blogs—to condense the information.
This process helped the model learn to reason over concepts and information rather than merely memorizing exact token strings.
To ensure regulatory compliance, tremendous effort was invested in excluding copyrighted books and materials with unclear licensing, attracting enterprise customers who are wary of intellectual property risks associated with mainstream LLMs.
This data-first approach allowed the model to scale cleanly while significantly improving performance on complex tasks like mathematics and multi-step agent tool use.
The pivot from yappy chatbots to reasoning agents
The defining feature of this official release is the transition from a standard "instruct" model to a "reasoning" model.
By implementing a "thinking" phase prior to generating a response—similar to the internal loops found in the earlier Trinity-Mini—Arcee has addressed the primary criticism of its January "Preview" release.
Early users of the Preview model had noted that it sometimes struggled with multi-step instructions in complex environments and could be "underwhelming" for agentic tasks.
The "Thinking" update effectively bridges this gap, enabling what Arcee calls "long-horizon agents" that can maintain coherence across multi-turn tool calls without getting "sloppy".
This reasoning process enables better context coherence and cleaner instruction following under constraint. This has direct implications for Maestro Reasoning, a 32B-parameter derivative of Trinity already being used in audit-focused industries to provide transparent "thought-to-answer" traces.
The goal was to move beyond "yappy" or inefficient chatbots toward reliable, cheap, high-quality agents that stay stable across long-running loops.
Geopolitics and the case for American open weights
The significance of Arcee’s Apache 2.0 commitment is amplified by the retreat of its primary competitors from the open-weight frontier.
Throughout 2025, Chinese research labs like Alibaba's Qwen and z.ai (aka Zhupai) set the pace for high-efficiency MoE architectures.
However, as we enter 2026, those labs have begun to shift toward proprietary enterprise platforms and specialized subscriptions, signaling a move away from pure community growth.
The fragmentation of these once-prolific teams, such as the departure of key technical leads from Alibaba's Qwen lab, has left a void at the high end of the open-weight market. In the United States, the movement has faced its own crisis.
Meta’s Llama division notably retreated from the frontier landscape following the mixed reception of Llama 4 in April 2025, which faced reports of quality issues and benchmark manipulation.
For developers who relied on the Llama 3 era of dominance, the lack of a current 400B+ open model created an urgent need for an alternative that Arcee has risen to fill.
Benchmarks and how Arcee's Trinity-Large-Thinking stacks up to other U.S. frontier open source AI model offerings
Trinity-Large-Thinking’s performance on agent-specific evaluations establishes it as a legitimate frontier contender. On PinchBench, a critical metric for evaluating model capability on autonomous agentic tasks, Trinity achieved a score of 91.9, placing it just behind the proprietary market leader, Claude Opus 4.6 (93.3).
This competitiveness is mirrored in IFBench, where Trinity’s score of 52.3 sits in a near-dead heat with Opus 4.6’s 53.1, indicating that the reasoning-first "Thinking" update has successfully addressed the instruction-following hurdles that challenged the model’s earlier preview phase.
The model’s broader technical reasoning capabilities also place it at the high end of the current open-source market. It recorded a 96.3 on AIME25, matching the high-tier Kimi-K2.5 and outstripping other major competitors like GLM-5 (93.3) and MiniMax-M2.7 (80.0).
While high-end coding benchmarks like SWE-bench Verified still show a lead for top-tier closed-source models—with Trinity scoring 63.2 against Opus 4.6’s 75.6—the massive delta in cost-per-token positions Trinity as the more viable sovereign infrastructure layer for enterprises looking to deploy these capabilities at production scale.
When it comes to other U.S. open source frontier model offerings, OpenAI's gpt-oss tops out at 120 billion parameters, but there's also Google with Gemma (Gemma 4 was just released this week) and IBM's Granite family is also worth a mention, despite having lower benchmarks. Nvidia's Nemotron family is also notable, but is fine-tuned and post-trained Qwen variants.
Benchmark | Arcee Trinity-Large | gpt-oss-120B (High) | IBM Granite 4.0 | Google Gemma 4 |
GPQA-D | 76.3% | 80.1% | 74.8% | 84.3% |
Tau2-Airline | 88.0% | 65.8%* | 68.3% | 76.9% |
PinchBench | 91.9% | 69.0% (IFBench) | 89.1% | 93.3% |
AIME25 | 96.3% | 97.9% | 88.5% | 89.2% |
MMLU-Pro | 83.4% | 90.0% (MMLU) | 81.2% | 85.2% |
So how is an enterprise supposed to choose between all these?
Arcee Trinity-Large-Thinking is the premier choice for organizations building autonomous agents; its sparse 400B architecture excels at "thinking" through multi-step logic, complex math, and long-horizon tool use. By activating only a fraction of its parameters, it provides a high-speed reasoning engine for developers who need GPT-4o-level planning capabilities within a cost-effective, open-source framework.
Conversely, gpt-oss-120B serves as the optimal middle ground for enterprises that require high-reasoning performance but prioritize lower operational costs and deployment flexibility.
Because it activates only 5.1B parameters per forward pass, it is uniquely suited for technical workloads like competitive code generation and advanced mathematical modeling that must run on limited hardware, such as a single H100 GPU.
Its configurable reasoning effort—offering "Low," "Medium," and "High" modes—makes it the best fit for production environments where latency and accuracy must be balanced dynamically across different tasks.
For broader, high-throughput applications, Google Gemma 4 and IBM Granite 4.0 serve as the primary backbones. Gemma 4 offers the highest "intelligence density" for general knowledge and scientific accuracy, making it the most versatile option for R&D and high-speed chat interfaces.
Meanwhile, IBM Granite 4.0 is engineered for the "all-day" enterprise workload, utilizing a hybrid architecture that eliminates context bottlenecks for massive document processing. For businesses concerned with legal compliance and hardware efficiency, Granite remains the most reliable foundation for large-scale RAG and document analysis.
Ownership as a feature for regulated industries
In this climate, Arcee’s choice of the Apache 2.0 license is a deliberate act of differentiation. Unlike the restrictive community licenses used by some competitors, Apache 2.0 allows enterprises to truly own their intelligence stack without the "black box" biases of a general-purpose chat model.
"Developers and Enterprises need models they can inspect, post-train, host, distill, and own," Lucas Atkins noted in the launch announcement.
This ownership is critical for the "bitter lesson" of training small models: you usually need to train a massive frontier model first to generate the high-quality synthetic data and logits required to build efficient student models.
Furthermore, Arcee has released Trinity-Large-TrueBase, a raw 10-trillion-token checkpoint. TrueBase offers a rare, "unspoiled" look at foundational intelligence before instruction tuning and reinforcement learning are applied. For researchers in highly regulated industries like finance and defense, TrueBase allows for authentic audits and custom alignments starting from a clean slate.
Community verdict and the future of distillation
The response from the developer community has been largely positive, reflecting the desire for more open weights, U.S.-made mdoels.
On X, researchers highlighted the disruption, noting that the "insanely cheap" prices for a model of this size would be a boon for the agentic community.
On open AI model inference website OpenRouter, Trinity-Large-Preview established itself as the #1 most used open model in the U.S., serving over 80.6 billion tokens on peak days like March 1, 2026.
The proximity of Trinity-Large-Thinking to Claude Opus 4.6 on PinchBench—at 91.9 versus 93.3—is particularly striking when compared to the cost. At $0.90 per million output tokens, Trinity is approximately 96% cheaper than Opus 4.6, which costs $25 per million output tokens.
Arcee’s strategy is now focused on bringing these pretraining and post-training lessons back down the stack. Much of the work that went into Trinity Large will now flow into the Mini and Nano models, refreshing the company's compact line with the distillation of frontier-level reasoning.
As global labs pivot toward proprietary lock-in, Arcee has positioned Trinity as a sovereign infrastructure layer that developers can finally control and adapt for long-horizon agentic workflows.
Read on the original site
Open the publisher's page for the full experience
Related Articles
- DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5The whale has resurfaced. DeepSeek, the Chinese AI startup offshoot of High-Flyer Capital Management quantitative analysis firm, became a near-overnight sensation globally in January 2025 with the release of its open source R1 model that matched proprietary U.S. giants. It's been an epoch in AI since then, and while DeepSeek has released several updates to that model and its other V3 series, the international AI and business community has been largely waiting with baited breath for the follow-up to the R1 moment. Now it's arrived with last night's release of DeepSeek-V4, a 1.6-trillion-parameter Mixture-of-Experts (MoE) model available free under commercially-friendly open source MIT License, which nears — and on some benchmarks, surpasses — the performance of the world’s most advanced closed-source systems at approximately 1/6th the cost over the application programming interface (API). This release—which DeepSeek AI researcher Deli Chen described on X as a "labor of love" 484 days after the launch of V3—is being hailed as the "second DeepSeek moment". As Chen noted in his post, "AGI belongs to everyone". It's available now on AI code sharing community Hugging Face and through DeepSeek's API. Frontier-class AI gets pushed into a lower price band The most immediate impact of the DeepSeek-V4 launch is economic. The corrected pricing table shows DeepSeek is not pricing its new Pro model at near-zero levels, but it is still pushing high-end model access into a far lower cost tier than the leading U.S. frontier models. DeepSeek-V4-Pro is priced through its API at $1.74 USD per 1 million input tokens on a cache miss and $3.48 per million output tokens. That puts a simple one-million-input, one-million-output comparison at $5.22. With cached input, the input price drops to $0.145 per million tokens, bringing that same blended comparison down to $3.625. That is dramatically cheaper than the current premium pricing from OpenAI and Anthropic. GPT-5.5 is priced at $5.00 per million input tokens and $30.00 per million output tokens, for a combined $35.00 in the same simple comparison. Claude Opus 4.7 is priced at $5.00 input and $25.00 output, for a combined $30.00. Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek-V4-Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI On standard, cache-miss pricing, DeepSeek-V4-Pro comes in at roughly one-seventh the cost of GPT-5.5 and about one-sixth (1/6th) the cost of Claude Opus 4.7. With cached input, the gap widens: DeepSeek-V4-Pro costs about one-tenth as much as GPT-5.5 and about one-eighth as much as Claude Opus 4.7. The more extreme near-zero story belongs to DeepSeek-V4-Flash, not the Pro model. Flash is priced at $0.14 per million input tokens on a cache miss and $0.28 per million output tokens, for a combined $0.42. With cached input, that drops to $0.308. In that case, DeepSeek’s cheaper model is more than 98% below GPT-5.5 and Claude Opus 4.7 in a simple input-plus-output comparison, or nearly 1/100th the cost — though the performance dips significantly. DeepSeek is compressing advanced model economics into a much lower band, forcing developers and enterprises to revisit the cost-benefit calculation around premium closed models. For companies running large inference workloads, that price gap can change what is worth automating. Tasks that look too expensive on GPT-5.5 or Claude Opus 4.7 may become economically viable on DeepSeek-V4-Pro, and even more so on DeepSeek-V4-Flash. The launch does not make intelligence free, but it does make the market harder for premium providers to defend on performance alone. Benchmarking the frontier: DeepSeek-V4-Pro gets close, but GPT-5.5 and Opus 4.7 still lead on most shared tests DeepSeek-V4-Pro-Max is best understood as a major open-weight leap, not a clean across-the-board defeat of the newest closed frontier systems. The model’s strongest benchmark claims come from DeepSeek’s own comparison tables, where it is shown against GPT-5.4 xHigh, Claude Opus 4.6 Max and Gemini 3.1 Pro High and bests them on several tests, including Codeforces and Apex Shortlist. But that is not the same as a head-to-head against OpenAI’s newer GPT-5.5 or Anthropic’s newer Claude Opus 4.7. Looking only at DeepSeek-V4 versus the latest proprietary models, the picture is more restrained. On this shared set, GPT-5.5 and Claude Opus 4.7 still lead most categories. DeepSeek-V4-Pro-Max’s best showing is on BrowseComp, the benchmark measuring agentic AI web browsing prowess (especially highly containerized information), where it scores 83.4%, narrowly behind GPT-5.5 at 84.4% and ahead of Claude Opus 4.7 at 79.3%. On Terminal-Bench 2.0, DeepSeek scores 67.9%, close to Claude Opus 4.7’s 69.4%, but far behind GPT-5.5’s 82.7%. Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 48.2% 52.2% 57.2% 54.7% GPT-5.5 Pro Terminal-Bench 2.0 67.9% 82.7% — 69.4% GPT-5.5 SWE-Bench Pro / SWE Pro 55.4% 58.6% — 64.3% Claude Opus 4.7 BrowseComp 83.4% 84.4% 90.1% 79.3% GPT-5.5 Pro MCP Atlas / MCPAtlas Public 73.6% 75.3% — 79.1% Claude Opus 4.7 The shared academic-reasoning results favor the closed models: On GPQA Diamond, DeepSeek-V4-Pro-Max scores 90.1%, while GPT-5.5 reaches 93.6% and Claude Opus 4.7 reaches 94.2%. On Humanity’s Last Exam without tools, DeepSeek scores 37.7%, behind GPT-5.5 at 41.4%, GPT-5.5 Pro at 43.1% and Claude Opus 4.7 at 46.9%. With tools enabled, DeepSeek rises to 48.2%, but still trails GPT-5.5 at 52.2%, GPT-5.5 Pro at 57.2% and Claude Opus 4.7 at 54.7%. The agentic and software-engineering results are more mixed, but they still show DeepSeek-V4-Pro-Max trailing GPT-5.5 and Opus 4.7. On Terminal-Bench 2.0, DeepSeek’s 67.9% is competitive with Claude Opus 4.7’s 69.4%, but GPT-5.5 is much higher at 82.7%. On SWE-Bench Pro, DeepSeek’s 55.4% trails GPT-5.5 at 58.6% and Claude Opus 4.7 at 64.3%. On MCP Atlas, DeepSeek’s 73.6% is slightly behind GPT-5.5 at 75.3% and Claude Opus 4.7 at 79.1%. BrowseComp is the standout: DeepSeek’s 83.4% beats Claude Opus 4.7’s 79.3% and nearly matches GPT-5.5’s 84.4%, though GPT-5.5 Pro’s 90.1% remains well ahead. So ultimately, DeepSeek-V4-Pro-Max does not appear to dethrone GPT-5.5 or Claude Opus 4.7 on the benchmarks that can be directly compared across the companies’ published tables. But it gets close enough on several of them — especially BrowseComp, Terminal-Bench 2.0 and MCP Atlas — that its much lower API pricing becomes the headline. In practical terms, DeepSeek does not need to win every leaderboard row to matter. If it can deliver near-frontier performance on many enterprise-relevant agent and reasoning tasks at roughly one-sixth to one-seventh the standard API cost of GPT-5.5 or Claude Opus 4.7, it still forces a major rethink of the economics of advanced AI deployment. DeepSeek-V4-Pro-Max is clearly the strongest open-weight model in the field right now, and it is unusually close to frontier closed systems on several practical benchmarks. While GPT-5.5 and Claude Opus 4.7 still retain the lead in most direct head-to-head comparisons across the company's benchmark charts, DeepSeek V4 Pro gets close while being dramatically cheaper and openly available. A big jump from DeepSeek V3.2 To understand the magnitude of this release, one must look at the performance gains of the base models. DeepSeek-V4-Pro-Base represents a significant advancement over the previous generation, DeepSeek-V3.2-Base. In World Knowledge, V4-Pro-Base achieved 90.1 on MMLU (5-shot) compared to V3.2’s 87.8, and a massive jump on MMLU-Pro from 65.5 to 73.5. The improvement in high-level reasoning and verified facts is even more pronounced: on SuperGPQA, V4-Pro-Base reached 53.9 compared to V3.2's 45.0, and on the FACTS Parametric benchmark, it more than doubled its predecessor's performance, jumping from 27.1 to 62.6. Simple-QA verified scores also saw a dramatic rise from 28.3 to 55.2. The Long Context capabilities have also been refined. On LongBench-V2, V4-Pro-Base scored 51.5, significantly outpacing the 40.2 achieved by V3.2-Base. In Code and Math, V4-Pro-Base reached 76.8 on HumanEval (Pass@1), up from 62.8 on V3.2-Base. These numbers underscore that DeepSeek has not just optimized for inference cost, but has fundamentally improved the intelligence density of its base architecture. The efficiency story is equally compelling for the Flash variant. DeepSeek-V4-Flash-Base, despite utilizing a substantially smaller number of parameters, outperforms the larger V3.2-Base across wide benchmarks, particularly in long-context scenarios. A new information 'traffic controller,' Manifold-Constrained Hyper-Connections (mHC) DeepSeek’s ability to offer these prices and performance figures is rooted in radical architectural innovations detailed in its technical report also released today, "Towards Highly Efficient Million-Token Context Intelligence." The standout technical achievement of V4 is its native one-million-token context window. Historically, maintaining such a large context required massive memory (the key values or KV cache). DeepSeek solved this by introducing a Hybrid Attention Architecture that combines Compressed Sparse Attention (CSA) to reduce initial token dimensionality and Heavily Compressed Attention (HCA) to aggressively compress the memory footprint for long-range dependencies. In practice, the V4-Pro model requires only 10% of the KV cache and 27% of the single-token inference FLOPs compared to its predecessor, the DeepSeek-V3.2, even when operating at a 1M token context. To stabilize a network of 1.6 trillion parameters, DeepSeek moved beyond traditional residual connections. The company's researchers incorporated Manifold-Constrained Hyper-Connections (mHC) to strengthen signal propagation across layers while preserving the model’s expressivity. mHC allows an AI to have a much wider flow of information (so it can learn more complex things) without the risk of the model becoming unstable or "breaking" during its training. It’s like giving a city a 10-lane highway but adding a perfect AI traffic controller to ensure no one ever hits the brakes. This is paired with the Muon optimizer, which allowed the team to achieve faster convergence and greater training stability during the pre-training on more than 32T diverse and high-quality tokens. This pre-training data was refined to remove hatched auto-generated content, mitigating the risk of model collapse and prioritizing unique academic values. The model’s 1.6T parameters utilize a Mixture-of-Experts (MoE) design where only 49B parameters are activated per token, further driving down compute requirements. Training the mixture-of-experts (MoE) to work as a whole DeepSeek-V4 was not simply trained; it was "cultivated" through a unique two-stage paradigm. First, through Independent Expert Cultivation, domain-specific experts were trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) using the GRPO (Group Relative Policy Optimization) algorithm. This allowed each expert to master specialized skills like mathematical reasoning or codebase analysis. Second, Unified Model Consolidation integrated these distinct proficiencies into a single model via on-policy distillation, where the unified model acts as the student learning to optimize reverse KL loss with teacher models. This distillation process ensures that the model preserves the specialized capabilities of each expert while operating as a cohesive whole. The model’s reasoning capabilities are further segmented into three increasing "effort" modes. The "Non-think" mode provides fast, intuitive responses for routine tasks. "Think High" provides conscious logical analysis for complex problem-solving. Finally, "Think Max" pushes the boundaries of model reasoning, bridging the gap with frontier models on complex reasoning and agentic tasks. This flexibility allows users to match the compute effort to the difficulty of the task, further enhancing cost-efficiency. Breaking the Nvidia GPU stranglehold with local Chinese Huawei Ascend NPUs While the model weights are the headline, the software stack released alongside them is arguably more important for the future of "Sovereign AI." Analyst Rui Ma highlighted a single sentence from the release as the most critical: DeepSeek validated their fine-grained Expert Parallelism (EP) scheme on Huawei Ascend NPUs (neural processing units). By achieving a 1.50x to 1.73x speedup on non-Nvidia GPU platforms, DeepSeek has provided a blueprint for high-performance AI deployment that is resilient to Western GPU supply chains and export controls. However, it's important to note that DeepSeek still claims it used officially licensed, legal Nvidia GPUs for DeepSeek V4's training, in addition to the Huawei NPUs. DeepSeek has also open-sourced the MegaMoE mega-kernel as a component of its DeepGEMM library. This CUDA-based implementation delivers up to a 1.96x speedup for latency-sensitive tasks like RL rollouts and high-speed agent serving. This move ensures that developers can run these massive models with extreme efficiency on existing hardware, further cementing DeepSeek’s role as the primary driver of open-source AI infrastructure. The technical report emphasizes that these optimizations are crucial for supporting a standard 1M context across all official services. Licensing and local deployment DeepSeek-V4 is released under the MIT License, the most permissive framework in the industry. This allows developers to use, copy, modify, and distribute the weights for commercial purposes without royalties—a stark contrast to the "restricted" open-weight licenses favored by other companies. For local deployment, DeepSeek recommends setting sampling parameters to temperature = 1.0 and top_p = 1.0. For those utilizing the "Think Max" reasoning mode, the team suggests setting the context window to at least 384K tokens to avoid truncating the model's internal reasoning chains. The release includes a dedicated encoding folder with Python scripts demonstrating how to encode messages in OpenAI-compatible format and parse the model's output, including reasoning content. DeepSeek-V4 is also seamlessly integrated with leading AI agents like Claude Code, OpenClaw, and OpenCode. This native integration underscores its role as a bedrock for developer tools, providing an open-source alternative to the proprietary ecosystems of major cloud providers. Community reactions and what comes next The community reaction has been one of shock and validation. Hugging Face officially welcomed the "whale" back, stating that the era of cost-effective 1M context length has arrived. Industry experts noted that the "second DeepSeek moment" has effectively reset the developmental trajectory of the entire field, placing massive pressure on closed-source providers like OpenAI and Anthropic to justify their premiums. AI evaluation firm Vals AI noted that DeepSeek-V4 is now the "#1 open-weight model on our Vibe Code Benchmark, and it’s not close". DeepSeek is moving quickly to retire its older architectures. The company announced that the legacy deepseek-chat and deepseek-reasoner endpoints will be fully retired on July 24, 2026. All traffic is currently being rerouted to the V4-Flash architecture, signifying a total transition to the million-token standard. DeepSeek-V4 is more than just a new model; it is a challenge to the status quo. By proving that architectural innovation can substitute for raw compute-maximalism, DeepSeek has made the highest levels of AI intelligence accessible to the global developer community at a far lower cost — something that could benefit the globe, even at a time when lawmakers and leaders in Washington, D.C. are raising concerns about Chinese labs "distilling" from U.S. proprietary giants to train open source models, and fears of said open source or jailbroken proprietary models being used to create weapons and commit terror. The truth is, while all of these are potential risks — as they were and have been with prior technologies that broadened information access, like search and the internet itself — the benefits seem far outweigh them, and DeepSeek's quest to keep frontier AI models open is of benefit to the entire planet of potential AI users, especially enterprises looking to adopt the cutting-edge at the lowest possible cost.
- OpenAI's GPT-5.5 is here, and it's no potato: narrowly beats Anthropic's Claude Mythos Preview on Terminal-Bench 2.0After months of rumors and reports that OpenAI was developing a new, more powerful AI large language model for use in ChatGPT and through its application programming interface (API), allegedly codenamed "Spud" internally, the company has today unveiled its latest offering under the more formal name GPT-5.5. And to likely no one's surprise, it's hardly a "potato" in the disparaging sense of the word: GPT-5.5 retakes the lead for OpenAI in generally available LLMs, coming ahead of rivals Anthropic's and Google's latest public offerings, and even beating the private Anthropic Claude Mythos Preview model narrowly on one benchmark (essentially a statistical tie). "It’s definitely our strongest model yet on coding, both measured by benchmarks and based on the feedback that we’ve gotten from trusted partners, as well as our own experience," explained Amelia "Mia" Glaese, VP of Research at OpenAI, in a video call with journalists ahead of the launch earlier today. OpenAI positions GPT-5.5 as a fundamental redesign of how intelligence interacts with a computer's operating system and professional software stacks. "What is really special about this model is how much more it can do with less guidance," said OpenAI co-founder and president Greg Brockman on the same call. "It’s way more intuitive to use. It can look at an unclear problem and figure out what needs to happen next." Brockman proceeded to emphasize the areas in which users can expect to see gains from using GPT-5.5 compared to OpenAI's prior state-of-the-art model, GPT-5.4, which remains available (for now) to users and enterprises at half the API cost of its new successor. "It’s extremely good at coding," Brockman said of GPT-5.5. "It’s also great at broader computer work, computer use, scientific research—these kinds of applications that are very intelligent bottlenecks." OpenAI CEO and-cofounder Sam Altman also weighed in on the launch and the company's philosophy in a post on X, writing, in part: "We want our users to have access to the best technology and for everyone to have equal opportunity." The model is available in two variants: GPT-5.5 and GPT-5.5 Pro, distinguished by the latter offering enhanced precision and specialized logic for handling the most rigorous cognitive demands. While the standard version serves as the versatile flagship for general intelligence tasks, the Pro model is architected specifically for high-stakes environments such as legal research, data science, and advanced business analytics where accuracy is paramount. This premium tier provides noticeably more comprehensive and better-structured responses, supported by specialized latency optimizations that ensure high-quality performance during complex, multi-step workflows. Unfortunately for third-party software developers, API access is not yet available for either GPT-5.5 nor GPT-5.5 Pro and will be coming "very soon," according to the company's announcement blog post. "API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale," OpenAI writes. For the time being, GPT-5.5 is available only to paying subscribers of the ChatGPT Plus ($20 monthly), Pro ($100-$200 monthly), Business, and Enterprise users, with GPT-5.5 Pro access starting at the Pro tier and upwards. A focus on agency At the core of GPT-5.5 is a focus on "agentic" performance—specifically in coding, computer use, and scientific research. Unlike its predecessors, which often required granular, step-by-step prompting to avoid "hallucinating" a path forward, GPT-5.5 is designed to handle messy, multi-part tasks autonomously. It excels at researching online, debugging complex codebases, and moving between documents and spreadsheets without human intervention. One of the most significant technical leaps is the model's efficiency. While larger models typically suffer from increased latency, GPT-5.5 matches the per-token latency of the previous GPT-5.4 while delivering a higher level of intelligence. This was achieved through a deep hardware-software co-design. OpenAI served GPT-5.5 on NVIDIA GB200 and GB300 NVL72 systems, utilizing custom heuristic algorithms—written by the AI itself—to partition and balance work across GPU cores. This optimization reportedly increased token generation speeds by over 20%.For high-stakes reasoning, the "GPT-5.5 Thinking" mode in ChatGPT provides smarter, more concise answers by allowing the model more internal "compute time" to verify its own assumptions before responding. This capability is particularly visible in the model’s performance on "Expert-SWE," an internal OpenAI benchmark for long-horizon coding tasks with a median human completion time of 20 hours. GPT-5.5 notably outperformed GPT-5.4 on this metric while using significantly fewer tokens. Benchmarks show OpenAI has retaken the lead in most powerful publicly available LLM over Claude Opus 4.7 (but the unreleased Mythos still outperforms it) The market for leading U.S.-made frontier models has become an increasingly tight race between OpenAI, Anthropic, and Google. Literally a week ago to the date, OpenAI rival Anthropic released Opus 4.7, its most powerful generally available model, to the public, taking over the leaderboard in terms of the number of third-party benchmark tests in which it has the lead. Yet today, GPT-5.5 has surpassed it and even Anthropic's heavily restricted, more powerful model Claude Mythos Preview, albeit only on one benchmark, Terminal-Bench 2.0, which tests "a model's ability to navigate and complete tasks in a sandboxed terminal environment." GPT-5.5 achieved 82.7% accuracy on Terminal-Bench 2.0, easily surpassing Opus 4.7 (69.4%) and narrowly beating the Mythos Preview (82.0%). However, in multidisciplinary reasoning without tools, the landscape is more competitive. On Humanity's Last Exam without tools, GPT-5.5 Pro scored 43.1%, trailing behind Opus 4.7 (46.9%) and Mythos Preview (56.8%). Benchmark GPT-5.5 Claude Opus 4.7 Gemini 3.1 Pro Mythos Preview* Terminal-Bench 2.0 82.7 69.4 68.5 82.0 Expert-SWE (Internal) 73.1 — — — GDPval (wins or ties) 84.9 80.3 67.3 — OSWorld-Verified 78.7 78.0 — 79.6 Toolathlon 55.6 — 48.8 — BrowseComp 84.4 79.3 85.9 86.9 FrontierMath Tier 1–3 51.7 43.8 36.9 — FrontierMath Tier 4 35.4 22.9 16.7 — CyberGym 81.8 73.1 — 83.1 Tau2-bench Telecom (original prompts) 98.0 — — — OfficeQA Pro 54.1 43.6 18.1 — Investment Banking Modeling Tasks (Internal) 88.5 — — — MMMU Pro (no tools) 81.2 — 80.5 — MMMU Pro (with tools) 83.2 — — — GeneBench 25.0 — — — BixBench 80.5 — — — Capture-the-Flags challenge tasks (Internal) 88.1 — — — ARC-AGI-2 (Verified) 85.0 75.8 77.1 — SWE-bench Pro (Public) 58.6 64.3 54.2 77.8 This suggests that while OpenAI is winning on "computer use" and "agency," other models may still hold an edge in pure, zero-shot academic knowledge. It is important to clarify that Mythos Preview is not a generally available product; Anthropic has classified it as a strategic defensive asset due to its high cybersecurity risks, restricting its access to a small, limited audience of trusted partners and government agencies. Because Mythos is excluded from broad commercial use, the primary market competition remains between GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7. So when it comes to models that the general public can access, GPT-5.5 has retaken the crown for OpenAI, achieving the state-of-the-art across 14 benchmarks compared to 4 for Claude Opus 4.7 and 2 for Google Gemini 3.1 Pro. It dominates in agentic computer use, economic knowledge work (GDPval), specialized cybersecurity (CyberGym), and complex mathematics (Frontier Math). In comparison, Claude Opus 4.7 leads on software engineering and reasoning without tools, while Gemini 3.1 Pro leads in three categories, specifically excelling in academic reasoning and financial analysis. Increased costs for users The shift in intelligence comes with a significant price increase for API developers, according to material OpenAI shared ahead of the model's public release. OpenAI has effectively doubled the entry price for its flagship model compared to the previous generation, and again double it from there for the most-cutting edge variant of the model, GPT-5.5 Pro: Model Input Price (per 1M tokens) Output Price (per 1M tokens) GPT-5.4 $2.50 $15.00 GPT-5.5 $5.00 $30.00 GPT-5.5 Pro $30.00 $180.00 To mitigate these costs, OpenAI emphasizes that GPT-5.5 is more "token efficient," meaning it uses fewer tokens to complete the same task compared to GPT-5.4. For users requiring speed over depth, OpenAI also introduced a Fast mode in Codex, which generates tokens 1.5x faster but at a 2.5x price premium. The "mini" and "nano" tiers seen in the GPT-5.4 era (priced at $0.75 and $0.20 per 1M input tokens respectively) currently have no GPT-5.5 equivalent, though the company notes that GPT-5.5 is rolling out to all subscription tiers, including Plus, Pro, and Enterprise. Licensing and the 'cyber-permissive' frontier OpenAI’s approach to safety and licensing for GPT-5.5 introduces a novel concept: Trusted Access for Cyber. Because the model is now capable of identifying and patching advanced security vulnerabilities, OpenAI has implemented stricter "cyber-risk classifiers" for general users. For legitimate security professionals, however, OpenAI is offering a specialized "cyber-permissive" license. This program allows verified defenders—those responsible for critical infrastructure like power grids or water supplies—to use models like GPT-5.4-Cyber or unrestricted versions of GPT-5.5 with fewer refusals for security-related prompts. This dual-use framework acknowledges that while AI can accelerate cyber defense, it can also be weaponized. Under OpenAI’s Preparedness Framework, GPT-5.5 is classified as "High" risk for biological and cybersecurity capabilities. To manage this, API deployments currently require different safeguards than the consumer-facing ChatGPT, and OpenAI is working with government partners to ensure these tools are used to strengthen—not undermine—digital resilience. Initial reactions: losing access feels like having a 'limb amputated' The early feedback from power users and engineers suggests that GPT-5.5 has crossed a psychological threshold in AI utility. For developers, the model's ability to maintain "conceptual clarity" across massive codebases is its standout feature. "The first coding model I've used that has serious conceptual clarity," noted Dan Shipper, CEO of Every. Shipper tested the model by asking it to debug a complex system failure that had previously required a team of human engineers to rewrite; GPT-5.5 produced the same fix autonomously. Similarly, Pietro Schirano, CEO of MagicPath, described a "step change" in performance when the model successfully merged a branch with hundreds of refactor changes into a main branch in a single, 20-minute pass.Perhaps the most visceral reaction came from an anonymous engineer at NVIDIA, who had early access to the model: "Losing access to GPT-5.5 feels like I've had a limb amputated". This sentiment is echoed in the scientific community. Derya Unutmaz, a professor at the Jackson Laboratory for Genomic Medicine, used GPT-5.5 Pro to analyze a dataset of 28,000 genes, producing a report in minutes that would have normally taken his team months. Brandon White, CEO of Axiom Bio, went further, stating that if OpenAI continues this pace, "the foundations of drug discovery will change by the end of the year". GPT-5.5 is more than an incremental update; it is a tool designed for a world where humans delegate entire workflows rather than single prompts. While the costs are higher and the safety guardrails tighter, the performance gains in agentic work suggest that AI is finally moving from the chat box and into the operating system. Perhaps most astonishingly of all, it's not even hearing the end of the scaling limits — whereupon models are trained on more and more GPUs — according to researchers at the company. "We actually still have headroom to train significantly smarter models than this," said OpenAI chief scientist Jakub Pachocki.
- Anthropic releases Claude Opus 4.7, narrowly retaking lead for most powerful generally available LLMAnthropic is publicly releasing its most powerful large language model yet, Claude Opus 4.7, today — as it continues to keep an even more powerful successor, Mythos, restricted to a small number of external enterprise partners for cybersecurity testing and patching vulnerabilities in the software said enterprises use (which Mythos exposed rapidly). The big headlines are that Opus 4.7 exceeds its most direct rivals — OpenAI's GPT-5.4, released in early March 2026, scarcely more than a month ago; and Google's latest flagship model Gemini 3.1 Pro from February — on key benchmarks including agentic coding, scaled tool-use, agentic computer use, and financial analysis. But also, it's notable how tight the race is getting: on directly comparable benchmarks, Opus 4.7 only leads GPT-5.4 by 7-4. It currently leads the market on the GDPVal-AA knowledge work evaluation with an Elo score of 1753, surpassing both GPT-5.4 (1674) and Gemini 3.1 Pro (1314). Yet, the model does not represent a "clean sweep" across all categories. Competitors like GPT-5.4 and Gemini 3.1 Pro still hold the lead in specific domains such as agentic search, where GPT-5.4 scores 89.3% compared to Opus 4.7’s 79.3%, as well as in multilingual Q&A and raw terminal-based coding. This positioning defines Opus 4.7 not as a unilateral victor in all AI tasks, but as a specialized powerhouse optimized for the reliability and long-horizon autonomy required by the burgeoning agentic economy. Claude Opus 4.7 is available today across all major cloud platforms, including Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry, with API pricing held steady at $5/$25 per million tokens. Improvement in hard sciences and agentic workflows Claude Opus 4.7 is a direct evolution of the Opus 4.6 architecture, but its performance delta is most visible in the "hard" sciences of agentic workflows: software engineering and complex document reasoning. At its core, the model has been re-tuned to exhibit what Anthropic describes as "rigor". This isn't just marketing parlance; it refers to the model’s new ability to devise its own verification steps before reporting a task as complete. For example, in internal tests, the model was observed building a Rust-based text-to-speech engine from scratch and then independently feeding its own generated audio through a separate speech recognizer to verify the output against a Python reference. This level of autonomous self-correction is designed to reduce the "hallucination loops" that often plague earlier iterations of agentic software. The most significant architectural upgrade is the move to high-resolution multimodal support. Opus 4.7 can now process images up to 2,576 pixels on their longest edge—roughly 3.75 megapixels. This represents a three-fold increase in resolution compared to previous iterations. For developers building "computer-use" agents that must navigate dense, high-DPI interfaces or for analysts extracting data from intricate technical diagrams, this change effectively removes the "blurry vision" ceiling that previously limited autonomous navigation. This visual acuity is reflected in benchmarks from XBOW, where the model jumped from a 54.5% success rate in visual-acuity tests to 98.5%. On the benchmark front, Opus 4.7 has claimed the top spot in several critical categories: Knowledge Work (GDPVal-AA): It achieved an Elo score of 1753, notably outperforming GPT-5.4 (1674) and Gemini 3.1 Pro (1314). Agentic Coding (SWE-bench Pro): The model resolved 64.3% of tasks, compared to 53.4% for its predecessor. Graduate-Level Reasoning (GPQA Diamond): It reached 94.2%, maintaining parity with the industry's most advanced models while improving on its internal consistency. Visual Reasoning (arXiv Reasoning): With tools, the model scored 91.0%, a meaningful jump from the 84.7% seen in Opus 4.6. Crucially, Anthropic warns that this increased precision requires a shift in how users approach prompting. Opus 4.7 follows instructions literally. While older models might "read between the lines" and interpret ambiguous prompts loosely, Opus 4.7 executes the exact text provided. This means that legacy prompt libraries may require re-tuning to avoid unexpected results caused by the model’s strict adherence to the letter of the request. Controlling the 'thinking' budget The "agentic" nature of Opus 4.7—its tendency to pause, plan, and verify—comes with a trade-off in token consumption and latency. To address this, Anthropic is introducing a new "effort" parameter. Users can now select an xhigh (extra high) effort level, positioned between high and max, allowing for more granular control over the depth of reasoning the model applies to a specific problem. Internal data shows that while max effort yields the highest scores (approaching 75% on coding tasks), the xhigh setting provides a compelling sweet spot between performance and token expenditure. To manage the costs associated with these more "thoughtful" runs, the Claude API is introducing "task budgets" in public beta. This allows developers to set a hard ceiling on token spend for autonomous agents, ensuring that a long-running debugging session doesn't result in an unexpected bill. These product changes signal a maturing market where AI is no longer a novelty but a production line item that requires fiscal and operational guardrails. Furthermore, Opus 4.7 utilizes an updated tokenizer that improves text processing efficiency, though it can increase the token count of certain inputs by 1.0–1.35x. Within the Claude Code environment, the update brings a new /ultrareview command. Unlike standard code reviews that look for syntax errors, /ultrareview is designed to simulate a senior human reviewer, flagging subtle design flaws and logic gaps. Additionally, "auto mode"—a setting where Claude can make autonomous decisions without constant permission prompts—has been extended to Max plan users. Licensing, safety, and the "cyber" divide Anthropic continues to walk a narrow line regarding cybersecurity. The recent announcement of the aforementioend cybersecurity partnership around Mythos with external industry partners — known as "Project Glasswing" — highlighted the dual-use risks of high-capability models. Consequently, while the flagship Mythos Preview model remains restricted, Opus 4.7 serves as the testbed for new automated safeguards. The model includes systems designed to detect and block requests that suggest high-risk cyberattacks, such as automated vulnerability exploitation. To bridge the gap for the security industry, Anthropic is launching the Cyber Verification Program. This allows legitimate professionals—vulnerability researchers, penetration testers, and red-teamers—to apply for access to use Opus 4.7’s capabilities for defensive purposes. This "verified user" model suggests a future where the most capable AI features are not universally available, but gated behind professional credentials and compliance frameworks. In cybersecurity vulnerability reproduction (CyberGym), Opus 4.7 maintains a 73.1% success rate, trailing Mythos Preview's 83.1% but leading GPT-5.4's 66.3%. Initial reactions from industry partners reveal quantifiable improvements in production enterprise workflows Early testimonials from enterprise customers shared by Anthropic indicate there has been a tangible shift in model perception of Opus 4.7 from 4.6, going from "impressed by the tech" to "relying on the output". Clarence Huang, VP of Technology at Intuit, noted that the model’s ability to "catch its own logical faults during the planning phase" is a game-changer for velocity. This sentiment was echoed by Replit President Michele Catasta, who stated that the model achieved higher quality at a lower cost for tasks like log analysis and bug hunting, adding, "It really feels like a better coworker". Other specific reactions included: Cognition (Devin): CEO Scott Wu reported that Opus 4.7 can work coherently "for hours" and pushes through difficult problems that previously caused models to stall. Notion: Sarah Sachs, AI Lead, highlighted a 14% improvement in multi-step workflows and a 66% reduction in tool-calling errors, making the agent feel like a "true teammate". Factory Droids: Leo Tchourakov observed that the model carries work through to validation steps rather than "stopping halfway," a common complaint with previous frontier models. Harvey: Niko Grupen, Head of Applied Research, noted the model's 90.9% score on BigLaw Bench, highlighting its "noticeably smarter handling of ambiguous document editing tasks". Perhaps the most telling reaction came from Aj Orbach, CEO of a dashboard-building firm, who remarked on the model’s "design taste," noting that its choices for data-rich interfaces were of a quality he would "actually ship". Should enterprises immediately upgrade to Opus 4.7? For enterprise leaders, Claude Opus 4.7 represents a shift from generative AI as a "creative assistant" to a "reliable operative." But importantly, it is not a "clean win" for every use case. Instead, it is a decisive upgrade for teams building autonomous agents or complex software systems. The primary value proposition is the model's new capability for self-verification and rigor; it no longer just generates an answer but creates internal tests to verify that the answer is correct before responding. This reliability makes it a superior choice for long-horizon engineering tasks where the cost of human supervision is the primary bottleneck. However, an immediate, wholesale migration from Opus 4.6 requires caution. The model's increased literalism in instruction following means that prompts engineered to be "loose" or conversational with previous versions may now produce unexpected or overly rigid results. Furthermore, enterprises must prepare for a significant increase in operational costs. Opus 4.7 uses an updated tokenizer that can increase input token counts by 1.0–1.35x, and its tendency to "think harder" at high effort levels results in higher output token consumption. For legacy applications where prompts are fragile and margins are thin, a phased rollout with significant re-tuning is recommended. Where it puts Anthropic in the AI race This release arrives at a paradoxical moment for Anthropic. Financially, the company is an undisputed juggernaut, with venture capital firms reportedly extending investment offers at a staggering $800 billion valuation—more than double its $380 billion Series G valuation from February 2026. This momentum is fueled by explosive growth, with the company’s annual run-rate revenue skyrocketing to $30 billion in April 2026, driven largely by enterprise adoption and the success of Claude Code. Yet, this commercial success is being contested by intense regulatory and technical friction. Anthropic is currently embroiled in a high-stakes legal battle with the U.S. Department of War (DoW), which recently labeled the company a "supply chain risk" after Anthropic refused to allow its models to be used for mass surveillance or fully autonomous lethal weapons. While a San Francisco judge initially blocked the designation, a federal appeals panel recently denied Anthropic’s bid to stay the blacklisting, leaving the company excluded from lucrative defense contracts during an active military conflict. Simultaneously, Anthropic is fending off a growing rebellion from its most loyal power users. Despite the company's "market leader" status, developers have flooded GitHub and X with accusations of "AI shrinkflation," claiming that the preceding Opus 4.6 model and Claude Code product have been quietly degraded. Users report that recent versions are more prone to exploration loops, memory loss, and ignored instructions, leading some to describe the newly released Claude Code desktop app as "unpolished" and unbefitting a firm with a near-trillion-dollar valuation. Opus 4.7 is Anthropic's attempt to silence these critics by proving that "deep thinking" can be paired with the rigorous execution that its enterprise clients now demand. Ultimately, Opus 4.7 is a model defined by its discipline. In a market where models are often incentivized to be "helpful" to a fault—sometimes hallucinating answers to please the user—Opus 4.7 marks a return to rigor. By allowing users to control effort, set budgets, and verify outputs, Anthropic is moving closer to the goal of a truly autonomous digital labor force. For the engineering teams at Replit, Notion, and beyond, the shift from "watching the AI work" to "managing the AI's results" has officially begun.