How DeepSeek’s radical architecture is shattering Silicon Valley's token moat
Our take

DeepSeek’s recent decision to make its 75% price cut on the V4 Pro model permanent marks a significant turning point in the AI landscape. This move not only disrupts the pricing structures of established Silicon Valley giants like OpenAI and Anthropic but also challenges the prevailing notion that high-quality AI solutions must come with exorbitant costs. As highlighted in our analysis of developments like Has the hunt for AI compute uncovered the next Cerebras? and Visa invests in Replit to power agentic payments for developers, the shift in pricing dynamics is indicative of a broader trend towards more accessible, innovative, and competitive AI solutions.
DeepSeek's V4 Pro model is not just cheaper; it boasts performance metrics that rival those of its pricier counterparts, achieving substantial efficiencies through its unique hardware-software architecture. With input costs seven times lower and output costs 17 times cheaper than offerings from competitors, this architecture signals a seismic shift in how enterprises can approach AI integration. The lightweight V4 Flash model further exemplifies this cost-efficiency, making it 10x to 25x cheaper than entry-level alternatives. By addressing the cost barriers that have historically limited AI adoption, DeepSeek is effectively democratizing access to advanced AI capabilities, allowing organizations of various sizes to leverage powerful tools without the traditional financial constraints.
This disruption is particularly relevant given the growing scrutiny that companies like OpenAI and Anthropic are facing regarding their return on investment. With rising operational costs tied to high token usage and the increasing burden of justifying these expenses, as noted by major players such as Uber and Airbnb, the implications of DeepSeek’s cost leadership could lead to a significant realignment in enterprise AI strategies. The shift towards open-weight architectures, as demonstrated by DeepSeek's offerings, may compel other firms to rethink their approaches, especially if they wish to remain competitive in a market that is increasingly favoring cost-effective solutions.
Looking ahead, the question becomes: how will Silicon Valley respond to this challenge? The traditional reliance on proprietary models and extensive cloud infrastructures may no longer be sustainable as enterprises seek to optimize costs without sacrificing quality. The potential for a bifurcation in the AI market is palpable, where high-end, specialized applications coexist alongside more accessible, open models that cater to high-volume needs. As more companies experiment with these models, we may witness a shift in the competitive landscape that redefines the standards of AI utility and accessibility.
In conclusion, DeepSeek’s aggressive pricing strategy signifies not just a tactical move but a strategic challenge to the status quo in AI. As enterprises increasingly prioritize both performance and cost-effectiveness, we must closely monitor how legacy providers adjust their offerings in response. The unfolding dynamics could reshape not only the AI market but also broader business practices around technology adoption, potentially leading to a more diverse and competitive ecosystem that prioritizes innovation and efficiency over entrenched interests.
DeepSeek’s announcement over the weekend that it has made its 75% price cut permanent on its flagship V4 Pro model is a disruptive assault on the capital-heavy business models of Silicon Valley’s frontier labs.
The reduction on DeepSeek V4 Pro directly undercuts comparable Western models used as workhorses for enterprise production. It is 7x cheaper on inputs and 17x cheaper on outputs than Anthropic’s Claude Sonnet or OpenAI’s GPT 5.5-Med, while the lightweight DeepSeek V4 Flash undercuts entry-tier alternatives like Claude Haiku by 10x to 25x.
The price cuts are enabled by a series of hardware-software innovations, especially around cache, that make DeepSeek's models radically more efficient to run. When hosted natively in China, DeepSeek’s cache-read pricing is a whopping 87x cheaper than Western clouds — a deflationary floor so aggressive that handset giant Xiaomi just moved to match the exact pricing tier for its newly deployed MiMo architecture.
DeepSeek V4 Pro’s performance is ranked almost on par with Western frontier models, hitting 80.6% on coding-agent tasks via the SWE-bench Verified leaderboard and an elite reasoning score of 87.5 on the advanced MMLU-Pro technical index. Both V4 Pro and V4 Flash — a hyper-optimized speedy version for developers — are open-weight and issued under a permissive MIT license. This gives enterprises complete flexibility over deployment. This dual-model strategy allows technical teams to route their heaviest, multi-step autonomous agent workloads to the lightning-fast Flash model, while reserving the heavy Pro model for deep reasoning tasks, drastically lowering costs at a time when budget concerns have grown considerably.
This also comes at a time when the closed Western labs, in particular OpenAI and Anthropic, face an intense return-on-investment scrutiny for their multi-billion dollar general-purpose hardware infrastructure investments.
This deflationary collapse will not affect all Silicon Valley labs equally, signaling a permanent bifurcation of the enterprise AI market. While a premium, deterministic tier will endure for mission-critical engineering workflows, the high-volume background agentic layer is being completely commoditized by open weights. Ultimately, it creates a much more dangerous exposure for OpenAI — whose revenue mix relies heavily on general-purpose commodity API streams — than for software-insulated peers like Anthropic.
The token cost crisis
Uber says it burned through its entire 2026 budget for Claude Code and Cursor in just the first four months of the year; its COO said that the cost related to high token usage by some of its engineers was getting “harder to justify” without better products to show for it. Airbnb's Brian Chesky said last year that while the company uses OpenAI's latest models, they don't rely on them heavily in production — favoring faster, cheaper alternatives like Alibaba's Qwen. And in the latest episode of VentureBeat’s podcast Beyond the Pilot, Pinterest CTO Matt Madrigal confirmed that the company went all-in on an open-source AI strategy, post-training Alibaba’s open Qwen model on the company’s proprietary "taste graph" to drive Pinterest’s assistant — achieving frontier-like quality at a 90% reduction in costs. DeepSeek’s subsequent price drop makes the possibility of such cost differences even greater.
Geopolitical headwinds and compliance defenses
Widespread enterprise adoption of Chinese models faces massive geopolitical headwinds in the West. For highly regulated U.S. giants in finance, healthcare, and defense, getting comfortable with DeepSeek will take time.
Even though an open-weights architecture under an MIT license allows a company to self-host the model locally and prevent active data exfiltration to foreign servers, corporate compliance boards remain deeply paranoid over software supply chain risks, potential hidden backdoors, and the legal threat of sudden federal sanctions.
Smaller, more nimble software teams, on the other hand, face far less bureaucratic gridlock. Free from multi-month security review cycles, these fast-moving organizations view the immediate 75% infrastructure savings as a massive competitive edge worth deploying right now
The OpenRouter clearinghouse: mapping global token traffic
Take the token usage metrics on OpenRouter, a leading public proxy for what models are the most popular among developers. OpenRouter allows developers an easy way to compare and deploy models, and while its data is by no means a full proxy for real model popularity — it confirms this structural migration is already taking place within company data pipelines. DeepSeek V4 Flash model has captured the No. 1 position on the OpenRouter leaderboard over the past week, surging 48% in token usage. Its advanced counterpart, V4 Pro, sits at No. 6. DeepSeek’s top three models processed nearly 6 trillion tokens on OpenRouter over the past week, giving it a huge lead over other competitors. For example, OpenAI’s premium model, GPT-5.5, has slipped down to No. 15 at 470B tokens.
It’s not clear exactly how much of the world’s token traffic is on OpenRouter. Conservative estimates put it at about 3%. It does not show the massive amounts of tokens being served by the APIs offered directly to developers by companies like Anthropic, OpenAI and Google. But recent estimates suggest OpenRouter processes between 15 and 40% of each of OpenAI’s and Google’s token usage, and growing, making it a significant indicator of relative trends regardless of the exact percentage it represents.
While skeptics often dismiss aggregator traffic as an indie developer signal rather than a reflection of Fortune 500 IT spend, the corporate pipeline reality is shifting. An infrastructure analysis by a leading venture capital firm, Andreessen Horowitz, revealed that enterprise production environments deploy a median of 14 different models simultaneously to price-route workloads and avoid single-vendor lock-in. This structural architecture shift is why OpenRouter recently secured a massive $113 million Series B funding round backed directly by the big enterprise data and software vendors that serve corporate America — including ServiceNow Ventures, Snowflake Ventures, Databricks Ventures, Nvidia's NVentures, and Google’s CapitalG. Stripe also cited OpenRouter’s enterprise customers in its decision to partner closely with the company.
That’s why DeepSeek’s surge on this leaderboard is so eye-opening. DeepSeek itself offers an API directly to developers, and so it too delivers more token traffic than what OpenRouter lets on.
Beyond chatbots: the rise of multi-step autonomous agents
The DeepSeek spike on OpenRouter indicates a deeper structural shift in how automated software architectures consume machine intelligence. Technical teams are moving beyond using trivial, single-turn chatbots, and starting to deploy more sophisticated autonomous agents that persist for hours at a time — recursively looping through codebases and data lakes. Their huge number of tool calls, and continuous rereading of long context histories, means AI token consumption expands exponentially.
Running these recursive loops on closed, premium Western APIs quickly creates unsustainable infrastructure costs. While corporate tech teams spent last year experimenting freely with early, single-turn prototypes without worrying about budgets, the onset of token-prolific autonomous agents has triggered an enterprise line-item crisis. VentureBeat's Q1 2026 research, which surveyed enterprise users at organizations with over 100 employees (n=65, in the U.S. software, finance and healthcare industries), confirms the shift: “Cost per token or licensing model” jumped from 25.4% in January to 36.7% in March, trailing only raw performance as the primary selection criterion for enterprise buyers.
DeepSeek target-optimized its weights for this specific trend of agentic high-token use. It has locked in on a standard input cost of $0.435 per million tokens and a standard output rate of $0.87 per million tokens, alongside a rock-bottom prefix-cached read cost of $0.003625 per million.
It's this third cost item — for cache — which is arguably the most significant. “If you measure how all of these agents now are using tokens, 80 to 90% of the tokens are cache-read tokens,” said Val Bercovici, Chief AI Officer at WEKA, a company that provides fast storage for much of this cache. “Which means that [that price] is almost by far the most important price, making the others irrelevant — nearly a rounding error. So what DeepSeek did is not just say we're going to be 5% cheaper, 10% cheaper, 20% cheaper. They're like 87x cheaper on that cache-read price with DeepSeek V4 Pro. So that's really set the industry on notice.”
The infrastructure coup: Decoupling HBM from Context
DeepSeek's core innovations are around hardware-software alignment. This is where we get a little technical.
While Western frontier labs like OpenAI have prioritized performance at all cost, they’ve invested billions into uncompressed "dense" neural architectures. DeepSeek, by contrast, has systematically sought to extract maximum intelligence from lower grade hardware, given that they’ve lacked access to Nvidia’s GPUs. By pioneering deep software optimizations as early as its V2 architectures in 2024, the lab engineered a series of four interconnected hardware-software alignment breakthroughs that decoupled a model's operational context from expensive computing overhead:
Breakthrough 1: Sequence Dimension Compression via CSA and HCA
The transformer architecture that most LLMs use is bottlenecked by something called the Key-Value (KV) cache. As an agent executes long, multi-step sessions, historical context keys clog the high-bandwidth memory (HBM) on the GPU, causing severe latency spikes and an expensive infrastructure tax.
DeepSeek resolved this structural bottleneck by introducing a hybrid attention mechanism — documented in the DeepSeek V4 Architecture Paper — that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to cut overall KV-cache usage by a massive 90% across its 1-million-token context window.
While traditional models try to keep a unique memory log for every individual word, DeepSeek compresses the rows of its memory cache. CSA acts as a local filter, condensing small windows of text into concise, indexable blocks so the model doesn't sweat the fine-grained details. HCA acts as an aggressive global index, crushing massive spans of text deep within a session's history into high-density summaries. By interleaving these layers, DeepSeek shrinks millions of memory rows down to a fraction of their size.
Breakthrough 2: Native memory offloading via Multi-head Latent Attention (MLA)
Using something called Multi-head Latent Attention (MLA), DeepSeek strips the active memory footprint of its context history down to a fraction of standard models. It achieves this by running a physical division of labor between hardware chips. While traditional models force expensive GPUs to hold a session's entire history, DeepSeek’s architecture keeps only the tiny, highly compressed search index tags (the Keys) on the GPU. Meanwhile, it offloads the heavy data payloads (the Values) entirely into cheaper system memory and local storage tiers. Once the GPU handles the high-speed matching to find relevant data, it calls the values from storage only on an as-needed basis.
DeepSeek’s architecture is so different that the inference engines that load an AI model's weights into GPU memory, in order to be ready for prompting, are being stretched. The three most popular engines — Nvidia TensorRT-LLM, the UC Berkeley one, SGLang and the really popular vLLM — “are all being stretched to keep up with being able to offer it, which is not normal,” explains Weka's Bercovici. "Every other open model has had some similarity to other open models. This one from DeepSeek is just built different."
DeepSeek's software engineering means its massive 1.6-trillion parameter model requires an astonishingly tiny 5.48 GB of HBM to hold a 1-million-token context loop in production, according to calculations by an analyst using hardware modeling benchmarks. For comparison, smaller models utilizing standard Western architectures choke up to 89 GB of HBM under the exact same context load.
Model Framework / Metric Tier | Active HBM Needed (1M Context) | Context Length Capacity | Multi-Step Cached Economics |
DeepSeek V4-Pro (1.6T MoE) | 5.48 GB | 1,000,000 tokens | 80% to 90% of workflow tokens |
Qwen3-235B-A22B (GQA Standard) | 89.00 GB | 1,000,000 tokens | Subject to steep hardware tax |
GPT-5.5 / Claude 4.7-class (Western Frontier / MoE) | 180+ GB | 1,000,000 tokens | Prohibitive premium infrastructure tax |
DeepSeek’s extreme compression of the KV cache down to 5.48 GB of HBM is also a calculated geopolitical strategy to bypass U.S. export bans on top-tier Nvidia GPUs. By reducing the need for HBM and Nvidia’s CUDA ecosystem, DeepSeek’s software design allows frontier AI to run efficiently on domestic, lower-cost, and unsanctioned Chinese storage tiers like NAND flash, commodity SSDs, and LPDDR memory (produced by domestic giants like YMTC and CXMT).
Breakthrough 3: Ultra-Low Footprint Inference via FP4 Quantization-Aware Training (QAT)
To keep compute costs low over massive context windows, DeepSeek moved away from the old approach of scanning bulky, uncompressed numbers every time the model searches its memory. Instead, as detailed in the DeepSeek V4 Technical Report, the architecture runs an advanced form of data compression directly on the active pathways it uses to find information during training.
This compression slashes memory demands to deliver a 2x hardware speedup, yet it maintains a near-flawless 99.7% accuracy in how the system targets and indexes specific data blocks. This engineering win allows enterprise workflows to process massive, multi-step agent tasks smoothly while keeping an exceptional 83.5% retrieval accuracy on extreme, million-token "needle-in-a-haystack" benchmarks—eliminating performance lags without draining expensive GPU power.
Breakthrough 4: Ultra-scale training stability via manifold-constrained hyper-connections (mHC)
Training a 1.6-trillion parameter model creates instability risk — causing too many data pathways and processing signals to cascade out of control, crashing the run. DeepSeek resolved this with a framework called Manifold-Constrained Hyper-Connections (mHC), which uses a balancing routine to force the model's internal data tables to always sum to one — a mathematical safety valve that lets complex data move through deep networks without runaway spikes.
The infrastructure pivot: rebuilding corporate plumbing
DeepSeek’s significant architectural cache efficiency alters the underlying unit economics for the cloud platforms hosting these models. On developer aggregators like OpenRouter, where third-party providers routinely offer advanced endpoints at a loss, to capture developer mindshare, this hardware-software decoupling alters the balance sheet. DeepSeek's extremely low cost likely gives DeepSeek a profit, at least when it comes to serving the model in China, Bercovici said.
This transformation in provider-side unit economics is mirrored on the buy-side, which shows a structural change happening across enterprise IT budgets. VentureBeat's Q1 2026 AI Infrastructure and Compute tracker survey — which tracks enterprise technology buyers at organizations with over 100 employees (n=53 in January, n=39 in February) across software, financial services, healthcare, and manufacturing sectors — revealed that enterprise adoption of custom, self-managed inference stacks utilizing open-source frameworks like Triton, vLLM, Ray, and Kubernetes surged from 11.3% to 17.9%. Because these software layers allow corporate engineering teams to deploy open-weights architectures natively across their own clusters, they act as an operational escape hatch from closed cloud ecosystems.
This software shift is paired with an aggressive hardware migration: enterprise workloads moving to specialized, inference-first AI clouds like CoreWeave, Lambda, and Crusoe grew from 30.2% to 35.9% in the latest survey window. These infrastructure metrics indicate that corporate technology leaders are no longer just prototyping with open alternatives; they are actively laying down the physical plumbing required to host architectures like DeepSeek V4 independently, increasingly pricing away the premium markup of Western API gatekeepers.
The strategic split for Western labs
This baseline cost reduction could soon fracture the competitive field in Silicon Valley, by rewriting the expectations for labs attempting to yield a return on massive infrastructure investments.
For now, though, the Silicon Valley music is unlikely to stop anytime soon. Anthropic remains on an extraordinary enterprise trajectory, driven by widespread adoption of Claude Code and its codebase-aware terminal execution. For enterprise engineering teams, paying a premium for Anthropic's deterministic accuracy makes perfect sense for core production software development. Yet even an elite frontier lab scaling at this pace must watch DeepSeek with caution: an open-weights architecture under an MIT license offering near-frontier utility at a 75% cost reduction places downward pricing pressure on the high-volume operational layers of any multi-agent system.
The primary structural margin squeeze may land more squarely on OpenAI, despite its aggressive pivot toward a multi-cloud footprint. To support its staggering consumer and API token volumes, OpenAI fundamentally altered its historic seven-year exclusive alliance with Microsoft, unbundling its distribution so it can serve models across Azure, Oracle, AWS, and Google Cloud. Yet this multi-cloud strategy, while providing raw capacity at scale, leaves the company intensely exposed to infrastructure commodity pressure.
Unlike Anthropic, which has successfully insulated its margins by embedding its models into premium, high-utility software environments like Claude Code, a massive portion of OpenAI's enterprise revenue relies on high-volume, general-purpose API token streams. To be fair, Western labs have already begun quietly retreating from this territory — aggressively launching deep batch API discounts, prompt caching features, and lightweight entry models to stem the bleed. Yet this tactical retreat only reinforces the structural crisis: Silicon Valley is actively conceding the high-volume commodity layer because they know they cannot defend its margins. When those exact same automated background workflows can be handled natively by highly intelligent open weights like DeepSeek V4, defending a premium price point for raw cloud text completion ceases to be a defensible strategy.
More significantly, unlike OpenAI or Anthropic, DeepSeek has much less interest in urgently building consumer wrappers or locking developers into subscription frameworks. Instead, DeepSeek is positioned for a longer-term ecosystem play. Supported by a massive state-backed funding round led by China’s "Big Fund" — which has pushed the startup's targeted valuation into the $10 billion to $45 billion range — the lab’s more likely objective is to prove the viability of a self-sufficient, independent Chinese AI hardware stack that could one day be worth up to $10 trillion.
Premium deterministic tier (Anthropic / OpenAI / Google) | High-volume agentic tier (DeepSeek / open ecosystems) |
• Core Codebase Refactoring • Strict Corporate Compliance & Guardrails • Mission-Critical Financial/Legal Precision • High CapEx / R&D Premium Margins | • Recursive Multi-Agent Loops • Prefix-Cached Autonomous Tool Swarms • Massive Real-Time Ingestion Logs • Bare-Metal / Optimized HBM Economics |
The operational division between western labs and models like DeepSeek V4 Pro is already showing up. Financial company Ramp benchmarked automated cybersecurity agent swarms, and showed that while DeepSeek V4 Pro completely flatlines on the most complex security logic, it achieves a flawless 100% detection rate on high-volume baseline tasks like cloud configuration triage — significantly outperforming OpenAI’s GPT-5.5 (44%). For an enterprise CISO, the strategy is clear: You offload the high-volume token burn of routine background noise to cheap open weights, and reserve premium frontier models strictly for the high-level reasoning required to catch the most sophisticated flaws.
The enterprise verdict
For IT operations directors and data pipeline managers, the choice to migrate to an open architecture like DeepSeek V4-Pro is a smart governance decision. The open model gives companies total architecture control, allowing them to host it on-premise or via any specialized cloud layer they choose. Crucially, it provides enterprise infrastructure leads with a strategic operational fallback that closed vendors can’t match: the power to download raw model weights and execute them privately for zero marginal token cost if public cloud pricing or API access conditions change.
The assumption that closed frontier labs hold a permanent monopoly on useful enterprise reasoning has collapsed. While engineering directors will continue to pay a premium to protect specialized, deterministic workflows, the financial foundation of the frontier lab model has fundamentally shifted. By diverting the immense, day-to-day token volume of recursive background agents onto highly optimized, open-source clusters, enterprise teams are starving proprietary clouds of their highest-margin fuel. Silicon Valley’s multi-billion dollar token moat didn't just narrow — it was completely drained from the bottom up.
Read on the original site
Open the publisher's page for the full experience
Related Articles
- DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5The whale has resurfaced. DeepSeek, the Chinese AI startup offshoot of High-Flyer Capital Management quantitative analysis firm, became a near-overnight sensation globally in January 2025 with the release of its open source R1 model that matched proprietary U.S. giants. It's been an epoch in AI since then, and while DeepSeek has released several updates to that model and its other V3 series, the international AI and business community has been largely waiting with baited breath for the follow-up to the R1 moment. Now it's arrived with last night's release of DeepSeek-V4, a 1.6-trillion-parameter Mixture-of-Experts (MoE) model available free under commercially-friendly open source MIT License, which nears — and on some benchmarks, surpasses — the performance of the world’s most advanced closed-source systems at approximately 1/6th the cost over the application programming interface (API). This release—which DeepSeek AI researcher Deli Chen described on X as a "labor of love" 484 days after the launch of V3—is being hailed as the "second DeepSeek moment". As Chen noted in his post, "AGI belongs to everyone". It's available now on AI code sharing community Hugging Face and through DeepSeek's API. Frontier-class AI gets pushed into a lower price band The most immediate impact of the DeepSeek-V4 launch is economic. The corrected pricing table shows DeepSeek is not pricing its new Pro model at near-zero levels, but it is still pushing high-end model access into a far lower cost tier than the leading U.S. frontier models. DeepSeek-V4-Pro is priced through its API at $1.74 USD per 1 million input tokens on a cache miss and $3.48 per million output tokens. That puts a simple one-million-input, one-million-output comparison at $5.22. With cached input, the input price drops to $0.145 per million tokens, bringing that same blended comparison down to $3.625. That is dramatically cheaper than the current premium pricing from OpenAI and Anthropic. GPT-5.5 is priced at $5.00 per million input tokens and $30.00 per million output tokens, for a combined $35.00 in the same simple comparison. Claude Opus 4.7 is priced at $5.00 input and $25.00 output, for a combined $30.00. Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek-V4-Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI On standard, cache-miss pricing, DeepSeek-V4-Pro comes in at roughly one-seventh the cost of GPT-5.5 and about one-sixth (1/6th) the cost of Claude Opus 4.7. With cached input, the gap widens: DeepSeek-V4-Pro costs about one-tenth as much as GPT-5.5 and about one-eighth as much as Claude Opus 4.7. The more extreme near-zero story belongs to DeepSeek-V4-Flash, not the Pro model. Flash is priced at $0.14 per million input tokens on a cache miss and $0.28 per million output tokens, for a combined $0.42. With cached input, that drops to $0.308. In that case, DeepSeek’s cheaper model is more than 98% below GPT-5.5 and Claude Opus 4.7 in a simple input-plus-output comparison, or nearly 1/100th the cost — though the performance dips significantly. DeepSeek is compressing advanced model economics into a much lower band, forcing developers and enterprises to revisit the cost-benefit calculation around premium closed models. For companies running large inference workloads, that price gap can change what is worth automating. Tasks that look too expensive on GPT-5.5 or Claude Opus 4.7 may become economically viable on DeepSeek-V4-Pro, and even more so on DeepSeek-V4-Flash. The launch does not make intelligence free, but it does make the market harder for premium providers to defend on performance alone. Benchmarking the frontier: DeepSeek-V4-Pro gets close, but GPT-5.5 and Opus 4.7 still lead on most shared tests DeepSeek-V4-Pro-Max is best understood as a major open-weight leap, not a clean across-the-board defeat of the newest closed frontier systems. The model’s strongest benchmark claims come from DeepSeek’s own comparison tables, where it is shown against GPT-5.4 xHigh, Claude Opus 4.6 Max and Gemini 3.1 Pro High and bests them on several tests, including Codeforces and Apex Shortlist. But that is not the same as a head-to-head against OpenAI’s newer GPT-5.5 or Anthropic’s newer Claude Opus 4.7. Looking only at DeepSeek-V4 versus the latest proprietary models, the picture is more restrained. On this shared set, GPT-5.5 and Claude Opus 4.7 still lead most categories. DeepSeek-V4-Pro-Max’s best showing is on BrowseComp, the benchmark measuring agentic AI web browsing prowess (especially highly containerized information), where it scores 83.4%, narrowly behind GPT-5.5 at 84.4% and ahead of Claude Opus 4.7 at 79.3%. On Terminal-Bench 2.0, DeepSeek scores 67.9%, close to Claude Opus 4.7’s 69.4%, but far behind GPT-5.5’s 82.7%. Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 48.2% 52.2% 57.2% 54.7% GPT-5.5 Pro Terminal-Bench 2.0 67.9% 82.7% — 69.4% GPT-5.5 SWE-Bench Pro / SWE Pro 55.4% 58.6% — 64.3% Claude Opus 4.7 BrowseComp 83.4% 84.4% 90.1% 79.3% GPT-5.5 Pro MCP Atlas / MCPAtlas Public 73.6% 75.3% — 79.1% Claude Opus 4.7 The shared academic-reasoning results favor the closed models: On GPQA Diamond, DeepSeek-V4-Pro-Max scores 90.1%, while GPT-5.5 reaches 93.6% and Claude Opus 4.7 reaches 94.2%. On Humanity’s Last Exam without tools, DeepSeek scores 37.7%, behind GPT-5.5 at 41.4%, GPT-5.5 Pro at 43.1% and Claude Opus 4.7 at 46.9%. With tools enabled, DeepSeek rises to 48.2%, but still trails GPT-5.5 at 52.2%, GPT-5.5 Pro at 57.2% and Claude Opus 4.7 at 54.7%. The agentic and software-engineering results are more mixed, but they still show DeepSeek-V4-Pro-Max trailing GPT-5.5 and Opus 4.7. On Terminal-Bench 2.0, DeepSeek’s 67.9% is competitive with Claude Opus 4.7’s 69.4%, but GPT-5.5 is much higher at 82.7%. On SWE-Bench Pro, DeepSeek’s 55.4% trails GPT-5.5 at 58.6% and Claude Opus 4.7 at 64.3%. On MCP Atlas, DeepSeek’s 73.6% is slightly behind GPT-5.5 at 75.3% and Claude Opus 4.7 at 79.1%. BrowseComp is the standout: DeepSeek’s 83.4% beats Claude Opus 4.7’s 79.3% and nearly matches GPT-5.5’s 84.4%, though GPT-5.5 Pro’s 90.1% remains well ahead. So ultimately, DeepSeek-V4-Pro-Max does not appear to dethrone GPT-5.5 or Claude Opus 4.7 on the benchmarks that can be directly compared across the companies’ published tables. But it gets close enough on several of them — especially BrowseComp, Terminal-Bench 2.0 and MCP Atlas — that its much lower API pricing becomes the headline. In practical terms, DeepSeek does not need to win every leaderboard row to matter. If it can deliver near-frontier performance on many enterprise-relevant agent and reasoning tasks at roughly one-sixth to one-seventh the standard API cost of GPT-5.5 or Claude Opus 4.7, it still forces a major rethink of the economics of advanced AI deployment. DeepSeek-V4-Pro-Max is clearly the strongest open-weight model in the field right now, and it is unusually close to frontier closed systems on several practical benchmarks. While GPT-5.5 and Claude Opus 4.7 still retain the lead in most direct head-to-head comparisons across the company's benchmark charts, DeepSeek V4 Pro gets close while being dramatically cheaper and openly available. A big jump from DeepSeek V3.2 To understand the magnitude of this release, one must look at the performance gains of the base models. DeepSeek-V4-Pro-Base represents a significant advancement over the previous generation, DeepSeek-V3.2-Base. In World Knowledge, V4-Pro-Base achieved 90.1 on MMLU (5-shot) compared to V3.2’s 87.8, and a massive jump on MMLU-Pro from 65.5 to 73.5. The improvement in high-level reasoning and verified facts is even more pronounced: on SuperGPQA, V4-Pro-Base reached 53.9 compared to V3.2's 45.0, and on the FACTS Parametric benchmark, it more than doubled its predecessor's performance, jumping from 27.1 to 62.6. Simple-QA verified scores also saw a dramatic rise from 28.3 to 55.2. The Long Context capabilities have also been refined. On LongBench-V2, V4-Pro-Base scored 51.5, significantly outpacing the 40.2 achieved by V3.2-Base. In Code and Math, V4-Pro-Base reached 76.8 on HumanEval (Pass@1), up from 62.8 on V3.2-Base. These numbers underscore that DeepSeek has not just optimized for inference cost, but has fundamentally improved the intelligence density of its base architecture. The efficiency story is equally compelling for the Flash variant. DeepSeek-V4-Flash-Base, despite utilizing a substantially smaller number of parameters, outperforms the larger V3.2-Base across wide benchmarks, particularly in long-context scenarios. A new information 'traffic controller,' Manifold-Constrained Hyper-Connections (mHC) DeepSeek’s ability to offer these prices and performance figures is rooted in radical architectural innovations detailed in its technical report also released today, "Towards Highly Efficient Million-Token Context Intelligence." The standout technical achievement of V4 is its native one-million-token context window. Historically, maintaining such a large context required massive memory (the key values or KV cache). DeepSeek solved this by introducing a Hybrid Attention Architecture that combines Compressed Sparse Attention (CSA) to reduce initial token dimensionality and Heavily Compressed Attention (HCA) to aggressively compress the memory footprint for long-range dependencies. In practice, the V4-Pro model requires only 10% of the KV cache and 27% of the single-token inference FLOPs compared to its predecessor, the DeepSeek-V3.2, even when operating at a 1M token context. To stabilize a network of 1.6 trillion parameters, DeepSeek moved beyond traditional residual connections. The company's researchers incorporated Manifold-Constrained Hyper-Connections (mHC) to strengthen signal propagation across layers while preserving the model’s expressivity. mHC allows an AI to have a much wider flow of information (so it can learn more complex things) without the risk of the model becoming unstable or "breaking" during its training. It’s like giving a city a 10-lane highway but adding a perfect AI traffic controller to ensure no one ever hits the brakes. This is paired with the Muon optimizer, which allowed the team to achieve faster convergence and greater training stability during the pre-training on more than 32T diverse and high-quality tokens. This pre-training data was refined to remove hatched auto-generated content, mitigating the risk of model collapse and prioritizing unique academic values. The model’s 1.6T parameters utilize a Mixture-of-Experts (MoE) design where only 49B parameters are activated per token, further driving down compute requirements. Training the mixture-of-experts (MoE) to work as a whole DeepSeek-V4 was not simply trained; it was "cultivated" through a unique two-stage paradigm. First, through Independent Expert Cultivation, domain-specific experts were trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) using the GRPO (Group Relative Policy Optimization) algorithm. This allowed each expert to master specialized skills like mathematical reasoning or codebase analysis. Second, Unified Model Consolidation integrated these distinct proficiencies into a single model via on-policy distillation, where the unified model acts as the student learning to optimize reverse KL loss with teacher models. This distillation process ensures that the model preserves the specialized capabilities of each expert while operating as a cohesive whole. The model’s reasoning capabilities are further segmented into three increasing "effort" modes. The "Non-think" mode provides fast, intuitive responses for routine tasks. "Think High" provides conscious logical analysis for complex problem-solving. Finally, "Think Max" pushes the boundaries of model reasoning, bridging the gap with frontier models on complex reasoning and agentic tasks. This flexibility allows users to match the compute effort to the difficulty of the task, further enhancing cost-efficiency. Breaking the Nvidia GPU stranglehold with local Chinese Huawei Ascend NPUs While the model weights are the headline, the software stack released alongside them is arguably more important for the future of "Sovereign AI." Analyst Rui Ma highlighted a single sentence from the release as the most critical: DeepSeek validated their fine-grained Expert Parallelism (EP) scheme on Huawei Ascend NPUs (neural processing units). By achieving a 1.50x to 1.73x speedup on non-Nvidia GPU platforms, DeepSeek has provided a blueprint for high-performance AI deployment that is resilient to Western GPU supply chains and export controls. However, it's important to note that DeepSeek still claims it used officially licensed, legal Nvidia GPUs for DeepSeek V4's training, in addition to the Huawei NPUs. DeepSeek has also open-sourced the MegaMoE mega-kernel as a component of its DeepGEMM library. This CUDA-based implementation delivers up to a 1.96x speedup for latency-sensitive tasks like RL rollouts and high-speed agent serving. This move ensures that developers can run these massive models with extreme efficiency on existing hardware, further cementing DeepSeek’s role as the primary driver of open-source AI infrastructure. The technical report emphasizes that these optimizations are crucial for supporting a standard 1M context across all official services. Licensing and local deployment DeepSeek-V4 is released under the MIT License, the most permissive framework in the industry. This allows developers to use, copy, modify, and distribute the weights for commercial purposes without royalties—a stark contrast to the "restricted" open-weight licenses favored by other companies. For local deployment, DeepSeek recommends setting sampling parameters to temperature = 1.0 and top_p = 1.0. For those utilizing the "Think Max" reasoning mode, the team suggests setting the context window to at least 384K tokens to avoid truncating the model's internal reasoning chains. The release includes a dedicated encoding folder with Python scripts demonstrating how to encode messages in OpenAI-compatible format and parse the model's output, including reasoning content. DeepSeek-V4 is also seamlessly integrated with leading AI agents like Claude Code, OpenClaw, and OpenCode. This native integration underscores its role as a bedrock for developer tools, providing an open-source alternative to the proprietary ecosystems of major cloud providers. Community reactions and what comes next The community reaction has been one of shock and validation. Hugging Face officially welcomed the "whale" back, stating that the era of cost-effective 1M context length has arrived. Industry experts noted that the "second DeepSeek moment" has effectively reset the developmental trajectory of the entire field, placing massive pressure on closed-source providers like OpenAI and Anthropic to justify their premiums. AI evaluation firm Vals AI noted that DeepSeek-V4 is now the "#1 open-weight model on our Vibe Code Benchmark, and it’s not close". DeepSeek is moving quickly to retire its older architectures. The company announced that the legacy deepseek-chat and deepseek-reasoner endpoints will be fully retired on July 24, 2026. All traffic is currently being rerouted to the V4-Flash architecture, signifying a total transition to the million-token standard. DeepSeek-V4 is more than just a new model; it is a challenge to the status quo. By proving that architectural innovation can substitute for raw compute-maximalism, DeepSeek has made the highest levels of AI intelligence accessible to the global developer community at a far lower cost — something that could benefit the globe, even at a time when lawmakers and leaders in Washington, D.C. are raising concerns about Chinese labs "distilling" from U.S. proprietary giants to train open source models, and fears of said open source or jailbroken proprietary models being used to create weapons and commit terror. The truth is, while all of these are potential risks — as they were and have been with prior technologies that broadened information access, like search and the internet itself — the benefits seem far outweigh them, and DeepSeek's quest to keep frontier AI models open is of benefit to the entire planet of potential AI users, especially enterprises looking to adopt the cutting-edge at the lowest possible cost.
- Anthropic finally beat OpenAI in business AI adoption — but 3 big threats could erase its leadFor the first time since the AI race began, more American businesses are paying for Anthropic's Claude than for OpenAI's ChatGPT. Adoption of Anthropic rose 3.8% in April to 34.4% of businesses, according to the May 2026 release of the Ramp AI Index. OpenAI's adoption fell 2.9% to 32.3%. Overall AI adoption among businesses rose 0.2 percentage points to 50.6%. The crossover — published Tuesday by Ramp, the corporate card and finance automation platform that tracks spending patterns across more than 50,000 U.S. businesses — marks the culmination of a yearlong surge by Anthropic that few in the industry predicted. Anthropic has quadrupled its business adoption over the past year, while OpenAI grew its business adoption by only 0.3%. But the same report that crowns a new market leader also warns that Anthropic's position may be more fragile than it appears — threatened by escalating costs, compute constraints, and the very token-based pricing model that has fueled the company's extraordinary revenue growth. How Anthropic went from a niche player to the most popular AI model in corporate America To appreciate the scale of the shift, consider where the two companies stood a year ago. In April 2025, OpenAI commanded roughly 32% of business AI adoption according to Ramp's underlying data, while Anthropic stood at under 8%. OpenAI had built an early, commanding lead as the consumer default — ChatGPT was where most people first encountered AI, and that momentum carried into corporate purchasing decisions. Anthropic's path was different. The company was popular early on with the earliest adopters — engineers, AI evangelists, the technical vanguard inside organizations. As Ramp lead economist Ara Kharazian noted in the March 2026 edition of the index, Anthropic leveraged that early-adopter base to go mainstream. By February, Anthropic was winning about 70% of head-to-head matchups against OpenAI among businesses purchasing AI services for the first time — a complete reversal of the trends observed in 2025. The trajectory is visible in Ramp's underlying data. The company's adoption figures show Anthropic climbing from 0.03% of businesses in June 2023 to 7.94% by April 2025, then rocketing to 34.44% by April 2026. OpenAI, meanwhile, peaked near 36.5% in mid-2025 and has been slowly declining since. The engine behind much of this growth is a single product: Claude Code, the company's agentic AI coding tool, which has become the fastest-growing product in Anthropic's history. A recent analysis estimated that 4% of all GitHub public commits worldwide were being authored by Claude Code — double the percentage from just one month prior. Business Insider reported in April that the crossover was imminent. A Ramp spokesperson told the outlet that "at the current pace, Anthropic is on track to surpass OpenAI within the next two months," noting that it already led "among early adopters, including VC-backed companies, and in key sectors like software, finance, and professional services." That prediction proved accurate almost to the day. AI adoption reaches a workplace tipping point, but the productivity revolution hasn't arrived yet The Ramp data on business spending finds its complement in a separate workforce survey that underscores just how deeply AI has embedded itself into American economic life. For the first time in Gallup's measurement, half of employed American adults say they use AI in their role at least a few times a year, up from 46% the previous quarter. Frequent use is also increasing, with 13% of employees now saying they use AI daily and 28% reporting they use it a few times a week or more. But the Gallup data, based on a February 2026 survey of 23,717 U.S. employees, also suggests that the benefits of AI remain concentrated at the level of individual tasks rather than organizational transformation. Only about one in 10 employees in AI-adopting organizations strongly agree that artificial intelligence has transformed how work gets done. That finding is consistent with firm-level studies across the U.S., U.K., Germany, and Australia showing chief executives reporting minimal broad productivity effects from AI over the past three years — a notable gap between the hype cycle and operational reality. The Ramp methodology captures a different but complementary signal. Where Gallup asks employees whether they use AI, Ramp measures whether their employer is writing checks for it. The index counts corporate card and invoice-based payments, identifying firms as AI adopters if they have a positive transaction amount for an AI product or service in a given month. As Ramp's methodology page notes, its results likely underestimate actual adoption because many employees use free AI tools or personal accounts for work tasks. Taken together, the two datasets paint a picture of AI that is ubiquitous in the American workplace but has not yet delivered on its promise to fundamentally transform how organizations operate. Why Anthropic's biggest threat might be the success of its own best-selling product Perhaps the most striking aspect of Ramp's analysis is its refusal to declare a lasting winner. Kharazian identified three specific risks facing Anthropic even as the company takes the lead — and the most serious one stems from a structural tension baked into the company's business model. Anthropic makes more money when businesses purchase more tokens, meaning the company is incentivized to drive users toward more expensive models even when cheaper ones are sufficient. This dynamic is already creating budget crises at major enterprises. Uber's CTO revealed that the company spent its entire 2026 AI budget in just four months, largely on Claude Code and Cursor, with engineers reporting monthly API costs between $500 and $2,000 per person. Adoption jumped from 32% to 84% of Uber engineers in a matter of months, and about 70% of committed code at Uber now comes from AI. The Uber case is a microcosm of a broader tension: Claude Code works — perhaps too well. When a productivity tool becomes so valuable that an organization's $3.4 billion R&D operation can't afford to keep the lights on, the resulting cost scrutiny could push enterprises toward cheaper alternatives. At the same time, quality and reliability have suffered under the weight of demand. In recent weeks, users have experienced frequent outages, rate limits, and increasing dissatisfaction with Claude's results. Anthropic has responded by resetting usage limits and by striking a compute deal with SpaceX to access more than 300 megawatts of new capacity at the Colossus 1 data center in Memphis. CEO Dario Amodei said the company saw "80x growth per year in revenue and usage" for Q1 2026, when it had only planned for 10x. And Ramp economist Rafael Hajjar found that Anthropic's latest model update would triple token costs for any prompt that includes an image — a change that seems at odds with the company's already-acute cost and compute problems. Open-source models and OpenAI's Codex could quickly erode Anthropic's narrow lead The Ramp report points to competitive dynamics that could reshape the market within months. Some of the fastest-growing vendors on Ramp's platform in April were AI inference platforms that give companies access to cheap, open-source models — offering enterprises a way to get "good enough" AI at a fraction of the cost, particularly for routine tasks that don't require frontier model capabilities. OpenAI's Codex presents an even more direct threat. By most measures, it is a strong product that does many of the same tasks as Claude Code at a lower price point — and the switching cost between models is minimal. Uber itself is already testing Codex as a hedge, a move that could preview a broader pattern across enterprise tech. OpenAI also retains enormous structural advantages. ChatGPT reached 900 million weekly active users by March 2026, dwarfing Claude's consumer footprint. Enterprise revenue now makes up more than 40% of OpenAI's total and is on track to reach parity with consumer revenue by the end of 2026. And OpenAI's $122 billion funding round, closed in March at an $852 billion valuation, gives it vast resources to compete on pricing, capacity, and product development. Anthropic is not standing still on distribution. AWS recently launched Claude Platform on AWS, giving enterprises direct access to Anthropic's native platform through existing AWS credentials, billing, and access controls — a move that lowers procurement friction considerably. Anthropic has also announced compute agreements totaling billions of dollars with Amazon, Google, Microsoft, Nvidia, and others, though much of that capacity won't come online until late 2026 or 2027. Anthropic is reportedly in talks to raise another $50 billion at a valuation approaching $900 billion. The unlikely reason businesses are choosing Claude over cheaper alternatives Beneath the spending data and market share charts lies a more intriguing question: Why are businesses choosing Anthropic over a cheaper, comparably performing alternative? Kharazian explored this in his March analysis. Claude Code and OpenAI's Codex are roughly comparable products — on certain benchmarks, Codex is arguably better, and it's also cheaper. Yet Anthropic can't meet its own demand. Every plan still has usage limits and rate caps. The company is actively turning away revenue because it doesn't have the compute to serve it. Despite charging more for roughly equivalent performance, Anthropic's demand is growing. Kharazian suggested the answer might be cultural. Earlier this year, Anthropic refused to agree to the Pentagon's terms of use for Claude, resulting in a blacklisting by the Department of Defense. OpenAI stepped in to offer its services in Anthropic's place. In the wake of that episode, users rallied around Anthropic, and Claude temporarily surpassed ChatGPT on the App Store. The question, Kharazian wrote, is whether choosing an AI model is becoming less like an enterprise procurement decision and "more like the green bubble/blue bubble distinction in iMessage: a signal of identity as much as a choice of technology." That observation may sound absurd for an enterprise software category. But Ramp's data tells a story that pure economics cannot fully explain. In a market where the products perform similarly, where the cheaper option is arguably better on benchmarks, and where switching costs are negligible, something other than spreadsheet logic is driving the biggest shift in AI market share since the industry began. As Kharazian noted in his report: "We have never seen a software industry as dynamic, where newcomers can disrupt market leaders in a matter of months, and where the pace of development overrides the typical forces of vendor stickiness." That dynamism cuts both ways. The same forces that propelled a company from 8% to 34% market share in twelve months could just as easily work in reverse. Anthropic's two-point lead was earned in the most volatile software market in modern history — and in this market, the distance between the throne and the floor has never been shorter.
- Miami startup Subquadratic claims 1,000x AI efficiency gain with SubQ model; researchers demand independent proof.A little-known Miami-based startup called Subquadratic emerged from stealth on Tuesday with a sweeping claim: that it has built the first large language model to fully escape the mathematical constraint that has defined — and limited — every major AI system since 2017. The company claims its first model, SubQ 1M-Preview, is the first LLM built on a fully subquadratic architecture — one where compute grows linearly with context length. If that claim holds, it would be a genuine inflection point in how AI systems scale. At 12 million tokens, the company says, its architecture reduces attention compute by almost 1,000 times compared to other frontier models — a figure that, if validated independently, would dwarf the efficiency gains of any existing approach. The company is also launching three products into private beta: an API exposing the full context window, a command-line coding agent called SubQ Code, and a search tool called SubQ Search. It has raised $29 million in seed funding from investors including Tinder co-founder Justin Mateen, former SoftBank Vision Fund partner Javier Villamizar, and early investors in Anthropic, OpenAI, Stripe, and Brex. The New Stack reported that the raise values the company at $500 million. The numbers Subquadratic is publishing are extraordinary. The reaction from the AI research community has been, to put it mildly, mixed — ranging from genuine curiosity to open accusations of vaporware. Understanding why requires understanding what the company claims to have solved, and why so many prior attempts to solve the same problem have fallen short. The quadratic scaling problem has shaped the economics of the entire AI industry Every transformer-based AI model — which includes virtually every frontier system from OpenAI, Anthropic, Google, and others — relies on an operation called "attention." Every token is compared against every other token, so as inputs grow, the number of interactions — and the compute required to process them — scales quadratically. In plain terms: double the input size, and the cost doesn't double. It quadruples. This relationship has shaped what gets built and what doesn't. The industry standard is 128,000 tokens for many AI models and up to 1 million tokens for frontier cloud models such as Claude Sonnet 4.7 and Gemini 3.1 Pro. Even at those sizes, the cost of processing long inputs becomes punishing. The industry built an elaborate stack of workarounds to cope. RAG systems use a search engine to pull a small number of relevant results before sending them to the model, because sending the full corpus isn't feasible. Developers layer retrieval pipelines, chunking strategies, prompt engineering techniques, and multi-agent orchestration systems on top of models — all to route around the fundamental constraint that the model itself can't efficiently process everything at once. Subquadratic's argument is that these workarounds are expensive, brittle, and ultimately limiting. As CTO Alexander Whedon told SiliconANGLE in an interview, "I used to manually curate prompts and retrieval systems and evals and conditional logic to chain together the workflows. And I think that that is kind of a waste of human intelligence and also limiting to the product quality." Subquadratic's fix is deceptively simple: stop doing the math that doesn't matter The company's approach, called Subquadratic Sparse Attention or SSA, is built on a straightforward premise: most of the token-to-token comparisons in standard attention are wasted compute. Instead of comparing every token to every other token, SSA learns to identify which comparisons actually matter and computes attention only over those positions. Crucially, the selection is content-dependent — the model decides where to look based on meaning, not on fixed positional patterns. This allows it to retrieve specific information from arbitrary positions across a very long context without paying the quadratic tax. The practical payoff scales with context length — exactly the inverse of the problem it's trying to solve. According to the company's technical blog, SSA achieves a 7.2x prefill speedup over dense attention at 128,000 tokens, rising to 52.2x at 1 million tokens. As Whedon put it: "If you double the input size with quadratic scaling laws, you need four times the compute; with linear scaling laws, you need just twice." The company says it trained the model in three stages — pretraining, supervised fine-tuning, and a reinforcement learning stage specifically targeting long-context retrieval failures — teaching the model to aggressively use distant context rather than defaulting to nearby information, a subtle failure mode that quietly degrades performance in existing systems. Three benchmarks paint a strong picture, but what they leave out may matter more On the surface, SubQ's benchmark numbers are competitive with or superior to models built by organizations spending billions of dollars. On SWE-Bench Verified, it scored 81.8% compared to Opus 4.6's 80.8% and DeepSeek 4.0 Pro's 80.0%. On RULER at 128,000 tokens, a standard benchmark for reasoning over extended inputs, SubQ scored 95% — edging out Claude Opus 4.6 at 94.8%. On MRCR v2, a demanding test of multi-hop retrieval across long contexts, SubQ posted a third-party verified score of 65.9%, compared with Claude Opus 4.7 at 32.2%, GPT-5.5 at 74%, and Gemini 3.1 Pro at 26.3%. But several details warrant scrutiny. The benchmark selection is narrow — exactly three tests, all emphasizing long-context retrieval and coding, the precise tasks SubQ is designed for. Broader evaluations across general reasoning, math, multilingual performance, and safety have not been published. The company says a comprehensive model card is "coming soon." According to The New Stack, each benchmark model was run only once due to high inference cost, and the SWE-Bench margin is, as the company's own paper acknowledges, "harness as much as model." In benchmark methodology, single runs without confidence intervals leave room for variance. There is also a significant gap between SubQ's research results and its production model. On MRCR v2, the company reported a research score of 83 — but the third-party verified production model scored 65.9. That 17-point gap between the lab result and the shipping product is notable and largely unexplained. Subquadratic also told SiliconANGLE that on the RULER 128K benchmark, SubQ scored 95% accuracy at a cost of $8, compared with 94% accuracy and about $2,600 for Claude Opus — a remarkable cost claim. But the company has not publicly disclosed specific API pricing, making it impossible to independently verify the cost-per-task comparisons. The AI research community's verdict ranges from 'genuine breakthrough' to 'AI Theranos' Within hours of the announcement, the AI research community erupted into a debate that crystallized around a single question: Is this real? AI commentator Dan McAteer captured the binary mood in a widely shared post: "SubQ is either the biggest breakthrough since the Transformer... or it's AI Theranos." The comparison to the infamous blood-testing fraud company may be unfair, but it reflects the scale of the claims being made. Skeptics zeroed in on several pressure points. Prominent AI engineer Will Depue initially noted that SubQ is "almost surely a sparse attention finetune of Kimi or DeepSeek," referring to existing open-source models. Whedon confirmed this on X, writing that the company is "using weights from open-source models as a starting point, as a function of our funding and maturity as a company." Depue later escalated his criticism, writing that the company's O(n) scaling claims and the speedup numbers "don't seem to line up" and called the communication "either incredibly poorly communicated or just not real." Others raised structural questions. One developer noted that if SubQ truly reduces compute by 1,000x and costs less than 5% of Opus, the company should have no trouble serving it at scale — so why gate access through an early-access program? Developer Stepan Goncharov called the benchmarks "very interesting cherry-picked benchmarks," while another commenter described them as "suspiciously perfect." But not everyone was dismissive. AI researcher John Rysana pushed back on the Theranos framing, writing that the work is "just subquadratic attention done well which is very meaningful for long context workloads," and that "odds of it being BS are extremely low." Linus Ekenstam, a tech commentator, said he was "extremely intrigued to see the real-world implications" particularly for complex AI-powered software. Magic.dev made strikingly similar claims two years ago — and then went quiet Perhaps the most pointed critique of SubQ's launch comes not from its specific claims but from recent history. Magic.dev announced a 100-million-token context-window model in August 2024, with a claimed 1,000x efficiency advantage, and raised roughly $500 million on the strength of those claims. As of early 2026, there is no public evidence of LTM-2-mini being used outside Magic. The parallels are uncomfortable. Both companies claimed massive context windows. Both touted roughly 1,000x efficiency gains. Both targeted software engineering as their primary use case. And both launched with limited external access. The broader research landscape reinforces the caution. Kimi Linear, DeepSeek Sparse Attention, Mamba, and RWKV all promised subquadratic scaling, and all faced the same problem: architectures that achieve linear complexity in theory often underperform quadratic attention on downstream benchmarks at frontier scale, or they end up hybrid — mixing subquadratic layers with standard attention and losing the pure scaling benefits. A widely cited LessWrong analysis argued that these approaches "are all better thought of as 'incremental improvement number 93595 to the transformer architecture'" because practical implementations remain quadratic and "only improve attention by a constant factor." Subquadratic is directly aware of this history. Its own technical blog specifically addresses each prior approach — fixed-pattern sparse attention, state space models, hybrid architectures, and DeepSeek Sparse Attention — and argues that SSA avoids their tradeoffs. Whether it actually does remains an empirical question that only independent evaluation can settle. A five-time founder, a former Meta engineer, and $29 million to prove the doubters wrong The team behind the claims matters in evaluating them. CEO Justin Dangel is a five-time founder and CEO with a track record across health tech, insurancetech, and consumer goods, and his companies have scaled to hundreds of employees, attracted institutional backing, and reached liquidity. CTO Alexander Whedon previously worked as a software engineer at Meta and served as Head of Generative AI at TribeAI, where he led over 40 enterprise AI implementations. The team includes 11 PhD researchers with backgrounds from Meta, Google, Oxford, Cambridge, ByteDance, and Adobe. That is a credible collection of talent for an architecture-level research effort. But neither co-founder has published foundational AI research, and the company has not yet released a peer-reviewed paper. The technical report is listed as "coming soon." The funding profile is unusual for a company making frontier AI claims. Subquadratic raised $29 million at a reported $500 million valuation — a steep price for a seed-stage company with no publicly available model, no peer-reviewed research, and no disclosed revenue. The investor base, led by Tinder co-founder Mateen and former SoftBank partner Villamizar, skews toward consumer tech and growth investing rather than deep technical AI research. The company is not open-sourcing its weights but plans to offer training tools for enterprises to do their own post-training, and has set a 50-million-token context window target for Q4. The real test for SubQ isn't benchmarks — it's whether the math survives independent scrutiny Strip away the marketing language and the social media drama, and the underlying question Subquadratic is asking is genuinely important: Can AI systems break free of quadratic scaling without sacrificing the quality that makes them useful? The stakes are enormous. If attention can be made truly linear without degrading retrieval and reasoning, the economics of AI shift fundamentally. Enterprise applications that today require elaborate retrieval pipelines — processing entire codebases, contracts, regulatory filings, medical records — become single-pass operations. The billions of dollars currently spent on RAG infrastructure, context management, and agentic orchestration become partially redundant. Whedon's willingness to engage publicly with technical criticism — posting a technical blog within hours of pushback — suggests a team that understands it needs to show its work, not just describe it. And to its credit, the company acknowledged openly that it builds on open-source foundations and that its model is smaller than those at the major labs. Every frontier model in 2026 advertises a context window of at least a million tokens, but almost none of them are actually great at making use of all that information. The gap between a nominal context window and a functional one — between what a model accepts and what it reliably reasons over — remains one of the most important unsolved problems in AI. Subquadratic says it has closed that gap. If independent evaluation confirms that claim, the implications would ripple far beyond a single startup's valuation. If it doesn't, the company joins a growing list of long-context promises that sounded revolutionary on launch day and unremarkable six months later. In computing, every fundamental constraint eventually falls. When it does, the breakthrough never comes from the direction the industry expected. The question hanging over Subquadratic is whether a team of 11 PhDs and a $29 million seed round actually found the answer that has eluded organizations spending thousands of times more — or whether they just found a better way to describe the problem.