Anthropic releases Claude Opus 4.7, narrowly retaking lead for most powerful generally available LLM
Our take

Anthropic is publicly releasing its most powerful large language model yet, Claude Opus 4.7, today — as it continues to keep an even more powerful successor, Mythos, restricted to a small number of external enterprise partners for cybersecurity testing and patching vulnerabilities in the software said enterprises use (which Mythos exposed rapidly).
The big headlines are that Opus 4.7 exceeds its most direct rivals — OpenAI's GPT-5.4, released in early March 2026, scarcely more than a month ago; and Google's latest flagship model Gemini 3.1 Pro from February — on key benchmarks including agentic coding, scaled tool-use, agentic computer use, and financial analysis.
But also, it's notable how tight the race is getting: on directly comparable benchmarks, Opus 4.7 only leads GPT-5.4 by 7-4.
It currently leads the market on the GDPVal-AA knowledge work evaluation with an Elo score of 1753, surpassing both GPT-5.4 (1674) and Gemini 3.1 Pro (1314).
Yet, the model does not represent a "clean sweep" across all categories.
Competitors like GPT-5.4 and Gemini 3.1 Pro still hold the lead in specific domains such as agentic search, where GPT-5.4 scores 89.3% compared to Opus 4.7’s 79.3%, as well as in multilingual Q&A and raw terminal-based coding.
This positioning defines Opus 4.7 not as a unilateral victor in all AI tasks, but as a specialized powerhouse optimized for the reliability and long-horizon autonomy required by the burgeoning agentic economy.
Claude Opus 4.7 is available today across all major cloud platforms, including Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry, with API pricing held steady at $5/$25 per million tokens.
Improvement in hard sciences and agentic workflows
Claude Opus 4.7 is a direct evolution of the Opus 4.6 architecture, but its performance delta is most visible in the "hard" sciences of agentic workflows: software engineering and complex document reasoning.
At its core, the model has been re-tuned to exhibit what Anthropic describes as "rigor". This isn't just marketing parlance; it refers to the model’s new ability to devise its own verification steps before reporting a task as complete.
For example, in internal tests, the model was observed building a Rust-based text-to-speech engine from scratch and then independently feeding its own generated audio through a separate speech recognizer to verify the output against a Python reference.
This level of autonomous self-correction is designed to reduce the "hallucination loops" that often plague earlier iterations of agentic software.
The most significant architectural upgrade is the move to high-resolution multimodal support. Opus 4.7 can now process images up to 2,576 pixels on their longest edge—roughly 3.75 megapixels.
This represents a three-fold increase in resolution compared to previous iterations. For developers building "computer-use" agents that must navigate dense, high-DPI interfaces or for analysts extracting data from intricate technical diagrams, this change effectively removes the "blurry vision" ceiling that previously limited autonomous navigation.
This visual acuity is reflected in benchmarks from XBOW, where the model jumped from a 54.5% success rate in visual-acuity tests to 98.5%.
On the benchmark front, Opus 4.7 has claimed the top spot in several critical categories:
Knowledge Work (GDPVal-AA): It achieved an Elo score of 1753, notably outperforming GPT-5.4 (1674) and Gemini 3.1 Pro (1314).
Agentic Coding (SWE-bench Pro): The model resolved 64.3% of tasks, compared to 53.4% for its predecessor.
Graduate-Level Reasoning (GPQA Diamond): It reached 94.2%, maintaining parity with the industry's most advanced models while improving on its internal consistency.
Visual Reasoning (arXiv Reasoning): With tools, the model scored 91.0%, a meaningful jump from the 84.7% seen in Opus 4.6.
Crucially, Anthropic warns that this increased precision requires a shift in how users approach prompting. Opus 4.7 follows instructions literally. While older models might "read between the lines" and interpret ambiguous prompts loosely, Opus 4.7 executes the exact text provided. This means that legacy prompt libraries may require re-tuning to avoid unexpected results caused by the model’s strict adherence to the letter of the request.
Controlling the 'thinking' budget
The "agentic" nature of Opus 4.7—its tendency to pause, plan, and verify—comes with a trade-off in token consumption and latency.
To address this, Anthropic is introducing a new "effort" parameter. Users can now select an xhigh (extra high) effort level, positioned between high and max, allowing for more granular control over the depth of reasoning the model applies to a specific problem.
Internal data shows that while max effort yields the highest scores (approaching 75% on coding tasks), the xhigh setting provides a compelling sweet spot between performance and token expenditure.
To manage the costs associated with these more "thoughtful" runs, the Claude API is introducing "task budgets" in public beta. This allows developers to set a hard ceiling on token spend for autonomous agents, ensuring that a long-running debugging session doesn't result in an unexpected bill.
These product changes signal a maturing market where AI is no longer a novelty but a production line item that requires fiscal and operational guardrails.
Furthermore, Opus 4.7 utilizes an updated tokenizer that improves text processing efficiency, though it can increase the token count of certain inputs by 1.0–1.35x.
Within the Claude Code environment, the update brings a new /ultrareview command. Unlike standard code reviews that look for syntax errors, /ultrareview is designed to simulate a senior human reviewer, flagging subtle design flaws and logic gaps.
Additionally, "auto mode"—a setting where Claude can make autonomous decisions without constant permission prompts—has been extended to Max plan users.
Licensing, safety, and the "cyber" divide
Anthropic continues to walk a narrow line regarding cybersecurity. The recent announcement of the aforementioend cybersecurity partnership around Mythos with external industry partners — known as "Project Glasswing" — highlighted the dual-use risks of high-capability models.
Consequently, while the flagship Mythos Preview model remains restricted, Opus 4.7 serves as the testbed for new automated safeguards. The model includes systems designed to detect and block requests that suggest high-risk cyberattacks, such as automated vulnerability exploitation.
To bridge the gap for the security industry, Anthropic is launching the Cyber Verification Program. This allows legitimate professionals—vulnerability researchers, penetration testers, and red-teamers—to apply for access to use Opus 4.7’s capabilities for defensive purposes.
This "verified user" model suggests a future where the most capable AI features are not universally available, but gated behind professional credentials and compliance frameworks.
In cybersecurity vulnerability reproduction (CyberGym), Opus 4.7 maintains a 73.1% success rate, trailing Mythos Preview's 83.1% but leading GPT-5.4's 66.3%.
Initial reactions from industry partners reveal quantifiable improvements in production enterprise workflows
Early testimonials from enterprise customers shared by Anthropic indicate there has been a tangible shift in model perception of Opus 4.7 from 4.6, going from "impressed by the tech" to "relying on the output".
Clarence Huang, VP of Technology at Intuit, noted that the model’s ability to "catch its own logical faults during the planning phase" is a game-changer for velocity.
This sentiment was echoed by Replit President Michele Catasta, who stated that the model achieved higher quality at a lower cost for tasks like log analysis and bug hunting, adding, "It really feels like a better coworker".
Other specific reactions included:
Cognition (Devin): CEO Scott Wu reported that Opus 4.7 can work coherently "for hours" and pushes through difficult problems that previously caused models to stall.
Notion: Sarah Sachs, AI Lead, highlighted a 14% improvement in multi-step workflows and a 66% reduction in tool-calling errors, making the agent feel like a "true teammate".
Factory Droids: Leo Tchourakov observed that the model carries work through to validation steps rather than "stopping halfway," a common complaint with previous frontier models.
Harvey: Niko Grupen, Head of Applied Research, noted the model's 90.9% score on BigLaw Bench, highlighting its "noticeably smarter handling of ambiguous document editing tasks".
Perhaps the most telling reaction came from Aj Orbach, CEO of a dashboard-building firm, who remarked on the model’s "design taste," noting that its choices for data-rich interfaces were of a quality he would "actually ship".
Should enterprises immediately upgrade to Opus 4.7?
For enterprise leaders, Claude Opus 4.7 represents a shift from generative AI as a "creative assistant" to a "reliable operative."
But importantly, it is not a "clean win" for every use case.
Instead, it is a decisive upgrade for teams building autonomous agents or complex software systems. The primary value proposition is the model's new capability for self-verification and rigor; it no longer just generates an answer but creates internal tests to verify that the answer is correct before responding. This reliability makes it a superior choice for long-horizon engineering tasks where the cost of human supervision is the primary bottleneck.
However, an immediate, wholesale migration from Opus 4.6 requires caution. The model's increased literalism in instruction following means that prompts engineered to be "loose" or conversational with previous versions may now produce unexpected or overly rigid results.
Furthermore, enterprises must prepare for a significant increase in operational costs. Opus 4.7 uses an updated tokenizer that can increase input token counts by 1.0–1.35x, and its tendency to "think harder" at high effort levels results in higher output token consumption.
For legacy applications where prompts are fragile and margins are thin, a phased rollout with significant re-tuning is recommended.
Where it puts Anthropic in the AI race
This release arrives at a paradoxical moment for Anthropic. Financially, the company is an undisputed juggernaut, with venture capital firms reportedly extending investment offers at a staggering $800 billion valuation—more than double its $380 billion Series G valuation from February 2026.
This momentum is fueled by explosive growth, with the company’s annual run-rate revenue skyrocketing to $30 billion in April 2026, driven largely by enterprise adoption and the success of Claude Code.
Yet, this commercial success is being contested by intense regulatory and technical friction. Anthropic is currently embroiled in a high-stakes legal battle with the U.S. Department of War (DoW), which recently labeled the company a "supply chain risk" after Anthropic refused to allow its models to be used for mass surveillance or fully autonomous lethal weapons.
While a San Francisco judge initially blocked the designation, a federal appeals panel recently denied Anthropic’s bid to stay the blacklisting, leaving the company excluded from lucrative defense contracts during an active military conflict.
Simultaneously, Anthropic is fending off a growing rebellion from its most loyal power users. Despite the company's "market leader" status, developers have flooded GitHub and X with accusations of "AI shrinkflation," claiming that the preceding Opus 4.6 model and Claude Code product have been quietly degraded.
Users report that recent versions are more prone to exploration loops, memory loss, and ignored instructions, leading some to describe the newly released Claude Code desktop app as "unpolished" and unbefitting a firm with a near-trillion-dollar valuation. Opus 4.7 is Anthropic's attempt to silence these critics by proving that "deep thinking" can be paired with the rigorous execution that its enterprise clients now demand.
Ultimately, Opus 4.7 is a model defined by its discipline. In a market where models are often incentivized to be "helpful" to a fault—sometimes hallucinating answers to please the user—Opus 4.7 marks a return to rigor. By allowing users to control effort, set budgets, and verify outputs, Anthropic is moving closer to the goal of a truly autonomous digital labor force. For the engineering teams at Replit, Notion, and beyond, the shift from "watching the AI work" to "managing the AI's results" has officially begun.
Read on the original site
Open the publisher's page for the full experience
Related Articles
- OpenAI's GPT-5.5 is here, and it's no potato: narrowly beats Anthropic's Claude Mythos Preview on Terminal-Bench 2.0After months of rumors and reports that OpenAI was developing a new, more powerful AI large language model for use in ChatGPT and through its application programming interface (API), allegedly codenamed "Spud" internally, the company has today unveiled its latest offering under the more formal name GPT-5.5. And to likely no one's surprise, it's hardly a "potato" in the disparaging sense of the word: GPT-5.5 retakes the lead for OpenAI in generally available LLMs, coming ahead of rivals Anthropic's and Google's latest public offerings, and even beating the private Anthropic Claude Mythos Preview model narrowly on one benchmark (essentially a statistical tie). "It’s definitely our strongest model yet on coding, both measured by benchmarks and based on the feedback that we’ve gotten from trusted partners, as well as our own experience," explained Amelia "Mia" Glaese, VP of Research at OpenAI, in a video call with journalists ahead of the launch earlier today. OpenAI positions GPT-5.5 as a fundamental redesign of how intelligence interacts with a computer's operating system and professional software stacks. "What is really special about this model is how much more it can do with less guidance," said OpenAI co-founder and president Greg Brockman on the same call. "It’s way more intuitive to use. It can look at an unclear problem and figure out what needs to happen next." Brockman proceeded to emphasize the areas in which users can expect to see gains from using GPT-5.5 compared to OpenAI's prior state-of-the-art model, GPT-5.4, which remains available (for now) to users and enterprises at half the API cost of its new successor. "It’s extremely good at coding," Brockman said of GPT-5.5. "It’s also great at broader computer work, computer use, scientific research—these kinds of applications that are very intelligent bottlenecks." OpenAI CEO and-cofounder Sam Altman also weighed in on the launch and the company's philosophy in a post on X, writing, in part: "We want our users to have access to the best technology and for everyone to have equal opportunity." The model is available in two variants: GPT-5.5 and GPT-5.5 Pro, distinguished by the latter offering enhanced precision and specialized logic for handling the most rigorous cognitive demands. While the standard version serves as the versatile flagship for general intelligence tasks, the Pro model is architected specifically for high-stakes environments such as legal research, data science, and advanced business analytics where accuracy is paramount. This premium tier provides noticeably more comprehensive and better-structured responses, supported by specialized latency optimizations that ensure high-quality performance during complex, multi-step workflows. Unfortunately for third-party software developers, API access is not yet available for either GPT-5.5 nor GPT-5.5 Pro and will be coming "very soon," according to the company's announcement blog post. "API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale," OpenAI writes. For the time being, GPT-5.5 is available only to paying subscribers of the ChatGPT Plus ($20 monthly), Pro ($100-$200 monthly), Business, and Enterprise users, with GPT-5.5 Pro access starting at the Pro tier and upwards. A focus on agency At the core of GPT-5.5 is a focus on "agentic" performance—specifically in coding, computer use, and scientific research. Unlike its predecessors, which often required granular, step-by-step prompting to avoid "hallucinating" a path forward, GPT-5.5 is designed to handle messy, multi-part tasks autonomously. It excels at researching online, debugging complex codebases, and moving between documents and spreadsheets without human intervention. One of the most significant technical leaps is the model's efficiency. While larger models typically suffer from increased latency, GPT-5.5 matches the per-token latency of the previous GPT-5.4 while delivering a higher level of intelligence. This was achieved through a deep hardware-software co-design. OpenAI served GPT-5.5 on NVIDIA GB200 and GB300 NVL72 systems, utilizing custom heuristic algorithms—written by the AI itself—to partition and balance work across GPU cores. This optimization reportedly increased token generation speeds by over 20%.For high-stakes reasoning, the "GPT-5.5 Thinking" mode in ChatGPT provides smarter, more concise answers by allowing the model more internal "compute time" to verify its own assumptions before responding. This capability is particularly visible in the model’s performance on "Expert-SWE," an internal OpenAI benchmark for long-horizon coding tasks with a median human completion time of 20 hours. GPT-5.5 notably outperformed GPT-5.4 on this metric while using significantly fewer tokens. Benchmarks show OpenAI has retaken the lead in most powerful publicly available LLM over Claude Opus 4.7 (but the unreleased Mythos still outperforms it) The market for leading U.S.-made frontier models has become an increasingly tight race between OpenAI, Anthropic, and Google. Literally a week ago to the date, OpenAI rival Anthropic released Opus 4.7, its most powerful generally available model, to the public, taking over the leaderboard in terms of the number of third-party benchmark tests in which it has the lead. Yet today, GPT-5.5 has surpassed it and even Anthropic's heavily restricted, more powerful model Claude Mythos Preview, albeit only on one benchmark, Terminal-Bench 2.0, which tests "a model's ability to navigate and complete tasks in a sandboxed terminal environment." GPT-5.5 achieved 82.7% accuracy on Terminal-Bench 2.0, easily surpassing Opus 4.7 (69.4%) and narrowly beating the Mythos Preview (82.0%). However, in multidisciplinary reasoning without tools, the landscape is more competitive. On Humanity's Last Exam without tools, GPT-5.5 Pro scored 43.1%, trailing behind Opus 4.7 (46.9%) and Mythos Preview (56.8%). Benchmark GPT-5.5 Claude Opus 4.7 Gemini 3.1 Pro Mythos Preview* Terminal-Bench 2.0 82.7 69.4 68.5 82.0 Expert-SWE (Internal) 73.1 — — — GDPval (wins or ties) 84.9 80.3 67.3 — OSWorld-Verified 78.7 78.0 — 79.6 Toolathlon 55.6 — 48.8 — BrowseComp 84.4 79.3 85.9 86.9 FrontierMath Tier 1–3 51.7 43.8 36.9 — FrontierMath Tier 4 35.4 22.9 16.7 — CyberGym 81.8 73.1 — 83.1 Tau2-bench Telecom (original prompts) 98.0 — — — OfficeQA Pro 54.1 43.6 18.1 — Investment Banking Modeling Tasks (Internal) 88.5 — — — MMMU Pro (no tools) 81.2 — 80.5 — MMMU Pro (with tools) 83.2 — — — GeneBench 25.0 — — — BixBench 80.5 — — — Capture-the-Flags challenge tasks (Internal) 88.1 — — — ARC-AGI-2 (Verified) 85.0 75.8 77.1 — SWE-bench Pro (Public) 58.6 64.3 54.2 77.8 This suggests that while OpenAI is winning on "computer use" and "agency," other models may still hold an edge in pure, zero-shot academic knowledge. It is important to clarify that Mythos Preview is not a generally available product; Anthropic has classified it as a strategic defensive asset due to its high cybersecurity risks, restricting its access to a small, limited audience of trusted partners and government agencies. Because Mythos is excluded from broad commercial use, the primary market competition remains between GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7. So when it comes to models that the general public can access, GPT-5.5 has retaken the crown for OpenAI, achieving the state-of-the-art across 14 benchmarks compared to 4 for Claude Opus 4.7 and 2 for Google Gemini 3.1 Pro. It dominates in agentic computer use, economic knowledge work (GDPval), specialized cybersecurity (CyberGym), and complex mathematics (Frontier Math). In comparison, Claude Opus 4.7 leads on software engineering and reasoning without tools, while Gemini 3.1 Pro leads in three categories, specifically excelling in academic reasoning and financial analysis. Increased costs for users The shift in intelligence comes with a significant price increase for API developers, according to material OpenAI shared ahead of the model's public release. OpenAI has effectively doubled the entry price for its flagship model compared to the previous generation, and again double it from there for the most-cutting edge variant of the model, GPT-5.5 Pro: Model Input Price (per 1M tokens) Output Price (per 1M tokens) GPT-5.4 $2.50 $15.00 GPT-5.5 $5.00 $30.00 GPT-5.5 Pro $30.00 $180.00 To mitigate these costs, OpenAI emphasizes that GPT-5.5 is more "token efficient," meaning it uses fewer tokens to complete the same task compared to GPT-5.4. For users requiring speed over depth, OpenAI also introduced a Fast mode in Codex, which generates tokens 1.5x faster but at a 2.5x price premium. The "mini" and "nano" tiers seen in the GPT-5.4 era (priced at $0.75 and $0.20 per 1M input tokens respectively) currently have no GPT-5.5 equivalent, though the company notes that GPT-5.5 is rolling out to all subscription tiers, including Plus, Pro, and Enterprise. Licensing and the 'cyber-permissive' frontier OpenAI’s approach to safety and licensing for GPT-5.5 introduces a novel concept: Trusted Access for Cyber. Because the model is now capable of identifying and patching advanced security vulnerabilities, OpenAI has implemented stricter "cyber-risk classifiers" for general users. For legitimate security professionals, however, OpenAI is offering a specialized "cyber-permissive" license. This program allows verified defenders—those responsible for critical infrastructure like power grids or water supplies—to use models like GPT-5.4-Cyber or unrestricted versions of GPT-5.5 with fewer refusals for security-related prompts. This dual-use framework acknowledges that while AI can accelerate cyber defense, it can also be weaponized. Under OpenAI’s Preparedness Framework, GPT-5.5 is classified as "High" risk for biological and cybersecurity capabilities. To manage this, API deployments currently require different safeguards than the consumer-facing ChatGPT, and OpenAI is working with government partners to ensure these tools are used to strengthen—not undermine—digital resilience. Initial reactions: losing access feels like having a 'limb amputated' The early feedback from power users and engineers suggests that GPT-5.5 has crossed a psychological threshold in AI utility. For developers, the model's ability to maintain "conceptual clarity" across massive codebases is its standout feature. "The first coding model I've used that has serious conceptual clarity," noted Dan Shipper, CEO of Every. Shipper tested the model by asking it to debug a complex system failure that had previously required a team of human engineers to rewrite; GPT-5.5 produced the same fix autonomously. Similarly, Pietro Schirano, CEO of MagicPath, described a "step change" in performance when the model successfully merged a branch with hundreds of refactor changes into a main branch in a single, 20-minute pass.Perhaps the most visceral reaction came from an anonymous engineer at NVIDIA, who had early access to the model: "Losing access to GPT-5.5 feels like I've had a limb amputated". This sentiment is echoed in the scientific community. Derya Unutmaz, a professor at the Jackson Laboratory for Genomic Medicine, used GPT-5.5 Pro to analyze a dataset of 28,000 genes, producing a report in minutes that would have normally taken his team months. Brandon White, CEO of Axiom Bio, went further, stating that if OpenAI continues this pace, "the foundations of drug discovery will change by the end of the year". GPT-5.5 is more than an incremental update; it is a tool designed for a world where humans delegate entire workflows rather than single prompts. While the costs are higher and the safety guardrails tighter, the performance gains in agentic work suggest that AI is finally moving from the chat box and into the operating system. Perhaps most astonishingly of all, it's not even hearing the end of the scaling limits — whereupon models are trained on more and more GPUs — according to researchers at the company. "We actually still have headroom to train significantly smarter models than this," said OpenAI chief scientist Jakub Pachocki.
- Anthropic just launched Claude Design, an AI tool that turns prompts into prototypes and challenges FigmaAnthropic today launched Claude Design, a new product from its Anthropic Labs division that allows users to create polished visual work — designs, interactive prototypes, slide decks, one-pagers, and marketing collateral — through conversational prompts and fine-grained editing controls. The release, available immediately in research preview to all paid Claude subscribers, is the company's most aggressive expansion beyond its core language model business and into the application layer that has historically belonged to companies like Figma, Adobe, and Canva. Claude Design is powered by Claude Opus 4.7, Anthropic's most capable generally available vision model, which the company also released today. Anthropic says it is rolling access out gradually throughout the day to Claude Pro, Max, Team, and Enterprise subscribers. The simultaneous launches mark a watershed for Anthropic, whose ambitions now visibly extend from foundation model provider to full-stack product company — one that wants to own the arc from a rough idea to a shipped product. The timing is also significant: Anthropic hit roughly $20 billion in annualized revenue in early March 2026, according to Bloomberg, up from $9 billion at the end of 2025 — and surpassed $30 billion by early April 2026. The company is in early talks with Goldman Sachs, JPMorgan, and Morgan Stanley about a potential IPO that could come as early as October 2026. How Claude Design turns a text prompt into a working prototype The product follows a workflow that Anthropic has designed to feel like a natural creative conversation. Users describe what they need, and Claude generates a first version. From there, refinement happens through a combination of channels: chat-based conversation, inline comments on specific elements, direct text editing, and custom adjustment sliders that Claude itself generates to let users tweak spacing, color, and layout in real time. During onboarding, Claude reads a team's codebase and design files and builds a design system — colors, typography, and components — that it automatically applies to every subsequent project. Teams can refine the system over time and maintain more than one. The import surface is broad: users can start from a text prompt, upload images and documents in various formats, or point Claude at their codebase. A web capture tool grabs elements directly from a live website so prototypes look like the real product. What distinguishes Claude Design from the wave of AI design experiments that have proliferated in the past year is the handoff mechanism. When a design is ready to build, Claude packages everything into a handoff bundle that can be passed to Claude Code with a single instruction. That creates a closed loop — exploration to prototype to production code — all within Anthropic's ecosystem. The export options acknowledge that not everyone's next step is Claude Code: users can also share designs as an internal URL within their organization, save as a folder, or export to Canva, PDF, PPTX, or standalone HTML files. Anthropic points to Brilliant, the education technology company known for intricate interactive lessons, as an early proof point. The company's senior product designer reported that the most complex pages required 20 or more prompts to recreate in competing tools but needed only 2 in Claude Design. The Brilliant team then turned static mockups into interactive prototypes they could share and user-test without code review, and handed everything — including the design intent — to Claude Code for implementation. Datadog's product team described a similar shift, compressing what had been a week-long cycle of briefs, mockups, and review rounds into a single conversation. Why Anthropic's chief product officer just resigned from Figma's board The launch arrives against a backdrop that makes Anthropic's claim of complementarity with existing design tools difficult to take entirely at face value. Mike Krieger, Anthropic's chief product officer, resigned from the board of Figma on April 14 — the same day The Information reported Anthropic's next model would include design tools that could compete with Figma's primary offering. Figma has collaborated closely with Anthropic to integrate the frontier lab's AI models into its products. Just two months ago, in February, Figma launched "Code to Canvas," a feature that converts code generated in AI tools like Claude Code into fully editable designs inside Figma — creating a bridge between AI coding tools and Figma's design process. The partnership felt like a mutual bet that AI would make design more essential, not less. Claude Design complicates that narrative significantly. Anthropic's position, based on VentureBeat's background conversations with the company, is that Claude Design is built around interoperability and is meant to meet teams where they already work, not replace incumbent tools. The company points to the Canva export, PPTX and PDF support, and plans to make it easier for other tools to connect via MCPs (model context protocols) as evidence of that philosophy. Anthropic is also making it possible for other tools to build integrations with Claude Design, a move clearly designed to preempt accusations of walled-garden ambitions. But the market read the signals differently. The structural tension is clear: Figma commands an estimated 80 to 90% market share in UI and UX design, according to The Next Web. Both Figma and Adobe assume a trained designer is in the loop. Anthropic's tool does not. Claude Design is not merely another AI copilot embedded in an existing design application. It is a standalone product that generates complete, interactive prototypes from natural language — accessible to founders, product managers, and marketers who have never opened Figma. The expansion of the design user base to non-designers is the real competitive threat, even if the professional designer's workflow remains anchored in Figma for now. Inside Claude Opus 4.7, the model Anthropic deliberately made less dangerous The model powering Claude Design is itself a significant story. Claude Opus 4.7 is Anthropic's most capable generally available model, with notable improvements over its predecessor Opus 4.6 in software engineering, instruction following, and vision — but it is intentionally less capable than Anthropic's most powerful offering, Claude Mythos Preview, the model the company announced earlier this month as too dangerous for broad release due to its cybersecurity capabilities. That dual-track approach — one model for the public, one model locked behind a vetted-access program — is unprecedented in the AI industry. Anthropic used Claude Mythos Preview to identify thousands of zero-day vulnerabilities in every major operating system and web browser, as reported by multiple outlets. The Project Glasswing initiative that houses Mythos brings together Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks as launch partners. Opus 4.7 sits a deliberate step below Mythos. Anthropic stated in its release that it "experimented with efforts to differentially reduce" the new model's cyber capabilities during training and ships it with safeguards that automatically detect and block requests indicating prohibited or high-risk cybersecurity uses. What Anthropic learns from those real-world safeguards will inform the eventual goal of broader release for Mythos-class models. For security professionals with legitimate needs, the company has created a new Cyber Verification Program. On benchmarks, the model posts strong numbers. Opus 4.7 reached 64.3% on SWE-bench Pro, and on Anthropic's internal 93-task coding benchmark, it delivered a 13% resolution improvement over Opus 4.6, including solving four tasks that neither Opus 4.6 nor Sonnet 4.6 could crack. The vision improvements are substantial and directly relevant to Claude Design: Opus 4.7 can accept images up to 2,576 pixels on the long edge — roughly 3.75 megapixels, more than three times the resolution of prior Claude models. Early access partner XBOW, the autonomous penetration testing company, reported that the new model scored 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6. Meanwhile, Bloomberg reported that the White House is preparing to make a version of Mythos available to major federal agencies, with the Office of Management and Budget setting up protections for Cabinet departments — a sign that the government views the model's capabilities as too important to leave solely in private hands. What enterprise buyers need to know about data privacy and pricing For enterprise and regulated-industry buyers, the data handling architecture of Claude Design will be a critical evaluation criterion. Based on VentureBeat's exclusive background discussions with Anthropic, the system stores the design-system representation it generates — not the source files themselves. When users link a local copy of their code, it is not uploaded to or stored on Anthropic's servers. The company is also adding the ability to connect directly to GitHub. Anthropic states unequivocally that it does not train on this data. For Enterprise customers, Claude Design is off by default — administrators choose whether to enable it and control who has access. On pricing, Claude Design is included at no additional cost with Pro, Max, Team, and Enterprise plans, using existing subscription limits with optional extra usage beyond those caps. Opus 4.7 holds the same API pricing as its predecessor: $5 per million input tokens and $25 per million output tokens. The pricing strategy mirrors the approach Anthropic took with Claude Code, which launched as a bundled feature and rapidly grew into a major revenue driver. Anthropic's reasoning is straightforward: the best way to learn what people will build with a new product category is to put it in their hands, then build monetization around demonstrated value. Anthropic is also being transparent about the product's limitations. The design system import works best with a clean codebase; messy source code produces messy output. Collaboration is basic and not yet fully multiplayer. The editing experience has rough edges. There is no general availability date, and Anthropic says that is intentional — it will let the product and user feedback determine when Claude Design is ready for prime time. Anthropic's bet that owning the full creative stack is worth the risk Claude Design is the most visible expression of a trend that has been accelerating for months: the major AI labs are moving up the stack from model providers into full application builders, directly entering categories previously owned by established software companies. Anthropic now offers a coding agent (Claude Code), a knowledge-work assistant (Claude Cowork), desktop computer control, office integrations for Word, Excel, and PowerPoint, a browser agent in Chrome, and now a design tool. Each product reinforces the others. A designer can explore concepts in Claude Design, export a prototype, hand it to Claude Code for implementation, and have Claude Cowork manage the review cycle — all within Anthropic's platform. The financial momentum behind this expansion is staggering. Anthropic has received investor offers valuing the company at approximately $800 billion, according to Reuters, more than doubling its $380 billion valuation from a funding round closed just two months ago. But building an application empire while simultaneously navigating an AI safety reputation, an impending IPO, growing public hostility toward the technology, and the diplomatic fallout of competing with your own partners is a balancing act that no technology company has attempted at this scale or speed. When Figma launched Code to Canvas in February, the implicit promise was that AI coding tools and design tools would grow together, each making the other more valuable. Two months later, Anthropic's chief product officer has left Figma's board, and the company has shipped a product that lets anyone who can type a sentence create the kind of interactive prototype that once required years of design training and a Figma license. The partnership may survive. But the power dynamic just changed — and in the AI industry, that tends to be the only kind of change that matters.
- DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5The whale has resurfaced. DeepSeek, the Chinese AI startup offshoot of High-Flyer Capital Management quantitative analysis firm, became a near-overnight sensation globally in January 2025 with the release of its open source R1 model that matched proprietary U.S. giants. It's been an epoch in AI since then, and while DeepSeek has released several updates to that model and its other V3 series, the international AI and business community has been largely waiting with baited breath for the follow-up to the R1 moment. Now it's arrived with last night's release of DeepSeek-V4, a 1.6-trillion-parameter Mixture-of-Experts (MoE) model available free under commercially-friendly open source MIT License, which nears — and on some benchmarks, surpasses — the performance of the world’s most advanced closed-source systems at approximately 1/6th the cost over the application programming interface (API). This release—which DeepSeek AI researcher Deli Chen described on X as a "labor of love" 484 days after the launch of V3—is being hailed as the "second DeepSeek moment". As Chen noted in his post, "AGI belongs to everyone". It's available now on AI code sharing community Hugging Face and through DeepSeek's API. Frontier-class AI gets pushed into a lower price band The most immediate impact of the DeepSeek-V4 launch is economic. The corrected pricing table shows DeepSeek is not pricing its new Pro model at near-zero levels, but it is still pushing high-end model access into a far lower cost tier than the leading U.S. frontier models. DeepSeek-V4-Pro is priced through its API at $1.74 USD per 1 million input tokens on a cache miss and $3.48 per million output tokens. That puts a simple one-million-input, one-million-output comparison at $5.22. With cached input, the input price drops to $0.145 per million tokens, bringing that same blended comparison down to $3.625. That is dramatically cheaper than the current premium pricing from OpenAI and Anthropic. GPT-5.5 is priced at $5.00 per million input tokens and $30.00 per million output tokens, for a combined $35.00 in the same simple comparison. Claude Opus 4.7 is priced at $5.00 input and $25.00 output, for a combined $30.00. Model Input Output Total Cost Source Grok 4.1 Fast $0.20 $0.50 $0.70 xAI MiniMax M2.7 $0.30 $1.20 $1.50 MiniMax Gemini 3 Flash $0.50 $3.00 $3.50 Google Kimi-K2.5 $0.60 $3.00 $3.60 Moonshot MiMo-V2-Pro (≤256K) $1.00 $3.00 $4.00 Xiaomi MiMo GLM-5 $1.00 $3.20 $4.20 Z.ai GLM-5-Turbo $1.20 $4.00 $5.20 Z.ai DeepSeek-V4-Pro $1.74 $3.48 $5.22 DeepSeek GLM-5.1 $1.40 $4.40 $5.80 Z.ai Claude Haiku 4.5 $1.00 $5.00 $6.00 Anthropic Qwen3-Max $1.20 $6.00 $7.20 Alibaba Cloud Gemini 3 Pro $2.00 $12.00 $14.00 Google GPT-5.2 $1.75 $14.00 $15.75 OpenAI GPT-5.4 $2.50 $15.00 $17.50 OpenAI Claude Sonnet 4.5 $3.00 $15.00 $18.00 Anthropic Claude Opus 4.7 $5.00 $25.00 $30.00 Anthropic GPT-5.5 $5.00 $30.00 $35.00 OpenAI GPT-5.4 Pro $30.00 $180.00 $210.00 OpenAI On standard, cache-miss pricing, DeepSeek-V4-Pro comes in at roughly one-seventh the cost of GPT-5.5 and about one-sixth (1/6th) the cost of Claude Opus 4.7. With cached input, the gap widens: DeepSeek-V4-Pro costs about one-tenth as much as GPT-5.5 and about one-eighth as much as Claude Opus 4.7. The more extreme near-zero story belongs to DeepSeek-V4-Flash, not the Pro model. Flash is priced at $0.14 per million input tokens on a cache miss and $0.28 per million output tokens, for a combined $0.42. With cached input, that drops to $0.308. In that case, DeepSeek’s cheaper model is more than 98% below GPT-5.5 and Claude Opus 4.7 in a simple input-plus-output comparison, or nearly 1/100th the cost — though the performance dips significantly. DeepSeek is compressing advanced model economics into a much lower band, forcing developers and enterprises to revisit the cost-benefit calculation around premium closed models. For companies running large inference workloads, that price gap can change what is worth automating. Tasks that look too expensive on GPT-5.5 or Claude Opus 4.7 may become economically viable on DeepSeek-V4-Pro, and even more so on DeepSeek-V4-Flash. The launch does not make intelligence free, but it does make the market harder for premium providers to defend on performance alone. Benchmarking the frontier: DeepSeek-V4-Pro gets close, but GPT-5.5 and Opus 4.7 still lead on most shared tests DeepSeek-V4-Pro-Max is best understood as a major open-weight leap, not a clean across-the-board defeat of the newest closed frontier systems. The model’s strongest benchmark claims come from DeepSeek’s own comparison tables, where it is shown against GPT-5.4 xHigh, Claude Opus 4.6 Max and Gemini 3.1 Pro High and bests them on several tests, including Codeforces and Apex Shortlist. But that is not the same as a head-to-head against OpenAI’s newer GPT-5.5 or Anthropic’s newer Claude Opus 4.7. Looking only at DeepSeek-V4 versus the latest proprietary models, the picture is more restrained. On this shared set, GPT-5.5 and Claude Opus 4.7 still lead most categories. DeepSeek-V4-Pro-Max’s best showing is on BrowseComp, the benchmark measuring agentic AI web browsing prowess (especially highly containerized information), where it scores 83.4%, narrowly behind GPT-5.5 at 84.4% and ahead of Claude Opus 4.7 at 79.3%. On Terminal-Bench 2.0, DeepSeek scores 67.9%, close to Claude Opus 4.7’s 69.4%, but far behind GPT-5.5’s 82.7%. Benchmark DeepSeek-V4-Pro-Max GPT-5.5 GPT-5.5 Pro, where shown Claude Opus 4.7 Best result among these GPQA Diamond 90.1% 93.6% — 94.2% Claude Opus 4.7 Humanity’s Last Exam, no tools 37.7% 41.4% 43.1% 46.9% Claude Opus 4.7 Humanity’s Last Exam, with tools 48.2% 52.2% 57.2% 54.7% GPT-5.5 Pro Terminal-Bench 2.0 67.9% 82.7% — 69.4% GPT-5.5 SWE-Bench Pro / SWE Pro 55.4% 58.6% — 64.3% Claude Opus 4.7 BrowseComp 83.4% 84.4% 90.1% 79.3% GPT-5.5 Pro MCP Atlas / MCPAtlas Public 73.6% 75.3% — 79.1% Claude Opus 4.7 The shared academic-reasoning results favor the closed models: On GPQA Diamond, DeepSeek-V4-Pro-Max scores 90.1%, while GPT-5.5 reaches 93.6% and Claude Opus 4.7 reaches 94.2%. On Humanity’s Last Exam without tools, DeepSeek scores 37.7%, behind GPT-5.5 at 41.4%, GPT-5.5 Pro at 43.1% and Claude Opus 4.7 at 46.9%. With tools enabled, DeepSeek rises to 48.2%, but still trails GPT-5.5 at 52.2%, GPT-5.5 Pro at 57.2% and Claude Opus 4.7 at 54.7%. The agentic and software-engineering results are more mixed, but they still show DeepSeek-V4-Pro-Max trailing GPT-5.5 and Opus 4.7. On Terminal-Bench 2.0, DeepSeek’s 67.9% is competitive with Claude Opus 4.7’s 69.4%, but GPT-5.5 is much higher at 82.7%. On SWE-Bench Pro, DeepSeek’s 55.4% trails GPT-5.5 at 58.6% and Claude Opus 4.7 at 64.3%. On MCP Atlas, DeepSeek’s 73.6% is slightly behind GPT-5.5 at 75.3% and Claude Opus 4.7 at 79.1%. BrowseComp is the standout: DeepSeek’s 83.4% beats Claude Opus 4.7’s 79.3% and nearly matches GPT-5.5’s 84.4%, though GPT-5.5 Pro’s 90.1% remains well ahead. So ultimately, DeepSeek-V4-Pro-Max does not appear to dethrone GPT-5.5 or Claude Opus 4.7 on the benchmarks that can be directly compared across the companies’ published tables. But it gets close enough on several of them — especially BrowseComp, Terminal-Bench 2.0 and MCP Atlas — that its much lower API pricing becomes the headline. In practical terms, DeepSeek does not need to win every leaderboard row to matter. If it can deliver near-frontier performance on many enterprise-relevant agent and reasoning tasks at roughly one-sixth to one-seventh the standard API cost of GPT-5.5 or Claude Opus 4.7, it still forces a major rethink of the economics of advanced AI deployment. DeepSeek-V4-Pro-Max is clearly the strongest open-weight model in the field right now, and it is unusually close to frontier closed systems on several practical benchmarks. While GPT-5.5 and Claude Opus 4.7 still retain the lead in most direct head-to-head comparisons across the company's benchmark charts, DeepSeek V4 Pro gets close while being dramatically cheaper and openly available. A big jump from DeepSeek V3.2 To understand the magnitude of this release, one must look at the performance gains of the base models. DeepSeek-V4-Pro-Base represents a significant advancement over the previous generation, DeepSeek-V3.2-Base. In World Knowledge, V4-Pro-Base achieved 90.1 on MMLU (5-shot) compared to V3.2’s 87.8, and a massive jump on MMLU-Pro from 65.5 to 73.5. The improvement in high-level reasoning and verified facts is even more pronounced: on SuperGPQA, V4-Pro-Base reached 53.9 compared to V3.2's 45.0, and on the FACTS Parametric benchmark, it more than doubled its predecessor's performance, jumping from 27.1 to 62.6. Simple-QA verified scores also saw a dramatic rise from 28.3 to 55.2. The Long Context capabilities have also been refined. On LongBench-V2, V4-Pro-Base scored 51.5, significantly outpacing the 40.2 achieved by V3.2-Base. In Code and Math, V4-Pro-Base reached 76.8 on HumanEval (Pass@1), up from 62.8 on V3.2-Base. These numbers underscore that DeepSeek has not just optimized for inference cost, but has fundamentally improved the intelligence density of its base architecture. The efficiency story is equally compelling for the Flash variant. DeepSeek-V4-Flash-Base, despite utilizing a substantially smaller number of parameters, outperforms the larger V3.2-Base across wide benchmarks, particularly in long-context scenarios. A new information 'traffic controller,' Manifold-Constrained Hyper-Connections (mHC) DeepSeek’s ability to offer these prices and performance figures is rooted in radical architectural innovations detailed in its technical report also released today, "Towards Highly Efficient Million-Token Context Intelligence." The standout technical achievement of V4 is its native one-million-token context window. Historically, maintaining such a large context required massive memory (the key values or KV cache). DeepSeek solved this by introducing a Hybrid Attention Architecture that combines Compressed Sparse Attention (CSA) to reduce initial token dimensionality and Heavily Compressed Attention (HCA) to aggressively compress the memory footprint for long-range dependencies. In practice, the V4-Pro model requires only 10% of the KV cache and 27% of the single-token inference FLOPs compared to its predecessor, the DeepSeek-V3.2, even when operating at a 1M token context. To stabilize a network of 1.6 trillion parameters, DeepSeek moved beyond traditional residual connections. The company's researchers incorporated Manifold-Constrained Hyper-Connections (mHC) to strengthen signal propagation across layers while preserving the model’s expressivity. mHC allows an AI to have a much wider flow of information (so it can learn more complex things) without the risk of the model becoming unstable or "breaking" during its training. It’s like giving a city a 10-lane highway but adding a perfect AI traffic controller to ensure no one ever hits the brakes. This is paired with the Muon optimizer, which allowed the team to achieve faster convergence and greater training stability during the pre-training on more than 32T diverse and high-quality tokens. This pre-training data was refined to remove hatched auto-generated content, mitigating the risk of model collapse and prioritizing unique academic values. The model’s 1.6T parameters utilize a Mixture-of-Experts (MoE) design where only 49B parameters are activated per token, further driving down compute requirements. Training the mixture-of-experts (MoE) to work as a whole DeepSeek-V4 was not simply trained; it was "cultivated" through a unique two-stage paradigm. First, through Independent Expert Cultivation, domain-specific experts were trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) using the GRPO (Group Relative Policy Optimization) algorithm. This allowed each expert to master specialized skills like mathematical reasoning or codebase analysis. Second, Unified Model Consolidation integrated these distinct proficiencies into a single model via on-policy distillation, where the unified model acts as the student learning to optimize reverse KL loss with teacher models. This distillation process ensures that the model preserves the specialized capabilities of each expert while operating as a cohesive whole. The model’s reasoning capabilities are further segmented into three increasing "effort" modes. The "Non-think" mode provides fast, intuitive responses for routine tasks. "Think High" provides conscious logical analysis for complex problem-solving. Finally, "Think Max" pushes the boundaries of model reasoning, bridging the gap with frontier models on complex reasoning and agentic tasks. This flexibility allows users to match the compute effort to the difficulty of the task, further enhancing cost-efficiency. Breaking the Nvidia GPU stranglehold with local Chinese Huawei Ascend NPUs While the model weights are the headline, the software stack released alongside them is arguably more important for the future of "Sovereign AI." Analyst Rui Ma highlighted a single sentence from the release as the most critical: DeepSeek validated their fine-grained Expert Parallelism (EP) scheme on Huawei Ascend NPUs (neural processing units). By achieving a 1.50x to 1.73x speedup on non-Nvidia GPU platforms, DeepSeek has provided a blueprint for high-performance AI deployment that is resilient to Western GPU supply chains and export controls. However, it's important to note that DeepSeek still claims it used officially licensed, legal Nvidia GPUs for DeepSeek V4's training, in addition to the Huawei NPUs. DeepSeek has also open-sourced the MegaMoE mega-kernel as a component of its DeepGEMM library. This CUDA-based implementation delivers up to a 1.96x speedup for latency-sensitive tasks like RL rollouts and high-speed agent serving. This move ensures that developers can run these massive models with extreme efficiency on existing hardware, further cementing DeepSeek’s role as the primary driver of open-source AI infrastructure. The technical report emphasizes that these optimizations are crucial for supporting a standard 1M context across all official services. Licensing and local deployment DeepSeek-V4 is released under the MIT License, the most permissive framework in the industry. This allows developers to use, copy, modify, and distribute the weights for commercial purposes without royalties—a stark contrast to the "restricted" open-weight licenses favored by other companies. For local deployment, DeepSeek recommends setting sampling parameters to temperature = 1.0 and top_p = 1.0. For those utilizing the "Think Max" reasoning mode, the team suggests setting the context window to at least 384K tokens to avoid truncating the model's internal reasoning chains. The release includes a dedicated encoding folder with Python scripts demonstrating how to encode messages in OpenAI-compatible format and parse the model's output, including reasoning content. DeepSeek-V4 is also seamlessly integrated with leading AI agents like Claude Code, OpenClaw, and OpenCode. This native integration underscores its role as a bedrock for developer tools, providing an open-source alternative to the proprietary ecosystems of major cloud providers. Community reactions and what comes next The community reaction has been one of shock and validation. Hugging Face officially welcomed the "whale" back, stating that the era of cost-effective 1M context length has arrived. Industry experts noted that the "second DeepSeek moment" has effectively reset the developmental trajectory of the entire field, placing massive pressure on closed-source providers like OpenAI and Anthropic to justify their premiums. AI evaluation firm Vals AI noted that DeepSeek-V4 is now the "#1 open-weight model on our Vibe Code Benchmark, and it’s not close". DeepSeek is moving quickly to retire its older architectures. The company announced that the legacy deepseek-chat and deepseek-reasoner endpoints will be fully retired on July 24, 2026. All traffic is currently being rerouted to the V4-Flash architecture, signifying a total transition to the million-token standard. DeepSeek-V4 is more than just a new model; it is a challenge to the status quo. By proving that architectural innovation can substitute for raw compute-maximalism, DeepSeek has made the highest levels of AI intelligence accessible to the global developer community at a far lower cost — something that could benefit the globe, even at a time when lawmakers and leaders in Washington, D.C. are raising concerns about Chinese labs "distilling" from U.S. proprietary giants to train open source models, and fears of said open source or jailbroken proprietary models being used to create weapons and commit terror. The truth is, while all of these are potential risks — as they were and have been with prior technologies that broadened information access, like search and the internet itself — the benefits seem far outweigh them, and DeepSeek's quest to keep frontier AI models open is of benefit to the entire planet of potential AI users, especially enterprises looking to adopt the cutting-edge at the lowest possible cost.
- Opus 4.7 vs Opus 4.6: Should You Switch?Turmoil has followed the launch of Claude’s new model. Opus 4.7, the younger sibling of Anthropic’s revolutionary Mythos, is the recent attempt by the company to go public with some of the capabilities of Mythos. Better agentic workflows, better memory, and better real-world tasks than the outgoing model, i.e., the Opus 4.6. That is what […] The post Opus 4.7 vs Opus 4.6: Should You Switch? appeared first on Analytics Vidhya.