Mystery solved: Anthropic reveals changes to Claude's harnesses and operating instructions likely caused degradation
Our take

For several weeks, a growing chorus of developers and AI power users claimed that Anthropic’s flagship models were losing their edge. Users across GitHub, X, and Reddit reported a phenomenon they described as "AI shrinkflation"—a perceived degradation where Claude seemed less capable of sustained reasoning, more prone to hallucinations, and increasingly wasteful with tokens.
Critics pointed to a measurable shift in behavior, alleging that the model had moved from a "research-first" approach to a lazier, "edit-first" style that could no longer be trusted for complex engineering.
While the company initially pushed back against claims of "nerfing" the model to manage demand, the mounting evidence from high-profile users and third-party benchmarks created a significant trust gap.
Today, Anthropic addressed these concerns directly, publishing a technical post-mortem that identified three separate product-layer changes responsible for the reported quality issues.
"We take reports about degradation very seriously," reads Anthropic's blog post on the matter. "We never intentionally degrade our models, and we were able to immediately confirm that our API and inference layer were unaffected."
Anthropic claims it has resolved the issues by reverting the reasoning effort change and the verbosity prompt, while fixing the caching bug in version v2.1.116.
The mounting evidence of degradation
The controversy gained momentum in early April 2026, fueled by detailed technical analyses from the developer community. Stella Laurenzo, a Senior Director in AMD’s AI group, published an exhaustive audit of 6,852 Claude Code session files and over 234,000 tool calls on Github showing performance falling from her usage before.
Her findings suggested that Claude’s reasoning depth had fallen sharply, leading to reasoning loops and a tendency to choose the "simplest fix" rather than the correct one.
This anecdotal frustration was seemingly validated by third-party benchmarks. BridgeMind reported that Claude Opus 4.6’s accuracy had dropped from 83.3% to 68.3% in their tests, causing its ranking to plummet from No. 2 to No. 10.
Although some researchers argued these specific benchmark comparisons were flawed due to inconsistent testing scopes, the narrative that Claude had become "dumber" became a viral talking point. Users also reported that usage limits were draining faster than expected, leading to suspicions that Anthropic was intentionally throttling performance to manage surging demand.
The causes
In its post-morem bog post, Anthropic clarified that while the underlying model weights had not regressed, three specific changes to the "harness" surrounding the models had inadvertently hampered their performance:
Default Reasoning Effort: On March 4, Anthropic changed the default reasoning effort from
hightomediumfor Claude Code to address UI latency issues. This change was intended to prevent the interface from appearing "frozen" while the model thought, but it resulted in a noticeable drop in intelligence for complex tasks.A Caching Logic Bug: Shipped on March 26, a caching optimization meant to prune old "thinking" from idle sessions contained a critical bug. Instead of clearing the thinking history once after an hour of inactivity, it cleared it on every subsequent turn, causing the model to lose its "short-term memory" and become repetitive or forgetful.
System Prompt Verbosity Limits: On April 16, Anthropic added instructions to the system prompt to keep text between tool calls under 25 words and final responses under 100 words. This attempt to reduce verbosity in Opus 4.7 backfired, causing a 3% drop in coding quality evaluations.
Impact and future safeguards
The quality issues extended beyond the Claude Code CLI, affecting the Claude Agent SDK and Claude Cowork, though the Claude API was not impacted.
Anthropic admitted that these changes made the model appear to have "less intelligence," which they acknowledged was not the experience users should expect.
To regain user trust and prevent future regressions, Anthropic is implementing several operational changes:
Internal Dogfooding: A larger share of internal staff will be required to use the exact public builds of Claude Code to ensure they experience the product as users do.
Enhanced Evaluation Suites: The company will now run a broader suite of per-model evaluations and "ablations" for every system prompt change to isolate the impact of specific instructions.
Tighter Controls: New tooling has been built to make prompt changes easier to audit, and model-specific changes will be strictly gated to their intended targets.
Subscriber Compensation: To account for the token waste and performance friction caused by these bugs, Anthropic has reset usage limits for all subscribers as of April 23.
The company intends to use its new @ClaudeDevs account on X and GitHub threads to provide deeper reasoning behind future product decisions and maintain a more transparent dialogue with its developer base.
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Is Anthropic 'nerfing' Claude? Users increasingly report performance degradation as leaders push backA growing number of developers and AI power users are taking to social media to accuse Anthropic of degrading the performance of Claude Opus 4.6 and Claude Code — intentionally or as an outcome of compute limits — arguing that the company’s flagship coding model feels less capable, less reliable and more wasteful with tokens than it did just weeks ago. The complaints have spread quickly on Github, X and Reddit over the past several weeks, with several high-reach posts alleging that Claude has become worse at sustained reasoning, more likely to abandon tasks midway through, and more prone to hallucinations or contradictions. Some users have framed the issue as “AI shrinkflation” — the idea that customers are paying the same price for a weaker product. Others have gone further, suggesting Anthropic may be throttling or otherwise tuning Claude downward during periods of heavy demand. Those claims remain unproven, and Anthropic employees have publicly denied that the company degrades models to manage capacity. At the same time, Anthropic has acknowledged real changes to usage limits and reasoning defaults in recent weeks, which has made the broader debate more combustible. VentureBeat has reached out to Anthropic for further clarification on the recent accusations, including whether any recent changes to reasoning defaults, context handling, throttling behavior, inference parameters or benchmark methodology could help explain the spike in complaints. We have also asked how Anthropic explains the recent benchmark-related claims and whether it plans to publish additional data that could reassure customers. As of publication time, we are awaiting a response. Viral user complaints, including from an AMD Senior Director, argue Claude has become less capable One of the most detailed public complaints originated as a GitHub issue filed by Stella Laurenzo on April 2, 2026, whose LinkedIn profile identifies her as Senior Director in AMD’s AI group. In that post, Laurenzo wrote that Claude Code had regressed to the point that it could not be trusted for complex engineering work, then backed that claim with a sprawling analysis of 6,852 Claude Code session files, 17,871 thinking blocks and 234,760 tool calls. The complaint argued that, starting in February, Claude’s estimated reasoning depth fell sharply while signs of poorer performance rose alongside it, including more premature stopping, more “simplest fix” behavior, more reasoning loops, and a measurable shift from research-first behavior to edit-first behavior. The post’s broader point was that for advanced engineering workflows, extended reasoning is not a luxury but part of what makes the model usable in the first place. That GitHub thread then escaped into the broader social media conversation, with X users including @Hesamation, who posted screenshots of Laurenzo's GitHub post to X on April 11, turning it into an even more viral talking point. That amplification mattered because it gave the wider “Claude is getting worse” narrative something more concrete than anecdotal frustration: a long, data-heavy post from a senior AI leader at a major chip company arguing that the regression was visible in logs, tool-use patterns and user corrections, not just gut feeling. Anthropic’s public response focused on separating perceived changes from actual model degradation. In a pinned follow-up on the same GitHub issue posted a week ago, Claude Code lead Boris Cherny thanked Laurenzo for the care and depth of the analysis but disputed its main conclusion. Cherny said the “redact-thinking-2026-02-12” header cited in the complaint is a UI-only change that hides thinking from the interface and reduces latency, but “does not impact thinking itself,” “thinking budgets,” or how extended reasoning works under the hood. He also said two other product changes likely affected what users were seeing: Opus 4.6’s move to adaptive thinking by default on Feb. 9, and a March 3 shift to medium effort, or effort level 85, as the default for Opus 4.6, which he said Anthropic viewed as the best balance across intelligence, latency and cost for most users. Cherny added that users who want more extended reasoning can manually switch effort higher by typing /effort high in Claude Code terminal sessions. That exchange gets at the core of the controversy. Critics like Laurenzo argue that Claude’s behavior in demanding coding workflows has plainly worsened and point to logs and usage patterns as evidence. Anthropic, by contrast, is not saying nothing changed. It is saying the biggest recent changes were product and interface choices that affect what users see and how much effort the system expends by default, not a secret downgrade of the underlying model. That distinction may be technically important, but for power users who feel the product is delivering worse results, it is not necessarily a satisfying one. External coverage from TechRadar and PC Gamer further amplified Laurenzo's post and larger wave of agreement from some power users. Another viral post on X from developer Om Patel on April 7 made the same argument in even more direct terms, claiming that someone had “actually measured” how much “dumber” Claude had gotten and summarizing the result as a 67% drop. That post helped popularize the “AI shrinkflation” label and pushed the controversy beyond hard-core Claude Code users into the broader AI discourse on X. These claims have resonated because they map closely onto what many frustrated users say they are seeing in practice: more unfinished tasks, more backtracking, more token burn and a stronger sense that Claude is less willing to reason deeply through complicated coding jobs than it was earlier this year. Benchmark posts turned anecdotal frustration into a public controversy The loudest benchmark-based claim came from BridgeMind, which runs the BridgeBench hallucination benchmark. On April 12, the account posted that Claude Opus 4.6 had fallen from 83.3% accuracy and a No. 2 ranking in an earlier result to 68.3% accuracy and No. 10 in a new retest, calling that proof that “Claude Opus 4.6 is nerfed.” That post spread widely and became one of the main anchors for the broader public case that Anthropic had degraded the model. Other users also circulated benchmark-related or test-based posts suggesting that Opus 4.6 was underperforming versus Opus 4.5 in practical coding tasks. Still other posts pointed to TerminalBench-related results as supposed evidence that the model’s behavior had changed in certain harnesses or product contexts. The effect was cumulative: benchmark screenshots, side-by-side tests and anecdotal frustration all began reinforcing one another in public. That matters because benchmark claims tend to travel farther than more subjective complaints. A developer saying a model “feels worse” is one thing. A screenshot showing a ranking drop from No. 2 to No. 10, or a dramatic percentage swing in accuracy, gives the appearance of hard proof, even when the underlying comparison may be more complicated. Critics of the benchmark claims say the evidence is weaker than it looks The most important rebuttal to the BridgeBench claim did not come from Anthropic. It came from Paul Calcraft, an outside software and AI researcher on X, who argued that the viral comparison was misleading because the earlier Opus 4.6 result was based on only six tasks while the later one was based on 30. In his words, it was a “DIFFERENT BENCHMARK.” He also said that on the six tasks the two runs shared in common, Claude’s score moved only modestly, from 87.6% previously to 85.4% in the later run, and that the bigger swing appeared to come mostly from a single fabrication result without repeats. He characterized that as something that could easily fall within ordinary statistical noise. That outside rebuttal matters because it undercuts one of the cleanest and most viral claims in circulation. It does not prove users are wrong to think something has changed. But it does suggest that at least some of the benchmark evidence now driving the story may be overstated, poorly normalized or not directly comparable. Even the BridgeBench post itself drew a community note to similar effect. The note said the two benchmark runs covered different scopes — six tasks in one case and 30 in the other — and that the common-task subset showed only a minor change. That does not make the later result meaningless, but it weakens the strongest version of the “BridgeBench proved it” argument. This is now a key feature of the controversy: the claims are not all equally strong. Some are grounded in first-hand user experience. Some point to real product changes. Some rely on benchmark comparisons that may not be apples-to-apples. And some depend on inferences about hidden system behavior that users outside Anthropic cannot directly verify. Earlier capacity limits gave users a reason to suspect more changes under the hood The current backlash also lands in the shadow of a real, confirmed Anthropic policy change from late March. On March 26, Anthropic technical staffer Thariq Shihipar posted that, “To manage growing demand for Claude,” the company was adjusting how 5-hour session limits work for Free, Pro and Max subscribers during peak hours, while keeping weekly limits unchanged. He added that during weekdays from 5 a.m. to 11 a.m. Pacific time, users would move through their 5-hour session limits faster than before. In follow-up posts, he said Anthropic had landed efficiency wins to offset some of the impact, but that roughly 7% of users would hit session limits they would not have hit before, particularly on Pro tiers. In an email on March 27, 2026, Anthropic told VentureBeat that Team and Enterprise customers were not affected by those changes, and that the shift was not dynamically optimized per user but instead applied to the peak-hour window the company had publicly described. Anthropic also said it was continuing to invest in scaling capacity. Those comments were about session limits, not model downgrades. But they are important context, because they establish two things that users now keep connecting in public: first, Anthropic has been dealing with surging demand; second, it has already changed how usage is rationed during busy periods. That does not prove Anthropic reduced model quality. It does help explain why so many users are primed to believe something else may also have changed. Prompt caching and TTL A separate, more recent GitHub issue broadens the dispute beyond model quality and into pricing and quota behavior. In issue #46829, user seanGSISG argued that Claude Code’s prompt-cache time-to-live, or TTL, appeared to shift from a one-hour setting back to a five-minute setting in early March, based on analysis of nearly 120,000 API calls drawn from Claude Code session logs across two machines. The complaint argues that this change drove meaningful increases in cache-creation costs and quota burn, especially for long-running coding sessions where cached context expires quickly and must be rebuilt. The author claims that this helps explain why some subscription users began hitting usage limits they had not previously encountered. What makes this issue notable is that Anthropic did not flatly deny that something changed. In a reply on the thread, Jarred Sumner said the March 6 change was real and intentional, but rejected the framing that it was a regression. He said Claude Code uses different cache durations for different request types, and that one-hour cache is not always cheaper because one-hour writes cost more up front and only save money when the same cached context is reused enough times to justify it. In his telling, the change was part of ongoing cache optimization work, not a silent downgrade, and the pre–March 6 behavior described in the issue “wasn’t the intended steady state.” The thread later drew a more detailed response from Anthropic’s Cherny, who described one-hour caching as “nuanced” and said the company has been testing heuristics to improve cache hit rates, token usage and latency for subscribers. Cherny said Anthropic keeps five-minute cache for many queries, including subagents that are rarely resumed, and said turning off telemetry also disables experiment gates, which can cause Claude Code to fall back to a five-minute default in some cases. He added that Anthropic plans to expose environment variables that let users force one-hour or five-minute cache behavior directly. Together, those replies do not validate the issue author’s claim that Anthropic silently made Claude Code more expensive overall, but they do confirm that Anthropic has been actively experimenting with cache behavior behind the scenes during the same period users began complaining more loudly about quota burn and changing product behavior. Anthropic says user-facing changes, not secret degradation, explain much of the uproar Anthropic-affiliated employees have publicly pushed back on the broadest accusations. In one widely circulated reply on X, Cherny responded to claims that Anthropic had secretly nerfed Claude Code by writing, “This is false.” He said Claude Code had been defaulted to medium effort in response to user feedback that Claude was consuming too many tokens, and that the change had been disclosed both in the changelog and in a dialog shown to users when they opened Claude Code. That response is notable because it concedes a meaningful product change while rejecting the more conspiratorial interpretation of it. Anthropic is not saying nothing changed. It is saying that what changed was disclosed and was aimed at balancing token use, not secretly reducing model quality. Public documentation also supports the fact that effort defaults have been in motion. Claude Code’s changelog says that on April 7, Anthropic changed the default effort level from medium to high for API-key users as well as Bedrock, Vertex, Foundry, Team and Enterprise users. That suggests Anthropic has actively been tuning these settings across different segments, which could plausibly affect user perceptions even if the core model weights are unchanged. Shihipar has also directly denied the broader demand-management accusation. In a reply on X posted April 11, he said Anthropic does not “degrade” its models to better serve demand. He also said that changes to thinking summaries affected how some users were measuring Claude’s “thinking,” and that the company had not found evidence backing the strongest qualitative claims now spreading online. The real issue may be trust as much as model quality What is clear is that a trust gap has opened between Anthropic and some of its most demanding users. For developers who rely on Claude Code all day, subtle shifts in visible thinking output, effort defaults, token burn, latency tradeoffs or usage caps can feel indistinguishable from a weaker model. That is true whether the root cause is a product setting, a UI change, an inference-policy tweak, capacity pressure or a genuine quality regression. It also means both sides of the fight may be talking past each other. Users are describing what they experience: more friction, more failures and less confidence. Anthropic is responding in product terms: effort defaults, hidden thinking summaries, changelog disclosures, and denials that demand pressure is causing secret model degradation. Those are not necessarily incompatible descriptions. A model can feel worse to users even if the company believes it has not “nerfed” the underlying model in the way critics allege. But coming at a time when Anthropic's chief rival OpenAI has recently pivoted and put more resources behind its competing, enterprise and vibe-coding focused product Codex — even offering a new, more mid-range ChatGPT subscription in an effort to boost usage of the tool — it's certainly not the kind of publicity that stands to benefit Anthropic or its customer retention. At the same time, the public evidence remains mixed. Some of the most viral claims have come from developers with detailed logs and strong opinions based on repeated use. Some of the benchmark evidence has been challenged by outside observers on methodological grounds. And Anthropic’s own recent changes to limits and settings ensure that this debate is happening against a backdrop of real adjustments, not pure rumor.
- Anthropic cuts off the ability to use Claude subscriptions with OpenClaw and third-party AI agentsAre you a subscriber to Anthropic's Claude Pro ($20 monthly) or Max ($100-$200 monthly) plans and use its Claude AI models and products to power third-party AI agents like OpenClaw? If so, you're in for an unpleasant surprise. Anthropic announced a few hours ago that starting tomorrow, Saturday, April 4, 2026, at 12 pm PT/3 pm ET, it will no longer be possible for those Claude subscribers to use their subscriptions to hook Anthropic's Claude models up to third-party agentic tools, citing the strain such usage was placing on Anthropic's compute and engineering resources, and desire to serve a wide number of users reliably. "We’ve been working hard to meet the increase in demand for Claude, and our subscriptions weren't built for the usage patterns of these third-party tools," wrote Boris Cherny, Head of Claude Code at Anthropic, in a post on X. "Capacity is a resource we manage thoughtfully and we are prioritizing our customers using our products and API." The company also reportedly sent out an email to this effect to some subscribers. However, it's not certain if subscribers to Claude Team and Enterprise will be impacted similarly. We've reached out to Anthropic for further clarification and will update when we hear back. To be clear, it will still be possible to use Claude models like Opus, Sonnet, and Haiku to power OpenClaw and similar external agents, but users will now need to opt into a pay-as-you-go "extra usage" billing system or utilize Anthropic's application programming interface (API), which charges for every token of usage rather than allowing for open-ended usage up to certain limits, as the Pro and Max plans have allowed so far. The reason for the change: 'third party services are not optimized' The technical reality, according to Anthropic, is that its first-party tools like Claude Code, its AI vibe coding harness, and Claude Cowork, its business app interfacing and control tool, are built to maximize "prompt cache hit rates"—reusing previously processed text to save on compute. Third-party harnesses like OpenClaw often bypass these efficiencies. “Third party services are not optimized in this way, so it's really hard for us to do sustainably,” Cherny explained further on X. He even revealed his own hands-on attempts to bridge the gap: “I did put up a few PRs to improve prompt cache hit rate for OpenClaw in particular, which should help for folks using it with Claude via API/overages.” Prior to the news, Anthropic had also begun imposing stricter Claude session limits every 5 hours of usage during business hours (5am-11am PT/8am-2pm ET), meaning that the number of tokens you could send during those sessions dropped. This frustrated some power users who suddenly began reaching their limits far faster than they had previously — a change Anthropic said was to help "manage growing demand for Claude" and would only affect up to 7% of users at any given time. Discounts and credits to soften the blow Anthropic is not banning third-party tools entirely, but it is moving them to a different ledger. The new "Extra Usage" bundles represent a middle ground between a flat-rate subscription and a full enterprise API account. The Credit: To "soften the blow," Anthropic is offering existing subscribers a one-time credit equal to their monthly plan price, redeemable until April 17. The Discount: Users who pre-purchase "extra usage" bundles can receive up to a 30% discount, an attempt to retain power users who might otherwise churn. Capacity Management: Anthropic’s official statement noted that these tools put an "outsized strain" on systems, forcing a prioritization of "customers using our core products and API." 'The all you-can-eat buffet just closed' The response from the developer community has been a mixture of analytical acceptance and sharp frustration. Growth marketer Aakash Gupta observed on X that the "all-you-can-eat buffet just closed," noting that a single OpenClaw agent running for one day could burn $1,000 to $5,000 in API costs. “Anthropic was eating that difference on every user who routed through a third-party harness,” Gupta wrote. “That's the pace of a company watching its margin evaporate in real time.” However, Peter Steinberger, the creator of OpenClaw who was recently hired by OpenAI, took a more skeptical view of the "capacity" argument.“Funny how timings match up,” Steinberger posted on X. “First they copy some popular features into their closed harness, then they lock out open source.” Indeed, Anthropic recently added some of the same capabilities that helped OpenClaw catch-on — such as the ability to message agents through external services like Discord and Telegram — to Claude Code. Steinberger claimed that he and fellow investor Dave Morin attempted to "talk sense" into Anthropic, but were only able to delay the enforcement by a single week. User @ashen_one, founder of Telaga Charity, voiced a concern likely shared by other small-scale builders: “If I switch both [OpenClaw instances] to an API key or the extra usage you're recommending here, it's going to be far too expensive to make it worth using. I'll probably have to switch over to a different model at this point.” .“I know it sucks,” Cherny replied. “Fundamentally engineering is about tradeoffs, and one of the things we do to serve a lot of customers is optimize the way subscriptions work to serve as many people as possible with the best mode Licensing and the OpenAI shadow The timing of the crackdown is particularly notable given the talent migration. When Steinberger joined OpenAI in February 2026, he brought the "OpenClaw" ethos with him. OpenAI appears to be positioning itself as a more "harness-friendly" alternative, potentially using this moment as a customer acquisition channel for disgruntled Claude power users. By restricting subscription limits to their own "closed harness," Anthropic is asserting control over the UI/UX layer. This allows them to collect telemetry and manage rate limits more granularly, but it risks alienating the power-user community that built the "agentic" ecosystem in the first place. The Bottom Line Anthropic’s decision is a cold calculation of margins versus growth. As Cherny noted, "Capacity is a resource we manage thoughtfully." In the 2026 AI landscape, the era of subsidized, unlimited compute for third-party automation is over. For the average user on Claude.ai, the experience remains unchanged; for the power users running autonomous offices, the bell has tolled.
- Anthropic just launched Claude Design, an AI tool that turns prompts into prototypes and challenges FigmaAnthropic today launched Claude Design, a new product from its Anthropic Labs division that allows users to create polished visual work — designs, interactive prototypes, slide decks, one-pagers, and marketing collateral — through conversational prompts and fine-grained editing controls. The release, available immediately in research preview to all paid Claude subscribers, is the company's most aggressive expansion beyond its core language model business and into the application layer that has historically belonged to companies like Figma, Adobe, and Canva. Claude Design is powered by Claude Opus 4.7, Anthropic's most capable generally available vision model, which the company also released today. Anthropic says it is rolling access out gradually throughout the day to Claude Pro, Max, Team, and Enterprise subscribers. The simultaneous launches mark a watershed for Anthropic, whose ambitions now visibly extend from foundation model provider to full-stack product company — one that wants to own the arc from a rough idea to a shipped product. The timing is also significant: Anthropic hit roughly $20 billion in annualized revenue in early March 2026, according to Bloomberg, up from $9 billion at the end of 2025 — and surpassed $30 billion by early April 2026. The company is in early talks with Goldman Sachs, JPMorgan, and Morgan Stanley about a potential IPO that could come as early as October 2026. How Claude Design turns a text prompt into a working prototype The product follows a workflow that Anthropic has designed to feel like a natural creative conversation. Users describe what they need, and Claude generates a first version. From there, refinement happens through a combination of channels: chat-based conversation, inline comments on specific elements, direct text editing, and custom adjustment sliders that Claude itself generates to let users tweak spacing, color, and layout in real time. During onboarding, Claude reads a team's codebase and design files and builds a design system — colors, typography, and components — that it automatically applies to every subsequent project. Teams can refine the system over time and maintain more than one. The import surface is broad: users can start from a text prompt, upload images and documents in various formats, or point Claude at their codebase. A web capture tool grabs elements directly from a live website so prototypes look like the real product. What distinguishes Claude Design from the wave of AI design experiments that have proliferated in the past year is the handoff mechanism. When a design is ready to build, Claude packages everything into a handoff bundle that can be passed to Claude Code with a single instruction. That creates a closed loop — exploration to prototype to production code — all within Anthropic's ecosystem. The export options acknowledge that not everyone's next step is Claude Code: users can also share designs as an internal URL within their organization, save as a folder, or export to Canva, PDF, PPTX, or standalone HTML files. Anthropic points to Brilliant, the education technology company known for intricate interactive lessons, as an early proof point. The company's senior product designer reported that the most complex pages required 20 or more prompts to recreate in competing tools but needed only 2 in Claude Design. The Brilliant team then turned static mockups into interactive prototypes they could share and user-test without code review, and handed everything — including the design intent — to Claude Code for implementation. Datadog's product team described a similar shift, compressing what had been a week-long cycle of briefs, mockups, and review rounds into a single conversation. Why Anthropic's chief product officer just resigned from Figma's board The launch arrives against a backdrop that makes Anthropic's claim of complementarity with existing design tools difficult to take entirely at face value. Mike Krieger, Anthropic's chief product officer, resigned from the board of Figma on April 14 — the same day The Information reported Anthropic's next model would include design tools that could compete with Figma's primary offering. Figma has collaborated closely with Anthropic to integrate the frontier lab's AI models into its products. Just two months ago, in February, Figma launched "Code to Canvas," a feature that converts code generated in AI tools like Claude Code into fully editable designs inside Figma — creating a bridge between AI coding tools and Figma's design process. The partnership felt like a mutual bet that AI would make design more essential, not less. Claude Design complicates that narrative significantly. Anthropic's position, based on VentureBeat's background conversations with the company, is that Claude Design is built around interoperability and is meant to meet teams where they already work, not replace incumbent tools. The company points to the Canva export, PPTX and PDF support, and plans to make it easier for other tools to connect via MCPs (model context protocols) as evidence of that philosophy. Anthropic is also making it possible for other tools to build integrations with Claude Design, a move clearly designed to preempt accusations of walled-garden ambitions. But the market read the signals differently. The structural tension is clear: Figma commands an estimated 80 to 90% market share in UI and UX design, according to The Next Web. Both Figma and Adobe assume a trained designer is in the loop. Anthropic's tool does not. Claude Design is not merely another AI copilot embedded in an existing design application. It is a standalone product that generates complete, interactive prototypes from natural language — accessible to founders, product managers, and marketers who have never opened Figma. The expansion of the design user base to non-designers is the real competitive threat, even if the professional designer's workflow remains anchored in Figma for now. Inside Claude Opus 4.7, the model Anthropic deliberately made less dangerous The model powering Claude Design is itself a significant story. Claude Opus 4.7 is Anthropic's most capable generally available model, with notable improvements over its predecessor Opus 4.6 in software engineering, instruction following, and vision — but it is intentionally less capable than Anthropic's most powerful offering, Claude Mythos Preview, the model the company announced earlier this month as too dangerous for broad release due to its cybersecurity capabilities. That dual-track approach — one model for the public, one model locked behind a vetted-access program — is unprecedented in the AI industry. Anthropic used Claude Mythos Preview to identify thousands of zero-day vulnerabilities in every major operating system and web browser, as reported by multiple outlets. The Project Glasswing initiative that houses Mythos brings together Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks as launch partners. Opus 4.7 sits a deliberate step below Mythos. Anthropic stated in its release that it "experimented with efforts to differentially reduce" the new model's cyber capabilities during training and ships it with safeguards that automatically detect and block requests indicating prohibited or high-risk cybersecurity uses. What Anthropic learns from those real-world safeguards will inform the eventual goal of broader release for Mythos-class models. For security professionals with legitimate needs, the company has created a new Cyber Verification Program. On benchmarks, the model posts strong numbers. Opus 4.7 reached 64.3% on SWE-bench Pro, and on Anthropic's internal 93-task coding benchmark, it delivered a 13% resolution improvement over Opus 4.6, including solving four tasks that neither Opus 4.6 nor Sonnet 4.6 could crack. The vision improvements are substantial and directly relevant to Claude Design: Opus 4.7 can accept images up to 2,576 pixels on the long edge — roughly 3.75 megapixels, more than three times the resolution of prior Claude models. Early access partner XBOW, the autonomous penetration testing company, reported that the new model scored 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6. Meanwhile, Bloomberg reported that the White House is preparing to make a version of Mythos available to major federal agencies, with the Office of Management and Budget setting up protections for Cabinet departments — a sign that the government views the model's capabilities as too important to leave solely in private hands. What enterprise buyers need to know about data privacy and pricing For enterprise and regulated-industry buyers, the data handling architecture of Claude Design will be a critical evaluation criterion. Based on VentureBeat's exclusive background discussions with Anthropic, the system stores the design-system representation it generates — not the source files themselves. When users link a local copy of their code, it is not uploaded to or stored on Anthropic's servers. The company is also adding the ability to connect directly to GitHub. Anthropic states unequivocally that it does not train on this data. For Enterprise customers, Claude Design is off by default — administrators choose whether to enable it and control who has access. On pricing, Claude Design is included at no additional cost with Pro, Max, Team, and Enterprise plans, using existing subscription limits with optional extra usage beyond those caps. Opus 4.7 holds the same API pricing as its predecessor: $5 per million input tokens and $25 per million output tokens. The pricing strategy mirrors the approach Anthropic took with Claude Code, which launched as a bundled feature and rapidly grew into a major revenue driver. Anthropic's reasoning is straightforward: the best way to learn what people will build with a new product category is to put it in their hands, then build monetization around demonstrated value. Anthropic is also being transparent about the product's limitations. The design system import works best with a clean codebase; messy source code produces messy output. Collaboration is basic and not yet fully multiplayer. The editing experience has rough edges. There is no general availability date, and Anthropic says that is intentional — it will let the product and user feedback determine when Claude Design is ready for prime time. Anthropic's bet that owning the full creative stack is worth the risk Claude Design is the most visible expression of a trend that has been accelerating for months: the major AI labs are moving up the stack from model providers into full application builders, directly entering categories previously owned by established software companies. Anthropic now offers a coding agent (Claude Code), a knowledge-work assistant (Claude Cowork), desktop computer control, office integrations for Word, Excel, and PowerPoint, a browser agent in Chrome, and now a design tool. Each product reinforces the others. A designer can explore concepts in Claude Design, export a prototype, hand it to Claude Code for implementation, and have Claude Cowork manage the review cycle — all within Anthropic's platform. The financial momentum behind this expansion is staggering. Anthropic has received investor offers valuing the company at approximately $800 billion, according to Reuters, more than doubling its $380 billion valuation from a funding round closed just two months ago. But building an application empire while simultaneously navigating an AI safety reputation, an impending IPO, growing public hostility toward the technology, and the diplomatic fallout of competing with your own partners is a balancing act that no technology company has attempted at this scale or speed. When Figma launched Code to Canvas in February, the implicit promise was that AI coding tools and design tools would grow together, each making the other more valuable. Two months later, Anthropic's chief product officer has left Figma's board, and the company has shipped a product that lets anyone who can type a sentence create the kind of interactive prototype that once required years of design training and a Figma license. The partnership may survive. But the power dynamic just changed — and in the AI industry, that tends to be the only kind of change that matters.
- Anthropic releases Claude Opus 4.7, narrowly retaking lead for most powerful generally available LLMAnthropic is publicly releasing its most powerful large language model yet, Claude Opus 4.7, today — as it continues to keep an even more powerful successor, Mythos, restricted to a small number of external enterprise partners for cybersecurity testing and patching vulnerabilities in the software said enterprises use (which Mythos exposed rapidly). The big headlines are that Opus 4.7 exceeds its most direct rivals — OpenAI's GPT-5.4, released in early March 2026, scarcely more than a month ago; and Google's latest flagship model Gemini 3.1 Pro from February — on key benchmarks including agentic coding, scaled tool-use, agentic computer use, and financial analysis. But also, it's notable how tight the race is getting: on directly comparable benchmarks, Opus 4.7 only leads GPT-5.4 by 7-4. It currently leads the market on the GDPVal-AA knowledge work evaluation with an Elo score of 1753, surpassing both GPT-5.4 (1674) and Gemini 3.1 Pro (1314). Yet, the model does not represent a "clean sweep" across all categories. Competitors like GPT-5.4 and Gemini 3.1 Pro still hold the lead in specific domains such as agentic search, where GPT-5.4 scores 89.3% compared to Opus 4.7’s 79.3%, as well as in multilingual Q&A and raw terminal-based coding. This positioning defines Opus 4.7 not as a unilateral victor in all AI tasks, but as a specialized powerhouse optimized for the reliability and long-horizon autonomy required by the burgeoning agentic economy. Claude Opus 4.7 is available today across all major cloud platforms, including Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry, with API pricing held steady at $5/$25 per million tokens. Improvement in hard sciences and agentic workflows Claude Opus 4.7 is a direct evolution of the Opus 4.6 architecture, but its performance delta is most visible in the "hard" sciences of agentic workflows: software engineering and complex document reasoning. At its core, the model has been re-tuned to exhibit what Anthropic describes as "rigor". This isn't just marketing parlance; it refers to the model’s new ability to devise its own verification steps before reporting a task as complete. For example, in internal tests, the model was observed building a Rust-based text-to-speech engine from scratch and then independently feeding its own generated audio through a separate speech recognizer to verify the output against a Python reference. This level of autonomous self-correction is designed to reduce the "hallucination loops" that often plague earlier iterations of agentic software. The most significant architectural upgrade is the move to high-resolution multimodal support. Opus 4.7 can now process images up to 2,576 pixels on their longest edge—roughly 3.75 megapixels. This represents a three-fold increase in resolution compared to previous iterations. For developers building "computer-use" agents that must navigate dense, high-DPI interfaces or for analysts extracting data from intricate technical diagrams, this change effectively removes the "blurry vision" ceiling that previously limited autonomous navigation. This visual acuity is reflected in benchmarks from XBOW, where the model jumped from a 54.5% success rate in visual-acuity tests to 98.5%. On the benchmark front, Opus 4.7 has claimed the top spot in several critical categories: Knowledge Work (GDPVal-AA): It achieved an Elo score of 1753, notably outperforming GPT-5.4 (1674) and Gemini 3.1 Pro (1314). Agentic Coding (SWE-bench Pro): The model resolved 64.3% of tasks, compared to 53.4% for its predecessor. Graduate-Level Reasoning (GPQA Diamond): It reached 94.2%, maintaining parity with the industry's most advanced models while improving on its internal consistency. Visual Reasoning (arXiv Reasoning): With tools, the model scored 91.0%, a meaningful jump from the 84.7% seen in Opus 4.6. Crucially, Anthropic warns that this increased precision requires a shift in how users approach prompting. Opus 4.7 follows instructions literally. While older models might "read between the lines" and interpret ambiguous prompts loosely, Opus 4.7 executes the exact text provided. This means that legacy prompt libraries may require re-tuning to avoid unexpected results caused by the model’s strict adherence to the letter of the request. Controlling the 'thinking' budget The "agentic" nature of Opus 4.7—its tendency to pause, plan, and verify—comes with a trade-off in token consumption and latency. To address this, Anthropic is introducing a new "effort" parameter. Users can now select an xhigh (extra high) effort level, positioned between high and max, allowing for more granular control over the depth of reasoning the model applies to a specific problem. Internal data shows that while max effort yields the highest scores (approaching 75% on coding tasks), the xhigh setting provides a compelling sweet spot between performance and token expenditure. To manage the costs associated with these more "thoughtful" runs, the Claude API is introducing "task budgets" in public beta. This allows developers to set a hard ceiling on token spend for autonomous agents, ensuring that a long-running debugging session doesn't result in an unexpected bill. These product changes signal a maturing market where AI is no longer a novelty but a production line item that requires fiscal and operational guardrails. Furthermore, Opus 4.7 utilizes an updated tokenizer that improves text processing efficiency, though it can increase the token count of certain inputs by 1.0–1.35x. Within the Claude Code environment, the update brings a new /ultrareview command. Unlike standard code reviews that look for syntax errors, /ultrareview is designed to simulate a senior human reviewer, flagging subtle design flaws and logic gaps. Additionally, "auto mode"—a setting where Claude can make autonomous decisions without constant permission prompts—has been extended to Max plan users. Licensing, safety, and the "cyber" divide Anthropic continues to walk a narrow line regarding cybersecurity. The recent announcement of the aforementioend cybersecurity partnership around Mythos with external industry partners — known as "Project Glasswing" — highlighted the dual-use risks of high-capability models. Consequently, while the flagship Mythos Preview model remains restricted, Opus 4.7 serves as the testbed for new automated safeguards. The model includes systems designed to detect and block requests that suggest high-risk cyberattacks, such as automated vulnerability exploitation. To bridge the gap for the security industry, Anthropic is launching the Cyber Verification Program. This allows legitimate professionals—vulnerability researchers, penetration testers, and red-teamers—to apply for access to use Opus 4.7’s capabilities for defensive purposes. This "verified user" model suggests a future where the most capable AI features are not universally available, but gated behind professional credentials and compliance frameworks. In cybersecurity vulnerability reproduction (CyberGym), Opus 4.7 maintains a 73.1% success rate, trailing Mythos Preview's 83.1% but leading GPT-5.4's 66.3%. Initial reactions from industry partners reveal quantifiable improvements in production enterprise workflows Early testimonials from enterprise customers shared by Anthropic indicate there has been a tangible shift in model perception of Opus 4.7 from 4.6, going from "impressed by the tech" to "relying on the output". Clarence Huang, VP of Technology at Intuit, noted that the model’s ability to "catch its own logical faults during the planning phase" is a game-changer for velocity. This sentiment was echoed by Replit President Michele Catasta, who stated that the model achieved higher quality at a lower cost for tasks like log analysis and bug hunting, adding, "It really feels like a better coworker". Other specific reactions included: Cognition (Devin): CEO Scott Wu reported that Opus 4.7 can work coherently "for hours" and pushes through difficult problems that previously caused models to stall. Notion: Sarah Sachs, AI Lead, highlighted a 14% improvement in multi-step workflows and a 66% reduction in tool-calling errors, making the agent feel like a "true teammate". Factory Droids: Leo Tchourakov observed that the model carries work through to validation steps rather than "stopping halfway," a common complaint with previous frontier models. Harvey: Niko Grupen, Head of Applied Research, noted the model's 90.9% score on BigLaw Bench, highlighting its "noticeably smarter handling of ambiguous document editing tasks". Perhaps the most telling reaction came from Aj Orbach, CEO of a dashboard-building firm, who remarked on the model’s "design taste," noting that its choices for data-rich interfaces were of a quality he would "actually ship". Should enterprises immediately upgrade to Opus 4.7? For enterprise leaders, Claude Opus 4.7 represents a shift from generative AI as a "creative assistant" to a "reliable operative." But importantly, it is not a "clean win" for every use case. Instead, it is a decisive upgrade for teams building autonomous agents or complex software systems. The primary value proposition is the model's new capability for self-verification and rigor; it no longer just generates an answer but creates internal tests to verify that the answer is correct before responding. This reliability makes it a superior choice for long-horizon engineering tasks where the cost of human supervision is the primary bottleneck. However, an immediate, wholesale migration from Opus 4.6 requires caution. The model's increased literalism in instruction following means that prompts engineered to be "loose" or conversational with previous versions may now produce unexpected or overly rigid results. Furthermore, enterprises must prepare for a significant increase in operational costs. Opus 4.7 uses an updated tokenizer that can increase input token counts by 1.0–1.35x, and its tendency to "think harder" at high effort levels results in higher output token consumption. For legacy applications where prompts are fragile and margins are thin, a phased rollout with significant re-tuning is recommended. Where it puts Anthropic in the AI race This release arrives at a paradoxical moment for Anthropic. Financially, the company is an undisputed juggernaut, with venture capital firms reportedly extending investment offers at a staggering $800 billion valuation—more than double its $380 billion Series G valuation from February 2026. This momentum is fueled by explosive growth, with the company’s annual run-rate revenue skyrocketing to $30 billion in April 2026, driven largely by enterprise adoption and the success of Claude Code. Yet, this commercial success is being contested by intense regulatory and technical friction. Anthropic is currently embroiled in a high-stakes legal battle with the U.S. Department of War (DoW), which recently labeled the company a "supply chain risk" after Anthropic refused to allow its models to be used for mass surveillance or fully autonomous lethal weapons. While a San Francisco judge initially blocked the designation, a federal appeals panel recently denied Anthropic’s bid to stay the blacklisting, leaving the company excluded from lucrative defense contracts during an active military conflict. Simultaneously, Anthropic is fending off a growing rebellion from its most loyal power users. Despite the company's "market leader" status, developers have flooded GitHub and X with accusations of "AI shrinkflation," claiming that the preceding Opus 4.6 model and Claude Code product have been quietly degraded. Users report that recent versions are more prone to exploration loops, memory loss, and ignored instructions, leading some to describe the newly released Claude Code desktop app as "unpolished" and unbefitting a firm with a near-trillion-dollar valuation. Opus 4.7 is Anthropic's attempt to silence these critics by proving that "deep thinking" can be paired with the rigorous execution that its enterprise clients now demand. Ultimately, Opus 4.7 is a model defined by its discipline. In a market where models are often incentivized to be "helpful" to a fault—sometimes hallucinating answers to please the user—Opus 4.7 marks a return to rigor. By allowing users to control effort, set budgets, and verify outputs, Anthropic is moving closer to the goal of a truly autonomous digital labor force. For the engineering teams at Replit, Notion, and beyond, the shift from "watching the AI work" to "managing the AI's results" has officially begun.