vff — the signal in the noise
News

14 things Claude Opus 4.7 actually does better than Opus 4.6 and Sonnet 4.6

Nick ZarzyckiRead original
Share
14 things Claude Opus 4.7 actually does better than Opus 4.6 and Sonnet 4.6

Anthropic released Claude Opus 4.7 on April 16, positioning it as a focused upgrade for teams building agents, long-running coding workflows, and vision-based tools rather than a broad leaderboard sweep. The model handles images up to 2,576 pixels (3.75 megapixels), reads cluttered screens at 98.5% accuracy versus 54.5% for Opus 4.6, and shows measurable gains in tool use, error recovery, and long-context coherence. Key improvements target production failure modes like infinite loops, tool failures, and hallucination on missing data, with notable jumps on coding benchmarks like SWE-bench Pro (53.4% to 64.3%).

TL;DR

  • Opus 4.7 processes images at 3.75 megapixels, over 3x the resolution of earlier Claude models, with 98.5% accuracy on visual acuity tests versus 54.5% for Opus 4.6
  • Tool-use accuracy reaches 77.3%, with 14% fewer tool errors and stronger recovery from tool failures in real-world testing
  • Coding performance jumps to 64.3% on SWE-bench Pro (up from 53.4%), beating GPT-5.4 at 57.7% and solving four previously unsolved tasks
  • Model stays coherent across long sessions, resists infinite loops, admits when data is missing, and maintains context across multi-day projects

Why it matters

Opus 4.7 targets the operational pain points that block production AI systems: vision quality for screen reading, tool reliability for agent workflows, and reasoning stability for long-running tasks. These are not headline benchmarks but the failure modes that make or break deployed systems. The improvements suggest Anthropic is optimizing for real-world agent and automation use cases rather than chasing raw capability metrics.

Business relevance

For teams building internal agents, RPA tools, and document analysis systems, Opus 4.7 removes friction that currently requires expensive workarounds like image preprocessing, tool retry logic, and session management. The gains in tool accuracy and error recovery directly reduce operational overhead and improve system reliability without requiring architectural changes. Teams working with enterprise documents and long-context workflows can ship more ambitious automation with less engineering effort.

Key implications

  • Vision improvements enable new use cases in screen reading and diagram analysis that were previously unreliable, lowering the barrier to building visual agents
  • Tool-use gains and error recovery suggest Anthropic is building toward more autonomous agent systems that can handle real-world tool failures without human intervention
  • Long-context coherence and session memory improvements position Opus 4.7 for multi-day investigation and analysis workflows, expanding the scope of tasks AI can handle unsupervised
  • The focus on failure modes over raw benchmarks indicates a shift toward production-grade reliability as a competitive differentiator in the LLM market

What to watch

Monitor whether Opus 4.7's improvements in tool use and error recovery translate to measurable gains in agent adoption and automation deployment rates. Watch for competitive responses from OpenAI and Google on vision quality and tool reliability, as these are becoming table-stakes for production AI systems. Track whether the long-context improvements enable new categories of multi-day analysis and investigation workflows that were previously impractical.

April 2026

14 things Claude Opus 4.7 actually does better than Opus 4.6 and Sonnet 4.6

The headline numbers are real, but the interesting improvements are the ones that change what teams can ship, not just what they can score.

Anthropic shipped Opus 4.7 on April 16, and the early benchmark dust is settling into a clearer picture. This is not a model that sweeps every leaderboard. It is a focused upgrade aimed at a specific kind of user: the one building agents, long-running coding workflows, and tools that read the real world through pixels. Below are fourteen places where Opus 4.7 does something meaningfully better than its predecessors, with the numbers behind each claim.

The Fourteen
01

Handles much sharper images

Opus 4.7 can process images up to 2,576 pixels on the long edge, which works out to about 3.75 megapixels. That is more than three times what earlier Claude models could see. If your work involves screenshots, diagrams, or photos with fine detail, this is the upgrade that changes what is possible.

02

Actually reads cluttered screens

On visual-acuity testing, Opus 4.7 scored 98.5% compared to 54.5% for Opus 4.6. For anyone building agents that read busy dashboards or dense web pages, one of the biggest headaches just went away.

03

Works real software interfaces better

On a benchmark that tests how well AI can operate actual desktop applications, Opus 4.7 climbed from 72.7% to 78%. That puts it ahead of GPT-5.4, which sits at 75%.

04

Picks the right tool more often

Opus 4.7 leads tool-use benchmarks at 77.3%, beating Opus 4.6 at 75.8% and Sonnet 4.6. For anyone connecting Claude to outside tools and services, this is the number that matters most.

05

Keeps going when a tool breaks

Real-world testing shows a 14% lift over Opus 4.6 using fewer tokens and a third of the tool errors. Opus 4.7 keeps working through tool failures that used to bring earlier models to a full stop.

06

Stops getting stuck in loops

A model that spins forever on one out of every 18 queries burns compute and blocks users. Opus 4.7 posts the highest quality-per-tool-call ratio on record, with loop resistance that production teams call the most important improvement in the release.

07

Runs for hours without falling apart

Opus 4.7 stays coherent across long sessions, pushes through hard problems instead of quitting, and handles deep investigation work that earlier models could not be trusted to finish.

The interesting improvements are not the benchmark wins. They are the failure modes that quietly disappear.
08

Solves harder coding problems

On SWE-bench Pro, the tougher multi-language version of the coding benchmark, Opus 4.7 jumped from 53.4% to 64.3%. That puts it ahead of GPT-5.4 at 57.7% and Gemini at 54.2%.

09

Cracks problems nothing else could

On a 93-task coding benchmark, Opus 4.7 lifted resolution 13% over Opus 4.6. Four of those tasks had never been solved before by Opus 4.6 or Sonnet 4.6.

10

Handles the command line better

Opus 4.7 passes Terminal Bench tasks that earlier Claude models had failed, including a tricky concurrency bug that Opus 4.6 could not figure out.

11

Admits when data is missing

Opus 4.7 tells you when information is not there instead of filling the gap with a confident guess. It also resists "dissonant data" traps that still catch Opus 4.6. This is a big deal for anyone doing analysis where a wrong answer is worse than no answer.

12

Reads enterprise documents more accurately

On document question-answering tests, Opus 4.7 makes 21% fewer errors than Opus 4.6 when working from source documents.

13

Remembers across sessions

Opus 4.7 is meaningfully better at using files as memory. It keeps track of important notes across long, multi-day projects and uses them to pick up new tasks faster, with less context needed up front.

14

Gives developers finer control

Opus 4.7 adds a new xhigh effort level that sits between high and max, so you can dial in the tradeoff between reasoning depth and speed. It also launches task budgets in public beta, which let developers cap token spend and tell Claude how to prioritize work across long runs. Neither control exists on Opus 4.6 or Sonnet 4.6.

The bottom line

Opus 4.7 is not a universal upgrade. Teams running tight token budgets, simple chat workloads, or prompts tuned carefully for Opus 4.6 should test before switching. But for anyone building agents that live in the real world, read screens, call tools, and run for hours at a time, this is the first Claude release that feels like it was built for the job rather than adapted to it.

References & Sources

  1. Anthropic. "Introducing Claude Opus 4.7." anthropic.com/news/claude-opus-4-7 (April 16, 2026). Primary source for image resolution specs, customer testing quotes, memory improvements, and effort-level controls.
  2. Vellum AI. "Claude Opus 4.7 Benchmarks Explained." vellum.ai/blog/claude-opus-4-7-benchmarks-explained (April 16, 2026). Source for SWE-bench Pro, OSWorld-Verified, MCP-Atlas, and comparative scores against GPT-5.4 and Gemini 3.1 Pro.
  3. BenchLM. "Claude Opus 4.7 vs Claude Sonnet 4.6: AI Benchmark Comparison." benchlm.ai/compare/claude-opus-4-7-vs-claude-sonnet-4-6 (April 2026). Source for context window comparison and aggregate benchmark leaderboards.
  4. Anthropic. "Models Overview." platform.claude.com/docs/en/about-claude/models/overview (accessed April 2026). API model IDs, pricing, and platform availability.
  5. XBOW. Visual-acuity benchmark results for autonomous penetration testing, cited in Anthropic's Opus 4.7 launch post.
  6. Notion. Tool-use and implicit-need test results, cited in Anthropic's Opus 4.7 launch post.
  7. Genspark. Loop resistance and quality-per-tool-call measurements, cited in Anthropic's Opus 4.7 launch post.
  8. Cognition (Devin). Long-horizon autonomy evaluations, cited in Anthropic's Opus 4.7 launch post.
  9. Warp. Terminal Bench and concurrency bug resolution, cited in Anthropic's Opus 4.7 launch post.
  10. Hex. Dissonant-data trap and missing-data reporting evaluations, cited in Anthropic's Opus 4.7 launch post.
  11. Databricks. OfficeQA Pro error rate measurements, cited in Anthropic's Opus 4.7 launch post.
  12. GitHub. 93-task coding benchmark results, cited in Anthropic's Opus 4.7 launch post.
  13. Evolink. "Claude Opus 4.7 vs Claude Opus 4.6: Pricing, API Changes, and Migration Guide." evolink.ai/blog/claude-opus-4-7-vs-claude-opus-4-6 (April 2026). Source for tokenizer changes and migration notes.
Share

vff Briefing

Weekly signal. No noise. Built for founders, operators, and AI-curious professionals.

No spam. Unsubscribe any time.

Related stories