I Ran Opus 4.6 as My Daily Driver for Months. Now Opus 4.7 Is Here.

I'm not a benchmarker. I don't run eval suites or score models on coding competitions. What I do is use Claude every single day — in VS Code, in Make.com automations, for writing, for building apps, for debugging. I ran Opus 4.6 as my primary model for months across real builds. Now Opus 4.7 has dropped. Here's my honest 4.6 baseline — and what I'm watching as I move to the new version.

My Context: How I Actually Use This Model

Before getting into the specifics, it's worth being clear about what "daily driver" means in my case. I'm not a software engineer. I'm a working dad with a full-time job who builds AI-powered web tools in the margins — evenings, early mornings, the occasional weekend block. My use of Opus 4.6 spans three modes:

Claude Code in VS Code — active development sessions on my personal site, including building and iterating on apps, cross-file edits, debugging, and structural work.
Make.com automation builds — designing multi-module scenarios, writing Claude API prompts that get embedded in Make.com HTTP modules, debugging automation failures.
Claude Chat and Cowork — writing, planning, research, and sprint-style building sessions.

This perspective is different from a developer benchmarking model performance on code generation tasks. It's a practitioner perspective — what does this model actually do for someone building real things with real constraints?

Where Opus 4.6 Is Genuinely Excellent

Codebase Comprehension

The thing that impresses me most about Opus 4.6 in Claude Code is how well it holds a large codebase in mind. My site has 40+ HTML files, multiple app subdirectories, a reference document, several Make.com scenario descriptions embedded in markdown — and when I start a session and point Claude at the project, it reads across files and maintains a coherent picture of the whole thing.

The 40-file webhook security update I've written about is the clearest example. That task required reading every HTML file in the repo, identifying specific patterns, and making consistent changes across all of them. That kind of broad-context, multi-file reasoning is where Opus genuinely shines over smaller models.

Long-Form Generation That Holds Structure

For generating long, structured HTML — like a full blog post with a defined template, specific CSS classes, embedded Chart.js data, and a constrained format — Opus 4.6 is remarkably consistent. It follows the template. It uses the right class names. It doesn't drift into its own structure halfway through. That consistency is what makes the blog-post factory viable at all.

Honest Reasoning

One of the things I've come to genuinely rely on: Opus tells me when something is uncertain. It doesn't confidently generate wrong answers the way smaller models sometimes do. When I ask it to do something it's not sure about — a specific Make.com configuration, the exact syntax for a GitHub API call — it'll flag the uncertainty and suggest verifying. That's the right behavior. It saves debugging time.

~6mo

Of daily Opus 4.6 use across real builds: 10 live apps, a 16-module Make.com automation, 40+ blog posts, and a rebuilt site. This is a practitioner take, not a benchmark.

Where It Still Needs Direction

Vague Prompts Produce Vague Output

This is consistent and predictable: when I'm imprecise, Opus is imprecise back. If I say "make the app better," I get something that's different but not necessarily what I wanted. If I say "add a loading spinner that appears between submit and response, positioned centered below the submit button, using the existing --cyan CSS variable," I get exactly that.

This isn't a weakness of Opus specifically — it's a property of working with any AI model. But it means the skill of prompt clarity matters. The model amplifies precision; it doesn't compensate for vagueness.

Root Cause vs. Symptom Fixes

Occasionally, when debugging, Opus will fix the symptom rather than the root cause — patch the line that throws the error rather than understanding why the error is being thrown. It's not frequent, but it happens enough that I've learned to ask explicitly: "What's causing this, not just where the error appears?" That reframe usually produces a better diagnosis.

Context Limit Management

Long sessions degrade. This is known, and it's not unique to Opus — it's a fundamental property of context windows. But it shows up practically: a session that's been running for 2+ hours with heavy file reads will start producing outputs that drift slightly from established conventions. The solution is the reference document system — saving state to files so new sessions start fresh rather than trying to maintain coherence in a single long conversation.

My Evaluation Framework — Now That 4.7 Has Dropped

When I wrote the first draft of this post, Opus 4.7 was still unreleased. It has since dropped — which means the "watching for" list below has become a test suite. These are the five things I'm actively evaluating as I build with the new version. Benchmark scores are the first thing everyone publishes. They're the least interesting thing to me. Here's what I actually test:

Capability	What I'm testing	Why it matters to me
Multi-file consistency	Cross-repo edits that touch 20+ files without drifting from conventions	The larger my site gets, the more this matters
Long-session stability	Convention adherence at hour 2 vs. hour 0	Long sessions are where I do my most complex work
Structured output reliability	HTML generation that passes the eval harness on first pass more often	Fewer eval failures = faster publishing pipeline
Root cause diagnosis	Debugging sessions where I don't need to ask "but what's actually causing it"	Saves back-and-forth on complex bugs
Make.com prompt quality	Claude API prompts embedded in automations that need to produce structured output reliably	The automation pipeline depends on consistent API responses

The honest version of this list: I wanted less prompt engineering overhead for the same output quality. Opus 4.6 was good enough that I could get great results — but it required clear, specific, structured prompts to do so consistently. That's still the right discipline with any model. But if 4.7 holds quality at lower prompt precision, that's the improvement I'll feel every session.

4.7

Opus 4.7 has landed. The evaluation framework above is now my active test suite — multi-file consistency, long-session stability, structured output reliability, root cause diagnosis, and Make.com prompt quality. More to follow as I build.

What I Tell People Who Are Just Starting

The question I get most often from people who've tried Claude and felt underwhelmed: "Is it really as good as people say?"

My answer: it depends entirely on how you use it. The ceiling of what these models can do is much higher than most people's first sessions with them. The limiting factor is almost always the quality of the input — the precision of the prompt, the richness of the context, the clarity of the success criteria. Give it a well-formed problem with clear parameters and good context, and the output is genuinely remarkable. Give it a vague question and expect magic, and you'll be disappointed.

That's why I built the reference document, the eval harness, the CLAUDE.md file — not because the model needs all of that to produce output, but because the model produces dramatically better output when those things are in place. You're not working around weaknesses. You're removing friction that would limit any intelligent collaborator. That's true of 4.6, and I expect it to be true of 4.7.

If you want to understand what I've actually built with this system, the full inventory is here. The list still surprises me when I read it back.