The Week I'm Writing This
Anthropic filed to go public this week. S-1 submitted to the SEC, valuation near $1 trillion. The company that makes my primary working tool is about to be on the public markets with earnings calls, quarterly guidance, and institutional investors watching token economics.
The same week, they deprecated Opus 4.1 with a hard retirement date of August 5 and pointed everyone toward Opus 4.8. If you've built anything on 4.1, you have two months to move it.
I've been sitting on a draft of this post for a while, waiting for enough new model news to make the timing right. This is that moment. I ran Opus 4.6 as my daily driver for six months — before the IPO filing, before the deprecation cascade. What I want to write down isn't a model comparison. It's what six months of daily use with a single model actually teaches you — and why the framework I built matters more than the version number I built it on.
My Context: How I Actually Use This Model
Before getting into the specifics, it's worth being clear about what "daily driver" means in my case. I'm not a software engineer. I'm a working dad with a full-time job who builds AI-powered web tools in the margins — evenings, early mornings, the occasional weekend block. My use of Opus 4.6 spans three modes:
- Claude Code in VS Code — active development sessions on my personal site, including building and iterating on apps, cross-file edits, debugging, and structural work.
- Make.com automation builds — designing multi-module scenarios, writing Claude API prompts that get embedded in Make.com HTTP modules, debugging automation failures.
- Claude Chat and Cowork — writing, planning, research, and sprint-style building sessions.
This perspective is different from a developer benchmarking model performance on code generation tasks. It's a practitioner perspective — what does this model actually do for someone building real things with real constraints?
Where Opus 4.6 Is Genuinely Excellent
Codebase Comprehension
The thing that impresses me most about Opus 4.6 in Claude Code is how well it holds a large codebase in mind. My site has 40+ HTML files, multiple app subdirectories, a reference document, several Make.com scenario descriptions embedded in markdown — and when I start a session and point Claude at the project, it reads across files and maintains a coherent picture of the whole thing.
The 40-file webhook security update I've written about is the clearest example. That task required reading every HTML file in the repo, identifying specific patterns, and making consistent changes across all of them. That kind of broad-context, multi-file reasoning is where Opus genuinely shines over smaller models.
Long-Form Generation That Holds Structure
For generating long, structured HTML — like a full blog post with a defined template, specific CSS classes, embedded Chart.js data, and a constrained format — Opus 4.6 is remarkably consistent. It follows the template. It uses the right class names. It doesn't drift into its own structure halfway through. That consistency is what makes the blog-post factory viable at all.
Honest Reasoning
One of the things I've come to genuinely rely on: Opus tells me when something is uncertain. It doesn't confidently generate wrong answers the way smaller models sometimes do. When I ask it to do something it's not sure about — a specific Make.com configuration, the exact syntax for a GitHub API call — it'll flag the uncertainty and suggest verifying. That's the right behavior. It saves debugging time.
Where It Still Needs Direction
Vague Prompts Produce Vague Output
This is consistent and predictable: when I'm imprecise, Opus is imprecise back. If I say "make the app better," I get something that's different but not necessarily what I wanted. If I say "add a loading spinner that appears between submit and response, positioned centered below the submit button, using the existing --cyan CSS variable," I get exactly that.
This isn't a weakness of Opus specifically — it's a property of working with any AI model. But it means the skill of prompt clarity matters. The model amplifies precision; it doesn't compensate for vagueness.
Root Cause vs. Symptom Fixes
Occasionally, when debugging, Opus will fix the symptom rather than the root cause — patch the line that throws the error rather than understanding why the error is being thrown. It's not frequent, but it happens enough that I've learned to ask explicitly: "What's causing this, not just where the error appears?" That reframe usually produces a better diagnosis.
Context Limit Management
Long sessions degrade. This is known, and it's not unique to Opus — it's a fundamental property of context windows. But it shows up practically: a session that's been running for 2+ hours with heavy file reads will start producing outputs that drift slightly from established conventions. The solution is the reference document system — saving state to files so new sessions start fresh rather than trying to maintain coherence in a single long conversation.
My Evaluation Framework — Built to Transfer
When I started writing this post, I was tracking what I hoped would improve in the next Opus release. By the time I'm publishing it, the landscape has moved twice — 4.7 landed, then Anthropic began pointing people toward 4.8 as older versions reach deprecation. Benchmark scores are the first thing everyone publishes. They're the least interesting thing to me. Here's what I actually test, regardless of version:
| Capability | What I'm testing | Why it matters to me |
|---|---|---|
| Multi-file consistency | Cross-repo edits that touch 20+ files without drifting from conventions | The larger my site gets, the more this matters |
| Long-session stability | Convention adherence at hour 2 vs. hour 0 | Long sessions are where I do my most complex work |
| Structured output reliability | HTML generation that passes the eval harness on first pass more often | Fewer eval failures = faster publishing pipeline |
| Root cause diagnosis | Debugging sessions where I don't need to ask "but what's actually causing it" | Saves back-and-forth on complex bugs |
| Make.com prompt quality | Claude API prompts embedded in automations that need to produce structured output reliably | The automation pipeline depends on consistent API responses |
The honest version of this list: what I wanted from newer models was less prompt engineering overhead for the same output quality. Opus 4.6 required clear, specific, structured prompts to perform consistently — that's still true of every version I've tested since. What I'm watching for in 4.8 is whether precision buys more consistent structured output, fewer eval failures, better root-cause diagnosis. If it does, I'll feel it every session. If it doesn't, the framework still holds regardless of what the release notes say.
What I Tell People Who Are Just Starting
The question I get most often from people who've tried Claude and felt underwhelmed: "Is it really as good as people say?"
My answer: it depends entirely on how you use it. The ceiling of what these models can do is much higher than most people's first sessions with them. The limiting factor is almost always the quality of the input — the precision of the prompt, the richness of the context, the clarity of the success criteria. Give it a well-formed problem with clear parameters and good context, and the output is genuinely remarkable. Give it a vague question and expect magic, and you'll be disappointed.
That's why I built the reference document, the eval harness, the CLAUDE.md file — not because the model needs all of that to produce output, but because the model produces dramatically better output when those things are in place. You're not working around weaknesses. You're removing friction that would limit any intelligent collaborator. That was true of 4.6. It'll be true of 4.8. It'll be true of whatever version Anthropic ships after their IPO road show wraps up.
If you want to understand what I've actually built with this system, the full inventory is here. The list still surprises me when I read it back.