I Built a Blog-Post Factory With a Quality Eval System — And I Don't Write Python

Give it a topic and a slug. Get back a complete, publish-ready HTML blog post — with the GA4 tag confirmed present, the newsletter webhook verified correct, the table of contents validated against the actual h2 count, and 7 other checks run in real Python. I built this without writing a line of Python myself. Here's exactly how it works.

The Problem With AI-Generated Content

Here's the thing nobody talks about when they describe using AI to generate blog posts: the output is usually pretty good, but it's almost never right on the first pass. Not wrong in obvious ways — wrong in the subtle ways that matter. The GA4 analytics tag has the wrong ID. The newsletter webhook URL is from a previous project. The table of contents has four items but there are only three h2 sections in the article. A CSS attribute on the article content div makes the whole post invisible on mobile.

None of these are content failures. They're structural failures — the kind that an automated system can catch but a human reading the post might not notice until someone emails to say the newsletter signup button doesn't work.

I ran into all of these, and more, while building out the blog on this site. The solution I ended up with wasn't just "be more careful." It was an eval harness: a set of automated assertions that run every time a post is generated, catch the failures, and fix them before anything goes to staging.

I'm not a developer. I don't write Python. The eval system runs in actual Python, in a sandboxed environment, on my computer. I built it in one session with Claude. This is that story.

What the Blog-Post Skill Does

The blog-post skill is a system I built with Claude Code — a structured prompt-plus-process that takes a two-sentence brief and produces a complete, publish-ready HTML file. Here's the sequence:

Input. I give it a topic, a target slug, and a category tag. That's the entire brief.
Voice analysis. Claude reads a set of voice guidelines derived from my existing published posts. The goal is that the generated post sounds like me, not like a generic AI blog.
Template load. It loads a canonical 400-line HTML template that includes every required structural element: GA4 tag, correct newsletter webhook, share buttons with the right URL, related post cards, the affiliate disclosure, the table of contents scaffold.
Content generation. Claude writes the post content and drops it into the template. Real paragraphs, real h2 sections, real stat callouts — not lorem ipsum filler.
Eval harness. Ten automated assertions run against the generated HTML. If anything fails, Claude identifies the problem, fixes the file, and re-runs the assertions.
Save to staging. The verified file lands in the staging folder, gitignored and ready for review before it ever touches the live site.

The whole sequence takes a few minutes. What comes out the other side is a post I can read, approve, and publish — not a draft I need to spend 45 minutes cleaning up.

Automated quality assertions run against every generated blog post before it reaches staging. If any check fails, Claude fixes the issue and re-runs — no manual debugging required.

The 10 Assertions — In Detail

This is the part people find surprising. The eval harness isn't a vibe check or a "does this look right" review. It's a set of concrete, testable conditions that either pass or fail. Here's what the 10 assertions actually check:

The 10 Quality Assertions

GA4 tag is present. Confirms the Google Analytics tag ID G-M0J8NEXYGB appears in the file.
Newsletter webhook URL is correct. Verifies the specific Make.com webhook URL is the right one, not a stale value from a previous project.
Affiliate link is accurate. Checks that the Make.com affiliate link uses my referral code, not a plain link.
TOC item count matches h2 section count. Counts the table-of-contents list items and the actual h2 headings in the article — they must match.
At least two stat callouts present. Confirms the post has a minimum of two of the blue stat callout blocks that I use to highlight key figures.
Exactly three related post cards. The related section at the bottom of every post has exactly three cards — no more, no fewer.
No data-fade bug on article content div. A specific CSS attribute that caused post content to be invisible on mobile — checks that this attribute is absent.
Share buttons have correct post URL. Verifies the Twitter and LinkedIn share buttons link to the right URL for this specific post, not a placeholder or previous post's URL.
Footer affiliate disclaimer present. Confirms the standard affiliate disclosure line is in the footer.
Meta description is populated. Checks that the og:description and meta description tags are not empty or default placeholder text.

Each one of these exists because something broke without it. The data-fade bug caused invisible content on mobile for two posts before I caught it. The wrong newsletter webhook URL meant subscribers from an early post were going into the wrong list. The mismatched TOC count was a recurring issue in early generated posts.

The eval harness is a catalog of my own mistakes, automated into a gate.

How I Built It Without Writing Python

This is the part that still feels a little surreal. The assertions run in actual Python — not a JavaScript scraper, not a regex check in a text editor, but a real Python script that parses HTML, counts elements, and checks string values. I do not write Python. I have never written Python.

What I did: I described what I wanted the eval system to do to Claude Code, in plain language. "Check that the GA4 tag is present. Count the TOC items and h2 headings and confirm they match. Verify the newsletter webhook URL is the correct one." Claude wrote the Python script. It ran it in a sandboxed environment. When the first pass had a parsing issue with one of the checks, Claude read the error, fixed the script, and re-ran it.

One session. One working eval harness. The script has run against every post I've generated since, and it has caught real failures on at least a third of them — failures that would have shipped to the live site if the gate didn't exist.

This is what I mean when I say Claude Code opens up a new category of what's feasible. I couldn't have built this eval system without it. I don't have the language to write it, and I wouldn't have had the time to learn. With Claude Code, I described the goal and got a working system. The barrier between "I want this" and "this exists" is almost entirely gone.

~⅓

Of generated posts catch at least one assertion failure before reaching staging. The eval harness exists because AI-generated HTML is good but not perfect — structural checks are the difference between reliable publishing and occasional broken posts going live.

What This System Actually Unlocks

The obvious benefit is speed — I can go from "I want a post about X" to a verified draft in minutes instead of hours. But that's not the most important thing it unlocks.

The more important unlock is confidence. Because the eval runs, I trust the output. I'm not doing a manual checklist every time I publish. I'm not second-guessing whether the GA4 tag made it in. The system handles that, and I can focus on whether the content is actually good — the part that actually requires my judgment.

It also unlocks volume. With 40+ posts on this site now, producing each one manually wasn't going to scale. The factory model means I can ship more content without adding more time to the process — which, for someone building in evening blocks around a full-time job and a family, is the only kind of scaling that's actually available to me.

The system feeds into a bigger picture: the same kind of automation logic powers my 16-module Make.com scenario that publishes a site tracker dashboard every week without any input from me. The blog factory is one part of a broader system for running this site largely on autopilot.

If you want to understand how the whole thing fits together — the reference doc that gives Claude persistent context, the voice guidelines, the staging workflow — that's in How I Give Claude Memory Across Every Session. That's the foundation the factory is built on.

What It Doesn't Solve

The eval harness catches structural problems. It does not catch bad writing. It does not check whether the advice in a post is actually correct. It does not verify that the examples I used are accurate. That part still requires me — reading the post, fact-checking the claims, making sure the voice actually sounds right rather than just passing the voice-guidelines check.

The factory produces a solid structural draft. The human layer is still necessary to make it a post worth reading. I've found the balance works well: Claude handles the scaffolding and structure, I handle the editorial judgment. Neither is trying to do the other's job.

And the eval itself is a living document. As I find new failure modes, I add new assertions. The Python script has grown from the original 10 checks to a few additional ones I've added over time. It's not a static artifact — it evolves with the system it's guarding.

I Built a Blog-Post Factory With a Quality Eval System — And I Don't Write Python

The Problem With AI-Generated Content

What the Blog-Post Skill Does

The 10 Assertions — In Detail

The 10 Quality Assertions

How I Built It Without Writing Python

What This System Actually Unlocks

What It Doesn't Solve

Keep Reading

My 16-Module Make.com Automation That Publishes Itself Every Week

How I Give Claude Memory Across Every Session

I Rebuilt My Entire Site with Claude Cowork