2026-03-28 · Note

Write the eval before the prompt

We write evals before prompts. Quality drops, then it jumps past where vibe checks ever reach.

On our System tier we write a tiny eval set before we touch the prompt. Twenty examples, two columns: input and expected behavior. The prompt gets worse at first because you finally see what it does wrong.

Then it gets much better because you iterate against a number instead of a feeling. McKinsey's 2024 state of AI survey found high performers were far more likely to use systematic testing for gen AI than laggards.

“If you do not measure it, you do not know if you shipped the same bug twice.”
Shreya Shankar · UC Berkeley, evaluation research

“Teams that treat prompts as code and evals as tests ship faster after the first painful week.”
Hamels Mu · LangChain, production eval guidance 2024

The eval is the spec

If you cannot write the eval, you do not know what you are building. Expected behavior can be a exact string, a regex, a JSON shape, or a rubric scored by a stronger model. Pick the strictest check that still passes real inputs.

Store evals in git next to the prompt. Review diffs like application code.

Start with twenty rows

Cover happy path, edge cases, and known failure modes from stakeholders. Include at least three examples where the old manual process failed.

Name a pass threshold

We use 90% on blocking scenarios before we call a Founder MVP done for AI features. System tier adds regression alerts on every merge.

What the first week feels like

Day 1: baseline prompt scores 55%. Day 3: you rewrite retrieval and score 72%. Day 5: prompt and guardrails land at 91%. Without the eval, you would have shipped at 72% and called it good.

This matches what we saw on DocPulse: retrieval tuning moved citation accuracy more than any model swap.

Tools we actually use

A spreadsheet or CSV is fine to start. We promote to a small script in CI that runs the eval set against staging. No vendor lock-in required.

For voice and agent flows, record transcripts as eval inputs. Replay them after every policy change.

Human review still matters

Evals catch regressions. Humans catch tone, safety, and domain nuance. Use a review queue for low-confidence outputs, like we did on Skanda Billing OCR.

Bring this to your project

Ask for an eval plan in your Day 1 scope. If your vendor cannot describe how they will measure success, that is a red flag.

Our AI Readiness Audit add-on includes an eval template and build-vs-buy notes before you commit to a full build.

FAQ

How many eval examples do I need?

Twenty is enough to start. Expand when you find new failure modes in production. One hundred covers most SMB pilots we see in 2025.

Do evals replace user testing?

No. Evals run on every change. User testing validates the whole experience. You need both.

What if expected output is subjective?

Use a rubric and score with a stronger model or a human labeler. Still better than no baseline.

Can I eval RAG and agents the same way?

Same spreadsheet mindset. RAG evals check citations and recall. Agent evals check tool calls and final state. Combine both when your product uses both.

Does KatalyzU include evals in every AI build?

Founder MVP includes a minimal set for the core path. System tier includes regression runs in CI and tracing.

What is a good pass rate?

90% on must-not-fail scenarios before launch. Iterate on nice-to-have cases after you ship.

Want this kind of work for your product? Start a project or see our services.

Start a project