On the System tier, we now write a tiny eval set before touching the prompt. Twenty examples, two columns: input and expected behaviour.
First the prompt gets worse, because we can finally see what it is doing wrong. Then it gets much better, because we can iterate against a number instead of a vibe.
Take-away: the eval is the spec. If you cannot write the eval, you do not yet know what you are building.
Want this kind of work for your own product? Start a project.
Start a project