Lessons·May 9, 2026·11 min read

Build an LLM Eval Pipeline in 2 Days, Not 2 Weeks

Most teams ship AI features without eval. They flip a coin every PR. A small eval set built right takes two days and pays back forever — here is the minimum viable version.

Ikki

Last verified · May 9, 2026

Build an LLM Eval Pipeline in 2 Days, Not 2 Weeks

Why most teams skip eval

Eval is the part everyone agrees they should do and almost nobody does. We've shipped enough AI products to know exactly why.

It feels like cold-start work: an empty CSV, a vague spec, no immediate user-visible win. Compared to "ship the next feature" it always loses the priority fight. By the time someone notices the agent is regressing, eight weeks of changes have piled up and nobody can isolate which PR caused the drop.

This article is the minimum viable eval pipeline. Two engineering days. 50–100 examples. One CI hook. Designed to ship before you've shipped the thing it's evaluating, and to be cheap enough to run on every PR.

The point is not to build a perfect eval framework. The point is to stop flying blind.

The shape of a useful eval set

A common mistake is to over-engineer the eval set on day one. You don't need 5,000 examples. You don't need a synthetic data pipeline. You don't need MTEB-quality scoring. You need:

50–100 real-traffic examples (or, if you haven't launched, hand-curated ones that represent the user questions you expect)
Ground truth answers — written by you, or your domain expert, or both
A scoring function — sometimes rule-based, sometimes a small LLM-as-judge call
A runner script that takes a model + prompt version, runs the set, and emits a single number plus a per-example breakdown

That's it. The whole thing fits in one file at first.

Where to get the examples

Three sources, in order of preference:

1. Real production logs. If you've shipped, this is the gold standard. Pull 200 random conversation transcripts (or RAG queries, or whatever your unit is). Drop the ones that aren't representative. Keep the 50–100 that look most like steady-state usage. The advantage: the eval set looks like the actual traffic.

2. Logged-and-anonymized customer support tickets. Same logic, slightly noisier. If users email "the agent didn't understand X", X is what you want in your eval. Anonymize before you go further (PII in the eval set is a privacy mistake you'll regret).

3. Hand-curated by a domain expert. When you haven't launched yet. The expert sits with you and brainstorms 50 questions across the user journey. Quality varies — they will under-cover the long tail of weird questions real users ask. Plan to refresh this set within the first month of production.

The trap to avoid: synthesizing the eval set with the same LLM you're testing. If you ask GPT-4 to generate 100 questions and then test how well GPT-4 answers them, you've built a self-referential eval. The numbers look great. They don't generalize. Synthetic generation is fine for augmenting a real set, not replacing it.

Writing ground-truth answers

For each example, you need either an answer or a way to score the answer.

For factual questions (RAG over a known corpus, document Q&A, internal tools), write the answer. One sentence per question is usually enough. Cite the source if relevant.

For classification or routing tasks (which tool to call, which intent matches), the ground truth is a label or label + arguments. This is the easiest eval format.

For open-ended generation (writing assistance, summarization, conversation), the ground truth is a rubric: the qualities the answer should have. ("Includes a date. Cites the policy doc. Doesn't promise a callback.") The judge model checks the rubric.

We typically use a JSON file for this:

[
  {
    "id": "001",
    "input": "When does my subscription auto-renew?",
    "expected": {
      "must_include": ["renewal date", "amount"],
      "must_not_include": ["promise of refund", "specific time of day"],
      "tone": "informative",
      "format": "short paragraph"
    },
    "tags": ["billing", "factual"]
  },
  {
    "id": "002",
    "input": "I want to cancel.",
    "expected": {
      "tool_call": "respond_redirect",
      "tool_args": { "category": "cancellation" }
    },
    "tags": ["routing", "billing"]
  }
]

Tags matter. They let you slice the score later: "we regressed on routing this PR" is more actionable than "we regressed somewhere".

LLM-as-judge — the prompts that don't drift

For open-ended outputs, you'll use another LLM as a judge. The trap: judge prompts that drift in their own scoring criteria over time. Three principles for stable judges:

1. Make the judge a classifier, not a writer. Don't ask "score this answer 0–10". Ask three to five binary or low-cardinality questions. Aggregate to a number outside the LLM:

const judgePrompt = `Given the user's question and the agent's response, answer:

1. Does the response contain factually wrong information? (yes/no)
2. Does the response answer the user's actual question? (fully/partially/no)
3. Does the response include any forbidden phrases? (yes/no)
   Forbidden: ${FORBIDDEN_PHRASES.join(', ')}

Return JSON: {"factually_wrong": bool, "answers": "fully"|"partially"|"no", "forbidden": bool}`

This is much more stable than asking for a score. Each binary answer is easier to verify and harder to drift.

2. Use a different model from the one being evaluated. If you're testing Claude, judge with GPT-4o. If you're testing Gemini, judge with Claude. The biases are different, the eval is more honest. Self-judging is biased toward "this looks right because I would have written it this way".

3. Pin the judge model. When Anthropic ships a new Claude version, don't auto-upgrade the judge. Pin the exact model ID. The day you upgrade the judge, your scores can shift even though nothing about the system under test changed.

Rule-based scoring for what you can measure

Not every check needs an LLM. The cheapest eval signals are deterministic:

String contains for must-include / must-not-include.
JSON shape for tool-call schemas. Did the model call the right tool with the right argument types?
Length bounds ("response between 50 and 500 chars").
Tone proxies: presence of forbidden words ("guarantee", "promise"), max number of exclamation marks, no all-caps.
Structured output validation: if the agent should return JSON, parse it and reject on failure.

Rule-based checks are 1ms each and cost nothing. Layer them on top of LLM-as-judge for the things judges over-complicate. We typically get 30–60% of our score signal from deterministic checks and the rest from a single judge call per example.

CI integration — the one job that catches everything

The whole point of building eval is that it runs on every PR, blocking or non-blocking, and you see the score before you merge.

A minimal GitHub Actions / GitLab CI job:

eval:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - run: pnpm install
    - run: pnpm tsx scripts/eval.ts --set=v1 --output=eval-results.json
      env:
        ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} # for judge
    - uses: actions/upload-artifact@v4
      with:
        name: eval-results
        path: eval-results.json
    - run: pnpm tsx scripts/eval-compare.ts --baseline=main --current=eval-results.json

The runner script does three things: load examples, run them through the system under test, write per-example results. The compare script diffs the current run against the baseline (the last green main) and posts a comment on the PR with the score, the regressions, and the wins.

Two design choices that pay off:

Non-blocking by default. If the eval is slow or flaky on first weeks, blocking will frustrate the team and they'll disable it. Non-blocking with a clear PR comment ("eval score: 87% (-3% vs main, 4 regressions)") gets you 90% of the value. You can flip to blocking after two weeks of stability.

Keep cost predictable. Run the eval on every PR, but cap the cost. 100 examples × $0.01 per call = $1 per run. At 10 PRs/day that's $10/day, $300/month. Don't let the eval set creep to 5,000 examples until you have a budget for it. If it grows, sample 100 random ones per PR and the full set on main only.

The drift dashboard

A score on every PR is good. A trend over time is better. We log every eval run to a simple time-series table:

| run_id | timestamp | git_sha | score | model | prompt_version | by_tag |
| ------ | --------- | ------- | ----- | ----- | -------------- | ------ |
| ...    | ...       | ...     | 0.87  | sonnet-4-6 | v3 | { billing: 0.91, routing: 0.82 } |

Plot score over timestamp, color by prompt_version. You can spot:

Slow drift when the model vendor silently shifts behavior (rare but real).
Cliffs when a prompt change tanked a tag.
Recoveries so you know which PR fixed the regression.

This is the dashboard senior engineers refer to when someone says "is the agent better or worse than last quarter?" Without it, you're answering by feel.

The eval-set gaming trap

You've built the eval. The score is 73%. You change the prompt and now it's 89%. Did you actually improve the system?

Maybe. Maybe you over-fit the prompt to the eval set.

Three habits to avoid this:

1. Hold out a test split. Keep 20% of your examples in a test set the team doesn't see during prompt iteration. Run it monthly. If test score diverges significantly from train score, you've started gaming.

2. Refresh the eval set quarterly. Pull 20 new real-traffic examples, retire 20 old ones. Drift in production should match drift in the eval set, not the other way around.

3. Don't optimize for the score directly. Optimize for the failure modes it surfaces. If example 7 is now passing because you added a string match for "renewal date", but it doesn't actually understand the renewal date concept, you've moved the score and not the system.

Tools

You don't need a framework on day one. A folder with examples.json, a runner.ts, a judge.ts, and compare.ts is enough.

If you do want a framework:

Ragas — RAG-specific, faithfulness and answer relevancy metrics out of the box. Solid for vanilla RAG; opinionated for everything else.
DeepEval — broader scope, plays well with pytest and typical Python test runners.
PromptFoo — config-driven, good if you need to A/B prompts cheaply across providers.
Roll your own — what we usually end up with after a month, because frameworks lock you into their score model and you eventually want different rubrics per agent.

Start with the simplest thing that gives you a number. Replace it when it stops being enough.

What to invest in once the basics work

After a few months of running the basic pipeline, three additions become worth the investment:

1. Human spot-checks. Once a week, review 10 random examples by hand. The judge is biased; you'll catch judge failures only if you eyeball outputs occasionally.

2. Curated regression set. When a real user reports a bug, add it to the eval set. Now the system can never regress on that exact failure mode again. Across a year of production this is the most valuable layer of eval — it's a database of "things that went wrong, and we made sure they can't again".

3. Per-tenant slices in multi-tenant SaaS. Org-A's queries can get worse while Org-B's improve. The aggregate score hides this. Slice by tenant for the agents that matter.

When eval doesn't pay back

A short list of cases where building eval is overkill:

Pure proof-of-concept demos that won't ship to production.
Single-use generation tasks (a one-time data migration, a one-shot report). Just spot-check by hand.
Ultra-low-stakes outputs (a "cool emoji of the day" agent). The cost of being wrong is too low to justify the work.

For everything else — anything you're going to ship and rely on — eval is the cheapest insurance available.

Closing thoughts

The mental shift here is to stop treating eval as an academic exercise and start treating it like a unit test for AI features. You don't unit-test by hand-running the code; you don't eval by hand-reading transcripts. Build the runner once. Run it on every PR. Look at the dashboard.

If you've shipped an AI feature without eval and you're not sure how to retrofit one without three weeks of work, get in touch. Two days is usually enough to get the first useful score.

Work with Ikki

Flying blind on your LLM?

We set up an eval pipeline in 5 days, with test cases curated from your real conversations and a dashboard you can run weekly.

Start a project See our work

Agents

The Week Anthropic Claimed the Full Stack

Project Glasswing went to public beta. Stainless — the company behind all Anthropic SDKs — was acquired. Seven agent SDK releases in four days. The platform era is here.

Agents

Six Releases in Eleven Days: What Google's Pre-I/O Sprint Signals

@google/genai shipped Agent and Environment APIs today — days before Google I/O. The SDK velocity tells you what's coming before the keynote does.

Build an LLM Eval Pipeline in 2 Days, Not 2 Weeks

Flying blind on your LLM?

More articles

The Week Anthropic Claimed the Full Stack

Six Releases in Eleven Days: What Google's Pre-I/O Sprint Signals