← all writing
2026-06-02 · 7 min

Shipping LLM features that survive production

Most LLM features work in the demo and break in the real world. Here is the evaluation layer that keeps them honest — and how to measure 'correct' before you ship.

There is a reliable pattern in AI projects: the prototype is dazzling and the production system is embarrassing. The same prompt that nailed every example in the demo starts missing obvious cases the moment real users touch it. The model did not get worse. Your input distribution did.

Real input is adversarial without trying to be. People are sarcastic ("oh sure, I'll solve world hunger by Friday"). They are vague ("I'll look into it"). They write in three languages in one message. They bury the signal in small talk. A keyword matcher dies instantly here, but so does a naive prompt that was only ever tested on clean examples.

Define 'correct' before you build

The first mistake is starting from the prompt. Start from the definition of correct. For a commitment-detection feature I built into Slack, "correct" meant: does this message contain a promise the speaker is accountable for, by a time or trigger? That definition is doing real work — it tells you that "someone should review the PR" is not a commitment, but "I'll review the PR" is.

Write the definition down. Turn it into labelled examples. Now you have something a system can be measured against instead of a feeling you can argue about.

The evaluation layer

The single highest-leverage component in a production LLM feature is not the prompt — it is the layer that judges the prompt's output. Two patterns carry most of the weight:

  • LLM-as-judge. A second model call independently scores each detection against the definition. It does not generate; it evaluates. This catches a large class of failures the generating model is blind to, because the judge is not invested in its own answer.
  • An eval suite. A fixed set of hand-written edge cases, organised by failure mode, that you run on every prompt change. Mine for the Slack feature had 46 cases across 8 categories: sarcasm, conditionals, passive-aggressive tone, multilingual, vague language, short ambiguous replies, and more. The system landed at 87% accuracy — and, crucially, I could say 87% rather than "it feels good."

Accuracy is a number, not a vibe

Once the eval suite exists, every prompt tweak becomes a measurable experiment. Did that clever new instruction help? Run the suite. Accuracy went from 84% to 81%? Revert it. Without the suite you are tuning by anecdote, and anecdote is how LLM features quietly rot.

This is also what lets you ship with confidence to a non-technical founder. Hand them the evaluation tool, let them test detection quality in real time, and the conversation shifts from "do you trust the AI" to "here is the number, here are the cases it misses, here is what closing that gap costs."

The shape that works

  1. Define correct, in writing, with labelled examples.
  2. Build the generating prompt.
  3. Build a judge that scores output against the definition.
  4. Build an eval suite of edge cases grouped by failure mode.
  5. Treat every change as an experiment against the suite.

None of this is exotic. It is the difference between an AI feature that demos well and one that survives contact with real users — and it is mostly engineering discipline, not model magic.


I build LLM features with this evaluation discipline baked in, plus the backend they run on. If you have an AI feature that works in the demo and you need it to work in production, tell me about it.