The first time you test an AI feature

You send the same input twice. You get two different outputs. Your old QA mind screams "flaky test!" — but it is not flaky. It is the feature working as designed. Welcome to QA for AI applications.

🧪 The mindset shift

You are no longer asking "is this answer correct?" You are asking "is this answer good enough, often enough?" That is a statistical question, not a binary one.

From test cases to eval sets

A traditional QA test case has an input and an expected output. An eval has an input and one or more scoring functions that grade the output. The scoring function can be a string match, a regex, a JSON schema check, a similarity score, or another LLM acting as judge.

A good eval set has three slices:

Golden examples: classic happy-path requests with a clear "right" answer.
Edge cases: empty input, ambiguous input, multilingual input, very long input.
Refusal cases: requests the system should decline (out of scope, unsafe, prompt injection).

The new QA workflow

Before a prompt change ships, the engineer runs the eval set locally and posts the diff (pass rate, regression list) in the PR.
CI re-runs the eval set on every merge, with a threshold. A 2% drop in pass rate blocks the deploy.
Production samples — say 1% of real traffic — get graded async by an LLM judge plus a weekly human spot-check.
Failures feed back into the eval set, so we never regress on a real bug.

What human testers still do best

Automation grades known cases at scale. Humans find new cases. The most valuable thing my team does in AI QA is sit with real users for an hour and watch them break the system in ways no eval would predict.

Automated evals catch	Human testers catch
Regression after a prompt change	Confusing tone or phrasing
Schema violations	Cultural or domain misunderstandings
Cost or latency drift	New jailbreak techniques
Refusal coverage	"This is technically right but feels wrong"

QA tip: Treat your eval set as living documentation of the product's intended behaviour. When a stakeholder asks "what should the assistant do when…?" the answer should be in the eval set, not in a Slack thread from six months ago.

The cost reality

Eval runs cost money — every example is an LLM call. Budget for it. A practical pattern: run a small smoke eval on every PR, the full set on every merge to main, the long-tail set nightly.

Conclusion

Eval-driven QA does not replace human judgement. It scales the boring part so testers can spend their time on the high-leverage human work. For the broader QA evolution see The Future of QA and for the architectural side see Designing AI Applications That Survive Production.

Eval-Driven QA: How to Test AI Features Without Going Crazy

The first time you test an AI feature

🧪 The mindset shift

From test cases to eval sets

The new QA workflow

What human testers still do best

The cost reality

Conclusion

Sanjana

Want to ship something like this on your product?

Table of Contents

Related Insights

Building RESTful APIs with NestJS

Thinking in AI: From Deterministic to Probabilistic Systems

The AI Pair Programmer: Coding in the Age of LLMs

Related Insights

Building RESTful APIs with NestJS

Thinking in AI: From Deterministic to Probabilistic Systems

The AI Pair Programmer: Coding in the Age of LLMs