Back to Insights
QAAITestingEvaluation

Eval-Driven QA: How to Test AI Features Without Going Crazy

Sanjana
Written bySanjana
28 March 2026
5 min read
Eval-Driven QA: How to Test AI Features Without Going Crazy

The first time you test an AI feature

You send the same input twice. You get two different outputs. Your old QA mind screams "flaky test!" — but it is not flaky. It is the feature working as designed. Welcome to QA for AI applications.

🧪 The mindset shift

You are no longer asking "is this answer correct?" You are asking "is this answer good enough, often enough?" That is a statistical question, not a binary one.

From test cases to eval sets

A traditional QA test case has an input and an expected output. An eval has an input and one or more scoring functions that grade the output. The scoring function can be a string match, a regex, a JSON schema check, a similarity score, or another LLM acting as judge.

A good eval set has three slices:

  • Golden examples: classic happy-path requests with a clear "right" answer.
  • Edge cases: empty input, ambiguous input, multilingual input, very long input.
  • Refusal cases: requests the system should decline (out of scope, unsafe, prompt injection).

The new QA workflow

  1. Before a prompt change ships, the engineer runs the eval set locally and posts the diff (pass rate, regression list) in the PR.
  2. CI re-runs the eval set on every merge, with a threshold. A 2% drop in pass rate blocks the deploy.
  3. Production samples — say 1% of real traffic — get graded async by an LLM judge plus a weekly human spot-check.
  4. Failures feed back into the eval set, so we never regress on a real bug.

What human testers still do best

Automation grades known cases at scale. Humans find new cases. The most valuable thing my team does in AI QA is sit with real users for an hour and watch them break the system in ways no eval would predict.

Automated evals catch Human testers catch
Regression after a prompt change Confusing tone or phrasing
Schema violations Cultural or domain misunderstandings
Cost or latency drift New jailbreak techniques
Refusal coverage "This is technically right but feels wrong"

QA tip: Treat your eval set as living documentation of the product's intended behaviour. When a stakeholder asks "what should the assistant do when…?" the answer should be in the eval set, not in a Slack thread from six months ago.

The cost reality

Eval runs cost money — every example is an LLM call. Budget for it. A practical pattern: run a small smoke eval on every PR, the full set on every merge to main, the long-tail set nightly.

Conclusion

Eval-driven QA does not replace human judgement. It scales the boring part so testers can spend their time on the high-leverage human work. For the broader QA evolution see The Future of QA and for the architectural side see Designing AI Applications That Survive Production.


Sanjana

QA Lead

Specializes in Quality Assurance and User Experience testing, ensuring software meets the highest standards of usability.

View full profile →