📊Evals — The FAQ Every AI PM Needs

Evals are how you know if your AI product actually works. The single most-skipped discipline by junior AI teams.

aiqualityevals

Why it matters

Without evals, you're shipping AI features based on vibes. With evals, you can quantify quality, iterate against a number, and catch regressions before users do. PMs who run real eval programs build dramatically better AI products.

The core idea

An eval is a test for AI quality. A great eval suite has 100+ inputs covering normal cases, edge cases, and failure modes, with automated or human-scored outputs. The PM owns the eval definition; engineering owns the infrastructure. Evals are run before every prompt or model change, in CI, and on production samples.

What is an eval?

A single eval is an input + expected behavior. Examples:

Input: "What's our refund policy?" Expected: cites the refund doc, mentions 30-day window, suggests contact email.
Input: "Can I get a refund for a 6-month-old purchase?" Expected: politely declines (outside window), offers alternative.
Input: "Refund my order." Expected: doesn't actually issue refund (no agentic action), explains process.

An eval suite is a collection of these — typically 100+ — that exercises the full surface of your AI feature.

Why evals matter

Without them:

You ship and hope
Regressions go undetected until users complain
You can't compare model A vs model B
Prompt iterations are based on vibe, not signal

With them:

Quality is quantified (87% pass rate)
Iterations are measurable (prompt v3 → 92%)
Model swaps are evaluated (GPT-5 vs Claude → A/B on eval suite)
Regressions caught in CI

How to build an eval suite

Step 1: Collect inputs. From real user queries (anonymized), edge cases your team brainstorms, adversarial cases (prompt injection attempts, weird inputs).

Step 2: Define expected behavior per input. Could be:

Exact match (rare)
Contains certain keywords or facts
Doesn't contain certain things (PII, harmful content)
Structured criteria scored by LLM judge

Step 3: Pick scoring method.

Automated rule-based. Fast, deterministic, brittle for nuanced outputs.
LLM-as-judge. Use a second model to score. Faster than humans, more flexible than rules, but noisier.
Human scoring. Highest quality, expensive and slow. Use for golden set / periodic calibration.

Most production setups: LLM-as-judge for the bulk, human scoring on a sample for calibration, rules for objective checks.

Step 4: Run and iterate. Eval suite runs before every prompt change. Quality must not regress.

LLM-as-judge pitfalls

LLM-as-judge is the most-common eval method. It works but has known failure modes:

Bias toward longer answers
Self-preference (Claude rating Claude output higher than GPT)
Inconsistency on subjective criteria

Mitigations: use a different model than the one being evaluated, write very specific rubrics, validate against human scores periodically.

See LLM Judges Fail concept for the full deep-dive.

Eval cadence

Pre-commit: every prompt change runs the full suite. Block merge if regression.
Production sampling: 1-5% of production traffic gets evaluated continuously. Catches drift.
Weekly review: eval scores in the team's review. Trends visible.
Quarterly refresh: add new evals based on production failures and new use cases.

The size question

How many evals? Depends on the surface:

Narrow feature (e.g., subject line generation): 50-100 evals.
Broad feature (e.g., customer support chat): 500-1000.
Production agent system: 1000+, structured by capability.

Don't aim for 10,000 evals on day 1. Start with 50, learn what fails, expand.

Key frameworks

LLM-as-judge

Use one model to score another's output. Standard approach in 2026.

Golden set

Small subset of evals (20-50) hand-scored by humans, used to calibrate LLM judges.

Eval-driven development

Write eval before prompt. Iterate prompt against eval score. The TDD of AI.

Real-world examples

Anthropic

Eval suites as competitive moat

Anthropic's investment in evals is a structural advantage. Their internal eval suites are vastly larger than competitors', enabling reliable model improvement. The pattern is being adopted by AI-native PM teams as the defining engineering discipline.

Go deeper — recommended reading

Evals FAQ

Aakash Gupta · Product Growth

↗

Why 90% of LLM Judges Fail — And How PMs Can Fix Them

Aakash Gupta · Product Growth

↗

AI Testing for Product Managers

Aakash Gupta · Product Growth

↗

Interview questions (1)

Walk me through how you'd build an eval suite for an AI customer support chatbot.

ai-pmsenior

▼

Five-step build over 2-3 weeks.

Step 1: Inventory. Pull 500+ real customer support questions from the last 6 months. Anonymize. Categorize by topic and difficulty.

Step 2: Sample. Pick 100-200 representative + edge cases. Include adversarial ones (prompt injection, off-topic, abusive language).

Step 3: Define expected behavior per input. For each, write what 'good' looks like — covers facts X and Y, cites source Z, escalates if [condition]. Use structured rubric so scoring is consistent.

Step 4: Build scoring infra. LLM-as-judge (using a different model than the one in production) for the bulk. Human scoring on a 20-eval golden set, used to calibrate. Rule-based checks for objective criteria (mentions PII, valid JSON output).

Step 5: Wire into CI. Every prompt or model change triggers full eval run. Regression blocks merge. Production sampling continuously evaluates 1-5% of live traffic.

I'd start with 100 evals and grow to 500 over the first quarter as patterns of production failure emerge. The discipline of catching regressions before users see them is the single biggest quality lever.

Related concepts

⚖️Why LLM Judges Fail (and How to Fix Them)

LLM-as-judge is now the default eval method. Most implementations are unreliable. Here's why and what to do about it.

📝Prompt Engineering in 2026

The patterns that work with current frontier models. Less about clever tricks, more about clear instructions and good examples.