๐Evals โ The FAQ Every AI PM Needs
Evals are how you know if your AI product actually works. The single most-skipped discipline by junior AI teams.
Without evals, you're shipping AI features based on vibes. With evals, you can quantify quality, iterate against a number, and catch regressions before users do. PMs who run real eval programs build dramatically better AI products.
An eval is a test for AI quality. A great eval suite has 100+ inputs covering normal cases, edge cases, and failure modes, with automated or human-scored outputs. The PM owns the eval definition; engineering owns the infrastructure. Evals are run before every prompt or model change, in CI, and on production samples.
What is an eval?
A single eval is an input + expected behavior. Examples:
- Input: "What's our refund policy?" Expected: cites the refund doc, mentions 30-day window, suggests contact email.
- Input: "Can I get a refund for a 6-month-old purchase?" Expected: politely declines (outside window), offers alternative.
- Input: "Refund my order." Expected: doesn't actually issue refund (no agentic action), explains process.
An eval suite is a collection of these โ typically 100+ โ that exercises the full surface of your AI feature.
Why evals matter
Without them:
- You ship and hope
- Regressions go undetected until users complain
- You can't compare model A vs model B
- Prompt iterations are based on vibe, not signal
With them:
- Quality is quantified (87% pass rate)
- Iterations are measurable (prompt v3 โ 92%)
- Model swaps are evaluated (GPT-5 vs Claude โ A/B on eval suite)
- Regressions caught in CI
How to build an eval suite
Step 1: Collect inputs. From real user queries (anonymized), edge cases your team brainstorms, adversarial cases (prompt injection attempts, weird inputs).
Step 2: Define expected behavior per input. Could be:
- Exact match (rare)
- Contains certain keywords or facts
- Doesn't contain certain things (PII, harmful content)
- Structured criteria scored by LLM judge
Step 3: Pick scoring method.
- Automated rule-based. Fast, deterministic, brittle for nuanced outputs.
- LLM-as-judge. Use a second model to score. Faster than humans, more flexible than rules, but noisier.
- Human scoring. Highest quality, expensive and slow. Use for golden set / periodic calibration.
Most production setups: LLM-as-judge for the bulk, human scoring on a sample for calibration, rules for objective checks.
Step 4: Run and iterate. Eval suite runs before every prompt change. Quality must not regress.
LLM-as-judge pitfalls
LLM-as-judge is the most-common eval method. It works but has known failure modes:
- Bias toward longer answers
- Self-preference (Claude rating Claude output higher than GPT)
- Inconsistency on subjective criteria
Mitigations: use a different model than the one being evaluated, write very specific rubrics, validate against human scores periodically.
See LLM Judges Fail concept for the full deep-dive.
Eval cadence
- Pre-commit: every prompt change runs the full suite. Block merge if regression.
- Production sampling: 1-5% of production traffic gets evaluated continuously. Catches drift.
- Weekly review: eval scores in the team's review. Trends visible.
- Quarterly refresh: add new evals based on production failures and new use cases.
The size question
How many evals? Depends on the surface:
- Narrow feature (e.g., subject line generation): 50-100 evals.
- Broad feature (e.g., customer support chat): 500-1000.
- Production agent system: 1000+, structured by capability.
Don't aim for 10,000 evals on day 1. Start with 50, learn what fails, expand.
Key frameworks
Use one model to score another's output. Standard approach in 2026.
Small subset of evals (20-50) hand-scored by humans, used to calibrate LLM judges.
Write eval before prompt. Iterate prompt against eval score. The TDD of AI.
Real-world examples
Anthropic's investment in evals is a structural advantage. Their internal eval suites are vastly larger than competitors', enabling reliable model improvement. The pattern is being adopted by AI-native PM teams as the defining engineering discipline.
Go deeper โ recommended reading
Interview questions (1)
Q1Walk me through how you'd build an eval suite for an AI customer support chatbot.ai-pmseniorโผ
Five-step build over 2-3 weeks.
Step 1: Inventory. Pull 500+ real customer support questions from the last 6 months. Anonymize. Categorize by topic and difficulty.
Step 2: Sample. Pick 100-200 representative + edge cases. Include adversarial ones (prompt injection, off-topic, abusive language).
Step 3: Define expected behavior per input. For each, write what 'good' looks like โ covers facts X and Y, cites source Z, escalates if [condition]. Use structured rubric so scoring is consistent.
Step 4: Build scoring infra. LLM-as-judge (using a different model than the one in production) for the bulk. Human scoring on a 20-eval golden set, used to calibrate. Rule-based checks for objective criteria (mentions PII, valid JSON output).
Step 5: Wire into CI. Every prompt or model change triggers full eval run. Regression blocks merge. Production sampling continuously evaluates 1-5% of live traffic.
I'd start with 100 evals and grow to 500 over the first quarter as patterns of production failure emerge. The discipline of catching regressions before users see them is the single biggest quality lever.