๐ง Everything You Need to Know about AI (for PMs)
The foundational vocabulary and mental model. If you can speak fluently about LLMs, RAG, agents, evals, and the cost stack, you're already ahead of 80% of PMs.
Every PM in 2026 needs AI literacy. Engineers and designers will speak this language whether you do or not; if you can't keep up, you become the slow one in the room. The vocabulary is learnable in a weekend.
AI for product = LLMs, RAG, fine-tuning, agents, MCP, evals. You don't need to be able to build the models; you need to understand the design space they create: latency / cost / quality tradeoffs, the failure modes (hallucination, instruction-drift), and the UX patterns that hide them. Build the mental model once and it pays dividends for years.
The vocabulary
LLM (Large Language Model). A neural network trained on huge text corpora that predicts the next token given a context. ChatGPT, Claude, Gemini are interfaces wrapping LLMs.
Token. The unit of text the LLM operates on โ roughly 4 characters or 3/4 of a word. Pricing and context windows are measured in tokens.
Context window. How much text the model can consider at once. Claude Opus 4.7 has a 200K-1M token context; GPT-5 similar. Bigger = more capable but more expensive per call.
Prompt. The text input to the LLM. System prompt = instructions to the model. User prompt = the user's query.
Inference. A single call to an LLM. Costs money (per token).
RAG (Retrieval-Augmented Generation). Pull relevant docs into the prompt before asking the LLM, so it has fresh/private context. Dominant pattern for production AI products.
Fine-tuning. Further-training a base model on your domain data. More expensive than RAG, sometimes better quality. Less common in 2026 because base models are now strong enough.
Embeddings. Numerical vectors representing text meaning. Used for similarity search in RAG pipelines.
Vector database. A DB optimized for embedding similarity search. Pinecone, Weaviate, pgvector are common.
Agent. An LLM with the ability to call tools (functions, APIs) in a loop until it accomplishes a goal. The defining UX pattern of 2025-26.
Tool use / function calling. The protocol that lets an LLM call external tools (search, code execution, API calls).
MCP (Model Context Protocol). Anthropic's open standard for agents to talk to tools and data sources. Becoming the de facto agent integration layer in 2026.
Evals. The test suite for an AI product. Quantifies how well the AI is doing on real tasks. The discipline that separates production-ready AI from demos.
Hallucination. When the model produces plausible-sounding but factually wrong output. The single biggest UX challenge for AI products.
Temperature. A parameter controlling output randomness. Low (0-0.3) = consistent/deterministic. High (0.7-1.0) = creative/varied.
The cost stack
Knowing the cost stack lets you make informed product decisions.
- GPT-5 / Claude Opus 4.7: $5-15 per million input tokens, $25-75 per million output. Premium tier.
- Mid-tier models: $1-3 per million tokens. Most production use.
- Open-source models hosted: can be ~$0.20-1 per million tokens.
- Embedding models: ~$0.02-0.10 per million tokens.
A typical AI feature call (5K input + 2K output) on a frontier model costs ~$0.10-0.30 per call. Multiply by usage volume to see if your product is sustainable.
The product design space
When designing an AI feature, you're choosing on three axes:
- Quality โ how good must the output be? Customer support response (high) vs autocomplete suggestion (medium) vs background classification (lower).
- Latency โ how fast must it respond? Chat (3-5s tolerable) vs autocomplete (must be <300ms) vs background task (minutes fine).
- Cost per call โ what's the unit economics?
Most AI product decisions reduce to picking the point in this 3-space and engineering toward it. Frontier model + sophisticated RAG = high quality, high latency, high cost. Smaller model + simple prompt = lower everything.
The failure modes
- Hallucination. Confident wrong answers. Mitigated by RAG, evals, and UX patterns (cite sources, allow easy correction).
- Instruction drift. Long conversations where the model loses the original instructions.
- Latency variability. Same call can take 2s or 30s. Plan UX around the worst case.
- Cost blowups. A bug or a malicious user can 100x your token usage overnight.
- Stochasticity. Same input doesn't always produce the same output. Tests must account.
The UX patterns that hide the failures
- Streaming. Show output as it generates. Hides latency.
- Citations. Show sources. Lets users verify and reduces perceived hallucination.
- Confidence indicators. "I'm not sure about this" is better than confident wrong answers.
- Easy correction. "Regenerate," "edit response," thumbs up/down. Lets users repair.
- Human-in-the-loop on high-stakes actions. Don't let the AI delete the database. Confirm.
What PMs should do next
- Use Claude or ChatGPT daily for real work for a month. Build intuition.
- Read OpenAI / Anthropic / Cohere docs end-to-end. Free PhD.
- Build one toy AI feature with a friend. The understanding from doing > reading.
- Set up evals for one feature you care about. The discipline scales.
Key frameworks
Every AI design choice is picking a point in this 3-dimensional space.
Three ways to inject domain knowledge into an LLM. Pick based on data freshness, cost, and quality needs.
Write the eval before the feature. Quantify quality. Iterate against the score.
Real-world examples
Most senior AI PMs in 2026 use Claude (Sonnet, Opus) and ChatGPT (GPT-5) as their daily thinking tools. The fluency comes from use. PMs who don't reach for these tools daily will fall behind those who do.
Go deeper โ recommended reading
Interview questions (3)
Q1Explain RAG vs fine-tuning vs prompt engineering. When would you use each?ai-pmmidโผ
Three different ways to inject domain knowledge into an LLM.
Prompt engineering. Put the knowledge in the prompt itself. Fastest, simplest, no infrastructure. Use when the knowledge is small enough to fit in the context window and changes infrequently.
RAG (Retrieval-Augmented Generation). Store domain knowledge in a vector DB; retrieve relevant chunks at query time and inject into the prompt. Use when you have a large or constantly-updating corpus (docs, knowledge base, support tickets). This is the dominant pattern for production AI in 2026.
Fine-tuning. Further-train the base model on your domain data. Higher cost (training + per-token inference is slightly cheaper than frontier models if you self-host), longer to set up. Use when you need consistent behavior or specialized capabilities that prompt + RAG can't achieve.
Default to prompting first, RAG when you have a corpus, fine-tuning only when both fall short. Most teams over-invest in fine-tuning when RAG would have been enough.
Q2Your AI feature is hallucinating frequently. How do you fix it without retraining the model?ai-pmseniorโผ
Five layers, in order of leverage:
- Tighten the prompt. Add explicit instructions: 'If you don't know, say so' and 'Only use the provided context.' Many hallucinations come from vague prompts.
- Add RAG. Ground the model in real data. The model hallucinates when it doesn't have facts; give it facts.
- Add citations. Make the model cite its sources. Hallucinations are easier to detect when sourced.
- Add evals. Build a test suite of inputs with known correct outputs. Quantify hallucination rate. Iterate against the metric.
- UX safety net. Confidence indicators, easy correction, human-in-the-loop on high-stakes actions. Even with all the above, errors will happen โ design the UX to handle them gracefully.
Retraining is expensive and slow. 80% of hallucination fixes are in the prompt + RAG layer. Save retraining for when you've exhausted the easier wins.
Q3Walk me through the cost analysis of an AI feature.ai-pmmidโผ
Three inputs to model:
Per-call cost. Tokens in ร input price + tokens out ร output price. For Claude Opus 4.7: ~$15/M input, $75/M output. A 5K-in, 2K-out call = ~$0.23.
Calls per user per period. Active user makes how many AI calls daily/monthly? Estimate from beta usage or competitor analogs.
User base ร engagement = total monthly inference cost.
Then add the RAG cost: embedding generation, vector DB hosting (~$50-500/mo for most products), retrieval overhead.
Compare to ARPU or willingness-to-pay. If your AI feature costs $5/user/month and your product's ARPU is $10/month, the feature consumes 50% of revenue โ usually not viable.
Three levers if costs are too high: (1) switch to a smaller model for the easy cases (intent routing), (2) cache aggressively (semantic cache hits common queries), (3) tighter prompts (fewer tokens). All three combined can drop cost 5-10x without quality loss.