Lesson 02 — Tokens, Context & the Cost of a Call

FDE skill · size and price a solution before you build it

🎧 Listen to this lesson · ~6 min · narrated audiobook edition

Here's a moment you will live many times as an FDE: a customer describes a feature — "summarize every support ticket as it comes in" — and looks at you. Can we do it, what will it cost, will it be fast enough? Answering that on the spot takes exactly three numbers: input tokens, output tokens, and the context window. This lesson gives you all three, then you'll make your first real API call and read them off the meter yourself.

Tokens are the meter

From Lesson 01 you know the model reads and writes tokens — chunks of text, usually part of a word. Rough rule for English prose: 1 token ≈ ¾ of a word, so 1,000 tokens is roughly 750 words.¹ Everything is denominated in tokens: pricing, speed, and limits. Words don't matter; tokens do.

And tokenization isn't uniform. Code, JSON, and rare words split into more tokens per word than plain prose — a config blob can cost triple what an equivalent-length sentence does. That's why an FDE never estimates from word counts: you count tokens (the API has a free endpoint for exactly this — it's Part 1 of your lab).

What a call costs

Pricing is per million tokens, and here's the part people miss: output tokens cost about 5× more than input tokens. Reading is cheap; generating is expensive — every output token is one pass of the next-token loop.¹ Current Claude pricing² as of mid-2026:

Model	Tier	Input $/1M	Output $/1M	Context window
Claude Opus 4.8	Most capable	$5.00	$25.00	1M tokens
Claude Sonnet 5	Balanced	$3.00	$15.00	1M tokens
Claude Haiku 4.5	Fast & cheap	$1.00	$5.00	200K tokens

Worked example — the scoping math Customer support bot on Opus 4.8: each conversation sends ~2,000 input tokens (system prompt + ticket + history) and generates ~500 output tokens.

Input: 2,000 × $5/1M = $0.010 · Output: 500 × $25/1M = $0.0125 → ~2.3¢ per conversation.
At 10,000 conversations/day: ≈ $225/day ≈ $6,800/month. Now you can discuss whether Haiku at ~$1,350/month is good enough for this task. That conversation is FDE work.

The context window is the model's working memory

The context window caps how much the model can consider at once — prompt plus answer. Modern Claude models take 1M tokens (~2,000 pages); Haiku takes 200K.² Two practical consequences: anything the model must "know" has to fit, and since input tokens cost money, stuffing the window is a cost decision too. "Just paste in all 10 years of customer docs" is rarely the right answer — that instinct is what leads to RAG, a few lessons from now.

Latency: the answer length is the clock

Reading input is fast and parallel; generating output happens one token at a time through the loop from Lesson 01. So response time is dominated by output length, not input length. Two levers every FDE uses: cap or prompt for shorter outputs, and stream the response so the user sees words immediately — the time to first token is what a user feels, and streaming makes a 20-second answer feel instant.

🧪 Lab: your first API call

Time to touch the metal. The lab file is in your workspace at labs/0002-first-api-call.py. One-time setup, then run it:

# one-time setup (PowerShell)
pip install anthropic
$env:ANTHROPIC_API_KEY = "sk-ant-..."   # from console.anthropic.com

python labs/0002-first-api-call.py

Three parts, each proving something from this lesson with a tight feedback loop: Part 1 counts tokens on four strings — watch JSON cost ~2× the tokens of prose. Part 2 makes a real call and reads response.usage — the exact meter you'll use to price customer features — then computes the dollar cost. Part 3 streams a response and measures time-to-first-token versus total time, so you see that generation, not reading, is the slow part. The whole lab costs under a cent.

Check yourself — scope the feature

Same drill as Lesson 01: customer scenarios, and you diagnose from the mechanism. Don't scroll up. Wrong picks stay live.

Scenario A

A customer wants 100-page contracts summarized into one paragraph each. Costs are too high in the pilot. Which lever cuts cost the most?

Reduce the input you send per request
Cap the output length of the summary
Batch many contracts into fewer calls
Stream responses instead of waiting

Scenario B

Users complain the assistant "hangs" for 15 seconds on detailed answers, though short answers feel fine. What's the primary cause?

The model reads the input too slowly
Latency grows with generated output tokens
The account is hitting its rate limits
The context window is completely full

Scenario C

To save money, a team moves a workflow to Haiku and pastes an entire knowledge base (~350K tokens) into every prompt. Calls now fail. Why?

Haiku is too weak to process the text
The team exceeded its monthly budget
The tokenizer cannot split that much text
The prompt exceeds Haiku's context window

Primary source — read this

Chip Huyen — AI Engineering, Ch. 2 (Understanding Foundation Models)

The canonical book for this exact career path. Chapter 2 covers tokens, sampling, and the model-as-a-service economics behind this lesson. The companion repo (linked) is free; the book itself is worth owning for the whole journey.

Your one tangible win Give yourself the test: a customer wants ticket summarization at 50,000 tickets/month, ~1,500 input + 300 output tokens each. You can now estimate the monthly cost on two models, say which of input or output dominates, and check it fits the context window — before writing a line of code. That's a scoping conversation you can lead.

I'm your teacher — ask me anything. Lab won't run? API key trouble? Curious why output tokens cost more, or what "prompt caching" means on the pricing page? Ask in chat — debugging your first API call together is exactly what I'm for.

References

Chip Huyen, AI Engineering (O'Reilly, 2025) — tokens, sampling, and inference economics.
Anthropic, Claude models & pricing documentation — model tiers, per-token pricing, and context windows (verified July 2026).