Lesson 02 ยท Foundations
The three numbers behind every scoping conversation โ plus your first real API call.
Here's a moment you will live many times as an FDE: a customer describes a feature โ "summarize every support ticket as it comes in" โ and looks at you. Can we do it, what will it cost, will it be fast enough? Answering that on the spot takes exactly three numbers: input tokens, output tokens, and the context window. This lesson gives you all three, then you'll make your first real API call and read them off the meter yourself.
From Lesson 01 you know the model reads and writes tokens โ chunks of text, usually part of a word. Rough rule for English prose: 1 token โ ยพ of a word, so 1,000 tokens is roughly 750 words.1 Everything is denominated in tokens: pricing, speed, and limits. Words don't matter; tokens do.
And tokenization isn't uniform. Code, JSON, and rare words split into more tokens per word than plain prose โ a config blob can cost triple what an equivalent-length sentence does. That's why an FDE never estimates from word counts: you count tokens (the API has a free endpoint for exactly this โ it's Part 1 of your lab).
Pricing is per million tokens, and here's the part people miss: output tokens cost about 5ร more than input tokens. Reading is cheap; generating is expensive โ every output token is one pass of the next-token loop.1 Current Claude pricing2 as of mid-2026:
| Model | Tier | Input $/1M | Output $/1M | Context window |
|---|---|---|---|---|
| Claude Opus 4.8 | Most capable | $5.00 | $25.00 | 1M tokens |
| Claude Sonnet 5 | Balanced | $3.00 | $15.00 | 1M tokens |
| Claude Haiku 4.5 | Fast & cheap | $1.00 | $5.00 | 200K tokens |
The context window caps how much the model can consider at once โ prompt plus answer. Modern Claude models take 1M tokens (~2,000 pages); Haiku takes 200K.2 Two practical consequences: anything the model must "know" has to fit, and since input tokens cost money, stuffing the window is a cost decision too. "Just paste in all 10 years of customer docs" is rarely the right answer โ that instinct is what leads to RAG, a few lessons from now.
Reading input is fast and parallel; generating output happens one token at a time through the loop from Lesson 01. So response time is dominated by output length, not input length. Two levers every FDE uses: cap or prompt for shorter outputs, and stream the response so the user sees words immediately โ the time to first token is what a user feels, and streaming makes a 20-second answer feel instant.
Time to touch the metal. The lab file is in your workspace at
labs/0002-first-api-call.py. One-time setup, then run it:
# one-time setup (PowerShell)
pip install anthropic
$env:ANTHROPIC_API_KEY = "sk-ant-..." # from console.anthropic.com
python labs/0002-first-api-call.py
Three parts, each proving something from this lesson with a tight feedback loop:
Part 1 counts tokens on four strings โ watch JSON cost ~2ร the tokens of prose.
Part 2 makes a real call and reads response.usage โ the exact meter
you'll use to price customer features โ then computes the dollar cost.
Part 3 streams a response and measures time-to-first-token versus total time,
so you see that generation, not reading, is the slow part. The whole lab costs under a cent.
Same drill as Lesson 01: customer scenarios, and you diagnose from the mechanism. Don't scroll up. Wrong picks stay live.
Scenario A
A customer wants 100-page contracts summarized into one paragraph each. Costs are too high in the pilot. Which lever cuts cost the most?
Scenario B
Users complain the assistant "hangs" for 15 seconds on detailed answers, though short answers feel fine. What's the primary cause?
Scenario C
To save money, a team moves a workflow to Haiku and pastes an entire knowledge base (~350K tokens) into every prompt. Calls now fail. Why?