Lesson 01 ยท Foundations
One idea, and almost everything you'll build follows from it.
You're going to spend your career building reliable systems on top of these models for customers. You cannot make something reliable if you don't know what it is. So we start with the one mental model that the rest of the course hangs off โ the one that lets you look at a weird model behavior at a client site and say "of course it did that," instead of being surprised.
An LLM is a function that takes some text and outputs a probability distribution over what the next token should be.1 That's it. That is the entire machine.
A token is just a chunk of text โ usually part of a word. To generate a full answer, the model runs a loop you already understand as an engineer:
text = "The capital of France is"
loop:
dist = model(text) # a probability for every possible next token
next = sample(dist) # pick one, e.g. " Paris"
text = text + next # append it
if next == END: break # stop token
Predict a token, glue it on, feed the whole thing back in, predict again. This loop is called autoregression, and running it fast is all that happens when you call ChatGPT or Claude.2 There is no database lookup, no reasoning module, no fact-checker. One loop.
autocomplete(text) โ text, trained on a large fraction of the internet.
Where do the probabilities come from? During
pretraining, the model read an enormous
amount of text and adjusted billions of
parameters until it got good at
guessing the next token. Those frozen numbers are a lossy compression of the
patterns in that text โ not a lookup table of facts.1
Here's the payoff. You don't have to memorize the model's quirks as a list โ you can derive them from "it samples plausible next tokens from compressed text patterns."
The loop always produces a plausible continuation. Plausible โ true. If a customer asks for a policy number the model never saw, the most plausible next tokens still look like a policy number โ so it confidently invents one.3 Hallucination isn't a defect bolted on; it's what next-token prediction does when it lacks the fact. (Grounding it in real data โ RAG โ is a later lesson, and now you'll know exactly why it works.)
The model outputs a distribution; sample() picks from it. Turn the
temperature up and picks get more random;
down toward zero and they get near-deterministic. This is why an LLM call is not a pure
function the way your other code is โ a fact that will shape how you test and eval it.
The only thing the model sees is the text in front of it โ the context window. It doesn't "remember" your last message unless you resend it. Every bit of customer context the model needs must be in the input. "Chat history" is just your app re-pasting the conversation each turn.
Output is conditioned entirely on input tokens, so changing the prompt changes the distribution. This isn't the model being fussy โ it's the mechanism. It's also why prompt engineering is a real lever, not folklore (Lesson 03).
Retrieval beats re-reading. Don't scroll up: for each customer scenario, name the root cause from the model itself. Wrong picks stay live โ try again.
Scenario A
A customer's support bot cited a refund policy โ clause number and all โ that does not exist in their handbook. What happened?
Scenario B
You send the identical prompt twice and get two different answers. The customer calls it "unreliable." What's actually going on?
Scenario C
A client insists the bot "forgot" a detail from ten messages ago, even though it handled it fine earlier. Most likely cause?