Lesson 01 — What an LLM Actually Is

FDE skill · predict failure before you build

🎧 Listen to this lesson · ~6 min · narrated audiobook edition

You're going to spend your career building reliable systems on top of these models for customers. You cannot make something reliable if you don't know what it is. So we start with the one mental model that the rest of the course hangs off — the one that lets you look at a weird model behavior at a client site and say "of course it did that," instead of being surprised.

The whole thing, in one sentence

An LLM is a function that takes some text and outputs a probability distribution over what the next token should be.¹ That's it. That is the entire machine.

A token is just a chunk of text — usually part of a word. To generate a full answer, the model runs a loop you already understand as an engineer:

text = "The capital of France is"
loop:
    dist  = model(text)        # a probability for every possible next token
    next  = sample(dist)       # pick one, e.g. " Paris"
    text  = text + next        # append it
    if next == END: break      # stop token

Predict a token, glue it on, feed the whole thing back in, predict again. This loop is called autoregression, and running it fast is all that happens when you call ChatGPT or Claude.² There is no database lookup, no reasoning module, no fact-checker. One loop.

The engineer's reframe It's autocomplete(text) → text, trained on a large fraction of the internet. Where do the probabilities come from? During pretraining, the model read an enormous amount of text and adjusted billions of parameters until it got good at guessing the next token. Those frozen numbers are a lossy compression of the patterns in that text — not a lookup table of facts.¹

Why an FDE should care: four behaviors fall out for free

Here's the payoff. You don't have to memorize the model's quirks as a list — you can derive them from "it samples plausible next tokens from compressed text patterns."

1. It hallucinates — and that's not a bug

The loop always produces a plausible continuation. Plausible ≠ true. If a customer asks for a policy number the model never saw, the most plausible next tokens still look like a policy number — so it confidently invents one.³ Hallucination isn't a defect bolted on; it's what next-token prediction does when it lacks the fact. (Grounding it in real data — RAG — is a later lesson, and now you'll know exactly why it works.)

2. Same prompt, different answers

The model outputs a distribution; sample() picks from it. Turn the temperature up and picks get more random; down toward zero and they get near-deterministic. This is why an LLM call is not a pure function the way your other code is — a fact that will shape how you test and eval it.

3. It has no memory between calls

The only thing the model sees is the text in front of it — the context window. It doesn't "remember" your last message unless you resend it. Every bit of customer context the model needs must be in the input. "Chat history" is just your app re-pasting the conversation each turn.

4. Wording matters more than you'd expect

Output is conditioned entirely on input tokens, so changing the prompt changes the distribution. This isn't the model being fussy — it's the mechanism. It's also why prompt engineering is a real lever, not folklore (Lesson 03).

Optional intuition (skip if you like) No math needed today. The only quantitative idea worth holding: the output isn't one answer, it's a ranked list of likelihoods over ~100k possible tokens, and the model commits to just one before moving on. Everything downstream is that bet, repeated.

Check yourself — diagnose the failure

Retrieval beats re-reading. Don't scroll up: for each customer scenario, name the root cause from the model itself. Wrong picks stay live — try again.

Scenario A

A customer's support bot cited a refund policy — clause number and all — that does not exist in their handbook. What happened?

It generated the most plausible tokens
It queried the wrong internal database
It served an outdated cached response
A setting was configured incorrectly

Scenario B

You send the identical prompt twice and get two different answers. The customer calls it "unreliable." What's actually going on?

The model retrained between the requests
The model guessed your intent both times
It samples from a probability distribution
It remembered the previous answer given

Scenario C

A client insists the bot "forgot" a detail from ten messages ago, even though it handled it fine earlier. Most likely cause?

The detail was deleted from storage
It fell outside the context window
The temperature setting was too high
The tokenizer merged the two messages

Primary source — watch this

Andrej Karpathy — "A Busy Person's Introduction to LLMs" (~1 hr)

The clearest no-math explanation of the machine you just met, from one of the field's best teachers. Watch the first half this week. When you're ready to go deeper, his 3.5-hour "Deep Dive into LLMs" covers the full training stack.

Your one tangible win You can now take any surprising LLM behavior a customer reports and trace it back to "it samples plausible next tokens from compressed text patterns." That single sentence is the foundation for prompting, RAG, agents, and evals — all coming up.

I'm your teacher — ask me anything. Confused about tokens vs. words? Want to see the actual token IDs for a sentence? Curious how "reasoning" models fit this picture? Ask in chat. Unstuck-in-the-moment is exactly what I'm here for.

References

Chip Huyen, AI Engineering: Building Applications with Foundation Models (O'Reilly, 2025) — foundation models & the language-model mechanism.
Andrej Karpathy, "Deep Dive into LLMs like ChatGPT" (2025) — inference & the autoregressive loop.
Andrej Karpathy, "A Busy Person's Introduction to LLMs" (2023) — hallucination as a property of next-token prediction.