- Foundation model
- A large model trained on a broad sweep of data that can be adapted to many tasks
without being retrained. LLMs are the text kind; there are also image/audio/multimodal ones.
As an FDE you build on these, you don't train them.
- LLM — large language model
- A foundation model for text. Under the hood it is a next-token
predictor: given some text, it outputs a probability distribution over what comes next.
- Token
- The unit an LLM actually reads and writes — usually a chunk of a word, not a whole word.
"tokenization" might be 3 tokens:
token, iz, ation.
Prices, context limits, and speed are all measured in tokens, not words.
- Tokenization
- The step that chops input text into tokens before the model sees it
(and stitches output tokens back into text). Why the model can be weird about spelling,
rare words, and counting characters.
- Next-token prediction — autoregression
- The core loop: predict the next token, append it to the input, predict again, repeat.
Everything an LLM does is this loop running fast. The single most important mental model
in the course.
- Parameters — weights
- The billions of numbers, fixed after training, that encode the patterns the model learned.
They don't change when you use the model — they are the "compressed" version of its training text.
- Pretraining
- The expensive phase where the model learns to predict next tokens over a huge corpus of text.
This is where its "knowledge" comes from — and why it's frozen at a training cutoff.
- Inference
- Actually running the trained model to generate output — the part you do as a builder,
via an API call. Costs money and time per token.
- Context window
- The maximum amount of text (in tokens) the model can consider at once —
your prompt plus its answer. It has no memory outside this window; anything the model
should "know" for a call must be inside it.
- Prompt
- The input text you give the model. Because output is conditioned entirely on the input,
the prompt is your main control surface — small wording changes shift the output.
- Sampling — temperature
- Because the model outputs a distribution, it picks the next token by sampling from it.
"Temperature" tunes how random that pick is: low = more deterministic/repetitive,
high = more varied/creative. Why the same prompt can give different answers.
- Input & output tokens
- The two sides of the billing meter. Input tokens are everything you send (prompt, history,
documents); output tokens are what the model generates. Output tokens cost ~5× more per token
and dominate latency, because each one is a pass of the next-token loop.
- Streaming
- Receiving the response token-by-token as it's generated, instead of waiting for the whole
answer. Changes nothing about cost — only about when the user starts seeing words. The default
for anything user-facing.
- Time to first token — TTFT
- How long before the first piece of the answer appears. Roughly the time the model spends
reading your input. What a user feels as responsiveness — streaming optimizes for it.
- Hallucination
- When the model produces fluent, confident text that is false. Not a bug bolted on — a direct
consequence of next-token prediction: it generates plausible
continuations, with no built-in check for truth. Managing this is core FDE work.