Understand how LLMs work for engineering it better
For engineers who use GPT APIs but don’t trust the “it’s magic” answer.
Large Language Models (LLMs) like GPT‑4 or Claude aren’t magical. They’re predictive engines trained to guess the next token in a sequence — that’s it.
1. Next-token prediction
Every output — chat, code, essay — is a sequence of guesses.
logits = model(tokens[:-1])
probs = softmax(logits[-1])
next_token = sample(probs)
Each new token depends on all previous ones. If the model drifts, the problem started earlier in the sequence.
Analogy: Think of an LLM as an engineer typing code with autocomplete turned on — one token at a time, no global plan.
2. Tokens are not words
Models don’t read “words,” they read subword tokens.
"antidisestablishmentarianism" → ["anti", "dis", "establish", "ment", "arian", "ism"]
- Common words ≈ 1 token
- Rare or technical words = multiple tokens
- Cost and latency scale with token count
👉 Always check token length before sending prompts.
len(tokenizer.encode("Your prompt"))
3. Attention: How the model "thinks"
Each token decides which earlier tokens matter using self‑attention.
Query × Key → Attention weights → Weighted sum of Values
That’s the heart of the Transformer architecture.
During generation:
- First token = slow (O(n²))
- Later tokens = faster (cached, ≈ O(n))
Result: initial delay, then smooth streaming.
4. Context Window: The model's memory limit
The context window (e.g., 128k tokens) defines how much the model can “see.” Old tokens fade from focus in long prompts.
Fix it:
- Summarize before Q&A
- Use RAG to load only relevant chunks
- Keep critical info near the end — recency bias helps
Key learnings
LLMs are not reasoning machines — they’re compression‑based next‑token predictors with limited memory. Understanding that boundary makes you a better builder.