Understand how LLMs work for engineering it better

Param Harrison
2 min read

Share this post

For engineers who use GPT APIs but don’t trust the “it’s magic” answer.

Large Language Models (LLMs) like GPT‑4 or Claude aren’t magical. They’re predictive engines trained to guess the next token in a sequence — that’s it.

1. Next-token prediction

Every output — chat, code, essay — is a sequence of guesses.

logits = model(tokens[:-1])
probs = softmax(logits[-1])
next_token = sample(probs)

Each new token depends on all previous ones. If the model drifts, the problem started earlier in the sequence.

Analogy: Think of an LLM as an engineer typing code with autocomplete turned on — one token at a time, no global plan.

2. Tokens are not words

Models don’t read “words,” they read subword tokens.

"antidisestablishmentarianism" → ["anti", "dis", "establish", "ment", "arian", "ism"]

  • Common words ≈ 1 token
  • Rare or technical words = multiple tokens
  • Cost and latency scale with token count

👉 Always check token length before sending prompts.

len(tokenizer.encode("Your prompt"))

3. Attention: How the model "thinks"

Each token decides which earlier tokens matter using self‑attention.

Query × Key → Attention weights → Weighted sum of Values

That’s the heart of the Transformer architecture.

During generation:

  • First token = slow (O(n²))
  • Later tokens = faster (cached, ≈ O(n))

Result: initial delay, then smooth streaming.

4. Context Window: The model's memory limit

The context window (e.g., 128k tokens) defines how much the model can “see.” Old tokens fade from focus in long prompts.

Fix it:

  • Summarize before Q&A
  • Use RAG to load only relevant chunks
  • Keep critical info near the end — recency bias helps

Key learnings

LLMs are not reasoning machines — they’re compression‑based next‑token predictors with limited memory. Understanding that boundary makes you a better builder.

Share this post

Continue Reading

Weekly Bytes of AI — Newsletter by Param

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.