LLM Basics: How Machines Think (and Don't)

Param Harrison
7 min read

Share this post

The big idea: it's all about prediction

If you saw the sentence: The dog chased the ___ — what word comes next?

You probably thought ball, cat, or squirrel. You didn't know the answer for sure; you predicted it based on common patterns you've seen in language before.

That's exactly what an LLM does. It's a prediction machine. It predicts the most likely next word (or "token") based on all the text it has learned from.

But if it's just guessing, how does it seem so smart? It's about how it guesses and the settings we can control.

How we talk to an LLM

In the coding world, you don't just "talk" to an LLM. You send it a structured request, often called an API call. Think of it as filling out a form for the LLM.

# 1. Import the necessary library
from openai import OpenAI

# 2. Initialize the connection client
#    (In a real app, the API key is loaded securely)
client = OpenAI(api_key="...") 

# 3. Create the chat completion request
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Specify the model
    messages=[            # Define the message history
        {"role": "user", "content": "Explain what an LLM is in one sentence."}
    ],
    max_tokens=150        # Set a maximum token limit for the reply
)

# 4. Extract and print the text content of the reply
answer = response.choices[0].message.content
print(answer)

The magic isn't just in the prompt. It's in the settings you can add to that request. The most important one is "temperature."

Temperature: the creativity dial

"Temperature" is a setting that controls how creative or random the LLM's predictions are.

graph TD
    A[Temperature Dial] --> B{LLM's Behavior}
    B --> C[0.0: Predictable, Factual]
    B --> D[0.7: Balanced, General Chat]
    B --> E[1.5+: Creative, Unpredictable]
  • Low Temperature (e.g., 0.0): The LLM will always pick the most obvious, safest next word. It's boring, predictable, and great for facts or code.
  • High Temperature (e.g., 1.5): The LLM takes more risks, picking less common words. This makes it highly creative and imaginative, but also more likely to go off-topic.

Here's how we'd add that setting to our code:

# This request asks for a creative slogan
# by turning the temperature up.
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write a catchy slogan for a coffee shop."}
    ],
    temperature=1.2,  # <-- Set the "creativity dial" high
    max_tokens=20
)

If you ran this, you'd get a different slogan almost every time. If you set temperature=0.0, you'd probably get the same slogan every single time.

Tokens: the LLM's word pieces

LLMs don't actually see "words." They see tokens.

Think of tokens as "word pieces." For English, 1 token is about 0.75 words.

graph TD
    A["Human Language: 'Hello world!'"] --> B[Tokenization]
    B --> C["'Hello' (1 token)"]
    B --> D["' world' (1 token)"]
    B --> E["'!' (1 token)"]

This matters for two big reasons: cost and limits.

  1. Cost: You pay for every token. Both the tokens you send in (your prompt) and the tokens you get back (the answer).

  2. Limits: Every LLM has a "context window," or a maximum number of tokens it can remember at one time.

Notice how different things "cost" different token amounts. Long words or other languages are often "more expensive" in tokens. A library like tiktoken is used to count them.

import tiktoken

# Get the encoder for a specific model
encoding = tiktoken.encoding_for_model("gpt-4o")

# 'encoding.encode' turns text into a list of token numbers
tokens_hello = encoding.encode("Hello")
tokens_long_word = encoding.encode("Antidisestablishmentarianism")
tokens_chinese = encoding.encode("人工智能") # "Artificial Intelligence"

print(f"'Hello': {len(tokens_hello)} tokens")
print(f"'Antidisestablishmentarianism': {len(tokens_long_word)} tokens")
print(f"'人工智能': {len(tokens_chinese)} tokens")

# Output:
# 'Hello': 1 tokens
# 'Antidisestablishmentarianism': 5 tokens
# '人工智能': 6 tokens

This is why an LLM might feel "smarter" or "cheaper" in English—it was trained on more English tokens, so it's more efficient at processing them.

Context windows: the LLM's short-term memory

The context window is the LLM's entire short-term memory. It's the maximum number of tokens (your prompt + its answer) it can handle at once.

graph TD
    A["Your Input (e.g., 2,000 Tokens)"] --- B["LLM's Brain"]
    C["LLM's Reply (e.g., 1,000 Tokens)"] --- B
    
    subgraph TOTAL["Total Memory Used: 3,000 Tokens"]
        A
        C
    end

    D{"Context Window Limit (e.g., 4,000 Tokens)"}
    TOTAL -- "Must Be Less Than" --> D

If your conversation (all your prompts and all its answers) gets longer than this limit, the LLM starts to "forget" the beginning of the conversation.

This is the single biggest challenge in using LLMs. You can't just ask it to "summarize this 500-page book" by pasting the whole book, because the book is probably 200,000 tokens, but the LLM's memory (context window) might only be 8,000 or 128,000 tokens.

The cost equation

Using LLMs isn't free. The cost is calculated very simply:

Total Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)

A key insight: Output tokens (the LLM's answer) are almost always more expensive than input tokens (your prompt). It "costs" more for the LLM to think than to listen.

Here's the logic for a cost-calculating function:

# A simple function to estimate the cost of one LLM call.
def estimate_cost(input_text, output_text):
    # 1. Define example prices (per 1 MILLION tokens)
    INPUT_PRICE_PER_1M_TOKENS = 0.15  # $0.15
    OUTPUT_PRICE_PER_1M_TOKENS = 0.60 # $0.60 (4x more expensive!)
    
    # 2. Count the tokens (using a hypothetical count_tokens function)
    input_tokens = count_tokens(input_text)
    output_tokens = count_tokens(output_text)
    
    # 3. Calculate the cost for each part
    input_cost = (input_tokens / 1_000_000) * INPUT_PRICE_PER_1M_TOKENS
    output_cost = (output_tokens / 1_000_000) * OUTPUT_PRICE_PER_1M_TOKENS
    
    # 4. Add them up
    total_cost = input_cost + output_cost
    return total_cost

When LLMs fail (common glitches)

LLMs are amazing, but they are not perfect. They have very predictable failure modes.

1. Hallucinations (making stuff up)

An LLM's job is to predict the next word. It does not know what is true or false. A hallucination is when the LLM confidently generates a plausible-sounding but completely false statement.

If you ask it:

"Tell me about the 2023 Nobel Prize winner in Astrobotany."

It won't say, "That's not a real prize." It will invent a person, a university, and their "groundbreaking research" because those words statistically follow the pattern of your question.

2. Bad at math

LLMs are text-prediction machines, not calculators. They can recognize simple math (like 2 + 2 = 4) because they've seen that text in their training data. But they can't do math.

If you ask it:

"What is 234 * 567?"

It is very likely to give you the wrong answer. It's just predicting what a number looks like in that position, not actually performing the calculation.

Key takeaways

  • LLMs are predictors, not thinkers. They just guess the next most likely word.
  • Temperature is your "creativity dial." Low for facts, high for fiction.
  • Tokens are the "word pieces" you pay for. Everything has a cost.
  • Context Windows are the LLM's "short-term memory." This is their biggest limitation.
  • LLMs Hallucinate and are bad at math. Never trust them with facts or numbers without checking.

For more on building production AI systems, check out our AI Engineering Bootcamp.

Share this post

Continue Reading

Weekly Bytes of AI — Newsletter by Param

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.