How Large Language Models Actually Work
TL;DR
LLMs use transformer architecture with self-attention to process token sequences in parallel, enabling them to learn complex language patterns from massive datasets.
Large language models power everything from code completion to document summarization. But most developers interact with them as black boxes. Understanding the internals — transformers, attention, and tokenization — gives you a practical edge when building AI-powered systems.
Tokenization: How Text Becomes Numbers
LLMs do not process raw text. They operate on tokens — numeric representations of text fragments. Most modern models use Byte Pair Encoding (BPE), which splits text into subword units based on frequency in the training data.
Common words like “the” map to a single token. Rare or compound words get split into multiple tokens. This is why token counts differ from word counts.
from tiktoken import encoding_for_model
enc = encoding_for_model("gpt-4")
text = "Tokenization splits text into subword units."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
# Decode each token to see the splits
for token_id in tokens:
print(f" {token_id} -> '{enc.decode([token_id])}'")
Running this produces output like:
Token IDs: [3947, 2065, 27374, 6529, 1495, 1139, 1207, 1178, 8316, 13]
Token count: 10
Tokenization directly affects cost and context limits. Every API call is billed per token, and every model has a maximum context window measured in tokens — not words.
Why Subword Tokenization Matters
Subword tokenization strikes a balance between character-level and word-level approaches. It handles unseen words gracefully by decomposing them into known subword pieces, while keeping common words as single units for efficiency.
For developers, the practical implication is straightforward: always check token counts before sending requests. A 1,000-word document might be 1,300 tokens in one model and 1,500 in another.
The Transformer Architecture
The transformer is the architecture behind every major LLM. Introduced in 2017, it replaced recurrent neural networks (RNNs) by processing entire sequences in parallel rather than one token at a time.
A transformer consists of stacked layers, each containing two core components:
- Self-attention mechanism — determines how each token relates to every other token in the sequence
- Feed-forward network — applies learned transformations to each token independently
Self-Attention: The Core Innovation
Self-attention computes three vectors for each token: a Query (Q), a Key (K), and a Value (V). The attention score between two tokens is the dot product of one token’s query with another’s key, scaled and passed through a softmax function.
In code terms, the attention calculation looks like this:
import numpy as np
def scaled_dot_product_attention(Q, K, V):
d_k = Q.shape[-1]
scores = np.matmul(Q, K.T) / np.sqrt(d_k)
weights = softmax(scores, axis=-1)
return np.matmul(weights, V)
This mechanism lets the model learn contextual relationships. The word “bank” attends differently to surrounding tokens in “river bank” versus “bank account” — and the model learns these distinctions from data.
Multi-Head Attention
Rather than computing a single attention pattern, transformers use multi-head attention — multiple attention computations running in parallel. Each head can learn different types of relationships: syntactic dependencies, semantic similarity, positional patterns.
A model with 16 attention heads examines 16 different relationship patterns simultaneously for every token in the sequence.
Comparing Model Architectures
Not all LLMs use the same transformer variant. The choice of architecture affects what tasks the model handles well.
| Architecture | Examples | Strengths | Limitations |
|---|---|---|---|
| Decoder-only | GPT-4, Claude, Llama | Text generation, conversation, code | Unidirectional context |
| Encoder-only | BERT, RoBERTa | Classification, embeddings, NER | Cannot generate text |
| Encoder-decoder | T5, BART | Translation, summarization | Higher computational cost |
Most production LLMs today are decoder-only transformers. They predict the next token autoregressively — generating output one token at a time, with each new token conditioned on all previous tokens.
From Architecture to Behavior
The architecture alone does not explain why LLMs produce coherent text. That comes from training — exposing the model to massive datasets and optimizing it to predict the next token accurately.
Key training stages include:
- Pretraining — learning language patterns from billions of tokens of text
- Supervised fine-tuning — learning to follow instructions from curated examples
- RLHF / preference optimization — aligning outputs with human preferences
The model does not “understand” language in the human sense. It learns statistical patterns over token sequences that produce remarkably coherent and useful outputs.
Practical Implications for Developers
Understanding these internals changes how you work with LLMs:
- Token limits are hard constraints. Structure your inputs to fit within context windows.
- Attention is quadratic. Doubling the input length roughly quadruples the compute cost.
- Order matters. Decoder-only models attend more strongly to recent tokens, so put critical context last.
- Temperature controls randomness. Lower values (0.0–0.3) produce deterministic output; higher values (0.7–1.0) increase variety.
FAQ
What is a transformer in machine learning?
A transformer is a neural network architecture that uses self-attention mechanisms to process input sequences in parallel, rather than sequentially like RNNs. This parallelism enables transformers to train on much larger datasets and capture long-range dependencies in text more effectively.
How does tokenization work in LLMs?
Tokenization breaks text into subword units using algorithms like Byte Pair Encoding (BPE). Common words become single tokens while rare words are split into smaller pieces. This approach balances vocabulary size with the ability to represent any input text, including words the model has not seen during training.
Comments