Skip to content

Search

ESC
Abstract neural network visualization

How Large Language Models Actually Work

T
by Tomáš
4 min read

TL;DR

LLMs use transformer architecture with self-attention to process token sequences in parallel, enabling them to learn complex language patterns from massive datasets.

Large language models power everything from code completion to document summarization. But most developers interact with them as black boxes. Understanding the internals — transformers, attention, and tokenization — gives you a practical edge when building AI-powered systems.

Tokenization: How Text Becomes Numbers

LLMs do not process raw text. They operate on tokens — numeric representations of text fragments. Most modern models use Byte Pair Encoding (BPE), which splits text into subword units based on frequency in the training data.

Common words like “the” map to a single token. Rare or compound words get split into multiple tokens. This is why token counts differ from word counts.

from tiktoken import encoding_for_model

enc = encoding_for_model("gpt-4")
text = "Tokenization splits text into subword units."

tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")

# Decode each token to see the splits
for token_id in tokens:
    print(f"  {token_id} -> '{enc.decode([token_id])}'")

Running this produces output like:

Token IDs: [3947, 2065, 27374, 6529, 1495, 1139, 1207, 1178, 8316, 13]
Token count: 10

Tokenization directly affects cost and context limits. Every API call is billed per token, and every model has a maximum context window measured in tokens — not words.

Why Subword Tokenization Matters

Subword tokenization strikes a balance between character-level and word-level approaches. It handles unseen words gracefully by decomposing them into known subword pieces, while keeping common words as single units for efficiency.

For developers, the practical implication is straightforward: always check token counts before sending requests. A 1,000-word document might be 1,300 tokens in one model and 1,500 in another.

The Transformer Architecture

The transformer is the architecture behind every major LLM. Introduced in 2017, it replaced recurrent neural networks (RNNs) by processing entire sequences in parallel rather than one token at a time.

A transformer consists of stacked layers, each containing two core components:

  1. Self-attention mechanism — determines how each token relates to every other token in the sequence
  2. Feed-forward network — applies learned transformations to each token independently

Self-Attention: The Core Innovation

Self-attention computes three vectors for each token: a Query (Q), a Key (K), and a Value (V). The attention score between two tokens is the dot product of one token’s query with another’s key, scaled and passed through a softmax function.

In code terms, the attention calculation looks like this:

import numpy as np

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    weights = softmax(scores, axis=-1)
    return np.matmul(weights, V)

This mechanism lets the model learn contextual relationships. The word “bank” attends differently to surrounding tokens in “river bank” versus “bank account” — and the model learns these distinctions from data.

Multi-Head Attention

Rather than computing a single attention pattern, transformers use multi-head attention — multiple attention computations running in parallel. Each head can learn different types of relationships: syntactic dependencies, semantic similarity, positional patterns.

A model with 16 attention heads examines 16 different relationship patterns simultaneously for every token in the sequence.

Comparing Model Architectures

Not all LLMs use the same transformer variant. The choice of architecture affects what tasks the model handles well.

ArchitectureExamplesStrengthsLimitations
Decoder-onlyGPT-4, Claude, LlamaText generation, conversation, codeUnidirectional context
Encoder-onlyBERT, RoBERTaClassification, embeddings, NERCannot generate text
Encoder-decoderT5, BARTTranslation, summarizationHigher computational cost

Most production LLMs today are decoder-only transformers. They predict the next token autoregressively — generating output one token at a time, with each new token conditioned on all previous tokens.

From Architecture to Behavior

The architecture alone does not explain why LLMs produce coherent text. That comes from training — exposing the model to massive datasets and optimizing it to predict the next token accurately.

Key training stages include:

  • Pretraining — learning language patterns from billions of tokens of text
  • Supervised fine-tuning — learning to follow instructions from curated examples
  • RLHF / preference optimization — aligning outputs with human preferences

The model does not “understand” language in the human sense. It learns statistical patterns over token sequences that produce remarkably coherent and useful outputs.

Practical Implications for Developers

Understanding these internals changes how you work with LLMs:

  • Token limits are hard constraints. Structure your inputs to fit within context windows.
  • Attention is quadratic. Doubling the input length roughly quadruples the compute cost.
  • Order matters. Decoder-only models attend more strongly to recent tokens, so put critical context last.
  • Temperature controls randomness. Lower values (0.0–0.3) produce deterministic output; higher values (0.7–1.0) increase variety.

FAQ

What is a transformer in machine learning?

A transformer is a neural network architecture that uses self-attention mechanisms to process input sequences in parallel, rather than sequentially like RNNs. This parallelism enables transformers to train on much larger datasets and capture long-range dependencies in text more effectively.

How does tokenization work in LLMs?

Tokenization breaks text into subword units using algorithms like Byte Pair Encoding (BPE). Common words become single tokens while rare words are split into smaller pieces. This approach balances vocabulary size with the ability to represent any input text, including words the model has not seen during training.

Share

Comments