This post uses many abbreviations (BPE, FFN, RoPE, etc.). See the LLM Abbreviations Glossary for a quick reference.

What is LLM?#

At its core, LLM is a next-token predictor.

Given a sequence of tokens, it predicts the most probable next token. By repeating this (autoregressive generation), it produces coherent text.

Input:  "The cat sat on the"
Model:  P("mat")=0.23, P("floor")=0.18, P("roof")=0.07, ...
Pick:   "mat"

Here’s the high-level pipeline — every concept in this post maps to one of these stages:

"The cat sat"          ← raw text
     │
     ▼
[The] [cat] [sat]      ← Tokenization (§1)
     │
     ▼
[0.12, -0.5, ...]      ← Token Embedding (§2)
[0.78,  0.3, ...]
[0.45, -0.1, ...]
     │
     ▼
┌─────────────────┐
│   Transformer    │    ← Self-Attention (§3) + FFN (Feed-Forward Network) (§4)
│   × N layers     │
└────────┬────────┘
         │
         ▼
[0.01, 0.23, 0.18..]   ← Logits → Probabilities
         │
         ▼
      "on"              ← Sampling (§5)

NLP (Natural Language Processing) Evolution#

Era	Period	Approach	Limitation
Rule-based	1950s-1980s	Hand-crafted patterns (ELIZA)	Can’t generalize
Statistical	1990s-2000s	N-grams, HMMs (Hidden Markov Models)	Sparse, no long-range context
Embeddings	2013	Word2Vec, GloVe	Static (one vector per word regardless of context)
Seq2Seq (Sequence-to-Sequence)	2014-2017	RNN (Recurrent Neural Network) / LSTM (Long Short-Term Memory) + attention	Sequential, slow, vanishing gradients
Transformer	2017	Self-attention, parallelizable	The breakthrough → modern LLMs

Where Does the Transformer Come From?#

Machine Learning (ML)
  └── Deep Learning (DL)          ← ML with multi-layer neural networks
        └── Neural Network (NN)   ← inspired by biological neurons
              ├── CNN (Convolutional Neural Network)   ← images
              ├── RNN (Recurrent Neural Network)       ← sequences (text, audio)
              │     └── LSTM / GRU                     ← better at long sequences
              └── Transformer                          ← replaced RNN for NLP (2017)
                    └── LLM (Large Language Model)     ← Transformer at massive scale

Machine Learning: algorithms that learn patterns from data (instead of hand-coded rules)
Deep Learning: ML using neural networks with many layers — “deep” = many layers
Neural Network: layers of interconnected nodes, loosely inspired by brain neurons. Each node applies: output = activation(weights · input + bias)
Transformer: a specific neural network architecture from the 2017 paper “Attention Is All You Need”. Replaced RNNs by using self-attention instead of sequential processing — this made it parallelizable and far more scalable
LLM: a Transformer trained on massive text data (billions to trillions of tokens) with billions of parameters

1. Tokenization#

LLMs don’t read characters or words — they read tokens (subword units).

Why Not Characters or Words?#

Strategy	Vocabulary Size	Problem
Characters	~256	Sequences too long, no semantic meaning per unit
Words	~500K+	Vocabulary explodes, can’t handle typos or new words
Subwords (BPE, Byte Pair Encoding)	32K-128K	Best trade-off: manageable vocab, handles unseen words

BPE (Byte Pair Encoding) — Step by Step#

BPE builds a vocabulary by iteratively merging the most frequent character pairs. The result is a vocabulary table that maps each subword to an integer ID.

Step 1 — Build the vocabulary (done once, during tokenizer training):

Training corpus: "low lower lowest lowly"

Iteration 0: Start with characters
  l o w _ l o w e r _ l o w e s t _ l o w l y

Iteration 1: Most frequent pair: (l, o) → merge into "lo"
  lo w _ lo w e r _ lo w e s t _ lo w l y

Iteration 2: Most frequent pair: (lo, w) → merge into "low"
  low _ low e r _ low e s t _ low l y

Iteration 3: Most frequent pair: (low, e) → merge into "lowe"
  low _ lowe r _ lowe s t _ low l y

... continue until vocabulary size is reached

This produces a vocabulary — every character and merged subword gets an integer ID:

ID:     0    1    2    3    4    5    6    7    8    9    10     11
Token: "l"  "o"  "w"  "e"  "r"  "s"  "t"  "y"  "_"  "lo"  "low"  "lowe"

Step 2 — Tokenize input (done every time text is fed to the model):

Apply the learned merge rules to split input text into known subwords, then look up their IDs:

"lowest"  → ["lowe", "s", "t"]  → IDs: [11, 5, 6]
             ↑ in vocab  ↑ "st" not in vocab, falls back to "s"=5 + "t"=6

"lowly"   → ["low", "l", "y"]  → IDs: [10, 0, 7]
             ↑ in vocab  ↑ "ly" not in vocab, falls back to "l"=0 + "y"=7

"low"     → ["low"]           → IDs: [10]
             ↑ exact match

The key rule: if a subword isn’t in the vocabulary, BPE falls back to smaller known pieces, all the way down to individual characters (which are always in the vocab).

In real tokenizers the vocabulary is much larger (32K-128K entries), built from massive training corpora instead of 4 words. Subwords like “ness”, “ing”, “tion” would all be merged and have their own IDs.

The output of tokenization is always a 1D array of integer IDs — this is what the model receives:

"The cat sat on" → ["The", " cat", " sat", " on"] → [464, 2368, 3520, 319]
                    human-readable tokens             actual model input

Tokenization Artifacts#

This is why LLMs make surprising mistakes:

"strawberry" → ["straw", "berry"]       Can't easily count letters — "r" is split across tokens
"123 + 456"  → ["123", " +", " 456"]    Digits grouped arbitrarily → arithmetic errors
"GPT"        → ["G", "PT"]              Acronyms split unpredictably

📚 References:

Let’s build the GPT Tokenizer by Andrej Karpathy — full implementation walk-through
Tiktokenizer — interactive tool to see how text gets tokenized
SentencePiece — language-agnostic tokenizer used by Llama, T5

2. Token Embedding#

After tokenization, each token is a discrete ID (e.g., “cat” → 2368). But neural networks need continuous vectors. Embedding maps each token ID to a dense vector.

How It Works#

The model has a learned embedding matrix E of shape [vocab_size × d_model]. Each row is a token’s embedding.

Vocabulary:    "the"=0  "cat"=1  "sat"=2  "on"=3  ...
               ┌                              ┐
Embedding      │  0.12  -0.50   0.33   0.78   │  ← "the" (ID 0)
Matrix E =     │  0.78   0.30  -0.12   0.45   │  ← "cat" (ID 1)
(vocab × d)    │  0.45  -0.10   0.67  -0.23   │  ← "sat" (ID 2)
               │  0.33   0.89   0.11   0.56   │  ← "on"  (ID 3)
               │  ...                          │
               └                              ┘

Input: "the cat sat" → IDs: [0, 1, 2]

Lookup: E[0] = [0.12, -0.50, 0.33, 0.78]   ← embedding for "the"
        E[1] = [0.78,  0.30, -0.12, 0.45]   ← embedding for "cat"
        E[2] = [0.45, -0.10, 0.67, -0.23]   ← embedding for "sat"

This is just a table lookup, not a computation.

Where Do These Vectors Come From?#

The embedding matrix is randomly initialized before training, then learned via backpropagation — the same way all other weights in the model are updated.

Before training:  E["the"] = [0.52, -0.11, 0.87, 0.03]   ← random initialization
After training:   E["the"] = [0.12, -0.50, 0.33, 0.78]   ← learned from data

During training, the model’s goal is to predict the next token. If adjusting E["the"] helps the model make better predictions, backpropagation will nudge those values accordingly. After billions of training examples, the vectors converge to encode meaningful patterns about each token’s usage.

Why Embeddings Work#

As a result, tokens that appear in similar contexts end up with similar vectors. This creates a semantic space:

cosine_similarity("king", "queen")  ≈ 0.85   (both royalty)
cosine_similarity("king", "pizza")  ≈ 0.12   (unrelated)
cosine_similarity("cat",  "dog")    ≈ 0.78   (both pets)

The famous example: vector("king") - vector("man") + vector("woman") ≈ vector("queen")

In real models, d_model is large: GPT-3 uses 12288 dimensions, Llama 3 (405B) uses 16384.

Positional Encoding#

Self-attention treats input as a set — it has no notion of token order. “cat sat on mat” and “mat on sat cat” would produce the same attention scores. We must inject position information.

Final input = Token Embedding + Positional Encoding

Token "cat" at position 2:
  token_emb  = [0.78,  0.30, -0.12,  0.45]    ← what the token is
  pos_enc(2) = [0.91, -0.42,  0.14, -0.99]    ← where the token is
  input      = [1.69, -0.12,  0.02, -0.54]    ← sum

Method	How It Works	Used By
Sinusoidal	Fixed sin/cos functions at different frequencies	Original Transformer (2017)
Learned	Trainable position vectors (one per position)	GPT-1, GPT-2
RoPE (Rotary Position Embedding)	Rotates Q/K vectors by position-dependent angle	Llama, DeepSeek, most modern LLMs

RoPE is dominant today because it encodes relative positions (distance between tokens matters more than absolute position) and supports extrapolation to longer sequences than seen in training.

📚 References:

The Illustrated Word2Vec by Jay Alammar — visual intuition for embeddings
Word Embeddings by Lena Voita — beginner-friendly course
RoPE explanation by EleutherAI — technical deep dive into Rotary Position Embeddings

3. Self-Attention#

This is the core mechanism that makes Transformers work. It lets each token decide how much to attend to every other token in the sequence.

Intuition#

Consider: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to? A human instantly knows “it” = “animal”. Self-attention learns this by letting “it” compute a high attention score with “animal”.

Q, K, V — Query, Key, Value#

Each token is projected into three vectors via learned weight matrices:

Vector	Role	Analogy
Q (Query)	“What am I looking for?”	A search query
K (Key)	“What do I contain?”	A document title / tag
V (Value)	“What information do I provide?”	The actual document content

For each token:
  Q = token_embedding × W_Q     (W_Q is a learned weight matrix)
  K = token_embedding × W_K
  V = token_embedding × W_V

Dry Run — 3 Tokens, d_k = 4#

Input: ["The", "cat", "sat"]

Step 1: Project into Q, K, V (via learned weights — simplified here):

Q = [[1, 0, 1, 0],    K = [[1, 1, 0, 0],    V = [[1, 0, 1, 0],
     [0, 1, 0, 1],         [0, 0, 1, 1],         [0, 1, 0, 1],
     [1, 1, 0, 0]]         [1, 0, 1, 0]]         [1, 1, 0, 0]]

Step 2: Compute attention scores = Q · Kᵀ

Q · Kᵀ = [[1×1+0×1+1×0+0×0, 1×0+0×0+1×1+0×1, 1×1+0×0+1×1+0×0],   [[1, 1, 2],
           [0×1+1×1+0×0+1×0, 0×0+1×0+0×1+1×1, 0×1+1×0+0×1+1×0],  = [1, 1, 0],
           [1×1+1×1+0×0+0×0, 1×0+1×0+0×1+0×1, 1×1+1×0+0×1+0×0]]    [2, 0, 1]]

Step 3: Scale by √d_k = √4 = 2 (prevents softmax saturation)

Scaled = [[0.5, 0.5, 1.0],
          [0.5, 0.5, 0.0],
          [1.0, 0.0, 0.5]]

Step 4: Softmax (each row sums to 1.0)

Weights = [[0.27, 0.27, 0.45],    ← "The" attends most to "sat"
           [0.38, 0.38, 0.23],    ← "cat" attends equally to "The" and itself
           [0.51, 0.19, 0.31]]    ← "sat" attends most to "The"

Step 5: Weighted sum of V vectors

Output[0] = 0.27×V["The"] + 0.27×V["cat"] + 0.45×V["sat"]
          = 0.27×[1,0,1,0] + 0.27×[0,1,0,1] + 0.45×[1,1,0,0]
          = [0.72, 0.72, 0.27, 0.27]

Each output token is now a context-aware blend of all tokens, weighted by relevance.

Multi-Head Attention#

Instead of one attention, run h parallel attention heads (GPT-3: h=96, Llama 3 8B: h=32). Each head has its own W_Q, W_K, W_V — they learn different relationships:

Head 1: might learn syntactic structure    ("sat" → "cat" as subject)
Head 2: might learn positional proximity   ("sat" → "on" as next word)
Head 3: might learn semantic similarity    ("cat" → "animal")
...

MultiHead = Concat(head_1, ..., head_h) × W_O

Causal Mask (Decoder-Only)#

For text generation, a token must NOT attend to future tokens (it hasn’t generated them yet). The causal mask enforces this:

              The  cat  sat  on
    The     [  ✓    ✗    ✗   ✗  ]     ✓ = can attend
    cat     [  ✓    ✓    ✗   ✗  ]     ✗ = masked (-∞ before softmax)
    sat     [  ✓    ✓    ✓   ✗  ]
    on      [  ✓    ✓    ✓   ✓  ]

Applied by setting masked positions to -∞ before softmax → they become 0 probability.

📚 References:

Attention? Attention! by Lilian Weng — evolution of attention mechanisms
The Illustrated Transformer by Jay Alammar — the best visual explanation
Visual intro to Transformers by 3Blue1Brown — animated explanation
LLM Visualization by Brendan Bycroft — interactive 3D visualization

4. Transformer Block#

A single Transformer block does one round of “gather context, then think”:

Attention: “which tokens are relevant to each other?” (gather context)
FFN: “given that context, what knowledge applies here?” (think and refine)

But one round isn’t enough — just like you can’t fully understand a complex sentence in one pass. So LLMs stack many blocks (GPT-3: 96, Llama 3 8B: 32, Llama 3 405B: 126), each refining the token representations further:

Input: "The bank by the river was steep"

Block 1-2:    syntax     — "bank" is a noun, "was" links to "bank"
Block 3-10:   semantics  — "bank" + "river" → this means riverbank, not financial bank
Block 11-30:  reasoning  — combines all context to predict what comes next

After block 1, the vector for “bank” still mostly means “bank (ambiguous)”. After block 10, it means “riverbank” — because attention pulled in context from “river”, and the FFN transformed that blended signal into a more specific representation.

Here’s one block:

         Input (embeddings)
              │
              ▼
    ┌───────────────────┐
    │  Multi-Head        │
    │  Self-Attention     │  ← "which tokens are relevant to each other?"
    └─────────┬──────────┘
              │
         ┌────▼────┐
         │  Add &   │  ← residual connection: output = attention(x) + x
         │ LayerNorm│    prevents vanishing gradients in deep networks
         └────┬─────┘
              │
    ┌─────────▼──────────┐
    │  FFN (Feed-Forward   │
    │       Network)      │  ← "what to do with that information?"
    │                     │
    │  up = x × W₁       │  d_model → 4×d_model (expand)
    │  act = GeLU(up)     │  GeLU (Gaussian Error Linear Unit) non-linearity
    │  down = act × W₂   │  4×d_model → d_model (compress)
    └─────────┬──────────┘
              │
         ┌────▼────┐
         │  Add &   │  ← another residual connection
         │ LayerNorm│
         └────┬─────┘
              │
              ▼
         Output (to next layer)

Attention vs FFN: Who Does What?#

Think of it as a team meeting:

Attention = the discussion phase — everyone shares information, and each person decides who to listen to. It operates across tokens (token-to-token communication).
FFN = the thinking phase — each person goes back to their desk and processes what they heard, using their own knowledge. It operates within each token independently (no cross-token communication).

Concrete example with "Paris is the capital of ___":

After Attention:
  The vector for "___" now contains blended information from
  "Paris", "capital", "of" — it knows the context.

After FFN:
  The FFN processes that blended vector and activates the
  stored knowledge: "capital of Paris → France"
  The vector for "___" is now transformed to strongly
  predict "France" as the next token.

Component	Operates On	What It Does	Where Knowledge Lives
Attention	Between tokens	Routes and blends information	Learns which tokens matter to each other
FFN	Each token independently	Transforms and refines	Stores factual knowledge in its weight matrices

The FFN’s weight matrices (W₁, W₂) are enormous — they make up ~2/3 of the model’s total parameters. Research shows that specific neurons in FFN layers fire for specific facts (e.g., “Paris → France”), which is why FFN is often called the model’s “memory”.

Residual Connections#

Without residuals, gradients vanish in deep networks (96+ layers). The residual path provides a “gradient highway”:

output = LayerNorm(x + Sublayer(x))
                   ↑
                   gradient flows directly through here

Architecture Variants#

Type	Attention Pattern	Examples	Best For
Encoder-only	Bidirectional (sees all tokens)	BERT (Bidirectional Encoder Representations from Transformers), RoBERTa	Classification, NER (Named Entity Recognition), embeddings
Decoder-only	Causal (sees only past tokens)	GPT (Generative Pre-trained Transformer), Llama, Claude	Text generation, chat, code
Encoder-Decoder	Encoder: bidirectional, Decoder: causal + cross-attention	T5, BART	Translation, summarization

Modern LLMs are almost exclusively decoder-only — simpler, scales better, and the causal mask enables autoregressive generation.

📚 References:

nanoGPT by Andrej Karpathy — reimplement GPT from scratch in 2 hours
Transformer Circuits Thread by Anthropic — deep analysis of what each component learns

5. Text Generation (Sampling)#

After the final Transformer block, the model outputs a logit (raw score) for every token in the vocabulary. How we convert logits into a chosen token is the sampling strategy.

From Logits to Probabilities#

To keep the math verifiable, let’s use a simplified vocabulary of just 4 tokens:

Tokens:  ["on", "the", "mat", "in"]
Logits:  [2.0,   1.0,   0.5,   0.0]      ← raw scores from the model

Softmax: P(token) = e^logit / Σe^all_logits

  e^2.0 = 7.39    e^1.0 = 2.72    e^0.5 = 1.65    e^0.0 = 1.00
  sum = 7.39 + 2.72 + 1.65 + 1.00 = 12.76

  P("on")  = 7.39 / 12.76 = 0.58
  P("the") = 2.72 / 12.76 = 0.21
  P("mat") = 1.65 / 12.76 = 0.13
  P("in")  = 1.00 / 12.76 = 0.08
                               sum = 1.00 ✓

In real models, the softmax is over the entire vocabulary (32K-128K tokens), so each individual probability is much smaller. We use 4 tokens here to keep the math clear.

Greedy Decoding#

Always pick the highest probability token. Deterministic but often repetitive:

P("on")=0.58, P("the")=0.21, P("mat")=0.13, P("in")=0.08
→ Always picks "on"

Temperature#

Temperature reshapes the probability distribution before sampling:

New logits = original logits / T

Temperature	Effect	Distribution
T → 0	Deterministic (greedy)	All probability on top token
T = 1.0	Original distribution	Balanced
T > 1.0	Flatter, more random	Spreads probability to unlikely tokens

Dry run with logits [2.0, 1.0, 0.5, 0.0] for tokens ["on", "the", "mat", "in"]:

T = 0.5 (focused):
  logits/T = [4.0, 2.0, 1.0, 0.0]
  e^4.0=54.60  e^2.0=7.39  e^1.0=2.72  e^0.0=1.00  sum=65.71
  softmax  → P("on")=0.83  P("the")=0.11  P("mat")=0.04  P("in")=0.02
  → almost always picks "on"

T = 1.0 (original):
  logits/T = [2.0, 1.0, 0.5, 0.0]
  softmax  → P("on")=0.58  P("the")=0.21  P("mat")=0.13  P("in")=0.08
  → usually "on", sometimes "the"

T = 2.0 (creative):
  logits/T = [1.0, 0.5, 0.25, 0.0]
  e^1.0=2.72  e^0.5=1.65  e^0.25=1.28  e^0.0=1.00  sum=6.65
  softmax  → P("on")=0.41  P("the")=0.25  P("mat")=0.19  P("in")=0.15
  → more diverse, might pick "mat" or "in"

Notice how temperature flattens or sharpens the distribution:

              "on"   "the"   "mat"   "in"
T = 0.5  →   0.83    0.11    0.04    0.02    ← concentrated
T = 1.0  →   0.58    0.21    0.13    0.08    ← balanced
T = 2.0  →   0.41    0.25    0.19    0.15    ← flattened

Top-k Sampling#

Only consider the k most probable tokens, then renormalize:

Original: P("on")=0.58, P("the")=0.21, P("mat")=0.13, P("in")=0.08

Top-k=2: keep only top 2, zero out the rest
  P("on")=0.58, P("the")=0.21 → renormalize (divide by 0.58+0.21=0.79):
  P("on")=0.73, P("the")=0.27
  → sample from these 2 only

Problem: top-k is fixed. If the model is very confident (top token is 0.95), k=50 still considers 50 tokens. If uncertain, k=2 might cut off good options.

Top-p (Nucleus Sampling)#

Keep the smallest set of tokens whose cumulative probability reaches p:

Sorted: P("on")=0.58, P("the")=0.21, P("mat")=0.13, P("in")=0.08

Top-p=0.7:
  "on"  → cumulative: 0.58 (< 0.7, keep)
  "the" → cumulative: 0.79 (≥ 0.7, keep this last one, stop)
  → sample from {"on", "the"} (2 tokens)

Top-p=0.95:
  "on"  → cumulative: 0.58 (< 0.95, keep)
  "the" → cumulative: 0.79 (< 0.95, keep)
  "mat" → cumulative: 0.92 (< 0.95, keep)
  "in"  → cumulative: 1.00 (≥ 0.95, keep this last one, stop)
  → sample from all 4 tokens

Top-p adapts — when the model is confident, fewer tokens pass the threshold; when uncertain, more do.

Practical Advice#

Use Case	Recommended Settings
Code generation	T=0.0 (deterministic) or T=0.2, top-p=0.1
Factual Q&A	T=0.3, top-p=0.3
Creative writing	T=0.8-1.0, top-p=0.9
Brainstorming	T=1.0-1.2, top-p=0.95

General rule: adjust temperature or top-p, not both. They do similar things and combining them can produce unpredictable results.

📚 References:

LLM Settings — Temperature, Top-p by Prompt Engineering Guide
Decoding Strategies in LLMs by Maxime Labonne — visual guide with code
How to generate text by Hugging Face — comprehensive sampling guide

6. Putting It All Together#

Let’s trace "The cat sat" through the entire pipeline to predict the next token:

┌─────────────────────────────────────────────────────────────────┐
│ Step 1: TOKENIZE                                                │
│   "The cat sat" → ["The", "cat", "sat"] → IDs: [464, 2368, 3520]│
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: EMBED                                                   │
│   E[464]  = [0.12, -0.50, 0.33, ...]   (d=4096 in real models) │
│   E[2368] = [0.78,  0.30, -0.12, ...]                          │
│   E[3520] = [0.45, -0.10, 0.67, ...]                           │
│                                                                 │
│   + Positional Encoding (RoPE, Rotary Position Embedding)        │
│     → position-aware embeddings                                  │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: TRANSFORMER × N layers                                  │
│                                                                 │
│   Layer 1:                                                      │
│     Attention: "sat" attends to "cat" (subject) and "The"       │
│     FFN (Feed-Forward Network): transforms each token's repr.   │
│                                                                 │
│   Layer 2-N:                                                    │
│     Builds increasingly abstract representations                │
│     Early layers: syntax  Middle: semantics  Late: task-level   │
│                                                                 │
│   Final representation for last token "sat":                    │
│     h = [1.34, -0.89, 2.11, 0.56, ...]                         │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 4: PROJECT TO VOCABULARY                                   │
│   logits = h × W_vocab    (4096 → 32000 vocabulary logits)      │
│                                                                 │
│   logits: ["on"=2.0, "the"=1.0, "mat"=0.5, "in"=0.0, ...]     │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 5: SAMPLE                                                  │
│   softmax → P("on")=0.58, P("the")=0.21, P("mat")=0.13, ...   │
│   with T=1.0, top-p=0.9 → sample → "on"                        │
│                                                                 │
│   Append "on" to sequence → "The cat sat on"                    │
│   Repeat from Step 1 with new sequence...                       │
└─────────────────────────────────────────────────────────────────┘

This loop continues until:

A special <EOS> (End of Sequence) token is generated
A maximum length limit is reached
The caller stops the generation

That’s it. Every LLM — from GPT-4 to Llama to Claude — follows this same fundamental pipeline.

What’s Next#

This post covered how an LLM works internally — the architecture. But building a useful LLM involves many more stages. Here’s the full picture:

┌──────────────────────┐
│  1. Architecture      │  ← this post (fundamentals)
│  (Transformer)        │
└──────────┬───────────┘
           │ architecture is what gets trained
           ▼
┌──────────────────────┐
│  2. Pre-Training      │  ← train on massive data (next-token prediction)
│                       │    adjusts all the weights we discussed:
│                       │    embedding matrix, W_Q/W_K/W_V, FFN weights
└──────────┬───────────┘
           │ produces a base model (can complete text, but not chat)
           ▼
┌──────────────────────────────────────────────────────────┐
│  3-5. Post-Training (treats model mostly as a black box)  │
│                                                           │
│  3. Post-Training Datasets  — curate instruction data     │
│  4. SFT (Supervised Fine-Tuning) — teach it to chat       │
│  5. Preference Alignment (RLHF/DPO) — make it safe/helpful│
└──────────┬───────────────────────────────────────────────┘
           │ produces a chat model (e.g., ChatGPT, Claude)
           ▼
┌──────────────────────────────────────────────────────────┐
│  6-8. Ship & Improve (also treats model as a black box)   │
│                                                           │
│  6. Evaluation       — measure quality (benchmarks, human) │
│  7. Quantization     — compress for deployment (INT8/INT4) │
│  8. New Trends       — merging, multimodal, reasoning      │
└──────────────────────────────────────────────────────────┘

Steps 1 → 2 are tightly coupled — pre-training directly adjusts the weights inside the architecture we covered (embeddings, attention, FFN). But steps 3-8 largely treat the model as a black box: feed it data, train, evaluate, compress. You don’t need to understand self-attention to do SFT or run benchmarks.

That said, the fundamentals help you understand why things work: why certain prompts are more effective (attention patterns), why models hallucinate (FFN knowledge retrieval limits), why quantization can degrade quality (weight precision matters for subtle knowledge stored in FFN).

References#

Attention Is All You Need — Vaswani et al., 2017 (the original Transformer paper)
The Illustrated Transformer — Jay Alammar
Visual intro to Transformers — 3Blue1Brown
LLM Visualization — Brendan Bycroft (interactive 3D)
nanoGPT — Andrej Karpathy (build GPT from scratch)
Let’s build the GPT Tokenizer — Andrej Karpathy
Decoding Strategies in LLMs — Maxime Labonne
Transformer Circuits Thread — Anthropic
Understanding Encoder and Decoder LLMs — Sebastian Raschka

LLM Fundamentals

What is LLM?#

NLP (Natural Language Processing) Evolution#

Where Does the Transformer Come From?#

1. Tokenization#

Why Not Characters or Words?#

BPE (Byte Pair Encoding) — Step by Step#

Tokenization Artifacts#

2. Token Embedding#

How It Works#

Where Do These Vectors Come From?#

Why Embeddings Work#

Positional Encoding#

3. Self-Attention#

Intuition#

Q, K, V — Query, Key, Value#

Dry Run — 3 Tokens, d_k = 4#

Multi-Head Attention#

Causal Mask (Decoder-Only)#

4. Transformer Block#

Attention vs FFN: Who Does What?#

Residual Connections#

Architecture Variants#

5. Text Generation (Sampling)#

From Logits to Probabilities#

Greedy Decoding#

Temperature#

Top-k Sampling#

Top-p (Nucleus Sampling)#

Practical Advice#

6. Putting It All Together#

What’s Next#

References#