Large Language Models (LLMs) have revolutionized how we interact with AI. But how do they actually work? Stanford’s CME 295: Transformers and Large Language Models course, taught by twin brothers Afshine and Shervine Amidi, provides one of the most accessible yet rigorous introductions to this technology.

These are my notes from Lecture 1, summarizing the key concepts from NLP basics to the Transformer architecture that powers models like GPT, BERT, and Claude.

Who Are the Instructors?

Afshine and Shervine Amidi are twin brothers with impressive ML backgrounds:

  • Both studied at Centrale Paris (France)
  • Afshine went to MIT, Shervine to Stanford (ICME Master’s)
  • Industry experience: Uber β†’ Google β†’ Netflix
  • Currently working on LLMs at Netflix
  • Creators of the popular VIP Cheat Sheets on GitHub

Their practical industry experience combined with academic rigor makes this course particularly valuable for developers looking to understand LLMs beyond surface-level explanations.


🎯 Understanding NLP: The Foundation

Natural Language Processing (NLP) is the field of computing things with text. At a high level, NLP tasks can be classified into three buckets:

1. Classification Tasks

You have an input text and want to predict a single label.

TaskDescriptionExample
Sentiment AnalysisDetermine if text is positive, negative, or neutralMovie review β†’ “Positive”
Intent DetectionIdentify what a user wants to do“Set alarm for 7am” β†’ “create_alarm”
Language DetectionIdentify the language of text“Bonjour” β†’ French
Topic ModelingCategorize text by topicNews article β†’ “Sports”

2. Multi-Classification Tasks

You have an input text and want to predict multiple things (labels for each token).

TaskDescriptionExample
Named Entity Recognition (NER)Label specific words with categories“Paris is beautiful” β†’ Paris: LOCATION
Part-of-Speech TaggingLabel grammatical function“The cat sat” β†’ DET, NOUN, VERB
Dependency ParsingIdentify grammatical relationshipsSubject-verb-object structures

3. Generation Tasks

Text in, text out β€” where the output length is variable.

TaskDescriptionExample
Machine TranslationConvert text between languagesEnglish β†’ French
Question AnsweringGenerate answers to questionsChatGPT, Claude
SummarizationCondense long textArticle β†’ 2-sentence summary
Code GenerationGenerate code from descriptions“Sort a list” β†’ Python code

πŸ“Š Evaluation Metrics: How Do We Measure Success?

Different tasks require different metrics. Here’s how we evaluate NLP models:

Classification Metrics

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Accuracy = (Correct Predictions) / (Total Predictions)

Precision = (True Positives) / (True Positives + False Positives)
           "Of all positive predictions, how many were correct?"

Recall = (True Positives) / (True Positives + False Negatives)
        "Of all actual positives, how many did we catch?"

F1 Score = 2 Γ— (Precision Γ— Recall) / (Precision + Recall)
          "Harmonic mean of precision and recall"

Why multiple metrics? Consider a dataset with 99% positive labels. A model that always predicts “positive” would have 99% accuracy but be useless. Precision and recall reveal this flaw.

Generation Metrics

MetricDescriptionDirection
BLEUBilingual Evaluation Understudy β€” measures n-gram overlap with referenceHigher = Better
ROUGERecall-Oriented Understudy for Gisting Evaluation β€” suite of metrics for summarizationHigher = Better
PerplexityHow “surprised” the model is by its outputLower = Better

The problem with BLEU and ROUGE: they require reference texts (labels), which are expensive to create. Modern approaches use reference-free metrics powered by LLMs themselves.


βœ‚οΈ Tokenization: Breaking Text into Pieces

Before a model can process text, we need to convert it into discrete units called tokens. There are three main approaches:

Word-Level Tokenization

Split text by words:

1
"A cute teddy bear" β†’ ["A", "cute", "teddy", "bear"]

Pros:

  • Simple and intuitive
  • Each token has clear meaning

Cons:

  • Large vocabulary size
  • “bear” and “bears” are completely different tokens
  • High Out-of-Vocabulary (OOV) risk

Subword-Level Tokenization

Split using common subword units:

1
2
"bears" β†’ ["bear", "s"]
"running" β†’ ["run", "ning"]

Pros:

  • Leverages word roots
  • Lower OOV risk
  • Balances vocabulary size

Cons:

  • Longer sequences than word-level
  • Requires training a tokenizer (BPE, WordPiece, etc.)

Character-Level Tokenization

Split into individual characters:

1
"cute" β†’ ["c", "u", "t", "e"]

Pros:

  • Robust to misspellings
  • Very small vocabulary
  • Zero OOV risk

Cons:

  • Very long sequences
  • Hard to capture meaning at character level
  • Slow computation

Comparison Table

ApproachVocabulary SizeSequence LengthOOV RiskUse Case
WordLarge (100K+)ShortHighSimple tasks
SubwordMedium (30-50K)MediumLowMost LLMs
CharacterTiny (~100)Very LongNoneSpelling tasks

Modern LLMs typically use subword tokenization (like BPE or SentencePiece) with vocabulary sizes of 30,000-100,000 tokens.


πŸ”’ Word Representation: From Text to Numbers

Models understand numbers, not text. We need to represent tokens numerically.

One-Hot Encoding: The Naive Approach

Assign each token a unique vector with a single 1:

1
2
3
4
5
Vocabulary: [soft, teddy_bear, book]

soft       = [1, 0, 0]
teddy_bear = [0, 1, 0]
book       = [0, 0, 1]

The Problem: All vectors are orthogonal. “Soft” and “teddy_bear” have zero similarity, even though teddy bears are soft!

Cosine Similarity

We measure similarity between vectors using cosine similarity:

1
cos(A, B) = (A Β· B) / (||A|| Γ— ||B||)
  • Similarity = 1: Vectors point in same direction
  • Similarity = 0: Vectors are orthogonal (independent)
  • Similarity = -1: Vectors point in opposite directions

What we want:

  • Similar concepts β†’ High similarity (teddy_bear ↔ soft)
  • Unrelated concepts β†’ Low similarity (teddy_bear ↔ book)

What one-hot gives us:

  • Everything has 0 similarity with everything else

This is why we need learned embeddings.


🧠 Word2vec: Learning Meaningful Embeddings

Word2vec (2013) was a breakthrough in learning word representations. The key insight: use a proxy task to learn embeddings.

The Proxy Task Concept

Instead of directly defining what makes a good embedding, we:

  1. Define a task that requires understanding language
  2. Train a model on that task
  3. Extract the learned representations

If a model can predict surrounding words, it must have learned something meaningful about language.

Two Approaches

Continuous Bag of Words (CBOW):

  • Input: Context words (surrounding words)
  • Output: Target word (center word)
  • “Predict the word from its context”

Skip-gram:

  • Input: Target word (center word)
  • Output: Context words (surrounding words)
  • “Predict the context from the word”

Training Walkthrough

Let’s trace through a simple example:

1
2
Sentence: "A cute teddy bear is reading"
Task: Predict next word

Step 1: Take word “A”, represent as one-hot vector

1
Input: [1, 0, 0, 0, 0, 0]  (vocabulary size = 6)

Step 2: Multiply by weight matrix W₁ (size: V Γ— d)

1
2
Hidden layer h = W₁ᡀ Γ— input
Result: h = [0.2, 0.9]  (dimension d = 2)

Step 3: Multiply by weight matrix Wβ‚‚ (size: d Γ— V)

1
2
Output = softmax(Wβ‚‚α΅€ Γ— h)
Result: [0.2, 0.4, 0.1, 0.1, 0.1, 0.1]

Step 4: Compare with true next word “cute” = [0, 1, 0, 0, 0, 0]

Step 5: Compute loss (cross-entropy) and backpropagate

Step 6: Repeat for all words in corpus until convergence

The Magic: Hidden Layer IS the Embedding

After training, the weight matrix W₁ contains our embeddings:

1
2
To get embedding for word i:
embedding(word_i) = W₁[i, :]  (row i of W₁)

The resulting embeddings capture semantic relationships:

1
2
king - man + woman β‰ˆ queen
Paris - France + Germany β‰ˆ Berlin

Typical Dimensions

  • Vocabulary size (V): 10,000 - 100,000+
  • Embedding dimension (d): 100 - 768+

πŸ”„ Sequence Models: RNN and LSTM

Word2vec gives us token embeddings, but they’re context-independent. The word “bank” has the same embedding whether it means “river bank” or “money bank”.

Recurrent Neural Networks (RNN)

RNNs process sequences one token at a time, maintaining a hidden state that captures the sequence so far.

1
2
3
For each time step t:
    h_t = f(h_{t-1}, x_t)
    y_t = g(h_t)

Architecture:

1
2
3
4
5
Input:  x₁ β†’ xβ‚‚ β†’ x₃ β†’ xβ‚„ β†’ xβ‚…
         ↓    ↓    ↓    ↓    ↓
Hidden: h₁ β†’ hβ‚‚ β†’ h₃ β†’ hβ‚„ β†’ hβ‚…
         ↓    ↓    ↓    ↓    ↓
Output: y₁   yβ‚‚   y₃   yβ‚„   yβ‚…

Pros:

  • Word order matters
  • Context-aware representations
  • Can handle variable-length sequences

Cons:

  • Sequential processing = slow (can’t parallelize)
  • Vanishing gradient problem

The Vanishing Gradient Problem

When backpropagating through many time steps:

1
βˆ‚L/βˆ‚h₁ = βˆ‚L/βˆ‚h_T Γ— βˆ‚h_T/βˆ‚h_{T-1} Γ— ... Γ— βˆ‚hβ‚‚/βˆ‚h₁

If each gradient term is < 1, the product becomes tiny. The model “forgets” early tokens.

If each term is > 1, gradients explode. Training becomes unstable.

LSTM: Long Short-Term Memory

LSTMs add a cell state to help remember important information:

1
2
3
4
Components:
- h_t: Hidden state (short-term)
- c_t: Cell state (long-term)
- Gates: Forget, Input, Output (control information flow)

LSTMs mitigate vanishing gradients but don’t eliminate them. Long-range dependencies (1000+ tokens) remain challenging.

Why RNNs Fell Out of Favor

  1. Sequential processing: Can’t parallelize across time steps
  2. Long-range dependencies: Still struggle with very long contexts
  3. Slow training: Each token depends on all previous tokens

This led to the development of attention mechanisms.


πŸ‘οΈ The Attention Mechanism: Direct Connections

The core idea: instead of passing information through a chain of hidden states, create direct connections between any two positions.

Motivation

Consider translating: “A cute teddy bear is reading” β†’ French

When generating “ours” (bear), we want to look directly at “teddy bear” in the input β€” not hope the information survived through the RNN chain.

How Attention Works

Attention creates a weighted combination of all input positions:

1
2
3
4
5
For each output position:
1. Compare (query) with all input positions (keys)
2. Get similarity scores
3. Weight the input values by these scores
4. Sum to get context vector

Query, Key, Value (Q, K, V)

This terminology comes from database lookups:

  • Query (Q): What you’re looking for
  • Key (K): Index for each item
  • Value (V): Actual content of each item
1
2
3
4
5
6
7
8
9
Example: Looking up "teddy bear" in a dictionary

Query: "teddy bear"
Keys: ["a", "cute", "teddy bear", "is", "reading"]
Values: [emb_a, emb_cute, emb_teddy, emb_is, emb_reading]

1. Compare query with each key β†’ similarity scores
2. Highest similarity: "teddy bear" key
3. Return corresponding value: emb_teddy

Attention Formula

1
Attention(Q, K, V) = softmax(QKα΅€ / √d_k) Γ— V
  • QKα΅€: Dot product of queries and keys (similarity scores)
  • √d_k: Scaling factor (prevents large dot products)
  • softmax: Converts to probability distribution
  • Γ— V: Weighted sum of values

πŸ€– The Transformer Architecture

The Transformer (2017, “Attention is All You Need”) revolutionized NLP by using only attention β€” no recurrence.

Key Innovation: Self-Attention

Instead of attention between encoder and decoder, apply attention within a sequence:

1
2
3
4
5
6
7
8
9
Input: "A cute teddy bear is reading"

For the word "teddy bear":
- Query: teddy bear's representation
- Keys: all words' representations
- Values: all words' representations

Result: teddy bear's representation enriched with context
        (e.g., "cute" contributes because teddy bears are cute)

This is self-attention β€” the sequence attends to itself.

Architecture Overview

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     ENCODER     β”‚      β”‚     DECODER     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€      β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                 β”‚      β”‚    Linear +     β”‚
β”‚ Multi-Head      β”‚      β”‚    Softmax      β”‚
β”‚ Self-Attention  β”‚      β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚       ↓         β”‚      β”‚  Feed Forward   β”‚
β”‚ Feed Forward    β”‚      β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                 β”‚  ──► β”‚ Cross-Attention β”‚ ◄── Q from decoder
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  K,V β”‚                 β”‚     K,V from encoder
β”‚ Input Embedding β”‚      β”‚ Masked Self-    β”‚
β”‚ + Positional    β”‚      β”‚ Attention       β”‚
β”‚   Encoding      β”‚      β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚ Output Embeddingβ”‚
        ↑                β”‚ + Positional    β”‚
   Source Text           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   (English)                     ↑
                            Target Text
                            (French)

Encoder

The encoder processes the input sequence:

  1. Input Embedding: Convert tokens to vectors
  2. Positional Encoding: Add position information (sine/cosine functions)
  3. Multi-Head Self-Attention: Each token attends to all tokens
  4. Feed-Forward Network: Additional transformation
  5. Repeat N times: Stack N encoder layers

Output: Context-aware representations for all input tokens.

Decoder

The decoder generates the output sequence:

  1. Output Embedding + Positional Encoding
  2. Masked Self-Attention: Each token attends only to previous tokens (causal)
  3. Cross-Attention: Attend to encoder outputs
    • Query: from decoder
    • Key, Value: from encoder
  4. Feed-Forward Network
  5. Repeat N times
  6. Linear + Softmax: Predict next token probability

Multi-Head Attention

Instead of one attention operation, run h parallel attention heads:

1
2
3
MultiHead(Q, K, V) = Concat(head₁, headβ‚‚, ..., head_h) Γ— W_O

where head_i = Attention(Q Γ— W_Q^i, K Γ— W_K^i, V Γ— W_V^i)

Why multiple heads?

  • Different heads can learn different relationship types
  • One head might focus on syntax, another on semantics
  • Similar to multiple filters in CNNs

Typical values: h = 8 or 12 heads.

Why Self-Attention Beats RNNs

AspectRNNTransformer
Long-range dependenciesDifficult (vanishing gradients)Direct connections
ParallelizationSequential (slow)Fully parallel
Computation per layerO(n Γ— dΒ²)O(nΒ² Γ— d)
Maximum path lengthO(n)O(1)

For sequences up to ~2000 tokens, Transformers are faster and more effective.

Label Smoothing

A training trick used in the original paper:

Instead of hard labels [1, 0, 0, 0]:

1
[0.9, 0.033, 0.033, 0.033]

Why? In language, there’s often more than one correct next word. “What a great ___” could be “day”, “idea”, “book”, etc. Label smoothing prevents overconfidence.


πŸ”¬ End-to-End Walkthrough

Let’s trace a complete forward pass for translation:

Task: Translate “A cute teddy bear is reading” to French

Step 1: Tokenization

1
Input tokens: [BOS, "A", "cute", "teddy", "bear", "is", "reading", EOS]

Step 2: Embedding + Positional Encoding

For each token position i:

1
x_i = TokenEmbedding(token_i) + PositionalEncoding(i)

Result: Matrix X of shape (sequence_length Γ— d_model)

Step 3: Encoder Processing

Self-Attention:

1
2
3
4
5
Q = X Γ— W_Q    (queries)
K = X Γ— W_K    (keys)
V = X Γ— W_V    (values)

Attention_output = softmax(QKα΅€ / √d_k) Γ— V

This is done h times (multi-head), concatenated, and projected:

1
MultiHead_output = Concat(head₁, ..., head_h) Γ— W_O

Feed-Forward:

1
FFN(x) = ReLU(x Γ— W₁ + b₁) Γ— Wβ‚‚ + bβ‚‚

Repeat N times.

Result: Encoded representations for all input tokens.

Step 4: Decoder Generation

Start with BOS token:

1
Decoder input: [BOS]

Masked Self-Attention: (only looks at previous tokens)

1
For position 1: attend only to BOS

Cross-Attention:

1
2
3
4
Q = decoder hidden state
K, V = encoder outputs

β†’ "What parts of the English sentence should I look at?"

Feed-Forward + Linear + Softmax:

1
2
β†’ Probability distribution over French vocabulary
β†’ Select: "Un" (or sample from distribution)

Step 5: Autoregressive Generation

1
2
3
4
5
6
7
Decoder input: [BOS, "Un"]
β†’ Predict: "ours"

Decoder input: [BOS, "Un", "ours"]
β†’ Predict: "en"

... continue until EOS is predicted

Final output: “Un ours en peluche mignon lit”


🎯 Key Takeaways

  1. NLP tasks fall into classification, multi-classification, and generation categories

  2. Tokenization converts text to discrete units β€” subword is the modern standard

  3. Embeddings are learned representations that capture semantic meaning

  4. Word2vec showed that proxy tasks (predicting context) yield meaningful embeddings

  5. RNNs process sequences but suffer from vanishing gradients

  6. Attention creates direct connections, solving long-range dependency issues

  7. Transformers use self-attention exclusively, enabling parallelization and better performance

  8. The Encoder creates context-aware input representations

  9. The Decoder generates output autoregressively, using masked self-attention and cross-attention

  10. Multi-head attention allows learning multiple types of relationships


πŸ“š References


These notes are based on Stanford’s CME 295 Lecture 1, taught by Afshine and Shervine Amidi. The course continues with deeper dives into training, fine-tuning, and applications of LLMs.