Stanford CME 295 Lecture 1 Notes: Transformers and LLM Foundations

Large Language Models (LLMs) have revolutionized how we interact with AI. But how do they actually work? Stanford’s CME 295: Transformers and Large Language Models course, taught by twin brothers Afshine and Shervine Amidi, provides one of the most accessible yet rigorous introductions to this technology.

These are my notes from Lecture 1, summarizing the key concepts from NLP basics to the Transformer architecture that powers models like GPT, BERT, and Claude.

Who Are the Instructors?

Afshine and Shervine Amidi are twin brothers with impressive ML backgrounds:

Both studied at Centrale Paris (France)
Afshine went to MIT, Shervine to Stanford (ICME Master’s)
Industry experience: Uber → Google → Netflix
Currently working on LLMs at Netflix
Creators of the popular VIP Cheat Sheets on GitHub

Their practical industry experience combined with academic rigor makes this course particularly valuable for developers looking to understand LLMs beyond surface-level explanations.

🎯 Understanding NLP: The Foundation

Natural Language Processing (NLP) is the field of computing things with text. At a high level, NLP tasks can be classified into three buckets:

1. Classification Tasks

You have an input text and want to predict a single label.

Task	Description	Example
Sentiment Analysis	Determine if text is positive, negative, or neutral	Movie review → “Positive”
Intent Detection	Identify what a user wants to do	“Set alarm for 7am” → “create_alarm”
Language Detection	Identify the language of text	“Bonjour” → French
Topic Modeling	Categorize text by topic	News article → “Sports”

2. Multi-Classification Tasks

You have an input text and want to predict multiple things (labels for each token).

Task	Description	Example
Named Entity Recognition (NER)	Label specific words with categories	“Paris is beautiful” → Paris: LOCATION
Part-of-Speech Tagging	Label grammatical function	“The cat sat” → DET, NOUN, VERB
Dependency Parsing	Identify grammatical relationships	Subject-verb-object structures

3. Generation Tasks

Text in, text out — where the output length is variable.

Task	Description	Example
Machine Translation	Convert text between languages	English → French
Question Answering	Generate answers to questions	ChatGPT, Claude
Summarization	Condense long text	Article → 2-sentence summary
Code Generation	Generate code from descriptions	“Sort a list” → Python code

📊 Evaluation Metrics: How Do We Measure Success?

Different tasks require different metrics. Here’s how we evaluate NLP models:

Classification Metrics

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Accuracy = (Correct Predictions) / (Total Predictions)

Precision = (True Positives) / (True Positives + False Positives)
           "Of all positive predictions, how many were correct?"

Recall = (True Positives) / (True Positives + False Negatives)
        "Of all actual positives, how many did we catch?"

F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
          "Harmonic mean of precision and recall"

Why multiple metrics? Consider a dataset with 99% positive labels. A model that always predicts “positive” would have 99% accuracy but be useless. Precision and recall reveal this flaw.

Generation Metrics

Metric	Description	Direction
BLEU	Bilingual Evaluation Understudy — measures n-gram overlap with reference	Higher = Better
ROUGE	Recall-Oriented Understudy for Gisting Evaluation — suite of metrics for summarization	Higher = Better
Perplexity	How “surprised” the model is by its output	Lower = Better

The problem with BLEU and ROUGE: they require reference texts (labels), which are expensive to create. Modern approaches use reference-free metrics powered by LLMs themselves.

✂️ Tokenization: Breaking Text into Pieces

Before a model can process text, we need to convert it into discrete units called tokens. There are three main approaches:

Word-Level Tokenization

Split text by words:

1
"A cute teddy bear" → ["A", "cute", "teddy", "bear"]

Pros:

Simple and intuitive
Each token has clear meaning

Cons:

Large vocabulary size
“bear” and “bears” are completely different tokens
High Out-of-Vocabulary (OOV) risk

Subword-Level Tokenization

Split using common subword units:

1
2
"bears" → ["bear", "s"]
"running" → ["run", "ning"]

Pros:

Leverages word roots
Lower OOV risk
Balances vocabulary size

Cons:

Longer sequences than word-level
Requires training a tokenizer (BPE, WordPiece, etc.)

Character-Level Tokenization

Split into individual characters:

1
"cute" → ["c", "u", "t", "e"]

Pros:

Robust to misspellings
Very small vocabulary
Zero OOV risk

Cons:

Very long sequences
Hard to capture meaning at character level
Slow computation

Comparison Table

Approach	Vocabulary Size	Sequence Length	OOV Risk	Use Case
Word	Large (100K+)	Short	High	Simple tasks
Subword	Medium (30-50K)	Medium	Low	Most LLMs
Character	Tiny (~100)	Very Long	None	Spelling tasks

Modern LLMs typically use subword tokenization (like BPE or SentencePiece) with vocabulary sizes of 30,000-100,000 tokens.

🔢 Word Representation: From Text to Numbers

Models understand numbers, not text. We need to represent tokens numerically.

One-Hot Encoding: The Naive Approach

Assign each token a unique vector with a single 1:

1
2
3
4
5
Vocabulary: [soft, teddy_bear, book]

soft       = [1, 0, 0]
teddy_bear = [0, 1, 0]
book       = [0, 0, 1]

The Problem: All vectors are orthogonal. “Soft” and “teddy_bear” have zero similarity, even though teddy bears are soft!

Cosine Similarity

We measure similarity between vectors using cosine similarity:

1
cos(A, B) = (A · B) / (||A|| × ||B||)

Similarity = 1: Vectors point in same direction
Similarity = 0: Vectors are orthogonal (independent)
Similarity = -1: Vectors point in opposite directions

What we want:

Similar concepts → High similarity (teddy_bear ↔ soft)
Unrelated concepts → Low similarity (teddy_bear ↔ book)

What one-hot gives us:

Everything has 0 similarity with everything else

This is why we need learned embeddings.

🧠 Word2vec: Learning Meaningful Embeddings

Word2vec (2013) was a breakthrough in learning word representations. The key insight: use a proxy task to learn embeddings.

The Proxy Task Concept

Instead of directly defining what makes a good embedding, we:

Define a task that requires understanding language
Train a model on that task
Extract the learned representations

If a model can predict surrounding words, it must have learned something meaningful about language.

Two Approaches

Continuous Bag of Words (CBOW):

Input: Context words (surrounding words)
Output: Target word (center word)
“Predict the word from its context”

Skip-gram:

Input: Target word (center word)
Output: Context words (surrounding words)
“Predict the context from the word”

Training Walkthrough

Let’s trace through a simple example:

1
2
Sentence: "A cute teddy bear is reading"
Task: Predict next word

Step 1: Take word “A”, represent as one-hot vector

1
Input: [1, 0, 0, 0, 0, 0]  (vocabulary size = 6)

Step 2: Multiply by weight matrix W₁ (size: V × d)

1
2
Hidden layer h = W₁ᵀ × input
Result: h = [0.2, 0.9]  (dimension d = 2)

Step 3: Multiply by weight matrix W₂ (size: d × V)

1
2
Output = softmax(W₂ᵀ × h)
Result: [0.2, 0.4, 0.1, 0.1, 0.1, 0.1]

Step 4: Compare with true next word “cute” = [0, 1, 0, 0, 0, 0]

Step 5: Compute loss (cross-entropy) and backpropagate

Step 6: Repeat for all words in corpus until convergence

The Magic: Hidden Layer IS the Embedding

After training, the weight matrix W₁ contains our embeddings:

1
2
To get embedding for word i:
embedding(word_i) = W₁[i, :]  (row i of W₁)

The resulting embeddings capture semantic relationships:

1
2
king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin

Typical Dimensions

Vocabulary size (V): 10,000 - 100,000+
Embedding dimension (d): 100 - 768+

🔄 Sequence Models: RNN and LSTM

Word2vec gives us token embeddings, but they’re context-independent. The word “bank” has the same embedding whether it means “river bank” or “money bank”.

Recurrent Neural Networks (RNN)

RNNs process sequences one token at a time, maintaining a hidden state that captures the sequence so far.

1
2
3
For each time step t:
    h_t = f(h_{t-1}, x_t)
    y_t = g(h_t)

Architecture:

1
2
3
4
5
Input:  x₁ → x₂ → x₃ → x₄ → x₅
         ↓    ↓    ↓    ↓    ↓
Hidden: h₁ → h₂ → h₃ → h₄ → h₅
         ↓    ↓    ↓    ↓    ↓
Output: y₁   y₂   y₃   y₄   y₅

Pros:

Word order matters
Context-aware representations
Can handle variable-length sequences

Cons:

Sequential processing = slow (can’t parallelize)
Vanishing gradient problem

The Vanishing Gradient Problem

When backpropagating through many time steps:

1
∂L/∂h₁ = ∂L/∂h_T × ∂h_T/∂h_{T-1} × ... × ∂h₂/∂h₁

If each gradient term is < 1, the product becomes tiny. The model “forgets” early tokens.

If each term is > 1, gradients explode. Training becomes unstable.

LSTM: Long Short-Term Memory

LSTMs add a cell state to help remember important information:

1
2
3
4
Components:
- h_t: Hidden state (short-term)
- c_t: Cell state (long-term)
- Gates: Forget, Input, Output (control information flow)

LSTMs mitigate vanishing gradients but don’t eliminate them. Long-range dependencies (1000+ tokens) remain challenging.

Why RNNs Fell Out of Favor

Sequential processing: Can’t parallelize across time steps
Long-range dependencies: Still struggle with very long contexts
Slow training: Each token depends on all previous tokens

This led to the development of attention mechanisms.

👁️ The Attention Mechanism: Direct Connections

The core idea: instead of passing information through a chain of hidden states, create direct connections between any two positions.

Motivation

Consider translating: “A cute teddy bear is reading” → French

When generating “ours” (bear), we want to look directly at “teddy bear” in the input — not hope the information survived through the RNN chain.

How Attention Works

Attention creates a weighted combination of all input positions:

1
2
3
4
5
For each output position:
1. Compare (query) with all input positions (keys)
2. Get similarity scores
3. Weight the input values by these scores
4. Sum to get context vector

Query, Key, Value (Q, K, V)

This terminology comes from database lookups:

Query (Q): What you’re looking for
Key (K): Index for each item
Value (V): Actual content of each item

1
2
3
4
5
6
7
8
9
Example: Looking up "teddy bear" in a dictionary

Query: "teddy bear"
Keys: ["a", "cute", "teddy bear", "is", "reading"]
Values: [emb_a, emb_cute, emb_teddy, emb_is, emb_reading]

1. Compare query with each key → similarity scores
2. Highest similarity: "teddy bear" key
3. Return corresponding value: emb_teddy

Attention Formula

1
Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V

QKᵀ: Dot product of queries and keys (similarity scores)
√d_k: Scaling factor (prevents large dot products)
softmax: Converts to probability distribution
× V: Weighted sum of values

🤖 The Transformer Architecture

The Transformer (2017, “Attention is All You Need”) revolutionized NLP by using only attention — no recurrence.

Key Innovation: Self-Attention

Instead of attention between encoder and decoder, apply attention within a sequence:

1
2
3
4
5
6
7
8
9
Input: "A cute teddy bear is reading"

For the word "teddy bear":
- Query: teddy bear's representation
- Keys: all words' representations
- Values: all words' representations

Result: teddy bear's representation enriched with context
        (e.g., "cute" contributes because teddy bears are cute)

This is self-attention — the sequence attends to itself.

Architecture Overview

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
┌─────────────────┐      ┌─────────────────┐
│     ENCODER     │      │     DECODER     │
├─────────────────┤      ├─────────────────┤
│                 │      │    Linear +     │
│ Multi-Head      │      │    Softmax      │
│ Self-Attention  │      ├─────────────────┤
│       ↓         │      │  Feed Forward   │
│ Feed Forward    │      ├─────────────────┤
│                 │  ──► │ Cross-Attention │ ◄── Q from decoder
├─────────────────┤  K,V │                 │     K,V from encoder
│ Input Embedding │      │ Masked Self-    │
│ + Positional    │      │ Attention       │
│   Encoding      │      ├─────────────────┤
└─────────────────┘      │ Output Embedding│
        ↑                │ + Positional    │
   Source Text           └─────────────────┘
   (English)                     ↑
                            Target Text
                            (French)

Encoder

The encoder processes the input sequence:

Input Embedding: Convert tokens to vectors
Positional Encoding: Add position information (sine/cosine functions)
Multi-Head Self-Attention: Each token attends to all tokens
Feed-Forward Network: Additional transformation
Repeat N times: Stack N encoder layers

Output: Context-aware representations for all input tokens.

Decoder

The decoder generates the output sequence:

Output Embedding + Positional Encoding
Masked Self-Attention: Each token attends only to previous tokens (causal)
Cross-Attention: Attend to encoder outputs
- Query: from decoder
- Key, Value: from encoder
Feed-Forward Network
Repeat N times
Linear + Softmax: Predict next token probability

Multi-Head Attention

Instead of one attention operation, run h parallel attention heads:

1
2
3
MultiHead(Q, K, V) = Concat(head₁, head₂, ..., head_h) × W_O

where head_i = Attention(Q × W_Q^i, K × W_K^i, V × W_V^i)

Why multiple heads?

Different heads can learn different relationship types
One head might focus on syntax, another on semantics
Similar to multiple filters in CNNs

Typical values: h = 8 or 12 heads.

Why Self-Attention Beats RNNs

Aspect	RNN	Transformer
Long-range dependencies	Difficult (vanishing gradients)	Direct connections
Parallelization	Sequential (slow)	Fully parallel
Computation per layer	O(n × d²)	O(n² × d)
Maximum path length	O(n)	O(1)

For sequences up to ~2000 tokens, Transformers are faster and more effective.

Label Smoothing

A training trick used in the original paper:

Instead of hard labels [1, 0, 0, 0]:

1
[0.9, 0.033, 0.033, 0.033]

Why? In language, there’s often more than one correct next word. “What a great ___” could be “day”, “idea”, “book”, etc. Label smoothing prevents overconfidence.

🔬 End-to-End Walkthrough

Let’s trace a complete forward pass for translation:

Task: Translate “A cute teddy bear is reading” to French

Step 1: Tokenization

1
Input tokens: [BOS, "A", "cute", "teddy", "bear", "is", "reading", EOS]

Step 2: Embedding + Positional Encoding

For each token position i:

1
x_i = TokenEmbedding(token_i) + PositionalEncoding(i)

Result: Matrix X of shape (sequence_length × d_model)

Step 3: Encoder Processing

Self-Attention:

1
2
3
4
5
Q = X × W_Q    (queries)
K = X × W_K    (keys)
V = X × W_V    (values)

Attention_output = softmax(QKᵀ / √d_k) × V

This is done h times (multi-head), concatenated, and projected:

1
MultiHead_output = Concat(head₁, ..., head_h) × W_O

Feed-Forward:

1
FFN(x) = ReLU(x × W₁ + b₁) × W₂ + b₂

Repeat N times.

Result: Encoded representations for all input tokens.

Step 4: Decoder Generation

Start with BOS token:

1
Decoder input: [BOS]

Masked Self-Attention: (only looks at previous tokens)

1
For position 1: attend only to BOS

Cross-Attention:

1
2
3
4
Q = decoder hidden state
K, V = encoder outputs

→ "What parts of the English sentence should I look at?"

Feed-Forward + Linear + Softmax:

1
2
→ Probability distribution over French vocabulary
→ Select: "Un" (or sample from distribution)

Step 5: Autoregressive Generation

1
2
3
4
5
6
7
Decoder input: [BOS, "Un"]
→ Predict: "ours"

Decoder input: [BOS, "Un", "ours"]
→ Predict: "en"

... continue until EOS is predicted

Final output: “Un ours en peluche mignon lit”

🎯 Key Takeaways

NLP tasks fall into classification, multi-classification, and generation categories
Tokenization converts text to discrete units — subword is the modern standard
Embeddings are learned representations that capture semantic meaning
Word2vec showed that proxy tasks (predicting context) yield meaningful embeddings
RNNs process sequences but suffer from vanishing gradients
Attention creates direct connections, solving long-range dependency issues
Transformers use self-attention exclusively, enabling parallelization and better performance
The Encoder creates context-aware input representations
The Decoder generates output autoregressively, using masked self-attention and cross-attention
Multi-head attention allows learning multiple types of relationships

📚 References

Attention Is All You Need (2017) — Original Transformer paper
Word2vec Paper (2013) — Efficient estimation of word representations
Stanford CME 295 Course Website — Course materials
YouTube: Stanford CME 295 Lecture 1 — Original lecture video
Super Study Guide: Transformers & LLMs — Course textbook by Afshine & Shervine

These notes are based on Stanford’s CME 295 Lecture 1, taught by Afshine and Shervine Amidi. The course continues with deeper dives into training, fine-tuning, and applications of LLMs.

Who Are the Instructors?#

🎯 Understanding NLP: The Foundation#

1. Classification Tasks#

2. Multi-Classification Tasks#

3. Generation Tasks#

📊 Evaluation Metrics: How Do We Measure Success?#

Classification Metrics#

Generation Metrics#

✂️ Tokenization: Breaking Text into Pieces#

Word-Level Tokenization#

Subword-Level Tokenization#

Character-Level Tokenization#

Comparison Table#

🔢 Word Representation: From Text to Numbers#

One-Hot Encoding: The Naive Approach#

Cosine Similarity#

🧠 Word2vec: Learning Meaningful Embeddings#

The Proxy Task Concept#

Two Approaches#

Training Walkthrough#

The Magic: Hidden Layer IS the Embedding#

Typical Dimensions#

🔄 Sequence Models: RNN and LSTM#

Recurrent Neural Networks (RNN)#

The Vanishing Gradient Problem#

LSTM: Long Short-Term Memory#

Why RNNs Fell Out of Favor#

👁️ The Attention Mechanism: Direct Connections#

Motivation#

How Attention Works#

Query, Key, Value (Q, K, V)#

Attention Formula#

🤖 The Transformer Architecture#

Key Innovation: Self-Attention#

Architecture Overview#

Encoder#

Decoder#

Multi-Head Attention#

Why Self-Attention Beats RNNs#

Label Smoothing#

🔬 End-to-End Walkthrough#

Step 1: Tokenization#

Step 2: Embedding + Positional Encoding#

Step 3: Encoder Processing#

Step 4: Decoder Generation#

Step 5: Autoregressive Generation#

🎯 Key Takeaways#

📚 References#

Who Are the Instructors?

🎯 Understanding NLP: The Foundation

1. Classification Tasks

2. Multi-Classification Tasks

3. Generation Tasks

📊 Evaluation Metrics: How Do We Measure Success?

Classification Metrics

Generation Metrics

✂️ Tokenization: Breaking Text into Pieces

Word-Level Tokenization

Subword-Level Tokenization

Character-Level Tokenization

Comparison Table

🔢 Word Representation: From Text to Numbers

One-Hot Encoding: The Naive Approach

Cosine Similarity

🧠 Word2vec: Learning Meaningful Embeddings

The Proxy Task Concept

Two Approaches

Training Walkthrough

The Magic: Hidden Layer IS the Embedding

Typical Dimensions

🔄 Sequence Models: RNN and LSTM

Recurrent Neural Networks (RNN)

The Vanishing Gradient Problem

LSTM: Long Short-Term Memory

Why RNNs Fell Out of Favor

👁️ The Attention Mechanism: Direct Connections

Motivation

How Attention Works

Query, Key, Value (Q, K, V)

Attention Formula

🤖 The Transformer Architecture

Key Innovation: Self-Attention

Architecture Overview

Encoder

Decoder

Multi-Head Attention

Why Self-Attention Beats RNNs

Label Smoothing

🔬 End-to-End Walkthrough

Step 1: Tokenization

Step 2: Embedding + Positional Encoding

Step 3: Encoder Processing

Step 4: Decoder Generation

Step 5: Autoregressive Generation

🎯 Key Takeaways

📚 References