Large Language Models (LLMs) have revolutionized how we interact with AI. But how do they actually work? Stanford’s CME 295: Transformers and Large Language Models course, taught by twin brothers Afshine and Shervine Amidi, provides one of the most accessible yet rigorous introductions to this technology.
These are my notes from Lecture 1, summarizing the key concepts from NLP basics to the Transformer architecture that powers models like GPT, BERT, and Claude.
Who Are the Instructors?
Afshine and Shervine Amidi are twin brothers with impressive ML backgrounds:
- Both studied at Centrale Paris (France)
- Afshine went to MIT, Shervine to Stanford (ICME Master’s)
- Industry experience: Uber β Google β Netflix
- Currently working on LLMs at Netflix
- Creators of the popular VIP Cheat Sheets on GitHub
Their practical industry experience combined with academic rigor makes this course particularly valuable for developers looking to understand LLMs beyond surface-level explanations.
π― Understanding NLP: The Foundation
Natural Language Processing (NLP) is the field of computing things with text. At a high level, NLP tasks can be classified into three buckets:
1. Classification Tasks
You have an input text and want to predict a single label.
| Task | Description | Example |
|---|---|---|
| Sentiment Analysis | Determine if text is positive, negative, or neutral | Movie review β “Positive” |
| Intent Detection | Identify what a user wants to do | “Set alarm for 7am” β “create_alarm” |
| Language Detection | Identify the language of text | “Bonjour” β French |
| Topic Modeling | Categorize text by topic | News article β “Sports” |
2. Multi-Classification Tasks
You have an input text and want to predict multiple things (labels for each token).
| Task | Description | Example |
|---|---|---|
| Named Entity Recognition (NER) | Label specific words with categories | “Paris is beautiful” β Paris: LOCATION |
| Part-of-Speech Tagging | Label grammatical function | “The cat sat” β DET, NOUN, VERB |
| Dependency Parsing | Identify grammatical relationships | Subject-verb-object structures |
3. Generation Tasks
Text in, text out β where the output length is variable.
| Task | Description | Example |
|---|---|---|
| Machine Translation | Convert text between languages | English β French |
| Question Answering | Generate answers to questions | ChatGPT, Claude |
| Summarization | Condense long text | Article β 2-sentence summary |
| Code Generation | Generate code from descriptions | “Sort a list” β Python code |
π Evaluation Metrics: How Do We Measure Success?
Different tasks require different metrics. Here’s how we evaluate NLP models:
Classification Metrics
| |
Why multiple metrics? Consider a dataset with 99% positive labels. A model that always predicts “positive” would have 99% accuracy but be useless. Precision and recall reveal this flaw.
Generation Metrics
| Metric | Description | Direction |
|---|---|---|
| BLEU | Bilingual Evaluation Understudy β measures n-gram overlap with reference | Higher = Better |
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation β suite of metrics for summarization | Higher = Better |
| Perplexity | How “surprised” the model is by its output | Lower = Better |
The problem with BLEU and ROUGE: they require reference texts (labels), which are expensive to create. Modern approaches use reference-free metrics powered by LLMs themselves.
βοΈ Tokenization: Breaking Text into Pieces
Before a model can process text, we need to convert it into discrete units called tokens. There are three main approaches:
Word-Level Tokenization
Split text by words:
| |
Pros:
- Simple and intuitive
- Each token has clear meaning
Cons:
- Large vocabulary size
- “bear” and “bears” are completely different tokens
- High Out-of-Vocabulary (OOV) risk
Subword-Level Tokenization
Split using common subword units:
| |
Pros:
- Leverages word roots
- Lower OOV risk
- Balances vocabulary size
Cons:
- Longer sequences than word-level
- Requires training a tokenizer (BPE, WordPiece, etc.)
Character-Level Tokenization
Split into individual characters:
| |
Pros:
- Robust to misspellings
- Very small vocabulary
- Zero OOV risk
Cons:
- Very long sequences
- Hard to capture meaning at character level
- Slow computation
Comparison Table
| Approach | Vocabulary Size | Sequence Length | OOV Risk | Use Case |
|---|---|---|---|---|
| Word | Large (100K+) | Short | High | Simple tasks |
| Subword | Medium (30-50K) | Medium | Low | Most LLMs |
| Character | Tiny (~100) | Very Long | None | Spelling tasks |
Modern LLMs typically use subword tokenization (like BPE or SentencePiece) with vocabulary sizes of 30,000-100,000 tokens.
π’ Word Representation: From Text to Numbers
Models understand numbers, not text. We need to represent tokens numerically.
One-Hot Encoding: The Naive Approach
Assign each token a unique vector with a single 1:
| |
The Problem: All vectors are orthogonal. “Soft” and “teddy_bear” have zero similarity, even though teddy bears are soft!
Cosine Similarity
We measure similarity between vectors using cosine similarity:
| |
- Similarity = 1: Vectors point in same direction
- Similarity = 0: Vectors are orthogonal (independent)
- Similarity = -1: Vectors point in opposite directions
What we want:
- Similar concepts β High similarity (teddy_bear β soft)
- Unrelated concepts β Low similarity (teddy_bear β book)
What one-hot gives us:
- Everything has 0 similarity with everything else
This is why we need learned embeddings.
π§ Word2vec: Learning Meaningful Embeddings
Word2vec (2013) was a breakthrough in learning word representations. The key insight: use a proxy task to learn embeddings.
The Proxy Task Concept
Instead of directly defining what makes a good embedding, we:
- Define a task that requires understanding language
- Train a model on that task
- Extract the learned representations
If a model can predict surrounding words, it must have learned something meaningful about language.
Two Approaches
Continuous Bag of Words (CBOW):
- Input: Context words (surrounding words)
- Output: Target word (center word)
- “Predict the word from its context”
Skip-gram:
- Input: Target word (center word)
- Output: Context words (surrounding words)
- “Predict the context from the word”
Training Walkthrough
Let’s trace through a simple example:
| |
Step 1: Take word “A”, represent as one-hot vector
| |
Step 2: Multiply by weight matrix Wβ (size: V Γ d)
| |
Step 3: Multiply by weight matrix Wβ (size: d Γ V)
| |
Step 4: Compare with true next word “cute” = [0, 1, 0, 0, 0, 0]
Step 5: Compute loss (cross-entropy) and backpropagate
Step 6: Repeat for all words in corpus until convergence
The Magic: Hidden Layer IS the Embedding
After training, the weight matrix Wβ contains our embeddings:
| |
The resulting embeddings capture semantic relationships:
| |
Typical Dimensions
- Vocabulary size (V): 10,000 - 100,000+
- Embedding dimension (d): 100 - 768+
π Sequence Models: RNN and LSTM
Word2vec gives us token embeddings, but they’re context-independent. The word “bank” has the same embedding whether it means “river bank” or “money bank”.
Recurrent Neural Networks (RNN)
RNNs process sequences one token at a time, maintaining a hidden state that captures the sequence so far.
| |
Architecture:
| |
Pros:
- Word order matters
- Context-aware representations
- Can handle variable-length sequences
Cons:
- Sequential processing = slow (can’t parallelize)
- Vanishing gradient problem
The Vanishing Gradient Problem
When backpropagating through many time steps:
| |
If each gradient term is < 1, the product becomes tiny. The model “forgets” early tokens.
If each term is > 1, gradients explode. Training becomes unstable.
LSTM: Long Short-Term Memory
LSTMs add a cell state to help remember important information:
| |
LSTMs mitigate vanishing gradients but don’t eliminate them. Long-range dependencies (1000+ tokens) remain challenging.
Why RNNs Fell Out of Favor
- Sequential processing: Can’t parallelize across time steps
- Long-range dependencies: Still struggle with very long contexts
- Slow training: Each token depends on all previous tokens
This led to the development of attention mechanisms.
ποΈ The Attention Mechanism: Direct Connections
The core idea: instead of passing information through a chain of hidden states, create direct connections between any two positions.
Motivation
Consider translating: “A cute teddy bear is reading” β French
When generating “ours” (bear), we want to look directly at “teddy bear” in the input β not hope the information survived through the RNN chain.
How Attention Works
Attention creates a weighted combination of all input positions:
| |
Query, Key, Value (Q, K, V)
This terminology comes from database lookups:
- Query (Q): What you’re looking for
- Key (K): Index for each item
- Value (V): Actual content of each item
| |
Attention Formula
| |
- QKα΅: Dot product of queries and keys (similarity scores)
- βd_k: Scaling factor (prevents large dot products)
- softmax: Converts to probability distribution
- Γ V: Weighted sum of values
π€ The Transformer Architecture
The Transformer (2017, “Attention is All You Need”) revolutionized NLP by using only attention β no recurrence.
Key Innovation: Self-Attention
Instead of attention between encoder and decoder, apply attention within a sequence:
| |
This is self-attention β the sequence attends to itself.
Architecture Overview
| |
Encoder
The encoder processes the input sequence:
- Input Embedding: Convert tokens to vectors
- Positional Encoding: Add position information (sine/cosine functions)
- Multi-Head Self-Attention: Each token attends to all tokens
- Feed-Forward Network: Additional transformation
- Repeat N times: Stack N encoder layers
Output: Context-aware representations for all input tokens.
Decoder
The decoder generates the output sequence:
- Output Embedding + Positional Encoding
- Masked Self-Attention: Each token attends only to previous tokens (causal)
- Cross-Attention: Attend to encoder outputs
- Query: from decoder
- Key, Value: from encoder
- Feed-Forward Network
- Repeat N times
- Linear + Softmax: Predict next token probability
Multi-Head Attention
Instead of one attention operation, run h parallel attention heads:
| |
Why multiple heads?
- Different heads can learn different relationship types
- One head might focus on syntax, another on semantics
- Similar to multiple filters in CNNs
Typical values: h = 8 or 12 heads.
Why Self-Attention Beats RNNs
| Aspect | RNN | Transformer |
|---|---|---|
| Long-range dependencies | Difficult (vanishing gradients) | Direct connections |
| Parallelization | Sequential (slow) | Fully parallel |
| Computation per layer | O(n Γ dΒ²) | O(nΒ² Γ d) |
| Maximum path length | O(n) | O(1) |
For sequences up to ~2000 tokens, Transformers are faster and more effective.
Label Smoothing
A training trick used in the original paper:
Instead of hard labels [1, 0, 0, 0]:
| |
Why? In language, there’s often more than one correct next word. “What a great ___” could be “day”, “idea”, “book”, etc. Label smoothing prevents overconfidence.
π¬ End-to-End Walkthrough
Let’s trace a complete forward pass for translation:
Task: Translate “A cute teddy bear is reading” to French
Step 1: Tokenization
| |
Step 2: Embedding + Positional Encoding
For each token position i:
| |
Result: Matrix X of shape (sequence_length Γ d_model)
Step 3: Encoder Processing
Self-Attention:
| |
This is done h times (multi-head), concatenated, and projected:
| |
Feed-Forward:
| |
Repeat N times.
Result: Encoded representations for all input tokens.
Step 4: Decoder Generation
Start with BOS token:
| |
Masked Self-Attention: (only looks at previous tokens)
| |
Cross-Attention:
| |
Feed-Forward + Linear + Softmax:
| |
Step 5: Autoregressive Generation
| |
Final output: “Un ours en peluche mignon lit”
π― Key Takeaways
NLP tasks fall into classification, multi-classification, and generation categories
Tokenization converts text to discrete units β subword is the modern standard
Embeddings are learned representations that capture semantic meaning
Word2vec showed that proxy tasks (predicting context) yield meaningful embeddings
RNNs process sequences but suffer from vanishing gradients
Attention creates direct connections, solving long-range dependency issues
Transformers use self-attention exclusively, enabling parallelization and better performance
The Encoder creates context-aware input representations
The Decoder generates output autoregressively, using masked self-attention and cross-attention
Multi-head attention allows learning multiple types of relationships
π References
- Attention Is All You Need (2017) β Original Transformer paper
- Word2vec Paper (2013) β Efficient estimation of word representations
- Stanford CME 295 Course Website β Course materials
- YouTube: Stanford CME 295 Lecture 1 β Original lecture video
- Super Study Guide: Transformers & LLMs β Course textbook by Afshine & Shervine
These notes are based on Stanford’s CME 295 Lecture 1, taught by Afshine and Shervine Amidi. The course continues with deeper dives into training, fine-tuning, and applications of LLMs.