第15章:Decoder-only GPT — 自回归生成 | Chapter 15: Decoder-only GPT — Autoregressive Generation
阶段定位 | Stage: 第三阶段 — Transformer 核心 预计学时 | Duration: 5~7 小时
---
学习目标 | Learning Objectives
中文:
- 真正理解 Decoder-only 架构的设计逻辑
- 掌握 Causal Mask 的实现和作用
- 理解自回归(Autoregressive)生成的本质
- 掌握 Next Token Prediction 的训练目标
English:
- Truly understand the design logic of Decoder-only architecture
- Master the implementation and role of Causal Mask
- Understand the essence of autoregressive generation
- Master the Next Token Prediction training objective
---
15.1 为什么 Decoder-only 成为主流?| Why Did Decoder-only Become Mainstream?
中文解释
历史演进:
- 2017: Transformer(Encoder-Decoder)— 用于翻译
- 2018: BERT(Encoder-only)— 用于理解
- 2018: GPT-1(Decoder-only)— 用于生成
- 2020+: GPT-3, LLaMA, Qwen... 全是 Decoder-only
Decoder-only 的优势:
- 结构简单,只有一个堆栈
- 天然适合生成任务
- 训练效率高(无需 encoder-decoder 对齐)
- 缩放效果好(Scaling Laws)
English Explanation
Historical evolution:
- 2017: Transformer (Encoder-Decoder) — for translation
- 2018: BERT (Encoder-only) — for understanding
- 2018: GPT-1 (Decoder-only) — for generation
- 2020+: GPT-3, LLaMA, Qwen... all Decoder-only
Advantages of Decoder-only:
- Simple structure, only one stack
- Naturally suited for generation tasks
- Training efficient (no encoder-decoder alignment needed)
- Good scaling properties (Scaling Laws)
---
15.2 Causal Mask — 因果掩码 | Causal Mask
中文解释
核心约束:当前 token 只能看到之前的 token,不能偷看未来
实现方式:下三角矩阵
位置: 0 1 2 3 4
┌────┬────┬────┬────┬────┐
0 │ 1 │ 0 │ 0 │ 0 │ 0 │ ← "我"只能看自己
├────┼────┼────┼────┼────┤
1 │ 1 │ 1 │ 0 │ 0 │ 0 │ ← "喜欢"能看"我"和自己
├────┼────┼────┼────┼────┤
2 │ 1 │ 1 │ 1 │ 0 │ 0 │ ← "猫"能看前三个
├────┼────┼────┼────┼────┤
3 │ 1 │ 1 │ 1 │ 1 │ 0 │
├────┼────┼────┼────┼────┤
4 │ 1 │ 1 │ 1 │ 1 │ 1 │
└────┴────┴────┴────┴────┘English Explanation
Core constraint: Current token can only see previous tokens, cannot peek at the future
Implementation: Lower triangular matrix
Pos: 0 1 2 3 4
┌────┬────┬────┬────┬────┐
0 │ 1 │ 0 │ 0 │ 0 │ 0 │ ← "I" can only see itself
├────┼────┼────┼────┼────┤
1 │ 1 │ 1 │ 0 │ 0 │ 0 │ ← "like" can see "I" and itself
├────┼────┼────┼────┼────┤
2 │ 1 │ 1 │ 1 │ 0 │ 0 │ ← "cats" can see first three
├────┼────┼────┼────┼────┤
3 │ 1 │ 1 │ 1 │ 1 │ 0 │
├────┼────┼────┼────┼────┤
4 │ 1 │ 1 │ 1 │ 1 │ 1 │
└────┴────┴────┴────┴────┘代码案例 | Code Example
import torch
def create_causal_mask(seq_len):
"""创建因果掩码 | Create causal mask"""
# torch.tril: 保留下三角(含对角线)| Keep lower triangle (including diagonal)
mask = torch.tril(torch.ones(seq_len, seq_len))
return mask
# 可视化 | Visualize
seq_len = 5
mask = create_causal_mask(seq_len)
print("Causal Mask:")
print(mask)
# 在 Attention 中使用 | Use in Attention
def apply_causal_mask(scores):
"""
scores: (batch, heads, seq, seq)
将未来位置设为 -inf | Set future positions to -inf
"""
seq_len = scores.size(-1)
causal_mask = torch.tril(torch.ones(seq_len, seq_len, device=scores.device))
# 将 mask=0 的位置设为 -inf | Set mask=0 positions to -inf
scores = scores.masked_fill(causal_mask == 0, float('-inf'))
return scores
# 测试 | Test
scores = torch.randn(1, 1, 5, 5)
masked_scores = apply_causal_mask(scores)
print("\nOriginal scores:")
print(scores[0, 0].round(2))
print("\nMasked scores (after softmax, future positions get 0 weight):")
weights = torch.softmax(masked_scores, dim=-1)
print(weights[0, 0].round(4))---
15.3 自回归生成 | Autoregressive Generation
中文解释
核心逻辑:一次生成一个 token
输入: "今天天气"
Step 1: 预测下一个 token
"今天天气" → 模型 → "很"
Step 2: 将生成的 token 加入输入
"今天天气很" → 模型 → "好"
Step 3: 继续
"今天天气很好" → 模型 → "。"
输出: "今天天气很好。"English Explanation
Core logic: Generate one token at a time
Input: "The weather today"
Step 1: Predict next token
"The weather today" → model → "is"
Step 2: Add generated token to input
"The weather today is" → model → "nice"
Step 3: Continue
"The weather today is nice" → model → "."
Output: "The weather today is nice."代码案例:贪心解码 | Code Example: Greedy Decoding
import torch
import torch.nn as nn
@torch.no_grad()
def greedy_generate(model, start_tokens, max_length=50):
"""
贪心解码:每次选概率最高的 token | Greedy decoding: always pick highest probability token
"""
model.eval()
tokens = start_tokens.clone()
for _ in range(max_length):
# 前向传播 | Forward
logits = model(tokens) # (1, seq, vocab)
# 取最后一个位置的 logits | Take logits at last position
next_logits = logits[0, -1, :] # (vocab,)
# 贪心:选概率最高的 | Greedy: pick highest probability
next_token = torch.argmax(next_logits).unsqueeze(0).unsqueeze(0)
# 拼接 | Concatenate
tokens = torch.cat([tokens, next_token], dim=1)
# 如果生成了结束符,停止 | If EOS generated, stop
if next_token.item() == 0: # 假设 0 是 EOS | Assume 0 is EOS
break
return tokens
# 温度采样 | Temperature Sampling
@torch.no_grad()
def temperature_sample(model, start_tokens, max_length=50, temperature=1.0):
"""
温度采样:控制随机性 | Temperature sampling: control randomness
"""
model.eval()
tokens = start_tokens.clone()
for _ in range(max_length):
logits = model(tokens)
next_logits = logits[0, -1, :] / temperature
# 计算概率 | Compute probabilities
probs = torch.softmax(next_logits, dim=-1)
# 采样 | Sample
next_token = torch.multinomial(probs, num_samples=1).unsqueeze(0)
tokens = torch.cat([tokens, next_token], dim=1)
return tokens
# Top-k 采样 | Top-k Sampling
@torch.no_grad()
def top_k_sample(model, start_tokens, max_length=50, k=50, temperature=1.0):
"""
Top-k 采样:只从概率最高的 k 个 token 中采样 | Top-k: only sample from top k tokens
"""
model.eval()
tokens = start_tokens.clone()
for _ in range(max_length):
logits = model(tokens)
next_logits = logits[0, -1, :] / temperature
# 取 top-k | Take top-k
top_k_logits, top_k_indices = torch.topk(next_logits, k)
top_k_probs = torch.softmax(top_k_logits, dim=-1)
# 在 top-k 中采样 | Sample among top-k
sampled_idx = torch.multinomial(top_k_probs, num_samples=1)
next_token = top_k_indices[sampled_idx].unsqueeze(0)
tokens = torch.cat([tokens, next_token], dim=1)
return tokens---
15.4 Next Token Prediction — 训练目标 | Next Token Prediction — Training Objective
中文解释
训练方式:预测下一个 token
输入:"今天天气" 标签:"今天天气很" — 即输入右移一位
Loss 计算:
- 模型输出:
["今", "天", "天", "气"]每个位置的 logits - 标签:
["天", "天", "气", "很"] - 每个位置预测下一个 token 的 CrossEntropy Loss
English Explanation
Training method: Predict the next token
Input: "The weather" Label: "The weather is" — i.e., input shifted right by one
Loss computation:
- Model output: logits for each position of
["The", "weather"] - Labels:
["weather", "is"] - CrossEntropy Loss for predicting the next token at each position
代码案例 | Code Example
import torch
import torch.nn as nn
def compute_loss(model, input_ids):
"""
计算 Next Token Prediction Loss | Compute Next Token Prediction Loss
input_ids: (batch, seq) — 输入 token IDs
"""
# 前向传播 | Forward
logits = model(input_ids) # (batch, seq, vocab)
# 目标:每个位置预测下一个 token | Target: predict next token at each position
# logits[:, :-1, :] 预测 input_ids[:, 1:] | logits[:, :-1, :] predicts input_ids[:, 1:]
logits_for_pred = logits[:, :-1, :] # (batch, seq-1, vocab)
targets = input_ids[:, 1:] # (batch, seq-1)
# 计算 loss | Compute loss
loss = nn.functional.cross_entropy(
logits_for_pred.reshape(-1, logits_for_pred.size(-1)),
targets.reshape(-1),
ignore_index=0, # 忽略 padding token | Ignore padding token
)
return loss
# 示例 | Example
vocab_size = 1000
seq_len = 10
input_ids = torch.randint(1, vocab_size, (4, seq_len)) # (batch, seq)
# 模拟模型 | Simulate model
class SimpleModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.emb = nn.Embedding(vocab_size, 128)
self.head = nn.Linear(128, vocab_size)
def forward(self, x):
return self.head(self.emb(x))
model = SimpleModel(vocab_size)
loss = compute_loss(model, input_ids)
print(f"Loss: {loss.item():.4f}")---
15.5 KV Cache — 推理加速的关键 | KV Cache — Key to Inference Acceleration
中文解释
问题:自回归生成时,每次都要重新计算所有位置的 K 和 V
解决:缓存之前计算好的 K 和 V
生成第5个 token 时:
- 需要 Q_5 @ K_{1:5}.T
- K_{1:4} 已经在生成前4个 token 时算过了
- 只需计算 K_5,然后拼接English Explanation
Problem: In autoregressive generation, K and V for all positions are recomputed every time
Solution: Cache previously computed K and V
When generating the 5th token:
- Need Q_5 @ K_{1:5}.T
- K_{1:4} was already computed when generating first 4 tokens
- Only need to compute K_5, then concatenate代码案例 | Code Example
import torch
class KVCache:
"""KV Cache 实现 | KV Cache implementation"""
def __init__(self, num_layers, num_heads, max_batch_size, max_seq_len, d_head):
"""
预分配缓存空间 | Pre-allocate cache space
"""
self.num_layers = num_layers
# 每个 layer 有 K 和 V 两个缓存 | Each layer has K and V caches
self.k_cache = torch.zeros(num_layers, max_batch_size, num_heads, max_seq_len, d_head)
self.v_cache = torch.zeros(num_layers, max_batch_size, num_heads, max_seq_len, d_head)
self.seq_len = 0
def update(self, layer_idx, k, v):
"""
更新缓存 | Update cache
k, v: (batch, heads, 1, d_head) — 新计算的 K/V | Newly computed K/V
"""
self.k_cache[layer_idx, :, :, self.seq_len, :] = k.squeeze(2)
self.v_cache[layer_idx, :, :, self.seq_len, :] = v.squeeze(2)
def get(self, layer_idx):
"""
获取缓存的 K/V | Get cached K/V
返回: (batch, heads, seq_len, d_head) | Returns: (batch, heads, seq_len, d_head)
"""
k = self.k_cache[layer_idx, :, :, :self.seq_len+1, :]
v = self.v_cache[layer_idx, :, :, :self.seq_len+1, :]
return k, v
def increment(self):
"""增加序列长度 | Increment sequence length"""
self.seq_len += 1
# 使用 KV Cache 的 Attention | Attention with KV Cache
def attention_with_kv_cache(Q, K_new, V_new, kv_cache, layer_idx):
"""
Q: (batch, heads, 1, d_head) — 当前 token 的 Query
K_new, V_new: (batch, heads, 1, d_head) — 当前 token 的 K/V
"""
# 更新缓存 | Update cache
kv_cache.update(layer_idx, K_new, V_new)
# 获取所有缓存的 K/V | Get all cached K/V
K_all, V_all = kv_cache.get(layer_idx)
# K_all: (batch, heads, seq_len, d_head)
# 计算 attention | Compute attention
scores = torch.matmul(Q, K_all.transpose(-2, -1)) # (B, H, 1, seq)
weights = torch.softmax(scores, dim=-1)
output = torch.matmul(weights, V_all) # (B, H, 1, d_head)
return output---
本章总结 | Chapter Summary
中文:
- Decoder-only 是现代大模型的标准架构
- Causal Mask 保证生成时不偷看未来
- 自回归 = 一次生成一个 token
- 训练目标 = Next Token Prediction
- KV Cache 是推理加速的关键优化
English:
- Decoder-only is the standard architecture for modern LLMs
- Causal Mask ensures no peeking at future during generation
- Autoregressive = generate one token at a time
- Training objective = Next Token Prediction
- KV Cache is the key optimization for inference acceleration
---
课后练习 | Homework
- Causal Mask:手写因果掩码,验证它确实阻止了未来信息的泄露
- 生成策略:实现 greedy、temperature、top-k 三种生成策略,比较输出差异
- 训练循环:实现一个完整的 GPT 训练循环(forward + loss + backward)
- KV Cache:在已有 Attention 基础上加入 KV Cache,验证推理速度提升
- 序列长度实验:测试不同序列长度下,有/无 KV Cache 的推理时间差异