第26章：Transformer 与 LLM 预训练 | Chapter 26: Transformer & LLM Pre-training

阶段定位 | Stage: 第六阶段 — 无监督学习与生成式 AI 预计学时 | Duration: 5~6 小时

---

学习目标 | Learning Objectives

中文：

掌握 Transformer 的完整架构：Encoder + Decoder
理解 Self-Attention、Cross-Attention、FFN、残差连接、LayerNorm
理解位置编码的必要性与正弦编码的数学形式
掌握 GPT（自回归）和 BERT（双向）的预训练目标差异
理解预训练-微调范式的核心思想与 Scaling Law
了解现代 LLM 的预训练流程与数据工程

English:

Master complete Transformer architecture
Understand Self-Attention, Cross-Attention, FFN, residuals, LayerNorm
Understand positional encoding necessity and sinusoidal form
Master GPT (autoregressive) vs BERT (bidirectional) pre-training
Understand pretrain-finetune paradigm and Scaling Law
Know modern LLM pre-training pipeline and data engineering

---

26.1 Transformer 架构 | Transformer Architecture

中文解释

Encoder-Decoder 结构

输入 → [Encoder × N] → [Decoder × N] → 输出

Encoder 内部

输入嵌入 + 位置编码
  → Self-Attention → Add & Norm
  → FFN → Add & Norm

Decoder 内部

输出嵌入 + 位置编码
  → Masked Self-Attention → Add & Norm
  → Cross-Attention → Add & Norm
  → FFN → Add & Norm

关键设计

组件	作用
Self-Attention	编码器/解码器内部的信息交互
Cross-Attention	解码器查询编码器信息（翻译对齐）
FFN	对每个位置独立做非线性变换
残差连接	缓解梯度消失，允许深层堆叠
LayerNorm	稳定训练，替代 BatchNorm（适合变长序列）

English Explanation

Encoder: bidirectional self-attention (understands input) Decoder: causal self-attention + cross-attention (generates output)

---

26.2 位置编码 | Positional Encoding

中文解释

问题

Self-Attention 对输入排列不变：

"I love you" 和 "you love I" 的注意力计算完全一样！

模型完全不知道词的顺序。

正弦位置编码

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

性质

每个位置有唯一编码
相对位置可线性表示：PE(pos+k) 可由 PE(pos) 线性组合
可以外推到训练时未见过的更长序列

可学习位置编码

现代模型（如 GPT、BERT）通常用可学习的嵌入替代正弦编码：

position_embedding = nn.Embedding(max_len, d_model)

效果通常更好，但不能外推到比训练更长的序列。

English Explanation

Sinusoidal PE: unique per position, supports relative position linear combination.

Learned PE: better performance but no extrapolation.

---

26.3 GPT vs BERT | GPT vs BERT

中文解释

特性	GPT	BERT
架构	Decoder-only	Encoder-only
注意力	因果（只看左边）	双向（看全部）
预训练目标	预测下一个词	掩码语言模型 (MLM)
用途	文本生成	文本理解
代表模型	GPT-3/4, LLaMA	BERT, RoBERTa

GPT 的自回归预训练

输入: "The cat sat"
目标: 预测 "on" → 预测 "the" → 预测 "mat"
损失: -log P(on|The cat sat) - log P(the|The cat sat on) - ...

BERT 的 MLM 预训练

输入: "The [MASK] sat on the mat"
目标: 预测 [MASK] = "cat"
损失: -log P(cat|The [MASK] sat on the mat)

BERT 还包含下一句预测（NSP）任务，但后来发现效果有限，RoBERTa 去掉了 NSP。

English Explanation

GPT: autoregressive next-token prediction. BERT: masked language model.

---

26.4 Scaling Law | Scaling Law

中文解释

核心发现

模型性能（测试损失）与三个因素呈幂律关系：

Loss ∝ C^(-α) ∝ N^(-β) ∝ D^(-γ)

C：计算量（FLOPs）
N：模型参数量
D：训练数据量

意义

只要扩大模型、数据、算力，性能就会可预测地提升
这是大模型时代"大力出奇迹"的理论基础
GPT-3、GPT-4、LLaMA 都是 Scaling Law 的实践

Chinchilla 论文的发现

2022 年 DeepMind 的 Chinchilla 论文指出：

之前的大模型（如 GPT-3）训练数据不足
最优比例：数据量 ≈ 20 × 参数量（以 token 计）
例如 70B 参数的模型，应该用 1.4T token 训练

English Explanation

Scaling Law: performance predictably improves with compute, parameters, and data.

Chinchilla: optimal data ≈ 20× parameters.

---

26.5 完整实现

代码案例

python

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=64, num_heads=4):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape

        Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = torch.softmax(scores, dim=-1)
        out = torch.matmul(attn, V)
        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        return self.W_o(out)

class TransformerBlock(nn.Module):
    def __init__(self, d_model=64, num_heads=4, d_ff=256, dropout=0.1):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        attn_out = self.attn(x, mask)
        x = self.norm1(x + self.dropout(attn_out))
        ffn_out = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_out))
        return x

# ========== 测试 ==========
print("=" * 50)
print("Transformer Block 测试")
print("=" * 50)

x = torch.randn(2, 10, 64)  # batch=2, seq=10, dim=64
block = TransformerBlock(d_model=64, num_heads=4)
out = block(x)
print(f"输入:  {x.shape}")
print(f"输出:  {out.shape}")

# 因果 Mask
def create_causal_mask(seq_len):
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask.unsqueeze(0).unsqueeze(0)  # (1, 1, seq, seq)

mask = create_causal_mask(10)
out_masked = block(x, mask=mask)
print(f"带因果 Mask: {out_masked.shape}")
print("Transformer Block = Attention + FFN + Residual + LayerNorm")

# 计算参数量
total = sum(p.numel() for p in block.parameters())
print(f"\n单块参数量: {total:,}")
print(f"GPT-3 (96层, d=12288, heads=96): ~1750亿参数")
print(f"LLaMA-2 (32层, d=4096, heads=32): ~70亿参数")

输出：

==================================================
Transformer Block 测试
==================================================
输入:  torch.Size([2, 10, 64])
输出:  torch.Size([2, 10, 64])
带因果 Mask: torch.Size([2, 10, 64])
Transformer Block = Attention + FFN + Residual + LayerNorm

单块参数量: 213,248
GPT-3 (96层, d=12288, heads=96): ~1750亿参数
LLaMA-2 (32层, d=4096, heads=32): ~70亿参数

---

26.6 常见误区 | Common Pitfalls

1. "BERT 比 GPT 更好"

没有绝对好坏，只有适用场景：

分类/理解任务 → BERT
生成/对话任务 → GPT
现在趋势：Decoder-only 模型（GPT 架构）通过指令微调也能做理解任务

2. 预训练 ≠ 直接可用

预训练模型只学会了"语言建模"，不会回答指令、不会遵循格式。必须经过 SFT（监督微调）和 RLHF 才能变成 ChatGPT。

3. 位置编码不是万能的

正弦位置编码虽然能外推，但实际外推效果差（长序列上注意力分布退化）。现代方法如 ALiBi、RoPE、NTK-aware scaling 专门解决长序列外推问题。

---

本章总结 | Chapter Summary

中文：

Transformer = Self-Attention + FFN + Residual + LayerNorm
Encoder 双向理解，Decoder 因果生成
位置编码注入序列顺序（正弦或可学习）
GPT：自回归预训练，生成模型
BERT：MLM 预训练，理解模型
Scaling Law：性能随算力/参数/数据幂律提升
Chinchilla：最优数据量 ≈ 20 × 参数量
预训练后需 SFT + RLHF 才能变成对话模型

English:

Transformer = Attention + FFN + Residual + LayerNorm
Encoder bidirectional, Decoder causal
Positional encoding injects order
GPT: autoregressive; BERT: masked LM
Scaling Law: power-law improvement
Chinchilla: optimal data ≈ 20× parameters
Pretrain + SFT + RLHF = chat model

---

课后练习 | Homework

位置编码性质：证明 PE(pos+k) 可由 PE(pos) 线性表示。

因果 Mask 实现：用布尔掩码实现 Masked Self-Attention，验证因果性。

参数量对比：对比 Transformer 与 RNN 的参数量和计算复杂度（每步和全序列）。

预训练目标实现：实现 GPT 的下一个词预测和 BERT 的 MLM 损失函数。

Scaling Law 验证：在小模型上（不同参数量、不同数据量）训练，验证损失与 N/D 的幂律关系。

长序列外推：对比正弦 PE、RoPE、ALiBi 在训练长度外推时的效果。

计算量估算：估算训练 GPT-3（175B 参数，300B token）所需的总 FLOPs 和 GPU 小时数。