第7章:Autograd — PyTorch 的灵魂 | Chapter 7: Autograd — The Soul of PyTorch
阶段定位 | Stage: 第二阶段 — PyTorch 与训练系统 预计学时 | Duration: 6~8 小时
---
学习目标 | Learning Objectives
中文:
- 真正理解
loss.backward()到底在做什么 - 掌握计算图(Computational Graph)的概念
- 理解
requires_grad和grad_fn的含义 - 能够手动追踪和验证梯度计算
English:
- Truly understand what
loss.backward()actually does - Master the concept of Computational Graph
- Understand the meaning of
requires_gradandgrad_fn - Be able to manually trace and verify gradient computations
---
7.1 什么是 Autograd?| What is Autograd?
中文解释
Autograd = 自动求导 = 自动计算梯度
在深度学习中:
- 你需要计算 loss 对数百万参数的梯度
- 手动求导是不可能的
- Autograd 自动构建计算图,然后反向传播计算所有梯度
English Explanation
Autograd = Automatic differentiation = Automatic gradient computation
In deep learning:
- You need to compute gradients of loss w.r.t. millions of parameters
- Manual differentiation is impossible
- Autograd automatically builds computation graph, then backpropagates to compute all gradients
核心概念 | Core Concepts
计算图(Computational Graph):
输入 x ──→ [op1] ──→ 中间值 z ──→ [op2] ──→ 输出 y
↑ ↑ ↑ ↑
requires_grad grad_fn grad_fn grad_fn
前向传播(Forward):从 x 计算到 y
反向传播(Backward):从 y 计算到 x 的梯度
Forward: compute from x to y
Backward: compute gradients from y back to x---
7.2 requires_grad — 控制求导 | requires_grad — Controlling Gradients
代码案例 | Code Example
import torch
# 默认 requires_grad=False | Default requires_grad=False
x = torch.tensor([2.0, 3.0])
print(f"x.requires_grad: {x.requires_grad}")
# 开启求导 | Enable gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)
print(f"x.requires_grad: {x.requires_grad}")
# 或者用 .requires_grad_() | Or use .requires_grad_()
x = torch.tensor([2.0, 3.0])
x.requires_grad_(True)
print(f"x.requires_grad: {x.requires_grad}")
# 运算自动追踪 | Operations automatically tracked
y = x * 2
print(f"y.requires_grad: {y.requires_grad}")
print(f"y.grad_fn: {y.grad_fn}") # MulBackward0 — 记录了乘法操作 | Records multiplication op
z = y.sum()
print(f"z.requires_grad: {z.requires_grad}")
print(f"z.grad_fn: {z.grad_fn}") # SumBackward0叶子节点与非叶子节点 | Leaf Nodes vs Non-Leaf Nodes
import torch
x = torch.tensor([2.0], requires_grad=True) # 叶子节点 | Leaf node
y = x * 3 # 非叶子节点 | Non-leaf node
z = y + 1 # 非叶子节点 | Non-leaf node
print(f"x.is_leaf: {x.is_leaf}") # True — 用户创建的 | User-created
print(f"y.is_leaf: {y.is_leaf}") # False — 运算生成的 | Generated by operation
print(f"z.is_leaf: {z.is_leaf}") # False
# 只有叶子节点的梯度会被保留 | Only leaf node gradients are retained
z.backward()
print(f"x.grad: {x.grad}") # 有梯度 | Has gradient
print(f"y.grad: {y.grad}") # None — 除非 retain_grad=True | None unless retain_grad=True
# 保留中间梯度 | Retain intermediate gradients
y.retain_grad()
z2 = y * 2
z2.backward()
print(f"y.grad (retained): {y.grad}")---
7.3 backward() — 反向传播 | backward() — Backpropagation
代码案例:简单函数的梯度 | Code Example: Gradient of Simple Function
import torch
# f(x) = x^2, 求 x=3 时的导数 | f(x) = x^2, find derivative at x=3
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 # f(x) = x^2
y.backward() # dy/dx = 2x = 6
print(f"f(x) = x^2")
print(f"x = {x.item()}")
print(f"y = {y.item()}")
print(f"dy/dx = {x.grad.item()}") # 6.0
# 验证:手动计算 | Verify: manual computation
# df/dx = 2x = 2*3 = 6 ✓多元函数的梯度 | Gradient of Multivariate Function
import torch
# f(x, y) = x^2 + 2y, 求偏导数 | f(x, y) = x^2 + 2y, find partial derivatives
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)
z = x**2 + 2*y # f = x^2 + 2y
z.backward()
print(f"f(x,y) = x^2 + 2y")
print(f"∂f/∂x = {x.grad.item()}") # 2x = 6
print(f"∂f/∂y = {y.grad.item()}") # 2
# 验证:| Verify:
# ∂f/∂x = 2x = 6 ✓
# ∂f/∂y = 2 ✓向量输入的梯度 | Gradient with Vector Input
import torch
# f(x) = ||x||^2 = sum(x_i^2) | f(x) = ||x||^2 = sum(x_i^2)
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x.pow(2).sum() # 1 + 4 + 9 = 14
y.backward()
print(f"x: {x.data}")
print(f"y: {y.item()}")
print(f"∇f: {x.grad}") # [2, 4, 6] — 每个元素是 2*x_i
# 验证:| Verify:
# ∂f/∂x_i = 2*x_i = [2, 4, 6] ✓---
7.4 计算图的可视化 | Visualizing the Computation Graph
代码案例 | Code Example
import torch
# 构建一个多层计算 | Build a multi-layer computation
x = torch.tensor(2.0, requires_grad=True)
# Layer 1
a = x * 2 # a = 2x
b = a + 3 # b = 2x + 3
# Layer 2
c = b ** 2 # c = (2x + 3)^2
# Layer 3
d = c / 2 # d = (2x + 3)^2 / 2
e = d + 1 # e = (2x + 3)^2 / 2 + 1
print("计算图节点 | Computation graph nodes:")
print(f"x.grad_fn: {x.grad_fn}") # None — 叶子节点 | Leaf node
print(f"a.grad_fn: {a.grad_fn}") # MulBackward0
print(f"b.grad_fn: {b.grad_fn}") # AddBackward0
print(f"c.grad_fn: {c.grad_fn}") # PowBackward0
print(f"d.grad_fn: {d.grad_fn}") # DivBackward0
print(f"e.grad_fn: {e.grad_fn}") # AddBackward0
e.backward()
print(f"\n最终梯度 | Final gradient: de/dx = {x.grad.item()}")
# 手动验证:| Manual verification:
# e = ((2x + 3)^2) / 2 + 1
# de/dx = (2*(2x+3)*2) / 2 = 2*(2x+3) = 2*(7) = 14
print(f"手动验证 | Manual check: 2*(2*2+3) = {2*(2*2+3)}")---
7.5 梯度累积与清零 | Gradient Accumulation and Zeroing
中文解释
重要:PyTorch 的梯度默认是累积的!
这意味着:
- 第一次
backward()→ grad = 1 - 第二次
backward()→ grad = 2(累积!) - 所以每次更新参数前必须
optimizer.zero_grad()
English Explanation
Important: PyTorch gradients accumulate by default!
This means:
- First
backward()→ grad = 1 - Second
backward()→ grad = 2 (accumulated!) - So you must call
optimizer.zero_grad()before each parameter update
代码案例 | Code Example
import torch
x = torch.tensor(2.0, requires_grad=True)
# 第一次反向传播 | First backward
y = x ** 2
y.backward()
print(f"After 1st backward: grad = {x.grad.item()}") # 4.0
# 第二次反向传播(不清理)| Second backward (without clearing)
y2 = x ** 2
y2.backward()
print(f"After 2nd backward: grad = {x.grad.item()}") # 8.0 — 累积了!| Accumulated!
# 正确做法:清零后重新计算 | Correct way: clear then recompute
x.grad.zero_() # 清零梯度 | Zero gradient
y3 = x ** 2
y3.backward()
print(f"After zero + backward: grad = {x.grad.item()}") # 4.0 — 正确!| Correct!---
7.6 禁用梯度追踪 | Disabling Gradient Tracking
代码案例 | Code Example
import torch
x = torch.tensor(2.0, requires_grad=True)
# 方法1:no_grad — 推荐用于推理 | Method 1: no_grad — recommended for inference
with torch.no_grad():
y = x * 2
print(f"y.requires_grad: {y.requires_grad}") # False
# 方法2:detach — 创建不追踪的副本 | Method 2: detach — create untracked copy
y = (x * 2).detach()
print(f"y.requires_grad: {y.requires_grad}") # False
# 方法3:torch.inference_mode — PyTorch 1.9+ | Method 3: inference_mode
with torch.inference_mode():
y = x * 2
print(f"y.requires_grad: {y.requires_grad}") # False
# 应用场景:| Use cases:
# - 推理/评估时不计算梯度 | No gradients during inference/evaluation
# - 复制参数到 CPU 时 | Copying parameters to CPU
# - 某些值不需要梯度时 | When certain values don't need gradients---
7.7 Autograd 与 Attention | Autograd and Attention
代码案例 | Code Example
import torch
# 用 PyTorch 自动求导实现 Attention 梯度 | Use PyTorch autograd for Attention gradients
d_model = 4
seq_len = 2
# 可学习的权重 | Learnable weights
W_q = torch.randn(d_model, d_model, requires_grad=True)
W_k = torch.randn(d_model, d_model, requires_grad=True)
W_v = torch.randn(d_model, d_model, requires_grad=True)
# 输入 | Input
X = torch.randn(seq_len, d_model)
# 前向传播 | Forward pass
Q = X @ W_q
K = X @ W_k
V = X @ W_v
scores = Q @ K.T / (d_model ** 0.5)
weights = torch.softmax(scores, dim=-1)
output = weights @ V
# 模拟 loss | Simulate loss
loss = output.sum()
# 反向传播 | Backward pass
loss.backward()
print("梯度检查 | Gradient check:")
print(f"W_q.grad shape: {W_q.grad.shape}") # (4, 4) — 有梯度!| Has gradient!
print(f"W_k.grad shape: {W_k.grad.shape}")
print(f"W_v.grad shape: {W_v.grad.shape}")
print(f"X.grad: {X.grad}") # None — 输入不需要梯度 | Input doesn't need grad
# 结论:PyTorch 自动计算了 Attention 中所有可学习参数的梯度!
# Conclusion: PyTorch automatically computed gradients for all learnable parameters in Attention!---
本章总结 | Chapter Summary
中文:
- Autograd 自动构建计算图,自动反向传播求梯度
requires_grad=True标记需要梯度的张量backward()触发反向传播,从输出到输入计算梯度- 梯度默认累积,每次迭代前要
zero_grad() - 推理时用
no_grad()节省内存和计算
English:
- Autograd automatically builds computation graphs and backpropagates gradients
requires_grad=Truemarks tensors needing gradientsbackward()triggers backpropagation, computing gradients from output to input- Gradients accumulate by default; call
zero_grad()before each iteration - Use
no_grad()during inference to save memory and computation
---
课后练习 | Homework
- 基础求导:用 PyTorch 计算
f(x) = sin(x^2)在 x=2 处的导数,并与手动计算对比 - 多元梯度:计算
f(x,y,z) = x*y + y*z + z*x在 (1,2,3) 处的梯度 - 梯度清零:写一个循环,模拟 5 次迭代,每次正确清零并计算梯度
- 链式法则:构建一个 5 层计算图,手动推导梯度公式,再用 PyTorch 验证
- Attention 梯度:用 PyTorch 实现单头 Attention,打印所有参数的梯度 shape