第7章：Autograd — PyTorch 的灵魂 | Chapter 7: Autograd — The Soul of PyTorch

阶段定位 | Stage: 第二阶段 — PyTorch 与训练系统 预计学时 | Duration: 6~8 小时

---

学习目标 | Learning Objectives

中文：

真正理解 loss.backward() 到底在做什么
掌握计算图（Computational Graph）的概念
理解 requires_grad 和 grad_fn 的含义
能够手动追踪和验证梯度计算

English:

Truly understand what loss.backward() actually does
Master the concept of Computational Graph
Understand the meaning of requires_grad and grad_fn
Be able to manually trace and verify gradient computations

---

7.1 什么是 Autograd？| What is Autograd?

中文解释

Autograd = 自动求导 = 自动计算梯度

在深度学习中：

你需要计算 loss 对数百万参数的梯度
手动求导是不可能的
Autograd 自动构建计算图，然后反向传播计算所有梯度

English Explanation

Autograd = Automatic differentiation = Automatic gradient computation

In deep learning:

You need to compute gradients of loss w.r.t. millions of parameters
Manual differentiation is impossible
Autograd automatically builds computation graph, then backpropagates to compute all gradients

核心概念 | Core Concepts

计算图（Computational Graph）:

  输入 x ──→ [op1] ──→ 中间值 z ──→ [op2] ──→ 输出 y
     ↑         ↑           ↑          ↑
 requires_grad grad_fn   grad_fn   grad_fn
 
 前向传播（Forward）：从 x 计算到 y
 反向传播（Backward）：从 y 计算到 x 的梯度
 
 Forward: compute from x to y
 Backward: compute gradients from y back to x

---

7.2 requires_grad — 控制求导 | requires_grad — Controlling Gradients

代码案例 | Code Example

import torch

# 默认 requires_grad=False | Default requires_grad=False
x = torch.tensor([2.0, 3.0])
print(f"x.requires_grad: {x.requires_grad}")

# 开启求导 | Enable gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)
print(f"x.requires_grad: {x.requires_grad}")

# 或者用 .requires_grad_() | Or use .requires_grad_()
x = torch.tensor([2.0, 3.0])
x.requires_grad_(True)
print(f"x.requires_grad: {x.requires_grad}")

# 运算自动追踪 | Operations automatically tracked
y = x * 2
print(f"y.requires_grad: {y.requires_grad}")
print(f"y.grad_fn: {y.grad_fn}")   # MulBackward0 — 记录了乘法操作 | Records multiplication op

z = y.sum()
print(f"z.requires_grad: {z.requires_grad}")
print(f"z.grad_fn: {z.grad_fn}")   # SumBackward0

叶子节点与非叶子节点 | Leaf Nodes vs Non-Leaf Nodes

import torch

x = torch.tensor([2.0], requires_grad=True)   # 叶子节点 | Leaf node
y = x * 3                                       # 非叶子节点 | Non-leaf node
z = y + 1                                       # 非叶子节点 | Non-leaf node

print(f"x.is_leaf: {x.is_leaf}")    # True — 用户创建的 | User-created
print(f"y.is_leaf: {y.is_leaf}")    # False — 运算生成的 | Generated by operation
print(f"z.is_leaf: {z.is_leaf}")    # False

# 只有叶子节点的梯度会被保留 | Only leaf node gradients are retained
z.backward()
print(f"x.grad: {x.grad}")    # 有梯度 | Has gradient
print(f"y.grad: {y.grad}")    # None — 除非 retain_grad=True | None unless retain_grad=True

# 保留中间梯度 | Retain intermediate gradients
y.retain_grad()
z2 = y * 2
z2.backward()
print(f"y.grad (retained): {y.grad}")

---

7.3 backward() — 反向传播 | backward() — Backpropagation

代码案例：简单函数的梯度 | Code Example: Gradient of Simple Function

import torch

# f(x) = x^2, 求 x=3 时的导数 | f(x) = x^2, find derivative at x=3
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2          # f(x) = x^2

y.backward()        # dy/dx = 2x = 6
print(f"f(x) = x^2")
print(f"x = {x.item()}")
print(f"y = {y.item()}")
print(f"dy/dx = {x.grad.item()}")   # 6.0

# 验证：手动计算 | Verify: manual computation
# df/dx = 2x = 2*3 = 6 ✓

多元函数的梯度 | Gradient of Multivariate Function

import torch

# f(x, y) = x^2 + 2y, 求偏导数 | f(x, y) = x^2 + 2y, find partial derivatives
x = torch.tensor(3.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

z = x**2 + 2*y      # f = x^2 + 2y

z.backward()

print(f"f(x,y) = x^2 + 2y")
print(f"∂f/∂x = {x.grad.item()}")   # 2x = 6
print(f"∂f/∂y = {y.grad.item()}")   # 2

# 验证：| Verify:
# ∂f/∂x = 2x = 6 ✓
# ∂f/∂y = 2 ✓

向量输入的梯度 | Gradient with Vector Input

import torch

# f(x) = ||x||^2 = sum(x_i^2) | f(x) = ||x||^2 = sum(x_i^2)
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x.pow(2).sum()   # 1 + 4 + 9 = 14

y.backward()

print(f"x: {x.data}")
print(f"y: {y.item()}")
print(f"∇f: {x.grad}")   # [2, 4, 6] — 每个元素是 2*x_i

# 验证：| Verify:
# ∂f/∂x_i = 2*x_i = [2, 4, 6] ✓

---

7.4 计算图的可视化 | Visualizing the Computation Graph

代码案例 | Code Example

import torch

# 构建一个多层计算 | Build a multi-layer computation
x = torch.tensor(2.0, requires_grad=True)

# Layer 1
a = x * 2           # a = 2x
b = a + 3           # b = 2x + 3

# Layer 2
c = b ** 2          # c = (2x + 3)^2

# Layer 3
d = c / 2           # d = (2x + 3)^2 / 2

e = d + 1           # e = (2x + 3)^2 / 2 + 1

print("计算图节点 | Computation graph nodes:")
print(f"x.grad_fn: {x.grad_fn}")    # None — 叶子节点 | Leaf node
print(f"a.grad_fn: {a.grad_fn}")    # MulBackward0
print(f"b.grad_fn: {b.grad_fn}")    # AddBackward0
print(f"c.grad_fn: {c.grad_fn}")    # PowBackward0
print(f"d.grad_fn: {d.grad_fn}")    # DivBackward0
print(f"e.grad_fn: {e.grad_fn}")    # AddBackward0

e.backward()
print(f"\n最终梯度 | Final gradient: de/dx = {x.grad.item()}")

# 手动验证：| Manual verification:
# e = ((2x + 3)^2) / 2 + 1
# de/dx = (2*(2x+3)*2) / 2 = 2*(2x+3) = 2*(7) = 14
print(f"手动验证 | Manual check: 2*(2*2+3) = {2*(2*2+3)}")

---

7.5 梯度累积与清零 | Gradient Accumulation and Zeroing

中文解释

重要：PyTorch 的梯度默认是累积的！

这意味着：

第一次 backward() → grad = 1
第二次 backward() → grad = 2（累积！）
所以每次更新参数前必须 optimizer.zero_grad()

English Explanation

Important: PyTorch gradients accumulate by default!

This means:

First backward() → grad = 1
Second backward() → grad = 2 (accumulated!)
So you must call optimizer.zero_grad() before each parameter update

代码案例 | Code Example

import torch

x = torch.tensor(2.0, requires_grad=True)

# 第一次反向传播 | First backward
y = x ** 2
y.backward()
print(f"After 1st backward: grad = {x.grad.item()}")   # 4.0

# 第二次反向传播（不清理）| Second backward (without clearing)
y2 = x ** 2
y2.backward()
print(f"After 2nd backward: grad = {x.grad.item()}")   # 8.0 — 累积了！| Accumulated!

# 正确做法：清零后重新计算 | Correct way: clear then recompute
x.grad.zero_()   # 清零梯度 | Zero gradient
y3 = x ** 2
y3.backward()
print(f"After zero + backward: grad = {x.grad.item()}")   # 4.0 — 正确！| Correct!

---

7.6 禁用梯度追踪 | Disabling Gradient Tracking

代码案例 | Code Example

import torch

x = torch.tensor(2.0, requires_grad=True)

# 方法1：no_grad — 推荐用于推理 | Method 1: no_grad — recommended for inference
with torch.no_grad():
    y = x * 2
    print(f"y.requires_grad: {y.requires_grad}")   # False

# 方法2：detach — 创建不追踪的副本 | Method 2: detach — create untracked copy
y = (x * 2).detach()
print(f"y.requires_grad: {y.requires_grad}")   # False

# 方法3：torch.inference_mode — PyTorch 1.9+ | Method 3: inference_mode
with torch.inference_mode():
    y = x * 2
    print(f"y.requires_grad: {y.requires_grad}")   # False

# 应用场景：| Use cases:
# - 推理/评估时不计算梯度 | No gradients during inference/evaluation
# - 复制参数到 CPU 时 | Copying parameters to CPU
# - 某些值不需要梯度时 | When certain values don't need gradients

---

7.7 Autograd 与 Attention | Autograd and Attention

代码案例 | Code Example

import torch

# 用 PyTorch 自动求导实现 Attention 梯度 | Use PyTorch autograd for Attention gradients

d_model = 4
seq_len = 2

# 可学习的权重 | Learnable weights
W_q = torch.randn(d_model, d_model, requires_grad=True)
W_k = torch.randn(d_model, d_model, requires_grad=True)
W_v = torch.randn(d_model, d_model, requires_grad=True)

# 输入 | Input
X = torch.randn(seq_len, d_model)

# 前向传播 | Forward pass
Q = X @ W_q
K = X @ W_k
V = X @ W_v

scores = Q @ K.T / (d_model ** 0.5)
weights = torch.softmax(scores, dim=-1)
output = weights @ V

# 模拟 loss | Simulate loss
loss = output.sum()

# 反向传播 | Backward pass
loss.backward()

print("梯度检查 | Gradient check:")
print(f"W_q.grad shape: {W_q.grad.shape}")   # (4, 4) — 有梯度！| Has gradient!
print(f"W_k.grad shape: {W_k.grad.shape}")
print(f"W_v.grad shape: {W_v.grad.shape}")
print(f"X.grad: {X.grad}")                    # None — 输入不需要梯度 | Input doesn't need grad

# 结论：PyTorch 自动计算了 Attention 中所有可学习参数的梯度！
# Conclusion: PyTorch automatically computed gradients for all learnable parameters in Attention!

---

本章总结 | Chapter Summary

中文：

Autograd 自动构建计算图，自动反向传播求梯度
requires_grad=True 标记需要梯度的张量
backward() 触发反向传播，从输出到输入计算梯度
梯度默认累积，每次迭代前要 zero_grad()
推理时用 no_grad() 节省内存和计算

English:

Autograd automatically builds computation graphs and backpropagates gradients
requires_grad=True marks tensors needing gradients
backward() triggers backpropagation, computing gradients from output to input
Gradients accumulate by default; call zero_grad() before each iteration
Use no_grad() during inference to save memory and computation

---

课后练习 | Homework

基础求导：用 PyTorch 计算 f(x) = sin(x^2) 在 x=2 处的导数，并与手动计算对比
多元梯度：计算 f(x,y,z) = x*y + y*z + z*x 在 (1,2,3) 处的梯度
梯度清零：写一个循环，模拟 5 次迭代，每次正确清零并计算梯度
链式法则：构建一个 5 层计算图，手动推导梯度公式，再用 PyTorch 验证
Attention 梯度：用 PyTorch 实现单头 Attention，打印所有参数的梯度 shape