第17章:Diffusion 基础 — 从噪声中学习 | Chapter 17: Diffusion Basics — Learning from Noise
阶段定位 | Stage: 第四阶段 — Diffusion 预计学时 | Duration: 6~8 小时
---
学习目标 | Learning Objectives
中文:
- 理解 Diffusion 模型的核心思想:学习去噪
- 掌握 Forward Diffusion(加噪)和 Reverse Diffusion(去噪)的过程
- 理解 Noise Prediction Network 的训练目标
- 建立 Diffusion = 高维空间中的逐步去噪 的直觉
English:
- Understand core idea of Diffusion models: learning to denoise
- Master Forward Diffusion (adding noise) and Reverse Diffusion (denoising) processes
- Understand training objective of Noise Prediction Network
- Build intuition that Diffusion = gradual denoising in high-dimensional space
---
17.1 Diffusion 的本质 | The Essence of Diffusion
中文解释
Diffusion 模型不是直接生成图片,而是学习"去噪"的过程。
核心思想类比:
sculptor(雕塑家):不是从石头中"挖"出雕像,
而是逐步打磨,从粗糙到精细。
Diffusion:不是直接生成图片,
而是从纯噪声中逐步"去噪",恢复出图片。English Explanation
Diffusion models don't directly generate images; they learn the "denoising" process.
Core idea analogy:
A sculptor doesn't "carve out" a statue from stone,
but gradually polishes from rough to fine.
Diffusion: doesn't directly generate an image,
but gradually "denoises" from pure noise to recover an image.与 Transformer 的对比 | Comparison with Transformer
| 特性 | Transformer (GPT) | Diffusion (SD) |
|---|---|---|
| 生成方式 | 自回归(逐 token) | 并行(逐步去噪) |
| 输入 | 文本 token | 噪声图片 |
| 输出 | 下一个 token | 去噪后的图片 |
| 核心操作 | Attention | UNet 去噪 |
| 条件控制 | Prompt | Prompt via Cross-Attention |
| Feature | Transformer (GPT) | Diffusion (SD) |
|---|---|---|
| Generation | Autoregressive (token by token) | Parallel (gradual denoising) |
| Input | Text tokens | Noisy image |
| Output | Next token | Denoised image |
| Core operation | Attention | UNet denoising |
| Conditioning | Prompt | Prompt via Cross-Attention |
---
17.2 Forward Diffusion — 加噪过程 | Forward Diffusion — Adding Noise
中文解释
Forward Diffusion = 给图片逐步添加高斯噪声,直到变成纯噪声
公式:
x_t = √(1-β_t) × x_{t-1} + √β_t × ε
其中:| where:
- x_0: 原始图片 | original image
- x_T: 纯噪声(T 很大时)| pure noise (when T is large)
- β_t: 噪声调度(从小到大)| noise schedule (small to large)
- ε: 标准高斯噪声 | standard Gaussian noise重参数化(更高效的计算):
x_t = √(ᾱ_t) × x_0 + √(1-ᾱ_t) × ε
其中 ᾱ_t = ∏_{i=1}^t (1-β_i) | where ᾱ_t = product of (1-β_i)English Explanation
Forward Diffusion = Gradually add Gaussian noise to image until it becomes pure noise
Formula:
x_t = √(1-β_t) × x_{t-1} + √β_t × ε
where:
- x_0: original image
- x_T: pure noise (when T is large)
- β_t: noise schedule (small to large)
- ε: standard Gaussian noiseReparameterization (more efficient computation):
x_t = √(ᾱ_t) × x_0 + √(1-ᾱ_t) × ε
where ᾱ_t = ∏_{i=1}^t (1-β_i)代码案例 | Code Example
import torch
import torch.nn.functional as F
class ForwardDiffusion:
"""前向扩散过程 | Forward diffusion process"""
def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
self.num_timesteps = num_timesteps
# 线性噪声调度 | Linear noise schedule
self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
# 预计算 ᾱ_t | Precompute ᾱ_t
self.alphas = 1.0 - self.betas
self.alphas_cumprod = torch.cumprod(self.alphas, dim=0) # ᾱ_t
self.alphas_cumprod_prev = F.pad(self.alphas_cumprod[:-1], (1, 0), value=1.0)
# 预计算扩散过程需要的值 | Precompute values needed for diffusion
self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)
def add_noise(self, x_0, t, noise=None):
"""
在任意时刻 t 加噪 | Add noise at arbitrary timestep t
x_0: (batch, channels, height, width) — 原始图片 | Original image
t: (batch,) — 时间步 | Timestep
"""
if noise is None:
noise = torch.randn_like(x_0)
# 获取对应时间步的系数 | Get coefficients for corresponding timestep
sqrt_alpha = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
sqrt_one_minus_alpha = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
# x_t = √(ᾱ_t) × x_0 + √(1-ᾱ_t) × ε
x_t = sqrt_alpha * x_0 + sqrt_one_minus_alpha * noise
return x_t, noise
# 测试 | Test
diffusion = ForwardDiffusion(num_timesteps=1000)
# 模拟一张图片 | Simulate an image
x_0 = torch.randn(4, 3, 32, 32) # 4张 32x32 的 RGB 图片
# 不同时间步的加噪结果 | Noise addition at different timesteps
for t_val in [0, 250, 500, 750, 999]:
t = torch.tensor([t_val] * 4)
x_t, noise = diffusion.add_noise(x_0, t)
print(f"t={t_val:4d}: mean={x_t.mean():.3f}, std={x_t.std():.3f}")
# 输出趋势:| Output trend:
# t= 0: 接近原始图片 | Close to original
# t= 999: 接近标准正态 | Close to standard normal---
17.3 Reverse Diffusion — 去噪过程 | Reverse Diffusion — Denoising
中文解释
Reverse Diffusion = 从纯噪声逐步恢复图片
如果我们知道每步添加的噪声,就可以逆向去噪:
x_{t-1} = (x_t - √β_t × ε_θ(x_t, t)) / √(1-β_t)但现实中我们不知道噪声,所以训练一个神经网络来预测:
ε_θ(x_t, t) ≈ ε(真实噪声)English Explanation
Reverse Diffusion = Gradually recover image from pure noise
If we knew the noise added at each step, we could reverse it:
x_{t-1} = (x_t - √β_t × ε_θ(x_t, t)) / √(1-β_t)But in reality we don't know the noise, so we train a neural network to predict it:
ε_θ(x_t, t) ≈ ε (true noise)代码案例:采样过程 | Code Example: Sampling Process
import torch
class ReverseDiffusion:
"""反向扩散(采样)| Reverse diffusion (sampling)"""
def __init__(self, diffusion_model, forward_diffusion):
self.model = diffusion_model # 噪声预测网络 | Noise prediction network
self.fd = forward_diffusion
@torch.no_grad()
def sample(self, shape, device='cpu'):
"""
从纯噪声生成图片 | Generate image from pure noise
shape: (batch, channels, height, width)
"""
# 从纯噪声开始 | Start from pure noise
x_t = torch.randn(shape, device=device)
# 逐步去噪 | Gradual denoising
for t in reversed(range(self.fd.num_timesteps)):
t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
# 预测噪声 | Predict noise
predicted_noise = self.model(x_t, t_batch)
# 计算去噪后的 x_{t-1} | Compute denoised x_{t-1}
alpha = self.fd.alphas[t]
alpha_cumprod = self.fd.alphas_cumprod[t]
beta = self.fd.betas[t]
# 去噪公式 | Denoising formula
x_t = (x_t - beta / torch.sqrt(1 - alpha_cumprod) * predicted_noise) / torch.sqrt(alpha)
# 添加少量噪声(最后一步除外)| Add small noise (except last step)
if t > 0:
noise = torch.randn_like(x_t)
sigma = torch.sqrt(beta)
x_t = x_t + sigma * noise
return x_t---
17.4 训练目标 — 噪声预测 | Training Objective — Noise Prediction
中文解释
训练目标:让网络预测 Forward Diffusion 中添加的噪声
Loss = MSE(ε_θ(x_t, t), ε)
其中:| where:
- ε_θ(x_t, t): 网络预测的噪声 | Network predicted noise
- ε: 实际添加的噪声(Forward 时使用)| Actually added noise (used in Forward)
- x_t: 加噪后的图片 | Noisy imageEnglish Explanation
Training objective: Make network predict the noise added in Forward Diffusion
Loss = MSE(ε_θ(x_t, t), ε)
where:
- ε_θ(x_t, t): noise predicted by network
- ε: actually added noise (used in Forward)
- x_t: noisy image代码案例:训练循环 | Code Example: Training Loop
import torch
import torch.nn as nn
class SimpleUNet(nn.Module):
"""简化的 UNet(噪声预测网络)| Simplified UNet (noise prediction network)"""
def __init__(self, in_channels=3, time_emb_dim=256):
super().__init__()
# 时间步嵌入 | Time step embedding
self.time_embed = nn.Sequential(
nn.Linear(1, time_emb_dim),
nn.SiLU(),
nn.Linear(time_emb_dim, time_emb_dim),
)
# 简化的编码器-解码器 | Simplified encoder-decoder
self.encoder = nn.Sequential(
nn.Conv2d(in_channels, 64, 3, padding=1),
nn.ReLU(),
nn.Conv2d(64, 64, 3, padding=1),
)
self.decoder = nn.Sequential(
nn.Conv2d(64, 64, 3, padding=1),
nn.ReLU(),
nn.Conv2d(64, in_channels, 3, padding=1),
)
def forward(self, x, t):
"""
x: (batch, channels, H, W) — 加噪图片 | Noisy image
t: (batch,) — 时间步 | Timestep
"""
# 时间嵌入 | Time embedding
t_emb = self.time_embed(t.float().unsqueeze(-1) / 1000)
# 简化的前向(实际 UNet 更复杂)| Simplified forward (real UNet is more complex)
h = self.encoder(x)
# 这里应该融合时间嵌入 | Time embedding should be fused here
out = self.decoder(h)
return out
# ========== 训练循环 | Training Loop ==========
def train_diffusion_model(model, dataloader, diffusion, num_epochs=10):
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.MSELoss()
for epoch in range(num_epochs):
total_loss = 0
for batch_idx, (x_0, _) in enumerate(dataloader):
batch_size = x_0.size(0)
# 随机选择时间步 | Randomly select timesteps
t = torch.randint(0, diffusion.num_timesteps, (batch_size,))
# 加噪 | Add noise
x_t, noise = diffusion.add_noise(x_0, t)
# 预测噪声 | Predict noise
predicted_noise = model(x_t, t)
# 计算 loss | Compute loss
loss = criterion(predicted_noise, noise)
# 优化 | Optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(dataloader)
print(f"Epoch {epoch}: Loss = {avg_loss:.4f}")
print("Diffusion 训练目标:| Training objective:")
print(" 输入:加噪图片 x_t 和时间步 t")
print(" 输出:预测的噪声 ε_θ")
print(" Loss: MSE(预测噪声, 真实噪声)")
print(" Input: noisy image x_t and timestep t")
print(" Output: predicted noise ε_θ")
print(" Loss: MSE(predicted noise, true noise)")---
17.5 Diffusion 的直观理解 | Intuitive Understanding of Diffusion
中文解释
Diffusion 可以看作高维空间中的梯度下降:
高维空间中的"概率地形":| "Probability landscape" in high-dimensional space:
噪声图片在概率低的地方(山谷外)| Noisy images are in low-probability areas (outside valleys)
真实图片在概率高的地方(山谷底)| Real images are in high-probability areas (valley bottom)
去噪过程 = 从随机位置向"真实图片流形"滑动
Denoising process = sliding from random positions toward "real image manifold"English Explanation
Diffusion can be seen as gradient descent in high-dimensional space:
"Probability landscape" in high-dimensional space:
Noisy images are in low-probability areas (outside valleys)
Real images are in high-probability areas (valley bottom)
Denoising process = sliding from random positions toward "real image manifold"---
本章总结 | Chapter Summary
中文:
- Diffusion = 学习从噪声中恢复图片
- Forward:x_0 → x_T(加噪)| Forward: x_0 → x_T (add noise)
- Reverse:x_T → x_0(去噪)| Reverse: x_T → x_0 (denoise)
- 训练目标:预测 Forward 中添加的噪声
- UNet 是噪声预测网络的标准选择
English:
- Diffusion = learning to recover images from noise
- Forward: x_0 → x_T (add noise)
- Reverse: x_T → x_0 (denoise)
- Training objective: predict noise added in Forward
- UNet is the standard choice for noise prediction network
---
课后练习 | Homework
- Forward Diffusion:实现加噪过程,可视化同一图片在不同 t 下的噪声程度
- 噪声预测:实现简化的噪声预测网络,训练一个小数据集
- 采样过程:实现完整的采样循环,从纯噪声生成图片
- 调度器实验:比较 linear 和 cosine 两种噪声调度器的差异
- 思考题:为什么 Diffusion 训练时只需要一步加噪,而不需要像推理那样循环 T 步?