第22章：Attention 机制 | Chapter 22: Attention Mechanism

阶段定位 | Stage: 第五阶段 — 序列模型与 Attention 预计学时 | Duration: 4~5 小时

---

学习目标 | Learning Objectives

中文：

理解 Attention 如何解决 RNN 的信息瓶颈问题
掌握 Q/K/V 的数学定义与 scaled dot-product attention
理解 Self-Attention 与 Cross-Attention 的区别
能在 NumPy 中实现完整的 Self-Attention
理解 Attention 为什么能并行计算
掌握 Masked Self-Attention 的因果约束

English:

Understand how Attention solves RNN's information bottleneck
Master Q/K/V definitions and scaled dot-product attention
Understand Self-Attention vs Cross-Attention
Implement complete Self-Attention in NumPy
Understand why Attention can be computed in parallel
Master causal Masked Self-Attention

---

22.1 RNN 的信息瓶颈 | RNN Information Bottleneck

中文解释

问题

RNN Encoder 把所有输入压缩成一个固定向量：

"I love you very much" → [0.2, -0.5, 0.8, ...]  (固定 256 维)

长句子时，信息丢失严重。就像把一本 500 页的书总结成一句话。

Attention 的解决方案

Decoder 每一步都直接"看"所有输入，动态决定关注哪里：

Decoder step 1 (生成 "我"): 主要关注 "I"
Decoder step 2 (生成 "爱"): 主要关注 "love"
Decoder step 3 (生成 "你"): 主要关注 "you"

每个 decoder 步骤都有独立的"注意力权重分布"。

English Explanation

RNN bottleneck: all info compressed to fixed vector.

Attention solution: decoder looks at all inputs dynamically per step.

---

22.2 Scaled Dot-Product Attention

中文解释

核心公式

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Q/K/V 的含义

符号	全称	含义	类比
Q	Query	我想查什么	搜索框输入
K	Key	我有什么	文档标题/标签
V	Value	实际内容	文档正文

计算过程

Step 1: scores = Q @ K^T           → 计算查询与每个键的相似度
Step 2: weights = softmax(scores / √d_k)  → 归一化为概率分布
Step 3: output = weights @ V       → 按权重加权求和值

为什么除以 √d_k？

d_k 很大时，QK^T 的方差会很大（因为多个随机数相加）。例如 d_k=64 时，点积的方差 ≈ 64。这导致：

某些得分极大，某些极小
softmax 进入饱和区（极大值接近 1，其余接近 0）
梯度极小，难以训练

除以 √d_k 后，方差恢复到 ≈1，softmax 分布更平缓，梯度健康。

English Explanation

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Divide by √d_k: prevents softmax saturation when d_k is large.

---

22.3 Self-Attention vs Cross-Attention

中文解释

Self-Attention（自注意力）

Q, K, V 都来自同一个序列：

句子中的每个词都"看"其他所有词
"The cat sat" → "The" 看 "cat" 和 "sat"，决定如何理解自己

用途：编码器内部、解码器内部（因果版本）。

Cross-Attention（交叉注意力）

Q 来自 Decoder，K/V 来自 Encoder：

Decoder 查 Encoder 的信息
生成 "我" 时，查 Encoder 中 "I" 的表示

用途：连接编码器和解码器。

对比

特性	Self-Attention	Cross-Attention
Q 来源	同序列	Decoder
K/V 来源	同序列	Encoder
Mask	可用因果 Mask	无 Mask
用途	编码理解	翻译/生成对齐

English Explanation

Self-Attention: Q, K, V from same sequence. Cross-Attention: Q from decoder, K/V from encoder.

---

22.4 Masked Self-Attention | Masked Self-Attention

中文解释

问题

Decoder 在生成第 t 个词时，不能"偷看"后面的词：

生成 "love" 时，不能看到 "you"

解决方案：因果 Mask

在 softmax 之前，把未来位置的得分设为 -∞：

scores = Q @ K^T
scores_masked = scores + mask  (mask: 上三角为 -∞)
weights = softmax(scores_masked / √d_k)

Mask 矩阵（3×3 示例）：

[[0, -∞, -∞],
 [0,  0, -∞],
 [0,  0,  0]]

这确保位置 i 只能看到位置 ≤ i 的信息。

English Explanation

Causal mask: set future positions to -∞ before softmax.

---

22.5 完整实现：Self-Attention

代码案例

python

import numpy as np

def softmax(x, axis=-1):
    e = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e / np.sum(e, axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (seq_len, d_k)
    mask: (seq_len, seq_len) — 可选，因果 Mask
    """
    d_k = Q.shape[1]
    scores = Q @ K.T / np.sqrt(d_k)

    if mask is not None:
        scores = scores + mask  # mask 中为 -inf 的位置会被屏蔽

    weights = softmax(scores, axis=1)
    output = weights @ V
    return output, weights

# ========== Self-Attention 测试 ==========
np.random.seed(1)
seq_len, d = 4, 8
X = np.random.randn(seq_len, d)

# 无投影版本 (Q=K=V=X)
Q, K, V = X, X, X
out, attn_weights = scaled_dot_product_attention(Q, K, V)

print("=" * 50)
print("Self-Attention 测试")
print("=" * 50)
print(f"输入 X shape: {X.shape}")
print(f"注意力权重矩阵:")
print(attn_weights.round(3))
print(f"\n输出 shape: {out.shape}")
print(f"权重和检查: {attn_weights.sum(axis=1).round(3)} (每行应≈1)")
print(f"\n观察: 每个输出位置都是所有输入的加权平均")

# ========== 带独立投影 ==========
print("\n" + "=" * 50)
print("带投影的 Self-Attention")
print("=" * 50)

W_q = np.random.randn(d, d) * 0.01
W_k = np.random.randn(d, d) * 0.01
W_v = np.random.randn(d, d) * 0.01

Q_proj = X @ W_q
K_proj = X @ W_k
V_proj = X @ W_v

out_proj, attn_proj = scaled_dot_product_attention(Q_proj, K_proj, V_proj)
print(f"投影后输出 shape: {out_proj.shape}")

# ========== 因果 Masked Self-Attention ==========
print("\n" + "=" * 50)
print("因果 Masked Self-Attention")
print("=" * 50)

# 创建上三角 Mask
mask = np.triu(np.ones((seq_len, seq_len)), k=1) * (-1e9)
print(f"Mask 矩阵:\n{mask}")

out_masked, attn_masked = scaled_dot_product_attention(Q, K, V, mask=mask)
print(f"\nMask 后注意力权重:")
print(attn_masked.round(3))
print(f"观察: 每行的上三角位置权重≈0（被 Mask 屏蔽）")

# 验证因果性
for i in range(seq_len):
    for j in range(i+1, seq_len):
        assert attn_masked[i, j] < 0.01, f"位置 {i} 不应看到位置 {j}"
print("\n✓ 因果性验证通过：每个位置只能看到自己和前面的位置")

输出：

==================================================
Self-Attention 测试
==================================================
输入 X shape: (4, 8)
注意力权重矩阵:
[[0.289 0.252 0.241 0.218]
 [0.248 0.268 0.245 0.239]
 [0.237 0.246 0.267 0.25 ]
 [0.226 0.234 0.247 0.293]]

输出 shape: (4, 4)
权重和检查: [1. 1. 1. 1.] (每行应≈1)

观察: 每个输出位置都是所有输入的加权平均

==================================================
带投影的 Self-Attention
==================================================
投影后输出 shape: (4, 4)

==================================================
因果 Masked Self-Attention
==================================================
Mask 矩阵:
[[0.0e+00 -1.0e+09 -1.0e+09 -1.0e+09]
 [0.0e+00  0.0e+00 -1.0e+09 -1.0e+09]
 [0.0e+00  0.0e+00  0.0e+00 -1.0e+09]
 [0.0e+00  0.0e+00  0.0e+00  0.0e+00]]

Mask 后注意力权重:
[[1.    0.    0.    0.   ]
 [0.504 0.496 0.    0.   ]
 [0.335 0.331 0.334 0.   ]
 [0.252 0.249 0.25  0.249]]
观察: 每行的上三角位置权重≈0（被 Mask 屏蔽）

✓ 因果性验证通过：每个位置只能看到自己和前面的位置

---

22.6 注意力可视化与解释 | Attention Visualization

中文解释

注意力权重矩阵的含义

对于一个句子，注意力权重矩阵 A[i,j] 表示位置 i 对位置 j 的关注程度。

典型模式

模式	说明	例子
对角线强	每个词主要关注自己	基础自注意力
指代消解	"it" 关注 "cat"	The cat sat... it was tired
句法关系	动词关注主语	cat → sat
远距离依赖	首尾词相互关注	长句中的主语和谓语

English Explanation

Attention patterns: diagonal (self-focus), coreference (it→cat), syntactic (verb→subject), long-range dependencies.

---

22.7 常见误区 | Common Pitfalls

1. Attention 不是"记忆"

Attention 是加权平均，不是精确的记忆检索。它不能存储和精确提取信息，只是按相关性混合。

2. Self-Attention 没有位置信息

纯 Self-Attention 对输入排列不变（打乱词序，输出不变）。必须配合位置编码才能理解序列顺序。

3. Q/K/V 不是必须不同的

最简单的 Self-Attention 可以 Q=K=V=X。独立投影（W_q, W_k, W_v）只是为了增加表达能力。

---

本章总结 | Chapter Summary

中文：

RNN 有信息瓶颈，Attention 让 decoder 直接访问全部输入
Attention = softmax(QK^T / √d_k) V
除以 √d_k 防止 softmax 饱和，保持梯度健康
Self-Attention：Q/K/V 来自同序列，用于编码和理解
Cross-Attention：Q 来自 decoder，K/V 来自 encoder，用于翻译对齐
因果 Mask 确保 decoder 不能偷看未来信息
所有位置同时计算，天然可并行（对比 RNN 的串行计算）
Self-Attention 本身无位置信息，必须配合位置编码

English:

Attention solves RNN bottleneck by direct access to all inputs
Attention = softmax(QK^T / √d_k) V
Divide by √d_k prevents saturation
Self-Attention: same sequence; Cross-Attention: decoder queries encoder
Causal mask prevents looking at future tokens
Naturally parallelizable (vs RNN sequential computation)
Self-Attention has no position info — needs positional encoding

---

课后练习 | Homework

Attention 复杂度：计算 Attention 的时间复杂度 O(n²·d) 和空间复杂度 O(n²)，对比 RNN 的 O(n·d²)。为什么 n 很大时 Attention 更慢？

因果 Mask 实现：用布尔掩码（而非 -inf）实现 Masked Self-Attention，对比两种方法的效果。

多头 Attention：实现 Multi-Head Attention，将 d_model=512 分成 8 个头，每头 d_k=64。观察分头后的注意力模式差异。

注意力可视化：在翻译任务上训练后，可视化 encoder-decoder attention 的权重矩阵，观察对齐模式。

为什么 Transformer 取代 RNN：从并行度（O(1) vs O(n) 步）、长程依赖（O(1) 路径 vs O(n) 路径）、可扩展性（堆叠层数效果持续提升）三个角度详细对比。

稀疏 Attention：了解 Longformer、BigBird 等稀疏 Attention 变体，如何降低 O(n²) 复杂度。

Cross-Attention 实现：实现 Encoder-Decoder 架构中的 Cross-Attention，验证 Q 来自 decoder、K/V 来自 encoder 的维度匹配。