第11章：Position Encoding — 给模型顺序感 | Chapter 11: Position Encoding — Giving the Model a Sense of Order

阶段定位 | Stage: 第三阶段 — Transformer 核心 预计学时 | Duration: 4~6 小时

---

学习目标 | Learning Objectives

中文：

理解为什么 Transformer 本身不知道 token 的顺序
掌握 Sinusoidal、RoPE、ALiBi 三种位置编码的原理和实现
理解绝对位置编码与相对位置编码的区别
能够手写位置编码并可视化

English:

Understand why Transformers inherently don't know token order
Master the principles and implementations of Sinusoidal, RoPE, and ALiBi
Understand the difference between absolute and relative position encoding
Be able to write position encodings by hand and visualize them

---

11.1 为什么需要位置编码？| Why Position Encoding?

中文解释

Attention 的根本问题：没有位置信息

Attention(Q, K, V) 对每个位置的处理是对称的：
无论 token 在前面还是后面，计算方式完全相同。

"猫抓老鼠" 和 "老鼠抓猫" 在 Attention 看来是等价的！

解决方案：给每个位置添加位置信息

English Explanation

Fundamental problem of Attention: no position information

Attention(Q, K, V) processes each position symmetrically:
Whether a token is at the beginning or end, computation is identical.

"Cat catches mouse" and "Mouse catches cat" look equivalent to Attention!

Solution: Add position information to each position

---

11.2 Sinusoidal 位置编码 | Sinusoidal Position Encoding

中文解释

原始 Transformer 论文的方法

公式：

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

核心思想：

用不同频率的正弦/余弦函数编码位置
频率随维度增加而降低
模型可以从这些波中推断相对位置

English Explanation

Method from the original Transformer paper

Formula:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Core idea:

Encode position using sine/cosine functions of different frequencies
Frequency decreases as dimension increases
Model can infer relative positions from these waves

代码案例 | Code Example

import torch
import math

def sinusoidal_position_encoding(seq_len, d_model):
    """
    Sinusoidal 位置编码 | Sinusoidal position encoding
    """
    pe = torch.zeros(seq_len, d_model)
    
    # pos: (seq_len, 1)
    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
    
    # div_term: (d_model/2,)
    # 10000^(2i/d_model) = exp(2i * -log(10000) / d_model)
    div_term = torch.exp(
        torch.arange(0, d_model, 2).float() *
        (-math.log(10000.0) / d_model)
    )
    
    # 偶数维用 sin | Even dims use sin
    pe[:, 0::2] = torch.sin(position * div_term)
    # 奇数维用 cos | Odd dims use cos
    pe[:, 1::2] = torch.cos(position * div_term)
    
    return pe

# 测试 | Test
seq_len = 100
d_model = 512

pe = sinusoidal_position_encoding(seq_len, d_model)
print(f"PE shape: {pe.shape}")   # (100, 512)
print(f"PE[0, :8]: {pe[0, :8].round(4)}")
print(f"PE[1, :8]: {pe[1, :8].round(4)}")

# 可视化 | Visualization
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.imshow(pe.numpy(), cmap='viridis', aspect='auto')
plt.colorbar(label='Value')
plt.xlabel('Dimension')
plt.ylabel('Position')
plt.title('Sinusoidal Position Encoding')
plt.savefig('sinusoidal_pe.png', dpi=150)
print("Saved visualization")

特性分析 | Property Analysis

import torch

pe = sinusoidal_position_encoding(100, 512)

# 特性1：每个位置有唯一编码 | Property 1: Each position has unique encoding
print(f"PE[0] == PE[1]: {torch.allclose(pe[0], pe[1])}")   # False

# 特性2：相对位置可推断 | Property 2: Relative positions can be inferred
# PE(pos+k) 可以用 PE(pos) 线性表示 | PE(pos+k) can be linearly represented by PE(pos)
# 这是 sin/cos 的数学性质 | This is a mathematical property of sin/cos

# 特性3：有界于 [-1, 1] | Property 3: Bounded in [-1, 1]
print(f"Max: {pe.max():.4f}, Min: {pe.min():.4f}")

---

11.3 Learnable Position Embedding | 可学习的位置嵌入

中文解释

BERT/GPT 的方法：直接让模型学习位置嵌入

与 Sinusoidal 的区别：

Sinusoidal：固定公式，不可学习
Learnable：可训练参数

English Explanation

BERT/GPT approach: Let the model directly learn position embeddings

Difference from Sinusoidal:

Sinusoidal: Fixed formula, not learnable
Learnable: Trainable parameters

代码案例 | Code Example

import torch.nn as nn

class LearnablePositionEncoding(nn.Module):
    """可学习的位置嵌入 | Learnable position embedding"""
    
    def __init__(self, max_seq_len, d_model):
        super().__init__()
        self.position_embedding = nn.Embedding(max_seq_len, d_model)
    
    def forward(self, x):
        """
        x: (batch, seq, d_model)
        """
        seq_len = x.size(1)
        positions = torch.arange(seq_len, device=x.device)
        pos_emb = self.position_embedding(positions)   # (seq, d_model)
        return x + pos_emb.unsqueeze(0)   # 广播加到输入上 | Broadcast add to input

# 使用 | Usage
pos_enc = LearnablePositionEncoding(max_seq_len=512, d_model=768)
X = torch.randn(2, 100, 768)
X_with_pos = pos_enc(X)
print(f"Output shape: {X_with_pos.shape}")

---

11.4 RoPE — 旋转位置编码 | RoPE (Rotary Position Embedding)

中文解释

RoPE = 通过旋转矩阵编码位置信息

核心思想：

不直接加位置向量
而是将 Q/K 向量按位置角度旋转
自然地保持相对位置关系

优势：

外推性好（可以处理比训练更长的序列）
相对位置编码，更自然地融入 Attention

English Explanation

RoPE = Encode position information via rotation matrices

Core idea:

Instead of directly adding position vectors
Rotate Q/K vectors by position-dependent angles
Naturally preserve relative position relationships

Advantages:

Good extrapolation (can handle sequences longer than training)
Relative position encoding, more naturally integrated into Attention

代码案例 | Code Example

import torch
import math

class RoPE(nn.Module):
    """旋转位置编码 | Rotary Position Embedding"""
    
    def __init__(self, d_head, max_seq_len=2048):
        super().__init__()
        self.d_head = d_head
        
        # 预计算旋转角度 | Precompute rotation angles
        inv_freq = 1.0 / (10000 ** (torch.arange(0, d_head, 2).float() / d_head))
        positions = torch.arange(max_seq_len)
        angles = torch.einsum('i,j->ij', positions.float(), inv_freq)   # (seq, d_head/2)
        
        # 注册为 buffer（不参与训练）| Register as buffer (not trained)
        self.register_buffer('cos', torch.cos(angles))
        self.register_buffer('sin', torch.sin(angles))
    
    def forward(self, x, seq_len):
        """
        x: (..., seq, d_head)
        应用旋转 | Apply rotation
        """
        # 将 x 分成偶数维和奇数维 | Split x into even and odd dims
        x1 = x[..., ::2]    # 偶数维 | Even dims
        x2 = x[..., 1::2]   # 奇数维 | Odd dims
        
        # 应用旋转 | Apply rotation
        # [x1, x2] @ [[cos, -sin], [sin, cos]]
        cos = self.cos[:seq_len]   # (seq, d_head/2)
        sin = self.sin[:seq_len]
        
        rotated_x1 = x1 * cos - x2 * sin
        rotated_x2 = x1 * sin + x2 * cos
        
        # 交错合并 | Interleave
        rotated = torch.stack([rotated_x1, rotated_x2], dim=-1).flatten(-2)
        
        return rotated

# 在 Attention 中使用 RoPE | Using RoPE in Attention
def apply_rope_to_qk(Q, K, rope):
    """
    Q, K: (batch, heads, seq, d_head)
    """
    seq_len = Q.size(2)
    Q = rope(Q, seq_len)
    K = rope(K, seq_len)
    return Q, K

# 测试 | Test
rope = RoPE(d_head=64)
Q = torch.randn(2, 8, 128, 64)
K = torch.randn(2, 8, 128, 64)
Q_rot, K_rot = apply_rope_to_qk(Q, K, rope)
print(f"Q_rot shape: {Q_rot.shape}")   # (2, 8, 128, 64)

---

11.5 ALiBi — 线性偏置注意力 | ALiBi (Attention with Linear Biases)

中文解释

ALiBi = 给注意力分数加上一个与距离相关的负偏置

核心思想：

不需要位置编码向量
直接在注意力分数上减去 m × 距离
越远的 token，注意力分数越低

优势：

实现极其简单
外推性极好
训练稳定性好

English Explanation

ALiBi = Add a distance-dependent negative bias to attention scores

Core idea:

No position encoding vectors needed
Directly subtract m × distance from attention scores
Farther tokens get lower attention scores

Advantages:

Extremely simple implementation
Excellent extrapolation
Good training stability

代码案例 | Code Example

import torch

def get_alibi_slopes(num_heads):
    """计算每个 head 的斜率 | Calculate slope for each head"""
    # 几何序列的斜率 | Geometric sequence of slopes
    closest_power_of_2 = 2 ** (num_heads.bit_length() - 1)
    base = 2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3)))
    
    slopes = []
    for i in range(1, num_heads + 1):
        if i <= closest_power_of_2:
            slopes.append(i * base)
        else:
            slopes.append(slopes[i - closest_power_of_2 - 1] / 2)
    
    return torch.tensor(slopes)

def apply_alibi(scores, num_heads):
    """
    scores: (batch, heads, seq, seq)
    应用 ALiBi 偏置 | Apply ALiBi bias
    """
    batch, heads, seq, _ = scores.shape
    
    # 距离矩阵 | Distance matrix
    positions = torch.arange(seq)
    distance = positions.unsqueeze(0) - positions.unsqueeze(1)   # (seq, seq)
    distance = distance.abs().unsqueeze(0).unsqueeze(0)   # (1, 1, seq, seq)
    
    # 每个 head 的斜率 | Slope per head
    slopes = get_alibi_slopes(num_heads).view(1, heads, 1, 1)
    
    # 减去偏置 | Subtract bias
    bias = -slopes * distance   # 负数，越远越大 | Negative, larger for farther
    return scores + bias

# 测试 | Test
scores = torch.randn(2, 8, 10, 10)
scores_alibi = apply_alibi(scores, num_heads=8)
print(f"Original scores range: [{scores.min():.2f}, {scores.max():.2f}]")
print(f"ALiBi scores range: [{scores_alibi.min():.2f}, {scores_alibi.max():.2f}]")

---

11.6 三种编码方式对比 | Comparison of Three Methods

特性	Sinusoidal	RoPE	ALiBi
类型	绝对位置	相对位置	相对位置
参数	固定公式	固定公式	固定公式
实现复杂度	中	中	低
外推性	一般	好	极好
训练稳定性	好	好	极好
代表模型	Original Transformer	LLaMA, Qwen	BLOOM, MPT

Feature	Sinusoidal	RoPE	ALiBi
Type	Absolute	Relative	Relative
Parameters	Fixed formula	Fixed formula	Fixed formula
Complexity	Medium	Medium	Low
Extrapolation	Fair	Good	Excellent
Training stability	Good	Good	Excellent
Representative models	Original Transformer	LLaMA, Qwen	BLOOM, MPT

---

本章总结 | Chapter Summary

中文：

Transformer 需要位置编码，因为 Attention 本身无位置感知
Sinusoidal：经典方法，频率编码位置
Learnable：BERT/GPT 使用，直接学习
RoPE：通过旋转编码，外推性好
ALiBi：最简单，直接加偏置，外推性最好
现代大模型（LLaMA/Qwen）普遍使用 RoPE

English:

Transformers need position encoding because Attention itself has no position awareness
Sinusoidal: Classic method, frequency encodes position
Learnable: Used by BERT/GPT, directly learned
RoPE: Encodes via rotation, good extrapolation
ALiBi: Simplest, directly adds bias, best extrapolation
Modern LLMs (LLaMA/Qwen) generally use RoPE

---

课后练习 | Homework

Sinusoidal 实现：手写 sinusoidal PE，可视化前 50 个位置的前 64 维
RoPE 实现：实现 RoPE，验证旋转前后向量的模长不变
ALiBi 实现：实现 ALiBi，观察不同 head 的斜率分布
外推性测试：用训练长度 128 的模型，测试长度 256 的推理效果（三种方法对比）
思考题：为什么位置编码只需要加在输入端，而不需要每层都加？