第27章:LoRA — 低秩微调 | Chapter 27: LoRA — Low-Rank Adaptation
阶段定位 | Stage: 第五阶段 — 源码与魔改 预计学时 | Duration: 5~7 小时
---
学习目标 | Learning Objectives
中文:
- 理解 LoRA 的核心思想:低秩分解
- 掌握 LoRA 的实现和训练
- 理解为什么 LoRA 能显著减少微调参数量
- 能够使用 peft 库进行 LoRA 微调
English:
- Understand core idea of LoRA: low-rank decomposition
- Master LoRA implementation and training
- Understand why LoRA significantly reduces fine-tuning parameters
- Be able to use peft library for LoRA fine-tuning
---
27.1 为什么需要 LoRA?| Why LoRA?
中文解释
全量微调的问题:
- LLaMA-7B 有 70 亿参数
- 全量微调需要 140GB 显存(FP16 × 2 for Adam)
- 大多数人没有这样的硬件
LoRA 的解决:
- 只训练少量"适配器"参数
- 冻结原始模型权重
- 7B 模型可能只需要训练 1000 万参数(0.1%)
English Explanation
Problems with full fine-tuning:
- LLaMA-7B has 7 billion parameters
- Full fine-tuning needs 140GB VRAM (FP16 × 2 for Adam)
- Most people don't have such hardware
LoRA solution:
- Only train a small number of "adapter" parameters
- Freeze original model weights
- 7B model may only need 10M parameters (0.1%)
---
27.2 LoRA 核心公式 | LoRA Core Formula
中文解释
核心思想:权重更新是低秩的
原始权重:W_0 (d × k)
微调后的权重:W = W_0 + ΔW
LoRA 假设:ΔW 是低秩的,可以分解为两个矩阵的乘积
LoRA assumes: ΔW is low-rank, can be factorized into two matrices
ΔW = B × A
其中:| where:
B: (d × r) — r << min(d, k)
A: (r × k)
训练时:W_0 冻结,只训练 A 和 B
During training: W_0 frozen, only A and B trainedEnglish Explanation
Core idea: Weight updates are low-rank
Original weight: W_0 (d × k)
Fine-tuned weight: W = W_0 + ΔW
LoRA assumes: ΔW is low-rank, can be factorized into two matrices
ΔW = B × A
where:
B: (d × r) — r << min(d, k)
A: (r × k)
During training: W_0 frozen, only A and B trained---
27.3 LoRA 完整实现 | Complete LoRA Implementation
代码案例 | Code Example
import torch
import torch.nn as nn
import math
class LoRALayer(nn.Module):
"""LoRA 层 | LoRA layer"""
def __init__(self, in_features, out_features, rank=8, lora_alpha=16):
super().__init__()
self.rank = rank
self.lora_alpha = lora_alpha
self.scaling = lora_alpha / rank
# 低秩分解矩阵 | Low-rank decomposition matrices
self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
# 初始化:A 用高斯,B 用零 | Initialize: A with Gaussian, B with zeros
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x):
"""
x @ (B @ A) * scaling
"""
# x: (B, seq, in_features)
# lora_A: (in_features, rank)
# lora_B: (rank, out_features)
return (x @ self.lora_A @ self.lora_B) * self.scaling
class LinearWithLoRA(nn.Module):
"""带 LoRA 的线性层 | Linear layer with LoRA"""
def __init__(self, linear_layer, rank=8, lora_alpha=16):
super().__init__()
self.linear = linear_layer
self.lora = LoRALayer(
linear_layer.in_features,
linear_layer.out_features,
rank=rank,
lora_alpha=lora_alpha,
)
# 冻结原始权重 | Freeze original weights
for param in self.linear.parameters():
param.requires_grad = False
def forward(self, x):
# 原始输出 + LoRA 适配 | Original output + LoRA adaptation
return self.linear(x) + self.lora(x)
# ========== 应用到 Transformer | Apply to Transformer ==========
def add_lora_to_model(model, target_modules=["q_proj", "v_proj"], rank=8):
"""
给模型的指定模块添加 LoRA | Add LoRA to specified modules of model
"""
lora_layers = []
for name, module in model.named_modules():
if any(target in name for target in target_modules):
if isinstance(module, nn.Linear):
# 找到父模块 | Find parent module
parent_name = ".".join(name.split(".")[:-1])
child_name = name.split(".")[-1]
if parent_name:
parent = model.get_submodule(parent_name)
else:
parent = model
# 替换为 LoRA 版本 | Replace with LoRA version
lora_layer = LinearWithLoRA(module, rank=rank)
setattr(parent, child_name, lora_layer)
lora_layers.append(lora_layer)
return model, lora_layers
# 测试 | Test
linear = nn.Linear(768, 768)
linear_lora = LinearWithLoRA(linear, rank=8)
x = torch.randn(2, 10, 768)
output = linear_lora(x)
print(f"Input: {x.shape}")
print(f"Output: {output.shape}")
# 参数量对比 | Parameter comparison
orig_params = sum(p.numel() for p in linear.parameters())
lora_params = sum(p.numel() for p in linear_lora.lora.parameters())
print(f"\nOriginal params: {orig_params:,}")
print(f"LoRA params: {lora_params:,}")
print(f"Ratio: {lora_params/orig_params*100:.2f}%")---
27.4 使用 PEFT 库 | Using PEFT Library
代码案例 | Code Example
# pip install peft
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
# 加载基础模型 | Load base model
model = AutoModelForCausalLM.from_pretrained("gpt2")
# 配置 LoRA | Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank
lora_alpha=32, # scaling = alpha / r
lora_dropout=0.05,
target_modules=["c_attn", "c_proj"], # 要添加 LoRA 的模块 | Modules to add LoRA
bias="none",
)
# 应用 LoRA | Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 输出示例:| Output example:
# trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364
# 训练(只训练 LoRA 参数)| Train (only LoRA parameters)
# optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)---
27.5 合并 LoRA 权重 | Merging LoRA Weights
代码案例 | Code Example
# 训练完成后,可以将 LoRA 合并到原始权重 | After training, can merge LoRA into original weights
def merge_lora_weights(linear_lora):
"""
将 LoRA 权重合并到原始线性层 | Merge LoRA weights into original linear layer
"""
# ΔW = B @ A * scaling
delta_W = (linear_lora.lora.lora_B @ linear_lora.lora.lora_A.t()) * linear_lora.lora.scaling
# W_merged = W_0 + ΔW
linear_lora.linear.weight.data += delta_W.t()
return linear_lora.linear
# PEFT 库中的合并 | Merge in PEFT
# model = model.merge_and_unload()---
本章总结 | Chapter Summary
中文:
- LoRA = 低秩分解 + 冻结原权重
- rank 控制参数量,通常 4-64
- alpha 控制缩放,通常 2×rank
- PEFT 库让 LoRA 微调极其简单
- 训练后可合并权重,推理无额外开销
English:
- LoRA = low-rank decomposition + freeze original weights
- rank controls parameter count, typically 4-64
- alpha controls scaling, typically 2×rank
- PEFT library makes LoRA fine-tuning extremely simple
- Weights can be merged after training, no inference overhead
---
课后练习 | Homework
- LoRA 实现:完整实现 LoRA 层,验证低秩假设
- 参数计算:计算不同 rank 下 LoRA 的参数量和压缩比
- PEFT 微调:用 PEFT 对 GPT-2 进行指令微调
- 权重合并:实现 LoRA 权重合并,验证合并后输出一致
- 对比实验:对比 LoRA rank=4/8/16/64 的微调效果