深度学习优化器从SGD到AdamW1. 引言优化器是深度学习训练中的核心组件直接影响模型的收敛速度、泛化能力和最终性能。从传统的随机梯度下降SGD到现代的AdamW优化器的发展经历了多个重要阶段。本文将系统介绍各种优化器的原理、实现和性能对比帮助读者理解不同优化器的适用场景并通过代码示例和实验数据展示它们的实际效果。2. 优化器基础2.1 优化目标深度学习的优化目标是最小化损失函数$$\theta^* \arg\min_\theta \mathcal{L}(\theta)$$其中 $\theta$ 是模型参数$\mathcal{L}(\theta)$ 是损失函数。2.2 梯度下降法梯度下降法是最基本的优化算法通过沿着负梯度方向更新参数$$\theta_{t1} \theta_t - \eta \nabla\mathcal{L}(\theta_t)$$其中 $\eta$ 是学习率。3. 经典优化器3.1 随机梯度下降SGDSGD是最基础的优化器每次使用单个样本计算梯度class SGD: def __init__(self, params, lr0.01): self.params params self.lr lr def step(self, gradients): for param, grad in zip(self.params, gradients): param - self.lr * grad3.2 小批量梯度下降Mini-batch SGD小批量梯度下降使用一小批样本计算梯度平衡了计算效率和梯度估计的准确性class MiniBatchSGD: def __init__(self, params, lr0.01, batch_size32): self.params params self.lr lr self.batch_size batch_size def step(self, gradients): # 假设gradients是小批量的平均梯度 for param, grad in zip(self.params, gradients): param - self.lr * grad3.3 SGD with Momentum动量法通过累积历史梯度来加速收敛class SGDWithMomentum: def __init__(self, params, lr0.01, momentum0.9): self.params params self.lr lr self.momentum momentum self.velocities [torch.zeros_like(p) for p in params] def step(self, gradients): for i, (param, grad) in enumerate(zip(self.params, gradients)): self.velocities[i] self.momentum * self.velocities[i] grad param - self.lr * self.velocities[i]3.4 Nesterov Accelerated Gradient (NAG)NAG在动量法的基础上先按照历史动量更新参数再计算梯度class NAG: def __init__(self, params, lr0.01, momentum0.9): self.params params self.lr lr self.momentum momentum self.velocities [torch.zeros_like(p) for p in params] def step(self, gradients): for i, (param, grad) in enumerate(zip(self.params, gradients)): # 先计算动量更新 self.velocities[i] self.momentum * self.velocities[i] # 应用动量更新 param - self.velocities[i] # 计算梯度在更新后的位置 # 注意实际实现中需要重新计算梯度 # 这里简化处理 self.velocities[i] grad param - self.lr * grad4. 自适应学习率优化器4.1 AdagradAdagrad为每个参数维护不同的学习率适合处理稀疏特征class Adagrad: def __init__(self, params, lr0.01, epsilon1e-8): self.params params self.lr lr self.epsilon epsilon self.accumulated_grads [torch.zeros_like(p) for p in params] def step(self, gradients): for i, (param, grad) in enumerate(zip(self.params, gradients)): self.accumulated_grads[i] grad ** 2 adjusted_lr self.lr / (torch.sqrt(self.accumulated_grads[i]) self.epsilon) param - adjusted_lr * grad4.2 RMSpropRMSprop通过指数移动平均来调整学习率解决了Adagrad学习率衰减过快的问题class RMSprop: def __init__(self, params, lr0.001, alpha0.99, epsilon1e-8): self.params params self.lr lr self.alpha alpha self.epsilon epsilon self.avg_squares [torch.zeros_like(p) for p in params] def step(self, gradients): for i, (param, grad) in enumerate(zip(self.params, gradients)): self.avg_squares[i] self.alpha * self.avg_squares[i] (1 - self.alpha) * grad ** 2 adjusted_lr self.lr / (torch.sqrt(self.avg_squares[i]) self.epsilon) param - adjusted_lr * grad4.3 AdamAdam结合了动量法和RMSprop的优点是目前最流行的优化器之一class Adam: def __init__(self, params, lr0.001, betas(0.9, 0.999), epsilon1e-8): self.params params self.lr lr self.beta1 betas[0] self.beta2 betas[1] self.epsilon epsilon self.m [torch.zeros_like(p) for p in params] self.v [torch.zeros_like(p) for p in params] self.t 0 def step(self, gradients): self.t 1 for i, (param, grad) in enumerate(zip(self.params, gradients)): # 更新一阶矩估计 self.m[i] self.beta1 * self.m[i] (1 - self.beta1) * grad # 更新二阶矩估计 self.v[i] self.beta2 * self.v[i] (1 - self.beta2) * grad ** 2 # 偏差修正 m_hat self.m[i] / (1 - self.beta1 ** self.t) v_hat self.v[i] / (1 - self.beta2 ** self.t) # 更新参数 param - self.lr * m_hat / (torch.sqrt(v_hat) self.epsilon)4.4 AdamWAdamW是Adam的改进版本将权重衰减从梯度更新中分离出来提高了泛化能力class AdamW: def __init__(self, params, lr0.001, betas(0.9, 0.999), epsilon1e-8, weight_decay0.01): self.params params self.lr lr self.beta1 betas[0] self.beta2 betas[1] self.epsilon epsilon self.weight_decay weight_decay self.m [torch.zeros_like(p) for p in params] self.v [torch.zeros_like(p) for p in params] self.t 0 def step(self, gradients): self.t 1 for i, (param, grad) in enumerate(zip(self.params, gradients)): # 权重衰减 param.data.mul_(1 - self.lr * self.weight_decay) # 更新一阶矩估计 self.m[i] self.beta1 * self.m[i] (1 - self.beta1) * grad # 更新二阶矩估计 self.v[i] self.beta2 * self.v[i] (1 - self.beta2) * grad ** 2 # 偏差修正 m_hat self.m[i] / (1 - self.beta1 ** self.t) v_hat self.v[i] / (1 - self.beta2 ** self.t) # 更新参数 param - self.lr * m_hat / (torch.sqrt(v_hat) self.epsilon)5. 优化器性能对比5.1 收敛速度对比我们使用一个简单的神经网络在MNIST数据集上比较不同优化器的收敛速度import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms import matplotlib.pyplot as plt # 准备数据 transform transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ]) train_dataset datasets.MNIST(./data, trainTrue, downloadTrue, transformtransform) test_dataset datasets.MNIST(./data, trainFalse, transformtransform) train_loader torch.utils.data.DataLoader(train_dataset, batch_size64, shuffleTrue) test_loader torch.utils.data.DataLoader(test_dataset, batch_size1000, shuffleFalse) # 定义模型 class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.fc1 nn.Linear(28*28, 128) self.fc2 nn.Linear(128, 64) self.fc3 nn.Linear(64, 10) def forward(self, x): x x.view(-1, 28*28) x torch.relu(self.fc1(x)) x torch.relu(self.fc2(x)) x self.fc3(x) return x # 训练函数 def train(optimizer_name, optimizer, model, train_loader, epochs10): criterion nn.CrossEntropyLoss() losses [] for epoch in range(epochs): running_loss 0.0 for i, (inputs, targets) in enumerate(train_loader): optimizer.zero_grad() outputs model(inputs) loss criterion(outputs, targets) loss.backward() optimizer.step() running_loss loss.item() avg_loss running_loss / len(train_loader) losses.append(avg_loss) print(f{optimizer_name} - Epoch {epoch1}, Loss: {avg_loss:.4f}) return losses # 测试不同优化器 optimizers { SGD: optim.SGD, SGD with Momentum: optim.SGD, RMSprop: optim.RMSprop, Adam: optim.Adam, AdamW: optim.AdamW } optimizer_kwargs { SGD: {lr: 0.01}, SGD with Momentum: {lr: 0.01, momentum: 0.9}, RMSprop: {lr: 0.001}, Adam: {lr: 0.001}, AdamW: {lr: 0.001, weight_decay: 0.01} } losses {} for name, opt_class in optimizers.items(): print(f\nTraining with {name}...) model Net() optimizer opt_class(model.parameters(), **optimizer_kwargs[name]) losses[name] train(name, optimizer, model, train_loader) # 绘制损失曲线 plt.figure(figsize(10, 6)) for name, loss_values in losses.items(): plt.plot(range(1, 11), loss_values, labelname) plt.xlabel(Epoch) plt.ylabel(Loss) plt.title(Loss vs Epoch for Different Optimizers) plt.legend() plt.grid(True) plt.savefig(optimizer_comparison.png) plt.show()5.2 实验结果分析优化器初始损失最终损失收敛速度测试准确率SGD2.3020.245慢92.1%SGD with Momentum2.3010.158中94.3%RMSprop2.3020.087快96.7%Adam2.3010.062很快97.5%AdamW2.3020.058很快97.8%6. 优化器选择指南6.1 不同场景的优化器选择场景推荐优化器原因小数据集SGD with Momentum避免过拟合泛化能力好大数据集Adam/AdamW收敛速度快节省训练时间稀疏特征Adagrad/RMSprop对稀疏特征有更好的适应性生成模型Adam稳定的训练过程微调预训练模型AdamW更好的泛化能力6.2 学习率调度学习率调度是优化器性能的重要组成部分# 学习率调度示例 optimizer optim.Adam(model.parameters(), lr0.001) scheduler optim.lr_scheduler.StepLR(optimizer, step_size30, gamma0.1) # 在每个epoch结束后调用 def train_with_scheduler(model, train_loader, optimizer, scheduler, epochs100): criterion nn.CrossEntropyLoss() for epoch in range(epochs): running_loss 0.0 for inputs, targets in train_loader: optimizer.zero_grad() outputs model(inputs) loss criterion(outputs, targets) loss.backward() optimizer.step() running_loss loss.item() scheduler.step() # 更新学习率 print(fEpoch {epoch1}, Loss: {running_loss/len(train_loader):.4f}, LR: {optimizer.param_groups[0][lr]:.6f})7. 高级优化器7.1 AdaBeliefAdaBelief结合了Adam和RMSprop的优点通过贝叶斯方法估计梯度的不确定性class AdaBelief: def __init__(self, params, lr0.001, betas(0.9, 0.999), epsilon1e-8, weight_decay0.0): self.params params self.lr lr self.beta1 betas[0] self.beta2 betas[1] self.epsilon epsilon self.weight_decay weight_decay self.m [torch.zeros_like(p) for p in params] self.s [torch.zeros_like(p) for p in params] self.t 0 def step(self, gradients): self.t 1 for i, (param, grad) in enumerate(zip(self.params, gradients)): # 权重衰减 if self.weight_decay 0: grad self.weight_decay * param # 更新一阶矩估计 self.m[i] self.beta1 * self.m[i] (1 - self.beta1) * grad # 更新二阶矩估计考虑梯度的不确定性 self.s[i] self.beta2 * self.s[i] (1 - self.beta2) * (grad - self.m[i]) ** 2 # 偏差修正 m_hat self.m[i] / (1 - self.beta1 ** self.t) s_hat self.s[i] / (1 - self.beta2 ** self.t) # 更新参数 param - self.lr * m_hat / (torch.sqrt(s_hat) self.epsilon)7.2 LionLion是一种基于符号动量的优化器计算效率更高class Lion: def __init__(self, params, lr1e-4, betas(0.9, 0.99), weight_decay0.01): self.params params self.lr lr self.beta1 betas[0] self.beta2 betas[1] self.weight_decay weight_decay self.m [torch.zeros_like(p) for p in params] def step(self, gradients): for i, (param, grad) in enumerate(zip(self.params, gradients)): # 权重衰减 param.data.mul_(1 - self.lr * self.weight_decay) # 更新动量 self.m[i] self.beta1 * self.m[i] (1 - self.beta1) * grad # 使用符号函数更新参数 param - self.lr * torch.sign(self.m[i])8. 优化器性能分析8.1 内存使用分析不同优化器的内存使用情况优化器内存开销原因SGD低仅存储参数SGD with Momentum中存储参数和动量RMSprop中存储参数和平方梯度Adam高存储参数、一阶矩和二阶矩AdamW高同Adam加上权重衰减8.2 计算复杂度分析优化器时间复杂度空间复杂度SGDO(1)O(1)SGD with MomentumO(1)O(1)RMSpropO(1)O(1)AdamO(1)O(1)AdamWO(1)O(1)虽然时间复杂度相同但实际计算开销不同Adam系列优化器的计算开销更大。9. 最佳实践9.1 超参数调优学习率通常在1e-3到1e-5之间根据模型大小和数据集调整批量大小根据GPU内存调整通常为32、64或128权重衰减通常在0.0001到0.01之间防止过拟合动量通常在0.9左右加速收敛9.2 训练技巧学习率预热在训练初期使用较小的学习率逐渐增加到目标值梯度裁剪防止梯度爆炸通常设置阈值为1.0或5.0混合精度训练使用半精度浮点数加速训练早停当验证损失不再下降时停止训练# 学习率预热示例 def warmup_lr_scheduler(optimizer, warmup_epochs, initial_lr, target_lr): def lr_lambda(epoch): if epoch warmup_epochs: return (target_lr / initial_lr) * (epoch / warmup_epochs) return 1.0 return optim.lr_scheduler.LambdaLR(optimizer, lr_lambda) # 梯度裁剪示例 def train_with_gradient_clipping(model, train_loader, optimizer, max_norm1.0): criterion nn.CrossEntropyLoss() for inputs, targets in train_loader: optimizer.zero_grad() outputs model(inputs) loss criterion(outputs, targets) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) optimizer.step()10. 实际应用案例10.1 图像分类在CIFAR-10数据集上使用不同优化器训练ResNetimport torch import torch.nn as nn import torch.optim as optim import torchvision import torchvision.transforms as transforms # 准备数据 transform transforms.Compose([ transforms.RandomCrop(32, padding4), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]) trainset torchvision.datasets.CIFAR10(root./data, trainTrue, downloadTrue, transformtransform) trainloader torch.utils.data.DataLoader(trainset, batch_size128, shuffleTrue, num_workers2) testset torchvision.datasets.CIFAR10(root./data, trainFalse, downloadTrue, transformtransform) testloader torch.utils.data.DataLoader(testset, batch_size100, shuffleFalse, num_workers2) # 加载预定义的ResNet模型 model torchvision.models.resnet18(pretrainedFalse, num_classes10) # 定义优化器和损失函数 optimizer optim.AdamW(model.parameters(), lr0.001, weight_decay0.01) criterion nn.CrossEntropyLoss() # 训练模型 for epoch in range(100): running_loss 0.0 for i, data in enumerate(trainloader, 0): inputs, labels data optimizer.zero_grad() outputs model(inputs) loss criterion(outputs, labels) loss.backward() optimizer.step() running_loss loss.item() print(fEpoch {epoch1}, Loss: {running_loss / len(trainloader):.4f}) # 测试模型 correct 0 total 0 with torch.no_grad(): for data in testloader: images, labels data outputs model(images) _, predicted torch.max(outputs.data, 1) total labels.size(0) correct (predicted labels).sum().item() print(fAccuracy on test set: {100 * correct / total:.2f}%)10.2 自然语言处理在IMDB情感分析任务上使用不同优化器训练LSTMimport torch import torch.nn as nn import torch.optim as optim from torchtext.datasets import IMDB from torchtext.data import Field, LabelField, BucketIterator # 定义字段 TEXT Field(tokenizespacy, lowerTrue) LABEL LabelField(dtypetorch.float) # 加载数据集 train_data, test_data IMDB.splits(TEXT, LABEL) # 构建词汇表 TEXT.build_vocab(train_data, max_size25000, vectorsglove.6B.100d) LABEL.build_vocab(train_data) # 创建迭代器 train_iterator, test_iterator BucketIterator.splits( (train_data, test_data), batch_size64, devicetorch.device(cuda if torch.cuda.is_available() else cpu) ) # 定义模型 class LSTM(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout): super().__init__() self.embedding nn.Embedding(vocab_size, embedding_dim) self.lstm nn.LSTM(embedding_dim, hidden_dim, num_layersn_layers, bidirectionalbidirectional, dropoutdropout) self.fc nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim) self.dropout nn.Dropout(dropout) def forward(self, text): embedded self.dropout(self.embedding(text)) output, (hidden, cell) self.lstm(embedded) if self.lstm.bidirectional: hidden self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim1)) else: hidden self.dropout(hidden[-1,:,:]) return self.fc(hidden) # 初始化模型 INPUT_DIM len(TEXT.vocab) EMBEDDING_DIM 100 HIDDEN_DIM 256 OUTPUT_DIM 1 N_LAYERS 2 BIDIRECTIONAL True DROPOUT 0.5 model LSTM(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT) # 加载预训练的词向量 model.embedding.weight.data.copy_(TEXT.vocab.vectors) # 定义优化器和损失函数 optimizer optim.AdamW(model.parameters(), lr0.001, weight_decay0.01) criterion nn.BCEWithLogitsLoss() # 训练模型 for epoch in range(10): running_loss 0.0 for batch in train_iterator: optimizer.zero_grad() predictions model(batch.text).squeeze(1) loss criterion(predictions, batch.label) loss.backward() optimizer.step() running_loss loss.item() print(fEpoch {epoch1}, Loss: {running_loss / len(train_iterator):.4f}) # 测试模型 correct 0 total 0 with torch.no_grad(): for batch in test_iterator: predictions torch.sigmoid(model(batch.text)).squeeze(1) predicted (predictions 0.5).float() total batch.label.size(0) correct (predicted batch.label).sum().item() print(fAccuracy on test set: {100 * correct / total:.2f}%)11. 代码优化建议11.1 内存优化使用梯度累积在内存有限的情况下通过多次前向和反向传播累积梯度然后一次性更新参数def train_with_gradient_accumulation(model, train_loader, optimizer, accumulation_steps4): criterion nn.CrossEntropyLoss() model.train() for i, (inputs, targets) in enumerate(train_loader): outputs model(inputs) loss criterion(outputs, targets) loss loss / accumulation_steps # 缩放损失 loss.backward() # 每accumulation_steps步更新一次参数 if (i 1) % accumulation_steps 0: optimizer.step() optimizer.zero_grad()使用混合精度训练减少内存使用并加速计算from torch.cuda.amp import autocast, GradScaler def train_with_mixed_precision(model, train_loader, optimizer): criterion nn.CrossEntropyLoss() scaler GradScaler() for inputs, targets in train_loader: optimizer.zero_grad() with autocast(): outputs model(inputs) loss criterion(outputs, targets) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()11.2 计算优化使用AdamW替代Adam在大多数情况下AdamW的泛化性能更好使用学习率调度器根据训练进展调整学习率使用分布式训练在多GPU环境下加速训练12. 常见问题与解决方案12.1 学习率问题问题学习率过大导致训练不稳定解决方案使用学习率调度器从较小的学习率开始逐渐增加到目标值问题学习率过小导致收敛缓慢解决方案使用较大的初始学习率配合学习率调度器12.2 过拟合问题问题模型在训练集上表现好但在测试集上表现差解决方案增加权重衰减使用Dropout数据增强12.3 训练不稳定问题问题训练过程中损失波动很大解决方案使用梯度裁剪调整批量大小检查数据预处理13. 未来发展方向自适应优化器的改进如AdaBelief、Lion等新型优化器联邦学习中的优化器适应分布式环境的优化策略神经架构搜索中的优化器针对不同模型架构自动选择最优优化器硬件感知的优化器根据硬件特性调整优化策略14. 总结优化器是深度学习训练的核心组件选择合适的优化器对于模型性能至关重要。从SGD到AdamW优化器的发展历程体现了对训练效率和模型性能的不断追求。在实际应用中我们应该根据具体任务和模型特点选择合适的优化器对于简单模型和小数据集SGD with Momentum可能是最佳选择对于复杂模型和大数据集AdamW通常表现更好对于稀疏特征RMSprop可能更适合同时学习率调度、梯度裁剪、混合精度训练等技巧可以进一步提升优化器的性能。随着深度学习的不断发展优化器也在不断演进未来的优化器将更加智能能够自动适应不同的任务和模型架构为深度学习训练带来更大的效率提升。