pytorchDistributedDataParallel多卡训练结果变差的解决⽅
案
DDP 数据shuffle 的设置
使⽤DDP要给dataloader传⼊sampler参数(torch.utils.data.distributed.DistributedSampler(datat, num_replicas=None,
rank=None, shuffle=True, ed=0, drop_last=Fal))。默认shuffle=True,但按照pytorch DistributedSampler的实现:
def __iter__(lf) -> Iterator[T_co]:
if lf.shuffle:
# deterministically shuffle bad on epoch and ed
g = torch.Generator()
g.manual_ed(lf.ed + lf.epoch)
indices = torch.randperm(len(lf.datat), generator=g).tolist() # type: ignore
el:
indices = list(range(len(lf.datat))) # type: ignore
产⽣随机indix的种⼦是和当前的epoch有关,所以需要在训练的时候⼿动t epoch的值来实现真正的shuffle:
for epoch in range(start_epoch, n_epochs):
if is_distributed:
sampler.t_epoch(epoch)
train(loader)
DDP 增⼤batchsize 效果变差的问题
large batchsize:
理论上的优点:
数据中的噪声影响可能会变⼩,可能容易接近最优点;
缺点和问题:
降低了梯度的variance;(理论上,对于凸优化问题,低的梯度variance可以得到更好的优化效果; 但是实际上Keskar et al验证了增⼤batchsize会导致差的泛化能⼒);
全球人口排名前十位对于⾮凸优化问题,损失函数包含多个局部最优点,⼩的batchsize有噪声的⼲扰可能容易跳出局部最优点,⽽⼤的batchsize有可能停在局部最优点跳不出来。新生儿拍嗝
解决⽅法:
warmup
在训练初期就⽤很⼤的learning_rate可能会导致训练不收敛的问题,warmup的思想是在训练初期⽤⼩的学习率,随着训练慢慢变⼤学习率,直到ba learning_rate,再使⽤其他decay(CosineAnnealingLR)的⽅式训练.
# copy /ildoonet/pytorch-gradual-warmup-lr/blob/master/warmup_scheduler/scheduler.py
from torch.optim.lr_scheduler import _LRScheduler
from torch.optim.lr_scheduler import ReduceLROnPlateau
class GradualWarmupScheduler(_LRScheduler):
""" Gradually warm-up(increasing) learning rate in optimizer.
Propod in 'Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour'.
Args:
optimizer (Optimizer): Wrapped optimizer.
multiplier: target learning rate = ba lr * multiplier if multiplier > 1.0. if multiplier = 1.0, lr starts from 0 and ends up with the ba_lr.
total_epoch: target learning rate is reached at total_epoch, gradually
after_scheduler: after target_epoch, u this scheduler(eg. ReduceLROnPlateau)
"""
def __init__(lf, optimizer, multiplier, total_epoch, after_scheduler=None):
lf.multiplier = multiplier
if lf.multiplier < 1.:
rai ValueError('multiplier should be greater thant or equal to 1.')
lf.after_scheduler = after_scheduler
lf.finished = Fal
super(GradualWarmupScheduler, lf).__init__(optimizer)
def get_lr(lf):
if lf.last_epoch > lf.total_epoch:
if lf.after_scheduler:
爆房有情人
哲学书推荐if not lf.finished:
lf.after_scheduler.ba_lrs = [ba_lr * lf.multiplier for ba_lr in lf.ba_lrs]
介绍的英语
lf.finished = True
return lf._last_lr()
return [ba_lr * lf.multiplier for ba_lr in lf.ba_lrs]
if lf.multiplier == 1.0:
return [ba_lr * (float(lf.last_epoch) / lf.total_epoch) for ba_lr in lf.ba_lrs]
el:
return [ba_lr * ((lf.multiplier - 1.) * lf.last_epoch / lf.total_epoch + 1.) for ba_lr in lf.ba_lrs]
def step_ReduceLROnPlateau(lf, metrics, epoch=None):
if epoch is None:三顾茅庐读后感
epoch = lf.last_epoch + 1
lf.last_epoch = epoch if epoch != 0 el 1 # ReduceLROnPlateau is called at the end of epoch, whereas others are called at beginning
if lf.last_epoch <= lf.total_epoch:
warmup_lr = [ba_lr * ((lf.multiplier - 1.) * lf.last_epoch / lf.total_epoch + 1.) for ba_lr in lf.ba_lrs]
for param_group, lr in zip(lf.optimizer.param_groups, warmup_lr):
param_group['lr'] = lr
el:
if epoch is None:
lf.after_scheduler.step(metrics, None)
el:
lf.after_scheduler.step(metrics, epoch - lf.total_epoch)
def step(lf, epoch=None, metrics=None):
if type(lf.after_scheduler) != ReduceLROnPlateau:
if lf.finished and lf.after_scheduler:
if epoch is None:
畊宏lf.after_scheduler.step(None)
el:
lf.after_scheduler.step(epoch - lf.total_epoch)
lf._last_lr = lf._last_lr()
el:
return super(GradualWarmupScheduler, lf).step(epoch)
el:
lf.step_ReduceLROnPlateau(metrics, epoch)
分布式多卡训练DistributedDataParallel踩坑
近⼏天想研究了多卡训练,就花了点时间,本以为会很轻松,可是好多坑,⼀步⼀步踏过来,⼀般分布式训练分为单机多卡与多机多卡两种类型;
主要有两种⽅式实现:
1、DataParallel: Parameter Server模式,⼀张卡位reducer,实现也超级简单,⼀⾏代码
DataParallel是基于Parameter rver的算法,负载不均衡的问题⽐较严重,有时在模型较⼤的时候(⽐如bert-large),reducer的那张卡会多出3-4g的显存占⽤
2、DistributedDataParallel:官⽅建议⽤新的DDP,采⽤all-reduce算法,本来设计主要是为了多机多卡使⽤,但是单机上也能⽤
为什么要分布式训练?
身份证明怎么写可以⽤多张卡,总体跑得更快
可以得到更⼤的 BatchSize
有些分布式会取得更好的效果
主要分为以下⼏个部分:
单机多卡,DataParallel(最常⽤,最简单)
单机多卡,DistributedDataParallel(较⾼级)、多机多卡,DistributedDataParallel(最⾼级)
如何启动训练
模型保存与读取
注意事项
⼀、单机多卡(DATAPARALLEL)
import DataParallel
device = torch.device("cuda")
#或者device = torch.device("cuda:0" if True el "cpu")
model = MyModel()
model = (device)
model = DataParallel(model)
#或者model = nn.DataParallel(model,device_ids=[0,1,2,3])
⽐较简单,只需要加⼀⾏代码就⾏, model = DataParallel(model)
⼆、多机多卡、单机多卡(DISTRIBUTEDDATAPARALLEL)
建议先把注意事项看完在修改代码,防⽌出现莫名的bug,修改训练代码如下:
其中opt.local_rank要在代码前⾯解析这个参数,可以去后⾯看我写的注意事项;
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch
# Initialize Process Group
dist_backend = 'nccl'
print('args.local_rank: ', opt.local_rank)
torch.cuda.t_device(opt.local_rank)
dist.init_process_group(backend=dist_backend)
model = yourModel()#⾃⼰的模型
if torch.cuda.device_count() > 1:
print("Let's u", torch.cuda.device_count(), "GPUs!")
# 5) 封装
# model = parallel.DistributedDataParallel(model,
# device_ids=[opt.local_rank],
# output_device=opt.local_rank)
model = parallel.DistributedDataParallel(model.cuda(), device_ids=[opt.local_rank])
device = torch.device(opt.local_rank)
<(device)
datat = ListDatat(train_path, augment=True, multiscale=opt.multiscale_training, img_size=opt.img_size, normalized_labels=True)#⾃⼰的读取数据的代码 world_size = torch.cuda.device_count()
datasampler = DistributedSampler(datat, num__world_size(), rank=opt.local_rank)
dataloader = torch.utils.data.DataLoader(
datat,
batch_size=opt.batch_size,
shuffle=Fal,
num_workers=opt.n_cpu,
pin_memory=True,
collate_llate_fn,
sampler=datasampler
)#在原始读取数据中加sampler参数就⾏
.....
训练过程中,数据转cuda
imgs = (device)
targets = (device)
三、如何启动训练
1、DataParallel⽅式
正常训练即可,即
python3 train.py
2、DistributedDataParallel⽅式
需要通过torch.distributed.launch来启动,⼀般是单节点,
CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 train.py
其中CUDA_VISIBLE_DEVICES 设置⽤的显卡编号,--nproc_pre_node 每个节点的显卡数量,⼀般有⼏个显卡就⽤⼏个显卡
多节点
python3 -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=0
#两个节点,在0号节点
要是训练成功,就会打印出⼏个信息,有⼏个卡就打印⼏个信息,如下图所⽰:
四、模型保存与读取
以下a、b是对应的,⽤a保存,就⽤a⽅法加载
1、保存
a、只保存参数
torch.dule.state_dict(), path)
b、保存参数与⽹络
torch.dule,path)
2、加载
a、多卡加载模型预训练;
model = Yourmodel()
if opt.pretrained_weights:
if opt.dswith(".pth"):
model.load_state_dict(torch.load(opt.pretrained_weights))
el:
model.load_darknet_weights(opt.pretrained_weights)
单卡加载模型,需要加载模型时指定主卡读模型,⽽且这个'cuda:0',是看你训练的模型是0还是1(否则就会出错RuntimeError: Attempting to derialize object on CUDA device 1 but torch.cuda.device_count() is 1. Plea u torch.load with map_location to map your storages to an existing device),可以根据⾃⼰的更改:
model = Yourmodel()
if opt.pretrained_weights:
if opt.dswith(".pth"):
model.load_state_dict(torch.load(opt.pretrained_weights,map_location="cuda:0"))
el:
model.load_darknet_weights(opt.pretrained_weights)
b、单卡加载模型;
同样也要指定读取模型的卡。
model = torch.load(opt.weights_path, map_location="cuda:0")
多卡加载预训练模型,以b这种⽅式还没跑通。
五、注意事项
1、model后⾯添加module
获取到⽹络模型后,使⽤并⾏⽅法,并将⽹络模型和参数移到GPU上。注意,若需要修改⽹络模块或者获得模型的某个参数,⼀定要在model后⾯加上.module,否则会报错,⽐如:
model.img_size 要改成 dule.img_size
2、.cuda或者.to(device)等问题
device是⾃⼰设置,如果.cuda出错,就要化成相应的device
model(如:(device))
input(通常需要使⽤Variable包装,如:input = Variable(input).to(device))
target(通常需要使⽤Variable包装
nn.CrossEntropyLoss()(如:criterion = nn.CrossEntropyLoss().to(device))
3、args.local_rank的参数
通过torch.distributed.launch来启动训练,torch.distributed.launch 会给模型分配⼀个args.local_rank的参数,所以在训练代码中要解析这个参数,也可以通过_rank()获取进程id。
parr.add_argument("--local_rank", type=int, default=-1, help="number of cpu threads to u during batch generation")
以上为个⼈经验,希望能给⼤家⼀个参考,也希望⼤家多多⽀持。