首页 > 美文阅读

pytorchDistributedDataParallel多卡训练结果变差的解决方案

更新时间:2023-07-22 19:32:51 阅读：评论：0

pytorchDistributedDataParallel多卡训练结果变差的解决⽅

案

DDP 数据shuffle 的设置

使⽤DDP要给dataloader传⼊sampler参数（torch.utils.data.distributed.DistributedSampler(datat, num_replicas=None,

rank=None, shuffle=True, ed=0, drop_last=Fal)）。默认shuffle=True，但按照pytorch DistributedSampler的实现：

def __iter__(lf) -> Iterator[T_co]:

if lf.shuffle:

# deterministically shuffle bad on epoch and ed

g = torch.Generator()

g.manual_ed(lf.ed + lf.epoch)

indices = torch.randperm(len(lf.datat), generator=g).tolist() # type: ignore

el:

indices = list(range(len(lf.datat))) # type: ignore

产⽣随机indix的种⼦是和当前的epoch有关，所以需要在训练的时候⼿动t epoch的值来实现真正的shuffle：

for epoch in range(start_epoch, n_epochs):

if is_distributed:

sampler.t_epoch(epoch)

train(loader)

DDP 增⼤batchsize 效果变差的问题

large batchsize：

理论上的优点：

数据中的噪声影响可能会变⼩，可能容易接近最优点；

缺点和问题：

降低了梯度的variance；(理论上，对于凸优化问题，低的梯度variance可以得到更好的优化效果; 但是实际上Keskar et al验证了增⼤batchsize会导致差的泛化能⼒);

全球人口排名前十位对于⾮凸优化问题，损失函数包含多个局部最优点，⼩的batchsize有噪声的⼲扰可能容易跳出局部最优点，⽽⼤的batchsize有可能停在局部最优点跳不出来。新生儿拍嗝

解决⽅法：

warmup

在训练初期就⽤很⼤的learning_rate可能会导致训练不收敛的问题，warmup的思想是在训练初期⽤⼩的学习率，随着训练慢慢变⼤学习率，直到ba learning_rate，再使⽤其他decay（CosineAnnealingLR）的⽅式训练.

# copy /ildoonet/pytorch-gradual-warmup-lr/blob/master/warmup_scheduler/scheduler.py

from torch.optim.lr_scheduler import _LRScheduler

from torch.optim.lr_scheduler import ReduceLROnPlateau

class GradualWarmupScheduler(_LRScheduler):

""" Gradually warm-up(increasing) learning rate in optimizer.

Propod in 'Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour'.

Args:

optimizer (Optimizer): Wrapped optimizer.

multiplier: target learning rate = ba lr * multiplier if multiplier > 1.0. if multiplier = 1.0, lr starts from 0 and ends up with the ba_lr.

total_epoch: target learning rate is reached at total_epoch, gradually

after_scheduler: after target_epoch, u this scheduler(eg. ReduceLROnPlateau)

"""

def __init__(lf, optimizer, multiplier, total_epoch, after_scheduler=None):

lf.multiplier = multiplier

if lf.multiplier < 1.:

rai ValueError('multiplier should be greater thant or equal to 1.')

lf.after_scheduler = after_scheduler

lf.finished = Fal

super(GradualWarmupScheduler, lf).__init__(optimizer)

def get_lr(lf):

if lf.last_epoch > lf.total_epoch:

if lf.after_scheduler:

爆房有情人

哲学书推荐if not lf.finished:

lf.after_scheduler.ba_lrs = [ba_lr * lf.multiplier for ba_lr in lf.ba_lrs]

介绍的英语

lf.finished = True

return lf._last_lr()

return [ba_lr * lf.multiplier for ba_lr in lf.ba_lrs]

if lf.multiplier == 1.0:

return [ba_lr * (float(lf.last_epoch) / lf.total_epoch) for ba_lr in lf.ba_lrs]

el:

return [ba_lr * ((lf.multiplier - 1.) * lf.last_epoch / lf.total_epoch + 1.) for ba_lr in lf.ba_lrs]

def step_ReduceLROnPlateau(lf, metrics, epoch=None):

if epoch is None:三顾茅庐读后感

epoch = lf.last_epoch + 1

lf.last_epoch = epoch if epoch != 0 el 1 # ReduceLROnPlateau is called at the end of epoch, whereas others are called at beginning

if lf.last_epoch <= lf.total_epoch:

warmup_lr = [ba_lr * ((lf.multiplier - 1.) * lf.last_epoch / lf.total_epoch + 1.) for ba_lr in lf.ba_lrs]

for param_group, lr in zip(lf.optimizer.param_groups, warmup_lr):

param_group['lr'] = lr

el:

if epoch is None:

lf.after_scheduler.step(metrics, None)

el:

lf.after_scheduler.step(metrics, epoch - lf.total_epoch)

def step(lf, epoch=None, metrics=None):

if type(lf.after_scheduler) != ReduceLROnPlateau:

if lf.finished and lf.after_scheduler:

if epoch is None:

畊宏lf.after_scheduler.step(None)

el:

lf.after_scheduler.step(epoch - lf.total_epoch)

lf._last_lr = lf._last_lr()

el:

return super(GradualWarmupScheduler, lf).step(epoch)

el:

lf.step_ReduceLROnPlateau(metrics, epoch)

分布式多卡训练DistributedDataParallel踩坑

近⼏天想研究了多卡训练，就花了点时间，本以为会很轻松，可是好多坑，⼀步⼀步踏过来，⼀般分布式训练分为单机多卡与多机多卡两种类型；

主要有两种⽅式实现：

１、DataParallel: Parameter Server模式，⼀张卡位reducer，实现也超级简单，⼀⾏代码

DataParallel是基于Parameter rver的算法，负载不均衡的问题⽐较严重，有时在模型较⼤的时候（⽐如bert-large），reducer的那张卡会多出3-4g的显存占⽤

２、DistributedDataParallel：官⽅建议⽤新的DDP，采⽤all-reduce算法，本来设计主要是为了多机多卡使⽤，但是单机上也能⽤

为什么要分布式训练？

身份证明怎么写可以⽤多张卡，总体跑得更快

可以得到更⼤的 BatchSize

有些分布式会取得更好的效果

主要分为以下⼏个部分：

单机多卡，DataParallel（最常⽤，最简单）

单机多卡，DistributedDataParallel（较⾼级）、多机多卡，DistributedDataParallel（最⾼级）

如何启动训练

模型保存与读取

注意事项

⼀、单机多卡（DATAPARALLEL）

import DataParallel

device = torch.device("cuda")

＃或者device = torch.device("cuda:0" if True el "cpu")

model = MyModel()

model = (device)

model = DataParallel(model)

＃或者model = nn.DataParallel(model,device_ids=[0,1，2,3])

⽐较简单，只需要加⼀⾏代码就⾏， model = DataParallel(model)

⼆、多机多卡、单机多卡（DISTRIBUTEDDATAPARALLEL）

建议先把注意事项看完在修改代码，防⽌出现莫名的bug，修改训练代码如下：

其中opt.local_rank要在代码前⾯解析这个参数，可以去后⾯看我写的注意事项；

from torch.utils.data.distributed import DistributedSampler

import torch.distributed as dist

import torch

# Initialize Process Group

dist_backend = 'nccl'

print('args.local_rank: ', opt.local_rank)

torch.cuda.t_device(opt.local_rank)

dist.init_process_group(backend=dist_backend)

model = yourModel()＃⾃⼰的模型

if torch.cuda.device_count() > 1:

print("Let's u", torch.cuda.device_count(), "GPUs!")

# 5) 封装

# model = parallel.DistributedDataParallel(model,

# device_ids=[opt.local_rank],

# output_device=opt.local_rank)

model = parallel.DistributedDataParallel(model.cuda(), device_ids=[opt.local_rank])

device = torch.device(opt.local_rank)

<(device)

datat = ListDatat(train_path, augment=True, multiscale=opt.multiscale_training, img_size=opt.img_size, normalized_labels=True)#⾃⼰的读取数据的代码 world_size = torch.cuda.device_count()

datasampler = DistributedSampler(datat, num__world_size(), rank=opt.local_rank)

dataloader = torch.utils.data.DataLoader(

datat,

batch_size=opt.batch_size,

shuffle=Fal,

num_workers=opt.n_cpu,

pin_memory=True,

collate_llate_fn,

sampler=datasampler

)＃在原始读取数据中加sampler参数就⾏

.....

训练过程中，数据转cuda

imgs = (device)

targets = (device)

三、如何启动训练

１、DataParallel⽅式

正常训练即可，即

python3 train.py

２、DistributedDataParallel⽅式

需要通过torch.distributed.launch来启动，⼀般是单节点，

CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 train.py

其中CUDA_VISIBLE_DEVICES　设置⽤的显卡编号，--nproc_pre_node 每个节点的显卡数量，⼀般有⼏个显卡就⽤⼏个显卡

多节点

python３ -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=0

＃两个节点，在０号节点

要是训练成功，就会打印出⼏个信息，有⼏个卡就打印⼏个信息，如下图所⽰:

四、模型保存与读取

以下a、b是对应的，⽤a保存，就⽤a⽅法加载

１、保存

a、只保存参数

torch.dule.state_dict(), path)

b、保存参数与⽹络

torch.dule,path)

２、加载

a、多卡加载模型预训练；

model = Yourmodel()

if opt.pretrained_weights:

if opt.dswith(".pth"):

model.load_state_dict(torch.load(opt.pretrained_weights))

el:

model.load_darknet_weights(opt.pretrained_weights)

单卡加载模型，需要加载模型时指定主卡读模型，⽽且这个'cuda:0'，是看你训练的模型是０还是１（否则就会出错RuntimeError: Attempting to derialize object on CUDA device 1 but torch.cuda.device_count() is 1. Plea u torch.load with map_location to map your storages to an existing device），可以根据⾃⼰的更改：

model = Yourmodel()

if opt.pretrained_weights:

if opt.dswith(".pth"):

model.load_state_dict(torch.load(opt.pretrained_weights，map_location="cuda:0"))

el:

model.load_darknet_weights(opt.pretrained_weights)

b、单卡加载模型；

同样也要指定读取模型的卡。

model = torch.load(opt.weights_path, map_location="cuda:0")

多卡加载预训练模型，以b这种⽅式还没跑通。

五、注意事项

１、model后⾯添加module

获取到⽹络模型后，使⽤并⾏⽅法，并将⽹络模型和参数移到GPU上。注意，若需要修改⽹络模块或者获得模型的某个参数，⼀定要在model后⾯加上.module，否则会报错，⽐如：

model.img_size 要改成 dule.img_size

２、.cuda或者.to(device)等问题

device是⾃⼰设置，如果.cuda出错，就要化成相应的device

model（如：(device)）

input（通常需要使⽤Variable包装，如：input = Variable(input).to(device)）

target（通常需要使⽤Variable包装

nn.CrossEntropyLoss()（如：criterion = nn.CrossEntropyLoss().to(device)）

３、args.local_rank的参数

通过torch.distributed.launch来启动训练，torch.distributed.launch 会给模型分配⼀个args.local_rank的参数，所以在训练代码中要解析这个参数，也可以通过_rank()获取进程id。

parr.add_argument("--local_rank", type=int, default=-1, help="number of cpu threads to u during batch generation")

以上为个⼈经验，希望能给⼤家⼀个参考，也希望⼤家多多⽀持。

本文发布于:2023-07-22 19:32:51，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1111363.html

上一篇：拉格朗日乘数法不等式约束

下一篇：ANSYS中的阻尼 1

标签：训练模型问题需要加载参数

留言与评论（共有 0 条评论）