pytorch中dataloader的num_workers参数

更新时间:2023-06-07 17:58:52 阅读: 评论:0

pytorch中dataloader的num_workers参数
结论速递
谄谀怎么读
在Windows系统中,num_workers参数建议设为0,在Linux系统则不需担⼼。
1 问题描述
在之前的任务中,使⽤PyG库⽤于数据加载的DataLoader,ClusterLoader等类时,会涉及到参数num_workers。在不同系统环境试验该参数的设定值时,会出现不同的结果
colab(Linux),设为12
有warning message,提⽰num_workers参数设置不合理,不影响继续运⾏
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UrWarning: This DataLoader will create 12 worker process in total. Our sugge sted max number of worker in current system is4, which is smaller than what this DataLoader is going to create. Plea be aware that excessive worker cr eation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary
Windows系统,设为1
宁远九嶷山
报错
于是想要弄明⽩到底为什么会出现报错。
2 探索
于是阅读源码,溯源⾄⽗类,关于num_workers的定义都在这段源码中。
2.1 num_workers
在注释中,有对num_workers参数的描述
浅圆仓
即num_workers为进⾏数据加载时,将创建多少个subprocess⽤于数据加载,若设定为0,则主进程加载。根据我们拥有的知识,这个subprocess的数量是收到设备cpu核数和线程数限制的,不是我们设置多少个num_workers,就会有多少个subprocess。
因此源码⾥头定义了check_worker_number_rationality()这个⽅法,来执⾏num_workers设定合理性的检查。
2.2 关于colab的warning message
colab的warning message的来源就是check_worker_number_rationality()这个⽅法,这个⽅法定义的源码开头,讲述了进⾏num_workers设定合理性检查的原因。
This function check whether the dataloader’s worker number is rational bad on current system’s resource. Current rule is that if the number of workers this Dataloader will create is bigger than the number of logical cpus that is allowed to u, than we will pop up a warning to let ur pay attention.
eg. If current system has 2 physical CPUs with 16 cores each. And each core support 2 threads, then the total logical cpus here is 2 * 16 * 2 = 64. Let’s say current DataLoader process can u half of them which is 32, then the rational max number of worker that initiated from this process is 32. No
w, let’s say the created DataLoader has num_works = 40, which is bigger than 32. So the warning message is triggered to notify the ur to lower the worker number if necessary.
老爷锅即num_workers的设定数不能超过cpu可以开的线程数max_num_worker_suggest。当num_workers设定数超过线程数时,会引发warning message,对应源码如下:
在上图中标注出了对linux及windows系统,不同的确定系统max_num_worker_suggest的⽅式。
linux系统可以通过len(os.sched_getaffinity(0))来直接获取,⽽windows则通过os.cpu()获取(并⾮cpu核数,⽽是线程数)。
程文彬上述的8对应的即是cpu逻辑处理器的个数。
到了这⾥,我们已经可以知道colab中对应的warning message是由于我们所设定的num_workers数⽬超过了max_num_worker_suggest,也即cpu的线程数。
但是在先前的windows尝试中,num_workers的设定数没有超过cpu⽀持的process数,且即便是超过,也并不会导致报错,只会有warning message。于是我们继续查找windows报错信息的来源。
2.3 关于windows的报错信息
根据windows对应的报错信息“Runtime Error”,我们继续阅读源码,对应处如下
我可以歌词当单process没有在指定时间内被killed,就会rai RuntimeError提⽰错误。
到了这⼀步我们可以确定报错来源,但仍不知是何原因导致这个process运⾏失败。
查找互联⽹他⼈笔记,
在Windows上,FileMapping对象应必须在所有相关进程都关闭后,才能释放。
君子与小人启⽤多线程处理时,⼦进程将创建FileMapping,然后主进程将打开它。 之后当⼦进程将尝试释放它的时候,因为⽗进程还在引⽤,所以它的引⽤计数不为零,⽆法释放。 但是当前代码没有提供在可能的情况下再次关闭它的机会。女士裤子尺码对照表
在中也有相关的说法
The memory leak is caud by the difference in using FileMapping(mmap) on Windows. On Windows, FileMapping objects should be clod by all related process and then it can be relead. And there’s no other way to explicitly delete it.(Like shm_unlink)
When multiprocessing is on, the child process will create a FileMapping and then the main process will open it. After that, at some time, the child process will try to relea it but it’s reference count is n
on-zero so it cannot be relead at that time. But the current code does not provide a chance to let it clo again when possible.
This PR targets #5590.
Current Progress:
The memory leak when num_worker=1 should be solved. However, further work has to be done for more workers.
也就是说,在windows系统上运⾏时,⽬前应将num_workers设为0,以保证运⾏。

本文发布于:2023-06-07 17:58:52,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/895847.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:设定   进程   报错   源码   参数   没有   系统   对应
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图