【pytorch】学会pytorchdataloader数据加载(⼀)
DataLoader
Dataloader可以将⾃⼰的数据装换成Tensor,然后有效的迭代数据。可以很有效的简化数据的读取过程,⽅便炼丹。
射手座的幸运色⼀、⾸先介绍⼀个简单的例⼦:
1. 加载头⽂件:
import torch
import torch.utils.data as Data
torch.manual_ed(1)
2. ⽣成torch数据
x = torch.linspace(1,10,10)
y = torch.linspace(10,1,10)
3. 将⽣成的数据做成⼀个DataSet和Dataloader
torch_datat = Data.TensorDatat(x, y)
loader = Data.DataLoader(
datat = torch_datat,
batch_size = BATCH_SIZE,
shuffle =True,
num_workers =2
)
4. 利⽤Dataloader来迭代数据
BATCH_SIZE =5
for epoch in range(3):
for step,(batchX, batchY)in enumerate(loader):
print('Epoch: ', epoch,'| Step: ', step,'| batch x: ',
batchX.numpy(),'| batch y: ', batchY.numpy())
输出:
Epoch: 0 | Step: 0 | batch x: [ 4. 6. 7. 10. 8.] | batch y: [7. 5. 4. 1. 3.]
Epoch: 0 | Step: 1 | batch x: [5. 3. 2. 1. 9.] | batch y: [ 6. 8. 9. 10. 2.]
Epoch: 1 | Step: 0 | batch x: [ 4. 2. 5. 6. 10.] | batch y: [7. 9. 6. 5. 1.]
Epoch: 1 | Step: 1 | batch x: [3. 9. 1. 8. 7.] | batch y: [ 8. 2. 10. 3. 4.]
Epoch: 2 | Step: 0 | batch x: [ 4. 10. 9. 8. 7.] | batch y: [7. 1. 2. 3. 4.]
Epoch: 2 | Step: 1 | batch x: [6. 1. 2. 5. 3.] | batch y: [ 5. 10. 9. 6. 8.]
⼆、batchsize 不能被数据长度整除
上⾯⼀个玩具例⼦中,我们可以发现batchsize=5, 数据长度为10,刚好两个step可以取尽数据。如果batchsize=8呢,我们发现,第⼆次迭代数据时,数据长度只剩下2
loader = Data.DataLoader(
datat = torch_datat,
batch_size =8,
shuffle =True,
num_workers =2,
drop_last=True
)
for epoch in range(3):
for step,(batchX, batchY)in enumerate(loader):
print('Epoch: ', epoch,'| Step: ', step,'| batch x: ',
batchX.numpy(),'| batch y: ', batchY.numpy())
输出:
Epoch: 0 | Step: 0 | batch x: [10. 2. 9. 5. 6. 4. 8. 7.] | batch y: [1. 9. 2. 6. 5. 7. 3. 4.]
Epoch: 0 | Step: 1 | batch x: [1. 3.] | batch y: [10. 8.]
Epoch: 1 | Step: 0 | batch x: [7. 2. 8. 9. 6. 5. 3. 1.] | batch y: [ 4. 9. 3. 2. 5. 6. 8. 10.]
Epoch: 1 | Step: 1 | batch x: [10. 4.] | batch y: [1. 7.]
Epoch: 2 | Step: 0 | batch x: [ 1. 6. 3. 7. 10. 8. 4. 2.] | batch y: [10. 5. 8. 4. 1. 3. 7. 9.]
Epoch: 2 | Step: 1 | batch x: [9. 5.] | batch y: [2. 6.]
可以发现最后只迭代余下2个数据(10-8)。
那么我们如果不想要这两个数据怎么办呢,那么在构造dataloader的时候设置drop_last=True
loader = Data.DataLoader(
datat = torch_datat,
batch_size =8,
shuffle =True,
num_workers =2,
drop_last=True
)
for epoch in range(3):
for step,(batchX, batchY)in enumerate(loader):
print('Epoch: ', epoch,'| Step: ', step,'| batch x: ',
batchX.numpy(),'| batch y: ', batchY.numpy())
输出
Epoch: 0 | Step: 0 | batch x: [ 6. 5. 7. 3. 8. 10. 9. 2.] | batch y: [5. 6. 4. 8. 3. 1. 2. 9.]
Epoch: 1 | Step: 0 | batch x: [ 1. 10. 5. 2. 4. 6. 9. 8.] | batch y: [10. 1. 6. 9. 7. 5. 2. 3.]
Epoch: 2 | Step: 0 | batch x: [3. 4. 1. 8. 6. 5. 2. 7.] | batch y: [ 8. 7. 10. 3. 5. 6. 9. 4.]
三、关于DataSet和DataLoader
我们看到pytorch加载数据主要是⽤到了DataSet及DataLoader,这⾥简要介绍DataSet及DataLoader。Datat
class Datat(object):
"""An abstract class reprenting a Datat.
All other datats should subclass it. All subclass should override
``__len__``, that provides the size of the datat, and ``__getitem__``,
supporting integer indexing in range from 0 to len(lf) exclusive.
"""
def__getitem__(lf, index):
rai NotImplementedError
def__len__(lf):
rai NotImplementedError
def__add__(lf, other):
return ConcatDatat([lf, other])
上述代码⽤到了TensorDataSet, 这是DataSet的⼦类。
class TensorDatat(Datat):
"""Datat wrapping tensors.
Each sample will be retrieved by indexing tensors along the first dimension.
Arguments:
*tensors (Tensor): tensors that have the same size of the first dimension.
"""
def__init__(lf,*tensors):
asrt all(tensors[0].size(0)== tensor.size(0)for tensor in tensors)退会申请
def__getitem__(lf, index):
return tuple(tensor[index]for tensor sors)
def__len__(lf):
sors[0].size(0)
⼀共3个成员函数,init, getitem, 及len。分别⽤来初始化,getitem⽤来返回每个数据(注意这⾥是每个), len⽤来返回数据长度。
这⾥会有⼀个问题:上⾯玩具例⼦我们随便写了长度为10的数据,然后赋给DataSet。但是实际上,我们的数据量⾮常多,⼀次性加载到内存上,内存会爆炸,然后赋值DataSet基本不太可能。因此我们需要⾃⼰写⼀个DataSet的⼦类,这后⾯再讲。
我们需要明⽩的是,如果我们要⾃⼰构造⼦类,只需要学着TensorDatat, 构造三个成员函数就⾏了,分别是__init__, __ getitem
__, __ len __。
DataLoader
DataLoader的参数列表:
datat: Datat的类或者派⽣类
batch_size : batchsize, 每个batch的⼤⼩
shuffle: 是否打乱数据
sampler:定义从datat取数据的策略,⼀般来说选择默认
num_workers: 多线程读取数据,num_worker多少就代表多少线程读取
collate_fn: 将Datat中的单个数据拼成batch的数据
drop_last:是否将最后不⾜⼀个batch的数据丢弃
timeout:如果是正数,表明等待从worker进程中收集⼀个batch等待的时间,若超出设定的时间还没有收集到,那就不收集这个内容了。这个numeric应总是⼤于等于0。默认为0
pin_memory:锁页内存,⼀般来说,在GPU训练的时候设置成True,在CPU上设置成Fal。
刘海怎么卷pin_memory就是锁页内存,创建DataLoader时,设置pin_memory=True,则意味着⽣成的Tensor数据最开始是属于内存中的锁页内存,这样将内存的Tensor转义到GPU的显存就会更快⼀些。
主机中的内存,有两种存在⽅式,⼀是锁页,⼆是不锁页,锁页内存存放的内容在任何情况下都不会与主机的虚拟内存进⾏交换(注:虚拟内存就是硬盘),⽽不锁页内存在主机内存不⾜时,数据会存放在虚拟内存中。
⽽显卡中的显存全部是锁页内存!
当计算机的内存充⾜的时候,可以设置pin_memory=True。当系统卡住,或者交换内存使⽤过多的时候,设置
pin_memory=Fal。因为pin_memory与电脑硬件性能有关,pytorch开发者不能确保每⼀个炼丹玩家都有⾼端设备,因此pin_memory默认为Fal。
class DataLoader(object):
r"""
Data loader. Combines a datat and a sampler, and provides
single- or multi-process iterators over the datat.
Arguments:
datat (Datat): datat from which to load the data.
batch_size (int, optional): how many samples per batch to load
(default: ``1``).
shuffle (bool, optional): t to ``True`` to have the data reshuffled
at every epoch (default: ``Fal``).
sampler (Sampler, optional): defines the strategy to draw samples from
sampler (Sampler, optional): defines the strategy to draw samples from
the datat. If specified, ``shuffle`` must be Fal.
batch_sampler (Sampler, optional): like sampler, but returns a batch of
indices at a time. Mutually exclusive with :attr:`batch_size`,
:attr:`shuffle`, :attr:`sampler`, and :attr:`drop_last`.
num_workers (int, optional): how many subprocess to u for data
loading. 0 means that the data will be loaded in the main process.
(default: ``0``)
collate_fn (callable, optional): merges a list of samples to form a mini-batch.
pin_memory (bool, optional): If ``True``, the data loader will copy tensors
into CUDA pinned memory before returning them. If your data elements
长颈鹿的英语怎么读are a custom type, or your ``collate_fn`` returns a batch that is a custom type
e the example below.
练字书法
drop_last (bool, optional): t to ``True`` to drop the last incomplete batch,
if the datat size is not divisible by the batch size. If ``Fal`` and
the size of datat is not divisible by the batch size, then the last batch
will be smaller. (default: ``Fal``)
timeout (numeric, optional): if positive, the timeout value for collecting a batch
from workers. Should always be non-negative. (default: ``0``)
worker_init_fn (callable, optional): If not ``None``, this will be called on each
worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as
input, after eding and before data loading. (default: ``None``)
.. note:: When ``num_workers != 0``, the corresponding worker process are created each time iterator for the DataLoader is obtained (as in when you call
``enumerate(dataloader,0)``).
At this point, the datat, ``collate_fn`` and ``worker_init_fn`` are pasd to each
worker, where they are ud to access and initialize data bad on the indices
queued up from the main process. This means that datat access together with
its internal IO, transforms and collation runs in the worker, while any
shuffle randomization is done in the main process which guides loading by assigning
indices to load. Workers are shut down once the end of the iteration is reached.
Since workers rely on Python multiprocessing, worker launch behavior is different
on Windows compared to Unix. On Unix fork() is ud as the default
muliprocessing start method, so child workers typically can access the datat and
Python argument functions directly through the cloned address space. On Windows, another interpreter is launched which runs your main script, followed by the internal
worker function that receives the datat, collate_fn and other arguments
through Pickle rialization.
This parate rialization means that you should take two steps to ensure you
are compatible with Windows while using workers
(this also works equally well on Unix):
- Wrap most of you main script's code within ``if __name__ == '__main__':`` block,
to make sure it doesn't run again (most likely generating error) when each worker主持人开场白
process is launched. You can place your datat and DataLoader instance creation
logic here, as it doesn't need to be re-executed in workers.
- Make sure that ``collate_fn``, ``worker_init_fn`` or any custom datat code
is declared as a top level def, outside of that ``__main__`` check. This ensures
they are available in workers as well
(this is needed since functions are pickled as references only, not bytecode).
By default, each worker will have its PyTorch ed t to
``ba_ed + worker_id``, where ``ba_ed`` is a long generated
by main process using its RNG. However, eds for other libraies
may be duplicated upon initializing workers (w.g., NumPy), causing
each worker to return identical random numbers. (See
:
ref:`dataloader-workers-random-ed` ction in FAQ.) You may
u :func:`torch.initial_ed()` to access the PyTorch ed for
each worker in :attr:`worker_init_fn`, and u it to t other
eds before data loading.
.. warning:: If ``spawn`` start method is ud, :attr:`worker_init_fn` cannot be an
unpicklable object, e.g., a lambda function.
The default memory pinning logic only recognizes Tensors and maps and iterables
containg Tensors. By default, if the pinning logic es a batch that is a custom type
(which will occur if you have a ``collate_fn`` that returns a custom batch type),
or if each element of your batch is a custom type, the pinning logic will not
recognize them, and it will return that batch (or tho elements)
without pinning the memory. To enable memory pinning for custom batch or data types,
define a ``pin_memory`` method on your custom type(s).
Example::
class SimpleCustomBatch:
def __init__(lf, data):
transpod_data = list(zip(*data))
transpod_data = list(zip(*data))
lf.inp = torch.stack(transpod_data[0], 0)
< = torch.stack(transpod_data[1], 0)
def pin_memory(lf):
lf.inp = lf.inp.pin_memory()
< = lf.tgt.pin_memory()
exe是什么格式的文件
return lf
def collate_wrapper(batch):
虎和猴的婚姻如何return SimpleCustomBatch(batch)
inps = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
datat = TensorDatat(inps, tgts)
loader = DataLoader(datat, batch_size=2, collate_fn=collate_wrapper, pin_memory=True)
for batch_ndx, sample in enumerate(loader):
print(sample.inp.is_pinned())
is_pinned())
"""
__initialized =Fal
def__init__(lf, datat, batch_size=1, shuffle=Fal, sampler=None,
batch_sampler=None, num_workers=0, collate_fn=default_collate, pin_memory=Fal, drop_last=Fal, timeout=0,
worker_init_fn=None):
lf.datat = datat
lf.batch_size = batch_size
lf.num_workers = num_workers
lf.pin_memory = pin_memory
lf.drop_last = drop_last
lf.timeout = timeout
lf.worker_init_fn = worker_init_fn
if timeout <0:
rai ValueError('timeout option should be non-negative')
if batch_sampler is not None:
if batch_size >1or shuffle or sampler is not None or drop_last:
rai ValueError('batch_sampler option is mutually exclusive '
'with batch_size, shuffle, sampler, and '
'drop_last')
lf.batch_size =None
lf.drop_last =None
if sampler is not None and shuffle:
rai ValueError('sampler option is mutually exclusive with '
'shuffle')
if lf.num_workers <0:
rai ValueError('num_workers option cannot be negative; '
'u num_workers=0 to disable multiprocessing.')
if batch_sampler is None:
if sampler is None:
if shuffle:
sampler = RandomSampler(datat)
el:
sampler = SequentialSampler(datat)
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
lf.sampler = sampler
lf.batch_sampler = batch_sampler
lf.__initialized =True