首页 > 美文阅读

【pytorch】学会pytorchdataloader数据加载（一）

更新时间:2023-06-14 04:13:49 阅读：评论：0

【pytorch】学会pytorchdataloader数据加载（⼀）

DataLoader

Dataloader可以将⾃⼰的数据装换成Tensor，然后有效的迭代数据。可以很有效的简化数据的读取过程，⽅便炼丹。

射手座的幸运色⼀、⾸先介绍⼀个简单的例⼦：

1. 加载头⽂件：

import torch

import torch.utils.data as Data

torch.manual_ed(1)

2. ⽣成torch数据

x = torch.linspace(1,10,10)

y = torch.linspace(10,1,10)

3. 将⽣成的数据做成⼀个DataSet和Dataloader

torch_datat = Data.TensorDatat(x, y)

loader = Data.DataLoader(

datat = torch_datat,

batch_size = BATCH_SIZE,

shuffle =True,

num_workers =2

)

4. 利⽤Dataloader来迭代数据

BATCH_SIZE =5

for epoch in range(3):

for step,(batchX, batchY)in enumerate(loader):

print('Epoch: ', epoch,'| Step: ', step,'| batch x: ',

batchX.numpy(),'| batch y: ', batchY.numpy())

输出：

Epoch: 0 | Step: 0 | batch x: [ 4. 6. 7. 10. 8.] | batch y: [7. 5. 4. 1. 3.]

Epoch: 0 | Step: 1 | batch x: [5. 3. 2. 1. 9.] | batch y: [ 6. 8. 9. 10. 2.]

Epoch: 1 | Step: 0 | batch x: [ 4. 2. 5. 6. 10.] | batch y: [7. 9. 6. 5. 1.]

Epoch: 1 | Step: 1 | batch x: [3. 9. 1. 8. 7.] | batch y: [ 8. 2. 10. 3. 4.]

Epoch: 2 | Step: 0 | batch x: [ 4. 10. 9. 8. 7.] | batch y: [7. 1. 2. 3. 4.]

Epoch: 2 | Step: 1 | batch x: [6. 1. 2. 5. 3.] | batch y: [ 5. 10. 9. 6. 8.]

⼆、batchsize 不能被数据长度整除

上⾯⼀个玩具例⼦中，我们可以发现batchsize=5，数据长度为10，刚好两个step可以取尽数据。如果batchsize=8呢，我们发现，第⼆次迭代数据时，数据长度只剩下2

loader = Data.DataLoader(

datat = torch_datat,

batch_size =8,

shuffle =True,

num_workers =2,

drop_last=True

)

for epoch in range(3):

for step,(batchX, batchY)in enumerate(loader):

print('Epoch: ', epoch,'| Step: ', step,'| batch x: ',

batchX.numpy(),'| batch y: ', batchY.numpy())

输出：

Epoch: 0 | Step: 0 | batch x: [10. 2. 9. 5. 6. 4. 8. 7.] | batch y: [1. 9. 2. 6. 5. 7. 3. 4.]

Epoch: 0 | Step: 1 | batch x: [1. 3.] | batch y: [10. 8.]

Epoch: 1 | Step: 0 | batch x: [7. 2. 8. 9. 6. 5. 3. 1.] | batch y: [ 4. 9. 3. 2. 5. 6. 8. 10.]

Epoch: 1 | Step: 1 | batch x: [10. 4.] | batch y: [1. 7.]

Epoch: 2 | Step: 0 | batch x: [ 1. 6. 3. 7. 10. 8. 4. 2.] | batch y: [10. 5. 8. 4. 1. 3. 7. 9.]

Epoch: 2 | Step: 1 | batch x: [9. 5.] | batch y: [2. 6.]

可以发现最后只迭代余下2个数据（10-8）。

那么我们如果不想要这两个数据怎么办呢，那么在构造dataloader的时候设置drop_last=True

loader = Data.DataLoader(

datat = torch_datat,

batch_size =8,

shuffle =True,

num_workers =2,

drop_last=True

)

for epoch in range(3):

for step,(batchX, batchY)in enumerate(loader):

print('Epoch: ', epoch,'| Step: ', step,'| batch x: ',

batchX.numpy(),'| batch y: ', batchY.numpy())

输出

Epoch: 0 | Step: 0 | batch x: [ 6. 5. 7. 3. 8. 10. 9. 2.] | batch y: [5. 6. 4. 8. 3. 1. 2. 9.]

Epoch: 1 | Step: 0 | batch x: [ 1. 10. 5. 2. 4. 6. 9. 8.] | batch y: [10. 1. 6. 9. 7. 5. 2. 3.]

Epoch: 2 | Step: 0 | batch x: [3. 4. 1. 8. 6. 5. 2. 7.] | batch y: [ 8. 7. 10. 3. 5. 6. 9. 4.]

三、关于DataSet和DataLoader

我们看到pytorch加载数据主要是⽤到了DataSet及DataLoader，这⾥简要介绍DataSet及DataLoader。Datat

class Datat(object):

"""An abstract class reprenting a Datat.

All other datats should subclass it. All subclass should override

``__len__``, that provides the size of the datat, and ``__getitem__``,

supporting integer indexing in range from 0 to len(lf) exclusive.

"""

def__getitem__(lf, index):

rai NotImplementedError

def__len__(lf):

rai NotImplementedError

def__add__(lf, other):

return ConcatDatat([lf, other])

上述代码⽤到了TensorDataSet，这是DataSet的⼦类。

class TensorDatat(Datat):

"""Datat wrapping tensors.

Each sample will be retrieved by indexing tensors along the first dimension.

Arguments:

*tensors (Tensor): tensors that have the same size of the first dimension.

"""

def__init__(lf,*tensors):

asrt all(tensors[0].size(0)== tensor.size(0)for tensor in tensors)退会申请

def__getitem__(lf, index):

return tuple(tensor[index]for tensor sors)

def__len__(lf):

sors[0].size(0)

⼀共3个成员函数，init， getitem，及len。分别⽤来初始化，getitem⽤来返回每个数据（注意这⾥是每个）， len⽤来返回数据长度。

这⾥会有⼀个问题：上⾯玩具例⼦我们随便写了长度为10的数据，然后赋给DataSet。但是实际上，我们的数据量⾮常多，⼀次性加载到内存上，内存会爆炸，然后赋值DataSet基本不太可能。因此我们需要⾃⼰写⼀个DataSet的⼦类，这后⾯再讲。

我们需要明⽩的是，如果我们要⾃⼰构造⼦类，只需要学着TensorDatat，构造三个成员函数就⾏了，分别是__init__， __ getitem

__， __ len __。

DataLoader

DataLoader的参数列表：

datat： Datat的类或者派⽣类

batch_size : batchsize，每个batch的⼤⼩

shuffle：是否打乱数据

sampler：定义从datat取数据的策略，⼀般来说选择默认

num_workers：多线程读取数据，num_worker多少就代表多少线程读取

collate_fn：将Datat中的单个数据拼成batch的数据

drop_last：是否将最后不⾜⼀个batch的数据丢弃

timeout：如果是正数，表明等待从worker进程中收集⼀个batch等待的时间，若超出设定的时间还没有收集到，那就不收集这个内容了。这个numeric应总是⼤于等于0。默认为0

pin_memory：锁页内存，⼀般来说，在GPU训练的时候设置成True，在CPU上设置成Fal。

刘海怎么卷pin_memory就是锁页内存，创建DataLoader时，设置pin_memory=True，则意味着⽣成的Tensor数据最开始是属于内存中的锁页内存，这样将内存的Tensor转义到GPU的显存就会更快⼀些。

主机中的内存，有两种存在⽅式，⼀是锁页，⼆是不锁页，锁页内存存放的内容在任何情况下都不会与主机的虚拟内存进⾏交换（注：虚拟内存就是硬盘），⽽不锁页内存在主机内存不⾜时，数据会存放在虚拟内存中。

⽽显卡中的显存全部是锁页内存！

当计算机的内存充⾜的时候，可以设置pin_memory=True。当系统卡住，或者交换内存使⽤过多的时候，设置

pin_memory=Fal。因为pin_memory与电脑硬件性能有关，pytorch开发者不能确保每⼀个炼丹玩家都有⾼端设备，因此pin_memory默认为Fal。

class DataLoader(object):

r"""

Data loader. Combines a datat and a sampler, and provides

single- or multi-process iterators over the datat.

Arguments:

datat (Datat): datat from which to load the data.

batch_size (int, optional): how many samples per batch to load

(default: ``1``).

shuffle (bool, optional): t to ``True`` to have the data reshuffled

at every epoch (default: ``Fal``).

sampler (Sampler, optional): defines the strategy to draw samples from

the datat. If specified, ``shuffle`` must be Fal.

batch_sampler (Sampler, optional): like sampler, but returns a batch of

indices at a time. Mutually exclusive with :attr:`batch_size`,

:attr:`shuffle`, :attr:`sampler`, and :attr:`drop_last`.

num_workers (int, optional): how many subprocess to u for data

loading. 0 means that the data will be loaded in the main process.

(default: ``0``)

collate_fn (callable, optional): merges a list of samples to form a mini-batch.

pin_memory (bool, optional): If ``True``, the data loader will copy tensors

into CUDA pinned memory before returning them. If your data elements

长颈鹿的英语怎么读are a custom type, or your ``collate_fn`` returns a batch that is a custom type

e the example below.

练字书法

drop_last (bool, optional): t to ``True`` to drop the last incomplete batch,

if the datat size is not divisible by the batch size. If ``Fal`` and

the size of datat is not divisible by the batch size, then the last batch

will be smaller. (default: ``Fal``)

timeout (numeric, optional): if positive, the timeout value for collecting a batch

from workers. Should always be non-negative. (default: ``0``)

worker_init_fn (callable, optional): If not ``None``, this will be called on each

worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as

input, after eding and before data loading. (default: ``None``)

.. note:: When ``num_workers != 0``, the corresponding worker process are created each time iterator for the DataLoader is obtained (as in when you call

``enumerate(dataloader,0)``).

At this point, the datat, ``collate_fn`` and ``worker_init_fn`` are pasd to each

worker, where they are ud to access and initialize data bad on the indices

queued up from the main process. This means that datat access together with

its internal IO, transforms and collation runs in the worker, while any

shuffle randomization is done in the main process which guides loading by assigning

indices to load. Workers are shut down once the end of the iteration is reached.

Since workers rely on Python multiprocessing, worker launch behavior is different

on Windows compared to Unix. On Unix fork() is ud as the default

muliprocessing start method, so child workers typically can access the datat and

Python argument functions directly through the cloned address space. On Windows, another interpreter is launched which runs your main script, followed by the internal

worker function that receives the datat, collate_fn and other arguments

through Pickle rialization.

This parate rialization means that you should take two steps to ensure you

are compatible with Windows while using workers

(this also works equally well on Unix):

- Wrap most of you main script's code within ``if __name__ == '__main__':`` block,

to make sure it doesn't run again (most likely generating error) when each worker主持人开场白

process is launched. You can place your datat and DataLoader instance creation

logic here, as it doesn't need to be re-executed in workers.

- Make sure that ``collate_fn``, ``worker_init_fn`` or any custom datat code

is declared as a top level def, outside of that ``__main__`` check. This ensures

they are available in workers as well

(this is needed since functions are pickled as references only, not bytecode).

By default, each worker will have its PyTorch ed t to

``ba_ed + worker_id``, where ``ba_ed`` is a long generated

by main process using its RNG. However, eds for other libraies

may be duplicated upon initializing workers (w.g., NumPy), causing

each worker to return identical random numbers. (See

ref:`dataloader-workers-random-ed` ction in FAQ.) You may

u :func:`torch.initial_ed()` to access the PyTorch ed for

each worker in :attr:`worker_init_fn`, and u it to t other

eds before data loading.

.. warning:: If ``spawn`` start method is ud, :attr:`worker_init_fn` cannot be an

unpicklable object, e.g., a lambda function.

The default memory pinning logic only recognizes Tensors and maps and iterables

containg Tensors. By default, if the pinning logic es a batch that is a custom type

(which will occur if you have a ``collate_fn`` that returns a custom batch type),

or if each element of your batch is a custom type, the pinning logic will not

recognize them, and it will return that batch (or tho elements)

without pinning the memory. To enable memory pinning for custom batch or data types,

define a ``pin_memory`` method on your custom type(s).

Example::

class SimpleCustomBatch:

def __init__(lf, data):

transpod_data = list(zip(*data))

lf.inp = torch.stack(transpod_data[0], 0)

< = torch.stack(transpod_data[1], 0)

def pin_memory(lf):

lf.inp = lf.inp.pin_memory()

< = lf.tgt.pin_memory()

exe是什么格式的文件

return lf

def collate_wrapper(batch):

虎和猴的婚姻如何return SimpleCustomBatch(batch)

inps = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)

tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)

datat = TensorDatat(inps, tgts)

loader = DataLoader(datat, batch_size=2, collate_fn=collate_wrapper, pin_memory=True)

for batch_ndx, sample in enumerate(loader):

print(sample.inp.is_pinned())

is_pinned())

"""

__initialized =Fal

def__init__(lf, datat, batch_size=1, shuffle=Fal, sampler=None,

batch_sampler=None, num_workers=0, collate_fn=default_collate, pin_memory=Fal, drop_last=Fal, timeout=0,

worker_init_fn=None):

lf.datat = datat

lf.batch_size = batch_size

lf.num_workers = num_workers

lf.pin_memory = pin_memory

lf.drop_last = drop_last

lf.timeout = timeout

lf.worker_init_fn = worker_init_fn

if timeout <0:

rai ValueError('timeout option should be non-negative')

if batch_sampler is not None:

if batch_size >1or shuffle or sampler is not None or drop_last:

rai ValueError('batch_sampler option is mutually exclusive '

'with batch_size, shuffle, sampler, and '

'drop_last')

lf.batch_size =None

lf.drop_last =None

if sampler is not None and shuffle:

rai ValueError('sampler option is mutually exclusive with '

'shuffle')

if lf.num_workers <0:

rai ValueError('num_workers option cannot be negative; '

'u num_workers=0 to disable multiprocessing.')

if batch_sampler is None:

if sampler is None:

if shuffle:

sampler = RandomSampler(datat)

el:

sampler = SequentialSampler(datat)

batch_sampler = BatchSampler(sampler, batch_size, drop_last)

lf.sampler = sampler

lf.batch_sampler = batch_sampler

lf.__initialized =True

本文发布于:2023-06-14 04:13:49，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/949802.html

上一篇：亲情叙事的作文500字亲情叙事的作文600字左右(六篇)

下一篇：中学生自荐信800字(十四篇)

标签：数据内存长度虚拟内存时间设置

留言与评论（共有 0 条评论）