DataLoader#

class DataLoader(dataset, sampler=None, transform=None, collator=None, num_workers=0, timeout=0, preload=False, parallel_stream=False)[source]#

Provides a convenient way to iterate on a given dataset. The process is as follows:

flowchart LR Dataset.__len__ -- Sampler --> Indices batch_size -- Sampler --> Indices Indices -- Dataset.__getitem__ --> Samples Samples -- Transform + Collator --> mini-batch

DataLoader combines a Dataset with Sampler, Transform and Collator, make it flexible to get minibatch continually from a dataset. See Use Data to build the input pipeline for more details.

Parameters:

dataset (Dataset) – dataset from which to load the minibatch.
sampler (Optional[Sampler]) – defines the strategy to sample data from the dataset. If None, it will sequentially sample from the dataset one by one.
transform (Optional[Transform]) – defined the transforming strategy for a sampled batch.
collator (Optional[Collator]) – defined the merging strategy for a transformed batch.
num_workers (int) – the number of sub-process to load, transform and collate the batch. 0 means using single-process. Default: 0
timeout (int) – if positive, means the timeout value(second) for collecting a batch from workers. Default: 0
preload (bool) – whether to enable the preloading strategy of the dataloader. When enabling, the dataloader will preload one batch to the device memory to speed up the whole training process.
parallel_stream (bool) – whether to splitting workload across all workers when dataset is streamdataset and num_workers > 0. When enabling, each worker will collect data from different dataset in order to speed up the whole loading process. See ref:streamdataset-example for more details

The effect of enabling preload

All elements in map, list, and tuple will be converted to Tensor by preloading, and you will get Tensor instead of the original Numpy array or Python built-in data structrure.
Tensors’ host2device copy and device kernel execution will be overlapped, which will improve the training speed at the cost of higher device memory usage (due to one more batch data on device memory). This feature saves more time when your NN training time is short or your machine’s host PCIe bandwidth for each device is low.