DataLoaders in PyTorch
PyTorch provides `Dataset` and `DataLoader` to handle data loading, batching, shuffling, and parallel processing. This chapter shows how to load built‑in datasets and custom data.
Built‑in Datasets (torchvision)
import torchvision
import torchvision.transforms as transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True)Custom Dataset
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, csv_file):
self.data = pd.read_csv(csv_file)
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
row = self.data.iloc[idx]
x = torch.tensor(row[:-1], dtype=torch.float32)
y = torch.tensor(row[-1], dtype=torch.long)
return x, yDataLoader Parameters
- batch_size: number of samples per batch.
- shuffle: shuffle data every epoch (for training).
- num_workers: number of subprocesses for loading (increase for speed).
- drop_last: drop last incomplete batch (for some models).
Two Minute Drill
- `Dataset` defines how to access samples.
- `DataLoader` batches, shuffles, and parallelizes.
- Use torchvision for standard datasets (MNIST, CIFAR, ImageNet).
- Custom datasets require `__len__` and `__getitem__`.
Need more clarification?
Drop us an email at career@quipoinfotech.com
