You spent days tuning your model architecture. You got a GPU. You optimised your training loop. But your training is still slow and your GPU sits at 30% utilisation most of the time. What is going on?

In most cases the answer is the same: your model is waiting for data. The GPU is ready and hungry for the next batch, but the data pipeline is still loading, decoding and preprocessing images from disk. The bottleneck is not the model at all — it is the data feeding the model.

This guide explains how to build a data pipeline that is fast enough to keep your GPU busy the whole time. We will cover preprocessing strategies, caching techniques and how to stream datasets that are too large to fit in memory — all in plain simple English.

What Is a Data Pipeline

A data pipeline is everything that happens between your raw data sitting on disk and a clean batch arriving at your model. For an image classification task, the pipeline might look like this:

  • Read the image file from disk
  • Decode it from JPEG or PNG format into a pixel array
  • Resize it to the correct dimensions (for example 224 by 224 pixels)
  • Randomly flip or rotate it for data augmentation
  • Normalise the pixel values to a range the model expects
  • Stack several of these images into a batch
  • Send the batch to the GPU

Every single one of these steps takes time. If your pipeline takes 50 milliseconds per batch and your GPU can process a batch in 20 milliseconds, your GPU is idle for 30 milliseconds every cycle. That is 60% wasted GPU time.

ℹ️ The goal of pipeline optimisation: make sure data arrives at the GPU faster than the GPU can consume it, so the GPU is always busy and never waiting. The pipeline should be the fastest part of your system, not the slowest.

The Data Bottleneck Problem

Think of your training loop like a factory assembly line. The GPU is the main machine at the end of the line that does the expensive work. Before the GPU can do anything, workers must bring it the raw materials — in this case, preprocessed batches of data.

If the workers are slow, the main machine sits idle. No matter how powerful your GPU is, it can only work as fast as the data arrives. This is the bottleneck problem, and it is extremely common.

The three main causes of a slow data pipeline are reading from disk slowly, doing expensive preprocessing on every single epoch, and loading data on a single CPU thread while the GPU waits.


Preprocessing Strategies

Preprocessing means transforming your raw data into the format your model expects. There are two ways to do this: offline (do it once before training starts) and online (do it on the fly as training runs).

Offline Preprocessing — Do the Work Once

Offline preprocessing means you run all your transformations once, save the results to disk, and then load those already-processed files during training. You pay the preprocessing cost only once, no matter how many training epochs you run.

This is the best approach for transformations that are always the same — like resizing images, tokenising text, or extracting audio features. If you train for 50 epochs, offline preprocessing saves you 49 epochs worth of redundant work.

Python — offline preprocessing, save results to disk once
from PIL import Image import numpy as np import os # Run this ONCE before training starts def preprocess_dataset(raw_dir, processed_dir, target_size=(224, 224)): os.makedirs(processed_dir, exist_ok=True) for filename in os.listdir(raw_dir): save_path = os.path.join(processed_dir, filename.replace('.jpg', '.npy')) # Skip if we already processed this file if os.path.exists(save_path): continue img = Image.open(os.path.join(raw_dir, filename)).convert('RGB') img = img.resize(target_size) # Save as numpy array — loads much faster than JPEG np.save(save_path, np.array(img)) # Call once preprocess_dataset('data/raw', 'data/processed') # Now during training, just load the .npy files # np.load('data/processed/img_001.npy') is 3 to 10x faster than # opening and decoding 'data/raw/img_001.jpg' every epoch

Online Preprocessing — For Random Augmentations

Some preprocessing must happen online — meaning every time you load a sample. Data augmentation is the main example. Random flips, rotations, colour jitter and crops need to be different on every epoch so the model sees variety. You cannot precompute these in advance.

Python — torchvision transforms for online augmentation
from torchvision import transforms # Training transforms — random augmentation applied live each epoch train_transform = transforms.Compose([ transforms.RandomHorizontalFlip(p=0.5), transforms.RandomRotation(15), transforms.ColorJitter(brightness=0.2, contrast=0.2), transforms.ToTensor(), # converts PIL image to tensor transforms.Normalize( # normalise pixel values mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ) ]) # Validation transforms — no random augmentation, just resize and normalise val_transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ])

Why Normalisation Matters

Normalisation rescales your input values to a small consistent range (usually between 0 and 1, or with a mean of 0 and a standard deviation of 1). This is not just about speed — it makes training much more stable and often significantly improves accuracy.

Without normalisation, a pixel with value 255 is 255 times larger than a pixel with value 1. This huge range makes it hard for the model to learn. After normalisation, all values are on the same scale and the model learns much faster.


PyTorch DataLoader — The Right Way to Load Data

PyTorch's DataLoader is the standard tool for feeding data to a model. It handles batching, shuffling, and most importantly it can load data in parallel using multiple CPU workers so the GPU never has to wait.

Writing a Clean Dataset Class

Python — a clean PyTorch Dataset class
import torch from torch.utils.data import Dataset, DataLoader from PIL import Image import os class ImageDataset(Dataset): def __init__(self, image_dir, labels, transform=None): self.image_paths = sorted(os.listdir(image_dir)) self.image_dir = image_dir self.labels = labels self.transform = transform def __len__(self): return len(self.image_paths) def __getitem__(self, idx): path = os.path.join(self.image_dir, self.image_paths[idx]) image = Image.open(path).convert('RGB') label = self.labels[idx] if self.transform: image = self.transform(image) return image, label

Multiple Workers — Load Data in Parallel

By default the DataLoader uses a single CPU thread to load data. This means loading is sequential — it finishes one image before starting the next. By setting num_workers to a higher number, you tell PyTorch to use multiple CPU threads in parallel, so data loads much faster.

Python — DataLoader with workers and pinned memory
# Slow — single worker, loads one image at a time slow_loader = DataLoader( dataset, batch_size=32 ) # Fast — multiple workers load data in parallel fast_loader = DataLoader( dataset, batch_size=32, num_workers=4, # 4 parallel loading threads pin_memory=True, # keeps data in pinned memory for faster GPU transfer shuffle=True, # shuffle order every epoch persistent_workers=True # keep workers alive between epochs ) # How many workers to use? # A safe starting point is number of CPU cores divided by 2 import os num_workers = os.cpu_count() // 2 print(f'Recommended workers: {num_workers}')
ℹ️ pin_memory explained: normal RAM and GPU memory are separate. Moving data from one to the other requires a copy through a staging area. pin_memory=True tells PyTorch to store batches in a special "pinned" region of RAM that transfers directly to the GPU much faster. Always enable it when training on GPU.

Prefetching — Load the Next Batch Before You Need It

Prefetching means loading the next batch of data while the model is still processing the current one. Instead of waiting for the model to finish, then loading the next batch, loading and computing happen at the same time — in parallel.

Python — a simple prefetcher that overlaps loading and training
import torch class DataPrefetcher: """Loads the next batch to GPU while the current one is being used.""" def __init__(self, loader, device): self.loader = iter(loader) self.device = device self.stream = torch.cuda.Stream() self.next_data = None self._prefetch() # load the first batch right away def _prefetch(self): try: self.next_data = next(self.loader) except StopIteration: self.next_data = None return # Move next batch to GPU in the background with torch.cuda.stream(self.stream): self.next_data = [t.to(self.device, non_blocking=True) for t in self.next_data] def next(self): torch.cuda.current_stream().wait_stream(self.stream) data = self.next_data self._prefetch() # start loading the batch after next return data

Caching — Remember What You Already Computed

Caching means saving something you computed so you do not have to compute it again. In a data pipeline, there are two kinds of cache: a disk cache (saved as files) and a memory cache (stored in RAM).

Caching to Disk

Saving preprocessed data as files is the most common form of caching. The first time you load a sample you do all the expensive work (decode, resize, normalise) and save the result. Every time after that, you just load the already-processed file — which is much faster.

Python — disk caching with automatic invalidation
import numpy as np import os from PIL import Image class CachedDataset(Dataset): def __init__(self, image_paths, labels, cache_dir='cache', transform=None): self.image_paths = image_paths self.labels = labels self.cache_dir = cache_dir self.transform = transform os.makedirs(cache_dir, exist_ok=True) def __getitem__(self, idx): cache_path = os.path.join(self.cache_dir, f'{idx}.npy') if os.path.exists(cache_path): # Cache hit — load the already-processed array img_array = np.load(cache_path) else: # Cache miss — do the work and save for next time img = Image.open(self.image_paths[idx]).convert('RGB') img_array = np.array(img.resize((224, 224))) np.save(cache_path, img_array) tensor = torch.from_numpy(img_array).permute(2, 0, 1).float() / 255.0 if self.transform: tensor = self.transform(tensor) return tensor, self.labels[idx]

Caching in Memory

If your whole dataset fits in RAM, you can load everything into memory once at the start and never touch the disk again during training. This is the fastest option possible but only works if your dataset is small enough.

Python — load entire dataset into RAM on startup
class MemoryCachedDataset(Dataset): def __init__(self, image_paths, labels, transform=None): self.labels = labels self.transform = transform print('Loading all images into memory...') self.images = [] for path in image_paths: img = Image.open(path).convert('RGB').resize((224, 224)) self.images.append(np.array(img)) print(f'Loaded {len(self.images)} images into RAM.') def __len__(self): return len(self.images) def __getitem__(self, idx): # No disk access at all — data comes straight from RAM tensor = torch.from_numpy(self.images[idx]).permute(2, 0, 1).float() / 255.0 if self.transform: tensor = self.transform(tensor) return tensor, self.labels[idx]
Which cache to use: if your dataset fits in RAM, use memory caching — it is the fastest. If it does not fit in RAM but your preprocessing is expensive, use disk caching. If your preprocessing is fast (just a simple resize), skip the cache and preprocess on the fly with multiple workers.

Streaming Large Datasets

Some datasets are so large they cannot fit on your hard drive at all. A dataset with 50 million images would take terabytes of storage. You cannot download it all before training starts, and you cannot cache it. The solution is streaming.

Streaming means loading data one small piece at a time as you need it. Instead of downloading the whole dataset first, you download just the next batch right before you need it. You always have a small amount in memory and the rest stays on a remote server.

Think of it like a river. You do not store the whole river in a bucket. The water just flows past you and you take a cup at a time as you need it.

Streaming with HuggingFace Datasets

Python — streaming a large dataset from HuggingFace
from datasets import load_dataset # Without streaming — tries to download the WHOLE dataset first # For a very large dataset this could take hours or run out of disk dataset = load_dataset('wikipedia', '20220301.en') # With streaming=True — downloads just what you need, when you need it streamed = load_dataset('wikipedia', '20220301.en', streaming=True) # Iterate over it — each example loads as you reach it for example in streamed['train'].take(10): # take just 10 examples print(example['title']) # You can also shuffle a streaming dataset with a buffer shuffled = streamed['train'].shuffle(buffer_size=1000, seed=42) # Loads 1000 examples into a buffer, shuffles them, feeds you one at a time

Efficient Pipelines with TensorFlow tf.data

TensorFlow has a powerful pipeline API called tf.data. Its most important feature is .prefetch(), which automatically overlaps data loading with model training — exactly what we want.

Python — tf.data pipeline with prefetching and parallel mapping
import tensorflow as tf # Build a tf.data pipeline dataset = ( tf.data.Dataset.list_files('data/train/*.jpg') .map( load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE # TF decides the best number ) .cache() # cache after first epoch .shuffle(1000) # shuffle with a buffer .batch(32) # group into batches .prefetch(tf.data.AUTOTUNE) # load next batch while training current one ) def load_and_preprocess(path): img = tf.io.read_file(path) img = tf.image.decode_jpeg(img, channels=3) img = tf.image.resize(img, [224, 224]) img = img / 255.0 return img
ℹ️ AUTOTUNE: passing tf.data.AUTOTUNE to num_parallel_calls and prefetch lets TensorFlow automatically tune these values based on your hardware at runtime. This is almost always better than picking a fixed number yourself.

Quick Reference Table

Technique Best For Speedup Downside
Offline preprocessing Fixed transforms that repeat each epoch Large Extra disk space needed
Multiple num_workers All datasets 2 to 4x More CPU and RAM usage
pin_memory Training on GPU Moderate Slightly more RAM
Prefetching All GPU training Significant Slightly more memory
Memory caching Small datasets that fit in RAM Fastest option Requires enough RAM
Disk caching Large datasets with expensive preprocessing Good Disk space and first-pass time
Streaming Datasets too large to store locally Manageable Network dependent, no random access

⚡ Key Takeaways
  • A slow data pipeline is one of the most common reasons a GPU is underutilised. The goal is to make data arrive faster than the GPU can consume it.
  • Offline preprocessing means doing your fixed transforms once and saving the results to disk. This saves you from repeating the same work every epoch.
  • Online preprocessing (done live each epoch) is necessary for random augmentations like flips and colour jitter, because they need to be different each time.
  • Always normalise your inputs. It makes training faster and more stable, and is almost never optional for real models.
  • Set num_workers to at least half your CPU core count to load data in parallel. This one change can eliminate your data bottleneck completely on a fast machine.
  • Use pin_memory=True and persistent_workers=True in your DataLoader when training on GPU. They reduce data transfer overhead with barely any downside.
  • Prefetching overlaps data loading with model training. Always use it. In PyTorch set it up with a prefetcher class. In TensorFlow use .prefetch(tf.data.AUTOTUNE).
  • Use memory caching if the dataset fits in RAM. Use disk caching if preprocessing is expensive. Use streaming if the dataset is too large to store at all.
  • Stack these techniques. The biggest wins come from combining offline preprocessing, multiple workers, pinned memory and prefetching together.