You spent days tuning your model architecture. You got a GPU. You optimised your training loop. But your training is still slow and your GPU sits at 30% utilisation most of the time. What is going on?
In most cases the answer is the same: your model is waiting for data. The GPU is ready and hungry for the next batch, but the data pipeline is still loading, decoding and preprocessing images from disk. The bottleneck is not the model at all — it is the data feeding the model.
This guide explains how to build a data pipeline that is fast enough to keep your GPU busy the whole time. We will cover preprocessing strategies, caching techniques and how to stream datasets that are too large to fit in memory — all in plain simple English.
What Is a Data Pipeline
A data pipeline is everything that happens between your raw data sitting on disk and a clean batch arriving at your model. For an image classification task, the pipeline might look like this:
- Read the image file from disk
- Decode it from JPEG or PNG format into a pixel array
- Resize it to the correct dimensions (for example 224 by 224 pixels)
- Randomly flip or rotate it for data augmentation
- Normalise the pixel values to a range the model expects
- Stack several of these images into a batch
- Send the batch to the GPU
Every single one of these steps takes time. If your pipeline takes 50 milliseconds per batch and your GPU can process a batch in 20 milliseconds, your GPU is idle for 30 milliseconds every cycle. That is 60% wasted GPU time.
The Data Bottleneck Problem
Think of your training loop like a factory assembly line. The GPU is the main machine at the end of the line that does the expensive work. Before the GPU can do anything, workers must bring it the raw materials — in this case, preprocessed batches of data.
If the workers are slow, the main machine sits idle. No matter how powerful your GPU is, it can only work as fast as the data arrives. This is the bottleneck problem, and it is extremely common.
The three main causes of a slow data pipeline are reading from disk slowly, doing expensive preprocessing on every single epoch, and loading data on a single CPU thread while the GPU waits.
Preprocessing Strategies
Preprocessing means transforming your raw data into the format your model expects. There are two ways to do this: offline (do it once before training starts) and online (do it on the fly as training runs).
Offline Preprocessing — Do the Work Once
Offline preprocessing means you run all your transformations once, save the results to disk, and then load those already-processed files during training. You pay the preprocessing cost only once, no matter how many training epochs you run.
This is the best approach for transformations that are always the same — like resizing images, tokenising text, or extracting audio features. If you train for 50 epochs, offline preprocessing saves you 49 epochs worth of redundant work.
Online Preprocessing — For Random Augmentations
Some preprocessing must happen online — meaning every time you load a sample. Data augmentation is the main example. Random flips, rotations, colour jitter and crops need to be different on every epoch so the model sees variety. You cannot precompute these in advance.
Why Normalisation Matters
Normalisation rescales your input values to a small consistent range (usually between 0 and 1, or with a mean of 0 and a standard deviation of 1). This is not just about speed — it makes training much more stable and often significantly improves accuracy.
Without normalisation, a pixel with value 255 is 255 times larger than a pixel with value 1. This huge range makes it hard for the model to learn. After normalisation, all values are on the same scale and the model learns much faster.
PyTorch DataLoader — The Right Way to Load Data
PyTorch's DataLoader is the standard tool for feeding data to a model. It handles batching, shuffling, and most importantly it can load data in parallel using multiple CPU workers so the GPU never has to wait.
Writing a Clean Dataset Class
Multiple Workers — Load Data in Parallel
By default the DataLoader uses a single CPU thread to load data. This means loading is sequential — it finishes one image before starting the next. By setting num_workers to a higher number, you tell PyTorch to use multiple CPU threads in parallel, so data loads much faster.
Prefetching — Load the Next Batch Before You Need It
Prefetching means loading the next batch of data while the model is still processing the current one. Instead of waiting for the model to finish, then loading the next batch, loading and computing happen at the same time — in parallel.
Caching — Remember What You Already Computed
Caching means saving something you computed so you do not have to compute it again. In a data pipeline, there are two kinds of cache: a disk cache (saved as files) and a memory cache (stored in RAM).
Caching to Disk
Saving preprocessed data as files is the most common form of caching. The first time you load a sample you do all the expensive work (decode, resize, normalise) and save the result. Every time after that, you just load the already-processed file — which is much faster.
Caching in Memory
If your whole dataset fits in RAM, you can load everything into memory once at the start and never touch the disk again during training. This is the fastest option possible but only works if your dataset is small enough.
Streaming Large Datasets
Some datasets are so large they cannot fit on your hard drive at all. A dataset with 50 million images would take terabytes of storage. You cannot download it all before training starts, and you cannot cache it. The solution is streaming.
Streaming means loading data one small piece at a time as you need it. Instead of downloading the whole dataset first, you download just the next batch right before you need it. You always have a small amount in memory and the rest stays on a remote server.
Think of it like a river. You do not store the whole river in a bucket. The water just flows past you and you take a cup at a time as you need it.
Streaming with HuggingFace Datasets
Efficient Pipelines with TensorFlow tf.data
TensorFlow has a powerful pipeline API called tf.data. Its most important feature is .prefetch(), which automatically overlaps data loading with model training — exactly what we want.
Quick Reference Table
| Technique | Best For | Speedup | Downside |
|---|---|---|---|
| Offline preprocessing | Fixed transforms that repeat each epoch | Large | Extra disk space needed |
| Multiple num_workers | All datasets | 2 to 4x | More CPU and RAM usage |
| pin_memory | Training on GPU | Moderate | Slightly more RAM |
| Prefetching | All GPU training | Significant | Slightly more memory |
| Memory caching | Small datasets that fit in RAM | Fastest option | Requires enough RAM |
| Disk caching | Large datasets with expensive preprocessing | Good | Disk space and first-pass time |
| Streaming | Datasets too large to store locally | Manageable | Network dependent, no random access |
- A slow data pipeline is one of the most common reasons a GPU is underutilised. The goal is to make data arrive faster than the GPU can consume it.
- Offline preprocessing means doing your fixed transforms once and saving the results to disk. This saves you from repeating the same work every epoch.
- Online preprocessing (done live each epoch) is necessary for random augmentations like flips and colour jitter, because they need to be different each time.
- Always normalise your inputs. It makes training faster and more stable, and is almost never optional for real models.
- Set num_workers to at least half your CPU core count to load data in parallel. This one change can eliminate your data bottleneck completely on a fast machine.
- Use pin_memory=True and persistent_workers=True in your DataLoader when training on GPU. They reduce data transfer overhead with barely any downside.
- Prefetching overlaps data loading with model training. Always use it. In PyTorch set it up with a prefetcher class. In TensorFlow use .prefetch(tf.data.AUTOTUNE).
- Use memory caching if the dataset fits in RAM. Use disk caching if preprocessing is expensive. Use streaming if the dataset is too large to store at all.
- Stack these techniques. The biggest wins come from combining offline preprocessing, multiple workers, pinned memory and prefetching together.
