Data Pipeline Efficiency - Preprocessing, Caching and Streaming Large Datasets

You spent days tuning your model architecture. You got a GPU. You optimised your training loop. But your training is still slow and your GPU sits at 30% utilisation most of the time. What is going on?

In most cases the answer is the same: your model is waiting for data. The GPU is ready and hungry for the next batch, but the data pipeline is still loading, decoding and preprocessing images from disk. The bottleneck is not the model at all — it is the data feeding the model.

This guide explains how to build a data pipeline that is fast enough to keep your GPU busy the whole time. We will cover preprocessing strategies, caching techniques and how to stream datasets that are too large to fit in memory — all in plain simple English.

What Is a Data Pipeline

A data pipeline is everything that happens between your raw data sitting on disk and a clean batch arriving at your model. For an image classification task, the pipeline might look like this:

Read the image file from disk
Decode it from JPEG or PNG format into a pixel array
Resize it to the correct dimensions (for example 224 by 224 pixels)
Randomly flip or rotate it for data augmentation
Normalise the pixel values to a range the model expects
Stack several of these images into a batch
Send the batch to the GPU

Every single one of these steps takes time. If your pipeline takes 50 milliseconds per batch and your GPU can process a batch in 20 milliseconds, your GPU is idle for 30 milliseconds every cycle. That is 60% wasted GPU time.

ℹ️ The goal of pipeline optimisation: make sure data arrives at the GPU faster than the GPU can consume it, so the GPU is always busy and never waiting. The pipeline should be the fastest part of your system, not the slowest.

The Data Bottleneck Problem

Think of your training loop like a factory assembly line. The GPU is the main machine at the end of the line that does the expensive work. Before the GPU can do anything, workers must bring it the raw materials — in this case, preprocessed batches of data.

If the workers are slow, the main machine sits idle. No matter how powerful your GPU is, it can only work as fast as the data arrives. This is the bottleneck problem, and it is extremely common.

The three main causes of a slow data pipeline are reading from disk slowly, doing expensive preprocessing on every single epoch, and loading data on a single CPU thread while the GPU waits.

Preprocessing Strategies

Preprocessing means transforming your raw data into the format your model expects. There are two ways to do this: offline (do it once before training starts) and online (do it on the fly as training runs).

Offline Preprocessing — Do the Work Once

Offline preprocessing means you run all your transformations once, save the results to disk, and then load those already-processed files during training. You pay the preprocessing cost only once, no matter how many training epochs you run.

This is the best approach for transformations that are always the same — like resizing images, tokenising text, or extracting audio features. If you train for 50 epochs, offline preprocessing saves you 49 epochs worth of redundant work.

Python — offline preprocessing, save results to disk
              once
from PIL import Image
            import numpy as np
            import os

            # Run this ONCE before training starts
            def preprocess_dataset(raw_dir, processed_dir, target_size=(224, 224)):
            os.makedirs(processed_dir, exist_ok=True)

            for filename in os.listdir(raw_dir):
            save_path = os.path.join(processed_dir, filename.replace('.jpg', '.npy'))

            # Skip if we already processed this file
            if os.path.exists(save_path):
            continue

            img = Image.open(os.path.join(raw_dir, filename)).convert('RGB')
            img = img.resize(target_size)

            # Save as numpy array — loads much faster than JPEG
            np.save(save_path, np.array(img))

            # Call once
            preprocess_dataset('data/raw', 'data/processed')

            # Now during training, just load the .npy files
            # np.load('data/processed/img_001.npy') is 3 to 10x faster than
            # opening and decoding 'data/raw/img_001.jpg' every epoch
          

Online Preprocessing — For Random Augmentations

Some preprocessing must happen online — meaning every time you load a sample. Data augmentation is the main example. Random flips, rotations, colour jitter and crops need to be different on every epoch so the model sees variety. You cannot precompute these in advance.

Python — torchvision transforms for online
              augmentation
from torchvision import transforms

            # Training transforms — random augmentation applied live each epoch
            train_transform = transforms.Compose([
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.RandomRotation(15),
            transforms.ColorJitter(brightness=0.2, contrast=0.2),
            transforms.ToTensor(), # converts PIL image to tensor
            transforms.Normalize( # normalise pixel values
            mean=[0.485, 0.456,
            0.406],
            std=[0.229, 0.224,
            0.225]
            )
            ])

            # Validation transforms — no random augmentation, just resize and normalise
            val_transform = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456,
            0.406], std=[0.229, 0.224, 0.225])
            ])
          

Why Normalisation Matters

Normalisation rescales your input values to a small consistent range (usually between 0 and 1, or with a mean of 0 and a standard deviation of 1). This is not just about speed — it makes training much more stable and often significantly improves accuracy.

Without normalisation, a pixel with value 255 is 255 times larger than a pixel with value 1. This huge range makes it hard for the model to learn. After normalisation, all values are on the same scale and the model learns much faster.

PyTorch DataLoader — The Right Way to Load Data

PyTorch's DataLoader is the standard tool for feeding data to a model. It handles batching, shuffling, and most importantly it can load data in parallel using multiple CPU workers so the GPU never has to wait.

Writing a Clean Dataset Class

Python — a clean PyTorch Dataset class
import torch
            from torch.utils.data import Dataset, DataLoader
            from PIL import Image
            import os

            class ImageDataset(Dataset):
            def __init__(self, image_dir,
            labels, transform=None):
            self.image_paths = sorted(os.listdir(image_dir))
            self.image_dir = image_dir
            self.labels = labels
            self.transform = transform

            def __len__(self):
            return len(self.image_paths)

            def __getitem__(self, idx):
            path = os.path.join(self.image_dir, self.image_paths[idx])
            image = Image.open(path).convert('RGB')
            label = self.labels[idx]
            if self.transform:
            image = self.transform(image)
            return image, label
          

Multiple Workers — Load Data in Parallel

By default the DataLoader uses a single CPU thread to load data. This means loading is sequential — it finishes one image before starting the next. By setting num_workers to a higher number, you tell PyTorch to use multiple CPU threads in parallel, so data loads much faster.

Python — DataLoader with workers and pinned
              memory
# Slow — single worker, loads one image at a time
            slow_loader = DataLoader(
            dataset,
            batch_size=32
            )

            # Fast — multiple workers load data in parallel
            fast_loader = DataLoader(
            dataset,
            batch_size=32,
            num_workers=4, # 4 parallel loading threads
            pin_memory=True, # keeps data in pinned memory for faster GPU transfer
            shuffle=True, # shuffle order every epoch
            persistent_workers=True # keep workers alive between epochs
            )

            # How many workers to use?
            # A safe starting point is number of CPU cores divided by 2
            import os
            num_workers = os.cpu_count() //
            2
            print(f'Recommended workers:
              {num_workers}')
          

ℹ️ pin_memory explained: normal RAM and GPU memory are separate. Moving data from one to the other requires a copy through a staging area. pin_memory=True tells PyTorch to store batches in a special "pinned" region of RAM that transfers directly to the GPU much faster. Always enable it when training on GPU.

Prefetching — Load the Next Batch Before You Need It

Prefetching means loading the next batch of data while the model is still processing the current one. Instead of waiting for the model to finish, then loading the next batch, loading and computing happen at the same time — in parallel.

Python — a simple prefetcher that overlaps loading and
              training
import torch

            class DataPrefetcher:
            """Loads the next batch to GPU while the current one is being used."""

            def __init__(self, loader,
            device):
            self.loader = iter(loader)
            self.device = device
            self.stream = torch.cuda.Stream()
            self.next_data = None
            self._prefetch() # load the first batch right away

            def _prefetch(self):
            try:
            self.next_data = next(self.loader)
            except StopIteration:
            self.next_data = None
            return
            # Move next batch to GPU in the background
            with torch.cuda.stream(self.stream):
            self.next_data = [t.to(self.device, non_blocking=True) for
            t in self.next_data]

            def next(self):
            torch.cuda.current_stream().wait_stream(self.stream)
            data = self.next_data
            self._prefetch() # start loading the batch after next
            return data
          

Caching — Remember What You Already Computed

Caching means saving something you computed so you do not have to compute it again. In a data pipeline, there are two kinds of cache: a disk cache (saved as files) and a memory cache (stored in RAM).

Caching to Disk

Saving preprocessed data as files is the most common form of caching. The first time you load a sample you do all the expensive work (decode, resize, normalise) and save the result. Every time after that, you just load the already-processed file — which is much faster.

Python — disk caching with automatic
              invalidation
import numpy as np
            import os
            from PIL import Image

            class CachedDataset(Dataset):
            def __init__(self, image_paths, labels, cache_dir='cache', transform=None):
            self.image_paths = image_paths
            self.labels = labels
            self.cache_dir = cache_dir
            self.transform = transform
            os.makedirs(cache_dir, exist_ok=True)

            def __getitem__(self, idx):
            cache_path = os.path.join(self.cache_dir, f'{idx}.npy')

            if os.path.exists(cache_path):
            # Cache hit — load the already-processed array
            img_array = np.load(cache_path)
            else:
            # Cache miss — do the work and save for next time
            img = Image.open(self.image_paths[idx]).convert('RGB')
            img_array = np.array(img.resize((224, 224)))
            np.save(cache_path, img_array)

            tensor = torch.from_numpy(img_array).permute(2, 0, 1).float() / 255.0
            if self.transform:
            tensor = self.transform(tensor)
            return tensor, self.labels[idx]
          

Caching in Memory

If your whole dataset fits in RAM, you can load everything into memory once at the start and never touch the disk again during training. This is the fastest option possible but only works if your dataset is small enough.

Python — load entire dataset into RAM on
              startup
class MemoryCachedDataset(Dataset):
            def __init__(self, image_paths, labels, transform=None):
            self.labels = labels
            self.transform = transform

            print('Loading all images into
              memory...')
            self.images = []
            for path in image_paths:
            img = Image.open(path).convert('RGB').resize((224, 224))
            self.images.append(np.array(img))
            print(f'Loaded {len(self.images)} images
              into RAM.')

            def __len__(self):
            return len(self.images)

            def __getitem__(self, idx):
            # No disk access at all — data comes straight from RAM
            tensor = torch.from_numpy(self.images[idx]).permute(2, 0, 1).float() / 255.0
            if self.transform:
            tensor = self.transform(tensor)
            return tensor, self.labels[idx]
          

✅ Which cache to use: if your dataset fits in RAM, use memory caching — it is the fastest. If it does not fit in RAM but your preprocessing is expensive, use disk caching. If your preprocessing is fast (just a simple resize), skip the cache and preprocess on the fly with multiple workers.

Streaming Large Datasets

Some datasets are so large they cannot fit on your hard drive at all. A dataset with 50 million images would take terabytes of storage. You cannot download it all before training starts, and you cannot cache it. The solution is streaming.

Streaming means loading data one small piece at a time as you need it. Instead of downloading the whole dataset first, you download just the next batch right before you need it. You always have a small amount in memory and the rest stays on a remote server.

Think of it like a river. You do not store the whole river in a bucket. The water just flows past you and you take a cup at a time as you need it.

Streaming with HuggingFace Datasets

Python — streaming a large dataset from
              HuggingFace
from datasets import load_dataset

            # Without streaming — tries to download the WHOLE dataset first
            # For a very large dataset this could take hours or run out of disk
            dataset = load_dataset('wikipedia', '20220301.en')

            # With streaming=True — downloads just what you need, when you need it
            streamed = load_dataset('wikipedia', '20220301.en', streaming=True)

            # Iterate over it — each example loads as you reach it
            for example in streamed['train'].take(10): # take just 10
              examples
            print(example['title'])

            # You can also shuffle a streaming dataset with a buffer
            shuffled = streamed['train'].shuffle(buffer_size=1000, seed=42)
            # Loads 1000 examples into a buffer, shuffles them, feeds you one at a time
          

Efficient Pipelines with TensorFlow tf.data

TensorFlow has a powerful pipeline API called tf.data. Its most important feature is .prefetch(), which automatically overlaps data loading with model training — exactly what we want.

Python — tf.data pipeline with prefetching and parallel
              mapping
import tensorflow as tf

            # Build a tf.data pipeline
            dataset = (
            tf.data.Dataset.list_files('data/train/*.jpg')
            .map(
            load_and_preprocess,
            num_parallel_calls=tf.data.AUTOTUNE
            # TF decides the best number
            )
            .cache() # cache
              after first epoch
            .shuffle(1000) # shuffle with a buffer
            .batch(32) # group into batches
            .prefetch(tf.data.AUTOTUNE) # load next batch while training
              current one
            )

            def load_and_preprocess(path):
            img = tf.io.read_file(path)
            img = tf.image.decode_jpeg(img,
            channels=3)
            img = tf.image.resize(img,
            [224, 224])
            img = img /
            255.0
            return img
          

ℹ️ AUTOTUNE: passing tf.data.AUTOTUNE to num_parallel_calls and prefetch lets TensorFlow automatically tune these values based on your hardware at runtime. This is almost always better than picking a fixed number yourself.

Quick Reference Table

Technique	Best For	Speedup	Downside
Offline preprocessing	Fixed transforms that repeat each epoch	Large	Extra disk space needed
Multiple num_workers	All datasets	2 to 4x	More CPU and RAM usage
pin_memory	Training on GPU	Moderate	Slightly more RAM
Prefetching	All GPU training	Significant	Slightly more memory
Memory caching	Small datasets that fit in RAM	Fastest option	Requires enough RAM
Disk caching	Large datasets with expensive preprocessing	Good	Disk space and first-pass time
Streaming	Datasets too large to store locally	Manageable	Network dependent, no random access

⚡ Key Takeaways

A slow data pipeline is one of the most common reasons a GPU is underutilised. The goal is to make data arrive faster than the GPU can consume it.
Offline preprocessing means doing your fixed transforms once and saving the results to disk. This saves you from repeating the same work every epoch.
Online preprocessing (done live each epoch) is necessary for random augmentations like flips and colour jitter, because they need to be different each time.
Always normalise your inputs. It makes training faster and more stable, and is almost never optional for real models.
Set num_workers to at least half your CPU core count to load data in parallel. This one change can eliminate your data bottleneck completely on a fast machine.
Use pin_memory=True and persistent_workers=True in your DataLoader when training on GPU. They reduce data transfer overhead with barely any downside.
Prefetching overlaps data loading with model training. Always use it. In PyTorch set it up with a prefetcher class. In TensorFlow use .prefetch(tf.data.AUTOTUNE).
Use memory caching if the dataset fits in RAM. Use disk caching if preprocessing is expensive. Use streaming if the dataset is too large to store at all.
Stack these techniques. The biggest wins come from combining offline preprocessing, multiple workers, pinned memory and prefetching together.

Tags: AI / ML Data Pipeline PyTorch TensorFlow Caching Streaming

Shashank Shekhar

Founder & Creator — Hoopsiper.com

Full stack developer and educator. Building Hoopsiper to help developers learn faster through practical, no-fluff coding guides on JavaScript, AI/ML, Python and modern web development.

Data Pipeline EfficiencyPreprocessing, Caching and Streaming Large Datasets

What Is a Data Pipeline

The Data Bottleneck Problem

Preprocessing Strategies

Offline Preprocessing — Do the Work Once

Online Preprocessing — For Random Augmentations

Why Normalisation Matters

PyTorch DataLoader — The Right Way to Load Data

Writing a Clean Dataset Class

Multiple Workers — Load Data in Parallel

Prefetching — Load the Next Batch Before You Need It

Caching — Remember What You Already Computed

Caching to Disk

Caching in Memory

Streaming Large Datasets

Streaming with HuggingFace Datasets

Efficient Pipelines with TensorFlow tf.data

Quick Reference Table

Model Inference Speed and Pruning

Regression vs Classification

Optimizing Streamlit Apps for Speed

Data Pipeline Efficiency
Preprocessing, Caching and Streaming Large Datasets