You trained a machine learning model. It works well. The accuracy is good. But when you deploy it and real users start making requests, it is slow. Each prediction takes too long. Your server costs are too high. Users are waiting.
This is one of the most common problems in production ML. Building a model that works is one challenge. Making that same model run fast and cheaply at scale is a completely different challenge.
This guide covers the four most impactful techniques for speeding up model inference: batching, GPU acceleration, pruning and quantisation. Each one is explained from scratch with simple analogies and working Python code.
What Is Inference
When you train a model, you are teaching it. You feed it thousands of examples and it slowly adjusts its internal numbers until it gets good at making predictions.
When you run inference, you are using the trained model. You give it new input — a photo, a sentence, a row of data — and it gives you a prediction back. This is what happens every time a user interacts with your deployed model.
Training happens once and can take hours or days. Inference happens constantly and must be fast. A user asking your app to classify an image expects an answer in milliseconds, not seconds.
Why Models Feel Slow
A machine learning model is essentially a very large collection of numbers called weights, and a set of mathematical operations that transform your input using those weights to produce an output. A small model might have millions of weights. A large model can have billions.
Every time you run inference, all of those mathematical operations happen. The more weights, the more operations, the longer it takes. Three main things make inference slow:
- Model size — too many weights, too many operations to run
- Hardware mismatch — running on a CPU when a GPU would be far faster for this kind of maths
- Underutilisation — processing one request at a time when the hardware can handle many at once
Each of the techniques in this guide targets one or more of these three problems.
Batching Inputs
How Batching Works
Imagine you are a chef making pizza. You could bake one pizza at a time, wait for it to finish, then bake the next one. Or you could put ten pizzas in the oven at once and cook them all in the same amount of time it would take to cook one.
That is exactly what batching does for model inference. Instead of running your model on one input at a time, you collect a group of inputs together and run the model once on the whole group. The model processes them all in parallel, and you get all the results back at once.
This works especially well on GPUs, because a GPU is built to do many operations at the same time. Running a batch of 32 images through a model on a GPU takes roughly the same time as running one image — but you get 32 results instead of one.
Batching in Python with PyTorch
GPU Acceleration
CPU vs GPU — What Is the Difference
A CPU (the main processor in your computer) is very good at complex tasks that need to be done one at a time in a specific order. It has a small number of very powerful cores — usually 8 to 16 on a modern machine.
A GPU (a graphics card) was originally designed to render video game graphics. It has thousands of smaller, simpler cores that are all running at the same time. It is not as smart as a CPU core, but it can do thousands of simple maths operations in parallel.
Machine learning is basically just a huge amount of simple maths done in parallel — multiplying and adding matrices over and over. This maps perfectly onto what a GPU does. A task that takes 10 seconds on a CPU can sometimes take 0.2 seconds on a GPU.
Moving Your Model to the GPU in PyTorch
Model Pruning — Removing the Parts That Do Not Help
When a neural network learns, many of its weights end up being very close to zero. These near-zero weights do almost nothing to the output — they are not contributing to the model's accuracy in any meaningful way. Pruning removes them.
Think of it like editing a book. After the first draft, you read through and remove all the sentences that do not actually add anything. The story is the same, the meaning is the same, but the book is shorter and easier to read quickly. Pruning does this to a neural network.
After pruning, the model has fewer active connections. This means fewer operations at inference time, which means faster predictions and a smaller file on disk.
Pruning in Python with PyTorch
Quantisation — Using Smaller Numbers
By default, most neural network weights are stored as 32-bit floating point numbers (called float32). Each weight takes up 4 bytes of memory. Quantisation means switching to smaller number formats — like 8-bit integers (int8) — which use only 1 byte each.
Think of it like converting a high-resolution photo to a slightly lower resolution. You lose a tiny bit of detail, but the file is four times smaller and loads much faster. For most models, the accuracy difference is barely noticeable but the speed improvement is significant.
Quantisation gives you three benefits at once: the model file is smaller, it loads faster, and inference is faster because 8-bit maths is cheaper than 32-bit maths on most hardware.
Quantisation in Python with PyTorch
Turn Off Gradient Tracking at Inference Time
During training, PyTorch keeps track of all the mathematical steps it took to produce each output. It needs this information to update the model's weights. This bookkeeping uses extra memory and time.
At inference time, you are not training — you are just making predictions. You do not need that bookkeeping at all. Turning it off with torch.no_grad() is one of the simplest and easiest wins you can get.
TorchScript and ONNX — Exporting for Speed
When you run a regular PyTorch model, Python itself adds some overhead to every operation. You can remove this overhead entirely by compiling the model into a format that runs without Python involved at all.
Two popular options are TorchScript (stays in the PyTorch ecosystem) and ONNX (a universal format that works with many different runtimes including TensorRT, OpenVINO and CoreML).
Technique Comparison
| Technique | Typical Speedup | Accuracy Loss | Effort |
|---|---|---|---|
| model.eval() and no_grad() | 20 to 40% | None | 2 lines of code |
| Batching inputs | 5 to 20x on GPU | None | Low |
| Move to GPU | 10 to 50x | None | Low (one .to(device) call) |
| Quantisation (int8) | 2 to 4x | Less than 1% | Low to Medium |
| Pruning (40 to 60%) | 1.5 to 3x | Small — needs fine-tuning | Medium |
| TorchScript export | 1.5 to 2x | None | Low |
| ONNX Runtime | 2 to 5x | None | Medium |
- Inference is when you use your trained model to make predictions. It must be fast because it runs every time a user makes a request.
- Models are slow because of three things: too many weights, running on the wrong hardware, or processing one input at a time when the hardware can handle many at once.
- Always call model.eval() and use torch.no_grad() for inference. This turns off training-only features and stops bookkeeping you do not need. It is free performance.
- Batching groups multiple inputs together and runs them through the model at once. On a GPU this can be 5 to 20 times faster than processing one input at a time.
- GPUs have thousands of cores that run simple maths in parallel. Use .to(device) to move both your model and your input data to the GPU. Always use the same device for both.
- Pruning removes near-zero weights that contribute almost nothing to predictions. Always test accuracy after pruning and use iterative pruning with fine-tuning for best results.
- Quantisation switches weights from 32-bit floats to 8-bit integers. The model becomes 3 to 4 times smaller and inference becomes faster with minimal accuracy impact.
- TorchScript and ONNX compile your model into formats that run without Python overhead. ONNX Runtime in particular can be 2 to 5 times faster than standard PyTorch on CPU.
- Stack the techniques. Start with eval() and no_grad(), then add batching, then GPU, then quantisation. Each step stacks on top of the previous one.
