Deep learning models are wildly effective at solving complex problems using unstructured data. Unfortunately, these models are also (usually) big, slow to run, and resource intensive. This blog post explores options to reduce the size, inference time, and computational footprint for trained deep learning models in production. 

We’ll focus on inference that uses CPUs rather than GPUs. CPUs can be a good option when you’re trying to minimize cost. But running inference on CPUs can be around an order of magnitude slower than GPUs. Effective model compression can significantly speed up run time in a CPU deployment setting.

Furthermore, we’ll focus mainly on transformer models and PyTorch. However, the topics we’ll cover also generally apply across deep learning models and frameworks.

But wait, shouldn’t the machine learning engineer, software dev, data engineer, DevOps team, or IT team be responsible for figuring out how to deploy my models into production? Why do I need to understand model compression?

As a data scientist, your primary job is to deliver business value by using data. So more often than not, you’ll be involved in every aspect of the problem you’re working on. Your scope ranges from data collection to model development to defining meaningful metrics to interacting with business partners to weighing deployment options. You don’t need to become an expert in data pipelines or autoscaling (or, for that matter, medicine, law, e-commerce, finance, or whatever field your customer occupies), but you do need to understand enough about different parts of the process to anticipate how each decision will affect the solution you’re building.

To do your job well, you need to understand the basic choices around model compression. The information you find here will help you weigh in on the options and the potential tradeoffs that your team should consider.

TL;DR: For most middle-of-the-road scenarios, I generally recommend an approach that balances speed and compute gains with ease of implementation. With the currently available tools and libraries, you’ll get the most bang for your buck by simply performing ONNX serialization and quantization on your models, and potentially using a smaller model architecture like DistilBERT if the performance hit is acceptable.

What is model compression?

Model compression refers to the practice of modifying model loading or storage to decrease the resource usage and/or improve the latency for the model. For deep learning models, we should ask ourselves three basic questions:

  • Disk space: How much hard disk space does the model consume?
  • Memory usage: When the model is loaded into memory (RAM or GPU memory), how much memory is consumed?
  • Latency: How long does it take to load the model into memory? After a model is loaded, how much time is required to perform inference on a sample or batch?

Knowing the answers to these questions can help your team determine requirements for the infrastructure that serves the models. These questions can also help you assess opportunities for parallelization, autoscaling, and so on.

Additionally, consider how any compression technique will affect the model predictions. Ideally, model predictions shouldn’t change. But most of the techniques that we’ll discuss change the inference results. Those changes affect the metrics (for example, accuracy, precision, and recall) of the trained model. As data scientists, we can measure important metrics and determine whether any degradation is worth the gain in speed, size, and so on.

Baseline exercise

Before we dive into model compression techniques, let’s establish a baseline. We’ll load a RoBERTa transformer model available through Hugging Face. We’ll use the model to classify the first few sentences of my favorite book, Alice in Wonderland. Check out the sample code if you’d like to follow along. 

Throughout this exercise, we’ll use a neat memory profiler to track RAM consumption and runtime:[^1]

The total memory consumption for this model is close to 1,200 MiB. The total runtime is roughly 25 seconds, with the inference step requiring most of the time. The saved model requires about 500 MB in hard drive space. 

Now let’s look at some model compression techniques and how they affect resource usage and latency.

[^1] The actual runtime and memory consumption depends on the system setup (for example, CPU clock speed and multithreading) and model parameters (for example, batch size and maximum number of tokens). To facilitate a fair comparison, I am keeping these factors fixed when applicable. The point is the comparison between techniques rather than the raw numbers. For the record, I’m using a MacBook Pro with a 2 GHz quad-core Intel Core i5 processor. You can check out parameter definitions, code, and results in the code repository.

Types of model compression

You should understand these three common model compression techniques:

  • Serialization
  • Quantization
  • Pruning

We’ll explore each technique and test it in our Alice in Wonderland example.


Serialization encodes a model in a format that can be stored on a disk and reloaded (deserialized) when it’s needed. If a model is serialized by using a common framework, serialization methods can also facilitate interoperability. So the model can be deployed by using a system that’s different from the development environment.

The most common serialization approach for deep learning models is the Open Neural Network Exchange (ONNX) format. Models that are saved by using the ONNX format can be run by using ONNX Runtime. ONNX Runtime also provides tools for model quantization, which we’ll explore in the next section.

Let’s convert our RoBERTa model to the ONNX format and then rerun the memory profiler:

Neither the peak memory consumption nor the model size is reduced. But the runtime for CPU inference has decreased dramatically from 20 seconds to 7 seconds.

PyTorch provides native serialization by using TorchScript. Serialization takes two forms: scripting and tracing. During tracing, sample input is fed into the trained model and is followed (traced) through the model computation graph, which is then frozen. At the time of this writing, only tracing is supported for transformer models, so we’ll use this serialization method:

Through ONNX serialization, we’ve significantly reduced inference time. But we haven’t made much of a dent in model size and RAM usage. The techniques we’ll cover next will help reduce the memory footprint and hard disk footprint of large transformer models.


Deep learning frameworks generally use floating point (32-bit precision) numbers. This system makes sense during training but might be unnecessary during inference. Quantization describes the process of converting floating point numbers to integers. 

Two types of quantization can be applied after a model has been trained: dynamic or static.[^2] During dynamic quantization, model weights are stored by using their 8-bit representations. By contrast, activations are quantized at the time of compute but are stored as full-precision floats. For static quantization, both weights and activations are stored by using 8-bit precision. 

Compared with static quantization, dynamic quantization might improve accuracy at the cost of increased runtime. For a more in-depth tutorial on model quantization, see Practical Quantization in PyTorch.

For transformer models, dynamic quantization is generally recommended because: 

  • It’s simpler and more accurate than static quantization. 
  • Because of the model’s larger size (number of weights), the biggest gains for transformers can be made by reducing the footprint of the weights. 

Let’s apply dynamic quantization to our model and measure the effects. We’ll try out dynamic quantization both through PyTorch and through ONNX runtime:

Both quantization methods reduce inference time compared to their larger counterparts. Additionally, hard disk space usage has been reduced by more than 50 percent. 

Based on these results, quantization seems like a no-brainer: Why would you not do this? But before we decide to deploy a quantized model, we need to consider one more factor: how these changes affect the model predictions.

[^2] A third type of quantization is called quantization-aware training (QAT). During training, weights and activations are “fake quantized.” That is, computations use floating point numbers, but float values are rounded to mimic 8-bit integer values for forward and backward passes. Because quantization occurs as part of model training, QAT is less likely to degrade metrics. However, QAT is somewhat more complicated to implement. I recommend starting with a simple approach like dynamic quantization and then moving on to QAT only if you observe a degradation that’s unacceptable for your use case.


How quantization affects metrics

Because quantization modifies the weights of a trained network, it will inevitably affect the scores that the final layer of the model produces. Particularly if a score is close to a decision threshold, a quantized model can yield predictions that are different from those that the original, unquantized model produces. 

Before you deploy a model that has been quantized, it’s a good idea to run inference on a test set, measure metrics, and examine examples where the predicted output has changed from that of the original model.

Let’s examine how quantization might affect model performance. For this example, we’ll consider the task of sentiment analysis by using cardiffnlp/twitter-roberta-base-sentiment. This RoBERTa model is available through Hugging Face. It has been trained on Twitter excerpts. In the model, tweets can be assigned to one of three sentiment categories: negative, positive, or neutral. 

For both the original model and quantized models, we’ll perform inference on 100 examples. We’ll generate a confusion matrix and measure accuracy.

The general performance of the model appears relatively unaffected. But quantization has changed some of the predictions. In this situation, we should examine some of the examples that showed altered predictions before deploying the quantized model.


Pruning refers to the practice of ignoring or discarding “unimportant” weights in a trained model. But how do you determine which weights are unimportant? Here are three methods that can identify unimportant weights:

Magnitude pruning

The magnitude pruning method identifies unimportant weights by considering their absolute values. Weights that have a low absolute value have little effect on the values that are passed through the model. 

Magnitude pruning is mostly effective for models that are trained from scratch for a specific task because the values of the weights dictate importance for the task that the model was trained on. Oftentimes, you’ll want to use the weights from a trained model as a starting point and fine tune these weights for a specific dataset and task using a process known as transfer learning. In a transfer learning scenario, the values of the weights are also related to the task that’s used to pretrain the network. Magnitude pruning isn’t a good fit for scenarios that involve transfer learning.

Movement pruning

During movement pruning, weights that shrink in absolute value during training are discarded. This approach is well-suited for transfer learning because the movement of weights from large to small demonstrates that they were unimportant (actually, counterproductive) for the fine-tuning task. 

In their 2020 paper, Movement Pruning: Adaptive Sparsity by Fine-Tuning, Sanh et al. demonstrated that movement pruning applied to a fine-tuned BERT model adapted better to the end task than magnitude pruning. Movement pruning yielded 95 percent of the original performance for natural language inference and question answering. It used only 5 percent of the encoder’s weight.

Pruning attention heads

One differentiating architectural feature of transformer models is the employment of multiple parallel attention “heads.” In their 2019 paper, Are Sixteen Heads Really Better than One?, Michel et al. showed that models trained by using many heads can be pruned at inference time to include only a subset of the attention heads without significantly affecting model metrics. 

Let’s test the impact of pruning N percent of heads. Because we’re interested in measuring only resource usage, we’ll prune heads randomly. For a real-world application, however, you should prune heads based on their relative importance.

For our RoBERTa model, we need to prune most of the heads to see a noticeable effect on resource usage and inference time.

Change the model architecture

You might ask, To reduce the size or increase the speed of a model, why not just use a smaller model? Although this solution isn’t a model compression technique per se, it’s worth discussing. The technique can reduce resource usage and improve latency.

One potential drawback of decreasing model size is that it can substantially degrade model performance. A solution called knowledge distillation addresses this shortcoming by using a student-teacher approach. That is, during training, a smaller model (“student”) is trained to mimic the behavior of a larger model or ensemble of models (“teacher”). 

This methodology was applied to transformers in a distilled version of BERT called DistilBERT in 2019. In the DistilBERT experiment, the authors demonstrated performance similar to that of BERT on the GLUE benchmark dataset at 40 percent of the size.

To train the student model, the authors used a triple loss function:

  • A standard masked language modeling (MLM) objective
  • Distillation loss (the similarity between output probability distribution of student model and teacher model)
  • Cosine distance (the similarity between student hidden states and teacher hidden states)

The first objective (MLM) is the same objective function that’s used to train a standard BERT model. The second two objectives (distillation loss and cosine distance), however, encourage the model to yield intermediate and final results that are similar to the larger teacher model. 

We can apply the student-teacher approach to our own trained model, but this task is not trivial. Alternatively, we can use a DistilBERT-flavor foundation model as a starting point and fine-tune it on our data.

Let’s test the resource usage and latency of DistilRoBERTa compared with the larger RoBERTa model:

By using a DistilRoBERTa model, we’ve reduced size, latency, and RAM usage even more than we did by pruning 90 percent of attention heads.

Put it all together

We’ve examined multiple methods individually for model compression. But you’ll probably want to combine techniques for the biggest impact. Based on our experience at Prolego, most middle-of-the-road applications benefit from considering these techniques, in the following order:

  1. Use ONNX serialization.
  2. Additionally apply quantization by using ONNX.
  3. Use DistilRoBERTa architecture (and serialize and quantize by using ONNX).
  4. Additionally prune attention heads.

The following chart shows the results of applying these techniques sequentially.

The results show that ONNX serialization substantially decreased inference time without changing model predictions. Quantization decreased hard disk usage and further reduced inference time. 

As we saw previously, we can expect quantization to somewhat affect model predictions, but likely not substantially. You could use a smaller architecture like DistilRoBERTa to further reduce hard disk usage and inference time. However, this technique requires model training or retraining by using a different architecture, which can substantially affect model metrics. The extent of the effect likely depends on the data, so you would need to test with a specific use case and dataset to determine the effect. 

Considering the three techniques, I would recommend the pruning method only in extreme situations where it’s critical to eke out slightly better performance. Although we see some gains by using more than just the DistilRoBERTa model, you’ll spend much more effort trying to apply an intelligent strategy for determining which heads to prune than to, say, swap in a DistilBERT foundation model instead of a BERT model.


I hope you’ve found this guide both practical and useful for identifying where to start with model compression. This analysis focused mainly on CPU inference (the cheaper option). For nearly real-time inference, you might prefer to run inference on GPU-enabled instances. 

In the meantime, happy modeling!

More Ideas

AI Abundance:

Why you have only five years to prepare for the inevitable business extinction event.