Optimizing Models

With the increasing number of models deployed in a single app, and the size of the models also increasing, it is critical to optimize each model's memory footprint. To deploy models on devices such as the iPhone, you often need to optimize the models to use less storage space, reduce power consumption, and reduce latency during inference.

Optimization is a rapidly evolving field of machine learning. There are several ways to optimize a deep learning model. For example, you can distill a large model into a smaller model, start from scratch and train a smaller efficient architecture, customize the model architecture for a specific task and hardware, and so on. This section focuses specifically on techniques that achieve a smaller model by compressing its weights, and optionally its activations. In particular, this section covers those optimizations that enable you to take a model with float precision weights and modify the weights into smaller-sized approximate and lossy representations, allowing you to trade off task accuracy with overall model size and on-device performance.

In this section you learn about model compression techniques that are compatible with the Core ML runtime and Apple hardware, and at what stages in your model deployment workflow can you compress your models. You also learn about the trade-offs, the APIs in Core ML Tools that can help you achieve these model optimizations, and the kind of memory savings and latency impact you can expect from different techniques.

Types of Compression

There are lots of different ways to compress weights and activations of a neural network model, and for each kind of compression type, there are different algorithms to achieve it. This is an extremely active area of research and development. Most of these methods result in weights that can be represented in one of the three formats that Core ML supports. For overviews of these formats and examples, see the following:

When to Compress

You can either directly compress a Core ML model, or compress a model in the source framework during training and then convert. While the former is quicker and can happen without needing data, the latter can preserve accuracy better by fine-tuning with data. To find out more about the two workflows see Optimization Workflow.

How to Compress

You can compress the model in your source framework and then convert, or use the recommended workflow which is to use coremltools.optimize.coreml.* APIs for data free compression or coremltools.optimize.torch.* APIs to compress with data and fine tuning. To learn more on how to use these APIs see API Overview

Learn More about Accuracy and Performance

The accuracy of the compressed model depends not only on the type of model and the task for which it is trained, but also on the amount of the compression ratio. To learn more about the impact of model compression methods, see Accuracy and Performance.

Software Availability of Optimizations

OS versionsOptimizationsCore ML Model Typecoremltools API
iOS15 or lower,
macOS12 or lower
palettization,
8 bit quantization
neuralnetworkct.models.neural_networks.quantization_utils.*
iOS16,
macOS13
palettization,
sparsity,
8 bit quantization
mlprogramct.optimize.*
iOS17,
macOS14
iOS16/macOS13 optimizations
+ 8bit activation quantization,
runtime memory & latency improvements
mlprogramct.optimize.*

You may also find it useful to view the presentation in WWDC 2023 that provides an overview of the optimizations.