Use Core ML to integrate machine learning models into your app. Core ML provides a unified representation for all models. Your app uses Core ML APIs and user data to make predictions, and to train or fine-tune models, all on the user’s device.

Core ML optimizes on-device performance by leveraging the CPU, GPU, and Neural Engine while minimizing its memory footprint and power consumption. Running a model strictly on the user’s device removes any need for a network connection, which helps keep the user’s data private and your app responsive.

Compressing Neural Network Weights

The coremltools package includes a utility to compress the weights of a Core ML neural network model. Weight compression reduces the space occupied by the model. However, the precision of the intermediate tensors and the compute precision of the ops are not altered.


For neural networks only

You can use the weight quantization utility to quantize weights in a neural network to 8 bits or less. For ML programs, use the compression utilities described in Compressing ML Program Weights.

Quantization refers to the process of reducing the number of bits that represent a number. The lower the number of bits, more the chances of degrading the model accuracy. The loss in accuracy varies with the model.

By default, the coremltools converters produce a model with weights in floating-point 32 bit (float 32) precision. The weights can be quantized to 16 bits, 8 bits, 7 bits, and so on down to 1 bit. The intermediate tensors are kept in float precision (float 32 or float 16 depending on execution unit), while the weights are dequantized at runtime to match the precision of the intermediate tensors. Quantizing from float 32 to float 16 provides up to 2x savings in storage and generally does not affect the model's accuracy.

The quantize_weights function handles all quantization modes and options:

from coremltools.models.neural_network import quantization_utils

# allowed values of nbits = 16, 8, 7, 6, ...., 1
quantized_model = quantization_utils.quantize_weights(model, nbits)

For a full list of supported arguments, see quantize_weights in the API Reference for The following examples demonstrate some of these arguments.

Quantize to float 16 weights

Quantizing to float 16, which reduces by half the model's disk size, is the safest quantization option since it generally does not affect the model's accuracy:

import coremltools as ct
from coremltools.models.neural_network import quantization_utils

# load full precision model
model_fp32 = ct.models.MLModel('model.mlmodel')

model_fp16 = quantization_utils.quantize_weights(model_fp32, nbits=16)

Quantize to 1-8 bits

Quantizing to 8 bits reduces the disk size to one fourth of the float 32 model. However, it may affect model accuracy, so you should always test the model after quantization, using test data. Depending on the model type, you may be able to quantize to bits lower than 8 without losing accuracy.

# quantize to 8 bit using linear mode
model_8bit = quantize_weights(model_fp32, nbits=8)

# quantize to 8 bit using LUT kmeans mode
model_8bit = quantize_weights(model_fp32, nbits=8,

# quantize to 8 bit using linearsymmetric mode
model_8bit = quantize_weights(model_fp32, nbits=8,

When you set nbits to a value between 1 and 8, you can choose one of the following quantization modes:

  • linear: The default mode, which uses linear quantization for weights with a scale and bias term.
  • linear_symmetric: Symmetric quantization, with only a scale term.
  • kmeans_lut: Uses a k-means clustering algorithm to construct a lookup table (LUT) quantization of weights.

Try these different algorithms with your model, as some may work better than others depending on the model type.

Quantization options

The following options enable you to experiment with the quantization scheme so that you can find one that works best with your model.

Custom LUT function

By default, the k-means algorithm is used to find the lookup table (LUT). However, you can provide a custom function to compute the LUT by setting quantization_mode = "custom_lut ".

Control which layers are quantized

By default, all the layers that have weight parameters are quantized. However, the model accuracy may be sensitive to certain layers, which shouldn't be quantized. You can choose to skip quantization for certain layers and experiment as follows:

  • Use the AdvancedQuantizedLayerSelector class, which lets you set simple properties such as layer types and weight count. For example:
# Example: 8-bit symmetric linear quantization skipping bias,
# batchnorm, depthwise-convolution, and convolution layers
# with less than 4 channels or 4096 elements
from coremltools.models.neural_network.quantization_utils import AdvancedQuantizedLayerSelector

selector = AdvancedQuantizedLayerSelector(
    skip_layer_types=['batchnorm', 'bias', 'depthwiseConv'],

quantized_model = quantize_weights(model, 

For a list of all the layer types in the Core ML neural network model, see the NeuralNetworkLayer section in the Core ML Format reference for NeuralNetwork.

For finer control, you can write a custom rule to skip (or not skip) quantizing a layer by extending the QuantizedLayerSelector class:

# Example : 8-bit linear quantization skipping the layer with name 'dense_2'
from coremltools.models.neural_network.quantization_utils import QuantizedLayerSelector

class MyLayerSelector(QuantizedLayerSelector):

    def __init__(self):
        super(MyLayerSelector, self).__init__()

    def do_quantize(self, layer, **kwargs):
        ret = super(MyLayerSelector, self).do_quantize(layer)
        if not ret or == 'dense_2':
            return True

selector = MyLayerSelector()
quantized_model = quantize_weights(
  nbits = 8, 

Updated 3 months ago

Compressing Neural Network Weights

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.