Post-Training Quantization

You can linearly quantize the weights of your Core ML model by using the linear_quantize_weights method as follows:

import coremltools.optimize.coreml as cto

op_config = cto.OpLinearQuantizerConfig(mode="linear_symmetric", weight_threshold=512)
config = cto.OptimizationConfig(global_config=op_config)

compressed_8_bit_model = cto.linear_quantize_weights(model, config=config)

📘

Quantize Activations Plus Weights

If you want to quantize the activations in addition to the weights, use Training-Time Quantization.

The linear_quantize_weights method iterates over the weights of the model. Those weights whose sizes are above the specified weight_threshold are quantized to the 8-bit range according to the mode specified in OpLinearQuantizerConfig. It defaults to linear_symmetric, which will use only per-channel scales and no zero-points. You can also choose a linear mode which will use a zero-point as well, which may help to get slightly better accuracy.

For options on how to set different quantization configs for different weights in the same network, see Customizing Ops to Compress.

For more details on the parameters available in the config, see the following in the API Reference:

If your model's accuracy drops considerably after quantizing the weights of the model, or your model is fully resident on the Neural Engine and you want to see if you can get more latency gains, then consider quantizing both the weights and activation using [Training-Time Quantization](doc: data-dependent-quantization).