Post-Training Quantization
You can linearly quantize the weights of your Core ML model by using the linear_quantize_weights
method as follows:
import coremltools.optimize.coreml as cto
op_config = cto.OpLinearQuantizerConfig(mode="linear_symmetric", weight_threshold=512)
config = cto.OptimizationConfig(global_config=op_config)
compressed_8_bit_model = cto.linear_quantize_weights(model, config=config)
Quantize Activations Plus Weights
If you want to quantize the activations in addition to the weights, use Training-Time Quantization.
The linear_quantize_weights
method iterates over the weights of the model. Those weights whose sizes are above the specified weight_threshold
are quantized to the 8-bit range according to the mode
specified in OpLinearQuantizerConfig
. It defaults to linear_symmetric
, which will use only per-channel
scales
and no zero-points
. You can also choose a linear
mode which will use a zero-point
as well, which may help to get slightly better accuracy.
For options on how to set different quantization configs for different weights in the same network, see Customizing Ops to Compress.
For more details on the parameters available in the config, see the following in the API Reference:
If your model's accuracy drops considerably after quantizing the weights of the model, or your model is fully resident on the Neural Engine and you want to see if you can get more latency gains, then consider quantizing both the weights and activation using [Training-Time Quantization](doc: data-dependent-quantization).
Updated 4 months ago