All converters in the coremltools package return a Core ML MLModel object, which you can save as a model (
.mlmodel) file. To reduce the size of the
.mlmodel file, coremltools provides utilities for performing post-training quantization for the weight parameters. This quantization is applicable only for neural network models.
Weight Quantization Availability
The weight quantization utilities on this page, which are used for quantizing weights to 8 bits or less, are available only with the Neural Network backend, not with the ML Programs target. When converting to ML Programs, you can produce a model with Float 32 weights or Float 16 weights. For details, see Typed Execution.
The starting point for quantization is the
.mlmodel file or the MLModel object returned by the convert call. By default, the converters produce an MLModel with weights in floating point 32 (FP32) bit precision. The weights can be quantized to 16 bits, 8 bits, 7 bits, and so on down to 1 bit.
Bits and accuracy
The lower the number of bits, more the chances of degrading the model accuracy. The loss in accuracy varies with the model.
The quantization utilities described on this page only affect the weights of the model. The intermediate tensors are kept in float precision (float 32 or float 16 depending on execution unit), while the weights are dequantized at runtime, if required, to match the precision of the intermediate tensors.
quantize_weights function handles all quantization modes and options:
from coremltools.models.neural_network import quantization_utils # allowed values of nbits = 16, 8, 7, 6, ...., 1 quantized_model = quantization_utils.quantize_weights(model, nbits)
Quantizing to FP16, which reduces by half the model's disk size, is the safest quantization option since it generally does not affect the model's accuracy:
import coremltools as ct from coremltools.models.neural_network import quantization_utils # load full precision model model_fp32 = ct.models.MLModel('model.mlmodel') model_fp16 = quantization_utils.quantize_weights(model_fp32, nbits=16)
Quantizing to 8 bits reduces the disk size to one fourth of the FP32 model. However, it may affect model accuracy, so you should always test the model after quantization, using test data. Depending on the model type, you may be able to quantize to bits lower than 8 without losing accuracy.
# quantize to 8 bit using linear mode model_8bit = quantize_weights(model_fp32, nbits=8) # quantize to 8 bit using LUT kmeans mode model_8bit = quantize_weights(model_fp32, nbits=8, quantization_mode="kmeans") # quantize to 8 bit using linearsymmetric mode model_8bit = quantize_weights(model_fp32, nbits=8, quantization_mode="linear_symmetric")
When you set
nbits to a value between 1 and 8, you can choose one of the following quantization modes:
linear: The default mode, which uses linear quantization for weights with a scale and bias term.
linear_symmetric: Symmetric quantization, with only a scale term.
kmeans_lut: Uses a k-means clustering algorithm to construct a lookup table quantization of weights.
Try these different algorithms with your model, as some may work better than others depending on the model type.
The following options enable you to experiment with the quantization scheme so that you can find one that works best with your model.
By default, the k-means algorithm is used to find the lookup table (LUT). However, you can provide a custom function to compute the LUT by setting
quantization_mode = "custom_lut ".
By default, all the layers that have weight parameters are quantized. However, the model accuracy may be sensitive to certain layers, which shouldn't be quantized. You can choose to skip quantization for certain layers and experiment as follows:
- Use the
AdvancedQuantizedLayerSelectorclass, which lets you set simple properties such as layer types and weight count. For example:
# Example: 8-bit symmetric linear quantization skipping bias, # batchnorm, depthwise-convolution, and convolution layers # with less than 4 channels or 4096 elements from coremltools.models.neural_network.quantization_utils import AdvancedQuantizedLayerSelector selector = AdvancedQuantizedLayerSelector( skip_layer_types=['batchnorm', 'bias', 'depthwiseConv'], minimum_conv_kernel_channels=4, minimum_conv_weight_count=4096 ) quantized_model = quantize_weights(model, nbits=8, quantization_mode='linear_symmetric', selector=selector)
For finer control, you can write a custom rule to skip (or not skip) quantizing a layer by extending the
# Example : 8-bit linear quantization skipping the layer with name 'dense_2' from coremltools.models.neural_network.quantization_utils import QuantizedLayerSelector class MyLayerSelector(QuantizedLayerSelector): def __init__(self): super(MyLayerSelector, self).__init__() def do_quantize(self, layer, **kwargs): ret = super(MyLayerSelector, self).do_quantize(layer) if not ret or layer.name == 'dense_2': return True selector = MyLayerSelector() quantized_model = quantize_weights( mlmodel, nbits = 8, quantization_mode='linear', selector=selector )
Updated about a month ago