Compressing ML Program Weights

The Core ML Tools package includes utilities to compress the weights of a Core ML model in the ML program format. Weight compression reduces the space occupied by the model. However, the precision of the intermediate tensors and the compute precision of the ops are not altered — at runtime weights are decompressed into float precision and all computation uses float precision.

📘

Neural Network Compression

To compress a neural network, see Compressing Neural Network Weights.

Follow these steps:

  1. Convert the PyTorch or TensorFlow model to an ML program using convert(). For instructions, see Unified Conversion API.

    Conversion produces an ML program model with weights in float 16 precision by default, or float 32 precision if you use compute_units=ct.precision.Float32 as described in Typed Execution.

  2. Choose one of the following utilities to compress the weights:

    • affine_quantize_weights: Apply linear quantization to produce 8-bit weights. This process provides up to 2x savings in storage when starting from a float 16 model, or up to 4x savings when starting from a float 32 model:

      model_compressed = ct.compression_utils.affine_quantize_weights(model)
      
    • palettize_weights: Use a linear histogram or k-means clustering algorithm to represent the weights in a lookup table (LUT). This process can compress weights into 1, 2, 4, 6, or 8 bits.:

      model_compressed = ct.compression_utils.palettize_weights(model)
      
    • sparsify_weights: Represent zero-value weights efficiently. Use this process if the original model uses a lot of zero-value weights:

      model_compressed = ct.compression_utils.sparsify_weights(model)
      

🚧

Model Accuracy

Compressing a model can affect its accuracy. After using any of these utilities, you should verify the numerical performance of the model using a validation dataset.

Use Affine Quantization

Quantization refers to the process of reducing the number of bits that represent a number. Affine quantization, aka linear quantization, achieves this by mapping the range of float values to a quantized range, such as the range for 8-bit integers [0,255] and interpolating linearly. This is expressed by the following mathematical equation:

w_unquantized = scale * (w_quantized - zero_point)

where w_unquantized and scale are of type float, and w_quantized and zero_point (also called quantization bias, or offset) are of type unsigned 8 bit integers. The scale and zero point values are computed so that the maximum and minimum float values are mapped to 0 and 255, respectively.

Use affine_quantize_weights API to convert an ML program that uses float-precision weights into a compressed version that uses 8-bit weights. This function provides up to 2x savings in storage when starting from a float 16 model, or up to 4x savings when starting from a float 32 model.

This utility computes the scales and zero points for each weight value, and converts the float weight values that are stored using the const operations in the MIL program into the constexpr_affine_dequantize op that stores the unit8 weights along with scales and zero points. It uses linear symmetric interpolation ("linear_symmetric" mode) by default, or you can specify linear interpolation ("linear" mode).

For a complete description of this function and the formulae used to modify the weights, see the API reference for the compression_utils.affine_quantize_weights method.

Linear Interpolation Mode

Linear interpolation ("linear" mode) maps the minimum and maximum (min/max) of a floating-point range to the integer range [0, 255] using a zero point and a scale factor.

The following example uses the function to compress a model using linear interpolation mode:

import coremtools as ct

compressed_model = compression_utils.affine_quantize_weights(model, 
                                                             mode="linear")

Linear Symmetric Interpolation Mode

With linear symmetric interpolation option (the default "linear_symmetric" mode), rather than mapping the exact min/max of the floating-point range to the range [0, 255], the maximum absolute value between the min/max is picked. The negative of this value is treated as the min, and these new min/max values are mapped to them range [0, 254]. This results in a zero point value of 127. The floating-point range is symmetric with respect to zero, and so is the quantized range.

The following example uses the function to compress a model using the default linear symmetric interpolation mode:

import coremtools as ct

compressed_model = ct.compression_utils.affine_quantize_weights(model)

Use a Lookup Table

Use palettize_weights to construct a lookup table (LUT) to compress a floating-point ML program by reducing the overall number of weights.

A LUT can be used to map an integer index to the floating-point weight values. An nbit LUT has 2^nbits entries. For example, a float weight vector such as {0.3, 0.3, 0.5, 0.5} can be compressed using a 1-bit LUT to {0: 0.3, 1: 0.5}. In this case the float vector is replaced with a 1-bit vector {0,0,1,1}.

The palettize_weights function discretizes the values of all weights in the ML program and constructs the LUT according to the algorithm you specify as mode. The float values are then converted to nbit values, and the LUT is saved alongside each weight. The const ops storing weight values are replaced by constexpr_lut_to_dense ops.

At runtime, the LUT and the nbit values are used to reconstruct the float weight values, which are then used to perform the float operations for the weights.

Specify how the LUT is constructed by choosing one of the following as the mode:

  • "uniform" (default): The LUT is generated by a linear histogram, which is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or "bins") and the representation displays how many values fall into each bin. Linear histograms have one bin at even intervals, such as one bin per integer.

  • "kmeans": The LUT is generated by k-means clustering, a method of vector quantization that groups similar data points together to discover underlying patterns by using a fixed number (k) of clusters in a dataset. A cluster refers to a collection of data points aggregated together because of certain similarities.

  • "unique": The LUT is generated by unique values in the weights. The weights are assumed to be on a discrete lattice but stored in a float data type. This parameter identifies the weights and converts them into the palettized representation.

Consider the following example of "uniform" mode:

  • nbits = 4
  • mode = "uniform"
  • weight = [0.11, 0.19, 0.3, 0.08, 0.0, 0.02]

The weight can be converted to the following:

  • A palette with indices [0, 1, 2, 3] (2 bits). The indices are a byte array.
  • The data range [0.0, 0.3] is divided into 4 partitions linearly, which is [0.0, 0.1, 0.2, 0.3].
  • The LUT would be [0.0, 0.1, 0.2, 0.3].
  • The weight is rounded to [0.1, 0.2, 0.3, 0.1, 0.0, 0.0], and represented in the palette as indices [01b, 10b, 11b, 01b, 00b, 00b].

The following example uses the utility to compress a model using k-means clustering:

compressed_model = ct.compression_utils.palettize_weights(model, 
                                                          nbits=4, 
                                                          mode="kmeans")

For a complete description of this utility, see palettize_weights.

Use Sparse Representation

Use sparsify_weights to compress a floating-point ML program by representing zero-value weights efficiently. Sparse representation is more efficient than a dense one if the model is trained with pruning techniques so that a lot of weights have zero values.

The sparsified weights are stored in a bit mask. If the weight values are
{0, 0, 0, 0, 0, 0, 0, 56.3}, its sparse representation contains a bit mask with
ones in the locations where the value is non-zero: 00000001b. This is accompanied by
non-zero data, which is a size-1 vector of value {56.3}.

For example, given the following:

  • weight = [0.3, 0, 0, 0.5, 0, 0]
  • non_zero_data, bit_mask = sparsify(weight)

The indices of the non-zero elements are:

  • non_zero_data = [0.3, 0.5]
  • bit_mask = "100100"

Choose the scheme to sparsify the model by specifying one of the following as the mode:

"threshold_based" (default): All the absolute weight values that are smaller than threshold are changed to 0, and the tensor is stored in a sparse format. For example, given the following:

  • weight = [0.3, -0.2, -0.01, 0.05]
  • threshold = 0.03

The sparsified weight would be [0.3, -0.2, 0, 0.05].

"percentile_based": Sparsify the weight with a constant sparsity percentile, which is target_percentile. Where n = floor(size_of_weight_tensor * target_percentile), the n lowest absolute weight values are changed to 0.

For example, given the following:

  • weight = [0.3, -0.2, -0.01, 0.05]

  • target_percentile = 0.75

The sparsified weight would be [0.3, 0, 0, 0].

The following example uses the utility to compress a model using threshold_based mode:

compressed_model = ct.compression_utils.sparsify_weights(model, 
                                                         mode="threshold_based", 
                                                         threshold=0.01)

For a complete description of this utility, see sparsify_weights.

Control Operations Whose Weights are Compressed

You can optionally control operations for which weights are compressed by providing the op_selector argument to the compression method. Specify the op_selector_function that will receive an object of type const operation as an input:

ct.compression_utils.affine_quantize_weights(model, 
                                             op_selector=op_selector_function)

The above example must return a bool : True to compress the const op, False to leave it unchanged. See the following examples:

  • All constants in the network are compressed:

    		def op_selector(const_op):
    			return True
    
  • Only the constant with tensor.size > 2048 is compressed:

     def op_selector(const_op): 
    			return const_op.val.val.size > 2048
    
  • Compress the constant only if it is the weight of a convolution layer
    and tensor.size > 2048:

     def op_selector(const_op):
        return (const_op.val.val.size > 2048 
    			and const_op.val.child_ops[0].op_type == "conv" 
    			and const_op.val == const_op.val.child_ops[0].weight)
    

When creating a custom op_selector function, the following attributes are useful:

  • const_op.val.val: The numpy array holding the value of the const.
  • const_op.val.child_ops: A list of ops into which this constant is feeding.
  • const_op.val.child_ops[i].op_type: The string corresponding to the op type of the i-th child op.
  • const_op.val.child_ops[i].name: The string corresponding to the name the
    i-th child op.

If op_selector is not provided, it will be set to the behavior in which weights bigger than 2048 elements are compressed:

def op_selector(const_op):
		returm const_op.val.val.size > 2048: