Palettization, also referred to as weight clustering, compresses a model by clustering the model's
float weights, and creating a lookup table (LUT) of centroids, and then storing the original weight values with indices pointing to the entries in the LUT.
Weights with similar values are grouped together and represented using the value of the cluster centroid they belong to, as shown in the following figure. The original weight matrix is converted to an index table in which each element points to the corresponding cluster center.
The previous figure shows 4 clusters. Each weight is represented using 2 bits, achieving 8x compression over
float16. The number of unique cluster centers used to represent the weights is equal to 2 to the power of
n is the number of bits used for palettization. Thus, 4-bit palettization means you can have 16 clusters.
Since palettization reduces the size of each weight value, the amount of data to be moved is reduced during prediction. This can lead to benefits in memory-bottlenecked models. Note that this latency advantage is available only when palettized weights are loaded and are decompressed "just in time" of computation. Starting with
iOS17/macOS14, this is more likely to happen for models running primarily on the Neural Engine backend.
Palettized weight representations for Core ML
mlprogrammodels is available from
Updated 4 months ago