# Quantization Overview

Quantization refers to the process of reducing the number of bits that represent a number. This is achieved by casting the values from `float`

type to an `integer`

type that uses fewer bits. Linear quantization, also known as affine quantization, achieves this by mapping the range of float values to a quantized range, such as the range for 8-bit integers [-127, 128] and interpolating linearly. This is expressed by the following mathematical equations:

```
# process of dequantizing weights:
w_unquantized = scale * (w_quantized - zero_point)
# process of quantizing weights:
w_quantized = clip(round(w_unquantized/scale) + zero_point)
```

where `w_unquantized`

and `scale`

are of type float, and `w_quantized`

and `zero_point`

(also called quantization bias, or offset) are of type either `int8`

or `uint8`

.

### Symmetric Quantization

When quantization is performed, constraining the `zero_point`

to be zero is referred to as *symmetric* quantization. In this case, the quantization and dequantization operations are further simplified. This is the default mode used by Core ML Tools.

### Per-Channel Scales

Rather than computing a single float `scale`

value for the whole tensor (a mode called `per-tensor`

), it is typical to instead use a scale factor for each outer dimension (also referred to as the `output channel`

dimension) of the weight tensor. This is commonly termed as having `per-channel`

scales. In most cases the scale is a vector (the default for Core ML Tools), which reduces the overall quantization error.

### Activation Quantization

Unlike the Pruning or Palettization compression schemes that compress only weights, for 8-bit quantization, activations of the network can also be quantized with their own scale factors.

Activations are quantized using `per-tensor`

mode. During the process of training-time quantization, the values of intermediate activations are observed and their max and min values are used to compute the quantization scales, which are stored during inference. Quantizing the intermediate tensors may help in inference of networks that are bottlenecked by memory bandwidth due to large activations.

## Impact on Model Size

Since 8 bits are used to represent weights, quantization can reduce your model size by 50% compared to `float16`

storage precision. In reality, the exact compression ratio will be slightly less than 2, since some extra memory needs to be allocated to store the `per-channel`

scale values.

## Impact on Latency and Compute Unit Considerations

Since quantization reduces the size of each weight value, the amount of data to be moved is reduced during prediction. This can lead to benefits with memory-bottlenecked models.

This latency advantage is available only when quantized weights are loaded from memory and are decompressed "just in time" of computation. Starting with `iOS17/macOS14`

, this is more likely to happen for models running primarily on the Neural Engine backend.

Quantizing the activations may further ease this memory pressure and may lead to more gains when compared to weight-only quantization. However, with activation quantization, you may observe a considerable slowdown in inference for the compute units (CPU and sometimes GPU) that employ load-time weight decompression, since activations are not known at load time, and they need to be decompressed at runtime, slowing down the inference. Therefore it is recommended to use activation quantization only when your model is fully or mostly running on the Neural Engine.

Feature Availability

8-bit quantized weight representations for Core ML

`mlprogram`

models is available from`iOS16/macOS13/watchOS9/tvOS16`

onwards. Activation quantization is available from`iOS17/macOS14/watchOS10/tvOS10`

.

Updated 4 months ago

**What’s Next**