Quantization

Introduction

Technique that respresents the weights and activations with low precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Less memory storage
Les computational cost
Operations like matrix multiplication much faster
Allows to run models on embedded devices, which sometimes only support integer data types

Theory

Going from high-precision representation (usually the regular 32-bit floating point) for weights and activations to a lower precision data type. The accumulation data type specifies the type of the result of accumulating (adding, multiplying, etc.) values of the data type in question.

For example, let’s consider two int8 values A = 127, B = 127, and let’s define C as the sum of A and B. Here the result is much bigger than the biggest representable value in int8, which is 127. Hence the need for a larger precision data type to avoid a huge precision loss that would make the whole quantization process useless.

Quantization

The two most common quantization cases are float32 → float16 and float32 → int8.

Quantization to float16

Straightforward since both data types follow the same representation scheme. The questions to ask yourself are:

Does my operation have a float16 implementation?
Does my hardware support float16? For instance, Intel CPUs have been supporting float16 as a storage type, but computation is done after converting to float32.
Is my operation sensitive to lower precision? The same applies for big values?

Quantization to int8

Only 256 values can be represented in int8, while float32 can represent a very wide range of values. The idea is to find the best way to project our rage [a, b] of float32 values to the int8 space.

Let’s consider a float x in [a, b], then we can write the following quantization scheme, also called the affine quantization scheme:

x = S * (x_q - Z)

where:

$x_q$ is the quantized int8 value associated to x
$S$ and $Z$ are the quantization parameters:

$S$ is the scale, and is a positive float32
$Z$ is called the zero-point, it is the int8 value corresponding to the value 0 in the float32 realm. This is important to be able to represent exactly the value 0 because it is used everywhere throughout machine learning models.

The quantized value of $x_q$ of $x$ in [a, b] can be computed as follows:

x_q=round(x/S + Z)

And float32 values of outside of the [a, b] range are clipped to the closest representable value, so for any floating-point number x:

x_q=clip(round(x/S+Z), round(a/S+Z), round(b/S+Z))