Introduction
Technique that respresents the weights and activations with low precision data types like 8-bit integer (int8
) instead of the usual 32-bit floating point (float32
).
- Less memory storage
- Les computational cost
- Operations like matrix multiplication much faster
- Allows to run models on embedded devices, which sometimes only support integer data types
Theory
Going from high-precision representation (usually the regular 32-bit floating point) for weights and activations to a lower precision data type. The accumulation data type specifies the type of the result of accumulating (adding, multiplying, etc.) values of the data type in question.
For example, let’s consider two int8
values A = 127, B = 127, and let’s define C as the sum of A and B. Here the result is much bigger than the biggest representable value in int8
, which is 127. Hence the need for a larger precision data type to avoid a huge precision loss that would make the whole quantization process useless.
Quantization
The two most common quantization cases are float32
→ float16
and float32
→ int8
.
Quantization to float16
Straightforward since both data types follow the same representation scheme. The questions to ask yourself are:
- Does my operation have a
float16
implementation? - Does my hardware support
float16
? For instance, Intel CPUs have been supportingfloat16
as a storage type, but computation is done after converting tofloat32
. - Is my operation sensitive to lower precision? The same applies for big values?
Quantization to int8
Only 256 values can be represented in int8
, while float32
can represent a very wide range of values. The idea is to find the best way to project our rage [a, b] of float32
values to the int8
space.
Let’s consider a float x in [a, b], then we can write the following quantization scheme, also called the affine quantization scheme:
where:
- is the quantized int8 value associated to x
- and are the quantization parameters:
- is the scale, and is a positive
float32
- is called the zero-point, it is the
int8
value corresponding to the value 0 in thefloat32
realm. This is important to be able to represent exactly the value 0 because it is used everywhere throughout machine learning models.
The quantized value of of in [a, b] can be computed as follows:
And float32
values of outside of the [a, b] range are clipped to the closest representable value, so for any floating-point number x: