6. Quantization
The theory of quantization is based on: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Paper link: https://arxiv.org/abs/1712.05877
This chapter introduces the quantization design of TPU-MLIR, focusing on the application of the paper in practical quantization.
6.1. Basic Concepts
INT8 quantization is divided into symmetric and asymmetric quantization. Symmetric quantization is a special case of asymmetric quantization, and usually, the performance of the former will be better than the latter, while the accuracy is in contrast.
6.1.1. Asymmetric Quantization
As shown in the figure (Asymmetric quantization), asymmetric quantization is actually the fixed-pointing of values in the range [min,max] to the interval [-128, 127] or [0, 255].
The quantization formula from int8 to float is:
where r is the real value of type float and q is the quantized value of type INT8 or UINT8.
S denotes scale, which is float; Z is zeropoint, which is of type INT8.
When quantized to INT8, qmax=127,qmin=-128, and for UINT8, qmax=255,qmin=0.
The quantization formula from float to INT8 is:
6.1.2. Symmetric Quantization
Symmetric quantization is a special case of asymmetric quantization when Z=0. The formula is:
The range of Tensor is [-threshold, threshold].
For activation, usually \(S = threshold / 128\).
For weight, usually \(S = threshold / 127\).
In the case of UINT8, the Tensor range is [0, threshold], at this time \(S = threshold/ 255.0\).
6.2. Scale Conversion
The formula in the paper:
In other words, it is the floating point Scale, which can be converted to Multiplier and rshift:
For example:
The higher the number of bits supported by Multiplier, the closer to Scale it will be, but that leads to worse performance. Therefore, generally, the chip will use a 32-bit or 8-bit Multiplier.
6.3. Quantization derivation
We can use quantization formulas and derive quantization for different OPs to get their corresponding INT8 calculations.
Both symmetric and asymmetric are used for Activation, and for weights generally only symmetric quantization is used.
6.3.1. Convolution
The abbreviation of Convolution: \(Y = X_{(n,ic,ih,iw)}\times W_{(oc,ic,kh,kw)} + B_{(1,oc,1,1)}\).
Substitute it into the int8 quantization formula, the derivation is as follows:
In particular, for asymmetric quantization, Pad is filled with Zx.
In the symmetric case, Pad is filled with 0 (both Zx and Zy are 0).
In PerAxis (or PerChannal) quantization, each OC of Filter will be quantized, and the derivation formula will remain unchanged, but there will be OC Multiplier and rshift.
6.3.2. InnerProduct
Expression and derivation are the same as (Convolution).
6.3.3. Add
The expression for addition is: \(Y = A + B\)
Substitute it into the int8 quantization formula, the derivation is as follows:
The way to implement Add with TPU is related to specific TPU instructions.
The symmetric method here is to use INT16 as the intermediate buffer.
The asymmetric method is to first de-quantize into the float, do the addition and then re-quantize into INT8.
6.3.4. AvgPool
The expression of average pooling can be abbreviated as: \(Y_i = \frac{\sum_{j=0}^{k}{(X_j)}}{k}, k = kh \times kw\).
Substitute it into the int8 quantization formula, the derivation is as follows:
6.3.5. LeakyReLU
The expression of LeakyReLU can be abbreviated as: \(Y = \begin{cases} X, if X \geq 0\\ \alpha X, if X < 0 \end{cases}\)
Substitute it into the int8 quantization formula, the derivation is as follows:
In INT8 symmetric quantization: \(S_y=\frac{threshold_y}{128}, S_x=\frac{threshold_x}{128}\). In INT8 asymmetric quantization: \(S_y = \frac{max_y - min_y}{255}, S_x = \frac{max_x - min_x}{255}\). After BackwardCalibration, \(max_y = max_x, min_y = min_x, threshold_y = threshold_x\), so Sx/Sy = 1.
In the symmetric case, both Zx and Zy are 0.
6.3.6. Pad
The expression of Pad can be abbreviated as: \(Y = \begin{cases} X, \ origin\ location \\ value, \ padded\ location \end{cases}\)
Substitute it into the int8 quantization formula, the derivation is as follows:
After BackwardCalibration, \(max_y = max_x, min_y = min_x, threshold_y = threshold_x\), so Sx/Sy = 1。
In the symmetric case, both Zx and Zy are 0, so the padded value is round(value/Sy). When asymmetric quantization, the padded value is round(value/Sy + Zy)。
6.3.7. PReLU
The expression of PReLU can be abbreviated as: \(Y_i = \begin{cases} X_i, if \ X_i \geq 0\\ \alpha_i X_i, if \ X_i < 0 \end{cases}\)
Substitute it into the int8 quantization formula, the derivation is as follows:
After BackwardCalibration, \(max_y = max_x, min_y = min_x, threshold_y = threshold_x\), so Sx/Sy = 1。
There are oc Multipliers and 1 rshift. When symmetric quantization, Zx and Zy are both 0.