4. Overall Design
4.1. Layered
TPU-MLIR treats the compilation process of the network model in two layers.
- Top Dialect
Chip-independent layer, including graph optimization, quantization and inference, etc.
- Tpu Dialect
Chip-related layer, including weight reordering, operator slicing, address assignment, inference, etc.
The overall flow is shown in the (TPU-MLIR overall process) diagram, where the model is gradually converted into final instructions by Passes. Here is a detailed description of what functions each Pass does in the Top layer and the Tpu layer. The following chapters will explain the key points of each Pass in detail.
![_images/flow.png](_images/flow.png)
Fig. 4.1 TPU-MLIR overall process
4.2. Top Pass
- shape-infer
Do shape inference, and constant folder
- canonicalize
Graph optimization related to specific OP, such as merging relu into conv, shape merge, etc.
- extra-optimize
Do extra patterns, such as get FLOPs, remove unuse output, etc.
- chip-assign
Assign chip, such as bm1684x, cv183x, etc; and adjust top mlir by chip, for example, make all cv18xx input types as F32.
- import-calibration-table
Import calibration table, assign min and max for all ops, for quantization later.
- chip-top-optimize
Do top ops optimization by chip.
- convert-top-to-tpu
Lower top ops to tpu ops; if for mode F32/F16/BF16, top op normally convert to tpu op directly; if INT8, quantization is needed.
4.3. Tpu Pass
- canonicalize
Graph optimization related to specific OP, such as merging of consecutive Requants, etc.
- strip-io-quant
Input and output types will be quantized if true; or be F32
- chip-tpu-optimize
Do tpu ops optimization by chip.
- weight-reorder
Reorder the weights of individual OP based on chip characteristics, such as filter and bias for convolution.
- subnet-divide
Split the network into different subnets according to TPU/CPU, if all operators are TPU, there is only one subnet.
- op-reorder
Reorder op to make sure ops are close to their users.
- layer-group
Slice the network so that as many OPs as possible are computed consecutively in the local mem.
- address-assign
Assign addresses to the OPs that need global mem.
- codegen
Use Builder module to generate the final model in flatbuffers format.
4.4. Other Passes
There are some optional passes, not in the diagram, used for special functions.
- fuse-preprocess
Fuse image preprocess to model.
- post-handle
Fuse postprocess to model, only support ssd and yolo.