4. Overall Design

4.1. Layered

TPU-MLIR treats the compilation process of the network model in two layers.

Top Dialect: Chip-independent layer, including graph optimization, quantization and inference, etc.
Tpu Dialect: Chip-related layer, including weight reordering, operator slicing, address assignment, inference, etc.

The overall flow is shown in the (TPU-MLIR overall process) diagram, where the model is gradually converted into final instructions by Passes. Here is a detailed description of what functions each Pass does in the Top layer and the Tpu layer. The following chapters will explain the key points of each Pass in detail.

_images/flow.png — Fig. 4.1 TPU-MLIR overall process

4.2. Top Pass

shape-infer: Do shape inference, and constant folder
canonicalize: Graph optimization related to specific OP, such as merging relu into conv, shape merge, etc.
extra-optimize: Do extra patterns, such as get FLOPs, remove unuse output, etc.
chip-assign: Assign chip, such as bm1684x, cv183x, etc; and adjust top mlir by chip, for example, make all cv18xx input types as F32.
import-calibration-table: Import calibration table, assign min and max for all ops, for quantization later.
chip-top-optimize: Do top ops optimization by chip.
convert-top-to-tpu: Lower top ops to tpu ops; if for mode F32/F16/BF16, top op normally convert to tpu op directly; if INT8, quantization is needed.

4.3. Tpu Pass

canonicalize: Graph optimization related to specific OP, such as merging of consecutive Requants, etc.
strip-io-quant: Input and output types will be quantized if true; or be F32
chip-tpu-optimize: Do tpu ops optimization by chip.
weight-reorder: Reorder the weights of individual OP based on chip characteristics, such as filter and bias for convolution.
subnet-divide: Split the network into different subnets according to TPU/CPU, if all operators are TPU, there is only one subnet.
op-reorder: Reorder op to make sure ops are close to their users.
layer-group: Slice the network so that as many OPs as possible are computed consecutively in the local mem.
address-assign: Assign addresses to the OPs that need global mem.
codegen: Use Builder module to generate the final model in flatbuffers format.

4.4. Other Passes

There are some optional passes, not in the diagram, used for special functions.

fuse-preprocess: Fuse image preprocess to model.
post-handle: Fuse postprocess to model, only support ssd and yolo.