TPU Architecture

The following is the architecture diagram of Sophon TPU.

../_images/tpu.svg

Sophon TPU is a multi-core architecture. Each core is called a Neural network Processing Unit (NPU). There is an independent local storage and many kinds of Execution Units (EU) in each NPU. All NPUs complete a operation in the form of Single Instruction Multiple Data (SIMD). The data needs to be copied from system memory (usually global memory) to NPU local memory by GDMA before it can be accessed by EUs. CDMA, GDMA and BDC can run parallel.

The copying of data between system memory and local memory should be reduced as much as possible, so as to effectively give full play to the performance of TPU.

For BM1684,

Memory Types

The TPU contains the following memory types

  • System Memory
    • Global Memory: Off-chip memory (DDR).

    • L2-SRAM: On-chip memory,it can be used as intermediate cache.

    • DTCM: Cache space of ARM9.

  • Local Memory: On-chip memory,it is mainly used to store the data for BDC.

GDMA can handle data copying between the above four kinds of memories, ARM9 can also access all of them. It should be noted that ARM9 is a 32-bit processor, the addresses of local memory, L2-SRAM and DTCM can be accessed directly by using 32-bit addresses, among them, DTCM is the fastest, but the address and of global memory are beyond the 32-bit representation range, so it can only be accessed indirectly after mapping (This function is rarely used).

For BM1684,

  • The size of global memory is 12GB.

  • The size of local memory is 32MB.

  • The size of L2-SRAM is 4MB, obtained by calling okk_l2_sram_size() on device.

  • The size of DTCM is 512KB, obtained by calling okk_dtcm_size() on device.

Heterogeneous Programming

OKKernel programming includes the following two parts

  • Host: In SoC mode, it refers to the ARM A53 processor on the chip, and in PCIe mode, it refers to the host used by user.

  • Device: SoC and PCIe are both the on-chip microprocessors (ARM9).

User compiles the developed codes to form new firmware, and then updates the firmware of device by the new one.

../_images/heterogeneous_prepare.svg

The above preparation job only needs to be done once, because kernel functions have been registered on device with the firmware update.

To the device where kernel functions have been registered, user can launch kernel from host.

../_images/heterogeneous_run.svg

Device

On device, it is used to parse the command sent from host and call registered atomic functions. For atomic functions can be used on device, see Atomic Functions. For more programming details, see Programming on Device.

The code on device can only be written in C language and compiled by ARM 32-bit compiler. The compiling result are two binary files, device_tcm.bin and device_ddr.bin (that can be loaded to device by calling okkernel_load_firmware() on host).

Due to the limited space of ITCM which is fast, some low-frequency codes need to be placed in the DDR by naming file with prefix ok_device_.

Host

On host, the steps to be implemented include loading binary files, sending commands to device and executing them synchronously or asynchronously and waiting for execution to complete. The APIs are as follows.

bm_status_t okkernel_load_firmware(bm_handle_t handle, const char *firmware_tcm, const char *firmware_ddr)

Load or reload firmware to device.

Parameters
  • handle – Handle of the device.

  • firmware_tcm – Path to the firmware file to load to ITCM.

  • firmware_ddr – Path to the firmware file to load to DDR.

Returns

Status of launching kernel function, BM_SUCCESS means succeeded, otherwise, some errors caught.

bm_status_t okkernel_launch_sync(bm_handle_t handle, const char *func_name, const void *args, unsigned int size)

Launch the kernel function on device synchronously.

Parameters
  • handle – Handle of the device.

  • func_name – Name of the kernel function to launch on device.

  • args – Pointer to the user-discript data package.

  • size – Size of the user-discript data package in bytes.

Returns

Status of launching kernel function, BM_SUCCESS means succeeded, otherwise, some errors caught.

bm_status_t okkernel_launch_async(bm_handle_t handle, const char *func_name, const void *args, unsigned int size)

Launch the kernel function on device asynchronously.

Parameters
  • handle – Handle of the device.

  • func_name – Name of the kernel function to launch on device.

  • args – Pointer to the user-discript data package.

  • size – Size of the user-discript data package in bytes.

Returns

Status of launching kernel function, BM_SUCCESS means succeeded, otherwise, some errors caught.

bm_status_t okkernel_sync(bm_handle_t handle)

Synchronize the device.

Parameters

handle – Handle of the device.

Returns

Status of launching kernel function, BM_SUCCESS means succeeded, otherwise, some errors caught.

For more programming details, see Programming on Host.