Programming on Device¶

Kernel Function¶

An acceptable kernel function is required to be in the following form

void my_kernel_func(const void *) {
    ...
}

having only one argument with type const void* and no returning. The pointer usually points to a user-descript structure containing inputs. The structure is the bridge between host and device. For example,

typedef struct {
    ...
} __attribute__((packed)) my_data_t;

void my_kernel_func(const void *args) {
    my_data_t *data = (my_data_t *)args;
    ...
}

In most cases, __attribute__((packed)) is necessary, because the structure is compiled by ARM9 32-bit compiler for device, and X86 or aarch64 64-bit compiler for host, if unpacked, the size and parsing way of the structure may differ between device and host. To avoid potential dangers, __attribute__((packed)) should not be ignored.

Register Kernel Function¶

A kernel function needs to be registered so that it can be used in runtime. For example,

void my_kernel_func(const void *) {
    ...
}

OKKERNEL_FUNC_REGISTER(my_kernel_func);

the kernel function is registered by macro OKKERNEL_FUNC_REGISTER. There is a map for storing paires of the kernel function and its name, and the name is the key-value that should be unique.

Hello World on Device¶

typedef struct {
    int year;
    int month;
    int day;
} __attribute__((packed)) date_t;

void hello_world(const void *args) {
    date_t *param = (date_t *)args;
    OKKERNEL_LOG("Hello World! Today is %d/%d/%d.\n", param->month, param->day, param->year);
}

OKKERNEL_FUNC_REGISTER(hello_world);

Example of Using GDMA and BDC¶

The following codes show how to use GDMA and BDC functions in kernel function.

typedef struct {
    unsigned long long output_addr;
    unsigned long long input_addr;
    int N, C, H, W;
} __attribute__((packed)) param_t;

void plus_one_0(const void *args) {
    param_t *param = (param_t *)args;
    dim4 shape = {.n = param->N, .c = param->C, .h = param->H, .w = param->W};
    dim4 stride;
    // The output and input are in the aligned layout.
    okk_128_byte_aligned_stride_for_32bit(&stride, 0, &shape);
    unsigned int tensor_size = stride.n * shape.n * sizeof(float);
    OKKERNEL_ASSERT(tensor_size * 2 <= okk_local_mem_size_per_npu());
    // Determine addresses of output and input.
    local_addr_t output_addr = 0;
    local_addr_t input_addr = tensor_size;
    // Initialize.
    okk_initialize();
    // Copy input from global to local.
    okk_gdma_32bit_cpy_S2L(input_addr, param->input_addr, &shape, NULL, NULL);
    // Calculate output = input + 1.
    okk_bdc_add_C(output_addr, input_addr, 1.f, &shape, NULL, NULL);
    // Copy output from local to global.
    okk_gdma_32bit_cpy_L2S(param->output_addr, output_addr, &shape, NULL, NULL);
    // Synchronize.
    okk_poll();
}

OKKERNEL_FUNC_REGISTER(plus_one_0);

This kernel function performs addition of 1.f and the elements of the input tensor with shape (N, C, H, W) and data type fp32, and the output tensor stores the result.

The detailed process is as follows

Initialize to use GDMA and BDC functions.
Copy the input tensor from global memory to local memory by GDMA.
Perform the addition by BDC.
Copy the the output tensor from local memory to global memory by GDMA.
Synchronize to make all GDMA and BDC functions done.

In the above codes, OKKERNEL_ASSERT(tensor_size * 2 <= okk_local_mem_size_per_npu()) is for checking if local memory is exceeded. Such security checks really help sometimes.

GDMA and BDC Run Parallel¶

Since GDMA and BDC functions can run parallel, making good use of this feature can improve performance.

Supposing N is even, the following codes show how to make use of running GDMA and BDC functions parallel.

void plus_one_1(const void *args) {
    param_t *param = (param_t *)args;
    dim4 shape = {.n = param->N / 2, .c = param->C, .h = param->H, .w = param->W};
    dim4 stride;
    // The output and input are in the aligned layout.
    okk_128_byte_aligned_stride_for_32bit(&stride, 0, &shape);
    unsigned int tensor_size_local = stride.n * shape.n * sizeof(float);
    unsigned int tensor_size_global = shape.n * shape.c * shape.h * shape.w * sizeof(float);
    OKKERNEL_ASSERT(tensor_size_local * 4 <= okk_local_mem_size_per_npu());
    // Determine addresses of output and input (ping-pong buffers).
    local_addr_t output_addr[2] = {0, tensor_size_local};
    local_addr_t input_addr[2] = {tensor_size_local * 2, tensor_size_local * 3};
    // Initialize.
    okk_initialize();
    ////////////////////////////////////////////////////////////////////
    // Step 0
    // Copy the first part of input from global to local.
    okk_gdma_32bit_cpy_S2L(input_addr[0], param->input_addr, &shape, NULL, NULL);
    ////////////////////////////////////////////////////////////////////
    // Step 1
    // Start parallel.
    okk_parallel_start();
    // Copy the second part of input from global to local.
    okk_gdma_32bit_cpy_S2L(input_addr[1], param->input_addr + tensor_size_global, &shape, NULL, NULL);
    // Calculate output = input + 1 for the first part.
    okk_bdc_add_C(output_addr[0], input_addr[0], 1.f, &shape, NULL, NULL);
    // End parallel.
    okk_parallel_end();
    ////////////////////////////////////////////////////////////////////
    // Step 2
    // Start parallel.
    okk_parallel_start();
    // Calculate output = input + 1 for the second part.
    okk_bdc_add_C(output_addr[1], input_addr[1], 1.f, &shape, NULL, NULL);
    // Copy the first part of output from local to global.
    okk_gdma_32bit_cpy_L2S(param->output_addr, output_addr[0], &shape, NULL, NULL);
    // End parallel.
    okk_parallel_end();
    ////////////////////////////////////////////////////////////////////
    // Step 3
    // Copy the second part of output from local to global.
    okk_gdma_32bit_cpy_L2S(param->output_addr + tensor_size_global, output_addr[1], &shape, NULL, NULL);
    // Synchronize.
    okk_poll();
}

OKKERNEL_FUNC_REGISTER(plus_one_1);

There is a pipeline in the process consisting of four steps as follows

Step 0:

Copy the first part of input from global memory to local memory by GDMA.

Step 1:

Start GDMA and BDC parallel mode.
Copy the second part of input from global memory to local memory by GDMA.
Perform the addition for the first part by BDC.
End GDMA and BDC parallel mode.

Step 2:

Start GDMA and BDC parallel mode.
Perform the addition for the second part by BDC.
Copy the first part of output from local memory to global memory by GDMA.
End GDMA and BDC parallel mode.

Step 3:

Copy the second part of output from local memory to global memory by GDMA.

GDMA and BDC functions run parallel in step 1 and step 2. Note that okk_parallel_end() in step 1 and okk_parallel_start() in step 2 can not be offset, because output_addr[0] is referred by both step 1 for BDC and step 2 for GDMA, and okk_parallel_end() in step 1 is to guarantee that BDC is completed in step 1 before GDMA begins in step 2.

When GDMA and BDC run parallel, if the cost of GDMA is balanced with BDC, one will be covered up by the other. At this time, the performance is brought into full play, but sometimes more local memory is cost to build up ping-pong buffers for the pipeline.

Divide Tensor Into Parts¶

If a tensor is too big to be stored in local memory, it should be divided into many parts. An appropriate number of parts is preferred, it can not only make full use of local memory, but also make the steps as few as possible.

The following codes show how to find such a number of parts and what the pipeline becomes.

#define DIV_UP(a, b) (((a) - 1) / (b) + 1)

void plus_one_2(const void *args) {
    param_t *param = (param_t *)args;
    dim4 shape_one_batch = {.n = 1, .c = param->C, .h = param->H, .w = param->W};
    dim4 stride_one_batch, stride;
    // Calculate number of working batches.
    okk_128_byte_aligned_stride_for_32bit(&stride_one_batch, 0, &shape_one_batch);
    unsigned int tensor_size_one_batch_local = stride_one_batch.n * sizeof(float);
    OKKERNEL_ASSERT(tensor_size_one_batch_local * 4 <= okk_local_mem_size_per_npu());
    int M = okk_local_mem_size_per_npu() / 4 / tensor_size_one_batch_local;
    dim4 shape = {.n = M, .c = param->C, .h = param->H, .w = param->W};
    // The output and input are in the aligned layout.
    okk_128_byte_aligned_stride_for_32bit(&stride, 0, &shape);
    unsigned int tensor_size_local = shape.n * stride.n * sizeof(float);
    unsigned int tensor_size_global = shape.n * shape.c * shape.h * shape.w * sizeof(float);
    // Determine addresses of output and input (ping-pong buffers).
    local_addr_t output_addr[2] = {0, tensor_size_local};
    local_addr_t input_addr[2] = {tensor_size_local * 2, tensor_size_local * 3};
    // Get the number of parts and shape of the last part.
    int S = DIV_UP(param->N, M);
    dim4 shape_last = {.n = param->N - (S - 1) * M, .c = param->C, .h = param->H, .w = param->W};
    // Initialize.
    okk_initialize();
    // Step 0 ~ Step S + 1
    for (int i = 0; i < S + 2; ++i) {
        // Start parallel.
        okk_parallel_start();
        // Copy part i of input from global to local.
        if (i < S)
            okk_gdma_32bit_cpy_S2L(input_addr[i % 2], param->input_addr + i * tensor_size_global, i == S - 1 ? &shape_last : &shape, NULL, NULL);
        // Calculate output = input + 1 for part i - 1.
        if (i > 0 && i < S + 1)
            okk_bdc_add_C(output_addr[(i - 1) % 2], input_addr[(i - 1) % 2], 1.f, i - 1 == S - 1 ? &shape_last : &shape, NULL, NULL);
        // Copy part i - 2 of output from local to global.
        if (i > 1)
            okk_gdma_32bit_cpy_L2S(param->output_addr + (i - 2) * tensor_size_global, output_addr[(i - 2) % 2], i - 2 == S - 1 ? &shape_last : &shape, NULL, NULL);
        // End parallel.
        okk_parallel_end();
    }
    // Synchronize.
    okk_poll();
}

OKKERNEL_FUNC_REGISTER(plus_one_2);

The number of batches is determined to be M in the above codes, and the tensor is divided into S = ceil(N / M) parts. The working shape is (M, C, H, W) in each step except the last one, since N may be not divisable by M, the working shape in the last step is (N - (S - 1) * M, C, H, W).

This pipeline is built up by S + 2 steps as follows (The number in the block is the part index).

In step K, the first job is copying the Kth part of input from global memory to local memory, the second job is performing addtion for the (K-1)th part, and the third job is copying the (K-2)th part of output from local memory to global memory. The first and the third are both of GDMA kind, so they run serially, but the second is of BDC kind, it will run parallel with the other two.

The tensor can be divided according to arbitary dimension and their combinations. A recommended order is N-dimension, H-dimension, W-dimension and C-dimension. If according to C-dimension, decreasing the number of channels per NPU is the key.

Reshape¶

If N is small while C, H and W are large, the above plus_one_0, plus_one_1 or plus_one_2 can not handle. This can be solved by reshaping the tensor as follows.

void plus_one_3(const void *args) {
    param_t *param = (param_t *)args;
    // Calculate the length of input.
    unsigned long long len = param->N * param->C * param->H * param->W;
    if (len < okk_npu_num())
        plus_one_0(args);
    else {
        unsigned long long L = len;
        param_t param_reshape = {.output_addr = param->output_addr, .input_addr = param->input_addr};
        // Reshape.
        param_reshape.C = okk_npu_num();
        L /= param_reshape.C;
        param_reshape.H = 1;
        param_reshape.W = L < 32 ? L : 32;
        L /= param_reshape.W;
        param_reshape.N = L;
        plus_one_2(&param_reshape);
        // Deal with the tail if it exists.
        L = param_reshape.N * param_reshape.C * param_reshape.H * param_reshape.W;
        if (L < len) {
            param_reshape.output_addr += (len - L) * sizeof(float);
            param_reshape.input_addr += (len - L) * sizeof(float);
            param_reshape.N = len - L;
            param_reshape.C = 1;
            param_reshape.H = 1;
            param_reshape.W = 1;
            plus_one_3(&param_reshape);
        }
    }
}

OKKERNEL_FUNC_REGISTER(plus_one_3);

There are two advantages to make C exactly equal to the number of NPUs, first, all NPUs will be used, and second, the number of channels per NPU is just 1, no waste of local memory.

There are two advantages to make H and W exactly equal to 1 and 32, first, the input and output tensor are in the 128-Byte Aligned Layout, the C stride is ceil(H * W / 32) * 32 = H * W, no waste of local memory, and second, 32 is exactly twice the number of EUs (for BM1684), it will make full use of the execution units.