6.2. C/C++ Programming Details

In this chapter, the YOLOV5 detection algorithm in sophon-demo will be selected as an example to explain interface calls and precautions of various steps.

注解

Sample code path:sophon-demo/sample/YOLOV5

Because the SDK supports multiple interface styles, a concise example code is unlikely to cover everything. Therefore, this example program uses the combination of OPENCV decoding + BMCV image preprocessing to develop, this combination takes into account the efficiency and simplicity.

We introduce the algorithm in the order of execution:

  1. Load the bmodel model

  2. Preprocessing

  3. Inference

  4. Matters needing attention

6.2.1. Load bmodel

 1...
 2
 3BMNNContext(BMNNHandlePtr handle, const char* bmodel_file):m_handlePtr(handle){
 4
 5    bm_handle_t hdev = m_handlePtr->handle();
 6
 7    // init bmruntime contxt
 8    m_bmrt = bmrt_create(hdev);
 9    if (NULL == m_bmrt) {
10    std::cout << "bmrt_create() failed!" << std::endl;
11    exit(-1);
12    }
13
14    // load bmodel from file
15    if (!bmrt_load_bmodel(m_bmrt, bmodel_file)) {
16    std::cout << "load bmodel(" << bmodel_file << ") failed" << std::endl;
17    }
18
19    load_network_names();
20
21}
22
23...
24
25void load_network_names() {
26
27    const char **names;
28    int num;
29
30    // get network info
31    num = bmrt_get_network_number(m_bmrt);
32    bmrt_get_network_names(m_bmrt, &names);
33
34    for(int i=0;i < num; ++i) {
35    m_network_names.push_back(names[i]);
36    }
37
38    free(names);
39}
40
41...
42
43BMNNNetwork(void *bmrt, const std::string& name):m_bmrt(bmrt) {
44    m_handle = static_cast<bm_handle_t>(bmrt_get_bm_handle(bmrt));
45
46    // get model info by model name
47    m_netinfo = bmrt_get_network_info(bmrt, name.c_str());
48
49    m_max_batch = -1;
50    std::vector<int> batches;
51    for(int i=0; i<m_netinfo->stage_num; i++){
52        batches.push_back(m_netinfo->stages[i].input_shapes[0].dims[0]);
53        if(m_max_batch<batches.back()){
54            m_max_batch = batches.back();
55        }
56    }
57    m_batches.insert(batches.begin(), batches.end());
58    m_inputTensors = new bm_tensor_t[m_netinfo->input_num];
59    m_outputTensors = new bm_tensor_t[m_netinfo->output_num];
60    for(int i = 0; i < m_netinfo->input_num; ++i) {
61
62        // get data type
63        m_inputTensors[i].dtype = m_netinfo->input_dtypes[i];
64        m_inputTensors[i].shape = m_netinfo->stages[0].input_shapes[i];
65        m_inputTensors[i].st_mode = BM_STORE_1N;
66        m_inputTensors[i].device_mem = bm_mem_null();
67    }
68
69...
70
71}
72
73...

The usage of these functions is relatively simple and fixed, users can refer to the BMRUNTIME Development Reference Manual for more details. The only thing that needs to be emphasized is the use of the name string variable: in inference code, the only identification of the model is its name string, which needs to be specified at compile stage, and the algorithm program needs to be developed based on this name. For example, when invoking the inference interface, it is necessary to use the name of the model as the input parameter and make runtime as the index to query the corresponding model, and the wrong name will cause the inference failure.

6.2.2. Preprocessing

6.2.2.1. Preprocessing initialization

During pre-processing initialization, appropriate bm_image objects need to be created in advance to store intermediate results. In this way, the overhead caused by repeated memory request release can be saved and the efficiency of the algorithm can be improved. The specific code is as follows:

 1...
 2
 3int aligned_net_w = FFALIGN(m_net_w, 64);
 4int strides[3] = {aligned_net_w, aligned_net_w, aligned_net_w};
 5for(int i=0; i<max_batch; i++){
 6
 7    // init bm images for storing results
 8    auto ret= bm_image_create(m_bmContext->handle(), m_net_h, m_net_w,
 9        FORMAT_RGB_PLANAR,
10        DATA_TYPE_EXT_1N_BYTE,
11        &m_resized_imgs[i], strides);
12    assert(BM_SUCCESS == ret);
13}
14bm_image_alloc_contiguous_mem (max_batch, m_resized_imgs.data());
15
16// bm images for storing inference inputs
17bm_image_data_format_ext img_dtype = DATA_TYPE_EXT_FLOAT32;   //FP32
18
19
20if (tensor->get_dtype() == BM_INT8) {   // INT8
21    img_dtype = DATA_TYPE_EXT_1N_BYTE_SIGNED;
22}
23
24auto ret = bm_image_create_batch(m_bmContext->handle(), m_net_h, m_net_w,
25    FORMAT_RGB_PLANAR,
26    img_dtype,
27    m_converto_imgs.data(), max_batch);
28assert(BM_SUCCESS == ret);
29
30...

Unlike the bm_image_create() function, which creates a single bm_image object, bm_image_create_batch() creates a set of bm_image objects based on the last parameter, batch, and the data fields used by this set of objects are physically contiguous. Using physically contiguous memory is a special requirement of the hardware accelerator; in the destructor, you can use bm_image_destroy_batch() to free memory.

This example algorithm supports both pictures and videos as inputs. In the main() function of main.cpp, we take video as an example. The details are as follows:

6.2.2.2. Open the video stream

 1...
 2
 3// open stream
 4cv::VideoCapture cap(input_url, cv::CAP_ANY, dev_id);
 5if (!cap.isOpened()) {
 6  std::cout << "open stream " << input_url << " failed!" << std::endl;
 7  exit(1);
 8}
 9
10// get resolution
11int w = int(cap.get(cv::CAP_PROP_FRAME_WIDTH));
12int h = int(cap.get(cv::CAP_PROP_FRAME_HEIGHT));
13std::cout << "resolution of input stream: " << h << "," << w << std::endl;
14
15...

The above code is almost identical to the standard opencv process for video.

6.2.2.3. Decoded video frame

 1...
 2
 3// get one mat
 4cv::Mat img;
 5if (!cap.read(img)) { //check
 6    std::cout << "Read frame failed or end of file!" << std::endl;
 7    exit(1);
 8}
 9
10std::vector<cv::Mat> images;
11images.push_back(img);
12
13...

6.2.2.4. Mat converts bm_image

Since both BMCV preprocessing interface and network inference need to use bm_image object as input, the decoded video frame needs to be converted to bm_image object. After reasoning is complete, it is released using the bm_image_destroy() interface. It is important to note that no memory copy occurs during this conversion.

 1...
 2
 3// mat -> bm_image
 4CV_Assert(0 == cv::bmcv::toBMI((cv::Mat&)images[i], &image1, true));
 5
 6...
 7
 8//destroy
 9bm_image_destroy(image1);
10
11...

6.2.2.5. Preprocessing

The bmcv_image_vpp_convert_padding() function uses VPP hardware resources and is the key to speed up the preprocessing process. padding_attr is required. The bmcv_image_convert_to() function is used for linear transformations and requires the parameter converto_attr.

 1...
 2
 3// set padding_attr
 4bmcv_padding_atrr_t padding_attr;
 5memset(&padding_attr, 0, sizeof(padding_attr));
 6padding_attr.dst_crop_sty = 0;
 7padding_attr.dst_crop_stx = 0;
 8padding_attr.padding_b = 114;
 9padding_attr.padding_g = 114;
10padding_attr.padding_r = 114;
11padding_attr.if_memset = 1;
12if (isAlignWidth) {
13  padding_attr.dst_crop_h = images[i].rows*ratio;
14  padding_attr.dst_crop_w = m_net_w;
15
16  int ty1 = (int)((m_net_h - padding_attr.dst_crop_h) / 2);
17  padding_attr.dst_crop_sty = ty1;
18  padding_attr.dst_crop_stx = 0;
19}else{
20  padding_attr.dst_crop_h = m_net_h;
21  padding_attr.dst_crop_w = images[i].cols*ratio;
22
23  int tx1 = (int)((m_net_w - padding_attr.dst_crop_w) / 2);
24  padding_attr.dst_crop_sty = 0;
25  padding_attr.dst_crop_stx = tx1;
26}
27
28// do not crop
29bmcv_rect_t crop_rect{0, 0, image1.width, image1.height};
30
31auto ret = bmcv_image_vpp_convert_padding(m_bmContext->handle(), 1, image_aligned, &m_resized_imgs[i],
32    &padding_attr, &crop_rect);
33
34...
35
36// set converto_attr
37float input_scale = input_tensor->get_scale();
38input_scale = input_scale* (float)1.0/255;
39bmcv_convert_to_attr converto_attr;
40converto_attr.alpha_0 = input_scale;
41converto_attr.beta_0 = 0;
42converto_attr.alpha_1 = input_scale;
43converto_attr.beta_1 = 0;
44converto_attr.alpha_2 = input_scale;
45converto_attr.beta_2 = 0;
46
47// do converto
48ret = bmcv_image_convert_to(m_bmContext->handle(), image_n, converto_attr, m_resized_imgs.data(), m_converto_imgs.data());
49
50// attach to tensor
51if(image_n != max_batch) image_n = m_bmNetwork->get_nearest_batch(image_n);
52bm_device_mem_t input_dev_mem;
53bm_image_get_contiguous_device_mem(image_n, m_converto_imgs.data(), &input_dev_mem);
54input_tensor->set_device_mem(&input_dev_mem);
55input_tensor->set_shape_by_dim(0, image_n);  // set real batch number
56
57...

6.2.3. Inference

The output of the preprocessing process is the input of the inference process, and when the input data of the inference process is ready, the inference can be carried out.

1...
2
3ret = m_bmNetwork->forward();
4...

6.2.4. Post-processing

The post-processing process varies from model to model and is mostly CPU-executed code, which I won’t go into here. It should be noted that we also provide some interfaces in BMCV that can be used for acceleration, such as bmcv_sort, bmcv_nms, etc. For other cases that require hardware acceleration, TPUKernel can be used as needed for development.

This is a brief description of the YOLOV5 example, see the module documentation for a more detailed description of the interfaces involved.

6.2.5. Summary of considerations for algorithm development

Based on what has been discussed above, we summarize some considerations as follows:

  • Attention should be paid to video decoding:

注解

We support the use of YUV format as the format for caching original frames, which can be set through the cap.set() interface after decoding. It is not covered in this example. For details, see the Decoding module section in this chapter.

  • Attention should be paid to the preprocessing process:

注解

  1. The preprocessing operation object is bm_image, which can be analogous to Mat object.

  2. Scale scaling in the preprocessing process is for int8 model. The scale coefficient is multiplied before inference data is entered. scale coefficient is generated in the process of quantization.

  3. To apply for continuous physical memory for multiple bm_image objects: bm_image_create_batch().

  4. Resize default bilinear interpolation algorithm. For details, see bmcv interface description.

  • The reasoning process needs to be noted:

注解

  1. Reasoning processes occur in device memory, so input data must be stored in input tensors’ device MEMs before inference, and result data after inference is also stored in output tensors’ device meMs.

  2. 4batch is recommended for performance optimization.