5.1.1. Interpretation of terms
Term |
Interpretation |
BM1684 |
The third-generation tensor processor launched by SOPHGO for the field of deep learning. |
BM1684X |
The fourth-generation tensor processor launched by SOPHGO for the field of deep learning. |
TPU |
On-chip neural network processing unit. |
SOC mode |
A working mode, the SDK runs on A53 AARCH64 platform, and TPU is used as the platform bus device. |
PCIe mode |
A working mode, SDK runs on the X86 platform, BM1684 and BM1684X are at deep learning computing accelerator cards in PCIe interface. |
Drivers |
Drivers are the channels through which the API interface accesses the hardware. |
Gmem |
DDR memory on card for TPU acceleration. |
F |
FAULT, fault status. |
N/A |
This parameter is not supported. |
5.1.2. Introduction of bm-smi
The bm-smi tool displays device status information in the form of interface or text, such as device temperature, fan speed; it can also enable, disable or set some device functions, such as led and ecc.
Bm-smi functions mainly include:
Viewing device parameters and runtime status.
Viewing device working mode(PCIe/SOC).
Viewing physical board ID.
Viewing the device chip ID, and the PCIe bus ID where it is located.
Viewing device temperature and power consumption.
Viewing whether the device ECC is enabled and the number of corrections.
Viewing the total number and usage of gmem.
Viewing the usage of tpu.
Viewing the work frequency information of the device.
Viewing the size of gmem occupied by each process during running.
Viewing device fan status.
Modifying board parameters.
Disable or enable ECC.
Turn on or turn off led indicator light on the board.
Performing the recovery operation on the fault device.
The following table lists the device information that bm-smi can obtain and the support information in PCIe and SOC modes:
Device Information |
PCIe Mode |
SOC Mode |
Displaying time and date |
Supported |
Supported |
SDK Version |
Supported |
Supported |
Driver Version |
Supported |
Supported |
Physical board id number |
Supported |
Supported |
Tpu device number |
Supported |
Supported |
Board name |
Supported |
Supported |
Board status |
Supported |
Supported |
Board temperature |
Supported |
Not supported |
Chip temperature |
Supported |
Not supported |
Board power consumption |
Supported |
Not supported |
Module power consumption |
Supported |
Not supported |
Module voltage |
Supported |
Not supported |
Supply current |
Supported |
Not supported |
Whether DDR ECC is enabled |
Supported |
Not supported |
DDR enabled, number of correction times |
Supported |
Not supported |
Board serial number |
Supported |
Not supported |
domain:b:d.f in PCIe mode |
Supported |
Not supported |
PCIe or SOC mode |
Supported |
Supported |
Minimum work frequency |
Supported |
Supported |
Maximum work frequency |
Supported |
Supported |
Current work frequency |
Supported |
Supported |
Maximum power consumption of board |
Supported |
Not supported |
Working current of module |
Supported |
Not supported |
Byte numbers of gmem totals and used |
Supported |
Supported |
Instantaneous usage of tpu |
Supported |
Supported |
Fan speed |
Supported |
Not supported |
Byte numbers of gmem occupied by each process (or thread) |
Supported |
Not supported |
Text mode |
Supported |
Not supported |
Parameters |
Supported |
Only file, lms and loop are supported |
Figure 1 shows the display status of SC5+ (three-core)/SC5H/SC5P (eight-core), each card is separated by =======, the board attributes are displayed on the left, and the state of a single chip is displayed on the right and middle.
bm-smi is an executable file that does not depend on other dynamic libraries, and it is located under /opt/sophon/libsophon-current/bin directory. The above figure is a schematic diagram about the execution of bm-smi.
5.1.3. Meanings of parameters
The meaning of each part is introduced one by one below:
Fri Aug 7 14:18:57 2020: time and date when bm-smi is executed. This is just an example, and the actual execution may be different from what is shown here.
SDK Version: 2.3.2: version number of sdk. This is just an example, and the actual execution may be different from what is shown here.
Driver Version: 2.3.2: version number of the driver. This is just an example, and the actual execution may be different from what is shown here.
Card: physical board id number.
Name: board name.
Mode: PCIe or SOC mode.
SN: board serial number (total of 17 bits).
TPU: device number of tpu.
BoardT: board temperature.
chipT: chip temperature.
TPU_P: power consumption of TPU module.
TPU_V: voltage of TPU module.
ECC: whether DDR ECC is enabled.
CorrectNum: the number of correction times if DDR is enabled.
Tpu-Util: instantaneous usage of tpu.
12V_ATX: 12V board supply current.
MaxP: maximum power consumption of board.
boardP: board power consumption.
Minclk: minimum work frequency of tpu.
Maxclk: maximum work frequency of tpu.
Fan: fan speed, N/A means the card has no fan, and F means there is a failure in the fan.
Bus-ID: domain:b:d.f in PCIe mode.
Status: board status. Active means active status; and Fault means fault status.
Curclk: current work frequency of tpu. The color of the displayed value varies according to the current work frequency. 550M (bm1684) or 1000M (bm1684x) is displayed in white, 75M in red, and other frequencies in yellow; red and yellow are used to indicate to the user that the current work frequency is not the maximum work frequency. Displaying different colors are only available in version 2.1.0 and above.
TPU_C: work current of tpu module.
Memory-Usage: byte numbers of gmem totals and used. The 106M indicates the memory size of the VPU firmware by default. The memory on the board may be distributed in different address spaces. All the memory we allocate is continuous address, and because of the different size of each allocation, it will lead to fragmentation of the memory, so the usage may not reach 100%.
The figure below shows the byte numbers of gmem occupied by each process (or thread) of each device:
Notices:
Because our board can be used by multiple users for multiple tasks at the same time, so theoretically, there can be unlimited processes creating unlimited handles to apply for global memory. You can use the up and down arrow keys and the page up and down keys to view the information of gmem occupied by all processes, save them as files through tag, which also contains all process information.
For the gmem information occupied by the process, each line displays the gmem corresponding to a handle created by the process. If the process creates multiple handles, the gmem information occupied by each handle is displayed on a separate line.
5.1.4. Specific operation methods and parameters
The parameters supported by bm-smi include:
dev (which dev is selected to query, 0xff is for all.) type: int32, default: 255
Used to select the parameters of which device to query or modify, and all devices are selected by default.
This function is not supported in SOC mode.
ecc (ECC on DDR is on or off)
type: string default: ""Used to configure Enable and Disable of DDR ECC, and the example is as follows:
bm-smi --dev=0x0 --ecc=on bm-smi --dev=0x0 --ecc=offWhen executing this command, do not allow any process to use this device. After setting, restart the host to make it valid.
When executing this command, please do not use it with other parameters, for example:
bm-smi --dev=0x0 --ecc=on --file=~/a.txt # the --file=~/a.txt may be ignored. This command can only execute ecc-related actions.If the dev parameter is not specified, the operation will be performed on all devices by default.
This function is not supported in SOC mode.
file (target file to save smi log)
type: string default: ""The device status can be redirected to a text with the following method:
bm-smi --dev=0x0 --file=./bm-smi.logThis function is supported in SOC mode.
led (pcie card LED status: on/off/blink)
type: string default: "on"Used to configure on and off of the board LED, and the examples are as below:
bm-smi --dev=0x0 --led=on bm-smi --dev=0x0 --led=offNote: This function support on/off/blink on SC5+ and SC5P, on/off on SC5H, does not support other board types. For the SC5+ board, only the first chip can control the status of LED. SC5P has 8 LEDs, each device corresponds to one LED, and each LED supports setting status separately.
This function is not supported in SOC mode.
lms (sample interval in loop mode.) type: int32 default: 500
Used to set the time interval for querying the device status when running bm-smi, the default is 500ms, and the minimum value of this parameter is 300ms. This function is supported in SOC mode.
loop (true is for loop mode, false is for only once mode.) type: bool, default: true
Used to set single mode or cycle mode when running bm-smi, and the cycle mode is set by default. In single mode, bm-smi exits after querying the device status once; in cycle mode, it queries the device status repeatedly according to lms. Examples are as follows:
bm-smi --loop bm-smi --noloopThis function is supported in SOC mode.
recovery, the use method is as follows: when x function of a certain device is found to be faulty, the user removes all services from this card, reaching to a state where no upper-layer services and applications use this card, and the following is executed:
bm-smi --dev=0x(0/1/2/3…..) --recoveryThe three-core card SC5+ and the eight-core card SC5P only support the recovery of the entire card. Recoverying any device on the card can led to the entire card recoveried, so you need to stop the tasks on the entire card during recovery.
Note: Do not perform this operation when the board card is working normally. Some servers do not support this function, and executing this function will cause the server to restart. Currently known servers that do not support this function are Dell R740, Dell R940, Inspur 5468 and Sugon X785-G30.
This function is not supported in SOC mode.
opmode and opval, operation method is: select the mode and mode value of bm-smi execution, compatible with the previous tags, for example:
bm-smi --opmode=display #bm-smi have the same effects. bm-smi --opmode=ecc --opval=on #bm-smi --ecc=on have the same effects. And so on for other tags.At present, opmode has four operation modes: display, ecc (enable), led (indicator), and recovery. Subsequent new functions will be used in this way. In order to take care of users which have the operating habits of the old version , the operation method of the old version is still supported in the new version. (Note: At present, only when the opmode is ecc or led, you need to use it with opval to assign values).
We added 2.5.0 display mode with the memory monitor display for heap and vpu, and the operation method is:
bm-smi --opmode=display_memory_detailMeanwhile, the usage display of vpu and jpu is also added, and the operation method is as follows:
bm-smi --opmode=display_util_detail
5.1.5. Introduction of text mode
The output of bm-smi is a simple graphical interface, which describes the status of the board. In order to meet the needs of some users for text information (easy to use scripts to parse some parameters), it supports text mode (SOC mode does not support text mode), and the operation method is as follows:
bm-smi --start_dev=0 --last_dev=2 --text_format
1684-SC5+ PCIE chip0: 0 000:01:00.0 Active 56C 55C 2W 615mV OFF N/A 0% 75M 550M 550M 3.3A 0MB 7086MB
1684-SC5+ PCIE chip1: 1 000:01:00.1 Active 56C 55C 2W 613mV OFF N/A 0% 75M 550M 550M 4.1A 0MB 7086MB
1684-SC5+ PCIE chip2: 2 000:01:00.2 Active 54C 53C 1W 615mV OFF N/A 0% 75M 550M 550M 2.6A 0MB 7086MB
A line of text information output by the above command divided into three areas:
First area:
1684-SC5+ PCIE chip0: 0 000:01:00.0 Active 56C 55C 2W 615mV OFF N/A 0% 75M 550M 550M 3.3A 0MB 7086MB
Status of the 0th chip on the three-core card, 1684-SC5+ PCIE chip0:
TPU Bus-ID Status boardT chipT TPU_P TPU_V ECC CorrectN Tpu-Util Minclk Maxclk Curclk TPU_C Memory-Usage
Second area:
1684-SC5+ PCIE chip1: 1 000:01:00.1 Active 56C 55C 2W 613mV OFF N/A 0% 75M 550M 550M 4.2A 0MB 7086MB
Status of the 1st chip on the three-chip card, 1684-SC5+ PCIE chip1:
The following information corresponds in sequence to parameters in bm-smi:
TPU Bus-ID Status boardT chipT TPU_P TPU_V ECC CorrectN Tpu-Util Minclk Maxclk Curclk TPU_C Memory-Usage
Third area:
1684-SC5+ PCIE chip2: 2 000:01:00.2 Active 54C 53C 1W 615mV OFF N/A 0% 75M 550M 550M 2.6A 0MB 7086MB
Status of the 2nd chip on the three-chip card, 1684-SC5+ PCIE chip2:
The following information corresponds in sequence to parameters in bm-smi:
TPU Bus-ID Status boardT chipT TPU_P TPU_V ECC CorrectN Tpu-Util Minclk Maxclk Curclk TPU_C Memory-Usage
Notes:
1. --start_dev=0 --last_dev=2 indicates the device numbers corresponding to the 0th chip and the last chip of a certain card displayed in bm-smi;
2. --start_dev --last_dev --text_format should be used together.
5.1.6. help information of bm-smi :
5.1.6.1. help information of bm-smi under PCIe mode
bm-smi --help
bm-smi: command line brew
usage: bm-smi [--ecc=on/off] [--file=/xx/yy.txt] [--dev=0/1...][--start_dev=x] [--last_dev=y] [--text_format] [--lms=500] [--recovery] [-loop] [--led=on/off/blink]
ecc:
set ecc status, default is off
file:
the target file to save smi log, default is empty.
dev:
which device to be selected to query, default is all.
start_dev:
the first device to be selected to query, must chip0 of one card, default is invalid.
last_dev:
the last device to be selected to query, default is invalid.
lms:
how many ms of the sample interval, default is 500.
loop:
if -loop (default): smi sample device every lms ms.
if -noloop: smi sample device only once.
recovery:
recovery dev from fault to active status.
text_format:
if true only display attr value from start_dev to last_dev.
led:
pcie card LED status: on/off/blink.
New usage: bm-smi [--opmode=display/ecc/led/recovery][--opval=on/off/...] [--file=/xx/yy.txt][--dev=0/1...] [--start_dev=x] [--last_dev=y] [--text_format][--lms=500] [-loop]
opmode(default null):
choose different mode,example:display, ecc, led, recovery
display: means open bm-smi window and check info, use like ./bm-smi
ecc: means enable or disable ecc, collocation opval=on/off
led: means modify led status, collocation opval=on/blink/off
recovery: means recovery dev from fault to active status.
opval(default null):
set mode value, use with opmode!
off: for led/ecc
on: for led/ecc
blink: for led
other flags have same usage, Both usage can be used!
bm-smi supports all parameters listed above in help in PCIe mode.
5.1.6.2. help information of bm-smi in SOC mode
bm-smi --help
bm-smi: command line brew
usage: bm-smi [--opmode=display] [--file=/xx/yy.txt] [--lms=500] [-loop]
opmode:
SOC mode only use display for bm-smi.
file:
the target file to save smi log, default is empty.
lms:
how many ms of the sample interval, default is 500.
loop:
if -loop (default): smi sample device every lms ms.
if -noloop: smi sample device only once.
SOC mode only supports opmode=display, file, lms and loop parameters, and other parameters are invalid.
5.1.7. bm-smi used in SOC mode
In PCIe mode, bm-smi supports all the above functions; and in SOC mode, bm-smi interface supports the functions shown in Fig.2. N/A indicates this function is not supported; the supported parameters are only opmode=display, file, lms and loop.
In SOC mode, bm-smi operation method: run bm-smi directly after logging in soc.
bm-smi or bm-smi --opmode=display