Aerial 系统脚本#
系统配置验证脚本#
发行包中包含一个脚本,用于检查和显示关键系统配置设置,这些设置对于运行 Aerial cuBB SDK 非常重要。
$ pip3 install psutil
$ cd $cuBB_SDK/cuPHY/util/cuBB_system_checks
$ sudo -E python3 ./cuBB_system_checks.py
cuBB_system_checks.py
的输出在裸机和容器环境版本之间可能略有不同。该脚本有助于检索软件组件版本和硬件配置。请参阅cuBB 发行说明中的发行清单,以确保安装了正确的软件组件版本。以下是裸机平台上的示例输出
# To get the system or ptp info, the command has to run on the host.
$ sudo -E python3 ./cuBB_system_checks.py --sys
-----General--------------------------------------
Hostname : smc-gh-01
IP address : 192.168.1.100
Linux distro : "Ubuntu 22.04.3 LTS"
Linux kernel version : 6.5.0-1019-nvidia
-----System---------------------------------------
Manufacturer : Supermicro
Product Name : ARS-111GL-NHR
Base Board Manufacturer : Supermicro
Base Board Product Name : G1SMH-G
Chassis Manufacturer : Supermicro
Chassis Type : Other
Chassis Height : 1 U
Processor : Grace A02
Max Speed : Unknown
Current Speed : 3402 MHz
$ sudo -E python3 ./cuBB_system_checks.py
-----General--------------------------------------
Hostname : smc-gh-01
IP address : 192.168.1.100
Linux distro : "Ubuntu 22.04.3 LTS"
Linux kernel version : 6.5.0-1019-nvidia
-----Kernel Command Line--------------------------
Audit subsystem : audit=0
Clock source : N/A
HugePage count : hugepages=32
HugePage size : hugepagesz=512M
CPU idle time management : idle=poll
Max Intel C-state : N/A
Intel IOMMU : N/A
IOMMU : N/A
Isolated CPUs : N/A
Corrected errors : N/A
Adaptive-tick CPUs : nohz_full=4-47
Soft-lockup detector disable : nosoftlockup
Max processor C-state : processor.max_cstate=0
RCU callback polling : rcu_nocb_poll
No-RCU-callback CPUs : rcu_nocbs=4-47
TSC stability checks : tsc=reliable
-----CPU------------------------------------------
CPU cores : 72
Thread(s) per CPU core : 1
CPU MHz: : N/A
CPU sockets : 1
-----Environment variables------------------------
CUDA_DEVICE_MAX_CONNECTIONS : N/A
cuBB_SDK : N/A
-----Memory---------------------------------------
HugePage count : 32
Free HugePages : 31
HugePage size : 524288 kB
Shared memory size : 240G
-----Nvidia GPUs----------------------------------
GPU driver version : 555.42.02
CUDA version : 12.5
GPU0
GPU product name : NVIDIA GH200 480GB
GPU persistence mode : Enabled
Current GPU temperature : 36 C
GPU clock frequency : 1980 MHz
Max GPU clock frequency : 1980 MHz
GPU PCIe bus id : 00000009:01:00.0
-----GPUDirect topology---------------------------
GPU0 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS SYS 0-71 0 1
NIC0 SYS X PIX SYS SYS
NIC1 SYS PIX X SYS SYS
NIC2 SYS SYS SYS X PIX
NIC3 SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
-----Mellanox NICs--------------------------------
NIC0
NIC product name : BlueField3
NIC part number : 900-9D3B6-00CV-A_Ax
NIC PCIe bus id : /dev/mst/mt41692_pciconf1
NIC FW version : 32.41.1000
FLEX_PARSER_PROFILE_ENABLE : 4
PROG_PARSE_GRAPH : True(1)
ACCURATE_TX_SCHEDULER : True(1)
CQE_COMPRESSION : AGGRESSIVE(1)
REAL_TIME_CLOCK_ENABLE : True(1)
NIC1
NIC product name : BlueField3
NIC part number : 900-9D3B6-00CV-A_Ax
NIC PCIe bus id : /dev/mst/mt41692_pciconf0
NIC FW version : 32.41.1000
FLEX_PARSER_PROFILE_ENABLE : 4
PROG_PARSE_GRAPH : True(1)
ACCURATE_TX_SCHEDULER : True(1)
CQE_COMPRESSION : AGGRESSIVE(1)
REAL_TIME_CLOCK_ENABLE : True(1)
-----Mellanox NIC Interfaces----------------------
Interface0
Name : aerial00
Network adapter : mlx5_0
PCIe bus id : 0000:01:00.0
Ethernet address : 94:6d:ae:c7:62:00
Operstate : up
MTU : 1514
RX flow control : off
TX flow control : off
PTP hardware clock : 0
QoS Priority trust state : pcp
PCIe MRRS : 4096 bytes
Interface1
Name : aerial01
Network adapter : mlx5_1
PCIe bus id : 0000:01:00.1
Ethernet address : 94:6d:ae:c7:62:01
Operstate : up
MTU : 1500
RX flow control : off
TX flow control : off
PTP hardware clock : 1
QoS Priority trust state : pcp
PCIe MRRS : 512 bytes
Interface2
Name : aerial02
Network adapter : mlx5_2
PCIe bus id : 0002:01:00.0
Ethernet address : 94:6d:ae:c7:6b:80
Operstate : down
MTU : 1500
RX flow control : on
TX flow control : on
PTP hardware clock : 2
QoS Priority trust state : pcp
PCIe MRRS : 512 bytes
Interface3
Name : aerial03
Network adapter : mlx5_3
PCIe bus id : 0002:01:00.1
Ethernet address : 94:6d:ae:c7:6b:81
Operstate : down
MTU : 1500
RX flow control : on
TX flow control : on
PTP hardware clock : 3
QoS Priority trust state : pcp
PCIe MRRS : 512 bytes
-----Linux PTP------------------------------------
● ptp4l.service - Precision Time Protocol (PTP) service
Loaded: loaded (/lib/systemd/system/ptp4l.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2024-06-05 21:42:18 UTC; 6h ago
Docs: man:ptp4l
Process: 4267 ExecStartPre=ethtool --set-priv-flags aerial01 tx_port_ts on (code=exited, status=0/SUCCESS)
Process: 4386 ExecStartPre=ethtool -A aerial01 rx off tx off (code=exited, status=0/SUCCESS)
Main PID: 4508 (ptp4l)
Tasks: 1 (limit: 146899)
Memory: 8.2M
CPU: 17.936s
CGroup: /system.slice/ptp4l.service
└─4508 /usr/sbin/ptp4l -f /etc/ptp.conf
Jun 06 03:45:21 smc-gh-01 ptp4l[4508]: [21807.308] rms 2 max 5 freq -1855 +/- 11 delay -96 +/- 0
Jun 06 03:45:22 smc-gh-01 ptp4l[4508]: [21808.308] rms 3 max 6 freq -1848 +/- 10 delay -96 +/- 0
Jun 06 03:45:23 smc-gh-01 ptp4l[4508]: [21809.308] rms 2 max 4 freq -1851 +/- 9 delay -96 +/- 1
Jun 06 03:45:24 smc-gh-01 ptp4l[4508]: [21810.308] rms 2 max 4 freq -1851 +/- 8 delay -97 +/- 1
Jun 06 03:45:25 smc-gh-01 ptp4l[4508]: [21811.308] rms 3 max 6 freq -1864 +/- 13 delay -96 +/- 0
Jun 06 03:45:26 smc-gh-01 ptp4l[4508]: [21812.308] rms 2 max 5 freq -1860 +/- 10 delay -96 +/- 0
Jun 06 03:45:27 smc-gh-01 ptp4l[4508]: [21813.308] rms 2 max 5 freq -1852 +/- 10 delay -97 +/- 0
Jun 06 03:45:28 smc-gh-01 ptp4l[4508]: [21814.308] rms 3 max 5 freq -1858 +/- 12 delay -96 +/- 1
Jun 06 03:45:29 smc-gh-01 ptp4l[4508]: [21815.308] rms 3 max 5 freq -1849 +/- 10 delay -97 +/- 0
Jun 06 03:45:30 smc-gh-01 ptp4l[4508]: [21816.308] rms 3 max 5 freq -1850 +/- 13 delay -97 +/- 0
● phc2sys.service - Synchronize system clock or PTP hardware clock (PHC)
Loaded: loaded (/lib/systemd/system/phc2sys.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2024-06-05 21:42:20 UTC; 6h ago
Docs: man:phc2sys
Process: 4529 ExecStartPre=sleep 2 (code=exited, status=0/SUCCESS)
Main PID: 4873 (sh)
Tasks: 2 (limit: 146899)
Memory: 2.1M
CPU: 1min 14.399s
CGroup: /system.slice/phc2sys.service
├─4873 /bin/sh -c "taskset -c 47 /usr/sbin/phc2sys -s /dev/ptp\$(ethtool -T aerial00 | grep PTP | awk '{print \$4}') -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256"
└─4878 /usr/sbin/phc2sys -s /dev/ptp0 -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256
Jun 06 03:45:20 smc-gh-01 phc2sys[4878]: [21806.453] CLOCK_REALTIME rms 8 max 20 freq +8730 +/- 44 delay 512 +/- 0
Jun 06 03:45:21 smc-gh-01 phc2sys[4878]: [21807.469] CLOCK_REALTIME rms 8 max 20 freq +8758 +/- 36 delay 512 +/- 0
Jun 06 03:45:22 smc-gh-01 phc2sys[4878]: [21808.486] CLOCK_REALTIME rms 7 max 19 freq +8740 +/- 44 delay 512 +/- 3
Jun 06 03:45:23 smc-gh-01 phc2sys[4878]: [21809.502] CLOCK_REALTIME rms 7 max 18 freq +8749 +/- 35 delay 512 +/- 0
Jun 06 03:45:24 smc-gh-01 phc2sys[4878]: [21810.519] CLOCK_REALTIME rms 7 max 16 freq +8744 +/- 35 delay 512 +/- 0
Jun 06 03:45:25 smc-gh-01 phc2sys[4878]: [21811.535] CLOCK_REALTIME rms 8 max 21 freq +8722 +/- 55 delay 512 +/- 0
Jun 06 03:45:26 smc-gh-01 phc2sys[4878]: [21812.552] CLOCK_REALTIME rms 9 max 23 freq +8750 +/- 61 delay 512 +/- 2
Jun 06 03:45:28 smc-gh-01 phc2sys[4878]: [21813.570] CLOCK_REALTIME rms 8 max 20 freq +8749 +/- 49 delay 512 +/- 2
Jun 06 03:45:29 smc-gh-01 phc2sys[4878]: [21814.589] CLOCK_REALTIME rms 6 max 18 freq +8735 +/- 29 delay 512 +/- 2
Jun 06 03:45:30 smc-gh-01 phc2sys[4878]: [21815.608] CLOCK_REALTIME rms 7 max 18 freq +8762 +/- 40 delay 512 +/- 3
-----Software Packages----------------------------
cmake : N/A
docker /usr/bin : 26.1.3
gcc /usr/bin : 11.4.0
git-lfs /usr/bin : 3.0.2
MOFED : N/A
meson : N/A
ninja : N/A
ptp4l /usr/sbin : 3.1.1-3
-----Loaded Kernel Modules------------------------
GDRCopy : gdrdrv
GPUDirect RDMA : N/A
Nvidia : nvidia
-----Non-persistent settings----------------------
VM swappiness : vm.swappiness = 0
VM zone reclaim mode : vm.zone_reclaim_mode = 0
-----Docker images--------------------------------
检查 NIC 状态#
要查询回使用上述脚本初始化的 Mellanox NIC 固件设置,请使用以下命令
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q | grep "CQE_COMPRESSION\|PROG_PARSE_GRAPH\
\|ACCURATE_TX_SCHEDULER\|FLEX_PARSER_PROFILE_ENABLE\|REAL_TIME_CLOCK_ENABLE\|INTERNAL_CPU_MODEL\
\|LINK_TYPE_P1\|LINK_TYPE_P2\|INTERNAL_CPU_PAGE_SUPPLIER\|INTERNAL_CPU_ESWITCH_MANAGER\
\|INTERNAL_CPU_IB_VPORT0\|INTERNAL_CPU_OFFLOAD_ENGINE"
INTERNAL_CPU_MODEL EMBEDDED_CPU(1)
INTERNAL_CPU_PAGE_SUPPLIER EXT_HOST_PF(1)
INTERNAL_CPU_ESWITCH_MANAGER EXT_HOST_PF(1)
INTERNAL_CPU_IB_VPORT0 EXT_HOST_PF(1)
INTERNAL_CPU_OFFLOAD_ENGINE DISABLED(1)
FLEX_PARSER_PROFILE_ENABLE 4
PROG_PARSE_GRAPH True(1)
ACCURATE_TX_SCHEDULER True(1)
CQE_COMPRESSION AGGRESSIVE(1)
REAL_TIME_CLOCK_ENABLE True(1)
LINK_TYPE_P1 ETH(2)
LINK_TYPE_P2 ETH(2)
要检查 NIC 端口的当前状态,请使用此命令
$ sudo mlxlink -d /dev/mst/mt41692_pciconf0
Operational Info
----------------
State : Active
Physical state : LinkUp
Speed : 200G
Width : 4x
FEC : Standard_RS-FEC - (544,514)
Loopback Mode : No Loopback
Auto Negotiation : ON
Supported Info
--------------
Enabled Link Speed (Ext.) : 0x00003ff2 (200G_2X,200G_4X,100G_1X,100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)
Supported Cable Speed (Ext.) : 0x000017f2 (200G_4X,100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)
Troubleshooting Info
--------------------
Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed
Tool Information
----------------
Firmware Version : 32.41.1000
amBER Version : 3.2
MFT Version : mft 4.28.0-92
或者,您可以使用系统配置验证脚本来获取配置设置的完整列表。