nvdebug 入门#
NVIDIA® NVDebug 工具 nvdebug
可以在服务器平台或远程客户端机器上运行。此二进制工具适用于 x86_64 或 arm64-SBSA 架构系统,可以收集以下信息
带外 (OOB) BMC 日志和信息,用于解决服务器问题
来自主机的日志
要求#
要求 |
客户端主机 |
服务器主机 |
---|---|---|
基于 Linux 的操作系统:支持 Linux kernel 4.4 或更高版本
(建议使用 4.15 或更高版本)
|
X |
X |
GNU C Library glibc-2.7 或更高版本 |
X |
X |
操作系统:支持 Ubuntu 18.04 或更高版本
(建议使用 Ubuntu 22.04)
|
X |
X |
Python 3.10 |
X |
X |
|
X |
X |
|
X |
X |
可通过 BMC 访问的被测服务器设备 (DUT)
从客户端主机使用 Redfish 和 IPMI-over-LAN。
|
X |
X |
|
X |
|
BMC 管理和服务器主机管理网络
位于同一子网中。
|
X |
NVSwitch 托架主机需要 NVOS 版本 2。
nvdebug 命令行界面#
nvdebug
命令的高级语法支持通过 OOB 收集调试日志。
您可以通过以下任一方式运行该工具
从可以访问 BMC 和主机的远程计算机。
直接在主机上,如果主机可以访问 BMC。
如果主机 IP 通过配置文件或命令行界面 (CLI) 使用 –I/--hostip
传递,则 nvdebug
假定该工具在远程计算机上运行。否则,nvdebug
假定该工具在主机上运行并在本地收集主机日志。
语法#
$ nvdebug -i <BMCIP> -u <BMCUSER> -p <BMCPASS> -t <PLATFORM>
Mandatory options:
-i/--ip is the BMC IP address.
-u/--user is BMC username with administrative privileges.
-p/--password is BMC administrative user password.
-t/--platform is the platform type of the DUT, and it accepts DGX, HGX-HMC, arm64, x86_64, and NVSwitch.
Additional credentials:
-r/--sshuser is BMC SSH username.
-w/--sshpass is BMC SSH password.
-R/--rfuser is BMC Redfish username.
-W/--rfpass is BMC Redfish password.
Host options:
-I/--hostip is the Host IP Address.
If the IP address is not provided, the tool assumes it is running on the host machine.
-U/--hostuser is the Host username with administrative privileges.
-H/--hostpass is the Host password.
Additional options:
-b/--baseboard <baseboard> is the baseboard type, such as Hopper-HGX-8-GPU and Blackwell-HGX-8-GPU.
-C/--config <file path> is the path to the config file. The default is ./config.yaml.
-d/--dutconfig <dut config path> is the path to the DUT specific config file.
The default path is ./dut_config.yaml.
-c/--common collects the common logs using the included common.json file.
-v/--verbose displays the detailed output and error messages.
-o/--outdir <output dir> the output directory where the output is generated.
The default location is /tmp.
-P/--port <fw_port> is the port number that will be used for forwarding.
The --port variable applies only to HGX-Baseboard based platforms,
and the default value is 18888.
--local enables Local Execution mode.
-z/--skipzip skips zipping individual DUT folders.
Log collection options:
-S/--cids CID [CID ...] runs the log collectors that correspond to the CIDs that were passed.
-g/--loggroup <Redfish|IPMI|SSH|Host|HealthCheck> runs all log collectors of a specific type
that is supported on the current platform.
Only one collector group can be specified.
-j/--vendor_file <vendor.json> is a vendor-defined JSON file that uses proprietary methods
and tools as defined by the user.
The -S and –g options cannot be used together.
Utility options:
-h/--help and --version are standalone options, and –l/--list requires the platform
type to be specified using –t/--platform.
--parse <log dump> parses an nvdebug log dump and decodes the binary data.
-h/--help provides information about tool usage.
--version displays the current version of the tool.
-l/--list [Redfish|IPMI|SSH|Host|HealthCheck] lists log collectors that are supported by platform
with their collector IDs (CID). If a type is passed, it will only list log collectors
of that type. The -l/--list options require the target platform type to be specified with –t/--platform.
By default, if option -c is not included, the nvdebug tool will collect logs based on the common.json
and platform_xyz.json files. At the end of the run, the tool will generate the output log xyz.zip
file in the directory specified by the –o option. If no directory is provided, the log
will be generated in the /tmp directory.
配置文件#
NVDebug 工具在与可执行文件相同的文件夹中有两个配置文件
DUT 配置文件:默认为
dut_config.yaml
。NVDebug 特定的配置文件:默认为
config.yaml
。
这些文件可用于提供额外的(但可选的)配置数据。如果 CLI 和配置文件都提供了参数,则通过 CLI 提供的值优先。
HGX B200 8-GPU 示例#
要与 HGX 基板通信,您需要 BMC SSH 凭据以通过 BMC 设置 SSH 隧道。默认情况下,SSH 凭据被假定为与 BMC 凭据相同。要使用不同的凭据,请分别为 SSH 用户名和密码指定 –r
和 –w
CLI 选项。
nvdebug –i $BMCIP –u $BMCUSER –p $BMCPASS –r SSHUSER –w SSHPASS –t HGX-HMC –P port_num
Log directory created at /tmp/nvdebug_logs_30_09_2024_12_27_46
Starting a collection for DUT dut-1
hgx-b200-node2: [12:28:13] Identified system as Model: P2312-A04, Partno: 692-22312-0001-000, Serialno:1324623011823
hgx-b200-node2: [12:28:13] User provided platform type: HGX-HMC
hgx-b200-node2: [12:28:13] BMC IP: XXXX
Log collection has started for dut-1
hgx-b200-node2: [12:45:43] Log collection is now complete
hgx-b200-node2: [12:45:43] Log collection took 17m 30.29s
DUT hgx-b200-node2 completed.
The log zip file (:literal:`nvdebug_logs_30_09_2024_12_27_46.zip`) will be created in the :literal:`/tmp` directory.
SSH 隧道由工具使用指定的端口自动设置,默认值为 18888
。要使用现有的 SSH 隧道,请不要在配置文件中设置 SSH 隧道,如下面的 dut_config
文件所示
hgx-b200-node2:
<<: *dut_defaults
BMC_IP: "bmc_ip"
BMC_USERNAME: "bmc_user"
BMC_PASSWORD: "bmc_pass"
BMC_SSH_USERNAME: "ssh_user"
BMC_SSH_PASSWORD: "ssh_pass"
TUNNEL_TCP_PORT: "port_num"
SETUP_PORT_FORWARDING: false
配置 NVDebug 工具后,运行 nvdebug
命令
注意
主机 BMC 需要支持端口转发。
输出示例
$ nvdebug
Log directory created at /tmp/nvdebug_logs_30_09_2024_12_27_46
Starting a collection for DUT hgx-b200-node2
hgx-b200-node2: [12:28:13] Identified system as Model: P2312-A04, Partno: 692-22312-0001-000, Serialno:1324623011823
hgx-b200-node2: [12:28:13] User provided platform type: HGX-HMC
hgx-b200-node2: [12:28:13] BMC IP: XXXX
Log collection has started for hgx-b200-node2
hgx-b200-node2: [12:45:43] Log collection is now complete
hgx-b200-node2: [12:45:43] Log collection took 17m 30.29s
DUT hgx-b200-node2 completed.
The log zip file (nvdebug_logs_30_09_2024_12_27_46.zip) will be created in the /tmp directory.
DGX 平台示例#
要列出 DGX 平台上可用的收集器,请分别为日志收集器和 DGX 平台指定 -l
选项和 -t DGX
选项
$ nvdebug -l -t DGX
输出示例
Redfish
CID Collector Name Log Location
R8 firmware_inventory Redfish_R8_firmware_inventory.json
R9 firmware_inventory_expand_query Redfish_R9_firmware_inventory_expand_query.json
R10 chassis_info Redfish_R10_chassis_info.json
R11 chassis_expand_query Redfish_R11_chassis_expand_query.json
R12 system_info Redfish_R12_system_info.json
R13 system_expand_query Redfish_R13_system_expand_query.json
R14 manager_info Redfish_R14_manager_info.json
R15 manager_expand_query Redfish_R15_manager_expand_query.json
R17 dgx_manager_oem_log_dump Redfish_R17_dgx_oem_dump_{manager_id}_{task_id}.tar.xz
R18 telemetry_metric_reports Redfish_R18_report_{metric_report}.json
R19 chassis_thermal_metrics Redfish_R19_chassis_{chassis}_thermal_metrics.json
R20 firmware_inventory_table Redfish_R20_firmware_inventory_table.txt
R22 task_details Redfish_R22_task_{task_id}.json
R23 nvlink_oob_logs Redfish_R23_NVLINK_OOB_Log_{id}.json
R25 additional_oob_logs Redfish_R25_OOB_Log_{id}.json
R26 chassis_certificates Redfish_R26_chassis_{chassis_id}_certificate.json
R29 background_copy_status Redfish_R29_{chassis_id}_copy_status.json
R30 software_inventory Redfish_R30_software_inventory
R32 system_post_codes Redfish_R32_system_post_codes
IPMI
CID Collector Name Log Location
I1 mc_info IPMI_I1_mc_info.txt
I2 lan_info IPMI_I2_lan_info.txt
I3 session_info IPMI_I3_session_info.txt
I4 fru_info IPMI_I4_fru_info.txt
I5 sdr_info IPMI_I5_sdr_info.txt
I6 sel_info IPMI_I6_sel_info.txt
I7 sensor_list IPMI_I7_sensor_list.txt
I8 sel_list IPMI_I8_sel_list.txt
I9 sel_raw_dump IPMI_I9_sel_raw_dump.txt
I10 chassis_status IPMI_I10_chassis_status.txt
I11 chassis_restart_cause IPMI_I11_chassis_restart_cause.txt
I12 user_list IPMI_I12_user_list.txt
I13 channel_info IPMI_I13_channel_info.txt
I14 sdr_elist IPMI_I14_sdr_elist.txt
SSH
CID Collector Name Log Location
S2 bmc_dmesg BMC_SSH_S2_bmc_dmesg.txt
S3 network_info BMC_SSH_S3_network_info/...
S5 bmc_list_kernel_modules BMC_SSH_S5_bmc_list_kernel_modules.txt
S8 bmc_mem_cpu_utilization BMC_SSH_S8_bmc_mem_cpu_utilization/...
S11 uptime BMC_SSH_S11_uptime.txt
S12 fpga_register_table BMC_SSH_S12_fpga_register_table.txt
S13 hmc_boot_status BMC_SSH_S13_hmc_boot_status.txt
S15 bmc_power_status BMC_SSH_S15_bmc_power_status/...
Host
CID Collector Name Log Location
H1 node_dmesg Host_H1_node_dmesg.tar.gz
H2 node_lspci Host_H2_node_lspci*.txt
H3 node_smbios Host_H3_dmidecode*.txt
H4 node_lshw Host_H4_lshw*.txt
H5 node_nvidia_smi Host_H5_nvidia-smi*.txt
H6 node_kern_log Host_H6_node_kern_log.tar.gz
H7 node_crash_dump Host_H7_node_crash_dump.tar.gz
H8 node_nvme_list Host_H8_nvme_list_-v.txt
H9 node_fabric_manager_log Host_H9_fabricmanager.log
H10 node_nvflash_log Host_H10_nvflash_--check_-i_{num}.txt
H11 nvidia_bug_report Host_H11_nvidia_bug_report_op.log.gz
H15 node_subnet_manager Host_H15_node_subnet_manager/
H16 one_diag_dump Host_H16_one_diag_dump/
H17 node_nvme_log_dump Host_H17_nvos_tech_support_dump/
HealthCheck
CID Collector Name Log Location
C1 out_of_band_health_check HealthCheck_C1_out_of_band_health_check.json
Redfish 收集器#
要仅收集特定收集器,请为固件清单、系统信息和 ipmi 管理器信息指定 -S
选项。
nvdebug -i <bmc_ip> -u <bmc_user> -p <bmc_pass> ... -t DGX -v -S R8 I1 R12
输出示例
Log directory created at /tmp/nvdebug_logs_06_11_2024_15_40_27
Starting a collection for DUT dut-1
dut-1: [15:40:34] All preflight checks passed
dut-1: [15:40:34] Identified system as Model: DGXB200, Partno: 965-24387-0002-003, Serialno:1660224000069
dut-1: [15:40:34] User provided platform type: DGX
dut-1: [15:40:34] BMC IP: XXXX
Log collection has started for dut-1
dut-1: [15:40:34]
dut-1: [15:40:34] #####################################
dut-1: [15:40:34]
dut-1: [15:40:34] Collecting custom logs:
dut-1: [15:40:34]
dut-1: [15:40:34] Log collection was initiated for: r8_firmware_inventory
dut-1: [15:40:36] Log collection for r8_firmware_inventory took 0m 1.71s
dut-1: [15:40:36] Log collection was initiated for: r12_system_info
dut-1: [15:40:36] Log collection for r12_system_info took 0m 0.06s
dut-1: [15:40:36] Log collection was initiated for: i1_mc_info
dut-1: [15:40:36] Log collection for i1_mc_info took 0m 0.14s
dut-1: [15:40:36] Log collection is now complete
dut-1: [15:40:36] Log collection took 0m 2.16s
DUT dut-1 completed.
Log zip created at /tmp/nvdebug_logs_06_11_2024_15_40_27.zip
要运行 Redfish 日志收集器,请为 Redfish
日志组指定 -g
选项
$ nvdebug -i $BMC_IP -u $BMC_USER -p $BMC_PASS -t DGX -g Redfish
IPv6 配置#
默认情况下,nvdebug
工具使用 IPv4。对于 IPv6,请在 DUT 配置中将 IP_NETWORK
设置为 ipv6
。为 BMC/主机提供 IPv6 地址时,请勿使用方括号。