nvdebug 入门#

NVIDIA® NVDebug 工具 nvdebug 可以在服务器平台或远程客户端机器上运行。此二进制工具适用于 x86_64 或 arm64-SBSA 架构系统,可以收集以下信息

  • 带外 (OOB) BMC 日志和信息,用于解决服务器问题

  • 来自主机的日志

要求#

客户端主机和服务器主机的要求#

要求

客户端主机

服务器主机

基于 Linux 的操作系统:支持 Linux kernel 4.4 或更高版本
(建议使用 4.15 或更高版本)

X

X

GNU C Library glibc-2.7 或更高版本

X

X

操作系统:支持 Ubuntu 18.04 或更高版本
(建议使用 Ubuntu 22.04)

X

X

Python 3.10

X

X

ipmitool 1.8.18 或更高版本

X

X

sshpass 命令

X

X

可通过 BMC 访问的被测服务器设备 (DUT)
从客户端主机使用 Redfish 和 IPMI-over-LAN。

X

X

nvme-cli 工具

X

BMC 管理和服务器主机管理网络
位于同一子网中。

X

NVSwitch 托架主机需要 NVOS 版本 2。

nvdebug 命令行界面#

nvdebug 命令的高级语法支持通过 OOB 收集调试日志。

您可以通过以下任一方式运行该工具

  • 从可以访问 BMC 和主机的远程计算机。

  • 直接在主机上,如果主机可以访问 BMC。

如果主机 IP 通过配置文件或命令行界面 (CLI) 使用 –I/--hostip 传递,则 nvdebug 假定该工具在远程计算机上运行。否则,nvdebug 假定该工具在主机上运行并在本地收集主机日志。

语法#

$ nvdebug -i <BMCIP> -u <BMCUSER> -p <BMCPASS> -t <PLATFORM>

Mandatory options:
    -i/--ip is the BMC IP address.
    -u/--user is BMC username with administrative privileges.
    -p/--password is BMC administrative user password.
    -t/--platform is the platform type of the DUT, and it accepts DGX, HGX-HMC, arm64, x86_64, and NVSwitch.

Additional credentials:
    -r/--sshuser is BMC SSH username.
    -w/--sshpass is BMC SSH password.
    -R/--rfuser is BMC Redfish username.
    -W/--rfpass is BMC Redfish password.

Host options:
    -I/--hostip is the Host IP Address.
     If the IP address is not provided, the tool assumes it is running on the host machine.
    -U/--hostuser is the Host username with administrative privileges.
    -H/--hostpass is the Host password.

Additional options:
    -b/--baseboard <baseboard> is the baseboard type, such as Hopper-HGX-8-GPU and Blackwell-HGX-8-GPU.
    -C/--config <file path> is the path to the config file. The default is ./config.yaml.
    -d/--dutconfig <dut config path> is the path to the DUT specific config file.
     The default path is ./dut_config.yaml.
    -c/--common collects the common logs using the included common.json file.
    -v/--verbose displays the detailed output and error messages.
    -o/--outdir <output dir> the output directory where the output is generated.
     The default location is /tmp.
    -P/--port <fw_port> is the port number that will be used for forwarding.
     The --port variable applies only to HGX-Baseboard based platforms,
     and the default value is 18888.
    --local enables Local Execution mode.
    -z/--skipzip skips zipping individual DUT folders.


Log collection options:
    -S/--cids CID [CID ...] runs the log collectors that correspond to the CIDs that were passed.
    -g/--loggroup <Redfish|IPMI|SSH|Host|HealthCheck> runs all log collectors of a specific type
     that is supported on the current platform.
     Only one collector group can be specified.
    -j/--vendor_file <vendor.json> is a vendor-defined JSON file that uses proprietary methods
     and tools as defined by the user.
     The -S and –g options cannot be used together.

Utility options:
    -h/--help and --version are standalone options, and –l/--list requires the platform
     type to be specified using –t/--platform.
    --parse <log dump> parses an nvdebug log dump and decodes the binary data.
    -h/--help provides information about tool usage.
    --version displays the current version of the tool.
    -l/--list [Redfish|IPMI|SSH|Host|HealthCheck] lists log collectors that are supported by platform
     with their collector IDs (CID). If a type is passed, it will only list log collectors
     of that type. The -l/--list options require the target platform type to be specified with –t/--platform.

    By default, if option -c is not included, the nvdebug tool will collect logs based on the common.json
    and platform_xyz.json files. At the end of the run, the tool will generate the output log xyz.zip
    file in the directory specified by the –o option. If no directory is provided, the log
    will be generated in the /tmp directory.

配置文件#

NVDebug 工具在与可执行文件相同的文件夹中有两个配置文件

  • DUT 配置文件:默认为 dut_config.yaml

  • NVDebug 特定的配置文件:默认为 config.yaml

这些文件可用于提供额外的(但可选的)配置数据。如果 CLI 和配置文件都提供了参数,则通过 CLI 提供的值优先。

HGX B200 8-GPU 示例#

要与 HGX 基板通信,您需要 BMC SSH 凭据以通过 BMC 设置 SSH 隧道。默认情况下,SSH 凭据被假定为与 BMC 凭据相同。要使用不同的凭据,请分别为 SSH 用户名和密码指定 –r–w CLI 选项。

nvdebug –i $BMCIP –u $BMCUSER –p $BMCPASS –r SSHUSER –w SSHPASS –t HGX-HMC –P port_num

Log directory created at /tmp/nvdebug_logs_30_09_2024_12_27_46
Starting a collection for DUT dut-1
hgx-b200-node2: [12:28:13] Identified system as Model: P2312-A04, Partno: 692-22312-0001-000, Serialno:1324623011823
hgx-b200-node2: [12:28:13] User provided platform type: HGX-HMC
hgx-b200-node2: [12:28:13] BMC IP: XXXX

Log collection has started for dut-1
hgx-b200-node2: [12:45:43] Log collection is now complete
hgx-b200-node2: [12:45:43] Log collection took 17m 30.29s
DUT hgx-b200-node2 completed.

The log zip file (:literal:`nvdebug_logs_30_09_2024_12_27_46.zip`) will be created in the :literal:`/tmp` directory.

SSH 隧道由工具使用指定的端口自动设置,默认值为 18888。要使用现有的 SSH 隧道,请不要在配置文件中设置 SSH 隧道,如下面的 dut_config 文件所示

hgx-b200-node2:
  <<: *dut_defaults
  BMC_IP: "bmc_ip"
  BMC_USERNAME: "bmc_user"
  BMC_PASSWORD: "bmc_pass"
  BMC_SSH_USERNAME: "ssh_user"
  BMC_SSH_PASSWORD: "ssh_pass"
  TUNNEL_TCP_PORT: "port_num"

  SETUP_PORT_FORWARDING: false

配置 NVDebug 工具后,运行 nvdebug 命令

注意

主机 BMC 需要支持端口转发。

输出示例

$ nvdebug

Log directory created at /tmp/nvdebug_logs_30_09_2024_12_27_46
Starting a collection for DUT hgx-b200-node2
hgx-b200-node2: [12:28:13] Identified system as Model: P2312-A04, Partno: 692-22312-0001-000, Serialno:1324623011823
hgx-b200-node2: [12:28:13] User provided platform type: HGX-HMC
hgx-b200-node2: [12:28:13] BMC IP: XXXX

Log collection has started for hgx-b200-node2
hgx-b200-node2: [12:45:43] Log collection is now complete
hgx-b200-node2: [12:45:43] Log collection took 17m 30.29s
DUT hgx-b200-node2 completed.

The log zip file (nvdebug_logs_30_09_2024_12_27_46.zip) will be created in the /tmp directory.

DGX 平台示例#

要列出 DGX 平台上可用的收集器,请分别为日志收集器和 DGX 平台指定 -l 选项和 -t DGX 选项

$ nvdebug -l -t DGX

输出示例

Redfish
  CID    Collector Name                          Log Location
   R8    firmware_inventory                      Redfish_R8_firmware_inventory.json
   R9    firmware_inventory_expand_query         Redfish_R9_firmware_inventory_expand_query.json
  R10    chassis_info                            Redfish_R10_chassis_info.json
  R11    chassis_expand_query                    Redfish_R11_chassis_expand_query.json
  R12    system_info                             Redfish_R12_system_info.json
  R13    system_expand_query                     Redfish_R13_system_expand_query.json
  R14    manager_info                            Redfish_R14_manager_info.json
  R15    manager_expand_query                    Redfish_R15_manager_expand_query.json
  R17    dgx_manager_oem_log_dump                Redfish_R17_dgx_oem_dump_{manager_id}_{task_id}.tar.xz
  R18    telemetry_metric_reports                Redfish_R18_report_{metric_report}.json
  R19    chassis_thermal_metrics                 Redfish_R19_chassis_{chassis}_thermal_metrics.json
  R20    firmware_inventory_table                Redfish_R20_firmware_inventory_table.txt
  R22    task_details                            Redfish_R22_task_{task_id}.json
  R23    nvlink_oob_logs                         Redfish_R23_NVLINK_OOB_Log_{id}.json
  R25    additional_oob_logs                     Redfish_R25_OOB_Log_{id}.json
  R26    chassis_certificates                    Redfish_R26_chassis_{chassis_id}_certificate.json
  R29    background_copy_status                  Redfish_R29_{chassis_id}_copy_status.json
  R30    software_inventory                      Redfish_R30_software_inventory
  R32    system_post_codes                       Redfish_R32_system_post_codes

IPMI
  CID    Collector Name                          Log Location
   I1    mc_info                                 IPMI_I1_mc_info.txt
   I2    lan_info                                IPMI_I2_lan_info.txt
   I3    session_info                            IPMI_I3_session_info.txt
   I4    fru_info                                IPMI_I4_fru_info.txt
   I5    sdr_info                                IPMI_I5_sdr_info.txt
   I6    sel_info                                IPMI_I6_sel_info.txt
   I7    sensor_list                             IPMI_I7_sensor_list.txt
   I8    sel_list                                IPMI_I8_sel_list.txt
   I9    sel_raw_dump                            IPMI_I9_sel_raw_dump.txt
  I10    chassis_status                          IPMI_I10_chassis_status.txt
  I11    chassis_restart_cause                   IPMI_I11_chassis_restart_cause.txt
  I12    user_list                               IPMI_I12_user_list.txt
  I13    channel_info                            IPMI_I13_channel_info.txt
  I14    sdr_elist                               IPMI_I14_sdr_elist.txt

SSH
  CID    Collector Name                          Log Location
   S2    bmc_dmesg                               BMC_SSH_S2_bmc_dmesg.txt
   S3    network_info                            BMC_SSH_S3_network_info/...
   S5    bmc_list_kernel_modules                 BMC_SSH_S5_bmc_list_kernel_modules.txt
   S8    bmc_mem_cpu_utilization                 BMC_SSH_S8_bmc_mem_cpu_utilization/...
  S11    uptime                                  BMC_SSH_S11_uptime.txt
  S12    fpga_register_table                     BMC_SSH_S12_fpga_register_table.txt
  S13    hmc_boot_status                         BMC_SSH_S13_hmc_boot_status.txt
  S15    bmc_power_status                        BMC_SSH_S15_bmc_power_status/...

Host
  CID    Collector Name                          Log Location
   H1    node_dmesg                              Host_H1_node_dmesg.tar.gz
   H2    node_lspci                              Host_H2_node_lspci*.txt
   H3    node_smbios                             Host_H3_dmidecode*.txt
   H4    node_lshw                               Host_H4_lshw*.txt
   H5    node_nvidia_smi                         Host_H5_nvidia-smi*.txt
   H6    node_kern_log                           Host_H6_node_kern_log.tar.gz
   H7    node_crash_dump                         Host_H7_node_crash_dump.tar.gz
   H8    node_nvme_list                          Host_H8_nvme_list_-v.txt
   H9    node_fabric_manager_log                 Host_H9_fabricmanager.log
  H10    node_nvflash_log                        Host_H10_nvflash_--check_-i_{num}.txt
  H11    nvidia_bug_report                       Host_H11_nvidia_bug_report_op.log.gz
  H15    node_subnet_manager                     Host_H15_node_subnet_manager/
  H16    one_diag_dump                           Host_H16_one_diag_dump/
  H17    node_nvme_log_dump                      Host_H17_nvos_tech_support_dump/

HealthCheck
  CID    Collector Name                          Log Location
   C1    out_of_band_health_check                HealthCheck_C1_out_of_band_health_check.json

Redfish 收集器#

要仅收集特定收集器,请为固件清单、系统信息和 ipmi 管理器信息指定 -S 选项。

nvdebug -i <bmc_ip> -u <bmc_user> -p <bmc_pass> ... -t DGX -v -S R8 I1 R12

输出示例

Log directory created at /tmp/nvdebug_logs_06_11_2024_15_40_27
Starting a collection for DUT dut-1
dut-1: [15:40:34] All preflight checks passed
dut-1: [15:40:34] Identified system as Model: DGXB200, Partno: 965-24387-0002-003, Serialno:1660224000069
dut-1: [15:40:34] User provided platform type: DGX
dut-1: [15:40:34] BMC IP: XXXX
Log collection has started for dut-1
dut-1: [15:40:34]
dut-1: [15:40:34] #####################################
dut-1: [15:40:34]
dut-1: [15:40:34] Collecting custom logs:
dut-1: [15:40:34]
dut-1: [15:40:34] Log collection was initiated for: r8_firmware_inventory
dut-1: [15:40:36] Log collection for r8_firmware_inventory took 0m 1.71s
dut-1: [15:40:36] Log collection was initiated for: r12_system_info
dut-1: [15:40:36] Log collection for r12_system_info took 0m 0.06s
dut-1: [15:40:36] Log collection was initiated for: i1_mc_info
dut-1: [15:40:36] Log collection for i1_mc_info took 0m 0.14s
dut-1: [15:40:36] Log collection is now complete
dut-1: [15:40:36] Log collection took 0m 2.16s
DUT dut-1 completed.
Log zip created at /tmp/nvdebug_logs_06_11_2024_15_40_27.zip

要运行 Redfish 日志收集器,请为 Redfish 日志组指定 -g 选项

$ nvdebug -i $BMC_IP -u $BMC_USER -p $BMC_PASS -t DGX -g Redfish

IPv6 配置#

默认情况下,nvdebug 工具使用 IPv4。对于 IPv6,请在 DUT 配置中将 IP_NETWORK 设置为 ipv6。为 BMC/主机提供 IPv6 地址时,请勿使用方括号。