在 Grace Hopper MGX 系统上安装工具#
本章介绍如何在主机上安装所需的内核、驱动程序和工具。这是一次性安装,如果系统已配置,则可以跳过。
在以下步骤序列中,目标主机是 Supermicro Grace Hopper MGX 系统。
根据版本,本节中安装的工具可能需要在 安装和升级 Aerial cuBB 部分中升级。
安装和更新所有内容后,请参阅cuBB 快速入门指南,了解如何使用 Aerial cuBB。
Supermicro Grace Hopper MGX 配置#
Supermicro 服务器 SKU:ARS-111GL-NHR (配置 2)

顶视图

后视图

电缆连接#
主机操作系统互联网连接#
BF3 NIC 保留用于前传和回传连接,建议使用 USB 转以太网适配器连接到后置 USB 端口,用于主机操作系统互联网连接。
E2E 测试连接#
要使用 O-RU 运行端到端测试,BF3 前传端口 #0 或端口 #1 必须连接到前传交换机。确保 PTP 配置为使用连接到前传交换机的端口。下图显示了 O-RAN LLS-C3 拓扑中的典型 E2E 连接。

cuBB 测试连接#
要使用 TestMAC 和 RU 模拟器运行 cuBB 端到端测试,建议使用 R750 RU 模拟器与 Grace Hopper MGX 系统配对。BF3 NIC(部件号:900-9D3B6-00CV-AA0)应安装在 R750 服务器的插槽 7 上,如下图所示。

要配置 R750 RU 模拟器,请按照 在 Dell R750 上安装工具 中的说明进行操作。由于 R750 RU 模拟器没有 GPU,因此可以跳过 安装 CUDA 驱动程序。请注意,BF3 端口的 PCI 地址在 R750 RU 模拟器上是 ca:00.0 和 ca:00.1。
$ lshw -c network -businfo
Bus info Device Class Description
==========================================================
pci@0000:04:00.0 eno8303 network NetXtreme BCM5720 Gigabit Etherne
pci@0000:04:00.1 eno8403 network NetXtreme BCM5720 Gigabit Etherne
pci@0000:ca:00.0 aerial00 network MT43244 BlueField-3 integrated Co
pci@0000:ca:00.1 aerial01 network MT43244 BlueField-3 integrated Co
需要 Mellanox 200GbE 直连铜缆来连接 Grace Hopper MGX 和 R750 RU 模拟器,以运行 10 个以上的蜂窝小区。100GbE 直连铜缆应该能够支持 10C 59c BFP9,但不能用于 20C 60c BFP9。

要在 R750 + BF3 上运行 RU 模拟器,请更新 RU 模拟器 yaml,如下所示
# For RU Emulator on R750 system
sed -i "s/ul_core_list.*/ul_core_list: [5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43]/" $RU_YAML
sed -i "s/dl_core_list.*/dl_core_list: [4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42]/" $RU_YAML
sed -i "s/aerial_fh_split_rx_tx_mempool.*/aerial_fh_split_rx_tx_mempool: 1/" $RU_YAML
sed -i "s/low_priority_core.*/low_priority_core: 45/" $RU_YAML
系统固件升级#
在首次启动期间,登录 BMC 以检查固件清单。转到 Dashboard -> Maintenance -> Firmware Management -> Inventory 以查看当前的固件版本。

以下是最低要求的版本列表。如果您的系统具有较旧的固件,请将固件升级到以下或更新版本。
组件 |
固件版本 |
固件文件名 |
---|---|---|
BMC |
1.02.01 (20231103) |
BMC_SCMAST2600-ROT20-2501MS_20231103_01.02.01_STDsp.bin |
BIOS |
1.0 (20231026) |
BIOS_G1SMH-G-1D31_20231026_1.0_STDsp.bin |
FPGA |
0.8A |
FPGA_MBD-G1SMH-G-10XX1D31_20231018_00.8A.XX_STDsp.bin |
VBIOS |
96.00.84.00.02 |
g530_0206_888__9600840002-prod.fwpkg |
EROT |
1.03.0114.0000-n01 |
cec1736-ecfw-01.03.0114.0000-n01-rel-prod.fwpkg |
CPLD 主板 Misc |
V0B |
CPLD_XO3-GP03E0-10XX03E0_20231020_0B.XX.XX_STDsp.jed |
推荐的固件更新顺序是
关闭主机电源
更新 BMC
更新 CPLD 主板 Misc
更新 CPU ERoT
更新 FPGA
交流电源循环
更新 BIOS
更新 VBIOS
重新启动或电源循环
要更新特定组件的固件,请转到 Dashboard -> Maintenance -> Firmware Management -> Update,然后选择组件图标 -> Next -> Select File -> Upload -> Update。例如,选择 BMC 及其固件文件,如下所示

对于非 BMC 固件更新,它将在任务列表中排队,以便在下次启动时更新。

安装 Ubuntu 22.04 Server#
从 https://ubuntu.com/download/server/arm 下载适用于基于 ARM 的系统的 Ubuntu server 22.04 ISO 镜像。在安装系统操作系统之前,准备一个包含操作系统镜像的可启动 USB 驱动器,或在 BMC 中配置虚拟介质以进行远程安装。另请验证是否已将 USB 转以太网适配器连接到后置 USB 端口以进行主机互联网访问。
有两种方法可以配置虚拟介质。一种是通过 Windows 网络共享或 Linux 上的 Samba 共享来共享操作系统 ISO 镜像。然后转到 BMC Dashboard -> Configuration -> Virtual Media,输入虚拟介质连接信息,包括共享主机 IP、镜像路径、用户名和密码。保存连接信息后,单击链接图标进行连接。

配置虚拟介质的另一种方法是从远程控制台选择“虚拟介质”图标,然后将操作系统 ISO 镜像挂载到虚拟 CD/DVD 驱动器。

配置并连接虚拟介质后,重新启动系统。按 F11 进入 BIOS 启动菜单,然后选择 UEFI: USB CD/DVD Drive 以使用虚拟介质启动。

从 BMC 远程控制菜单启动 SOL 控制台。需要 SOL 控制台才能完成 Ubuntu 操作系统安装。
注意
Ubuntu 22.04.3 安装介质不包含解决 ast 驱动程序问题的所需补丁。ast 驱动程序用于与 BMC 接口。缺少此补丁会导致板载显示端口和远程控制台的输出失真。因此,操作系统安装必须在 SOL 控制台上完成。此修复程序包含在 NVIDIA 优化的 Ubuntu 内核中。安装 NVIDIA 优化的 Ubuntu 内核后,板载显示器和 BMC 远程控制台的输出将再次正常。

从 SOL 控制台看到 GRUB 菜单后,选择 Ubuntu Server with the HWE Kernel 以安装 Ubuntu 服务器操作系统。

按照 Ubuntu 安装过程进行操作,并注意以下选择
以富模式继续
不更新继续
Ubuntu Server
安装 OpenSSH 服务器
安装完成后,控制台显示 Install complete 和 Reboot now。重新启动系统并检查以下各项
检查系统时间是否正确,以避免 apt 更新错误。
运行以下命令通过 NTP 设置日期和时间一次(这不会启用 NTP 服务)
sudo apt-get install ntpdate
sudo ntpdate -s pool.ntp.org
检查操作系统是否检测到 GPU 和 NIC。
使用以下命令确定操作系统是否检测到 GPU 和 NIC
$ lspci | grep -i nvidia
# GH200 GPU
0009:01:00.0 3D controller: NVIDIA Corporation Device 2342 (rev a1)
$ lspci | grep -i mellanox
# The first BF3 NIC (Fronthaul NIC)
0000:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
0000:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
0000:01:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
# The second BF3 NIC (Backhaul NIC)
0002:01:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
0002:01:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
0002:01:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
使用以下命令更改主机名
$ sudo hostnamectl set-hostname NEW_HOSTNAME
要在启动期间显示 GRUB 菜单,请使用以下内容创建 /etc/default/grub.d/menu.cfg
$ cat <<"EOF" | sudo tee /etc/default/grub.d/menu.cfg
GRUB_TIMEOUT_STYLE=menu
GRUB_TIMEOUT=5
GRUB_TERMINAL="console serial"
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_SERIAL_COMMAND="$GRUB_SERIAL_COMMAND serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1"
EOF
配置网络接口#
以下安装步骤需要互联网连接。确保您为本地网络配置了正确的 netplan。
网络接口名称可能会在重新启动后更改。为确保重新启动后网络接口名称保持不变,请在 /etc/systemd/network 下为每个接口创建一个持久网络链接文件。
要查找 BlueField-3 NIC 的 MAC 地址,请运行 lshw
以检查网络设备,并查找 ConnectX-7
条目。
$ sudo apt-get install jq -y
$ sudo lshw -json -C network | jq '.[] | "\(.product), MAC: \(.serial)"' | grep "ConnectX-7"
"MT43244 BlueField-3 integrated ConnectX-7 network controller, MAC: 94:6d:ae:ww:ww:ww"
"MT43244 BlueField-3 integrated ConnectX-7 network controller, MAC: 94:6d:ae:xx:xx:xx"
"MT43244 BlueField-3 integrated ConnectX-7 network controller, MAC: 94:6d:ae:yy:yy:yy"
"MT43244 BlueField-3 integrated ConnectX-7 network controller, MAC: 94:6d:ae:zz:zz:zz"
使用接口的所需名称和在上一步中找到的 MAC 地址在 /etc/systemd/network/ 中创建文件。
注意
本文档的其余部分将假定 aerial00 和 aerial01 接口是连接到 RU 模拟器以进行 cuBB 测试或连接到前传交换机以进行 E2E 测试的接口,并且 aerial00 是用于 PTP 的接口。
$ sudo nano /etc/systemd/network/20-aerial00.link
[Match]
MACAddress=94:6d:ae:ww:ww:ww
[Link]
Name=aerial00
$ sudo nano /etc/systemd/network/20-aerial01.link
[Match]
MACAddress=94:6d:ae:xx:xx:xx
[Link]
Name=aerial01
$ sudo nano /etc/systemd/network/20-aerial02.link
[Match]
MACAddress=94:6d:ae:yy:yy:yy
[Link]
Name=aerial02
$ sudo nano /etc/systemd/network/20-aerial03.link
[Match]
MACAddress=94:6d:ae:zz:zz:zz
[Link]
Name=aerial03
要应用更改
$ sudo netplan apply
禁用自动升级#
编辑 /etc/apt/apt.conf.d/20auto-upgrades
系统文件,并将两行中的 “1” 更改为 “0”。这可以防止已安装的低延迟内核版本在后续软件升级中被意外更改。
$ sudo nano /etc/apt/apt.conf.d/20auto-upgrades
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Unattended-Upgrade "0";
禁用 fwupd-refresh 定时器,以防止 fwupdmgr 自动检查任何更新。
$ sudo systemctl mask fwupd-refresh.timer
安装 NVIDIA 优化 Ubuntu 内核#
运行以下命令以安装 NVIDIA 优化的 Ubuntu 内核。
$ sudo apt update
# NOTE: This will install the specific kernel version, not the latest NVIDIA optimized kernel.
$ sudo apt install -y linux-image-6.5.0-1019-nvidia-64k
然后,更新 GRUB 以更改默认启动内核。此处使用的版本取决于先前命令安装的最新版本
# Update grub to change the default boot kernel
$ sudo sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.5.0-1019-nvidia-64k"/' /etc/default/grub
配置 Linux 内核命令行#
确保 iommu.passthrough=y 内核参数未传递给内核。此参数会阻止 GPU 驱动程序加载,因此如果存在,则必须删除。
通过运行以下命令验证参数是否存在
$ grep iommu.passthrough=y /proc/cmdline
如果参数存在,请找到包含此参数的文件并将其删除。例如
$ grep -rns iommu.passthrough /etc/default/grub*
# Remove iommu.passthrough=y from the found file
$ sudo sed -i 's/ iommu.passthrough=y//' /etc/default/<found file>
要设置内核命令行参数,请编辑 grub 文件 /etc/default/grub.d/cmdline.cfg
中的 GRUB_CMDLINE_LINUX
参数,并附加或更新以下描述的参数。以下内核参数针对 GH200 进行了优化。要使用这些参数自动附加 grub 文件,请输入此命令
$ cat <<"EOF" | sudo tee /etc/default/grub.d/cmdline.cfg
GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX pci=realloc=off pci=pcie_bus_safe default_hugepagesz=512M hugepagesz=512M hugepages=48 tsc=reliable processor.max_cstate=0 audit=0 idle=poll rcu_nocb_poll nosoftlockup irqaffinity=0 isolcpus=managed_irq,domain,4-64 nohz_full=4-64 rcu_nocbs=4-64 earlycon module_blacklist=nouveau acpi_power_meter.force_cap_on=y numa_balancing=disable init_on_alloc=0 preempt=none"
EOF
注意
巨页大小为 512MB,这针对 ARM 上的 64k 页大小内核进行了优化。
应用更改并重新启动以加载内核#
$ sudo update-grub
$ sudo reboot
重新启动后,输入此命令以验证内核命令行参数是否配置正确
$ uname -r
6.5.0-1019-nvidia-64k
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.5.0-1019-nvidia-64k root=/dev/mapper/ubuntu--vg-ubuntu--lv ro pci=realloc=off pci=pcie_bus_safe default_hugepagesz=512M hugepagesz=512M hugepages=48 tsc=reliable processor.max_cstate=0 audit=0 idle=poll rcu_nocb_poll nosoftlockup irqaffinity=0 isolcpus=managed_irq,domain,4-64 nohz_full=4-64 rcu_nocbs=4-64 earlycon module_blacklist=nouveau acpi_power_meter.force_cap_on=y numa_balancing=disable init_on_alloc=0 preempt=none
输入此命令以检查是否已启用巨页
$ grep -i huge /proc/meminfo
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 48
HugePages_Free: 48
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 524288 kB
Hugetlb: 25165824 kB
安装依赖包#
输入以下命令以安装先决条件包
$ sudo apt-get update
$ sudo apt-get install -y build-essential linux-headers-$(uname -r) dkms unzip linuxptp pv apt-utils net-tools
在主机上安装 DOCA OFED 和 Mellanox 固件工具#
检查主机系统上是否已安装 MOFED。
$ ofed_info -s
OFED-internal-23.10-1.1.9:
如果 MOFED 存在,请卸载 MOFED。
$ sudo /usr/sbin/ofed_uninstall.sh
输入以下命令以安装 DOCA OFED。
# Install DOCA OFED
$ wget https://www.mellanox.com/downloads/DOCA/DOCA_v2.7.0/host/doca-host_2.7.0-204000-24.04-ubuntu2204_arm64.deb
$ sudo dpkg -i doca-host_2.7.0-204000-24.04-ubuntu2204_arm64.deb
$ sudo apt update
$ sudo apt install -y doca-ofed
# To check what version of OFED you have installed
$ ofed_info -s
OFED-internal-24.04-0.6.6:
输入以下命令以安装 Mellanox 固件工具。
# Install Mellanox Firmware Tools
$ export MFT_VERSION=4.28.0-92
$ wget https://www.mellanox.com/downloads/MFT/mft-$MFT_VERSION-arm64-deb.tgz
$ tar xvf mft-$MFT_VERSION-arm64-deb.tgz
$ sudo mft-$MFT_VERSION-arm64-deb/install.sh
$ sudo mst version
mst, mft 4.28.0-92, built on Apr 25 2024, 15:22:48. Git SHA Hash: N/A
$ sudo mst start
# check NIC PCIe bus addresses and network interface names
$ sudo mst status -v
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE MST PCI RDMA NET NUMA
BlueField3(rev:1) /dev/mst/mt41692_pciconf1.1 0002:01:00.1 mlx5_3 net-aerial03 0
BlueField3(rev:1) /dev/mst/mt41692_pciconf1 0002:01:00.0 mlx5_2 net-aerial02 0
BlueField3(rev:1) /dev/mst/mt41692_pciconf0.1 0000:01:00.1 mlx5_1 net-aerial01 0
BlueField3(rev:1) /dev/mst/mt41692_pciconf0 0000:01:00.0 mlx5_0 net-aerial00 0
输入以下命令以检查端口 0 的链路状态
# Here is an example if the port 0 of fronthaul NIC is connected to another server or switch via a 200GbE DAC cable.
$ sudo mlxlink -d 0000:01:00.0
Operational Info
----------------
State : Active
Physical state : LinkUp
Speed : 200G
Width : 4x
FEC : Standard_RS-FEC - (544,514)
Loopback Mode : No Loopback
Auto Negotiation : ON
Supported Info
--------------
Enabled Link Speed (Ext.) : 0x00003ff2 (200G_2X,200G_4X,100G_1X,100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)
Supported Cable Speed (Ext.) : 0x000017f2 (200G_4X,100G_2X,100G_4X,50G_1X,50G_2X,40G,25G,10G,1G)
Troubleshooting Info
--------------------
Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed
Tool Information
----------------
Firmware Version : 32.39.2048
amBER Version : 2.22
MFT Version : mft 4.26.1-3
安装 CUDA 驱动程序#
如果系统已安装旧版本的驱动程序,请使用以下命令卸载当前驱动程序模块并卸载旧驱动程序
# Unload the current driver modules
$ for m in $(lsmod | awk "/^[^[:space:]]*(nvidia|nv_|gdrdrv)/ {print \$1}"); do echo Unload $m...; sudo rmmod $m; done
# Remove the driver if it was installed by runfile installer before.
$ sudo /usr/bin/nvidia-uninstall
使用以下推荐设置创建驱动程序模块配置
$ cat <<EOF | sudo tee /etc/modprobe.d/nvidia.conf
options nvidia NVreg_RegistryDwords="RMNvLinkDisableLinks=0x3FFFF;"
EOF
运行以下命令以安装 NVIDIA 开源 GPU 内核驱动程序 (OpenRM)。
# Install NVIDIA GPU driver 560.35.03 to run Aerial L1 in non-MIG mode.
$ wget https://us.download.nvidia.com/XFree86/aarch64/560.35.03/NVIDIA-Linux-aarch64-560.35.03.run
$ sudo sh NVIDIA-Linux-aarch64-560.35.03.run --silent -m kernel-open
# Verify that the driver is loaded successfully
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | 0 |
| N/A 36C P0 115W / 900W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
安装 GDRCopy 驱动程序#
运行以下命令以安装 GDRCopy 驱动程序。如果系统已安装旧版本,请先删除旧驱动程序。
警告
GDRCopy 驱动程序必须在 CUDA 驱动程序之后安装。
# Check the installed GDRCopy driver version
$ apt list --installed | grep gdrdrv-dkms
# Remove the driver, if you have the older version installed.
$ sudo apt purge gdrdrv-dkms
$ sudo apt autoremove
# Install GDRCopy driver
$ wget https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.2/ubuntu22_04/aarch64/gdrdrv-dkms_2.4-1_arm64.Ubuntu22_04.deb
$ sudo dpkg -i gdrdrv-dkms_2.4-1_arm64.Ubuntu22_04.deb
安装 Docker CE#
有关安装 Docker CE 的完整官方说明,请访问此处:https://docs.docker.net.cn/engine/install/ubuntu/#install-docker-engine。以下说明是安装 Docker CE 的一种受支持方法
警告
为了正常工作,必须在 Docker CE 或 nvidia-container-toolkit 安装之前安装 CUDA 驱动程序。建议您在安装 Docker CE 或 nvidia-container-toolkit 之前安装 CUDA 驱动程序。
$ sudo apt-get update
$ sudo apt-get install -y ca-certificates curl gnupg
$ sudo install -m 0755 -d /etc/apt/keyrings
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
$ sudo chmod a+r /etc/apt/keyrings/docker.gpg
$ echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
$ sudo apt-get update
$ sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
$ sudo docker run --rm hello-world
安装 Nvidia Container Toolkit#
查找并按照 nvidia-container-toolkit 安装说明进行操作。
或使用以下说明作为安装 nvidia-container-toolkit 的替代方法。支持版本 1.16.2。
警告
为了正常工作,必须在 Docker CE 或 nvidia-container-toolkit 安装之前安装 CUDA 驱动程序。建议您在安装 Docker CE 或 nvidia-container-toolkit 之前安装 CUDA 驱动程序。
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& \
sudo apt-get update
$ sudo apt-get install -y nvidia-container-toolkit
$ sudo nvidia-ctk runtime configure --runtime=docker
$ sudo systemctl restart docker
$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
注意
如果现有系统上已安装 nvidia-container-toolkit,请运行 nvidia-ctk --version
命令检查版本。如果版本低于 1.16.2,请运行以下命令升级到当前版本
$ nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.14.4
commit: d167812ce3a55ec04ae2582eff1654ec812f42e1
$ sudo apt update
$ sudo apt-get install -y nvidia-container-toolkit
$ nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.16.2
commit: a5a5833c14a15fd9c86bcece85d5ec6621b65652
更新 BF3 BFB 镜像和 NIC 固件#
注意
以下说明专门针对 BF3 NIC(OPN: 900-9D3B6-00CV-A; PSID: MT_0000000884)。
如果使用以下 BFB 镜像,则无需切换到 DPU 模式。
此 BFB 镜像将自动更新 NIC 固件。
# Enable MST
$ sudo mst start
$ sudo mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt41692_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:01:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 01
/dev/mst/mt41692_pciconf1 - PCI configuration cycles access.
domain:bus:dev.fn=0002:01:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1
Chip revision is: 01
# Download the BF3 BFB image
$ wget https://content.mellanox.com/BlueField/BFBs/Ubuntu22.04/bf-bundle-2.7.0-33_24.04_ubuntu-22.04_prod.bfb
# Update the BFB image of the 1st BF3
$ sudo bfb-install -r rshim0 -b bf-bundle-2.7.0-33_24.04_ubuntu-22.04_prod.bfb
# Update the BFB image of the 2nd BF3
$ sudo bfb-install -r rshim1 -b bf-bundle-2.7.0-33_24.04_ubuntu-22.04_prod.bfb
Pushing bfb
1.41GiB 0:01:24 [17.1MiB/s] [ <=>]
Collecting BlueField booting status. Press Ctrl+C to stop…
INFO[PSC]: PSC BL1 START
INFO[BL2]: start
INFO[BL2]: boot mode (rshim)
INFO[BL2]: VDDQ adjustment complete
INFO[BL2]: VDDQ: 1120 mV
INFO[BL2]: DDR POST passed
INFO[BL2]: UEFI loaded
INFO[BL31]: start
INFO[BL31]: lifecycle GA Secured
INFO[BL31]: VDD: 851 mV
ERR[BL31]: MB timeout
INFO[BL31]: runtime
INFO[UEFI]: eMMC init
INFO[UEFI]: eMMC probed
INFO[UEFI]: UPVS valid
INFO[UEFI]: PMI: updates started
INFO[UEFI]: PMI: total updates: 1
INFO[UEFI]: PMI: updates completed, status 0
INFO[UEFI]: PCIe enum start
INFO[UEFI]: PCIe enum end
INFO[UEFI]: UEFI Secure Boot (enabled)
INFO[UEFI]: Redfish enabled
INFO[BL31]: Partial NIC
INFO[BL31]: power capping disabled
INFO[UEFI]: exit Boot Service
INFO[MISC]: Ubuntu installation started
INFO[MISC]: Installing OS image
INFO[MISC]: Ubuntu installation completed
WARN[MISC]: Skipping BMC components upgrade.
INFO[MISC]: Updating NIC firmware...
INFO[MISC]: NIC firmware update done
INFO[MISC]: Installation finished
# Wait 10 minutes to ensure the card initializes properly after the BFB installation
$ sleep 600
# NOTE: Requires a full power cycle from host with cold boot
# Verify NIC FW version after reboot
$ sudo mst start
$ sudo flint -d /dev/mst/mt41692_pciconf0 q
Image type: FS4
FW Version: 32.41.1000
FW Release Date: 28.4.2024
Product Version: 32.41.1000
Rom Info: type=UEFI Virtio net version=21.4.13 cpu=AMD64,AARCH64
type=UEFI Virtio blk version=22.4.13 cpu=AMD64,AARCH64
type=UEFI version=14.34.12 cpu=AMD64,AARCH64
type=PXE version=3.7.400 cpu=AMD64
Description: UID GuidsNumber
Base GUID: 946dae0300f5aa8e 38
Base MAC: 946daef5aa8e 38
Image VSD: N/A
Device VSD: N/A
PSID: MT_0000000884
Security Attributes: secure-fw
运行以下命令以配置 BF3 NIC
# Setting BF3 port to Ethernet mode (not Infiniband)
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set LINK_TYPE_P1=2
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set LINK_TYPE_P2=2
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set INTERNAL_CPU_MODEL=1
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set INTERNAL_CPU_PAGE_SUPPLIER=EXT_HOST_PF
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set INTERNAL_CPU_ESWITCH_MANAGER=EXT_HOST_PF
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set INTERNAL_CPU_IB_VPORT0=EXT_HOST_PF
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set CQE_COMPRESSION=1
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set PROG_PARSE_GRAPH=1
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set ACCURATE_TX_SCHEDULER=1
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set FLEX_PARSER_PROFILE_ENABLE=4
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set REAL_TIME_CLOCK_ENABLE=1
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set EXP_ROM_VIRTIO_NET_PXE_ENABLE=0
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set EXP_ROM_VIRTIO_NET_UEFI_ARM_ENABLE=0
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set EXP_ROM_VIRTIO_NET_UEFI_x86_ENABLE=0
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set EXP_ROM_VIRTIO_BLK_UEFI_ARM_ENABLE=0
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set EXP_ROM_VIRTIO_BLK_UEFI_x86_ENABLE=0
# NOTE: Requires a full power cycle from host with cold boot
# Verify that the NIC FW changes have been applied
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q | grep "CQE_COMPRESSION\|PROG_PARSE_GRAPH\|ACCURATE_TX_SCHEDULER\|FLEX_PARSER_PROFILE_ENABLE\|REAL_TIME_CLOCK_ENABLE\|INTERNAL_CPU_MODEL\|LINK_TYPE_P1\|LINK_TYPE_P2\|INTERNAL_CPU_PAGE_SUPPLIER\|INTERNAL_CPU_ESWITCH_MANAGER\|INTERNAL_CPU_IB_VPORT0\|INTERNAL_CPU_OFFLOAD_ENGINE"
INTERNAL_CPU_MODEL EMBEDDED_CPU(1)
INTERNAL_CPU_PAGE_SUPPLIER EXT_HOST_PF(1)
INTERNAL_CPU_ESWITCH_MANAGER EXT_HOST_PF(1)
INTERNAL_CPU_IB_VPORT0 EXT_HOST_PF(1)
INTERNAL_CPU_OFFLOAD_ENGINE DISABLED(1)
FLEX_PARSER_PROFILE_ENABLE 4
PROG_PARSE_GRAPH True(1)
ACCURATE_TX_SCHEDULER True(1)
CQE_COMPRESSION AGGRESSIVE(1)
REAL_TIME_CLOCK_ENABLE True(1)
LINK_TYPE_P1 ETH(2)
LINK_TYPE_P2 ETH(2)
安装 ptp4l 和 phc2sys#
输入以下命令以配置 PTP4L,假设 aerial00
NIC 接口和 CPU 核心 41 用于 PTP
$ cat <<EOF | sudo tee /etc/ptp.conf
[global]
dataset_comparison G.8275.x
G.8275.defaultDS.localPriority 128
maxStepsRemoved 255
logAnnounceInterval -3
logSyncInterval -4
logMinDelayReqInterval -4
G.8275.portDS.localPriority 128
network_transport L2
domainNumber 24
tx_timestamp_timeout 30
slaveOnly 1
clock_servo pi
step_threshold 1.0
egressLatency 28
pi_proportional_const 4.65
pi_integral_const 0.1
[aerial00]
announceReceiptTimeout 3
delay_mechanism E2E
network_transport L2
EOF
$ cat <<EOF | sudo tee /lib/systemd/system/ptp4l.service
[Unit]
Description=Precision Time Protocol (PTP) service
Documentation=man:ptp4l
After=network.target
[Service]
Restart=always
RestartSec=5s
Type=simple
ExecStartPre=ifconfig aerial00 up
ExecStartPre=ethtool --set-priv-flags aerial00 tx_port_ts on
ExecStartPre=ethtool -A aerial00 rx off tx off
ExecStartPre=ifconfig aerial01 up
ExecStartPre=ethtool --set-priv-flags aerial01 tx_port_ts on
ExecStartPre=ethtool -A aerial01 rx off tx off
ExecStart=taskset -c 41 /usr/sbin/ptp4l -f /etc/ptp.conf
[Install]
WantedBy=multi-user.target
EOF
$ sudo systemctl daemon-reload
$ sudo systemctl restart ptp4l.service
$ sudo systemctl enable ptp4l.service
一台服务器成为主时钟,如下所示
$ sudo systemctl status ptp4l.service
● ptp4l.service - Precision Time Protocol (PTP) service
Loaded: loaded (/lib/systemd/system/ptp4l.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-08-30 01:25:57 UTC; 2min 16s ago
Docs: man:ptp4l
Main PID: 3404 (ptp4l)
Tasks: 1 (limit: 598789)
Memory: 2.6M
CPU: 126ms
CGroup: /system.slice/ptp4l.service
└─3404 /usr/sbin/ptp4l -f /etc/ptp.conf
Aug 30 01:25:57 r750-01 ptp4l[3404]: [14.291] port 0: INITIALIZING to LISTENING on INIT_COMPLETE
Aug 30 01:25:57 r750-01 ptp4l[3404]: [14.291] port 1: link down
Aug 30 01:25:57 r750-01 ptp4l[3404]: [14.291] port 1: LISTENING to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Aug 30 01:25:57 r750-01 ptp4l[3404]: [14.323] selected local clock a088c2.fffe.47be40 as best master
Aug 30 01:25:57 r750-01 ptp4l[3404]: [14.323] port 1: assuming the grand master role
Aug 30 01:26:56 r750-01 ptp4l[3404]: [73.338] port 1: link up
Aug 30 01:26:56 r750-01 ptp4l[3404]: [73.368] port 1: FAULTY to LISTENING on INIT_COMPLETE
Aug 30 01:26:57 r750-01 ptp4l[3404]: [73.860] port 1: LISTENING to MASTER on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES
Aug 30 01:26:57 r750-01 ptp4l[3404]: [73.860] selected local clock a088c2.fffe.47be40 as best master
Aug 30 01:26:57 r750-01 ptp4l[3404]: [73.860] port 1: assuming the grand master role
另一台服务器成为辅助时钟,从时钟,如下所示
$ sudo systemctl status ptp4l.service
● ptp4l.service - Precision Time Protocol (PTP) service
Loaded: loaded (/lib/systemd/system/ptp4l.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-08-30 01:29:33 UTC; 47s ago
Docs: man:ptp4l
Process: 1509 ExecStartPre=ifconfig aerial00 up (code=exited, status=0/SUCCESS)
Process: 3069 ExecStartPre=ethtool --set-priv-flags aerial00 tx_port_ts on (code=exited, status=0/SUCCESS)
Process: 3755 ExecStartPre=ethtool -A aerial00 rx off tx off (code=exited, status=0/SUCCESS)
Process: 3822 ExecStartPre=ifconfig aerial01 up (code=exited, status=0/SUCCESS)
Process: 3827 ExecStartPre=ethtool --set-priv-flags aerial01 tx_port_ts on (code=exited, status=0/SUCCESS)
Process: 3862 ExecStartPre=ethtool -A aerial01 rx off tx off (code=exited, status=0/SUCCESS)
Main PID: 3870 (ptp4l)
Tasks: 1 (limit: 73247)
Memory: 9.2M
CPU: 183ms
CGroup: /system.slice/ptp4l.service
└─3870 /usr/sbin/ptp4l -f /etc/ptp.conf
Aug 30 01:30:12 aerial-mgx-cg1-01 ptp4l[3870]: [107.479] rms 3 max 6 freq +9551 +/- 12 delay -94 +/- 0
Aug 30 01:30:13 aerial-mgx-cg1-01 ptp4l[3870]: [108.479] rms 3 max 6 freq +9556 +/- 10 delay -94 +/- 0
Aug 30 01:30:14 aerial-mgx-cg1-01 ptp4l[3870]: [109.479] rms 3 max 4 freq +9552 +/- 13 delay -94 +/- 0
Aug 30 01:30:15 aerial-mgx-cg1-01 ptp4l[3870]: [110.479] rms 3 max 6 freq +9556 +/- 12 delay -94 +/- 1
Aug 30 01:30:16 aerial-mgx-cg1-01 ptp4l[3870]: [111.479] rms 3 max 7 freq +9558 +/- 14 delay -94 +/- 0
Aug 30 01:30:17 aerial-mgx-cg1-01 ptp4l[3870]: [112.479] rms 4 max 7 freq +9567 +/- 12 delay -94 +/- 0
Aug 30 01:30:18 aerial-mgx-cg1-01 ptp4l[3870]: [113.479] rms 3 max 5 freq +9569 +/- 7 delay -94 +/- 0
Aug 30 01:30:19 aerial-mgx-cg1-01 ptp4l[3870]: [114.479] rms 3 max 6 freq +9574 +/- 8 delay -94 +/- 1
Aug 30 01:30:20 aerial-mgx-cg1-01 ptp4l[3870]: [115.479] rms 3 max 5 freq +9577 +/- 9 delay -94 +/- 0
Aug 30 01:30:21 aerial-mgx-cg1-01 ptp4l[3870]: [116.479] rms 4 max 7 freq +9583 +/- 12 delay -94 +/- 0
输入命令以关闭 NTP
$ sudo timedatectl set-ntp false
$ timedatectl
Local time: Fri 2024-08-30 01:30:36 UTC
Universal time: Fri 2024-08-30 01:30:36 UTC
RTC time: Fri 2024-08-30 01:30:36
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: no
NTP service: inactive
RTC in local TZ: no
将 PHC2SYS 作为服务运行
PHC2SYS 用于将系统时钟与 NIC 上的 PTP 硬件时钟 (PHC) 同步。
指定用于 PTP 的网络接口和作为从时钟的系统时钟。
# If more than one instance is already running, kill the existing # PHC2SYS sessions. # Command used can be found in /lib/systemd/system/phc2sys.service # Update the ExecStart line to the following $ cat <<EOF | sudo tee /lib/systemd/system/phc2sys.service [Unit] Description=Synchronize system clock or PTP hardware clock (PHC) Documentation=man:phc2sys Requires=ptp4l.service After=ptp4l.service [Service] Restart=always RestartSec=5s Type=simple # Gives ptp4l a chance to stabilize ExecStartPre=sleep 2 ExecStart=/bin/sh -c "taskset -c 41 /usr/sbin/phc2sys -s /dev/ptp\$(ethtool -T aerial00 | grep PTP | awk '{print \$4}') -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256" [Install] WantedBy=multi-user.target EOF
更改 PHC2SYS 配置文件后,运行以下命令
$ sudo systemctl daemon-reload
$ sudo systemctl restart phc2sys.service
# Set to start automatically on reboot
$ sudo systemctl enable phc2sys.service
# check that the service is active and has converged to a low rms value (<30) and that the correct NIC has been selected (aerial00):
$ sudo systemctl status phc2sys.service
● phc2sys.service - Synchronize system clock or PTP hardware clock (PHC)
Loaded: loaded (/lib/systemd/system/phc2sys.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-08-30 01:31:35 UTC; 18min ago
Docs: man:phc2sys
Process: 3871 ExecStartPre=sleep 2 (code=exited, status=0/SUCCESS)
Main PID: 4006 (sh)
Tasks: 2 (limit: 73247)
Memory: 6.0M
CPU: 3.628s
CGroup: /system.slice/phc2sys.service
├─4006 /bin/sh -c "taskset -c 41 /usr/sbin/phc2sys -s /dev/ptp\$(ethtool -T aerial00 | grep PTP | awk '{print \$4}') -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256"
└─4012 /usr/sbin/phc2sys -s /dev/ptp2 -c CLOCK_REALTIME -n 24 -O 0 -R 256 -u 256
Aug 30 01:48:09 aerial-mgx-c1-01 phc2sys[4012]: [1184.489] CLOCK_REALTIME rms 8 max 22 freq +5522 +/- 47 delay 480 +/- 0
Aug 30 01:48:10 aerial-mgx-c1-01 phc2sys[4012]: [1185.505] CLOCK_REALTIME rms 7 max 19 freq +5542 +/- 30 delay 480 +/- 2
Aug 30 01:48:11 aerial-mgx-c1-01 phc2sys[4012]: [1186.521] CLOCK_REALTIME rms 7 max 19 freq +5530 +/- 36 delay 480 +/- 0
Aug 30 01:48:12 aerial-mgx-c1-01 phc2sys[4012]: [1187.537] CLOCK_REALTIME rms 7 max 19 freq +5534 +/- 43 delay 480 +/- 2
Aug 30 01:48:13 aerial-mgx-c1-01 phc2sys[4012]: [1188.553] CLOCK_REALTIME rms 9 max 22 freq +5557 +/- 64 delay 480 +/- 0
Aug 30 01:48:14 aerial-mgx-c1-01 phc2sys[4012]: [1189.569] CLOCK_REALTIME rms 9 max 23 freq +5516 +/- 52 delay 480 +/- 0
Aug 30 01:48:15 aerial-mgx-c1-01 phc2sys[4012]: [1190.586] CLOCK_REALTIME rms 7 max 19 freq +5538 +/- 32 delay 480 +/- 0
Aug 30 01:48:16 aerial-mgx-c1-01 phc2sys[4012]: [1191.602] CLOCK_REALTIME rms 7 max 19 freq +5534 +/- 27 delay 480 +/- 0
Aug 30 01:48:17 aerial-mgx-c1-01 phc2sys[4012]: [1192.618] CLOCK_REALTIME rms 8 max 18 freq +5538 +/- 42 delay 480 +/- 0
Aug 30 01:48:18 aerial-mgx-c1-01 phc2sys[4012]: [1193.634] CLOCK_REALTIME rms 8 max 20 freq +5547 +/- 47 delay 480 +/- 0
验证系统时钟是否已同步
$ timedatectl
Local time: Fri 2024-08-30 01:48:25 UTC
Universal time: Fri 2024-08-30 01:48:25 UTC
RTC time: Fri 2024-08-30 01:48:25
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
NTP service: inactive
RTC in local TZ: no
设置启动配置服务#
创建 /usr/local/bin
目录,并创建 /usr/local/bin/nvidia.sh
文件以在每次重新启动时运行命令。
注意
“nvidia-smi lgc” 命令期望只有一个 GPU 设备 (-i 0)。如果系统使用多个 GPU,则需要修改此命令。必须将模式设置为 1 以用于 GH200,以便它可以利用最大时钟频率,否则它将被限制为默认模式 = 0 的 1830MHz。
$ cat <<"EOF" | sudo tee /usr/local/bin/nvidia.sh
#!/bin/bash
mst start
nvidia-smi -i 0 -lgc $(nvidia-smi -i 0 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1) --mode=1
nvidia-smi -mig 0
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
EOF
创建一个系统服务文件,以便在网络接口启动后加载。
$ cat <<EOF | sudo tee /lib/systemd/system/nvidia.service
[Unit]
After=network.target
[Service]
ExecStart=/usr/local/bin/nvidia.sh
[Install]
WantedBy=default.target
EOF
创建一个系统服务文件,用于在启动时运行 nvidia-persistenced。
注意
此文件是按照 /usr/share/doc/NVIDIA_GLX-1.0/samples/nvidia-persistenced-init.tar.bz2 中的示例创建的
$ cat <<EOF | sudo tee /lib/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target
[Service]
Type=forking
ExecStart=/usr/bin/nvidia-persistenced
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced
[Install]
WantedBy=multi-user.target
EOF
然后设置文件权限,重新加载 systemd 守护程序,启用服务,首次安装时重新启动服务,并检查状态
$ sudo chmod 744 /usr/local/bin/nvidia.sh
$ sudo chmod 664 /lib/systemd/system/nvidia.service
$ sudo chmod 664 /lib/systemd/system/nvidia-persistenced.service
$ sudo systemctl daemon-reload
$ sudo systemctl enable nvidia-persistenced.service
$ sudo systemctl enable nvidia.service
$ sudo systemctl restart nvidia.service
$ sudo systemctl restart nvidia-persistenced.service
$ sudo systemctl status nvidia.service
$ sudo systemctl status nvidia-persistenced.service
最后一个命令的输出应如下所示
$ sudo systemctl status nvidia.service
○ nvidia.service
Loaded: loaded (/lib/systemd/system/nvidia.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Fri 2024-06-07 20:11:55 UTC; 2s ago
Process: 3300619 ExecStart=/usr/local/bin/nvidia.sh (code=exited, status=0/SUCCESS)
Main PID: 3300619 (code=exited, status=0/SUCCESS)
CPU: 1.091s
Jun 07 20:11:54 server nvidia.sh[3300620]: Loading MST PCI module - Success
Jun 07 20:11:54 server nvidia.sh[3300620]: [warn] mst_pciconf is already loaded, skipping
Jun 07 20:11:54 server nvidia.sh[3300620]: Create devices
Jun 07 20:11:55 server nvidia.sh[3300620]: Unloading MST PCI module (unused) - Success
Jun 07 20:11:55 server nvidia.sh[3302599]: GPU clocks set to "(gpuClkMin 1980, gpuClkMax 1980)" for GPU 00000009:01:00.0
Jun 07 20:11:55 server nvidia.sh[3302599]: All done.
Jun 07 20:11:55 server nvidia.sh[3302600]: Disabled MIG Mode for GPU 00000009:01:00.0
Jun 07 20:11:55 server nvidia.sh[3302600]: All done.
Jun 07 20:11:55 server systemd[1]: nvidia.service: Deactivated successfully.
Jun 07 20:11:55 server systemd[1]: nvidia.service: Consumed 1.091s CPU time.
$ sudo systemctl status nvidia-persistenced.service
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2024-06-05 21:42:17 UTC; 1 day 22h ago
Main PID: 1858 (nvidia-persiste)
Tasks: 1 (limit: 146899)
Memory: 36.5M
CPU: 2.353s
CGroup: /system.slice/nvidia-persistenced.service
└─1858 /usr/bin/nvidia-persistenced
Jun 05 21:42:15 server systemd[1]: Starting NVIDIA Persistence Daemon...
Jun 05 21:42:15 server nvidia-persistenced[1858]: Started (1858)
Jun 05 21:42:17 server systemd[1]: Started NVIDIA Persistence Daemon.
在 Grace Hopper 上运行 Aerial#
Aerial 源代码中的默认 MGX CG1 配置为
cuPHY-CP/cuphycontroller/config/cuphycontroller_F08_CG1.yaml
cuPHY-CP/cuphycontroller/config/l2_adapter_config_F08_CG1.yaml
将 F08_CG1 传递给 cuphycontroller_scf 可执行文件以选择它们。