DGX 软件堆栈#
NVIDIA DGX 软件包#
以下表格列出了作为 DGX 软件堆栈一部分安装的软件包,按元软件包名称分类。
nvidia-system-core#
软件包名称 |
描述 |
---|---|
cuda-compute-repo |
CUDA 计算存储库配置文件。 |
dgx-release |
软件包更新 DGX OS 发行信息。 |
dgx-repo |
DGX 存储库配置文件。 |
hpc-sdk-repo |
NVIDIA HPC SDK 存储库配置文件。 |
msecli |
Micron 存储执行器 CLI。 |
nv-common-apis |
安装 Nvidia 系统常用的脚本。 |
nv-cpu-governor |
将 CPU 调速器设置为性能模式。 |
nv-env-paths |
配置 PATH 变量。 |
nv-grubmenu |
使 Grub 菜单可见。 |
nv-grubserial |
通过串行控制台显示 GRUB 菜单。 |
nv-iommu |
在直通模式下启用 iommu;在配备 Emerald Rapids CPU 的系统上启用 intel_iommu。 |
nv-ipmi-devintf |
加载 ipmi_devintf 模块。 |
nv-limits |
增加文件限制。 |
nv-update-disable |
禁用操作系统更新提示。 |
nvgpu-services-list |
列出所有与 GPU 相关的服务。 |
nvidia-acs-disable |
禁用 PCIe ACS 功能。 |
nvidia-crashdump |
NVIDIA 崩溃转储策略。 |
nvidia-disable-init-on-alloc |
禁用分配时堆内存清零。 |
nvidia-disable-numa-balancing |
禁用自动页面错误 NUMA 内存平衡。 |
nvidia-earlycon |
设置无选项的早期控制台。 |
nvidia-enable-power-meter-cap |
在 ACPI 电源计量器中启用功率限制功能。 |
nvidia-esm-hook-epilogue |
NVIDIA 软件包,用于阐明 ESM 策略。 |
nvidia-fs-loader |
加载 nvidia-fs 模块。 |
nvidia-ipmisol |
启用 IPMI 串行 over LAN。 |
nvidia-kernel-defaults |
DGX 的 sysctl 默认内核设置。 |
nvidia-mig-manager |
NVIDIA MIG 分区编辑器和 Systemd 服务。 |
nvidia-pci-bridge-power |
将 PCI 桥接电源控制设置为开启。 |
nvidia-pci-realloc |
强制 PCI 资源重新分配。 |
nvidia-raid-config |
DGX RAID 配置。 |
nvidia-redfish-config |
配置 Redfish 主机接口。 |
nvidia-relaxed-ordering-gpu |
配置 PCIe 宽松排序。 |
nvidia-relaxed-ordering-nvme |
配置 PCIe 宽松排序。 |
nvidia-repo-keys |
将密钥添加到 apt trusted.gpg 数据库。 |
nvidia-system-utils#
软件包名称 |
描述 |
---|---|
nv-persistence-mode |
启用持久化模式。 |
nvidia-conf-cachefilesd |
cachefilesd 的 Systemd 设置。 |
nvidia-fs-loader |
加载 nvidia-fs 模块。 |
nvidia-logrotate |
NVIDIA 日志轮换策略。 |
nvidia-motd |
NVIDIA 平台的自定义 motd 文件。 |
nvsm |
用于 DGX 系统管理的 REST API 服务。 |
nvidia-system-mlnx-drivers#
软件包名称 |
描述 |
---|---|
doca-ofed |
doca-ofed 元软件包。 |
doca-repo |
DOCA 存储库配置文件。 |
mlnx-nfsrdma-dkms |
对 NFS RDMA 内核模块的 DKMS 支持。 |
mlnx-nvme-dkms |
对 nvme 内核模块的 DKMS 支持。 |
mlnx-pxe-setup |
提供一个脚本以使用 Mellanox 网卡启用 PXE 启动。 |
nvidia-ib-umad-loader |
加载 ib_umad 模块。 |
nvidia-mlnx-config |
配置 MLNX 设备。 |
DGX 内核参数#
参数名称 |
描述 |
软件包 |
位置 |
---|---|---|---|
crashkernel |
用于崩溃转储的内存量。 |
nvidia-crashdump |
/etc/default/grub.d/ ipmisol.cfg |
console=ttyS[0-1],11 5200n8 |
将控制台设置为串行端口 0 或 1,使用 115200 波特,无奇偶校验和 8 位数据位。对于 dgx-h100 和 dgx-h800:console=ttyS0,115200 n8。其他系统类型:console=ttyS1,115200 n8。 |
nvidia-ipmisol |
kernel cmdline |
net.ipv4.conf.all.ar p_announce = 2 |
始终为此目标使用最佳本地地址。 |
nvidia-kernel-defaults |
/etc/sysctl.d/20-nvi dia-defaults.conf |
net.ipv4.conf.defaul t.arp_announce = 2 |
始终为此目标使用最佳本地地址。 |
nvidia-kernel-defaults |
/etc/sysctl.d/20-nvi dia-defaults.conf |
net.ipv4.conf.all.ar p_ignore = 1 |
仅回复包含目标 IP 地址的接口上的 ARP 请求。 |
nvidia-kernel-defaults |
/etc/sysctl.d/20-nvi dia-defaults.conf |
net.ipv4.conf.defaul t.arp_ignore = 1 |
仅回复包含目标 IP 地址的接口上的 ARP 请求。 |
nvidia-kernel-defaults |
/etc/sysctl.d/20-nvi dia-defaults.conf |
setpci -d ::207 68.w=5000:f000 |
将所有网络 (2) Infiniband (07) 设备的 MaxReadReq 大小设置为 4KB。 |
nvidia-mlnx-config |
/etc/systemd/system/ nvidia-mlnx-config.s ervice |
setpci -d ::207 68.w |
将所有网络 (2) Infiniband (07) 设备的 MaxReadReq 大小设置为 4KB。 |
nvidia-mlnx-config |
/etc/systemd/system/ nvidia-mlnx-config.s ervice |
NVreg_EnablePCIERela xedOrderingMode=1 |
设置一个注册表键以在 GPU 中启用 PCIe 宽松排序。 |
nvidia-relaxed-ordering-gpu |
/etc/modprobe.d/nvid ia-relaxed-ordering.conf |
DGX 平台 JSON 配置#
{
"dgx_a800":
{
"PlatformType": "DGX A800",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "True",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "False",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "True",
"IPMIDefaultSerialTTY": "ttyS1",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "",
"CrashdumpMem": "2048M,high"
},
"dgx_a100":
{
"PlatformType": "DGX A100",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "True",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "False",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "True",
"IPMIDefaultSerialTTY": "ttyS1",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "",
"CrashdumpMem": "1G-:1024M"
},
"dgx_h100":
{
"PlatformType": "DGX H100",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "False",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "True",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "False",
"IPMIDefaultSerialTTY": "ttyS0",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "off",
"CrashdumpMem": "1G-:1024M"
},
"dgx_h200":
{
"PlatformType": "DGX H200",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "False",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "True",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "False",
"IPMIDefaultSerialTTY": "ttyS0",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "off",
"CrashdumpMem": "1G-:1024M"
},
"dgx_h800":
{
"PlatformType": "DGX H100",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "False",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "True",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "False",
"IPMIDefaultSerialTTY": "ttyS0",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "off",
"CrashdumpMem": "1G-:1024M"
},
"dgx_b200":
{
"PlatformType": "DGX B200",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "False",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "True",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "False",
"IPMIDefaultSerialTTY": "ttyS0",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"CrashdumpMem": "2048M,high"
},
"dgxstation_a100":
{
"PlatformType": "DGXSTATION A100",
"ConfigureDGXA100Raid": "False",
"ConfigureDGXStationA100Raid": "True",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "True",
"HasRedfishIntf": "True",
"UsesNetplan": "False",
"EnablePowerMeterCap": "False",
"UsesNetworkManager": "True",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "False",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "True",
"IPMIDefaultSerialTTY": "ttyS1",
"NeedsOEMXconfigOverride": "True",
"NeedsInitialNvidiaXconfig": "True",
"NeedsAdaptiveNvidiaXconfig": "True",
"NeedsContainerdOverride": "True",
"UsesFabricManager": "False",
"IsDgxServerType": "False",
"IsDgxDesktopType": "True",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "",
"CrashdumpMem": "1G-:1024M"
},
"dgxstation_a800":
{
"PlatformType": "DGXSTATION A800",
"ConfigureDGXA100Raid": "False",
"ConfigureDGXStationA100Raid": "True",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "True",
"HasRedfishIntf": "True",
"UsesNetplan": "False",
"EnablePowerMeterCap": "False",
"UsesNetworkManager": "True",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "False",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "True",
"IPMIDefaultSerialTTY": "ttyS1",
"NeedsOEMXconfigOverride": "True",
"NeedsInitialNvidiaXconfig": "True",
"NeedsAdaptiveNvidiaXconfig": "True",
"NeedsContainerdOverride": "True",
"UsesFabricManager": "False",
"IsDgxServerType": "False",
"IsDgxDesktopType": "True",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "",
"CrashdumpMem": "1G-:1024M"
}
}
DGX 平台 JSON 配置定义#
名称 |
定义 |
---|---|
PlatformType |
平台类型的可打印字符串表示形式(例如,DGX H100)。 |
ConfigureDGXA100Raid |
由 |
ConfigureDGXStationA100Raid |
用于创建具有类似 DGX Station A100 磁盘排列的 RAID 阵列:单个 U.2,无 RAID。 |
NVMERelaxedOrdering |
软件包安装 |
GpuRelaxedOrdering |
nvidia-relaxed-ordering-gpus 软件包调用此函数以根据平台更改 GPU 驱动程序设置。 |
EnablePowerMeterCap |
nvidia-enable-power-meter-cap 软件包配置服务器以在基于 Grace 平台的 ACPI 电源计量器中启用功率限制。将其设置为 |
BMCPasswordMinLength |
nvidia-oem-config-plugins 软件包创建 EULA、BMC 等 oem-config 屏幕,这些屏幕使用此属性在 ISO 安装期间设置 BMC 密码要求。 |
BMCPasswordMaxLength |
nvidia-oem-config-plugins 软件包创建 EULA、BMC 等 oem-config 屏幕,这些屏幕使用此属性在 ISO 安装期间设置 BMC 密码要求。 |
BMCPasswordSupportsZerofill |
nvidia-oem-config-plugins 软件包创建 EULA、BMC 等 oem-config 屏幕,这些屏幕使用此属性在 ISO 安装期间设置 BMC 密码要求。 |
BMCPasswordComplexityReq |
nvidia-oem-config-plugins 软件包创建 EULA、BMC 等 oem-config 屏幕,这些屏幕使用此属性在 ISO 安装期间设置 BMC 密码要求。 |
NVSMAlertsSupported |
NVSM 仅在 DGX 平台上受支持。如果安装了 NVSM,nvidia-motd 将更改每日格言以显示 NVSM 警报。 |
NeedsMRRSConfig |
Nvidia-mlnx-config 使用此属性来使用 mlxconfig 并设置各种 PCI 设置。仅在 DGX A100、DGX A800 和 DGX2 上,将所有网络 (2) Infiniband (07) 设备的 MaxReadReq 大小设置为 4KB |
NeedsAccBytesTuning |
Nvidia-mlnx-config 使用此属性来使用 mlxconfig 并设置各种 PCI 设置。 |
IPMIDefaultSerialTTY |
在 grub 内核参数中设置默认的 IPMI 串行控制台端口。 |
NeedsOEMXconfigOverride |
对于 dgxstation_a100 或 dgxstation_a800,nvidia-conf-xconfig 创建 oem-config 覆盖服务。 |
NeedsInitialNvidiaXconfig |
对于 dgxstation_a100 或 dgxstation_a800,nvidia-conf-xconfig 调用 nvidia xconfig 并创建空的初始配置。 |
NeedsAdaptiveNvidiaXconfig |
对于 dgxstation_a100 或 dgxstation_a800,nvidia-conf-xconfig 调用 nvidia xconfig 并创建空的初始配置。 |
NeedsContainerdOverride |
nv-docker-gpus 软件包为 dgxstation_a100 或 dgxstation_a800 检查此项。 在这些情况下,此软件包限制 nvidia docker 使用 3D 控制器类 GPU |
NeedsOemConfigPostActNetplanApply |
Nvidia-oem-config-postact 检查 DCS 和 DCS legacy 平台,这会在 OEM ISO 安装后强制执行 “netplan apply”。 |
UsesFabricManager |
对于 DGX2 到 DGX B200 平台,dgx-release-upgrade 软件包检查此项以安装适用于 GPU 驱动程序的正确 nvidia-fabricmanager 软件包。 |
IsDgxServerType |
在版本升级期间,dgx-release-upgrade 检查此项以仅为 DGX 服务器安装软件包。 |
IsDgxDesktopType |
在版本升级期间,dgx-release-upgrade 检查此项以仅为 DGX 工作站(DGX Station A100 等)安装软件包。 |
NeedsDisableNumaBalance |
在基于 Grace 的平台中,它配置系统以禁用自动页面错误 NUMA 内存平衡。 |
NeedsIommuPt |
为 AMD Rome 平台设置 |
NeedsDisableInitOnAlloc |
将选项 |
NeedsEarlycon |
将选项 |
PciRealloc |
确定是否将 grub 设置为 |
CrashdumpMem |
kdump 服务使用此值来为每个内核保留崩溃内核内存。崩溃内核的最小大小可能因硬件和机器规格而异。 |