Slurm 设置#

更新 slogin 节点上的接口名称。
1% device use slogin-01
1. 如果 slogin-01 没有预期的接口名称，请更新接口名称。
  1% use networkdevicename 2% set networkdevicename new-name

将 MAC 地址分配给 slogin 节点。

device use slogin-01
set mac <MAC address>

开启电源并安装 slogin 节点。
运行 bcm-install-slurm 脚本。
使用以下参数
1. –bcm-media 参数的安装源。它可以是 USB 或 .iso 文件的路径。
2. 使用 -A 参数以气隙模式运行脚本。
3. 如果 CMHA 已设置但存在故障转移 ping 错误，请附加 --ignore-ha-errors。
4. 如果只有一个 slogin 节点，请附加 --ignore-missing-login-node。
  bcm-install-slurm -A --bcm-media <path to installer image or usb device to mount>
在配置 DGX 节点之前，确认 slurmd 文件是否存在于 DGX 镜像中，如果不存在则创建它。
DGX A100 和 DGX H100 系统都需要相同的文件。此示例适用于 DGX H100 系统。据观察，使用 PMIX 的 NCCL 测试需要此文件。
1cat /cm/images/dgx-os-6.3-h100-image/etc/sysconfig/slurmd 2PMIX_MCA_ptl=^usock 3PMIX_MCA_psec=none 4PMIX_SYSTEM_TMPDIR=/var/empty 5PMIX_MCA_gds=hash

重启 slogin 和计算节点。

cmsh
device
reboot -c slogin
reboot -c dgx-h100

为了简化配置，修改 slurmclient-gpu 角色以移除 slurm 客户端角色，并将 slurm client-gpu 转换为使用该名称。
1cmsh 2configurationoverlay 3remove slurm-client 4commit 5use slurm-client-gpu 6set name slurm-client 7commit

对于 DGX A100 系统，清除 Type 值，并为每个 GPU 条目设置正确的核心关联，以获得最大性能。

cmsh
configurationoverlay
use slurm-client
roles
use slurmclient
genericresources
use gpu0
clear type
set cores 48-63,176-191
use gpu1
clear type
set cores 48-63,176-191
use gpu2
clear type
set cores 16-31,144-159
use gpu3
clear type
set cores 16-31,144-159
use gpu4
clear type
set cores 112-127,240-255
use gpu5
clear type
set cores 112-127,240-255
use gpu6
clear type
set cores 80-95,210-223
use gpu7
clear type
set cores 80-95,210-223
commit

对于 DGX H100 系统，通用资源设置为自动检测。

使用此脚本。

cmsh
wlm
set gpuautodetect nvml
commit
configurationoverlay
use slurm-client
roles
use slurmclient
set gpuautodetect nvml
commit
genericresources
foreach * (remove)
commit
add autodetected-gpus
set name gpu
set count 8
set addtogresconfig yes
commit

注意

addtogresconfig 默认设置为 YES，无需显式设置。

这应产生如下输出。

[vikingbcmhead-01->configurationoverlay*[slurm-client*]->roles*[slurmclient*]->genericresources*[autodetected-gpus]]% ls
Alias (key)        Name     Type     Count    File
------------------ -------- -------- -------- ----------------
autodetected-gpus  gpu      H100     8

gres.conf 文件将由 BCM 自动更新——这些设置符合 NVIDIA 生态系统中各种脚本和工具的预期，并将最大程度地提高此环境与这些脚本和工具的兼容性。

如果 /home 目录未挂载在节点上，请增加重试次数。由于 bond0 接口启动和 /home 挂载之间的竞争条件，有时 /home 将不会被挂载。增加重试次数应能解决此问题。
```
1category
2use dgx-h100
3fsmounts
4use /home
5set mountoptions "x-systemd.mount-timeout=150,defaults,_netdev,retry=5,vers=3"
```

Pod 设置可能会在气隙环境中留下过时的仓库。在这种情况下，需要在登录节点上手动删除以下文件。

cd /etc/apt/sources.list.d/

禁用以下仓库

mv local.list local.list.disabled
mv cm.disabled cm.list
mv cm-ml.disabled cm-ml.list
mv /etc/apt/sources.disabled /etc/apt/sources.list