H200 节点配置#
将 H200 tar 文件下载到头节点的 /root 目录。
wget https://support2.brightcomputing.com/h200-parker/DGXOS-6.3.1-H200.tar.gz -P /root
使用 cm-create-image 将 H200 镜像添加到 cmsh。
1cm-create-image --fromarchive /root/DGXOS-6.3.1-H200.tar.gz --imagename dgx-6.3.1-h200-image --skipdist 2 3Running validate base tar........................ [ OK ] 4 5Running sanity check............................. [ OK ] 6 7Running unpack base tar.......................... [ OK ] 8 ******************** IMPORTANT **************************** 9 Please confirm that the base distribution repositories for 10 the software image are enabled. For instructions on how to 11 enable repositories for your software image, please refer 12 the administrator's manual. 13 14 15 Image creation can be resumed in one of the following ways: 16 ----------------------------------------------------------- 17 1. Enter 'e' to exit, and configure repositories. 18 Then, restart program with the -d (--fromdir) option. 19 cm-create-image -d /cm/images/dgx-6.3.1-h200-image -n dgx-6.3.1-h200-image 20 21 2. Open a new console, and configure repositories. 22 Then enter 'c' on this console, to continue software 23 image creation. 24 25 *********************************************************** 26 27Continue(c)/Exit(e)? c 28 29 30Finalize base distribution....................... [ OK ] 31 32Copying cm repo files............................ [ OK ] 33 34Validating repo configuration.................... [ OK ] 35 36Finalizing image services........................ [ OK ] 37 38Installing CM packages........................... [ OK ] 39 40Finalizing cluster services...................... [ OK ] 41 42Copying cluster certificate to image............. [ OK ] 43 44Adding/Updating software image................... [ OK ]
在 cmsh 中,转到 softwareimages 并验证 H200 镜像已创建。
1cmsh 2softwareimage 3ls 4 5Name (key) Path (key) Kernel version Nodes 6--------------------------------- -------------------------------------------- ------------------- -------- 7dgx-6.3.1-h200-image /cm/images/dgx-6.3.1-h200-image 5.15.0-1063-nvidia 0
将 bonding 模块添加到 H200 镜像。
1cmsh 2softwareimage 3use dgx-6.3.1-h200-image 4kernelmodules 5add bonding 6commit
在类别部分,从 dgx-h100 克隆 dgx-h200,并将软件镜像设置为 H200 镜像。
1cmsh 2category 3clone dgx-h100 dgx-h200 4use dgx-h200 5set softwareimage dgx-6.3.1-h200-image 6commit
将类别分配给 H200 节点并开机。