H200 节点配置#

  1. 将 H200 tar 文件下载到头节点的 /root 目录。

    wget https://support2.brightcomputing.com/h200-parker/DGXOS-6.3.1-H200.tar.gz -P /root
    
  2. 使用 cm-create-image 将 H200 镜像添加到 cmsh。

     1cm-create-image --fromarchive /root/DGXOS-6.3.1-H200.tar.gz --imagename dgx-6.3.1-h200-image  --skipdist
     2
     3Running validate base tar........................ [  OK  ]
     4
     5Running sanity check............................. [  OK  ]
     6
     7Running unpack base tar.......................... [  OK  ]
     8    ******************** IMPORTANT ****************************
     9    Please confirm that the base distribution repositories for
    10    the software image are enabled. For instructions on how to
    11    enable repositories for your software image, please refer
    12    the administrator's manual.
    13
    14
    15    Image creation can be resumed in one of the following ways:
    16    -----------------------------------------------------------
    17    1. Enter 'e' to exit, and configure repositories.
    18        Then, restart program with the -d (--fromdir) option.
    19        cm-create-image -d /cm/images/dgx-6.3.1-h200-image -n dgx-6.3.1-h200-image
    20
    21    2. Open a new console, and configure repositories.
    22        Then enter 'c' on this console, to continue software
    23        image creation.
    24
    25    ***********************************************************
    26
    27Continue(c)/Exit(e)? c
    28
    29
    30Finalize base distribution....................... [  OK  ]
    31
    32Copying cm repo files............................ [  OK  ]
    33
    34Validating repo configuration.................... [  OK  ]
    35
    36Finalizing image services........................ [  OK  ]
    37
    38Installing CM packages........................... [  OK  ]
    39
    40Finalizing cluster services...................... [  OK  ]
    41
    42Copying cluster certificate to image............. [  OK  ]
    43
    44Adding/Updating software image................... [  OK  ]
    
  3. 在 cmsh 中,转到 softwareimages 并验证 H200 镜像已创建。

    1cmsh
    2softwareimage
    3ls
    4
    5Name (key)                        Path (key)                                   Kernel version      Nodes
    6--------------------------------- -------------------------------------------- ------------------- --------
    7dgx-6.3.1-h200-image              /cm/images/dgx-6.3.1-h200-image              5.15.0-1063-nvidia  0
    
  4. 将 bonding 模块添加到 H200 镜像。

    1cmsh
    2softwareimage
    3use dgx-6.3.1-h200-image
    4kernelmodules
    5add bonding
    6commit
    
  5. 在类别部分,从 dgx-h100 克隆 dgx-h200,并将软件镜像设置为 H200 镜像。

    1cmsh
    2category
    3clone dgx-h100 dgx-h200
    4use dgx-h200
    5set softwareimage dgx-6.3.1-h200-image
    6commit
    
  6. 将类别分配给 H200 节点并开机。