网络运营商#

  1. 验证 DGX 系统上的 NVIDIA Mellanox OFED 软件包版本是否与DGX 操作系统发行说明中列出的版本匹配。

    1  # cmsh
    2  % device
    3  % pexec -c dgx-a100 -j "ofed_info -s"
    4  [dgx01..dgx04]
    5  MLNX_OFED_LINUX-23.10-0.5.5.0:
    

    必须识别计算 Fabric 中使用的正确的 InfiniBand 接口,并检查其运行状态。如前所述,使用了 mlx5_0、mlx5_2、mlx5_6 和 mlx5_8,应验证它们是否处于工作状态。每个节点上的每个接口都应为 State: Active,Physical stat: LinkUp 和 Link layer: InfiniBand。

  2. 使用以下命令验证接口是否正常工作

     1  [basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do ibstat -d \ mlx5_${i} | grep -i \"mlx5_\\|state\\|infiniband\"; done"
     2  [dgx01..dgx04]
     3  CA 'mlx5_0'
     4                  State: Active
     5                  Physical state: LinkUp
     6                  Link layer: InfiniBand
     7  CA 'mlx5_2'
     8                  State: Active
     9                  Physical state: LinkUp
    10                  Link layer: InfiniBand
    11  CA 'mlx5_6'
    12                  State: Active
    13                  Physical state: LinkUp
    14                  Link layer: InfiniBand
    15  CA 'mlx5_8'
    16                  State: Active
    17                  Physical state: LinkUp
    18                  Link layer: InfiniBand`
    
  3. 检查 SRIOV 接口状态。

    1. NUM_OF_VFS 应设置为 8。

    2. SRIOV_EN 应为 True(1)。

    3. Link_TYPE_P1 应为 IB(1)。

    在本例中,仅 Link_TYPE_P1 设置正确。其他值需要在下一步中设置。

     1  [basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do mst start; \ mlxconfig -d /dev/mst/mt4123_pciconf${i} q; done | grep -e \ \"SRIOV_EN\\|LINK_TYPE\\|NUM_OF_VFS\""
     2  [dgx01..dgx04]
     3          NUM_OF_VFS                          0
     4          SRIOV_EN                            False(0)
     5          LINK_TYPE_P1                        IB(1)
     6          NUM_OF_VFS                          0
     7          SRIOV_EN                            False(0)
     8          LINK_TYPE_P1                        IB(1)
     9          NUM_OF_VFS                          0
    10          SRIOV_EN                            False(0)
    11          LINK_TYPE_P1                        IB(1)
    12          NUM_OF_VFS                          0
    13          SRIOV_EN                            False(0)
    14          LINK_TYPE_P1                        IB(1)
    
  4. 启用 SRIOV 并将每个接口的 NUM_OF_VFS 设置为 8。

    由于 Link_TYPE_P1 已正确设置,因此下面仅设置其他两个值。

     1[basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do mst start; \ mlxconfig -d /dev/mst/mt4123_pciconf${i} -y set SRIOV_EN=1 NUM_OF_VFS=8; done"
     2[dgx01..dgx04]
     3Starting MST (Mellanox Software Tools) driver set
     4Loading MST PCI module - Success
     5[warn] mst_pciconf is already loaded, skipping
     6Create devices
     7Unloading MST PCI module (unused) - Success
     8
     9Device #1:
    10----------
    11
    12Device type:    ConnectX6
    13Name:           MCX653105A-HDA_Ax
    14Description:    ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
    15Device:         /dev/mst/mt4123_pciconf0
    16
    17Configurations:                              Next Boot       New
    18        SRIOV_EN                            False(0)        True(1)
    19        NUM_OF_VFS                          0               8
    20
    21Apply new Configuration? (y/n) [n] : y
    22Applying... Done!
    23-I- Please reboot machine to load new configurations.
    24. . . some output omitted . . .
    
  5. 重启 DGX 节点以加载配置。

    % reboot -c dgx-a100
    
  6. 等待 DGX 节点启动 (UP) 后再继续下一步。

    1% list -c dgx-a100 -f hostname:20,category:10,ip:20,status:10
    2hostname (key)       category   ip                   status
    3-------------------- ---------- -------------------- ----------
    4dgx01                dgx-a100   10.184.71.11         [   UP   +
    5dgx02                dgx-a100   10.184.71.12         [   UP   +
    6dgx03                dgx-a100   10.184.71.13         [   UP   +
    7dgx04                dgx-a100   10.184.71.14         [   UP   +
    
  7. 在 InfiniBand 端口上配置八个 SRIOV VF。

    [basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do echo 8 > \ /sys/class/infiniband/mlx5_${i}/device/sriov_numvfs; done"
    
  8. 在主头节点上,加载 Kubernetes 环境模块。

    # module load kubernetes/default/1.27.11-150500.1.1
    
  9. 添加并安装 Network Operator Helm repo。

    1# helm repo add nvidia-networking https://mellanox.github.io/network-operator
    2"nvidia-networking" has been added to your repositories
    3
    4# helm repo update
    5Hang tight while we grab the latest from your chart repositories...
    6...Successfully got an update from the "nvidia-networking" chart repository
    7...Successfully got an update from the "prometheus-community" chart repository
    8...Successfully got an update from the "nvidia" chart repository
    9Update Complete. ⎈Happy Helming!⎈
    
  10. 创建目录 ./network-operator。

    # mkdir ./network-operator
    
  11. 创建用于 Helm 安装 Network Operator 的 values.yaml 文件。

     1# vi ./network-operator/values.yaml
     2
     3nfd:
     4  enabled: true
     5sriovNetworkOperator:
     6  enabled: true
     7
     8# NicClusterPolicy CR values:
     9deployCR: true
    10ofedDriver:
    11  deploy: false
    12rdmaSharedDevicePlugin:
    13  deploy: false
    14sriovDevicePlugin:
    15  deploy: false
    16
    17secondaryNetwork:
    18  deploy: true
    19  multus:
    20    deploy: true
    21  cniPlugins:
    22    deploy: true
    23  ipamPlugin:
    24    deploy: true
    
  12. 创建 sriov-ib-network-node-policy.yaml 文件。

     1# vi ./network-operator/sriov-ib-network-node-policy.yaml
     2
     3apiVersion: sriovnetwork.openshift.io/v1
     4kind: SriovNetworkNodePolicy
     5metadata:
     6  name: ibp12s0
     7  namespace: network-operator
     8spec:
     9  deviceType: netdevice
    10  nodeSelector:
    11    feature.node.kubernetes.io/network-sriov.capable: "true"
    12  nicSelector:
    13    vendor: "15b3"
    14    pfNames: ["ibp12s0"]
    15  linkType: ib
    16  isRdma: true
    17  numVfs: 8
    18  priority: 90
    19  resourceName: resibp12s0
    20
    21---
    22apiVersion: sriovnetwork.openshift.io/v1
    23kind: SriovNetworkNodePolicy
    24metadata:
    25  name: ibp75s0
    26  namespace: network-operator
    27spec:
    28  deviceType: netdevice
    29  nodeSelector:
    30    feature.node.kubernetes.io/network-sriov.capable: "true"
    31  nicSelector:
    32    vendor: "15b3"
    33    pfNames: ["ibp75s0"]
    34  linkType: ib
    35  isRdma: true
    36  numVfs: 8
    37  priority: 90
    38  resourceName: resibp75s0
    39
    40---
    41apiVersion: sriovnetwork.openshift.io/v1
    42kind: SriovNetworkNodePolicy
    43metadata:
    44  name: ibp141s0
    45  namespace: network-operator
    46spec:
    47  deviceType: netdevice
    48  nodeSelector:
    49    feature.node.kubernetes.io/network-sriov.capable: "true"
    50  nicSelector:
    51    vendor: "15b3"
    52    pfNames: ["ibp141s0"]
    53  linkType: ib
    54  isRdma: true
    55  numVfs: 8
    56  priority: 90
    57  resourceName: resibp141s0
    58
    59---
    60apiVersion: sriovnetwork.openshift.io/v1
    61kind: SriovNetworkNodePolicy
    62metadata:
    63  name: ibp186s0
    64  namespace: network-operator
    65spec:
    66  deviceType: netdevice
    67  nodeSelector:
    68    feature.node.kubernetes.io/network-sriov.capable: "true"
    69  nicSelector:
    70    vendor: "15b3"
    71    pfNames: ["ibp186s0"]
    72  linkType: ib
    73  isRdma: true
    74  numVfs: 8
    75  priority: 90
    76  resourceName: resibp186s0
    
  13. 创建 sriovibnetwork.yaml 文件。

     1# vi ./network-operator/sriovibnetwork.yaml
     2
     3apiVersion: sriovnetwork.openshift.io/v1
     4kind: SriovIBNetwork
     5metadata:
     6  name: ibp12s0
     7  namespace: network-operator
     8spec:
     9  ipam: |
    10    {
    11      "type": "whereabouts",
    12      "datastore": "kubernetes",
    13      "kubernetes": {
    14        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
    15      },
    16      "range": "192.168.1.0/24",
    17      "log_file": "/var/log/whereabouts.log",
    18      "log_level": "info"
    19    }
    20  resourceName: resibp12s0
    21  linkState: enable
    22  networkNamespace: default
    23
    24---
    25apiVersion: sriovnetwork.openshift.io/v1
    26kind: SriovIBNetwork
    27metadata:
    28  name: ibp75s0
    29  namespace: network-operator
    30spec:
    31  ipam: |
    32    {
    33      "type": "whereabouts",
    34      "datastore": "kubernetes",
    35      "kubernetes": {
    36        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
    37      },
    38      "range": "192.168.2.0/24",
    39      "log_file": "/var/log/whereabouts.log",
    40      "log_level": "info"
    41    }
    42  resourceName: resibp75s0
    43  linkState: enable
    44  networkNamespace: default
    45
    46---
    47apiVersion: sriovnetwork.openshift.io/v1
    48kind: SriovIBNetwork
    49metadata:
    50  name: ibpi141s0
    51  namespace: network-operator
    52spec:
    53  ipam: |
    54    {
    55      "type": "whereabouts",
    56      "datastore": "kubernetes",
    57      "kubernetes": {
    58        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
    59      },
    60      "range": "192.168.3.0/24",
    61      "log_file": "/var/log/whereabouts.log",
    62      "log_level": "info"
    63    }
    64  resourceName: resibp141s0
    65  linkState: enable
    66  networkNamespace: default
    67
    68---
    69apiVersion: sriovnetwork.openshift.io/v1
    70kind: SriovIBNetwork
    71metadata:
    72  name: ibp186s0
    73  namespace: network-operator
    74spec:
    75  ipam: |
    76    {
    77      "type": "whereabouts",
    78      "datastore": "kubernetes",
    79      "kubernetes": {
    80        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
    81      },
    82      "range": "192.168.4.0/24",
    83      "log_file": "/var/log/whereabouts.log",
    84      "log_level": "info"
    85    }
    86  resourceName: resibp186s0
    87  linkState: enable
    88  networkNamespace: default
    
  14. 部署配置文件。

     1  # kubectl apply -f ./network-operator/sriov-ib-network-node-policy.yaml
     2  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp12s0 created
     3  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp75s0 created
     4  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp141s0 created
     5  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp186s0 created
     6
     7  # kubectl apply -f ./network-operator/sriovibnetwork.yaml
     8  sriovibnetwork.sriovnetwork.openshift.io/ibp12s0 created
     9  sriovibnetwork.sriovnetwork.openshift.io/ibp75s0 created
    10  sriovibnetwork.sriovnetwork.openshift.io/ibpi141s0 created
    11  sriovibnetwork.sriovnetwork.openshift.io/ibp186s0 created
    
  15. 部署 mpi-operator。

     1  # kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-\ operator/master/deploy/v2beta1/mpi-operator.yaml
     2  namespace/mpi-operator created
     3  customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
     4  serviceaccount/mpi-operator created
     5  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin created
     6  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit created
     7  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view created
     8  clusterrole.rbac.authorization.k8s.io/mpi-operator created
     9  clusterrolebinding.rbac.authorization.k8s.io/mpi-operator created
    10  deployment.apps/mpi-operator created
    
  16. 将 Network Operator /opt/cni/bin 目录复制到 /cm/shared,头节点将在此处访问它。

    1  # ssh dgx01
    2  # cp -r /opt/cni/bin /cm/shared/dgx_opt_cni_bin
    3  # exit
    
  17. 创建 network-validation.yaml 文件并运行简单的验证测试。

     1  # vi network-operator/network-validation.yaml
     2
     3  apiVersion: v1
     4  kind: Pod
     5  metadata:
     6    name: network-validation-pod
     7  spec:
     8    containers:
     9      - name: network-validation-pod
    10        image: docker.io/deepops/nccl-tests:latest
    11        imagePullPolicy: IfNotPresent
    12        command:
    13          - sh
    14          - -c
    15          - sleep inf
    16        securityContext:
    17          capabilities:
    18            add: ["IPC_LOCK"]
    19        resources:
    20          requests:
    21            nvidia.com/resibp75s0: "1"
    22            nvidia.com/resibp186s0: "1"
    23            nvidia.com/resibp12s0: "1"
    24            nvidia.com/resibp141s0: "1"
    25          limits:
    26            nvidia.com/resibp75s0: "1"
    27            nvidia.com/resibp186s0: "1"
    28            nvidia.com/resibp12s0: "1"
    29            nvidia.com/resibp141s0: "1"
    
  18. 应用 network-validation.yaml 文件。

    1  # kubectl apply -f ./network-operator/network-validation.yaml
    2  pod/network-validation-pod created
    

    如果 Pod 成功运行且未给出任何错误,则表示已通过网络验证测试。

  19. 运行多节点 NCCL 测试。

    NVIDIA Collective Communication Library (NCCL) 实现了针对 NVIDIA GPU 和网络优化的多 GPU 和多节点通信原语,这是许多 AI/ML 训练和深度学习应用的基础。成功运行多节点 NCCL 测试是一个很好的指标,表明 GPU 之间的多节点 MPI 和 NCCL 通信运行正常。在 ./network-operator 目录中创建 nccl_test.yaml 文件。

     1  # vi ./network-operator/nccl_test.yaml
     2
     3  apiVersion: kubeflow.org/v2beta1
     4  kind: MPIJob
     5  metadata:
     6    name: nccltest
     7  spec:
     8    slotsPerWorker: 8
     9    runPolicy:
    10      cleanPodPolicy: Running
    11    mpiReplicaSpecs:
    12      Launcher:
    13        replicas: 1
    14        template:
    15          spec:
    16            containers:
    17              - image: docker.io/deepops/nccl-tests:latest
    18                name: nccltest
    19                imagePullPolicy: IfNotPresent
    20                command:
    21                  - sh
    22                  - "-c"
    23                  - |
    24                    /bin/bash << 'EOF'
    25
    26                    mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=NET -x NCCL_ALGO=RING -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl self,tcp -mca btl_tcp_if_include 192.168.0.0/16 -mca oob_tcp_if_include 172.29.0.0/16 /nccl_tests/build/all_reduce_perf -b 8 -e 4G -f2 -g 1
    27
    28                    EOF
    29      Worker:
    30        replicas: 4
    31        template:
    32          metadata:
    33          spec:
    34            containers:
    35              - image: docker.io/deepops/nccl-tests:latest
    36                name: nccltest
    37                imagePullPolicy: IfNotPresent
    38                securityContext:
    39                  capabilities:
    40                    add: ["IPC_LOCK"]
    41                resources:
    42                  limits:
    43                    nvidia.com/resibp12s0: "1"
    44                    nvidia.com/resibp75s0: "1"
    45                    nvidia.com/resibp141s0: "1"
    46                    nvidia.com/resibp186s0: "1"
    47                    nvidia.com/gpu: 8
    
  20. 运行 nccl_test 文件。

     1  # kubectl apply -f ./network-operator/nccl_test.yaml
     2  mpijob.kubeflow.org/nccltest created
     3  root@basepod-head1:~#
     4  # kubectl get pods
     5  NAME                      READY   STATUS    RESTARTS   AGE
     6  nccltest-launcher-9pp28   1/1     Running   0          3m6s
     7  nccltest-worker-0         1/1     Running   0          3m6s
     8  nccltest-worker-1         1/1     Running   0          3m6s
     9  nccltest-worker-2         1/1     Running   0          3m6s
    10  nccltest-worker-3         1/1     Running   0          3m6s
    

    要查看日志,请运行 kubectl logs nccltest-launcher-<ID>。 示例如下。

    _images/network-operator-1.png