网络运营商#

验证 DGX 系统上的 NVIDIA Mellanox OFED 软件包版本是否与DGX 操作系统发行说明中列出的版本匹配。
1 # cmsh 2 % device 3 % pexec -c dgx-a100 -j "ofed_info -s" 4 [dgx01..dgx04] 5 MLNX_OFED_LINUX-23.10-0.5.5.0:
必须识别计算 Fabric 中使用的正确的 InfiniBand 接口，并检查其运行状态。如前所述，使用了 mlx5_0、mlx5_2、mlx5_6 和 mlx5_8，应验证它们是否处于工作状态。每个节点上的每个接口都应为 State: Active，Physical stat: LinkUp 和 Link layer: InfiniBand。

使用以下命令验证接口是否正常工作

  [basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do ibstat -d \ mlx5_${i} | grep -i \"mlx5_\\|state\\|infiniband\"; done"
  [dgx01..dgx04]
  CA 'mlx5_0'
                  State: Active
                  Physical state: LinkUp
                  Link layer: InfiniBand
  CA 'mlx5_2'
                  State: Active
                  Physical state: LinkUp
                  Link layer: InfiniBand
  CA 'mlx5_6'
                  State: Active
                  Physical state: LinkUp
                  Link layer: InfiniBand
  CA 'mlx5_8'
                  State: Active
                  Physical state: LinkUp
                  Link layer: InfiniBand`

检查 SRIOV 接口状态。

NUM_OF_VFS 应设置为 8。
SRIOV_EN 应为 True(1)。
Link_TYPE_P1 应为 IB(1)。

在本例中，仅 Link_TYPE_P1 设置正确。其他值需要在下一步中设置。

  [basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do mst start; \ mlxconfig -d /dev/mst/mt4123_pciconf${i} q; done | grep -e \ \"SRIOV_EN\\|LINK_TYPE\\|NUM_OF_VFS\""
  [dgx01..dgx04]
          NUM_OF_VFS                          0
          SRIOV_EN                            False(0)
          LINK_TYPE_P1                        IB(1)
          NUM_OF_VFS                          0
          SRIOV_EN                            False(0)
          LINK_TYPE_P1                        IB(1)
          NUM_OF_VFS                          0
          SRIOV_EN                            False(0)
          LINK_TYPE_P1                        IB(1)
          NUM_OF_VFS                          0
          SRIOV_EN                            False(0)
          LINK_TYPE_P1                        IB(1)

启用 SRIOV 并将每个接口的 NUM_OF_VFS 设置为 8。

由于 Link_TYPE_P1 已正确设置，因此下面仅设置其他两个值。

[basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do mst start; \ mlxconfig -d /dev/mst/mt4123_pciconf${i} -y set SRIOV_EN=1 NUM_OF_VFS=8; done"
[dgx01..dgx04]
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success

Device #1:
----------

Device type:    ConnectX6
Name:           MCX653105A-HDA_Ax
Description:    ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
Device:         /dev/mst/mt4123_pciconf0

Configurations:                              Next Boot       New
        SRIOV_EN                            False(0)        True(1)
        NUM_OF_VFS                          0               8

Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
. . . some output omitted . . .

重启 DGX 节点以加载配置。
% reboot -c dgx-a100

等待 DGX 节点启动 (UP) 后再继续下一步。

% list -c dgx-a100 -f hostname:20,category:10,ip:20,status:10
hostname (key)       category   ip                   status
-------------------- ---------- -------------------- ----------
dgx01                dgx-a100   10.184.71.11         [   UP   +
dgx02                dgx-a100   10.184.71.12         [   UP   +
dgx03                dgx-a100   10.184.71.13         [   UP   +
dgx04                dgx-a100   10.184.71.14         [   UP   +

在 InfiniBand 端口上配置八个 SRIOV VF。

[basepod-head1->device]% pexec -c dgx-a100 -j "for i in 0 2 6 8; do echo 8 > \ /sys/class/infiniband/mlx5_${i}/device/sriov_numvfs; done"

在主头节点上，加载 Kubernetes 环境模块。
# module load kubernetes/default/1.27.11-150500.1.1

添加并安装 Network Operator Helm repo。

# helm repo add nvidia-networking https://mellanox.github.io/network-operator
"nvidia-networking" has been added to your repositories

# helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia-networking" chart repository
...Successfully got an update from the "prometheus-community" chart repository
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈

创建目录 ./network-operator。
# mkdir ./network-operator

创建用于 Helm 安装 Network Operator 的 values.yaml 文件。

# vi ./network-operator/values.yaml

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: true

# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: false
rdmaSharedDevicePlugin:
  deploy: false
sriovDevicePlugin:
  deploy: false

secondaryNetwork:
  deploy: true
  multus:
    deploy: true
  cniPlugins:
    deploy: true
  ipamPlugin:
    deploy: true

创建 sriov-ib-network-node-policy.yaml 文件。

# vi ./network-operator/sriov-ib-network-node-policy.yaml

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ibp12s0
  namespace: network-operator
spec:
  deviceType: netdevice
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  nicSelector:
    vendor: "15b3"
    pfNames: ["ibp12s0"]
  linkType: ib
  isRdma: true
  numVfs: 8
  priority: 90
  resourceName: resibp12s0

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ibp75s0
  namespace: network-operator
spec:
  deviceType: netdevice
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  nicSelector:
    vendor: "15b3"
    pfNames: ["ibp75s0"]
  linkType: ib
  isRdma: true
  numVfs: 8
  priority: 90
  resourceName: resibp75s0

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ibp141s0
  namespace: network-operator
spec:
  deviceType: netdevice
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  nicSelector:
    vendor: "15b3"
    pfNames: ["ibp141s0"]
  linkType: ib
  isRdma: true
  numVfs: 8
  priority: 90
  resourceName: resibp141s0

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ibp186s0
  namespace: network-operator
spec:
  deviceType: netdevice
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  nicSelector:
    vendor: "15b3"
    pfNames: ["ibp186s0"]
  linkType: ib
  isRdma: true
  numVfs: 8
  priority: 90
  resourceName: resibp186s0

创建 sriovibnetwork.yaml 文件。

# vi ./network-operator/sriovibnetwork.yaml

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: ibp12s0
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.1.0/24",
      "log_file": "/var/log/whereabouts.log",
      "log_level": "info"
    }
  resourceName: resibp12s0
  linkState: enable
  networkNamespace: default

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: ibp75s0
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.2.0/24",
      "log_file": "/var/log/whereabouts.log",
      "log_level": "info"
    }
  resourceName: resibp75s0
  linkState: enable
  networkNamespace: default

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: ibpi141s0
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.3.0/24",
      "log_file": "/var/log/whereabouts.log",
      "log_level": "info"
    }
  resourceName: resibp141s0
  linkState: enable
  networkNamespace: default

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: ibp186s0
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.4.0/24",
      "log_file": "/var/log/whereabouts.log",
      "log_level": "info"
    }
  resourceName: resibp186s0
  linkState: enable
  networkNamespace: default

部署配置文件。

  # kubectl apply -f ./network-operator/sriov-ib-network-node-policy.yaml
  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp12s0 created
  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp75s0 created
  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp141s0 created
  sriovnetworknodepolicy.sriovnetwork.openshift.io/ibp186s0 created

  # kubectl apply -f ./network-operator/sriovibnetwork.yaml
  sriovibnetwork.sriovnetwork.openshift.io/ibp12s0 created
  sriovibnetwork.sriovnetwork.openshift.io/ibp75s0 created
  sriovibnetwork.sriovnetwork.openshift.io/ibpi141s0 created
  sriovibnetwork.sriovnetwork.openshift.io/ibp186s0 created

部署 mpi-operator。

  # kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-\ operator/master/deploy/v2beta1/mpi-operator.yaml
  namespace/mpi-operator created
  customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
  serviceaccount/mpi-operator created
  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin created
  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit created
  clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view created
  clusterrole.rbac.authorization.k8s.io/mpi-operator created
  clusterrolebinding.rbac.authorization.k8s.io/mpi-operator created
  deployment.apps/mpi-operator created

将 Network Operator /opt/cni/bin 目录复制到 /cm/shared，头节点将在此处访问它。
1 # ssh dgx01 2 # cp -r /opt/cni/bin /cm/shared/dgx_opt_cni_bin 3 # exit

创建 network-validation.yaml 文件并运行简单的验证测试。

  # vi network-operator/network-validation.yaml

  apiVersion: v1
  kind: Pod
  metadata:
    name: network-validation-pod
  spec:
    containers:
      - name: network-validation-pod
        image: docker.io/deepops/nccl-tests:latest
        imagePullPolicy: IfNotPresent
        command:
          - sh
          - -c
          - sleep inf
        securityContext:
          capabilities:
            add: ["IPC_LOCK"]
        resources:
          requests:
            nvidia.com/resibp75s0: "1"
            nvidia.com/resibp186s0: "1"
            nvidia.com/resibp12s0: "1"
            nvidia.com/resibp141s0: "1"
          limits:
            nvidia.com/resibp75s0: "1"
            nvidia.com/resibp186s0: "1"
            nvidia.com/resibp12s0: "1"
            nvidia.com/resibp141s0: "1"

应用 network-validation.yaml 文件。
1 # kubectl apply -f ./network-operator/network-validation.yaml 2 pod/network-validation-pod created
如果 Pod 成功运行且未给出任何错误，则表示已通过网络验证测试。

运行多节点 NCCL 测试。

NVIDIA Collective Communication Library (NCCL) 实现了针对 NVIDIA GPU 和网络优化的多 GPU 和多节点通信原语，这是许多 AI/ML 训练和深度学习应用的基础。成功运行多节点 NCCL 测试是一个很好的指标，表明 GPU 之间的多节点 MPI 和 NCCL 通信运行正常。在 ./network-operator 目录中创建 nccl_test.yaml 文件。

  # vi ./network-operator/nccl_test.yaml

  apiVersion: kubeflow.org/v2beta1
  kind: MPIJob
  metadata:
    name: nccltest
  spec:
    slotsPerWorker: 8
    runPolicy:
      cleanPodPolicy: Running
    mpiReplicaSpecs:
      Launcher:
        replicas: 1
        template:
          spec:
            containers:
              - image: docker.io/deepops/nccl-tests:latest
                name: nccltest
                imagePullPolicy: IfNotPresent
                command:
                  - sh
                  - "-c"
                  - |
                    /bin/bash << 'EOF'

                    mpirun --allow-run-as-root -np 4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=NET -x NCCL_ALGO=RING -x NCCL_IB_DISABLE=0 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl self,tcp -mca btl_tcp_if_include 192.168.0.0/16 -mca oob_tcp_if_include 172.29.0.0/16 /nccl_tests/build/all_reduce_perf -b 8 -e 4G -f2 -g 1

                    EOF
      Worker:
        replicas: 4
        template:
          metadata:
          spec:
            containers:
              - image: docker.io/deepops/nccl-tests:latest
                name: nccltest
                imagePullPolicy: IfNotPresent
                securityContext:
                  capabilities:
                    add: ["IPC_LOCK"]
                resources:
                  limits:
                    nvidia.com/resibp12s0: "1"
                    nvidia.com/resibp75s0: "1"
                    nvidia.com/resibp141s0: "1"
                    nvidia.com/resibp186s0: "1"
                    nvidia.com/gpu: 8

运行 nccl_test 文件。

  # kubectl apply -f ./network-operator/nccl_test.yaml
  mpijob.kubeflow.org/nccltest created
  root@basepod-head1:~#
  # kubectl get pods
  NAME                      READY   STATUS    RESTARTS   AGE
  nccltest-launcher-9pp28   1/1     Running   0          3m6s
  nccltest-worker-0         1/1     Running   0          3m6s
  nccltest-worker-1         1/1     Running   0          3m6s
  nccltest-worker-2         1/1     Running   0          3m6s
  nccltest-worker-3         1/1     Running   0          3m6s

要查看日志，请运行 kubectl logs nccltest-launcher-<ID>。示例如下。