创建集群策略实例#

在 2.0 版本中添加。

使用 Web 控制台创建集群策略#

  1. 在 OpenShift Container Platform Web 控制台中,从侧边菜单中选择Operators > Installed Operators,然后单击 NVIDIA GPU Operator

  2. 选择 ClusterPolicy 选项卡,然后单击 创建 ClusterPolicy

    注意

    平台会分配默认名称 gpu-cluster-policy

  3. 您可以使用此屏幕自定义 ClusterPolicy,但默认设置足以配置和运行 GPU。

  4. 单击 创建

  5. 此时,GPU Operator 将继续并安装所有必需的组件,以在 OpenShift 4 集群中设置 NVIDIA GPU。至少等待 10-20 分钟,然后再深入研究任何形式的故障排除,因为这可能需要一段时间才能完成。

验证集群策略#

当安装成功时,新部署的 NVIDIA GPU Operator 的 ClusterPolicy gpu-cluster-policy 的状态将更改为 State:ready

_images/os-on-bm-cluster1.png

验证 GPU 是否可用于来自 CLI 的节点,请使用

$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'

这将列出每个节点及其可用于 Kubernetes 的 GPU 数量。

示例输出

$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'
Node                           GPUs
nvaie-ocp-7rfr8-master-0       <none>
nvaie-ocp-7rfr8-master-1       <none>
nvaie-ocp-7rfr8-master-2       <none>
nvaie-ocp-7rfr8-worker-7x5km   1
nvaie-ocp-7rfr8-worker-9jgmk   <none>
nvaie-ocp-7rfr8-worker-jntsp   1

验证 NVIDIA GPU Operator 是否成功安装#

验证此处所示的 NVIDIA GPU Operator 是否成功安装

运行以下命令以查看这些新的 Pod 和 DaemonSet

$ oc get pods,daemonset -n nvidia-gpu-operator

NAME                                                                  READY   STATUS      RESTARTS   AGE

pod/bb0dd90f1b757a8c7b338785a4a65140732d30447093bc2c4f6ae8e75844gfv   0/1     Completed   0          94m

pod/gpu-feature-discovery-hlpgs                                       1/1     Running     0          91m

pod/gpu-operator-8dc8d6648-jzhnr                                      1/1     Running     0          94m

pod/nvidia-container-toolkit-daemonset-z2wh7                          1/1     Running     0          91m

pod/nvidia-cuda-validator-8fx22                                       0/1     Completed   0          86m

pod/nvidia-dcgm-exporter-ds9xd                                        1/1     Running     0          91m

pod/nvidia-dcgm-k7tz6                                                 1/1     Running     0          91m

pod/nvidia-device-plugin-daemonset-nqxmc                              1/1     Running     0          91m

pod/nvidia-device-plugin-validator-87zdl                              0/1     Completed   0          86m

pod/nvidia-driver-daemonset-48.84.202110270303-0-9df9j                2/2     Running     0          91m

pod/nvidia-node-status-exporter-7bhdk                                 1/1     Running     0          91m

pod/nvidia-operator-validator-kjznr                                   1/1     Running     0          91m

pod/openshift-psap-ci-artifacts-operator-bundle-gpu-operator-master   1/1     Running     0          94m



NAME                                                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                        AGE

daemonset.apps/gpu-feature-discovery                          1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                     91m

daemonset.apps/nvidia-container-toolkit-daemonset             1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true                                                                         91m

daemonset.apps/nvidia-dcgm                                    1         1         1       1            1           nvidia.com/gpu.deploy.dcgm=true                                                                                      91m

daemonset.apps/nvidia-dcgm-exporter                           1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                             91m

daemonset.apps/nvidia-device-plugin-daemonset                 1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true                                                                             91m

daemonset.apps/nvidia-driver-daemonset-48.84.202110270303-0   1         1         1       1            1           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=48.84.202110270303-0,nvidia.com/gpu.deploy.driver=true   91m

daemonset.apps/nvidia-mig-manager                             0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                                                               91m

daemonset.apps/nvidia-node-status-exporter                    1         1         1       1            1           nvidia.com/gpu.deploy.node-status-exporter=true                                                                      91m

daemonset.apps/nvidia-operator-validator                      1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true                                                                        91m

nvidia-driver-daemonset Pod 在每个包含受支持的 NVIDIA GPU 的工作节点上运行。

注意

当 Driver Toolkit 处于活动状态时,DaemonSet 被命名为 nvidia-driver-daemonset-<RHCOS-version>。其中 RHCOS-version 等于 <OCP XY>.<RHEL XY>.<related date YYYYMMDDHHSS-0DaemonSet 的 Pod 被命名为 nvidia-driver-daemonset-<RHCOS-version>-<UUID>