创建集群策略实例#
在 2.0 版本中添加。
使用 Web 控制台创建集群策略#
在 OpenShift Container Platform Web 控制台中,从侧边菜单中选择Operators > Installed Operators,然后单击 NVIDIA GPU Operator。
选择 ClusterPolicy 选项卡,然后单击 创建 ClusterPolicy。
注意
平台会分配默认名称 gpu-cluster-policy。
您可以使用此屏幕自定义 ClusterPolicy,但默认设置足以配置和运行 GPU。
单击 创建
此时,GPU Operator 将继续并安装所有必需的组件,以在 OpenShift 4 集群中设置 NVIDIA GPU。至少等待 10-20 分钟,然后再深入研究任何形式的故障排除,因为这可能需要一段时间才能完成。
验证集群策略#
当安装成功时,新部署的 NVIDIA GPU Operator 的 ClusterPolicy gpu-cluster-policy 的状态将更改为 State:ready
。
验证 GPU 是否可用于来自 CLI 的节点,请使用
$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'
这将列出每个节点及其可用于 Kubernetes 的 GPU 数量。
示例输出
$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu' Node GPUs nvaie-ocp-7rfr8-master-0 <none> nvaie-ocp-7rfr8-master-1 <none> nvaie-ocp-7rfr8-master-2 <none> nvaie-ocp-7rfr8-worker-7x5km 1 nvaie-ocp-7rfr8-worker-9jgmk <none> nvaie-ocp-7rfr8-worker-jntsp 1
验证 NVIDIA GPU Operator 是否成功安装#
验证此处所示的 NVIDIA GPU Operator 是否成功安装
运行以下命令以查看这些新的 Pod 和 DaemonSet
$ oc get pods,daemonset -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE pod/bb0dd90f1b757a8c7b338785a4a65140732d30447093bc2c4f6ae8e75844gfv 0/1 Completed 0 94m pod/gpu-feature-discovery-hlpgs 1/1 Running 0 91m pod/gpu-operator-8dc8d6648-jzhnr 1/1 Running 0 94m pod/nvidia-container-toolkit-daemonset-z2wh7 1/1 Running 0 91m pod/nvidia-cuda-validator-8fx22 0/1 Completed 0 86m pod/nvidia-dcgm-exporter-ds9xd 1/1 Running 0 91m pod/nvidia-dcgm-k7tz6 1/1 Running 0 91m pod/nvidia-device-plugin-daemonset-nqxmc 1/1 Running 0 91m pod/nvidia-device-plugin-validator-87zdl 0/1 Completed 0 86m pod/nvidia-driver-daemonset-48.84.202110270303-0-9df9j 2/2 Running 0 91m pod/nvidia-node-status-exporter-7bhdk 1/1 Running 0 91m pod/nvidia-operator-validator-kjznr 1/1 Running 0 91m pod/openshift-psap-ci-artifacts-operator-bundle-gpu-operator-master 1/1 Running 0 94m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 91m daemonset.apps/nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 91m daemonset.apps/nvidia-dcgm 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm=true 91m daemonset.apps/nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 91m daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 91m daemonset.apps/nvidia-driver-daemonset-48.84.202110270303-0 1 1 1 1 1 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=48.84.202110270303-0,nvidia.com/gpu.deploy.driver=true 91m daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 91m daemonset.apps/nvidia-node-status-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.node-status-exporter=true 91m daemonset.apps/nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 91m
nvidia-driver-daemonset
Pod 在每个包含受支持的 NVIDIA GPU 的工作节点上运行。
注意
当 Driver Toolkit 处于活动状态时,DaemonSet
被命名为 nvidia-driver-daemonset-<RHCOS-version>
。其中 RHCOS-version
等于 <OCP XY>.<RHEL XY>.<related date YYYYMMDDHHSS-0
。DaemonSet
的 Pod 被命名为 nvidia-driver-daemonset-<RHCOS-version>-<UUID>
。