故障排除#

本节包含用户在 OpenShift 容器平台集群上安装 NVIDIA GPU Operator 期间执行各种检查时可能遇到的错误。

节点功能发现检查#

  1. 验证节点功能发现是否已创建

    $ oc get NodeFeatureDiscovery -n openshift-nfd
    
    NAME           AGE
    nfd-instance   4h11m
    

    注意

    如果为空,则必须创建节点功能发现自定义资源 (CR)。

  2. 确保存在带有 GPU 的节点。在此示例中,检查是针对使用 PCI ID 10de 的 NVIDIA GPU 执行的。

    $ oc get nodes -l feature.node.kubernetes.io/pci-10de.present
    
    NAME                           STATUS   ROLES    AGE     VERSION
    ip-10-0-133-209.ec2.internal   Ready    worker   4h13m   v1.21.1+9807387
    

GPU Operator 检查#

  1. 验证自定义资源定义 (CRD) 是否已部署。

    $ oc get crd/clusterpolicies.nvidia.com
    
    NAME                         CREATED AT
    clusterpolicies.nvidia.com   2021-09-02T10:33:50Z
    

    注意

    如果缺失,则集群策略安装不成功。

  2. 验证集群策略是否已部署

    $ oc get clusterpolicy
    
    NAME                  AGE
    gpu-cluster-policy    8m25s
    

    注意

    如果缺失,请创建自定义资源 (CR)。有关更多信息,请参阅 创建 ClusterPolicy 实例

  3. 验证 Operator 是否正在运行

    $ oc get pods -n nvidia-gpu-operator -lapp=gpu-operator
    
    NAME                            READY   STATUS    RESTART    AGE
    gpu-operator-6b8b8c5fd9-zcs9r   1/1     Running   0          3h55m
    

    注意

    如果报告 ImagePullBackOff,则可能是 NVIDIA 注册表已关闭。 如果报告 CrashLoopBackOff,请查看 operator 日志。

    $ oc logs -f -n nvidia-gpu-operator -lapp=gpu-operator
    
    2021-11-17T12:08:33.913Z  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ClusterRoleBinding": "nvidia-mig-manager", "Namespace": "nvidia-gpu-operator"}
    2021-11-17T12:08:33.941Z  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ConfigMap": "default-mig-parted-config", "Namespace": "nvidia-gpu-operator"}
    2021-11-17T12:08:33.958Z  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ConfigMap": "default-gpu-clients", "Namespace": "nvidia-gpu-operator"}
    2021-11-17T12:08:33.964Z  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"SecurityContextConstraints": "nvidia-mig-manager", "Namespace": "default"}
    2021-11-17T12:08:33.992Z  INFO    controllers.ClusterPolicy       DaemonSet identical, skipping update    {"DaemonSet": "nvidia-mig-manager", "Namespace": "nvidia-gpu-operator", "name": "nvidia-mig-manager"}
    2021-11-17T12:08:33.992Z  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"LabelSelector": "app=nvidia-mig-manager"}
    2021-11-17T12:08:33.992Z  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberOfDaemonSets": 1}
    2021-11-17T12:08:33.992Z  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberUnavailable": 0}
    2021-11-17T12:08:33.992Z  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-mig-manager", "status": "ready"}
    2021-11-17T12:08:34.061Z  INFO    controllers.ClusterPolicy       ClusterPolicy is ready.
    

验证 GPU 堆栈#

GPU Operator 使用 nvidia-device-plugin-validatornvidia-cuda-validator pod 验证堆栈。 如果它们报告状态 Completed,则堆栈按预期工作。

$ oc get po -n nvidia-gpu-operator
NAME                                                              READY   STATUS      RESTARTS   AGE
bb0dd90f1b757a8c7b338785a4a65140732d30447093bc2c4f6ae8e75844gfv   0/1     Completed   0          125m
gpu-feature-discovery-hlpgs                                       1/1     Running     0          122m
gpu-operator-8dc8d6648-jzhnr                                      1/1     Running     0          125m
nvidia-container-toolkit-daemonset-z2wh7                          1/1     Running     0          122m
nvidia-cuda-validator-8fx22                                       0/1     Completed   0          117m
nvidia-dcgm-exporter-ds9xd                                        1/1     Running     0          122m
nvidia-dcgm-k7tz6                                                 1/1     Running     0          122m
nvidia-device-plugin-daemonset-nqxmc                              1/1     Running     0          122m
nvidia-device-plugin-validator-87zdl                              0/1     Completed   0          117m
nvidia-driver-daemonset-48.84.202110270303-0-9df9j                2/2     Running     0          122m
nvidia-node-status-exporter-7bhdk                                 1/1     Running     0          122m
nvidia-operator-validator-kjznr                                   1/1     Running     0          122m
  1. 验证 cuda validator 日志

    $ oc logs -f nvidia-cuda-validator-8fx22 -n nvidia-gpu-operator
    
    cuda workload validation is successful
    
  2. 验证 nvidia-device-plugin-validator 日志

    $ oc logs nvidia-device-plugin-validator-87zdl -n nvidia-gpu-operator
    
    device-plugin workload validation is successful
    

验证 NVIDIA 驱动程序部署#

这是一个说明性示例,说明 Operator 的部署按预期进行的情况。

  1. 列出部署到 nvidia-gpu-operator 命名空间的 pod

    $ oc get pods -n nvidia-gpu-operator
    
    NAME                                               READY   STATUS             RESTARTS   AGE
    gpu-feature-discovery-hlpgs                        0/1     Init:0/1           0          53m
    gpu-operator-8dc8d6648-jzhnr                       0/1     Init:0/1           0          53m
    nvidia-container-toolkit-daemonset-z2wh7           0/1     Init:0/1           0          53m
    nvidia-cuda-validator-8fx22                        0/1     Init:0/1           0          53m
    nvidia-dcgm-exporter-ds9xd                         0/1     Init:0/2           0          53m
    nvidia-dcgm-k7tz6                                  0/1     Init:0/1           0          53m
    nvidia-device-plugin-daemonset-nqxmc               0/1     Init:0/1           0          53m
    nvidia-device-plugin-validator-87zd                0/1     Init:0/1           0          53m
    nvidia-driver-daemonset-48.84.202110270303-0-9df9j 0/1     CrashLoopBackOff   13         53m
    nvidia-node-status-exporter-7bhdk                  1/1     Init: 0/1          0          53m
    nvidia-operator-validator-kjznr                    0/1     Init:0/4           0          53m
    

    Init 状态指示驱动程序 pod 未就绪。 在此示例中,驱动程序 Pod 处于 CrashLoopBackOff 状态。 这与 RESTARTS 等于 13 相结合表明存在问题。

  2. 验证主控制台页面

    _images/ocp_main_console_alerts.png

    第一个警报显示“nvidia 驱动程序无法正确部署”。

  3. 检查 NVIDIA 驱动程序主容器日志

    $ oc logs -f nvidia-driver-daemonset-48.84.202110270303-0-9df9j -n nvidia-gpu-operator -c nvidia-driver-ctr
    

    在日志中,此行指示权利问题

    FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed
    

    此消息可能与驱动程序工具包部署不成功有关。 要确认驱动程序工具包已成功部署,请遵循 验证 OpenShift Driver Toolkit 中的指南。 如果您看到此消息,一种解决方法是在 OpenShift 容器平台控制台中编辑创建的 gpu-cluster-policy YAML 文件,并将 use_ocp_driver_toolkit 设置为 false

    设置权利。 有关更多信息,请参阅 启用集群范围的权利

验证 OpenShift Driver Toolkit#

验证 OpenShift Driver Toolkit 是否已成功部署。

  1. 检查日志以确保驱动程序工具包已成功部署

    oc logs $(oc get pods -l app=gpu-operator -oname -n nvidia-gpu-operator) -n nvidia-gpu-operator
    

    以下输出指示 dependencies missing 存在问题。

    2021-11-15T11:43:51.626Z   INFO    controllers.ClusterPolicy       ocpHasDriverToolkitImageStream: driver-toolkit imagestream not found    {"Name": "driver-toolkit", "Namespace": "openshift"}
    2021-11-15T11:43:51.626Z   INFO    controllers.ClusterPolicy       WARNING OpenShift Driver Toolkit requested      {"hasCompatibleNFD": true, "hasDriverToolkitImageStream": false}
    2021-11-15T11:43:51.626Z   INFO    controllers.ClusterPolicy       WARNING OpenShift Driver Toolkit        {"enabled": false}
    ...
    2021-11-15T11:43:52.048Z   INFO    controllers.ClusterPolicy       WARNING: OpenShift DriverToolkit was requested but could not be enabled (dependencies missing)
    

    检查 OpenShift 容器平台 Web 控制台上的警报会显示以下 GPUOperatorOpenshiftDriverToolkitEnabledImageStreamMissing。 此屏幕截图提供了额外的证据,表明 Openshift 版本需要升级。

    _images/driver_toolkit_alert.png

    日志条目和来自警报的信息表明此版本的 OpenShift 不支持成功部署 OpenShift Driver Toolkit。 有关 OpenShift 容器平台版本和 OpenShift Driver Toolkit 之间依赖关系的更多详细信息,请参阅 OpenShift 上的安装和升级概述