GKE (Google Kubernetes Engine)#

概述#

版本 3.1 中新增。

Google Kubernetes Engine (GKE) 提供了一个托管环境，用于使用 Google 基础设施部署、管理和扩展容器化应用程序。NVIDIA AI Enterprise 是 NVIDIA AI 平台的端到端软件，支持在 GKE 上运行。GKE 环境由多台机器（特别是 Compute Engine 实例）组合在一起，形成一个集群。本指南详细介绍了如何在具有 NVIDIA GPU 加速节点的 GKE 集群上部署和运行 NVIDIA AI Enterprise。

注意

NVIDIA Terraform 模块提供了一种简便的方法来部署托管 Kubernetes 集群，当与受支持的操作系统和 GPU Operator 版本一起使用时，这些集群可以由 NVIDIA AI Enterprise 支持。

先决条件#

通过 BYOL 或私有报价获得的 NVIDIA AI Enterprise 许可证
安装 Helm
安装 Google Cloud CLI

注意

NVIDIA 建议通过下载 linux google-cloud-cli-xxx.x.x-linux-x86_64.tar.gz 文件来安装 Google Cloud CLI。如果使用不同的操作系统发行版，请继续按照 Google 的说明操作以获取所需的操作系统。
具有 Google Kubernetes Engine 管理员角色和 Kubernetes Engine 集群管理员角色的 Google Cloud 帐户，有关更多信息，请参阅 Google 的 IAM 策略
Ubuntu 节点

创建 GKE 集群#

运行以下命令以安装 GKE 组件。

./google-cloud-sdk/bin/gcloud components install beta
./google-cloud-sdk/bin/gcloud components install kubectl
./google-cloud-sdk/bin/gcloud components update

运行以下命令以创建 GKE 集群。

./google-cloud-sdk/bin/gcloud beta container --project <Google-Project-ID> clusters create <GKE-Cluster-Name> --zone us-west1-a --release-channel "regular" --machine-type "n1-standard-4" --accelerator "type=nvidia-tesla-t4,count=1" --image-type "UBUNTU_CONTAINERD" --disk-type "pd-standard" --disk-size "1000" --no-enable-intra-node-visibility --metadata disable-legacy-endpoints=true --max-pods-per-node "110" --num-nodes "1" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --no-enable-intra-node-visibility --default-max-pods-per-node "110" --no-enable-master-authorized-networks --tags=nvidia-ingress-all

注意

<Google-Project-ID>：您将在 Google 控制台仪表板“设置”中找到项目 ID，并进行相应的更新。
如果网络名称不是“default”，则从控制台获取网络信息，并将以下值附加到相应的网络和子网名称。

--network "<Google Kubernetes Network Name>" --subnetwork "<Google Kubernetes SubNetwork Name>"

Example:

--network "projects/<GKE-Project-ID>/global/networks/<GKE-Network-Name>" --subnetwork "projects/<GKE-Project-ID>/regions/us-west1/subnetworks/<GKE-Network-Name>"

运行以下命令以获取 kubeconfig 凭据到本地系统。

$ export USE_GKE_GCLOUD_AUTH_PLUGIN=True

./google-cloud-sdk/bin/gcloud container clusters get-credentials <GKE-Cluster-Name> --zone us-west1-a

运行以下命令以验证节点信息

kubectl get nodes -o wide

示例输出结果

NAME                                                STATUS   ROLES    AGE     VERSION   INTERNAL-IP      EXTERNAL-IP    OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
gke-<GKE-Cluster-Name>-default-pool-db9e3df9-r0jf   Ready    <none>   5m15s   v1.25.6   192.168.50.108   13.57.187.63   Ubuntu 20.04.6 LTS   5.15.0-1033-gke   containerd://1.6.12

创建资源以在 GKE 上安装 GPU Operator#

使用以下命令创建资源配额文件。

cat <<EOF | tee resourcequota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-operator-quota
namespace: gpu-operator
spec:
hard:
    pods: 100
scopeSelector:
    matchExpressions:
    - operator: In
    scopeName: PriorityClass
    values:
    - system-node-critical
    - system-cluster-critical
EOF

运行以下命令以在 GKE 集群上创建命名空间和资源配额到命名空间。

./google-cloud-sdk/bin/kubectl create ns gpu-operator

./google-cloud-sdk/bin/kubectl apply -f resourcequota.yaml

./google-cloud-sdk/bin/kubectl get ResourceQuota -n gpu-operator

示例输出结果

NAME                  AGE   REQUEST                                                                                                                              LIMIT
gke-resource-quotas   24s   count/ingresses.extensions: 0/100, count/ingresses.networking.k8s.io: 0/100, count/jobs.batch: 0/5k, pods: 0/1500, services: 0/500
gpu-operator-quota    21s   pods: 0/100

有关更多信息，请参阅 GKE 的配额和限制概述

使用以下命令验证 GKE 上的默认 Pod 安全策略。

./google-cloud-sdk/bin/kubectl get psp

示例输出结果

Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME                    PRIV    CAPS   SELINUX    RUNASUSER   FSGROUP    SUPGROUP   READONLYROOTFS   VOLUMES
gce.gke-metrics-agent   false          RunAsAny   RunAsAny    RunAsAny   RunAsAny   false            hostPath,secret,configMap

部署 GPU Operator#

现在集群和适当的资源已创建，可以安装 NVIDIA GPU Operator

首先，我们将访问我们的 NGC API 密钥。

登录您的 NGC 帐户并生成新的 API 密钥或找到您现有的 API 密钥。请参阅附录的访问 NGC部分。

生成用于访问目录的 API 密钥

接下来，您必须生成一个 API 密钥，该密钥将允许您访问 NGC 目录。

导航到右上角的用户帐户图标，然后选择设置。

选择 获取 API 密钥 以打开“设置”>“API 密钥”页面。

选择 生成 API 密钥 以生成您的 API 密钥。

选择确认以生成密钥，并从页面底部复制它。NGC 不会保存您的密钥，因此请将其存储在安全的地方。

注意

生成新的 API 密钥会使先前生成的密钥失效。

添加 Helm 仓库并使用以下命令更新。

helm repo add nvidia https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=<YOUR API KEY>
helm repo update

在“gpu-operator”命名空间上使用您的 NGC API 密钥创建一个 NGC Secret，如下所示。

./google-cloud-sdk/bin/kubectl create secret docker-registry ngc-secret  \
--docker-server=nvcr.io/nvaie --docker-username=\$oauthtoken \
--docker-password=<NGC-API-KEY> \
--docker-email=<NGC-email> -n gpu-operator

创建一个空的 gridd.conf 文件，然后使用 NVIDIA vGPU 许可证令牌文件创建一个 configmap，如下所示

./google-cloud-sdk/bin/kubectl create configmap licensing-config   -n gpu-operator --from-file=./client_configuration_token.tok --from-file=./gridd.conf

注意

configmap 将查找文件 client_configuration_token.tok，如果您的令牌采用不同的形式，例如 client_configuration_token_date_xx_xx.tok，请运行以下命令

mv client_configuration_token_date_xx_xx.tok client_configuration_token.tok

从 NGC 目录安装带有许可证令牌和驱动程序仓库的 GPU Operator。

helm install gpu-operator nvidia/gpu-operator-3-0 --version 22.9.1  --set driver.repository=nvcr.io/nvaie,driver.licensingConfig.configMapName=licensing-config,psp.enabled=true --namespace gpu-operator

重要提示

确保您具有 Kubernetes Engine 管理员或 Kubernetes Engine 集群管理员的正确角色，以便安装 GPU Operator。如果您缺少所需的权限，请参阅创建 IAM 策略。

安装完成后，请至少等待 5 分钟，并验证所有 Pod 是否都正在运行或已完成，如下所示。

./google-cloud-sdk/bin/kubectl get pods -n gpu-operator
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-fzgv9                                   1/1     Running     0          6m1s
gpu-operator-69f476f875-w4hwr                                 1/1     Running     0          6m29s
gpu-operator-node-feature-discovery-master-84c7c7c6cf-hxlk4   1/1     Running     0          6m29s
gpu-operator-node-feature-discovery-worker-86bbx              1/1     Running     0          6m29s
nvidia-container-toolkit-daemonset-c7k5p                      1/1     Running     0          6m
nvidia-cuda-validator-qjcsf                                   0/1     Completed   0          59s
nvidia-dcgm-exporter-9tggn                                    1/1     Running     0          6m
nvidia-device-plugin-daemonset-tpx9z                          1/1     Running     0          6m
nvidia-device-plugin-validator-gz85d                          0/1     Completed   0          44s
nvidia-driver-daemonset-jwzx8                                 1/1     Running     0          6m9s
nvidia-operator-validator-qj57n                               1/1     Running     0          6m

验证 GPU Operator 安装#

使用以下命令验证 NVIDIA GPU 驱动程序是否已加载。

./google-cloud-sdk/bin/kubectl exec -it nvidia-driver-daemonset-jwzx8 -n gpu-operator -- nvidia-smi

Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
Tue Feb 14 22:24:31 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.60.13    Driver Version: 520.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8    17W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

注意

nvidia-driver-daemonset-xxxxx 在您自己的环境中会有所不同，以上命令用于验证 NVIDIA vGPU 驱动程序。

使用以下命令验证 NVIDIA vGPU 许可证信息

./google-cloud-sdk/bin/kubectl exec -it nvidia-driver-daemonset-jwzx8 -n gpu-operator-resources -- nvidia-smi -q

运行示例 NVIDIA AI Enterprise 容器#

创建一个 docker-regirty Secret。这将用于自定义 yaml 文件中，以从 NGC 目录拉取容器。

./google-cloud-sdk/bin/kubectl create secret docker-registry regcred --docker-server=nvcr.io/nvaie --docker-username=\$oauthtoken --docker-password=<YOUR_NGC_KEY> --docker-email=<your_email_id> -n default

创建一个自定义 yaml 文件以部署 NVIDIA AI Enterprise 容器并运行示例训练代码。

nano pytoch-mnist.yaml

将以下内容粘贴到文件中并保存

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-mnist
  labels:
    app: pytorch-mnist
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pytorch-mnist
template:
    metadata:
      labels:
        app: pytorch-mnist
    spec:
      containers:
        - name: pytorch-container
          image: nvcr.io/nvaie/pytorch-2-0:22.02-nvaie-2.0-py3
          command:
            - python
          args:
            - /workspace/examples/upstream/mnist/main.py
          resources:
            requests:
              nvidia.com/gpu: 1
            limits:
              nvidia.com/gpu: 1
      imagePullSecrets:
        - name: regcred

检查 Pod 的状态。

1./google-cloud-sdk/bin/kubectl get pods

查看示例 mnist 训练作业的输出。

./google-cloud-sdk/bin/kubectl logs -l app=pytorch-mnist

输出将类似于这样。

~$ ./google-cloud-sdk/bin/kubectl logs -l app=pytorch-mnist
Train Epoch: 7 [55680/60000 (93%)]      Loss: 0.040756
Train Epoch: 7 [56320/60000 (94%)]      Loss: 0.028230
Train Epoch: 7 [56960/60000 (95%)]      Loss: 0.019917
Train Epoch: 7 [57600/60000 (96%)]      Loss: 0.005957
Train Epoch: 7 [58240/60000 (97%)]      Loss: 0.003768
Train Epoch: 7 [58880/60000 (98%)]      Loss: 0.277371
Train Epoch: 7 [59520/60000 (99%)]      Loss: 0.115487


Test set: Average loss: 0.0270, Accuracy: 9913/10000 (99%)

删除 GKE 集群#

运行以下命令以删除 GKE 集群

./google-cloud-sdk/bin/gcloud  beta container clusters delete <cluster-name>  --zone <zone-name>