部署云原生服务附加包#

Amazon EKS

  1. 以下步骤假设已满足要求部分中的要求,并且已按照上一节中的步骤设置了 Amazon EKS 集群。

  2. 确保 kubeconfig 可用,并通过 KUBECONFIG 环境变量或用户的默认位置 (.kube/config) 进行设置。

  3. 确保为创建的 K8S 集群提供 FQDN 和通配符 DNS 条目,并且可以解析。

  4. 从 Enterprise Catalog 下载 NVIDIA 云原生服务附加包到您从此处预配的实例上。

    ngc registry resource download-version "nvaie/nvidia_cnpack:0.4.0"
    

    注意

    如果您仍需要使用 API 密钥安装和设置 NGC CLI,请通过自动加载资源来执行此操作。说明可以在此处找到。

  5. 使用以下命令导航到安装程序的目录

    cd nvidia_cnpack_v*
    
  6. 使用以下模板为安装创建配置文件。以下表示最小配置文件。有关所有可用配置选项的完整详细信息,请参考附录的高级用法部分

    注意

    确保更改 wildcardDomain 字段以匹配在要求部分中描述的 DNS FQDN 和通配符记录。

    cat > config.yaml <<EOF
    apiVersion: v1alpha1
    kind: NvidiaPlatform
    spec:
      platform:
        wildcardDomain: "*.my-cluster.my-domain.com"
        externalPort: 443
        eks:
          region: us-west-2
      certManager:
        enabled: true
        awsPCA:
          enabled: true
          commonName: "<your common name used to enable AWS Private CA>"
          domainName: "<your commonName used to enable AWS Private CA>"
          arn: "<ARN of the AWS Private CA>"
      prometheus:
        enabled: true
        awsRemoteWrite:
          url: "<Remote write url for Amazon Managed Prometheus>"
          arn: "<IAM Role for Amazon managed Prometheus>"
      grafana:
        enabled: false
      keycloak:
        enabled: true
        databaseStorage:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1G
        storageClassName: gp2
        volumeMode: Filesystem
      postgres:
        enabled: true
      fluentbit:
        enabled: true
      elastic:
        enabled: true
      ingress:
        enabled: true
    
    EOF
    
  7. 通过以下命令使安装程序可执行

    chmod +x ./nvidia-cnpack_Linux_x86_64
    
  8. 在实例上运行以下命令以设置 NVIDIA 云原生服务附加包

    ./nvidia-cnpack_Linux_x86_64 create -f config.yaml
    
  9. 安装完成后,通过以下命令检查所有 pod 是否健康

    kubectl get pods -A
    

    输出应类似于以下内容

    NAMESPACE           NAME                                                              READY   STATUS      RESTARTS      AGE
    kube-system         aws-node-hcn49                                                    1/1     Running     0             20d
    kube-system         coredns-769569fd5d-8pfsr                                          1/1     Running     0             20d
    kube-system         coredns-769569fd5d-cpf29                                          1/1     Running     0             20d
    kube-system         ebs-csi-controller-7c5f746989-9kjrj                               6/6     Running     0             20d
    kube-system         ebs-csi-controller-7c5f746989-fzzlw                               6/6     Running     0             20d
    kube-system         ebs-csi-node-f9bqp                                                3/3     Running     0             20d
    kube-system         kube-proxy-t8ttt                                                  1/1     Running     0             20d
    nvidia-monitoring   elastic-operator-0                                                1/1     Running     1 (14d ago)   14d
    nvidia-monitoring   grafana-deployment-6fdf95b986-8d2sh                               1/1     Running     0             14d
    nvidia-monitoring   nvidia-fluentbit-aws-for-fluent-bit-ljf7j                         1/1     Running     0             17d
    nvidia-monitoring   nvidia-grafana-grafana-operator-66d597fcdb-q88k7                  1/1     Running     0             17d
    nvidia-monitoring   nvidia-prometheus-kube-pro-operator-87cbfd57d-mlm6j               1/1     Running     0             17d
    nvidia-monitoring   prometheus-nvidia-prometheus-kube-pro-prometheus-0                2/2     Running     0             17d
    nvidia-platform     nvidia-certmanager-cert-manager-754dbf54cd-wnfmd                  1/1     Running     0             17d
    nvidia-platform     nvidia-certmanager-cert-manager-cainjector-68b7b69c6f-nrfpf       1/1     Running     0             17d
    nvidia-platform     nvidia-certmanager-cert-manager-webhook-557978b4fc-tsc69          1/1     Running     0             17d
    nvidia-platform     nvidia-ingress-kubernetes-ingress-j4zgh                           1/1     Running     0             17d
    nvidia-platform     nvidia-keycloak-0                                                 1/1     Running     1 (17d ago)   17d
    nvidia-platform     nvidia-keycloak-1                                                 1/1     Running     0             17d
    nvidia-platform     nvidia-keycloak-backup-hk5n-qnrgj                                 0/1     Completed   0             17d
    nvidia-platform     nvidia-keycloak-instance1-mrbl-0                                  4/4     Running     0             17d
    nvidia-platform     nvidia-keycloak-instance1-pt9t-0                                  4/4     Running     0             17d
    nvidia-platform     nvidia-keycloak-instance1-schr-0                                  4/4     Running     0             17d
    nvidia-platform     nvidia-keycloak-repo-host-0                                       2/2     Running     0             17d
    nvidia-platform     nvidia-platform-aws-privateca-issuer-55b676666d-h6nlw             1/1     Running     0             17d
    nvidia-platform     pgo-64cdcfff78-np8nb                                              1/1     Running     0             17d
    nvidia-platform     pgo-upgrade-6776d6894-gjcn9                                       1/1     Running     0             17d
    
  10. 作为安装的一部分,安装程序将创建 nvidia-platformnvidia-monitoring 命名空间,其中包含与已部署服务交互所需的大部分组件和信息。