部署云原生服务附加包#
Amazon EKS
确保 kubeconfig 可用,并通过 KUBECONFIG 环境变量或用户的默认位置 (.kube/config) 进行设置。
确保为创建的 K8S 集群提供 FQDN 和通配符 DNS 条目,并且可以解析。
从 Enterprise Catalog 下载 NVIDIA 云原生服务附加包到您从此处预配的实例上。
ngc registry resource download-version "nvaie/nvidia_cnpack:0.4.0"
注意
如果您仍需要使用 API 密钥安装和设置 NGC CLI,请通过自动加载资源来执行此操作。说明可以在此处找到。
使用以下命令导航到安装程序的目录
cd nvidia_cnpack_v*
使用以下模板为安装创建配置文件。以下表示最小配置文件。有关所有可用配置选项的完整详细信息,请参考附录的高级用法部分。
注意
确保更改
wildcardDomain
字段以匹配在要求部分中描述的 DNS FQDN 和通配符记录。cat > config.yaml <<EOF apiVersion: v1alpha1 kind: NvidiaPlatform spec: platform: wildcardDomain: "*.my-cluster.my-domain.com" externalPort: 443 eks: region: us-west-2 certManager: enabled: true awsPCA: enabled: true commonName: "<your common name used to enable AWS Private CA>" domainName: "<your commonName used to enable AWS Private CA>" arn: "<ARN of the AWS Private CA>" prometheus: enabled: true awsRemoteWrite: url: "<Remote write url for Amazon Managed Prometheus>" arn: "<IAM Role for Amazon managed Prometheus>" grafana: enabled: false keycloak: enabled: true databaseStorage: accessModes: - ReadWriteOnce resources: requests: storage: 1G storageClassName: gp2 volumeMode: Filesystem postgres: enabled: true fluentbit: enabled: true elastic: enabled: true ingress: enabled: true EOF
通过以下命令使安装程序可执行
chmod +x ./nvidia-cnpack_Linux_x86_64
在实例上运行以下命令以设置 NVIDIA 云原生服务附加包
./nvidia-cnpack_Linux_x86_64 create -f config.yaml
安装完成后,通过以下命令检查所有 pod 是否健康
kubectl get pods -A
输出应类似于以下内容
NAMESPACE NAME READY STATUS RESTARTS AGE kube-system aws-node-hcn49 1/1 Running 0 20d kube-system coredns-769569fd5d-8pfsr 1/1 Running 0 20d kube-system coredns-769569fd5d-cpf29 1/1 Running 0 20d kube-system ebs-csi-controller-7c5f746989-9kjrj 6/6 Running 0 20d kube-system ebs-csi-controller-7c5f746989-fzzlw 6/6 Running 0 20d kube-system ebs-csi-node-f9bqp 3/3 Running 0 20d kube-system kube-proxy-t8ttt 1/1 Running 0 20d nvidia-monitoring elastic-operator-0 1/1 Running 1 (14d ago) 14d nvidia-monitoring grafana-deployment-6fdf95b986-8d2sh 1/1 Running 0 14d nvidia-monitoring nvidia-fluentbit-aws-for-fluent-bit-ljf7j 1/1 Running 0 17d nvidia-monitoring nvidia-grafana-grafana-operator-66d597fcdb-q88k7 1/1 Running 0 17d nvidia-monitoring nvidia-prometheus-kube-pro-operator-87cbfd57d-mlm6j 1/1 Running 0 17d nvidia-monitoring prometheus-nvidia-prometheus-kube-pro-prometheus-0 2/2 Running 0 17d nvidia-platform nvidia-certmanager-cert-manager-754dbf54cd-wnfmd 1/1 Running 0 17d nvidia-platform nvidia-certmanager-cert-manager-cainjector-68b7b69c6f-nrfpf 1/1 Running 0 17d nvidia-platform nvidia-certmanager-cert-manager-webhook-557978b4fc-tsc69 1/1 Running 0 17d nvidia-platform nvidia-ingress-kubernetes-ingress-j4zgh 1/1 Running 0 17d nvidia-platform nvidia-keycloak-0 1/1 Running 1 (17d ago) 17d nvidia-platform nvidia-keycloak-1 1/1 Running 0 17d nvidia-platform nvidia-keycloak-backup-hk5n-qnrgj 0/1 Completed 0 17d nvidia-platform nvidia-keycloak-instance1-mrbl-0 4/4 Running 0 17d nvidia-platform nvidia-keycloak-instance1-pt9t-0 4/4 Running 0 17d nvidia-platform nvidia-keycloak-instance1-schr-0 4/4 Running 0 17d nvidia-platform nvidia-keycloak-repo-host-0 2/2 Running 0 17d nvidia-platform nvidia-platform-aws-privateca-issuer-55b676666d-h6nlw 1/1 Running 0 17d nvidia-platform pgo-64cdcfff78-np8nb 1/1 Running 0 17d nvidia-platform pgo-upgrade-6776d6894-gjcn9 1/1 Running 0 17d
作为安装的一部分,安装程序将创建
nvidia-platform
和nvidia-monitoring
命名空间,其中包含与已部署服务交互所需的大部分组件和信息。