设置 Amazon EKS#
Amazon EKS
本节将介绍如何设置 NVIDIA AI Enterprise 支持的 Amazon EKS 实例以及相关的 Amazon Web Services,以便在其上部署和集成 Cloud Native Service Add-On Pack。
先决条件#
注意
以下步骤需要使用具有管理员权限的 AWS 账户执行
首先,使用 AI Workflows 文档中的硬件规格,按照 NVIDIA AI Enterprise Cloud Guide 中的说明,配置一个满足以下最低集群版本的 EKS 实例。
最低集群版本:1.23
最低 Cloud Native Service Add-On Pack 版本:0.4.0
创建集群后,请确保您可以通过系统上的 kubeconfig 和
eksctl
访问该集群。使用以下命令检索集群名称
1$ aws eks list-clusters
您应该看到类似于以下内容的输出
1"clusters": [ 2 3"<cluster-name>" 4 5]
记下此集群名称,因为在其余步骤中您将引用它。
如果您尚未创建 NGC API Key,请创建一个,并确保您可以访问 Enterprise Catalog。
创建 NGC API Key 后,如果您尚未安装和配置 NGC CLI,请按照此处的说明进行操作。
EKS 配置#
请按照以下步骤创建和配置 Cloud Native Service Add-On Pack 部署和集成所需的组件。
IAM OIDC 提供商#
需要 IAM OIDC 提供商来管理配置 EKS 集群和 Amazon 管理的服务所需的服务账户。
要创建 IAM OIDC 提供商,请使用您的集群名称运行以下命令
eksctl utils associate-iam-oidc-provider --cluster <cluster-name> --approve
您应该看到类似于以下内容的输出
12023-04-04 13:55:08 [ℹ] will create IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>" 2 32023-04-04 13:55:08 [✔] created IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>"
存储配置#
EKS 集群上必须有一个存储类可用于配置 Cloud Native Service Add-on Pack。本指南将使用 gp2
存储类。有关如何启用 gp2
存储类的详细信息,请参阅 AWS 文档。
创建 gp2
存储类后,需要进行以下配置,以创建一个服务账户,该账户可以使用集群上的 gp2
存储类创建持久卷
首先,使用以下命令检索您的集群 ID。您将在后续步骤中使用它。
1$ aws sts get-caller-identity --query 'Account' --output text
您应该看到类似于以下内容的输出
1298485221437
接下来,使用以下命令为 EBS CSI 驱动程序附加组件创建一个具有 EBS CSI Driver 角色的服务账户。为了确保角色名称是唯一的并避免与现有角色冲突,请在角色名称后附加时间戳。
1$ eksctl create iamserviceaccount --cluster <cluster-name> --region <region> --name ebs-csi-controller-sa --namespace kube-system --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy --approve --role-only --role-name AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
您应该看到类似于以下内容的输出
12023-04-04 15:18:05 [ℹ] 1 iamserviceaccount (kube-system/ebs-csi-controller-sa) was included (based on the include/exclude rules) 2 32023-04-04 15:18:05 [!] serviceaccounts in Kubernetes will not be created or modified, since the option --role-only is used 4 52023-04-04 15:18:05 [ℹ] 1 task: { create IAM role for serviceaccount "kube-system/ebs-csi-controller-sa" } 6 72023-04-04 15:18:05 [ℹ] building iamserviceaccount stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 8 92023-04-04 15:18:05 [ℹ] deploying stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 10 112023-04-04 15:18:05 [ℹ] waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa" 12 132023-04-04 15:18:35 [ℹ] waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
使用以下命令确认服务账户已创建
1nvidia@4a66fd1-lcedt:~/0-EKS-TF$ eksctl get iamserviceaccount --cluster <cluster-name>
您应该看到类似于以下内容的输出
1NAMESPACE NAME ROLE ARN 2 3kube-system ebs-csi-controller-sa arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>
确认 AWS 控制台中的 IAM 角色与下图匹配
接下来,使用以下命令为您的集群的服务账户角色创建附加组件
1$ eksctl create addon --name aws-ebs-csi-driver --cluster <cluster-name> --service-account-role-arn arn:aws:iam::<cluster-id>:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
您应该看到类似于以下内容的输出
1arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force 2 32023-04-04 15:29:49 [ℹ] Kubernetes version "1.25" in use by cluster "<cluster-name> " 4 52023-04-04 15:29:49 [ℹ] using provided ServiceAccountRoleARN "arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>" 6 72023-04-04 15:29:49 [ℹ] creating addon
使用以下命令确认附加组件已创建
1$ eksctl get addon --cluster <cluster-name>
您应该看到类似于以下内容的输出
12023-04-04 15:30:20 [ℹ] Kubernetes version "1.25" in use by cluster "<cluster-name>" 2 32023-04-04 15:30:20 [ℹ] getting all addons 4 52023-04-04 15:30:21 [ℹ] to see issues for an addon run `eksctl get addon --name <addon-name> --cluster <cluster-name>` 6 7NAME VERSION STATUS ISSUES IAMROLE UPDATE AVAILABLE CONFIGURATION VALUES 8 9aws-ebs-csi-driver v1.17.0-eksbuild.1 CREATING 0 arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>
Amazon Web Services 集成#
在 EKS 上安装 CNPack 时,您可以选择配置集群的某些组件并将其连接到 AWS 中央服务。当前可以配置的 AWS 服务包括
FluentBit
Prometheus
Cert-manager
下面提供了这些服务的配置指南。有关将 Cloud Native Service Add-On Pack 连接到这些服务的示例部署配置,请参见下一节。
AWS Cloud Watch#
要启用 FluentBit 将日志发送到 AWS CloudWatch,需要将策略 CloudWatchAgentServerPolicy
附加到您的集群节点。此策略使 FluentBit 能够将日志传送到 AWS CloudWatch。
要附加策略,请导航到 AWS 控制台,然后转到 IAM > Roles 页面。
转到 添加权限 > 附加策略。
为您的其他节点组重复这些步骤。
AWS Private CA#
要在集群上启用 AWS Private CA 颁发者,首先从 AWS 控制台设置 AWS Private CA 颁发者。
创建 CA 后,创建一个 IAM 策略以附加到 Amazon PCA 资源 (arn),按照以下步骤操作。
创建一个名为
iam-pca-policy.json
的文件,内容如下{ "Version": "2012-10-17", "Statement": [ { "Sid": "awspcaissuer", "Action": [ "acm-pca:DescribeCertificateAuthority", "acm-pca:GetCertificate", "acm-pca:IssueCertificate" ], "Effect": "Allow", "Resource": "arn:aws:acm-pca:<region>:<your-aws-account_id>:certificate-authority/<arn-of-pca>" } ] }
创建一个名为 AWSPCAIssuerIAMPolicy 的 IAM 策略
aws iam create-policy \ --policy-name AWSPCAIssuerIAMPolicy \ --policy-document file://pca-iam-policy.json
这个新的 PCA IAM 策略
arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy
需要附加到 Kubernetes 节点池的节点角色 ARN。这可以从 AWS 控制台完成,也可以通过以管理员身份运行此命令来完成aws iam attach-role-policy \ --policy-arn=arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy \ --role-name <your node-role arn>
将附加组件包的配置 YAML 中的
certManager
配置更新为以下内容certManager: enabled: true awsPCA: enabled: true commonName: "<your common name specific to your AWS Private CA>" domainName: "<the domain name specific to your AWS Private CA>" arn: "<ARN of the AWS Private CA>"
AWS Managed Prometheus#
从 AWS 控制台创建一个 AWS Managed Prometheus Workspace。
Prometheus 实例准备就绪后,要设置 AWS Managed Prometheus 所需的 IAM 角色,请使用以下模板创建 IAM 帮助脚本
注意
确保更改
CLUSTER_NAME
字段以匹配 EKS 集群的集群名称。1aws-managed-prometheus-iam-setup.sh 2 3 #!/bin/bash -e 4 5 # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 6 7 # Permission is hereby granted, free of charge, to any person obtaining a copy of this 8 # software and associated documentation files (the "Software"), to deal in the Software 9 # without restriction, including without limitation the rights to use, copy, modify, 10 # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 11 # permit persons to whom the Software is furnished to do so. 12 13 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 14 # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 15 # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 16 # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 17 # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 18 # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 19 20 21 CLUSTER_NAME= "<cluster-name>" # Add name of your cluster or this will fail 22 SERVICE_ACCOUNT_NAMESPACE="nvidia-monitoring" # Matches CNPack Deployment 23 AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) 24 OIDC_PROVIDER=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///") 25 SERVICE_ACCOUNT_AMP_INGEST_NAME=nvidia-prometheus-kube-pro-operator # Matches CNPack Deployment 26 SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE=amp-iamproxy-ingest-role # Fine to attach to existing role 27 SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY=AMPIngestPolicy # Update this policy 28 # 29 # Set up a trust policy designed for a specific combination of K8s service account and namespace to sign in from a Kubernetes cluster which hosts the OIDC Idp. 30 # 31 cat <<EOF > TrustPolicy.json 32 { 33 "Version": "2012-10-17", 34 "Statement": [ 35 { 36 "Effect": "Allow", 37 "Principal": { 38 "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}" 39 }, 40 "Action": "sts:AssumeRoleWithWebIdentity", 41 "Condition": { 42 "StringEquals": { 43 "${OIDC_PROVIDER}:sub": "system:serviceaccount:${SERVICE_ACCOUNT_NAMESPACE}:${SERVICE_ACCOUNT_AMP_INGEST_NAME}" 44 } 45 } 46 } 47 ] 48 } 49 EOF 50 # 51 # Set up the permission policy that grants ingest (remote write) permissions for all AMP workspaces 52 # 53 cat <<EOF > PermissionPolicyIngest.json 54 { 55 "Version": "2012-10-17", 56 "Statement": [ 57 {"Effect": "Allow", 58 "Action": [ 59 "aps:RemoteWrite", 60 "aps:GetSeries", 61 "aps:GetLabels", 62 "aps:GetMetricMetadata" 63 ], 64 "Resource": "*" 65 } 66 ] 67 } 68 EOF 69 70 function getRoleArn() { 71 OUTPUT=$(aws iam get-role --role-name $1 --query 'Role.Arn' --output text 2>&1) 72 73 # Check for an expected exception 74 if [[ $? -eq 0 ]]; then 75 echo $OUTPUT 76 elif [[ -n $(grep "NoSuchEntity" <<< $OUTPUT) ]]; then 77 echo "" 78 else 79 >&2 echo $OUTPUT 80 return 1 81 fi 82 } 83 84 # 85 # Create the IAM Role for ingest with the above trust policy 86 # 87 SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(getRoleArn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE) 88 if [ "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN" = "" ]; 89 then 90 # 91 # Create the IAM role for service account 92 # 93 SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(aws iam create-role \ 94 --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \ 95 --assume-role-policy-document file://TrustPolicy.json \ 96 --query "Role.Arn" --output text) 97 # 98 # Create an IAM permission policy 99 # 100 SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN=$(aws iam create-policy --policy-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY \ 101 --policy-document file://PermissionPolicyIngest.json \ 102 --query 'Policy.Arn' --output text) 103 # 104 # Attach the required IAM policies to the IAM role created above 105 # 106 aws iam attach-role-policy \ 107 --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \ 108 --policy-arn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN 109 else 110 echo "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN IAM role for ingest already exists" 111 fi 112 echo $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN 113 # 114 # EKS cluster hosts an OIDC provider with a public discovery endpoint. 115 # Associate this IdP with AWS IAM so that the latter can validate and accept the OIDC tokens issued by Kubernetes to service accounts. 116 # Doing this with eksctl is the easier and best approach. 117 # 118 eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve
运行创建的帮助脚本。您应该看到类似于以下内容的输出
1nvidia@4a66fd1-lcedt:~/0-EKS-TF$ ./aws-managed-prometheus-iam-setup.sh 2 3arn:aws:iam::298485221437:role/amp-iamproxy-ingest-role 4 52023-03-31 12:57:48 [ℹ] IAM Open ID Connect provider is associated with cluster "<cluster-name>" in "<region>"
按照以下模板更新附加组件包的配置 YAML 文件。
prometheus: enabled: true awsRemoteWrite: url: "https://aps-workspaces.<region>.amazonaws.com/workspaces/<prometheus-workspace>/api/v1/remote_write" arn: "arn:aws:iam::<your-aws-account-id>:role/amp-iamproxy-ingest-role"
url
字段设置为 AWS Managed Prometheus 工作区 url确保
arn
字段设置为上面 IAM 帮助脚本创建的 AWS AMP Policy 的arn
,或现有的 IAM Role。有关 IAM 角色创建的更多信息,请参阅 AWS 文档。