设置 Amazon EKS#

Amazon EKS

本节将介绍如何设置 NVIDIA AI Enterprise 支持的 Amazon EKS 实例以及相关的 Amazon Web Services,以便在其上部署和集成 Cloud Native Service Add-On Pack。

先决条件#

注意

以下步骤需要使用具有管理员权限的 AWS 账户执行

  1. 首先,使用 AI Workflows 文档中的硬件规格,按照 NVIDIA AI Enterprise Cloud Guide 中的说明,配置一个满足以下最低集群版本的 EKS 实例。

    • 最低集群版本:1.23

    • 最低 Cloud Native Service Add-On Pack 版本:0.4.0

  2. 创建集群后,请确保您可以通过系统上的 kubeconfig 和 eksctl 访问该集群。

  3. 使用以下命令检索集群名称

    1$  aws eks list-clusters
    

    您应该看到类似于以下内容的输出

    1"clusters": [
    2
    3"<cluster-name>"
    4
    5]
    

    记下此集群名称,因为在其余步骤中您将引用它。

  4. 如果您尚未创建 NGC API Key,请创建一个,并确保您可以访问 Enterprise Catalog

  5. 创建 NGC API Key 后,如果您尚未安装和配置 NGC CLI,请按照此处的说明进行操作。

EKS 配置#

请按照以下步骤创建和配置 Cloud Native Service Add-On Pack 部署和集成所需的组件。

IAM OIDC 提供商#

需要 IAM OIDC 提供商来管理配置 EKS 集群和 Amazon 管理的服务所需的服务账户。

要创建 IAM OIDC 提供商,请使用您的集群名称运行以下命令

eksctl utils associate-iam-oidc-provider --cluster <cluster-name> --approve

您应该看到类似于以下内容的输出

12023-04-04 13:55:08 []  will create IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>"
2
32023-04-04 13:55:08 []  created IAM Open ID Connect provider for cluster "<cluster-name>" in "<region>"

存储配置#

EKS 集群上必须有一个存储类可用于配置 Cloud Native Service Add-on Pack。本指南将使用 gp2 存储类。有关如何启用 gp2 存储类的详细信息,请参阅 AWS 文档

创建 gp2 存储类后,需要进行以下配置,以创建一个服务账户,该账户可以使用集群上的 gp2 存储类创建持久卷

  1. 首先,使用以下命令检索您的集群 ID。您将在后续步骤中使用它。

    1$ aws sts get-caller-identity --query 'Account' --output text
    

    您应该看到类似于以下内容的输出

    1298485221437
    
  2. 接下来,使用以下命令为 EBS CSI 驱动程序附加组件创建一个具有 EBS CSI Driver 角色的服务账户。为了确保角色名称是唯一的并避免与现有角色冲突,请在角色名称后附加时间戳。

    1$ eksctl create iamserviceaccount --cluster <cluster-name> --region <region> --name ebs-csi-controller-sa --namespace kube-system --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy --approve --role-only --role-name AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
    

    您应该看到类似于以下内容的输出

     12023-04-04 15:18:05 []  1 iamserviceaccount (kube-system/ebs-csi-controller-sa) was included (based on the include/exclude rules)
     2
     32023-04-04 15:18:05 [!]  serviceaccounts in Kubernetes will not be created or modified, since the option --role-only is used
     4
     52023-04-04 15:18:05 []  1 task: { create IAM role for serviceaccount "kube-system/ebs-csi-controller-sa" }
     6
     72023-04-04 15:18:05 []  building iamserviceaccount stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
     8
     92023-04-04 15:18:05 []  deploying stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
    10
    112023-04-04 15:18:05 []  waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
    12
    132023-04-04 15:18:35 []  waiting for CloudFormation stack "eksctl-<cluster-name>-addon-iamserviceaccount-kube-system-ebs-csi-controller-sa"
    

    使用以下命令确认服务账户已创建

    1nvidia@4a66fd1-lcedt:~/0-EKS-TF$ eksctl get iamserviceaccount --cluster <cluster-name>
    

    您应该看到类似于以下内容的输出

    1NAMESPACE NAME ROLE ARN
    2
    3kube-system ebs-csi-controller-sa arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>
    

    确认 AWS 控制台中的 IAM 角色与下图匹配

  3. 接下来,使用以下命令为您的集群的服务账户角色创建附加组件

    1$ eksctl create addon --name aws-ebs-csi-driver --cluster <cluster-name> --service-account-role-arn arn:aws:iam::<cluster-id>:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
    

    您应该看到类似于以下内容的输出

    1arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp> --force
    2
    32023-04-04 15:29:49 []  Kubernetes version "1.25" in use by cluster "<cluster-name> "
    4
    52023-04-04 15:29:49 []  using provided ServiceAccountRoleARN "arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>"
    6
    72023-04-04 15:29:49 []  creating addon
    

    使用以下命令确认附加组件已创建

    1$ eksctl get addon --cluster <cluster-name>
    

    您应该看到类似于以下内容的输出

    12023-04-04 15:30:20 []  Kubernetes version "1.25" in use by cluster "<cluster-name>"
    2
    32023-04-04 15:30:20 []  getting all addons
    4
    52023-04-04 15:30:21 []  to see issues for an addon run `eksctl get addon --name <addon-name> --cluster <cluster-name>`
    6
    7NAME VERSION STATUS ISSUES IAMROLE UPDATE AVAILABLE CONFIGURATION VALUES
    8
    9aws-ebs-csi-driver v1.17.0-eksbuild.1 CREATING 0 arn:aws:iam::298485221437:role/AmazonEKS_EBS_CSI_DriverRole_<timestamp>
    

Amazon Web Services 集成#

在 EKS 上安装 CNPack 时,您可以选择配置集群的某些组件并将其连接到 AWS 中央服务。当前可以配置的 AWS 服务包括

  • FluentBit

  • Prometheus

  • Cert-manager

下面提供了这些服务的配置指南。有关将 Cloud Native Service Add-On Pack 连接到这些服务的示例部署配置,请参见下一节

AWS Cloud Watch#

要启用 FluentBit 将日志发送到 AWS CloudWatch,需要将策略 CloudWatchAgentServerPolicy 附加到您的集群节点。此策略使 FluentBit 能够将日志传送到 AWS CloudWatch。

  1. 要附加策略,请导航到 AWS 控制台,然后转到 IAM > Roles 页面。

    搜索 node

    选择您的节点组

  2. 转到 添加权限 > 附加策略

    搜索 CloudWatchAgentServerPolicy

    通过复选框选中它,然后单击“添加权限”。然后您应该看到

  3. 为您的其他节点组重复这些步骤。

AWS Private CA#

要在集群上启用 AWS Private CA 颁发者,首先从 AWS 控制台设置 AWS Private CA 颁发者。

创建 CA 后,创建一个 IAM 策略以附加到 Amazon PCA 资源 (arn),按照以下步骤操作。

  1. 创建一个名为 iam-pca-policy.json 的文件,内容如下

    {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "awspcaissuer",
        "Action": [
          "acm-pca:DescribeCertificateAuthority",
          "acm-pca:GetCertificate",
          "acm-pca:IssueCertificate"
        ],
        "Effect": "Allow",
        "Resource": "arn:aws:acm-pca:<region>:<your-aws-account_id>:certificate-authority/<arn-of-pca>"
      }
    ]
    }
    
  2. 创建一个名为 AWSPCAIssuerIAMPolicy 的 IAM 策略

    aws iam create-policy \
     --policy-name AWSPCAIssuerIAMPolicy \
     --policy-document file://pca-iam-policy.json
    
  3. 这个新的 PCA IAM 策略 arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy 需要附加到 Kubernetes 节点池的节点角色 ARN。这可以从 AWS 控制台完成,也可以通过以管理员身份运行此命令来完成

    aws iam attach-role-policy \
    --policy-arn=arn:aws:iam::<your-aws-account-id>:policy/AWSPCAIssuerIAMPolicy \
    --role-name <your node-role arn>
    
  4. 将附加组件包的配置 YAML 中的 certManager 配置更新为以下内容

    certManager:
      enabled: true
      awsPCA:
        enabled: true
        commonName: "<your common name specific to your AWS Private CA>"
        domainName: "<the domain name specific to your AWS Private CA>"
        arn: "<ARN of the AWS Private CA>"
    

AWS Managed Prometheus#

  1. 从 AWS 控制台创建一个 AWS Managed Prometheus Workspace

  2. Prometheus 实例准备就绪后,要设置 AWS Managed Prometheus 所需的 IAM 角色,请使用以下模板创建 IAM 帮助脚本

    注意

    确保更改 CLUSTER_NAME 字段以匹配 EKS 集群的集群名称。

      1aws-managed-prometheus-iam-setup.sh
      2
      3    #!/bin/bash -e
      4
      5    # Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
      6
      7    # Permission is hereby granted, free of charge, to any person obtaining a copy of this
      8    # software and associated documentation files (the "Software"), to deal in the Software
      9    # without restriction, including without limitation the rights to use, copy, modify,
     10    # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
     11    # permit persons to whom the Software is furnished to do so.
     12
     13    # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
     14    # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
     15    # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
     16    # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
     17    # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
     18    # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
     19
     20
     21    CLUSTER_NAME= "<cluster-name>" # Add name of your cluster or this will fail
     22    SERVICE_ACCOUNT_NAMESPACE="nvidia-monitoring" # Matches CNPack Deployment
     23    AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
     24    OIDC_PROVIDER=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///")
     25    SERVICE_ACCOUNT_AMP_INGEST_NAME=nvidia-prometheus-kube-pro-operator # Matches CNPack Deployment
     26    SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE=amp-iamproxy-ingest-role # Fine to attach to existing role
     27    SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY=AMPIngestPolicy # Update this policy
     28    #
     29    # Set up a trust policy designed for a specific combination of K8s service account and namespace to sign in from a Kubernetes cluster which hosts the OIDC Idp.
     30    #
     31    cat <<EOF > TrustPolicy.json
     32    {
     33    "Version": "2012-10-17",
     34    "Statement": [
     35        {
     36        "Effect": "Allow",
     37        "Principal": {
     38            "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
     39        },
     40        "Action": "sts:AssumeRoleWithWebIdentity",
     41        "Condition": {
     42            "StringEquals": {
     43            "${OIDC_PROVIDER}:sub": "system:serviceaccount:${SERVICE_ACCOUNT_NAMESPACE}:${SERVICE_ACCOUNT_AMP_INGEST_NAME}"
     44            }
     45        }
     46        }
     47    ]
     48    }
     49    EOF
     50    #
     51    # Set up the permission policy that grants ingest (remote write) permissions for all AMP workspaces
     52    #
     53    cat <<EOF > PermissionPolicyIngest.json
     54    {
     55    "Version": "2012-10-17",
     56    "Statement": [
     57        {"Effect": "Allow",
     58            "Action": [
     59            "aps:RemoteWrite",
     60            "aps:GetSeries",
     61            "aps:GetLabels",
     62            "aps:GetMetricMetadata"
     63            ],
     64            "Resource": "*"
     65        }
     66    ]
     67    }
     68    EOF
     69
     70    function getRoleArn() {
     71    OUTPUT=$(aws iam get-role --role-name $1 --query 'Role.Arn' --output text 2>&1)
     72
     73    # Check for an expected exception
     74    if [[ $? -eq 0 ]]; then
     75        echo $OUTPUT
     76    elif [[ -n $(grep "NoSuchEntity" <<< $OUTPUT) ]]; then
     77        echo ""
     78    else
     79        >&2 echo $OUTPUT
     80        return 1
     81    fi
     82    }
     83
     84    #
     85    # Create the IAM Role for ingest with the above trust policy
     86    #
     87    SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(getRoleArn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE)
     88    if [ "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN" = "" ];
     89    then
     90    #
     91    # Create the IAM role for service account
     92    #
     93    SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN=$(aws iam create-role \
     94    --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \
     95    --assume-role-policy-document file://TrustPolicy.json \
     96    --query "Role.Arn" --output text)
     97    #
     98    # Create an IAM permission policy
     99    #
    100    SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN=$(aws iam create-policy --policy-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_POLICY \
    101    --policy-document file://PermissionPolicyIngest.json \
    102    --query 'Policy.Arn' --output text)
    103    #
    104    # Attach the required IAM policies to the IAM role created above
    105    #
    106    aws iam attach-role-policy \
    107    --role-name $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE \
    108    --policy-arn $SERVICE_ACCOUNT_IAM_AMP_INGEST_ARN
    109    else
    110        echo "$SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN IAM role for ingest already exists"
    111    fi
    112    echo $SERVICE_ACCOUNT_IAM_AMP_INGEST_ROLE_ARN
    113    #
    114    # EKS cluster hosts an OIDC provider with a public discovery endpoint.
    115    # Associate this IdP with AWS IAM so that the latter can validate and accept the OIDC tokens issued by Kubernetes to service accounts.
    116    # Doing this with eksctl is the easier and best approach.
    117    #
    118    eksctl utils associate-iam-oidc-provider --cluster $CLUSTER_NAME --approve
    
  3. 运行创建的帮助脚本。您应该看到类似于以下内容的输出

    1nvidia@4a66fd1-lcedt:~/0-EKS-TF$ ./aws-managed-prometheus-iam-setup.sh
    2
    3arn:aws:iam::298485221437:role/amp-iamproxy-ingest-role
    4
    52023-03-31 12:57:48 []  IAM Open ID Connect provider is associated with cluster "<cluster-name>" in "<region>"
    
  4. 按照以下模板更新附加组件包的配置 YAML 文件。

    prometheus:
    enabled: true
        awsRemoteWrite:
        url: "https://aps-workspaces.<region>.amazonaws.com/workspaces/<prometheus-workspace>/api/v1/remote_write"
        arn: "arn:aws:iam::<your-aws-account-id>:role/amp-iamproxy-ingest-role"
    
    • url 字段设置为 AWS Managed Prometheus 工作区 url

    • 确保 arn 字段设置为上面 IAM 帮助脚本创建的 AWS AMP Policy 的 arn,或现有的 IAM Role。有关 IAM 角色创建的更多信息,请参阅 AWS 文档