RAPIDS 和 Kubernetes 入门#

在 4.0 版本中添加。

本指南将介绍如何在 Kubernetes 集群中设置 RAPIDS Accelerator for Apache Spark。在本指南结束时，读者将能够运行一个在 Kubernetes 集群中的 NVIDIA GPU 上运行的示例 Apache Spark 应用程序。

这是一个快速入门指南，使用默认设置，这些设置可能与您的集群不同。

Kubernetes 需要 Docker 镜像才能运行 Spark。通常，所需的一切都在 Docker 镜像中 - Spark、RAPIDS Accelerator for Spark jar 和发现脚本。

您可以从 CUDA dockerhub 找到其他受支持的基础 CUDA 镜像。它的源 Dockerfile 位于 GitLab 存储库中，该存储库可用于从头开始从 OS 基础镜像构建 docker 镜像。

先决条件#

Ubuntu 22.04

Spark 3.4.0

Upstream Kubernetes Version 1.25

Docker 已安装在客户端机器上

Kubernetes 集群可以访问的 Docker 仓库

RAPIDS Accelerator

请参阅附录以获取访问权限

NGC API 密钥

请参阅生成您的 NGC API 密钥

本指南利用 Cloud Native Stack (CNS) GitHub 安装指南来构建 Kubernetes 集群。要在没有 CNS 的情况下安装，请利用安装 Kubernetes 的说明来创建具有 NVIDIA GPU 支持的 Kubernetes 集群。

Docker 镜像准备#

在安装了 Docker 的客户端机器上，下载以下软件包和脚本，如下所示。将使用 3.4 版本 Apache Spark。请注意，加速器目前仅支持 Scala 2.12 版本。

以下是用于安装 Apache Spark 的本地副本并配置 docker 镜像的 bash 命令

mkdir -p ~/spark-rapids/spark
wget https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
tar -zxvf spark-3.4.0-bin-hadoop3.tgz -C ./spark-rapids/spark --strip-components 1

1cd ~/spark-rapids

将 .jar 复制到当前工作目录中。请参阅访问 NVIDIA AI Enterprise RAPIDS Accelerator 部分以拉取 .jar 文件。

下面提供了一个示例 Dockerfile

# Copyright (c) 2020-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://apache.ac.cn/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

FROM nvidia/cuda:12.2.2-devel-ubuntu22.04
ARG spark_uid=185

# https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub

# Install tini and java dependencies
RUN apt-get update && apt-get install -y --no-install-recommends tini openjdk-8-jdk openjdk-8-jre
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin

# Before building the docker image, first either download Apache Spark 3.1+ from
# https://spark.apache.ac.cn/downloads.html or build and make a Spark distribution following the
# instructions in https://spark.apache.ac.cn/docs/3.1.2/building-spark.html (see
# https://nvda.org.cn/spark-rapids/docs/download.html for other supported versions).  If this
# docker file is being used in the context of building your images from a Spark distribution, the
# docker build command should be invoked from the top level directory of the Spark
# distribution. E.g.: docker build -t spark:3.1.2 -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
    ln -s /lib /lib64 && \
    mkdir -p /opt/spark/work-dir && \
    touch /opt/spark/RELEASE && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd

COPY spark /opt/spark
COPY spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY spark/kubernetes/tests /opt/spark/tests

COPY rapids-4-spark_2.12-*.jar /opt/spark/jars

RUN apt-get update && \
    apt-get install -y python-is-python3 python3-pip && \
    pip install --upgrade pip setuptools && \
    # You may install with python3 packages by using pip3.6
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
USER ${spark_uid}

假设存在以下目录结构

$ ls ~/spark-rapids

Dockerfile  rapids-4-spark_2.12-23.08.1.jar   spark

构建 Dockerfile 并推送到 NGC（可选）

export IMAGE_NAME=nvcr.io/<your-registry-name>/<container-name>:<tag>
docker build . -f Dockerfile -t $IMAGE_NAME
docker push $IMAGE_NAME

docker push $IMAGE_NAME

注意

推送到 NGC 是可选的。请参阅 NGC Private Registry User Guide 以进行设置

从 NGC 拉取 Docker 镜像到 Kubernetes#

通过在以下命令中填写 <ngc-secret-token> 和 <email>，更新 NGC 的 Kubernetes 凭据

kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nv-nvaie-tme" --docker-username='$oauthtoken' --docker-password='<ngc-secret-token>' --docker-email='<email>'

注意

如果 Kubernetes 配置中已经存在同名的 secret，则创建 secret 将不起作用。请参阅生成您的 NGC API 密钥

重要

这是身份验证所必需的。请注意，secret 的名称是 ngc-secret，它将在 spark.kubernetes.container.image.pullSecrets 中使用。

在 Kubernetes 集群中运行 Spark 应用程序#

当 CNS 创建集群时，默认启用基于角色的访问控制 (RBAC)。请按照 https://spark.apache.ac.cn/docs/latest/running-on-kubernetes.html#rbac 中的步骤创建一个服务帐户以拉取受保护的镜像。以下是相关命令

kubectl create serviceaccount spark

kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

相应地，在 spark-submit 中添加了一个额外的参数

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

提交一个简单的测试作业#

这个简单的作业将测试是否可以找到 RAPIDS 插件。

这是一个 spark-submit 作业的示例

export SPARK_HOME=~/spark-rapids/spark
export K8SMASTER=k8s://<ip>:<port>
export SPARK_NAMESPACE=default
export SPARK_DRIVER_NAME=exampledriver
export NGC_SECRET_NAME=ngc-secret

$SPARK_HOME/bin/spark-submit \
    --master $K8SMASTER \
    --deploy-mode cluster  \
    --name examplejob \
    --class org.apache.spark.examples.SparkPi \
    --driver-memory 2G \
    --conf spark.executor.instances=1 \
    --conf spark.executor.memory=4G \
    --conf spark.executor.cores=1 \
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.executor.resource.gpu.discoveryScript=/opt/spark/examples/src/main/scripts/getGpusResources.sh \
    --conf spark.executor.resource.gpu.vendor=nvidia.com \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
    --conf spark.kubernetes.namespace=$SPARK_NAMESPACE  \
    --conf spark.kubernetes.driver.pod.name=$SPARK_DRIVER_NAME  \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=$IMAGE_NAME \
    --conf spark.kubernetes.container.image.pullSecrets=$NGC_SECRET_NAME \
    --conf spark.kubernetes.container.image.pullPolicy=Always \
    local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 10000

spark-submit 末尾的 10000 是一个输入，用于指定迭代次数。注意 local:// 表示 jar 文件位置在 Docker 镜像内部。由于这是 cluster 模式，Spark 驱动程序在 Kubernetes 的 pod 中运行。作业运行时可以看到驱动程序和执行器 pod

$ kubectl get pods

NAME                               READY   STATUS    RESTARTS   AGE
spark-pi-d11075782f399fd7-exec-1   1/1     Running   0          9s
exampledriver                      1/1     Running   0          15s

要查看 Spark 驱动程序日志，请使用以下命令

kubectl logs $SPARK_DRIVER_NAME

如果作业成功运行，则日志输出将包含计算出的 pi 值

Pi 大约是 3.1406957034785172

注意

Exception in thread "main" java.lang.ClassNotFoundException: com.nvidia.spark.SQLPlugin

如果 Spark 驱动程序找不到 RAPIDS Accelerator jar，ClassNotFoundException 是一个常见的错误，会导致如下异常

要在作业运行时查看 Spark 驱动程序 UI，请首先公开驱动程序 UI 端口

kubectl port-forward $SPARK_DRIVER_NAME 4040:4040

注意

您可能需要启用带有端口转发的 ssh 以从远程计算机访问 UI。例如，run ssh -L 4040:localhost:4040 nvidia@<cluster-ip>

然后在公开的端口上打开 Web 浏览器访问 Spark 驱动程序 UI 页面

http://localhost:4040

要终止 Spark 作业

$SPARK_HOME/bin/spark-submit --kill spark:$SPARK_DRIVER_NAME

要删除驱动程序 pod

kubectl delete pod $SPARK_DRIVER_NAME

需要删除驱动程序 pod 才能重用相同的驱动程序 pod 名称。