重要提示
您正在查看 NeMo 2.0 文档。此版本对 API 和新的库 NeMo Run 进行了重大更改。我们目前正在将 NeMo 1.0 的所有功能移植到 2.0。有关先前版本或 2.0 中尚未提供的功能的文档,请参阅 NeMo 24.07 文档。
在 Kubernetes 上运行 NeMo Curator#
以下示例演示了如何在 Kubernetes 集群上使用 NVIDIA GPU 运行 NeMo Curator,并使用 PersistentVolumeClaims 作为存储选项。
注意
此项目将下载并安装其他第三方开源软件项目。使用前请查看这些开源项目的许可条款。
先决条件#
- Kubernetes 集群
- kubectl: Kubernetes 集群 CLI
请联系您的 Kubernetes 集群管理员,了解如何设置您的
kubectl
KUBECONFIG
ReadWriteMany StorageClass(由 Kubernetes 集群管理员设置)
存储#
要运行 NeMo Curator,我们需要设置存储来上传和存储输入文件,以及任何处理后的输出。
以下是如何从集群管理员设置的 StorageClass 创建动态 PV 的示例。将 STORAGE_CLASS=<...>
替换为您的 StorageClass 的名称。
此示例请求 150Gi
的空间。根据您的工作负载调整该数字,并注意并非所有存储 Provisioner 都支持卷调整大小。
STORAGE_CLASS=<...>
PVC_NAME=nemo-workspace
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${PVC_NAME}
spec:
accessModes:
- ReadWriteMany
storageClassName: ${STORAGE_CLASS}
resources:
requests:
# Requesting enough storage for a few experiments
storage: 150Gi
EOF
注意
存储类必须支持 ReadWriteMany
,因为多个 Pod 可能需要访问 PVC 以并发读取和写入。
设置 PVC Busybox 助手 Pod#
使用 busybox 容器可以方便地检查 PVC 以及从 PVC 复制和向 PVC 复制数据。以下一些示例假设您已运行此 Pod 以便从 PVC 复制和向 PVC 复制数据。
PVC_NAME=nemo-workspace
MOUNT_PATH=/nemo-workspace
kubectl create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: nemo-workspace-busybox
spec:
containers:
- name: busybox
image: busybox
command: ["sleep", "infinity"]
volumeMounts:
- name: workspace
mountPath: ${MOUNT_PATH}
volumes:
- name: workspace
persistentVolumeClaim:
claimName: ${PVC_NAME}
EOF
如果不再需要此容器,可以随意删除它,但它在空闲时应使用极少的资源。
kubectl delete pod nemo-workspace-busybox
设置 Docker 密钥#
需要在 k8s 集群上创建 Kubernetes Secret,以便使用 NGC 私有注册表进行身份验证。如果尚未完成,请从 ngc.nvidia.com 获取 NGC 密钥。在 k8s 集群上创建一个密钥(将 <NGC KEY HERE>
替换为您的 NGC 密钥。请注意,如果您的密钥中有任何特殊字符,您可能需要将密钥用单引号 ('
) 括起来,以便 k8s 可以正确解析)
kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>
设置 Python 环境#
此示例中运行提供的脚本的环境不需要完整的 nemo_curator
包,因此您可以创建一个仅包含所需包的虚拟环境,如下所示
python3 -m venv venv
source venv/bin/activate
pip install 'dask_kubernetes>=2024.4.1'
将数据上传到 PVC#
要复制到 nemo-workspace
PVC 中,我们将使用 kubectl exec
进行操作。您也可以使用 kubectl cp
,但 exec
在处理压缩文件时遇到的意外情况较少
# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>
# This copies $LOCAL_WORKSPACE/my_dataset to /my_dataset within the PVC.
# Change foobar to the directory or file you wish to upload.
( cd $LOCAL_WORKSPACE; tar cf - my_dataset | kubectl exec -i nemo-workspace-busybox -- tar xf - -C /nemo-workspace )
注意
有关如何下载可以按照上述说明上传到 PVC 的本地数据的示例,请参阅 下载和提取文本。
创建 Dask 集群#
使用 create_dask_cluster.py
创建 CPU 或 GPU Dask 集群。
注意
如果您要使用相同的 --name <name>
创建另一个 Dask 集群,请先通过以下方式删除它
kubectl delete daskcluster <name>
# Creates a CPU Dask cluster with 1 worker
python create_dask_cluster.py \
--name rapids-dask \
--n_workers 1 \
--image nvcr.io/nvidian/bignlp-train:nemofw-nightly \
--image_pull_secret ngc-registry \
--pvcs nemo-workspace:/nemo-workspace
#╭───────────────────── Creating KubeCluster 'rapids-dask' ─────────────────────╮
#│ │
#│ DaskCluster Running │
#│ Scheduler Pod Running │
#│ Scheduler Service Created │
#│ Default Worker Group Created │
#│ │
#│ ⠧ Getting dashboard URL │
#╰──────────────────────────────────────────────────────────────────────────────╯
#cluster = KubeCluster(rapids-dask, 'tcp://127.0.0.1:61757', workers=2, threads=510, memory=3.94 TiB)
# Creates a GPU Dask cluster with 2 workers with 1 GPU each
python create_dask_cluster.py \
--name rapids-dask \
--n_workers 2 \
--n_gpus_per_worker 1 \
--image nvcr.io/nvidian/bignlp-train:nemofw-nightly \
--image_pull_secret ngc-registry \
--pvcs nemo-workspace:/nemo-workspace
创建集群后,您应该能够在确认调度器和工作器都处于 Running
状态后继续操作
# Set DASK_CLUSTER_NAME to the value of --name
DASK_CLUSTER_NAME=rapids-dask
kubectl get pods -l "dask.org/cluster-name=$DASK_CLUSTER_NAME"
# NAME READY STATUS RESTARTS AGE
# rapids-dask-default-worker-587238cf2c-7d685f4d75-k6rnq 1/1 Running 0 57m
# rapids-dask-default-worker-f8ff963886-5577fff76b-qmvcd 1/1 Running 3 (52m ago) 57m
# rapids-dask-scheduler-654799869d-9bw4z 1/1 Running 0 57m
(选项 1)运行现有模块#
以下是运行现有 gpu_exact_dedup Curator 模块的示例。参数和脚本名称需要根据您要运行的模块进行更改
# Set DASK_CLUSTER_NAME to the value of --name
DASK_CLUSTER_NAME=rapids-dask
SCHEDULER_POD=$(kubectl get pods -l "dask.org/cluster-name=$DASK_CLUSTER_NAME,dask.org/component=scheduler" -o name)
# Starts an interactive shell session in the scheduler pod
kubectl exec -it $SCHEDULER_POD -- bash
########################
# Inside SCHEDULER_POD #
########################
# Run the following inside the interactive shell to launch script in the background and
# tee the logs to the /nemo-workspace PVC that was mounted in for persistence.
# The command line flags will need to be replaced with whatever the module script accepts.
# Recall that the PVC is mounted at /nemo-workspace, so any outputs should be written
# to somewhere under /nemo-workspace.
mkdir -p /nemo-workspace/curator/{output,log,profile}
# Write logs to script.log and to a log file with a date suffix
LOGS="/nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.$(date +%y_%m_%d-%H-%M-%S)"
(
echo "Writing to: $LOGS"
gpu_exact_dedup \
--input-data-dirs /nemo-workspace/my_dataset \
--output-dir /nemo-workspace/curator/output \
--hash-method md5 \
--log-dir /nemo-workspace/curator/log \
--num-files -1 \
--files-per-partition 1 \
--profile-path /nemo-workspace/curator/profile \
--log-frequency 250 \
--scheduler-address localhost:8786 \
2>&1
echo "Finished!"
) | tee $LOGS &
# At this point, feel free to disconnect the shell via Ctrl+D or simply
exit
此时,您可以跟踪日志并查找 /nemo-workspace/curator/script.log
中的 Finished!
# Command will follow the logs of the running module (Press ctrl+C to close)
kubectl exec -it $SCHEDULER_POD -- tail -f /nemo-workspace/curator/script.log
# Writing to: /nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.24_03_27-15-52-31
# Computing hashes for /nemo-workspace/my_dataset
# id _hashes
# 0 cc-2023-14-0397113620 91b77eae49c10a65d485ac8ca18d6c43
# 1 cc-2023-14-0397113621 a266f0794cc8ffbd431823e6930e4f80
# 2 cc-2023-14-0397113622 baee533e2eddae764de2cd6faaa1286c
# 3 cc-2023-14-0397113623 87dd52a468448b99078f97e76f528eab
# 4 cc-2023-14-0397113624 a17664daf4f24be58e0e3a3dcf81124a
# Finished!
(选项 2)运行自定义模块#
在此示例中,我们将演示如何运行您在本地定义的 NeMo Curator 模块。
由于您的 Curator 模块可能依赖于与容器中 Curator 版本不同的版本,因此我们需要构建一个安装了您的代码的自定义镜像
# Clone your repo. This example uses the official repo
git clone https://github.com/NVIDIA/NeMo-Curator.git NeMo-Curator-dev
# Checkout specific ref. This example uses a commit in the main branch
git -C NeMo-Curator-dev checkout fc167a6edffd38a55c333742972a5a25b901cb26
# Example NeMo base image. Change it according to your requirements
BASE_IMAGE=nvcr.io/nvidian/bignlp-train:nemofw-nightly
docker build -t nemo-curator-custom ./NeMo-Curator-dev -f - <<EOF
FROM ${BASE_IMAGE}
COPY ./ /NeMo-Curator-dev/
RUN pip install -e /NeMo-Curator-dev
EOF
# Then push this image to your registry: Change <private-registry>/<image>:<tag> accordingly
docker tag nemo-curator-custom <private-registry>/<image>:<tag>
docker push <private-registry>/<image>:<tag>
注意
当使用自定义镜像时,您可能需要创建一个不同的密钥,除非您已推送到公共注册表
# Fill in <private-registry>/<username>/<password>
kubectl create secret docker-registry my-private-registry --docker-server=<private-registry> --docker-username=<username> --docker-password=<password>
使用这个新密钥,您可以创建新的 Dask 集群
# Fill in <private-registry>/<username>/<password>
python create_dask_cluster.py \
--name rapids-dask \
--n_workers 2 \
--n_gpus_per_worker 1 \
--image <private-registry>/<image>:<tag> \
--image_pull_secret my-private-registry \
--pvcs nemo-workspace:/nemo-workspace
部署 Dask 集群后,您可以继续运行您的模块。在此示例中,我们将使用 NeMo-Curator/nemo_curator/scripts/find_exact_duplicates.py
模块,但您可以在 NeMo-Curator/examples 中找到其他模板
# Set DASK_CLUSTER_NAME to the value of --name
DASK_CLUSTER_NAME=rapids-dask
SCHEDULER_POD=$(kubectl get pods -l "dask.org/cluster-name=$DASK_CLUSTER_NAME,dask.org/component=scheduler" -o name)
# Starts an interactive shell session in the scheduler pod
kubectl exec -it $SCHEDULER_POD -- bash
########################
# Inside SCHEDULER_POD #
########################
# Run the following inside the interactive shell to launch script in the background and
# tee the logs to the /nemo-workspace PVC that was mounted in for persistence.
# The command line flags will need to be replaced with whatever the module script accepts.
# Recall that the PVC is mounted at /nemo-workspace, so any outputs should be written
# to somewhere under /nemo-workspace.
mkdir -p /nemo-workspace/curator/{output,log,profile}
# Append logs to script.log and write to a log file with a date suffix
LOGS="/nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.$(date +%y_%m_%d-%H-%M-%S)"
(
echo "Writing to: $LOGS"
# Recall that /NeMo-Curator-dev was copied and installed in the Dockerfile above
python3 -u /NeMo-Curator-dev/nemo_curator/scripts/find_exact_duplicates.py \
--input-data-dirs /nemo-workspace/my_dataset \
--output-dir /nemo-workspace/curator/output \
--hash-method md5 \
--log-dir /nemo-workspace/curator/log \
--files-per-partition 1 \
--profile-path /nemo-workspace/curator/profile \
--log-frequency 250 \
--scheduler-address localhost:8786 \
2>&1
echo "Finished!"
) | tee $LOGS &
# At this point, feel free to disconnect the shell via Ctrl+D or simply
exit
此时,您可以跟踪日志并查找 /nemo-workspace/curator/script.log
中的 Finished!
# Command will follow the logs of the running module (Press ctrl+C to close)
kubectl exec -it $SCHEDULER_POD -- tail -f /nemo-workspace/curator/script.log
# Writing to: /nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.24_03_27-20-52-07
# Reading 2 files
# /NeMo-Curator-dev/nemo_curator/modules/exact_dedup.py:157: UserWarning: Output path f/nemo-workspace/curator/output/_exact_duplicates.parquet already exists and will be overwritten
# warnings.warn(
# Finished!
删除集群#
完成使用创建的 Dask 集群后,您可以删除它以释放资源
# Where <name> is the flag passed to create_dask_cluster.py. Example: `--name <name>`
kubectl delete daskcluster <name>
从 PVC 下载数据#
要从 PVC 下载数据,您可以使用之前创建的 nemo-workspace-busybox
Pod
# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>
# Tar will fail if LOCAL_WORKSPACE doesn't exist
mkdir -p $LOCAL_WORKSPACE
# Copy file in PVC at /nemo-workspace/foobar.txt to local file-system at $LOCAL_WORKSPACE/nemo-workspace/foobar.txt
kubectl exec nemo-workspace-busybox -- tar cf - /nemo-workspace/foobar.txt | tar xf - -C $LOCAL_WORKSPACE
# Copy directory in PVC /nemo-workspace/fizzbuzz to local file-system at $LOCAL_WORKSPACE/fizzbuzz
kubectl exec nemo-workspace-busybox -- tar cf - /nemo-workspace/fizzbuzz | tar xf - -C $LOCAL_WORKSPACE