重要提示

您正在查看 NeMo 2.0 文档。此版本对 API 和新的库 NeMo Run 进行了重大更改。我们目前正在将 NeMo 1.0 的所有功能移植到 2.0。有关先前版本或 2.0 中尚未提供的功能的文档，请参阅 NeMo 24.07 文档。

在 Kubernetes 上运行 NeMo Curator#

以下示例演示了如何在 Kubernetes 集群上使用 NVIDIA GPU 运行 NeMo Curator，并使用 PersistentVolumeClaims 作为存储选项。

注意

此项目将下载并安装其他第三方开源软件项目。使用前请查看这些开源项目的许可条款。

先决条件#

Kubernetes 集群
- GPU Operator
- Dask Operator
kubectl: Kubernetes 集群 CLI
- 请联系您的 Kubernetes 集群管理员，了解如何设置您的 kubectl KUBECONFIG
ReadWriteMany StorageClass（由 Kubernetes 集群管理员设置）

存储#

要运行 NeMo Curator，我们需要设置存储来上传和存储输入文件，以及任何处理后的输出。

以下是如何从集群管理员设置的 StorageClass 创建动态 PV 的示例。将 STORAGE_CLASS=<...> 替换为您的 StorageClass 的名称。

此示例请求 150Gi 的空间。根据您的工作负载调整该数字，并注意并非所有存储 Provisioner 都支持卷调整大小。

STORAGE_CLASS=<...>
PVC_NAME=nemo-workspace

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ${PVC_NAME}
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ${STORAGE_CLASS}
  resources:
    requests:
      # Requesting enough storage for a few experiments
      storage: 150Gi
EOF

注意

存储类必须支持 ReadWriteMany，因为多个 Pod 可能需要访问 PVC 以并发读取和写入。

设置 PVC Busybox 助手 Pod#

使用 busybox 容器可以方便地检查 PVC 以及从 PVC 复制和向 PVC 复制数据。以下一些示例假设您已运行此 Pod 以便从 PVC 复制和向 PVC 复制数据。

PVC_NAME=nemo-workspace
MOUNT_PATH=/nemo-workspace

kubectl create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: nemo-workspace-busybox
spec:
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "infinity"]
    volumeMounts:
    - name: workspace
      mountPath: ${MOUNT_PATH}
  volumes:
  - name: workspace
    persistentVolumeClaim:
      claimName: ${PVC_NAME}
EOF

如果不再需要此容器，可以随意删除它，但它在空闲时应使用极少的资源。

kubectl delete pod nemo-workspace-busybox

设置 Docker 密钥#

需要在 k8s 集群上创建 Kubernetes Secret，以便使用 NGC 私有注册表进行身份验证。如果尚未完成，请从 ngc.nvidia.com 获取 NGC 密钥。在 k8s 集群上创建一个密钥（将 <NGC KEY HERE> 替换为您的 NGC 密钥。请注意，如果您的密钥中有任何特殊字符，您可能需要将密钥用单引号 (') 括起来，以便 k8s 可以正确解析）

kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>

设置 Python 环境#

此示例中运行提供的脚本的环境不需要完整的 nemo_curator 包，因此您可以创建一个仅包含所需包的虚拟环境，如下所示

python3 -m venv venv
source venv/bin/activate

pip install 'dask_kubernetes>=2024.4.1'

将数据上传到 PVC#

要复制到 nemo-workspace PVC 中，我们将使用 kubectl exec 进行操作。您也可以使用 kubectl cp，但 exec 在处理压缩文件时遇到的意外情况较少

# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>

# This copies $LOCAL_WORKSPACE/my_dataset to /my_dataset within the PVC.
# Change foobar to the directory or file you wish to upload.
( cd $LOCAL_WORKSPACE; tar cf - my_dataset | kubectl exec -i nemo-workspace-busybox -- tar xf - -C /nemo-workspace )

注意

有关如何下载可以按照上述说明上传到 PVC 的本地数据的示例，请参阅下载和提取文本。

创建 Dask 集群#

使用 create_dask_cluster.py 创建 CPU 或 GPU Dask 集群。

注意

如果您要使用相同的 --name <name> 创建另一个 Dask 集群，请先通过以下方式删除它

kubectl delete daskcluster <name>

# Creates a CPU Dask cluster with 1 worker
python create_dask_cluster.py \
    --name rapids-dask \
    --n_workers 1 \
    --image nvcr.io/nvidian/bignlp-train:nemofw-nightly \
    --image_pull_secret ngc-registry \
    --pvcs nemo-workspace:/nemo-workspace

#╭───────────────────── Creating KubeCluster 'rapids-dask' ─────────────────────╮
#│                                                                              │
#│   DaskCluster                                                      Running   │
#│   Scheduler Pod                                                    Running   │
#│   Scheduler Service                                                Created   │
#│   Default Worker Group                                             Created   │
#│                                                                              │
#│ ⠧ Getting dashboard URL                                                      │
#╰──────────────────────────────────────────────────────────────────────────────╯
#cluster = KubeCluster(rapids-dask, 'tcp://localhost:61757', workers=2, threads=510, memory=3.94 TiB)

# Creates a GPU Dask cluster with 2 workers with 1 GPU each
python create_dask_cluster.py \
    --name rapids-dask \
    --n_workers 2 \
    --n_gpus_per_worker 1 \
    --image nvcr.io/nvidian/bignlp-train:nemofw-nightly \
    --image_pull_secret ngc-registry \
    --pvcs nemo-workspace:/nemo-workspace

创建集群后，您应该能够在确认调度器和工作器都处于 Running 状态后继续操作

# Set DASK_CLUSTER_NAME to the value of --name
DASK_CLUSTER_NAME=rapids-dask
kubectl get pods -l "dask.org/cluster-name=$DASK_CLUSTER_NAME"

# NAME                                                     READY   STATUS    RESTARTS      AGE
# rapids-dask-default-worker-587238cf2c-7d685f4d75-k6rnq   1/1     Running   0             57m
# rapids-dask-default-worker-f8ff963886-5577fff76b-qmvcd   1/1     Running   3 (52m ago)   57m
# rapids-dask-scheduler-654799869d-9bw4z                   1/1     Running   0             57m

（选项 1）运行现有模块#

以下是运行现有 gpu_exact_dedup Curator 模块的示例。参数和脚本名称需要根据您要运行的模块进行更改

# Set DASK_CLUSTER_NAME to the value of --name
DASK_CLUSTER_NAME=rapids-dask
SCHEDULER_POD=$(kubectl get pods -l "dask.org/cluster-name=$DASK_CLUSTER_NAME,dask.org/component=scheduler" -o name)
# Starts an interactive shell session in the scheduler pod
kubectl exec -it $SCHEDULER_POD -- bash

########################
# Inside SCHEDULER_POD #
########################
# Run the following inside the interactive shell to launch script in the background and
# tee the logs to the /nemo-workspace PVC that was mounted in for persistence.
# The command line flags will need to be replaced with whatever the module script accepts.
# Recall that the PVC is mounted at /nemo-workspace, so any outputs should be written
# to somewhere under /nemo-workspace.

mkdir -p /nemo-workspace/curator/{output,log,profile}
# Write logs to script.log and to a log file with a date suffix
LOGS="/nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.$(date +%y_%m_%d-%H-%M-%S)"
(
echo "Writing to: $LOGS"
gpu_exact_dedup \
    --input-data-dirs /nemo-workspace/my_dataset \
    --output-dir /nemo-workspace/curator/output \
    --hash-method md5 \
    --log-dir /nemo-workspace/curator/log \
    --num-files -1 \
    --files-per-partition 1 \
    --profile-path /nemo-workspace/curator/profile \
    --log-frequency 250 \
    --scheduler-address localhost:8786 \
    2>&1
echo "Finished!"
) | tee $LOGS &

# At this point, feel free to disconnect the shell via Ctrl+D or simply
exit

此时，您可以跟踪日志并查找 /nemo-workspace/curator/script.log 中的 Finished!

# Command will follow the logs of the running module (Press ctrl+C to close)
kubectl exec -it $SCHEDULER_POD -- tail -f /nemo-workspace/curator/script.log

# Writing to: /nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.24_03_27-15-52-31
# Computing hashes for /nemo-workspace/my_dataset
#                       id                           _hashes
# 0  cc-2023-14-0397113620  91b77eae49c10a65d485ac8ca18d6c43
# 1  cc-2023-14-0397113621  a266f0794cc8ffbd431823e6930e4f80
# 2  cc-2023-14-0397113622  baee533e2eddae764de2cd6faaa1286c
# 3  cc-2023-14-0397113623  87dd52a468448b99078f97e76f528eab
# 4  cc-2023-14-0397113624  a17664daf4f24be58e0e3a3dcf81124a
# Finished!

（选项 2）运行自定义模块#

在此示例中，我们将演示如何运行您在本地定义的 NeMo Curator 模块。

由于您的 Curator 模块可能依赖于与容器中 Curator 版本不同的版本，因此我们需要构建一个安装了您的代码的自定义镜像

# Clone your repo. This example uses the official repo
git clone https://github.com/NVIDIA/NeMo-Curator.git NeMo-Curator-dev

# Checkout specific ref. This example uses a commit in the main branch
git -C NeMo-Curator-dev checkout fc167a6edffd38a55c333742972a5a25b901cb26

# Example NeMo base image. Change it according to your requirements
BASE_IMAGE=nvcr.io/nvidian/bignlp-train:nemofw-nightly
docker build -t nemo-curator-custom ./NeMo-Curator-dev -f - <<EOF
FROM ${BASE_IMAGE}

COPY ./ /NeMo-Curator-dev/
RUN pip install -e /NeMo-Curator-dev
EOF

# Then push this image to your registry: Change <private-registry>/<image>:<tag> accordingly
docker tag nemo-curator-custom <private-registry>/<image>:<tag>
docker push <private-registry>/<image>:<tag>

注意

当使用自定义镜像时，您可能需要创建一个不同的密钥，除非您已推送到公共注册表

# Fill in <private-registry>/<username>/<password>
kubectl create secret docker-registry my-private-registry --docker-server=<private-registry> --docker-username=<username> --docker-password=<password>

使用这个新密钥，您可以创建新的 Dask 集群

# Fill in <private-registry>/<username>/<password>
python create_dask_cluster.py \
    --name rapids-dask \
    --n_workers 2 \
    --n_gpus_per_worker 1 \
    --image <private-registry>/<image>:<tag> \
    --image_pull_secret my-private-registry \
    --pvcs nemo-workspace:/nemo-workspace

部署 Dask 集群后，您可以继续运行您的模块。在此示例中，我们将使用 NeMo-Curator/nemo_curator/scripts/find_exact_duplicates.py 模块，但您可以在 NeMo-Curator/examples 中找到其他模板

# Set DASK_CLUSTER_NAME to the value of --name
DASK_CLUSTER_NAME=rapids-dask
SCHEDULER_POD=$(kubectl get pods -l "dask.org/cluster-name=$DASK_CLUSTER_NAME,dask.org/component=scheduler" -o name)
# Starts an interactive shell session in the scheduler pod
kubectl exec -it $SCHEDULER_POD -- bash

########################
# Inside SCHEDULER_POD #
########################
# Run the following inside the interactive shell to launch script in the background and
# tee the logs to the /nemo-workspace PVC that was mounted in for persistence.
# The command line flags will need to be replaced with whatever the module script accepts.
# Recall that the PVC is mounted at /nemo-workspace, so any outputs should be written
# to somewhere under /nemo-workspace.

mkdir -p /nemo-workspace/curator/{output,log,profile}
# Append logs to script.log and write to a log file with a date suffix
LOGS="/nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.$(date +%y_%m_%d-%H-%M-%S)"
(
echo "Writing to: $LOGS"
# Recall that /NeMo-Curator-dev was copied and installed in the Dockerfile above
python3 -u /NeMo-Curator-dev/nemo_curator/scripts/find_exact_duplicates.py \
    --input-data-dirs /nemo-workspace/my_dataset \
    --output-dir /nemo-workspace/curator/output \
    --hash-method md5 \
    --log-dir /nemo-workspace/curator/log \
    --files-per-partition 1 \
    --profile-path /nemo-workspace/curator/profile \
    --log-frequency 250 \
    --scheduler-address localhost:8786 \
    2>&1
echo "Finished!"
) | tee $LOGS &

# At this point, feel free to disconnect the shell via Ctrl+D or simply
exit

此时，您可以跟踪日志并查找 /nemo-workspace/curator/script.log 中的 Finished!

# Command will follow the logs of the running module (Press ctrl+C to close)
kubectl exec -it $SCHEDULER_POD -- tail -f /nemo-workspace/curator/script.log

# Writing to: /nemo-workspace/curator/script.log /nemo-workspace/curator/script.log.24_03_27-20-52-07
# Reading 2 files
# /NeMo-Curator-dev/nemo_curator/modules/exact_dedup.py:157: UserWarning: Output path f/nemo-workspace/curator/output/_exact_duplicates.parquet already exists and will be overwritten
#   warnings.warn(
# Finished!

删除集群#

完成使用创建的 Dask 集群后，您可以删除它以释放资源

# Where <name> is the flag passed to create_dask_cluster.py. Example: `--name <name>`
kubectl delete daskcluster <name>

从 PVC 下载数据#

要从 PVC 下载数据，您可以使用之前创建的 nemo-workspace-busybox Pod

# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>

# Tar will fail if LOCAL_WORKSPACE doesn't exist
mkdir -p $LOCAL_WORKSPACE

# Copy file in PVC at /nemo-workspace/foobar.txt to local file-system at $LOCAL_WORKSPACE/nemo-workspace/foobar.txt
kubectl exec nemo-workspace-busybox -- tar cf - /nemo-workspace/foobar.txt | tar xf - -C $LOCAL_WORKSPACE

# Copy directory in PVC /nemo-workspace/fizzbuzz to local file-system at $LOCAL_WORKSPACE/fizzbuzz
kubectl exec nemo-workspace-busybox -- tar cf - /nemo-workspace/fizzbuzz | tar xf - -C $LOCAL_WORKSPACE