重要

您正在查看 NeMo 2.0 文档。此版本对 API 进行了重大更改,并引入了一个新的库 NeMo Run (NeMo 运行)。我们目前正在将所有功能从 NeMo 1.0 移植到 2.0。有关先前版本或 2.0 中尚不可用的功能的文档,请参阅 NeMo 24.07 文档

NeMo-Run 快速入门#

本文介绍了如何使用 NeMo-Run (NeMo 运行) 运行任何受支持的 NeMo 2.0 配方。在本教程中,我们将采用预训练和微调配方,并尝试在本地以及基于 Slurm 的集群上远程运行它。让我们开始吧。

请阅读 NeMo-Run README 以获得 NeMo-Run (NeMo 运行) 的高级概述。

最低要求#

本教程至少需要 1 个 NVIDIA GPU,至少 48GB 内存用于微调,以及 2 个 NVIDIA GPU,每个至少 48GB 内存用于预训练(尽管可以通过进一步减小模型大小在单个 GPU 或内存较小的 GPU 上完成)。每个部分都可以根据您的需要单独进行。您还需要在带有 dev 标签的 NeMo 容器内运行本教程。

预训练#

对于此预训练快速入门,我们将使用一个相对较小的模型。我们将从 Nemotron 3 4B 预训练配方开始,并逐步完成配置和启动预训练所需的步骤。

正如要求中所述,本教程在具有 2 个 GPU(每个 RTX 5880 具有 48GB 内存)的节点上运行。如果您打算仅在 1 个 GPU 或内存较小的 GPU 上运行,请更改配置以匹配您的主机。例如,您可以减少模型配置中的 num_layershidden_size 以使其适合单个 GPU。

设置先决条件#

运行以下命令以设置您的工作区和文件

# Check GPU access
nvidia-smi

# Create and go to workspace
mkdir -p /workspace/nemo-run
cd /workspace/nemo-run

# Create a python file to run pre-training
touch nemotron_pretraining.py

配置配方#

重要

在您编写的任何脚本中,请确保将您的代码包装在 if __name__ == "__main__": 代码块中。有关详细信息,请参阅 在 NeMo 2.0 中使用脚本

nemotron_pretraining.py 中配置配方

import nemo_run as run

from nemo.collections import llm


def configure_recipe(nodes: int = 1, gpus_per_node: int = 2):
    recipe = llm.nemotron3_4b.pretrain_recipe(
        dir="/checkpoints/nemotron", # Path to store checkpoints
        name="nemotron_pretraining",
        tensor_parallelism=2,
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        max_steps=100, # Setting a small value for the quickstart
    )

    # Add overrides here

    return recipe

在这里,recipe 变量保存一个配置的 run.Partial 对象。有关 NeMo-Run 中配置系统的更多详细信息,请点击此处阅读。对于那些熟悉 NeMo 1.0 风格 YAML 配置的人来说,此 recipe 只是用于预训练的 YAML 配置文件的 Pythonic 版本。

覆盖属性#

您可以像普通的 Python 对象一样在其属性上设置覆盖。因此,如果想要更改 val_check_interval,您可以在定义配方后通过设置来覆盖它

recipe.trainer.val_check_interval = 100

注意

需要记住的重要一点是,您在此阶段仅配置您的任务;此时底层代码尚未执行。

交换配方#

NeMo 2.0 中的配方易于交换。例如,如果您想将 NeMotron 配方与 Llama 3 配方交换,您只需运行以下命令

recipe = llm.llama3_8b.pretrain_recipe(
    dir="/checkpoints/llama3", # Path to store checkpoints
    name="llama3_pretraining",
    num_nodes=nodes,
    num_gpus_per_node=gpus_per_node,
)

配置完最终 recipe 后,您就可以进入执行阶段了。

本地执行#

  1. 首先,我们将使用 torchrun 在本地执行。为了做到这一点,我们将定义一个 LocalExecutor,如下所示

def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor

要了解有关 NeMo-Run 执行器的更多信息,请参阅执行指南。

  1. 接下来,我们将结合 recipeexecutor 来启动预训练运行

def run_pretraining():
    recipe = configure_recipe()
    executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)

    run.run(recipe, executor=executor, name="nemotron3_4b_pretraining")

# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.
if __name__ == "__main__":
    run_pretraining()

nemotron_pretraining.py 的完整代码如下所示

import nemo_run as run

from nemo.collections import llm


def configure_recipe(nodes: int = 1, gpus_per_node: int = 2):
    recipe = llm.nemotron3_4b.pretrain_recipe(
        dir="/checkpoints/nemotron", # Path to store checkpoints
        name="nemotron_pretraining",
        tensor_parallelism=2,
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        max_steps=100, # Setting a small value for the quickstart
    )

    recipe.trainer.val_check_interval = 100
    return recipe

def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor

def run_pretraining():
    recipe = configure_recipe()
    executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)

    run.run(recipe, executor=executor, name="nemotron3_4b_pretraining")

# This condition is necessary for the script to be compatible with Python's multiprocessing module.
if __name__ == "__main__":
    run_pretraining()

您只需使用以下命令运行文件

python nemotron_pretraining.py

这是一个录像,显示了上述所有步骤,直至预训练开始

更改 GPU 数量#

让我们看看如何更改配置以仅在 1 个 GPU 而不是 2 个 GPU 上运行。您只需更改 run_pretraining 中的配置,如下所示

def run_pretraining():
    recipe = configure_recipe()
    executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)

    # Change to 1 GPU

    # Change executor params
    executor.ntasks_per_node = 1
    executor.env_vars["CUDA_VISIBLE_DEVICES"] = "0"

    # Change recipe params

    # The default number of layers comes from the recipe in nemo where num_layers is 32
    # Ref: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/model/nemotron.py
    # To run on 1 GPU without TP, we can reduce the number of layers to 8 by setting recipe.model.config.num_layers = 8
    recipe.model.config.num_layers = 8
    # We also need to set TP to 1, since we had used 2 for 2 GPUs.
    recipe.trainer.strategy.tensor_model_parallel_size = 1
    # Lastly, we need to set devices to 1 in the trainer.
    recipe.trainer.devices = 1

    run.run(recipe, executor=executor, name="nemotron3_4b_pretraining")

在 Slurm 集群上执行#

NeMo-Run 的好处之一是允许您轻松地从本地扩展到远程基于 Slurm 的集群。接下来,让我们看看如何在 Slurm 集群上启动相同的预训练配方。

注意

每个集群可能有不同的设置。建议您联系集群管理员以获取具体详细信息。

  1. 首先,我们将定义一个 slurm 执行器

def slurm_executor(
    user: str,
    host: str,
    remote_job_dir: str,
    account: str,
    partition: str,
    nodes: int,
    devices: int,
    time: str = "01:00:00",
    custom_mounts: Optional[list[str]] = None,
    custom_env_vars: Optional[dict[str, str]] = None,
    container_image: str = "nvcr.io/nvidia/nemo:dev",
    retries: int = 0,
) -> run.SlurmExecutor:
    if not (user and host and remote_job_dir and account and partition and nodes and devices):
        raise RuntimeError(
            "Please set user, host, remote_job_dir, account, partition, nodes and devices args for using this function."
        )

    mounts = []
    # Custom mounts are defined here.
    if custom_mounts:
        mounts.extend(custom_mounts)

    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }
    if custom_env_vars:
        env_vars |= custom_env_vars

    # This defines the slurm executor.
    # We connect to the executor via the tunnel defined by user, host and remote_job_dir.
    executor = run.SlurmExecutor(
        account=account,
        partition=partition,
        tunnel=run.SSHTunnel(
            user=user,
            host=host,
            job_dir=remote_job_dir, # This is where the results of the run will be stored by default.
            # identity="/path/to/identity/file" OPTIONAL: Provide path to the private key that can be used to establish the SSH connection without entering your password.
        ),
        nodes=nodes,
        ntasks_per_node=devices,
        gpus_per_node=devices,
        mem="0",
        exclusive=True,
        gres="gpu:8",
        packager=run.Packager(),
    )

    executor.container_image = container_image
    executor.container_mounts = mounts
    executor.env_vars = env_vars
    executor.retries = retries
    executor.time = time

    return executor
  1. 接下来,您可以像下面这样,只需将本地执行器替换为 slurm 执行器

def run_pretraining_with_slurm():
    recipe = configure_recipe(nodes=1, gpus_per_node=8)
    executor = slurm_executor(
        user="", # TODO: Set the username you want to use
        host="", # TODO: Set the host of your cluster
        remote_job_dir="", # TODO: Set the directory on the cluster where you want to save results
        account="", # TODO: Set the account for your cluster
        partition="", # TODO: Set the partition for your cluster
        container_image="", # TODO: Set the container image you want to use for your job
        # container_mounts=[], TODO: Set any custom mounts
        # custom_env_vars={}, TODO: Set any custom env vars
        nodes=recipe.trainer.num_nodes,
        devices=recipe.trainer.devices,
    )

    run.run(recipe, executor=executor, detach=True, name="nemotron3_4b_pretraining")
  1. 最后,您可以按如下方式运行它

if __name__ == "__main__":
    run_pretraining_with_slurm()
python nemotron_pretraining.py

由于我们已设置 detach=True,因此进程将在集群上调度作业后退出,并提供有关目录和命令的信息以管理运行/实验。

微调#

NeMo-Run 的主要优势之一是它解耦了配置和执行,使我们能够重用预定义的执行器并简单地更改配方。出于本教程的目的,我们将包含执行器定义,以便可以独立遵循本节。

设置先决条件#

运行以下命令以设置您的 Hugging Face 令牌,以便从 Hugging Face 自动转换模型。

mkdir -p /tokens

# Fetch Huggingface token and export it.
# See http://hugging-face.cn/docs/hub/en/security-tokens for instructions.
export HF_TOKEN="hf_your_token" # Change this to your Huggingface token

# Save token to /tokens/huggingface
echo "$HF_TOKEN" > /tokens/huggingface

配置配方#

在此快速入门中,我们将在单个 GPU 上微调来自 Hugging Face 的 Llama 3 8B 模型。为了实现这一点,我们需要遵循两个步骤

  1. 将检查点从 Hugging Face 转换为 NeMo。

  2. 使用步骤 1 中转换的检查点运行微调。

我们将使用 NeMo-Run 实验来实现这一点,该实验允许您定义这两个任务并轻松地按顺序执行它们。我们将在同一目录中创建一个新文件 nemotron_finetuning.py。对于微调配置,我们将使用 Llama3 8b 微调配方。此配方使用 LoRA,使其能够适应 1 个 GPU(此示例使用具有 48GB 内存的 GPU)。

让我们首先定义两个任务的配置

import nemo_run as run
from nemo.collections import llm

def configure_checkpoint_conversion():
    return run.Partial(
        llm.import_ckpt,
        model=llm.llama3_8b.model(),
        source="hf://meta-llama/Meta-Llama-3-8B",
        overwrite=False,
    )

def configure_finetuning_recipe(nodes: int = 1, gpus_per_node: int = 1):
    recipe = llm.llama3_8b.finetune_recipe(
        dir="/checkpoints/llama3_finetuning", # Path to store checkpoints
        name="llama3_lora",
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
    )

    recipe.trainer.max_steps = 100
    recipe.trainer.num_sanity_val_steps = 0

    # Need to set this to 1 since the default is 2
    recipe.trainer.strategy.context_parallel_size = 1
    recipe.trainer.val_check_interval = 100

    # This is currently required for LoRA/PEFT
    recipe.trainer.strategy.ddp = "megatron"

    return recipe

有关覆盖更多默认属性的详细信息,请参阅覆盖

本地执行#

执行应该非常简单,因为我们将重用本地执行器(但在此处包含定义以供参考)。接下来,我们将定义实验并启动它。它的样子如下

def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor

def run_finetuning():
    import_ckpt = configure_checkpoint_conversion()
    finetune = configure_finetuning_recipe(nodes=1, gpus_per_node=1)

    executor = local_executor_torchrun(nodes=finetune.trainer.num_nodes, devices=finetune.trainer.devices)
    executor.env_vars["CUDA_VISIBLE_DEVICES"] = "0"

    # Set this env var for model download from huggingface
    executor.env_vars["HF_TOKEN_PATH"] = "/tokens/huggingface"

    with run.Experiment("llama3-8b-peft-finetuning") as exp:
        exp.add(import_ckpt, executor=run.LocalExecutor(), name="import_from_hf") # We don't need torchrun for the checkpoint conversion
        exp.add(finetune, executor=executor, name="peft_finetuning")
        exp.run(sequential=True, tail_logs=True) # This will run the tasks sequentially and stream the logs

# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.
if __name__ == "__main__":
    run_finetuning()

完整文件如下所示

import nemo_run as run
from nemotron_pretraining import local_executor_torchrun

from nemo.collections import llm


def configure_checkpoint_conversion():
    return run.Partial(
        llm.import_ckpt,
        model=llm.llama3_8b.model(),
        source="hf://meta-llama/Meta-Llama-3-8B",
        overwrite=False,
    )


def configure_finetuning_recipe(nodes: int = 1, gpus_per_node: int = 1):
    recipe = llm.llama3_8b.finetune_recipe(
        dir="/checkpoints/llama3_finetuning",  # Path to store checkpoints
        name="llama3_lora",
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
    )

    recipe.trainer.max_steps = 100
    recipe.trainer.num_sanity_val_steps = 0

    # Async checkpointing doesn't work with PEFT
    recipe.trainer.strategy.ckpt_async_save = False

    # Need to set this to 1 since the default is 2
    recipe.trainer.strategy.context_parallel_size = 1
    recipe.trainer.val_check_interval = 100

    # This is currently required for LoRA/PEFT
    recipe.trainer.strategy.ddp = "megatron"

    return recipe


def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor


def run_finetuning():
    import_ckpt = configure_checkpoint_conversion()
    finetune = configure_finetuning_recipe(nodes=1, gpus_per_node=1)

    executor = local_executor_torchrun(nodes=finetune.trainer.num_nodes, devices=finetune.trainer.devices)
    executor.env_vars["CUDA_VISIBLE_DEVICES"] = "0"

    # Set this env var for model download from huggingface
    executor.env_vars["HF_TOKEN_PATH"] = "/tokens/huggingface"

    with run.Experiment("llama3-8b-peft-finetuning") as exp:
        exp.add(
            import_ckpt, executor=run.LocalExecutor(), name="import_from_hf"
        )  # We don't need torchrun for the checkpoint conversion
        exp.add(finetune, executor=executor, name="peft_finetuning")
        exp.run(sequential=True, tail_logs=True)  # This will run the tasks sequentially and stream the logs


# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.
if __name__ == "__main__":
    run_finetuning()

这是一个录像,显示了上述所有步骤,直至微调开始

使用 NeMo 2.0 预训练检查点作为基础#

如果您已经有一个使用 NeMo 2.0 的预训练检查点,并且想使用它作为微调的起点,而不是 Hugging Face 检查点,您可以执行以下操作

def run_finetuning():
    finetune = configure_finetuning_recipe(nodes=1, gpus_per_node=1)
    finetune.resume.restore_config.path = "/path/to/pretrained/NeMo-2/checkpoint"

    executor = local_executor_torchrun(nodes=finetune.trainer.num_nodes, devices=finetune.trainer.devices)
    executor.env_vars["CUDA_VISIBLE_DEVICES"] = "0"

    with run.Experiment("llama3-8b-peft-finetuning") as exp:
        exp.add(finetune, executor=executor, name="peft_finetuning")
        exp.run(sequential=True, tail_logs=True)  # This will run the tasks sequentially and stream the logs

在具有更多节点的 Slurm 集群上执行#

您可以重用上面的 slurm 执行器。然后可以像这样配置实验

注意

import_ckpt 配置应写入集群中所有节点可访问的共享文件系统,以进行多节点训练。您可以通过设置 NEMO_HOME 环境变量来控制默认缓存位置。

def run_finetuning_on_slurm():
    import_ckpt = configure_checkpoint_conversion()

    # This will make finetuning run on 2 nodes with 8 GPUs each.
    recipe = configure_finetuning_recipe(gpus_per_node=8, nodes=2)
    executor = slurm_executor(
        ...
        nodes=recipe.trainer.num_nodes,
        devices=recipe.trainer.devices,
        ...
    )
    executor.env_vars["NEMO_HOME"] = "/path/to/a/shared/filesystem"

    # Importing checkpoint always requires only 1 node and 1 task per node
    import_executor = slurm_executor.clone()
    import_executor.nodes = 1
    import_executor.ntasks_per_node = 1
    # Set this env var for model download from huggingface
    import_executor.env_vars["HF_TOKEN_PATH"] = "/tokens/huggingface"

    with run.Experiment("llama3-8b-peft-finetuning-slurm") as exp:
        exp.add(import_ckpt, executor=import_executor, name="import_from_hf")
        exp.add(recipe, executor=executor, name="peft_finetuning")
        exp.run(sequential=True, tail_logs=True)