重要
您正在查看 NeMo 2.0 文档。此版本对 API 进行了重大更改,并引入了一个新的库 NeMo Run (NeMo 运行)。我们目前正在将所有功能从 NeMo 1.0 移植到 2.0。有关先前版本或 2.0 中尚不可用的功能的文档,请参阅 NeMo 24.07 文档。
NeMo-Run 快速入门#
本文介绍了如何使用 NeMo-Run (NeMo 运行) 运行任何受支持的 NeMo 2.0 配方。在本教程中,我们将采用预训练和微调配方,并尝试在本地以及基于 Slurm 的集群上远程运行它。让我们开始吧。
请阅读 NeMo-Run README 以获得 NeMo-Run (NeMo 运行) 的高级概述。
最低要求#
本教程至少需要 1 个 NVIDIA GPU,至少 48GB 内存用于微调,以及 2 个 NVIDIA GPU,每个至少 48GB 内存用于预训练(尽管可以通过进一步减小模型大小在单个 GPU 或内存较小的 GPU 上完成)。每个部分都可以根据您的需要单独进行。您还需要在带有 dev
标签的 NeMo 容器内运行本教程。
预训练#
对于此预训练快速入门,我们将使用一个相对较小的模型。我们将从 Nemotron 3 4B 预训练配方开始,并逐步完成配置和启动预训练所需的步骤。
正如要求中所述,本教程在具有 2 个 GPU(每个 RTX 5880 具有 48GB 内存)的节点上运行。如果您打算仅在 1 个 GPU 或内存较小的 GPU 上运行,请更改配置以匹配您的主机。例如,您可以减少模型配置中的 num_layers 或 hidden_size 以使其适合单个 GPU。
设置先决条件#
运行以下命令以设置您的工作区和文件
# Check GPU access
nvidia-smi
# Create and go to workspace
mkdir -p /workspace/nemo-run
cd /workspace/nemo-run
# Create a python file to run pre-training
touch nemotron_pretraining.py
配置配方#
重要
在您编写的任何脚本中,请确保将您的代码包装在 if __name__ == "__main__":
代码块中。有关详细信息,请参阅 在 NeMo 2.0 中使用脚本。
在 nemotron_pretraining.py 中配置配方
import nemo_run as run
from nemo.collections import llm
def configure_recipe(nodes: int = 1, gpus_per_node: int = 2):
recipe = llm.nemotron3_4b.pretrain_recipe(
dir="/checkpoints/nemotron", # Path to store checkpoints
name="nemotron_pretraining",
tensor_parallelism=2,
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
max_steps=100, # Setting a small value for the quickstart
)
# Add overrides here
return recipe
在这里,recipe 变量保存一个配置的 run.Partial 对象。有关 NeMo-Run 中配置系统的更多详细信息,请点击此处阅读。对于那些熟悉 NeMo 1.0 风格 YAML 配置的人来说,此 recipe 只是用于预训练的 YAML 配置文件的 Pythonic 版本。
覆盖属性#
您可以像普通的 Python 对象一样在其属性上设置覆盖。因此,如果想要更改 val_check_interval,您可以在定义配方后通过设置来覆盖它
recipe.trainer.val_check_interval = 100
注意
需要记住的重要一点是,您在此阶段仅配置您的任务;此时底层代码尚未执行。
交换配方#
NeMo 2.0 中的配方易于交换。例如,如果您想将 NeMotron 配方与 Llama 3 配方交换,您只需运行以下命令
recipe = llm.llama3_8b.pretrain_recipe(
dir="/checkpoints/llama3", # Path to store checkpoints
name="llama3_pretraining",
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
)
配置完最终 recipe 后,您就可以进入执行阶段了。
本地执行#
首先,我们将使用 torchrun 在本地执行。为了做到这一点,我们将定义一个 LocalExecutor,如下所示
def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
# Env vars for jobs are configured here
env_vars = {
"TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
"NCCL_NVLS_ENABLE": "0",
"NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
"NVTE_ASYNC_AMAX_REDUCTION": "1",
}
executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)
return executor
要了解有关 NeMo-Run 执行器的更多信息,请参阅执行指南。
接下来,我们将结合 recipe 和 executor 来启动预训练运行
def run_pretraining():
recipe = configure_recipe()
executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)
run.run(recipe, executor=executor, name="nemotron3_4b_pretraining")
# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.
if __name__ == "__main__":
run_pretraining()
nemotron_pretraining.py 的完整代码如下所示
import nemo_run as run
from nemo.collections import llm
def configure_recipe(nodes: int = 1, gpus_per_node: int = 2):
recipe = llm.nemotron3_4b.pretrain_recipe(
dir="/checkpoints/nemotron", # Path to store checkpoints
name="nemotron_pretraining",
tensor_parallelism=2,
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
max_steps=100, # Setting a small value for the quickstart
)
recipe.trainer.val_check_interval = 100
return recipe
def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
# Env vars for jobs are configured here
env_vars = {
"TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
"NCCL_NVLS_ENABLE": "0",
"NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
"NVTE_ASYNC_AMAX_REDUCTION": "1",
}
executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)
return executor
def run_pretraining():
recipe = configure_recipe()
executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)
run.run(recipe, executor=executor, name="nemotron3_4b_pretraining")
# This condition is necessary for the script to be compatible with Python's multiprocessing module.
if __name__ == "__main__":
run_pretraining()
您只需使用以下命令运行文件
python nemotron_pretraining.py
这是一个录像,显示了上述所有步骤,直至预训练开始
更改 GPU 数量#
让我们看看如何更改配置以仅在 1 个 GPU 而不是 2 个 GPU 上运行。您只需更改 run_pretraining
中的配置,如下所示
def run_pretraining():
recipe = configure_recipe()
executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)
# Change to 1 GPU
# Change executor params
executor.ntasks_per_node = 1
executor.env_vars["CUDA_VISIBLE_DEVICES"] = "0"
# Change recipe params
# The default number of layers comes from the recipe in nemo where num_layers is 32
# Ref: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/model/nemotron.py
# To run on 1 GPU without TP, we can reduce the number of layers to 8 by setting recipe.model.config.num_layers = 8
recipe.model.config.num_layers = 8
# We also need to set TP to 1, since we had used 2 for 2 GPUs.
recipe.trainer.strategy.tensor_model_parallel_size = 1
# Lastly, we need to set devices to 1 in the trainer.
recipe.trainer.devices = 1
run.run(recipe, executor=executor, name="nemotron3_4b_pretraining")
在 Slurm 集群上执行#
NeMo-Run 的好处之一是允许您轻松地从本地扩展到远程基于 Slurm 的集群。接下来,让我们看看如何在 Slurm 集群上启动相同的预训练配方。
注意
每个集群可能有不同的设置。建议您联系集群管理员以获取具体详细信息。
首先,我们将定义一个 slurm 执行器
def slurm_executor(
user: str,
host: str,
remote_job_dir: str,
account: str,
partition: str,
nodes: int,
devices: int,
time: str = "01:00:00",
custom_mounts: Optional[list[str]] = None,
custom_env_vars: Optional[dict[str, str]] = None,
container_image: str = "nvcr.io/nvidia/nemo:dev",
retries: int = 0,
) -> run.SlurmExecutor:
if not (user and host and remote_job_dir and account and partition and nodes and devices):
raise RuntimeError(
"Please set user, host, remote_job_dir, account, partition, nodes and devices args for using this function."
)
mounts = []
# Custom mounts are defined here.
if custom_mounts:
mounts.extend(custom_mounts)
# Env vars for jobs are configured here
env_vars = {
"TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
"NCCL_NVLS_ENABLE": "0",
"NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
"NVTE_ASYNC_AMAX_REDUCTION": "1",
}
if custom_env_vars:
env_vars |= custom_env_vars
# This defines the slurm executor.
# We connect to the executor via the tunnel defined by user, host and remote_job_dir.
executor = run.SlurmExecutor(
account=account,
partition=partition,
tunnel=run.SSHTunnel(
user=user,
host=host,
job_dir=remote_job_dir, # This is where the results of the run will be stored by default.
# identity="/path/to/identity/file" OPTIONAL: Provide path to the private key that can be used to establish the SSH connection without entering your password.
),
nodes=nodes,
ntasks_per_node=devices,
gpus_per_node=devices,
mem="0",
exclusive=True,
gres="gpu:8",
packager=run.Packager(),
)
executor.container_image = container_image
executor.container_mounts = mounts
executor.env_vars = env_vars
executor.retries = retries
executor.time = time
return executor
接下来,您可以像下面这样,只需将本地执行器替换为 slurm 执行器
def run_pretraining_with_slurm():
recipe = configure_recipe(nodes=1, gpus_per_node=8)
executor = slurm_executor(
user="", # TODO: Set the username you want to use
host="", # TODO: Set the host of your cluster
remote_job_dir="", # TODO: Set the directory on the cluster where you want to save results
account="", # TODO: Set the account for your cluster
partition="", # TODO: Set the partition for your cluster
container_image="", # TODO: Set the container image you want to use for your job
# container_mounts=[], TODO: Set any custom mounts
# custom_env_vars={}, TODO: Set any custom env vars
nodes=recipe.trainer.num_nodes,
devices=recipe.trainer.devices,
)
run.run(recipe, executor=executor, detach=True, name="nemotron3_4b_pretraining")
最后,您可以按如下方式运行它
if __name__ == "__main__":
run_pretraining_with_slurm()
python nemotron_pretraining.py
由于我们已设置 detach=True,因此进程将在集群上调度作业后退出,并提供有关目录和命令的信息以管理运行/实验。
微调#
NeMo-Run 的主要优势之一是它解耦了配置和执行,使我们能够重用预定义的执行器并简单地更改配方。出于本教程的目的,我们将包含执行器定义,以便可以独立遵循本节。
设置先决条件#
运行以下命令以设置您的 Hugging Face 令牌,以便从 Hugging Face 自动转换模型。
mkdir -p /tokens
# Fetch Huggingface token and export it.
# See http://hugging-face.cn/docs/hub/en/security-tokens for instructions.
export HF_TOKEN="hf_your_token" # Change this to your Huggingface token
# Save token to /tokens/huggingface
echo "$HF_TOKEN" > /tokens/huggingface
配置配方#
在此快速入门中,我们将在单个 GPU 上微调来自 Hugging Face 的 Llama 3 8B 模型。为了实现这一点,我们需要遵循两个步骤
将检查点从 Hugging Face 转换为 NeMo。
使用步骤 1 中转换的检查点运行微调。
我们将使用 NeMo-Run 实验来实现这一点,该实验允许您定义这两个任务并轻松地按顺序执行它们。我们将在同一目录中创建一个新文件 nemotron_finetuning.py
。对于微调配置,我们将使用 Llama3 8b 微调配方。此配方使用 LoRA,使其能够适应 1 个 GPU(此示例使用具有 48GB 内存的 GPU)。
让我们首先定义两个任务的配置
import nemo_run as run
from nemo.collections import llm
def configure_checkpoint_conversion():
return run.Partial(
llm.import_ckpt,
model=llm.llama3_8b.model(),
source="hf://meta-llama/Meta-Llama-3-8B",
overwrite=False,
)
def configure_finetuning_recipe(nodes: int = 1, gpus_per_node: int = 1):
recipe = llm.llama3_8b.finetune_recipe(
dir="/checkpoints/llama3_finetuning", # Path to store checkpoints
name="llama3_lora",
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
)
recipe.trainer.max_steps = 100
recipe.trainer.num_sanity_val_steps = 0
# Need to set this to 1 since the default is 2
recipe.trainer.strategy.context_parallel_size = 1
recipe.trainer.val_check_interval = 100
# This is currently required for LoRA/PEFT
recipe.trainer.strategy.ddp = "megatron"
return recipe
有关覆盖更多默认属性的详细信息,请参阅覆盖。
本地执行#
执行应该非常简单,因为我们将重用本地执行器(但在此处包含定义以供参考)。接下来,我们将定义实验并启动它。它的样子如下
def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
# Env vars for jobs are configured here
env_vars = {
"TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
"NCCL_NVLS_ENABLE": "0",
"NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
"NVTE_ASYNC_AMAX_REDUCTION": "1",
}
executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)
return executor
def run_finetuning():
import_ckpt = configure_checkpoint_conversion()
finetune = configure_finetuning_recipe(nodes=1, gpus_per_node=1)
executor = local_executor_torchrun(nodes=finetune.trainer.num_nodes, devices=finetune.trainer.devices)
executor.env_vars["CUDA_VISIBLE_DEVICES"] = "0"
# Set this env var for model download from huggingface
executor.env_vars["HF_TOKEN_PATH"] = "/tokens/huggingface"
with run.Experiment("llama3-8b-peft-finetuning") as exp:
exp.add(import_ckpt, executor=run.LocalExecutor(), name="import_from_hf") # We don't need torchrun for the checkpoint conversion
exp.add(finetune, executor=executor, name="peft_finetuning")
exp.run(sequential=True, tail_logs=True) # This will run the tasks sequentially and stream the logs
# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.
if __name__ == "__main__":
run_finetuning()
完整文件如下所示
import nemo_run as run
from nemotron_pretraining import local_executor_torchrun
from nemo.collections import llm
def configure_checkpoint_conversion():
return run.Partial(
llm.import_ckpt,
model=llm.llama3_8b.model(),
source="hf://meta-llama/Meta-Llama-3-8B",
overwrite=False,
)
def configure_finetuning_recipe(nodes: int = 1, gpus_per_node: int = 1):
recipe = llm.llama3_8b.finetune_recipe(
dir="/checkpoints/llama3_finetuning", # Path to store checkpoints
name="llama3_lora",
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
)
recipe.trainer.max_steps = 100
recipe.trainer.num_sanity_val_steps = 0
# Async checkpointing doesn't work with PEFT
recipe.trainer.strategy.ckpt_async_save = False
# Need to set this to 1 since the default is 2
recipe.trainer.strategy.context_parallel_size = 1
recipe.trainer.val_check_interval = 100
# This is currently required for LoRA/PEFT
recipe.trainer.strategy.ddp = "megatron"
return recipe
def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
# Env vars for jobs are configured here
env_vars = {
"TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
"NCCL_NVLS_ENABLE": "0",
"NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
"NVTE_ASYNC_AMAX_REDUCTION": "1",
}
executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)
return executor
def run_finetuning():
import_ckpt = configure_checkpoint_conversion()
finetune = configure_finetuning_recipe(nodes=1, gpus_per_node=1)
executor = local_executor_torchrun(nodes=finetune.trainer.num_nodes, devices=finetune.trainer.devices)
executor.env_vars["CUDA_VISIBLE_DEVICES"] = "0"
# Set this env var for model download from huggingface
executor.env_vars["HF_TOKEN_PATH"] = "/tokens/huggingface"
with run.Experiment("llama3-8b-peft-finetuning") as exp:
exp.add(
import_ckpt, executor=run.LocalExecutor(), name="import_from_hf"
) # We don't need torchrun for the checkpoint conversion
exp.add(finetune, executor=executor, name="peft_finetuning")
exp.run(sequential=True, tail_logs=True) # This will run the tasks sequentially and stream the logs
# Wrap the call in an if __name__ == "__main__": block to work with Python's multiprocessing module.
if __name__ == "__main__":
run_finetuning()
这是一个录像,显示了上述所有步骤,直至微调开始
使用 NeMo 2.0 预训练检查点作为基础#
如果您已经有一个使用 NeMo 2.0 的预训练检查点,并且想使用它作为微调的起点,而不是 Hugging Face 检查点,您可以执行以下操作
def run_finetuning():
finetune = configure_finetuning_recipe(nodes=1, gpus_per_node=1)
finetune.resume.restore_config.path = "/path/to/pretrained/NeMo-2/checkpoint"
executor = local_executor_torchrun(nodes=finetune.trainer.num_nodes, devices=finetune.trainer.devices)
executor.env_vars["CUDA_VISIBLE_DEVICES"] = "0"
with run.Experiment("llama3-8b-peft-finetuning") as exp:
exp.add(finetune, executor=executor, name="peft_finetuning")
exp.run(sequential=True, tail_logs=True) # This will run the tasks sequentially and stream the logs
在具有更多节点的 Slurm 集群上执行#
您可以重用上面的 slurm 执行器。然后可以像这样配置实验
注意
import_ckpt
配置应写入集群中所有节点可访问的共享文件系统,以进行多节点训练。您可以通过设置 NEMO_HOME
环境变量来控制默认缓存位置。
def run_finetuning_on_slurm():
import_ckpt = configure_checkpoint_conversion()
# This will make finetuning run on 2 nodes with 8 GPUs each.
recipe = configure_finetuning_recipe(gpus_per_node=8, nodes=2)
executor = slurm_executor(
...
nodes=recipe.trainer.num_nodes,
devices=recipe.trainer.devices,
...
)
executor.env_vars["NEMO_HOME"] = "/path/to/a/shared/filesystem"
# Importing checkpoint always requires only 1 node and 1 task per node
import_executor = slurm_executor.clone()
import_executor.nodes = 1
import_executor.ntasks_per_node = 1
# Set this env var for model download from huggingface
import_executor.env_vars["HF_TOKEN_PATH"] = "/tokens/huggingface"
with run.Experiment("llama3-8b-peft-finetuning-slurm") as exp:
exp.add(import_ckpt, executor=import_executor, name="import_from_hf")
exp.add(recipe, executor=executor, name="peft_finetuning")
exp.run(sequential=True, tail_logs=True)