重要提示

您正在查看 NeMo 2.0 文档。此版本引入了 API 的重大更改和一个新的库，NeMo Run。我们目前正在将 NeMo 1.0 中的所有功能移植到 2.0。有关先前版本或 2.0 中尚不可用的功能的文档，请参阅 NeMo 24.07 文档。

NeMo 2.0#

在 NeMo 1.0 中，配置实验的主要界面是通过 YAML 文件。这种方法允许以声明方式设置实验，但在灵活性和程序化控制方面存在限制。NeMo 2.0 转向基于 Python 的配置，这提供了几个优点

对配置的更大灵活性和控制。
与 IDE 的更好集成，用于代码完成和类型检查。
更易于以编程方式扩展和自定义配置。

通过采用 PyTorch Lightning 的模块化抽象，NeMo 2.0 使使用者可以轻松地使框架适应其特定用例，并尝试各种配置。本节概述了 NeMo 2.0 中的新功能，并包含一个迁移指南，其中包含将模型从 NeMo 1.0 迁移到 NeMo 2.0 的分步说明。

安装 NeMo 2.0#

NeMo 2.0 安装说明可以在入门指南中找到。

快速入门#

重要提示

在您编写的任何脚本中，请确保将您的代码包装在 if __name__ == "__main__": 代码块中。有关详细信息，请参阅在 NeMo 2.0 中使用脚本。

以下是使用 NeMo 2.0 运行简单训练循环的示例。此示例使用来自 NeMo 框架 LLM 集合的 train API。一旦您使用上面的说明设置了您的环境，您就可以运行这个简单的训练脚本了。

import torch
from nemo import lightning as nl
from nemo.collections import llm
from megatron.core.optimizer import OptimizerConfig

if __name__ == "__main__":
    seq_length = 2048
    global_batch_size = 16

    ## setup the dummy dataset
    data = llm.MockDataModule(seq_length=seq_length, global_batch_size=global_batch_size)

    ## initialize a small GPT model
    gpt_config = llm.GPTConfig(
        num_layers=6,
        hidden_size=384,
        ffn_hidden_size=1536,
        num_attention_heads=6,
        seq_length=seq_length,
        init_method_std=0.023,
        hidden_dropout=0.1,
        attention_dropout=0.1,
        layernorm_epsilon=1e-5,
        make_vocab_size_divisible_by=128,
    )
    model = llm.GPTModel(gpt_config, tokenizer=data.tokenizer)

    ## initialize the strategy
    strategy = nl.MegatronStrategy(
        tensor_model_parallel_size=1,
        pipeline_model_parallel_size=1,
        pipeline_dtype=torch.bfloat16,
    )

    ## setup the optimizer
    opt_config = OptimizerConfig(
        optimizer='adam',
        lr=6e-4,
        bf16=True,
    )
    opt = nl.MegatronOptimizerModule(config=opt_config)

    trainer = nl.Trainer(
        devices=1, ## you can change the number of devices to suit your setup
        max_steps=50,
        accelerator="gpu",
        strategy=strategy,
        plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
    )

    nemo_logger = nl.NeMoLogger(
        log_dir="test_logdir", ## logs and checkpoints will be written here
    )

    llm.train(
        model=model,
        data=data,
        trainer=trainer,
        log=nemo_logger,
        tokenizer='data',
        optim=opt,
    )

NeMo 2.0 还无缝支持使用 NeMo-Run 扩展到数千个 GPU。有关使用 NeMo-Run 启动大规模实验的示例，请参阅 NeMo-Run 快速入门。

注意

如果您是 NeMo 1.0 的现有用户，并且想使用 NeMo 1.0 数据集代替示例中的 MockDataModule，请参阅数据迁移指南以获取说明。

使用 NeMo-Run 扩展快速入门#

虽然 NeMo-Run 快速入门介绍了如何使用 NeMo-Run 配置 NeMo 2.0 实验，但不是强制要求使用 NeMo-Run 的配置系统。实际上，您可以采用上面快速入门中的 Python 脚本，并使用 NeMo-Run 直接在远程集群上启动它。有关 NeMo-Run 的更多详细信息，请参阅 NeMo-Run Github 和 hello_scripts 示例。下面，我们将逐步介绍如何执行此操作。

先决条件#

将上面的脚本另存为工作目录中的 train.py。
使用以下命令安装 NeMo-Run

pip install git+https://github.com/NVIDIA/NeMo-Run.git

假设您已将上面的脚本另存为当前工作目录中的 train.py。

在本地启动实验#

本地在此处指的是您的本地工作站。它可以是您工作站中的 venv 或交互式 NeMo Docker 容器。

创建一个名为 run.py 的新文件，内容如下

import os
import nemo_run as run

if __name__ == "__main__":
    training_job = run.Script(
        inline="""
# This string will get saved to a sh file and executed with bash
# Run any preprocessing commands

# Run the training command
python train.py

# Run any post processing commands
"""
    )

    # Run it locally
    executor = run.LocalExecutor()

    with run.Experiment("nemo_2.0_training_experiment", log_level="INFO") as exp:
        exp.add(training_job, executor=executor, tail_logs=True, name="training")
        # Add more jobs as needed

        # Run the experiment
        exp.run(detach=False)

然后，使用以下命令启动实验

python run.py

在 Slurm 上启动实验#

编写额外的脚本仅用于本地启动不是很有用。因此，让我们看看如何扩展 run.py 以在任何受支持的 NeMo-Run 执行器上启动作业。在本教程中，我们将使用 Slurm 执行器。

注意

每个集群可能有不同的设置。建议您联系集群管理员以获取具体详细信息。

您可以定义一个函数来配置您的 Slurm 执行器，如下所示

def slurm_executor(
    user: str,
    host: str,
    remote_job_dir: str,
    account: str,
    partition: str,
    nodes: int,
    devices: int,
    time: str = "01:00:00",
    custom_mounts: Optional[list[str]] = None,
    custom_env_vars: Optional[dict[str, str]] = None,
    container_image: str = "nvcr.io/nvidia/nemo:dev",
    retries: int = 0,
) -> run.SlurmExecutor:
    if not (user and host and remote_job_dir and account and partition and nodes and devices):
        raise RuntimeError(
            "Please set user, host, remote_job_dir, account, partition, nodes and devices args for using this function."
        )

    mounts = []
    # Custom mounts are defined here.
    if custom_mounts:
        mounts.extend(custom_mounts)

    # Env vars for jobs are configured here
    env_vars = {
        "TRANSFORMERS_OFFLINE": "1",
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
        "NVTE_FUSED_ATTN": "0",
    }
    if custom_env_vars:
        env_vars |= custom_env_vars

    # This will package the train.py script in the current working directory to the remote cluster.
    # If you are inside a git repo, you can also use https://github.com/NVIDIA/NeMo-Run/blob/main/src/nemo_run/core/packaging/git.py.
    # If the script already exists on your container and you call it with the absolute path, you can also just use `run.Packager()`.
    packager = run.PatternPackager(include_pattern="train.py", relative_path=os.getcwd())

    # This defines the slurm executor.
    # We connect to the executor via the tunnel defined by user, host and remote_job_dir.
    executor = run.SlurmExecutor(
        account=account,
        partition=partition,
        tunnel=run.SSHTunnel(
            user=user,
            host=host,
            job_dir=remote_job_dir, # This is where the results of the run will be stored by default.
            # identity="/path/to/identity/file" OPTIONAL: Provide path to the private key that can be used to establish the SSH connection without entering your password.
        ),
        nodes=nodes,
        ntasks_per_node=devices,
        gpus_per_node=devices,
        mem="0",
        exclusive=True,
        gres="gpu:8",
        packager=packager,
    )

    executor.container_image = container_image
    executor.container_mounts = mounts
    executor.env_vars = env_vars
    executor.retries = retries
    executor.time = time

    return executor

然后，只需将 run.py 中的执行器替换为

executor = slurm_executor(...) # pass in args relevant to your cluster

现在，您可以使用相同的命令运行该文件，它将在集群上启动您的作业。同样，您可以为多个 slurm 集群定义多个 slurm 执行器并互换使用它们，或者使用 NeMo-Run 中任何受支持的执行器。

在哪里找到 NeMo 2.0#

目前，NeMo 2.0 的代码可以在 NeMo GitHub 存储库中的两个主要位置找到

LLM 集合：这是第一个采用 NeMo 2.0 API 的集合。此集合提供了使用 NeMo 2.0 的常用语言模型的实现。目前，该集合支持以下模型
- GPT
- LLama
- Mixtral
- Nemotron
- Mamba2 和混合模型
- T5
NeMo 2.0 LLM 配方：为大型语言模型的预训练和微调提供全面的配方。借助 NeMo-Run，可以轻松配置和修改配方以用于特定用例。
NeMo Lightning：提供自定义的 PyTorch Lightning 兼容对象，使得可以使用 PTL 以模块化方式训练基于 Megatron Core 的模型。NeMo 2.0 采用这些对象以简单高效的方式训练模型。

LLM 集合支持预训练、监督微调 (SFT) 和参数高效微调 (PEFT)。有关每个模型的更多信息，请参阅上面链接的模型特定文档。

借助上下文并行，还支持长上下文配方。有关可用的长上下文配方的更多信息，请参阅长上下文文档。

通过 TRT-LLM 进行推理即将推出。

其他资源#

功能指南深入探讨了 NeMo 2.0 的主要功能。有关以下方面的信息，请参阅本指南
对于熟悉 NeMo 1.0 的用户，迁移指南解释了如何将您的实验从 NeMo 1.0 迁移到 NeMo 2.0。
NeMo 2.0 配方包含使用 NeMo 2.0 和 NeMo-Run 启动大规模运行的其他示例。

已知问题#

TRT-LLM 支持将在未来的版本中添加到 NeMo 2.0。
有关将 NeMo 1.0 检查点转换为 NeMo 2.0 格式的说明即将推出。