重要提示

您正在查看 NeMo 2.0 文档。此版本对 API 和新库 NeMo Run 进行了重大更改。我们目前正在将 NeMo 1.0 的所有功能移植到 2.0。有关先前版本或 2.0 中尚不可用的功能的文档，请参阅 NeMo 24.07 文档。

NeMo 模型#

基础知识#

NeMo 模型包含训练和复现对话式 AI 模型所需的一切

神经网络架构
数据集/数据加载器
数据预处理/后处理
数据增强器
优化器和调度器
分词器
语言模型

NeMo 使用 Hydra 来配置 NeMo 模型和 PyTorch Lightning Trainer。

注意

每个 NeMo 模型都有一个示例配置文件和训练脚本，可以在此处找到。

使用 NeMo、Pytorch Lightning 和 Hydra 的最终结果是，NeMo 模型都具有相同的外观和感觉，并且与 PyTorch 生态系统完全兼容。

预训练#

NeMo 为我们的每个集合（ASR、NLP 和 TTS）提供了许多预训练模型。每个预训练的 NeMo 模型都可以使用 from_pretrained() 方法下载和使用。

例如，我们可以使用以下代码实例化 QuartzNet

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")

要查看特定 NeMo 模型的所有可用预训练模型，请使用 list_available_models() 方法

nemo_asr.models.EncDecCTCModel.list_available_models()

有关可用预训练模型的详细信息，请参阅集合文档

训练#

NeMo 利用 PyTorch Lightning 进行模型训练。PyTorch Lightning 使 NeMo 能够将对话式 AI 代码与 PyTorch 训练代码解耦。这意味着 NeMo 用户可以专注于他们的领域（ASR、NLP、TTS），并构建复杂的 AI 应用程序，而无需为 PyTorch 训练重写样板代码。

当使用 PyTorch Lightning 时，NeMo 用户可以自动进行以下训练：

多 GPU/多节点
混合精度
模型检查点
日志记录
提前停止
等等

Lightning API 的两个主要方面是 LightningModule 和 Trainer。

PyTorch Lightning `LightningModule`#

每个 NeMo 模型都是一个 LightningModule，它是一个 nn.module。这意味着 NeMo 模型与 PyTorch 生态系统兼容，并且可以插入到现有的 PyTorch 工作流程中。

创建 NeMo 模型类似于任何其他 PyTorch 工作流程。我们首先初始化我们的模型架构，然后定义前向传播

class TextClassificationModel(NLPModel, Exportable):
    ...
    def __init__(self, cfg: DictConfig, trainer: Trainer = None):
        """Initializes the BERTTextClassifier model."""
        ...
        super().__init__(cfg=cfg, trainer=trainer)

        # instantiate a BERT based encoder
        self.bert_model = get_lm_model(
            config_file=cfg.language_model.config_file,
            config_dict=cfg.language_model.config,
            vocab_file=cfg.tokenizer.vocab_file,
            trainer=trainer,
            cfg=cfg,
        )

        # instantiate the FFN for classification
        self.classifier = SequenceClassifier(
            hidden_size=self.bert_model.config.hidden_size,
            num_classes=cfg.dataset.num_classes,
            num_layers=cfg.classifier_head.num_output_layers,
            activation='relu',
            log_softmax=False,
            dropout=cfg.classifier_head.fc_dropout,
            use_transformer_init=True,
            idx_conditioned_on=0,
        )

def forward(self, input_ids, token_type_ids, attention_mask):
    """
    No special modification required for Lightning, define it as you normally would
    in the `nn.Module` in vanilla PyTorch.
    """
    hidden_states = self.bert_model(
        input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask
    )
    logits = self.classifier(hidden_states=hidden_states)
    return logits

LightningModule 组织 PyTorch 代码，以便所有 NeMo 模型都具有相似的外观和感觉。例如，训练逻辑可以在 training_step 中找到

def training_step(self, batch, batch_idx):
    """
    Lightning calls this inside the training loop with the data from the training dataloader
    passed in as `batch`.
    """
    # forward pass
    input_ids, input_type_ids, input_mask, labels = batch
    logits = self.forward(input_ids=input_ids, token_type_ids=input_type_ids, attention_mask=input_mask)

    train_loss = self.loss(logits=logits, labels=labels)

    lr = self._optimizer.param_groups[0]['lr']

    self.log('train_loss', train_loss)
    self.log('lr', lr, prog_bar=True)

    return {
        'loss': train_loss,
        'lr': lr,
    }

而验证逻辑可以在 validation_step 中找到

def validation_step(self, batch, batch_idx):
    """
    Lightning calls this inside the validation loop with the data from the validation dataloader
    passed in as `batch`.
    """
    if self.testing:
        prefix = 'test'
    else:
        prefix = 'val'

    input_ids, input_type_ids, input_mask, labels = batch
    logits = self.forward(input_ids=input_ids, token_type_ids=input_type_ids, attention_mask=input_mask)

    val_loss = self.loss(logits=logits, labels=labels)

    preds = torch.argmax(logits, axis=-1)

    tp, fn, fp, _ = self.classification_report(preds, labels)

    return {'val_loss': val_loss, 'tp': tp, 'fn': fn, 'fp': fp}

然后 PyTorch Lightning 处理训练所需的所有样板代码。实际上，训练的任何方面都可以通过 PyTorch Lightning hooks、Plugins、callbacks 或通过覆盖 methods 进行自定义。

有关更多特定于领域的信息，请参阅

PyTorch Lightning Trainer#

由于每个 NeMo 模型都是一个 LightningModule，我们可以自动利用 PyTorch Lightning Trainer。每个 NeMo 示例训练脚本都使用 Trainer 对象来拟合模型。

首先，实例化模型和 trainer，然后调用 .fit

# We first instantiate the trainer based on the model configuration.
# See the model configuration documentation for details.
trainer = pl.Trainer(**cfg.trainer)

# Then pass the model configuration and trainer object into the NeMo model
model = TextClassificationModel(cfg.model, trainer=trainer)

# Now we can train with by calling .fit
trainer.fit(model)

# Or we can run the test loop on test data by calling
trainer.test(model=model)

所有 trainer 标志都可以从 NeMo 配置中设置。

配置#

Hydra 是一个开源 Python 框架，它简化了复杂应用程序的配置，这些应用程序必须将许多不同的软件库组合在一起。对话式 AI 模型训练就是这样一个应用程序的绝佳示例。为了训练对话式 AI 模型，我们必须能够配置

神经网络架构
训练和优化算法
数据预处理/后处理
数据增强
实验日志记录/可视化
模型检查点

有关使用 Hydra 的介绍，请参阅 Hydra 教程。

使用 Hydra，我们可以使用三个界面配置 NeMo 所需的一切

命令行 (CLI)
配置文件 (YAML)
数据类 (Python)

YAML#

NeMo 为我们的所有示例训练脚本提供了 YAML 配置文件。YAML 文件使您可以轻松地尝试不同的模型和训练配置。

每个 NeMo 示例 YAML 都具有相同的底层配置结构

trainer
exp_manager
model

模型配置始终包含 train_ds、validation_ds、test_ds 和 optim。但是，模型架构可能因领域而异。有关模型架构配置的详细信息，请参阅特定集合（LLM、ASR 等）的文档。

NeMo 配置文件应类似于以下内容

# PyTorch Lightning Trainer configuration
# any argument of the Trainer object can be set here
trainer:
    devices: 1 # number of gpus per node
    accelerator: gpu
    num_nodes: 1 # number of nodes
    max_epochs: 10 # how many training epochs to run
    val_check_interval: 1.0 # run validation after every epoch

# Experiment logging configuration
exp_manager:
    exp_dir: /path/to/my/nemo/experiments
    name: name_of_my_experiment
    create_tensorboard_logger: True
    create_wandb_logger: True

# Model configuration
# model network architecture, train/val/test datasets, data augmentation, and optimization
model:
    train_ds:
        manifest_filepath: /path/to/my/train/manifest.json
        batch_size: 256
        shuffle: True
    validation_ds:
        manifest_filepath: /path/to/my/validation/manifest.json
        batch_size: 32
        shuffle: False
    test_ds:
        manifest_filepath: /path/to/my/test/manifest.json
        batch_size: 32
        shuffle: False
    optim:
        name: novograd
        lr: .01
        betas: [0.8, 0.5]
        weight_decay: 0.001
    # network architecture can vary greatly depending on the domain
    encoder:
        ...
    decoder:
        ...

CLI#

使用 NeMo 和 Hydra，模型训练的每个方面都可以从命令行修改。这对于在计算集群上运行大量实验或在开发期间快速测试参数非常有用。

所有 NeMo 示例都附带有关如何从命令行运行训练/推理脚本的说明（例如，有关示例，请参阅此处）。

使用 Hydra，参数使用 = 运算符设置

python examples/asr/asr_ctc/speech_to_text_ctc.py \
    model.train_ds.manifest_filepath=/path/to/my/train/manifest.json \
    model.validation_ds.manifest_filepath=/path/to/my/validation/manifest.json \
    trainer.devices=2 \
    trainer.accelerator='gpu' \
    trainer.max_epochs=50

我们可以使用 + 运算符从 CLI 添加参数

python examples/asr/asr_ctc/speech_to_text_ctc.py \
    model.train_ds.manifest_filepath=/path/to/my/train/manifest.json \
    model.validation_ds.manifest_filepath=/path/to/my/validation/manifest.json \
    trainer.devices=2 \
    trainer.accelerator='gpu' \
    trainer.max_epochs=50 \
    +trainer.fast_dev_run=true

我们可以使用 ~ 运算符删除配置

python examples/asr/asr_ctc/speech_to_text_ctc.py \
    model.train_ds.manifest_filepath=/path/to/my/train/manifest.json \
    model.validation_ds.manifest_filepath=/path/to/my/validation/manifest.json \
    ~model.test_ds \
    trainer.devices=2 \
    trainer.accelerator='gpu' \
    trainer.max_epochs=50 \
    +trainer.fast_dev_run=true

我们可以使用 --config-path 和 --config-name 标志指定配置文件

python examples/asr/asr_ctc/speech_to_text_ctc.py \
    --config-path=conf/quartznet \
    --config-name=quartznet_15x5 \
    model.train_ds.manifest_filepath=/path/to/my/train/manifest.json \
    model.validation_ds.manifest_filepath=/path/to/my/validation/manifest.json \
    ~model.test_ds \
    trainer.devices=2 \
    trainer.accelerator='gpu' \
    trainer.max_epochs=50 \
    +trainer.fast_dev_run=true

数据类#

数据类允许 NeMo 将模型配置作为 NeMo 库的一部分发布，并且还支持 NeMo 模型的纯 Python 配置。使用 Hydra，数据类可用于为对话式 AI 应用程序创建结构化配置。

例如，请参阅下面的代码块，了解 Attenion is All You Need 机器翻译模型。模型配置可以像任何 Python Dataclass 一样实例化和修改。

from nemo.collections.nlp.models.machine_translation.mt_enc_dec_config import AAYNBaseConfig

cfg = AAYNBaseConfig()

# modify the number of layers in the encoder
cfg.encoder.num_layers = 8

# modify the training batch size
cfg.train_ds.tokens_in_batch = 8192

注意

Hydra 的配置始终具有以下优先级：CLI > YAML > 数据类。

优化#

优化器和学习率计划在所有 NeMo 模型中都是可配置的，并且有自己的命名空间。以下是 Novograd 优化器和余弦退火学习率计划的示例 YAML 配置。

optim:
    name: novograd
    lr: 0.01

    # optimizer arguments
    betas: [0.8, 0.25]
    weight_decay: 0.001

    # scheduler setup
    sched:
        name: CosineAnnealing

        # Optional arguments
        max_steps: -1 # computed at runtime or explicitly set here
        monitor: val_loss
        reduce_on_plateau: false

        # scheduler config override
        warmup_steps: 1000
        warmup_ratio: null
        min_lr: 1e-9:

注意

NeMo 示例具有每个 NeMo 模型的优化器和调度器配置。

优化器也可以从 CLI 配置

python examples/asr/asr_ctc/speech_to_text_ctc.py \
    --config-path=conf/quartznet \
    --config-name=quartznet_15x5 \
    ...
    # train with the adam optimizer
    model.optim=adam \
    # change the learning rate
    model.optim.lr=.0004 \
    # modify betas
    model.optim.betas=[.8, .5]

优化器#

name 对应于优化器的名称的小写形式。要查看可用优化器的列表，请运行

from nemo.core.optim.optimizers import AVAILABLE_OPTIMIZERS

for name, opt in AVAILABLE_OPTIMIZERS.items():
    print(f'name: {name}, opt: {opt}')

name: sgd opt: <class 'torch.optim.sgd.SGD'>
name: adam opt: <class 'torch.optim.adam.Adam'>
name: adamw opt: <class 'torch.optim.adamw.AdamW'>
name: adadelta opt: <class 'torch.optim.adadelta.Adadelta'>
name: adamax opt: <class 'torch.optim.adamax.Adamax'>
name: adagrad opt: <class 'torch.optim.adagrad.Adagrad'>
name: rmsprop opt: <class 'torch.optim.rmsprop.RMSprop'>
name: rprop opt: <class 'torch.optim.rprop.Rprop'>
name: novograd opt: <class 'nemo.core.optim.novograd.Novograd'>

优化器参数#

优化器参数在不同优化器之间可能有所不同，但所有优化器都需要 lr 参数。要查看优化器的可用参数，我们可以查看其对应的数据类。

from nemo.core.config.optimizers import NovogradParams

print(NovogradParams())

NovogradParams(lr='???', betas=(0.95, 0.98), eps=1e-08, weight_decay=0, grad_averaging=False, amsgrad=False, luc=False, luc_trust=0.001, luc_eps=1e-08)

'???' 表示 lr 参数是必需的。

注册优化器#

要注册要与 NeMo 一起使用的新优化器，请运行

nemo.core.optim.optimizers.register_optimizer( name: str, optimizer: torch.optim.optimizer.Optimizer, optimizer_params: OptimizerParams, )#

检查优化器名称是否已存在于注册表中，如果不存在，则添加它。

这允许添加自定义优化器并在实例化期间按名称调用它们。

参数:

name – 优化器的名称。将用作检索优化器的键。
optimizer – 优化器类
optimizer_params – 优化器的参数作为数据类

学习率调度器#

学习率调度器可以在 optim.sched 命名空间下选择性地配置。

name 对应于学习率计划的名称。要查看可用调度器的列表，请运行

from nemo.core.optim.lr_scheduler import AVAILABLE_SCHEDULERS

for name, opt in AVAILABLE_SCHEDULERS.items():
    print(f'name: {name}, schedule: {opt}')

name: WarmupPolicy, schedule: <class 'nemo.core.optim.lr_scheduler.WarmupPolicy'>
name: WarmupHoldPolicy, schedule: <class 'nemo.core.optim.lr_scheduler.WarmupHoldPolicy'>
name: SquareAnnealing, schedule: <class 'nemo.core.optim.lr_scheduler.SquareAnnealing'>
name: CosineAnnealing, schedule: <class 'nemo.core.optim.lr_scheduler.CosineAnnealing'>
name: NoamAnnealing, schedule: <class 'nemo.core.optim.lr_scheduler.NoamAnnealing'>
name: WarmupAnnealing, schedule: <class 'nemo.core.optim.lr_scheduler.WarmupAnnealing'>
name: InverseSquareRootAnnealing, schedule: <class 'nemo.core.optim.lr_scheduler.InverseSquareRootAnnealing'>
name: SquareRootAnnealing, schedule: <class 'nemo.core.optim.lr_scheduler.SquareRootAnnealing'>
name: PolynomialDecayAnnealing, schedule: <class 'nemo.core.optim.lr_scheduler.PolynomialDecayAnnealing'>
name: PolynomialHoldDecayAnnealing, schedule: <class 'nemo.core.optim.lr_scheduler.PolynomialHoldDecayAnnealing'>
name: StepLR, schedule: <class 'torch.optim.lr_scheduler.StepLR'>
name: ExponentialLR, schedule: <class 'torch.optim.lr_scheduler.ExponentialLR'>
name: ReduceLROnPlateau, schedule: <class 'torch.optim.lr_scheduler.ReduceLROnPlateau'>
name: CyclicLR, schedule: <class 'torch.optim.lr_scheduler.CyclicLR'>

调度器参数#

要查看调度器的可用参数，我们可以查看其对应的数据类

from nemo.core.config.schedulers import CosineAnnealingParams

print(CosineAnnealingParams())

CosineAnnealingParams(last_epoch=-1, warmup_steps=None, warmup_ratio=None, min_lr=0.0)

注册调度器#

要注册要与 NeMo 一起使用的新调度器，请运行

nemo.core.optim.lr_scheduler.register_scheduler( name: str, scheduler: torch.optim.lr_scheduler._LRScheduler, scheduler_params: SchedulerParams, )#

检查调度器名称是否已存在于注册表中，如果不存在，则添加它。

这允许添加自定义调度器并在实例化期间按名称调用它们。

参数:

name – 优化器的名称。将用作检索优化器的键。
scheduler – 调度器类（继承自 _LRScheduler）
scheduler_params – 调度器的参数作为数据类

保存和恢复#

所有 NeMo 模型都带有 .save_to 和 .restore_from 方法。

保存#

要保存 NeMo 模型，请运行

model.save_to('/path/to/model.nemo')

使用训练好的模型所需的一切都打包并保存在 .nemo 文件中。例如，在 NLP 领域，.nemo 文件包括必要的分词器模型和/或词汇表文件等。

注意

.nemo 文件只是像任何其他 .tar 文件一样的存档。

恢复#

要恢复 NeMo 模型，请运行

# Here, you should usually use the class of the model, or simply use ModelPT.restore_from() for simplicity.
model.restore_from('/path/to/model.nemo')

当使用 PyTorch Lightning Trainer 时，会创建一个 PyTorch Lightning 检查点。这些主要在 NeMo 中用于自动恢复训练。由于 NeMo 模型是 LightningModules，因此 PyTorch Lightning 方法 load_from_checkpoint 可用。请注意，load_from_checkpoint 不一定适用于所有模型，因为某些模型需要比仅检查点更多的工件才能恢复。对于这些模型，如果用户想要使用 load_from_checkpoint，则必须覆盖它。

强烈建议使用 restore_from 加载 NeMo 模型。

使用修改后的配置恢复#

有时，可能需要在恢复模型之前修改模型（或其子组件）。一个常见的情况是，由于各种原因（例如弃用、较新版本、支持新功能），必须更新模型的内部配置。只要模型具有与原始配置相同的参数，就可以再次安全地恢复参数。

在 NeMo 中，作为 .nemo 文件的一部分，模型的内部配置将被保留。此配置在恢复期间使用，如下所示，我们可以在恢复模型之前更新此配置。

# When restoring a model, you should generally use the class of the model
# Obtain the config (as an OmegaConf object)
config = model_class.restore_from('/path/to/model.nemo', return_config=True)
# OR
config = model_class.from_pretrained('name_of_the_model', return_config=True)

# Modify the config as needed
config.x.y = z

# Restore the model from the updated config
model = model_class.restore_from('/path/to/model.nemo', override_config_path=config)
# OR
model = model_class.from_pretrained('name_of_the_model', override_config_path=config)

注册工件#

恢复对话式 AI 模型可能很复杂，因为它需要的不仅仅是检查点权重；还需要其他信息才能使用该模型。NeMo 模型可以通过调用 .register_artifact 将其他工件保存在 .nemo 文件中。当使用 .restore_from 或 .from_pretrained 恢复 NeMo 模型时，任何已注册的工件都将自动可用。

例如，考虑一个需要训练后的分词器模型的 NLP 模型。分词器模型文件可以使用以下代码自动添加到 .nemo 文件中

self.encoder_tokenizer = get_nmt_tokenizer(
    ...
    tokenizer_model=self.register_artifact(config_path='encoder_tokenizer.tokenizer_model',
                                           src='/path/to/tokenizer.model',
                                           verify_src_exists=True),
)

默认情况下，.register_artifact 始终返回路径。如果模型是从 .nemo 文件恢复的，则该路径将指向 .nemo 文件中的工件。否则，.register_artifact 将返回用户指定的本地路径。

config_path 是工件密钥。它通常对应于模型配置，但不必如此。与 .nemo 文件打包在一起的模型配置将根据 config_path 密钥进行更新。在上面的示例中，模型配置将具有

encoder_tokenizer:
    ...
    tokenizer_model: nemo:4978b28103264263a03439aaa6560e5e_tokenizer.model

src 是工件的路径，当将工件打包到 .nemo 文件中时，将使用该路径的基本名称。每个工件在 .nemo 文件中都将哈希值添加到 src 的基本名称前面。这是为了防止与基本名称相同的基本名称发生冲突（例如，当有两个或多个分词器，都称为 tokenizer.model 时）。生成的 .nemo 文件将具有以下文件

4978b28103264263a03439aaa6560e5e_tokenizer.model

如果 verify_src_exists 设置为 False，则该工件是可选的。这意味着如果找不到 src，则 .register_artifact 将返回 None。

推送到 Hugging Face Hub#

NeMo 模型可以使用 push_to_hf_hub() 方法推送到 Hugging Face Hub。此方法执行与 save_to() 相同的操作，然后将模型上传到 HuggingFace Hub。它提供了一个额外的 pack_nemo_file 参数，允许用户上传整个 NeMo 文件或仅上传 .nemo 文件。这对于参数数量庞大的大型语言模型非常有用，单个 NeMo 文件可能会超过 Hugging Face Hub 的最大上传大小。

将模型上传到 Hub#

token = "<HF TOKEN>" or None
pack_nemo_file = True  # False will upload multiple files that comprise the NeMo file onto HF Hub; Generally useful for LLMs

model.push_to_hf_hub(
   repo_id=repo_id, pack_nemo_file=pack_nemo_file, token=token,
)

为 Hub 使用自定义模型卡片模板#

# Override the default model card
template = """ <Your own custom template>
# {model_name}
"""
kwargs = {"model_name": "ABC", "repo_id": "nvidia/ABC_XYZ"}
model_card = model.generate_model_card(template=template, template_kwargs=kwargs, type="hf")

model.push_to_hf_hub(
    repo_id=repo_id, token=token, model_card=model_card
)

# Write your own model card class
class MyModelCard:
  def __init__(self, model_name):
    self.model_name = model_name

  def __repr__(self):
    template = """This is the {model_name} model""".format(model_name=self.model_name)
    return template

model.push_to_hf_hub(
    repo_id=repo_id, token=token, model_card=MyModelCard("ABC")
)

嵌套 NeMo 模型#

在某些情况下，在其他 NeMo 模型内部使用 NeMo 模型可能会有所帮助。例如，我们可以将语言模型合并到 ASR 模型中，以用于解码过程以提高准确性，或者使用混合 ASR-TTS 模型来动态生成文本音频，以训练或微调 ASR 模型。

有三种方法可以在父模型内部实例化子模型

直接使用子配置
使用 .nemo 检查点路径加载子模型
使用预训练的 NeMo 模型

要注册子模型，请使用父模型的 register_nemo_submodule 方法。此方法会将子模型添加到指定的模型属性。在序列化期间，它将正确处理子工件并将子模型的配置存储在父模型的 config_field 中。

from nemo.core.classes import ModelPT

class ChildModel(ModelPT):
    ...  # implement necessary methods

class ParentModel(ModelPT):
    def __init__(self, cfg, trainer=None):
        super().__init__(cfg=cfg, trainer=trainer)

        # optionally annotate type for IDE autocompletion and type checking
        self.child_model: Optional[ChildModel]
        if cfg.get("child_model") is not None:
            # load directly from config
            # either if config provided initially, or automatically
            # after model restoration
            self.register_nemo_submodule(
                name="child_model",
                config_field="child_model",
                model=ChildModel(self.cfg.child_model, trainer=trainer),
            )
        elif cfg.get('child_model_path') is not None:
            # load from .nemo model checkpoint
            # while saving, config will be automatically assigned/updated
            # in cfg.child_model
            self.register_nemo_submodule(
                name="child_model",
                config_field="child_model",
                model=ChildModel.restore_from(self.cfg.child_model_path, trainer=trainer),
            )
        elif cfg.get('child_model_name') is not None:
            # load from pretrained model
            # while saving, config will be automatically assigned/updated
            # in cfg.child_model
            self.register_nemo_submodule(
                name="child_model",
                config_field="child_model",
                model=ChildModel.from_pretrained(self.cfg.child_model_name, trainer=trainer),
            )
        else:
            self.child_model = None

性能分析#

NeMo 为用户提供两个性能分析选项：Nsys 和 CUDA 内存性能分析。这两个选项允许用户调试性能问题以及内存泄漏等内存问题。

要启用 Nsys 性能分析，请将以下选项添加到模型配置中

nsys_profile: False
    start_step: 10  # Global batch to start profiling
    end_step: 10 # Global batch to end profiling
    ranks: [0] # Global rank IDs to profile
    gen_shape: False # Generate model and kernel details including input shapes

最后，使用以下命令运行模型训练脚本

nsys profile -s none -o <profile filepath> -t cuda,nvtx --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=stop python ./examples/...

有关更多选项，请参阅 nsight 用户指南。

要启用 CUDA 内存性能分析，请将以下选项添加到模型配置中

memory_profile:
    enabled: True
    start_step: 10  # Global batch to start profiling
    end_step: 10 # Global batch to end profiling
    rank: 0 # Global rank ID to profile
    output_path: None # Path to store the profile output file

然后，无需更改调用命令即可调用您的 NeMo 脚本。