重要提示

您正在查看 NeMo 2.0 文档。此版本引入了 API 的重大更改和一个新的库，NeMo Run。我们目前正在将 NeMo 1.0 的所有功能移植到 2.0。有关先前版本或 2.0 中尚未提供的功能的文档，请参阅 NeMo 24.07 文档。

实验管理器#

NeMo 框架实验管理器利用 PyTorch Lightning 进行模型检查点、TensorBoard 日志记录、Weights and Biases、DLLogger 和 MLFlow 日志记录。实验管理器默认包含在所有 NeMo 示例脚本中。

要使用实验管理器，请调用 exp_manager 并传入 PyTorch Lightning Trainer。

exp_dir = exp_manager(trainer, cfg.get("exp_manager", None))

实验管理器可以使用带有 Hydra 的 YAML 进行配置。

exp_manager:
    exp_dir: /path/to/my/experiments
    name: my_experiment_name
    create_tensorboard_logger: True
    create_checkpoint_callback: True

（可选）启动 TensorBoard 以查看 exp_dir 中的训练结果，默认情况下 exp_dir 设置为 ./nemo_experiments。

tensorboard --bind_all --logdir nemo_experiments

如果 create_checkpoint_callback 设置为 True，则 NeMo 会在使用 PyTorch Lightning 的 ModelCheckpoint 进行训练期间自动创建检查点。我们可以通过 YAML 或 CLI 配置 ModelCheckpoint

exp_manager:
    ...
    # configure the PyTorch Lightning ModelCheckpoint using checkpoint_call_back_params
    # any ModelCheckpoint argument can be set here

    # save the best checkpoints based on this metric
    checkpoint_callback_params.monitor=val_loss

    # choose how many total checkpoints to save
    checkpoint_callback_params.save_top_k=5

恢复训练#

要自动恢复训练，请配置 exp_manager。此功能对于可能在程序完成之前中断或关闭的长时间训练运行非常重要。要自动恢复训练，请通过 YAML 或 CLI 设置以下参数

exp_manager:
    ...
    # resume training if checkpoints already exist
    resume_if_exists: True

    # to start training with no existing checkpoints
    resume_ignore_no_checkpoint: True

    # by default experiments will be versioned by datetime
    # we can set our own version with
    exp_manager.version: my_experiment_version

实验记录器#

除了 Tensorboard，NeMo 还支持 Weights and Biases、MLFlow、DLLogger、ClearML 和 NeptuneLogger。要使用这些记录器，请通过 YAML 或 ExpManagerConfig 设置以下内容。

Weights and Biases (WandB)#

exp_manager:
    ...
    create_checkpoint_callback: True
    create_wandb_logger: True
    wandb_logger_kwargs:
        name: ${name}
        project: ${project}
        entity: ${entity}
        <Add any other arguments supported by WandB logger here>

MLFlow#

exp_manager:
    ...
    create_checkpoint_callback: True
    create_mlflow_logger: True
    mlflow_logger_kwargs:
        experiment_name: ${name}
        tags:
            <Any key:value pairs>
        save_dir: './mlruns'
        prefix: ''
        artifact_location: None
        # provide run_id if resuming a previously started run
        run_id: Optional[str] = None

DLLogger#

exp_manager:
    ...
    create_checkpoint_callback: True
    create_dllogger_logger: True
    dllogger_logger_kwargs:
        verbose: False
        stdout: False
        json_file: "./dllogger.json"

ClearML#

exp_manager:
    ...
    create_checkpoint_callback: True
    create_clearml_logger: True
    clearml_logger_kwargs:
        project: None  # name of the project
        task: None  # optional name of task
        connect_pytorch: False
        model_name: None  # optional name of model
        tags: None  # Should be a list of str
        log_model: False  # log model to clearml server
        log_cfg: False  # log config to clearml server
        log_metrics: False  # log metrics to clearml server

Neptune#

exp_manager:
    ...
    create_checkpoint_callback: True
    create_neptune_logger: false
    neptune_logger_kwargs:
        project: ${project}
        name: ${name}
        prefix: train
        log_model_checkpoints: false # set to True if checkpoints need to be pushed to Neptune
        tags: null # can specify as an array of strings in yaml array format
        description: null
        <Add any other arguments supported by Neptune logger here>

指数移动平均#

NeMo 支持使用指数移动平均 (EMA) 来处理模型参数。这对于提高模型泛化性和稳定性非常有用。要使用 EMA，请通过 YAML 或 ExpManagerConfig 设置以下参数。

exp_manager:
    ...
    # use exponential moving average for model parameters
    ema:
        enabled: True  # False by default
        decay: 0.999  # decay rate
        cpu_offload: False  # If EMA parameters should be offloaded to CPU to save GPU memory
        every_n_steps: 1  # How often to update EMA weights
        validate_original_weights: False  # Whether to use original weights for validation calculation or EMA weights

使用 NeMo 的 Hydra Multi-Run#

在训练神经网络时，通常执行超参数搜索以提高模型在验证数据上的性能。但是，手动准备实验网格并管理所有检查点及其指标可能很繁琐。为了简化这些任务，NeMo 与 Hydra Multi-Run 支持集成，提供了一种统一的方式来直接从配置运行一组实验。

此框架存在一些限制，我们在下面列出

所有实验都假定在单个 GPU 上运行，并且单次运行支持多 GPU（目前不支持模型并行模型）。
NeMo Multi-Run 目前仅支持对一组超参数进行网格搜索。对高级超参数搜索策略的支持将在未来添加。
**NeMo Multi-Run 需要一个或多个 GPU** 才能运行，并且在没有 GPU 设备的情况下将无法工作。

配置设置#

为了启用 NeMo Multi-Run，我们首先更新 YAML 配置，其中包含一些信息，以让 Hydra 知道我们希望从此配置运行多个实验 -

# Required for Hydra launch of hyperparameter search via multirun
defaults:
  - override hydra/launcher: nemo_launcher

# Hydra arguments necessary for hyperparameter optimization
hydra:
  # Helper arguments to ensure all hyper parameter runs are from the directory that launches the script.
  sweep:
    dir: "."
    subdir: "."

  # Define all the hyper parameters here
  sweeper:
    params:
      # Place all the parameters you wish to search over here (corresponding to the rest of the config)
      # NOTE: Make sure that there are no spaces between the commas that separate the config params !
      model.optim.lr: 0.001,0.0001
      model.encoder.dim: 32,64,96,128
      model.decoder.dropout: 0.0,0.1,0.2

  # Arguments to the process launcher
  launcher:
    num_gpus: -1  # Number of gpus to use. Each run works on a single GPU.
    jobs_per_gpu: 1  # If each GPU has large memory, you can run multiple jobs on the same GPU for faster results (until OOM).

接下来，我们将设置 Experiment Manager 的配置。当我们执行超参数搜索时，每次运行可能需要一些时间才能完成。因此，我们希望避免运行结束的情况（例如，由于机器上的 OOM 或超时），并且我们需要重新执行所有实验。因此，我们设置实验管理器配置，以便每个实验都有一个唯一的“key”，其值对应于单个可恢复的实验。

让我们看看如何通过实验名称设置这样一个唯一的“key”。只需将所有超参数参数附加到实验名称，如下所示 -

exp_manager:
  exp_dir: null  # Can be set by the user.

  # Add a unique name for all hyper parameter arguments to allow continued training.
  # NOTE: It is necessary to add all hyperparameter arguments to the name !
  # This ensures successful restoration of model runs in case HP search crashes.
  name: ${name}-lr-${model.optim.lr}-adim-${model.adapter.dim}-sd-${model.adapter.adapter_strategy.stochastic_depth}

  ...
  checkpoint_callback_params:
    ...
    save_top_k: 1  # Dont save too many .ckpt files during HP search
    always_save_nemo: True # saves the checkpoints as nemo files for fast checking of results later
  ...

  # We highly recommend use of any experiment tracking took to gather all the experiments in one location
  create_wandb_logger: True
  wandb_logger_kwargs:
    project: "<Add some project name here>"

  # HP Search may crash due to various reasons, best to attempt continuation in order to
  # resume from where the last failure case occurred.
  resume_if_exists: true
  resume_ignore_no_checkpoint: true

运行 NeMo Multi-Run 配置#

配置更新后，我们现在可以像运行任何普通的 Hydra 脚本一样运行它，只需一个特殊标志 (-m)。

python script.py --config-path=ABC --config-name=XYZ -m \
    trainer.max_steps=5000 \  # Any additional arg after -m will be passed to all the runs generated from the config !
    ...

技巧和窍门#

本节提供有关使用实验管理器的建议。

为大量实验保留磁盘空间#

某些模型可能具有大量参数，这使得在物理存储驱动器上保存大量检查点的成本非常高昂。例如，如果您使用 Adam 优化器，则每个 PyTorch Lightning “.ckpt” 文件的大小将是仅模型参数大小的三倍。如果您有多次运行，这可能会变得非常昂贵。

在上述配置中，我们显式设置 save_top_k: 1 和 always_save_nemo: True。这会将 “.ckpt” 文件的数量限制为一个，并保存一个 NeMo 文件，该文件仅包含模型参数，不包含优化器状态。可以立即恢复此 NeMo 文件以进行进一步工作。

我们可以通过使用 NeMo 的实用程序函数在训练运行完成后自动删除 “.ckpt” 或 NeMo 文件来进一步节省存储空间。如果您在实验跟踪工具中收集结果，并且可以在搜索完成后简单地重新运行最佳配置，则这已足够。

# Import `clean_exp_ckpt` along with exp_manager
from nemo.utils.exp_manager import clean_exp_ckpt, exp_manager

@hydra_runner(...)
def main(cfg):
    ...

    # Keep track of the experiment directory
    exp_log_dir = exp_manager(trainer, cfg.get("exp_manager", None))

    ... add any training code here as needed ...

    # Add following line to end of the training script
    # Remove PTL ckpt file, and potentially also remove .nemo file to conserve storage space.
    clean_exp_ckpt(exp_log_dir, remove_ckpt=True, remove_nemo=False)

调试 Multi-Run 脚本#

运行 Hydra 脚本时，您可能会遇到导致程序崩溃的配置问题。在 NeMo Multi-Run 中，任何单次运行中的崩溃都不会导致整个程序崩溃。相反，我们将记录错误并继续执行下一个作业。完成所有作业后，我们将按错误发生的顺序引发错误，并使用第一个错误的堆栈跟踪使程序崩溃。

要调试 NeMo Multi-Run，我们建议注释掉 sweep.params 内的整个超参数配置集。相反，使用该配置运行单个实验，这将立即引发错误。

Hydra 无法解析实验名称#

有时我们的超参数包括 PyTorch Lightning trainer 参数，例如步数、epoch 数以及是否使用梯度累积。当我们尝试将这些作为键添加到实验管理器的 name 时，Hydra 可能会抱怨无法解析 trainer.xyz。

一个简单的解决方案是在调用 exp_manager() 之前完成 Hydra 配置，如下所示

@hydra_runner(...)
def main(cfg):
    # Make any changes as necessary to the config
    cfg.xyz.abc = uvw

    # Finalize the config
    cfg = OmegaConf.resolve(cfg)

    # Carry on as normal by calling trainer and exp_manager
    trainer = pl.Trainer(**cfg.trainer)
    exp_log_dir = exp_manager(trainer, cfg.get("exp_manager", None))
    ...

ExpManagerConfig#

class nemo.utils.exp_manager.ExpManagerConfig( explicit_log_dir: str | None = None, exp_dir: str | None = None, name: str | None = None, version: str | None = None, use_datetime_version: bool | None = True, resume_if_exists: bool | None = False, resume_past_end: bool | None = False, resume_ignore_no_checkpoint: bool | None = False, resume_from_checkpoint: str | None = None, create_tensorboard_logger: bool | None = True, summary_writer_kwargs: ~typing.Dict[~typing.Any, ~typing.Any] | None = None, create_wandb_logger: bool | None = False, wandb_logger_kwargs: ~typing.Dict[~typing.Any, ~typing.Any] | None = None, create_mlflow_logger: bool | None = False, mlflow_logger_kwargs: ~nemo.utils.loggers.mlflow_logger.MLFlowParams | None = <factory>, create_dllogger_logger: bool | None = False, dllogger_logger_kwargs: ~nemo.utils.loggers.dllogger.DLLoggerParams | None = <factory>, create_clearml_logger: bool | None = False, clearml_logger_kwargs: ~nemo.utils.loggers.clearml_logger.ClearMLParams | None = <factory>, create_neptune_logger: bool | None = False, neptune_logger_kwargs: ~typing.Dict[~typing.Any, ~typing.Any] | None = None, create_checkpoint_callback: bool | None = True, checkpoint_callback_params: ~nemo.utils.exp_manager.CallbackParams | None = <factory>, create_early_stopping_callback: bool | None = False, early_stopping_callback_params: ~nemo.utils.exp_manager.EarlyStoppingParams | None = <factory>, create_preemption_callback: bool | None = True, files_to_copy: ~typing.List[str] | None = None, log_step_timing: bool | None = True, log_delta_step_timing: bool | None = False, step_timing_kwargs: ~nemo.utils.exp_manager.StepTimingParams | None = <factory>, log_local_rank_0_only: bool | None = False, log_global_rank_0_only: bool | None = False, disable_validation_on_resume: bool | None = True, ema: ~nemo.utils.exp_manager.EMAParams | None = <factory>, max_time_per_run: str | None = None, seconds_to_sleep: float = 5, create_straggler_detection_callback: bool | None = False, straggler_detection_params: ~nemo.utils.exp_manager.StragglerDetectionParams | None = <factory>, create_fault_tolerance_callback: bool | None = False, fault_tolerance: ~nemo.utils.exp_manager.FaultToleranceParams | None = <factory>, log_tflops_per_sec_per_gpu: bool | None = True, )

Bases: object

用于验证传递参数的实验管理器配置。