重要提示

您正在查看 NeMo 2.0 文档。此版本对 API 进行了重大更改，并引入了一个新的库 NeMo Run。我们目前正在将 NeMo 1.0 的所有功能移植到 2.0。有关先前版本或 2.0 中尚未提供的功能的文档，请参阅 NeMo 24.07 文档。

NeMo 音频配置文件#

本节介绍 NeMo 配置文件设置，该设置特定于音频集合中的模型。有关如何设置和运行所有 NeMo 模型通用的实验的常规信息（例如，实验管理器和 PyTorch Lightning 训练器参数），请参阅 NeMo 模型部分。

NeMo 音频配置文件的模型部分通常需要有关所用数据集、正在执行的任何增强的参数以及模型架构规范的信息。

所有 NeMo 音频模型的示例配置文件都可以在 examples 的 config 目录中找到。

NeMo 数据集配置#

训练、验证和测试参数分别在配置文件中使用 model.train_ds、model.validation_ds 和 model.test_ds 部分指定。根据任务的不同，可能存在指定加载音频示例的采样率或持续时间的参数。某些字段可以省略，并在运行时通过命令行指定。有关数据集类及其各自参数的列表，请参阅 API 的数据集处理类部分。训练、验证和测试数据集的示例配置如下

model:
  sample_rate: 16000
  skip_nan_grad: false

  train_ds:
    manifest_filepath: ???
    input_key: audio_filepath # key of the input signal path in the manifest
    target_key: target_filepath # key of the target signal path in the manifest
    target_channel_selector: 0 # target signal is the first channel from files in target_key
    audio_duration: 4.0 # in seconds, audio segment duration for training
    random_offset: true # if the file is longer than audio_duration, use random offset to select a subsegment
    min_duration: ${model.train_ds.audio_duration}
    batch_size: 64 # batch size may be increased based on the available memory
    shuffle: true
    num_workers: 8
    pin_memory: true

  validation_ds:
    manifest_filepath: ???
    input_key: audio_filepath # key of the input signal path in the manifest
    target_key: target_filepath # key of the target signal path in the manifest
    target_channel_selector: 0 # target signal is the first channel from files in target_key
    batch_size: 64 # batch size may be increased based on the available memory
    shuffle: false
    num_workers: 4
    pin_memory: true

  test_ds:
    manifest_filepath: ???
    input_key: audio_filepath # key of the input signal path in the manifest
    target_key: target_filepath # key of the target signal path in the manifest
    target_channel_selector: 0 # target signal is the first channel from files in target_key
    batch_size: 1 # batch size may be increased based on the available memory
    shuffle: false
    num_workers: 4
    pin_memory: true

有关在线增强的更多信息可以在示例配置中找到

Lhotse 数据集配置#

Lhotse CutSet#

以 Lhotse CutSet 格式的示例训练数据集可以配置如下

train_ds:
  use_lhotse: true # enable Lhotse data loader
  cuts_path: ??? # path to Lhotse cuts manifest with input signals and the corresponding target signals (target signals should be in the custom "target_recording" field)
  truncate_duration: 4.00 # truncate audio to 4 seconds
  truncate_offset_type: random # if the file is longer than truncate_duration, use random offset to select a subsegment
  batch_size: 64 # batch size may be increased based on the available memory
  shuffle: true
  num_workers: 8
  pin_memory: true

带有在线增强的 Lhotse CutSet#

使用带有房间脉冲响应 (RIR) 卷积和加性噪声的在线增强的 Lhotse CutSet 格式的示例训练数据集可以配置如下

train_ds:
  use_lhotse: true # enable Lhotse data loader
  cuts_path: ??? # path to Lhotse cuts manifest with speech signals for augmentation (including custom "target_recording" field with the same signals)
  truncate_duration: 4.00 # truncate audio to 4 seconds
  truncate_offset_type: random # if the file is longer than truncate_duration, use random offset to select a subsegment
  batch_size: 64 # batch size may be increased based on the available memory
  shuffle: true
  num_workers: 8
  pin_memory: true
  rir_enabled: true # enable room impulse response augmentation
  rir_path: ??? # path to Lhotse recordings manifest with room impulse response signals
  noise_path: ??? # path to Lhotse cuts manifest with noise signals

带有 Lhotse 在线增强的配置文件可以在示例配置中找到。有关在线增强的更多信息可以在教程笔记本中找到。

Lhotse Shar#

以 Lhotse shar 格式的示例训练数据集可以配置如下

train_ds:
  shar_path: ???
  use_lhotse: true
  truncate_duration: 4.00 # truncate audio to 4 seconds
  truncate_offset_type: random
  batch_size: 8 # batch size may be increased based on the available memory
  shuffle: true
  num_workers: 8
  pin_memory: true

带有 Lhotse shar 格式的配置文件可以在示例配置中找到。

模型架构配置#

每个配置文件都应描述实验中使用的模型架构。下面显示了一个简单的可预测模型配置示例

model:
  type: predictive
  sample_rate: 16000
  skip_nan_grad: false
  num_outputs: 1
  normalize_input: true # normalize the input signal to 0dBFS

  train_ds:
    manifest_filepath: ???
    input_key: noisy_filepath
    target_key: clean_filepath
    audio_duration: 2.00 # trim audio to 2 seconds
    random_offset: true
    normalization_signal: input_signal
    batch_size: 8 # batch size may be increased based on the available memory
    shuffle: true
    num_workers: 8
    pin_memory: true

  validation_ds:
    manifest_filepath: ???
    input_key: noisy_filepath
    target_key: clean_filepath
    batch_size: 8
    shuffle: false
    num_workers: 4
    pin_memory: true

  encoder:
    _target_: nemo.collections.audio.modules.transforms.AudioToSpectrogram
    fft_length: 510 # Number of subbands in the STFT = fft_length // 2 + 1 = 256
    hop_length: 128
    magnitude_power: 0.5
    scale: 0.33

  decoder:
    _target_: nemo.collections.audio.modules.transforms.SpectrogramToAudio
    fft_length: ${model.encoder.fft_length}
    hop_length: ${model.encoder.hop_length}
    magnitude_power: ${model.encoder.magnitude_power}
    scale: ${model.encoder.scale}

  estimator:
    _target_: nemo.collections.audio.parts.submodules.ncsnpp.SpectrogramNoiseConditionalScoreNetworkPlusPlus
    in_channels: 1 # single-channel noisy input
    out_channels: 1 # single-channel estimate
    num_res_blocks: 3 # increased number of res blocks
    pad_time_to: 64 # pad to 64 frames for the time dimension
    pad_dimension_to: 0 # no padding in the frequency dimension

  loss:
    _target_: nemo.collections.audio.losses.MSELoss # computed in the time domain

  metrics:
    val:
      sisdr: # output SI-SDR
        _target_: torchmetrics.audio.ScaleInvariantSignalDistortionRatio

  optim:
    name: adam
    lr: 1e-4
    # optimizer arguments
    betas: [0.9, 0.999]
    weight_decay: 0.0

完整的配置文件可以在示例配置中找到。

微调配置#

所有脚本都支持通过将预训练权重从检查点部分/完全加载到当前实例化的模型中来轻松进行微调。请注意，当前实例化的模型应具有与预训练检查点匹配的参数，以便权重可以正确加载。

预训练权重可以通过以下方式提供

提供 NeMo 模型的路径（通过 init_from_nemo_model）
提供预训练 NeMo 模型的名称（将通过云下载）（通过 init_from_pretrained_model）

从头开始训练#

可以使用以下命令从头开始训练模型

python examples/audio/audio_to_audio_train.py \
    --config-path=<path to dir of configs>
    --config-name=<name of config without .yaml>) \
    model.train_ds.manifest_filepath="<path to manifest file>" \
    model.validation_ds.manifest_filepath="<path to manifest file>" \
    trainer.devices=1 \
    trainer.accelerator='gpu' \
    trainer.max_epochs=50

通过 NeMo 模型进行微调#

可以使用以下命令从现有 NeMo 模型微调模型

python examples/audio/audio_to_audio_train.py \
    --config-path=<path to dir of configs>
    --config-name=<name of config without .yaml>) \
    model.train_ds.manifest_filepath="<path to manifest file>" \
    model.validation_ds.manifest_filepath="<path to manifest file>" \
    trainer.devices=1 \
    trainer.accelerator='gpu' \
    trainer.max_epochs=50 \
    +init_from_nemo_model="<path to .nemo model file>"

通过 NeMo 预训练模型名称进行微调#

可以使用以下命令从预训练的 NeMo 模型微调模型

python examples/audio/audio_to_audio_train.py \
    --config-path=<path to dir of configs>
    --config-name=<name of config without .yaml>) \
    model.train_ds.manifest_filepath="<path to manifest file>" \
    model.validation_ds.manifest_filepath="<path to manifest file>" \
    trainer.devices=1 \
    trainer.accelerator='gpu' \
    trainer.max_epochs=50 \
    +init_from_pretrained_model="<name of pretrained checkpoint>"