重要提示

您正在查看 NeMo 2.0 文档。此版本对 API 和新库 NeMo Run 进行了重大更改。我们目前正在将 NeMo 1.0 中的所有功能移植到 2.0。有关先前版本或 2.0 中尚未提供的功能的文档，请参阅 NeMo 24.07 文档。

NeMo 说话人识别配置文件#

本页介绍特定于说话人识别模型的 NeMo 配置文件设置。有关如何设置和运行所有 NeMo 模型通用的实验的常规信息（例如，实验管理器和 PyTorch Lightning 训练器参数），请参阅 NeMo 模型页面。

NeMo 说话人识别配置文件的模型部分通常需要有关正在使用的数据集、音频文件的预处理器、正在执行的任何增强的参数以及模型架构规范的信息。本页的章节更详细地介绍了这些内容。

所有 Speaker 相关脚本的示例配置文件都可以在示例的 config 目录中找到 {NEMO_ROOT/examples/speaker_tasks/recognition/conf}。

数据集配置#

训练、验证和测试参数分别使用配置文件的 train_ds、 validation_ds 和 test_ds 部分指定。根据任务的不同，您可能有参数指定音频文件的采样率、每个音频文件要考虑的最大时间长度、是否对数据集进行洗牌等等。您还可以决定将 manifest_filepath 等字段留空，以便在运行时通过命令行指定。

配置文件中可以设置实验中使用的 Dataset 类接受的任何初始化参数。

TitaNet 训练和验证配置示例可能如下所示 ({NEMO_ROOT}examples/speaker_tasks/recognition/conf/titanet-large.yaml)

model:
  train_ds:
    manifest_filepath: ???
    sample_rate: 16000
    labels: None   # finds labels based on manifest file
    batch_size: 32
    trim_silence: False
    shuffle: True

  validation_ds:
    manifest_filepath: ???
    sample_rate: 16000
    labels: None   # Keep None, to match with labels extracted during training
    batch_size: 32
    shuffle: False    # No need to shuffle the validation data

如果您想使用 tarred 数据集，请查看数据集配置。

预处理器配置#

预处理器有助于计算 MFCC 或梅尔频谱图特征，这些特征作为模型的输入。有关如何编写此部分的详细信息，请参阅预处理器配置

增强配置#

对于 TitaNet 训练，我们使用带有 MUSAN 和 RIR 脉冲的动态增强，使用 noise 增强器部分

以下示例设置了 musan 增强，音频文件取自 manifest 路径，最小和最大 SNR 分别用 min_snr 和 max_snr 指定。此部分可以添加到模型中的 train_ds 部分

model:
  ...
  train_ds:
    ...
    augmentor:
      noise:
        manifest_path: /path/to/musan/manifest_file
        prob: 0.2  # probability to augment the incoming batch audio with augmentor data
        min_snr_db: 5
        max_snr_db: 15

有关更多详细信息，请参阅 nemo.collections.asr.parts.preprocessing.perturb.AudioAugmentor API 部分。

模型架构配置#

每个配置文件都应描述用于实验的模型架构。 NeMo ASR 集合中的模型需要一个 encoder 部分和一个 decoder 部分，其中 _target_ 字段指定用于每个部分的模块。

以下各节将更详细地介绍每种模型架构的具体配置。

有关 TitaNet Encoder 模型的更多信息，请参阅模型页面。

解码器配置#

从 TitaNet 编码器计算出特征后，我们将这些特征传递给解码器以计算嵌入，然后计算模型训练的对数概率。

model:
  ...
  decoder:
    _target_: nemo.collections.asr.modules.SpeakerDecoder
    feat_in: *enc_feat_out
    num_classes: 7205  # Total number of classes in voxceleb1,2 training manifest file
    pool_mode: attention # xvector, attention
    emb_sizes: 192 # number of intermediate emb layers. can be comma separated for additional layers like 512,512
    angular: true # if true then loss will be changed to angular softmax loss and consider scale and margin from loss section else train with cross-entropy loss

  loss:
    scale: 30
    margin 0.2