重要提示

您正在查看 NeMo 2.0 文档。此版本引入了 API 的重大更改和一个新的库,NeMo Run。我们目前正在将所有功能从 NeMo 1.0 移植到 2.0。有关先前版本或 2.0 中尚不可用的功能的文档,请参阅 NeMo 24.07 文档

NeMo 语音意图分类和槽位填充配置文件#

此页面介绍了特定于语音意图分类和槽位填充集合中模型的 NeMo 配置文件设置。有关如何设置和运行所有 NeMo 模型通用的实验的常规信息(例如,实验管理器和 PyTorch Lightning 训练器参数),请参阅 NeMo 模型 页面。

数据集配置#

语音意图分类和槽位填充模型的数据集配置与标准 ASR 训练基本相同,此处 介绍。一个例外是 use_start_end_token 必须设置为 True

训练和验证配置的示例应类似于以下内容

model:
  train_ds:
    manifest_filepath: ???
    sample_rate: ${model.sample_rate}
    batch_size: 16 # you may increase batch_size if your memory allows
    shuffle: true
    num_workers: 8
    pin_memory: false
    use_start_end_token: true
    trim_silence: false
    max_duration: 11.0
    min_duration: 0.0
    # tarred datasets
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    # bucketing params
    bucketing_strategy: "synced_randomized"
    bucketing_batch_size: null

  validation_ds:
    manifest_filepath: ???
    sample_rate: ${model.sample_rate}
    batch_size: 16 # you may increase batch_size if your memory allows
    shuffle: false
    num_workers: 8
    pin_memory: true
    use_start_end_token: true
    min_duration: 8.0

预处理器配置#

预处理器有助于计算 MFCC 或 mel 频谱图特征,这些特征作为模型的输入给出。有关如何编写此部分的详细信息,请参阅 预处理器配置

增强配置#

NeMo ASR 有一些即时频谱图增强选项,可以通过使用 augmentorspec_augment 部分的配置文件来指定。有关如何编写此部分的详细信息,请参阅 增强配置

模型架构配置#

模型的 encoder 是一个 Conformer-large 模型,没有文本解码器,并且可以使用预训练的检查点进行初始化。decoder 是一个 Transformer 模型,带有额外的 embeddingclassifier 模块。

模型的配置示例可以是

pretrained_encoder:
  name: stt_en_conformer_ctc_large  # which model use to initialize the encoder, set to null if not using any. Only used to initialize training, not used in resuming from checkpoint.
  freeze: false  # whether to freeze the encoder during training.

model:
  sample_rate: 16000
  encoder:
    _target_: nemo.collections.asr.modules.ConformerEncoder
    feat_in: ${model.preprocessor.features}
    feat_out: -1 # you may set it if you need different output size other than the default d_model
    n_layers: 17  # SSL conformer-large have only 17 layers
    d_model: 512

    # Sub-sampling params
    subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory
    subsampling_factor: 4 # must be power of 2
    subsampling_conv_channels: -1 # -1 sets it to d_model

    # Reduction parameters: Can be used to add another subsampling layer at a given position.
    # Having a 2x reduction will speedup the training and inference speech while keeping similar WER.
    # Adding it at the end will give the best WER while adding it at the beginning will give the best speedup.
    reduction: null # pooling, striding, or null
    reduction_position: null # Encoder block index or -1 for subsampling at the end of encoder
    reduction_factor: 1

    # Feed forward module's params
    ff_expansion_factor: 4

    # Multi-headed Attention Module's params
    self_attention_model: rel_pos # rel_pos or abs_pos
    n_heads: 8 # may need to be lower for smaller d_models
    # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
    att_context_size: [-1, -1] # -1 means unlimited context
    xscaling: true # scales up the input embeddings by sqrt(d_model)
    untie_biases: true # unties the biases of the TransformerXL layers
    pos_emb_max_len: 5000

    # Convolution module's params
    conv_kernel_size: 31
    conv_norm_type: 'batch_norm' # batch_norm or layer_norm

    ### regularization
    dropout: 0.1 # The dropout used in most of the Conformer Modules
    dropout_pre_encoder: 0.1 # The dropout used before the encoder
    dropout_emb: 0.0 # The dropout used for embeddings
    dropout_att: 0.1 # The dropout for multi-headed attention modules

  embedding:
    _target_: nemo.collections.asr.modules.transformer.TransformerEmbedding
    vocab_size: -1
    hidden_size: ${model.encoder.d_model}
    max_sequence_length: 512
    num_token_types: 1
    embedding_dropout: 0.0
    learn_positional_encodings: false

  decoder:
    _target_: nemo.collections.asr.modules.transformer.TransformerDecoder
    num_layers: 3
    hidden_size: ${model.encoder.d_model}
    inner_size: 2048
    num_attention_heads: 8
    attn_score_dropout: 0.0
    attn_layer_dropout: 0.0
    ffn_dropout: 0.0

  classifier:
    _target_: nemo.collections.common.parts.MultiLayerPerceptron
    hidden_size: ${model.encoder.d_model}
    num_classes: -1
    num_layers: 1
    activation: 'relu'
    log_softmax: true

损失配置#

默认情况下,损失函数是负对数似然损失,其中可以使用以下配置应用可选的标签平滑(默认为 0.0)

loss:
  label_smoothing: 0.0

推理配置#

在推理期间,可以应用三种类型的序列生成策略:贪婪搜索束搜索Top-K 搜索

sequence_generator:
  type: greedy  # choices=[greedy, topk, beam]
  max_sequence_length: ${model.embedding.max_sequence_length}
  temperature: 1.0  # for top-k sampling
  beam_size: 1  # K for top-k sampling, N for beam search
  len_pen: 0  # for beam-search