重要提示
您正在查看 NeMo 2.0 文档。此版本引入了 API 的重大更改和一个新的库,NeMo Run。我们目前正在将所有功能从 NeMo 1.0 移植到 2.0。有关先前版本或 2.0 中尚不可用的功能的文档,请参阅 NeMo 24.07 文档。
NeMo 语音意图分类和槽位填充配置文件#
此页面介绍了特定于语音意图分类和槽位填充集合中模型的 NeMo 配置文件设置。有关如何设置和运行所有 NeMo 模型通用的实验的常规信息(例如,实验管理器和 PyTorch Lightning 训练器参数),请参阅 NeMo 模型 页面。
数据集配置#
语音意图分类和槽位填充模型的数据集配置与标准 ASR 训练基本相同,此处 介绍。一个例外是 use_start_end_token
必须设置为 True
。
训练和验证配置的示例应类似于以下内容
model:
train_ds:
manifest_filepath: ???
sample_rate: ${model.sample_rate}
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: true
num_workers: 8
pin_memory: false
use_start_end_token: true
trim_silence: false
max_duration: 11.0
min_duration: 0.0
# tarred datasets
is_tarred: false
tarred_audio_filepaths: null
shuffle_n: 2048
# bucketing params
bucketing_strategy: "synced_randomized"
bucketing_batch_size: null
validation_ds:
manifest_filepath: ???
sample_rate: ${model.sample_rate}
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: false
num_workers: 8
pin_memory: true
use_start_end_token: true
min_duration: 8.0
预处理器配置#
预处理器有助于计算 MFCC 或 mel 频谱图特征,这些特征作为模型的输入给出。有关如何编写此部分的详细信息,请参阅 预处理器配置
增强配置#
NeMo ASR 有一些即时频谱图增强选项,可以通过使用 augmentor
和 spec_augment
部分的配置文件来指定。有关如何编写此部分的详细信息,请参阅 增强配置
模型架构配置#
模型的 encoder
是一个 Conformer-large 模型,没有文本解码器,并且可以使用预训练的检查点进行初始化。decoder
是一个 Transformer 模型,带有额外的 embedding
和 classifier
模块。
模型的配置示例可以是
pretrained_encoder:
name: stt_en_conformer_ctc_large # which model use to initialize the encoder, set to null if not using any. Only used to initialize training, not used in resuming from checkpoint.
freeze: false # whether to freeze the encoder during training.
model:
sample_rate: 16000
encoder:
_target_: nemo.collections.asr.modules.ConformerEncoder
feat_in: ${model.preprocessor.features}
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17 # SSL conformer-large have only 17 layers
d_model: 512
# Sub-sampling params
subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory
subsampling_factor: 4 # must be power of 2
subsampling_conv_channels: -1 # -1 sets it to d_model
# Reduction parameters: Can be used to add another subsampling layer at a given position.
# Having a 2x reduction will speedup the training and inference speech while keeping similar WER.
# Adding it at the end will give the best WER while adding it at the beginning will give the best speedup.
reduction: null # pooling, striding, or null
reduction_position: null # Encoder block index or -1 for subsampling at the end of encoder
reduction_factor: 1
# Feed forward module's params
ff_expansion_factor: 4
# Multi-headed Attention Module's params
self_attention_model: rel_pos # rel_pos or abs_pos
n_heads: 8 # may need to be lower for smaller d_models
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
att_context_size: [-1, -1] # -1 means unlimited context
xscaling: true # scales up the input embeddings by sqrt(d_model)
untie_biases: true # unties the biases of the TransformerXL layers
pos_emb_max_len: 5000
# Convolution module's params
conv_kernel_size: 31
conv_norm_type: 'batch_norm' # batch_norm or layer_norm
### regularization
dropout: 0.1 # The dropout used in most of the Conformer Modules
dropout_pre_encoder: 0.1 # The dropout used before the encoder
dropout_emb: 0.0 # The dropout used for embeddings
dropout_att: 0.1 # The dropout for multi-headed attention modules
embedding:
_target_: nemo.collections.asr.modules.transformer.TransformerEmbedding
vocab_size: -1
hidden_size: ${model.encoder.d_model}
max_sequence_length: 512
num_token_types: 1
embedding_dropout: 0.0
learn_positional_encodings: false
decoder:
_target_: nemo.collections.asr.modules.transformer.TransformerDecoder
num_layers: 3
hidden_size: ${model.encoder.d_model}
inner_size: 2048
num_attention_heads: 8
attn_score_dropout: 0.0
attn_layer_dropout: 0.0
ffn_dropout: 0.0
classifier:
_target_: nemo.collections.common.parts.MultiLayerPerceptron
hidden_size: ${model.encoder.d_model}
num_classes: -1
num_layers: 1
activation: 'relu'
log_softmax: true
损失配置#
默认情况下,损失函数是负对数似然损失,其中可以使用以下配置应用可选的标签平滑(默认为 0.0)
loss:
label_smoothing: 0.0
推理配置#
在推理期间,可以应用三种类型的序列生成策略:贪婪搜索
、束搜索
和 Top-K 搜索
。
sequence_generator:
type: greedy # choices=[greedy, topk, beam]
max_sequence_length: ${model.embedding.max_sequence_length}
temperature: 1.0 # for top-k sampling
beam_size: 1 # K for top-k sampling, N for beam search
len_pen: 0 # for beam-search