如何使用 NVIDIA NeMo 微调 Riva ASR 声学模型#

本教程将引导您完成如何使用 NVIDIA NeMo 微调 NVIDIA Riva Parakeet 声学模型。

NVIDIA Riva 概述#

NVIDIA Riva 是一个 GPU 加速的 SDK，用于构建为您的用例定制并提供实时性能的语音 AI 应用程序。
Riva 提供了一组丰富的语音和自然语言理解 (NLU) 服务，例如

自动语音识别 (ASR)。
文本到语音合成 (TTS)。
一系列自然语言处理 (NLP) 服务，例如命名实体识别 (NER)、标点符号和意图分类。

在本教程中，我们将使用 NeMo 微调 Riva ASR 声学模型。
要了解 Riva ASR API 的基础知识，请参阅Riva ASR Python 入门。

有关 Riva 的更多信息，请参阅 Riva 开发者文档。

NeMo (神经模块)#

NVIDIA NeMo 是一个开源框架，用于使用简单的 Python 接口构建、训练和微调 GPU 加速的语音 AI 和 NLU 模型。有关如何设置 NeMo 的信息，请参阅 NeMo GitHub 说明。

"""
You can run either this tutorial locally (if you have all the dependencies and a GPU) or on Google Colab.

Perform the following steps to setup in Google Colab:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub.
   a. Click **File** > **Upload Notebook** > **GITHUB** tab > copy/paste the GitHub URL.
3. Connect to an instance with a GPU.
   a. Click **Runtime** > Change the runtime type > select **GPU** for the hardware accelerator.
4. Run this cell to set up the dependencies.
5. Restart the runtime.
   a. Click **Runtime** > **Restart Runtime** for any upgraded packages to take effect.
"""

# Install Dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3 jq
!pip install text-unidecode
!pip install matplotlib>=3.3.2
!pip install Cython
!pip3 install --no-cache-dir huggingface-hub==0.23.2

## Install NeMo
BRANCH = 'v1.23.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, in the case where you want to use the "Run All Cells" (or similar) option,
uncomment `exit()` below to crash and restart the kernel.
"""
# exit()

使用 NeMo 微调 ASR 模型#

下载数据#

在本教程中，我们将使用流行的 AN4 数据集。让我们下载它。

! wget https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz  # for the original source, please visit http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz

下载后，解压数据集并将其移动到正确的目录。

import os
DATA_DIR = os.getcwd()
os.environ["DATA_DIR"] = DATA_DIR
! tar -xvf an4_sphere.tar.gz
! mv an4 $DATA_DIR

预处理#

此步骤将 .mp3 文件转换为 .wav 文件，并将数据拆分为训练集和测试集。它还会生成一个“元数据”文件，供数据加载器用于训练和测试。

import json, librosa, os, glob
import subprocess


source_data_dir = f"{DATA_DIR}/an4"
target_data_dir = f"{DATA_DIR}/an4_converted"

def an4_build_manifest(transcripts_path, manifest_path, target_wavs_dir):
    """Build an AN4 manifest from a given transcript file."""
    with open(transcripts_path, 'r') as fin:
        with open(manifest_path, 'w') as fout:
            for line in fin:
                # Lines look like this:
                # <s> transcript </s> (fileID)
                transcript = line[: line.find('(') - 1].lower()
                transcript = transcript.replace('<s>', '').replace('</s>', '')
                transcript = transcript.strip()

                file_id = line[line.find('(') + 1 : -2]  # e.g. "cen4-fash-b"
                audio_path = os.path.join(target_wavs_dir, file_id + '.wav')

                duration = librosa.core.get_duration(filename=audio_path)

                # Write the metadata to the manifest
                metadata = {"audio_filepath": audio_path, "duration": duration, "text": transcript}
                json.dump(metadata, fout)
                fout.write('\n')

"""Process AN4 dataset."""
if not os.path.exists(source_data_dir):
    link = 'http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz'
    raise ValueError(
        f"Data not found at `{source_data_dir}`. Please download the AN4 dataset from `{link}` "
        f"and extract it into the folder specified by the `source_data_dir` argument."
    )

# Convert SPH files to WAV files
sph_list = glob.glob(os.path.join(source_data_dir, '**/*.sph'), recursive=True)
target_wavs_dir = os.path.join(target_data_dir, 'wavs')
if not os.path.exists(target_wavs_dir):
    print(f"Creating directories for {target_wavs_dir}.")
    os.makedirs(os.path.join(target_data_dir, 'wavs'))

for sph_path in sph_list:
    wav_path = os.path.join(target_wavs_dir, os.path.splitext(os.path.basename(sph_path))[0] + '.wav')
    cmd = ["sox", sph_path, wav_path]
    subprocess.run(cmd, check=True)

# Build AN4 manifests
train_transcripts = os.path.join(source_data_dir, 'etc/an4_train.transcription')
train_manifest = os.path.join(target_data_dir, 'train_manifest.json')
an4_build_manifest(train_transcripts, train_manifest, target_wavs_dir)

test_transcripts = os.path.join(source_data_dir, 'etc/an4_test.transcription')
test_manifest = os.path.join(target_data_dir, 'test_manifest.json')
an4_build_manifest(test_transcripts, test_manifest, target_wavs_dir)

让我们听一个示例音频文件。

# change path of the file here
import os
import IPython.display as ipd
path = os.environ["DATA_DIR"] + '/an4_converted/wavs/an268-mbmg-b.wav'
ipd.Audio(path)

分词器#

如果训练数据集不足（<50 小时），我们建议使用来自预训练模型本身的分词器。以下教程解释了如何使用来自预训练模型的分词器来微调 Parakeet 模型。如果词汇表发生变化，或者您希望训练自己的分词器，您可以使用 NeMo 分词器训练脚本，并使用混合模型训练脚本在您的数据上微调模型。有关更多详细信息，请参阅 asr-finetune-conformer-nemo 教程。

训练 Parakeet-Hybrid 模型#

Hybrid RNNT-CTC 模型是一组同时具有 RNNT 和 CTC 解码器的模型。训练混合模型将加速 CTC 模型的收敛，并使用户能够使用单个模型，该模型既可以用作 CTC 模型又可以用作 RNNT 模型。此类别可与任何 ASR 模型一起使用。混合模型在编码器的顶部使用 CTC 和 RNNT 的两个解码器。

NeMo 使用 .yaml 文件来配置训练参数。您可以直接编辑配置文件或从命令行界面更新它们。例如，如果需要修改 epoch 的数量，以及学习率的变化，您可以添加 trainer.max_epochs=100 和 optim.lr=0.02 并训练模型。

以下示例命令使用 examples 文件夹中的 speech_to_text_finetune.py 脚本来训练/微调 Parakeet-Hybrid ASR 模型 1 个 epoch。对于 Citrinet、Conformer 等其他 ASR 模型，您可以在 NeMo GitHub 存储库中的 examples/asr/conf/ 下找到相应的配置文件。

# To fully train the model, you'll need to increase trainer.max_epochs from 1.
# Empirical evidence suggests that around 200 epochs should suffice.
NEMO_DIR = 'FIX_ME/path/to/NeMo'
! git clone -b $BRANCH https://github.com/NVIDIA/NeMo $NEMO_DIR
!python $NEMO_DIR/examples/asr/speech_to_text_finetune.py \
        --config-path="../asr/conf/fastconformer/hybrid_transducer_ctc/" --config-name=fastconformer_hybrid_transducer_ctc_bpe \
        +init_from_pretrained_model=stt_en_fastconformer_hybrid_large_pc \
        ++model.train_ds.manifest_filepath="$DATA_DIR/an4_converted/train_manifest.json" \
        ++model.validation_ds.manifest_filepath="$DATA_DIR/an4_converted/test_manifest.json" \
        ++model.optim.sched.d_model=1024 \
        ++trainer.devices=1 \
        ++trainer.max_epochs=1 \
        ++trainer.precision=bf16 \
        ++model.optim.name="adamw" \
        ++model.optim.lr=0.1 \
        ++model.optim.weight_decay=0.001 \
        ++model.optim.sched.warmup_steps=100 \
        ++exp_manager.version=test \
        ++exp_manager.use_datetime_version=False \
        ++exp_manager.exp_dir=$DATA_DIR/checkpoints

nemo_file_path = os.path.join(DATA_DIR, 'checkpoints/FastConformer-Hybrid-Transducer-CTC-BPE/test/checkpoints/FastConformer-Hybrid-Transducer-CTC-BPE.nemo')

ASR 评估#

现在我们已经训练了一个模型，我们需要检查它的性能如何。

!python $NEMO_DIR/examples/asr/speech_to_text_eval.py \
    model_path=$nemo_file_path \
    dataset_manifest=$DATA_DIR/an4_converted/test_manifest.json \
    output_filename=./test_manifest_predictions.json \
    batch_size=32 \
    amp=True

ASR 模型导出#

使用 NeMo，您还可以以可以使用 NVIDIA Riva 部署的格式导出模型：Riva 是一个高性能应用程序框架，用于使用 GPU 的多模式会话 AI 服务。此处可以使用与导出到 ONNX 相同的命令。唯一的小变化是 spec 文件中 export_format 的配置。上面的模型使用混合损失进行训练，这意味着它是 CTC 和 RNNT 模型的融合。我们需要提取特定的解码器以在 RIVA 中使用。我们可以使用 convert_nemo_asr_hybrid_to_ctc.py 脚本来提取所需的解码器。

ctc_model_path = os.path.join(DATA_DIR, 'checkpoints/FastConformer-Hybrid-Transducer-CTC-BPE/test/checkpoints/FastConformer-CTC-BPE_ctc.nemo')
! python3 $NEMO_DIR/examples/asr/asr_hybrid_transducer_ctc/helpers/convert_nemo_asr_hybrid_to_ctc.py -i $nemo_file_path -o $ctc_model_path -t ctc

安装软件包#

我们现在将安装 NeMo 和 nemo2riva 软件包。 nemo2riva 在 NVIDIA NGC 上可用。确保在运行以下命令之前先安装 NGC CLI。

!pip install nvidia-pyindex
!ngc registry resource download-version "nvidia/riva/riva_quickstart:"$__riva_version__
!pip install nemo2riva
!pip install protobuf==3.20.0

转换为 Riva#

将下载的模型转换为 .riva 格式。我们将使用 --key=nemotoriva 设置加密密钥。为生产生成 .riva 模型时，请选择不同的加密密钥值。

为了在 RIVA 中部署 RNNT 模型，请使用 nemo2riva --out $riva_file_path --key=nemotoriva --format nemo $rnnt_file_path

riva_file_path = ctc_model_path[:-5]+".riva"
!nemo2riva --key=nemotoriva --onnx-opset 18 --out $riva_file_path $ctc_model_path

下一步是什么？#

您可以使用 NeMo 为您自己的应用程序构建自定义模型，并使用 NVIDIA Riva 部署它们！请参阅 Conformer-CTC 部署教程。

NVIDIA Riva

如何使用 NVIDIA NeMo 微调 Riva ASR 声学模型

目录