如何使用 NVIDIA NeMo 微调 Riva ASR 声学模型#

本教程将引导您了解如何使用 NVIDIA NeMo 微调 NVIDIA Riva ASR 声学模型。

重要提示：如果您计划使用模型训练时使用的相同分词器来微调 ASR 声学模型，请跳过本教程，并参阅NeMo ASR 语言微调教程的“子词编码 CTC 模型”部分（从“加载预训练模型”小节开始）。

NVIDIA Riva 概述#

NVIDIA Riva 是一个 GPU 加速的 SDK，用于构建针对您的用例定制并提供实时性能的语音 AI 应用程序。
Riva 提供了一系列丰富的语音和自然语言理解 (NLU) 服务，例如

自动语音识别 (ASR)。
文本到语音合成 (TTS)。
一系列自然语言处理 (NLP) 服务，例如命名实体识别 (NER)、标点符号和意图分类。

在本教程中，我们将使用 NeMo 微调 Riva ASR 声学模型。
要了解 Riva ASR API 的基础知识，请参阅Python 中 Riva ASR 入门。

有关 Riva 的更多信息，请参阅Riva 开发者文档。

NeMo (神经模块)#

NVIDIA NeMo是一个开源框架，用于使用简单的 Python 接口构建、训练和微调 GPU 加速的语音 AI 和 NLU 模型。有关如何设置 NeMo 的信息，请参阅NeMo GitHub说明。

"""
You can run either this tutorial locally (if you have all the dependencies and a GPU) or on Google Colab.

Perform the following steps to setup in Google Colab:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub.
   a. Click **File** > **Upload Notebook** > **GITHUB** tab > copy/paste the GitHub URL.
3. Connect to an instance with a GPU.
   a. Click **Runtime** > Change the runtime type > select **GPU** for the hardware accelerator.
4. Run this cell to set up the dependencies.
5. Restart the runtime.
   a. Click **Runtime** > **Restart Runtime** for any upgraded packages to take effect.
"""

# Install Dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install text-unidecode
!pip install matplotlib>=3.3.2
!pip install Cython

## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, in the case where you want to use the "Run All Cells" (or similar) option, 
uncomment `exit()` below to crash and restart the kernel.
"""
# exit()

使用 NeMo 微调 ASR 模型#

下载数据#

在本教程中，我们将使用流行的 AN4 数据集。让我们下载它。

! wget https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz  # for the original source, please visit http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz

下载后，解压数据集并将其移动到正确的目录。

import os
DATA_DIR = os.getcwd()
os.environ["DATA_DIR"] = DATA_DIR
! tar -xvf an4_sphere.tar.gz 
! mv an4 $DATA_DIR

预处理#

此步骤将 .mp3 文件转换为 .wav 文件，并将数据拆分为训练集和测试集。它还会生成一个“元数据”文件，供数据加载器在训练和测试时使用。

import json, librosa, os, glob
import subprocess


source_data_dir = f"{DATA_DIR}/an4"
target_data_dir = f"{DATA_DIR}/an4_converted"

def an4_build_manifest(transcripts_path, manifest_path, target_wavs_dir):
    """Build an AN4 manifest from a given transcript file."""
    with open(transcripts_path, 'r') as fin:
        with open(manifest_path, 'w') as fout:
            for line in fin:
                # Lines look like this:
                # <s> transcript </s> (fileID)
                transcript = line[: line.find('(') - 1].lower()
                transcript = transcript.replace('<s>', '').replace('</s>', '')
                transcript = transcript.strip()

                file_id = line[line.find('(') + 1 : -2]  # e.g. "cen4-fash-b"
                audio_path = os.path.join(target_wavs_dir, file_id + '.wav')

                duration = librosa.core.get_duration(filename=audio_path)

                # Write the metadata to the manifest
                metadata = {"audio_filepath": audio_path, "duration": duration, "text": transcript}
                json.dump(metadata, fout)
                fout.write('\n')

"""Process AN4 dataset."""
if not os.path.exists(source_data_dir):
    link = 'http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz'
    raise ValueError(
        f"Data not found at `{source_data_dir}`. Please download the AN4 dataset from `{link}` "
        f"and extract it into the folder specified by the `source_data_dir` argument."
    )

# Convert SPH files to WAV files
sph_list = glob.glob(os.path.join(source_data_dir, '**/*.sph'), recursive=True)
target_wavs_dir = os.path.join(target_data_dir, 'wavs')
if not os.path.exists(target_wavs_dir):
    print(f"Creating directories for {target_wavs_dir}.")
    os.makedirs(os.path.join(target_data_dir, 'wavs'))

for sph_path in sph_list:
    wav_path = os.path.join(target_wavs_dir, os.path.splitext(os.path.basename(sph_path))[0] + '.wav')
    cmd = ["sox", sph_path, wav_path]
    subprocess.run(cmd, check=True)

# Build AN4 manifests
train_transcripts = os.path.join(source_data_dir, 'etc/an4_train.transcription')
train_manifest = os.path.join(target_data_dir, 'train_manifest.json')
an4_build_manifest(train_transcripts, train_manifest, target_wavs_dir)

test_transcripts = os.path.join(source_data_dir, 'etc/an4_test.transcription')
test_manifest = os.path.join(target_data_dir, 'test_manifest.json')
an4_build_manifest(test_transcripts, test_manifest, target_wavs_dir)

让我们听一个示例音频文件。

# change path of the file here
import os
import IPython.display as ipd
path = os.environ["DATA_DIR"] + '/an4_converted/wavs/an268-mbmg-b.wav'
ipd.Audio(path)

训练#

创建分词器#

在我们可以进行实际训练之前，我们需要创建一个分词器，因为此 ASR 模型使用词片段编码。基于字符的模型不需要创建分词器，因为在它们的情况下，只有单个字符被视为词汇表中的元素。我们可以使用 NeMo 的 process_asr_text_tokenizer.py 脚本来创建分词器，该分词器为我们生成子词词汇表以用于训练。词汇表的大小 (vocab_size) 应与 ASR 模型中的词汇表大小相同。我们将克隆 NeMo GitHub 存储库以使用那里提供的脚本和示例。

# clone NeMo locally
NEMO_DIR = 'FIX_ME/path/to/NeMo'
! git clone https://github.com/NVIDIA/NeMo $NEMO_DIR

# create the tokenizer
!python $NEMO_DIR/scripts/tokenizers/process_asr_text_tokenizer.py \
         --manifest=$DATA_DIR/an4_converted/train_manifest.json \
         --data_root=$DATA_DIR/an4 \
         --vocab_size=128 \
         --tokenizer=spe \
         --spe_type=unigram

训练 Conformer-CTC#

NeMo 使用 .yml 文件来配置训练参数。您可以直接编辑配置文件或通过命令行界面更新它们。例如，如果需要修改 epoch 数，以及学习率的变化，您可以添加 trainer.max_epochs=100 和 optim.lr=0.02 并训练模型。

以下示例命令使用 examples 文件夹中的 speech_to_text_ctc_bpe.py 脚本来训练/微调 Conformer-CTC ASR 模型 1 个 epoch。对于其他 ASR 模型（如 Citrinet），您可以在 NeMo GitHub 存储库的 examples/asr/conf/ 下找到相应的配置文件。

# To fully train the model from scratch, you'll need to increase trainer.max_epochs from 1.
# Empirical evidence suggests that around 200 epochs should suffice.
!python $NEMO_DIR/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
    --config-path=../conf/conformer/ --config-name=conformer_ctc_bpe \
    +init_from_pretrained_model=stt_en_conformer_ctc_large \
    model.train_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
    model.validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
    model.tokenizer.dir=$DATA_DIR/an4/tokenizer_spe_unigram_v128 \
    trainer.devices=1 \
    trainer.max_epochs=1 \
    model.optim.name="adamw" \
    model.optim.lr=1.0 \
    model.optim.weight_decay=0.001 \
    model.optim.sched.warmup_steps=2000 \
    ++exp_manager.exp_dir=$DATA_DIR/checkpoints \
    ++exp_manager.version=test \
    ++exp_manager.use_datetime_version=False

!ls ./Conformer-CTC-BPE/test/checkpoints/

nemo_file_path = os.path.join(DATA_DIR, 'checkpoints/Conformer-CTC-BPE/test/checkpoints/Conformer-CTC-BPE.nemo')

ASR 评估#

现在我们已经训练了一个模型，我们需要检查它的性能如何。

!python $NEMO_DIR/examples/asr/speech_to_text_eval.py \
    model_path=$nemo_file_path \
    dataset_manifest=$DATA_DIR/an4_converted/test_manifest.json \
    output_filename=./test_manifest_predictions.json \
    batch_size=32 \
    amp=True

ASR 模型导出#

借助 NeMo，您还可以将模型导出为可以使用 NVIDIA Riva 部署的格式：NVIDIA Riva 是一个高性能应用程序框架，用于使用 GPU 的多模态会话 AI 服务。此处可以使用相同的导出到 ONNX 的命令。唯一的小变化是 spec 文件中 export_format 的配置。

安装软件包#

我们现在将安装 NeMo 和 nemo2riva 软件包。nemo2riva 在 NVIDIA NGC 上可用。在运行以下命令之前，请确保首先安装 NGC CLI。

from version import __riva_version__
print(__riva_version__)
!pip install nvidia-pyindex
!ngc registry resource download-version "nvidia/riva/riva_quickstart:"$__riva_version__
!pip install nemo2riva
!pip install protobuf==3.20.0

转换为 Riva#

将下载的模型转换为 .riva 格式。我们将使用 --key=nemotoriva 设置加密密钥。为生产生成 .riva 模型时，请选择不同的加密密钥值。

riva_file_path = nemo_file_path[:-5]+".riva"
!nemo2riva --out {riva_file_path} --key=nemotoriva {nemo_file_path}

下一步是什么？#

您可以使用 NeMo 为您自己的应用程序构建自定义模型，并使用 NVIDIA Riva 部署它们！请参阅 Conformer-CTC 部署教程。

NVIDIA Riva

如何使用 NVIDIA NeMo 微调 Riva ASR 声学模型

目录