如何通过在 Riva ASR 管道中微调声学模型 (Conformer-CTC) 来提高嘈杂语音的准确性#

本教程将引导您了解 Riva ASR 管道的一些高级自定义功能，通过微调声学模型 (Conformer-CTC)。这些自定义功能可以提高特定语音场景的准确性，例如背景噪声和不同的声学环境。

NVIDIA Riva 概览#

NVIDIA Riva 是一个 GPU 加速的 SDK，用于构建为您的用例定制并提供实时性能的语音 AI 应用程序。
Riva 提供丰富的语音和自然语言理解服务，例如

自动语音识别 (ASR)
文本到语音合成 (TTS)
一系列自然语言处理 (NLP) 服务，例如命名实体识别 (NER)、标点符号和意图分类。

在本教程中，我们将展示如何增强您的训练数据（使用背景噪声数据）以微调声学模型 (Conformer-CTC)，从而提高在带有背景噪声的音频上的准确性。
要了解 Riva ASR API 的基础知识，请参阅 Riva ASR Python 入门指南。

有关 Riva 的更多信息，请参阅 Riva 产品页面和文档。

数据预处理#

为了进行微调，我们需要带有背景噪声的音频数据。如果您已经有此类数据，则可以直接使用它。
在本教程中，我们将使用 AN4 数据集，并使用来自 OpenSLR 数据库的房间脉冲响应和噪声数据库中的噪声数据对其进行增强。

在本教程中，我们将使用 NVIDIA NeMo 进行数据预处理步骤。

NVIDIA NeMo 概览#

NVIDIA NeMo 是一个用于构建新的最先进的对话式 AI 模型的工具包。NeMo 针对自动语音识别 (ASR)、自然语言处理 (NLP) 和文本到语音 (TTS) 模型有单独的集合。每个集合都包含预构建的模块，其中包括在您的数据上训练对话式 AI 模型所需的一切。每个模块都可以轻松地进行自定义、扩展和组合，以创建新的对话式 AI 模型架构。有关 NeMo 的更多信息，请参阅 NeMo 产品页面和文档。开源 NeMo 存储库可以在此处找到。

数据预处理的要求和设置：#

我们将使用 NVIDIA NeMo 进行此数据预处理步骤。虽然我们提供了克隆 NeMo GitHub 存储库并在我们的推荐的虚拟环境中安装 NeMo Python 模块所需的代码，但您可能会发现通过 NVIDIA 的 PyTorch 或 NeMo Docker 容器安装和运行 NeMo 更方便。拉取任一镜像都需要访问 NGC。请参阅此处的说明来设置合适的 Docker 容器。

下载和处理 AN4 数据集#

AN4 是卡内基梅隆大学 (CMU) 录制和分发的小型数据集。它包含人们拼写地址、姓名等的录音。有关此数据集的信息可以在 CMU 官方网站上找到。

让我们下载 AN4 数据集 tar 文件。

# Install the necessary dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install text-unidecode
!pip install matplotlib>=3.3.2
!pip install Cython

# Import the necessary dependencies.
import wget
import glob
import os
import subprocess
import tarfile

# This is the working directory for this part of the tutorial. 
working_dir = 'am_finetuning/'
!mkdir -p $working_dir

# The AN4 directory will be created in `data_dir`. It is currently set to the `working_dir`.
data_dir = os.path.abspath(working_dir)

# Download the AN4 dataset if it doesn't already exist in `data_dir`. 
# This will take a few moments...
# We also set `an4_path` which points to the downloaded AN4 dataset
if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):
    an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'
    an4_path = wget.download(an4_url, data_dir)
    print(f"AN4 dataset downloaded at: {an4_path}")
else:
    print("AN4 dataset tarfile already exists. Proceed to the next step.")
    an4_path = data_dir + '/an4_sphere.tar.gz'

现在，让我们解压 tar 文件，以获得 .sph 格式的数据集音频文件。然后，我们将使用 SoX 库将 .sph 文件转换为 16kHz .wav 文件。

if not os.path.exists(data_dir + '/an4/'):
    # Untar
    tar = tarfile.open(an4_path)
    tar.extractall(path=data_dir)
    print("Completed untarring the AN4 tarfile")
    # Convert .sph to .wav (using sox)
    print("Converting .sph to .wav...")
    sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)
    for sph_path in sph_list:
        wav_path = sph_path[:-4] + '.wav'
        #converting to 16kHz wav
        cmd = f"sox {sph_path} -r 16000 {wav_path}"
        subprocess.call(cmd, shell=True)
    print("Finished converting the .sph files to .wav files")
else:
    print("AN4 dataset directory already exists. Proceed to the next step.")

接下来，让我们为 AN4 数据集构建清单文件。清单文件是一个 .json 文件，它将 .wav 剪辑映射到其对应的文本。

AN4 数据集清单 .json 文件中的每个条目都遵循以下模板
{"audio_filepath": "<.wav 文件位置>", "duration": <.wav 文件的持续时间>, "text": "<.wav 文件中的文本>"}
示例：{"audio_filepath": "/tutorials/am_finetuning/an4/wav/an4_clstk/fash/an251-fash-b.wav", "duration": 1.0, "text": "yes"}

# Import the necessary libraries.
import json
import subprocess

# Method to build a manifest.
def build_manifest(transcripts_path, manifest_path, wav_path):
    with open(transcripts_path, 'r') as fin:
        with open(manifest_path, 'w') as fout:
            for line in fin:
                # Lines look like this:
                # <s> transcript </s> (fileID)
                transcript = line[: line.find('(')-1].lower()
                transcript = transcript.replace('<s>', '').replace('</s>', '')
                transcript = transcript.strip()

                file_id = line[line.find('(')+1 : -2]  # e.g. "cen4-fash-b"
                audio_path = os.path.join(
                    data_dir, wav_path,
                    file_id[file_id.find('-')+1 : file_id.rfind('-')],
                    file_id + '.wav')

                duration = float(subprocess.check_output(
                      "soxi -D {0}".format(audio_path), shell=True))
                #duration = WAVE(filename=audio_path).info.length

                # Write the metadata to the manifest
                metadata = {
                    "audio_filepath": audio_path,
                    "duration": duration,
                    "text": transcript
                }
                
                fout.write(json.dumps(metadata) + '\n')
                
                
# Building the manifest files.
print("***Building manifest files***")

# Building manifest files for the training data
train_transcripts = data_dir + '/an4/etc/an4_train.transcription'
train_manifest = data_dir + '/an4/train_manifest.json'
if not os.path.isfile(train_manifest):
    build_manifest(train_transcripts, train_manifest, 'an4/wav/an4_clstk')
    print("Training manifest created at", train_manifest)
else:
    print("Training manifest already exists at", train_manifest)

# Building manifest files for the test data
test_transcripts = data_dir + '/an4/etc/an4_test.transcription'
test_manifest = data_dir + '/an4/test_manifest.json'
if not os.path.isfile(test_manifest):
    build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')
    print("Test manifest created at", test_manifest)
else:
    print("Test manifest already exists at", test_manifest)

print("***Done***")

下载和处理背景噪声数据集#

对于背景噪声，我们将使用来自 OpenSLR 数据库的房间脉冲响应和噪声数据库中的背景噪声样本。对于数据集中的每个 30 秒各向同性噪声样本，我们使用前 15 秒进行训练，后 15 秒进行评估。

首先，让我们下载数据集。

# Download the background noise dataset if it doesn't already exist in `data_dir`. 
# This will take a few moments...
# We also set `noise_path` which points to the downloaded background noise dataset.

if not os.path.exists(data_dir + '/rirs_noises.zip'):
    slr28_url = 'https://www.openslr.org/resources/28/rirs_noises.zip'
    noise_path = wget.download(slr28_url, data_dir)
    print("Background noise dataset download complete.")
else:
    print("Background noise dataset already exists. Proceed to the next step.")
    noise_path = data_dir + '/rirs_noises.zip'

现在，我们将解压缩 .zip 文件，这将为我们提供数据集音频文件，它们是 16kHz 采样的 8 声道 .wav 文件。格式和采样率适合我们的目的，但我们需要将这些文件转换为单声道以匹配 AN4 数据集中的文件。幸运的是，SoX 库也提供了用于此目的的工具。

注意：转换将需要几分钟。

# Extract noise data
from zipfile import ZipFile
if not os.path.exists(data_dir + '/RIRS_NOISES'):
    try:
        with ZipFile(noise_path, "r") as zipObj:
            zipObj.extractall(data_dir)
            print("Extracting noise data complete")
        # Convert 8-channel audio files to mono-channel
        wav_list = glob.glob(data_dir + '/RIRS_NOISES/**/*.wav', recursive=True)
        for wav_path in wav_list:
            mono_wav_path = wav_path[:-4] + '_mono.wav'
            cmd = f"sox {wav_path} {mono_wav_path} remix 1"
            subprocess.call(cmd, shell=True)
        print("Finished converting the 8-channel noise data .wav files to mono-channel")
    except Exception:
        print("Not extracting. Extracted noise data might already exist.")
else: 
    print("Extracted noise data already exists. Proceed to the next step.")

接下来，让我们为噪声数据构建清单文件。清单文件是一个 .json 文件，它将 .wav 剪辑映射到其对应的文本。

噪声数据清单 .json 文件中的每个条目都遵循以下模板
{"audio_filepath": "<.wav 文件位置>", "duration": <.wav 文件的持续时间>, "offset": <偏移值>, "text": "-"}
示例：{"audio_filepath": "/tutorials/am_finetuning/RIRS_NOISES/real_rirs_isotropic_noises/RVB2014_type1_noise_largeroom1_1_mono.wav", "duration": 30.0, "offset": 0, "text": "-"}

import json
iso_path = os.path.join(data_dir,"RIRS_NOISES/real_rirs_isotropic_noises")
iso_noise_list = os.path.join(iso_path, "noise_list")

# Edit the noise_list file so that it lists the *_mono.wav files instead of the original *.wav files
with open(iso_noise_list) as f:
    if '_mono.wav' in f.read():
        print(f"{iso_noise_list} has already been processed")
    else:
        cmd = f"sed -i 's|.wav|_mono.wav|g' {iso_noise_list}"
        subprocess.call(cmd, shell=True)
        print(f"Finished processing {iso_noise_list}")

# Create the manifest files from noise files
def process_row(row, offset, duration):
  try:
    entry = {}
    wav_f = row['wav_filename']
    newfile = wav_f
    duration = subprocess.check_output('soxi -D {0}'.format(newfile), shell=True)
    entry['audio_filepath'] = newfile
    entry['duration'] = float(duration)
    entry['offset'] = offset
    entry['text'] = row['transcript']
    return entry
  except Exception as e:
    wav_f = row['wav_filename']
    newfile = wav_f
    print(f"Error processing {newfile} file!!!")
    
train_rows = []
test_rows = []

with open(iso_noise_list,"r") as in_f:
    for line in in_f:
        row = {}
        data = line.rstrip().split()
        row['wav_filename'] = os.path.join(data_dir,data[-1])
        row['transcript'] = "-"
        train_rows.append(process_row(row, 0 , 15))
        test_rows.append(process_row(row, 15 , 15))

# Writing manifest files
def write_manifest(manifest_file, manifest_lines):
    with open(manifest_file, 'w') as fout:
      for m in manifest_lines:
        fout.write(json.dumps(m) + '\n')
      print("Writing manifest file to", manifest_file, "complete")

# Writing training and test manifest files
test_noise_manifest  = os.path.join(data_dir, "test_noise_manifest.json")
train_noise_manifest = os.path.join(data_dir, "train_noise_manifest.json")
if not os.path.exists(test_noise_manifest):
    write_manifest(test_noise_manifest, test_rows)
else:
    print('Test noise manifest file already exists. Proceed to the next step.')
if not os.path.exists(train_noise_manifest):
    write_manifest(train_noise_manifest, train_rows)
else:
    print('Train noise manifest file already exists. Proceed to the next step.')

创建噪声增强数据集#

最后，让我们通过使用 add_noise.py NeMo 脚本向 AN4 数据集添加噪声来创建噪声增强数据集。此脚本生成噪声增强音频剪辑以及清单文件。

噪声增强数据清单文件中的每个条目都遵循以下模板
{"audio_filepath": "<.wav 文件位置>", "duration": <.wav 文件的持续时间>, "text": "<.wav 文件中的文本>"} 示例：{"audio_filepath": "/tutorials/am_finetuning/noise_data/train_manifest/train_noise_0db/an251-fash-b.wav", "duration": 1.0, "text": "yes"}

设置#

安装 NeMo Python 模块并在本地克隆 NeMo GitHub 存储库。在本教程的其余部分，我们将使用 NeMo 存储库中的脚本，这些脚本需要 NeMo Python 模块才能运行。

## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

# Clone NeMo locally
nemo_dir = os.path.join(os.getcwd(), 'NeMo')
!git clone https://github.com/NVIDIA/NeMo $nemo_dir

训练数据集#

让我们使用 AN4 训练数据集创建一个噪声增强训练数据集。我们将使用 NeMo 脚本在 0 到 15 dB SNR 范围内的不同 SNR（信噪比）下添加噪声。请注意，0 dB SNR 意味着给定音频文件中的噪声和信号音量相等。

final_data_dir = os.path.join(data_dir, 'noise_data')

train_manifest = os.path.join(data_dir, 'an4/train_manifest.json')
test_manifest  = os.path.join(data_dir, 'an4/test_manifest.json')

train_noise_manifest = os.path.join(data_dir, 'train_noise_manifest.json')
test_noise_manifest  = os.path.join(data_dir, 'test_noise_manifest.json')

!python $nemo_dir/scripts/dataset_processing/add_noise.py \
    --input_manifest=$train_manifest \
    --noise_manifest=$train_noise_manifest \
    --snrs 0 5 10 15 \
    --out_dir=$final_data_dir

上述脚本为每个 SNR 值生成一个 .json 清单文件，即每个 0、5、10 和 15db SNR 各一个清单文件。

首先，让我们给这些清单文件起一些不太繁琐的名称。

noisy_train_manifest_files = os.listdir(os.path.join(final_data_dir, 'manifests'))
for filename in noisy_train_manifest_files:
    new_filename = filename.replace('train_manifest_train_noise_manifest', 'noisy_train_manifest')
    new_filepath = os.path.join(final_data_dir, 'manifests', new_filename)
    filepath = os.path.join(final_data_dir, 'manifests', filename)
    os.rename(filepath, new_filepath)

现在，让我们将所有噪声增强训练数据的清单合并到一个清单中。

!cat $final_data_dir/manifests/noisy* > $final_data_dir/manifests/noisy_train_manifest.json

print("Combined manifest for noise-augmented training dataset created at", final_data_dir + "/manifests/noisy_train_manifest.json")

测试数据集#

让我们使用 AN4 测试数据集创建一个噪声增强评估数据集，通过使用与增强训练数据集相同的 NeMo add_noise.py 脚本在 5 dB 下添加噪声。

!python $nemo_dir/scripts/dataset_processing/add_noise.py \
    --input_manifest=$test_manifest \
    --noise_manifest=$test_noise_manifest \
    --snrs=5 \
    --out_dir=$final_data_dir

print("Noise-augmented testing dataset created at", final_data_dir+"/test_manifest")

同样，让我们为噪声增强测试数据的清单文件起一个不太繁琐的名称。

noisy_test_manifest_files = glob.glob(os.path.join(final_data_dir, 'manifests/test*'))
for filename in noisy_test_manifest_files:
    new_filename = filename.replace('test_manifest_test_noise_manifest', 'noisy_test_manifest')
    new_filepath = os.path.join(final_data_dir, 'manifests', new_filename)
    filepath = os.path.join(final_data_dir, 'manifests', filename)
    os.rename(filepath, new_filepath)
    
print("Manifest for noise-augmented test dataset created at", final_data_dir + "/manifests/noisy_test_manifest_5db.json")

噪声增强训练清单和数据分别在 {working_dir}/noise_data/noisy_train_manifest.json 和 {working_dir}/noise_data/train_manifest 中创建。
噪声增强测试清单和数据分别在 {working_dir}/noise_data/manifests/noisy_test_manifest_5db.json 和 {working_dir}/noise_data/test_manifest 中创建。

微调 ASR 模型#

要使用我们刚刚创建的增强数据集微调 ASR 模型，您可以继续学习本教程。在这种情况下，请确保在调用 NeMo 标记化、训练和评估脚本时，适当重置清单和数据集文件路径。

NVIDIA Riva

如何通过在 Riva ASR 管道中微调声学模型 (Conformer-CTC) 来提高嘈杂语音的准确性

目录