使用 NeMo 微调文本到语音#

NeMo 工具包是一个基于 Python 的 AI 工具包，用于使用您自己的数据训练和自定义专门构建的预训练 AI 模型。

迁移学习从现有神经网络中提取学习到的特征到新的神经网络。当创建大型训练数据集不可行时，通常使用迁移学习。

开发人员、研究人员和构建智能 AI 应用和服务的软件合作伙伴可以自带数据来微调预训练模型，而不是从头开始进行训练的麻烦。

让我们通过语音合成的用例来了解一下实际操作！

文本到语音#

文本到语音 (TTS) 通常是构建对话式 AI 模型的最后一步。TTS 模型将文本转换为可听见的语音。主要目标是为给定的文本合成合理自然的语音。由于没有通用的标准来衡量合成语音的质量，您需要收听一些推断的语音才能判断 TTS 模型是否训练良好。

在本教程中，我们将研究两个模型：用于频谱图生成的 FastPitch 和作为声码器的 HiFiGAN。

深入了解：使用 NeMo 的 TTS#

本笔记本假设您已经熟悉使用 NeMo 进行 TTS 训练，如 text-to-speech-training 笔记本中所述，并且您有一个预训练的 TTS 模型。

在安装 NeMo 后，下一步是设置路径以保存数据和结果。NeMo 可以与 docker 容器或虚拟环境一起使用。

将变量 FIXME 替换为引号 “” 引起来的所需路径，作为字符串。

重要提示： 在这里，我们映射保存数据、规范、结果和缓存的目录。您应该针对您的特定情况对其进行配置，以便这些目录对 docker 容器正确可见。确保本教程位于 NeMo 文件夹中。

软件包安装和文件导入#

我们将首先安装所有必要的软件包。

! pip install numba>=0.53
! pip install librosa
! pip install soundfile
! pip install tqdm

仅当您要将模型导出为 .riva 格式时才安装以下软件包，否则您可以跳过它。我们现在将安装软件包 NeMo 和 nemo2riva。nemo2riva 在 ngc 上可用。在运行以下命令之前，请确保您首先安装了 NGC CLI。

!pip install nvidia-pyindex
!pip install nemo_toolkit['all']
!ngc registry resource download-version "nvidia/riva/riva_quickstart:2.8.1"
!pip install "riva_quickstart_v2.8.1/nemo2riva-2.8.1-py3-none-any.whl"
!pip install protobuf==3.20.0
# Installing pynini separately
!wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/nemo_text_processing/install_pynini.sh \
bash install_pynini.sh

我们现在将从 NeMo 导入所有相关文件

! wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/dataset_processing/tts/ljspeech/get_data.py
    
! wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/dataset_processing/tts/extract_sup_data.py
! mkdir -p ljspeech && cd ljspeech \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/dataset_processing/tts/ljspeech/ds_conf/ds_for_fastpitch_align.yaml \
&& cd ..
    
# additional files
!wget https://raw.githubusercontent.com/nvidia/NeMo/main/examples/tts/fastpitch_finetune.py

!mkdir -p conf \
&& cd conf \
&& wget https://raw.githubusercontent.com/nvidia/NeMo/main/examples/tts/conf/fastpitch_align_v1.05.yaml \
&& cd ..

!mkdir -p tts_dataset_files && cd tts_dataset_files \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tts_dataset_files/cmudict-0.7b_nv22.10 \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tts_dataset_files/heteronyms-052722 \
&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/nemo_text_processing/text_normalization/en/data/whitelist/tts.tsv \
&& cd ..
            
! wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/dataset_processing/tts/generate_mels.py
    
! wget https://raw.githubusercontent.com/nvidia/NeMo/main/examples/tts/hifigan_finetune.py

设置相关路径#

# NOTE: The following paths are set from the perspective of the NeMo Docker.

import os
from pathlib import Path

# The data is saved here
DATA_DIR = FIXME
RESULTS_DIR = FIXME

! mkdir -p {DATA_DIR}
! mkdir -p {RESULTS_DIR}

os.environ["DATA_DIR"] = DATA_DIR
os.environ["RESULTS_DIR"] = RESULTS_DIR

数据#

对于本笔记本的其余部分，假设您拥有

在 22kHz 采样的 LJSpeech 上训练的预训练 FastPitch 和 HiFiGAN 模型

如果您没有使用在正确的采样率下在 LJSpeech 上训练的 TTS 模型。请确保您拥有原始数据，包括 wav 文件和 .json 清单文件。如果您有 TTS 模型但不是 22kHz，请确保您设置了正确的采样率和 fft 参数。

对于本笔记本的其余部分，我们将使用来自 Hi-Fi TTS 数据集的音频样本子集，总计约一分钟。此数据集仅用于演示目的。对于高质量模型，我们建议至少 30 分钟的音频。如果您想录制自己的数据集，您可以遵循在家录制 TTS 数据集的指南。NeMo 支持的用于下载和预处理数据集的示例脚本可以在此处找到。

让我们首先下载和预处理原始 LJSpeech 数据集，并将变量设置为指向此数据集作为原始数据的 .json 文件。

预处理#

此步骤从 NVIDIA 下载音频到文本文件列表，并生成清单文件。如果您使用自己的数据集，则必须生成三个文件：ljs_audio_text_train_manifest.json、ljs_audio_text_val_manifest.json、ljs_audio_text_test_manifest.json。这些文件对应于您的训练/验证/测试拆分。对于每个文本文件，行数应等于此拆分中的样本数，并且单说话人数据集的每一行应如下所示

{"audio_filepath": "path_to_audio_file", "text": "text_of_the_audio", "duration": duration_of_the_audio}

如果是多说话人数据集

{"audio_filepath": "path_to_audio_file", "text": "text_of_the_audio", "duration": duration_of_the_audio, "speaker": speaker_id}

示例行是

{"audio_filepath": "actressinhighlife_01_bowen_0001.flac", "text": "the pleasant season did my heart employ", "duration": 2.4}

我们现在将下载音频和清单文件，然后将它们转换为上述格式，并对文本进行规范化。NeMo scripts/dataset_processing/tts/ljspeech/get_data.py 中可以找到 LJSpeech 的这些步骤。请耐心等待，此步骤预计需要一些时间。

! python get_data.py --data-root {DATA_DIR}

import os

original_data_json = os.path.join(os.environ["DATA_DIR"], "LJSpeech-1.1/train_manifest.json")
os.environ["original_data_json"] = original_data_json

现在让我们下载 Hi-Fi TTS 音频样本，并将数据放入 DATA_DIR 中。创建一个名为 manifest.json 的清单文件，并将 dev.json 和 train.json 的内容复制到其中。

import os

# Name of the untarred Hi-Fi TTS audio samples directory.
finetune_data_name = FIX_ME
# Absolute path of finetuning dataset from the perspective of NeMo container
finetune_data_path = os.path.join(os.environ["DATA_DIR"], finetune_data_name)

os.environ["finetune_data_name"] = finetune_data_name

现在您已经下载了数据，让我们确保音频剪辑和样本与用于训练预训练模型的剪辑的采样频率相同。在本笔记本的课程中，NVIDIA 建议使用在 LJSpeech 数据集上训练的模型。此模型的采样率为 22.05kHz。

import soundfile
import librosa
import json
import os

def resample_audio(input_file_path, output_path, target_sampling_rate=22050):
    """Resample a single audio file.
    
    Args:
        input_file_path (str): Path to the input audio file.
        output_path (str): Path to the output audio file.
        target_sampling_rate (int): Sampling rate for output audio file.
        
    Returns:
        No explicit returns
    """
    if not input_file_path.endswith(".wav"):
        raise NotImplementedError("Loading only implemented for wav files.")
    if not os.path.exists(input_file_path):
        raise FileNotFoundError(f"Cannot file input file at {input_file_path}")
    audio, sampling_rate = librosa.load(
      input_file_path,
      sr=target_sampling_rate
    )
    # Filterring out empty audio files.
    if librosa.get_duration(y=audio, sr=sampling_rate) == 0:
        print(f"0 duration audio file encountered at {input_file_path}")
        return None
    filename = os.path.basename(input_file_path)
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    soundfile.write(
        os.path.join(output_path, filename),
        audio,
        samplerate=target_sampling_rate,
        format="wav"
    )
    return filename

from tqdm.notebook import tqdm

relative_path = f"{finetune_data_name}/clips_resampled"
resampled_manifest_file = os.path.join(
    os.environ["DATA_DIR"],
    f"{finetune_data_name}/manifest_resampled.json"
)
input_manifest_file = os.path.join(
    os.environ["DATA_DIR"],
    f"{finetune_data_name}/manifest.json"
)
sampling_rate = 22050
output_path = os.path.join(os.environ["DATA_DIR"], relative_path)

# Resampling the audio clip.
with open(input_manifest_file, "r") as finetune_file:
    with open(resampled_manifest_file, "w") as resampled_file:
        for line in tqdm(finetune_file.readlines()):
            data = json.loads(line)
            filename = resample_audio(
                os.path.join(
                    os.environ["DATA_DIR"],
                    finetune_data_name,
                    data["audio_filepath"]
                ),
                output_path,
                target_sampling_rate=sampling_rate
            )
            if not filename:
                print("Skipping clip {} from training dataset")
                continue
            data["audio_filepath"] = os.path.join(
                os.environ["DATA_DIR"],
                relative_path, filename
            )
            resampled_file.write(f"{json.dumps(data)}\n")

assert resampled_file.closed, "Output file wasn't closed properly"
assert finetune_file.closed, "Input file wasn't closed properly"

# Splitting the dataset to train and val set.
! cat $finetune_data_path/manifest_resampled.json | tail -n 2 > $finetune_data_path/manifest_val.json
! cat $finetune_data_path/manifest_resampled.json | head -n -2 > $finetune_data_path/manifest_train.json

from pathlib import Path

finetune_data_json = os.path.join(os.environ["DATA_DIR"], f'{finetune_data_name}/manifest_train.json')
os.environ["finetune_data_json"] = finetune_data_json
os.environ["finetune_val_data_json"] = os.path.join(os.environ["DATA_DIR"], f'{finetune_data_name}/manifest_val.json')

第一步是创建一个 json，其中包含来自原始数据和微调数据的数据。由于原始数据远大于微调数据，我们将微调数据与原始数据的样本合并。我们可以使用以下方法执行此操作

import random
import json

def json_reader(filename):
    with open(filename) as f:
        for line in f:
            yield json.loads(line)
            
            
def json_writer(file, json_objects):
    with open(file, "w") as f:
        for jsonobj in json_objects:
            jsonstr = json.dumps(jsonobj)
            f.write(jsonstr + "\n")
            
            
def dataset_merge(original_manifest, finetune_manifest, num_records_original=50):
    original_ds = list(json_reader(original_manifest))
    finetune_ds = list(json_reader(finetune_manifest))
    original_ds = random.sample(original_ds, num_records_original)
    merged_ds = original_ds + finetune_ds
    random.shuffle(merged_ds)
    return merged_ds

merged_ds = dataset_merge(os.environ["original_data_json"], os.environ["finetune_data_json"])

os.environ["merged_data_json"] = f"{DATA_DIR}/{finetune_data_name}/merged_train.json"
json_writer(os.environ["merged_data_json"], merged_ds)

获取音高统计信息#

训练 Fastpitch 需要您设置 2 个音高提取值

avg：用于规范化音高的平均值
std：用于规范化音高的标准差

我们可以使用 scripts/dataset_processing/tts/extract_sup_data.py 计算训练数据的音高，并使用 NeMo 脚本 scripts/dataset_processing/tts/compute_speaker_stats.py 提取音高统计信息。我们已在本教程前面下载了这些文件。让我们使用它来获取 pitch_mean 和 pitch_std。

首先，我们将使用 extract_sup_data.py 文件提取音高补充数据。此文件与 yaml 配置文件 ds_for_fastpitch_align 一起使用，我们已在上面下载了该文件。要使此文件适用于您的数据集，只需将 manifest_path 更改为您的清单路径即可。sup_data_path 参数确定补充数据的存储位置。

sup_data_path = f'{finetune_data_path}/sup_data_path'
pitch_stats_path = f'{finetune_data_path}/pitch_stats.json'

# The script extract_sup_data.py writes the pitch mean and pitch std in the commandline. We will parse it to get the pitch mean and std
cmd_str_list = !python extract_sup_data.py --config-path "ljspeech" manifest_filepath={os.environ["merged_data_json"]} sup_data_path={sup_data_path}
# Select only the line that contains PITCH_MEAN
cmd_str = [c for c in cmd_str_list if "PITCH_MEAN" in c][0]

# Extract pitch mean and std from the commandline
pitch_mean_str = cmd_str.split(',')[0]
pitch_mean = float(pitch_mean_str.split('=')[1])
pitch_std_str = cmd_str.split(',')[1]
pitch_std = float(pitch_std_str.split('=')[1])
pitch_mean, pitch_std

根据上面单元格的结果设置 pitch_fmean 和 pitch_fmax。

os.environ["pitch_mean"] = str(pitch_mean)
os.environ["pitch_std"] = str(pitch_std)

print(f"pitch mean: {pitch_mean}")
print(f"pitch std: {pitch_std}")

微调#

我们现在已准备好微调我们的 TTS 管道。为此，您需要微调 FastPitch。为了获得最佳效果，您还需要微调 HiFiGAN。

这里我们使用来自 NGC 的预训练检查点，FastPitch 和 HiFiGAN

微调 FastPitch#

我们将需要来自 NeMo 的一些其他文件才能在 FastPitch 上运行微调，我们已在本教程前面下载了这些文件。在 NeMo 中，您可以在 examples 部分找到 fastpitch_finetuning.py 脚本和配置。

!(python fastpitch_finetune.py --config-name=fastpitch_align_v1.05.yaml \
  train_dataset={os.environ["merged_data_json"]} \
  validation_datasets={os.environ["finetune_val_data_json"]} \
  sup_data_path={sup_data_path} \
  phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.10 \
  heteronyms_path=tts_dataset_files/heteronyms-052722 \
  whitelist_path=tts_dataset_files/tts.tsv \
  exp_manager.exp_dir={os.environ["RESULTS_DIR"]} \
  +init_from_pretrained_model="tts_en_fastpitch" \
  +trainer.max_steps=1000 \
  ~trainer.max_epochs \
  trainer.check_val_every_n_epoch=10 \
  model.train_ds.dataloader_params.batch_size=24 \
  model.validation_ds.dataloader_params.batch_size=24 \
  model.n_speakers=1 \
  model.pitch_mean={os.environ["pitch_mean"]} model.pitch_std={os.environ["pitch_std"]} \
  model.optim.lr=2e-4 \
  ~model.optim.sched \
  model.optim.name=adam \
  trainer.devices=1 \
  trainer.strategy=null \
  +model.text_tokenizer.add_blank_at=true \
)

让我们仔细看看训练命令

--config-name=fastpitch_align_v1.05.yaml
- 我们首先告诉脚本要使用哪个配置文件。
train_dataset=./9017_manifest_train_dur_5_mins_local.json validation_datasets=./9017_manifest_dev_ns_all_local.json sup_data_path=./fastpitch_sup_data
- 我们告诉脚本要训练和评估的清单文件，以及补充数据的位置（如果未提供，则将在训练期间计算并保存）。
phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.10 heteronyms_path=tts_dataset_files/heteronyms-052722 whitelist_path=tts_dataset_files/tts.tsv
- 我们告诉脚本 phoneme_dict_path、heteronyms-052722 和 whitelist_path 的位置。这些是我们之前下载的附加文件，用于预处理数据。
exp_manager.exp_dir=./ljspeech_to_9017_no_mixing_5_mins
- 我们想要保存日志文件、tensorboard 文件、检查点等的位置。
+init_from_nemo_model=./tts_en_fastpitch_align.nemo
- 我们告诉脚本要从哪个检查点进行微调。
+trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25
- 对于此实验，我们告诉脚本训练 1000 个训练步骤/迭代，而不是指定要运行的 epoch 数。由于配置文件指定了 max_epochs，我们需要使用 ~trainer.max_epochs 删除它。
model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24
- 设置训练和验证数据加载器的批大小。
model.n_speakers=1
- 数据中的说话人数量。目前只有一个，但我们稍后将在笔记本中重新讨论此参数。
model.pitch_mean=152.3 model.pitch_std=64.0 model.pitch_fmin=30 model.pitch_fmax=512
- 对于新的说话人，我们需要定义新的音高超参数以获得更好的音频质量。
- 这些参数适用于 Hi-Fi TTS 数据集中的说话人 9017。
- 如果您使用的是自定义数据集，运行脚本 python <NeMo_base>/scripts/dataset_processing/tts/extract_sup_data.py manifest_filepath=<your_manifest_path> 将预先计算补充数据并打印这些音高统计信息。
- fmin 和 fmax 是 librosa 的 pyin 函数的超参数。我们建议仅当说话人处于嘈杂的环境中时才调整这些参数，这样就不会将背景噪声预测为语音。
model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam
- 为了进行微调，我们降低了学习率。
- 我们使用 2e-4 的固定学习率。
- 我们从 lamb 优化器切换到 adam 优化器。
trainer.devices=1 trainer.strategy=null
- 对于本笔记本，我们默认使用 1 个 gpu，这意味着我们不需要 ddp。
- 如果您有计算资源，请随意将其扩展到您可用的空闲 gpu 数量。
- 如果您打算进行多 gpu 训练，请删除 trainer.strategy=null 部分。

微调 HiFiGAN#

为了从 HiFiGAN 获得最佳音频，我们需要对其进行微调

在新说话人上
使用来自我们微调的 FastPitch 模型的 mel 频谱图

让我们首先从我们的 FastPitch 模型生成 mel，并将其保存到新的 .json 清单中，以用于 HiFiGAN。我们可以使用 NeMo 中的 generate_mels.py 文件生成 mel。

fastpitch_checkpoint = FIXME
mel_dir = f"{finetune_data_path}/mels"
! mkdir -p mel_dir

!(python generate_mels.py \
  --fastpitch-model-ckpt {fastpitch_checkpoint} \
  --input-json-manifests {os.environ["merged_data_json"]} \
  --output-json-manifest-root {mel_dir} \
 )

微调 HiFiGAN#

现在让我们微调 hifigan。可以使用 NeMo 中的脚本 examples/tts/hifigan_finetune.py 和 examples/tts/conf/hifigan 中的配置来完成 HiFiGAN 微调。

为 HiFiGAN 微调创建一个小型验证数据集

hifigan_full_ds = f"{finetune_data_path}/mels/merged_full_mel.json"
hifigan_train_ds = f"{finetune_data_path}/mels/merged_train_mel.json"
hifigan_val_ds = f"{finetune_data_path}/mels/merged_val_mel.json"

! cat {hifigan_train_ds} > {hifigan_full_ds}
! cat {hifigan_full_ds} | tail -n 2 > {hifigan_val_ds}
! cat {hifigan_full_ds} | head -n -2 > {hifigan_train_ds}

运行以下命令进行 HiFiGAN 微调

!(python examples/tts/hifigan_finetune.py \
--config-name=hifigan.yaml \
model.train_ds.dataloader_params.batch_size=32 \
model.max_steps=1000 \
model.optim.lr=0.00001 \
~model.optim.sched \
train_dataset={hifigan_train_ds} \
validation_datasets={hifigan_val_ds} \
exp_manager.exp_dir={os.environ["RESULTS_DIR"]} \
+init_from_pretrained_model=tts_hifigan \
trainer.check_val_every_n_epoch=10 \
model/train_ds=train_ds_finetune \
model/validation_ds=val_ds_finetune)

TTS 推理#

如前所述，由于没有通用的标准来衡量合成语音的质量，您需要收听一些推断的语音才能判断 TTS 模型是否训练良好。因此，我们在 NeMo 工具包中不提供 evaluate 功能用于 TTS，而仅提供 infer 功能。

生成频谱图和音频#

推理的第一步是生成频谱图。这是一个句子的 numpy 数组（保存为 .npy 文件），可以通过声码器将其转换为语音。我们使用刚刚训练的 FastPitch 生成频谱图

请使用您要使用的 HiFiGAN 检查点的路径更新 hifigan_checkpoint 变量。

让我们加载两个模型 FastPitch 和 HiFiGAN 以进行推理

from nemo.collections.tts.models import FastPitchModel, HifiGanModel

hifigan_checkpoint = FIXME
vocoder = HifiGanModel.load_from_checkpoint(hifigan_checkpoint)
vocoder = vocoder.eval().cuda()
spec_model = FastPitchModel.load_from_checkpoint(fastpitch_checkpoint)
spec_model.eval().cuda()

让我们创建一个辅助方法来执行给定字符串输入的推理。在多说话人推理的情况下，可以通过传递说话人 ID 作为参数来使用相同的方法。

import torch

def infer(spec_gen_model, vocoder_model, str_input, speaker=None):
    """
    Synthesizes spectrogram and audio from a text string given a spectrogram synthesis and vocoder model.
    
    Args:
        spec_gen_model: Spectrogram generator model (FastPitch in our case)
        vocoder_model: Vocoder model (HiFiGAN in our case)
        str_input: Text input for the synthesis
        speaker: Speaker ID
    
    Returns:
        spectrogram and waveform of the synthesized audio.
    """
    with torch.no_grad():
        parsed = spec_gen_model.parse(str_input)
        if speaker is not None:
            speaker = torch.tensor([speaker]).long().to(device=spec_gen_model.device)
        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, speaker=speaker)
        audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)
        
    if spectrogram is not None:
        if isinstance(spectrogram, torch.Tensor):
            spectrogram = spectrogram.to('cpu').numpy()
        if len(spectrogram.shape) == 3:
            spectrogram = spectrogram[0]
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()
    return spectrogram, audio

import IPython.display as ipd
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt

# Path to test manifest file (.json)
test_records_path = FIXME
test_records = list(json_reader(test_records_path))
new_speaker_id = FIXME

for test_record in test_records:
    print("Real validation audio")
    ipd.display(ipd.Audio(test_record['audio_filepath'], rate=22050))
    duration_mins = test_record['duration']
    if 'speaker' in test_record:
        speaker_id = test_record['speaker']
    else:
        speaker_id = new_speaker_id
    print(f"SYNTHESIZED | Duration: {duration_mins} mins | Text: {test_record['text']}")
    spec, audio = infer(spec_model, vocoder, test_record['text'], speaker=speaker_id)
    ipd.display(ipd.Audio(audio, rate=22050))
    %matplotlib inline
    imshow(spec, origin="lower", aspect="auto")
    plt.show()

调试#

提供的数据仅旨在作为了解 NeMo 中微调工作原理的示例。为了生成更好的语音质量，我们建议录制至少 30 分钟的音频，并将微调步骤数从当前的 trainer.max_steps=1000 增加到 trainer.max_steps=5000 以用于两个模型。

TTS 模型导出#

您还可以将模型导出为可以使用 Nvidia Riva 部署的格式，Nvidia Riva 是一个高性能应用程序框架，用于使用 GPU 的多模式对话式 AI 服务！

导出到 RIVA#

执行以下单元格中的代码片段，您可以为在前面的单元格中训练的频谱图生成器和声码器模型生成 .riva 模型文件。这些模型是生成完整的文本到语音管道所必需的。

转换为 riva.#

将下载的模型转换为 .riva 格式，我们将使用加密密钥=nemotoriva。在为生产环境生成 .riva 模型时更改此设置。

hifigan_nemo_file_path = FIXME
hifigan_riva_file_path = hifigan_nemo_file_path[:-5]+".riva"
fastpitch_nemo_file_path = FIXME
fastpitch_riva_file_path = fastpitch_nemo_file_path[:-5]+".riva"

!nemo2riva --out {fastpitch_riva_file_path} --key=nemotoriva {fastpitch_nemo_file_path}
!nemo2riva --out {hifigan_riva_file_path} --key=nemotoriva {hifigan_nemo_file_path}

下一步是什么？#

您可以使用 NeMo 为您自己的应用程序构建自定义模型，并将它们部署到 Nvidia Riva！要尝试将这些模型部署到 RIVA，请使用 tts-deploy.ipynb 作为快速示例。

NVIDIA Riva

使用 NeMo 微调文本到语音 (TTS)

目录

使用 NeMo 微调文本到语音#

文本到语音#

深入了解：使用 NeMo 的 TTS#

软件包安装和文件导入#

设置相关路径#

数据#

预处理#

获取音高统计信息#

微调#

微调 FastPitch#

微调 HiFiGAN#

微调 HiFiGAN#

TTS 推理#

生成频谱图和音频#

调试#

TTS 模型导出#

导出到 RIVA#

转换为 riva.#

下一步是什么？#