重要提示

您正在查看 NeMo 2.0 文档。此版本对 API 和新库 NeMo Run 进行了重大更改。我们目前正在将 NeMo 1.0 的所有功能移植到 2.0。有关先前版本或 2.0 中尚不可用的功能的文档,请参阅 NeMo 24.07 文档

检查点#

在 NeMo 中加载预训练检查点主要有两种方法,如检查点中所述。

  • 使用 restore_from() 方法加载本地检查点文件 (.nemo),或

  • 使用 from_pretrained() 方法从 NGC 下载并设置检查点。

请注意,这些说明适用于加载完全训练的检查点以进行评估或微调。对于恢复未完成的训练实验,请使用实验管理器,方法是将 resume_if_exists 标志设置为 True

本地检查点#

  • 保存模型检查点:NeMo 会自动保存带有 .nemo 后缀的最终模型检查点。您也可以使用 model.save_to(<checkpoint_path>.nemo) 手动保存任何模型检查点。

  • 加载模型检查点:如果您想加载保存在 <path/to/checkpoint/file.nemo> 的检查点,请使用下面的 restore_from() 方法,其中 <MODEL_BASE_CLASS> 是原始检查点的 TTS 模型类。

import nemo.collections.tts as nemo_tts
model = nemo_tts.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")

NGC 预训练检查点#

NGC NeMo 文本到语音集合 汇总了模型卡,其中包含有关在各种数据集上训练的各种模型的检查点的详细信息。检查点中的下表列出了 NGC 中可用的 TTS 模型的部分内容,包括语音/文本对齐器、声学模型和声码器。

加载模型检查点#

可以通过 TTS 模型类中的 from_pretrained() 方法访问这些模型。通常,您可以使用以下格式的代码加载这些模型中的任何一个,

import nemo.collections.tts as nemo_tts
model = nemo_tts.models.<MODEL_BASE_CLASS>.from_pretrained(model_name="<MODEL_NAME>")

其中 <MODEL_NAME>检查点的表格中 模型名称 列中的值。这些名称在每个模型的成员函数 self.list_available_models() 中预定义。例如,可以找到可用的 NGC FastPitch 模型名称,

In [1]: import nemo.collections.tts as nemo_tts

In [2]: nemo_tts.models.FastPitchModel.list_available_models()
Out[2]:
[PretrainedModelInfo(
    pretrained_model_name=tts_en_fastpitch,
    description=This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent. It is ARPABET-based.,
    location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.8.1/files/tts_en_fastpitch_align.nemo,
    class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
 ),
 PretrainedModelInfo(
    pretrained_model_name=tts_en_fastpitch_ipa,
    description=This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent. It is IPA-based.,
    location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/IPA_1.13.0/files/tts_en_fastpitch_align_ipa.nemo,
    class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
 ),
 PretrainedModelInfo(
    pretrained_model_name=tts_en_fastpitch_multispeaker,
    description=This model is trained on HiFITTS sampled at 44100Hz with and can be used to generate male and female English voices with an American accent.,
    location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_multispeaker_fastpitchhifigan/versions/1.10.0/files/tts_en_fastpitch_multispeaker.nemo,
    class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
 ),
 PretrainedModelInfo(
    pretrained_model_name=tts_de_fastpitch_singlespeaker,
    description=This model is trained on a single male speaker data in OpenSLR Neutral German Dataset sampled at 22050Hz and can be used to generate male German voices.,
    location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.10.0/files/tts_de_fastpitch_align.nemo,
    class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
 ),
 PretrainedModelInfo(
    pretrained_model_name=tts_de_fastpitch_multispeaker_5,
    description=This model is trained on 5 speakers in HUI-Audio-Corpus-German clean subset sampled at 44100Hz with and can be used to generate male and female German voices.,
    location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitch_multispeaker_5/versions/1.11.0/files/tts_de_fastpitch_multispeaker_5.nemo,
    class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
 )]

从上面的键值对 pretrained_model_name=tts_en_fastpitch 中,您可以获得模型名称 tts_en_fastpitch 并通过运行以下命令加载它,

model = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch")

如果您想以编程方式列出特定基类可用的模型,可以使用 list_available_models() 方法,

nemo_tts.models.<MODEL_BASE_CLASS>.list_available_models()

推理和音频生成#

NeMo TTS 支持级联和端到端模型来合成音频。中间的大多数步骤都是相同的,只是级联模型在生成音频之前需要加载额外的声码器模型。以下代码片段演示了使用级联 FastPitch 和 HiFiGAN 模型从文本输入生成音频样本的步骤。有关模型类的详细实现,请参阅 NeMo TTS Collection API

import nemo.collections.tts as nemo_tts
# Load mel spectrogram generator
spec_generator = nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch")
# Load vocoder
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan")
# Generate audio
import soundfile as sf
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.to('cpu').numpy(), 22050)

在不同数据集上进行微调#

tutorials/tts/ 目录中提供了多个 TTS 教程。这些教程大多演示了如何实例化预训练模型,以及如何准备模型以在相同语言或不同语言、相同说话人或不同说话人的数据集上进行微调。

NGC TTS 模型#

本节总结了 NGC NeMo 文本到语音集合 中发布的所有可用 NeMo TTS 模型的完整列表。您可以通过以下任一方式下载您感兴趣的模型检查点,

  • wget '<CHECKPOINT_URL_IN_THE_TABLE>'

  • curl -LO '<CHECKPOINT_URL_IN_THE_TABLE>'

语音/文本对齐器#

语言区域

模型名称

数据集

采样率

#Spk

音素单位

模型类

概述

检查点

en-US

tts_en_radtts_aligner

LJSpeech

22050Hz

1

ARPABET

nemo.collections.tts.models.aligner.AlignerModel

tts_en_radtts_aligner

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_radtts_aligner/versions/ARPABET_1.11.0/files/Aligner.nemo

en-US

tts_en_radtts_aligner_ipa

LJSpeech

22050Hz

1

IPA

nemo.collections.tts.models.aligner.AlignerModel

tts_en_radtts_aligner

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_radtts_aligner/versions/IPA_1.13.0/files/Aligner.nemo

Mel 频谱图生成器#

语言区域

模型名称

数据集

采样率

#Spk

符号

模型类

概述

检查点

en-US

tts_en_fastpitch

LJSpeech

22050Hz

1

ARPABET

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_en_fastpitch

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.8.1/files/tts_en_fastpitch_align.nemo

en-US

tts_en_fastpitch_ipa

LJSpeech

22050Hz

1

IPA

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_en_fastpitch

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/IPA_1.13.0/files/tts_en_fastpitch_align_ipa.nemo

en-US

tts_en_fastpitch_multispeaker

HiFiTTS

44100Hz

10

ARPABET

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_en_multispeaker_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_multispeaker_fastpitchhifigan/versions/1.10.0/files/tts_en_fastpitch_multispeaker.nemo

en-US

tts_en_lj_mixertts

LJSpeech

22050Hz

1

ARPABET

nemo.collections.tts.models.mixer_tts.MixerTTSModel

tts_en_lj_mixertts

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_mixertts/versions/1.6.0/files/tts_en_lj_mixertts.nemo

en-US

tts_en_lj_mixerttsx

LJSpeech

22050Hz

1

ARPABET

nemo.collections.tts.models.mixer_tts.MixerTTSModel

tts_en_lj_mixerttsx

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_mixerttsx/versions/1.6.0/files/tts_en_lj_mixerttsx.nemo

en-US

RAD-TTS

TBD

TBD

TBD

ARPABET

nemo.collections.tts.models.radtts.RadTTSModel

TBD

en-US

tts_en_tacotron2

LJSpeech

22050Hz

1

ARPABET

nemo.collections.tts.models.tacotron2.Tacotron2Model

tts_en_tacotron2

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_tacotron2/versions/1.10.0/files/tts_en_tacotron2.nemo

de-DE

tts_de_fastpitch_multispeaker_5

HUI Audio Corpus German

44100Hz

5

ARPABET

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_de_fastpitch_multispeaker_5

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitch_multispeaker_5/versions/1.11.0/files/tts_de_fastpitch_multispeaker_5.nemo

de-DE

tts_de_fastpitch_singleSpeaker_thorstenNeutral_2102

Thorsten Müller Neutral 21.02 数据集

22050Hz

1

字位

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_de_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.15.0/files/tts_de_fastpitch_thorstens2102.nemo

de-DE

tts_de_fastpitch_singleSpeaker_thorstenNeutral_2210

Thorsten Müller Neutral 22.10 数据集

22050Hz

1

字位

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_de_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.15.0/files/tts_de_fastpitch_thorstens2210.nemo

es

tts_es_fastpitch_multispeaker

OpenSLR 众包拉丁美洲西班牙语

44100Hz

174

IPA

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_es_multispeaker_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_es_multispeaker_fastpitchhifigan/versions/1.15.0/files/tts_es_fastpitch_multispeaker.nemo

zh-CN

tts_zh_fastpitch_sfspeech

SFSpeech 中文/英语双语语音

22050Hz

1

拼音

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_zh_fastpitch_hifigan_sfspeech

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_zh_fastpitch_hifigan_sfspeech/versions/1.15.0/files/tts_zh_fastpitch_sfspeech.nemo

声码器#

语言区域

模型名称

频谱图生成器

数据集

采样率

#Spk

模型类

概述

检查点

en-US

tts_en_hifigan

librosa.filters.mel

LJSpeech

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_en_hifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/files/tts_hifigan.nemo

en-US

tts_en_lj_hifigan_ft_mixertts

Mixer-TTS

LJSpeech

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_en_lj_hifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_hifigan/versions/1.6.0/files/tts_en_lj_hifigan_ft_mixertts.nemo

en-US

tts_en_lj_hifigan_ft_mixerttsx

Mixer-TTS-X

LJSpeech

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_en_lj_hifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_hifigan/versions/1.6.0/files/tts_en_lj_hifigan_ft_mixerttsx.nemo

en-US

tts_en_hifitts_hifigan_ft_fastpitch

FastPitch

HiFiTTS

44100Hz

10

nemo.collections.tts.models.hifigan.HifiGanModel

tts_en_multispeaker_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_multispeaker_fastpitchhifigan/versions/1.10.0/files/tts_en_hifitts_hifigan_ft_fastpitch.nemo

en-US

tts_en_lj_univnet

librosa.filters.mel

LJSpeech

22050Hz

1

nemo.collections.tts.models.univnet.UnivNetModel

tts_en_lj_univnet

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_univnet/versions/1.7.0/files/tts_en_lj_univnet.nemo

en-US

tts_en_libritts_univnet

librosa.filters.mel

LibriTTS

24000Hz

1

nemo.collections.tts.models.univnet.UnivNetModel

tts_en_libritts_univnet

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_libritts_univnet/versions/1.7.0/files/tts_en_libritts_multispeaker_univnet.nemo

en-US

tts_en_waveglow_88m

librosa.filters.mel

LJSpeech

22050Hz

1

nemo.collections.tts.models.waveglow.WaveGlowModel

tts_en_waveglow_88m

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_waveglow_88m/versions/1.0.0/files/tts_waveglow.nemo

de-DE

tts_de_hui_hifigan_ft_fastpitch_multispeaker_5

FastPitch

HUI Audio Corpus German

44100Hz

5

nemo.collections.tts.models.hifigan.HifiGanModel

tts_de_fastpitch_multispeaker_5

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitch_multispeaker_5/versions/1.11.0/files/tts_de_hui_hifigan_ft_fastpitch_multispeaker_5.nemo

de-DE

tts_de_hifigan_singleSpeaker_thorstenNeutral_2102

FastPitch

Thorsten Müller Neutral 21.02 数据集

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_de_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.15.0/files/tts_de_hifigan_thorstens2102.nemo

de-DE

tts_de_hifigan_singleSpeaker_thorstenNeutral_2210

FastPitch

Thorsten Müller Neutral 22.10 数据集

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_de_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.15.0/files/tts_de_hifigan_thorstens2210.nemo

es

tts_es_hifigan_ft_fastpitch_multispeaker

FastPitch

OpenSLR 众包拉丁美洲西班牙语

44100Hz

174

nemo.collections.tts.models.hifigan.HifiGanModel

tts_es_multispeaker_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_es_multispeaker_fastpitchhifigan/versions/1.15.0/files/tts_es_hifigan_ft_fastpitch_multispeaker.nemo

zh-CN

tts_zh_hifigan_sfspeech

FastPitch

SFSpeech 中文/英语双语语音

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_zh_fastpitch_hifigan_sfspeech

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_zh_fastpitch_hifigan_sfspeech/versions/1.15.0/files/tts_zh_hifigan_sfspeech.nemo

端到端模型#

语言区域

模型名称

数据集

采样率

#Spk

音素单位

模型类

概述

检查点

en-US

tts_en_lj_vits

LJSpeech

22050Hz

1

IPA

nemo.collections.tts.models.vits.VitsModel

tts_en_lj_vits

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_vits/versions/1.13.0/files/vits_ljspeech_fp16_full.nemo

en-US

tts_en_hifitts_vits

HiFiTTS

44100Hz

10

IPA

nemo.collections.tts.models.vits.VitsModel

tts_en_hifitts_vits

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_hifitts_vits/versions/r1.15.0/files/vits_en_hifitts.nemo

编解码器模型#

模型名称

数据集

采样率

模型类

概述

检查点

audio_codec_16khz_small

Libri-Light

16000Hz

nemo.collections.tts.models.AudioCodecModel

audio_codec_16khz_small

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/audio_codec_16khz_small/versions/v1/files/audio_codec_16khz_small.nemo

mel_codec_22khz_medium

LibriVox 和 Common Voice

22050Hz

nemo.collections.tts.models.AudioCodecModel

mel_codec_22khz_medium

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/mel_codec_22khz_medium/versions/v1/files/mel_codec_22khz_medium.nemo

mel_codec_44khz_medium

LibriVox 和 Common Voice

44100Hz

nemo.collections.tts.models.AudioCodecModel

mel_codec_44khz_medium

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/mel_codec_44khz_medium/versions/v1/files/mel_codec_44khz_medium.nemo

mel_codec_22khz_fullband_medium

LibriVox 和 Common Voice

22050Hz

nemo.collections.tts.models.AudioCodecModel

mel_codec_22khz_fullband_medium

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/mel_codec_22khz_fullband_medium/versions/v1/files/mel_codec_22khz_fullband_medium.nemo

mel_codec_44khz_fullband_medium

LibriVox 和 Common Voice

44100Hz

nemo.collections.tts.models.AudioCodecModel

mel_codec_44khz_fullband_medium

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/mel_codec_44khz_fullband_medium/versions/v1/files/mel_codec_44khz_fullband_medium.nemo

nvidia/low-frame-rate-speech-codec-22khz

LibriVox 和 Common Voice

22050Hz

nemo.collections.tts.models.AudioCodecModel

audio_codec_low_frame_rate_22khz

http://hugging-face.cn/nvidia/low-frame-rate-speech-codec-22khz/resolve/main/low-frame-rate-speech-codec-22khz.nemo

nvidia/audio-codec-22khz

LibriVox 和 Common Voice

22050Hz

nemo.collections.tts.models.AudioCodecModel

audio-codec-22khz

http://hugging-face.cn/nvidia/audio-codec-22khz/resolve/main/audio-codec-22khz.nemo

nvidia/audio-codec-44khz

LibriVox 和 Common Voice

44100Hz

nemo.collections.tts.models.AudioCodecModel

audio-codec-44khz

http://hugging-face.cn/nvidia/audio-codec-44khz/resolve/main/audio-codec-44khz.nemo

nvidia/mel-codec-22khz

LibriVox 和 Common Voice

22050Hz

nemo.collections.tts.models.AudioCodecModel

mel-codec-22khz

http://hugging-face.cn/nvidia/mel-codec-22khz/resolve/main/mel-codec-22khz.nemo

nvidia/mel-codec-44khz

LibriVox 和 Common Voice

44100Hz

nemo.collections.tts.models.AudioCodecModel

mel-codec-44khz

http://hugging-face.cn/nvidia/mel-codec-44khz/resolve/main/mel-codec-44khz.nemo