重要

您正在查看 NeMo 2.0 文档。此版本对 API 和一个新库 NeMo Run 进行了重大更改。我们目前正在将 NeMo 1.0 的所有功能移植到 2.0。有关先前版本或 2.0 中尚不可用的功能的文档，请参阅 NeMo 24.07 文档。

自动语音识别 (ASR)#

自动语音识别 (ASR)，也称为语音转文本 (STT)，指的是自动转录口语的问题。您可以使用 NeMo 通过开源的预训练模型转录 14 种以上语言的语音，或者训练您自己的 ASR 模型。

用 3 行代码转录语音#

安装 NeMo 后，您可以按如下方式转录音频文件

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large")
transcript = asr_model.transcribe(["path/to/audio_file.wav"])

获取时间戳#

使用 NeMo ASR 模型也可以获取字符（token）、单词或片段时间戳。

目前，Parakeet 模型的所有类型解码器（CTC/RNNT/TDT）都支持时间戳。对 AED 模型的支持即将添加。

有两种获取时间戳的方法：1. 在 transcribe 方法中使用 timestamps=True 标志。2. 为了更好地控制时间戳，您可以更新解码配置以提及时间戳类型（字符、单词、片段），并为片段和单词级别的时间戳指定片段分隔符或单词分隔符。

使用 timestamps=True 标志，您可以按如下方式获取转录中每个字符的时间戳

# import nemo_asr and instantiate asr_model as above
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt_ctc-110m")

# specify flag `timestamps=True`
hypotheses = asr_model.transcribe(["path/to/audio_file.wav"], timestamps=True)

# by default, timestamps are enabled for char, word and segment level
word_timestamps = hypotheses[0][0].timestep['word'] # word level timestamps for first sample
segment_timestamps = hypotheses[0][0].timestep['segment'] # segment level timestamps
char_timestamps = hypotheses[0][0].timestep['char'] # char level timestamps

for stamp in segment_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")

# segment level timestamps (if model supports Punctuation and Capitalization, segment level timestamps are displayed based on punctuation otherwise complete transcription is considered as a single segment)

为了更好地控制时间戳，您可以更新解码配置以提及时间戳类型（字符、单词、片段），并为片段和单词级别的时间戳指定片段分隔符或单词分隔符，如下所示

# import nemo_asr and instantiate asr_model as above
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large")

# update decoding config to preserve alignments and compute timestamps
# if necessary also update the segment seperators or word seperator for segment and word level timestamps
from omegaconf import OmegaConf, open_dict
decoding_cfg = asr_model.cfg.decoding
with open_dict(decoding_cfg):
    decoding_cfg.preserve_alignments = True
    decoding_cfg.compute_timestamps = True
    decoding_cfg.segment_seperators = [".", "?", "!"]
    decoding_cfg.word_seperator = " "
    asr_model.change_decoding_strategy(decoding_cfg)

# specify flag `return_hypotheses=True``
hypotheses = asr_model.transcribe(["path/to/audio_file.wav"], return_hypotheses=True)

# if hypotheses form a tuple (from RNNT), extract just "best" hypotheses
if type(hypotheses) == tuple and len(hypotheses) == 2:
    hypotheses = hypotheses[0]

timestamp_dict = hypotheses[0].timestep # extract timesteps from hypothesis of first (and only) audio file
print("Hypothesis contains following timestep information :", list(timestamp_dict.keys()))

# For a FastConformer model, you can display the word timestamps as follows:
# 80ms is duration of a timestep at output of the Conformer
time_stride = 8 * asr_model.cfg.preprocessor.window_stride

word_timestamps = timestamp_dict['word']
segment_timestamps = timestamp_dict['segment']

for stamp in word_timestamps:
    start = stamp['start_offset'] * time_stride
    end = stamp['end_offset'] * time_stride
    word = stamp['char'] if 'char' in stamp else stamp['word']

    print(f"Time : {start:0.2f} - {end:0.2f} - {word}")

for stamp in segment_timestamps:
    start = stamp['start_offset'] * time_stride
    end = stamp['end_offset'] * time_stride
    segment = stamp['segment']

    print(f"Time : {start:0.2f} - {end:0.2f} - {segment}")

通过命令行转录语音#

您也可以通过命令行使用以下脚本转录语音，例如

python <path_to_NeMo>/blob/main/examples/asr/transcribe_speech.py \
    pretrained_name="stt_en_fastconformer_transducer_large" \
    audio_dir=<path_to_audio_dir> # path to dir containing audio files to transcribe

该脚本会将所有转录保存到一个 JSONL 文件中，其中每行对应于 <audio_dir> 中的一个音频文件。此文件将对应于 NeMo 常用的格式，用于保存模型预测，以及存储用于训练和评估的输入数据。您可以在此处了解有关 NeMo 用于这些文件（我们称之为“manifest 文件”）的格式的更多信息。

您还可以在 manifest 文件中指定要转录的文件，并使用参数 dataset_manifest=<path to manifest specifying audio files to transcribe> 而不是 audio_dir 传入。

结合语言模型 (LM) 以提高 ASR 转录质量#

通过使用语言模型来帮助选择句子中更可能说出的单词，通常可以提高转录准确率。

即使使用简单的 N 元语法 LM，也可以在转录准确率方面获得良好的提升。

在训练 N 元语法 LM 后，您可以按如下方式使用它来转录音频

使用 install_beamsearch_decoders script 安装 OpenSeq2Seq 集束搜索解码和 KenLM 库。
使用 eval_beamsearch_ngram script 执行转录

python eval_beamsearch_ngram.py nemo_model_file=<path to the .nemo file of the model> \
    input_manifest=<path to the evaluation JSON manifest file \
    kenlm_model_file=<path to the binary KenLM model> \
    beam_width=[<list of the beam widths, separated with commas>] \
    beam_alpha=[<list of the beam alphas, separated with commas>] \
    beam_beta=[<list of the beam betas, separated with commas>] \
    preds_output_folder=<optional folder to store the predictions> \
    probs_cache_file=null \
    decoding_mode=beamsearch_ngram \
    decoding_strategy="<Beam library such as beam, pyctcdecode or flashlight>"

在此处查看有关 LM 解码的更多信息。

使用实时转录#

可以使用 NeMo 实时转录语音。我们提供了 Cache Aware Streaming 和 Buffered Streaming 的教程 notebook。

尝试不同的 ASR 模型#

NeMo 提供了各种开源的预训练 ASR 模型，这些模型在模型架构上有所不同

编码器架构（FastConformer、Conformer、Citrinet 等），
解码器架构（Transducer、CTC 和两者的混合），
模型大小（小、中、大等）。

预训练模型在以下方面也有所不同

语言（英语、西班牙语等，包括一些多语言和代码切换模型），
输出文本是否包含标点符号和大小写。

NeMo ASR 检查点可以在 HuggingFace 或 NGC 上找到。NeMo 团队发布的所有模型都可以在 NGC 上找到，其中一些模型也可在 HuggingFace 上找到。

NeMo 团队开源的所有 NeMo ASR 检查点都遵循以下命名约定：stt_{language}_{encoder name}_{decoder name}_{model size}{_optional descriptor}。

您可以使用 ASRModel.from_pretrained() 类方法自动加载检查点，例如

import nemo.collections.asr as nemo_asr
# model will be fetched from NGC
asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large")
# if model name is prepended with "nvidia/", the model will be fetched from huggingface
asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_en_fastconformer_transducer_large")
# you can also load open-sourced NeMo models released by other HF users using:
# asr_model = nemo_asr.models.ASRModel.from_pretrained("<HF username>/<model name>")

有关加载检查点、模型完整列表及其基准分数的更多文档，请参阅此处。

此处还提供了有关 NeMo 中 ASR 模型架构的更多信息。

在浏览器中试用 NeMo ASR 转录#

您无需离开浏览器即可试用 NeMo ASR 模型的转录，方法是使用下面嵌入的 HuggingFace Space。

此 HuggingFace Space 使用 Canary-1B，这是 NVIDIA NeMo 最新的 ASR 模型。在发布时，它位于 HuggingFace OpenASR Leaderboard 的榜首。

Canary-1B 是一个多语言、多任务模型，支持 4 种语言（英语、德语、法语、西班牙语）的自动语音转文本识别 (ASR)，以及英语与另外 3 种受支持语言之间的翻译。

ASR 教程 notebook#

实践语音识别教程 notebook 可以在 ASR 教程文件夹下找到。如果您是 NeMo 的初学者，请考虑试用 ASR with NeMo 教程。此教程和大多数其他教程都可以在 Google Colab 上运行，方法是在 Colab 上指定 notebook 的 GitHub 页面链接。

ASR 模型配置#

有关 nemo_asr 模型特定配置文件的文档，请参阅配置文件部分。

准备 ASR 数据集#

NeMo 包含多个常见 ASR 数据集的预处理脚本。“数据集”部分包含有关运行这些脚本的说明。如果您有自己的数据，它还包括创建您自己的 NeMo 兼容数据集的指南。

NeMo ASR 文档#

有关更多信息，请参阅左侧菜单或以下列表中的 ASR 文档的其他部分