自定义机器人#

选择 LLM 模型#

ACE Agent 允许您选择 LLM 模型来控制流程并生成响应。您可以在机器人配置中使用来自 OpenAI、NIM 托管或本地 NIM 模型的模型。查看 LLM 模型配置部分以获取完整的支持模型列表。

在机器人配置文件中，您可以将 main 模型类型的引擎更新为您的模型提供商，并添加您选择的模型。

models:
- type: main
    engine: openai
    model: gpt-4-turbo

以下模型已使用 Colang 2.0-beta 版本进行测试。

OpenAI 模型

gpt-3.5-turbo-instruct
gGpt-3.5-turbo
gpt-4-turbo
gpt-4o
gpt-4o-mini

NIM 模型

meta/llama3-8b-instruct
meta/llama3-70b-instruct
meta/llama-3.1-8b-instruct
meta/llama-3.1-70b-instruct

使用通过 NIM 部署的本地 LLM 模型#

在本教程中，我们将展示如何利用使用 NVIDIA NIM for Large Language Models 部署的 LLM 模型（使用 vLLM 或 TensorRT-LLM 优化），并更新示例股票机器人以使用本地部署的 LLM 模型。

您可以按照 Llama3 8B Instruct NIM 中的说明，使用 A100 或 H100 GPU 在本地部署 meta/llama3-8b-instruct 模型。

更新快速入门资源目录中 ./samples/stock_bot 中的 stock_bot_config.yaml 文件，以使用本地部署的 meta/llama3-8b-instruct LLM 模型。

models:
  - type: main
    engine: nim
    model: meta/llama3-8b-instruct
    parameters:
      stop: ["\n"]
      max_tokens: 100
      base_url: "http://0.0.0.0:8000/v1"  # Use this to use NIM model

按照快速入门指南中的步骤部署示例股票机器人。您可以在 build.nvidia.com 上探索其他用于部署的 LLM NIM。

注意

如果您正在使用 Triton 部署具有 TensorRT LLM 优化的 LLM 模型，您可能会在 NIM 部署期间看到端口冲突，ACE Agent 也使用 Triton 或 Riva Skills 服务器部署。更新了 NIM 部署命令，不使用 --host 网络，并根据需要公开 OpenAI 端口和其他端口。

创建新的自定义动作#

基于 Colang 的 ACE Agent 机器人使用动作来完成诸如意图生成、机器人消息生成、调用插件服务器等任务。ACE Agent 允许您创建自己的自定义动作，并覆盖 ACE Agent 定义的一些自定义动作。

在机器人目录中创建一个名为 actions.py 的新文件。这是一个特殊文件，在机器人启动期间初始化。此处定义的任何动作都会在 Colang 中注册，并可在 Colang 流中使用。

创建一个简单的动作，用于检查问题是否包含阻止列表中的任何单词。使用以下内容更新 actions.py。

from nemoguardrails.actions.actions import action
from typing import Dict, Any

BLOCK_LIST = ["stupid", "moron"]

@action(name="isBlockWordPresentAction")
async def check_block_list(context: Dict[str, Any] = {}):
    question = context.get("last_user_transcript")
    if any(word in question for word in BLOCK_LIST):
        return True
    return False

更新与 user queried about stocks 相关的流程，以在示例股票机器人中调用新创建的 block_word_present 自定义动作，并根据该动作的响应做出决定。

flow stock faq
    global $last_user_transcript
    user queried about stocks
    $should_block = await isBlockWordPresentAction()
    if $should_block
        bot say "Please do not use blocked words"
    else
        $retrieval_results = await RetrieveRelevantChunksAction()
        $response = ..."{$last_user_transcript}. You can take context from following section: {$retrieval_results}. Enclose the response in quotes."
        bot say "{$response}"

提出诸如 Is it stupid to invest in an IPO? 之类的问题将导致回退响应 Please do not use blocked words，而没有阻止词的问题将被接受。

同样，也可以创建可以接受来自 Colang 流的参数的自定义动作。

使用自定义 NLP 模型#

在本节中，我们将重点介绍如何部署自定义 NLP 模型。

使用 NLP 服务器部署自定义 NLP 模型#

NLP 服务器允许您轻松部署自定义 NLP 模型（例如 Hugging Face 模型），并将其与对话管道集成。它主要依赖于 @model_api 和 @pytriton 装饰器函数。

使用 @model_api 装饰器自定义模型集成#

让我们来看一个示例，说明如何为 /nlp/model/generate_embedding 端点集成自定义模型推理，该端点返回给定查询的嵌入。

熟悉 NLP 服务器 Swagger 上 /nlp/model/generate_embedding 端点的所需请求和响应架构，以用于任何现有部署。
创建一个 Python 函数，该函数可以采用与请求架构类似格式的输入，进行模型推理，并以与响应架构类似格式返回输出。
对于此示例，让我们通过返回给定查询的随机嵌入来模拟模型推理，您的 Python 函数应类似于以下内容
import numpy as np async def random_embeddings(input_request): """ Generate Random Embedding """ return {"queries": input_request.queries, "embeddings": np.random.uniform(low=0, high=1, size=(2, 768)).tolist()}
NLP 服务器公开 @model_api 装饰器，用于将推理函数映射到内部的特定 API 端点。服务器在 API 请求期间使用 model_name 和 model_version 的组合作为唯一标识符，以执行所需的推理函数。

将 @model_api 装饰器添加到随机嵌入生成推理函数。

import numpy as np
from nlp_server.decorators import model_api

@model_api(endpoint="/nlp/model/generate_embedding", model_name="random_embedding", model_version="1")
async def random_embeddings(input_request):
    """
    Generate Random Embedding
    """
    return {"queries": input_request.queries,
            "embeddings": np.random.uniform(low=0, high=1, size=(2, 768)).tolist()}

启动集成了随机嵌入推理客户端的 NLP 服务器。

aceagent nlp-server deploy --custom_model_dir random_embedding.py

（可选）您可以在 model_config.yaml 文件中指定客户端模块。

model_servers:
- name: custom
    nlp_models:
        - random_embedding.py # Absolute or relative path from model_config.yaml

要使用客户端启动 NLP 服务器，请运行

aceagent nlp-server deploy --config model_config.yaml

通过在 NLP 服务器 Swagger 上使用 model_name 为 "random_embedding" 查询 /nlp/model/generate_embedding 端点或使用以下 CURL 命令来验证更改

curl -X 'POST' \
    'http://0.0.0.0:9003/nlp/model/generate_embedding' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "queries": [
        "Random query"
    ],
    "model_name": "random_embedding"
}'

使用 @pytriton 装饰器自定义模型集成#

我们可以按照上一节中的步骤，使用 @model_api 装饰器集成任何自定义模型客户端。如果我们的自定义模型客户端正在加载模型作为 @model_api 函数或 Python 模块的一部分，并且 NLP 服务器正在使用多个工作进程运行，则模型将在所有工作进程上分别加载。多次加载模型将导致更高的内存使用率和性能下降。

对于生产用例，其目的是获得更高的吞吐量，我们建议将 GPU 和繁重的处理代码卸载到模型服务器（例如 NVIDIA Triton Inference Server），并且推理客户端应该是使用异步支持的轻量级实现。诸如 NVIDIA Triton Inference Server 之类的模型服务器旨在更好地处理 GPU 和 CPU 资源，并允许我们批量并行请求。@pytriton 装饰器允许您将繁重的计算卸载到 Triton，并避免多次加载模型。

要利用 @pytriton 装饰器部署 Hugging Face 的 facebook/bart-large-mnli 模型，请执行以下步骤。

创建一个示例推理客户端。这是 Hugging Face 的 facebook/bart-large-mnli 模型所必需的。

from transformers import pipeline
CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"]

input_queries = ["I love cricket", "Lets watch movie today"]
classification_result = CLASSIFIER(input_queries, LABELS)
result_labels = [res["labels"][0] for res in classification_result]
print(result_labels)

以以下格式排列您的推理客户端。我们将使用 PyTriton 在 NLP 服务器上托管模型。您可能需要通过遵循 PyTriton GitHub 存储库中的快速入门步骤来熟悉 PyTriton。

from pytriton.triton import Triton
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor

triton = Triton()

# Model Initialization / Loading Code
...

# Creating Inference function
@batch
def infer_fn(**inputs: np.ndarray):
    # Inference for batch of inputs using already loaded model and returning outputs
    ...
    return outputs

# Connecting inference callable with Triton Inference Server
triton.bind(
    model_name="<custom_model_name>",
    infer_func=infer_fn,
    inputs=[
        ...
    ],
    outputs=[
        ...
    ],
    config=ModelConfig(...)
)

# Serving model
triton.serve()

将我们的推理客户端转换为 PyTriton 兼容格式。

import numpy as np
from pytriton.triton import Triton
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor

triton = Triton()

# Model Initialization / Loading Code
from transformers import pipeline
CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"]

# Creating Inference function
@batch
def infer_fn(queries: np.ndarray):
    # Inference for batch of inputs using already loaded model and returning outputs
    input_queries=np.char.decode(queries.astype("bytes"), "utf-8").tolist()
    classification_result = CLASSIFIER(input_queries, LABELS)
    result_labels = [res["labels"][0] for res in classification_result]
    return {"labels": np.char.encode(result_labels, "utf-8")}

# Connecting inference callable with Triton Inference Server
triton.bind(
    model_name="facebook-bart-large-mnli",
    infer_func=infer_fn,
    inputs=[
        Tensor(name="queries", dtype=bytes, shape=(-1,)),
    ],
    outputs=[
        Tensor(name="labels", dtype=bytes, shape=(-1,)),
    ],
    config=ModelConfig(max_batch_size=4)
)

# Serving model
triton.serve()

通过运行以下命令来测试您的代码

在本地安装 PyTriton。
pip install -U "nvidia-pytriton<0.4.0"
将代码保存在 Python 文件中并运行
python custom_model.py
您应该看到 NVIDIA Triton Inference Server 托管在 localhost:8000。您可以使用以下代码与模型交互
import numpy as np
from pytriton.client import ModelClient

with ModelClient("localhost:8000", "facebook-bart-large-mnli") as client:
    result_dict = client.infer_batch(np.char.encode([["Lets watch movie today"]], "utf-8"))

print(result_dict)

将上述代码集成到 NLP 服务器中，以便通过 @pytriton 装饰器进行更轻松和自动化的部署。即使对于多个工作进程，@pytriton 装饰器函数也仅在启动期间执行一次，因此我们不会在 GPU 内存中多次加载模型。

import numpy as np
from pytriton.triton import Triton
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from nlp_server.decorators import pytriton

# @pytriton decorator
@pytriton()
def custom_pytriton_model(triton: Triton):
    # Model Initialization / Loading Code
    from transformers import pipeline
    CLASSIFIER = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
    LABELS = ["travel","cooking","dancing","sport","music","entertainment","festival","movie","literature"]

    # Creating Inference function
    @batch
    def infer_fn(queries: np.ndarray):
        # Inference for batch of inputs using already loaded model and returning outputs
        input_queries=np.char.decode(queries.astype("bytes"), "utf-8").tolist()
        classification_result = CLASSIFIER(input_queries, LABELS)
        result_labels = [res["labels"][0] for res in classification_result]
        return {"labels": np.char.encode(result_labels, "utf-8")}

    # Connecting inference callable with Triton Inference Server
    triton.bind(
        model_name="facebook-bart-large-mnli",
        infer_func=infer_fn,
        inputs=[
            Tensor(name="queries", dtype=bytes, shape=(-1,)),
        ],
        outputs=[
            Tensor(name="labels", dtype=bytes, shape=(-1,)),
        ],
        config=ModelConfig(max_batch_size=4)
    )

创建 @model_api 装饰器函数以覆盖您选择的 NLP 服务器 API 端点。使用此模型，我们将覆盖 /nlp/model/text_classification 端点。

import numpy as np
from pytriton.client import ModelClient
from nlp_server.decorators import model_api

@model_api(endpoint="/nlp/model/text_classification", model_name="facebook-bart-large-mnli", model_type="triton")
def bart_mnli_model_api(input_request):
    # NLP Server embed model metadata in the function meta as model_info attribute, you can access url for Triton server using bart_mnli_model_api.model_info.url
    with ModelClient(
        url=f"grpc://{bart_mnli_model_api.model_info.url}",
        model_name=bart_mnli_model_api.model_info.model_name,
        model_version=bart_mnli_model_api.model_info.model_version,
    ) as client:
        result_dict = client.infer_batch(np.array([[np.char.encode(input_request.query, "utf-8")]]))
    return {"class_name": result_dict["labels"][0].decode("utf-8"), "score": 1}

将 @pytriton 和 @model_api 客户端都保存在单个 Python 文件中，然后启动 NLP 服务器。

source deploy/docker/docker_init.sh
docker compose -f deploy/docker/docker-compose.yml run -it --workdir $PWD --build nlp-server aceagent nlp-server deploy --custom_model_dir <custom_model.py>

（可选）在 model_config.yaml 文件中指定客户端模块。

model_servers:
  - name: custom
    nlp_models:
      - <custom_model.py> # Absolute or relative path from model_config.yaml

要使用客户端启动 NLP 服务器，请运行

source deploy/docker/docker_init.sh
docker compose -f deploy/docker/docker-compose.yml run -it --workdir $PWD --build nlp-server aceagent nlp-server deploy --config model_config.yaml

通过在 NLP 服务器 Swagger 上使用 model_name 为 "facebook-bart-large-mnli" 查询 /nlp/model/text_classification 端点或使用以下 CURL 命令来验证更改

curl -X 'POST' \
  'https://:9003/nlp/model/text_classification' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "query": "I love cricket",
  "model_name": "facebook-bart-large-mnli"
}'

使用词语提升自定义 ASR 识别#

词语提升允许您在请求时偏向 ASR 以识别感兴趣的特定词语或特定领域的词语；通过在解码声学模型的输出时给它们更高的分数。

有关词语提升的更多信息，请参阅 Riva Word Boosting 文档。

对于我们的示例机器人，让我们添加一个 speech_config.yaml 文件。

samples/
└── stock_bot
    ├── bot_config.yaml
    ├── flows.co
    └── speech_config.yaml

创建一个词语提升文件 asr_words_to_boost.txt，其中包含要提升的词语以及提升分数。

samples/
└── stock_bot
    ├── asr_words_to_boost.txt
    ├── bot_config.yaml
    ├── flows.co
    └── speech_config.yaml

{
 "comment": "speech_context can have multiple entries. Each entry has single boost value and multiple phrases.",
 "speech_context": [
  {
  "boost": 40,
  "phrases": [
    "Nvidia"
  ]
  }
 ]
}

将此词语提升文件的路径添加到 speech_config.yaml 文件的 ASR 组件中。

riva_asr:
  RivaASR:
    word_boost_file_path: "/workspace/config/asr_words_to_boost.txt"

使用上述 ASR 自定义部署机器人。

自定义 ASR 识别以处理长停顿#

人类语音中的停顿可以发挥几个重要功能，例如允许说话者整理思路、仔细选择词语，并让听众有机会处理所说的内容。在会话语音中，研究表明，单词之间的大多数停顿都是短（0.20 秒）、中（0.60 秒）和长（超过 1 秒）的。

ASR 识别模型需要等待一段时间的静音才能检测用户音频中的话语结束 (EOU)。Riva ASR 使用 800 毫秒的静音作为默认 EOU，这意味着如果在单词之间观察到超过 800 毫秒的静音，则用户转录可能会分成多个查询。设置较高的 EOU 可能会对增加机器人响应延迟产生负面影响。根据机器人的用例，您可以更新 speech_config.yaml 中的 EOU 值，以调整静音以处理较长的停顿。

riva_asr:
   RivaASR:
       server: "localhost:50051"
       # final transcript silence threshold
       endpointing_stop_history: 800 #  End of User Utterance (EOU)

作为人类，我们会在对方说话时开始思考。为了获得人类水平的用户体验，我们需要在收到转录后立即开始处理转录，并在转录中添加更多单词时重新评估我们的响应。LLM 和 RAG 示例机器人也使用了这项技术，也称为 2 次通过话语结束 EOU。我们在第一次通过（240 毫秒）静音后开始处理用户转录，如果在第一次通过之后但在第二次通过 EOU（800 毫秒）之前检测到用户音频，则停止 LLM 和 TTS 处理，并使用新的转录重新触发管道。

riva_asr:
   RivaASR:
       server: "localhost:50051"
       # final transcript silence threshold
       endpointing_stop_history: 800 #  Second pass End of User Utterance (EOU)
       endpointing_stop_history_eou: 240 # First pass End of User Utterance (EOU)

使用 IPA 自定义 TTS 发音#

提供 IPA 发音文件使您能够使用自定义发音调整 TTS 模型，以用于某些特定领域的单词或未按预期发音的单词。模型将指定的单词使用此发音，同时合成相同音频。

要在 TTS 中使用 IPA 映射，我们需要创建一个字典文件，其中包含单词及其 IPA 发音。

在机器人配置中创建一个 IPA 字典文件 ipa.dict。

samples/
└── stock_bot
    ├── ipa.dict
    ├── asr_words_to_boost.txt
    ├── bot_config.yaml
    ├── flows.co
    └── speech_config.yaml

添加以下示例 IPA 发音，可以在此处添加任何自定义单词（单词必须为大写字母）。例如
GPU<SPACE><SPACE>'dʒi'pi'ju
将此 IPA 文件的路径添加到 speech_config.yaml 文件的 TTS 组件中。
riva_tts: RivaTTS: ipa_dict: "/workspace/config/ipa.dict"
使用上述 TTS 自定义部署机器人。

使用第三方文本转语音 (TTS) 解决方案#

ACE Agent 管道支持 Riva TTS 作为默认选项。

对于语音机器人，您可能想要自定义语音响应的语音。您可以训练自己的 TTS 模型、克隆 TTS 语音或使用任何第三方提供商。在本示例中，我们将展示如何集成 ElevenLabs 文本转语音 API。默认情况下，我们使用 NVIDIA Riva TTS 模型。

在 Quick Start 资源中的 deploy/docker/dockerfiles/nlp_server.Dockerfile 中存在的 NLP Server dockerfile 中添加所需的依赖项。
############################## # Install custom dependencies ############################## RUN pip install elevenlabs==1.4.1

注册 ElevenLabs - Generative AI Text to Speech & Voice Cloning 并获取 API 密钥。在 Quick Start 目录中的 .env 文件中添加 ElevenLabs API 密钥，变量名为 ELEVENLABS_API_KEY，以便可以将其传递到 NLP 服务器容器。

from elevenlabs.client import ElevenLabs
from elevenlabs import Voice, VoiceSettings

client = ElevenLabs(
        api_key=os.getenv("ELEVENLABS_API_KEY"),
    )

audio_stream = client.generate(
        text=input_request.text,
        voice="Brian",
        model=input_request.model_name,
        stream=True,
        optimize_streaming_latency=3,
        output_format="pcm_44100"
    )

使用 @model-api 装饰器覆盖 API 端点；由于 NLP 服务器公开了 /speech/text_to_speech API 端点。请参阅 NLP 服务器自定义客户端的完整代码。

import io
import os
import functools
from elevenlabs.client import ElevenLabs
from elevenlabs import Voice
from dataclasses import dataclass
from fastapi import HTTPException
from fastapi.responses import StreamingResponse

from nlp_server.decorators import model_api


def do_chunking(audio_stream, min_chunk_size=4410):
    buffer = b""
    for chunk in audio_stream:
        buffer += chunk
        if len(buffer) >= min_chunk_size:
            yield buffer
            buffer=b""
    if len(buffer) != 0:
        yield buffer

@dataclass
class TTSRequest:
    text: str
    voice_name: str
    model_name: str = ""
    model_version: str = ""
    language_code: str = "en-US"
    sample_rate_hz: int = 44100

@model_api(
    endpoint="/speech/text_to_speech",
    model_name=["eleven_monolingual_v1", "eleven_multilingual_v1", "eleven_multilingual_v2", "eleven_turbo_v2", "eleven_turbo_v2_5"],
)

async def eleven_tts(input_request: TTSRequest):
    client = ElevenLabs(
        api_key=os.getenv("ELEVENLABS_API_KEY"),
    )

    audio_stream = client.generate(
        text=input_request.text,
        voice=input_request.voice_name,
        model=input_request.model_name,
        stream=True,
        optimize_streaming_latency=3,
        output_format="pcm_44100"
    )

    return StreamingResponse(do_chunking(audio_stream))

为了试用自定义 TTS，您可以使用 Quick Start 资源的 ./bots 目录中存在的任何示例机器人，或作为教程一部分创建的自定义机器人。将代码保存在 ./bots 目录中名为 elevenlabs_tts.py 的 Python 文件中。
samples/ └── stock_bot ├── bot_config.yaml ├── main.co └── elevenlabs_tts.py

如果您没有现有的 model_config.yaml，那么让我们创建 model_config.yaml。

samples/
└── stock_bot
    ├── bot_config.yaml
    ├── main.co
    ├── elevenlabs_tts.py
    └── model_config.yaml

添加自定义 TTS 客户端和 Riva ASR 模型。

model_servers:
    - name: riva
        speech_models:
        - nvidia/ace/rmir_asr_parakeet_1-1b_en_us_str_vad:2.17.0
        url: localhost:8001
    - name: custom
        nlp_models:
        - elevenlabs_tts.py

为了将第三方 TTS 与聊天控制器集成，我们需要在 speech_config.yaml 中为 TTS 添加一些参数。如果您尚未创建 speech_config.yaml，那么让我们在示例机器人目录中创建它。
samples/ └── stock_bot ├── bot_config.yaml ├── main.co ├── elevenlabs_tts.py └── model_config.yaml └── speech_config.yaml

在 speech_config.yaml 文件中添加以下参数。

riva_tts:
  RivaTTS:
    tts_mode: "http"
    voice_name: "Brian"
    server: "http://0.0.0.0:9003/speech/text_to_speech"
    language: "en-US"
    ipa_dict: ""
    sample_rate: 44100
    model_name: "eleven_monolingual_v1"

使用上述 TTS 自定义部署机器人。

如果尚未设置，请设置 OPENAI 密钥。

export OPENAI_API_KEY=...

export BOT_PATH="samples/stock_bot"
source deploy/docker/docker_init.sh
docker compose -f deploy/docker/docker-compose.yml up model-utils-speech
docker compose -f deploy/docker/docker-compose.yml up speech-event-bot --build