故障排除#

在 Python 3.11 中安装 ACE Agent Wheel

问：在 Python 3.11 上尝试安装 aceagent wheel 时，安装崩溃并出现关于为 annoy 构建 wheel 文件的错误。我该如何修复此问题？

答：如果您遇到与无法为 annoy 构建 wheel 文件相关的问题，请运行以下命令并重建虚拟环境。

sudo apt-get install python3.11-dev
python3.11 -m venv ace
source ace/bin/activate

在 Python 3.12 中安装 ACE Agent Wheel

问：如何使用 Python 3.12 创建虚拟环境来安装 aceagent wheel？

答：您可以使用以下命令使用 Python 3.12 创建虚拟环境。

sudo apt-get install python3.12-distutils python3.12-dev python3.12-venv

python3.12 -m ensurepip --upgrade
python3.12 -m venv ace
source ace/bin/activate

使用 UCS 工具构建 Helm Chart

问：为什么在使用 UCS 工具构建应用程序时收到以下错误？

AppBuilder - ERROR - Failed to find a suitable version for microservice dependency 'ucf.svc.riva.speech-skills' that matches ...

答：生成一个 NGC_PERSONAL_KEY（或 NVIDIA API 密钥），并将 org 设置为 nv-ucf 并运行

export NGC_PERSONAL_KEY=nvapi-...

ucf_app_builder_cli registry repo set-api-key -a "${NGC_PERSONAL_KEY}"

再次尝试使用 UCS 工具构建应用程序。

Web UI 卡在 “connecting to server…” 消息

问：为什么 Web UI 卡在消息 connecting to server…？

答：当 Web UI 首次运行时，它会预先计算缓存值。此过程可能需要长达 5 分钟。一旦该过程完成，Web UI 应该在几秒钟内加载。如果您看到此错误，请等待几分钟然后重试。如果错误仍然存在，请确保 UI 服务器正在运行，并且如果从远程浏览器访问 UI，则端口 7007 未被防火墙阻止。

Bot 回复消息 “I have encountered some technical issue!”

问：为什么 bot 回复 I have encountered some technical issue! 回应？

答：当观察到 Colang 错误或观察到未定义的 Colang 流时，大多数示例 bot 和教程都使用此消息作为回退。检查在当前工作目录中创建的 logs 目录。常见的常见问题包括

Colang 文件中存在语法错误。

Action 或 Plugin 调用中存在错误。

对于 LLM 示例 bot，如果本地 LLM 未启动或 API 密钥不正确，则 ExternalLLMAction 失败，并且 Colang 流返回此回退消息。

如果外部 API 调用失败，并且您想要使用不同的错误消息，请使用 if else 语句检查 Action 调用是否成功。
$price = await InvokeFulfillmentAction(request_type="get",
endpoint="/stock/get_stock_price", company_name=$company_name)
if not $price
bot say "Could not find the stock price!"
else
bot say "Stock price of {$company_name} is {$price}"
如果您想更改所有故障的错误消息，请使用以下流激活的更新消息编辑 main.co。
activate notification of undefined flow start "I have encountered some technical issue!"
activate notification of colang errors "I have encountered some technical issue!"

为什么我没有收到任何 bot 回应

问：为什么在文本或语音模态中都没有 bot 回应？

答：如果未收到 bot 回应，您可能会观察到以下情况之一。

Chat Engine 容器未启动。检查 chat-engine 容器的退出原因日志。最有可能的情况是这表明 bot 配置中存在问题。

如果您没有通过 BOT_PATH 或 UCS 应用程序参数提供正确的 bot 配置路径，您可能会观察到以下错误
No bot config files found in the provided directory! Please provide a yaml file with name "bot: <bot_name>"
Chat Engine 尚未收到任何用户查询。检查 chat-controller 日志以确认是否收到 ASR 请求。检查 Riva Skills 日志以查找任何错误。确保 Riva Speech 服务器已启动并且 ASR 模型已部署。

如果您能够获得文本回应，但无法收到语音回应，请检查 chat-controller 容器是否已启动，并且在 chat controller 日志中没有观察到任何错误。检查配置的 TTS 提供程序是否正常工作，并且没有观察到任何错误。

检查 Redis UMIM 总线以获取特定的 stream_id，以确认是否收到用户话语和 bot 话语。

如果您在日志中没有观察到任何错误，并且在 Redis 中没有收到 bot 话语，请检查 Colang bot 配置是否预期为给定查询返回任何回应，并查看当前工作目录的 log 目录中的 NeMo Guardrails 日志。

Bot 回复消息 “Error in getting response from Chat Engine”

问：为什么我收到 bot 回应 Error in getting response from Chat Engine？

答：当 chat-controller 微服务无法调用 Chat Engine 和 Plugin Server 端点或由于超时导致请求失败时，会在 Chat Engine Server 架构和 Plugin Server 架构中观察到此错误。检查您是否在 speech_config.yaml 文件中正确配置了 dialog_manager 服务器 URL，并确认 Chat Engine 和 Plugin Server 端点已启动且可访问。

无法部署本地 NIM LLM 模型

问：为什么我的本地 LLM 部署容器 nemollm-inference-microservice 退出？

答：检查 NIM 容器的日志。如果在启动期间目标 GPU 设备上已在运行其他进程，您可能会观察到以下错误。如果您想在同一 GPU 设备上运行 Riva Speech 模型，则在 NIM 容器启动后可能是可以的，但不建议在同一 GPU 设备上运行其他模型，以避免 OOM 错误。

如果您没有正确设置 NGC_CLI_API_KEY，您可能会在日志文件中观察到 Error: NGC_API_KEY 未设置错误消息。

RAG 服务器未正确响应

问：为什么我从 RAG bot 收到 No response generated from LLM, make sure your query is relavent to the ingested document. bot 回应。

答：确保您已使用 Playground 通过访问 http://<your-ip>:3001/kb 或使用来自 http://<your-ip>:8081/docs 的文档摄取将相关文档上传到 RAG 服务器。

如果您在上传文档时观察到错误，请查看 rag-application-text-chatbot-langchain 容器中的日志。常见的观察到的错误包括

使用托管的 LLM 和嵌入模型，但未设置正确的 NVIDIA_API_KEY。

ERROR:example:Failed to ingest document due to exception [401] Unauthorized
invalid response from UAM
Please check or regenerate your API key.
ERROR:RAG.src.chain_server.server:Error from POST /documents endpoint. Ingestion of file: /tmp/gradio/cf9b8b2a9072611545f0b2dc20e454edb82650bf2bfc93e8567d803dcc0e49b7/2022 Delta Dental FAQs.pdf failed with error: Failed to upload document. Please upload an unstructured text document.
INFO:     172.21.0.6:40894 - "POST /documents HTTP/1.1" 500 Internal Server Error

使用本地嵌入和 NIM LLM 微服务，但服务未准备就绪或已退出。

requests.exceptions.ConnectionError: HTTPConnectionPool(host='nemollm-embedding', port=8000): Max retries exceeded with url: /v1/models (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x742c0280ccd0>: Failed to resolve 'nemollm-embedding' ([Errno -3] Temporary failure in name resolution)"))

Bot 回复消息 “Sorry I could not connect to the RAG endpoint”

问：为什么我从 RAG bot 收到 Sorry I could not connect to the RAG endpoint bot 回应？

答：当 RAG 服务器未启动、URL 配置不正确或 RAG 服务器返回空响应时，会观察到此错误。检查 chat-engine-event-speech 容器日志，查看是否观察到以下错误

  File "/usr/local/lib/python3.10/dist-packages/chat_engine/policies/actions/colang2_actions.py", line 149, in perform_fulfillment_call
    logger.warning("Could not connect to fulfillment endpoint=%s. Error %e", url, e)
Message: 'Could not connect to fulfillment endpoint=%s. Error %e'
Arguments: ('https://:9002/rag/chat', ClientPayloadError("Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>"))

查看 plugin-server 容器日志和 plugin_config.yaml 文件，以确保配置了正确的 RAG_SERVER_URL。如果 RAG 服务器不可访问或未启动，您将观察到以下错误

aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host localhost:8081 ssl:default [Multiple exceptions: [Errno 111] Connect call failed ('::1', 8081, 0, 0), [Errno 111] Connect call failed ('127.0.0.1', 8081)]

使用 NIM 托管的 LLM 和嵌入模型
使用本地 LLM 和嵌入模型

在 rag-application-text-chatbot-langchain 容器日志中常见的观察到的错误包括

Milvus 未启动。

WARNING:pymilvus.decorators:[query] retry:75, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:172.22.0.6:19530: Failed to connect to remote host: No route to host>
ERROR:pymilvus.decorators:RPC error: [query], <MilvusException: (code=<bound method _MultiThreadedRendezvous.code of <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:172.22.0.6:19530: Failed to connect to remote host: No route to host"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:172.22.0.6:19530: Failed to connect to remote host: No route to host", grpc_status:14, created_time:"2024-10-24T08:18:14.14407801+00:00"}"
>>, message=Retry run out of 75 retry times, message=failed to connect to all addresses; last error: UNKNOWN: ipv4:172.22.0.6:19530: Failed to connect to remote host: No route to host)>, <Time:{'RPC start': '2024-10-24 08:14:43.305372', 'RPC error': '2024-10-24 08:18:14.144369'}>
ERROR:RAG.src.chain_server.utils:Error occurred while retrieving documents: <MilvusException: (code=<bound method _MultiThreadedRendezvous.code of <_MultiThreadedRendezvous of RPC that terminated with:

本地 NIM LLM 未启动。检查 nemollm-inference-microservice 的日志。有关更多信息，请参阅 无法部署本地 NIM LLM 模型 故障排除部分。
本地 NIM 嵌入模型未启动。检查 nemo-retriever-embedding-microservice 的日志。如果您没有正确设置 NGC_CLI_API_KEY，您可能会观察到 Error: NGC_API_KEY 未设置错误消息。

无法拉取 Docker 镜像

问：为什么我无法在 Docker 部署中拉取镜像？我收到以下错误

Unable to find image 'nvcr.io/nvidia/riva/riva-speech:2.17.0-servicemaker' locally
model-utils-speech | docker: Error response from daemon: Head "https://nvcr.io/v2/nvidia/riva/riva-speech/manifests/2.17.0-servicemaker": unauthorized:

答：登录到 nvcr.io Docker 注册表以拉取 ACE Agent 和 Riva Skills 容器。

export NGC_CLI_API_KEY=<your-api-key>
echo ${NGC_CLI_API_KEY} | docker login nvcr.io --username '$oauthtoken' --password-stdin

Kubernetes Pod 正在失败

问：为什么我的 Kubernetes Pod 因 Failed to pull image 错误而失败？我收到以下错误

答：确保在 UCS 应用程序中正确设置了 imagePullSecrets。确保 Kubernetes 密钥已正确配置。删除并重新创建 ngc-docker-reg-secret Docker 注册表 Kubernetes 密钥。

export NGC_CLI_API_KEY=...

kubectl delete secret ngc-docker-reg-secret
kubectl create secret docker-registry ngc-docker-reg-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password="${NGC_CLI_API_KEY}"

为什么语音模型部署失败

问：当使用 docker compose -f deploy/docker/docker-compose.yml up model-utils-speech 命令部署语音模型时，为什么会失败并显示以下日志？

model-utils-speech  | Waiting for Riva server to load all models...retrying in 10 seconds

答：此错误通常表示 Riva Speech 服务器容器已退出并出现一些错误。检查 Riva Speech 服务器容器日志。最常见的观察到的问题是

内存不足 - GPU 没有足够的可用 VRAM 用于 Riva 服务器部署。
端口冲突 - 某些服务已在使用 Riva Speech 服务器公开的端口 8000、8001、8002 和 50051 之一。
TensorRT 转换失败 - 在 TensorRT 转换期间，特定模型发生了一些错误，并且该模型加载失败。检查 model-utils-speech 容器的日志，以查看 TensorRT 转换日志以查找根本原因。