使用 Triton Inference Server 部署 Hermes-2-Pro-Llama-3-8B 模型#

Hermes-2-Pro-Llama-3-8B 是由 NousResearch 开发的先进语言模型。此模型是对 Meta-Llama-3-8B 的增强，它在内部使用 OpenHermes 2.5 数据集以及 NousResearch 新推出的函数调用和 JSON 模式数据集进行了微调。这些进步使该模型在通用对话任务和结构化 JSON 输出以及函数调用等专门功能方面都表现出色，使其成为各种应用的多功能工具。

该模型可通过 huggingface 下载。

TensorRT-LLM 是英伟达推荐的在英伟达 GPU 上运行大型语言模型 (LLM) 的解决方案。阅读更多关于 TensoRT-LLM 的信息此处以及 Triton 的 TensorRT-LLM 后端此处。

注意： 如果本教程的某些部分无法工作，可能是 tutorials 和 tensorrtllm_backend 仓库之间存在一些版本不匹配。如有必要，请参阅 llama.md 以获取更详细的修改信息。如果您熟悉 python，也可以尝试使用 High-level API 进行 LLM 工作流程。

先决条件：TensorRT-LLM 后端#

本教程需要 TensorRT-LLM 后端仓库。请注意，为了获得最佳用户体验，我们建议使用 tensorrtllm_backend 的最新发布标签和最新的 Triton Server 容器。

要克隆 TensorRT-LLM 后端仓库，请确保运行以下命令集。

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git  --branch <release branch>
# Update the submodules
cd tensorrtllm_backend
# Install git-lfs if needed
apt-get update && apt-get install git-lfs -y --no-install-recommends
git lfs install
git submodule update --init --recursive

启动 Triton TensorRT-LLM 容器#

启动带有 TensorRT-LLM 后端的 Triton docker 容器。请注意，为了简单起见，我们将 tensorrtllm_backend 挂载到 /tensorrtllm_backend，并将 Hermes 模型挂载到 docker 容器中的 /Hermes-2-Pro-Llama-3-8B。在 docker 外部创建一个 engines 文件夹，以便在未来运行中重复使用引擎。请确保将 <xx.yy> 替换为您要使用的 Triton 版本。

docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v </path/to/tensorrtllm_backend>:/tensorrtllm_backend \
    -v </path/to/Hermes/repo>:/Hermes-2-Pro-Llama-3-8B \
    -v </path/to/engines>:/engines \
    nvcr.io/nvidia/tritonserver:<xx.yy>-trtllm-python-py3

或者，如果您想构建专门的容器，您可以按照此处的说明构建带有 Tensorrt-LLM 后端的 Triton Server。

启动容器时，不要忘记允许 gpu 使用。

为每个模型创建引擎 [如果您已经有引擎，请跳过此步骤]#

TensorRT-LLM 要求每个模型在运行前针对您需要的配置进行编译。为此，在您第一次在 Triton Server 上运行模型之前，您需要创建一个 TensorRT-LLM 引擎。

Triton Server TensrRT-LLM 容器预装了 TensorRT-LLM 包，用户可以在 Triton 容器内构建引擎。只需按照以下步骤操作

HF_LLAMA_MODEL=/Hermes-2-Pro-Llama-3-8B
UNIFIED_CKPT_PATH=/tmp/ckpt/hermes/8b/
ENGINE_DIR=/engines
CONVERT_CHKPT_SCRIPT=/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py
python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${HF_LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
            --remove_input_padding enable \
            --gpt_attention_plugin float16 \
            --context_fmha enable \
            --gemm_plugin float16 \
            --output_dir ${ENGINE_DIR} \
            --paged_kv_cache enable \
            --max_batch_size 4

可选：您可以检查位于同一 llama 示例文件夹中的 run.py 的模型输出。

 python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=${ENGINE_DIR} --max_output_len 28 --tokenizer_dir ${HF_LLAMA_MODEL} --input_text "What is ML?"

您应该期望以下响应

Input [Text 0]: "<|begin_of_text|>What is ML?"
Output [Text 0 Beam 0]: "
Machine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed."

使用 Triton 服务#

最后一步是创建一个 Triton 可读模型。您可以在 tensorrtllm_backend/all_models/inflight_batcher_llm 中找到使用飞行中批处理的模型的模板。要运行我们的模型，您需要

复制飞行中批处理器模型仓库

cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.

修改预处理、后处理和处理步骤的 config.pbtxt。以下脚本执行最小化配置以运行 tritonserver，但如果您想要最佳性能或自定义参数，请阅读文档和 perf_best_practices 中的详细信息

# preprocessing
TOKENIZER_DIR=/Hermes-2-Pro-Llama-3-8B/
TOKENIZER_TYPE=auto
DECOUPLED_MODE=false
MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm
MAX_BATCH_SIZE=4
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=10000
TRTLLM_BACKEND=python
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRTLLM_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching

启动 Tritonserver

[!NOTE] 本教程是为在单个 GPU 上服务 TensorRT-LLM 模型而准备的。因此，如果在为单个 GPU 构建引擎时，请在以下命令中使用 --world_size=1。或者，如果引擎需要多个 GPU，请确保在 --world_size 中指定引擎所需的 GPU 的确切数量。

使用 launch_triton_server.py 脚本。这将使用 MPI 启动 tritonserver 的多个实例。

python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=<world size of the engine> --model_repo=/opt/tritonserver/inflight_batcher_llm

您应该期望以下响应

...
I0503 22:01:25.210518 1175 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:8001
I0503 22:01:25.211612 1175 http_server.cc:4692] Started HTTPService at 0.0.0.0:8000
I0503 22:01:25.254914 1175 http_server.cc:362] Started Metrics Service at 0.0.0.0:8002

要在容器内停止 Triton Server，请运行

pkill tritonserver

发送推理请求#

您可以使用以下命令测试运行结果

inflight_batcher_llm_client.py 脚本。

首先，让我们启动 Triton SDK 容器

# Using the SDK container as an example
docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v /path/to/tensorrtllm_backend/inflight_batcher_llm/client:/tensorrtllm_client \
    -v /path/to/Hermes-2-Pro-Llama-3-8B/repo:/Hermes-2-Pro-Llama-3-8B \
    nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk

此外，请为脚本安装额外的依赖项

pip3 install transformers sentencepiece
python3 /tensorrtllm_client/inflight_batcher_llm_client.py --request-output-len 28 --tokenizer-dir /Hermes-2-Pro-Llama-3-8B --text "What is ML?"

您应该期望以下响应

...
Input: What is ML?
Output beam 0:
ML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation.
...

generate endpoint。

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

您应该期望以下响应

{"context_logits":0.0,...,"text_output":"What is ML?\nMachine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed."}

参考资料#

有关更多示例，请随时参考运行 llama 的端到端工作流程。