测试 TensorRT-LLM 后端#

可以手动运行此 CI 目录中的测试以提供全面的测试。

运行 QA 测试#

在 Triton 容器内运行测试。

docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/opt/tritonserver/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 bash

# Change directory to the test and run the test.sh script
cd /opt/tritonserver/tensorrtllm_backend/ci/L0_backend_trtllm
bash -x ./test.sh

运行 e2e/benchmark_core_model 进行基准测试#

这两个测试在 L0_backend_trtllm 测试中运行。以下是手动运行测试的说明。

生成模型仓库#

按照创建模型仓库部分中的说明准备模型仓库。

修改模型配置#

按照修改模型配置部分中的说明根据需要修改模型配置。

端到端测试#

端到端测试脚本向已部署的 ensemble 模型发送请求。

Ensemble 模型由三个模型集成而成：preprocessing、tensorrt_llm 和 postprocessing

“preprocessing”：此模型用于分词，即将提示（字符串）转换为 input_ids（整数列表）。
“tensorrt_llm”：此模型是 TensorRT-LLM 模型的包装器，用于推理
“postprocessing”：此模型用于反分词，即将 output_ids（整数列表）转换为 outputs（字符串）。

端到端延迟包括 ensemble 模型三个部分的总延迟。

cd tools/inflight_batcher_llm
python3 end_to_end_test.py --dataset <dataset path>

预期输出

[INFO] Functionality test succeed.
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 11099.243 ms

benchmark_core_model#

benchmark_core_model 脚本直接向已部署的 tensorrt_llm 模型发送请求，benchmark_core_model 延迟表示 TensorRT-LLM 的推理延迟，不包括通常由 HuggingFace 等第三方库处理的预处理/后处理延迟。

cd tools/inflight_batcher_llm
python3 benchmark_core_model.py dataset --dataset <dataset path>

预期输出

[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 10213.462 ms

请注意，该文档中的预期输出仅供参考，具体的性能数字取决于您使用的 GPU。