使用 Triton 和 TRT-LLM 部署 Phi-3 模型#
本指南介绍了使用 TRT-LLM 构建 Phi-3 模型并使用 Triton Inference Server 部署的步骤。它还展示了如何使用 GenAI-Perf 运行基准测试,以衡量模型的吞吐量和延迟性能。
本指南在 A100 80GB SXM4 和 H100 80GB PCIe 上进行了测试。 经确认,它可与 Phi-3-mini-128k-instruct 和 Phi-3-mini-4k-instruct(有关完整列表,请参阅支持矩阵)配合使用,使用的 TRT-LLM 版本为 v0.11,Triton Inference Server 版本为 24.07。
构建并测试 TRT-LLM 引擎#
参考:https://nvidia.github.io/TensorRT-LLM/installation/linux.html
检索并启动 Docker 容器(可选)
# Pre-install the environment using the NVIDIA Container Toolkit to avoid manual environment configuration
docker run --rm --ipc=host --runtime=nvidia --gpus '"device=0"' --entrypoint /bin/bash -it nvidia/cuda:12.4.1-devel-ubuntu22.04
安装 TensorRT-LLM
# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs
# Install TensorRT-LLM (v0.11.0)
pip3 install tensorrt_llm==0.11.0 --extra-index-url https://pypi.nvidia.com
# Check installation
python3 -c "import tensorrt_llm"
使用 Phi-3 转换脚本克隆 TRT-LLM 仓库
git clone -b v0.11.0 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/phi/
# only need to install requirements.txt if you want to test the summarize.py example
# if so, modify requirements.txt such that tensorrt_llm==0.11.0
# pip install -r requirements.txt
构建 TRT-LLM 引擎#
下载 Phi-3-mini-4k-instruct
git lfs install
git clone https://hugging-face.cn/microsoft/Phi-3-mini-4k-instruct
将权重从 HF Transformers 转换为 TensorRT-LLM 格式
python3 ./convert_checkpoint.py \
--model_dir ./Phi-3-mini-4k-instruct \
--output_dir ./phi-checkpoint \
--dtype float16
构建 TensorRT 引擎
# Build a float16 engine using a single GPU and HF weights.
# Enable several TensorRT-LLM plugins to increase runtime performance. It also helps with build time.
# --tp_size and --pp_size are the model shard size
trtllm-build \
--checkpoint_dir ./phi-checkpoint \
--output_dir ./phi-engine \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 1024 \
--max_seq_len 2048 \
--tp_size 1 \
--pp_size 1
运行模型
python3 ../run.py --engine_dir ./phi-engine \
--max_output_len 500 \
--tokenizer_dir ./Phi-3-mini-4k-instruct \
--input_text "How do I count to nine in French?"
使用 Phi 模型进行摘要测试
可以测试 TensorRT-LLM Phi 模型以总结来自 cnn_dailymail 数据集的文章。对于每个摘要,该脚本可以计算 ROUGE 分数,并使用 ROUGE-1 分数来验证实现。该脚本还可以使用 HF Phi 模型执行相同的摘要。
# Run the summarization task using a TensorRT-LLM model and a single GPU.
python3 ../summarize.py --engine_dir ./phi-engine \
--hf_model_dir ./Phi-3-mini-4k-instruct \
--batch_size 1 \
--test_trt_llm \
--test_hf \
--data_type fp16 \
--check_accuracy \
--tensorrt_llm_rouge1_threshold=20
使用 Triton Inference Server 部署#
将引擎文件从 Docker 容器复制到主机
# In another terminal instance, before exiting the current container
docker cp <container_id>:<path_in_container> <path_on_host>
# For example
docker cp 452ee1c1d8a1:/TensorRT-LLM/examples/phi/phi-engine /home/user/phi-engine
将编译后的模型复制到带有 TRT-LLM 后端的骨架仓库
# After exiting the TensorRT-LLM Docker container
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
cp ../phi-engine/* all_models/inflight_batcher_llm/tensorrt_llm/1/
修改模型仓库中的配置文件
以下配置文件需要更新
ensemble/config.pbtxt
postprocessing/config.pbtxt
preprocessing/config.pbtxt
tensorrt_llm/config.pbxt
tensorrt_llm/1/config.json
更新 ensemble/config.pbtxt#
python3 tools/fill_template.py --in_place \
all_models/inflight_batcher_llm/ensemble/config.pbtxt \
triton_max_batch_size:128
更新 preprocessing/config.pbtxt#
python3 tools/fill_template.py --in_place \
all_models/inflight_batcher_llm/postprocessing/config.pbtxt \
tokenizer_type:auto,\
tokenizer_dir:../Phi-3-mini-4k-instruct,\
triton_max_batch_size:128,\
postprocessing_instance_count:2
更新 postprocessing/config.pbtxt#
python3 tools/fill_template.py --in_place \
all_models/inflight_batcher_llm/preprocessing/config.pbtxt \
tokenizer_type:auto,\
tokenizer_dir:../Phi-3-mini-4k-instruct,\
triton_max_batch_size:128,\
preprocessing_instance_count:2
更新 tensorrt_llm/config.pbxt#
python3 tools/fill_template.py --in_place \
all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \
decoupled_mode:true,\
engine_dir:/all_models/inflight_batcher_llm/tensorrt_llm/1,\
max_tokens_in_paged_kv_cache:,\
batch_scheduler_policy:guaranteed_completion,\
kv_cache_free_gpu_mem_fraction:0.2,\
max_num_sequences:4,\
triton_backend:tensorrtllm,\
triton_max_batch_size:128,\
max_queue_delay_microseconds:10,\
max_beam_width:1,\
batching_strategy:inflight_fused_batching,\
engine_dir:/opt/all_models/inflight_batcher_llm/tensorrt_llm/1,\
max_tokens_in_paged_kv_cache:1,\
batch_scheduler_policy:guaranteed_completion,\
kv_cache_free_gpu_mem_fraction:0.2
# manually access tensort_llm/config.pbtxt and change the CPU instances to > 1
# unfortunately this was hard-coded and cannot be update with the above script
# instance_group [
# {
# count: 2
# kind : KIND_CPU
# }
# ]
分页 KV 缓存中的最大令牌数#
这仅对于 Phi-3-mini-128k-instruct 是必需的,对于 Phi-3-mini-4k-instruct,无需修改此参数。
为了适应 128k 上下文,请从 tensorrt_llm/config.pbxt 中删除以下内容 - 这将允许 KV 缓存管理器确定最大令牌数。如果您不想删除它,您还可以设置 maxTokensInPagedKvCache,使其足够大(例如 4096)以至少完成 1 个序列(即必须大于 beam_width * tokensPerBlock * maxBlocksPerSeq)
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {
string_value: "4096"
}
}
更新 tensorrt_llm/1/config.json#
在引擎配置 (tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/config.json) 中,在 plugin_config 下添加以下内容
"Use_context_fmha_for_generation": false
# for example:
"plugin_config": {
"dtype": "float16",
"bert_attention_plugin": "auto",
"streamingllm": false,
"Use_context_fmha_for_generation": false
以上操作需要使用您喜欢的编辑器手动完成。完成后,请确保您的工作目录是 ~/tensorrtllm_backend
删除 tensorrt_llm_bls
# Recommended to remove the BLS directory if not needed
rm -rf all_models/inflight_batcher_llm/tensorrt_llm_bls/
下载模型仓库
# for tokenizer
git lfs install
git clone https://hugging-face.cn/microsoft/Phi-3-mini-4k-instruct
启动 Triton Inference Server (trtllm-python3-py3)
docker run -it --rm --gpus all --network host --shm-size=1g \
-v $(pwd)/all_models:/opt/all_models \
-v $(pwd)/scripts:/opt/scripts \
-v $(pwd)/Phi-3-mini-4k-instruct:/opt/Phi-3-mini-4k-instruct \
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# Launch Server
python3 ../scripts/launch_triton_server.py --model_repo ../all_models/inflight_batcher_llm --world_size 1
发送请求
curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{
"text_input": "A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. If left unattended together, the wolf would eat the goat, or the goat would eat the cabbage. How can they cross the river without anything being eaten?",
"parameters": {
"max_tokens": 256,
"bad_words":[""],
"stop_words":[""]
}
}' | jq
使用 GenAI-Perf 进行基准测试#
启动 Triton Inference Server (py3-sdk)
export RELEASE="24.07"
docker run -it --net=host --gpus '"device=0"' nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
下载 Phi-3 分词器
登录 Hugging Face(使用用户访问令牌)以获取 Phi-3 分词器。此步骤不是必需的,但有助于解释来自提示和响应的令牌指标。如果您跳过此步骤,请务必从步骤 18 中的 GenAI-Perf 脚本中删除 –tokenizer 标志。
git lfs install
git clone https://hugging-face.cn/microsoft/Phi-3-mini-4k-instruct
pip install huggingface_hub
huggingface-cli login --token hf_***
运行 GenAI-Perf
export INPUT_SEQUENCE_LENGTH=128
export OUTPUT_SEQUENCE_LENGTH=128
export CONCURRENCY=25
genai-perf \
-m ensemble \
--service-kind triton \
--backend tensorrtllm \
--random-seed 123 \
--synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
--synthetic-input-tokens-stddev 0 \
--streaming \
--output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
--output-tokens-stddev 0 \
--output-tokens-mean-deterministic \
--concurrency $CONCURRENCY \
--tokenizer microsoft/Phi-3-mini-4k-instruct \
--measurement-interval 4000 \
--url localhost:8001
有关使用 GenAI-Perf 进行性能基准测试的更多详细信息,请点击此处。
参考配置#
/tensorrtllm_backend/all_models/inflight_batcher_llm 内的所有配置文件如下所示。
ensemble/config.pbtxt
# Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
name: "ensemble"
platform: "ensemble"
max_batch_size: 128
input [
{
name: "text_input"
data_type: TYPE_STRING
dims: [ 1 ]
},
{
name: "decoder_text_input"
data_type: TYPE_STRING
dims: [ 1 ]
optional: true
},
{
name: "image_input"
data_type: TYPE_FP16
dims: [ 3, 224, 224 ]
optional: true
},
{
name: "max_tokens"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "bad_words"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "stop_words"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "pad_id"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "top_k"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "top_p"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "length_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "min_length"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "presence_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "frequency_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
optional: true
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "return_context_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "return_generation_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "stream"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "prompt_embedding_table"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
},
{
name: "prompt_vocab_size"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "embedding_bias_words"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "embedding_bias_weights"
data_type: TYPE_FP32
dims: [ -1 ]
optional: true
}
]
output [
{
name: "text_output"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "context_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "generation_logits"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
},
{
name: "batch_index"
data_type: TYPE_INT32
dims: [ 1 ]
}
]
ensemble_scheduling {
step [
{
model_name: "preprocessing"
model_version: -1
input_map {
key: "QUERY"
value: "text_input"
}
input_map {
key: "DECODER_QUERY"
value: "decoder_text_input"
}
input_map {
key: "IMAGE"
value: "image_input"
}
input_map {
key: "REQUEST_OUTPUT_LEN"
value: "max_tokens"
}
input_map {
key: "BAD_WORDS_DICT"
value: "bad_words"
}
input_map {
key: "STOP_WORDS_DICT"
value: "stop_words"
}
input_map {
key: "EMBEDDING_BIAS_WORDS"
value: "embedding_bias_words"
}
input_map {
key: "EMBEDDING_BIAS_WEIGHTS"
value: "embedding_bias_weights"
}
input_map {
key: "END_ID"
value: "end_id"
}
input_map {
key: "PAD_ID"
value: "pad_id"
}
input_map {
key: "PROMPT_EMBEDDING_TABLE"
value: "prompt_embedding_table"
}
output_map {
key: "REQUEST_INPUT_LEN"
value: "_REQUEST_INPUT_LEN"
}
output_map {
key: "INPUT_ID"
value: "_INPUT_ID"
}
output_map {
key: "REQUEST_DECODER_INPUT_LEN"
value: "_REQUEST_DECODER_INPUT_LEN"
}
output_map {
key: "DECODER_INPUT_ID"
value: "_DECODER_INPUT_ID"
}
output_map {
key: "REQUEST_OUTPUT_LEN"
value: "_REQUEST_OUTPUT_LEN"
}
output_map {
key: "STOP_WORDS_IDS"
value: "_STOP_WORDS_IDS"
}
output_map {
key: "BAD_WORDS_IDS"
value: "_BAD_WORDS_IDS"
}
output_map {
key: "EMBEDDING_BIAS"
value: "_EMBEDDING_BIAS"
}
output_map {
key: "OUT_END_ID"
value: "_PREPROCESSOR_END_ID"
}
output_map {
key: "OUT_PAD_ID"
value: "_PREPROCESSOR_PAD_ID"
}
output_map {
key: "OUT_PROMPT_EMBEDDING_TABLE"
value: "out_prompt_embedding_table"
}
},
{
model_name: "tensorrt_llm"
model_version: -1
input_map {
key: "input_ids"
value: "_INPUT_ID"
}
input_map {
key: "decoder_input_ids"
value: "_DECODER_INPUT_ID"
}
input_map {
key: "input_lengths"
value: "_REQUEST_INPUT_LEN"
}
input_map {
key: "decoder_input_lengths"
value: "_REQUEST_DECODER_INPUT_LEN"
}
input_map {
key: "request_output_len"
value: "_REQUEST_OUTPUT_LEN"
}
input_map {
key: "end_id"
value: "_PREPROCESSOR_END_ID"
}
input_map {
key: "pad_id"
value: "_PREPROCESSOR_PAD_ID"
}
input_map {
key: "embedding_bias"
value: "_EMBEDDING_BIAS"
}
input_map {
key: "runtime_top_k"
value: "top_k"
}
input_map {
key: "runtime_top_p"
value: "top_p"
}
input_map {
key: "temperature"
value: "temperature"
}
input_map {
key: "len_penalty"
value: "length_penalty"
}
input_map {
key: "repetition_penalty"
value: "repetition_penalty"
}
input_map {
key: "min_length"
value: "min_length"
}
input_map {
key: "presence_penalty"
value: "presence_penalty"
}
input_map {
key: "frequency_penalty"
value: "frequency_penalty"
}
input_map {
key: "random_seed"
value: "random_seed"
}
input_map {
key: "return_log_probs"
value: "return_log_probs"
}
input_map {
key: "return_context_logits"
value: "return_context_logits"
}
input_map {
key: "return_generation_logits"
value: "return_generation_logits"
}
input_map {
key: "beam_width"
value: "beam_width"
}
input_map {
key: "streaming"
value: "stream"
}
input_map {
key: "prompt_embedding_table"
value: "out_prompt_embedding_table"
}
input_map {
key: "prompt_vocab_size"
value: "prompt_vocab_size"
}
input_map {
key: "stop_words_list"
value: "_STOP_WORDS_IDS"
}
input_map {
key: "bad_words_list"
value: "_BAD_WORDS_IDS"
}
output_map {
key: "output_ids"
value: "_TOKENS_BATCH"
}
output_map {
key: "sequence_length"
value: "_SEQUENCE_LENGTH"
},
output_map {
key: "cum_log_probs"
value: "_CUM_LOG_PROBS"
}
output_map {
key: "output_log_probs"
value: "_OUTPUT_LOG_PROBS"
},
output_map {
key: "context_logits"
value: "_CONTEXT_LOGITS"
},
output_map {
key: "generation_logits"
value: "_GENERATION_LOGITS"
},
output_map {
key: "batch_index"
value: "_BATCH_INDEX"
}
},
{
model_name: "postprocessing"
model_version: -1
input_map {
key: "TOKENS_BATCH"
value: "_TOKENS_BATCH"
}
input_map {
key: "CUM_LOG_PROBS"
value: "_CUM_LOG_PROBS"
}
input_map {
key: "OUTPUT_LOG_PROBS"
value: "_OUTPUT_LOG_PROBS"
}
input_map {
key: "CONTEXT_LOGITS"
value: "_CONTEXT_LOGITS"
}
input_map {
key: "GENERATION_LOGITS"
value: "_GENERATION_LOGITS"
}
input_map {
key: "SEQUENCE_LENGTH"
value: "_SEQUENCE_LENGTH"
}
input_map {
key: "BATCH_INDEX"
value: "_BATCH_INDEX"
}
output_map {
key: "OUTPUT"
value: "text_output"
}
output_map {
key: "OUT_OUTPUT_LOG_PROBS"
value: "output_log_probs"
}
output_map {
key: "OUT_CUM_LOG_PROBS"
value: "cum_log_probs"
}
output_map {
key: "OUT_CONTEXT_LOGITS"
value: "context_logits"
}
output_map {
key: "OUT_GENERATION_LOGITS"
value: "generation_logits"
}
output_map {
key: "OUT_BATCH_INDEX"
value: "batch_index"
}
}
]
}
postprocessing/config.pbtxt
# Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
name: "postprocessing"
backend: "python"
max_batch_size: 128
input [
{
name: "TOKENS_BATCH"
data_type: TYPE_INT32
dims: [ -1, -1 ]
},
{
name: "SEQUENCE_LENGTH"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "CUM_LOG_PROBS"
data_type: TYPE_FP32
dims: [ -1 ]
optional: true
},
{
name: "OUTPUT_LOG_PROBS"
data_type: TYPE_FP32
dims: [ -1, -1 ]
optional: true
},
{
name: "CONTEXT_LOGITS"
data_type: TYPE_FP32
dims: [ -1, -1 ]
optional: true
},
{
name: "GENERATION_LOGITS"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
optional: true
},
{
name: "BATCH_INDEX"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
}
]
output [
{
name: "OUTPUT"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "OUT_CUM_LOG_PROBS"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "OUT_OUTPUT_LOG_PROBS"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "OUT_CONTEXT_LOGITS"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "OUT_GENERATION_LOGITS"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
},
{
name: "OUT_BATCH_INDEX"
data_type: TYPE_INT32
dims: [ 1 ]
}
]
parameters {
key: "tokenizer_dir"
value: {
string_value: "../Phi-3-mini-4k-instruct"
}
}
parameters {
key: "skip_special_tokens"
value: {
string_value: "${skip_special_tokens}"
}
}
instance_group [
{
count: 4
kind: KIND_CPU
}
]
preprocessing/config.pbtxt
# Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
name: "preprocessing"
backend: "python"
max_batch_size: 128
input [
{
name: "QUERY"
data_type: TYPE_STRING
dims: [ 1 ]
},
{
name: "DECODER_QUERY"
data_type: TYPE_STRING
dims: [ 1 ]
optional: true
},
{
name: "IMAGE"
data_type: TYPE_FP16
dims: [ 3, 224, 224 ]
optional: true
},
{
name: "REQUEST_OUTPUT_LEN"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "BAD_WORDS_DICT"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "STOP_WORDS_DICT"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "EMBEDDING_BIAS_WORDS"
data_type: TYPE_STRING
dims: [ -1 ]
optional: true
},
{
name: "EMBEDDING_BIAS_WEIGHTS"
data_type: TYPE_FP32
dims: [ -1 ]
optional: true
},
{
name: "END_ID"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "PAD_ID"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
},
{
name: "PROMPT_EMBEDDING_TABLE"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
}
]
output [
{
name: "INPUT_ID"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "REQUEST_INPUT_LEN"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "DECODER_INPUT_ID"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "REQUEST_DECODER_INPUT_LEN"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "BAD_WORDS_IDS"
data_type: TYPE_INT32
dims: [ 2, -1 ]
},
{
name: "STOP_WORDS_IDS"
data_type: TYPE_INT32
dims: [ 2, -1 ]
},
{
name: "EMBEDDING_BIAS"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "REQUEST_OUTPUT_LEN"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "OUT_END_ID"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "OUT_PAD_ID"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "OUT_PROMPT_EMBEDDING_TABLE"
data_type: TYPE_FP16
dims: [ -1, -1 ]
}
]
parameters {
key: "tokenizer_dir"
value: {
string_value: "../Phi-3-mini-4k-instruct"
}
}
parameters {
key: "add_special_tokens"
value: {
string_value: "${add_special_tokens}"
}
}
parameters {
key: "visual_model_path"
value: {
string_value: "${visual_model_path}"
}
}
parameters: {
key: "gpt_model_path"
value: {
string_value: "${engine_dir}"
}
}
instance_group [
{
count: 4
kind: KIND_CPU
}
]
tensorrt_llm/config.pbtxt
# Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 128
model_transaction_policy {
decoupled: true
}
dynamic_batching {
preferred_batch_size: [ 128 ]
max_queue_delay_microseconds: 10
}
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
allow_ragged_batch: true
},
{
name: "input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "draft_input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "decoder_input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "decoder_input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
reshape: { shape: [ ] }
},
{
name: "draft_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "draft_acceptance_threshold"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "pad_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "embedding_bias"
data_type: TYPE_FP32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_k"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_min"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_decay"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_reset_ids"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "early_stopping"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "min_length"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_search_diversity_rate"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "presence_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "frequency_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_context_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_generation_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "streaming"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "prompt_embedding_table"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "prompt_vocab_size"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
# the unique task ID for the given LoRA.
# To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
# The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
# If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if `lora_task_id` is not cached.
{
name: "lora_task_id"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
# weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
# where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
# each of the in / out tensors are first flattened and then concatenated together in the format above.
# D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
{
name: "lora_weights"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
# module identifier (same size a first dimension of lora_weights)
# See LoraModule::ModuleType for model id mapping
#
# "attn_qkv": 0 # compbined qkv adapter
# "attn_q": 1 # q adapter
# "attn_k": 2 # k adapter
# "attn_v": 3 # v adapter
# "attn_dense": 4 # adapter for the dense layer in attention
# "mlp_h_to_4h": 5 # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
# "mlp_4h_to_h": 6 # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
# "mlp_gate": 7 # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
#
# last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
{
name: "lora_config"
data_type: TYPE_INT32
dims: [ -1, 3 ]
optional: true
allow_ragged_batch: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "context_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "generation_logits"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
},
{
name: "batch_index"
data_type: TYPE_INT32
dims: [ 1 ]
}
]
instance_group [
{
count: 4
kind : KIND_CPU
}
]
parameters: {
key: "max_beam_width"
value: {
string_value: "1"
}
}
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value: "no"
}
}
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching"
}
}
parameters: {
key: "gpt_model_path"
value: {
string_value: "/opt/all_models/inflight_batcher_llm/tensorrt_llm/1"
}
}
parameters: {
key: "encoder_model_path"
value: {
string_value: "${encoder_engine_dir}"
}
}
</details>
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {
string_value: ""
}
}
parameters: {
key: "max_attention_window_size"
value: {
string_value: "${max_attention_window_size}"
}
}
parameters: {
key: "sink_token_length"
value: {
string_value: "${sink_token_length}"
}
}
parameters: {
key: "batch_scheduler_policy"
value: {
string_value: "guaranteed_completion"
}
}
parameters: {
key: "kv_cache_free_gpu_mem_fraction"
value: {
string_value: "0.2"
}
}
parameters: {
key: "kv_cache_host_memory_bytes"
value: {
string_value: "${kv_cache_host_memory_bytes}"
}
}
parameters: {
key: "kv_cache_onboard_blocks"
value: {
string_value: "${kv_cache_onboard_blocks}"
}
}
# enable_trt_overlap is deprecated and doesn't have any effect on the runtime
# parameters: {
# key: "enable_trt_overlap"
# value: {
# string_value: "${enable_trt_overlap}"
# }
# }
parameters: {
key: "exclude_input_in_output"
value: {
string_value: "${exclude_input_in_output}"
}
}
parameters: {
key: "cancellation_check_period_ms"
value: {
string_value: "${cancellation_check_period_ms}"
}
}
parameters: {
key: "stats_check_period_ms"
value: {
string_value: "${stats_check_period_ms}"
}
}
parameters: {
key: "iter_stats_max_iterations"
value: {
string_value: "${iter_stats_max_iterations}"
}
}
parameters: {
key: "request_stats_max_iterations"
value: {
string_value: "${request_stats_max_iterations}"
}
}
parameters: {
key: "enable_kv_cache_reuse"
value: {
string_value: "${enable_kv_cache_reuse}"
}
}
parameters: {
key: "normalize_log_probs"
value: {
string_value: "${normalize_log_probs}"
}
}
parameters: {
key: "enable_chunked_context"
value: {
string_value: "${enable_chunked_context}"
}
}
parameters: {
key: "gpu_device_ids"
value: {
string_value: "${gpu_device_ids}"
}
}
parameters: {
key: "lora_cache_optimal_adapter_size"
value: {
string_value: "${lora_cache_optimal_adapter_size}"
}
}
parameters: {
key: "lora_cache_max_adapter_size"
value: {
string_value: "${lora_cache_max_adapter_size}"
}
}
parameters: {
key: "lora_cache_gpu_memory_fraction"
value: {
string_value: "${lora_cache_gpu_memory_fraction}"
}
}
parameters: {
key: "lora_cache_host_memory_bytes"
value: {
string_value: "${lora_cache_host_memory_bytes}"
}
}
parameters: {
key: "decoding_mode"
value: {
string_value: "${decoding_mode}"
}
}
parameters: {
key: "executor_worker_path"
value: {
string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
}
}
parameters: {
key: "medusa_choices"
value: {
string_value: "${medusa_choices}"
}
}
parameters: {
key: "gpu_weights_percent"
value: {
string_value: "${gpu_weights_percent}"
}
}