OpenAI 兼容前端,用于 Triton 推理服务器 (Beta)#
[!NOTE] OpenAI 兼容 API 目前为 BETA 版本。其功能和特性可能会根据我们收集到的反馈进行更改。我们很高兴听到您的任何想法以及您希望看到哪些功能!
先决条件#
Docker + NVIDIA 容器运行时
正确配置的
HF_TOKEN
,用于访问 HuggingFace 模型。当前的示例和测试主要使用
meta-llama/Meta-Llama-3.1-8B-Instruct
模型,但您可以手动引入自己的模型并进行相应的调整。
VLLM#
启动容器并安装依赖项
挂载
~/.huggingface/cache
以便在跨运行、容器等情况下重复使用下载的模型。设置
HF_TOKEN
环境变量以访问受限模型,如果需要,请确保在您的本地环境中设置此变量。
docker run -it --net=host --gpus all --rm \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN \
nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3
启动 OpenAI 兼容的 Triton 推理服务器
cd /opt/tritonserver/python/openai
# NOTE: Adjust the --tokenizer based on the model being used
python3 openai_frontend/main.py --model-repository tests/vllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
示例输出
...
+-----------------------+---------+--------+
| Model | Version | Status |
+-----------------------+---------+--------+
| llama-3.1-8b-instruct | 1 | READY | <- Correct Model Loaded in Triton
+-----------------------+---------+--------+
...
Found model: name='llama-3.1-8b-instruct', backend='vllm'
[WARNING] Adding CORS for the following origins: ['https://127.0.0.1']
INFO: Started server process [126]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit) <- OpenAI Frontend Started Successfully
发送
/v1/chat/completions
请求
请注意,使用
jq
是可选的,但可以为 JSON 响应提供格式良好的输出。
MODEL="llama-3.1-8b-instruct"
curl -s https://127.0.0.1:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"messages": [{"role": "user", "content": "Say this is a test!"}]
}' | jq
示例输出
{
"id": "cmpl-6930b296-7ef8-11ef-bdd1-107c6149ca79",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message":
{
"content": "This is only a test.",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": null
}
],
"created": 1727679085,
"model": "llama-3.1-8b-instruct",
"system_fingerprint": null,
"object": "chat.completion",
"usage": null
}
发送
/v1/completions
请求
请注意,使用
jq
是可选的,但可以为 JSON 响应提供格式良好的输出。
MODEL="llama-3.1-8b-instruct"
curl -s https://127.0.0.1:9000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"prompt": "Machine learning is"
}' | jq
示例输出
{
"id": "cmpl-d51df75c-7ef8-11ef-bdd1-107c6149ca79",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"text": " a field of computer science that focuses on developing algorithms that allow computers to learn from"
}
],
"created": 1727679266,
"model": "llama-3.1-8b-instruct",
"system_fingerprint": null,
"object": "text_completion",
"usage": null
}
使用
genai-perf
进行基准测试
MODEL="llama-3.1-8b-instruct"
TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"
genai-perf profile \
--model ${MODEL} \
--tokenizer ${TOKENIZER} \
--service-kind openai \
--endpoint-type chat \
--url localhost:9000 \
--streaming
示例输出
2024-10-14 22:43 [INFO] genai_perf.parser:82 - Profiling these models: llama-3.1-8b-instruct
2024-10-14 22:43 [INFO] genai_perf.wrapper:163 - Running Perf Analyzer : 'perf_analyzer -m llama-3.1-8b-instruct --async --input-data artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/inputs.json -i http --concurrency-range 1 --endpoint v1/chat/completions --service-kind openai -u localhost:9000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export.json'
NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Time to first token (ms) │ 71.66 │ 64.32 │ 86.52 │ 76.13 │ 74.92 │ 73.26 │
│ Inter token latency (ms) │ 18.47 │ 18.25 │ 18.72 │ 18.67 │ 18.61 │ 18.53 │
│ Request latency (ms) │ 348.00 │ 274.60 │ 362.27 │ 355.41 │ 352.29 │ 350.66 │
│ Output sequence length │ 15.96 │ 12.00 │ 16.00 │ 16.00 │ 16.00 │ 16.00 │
│ Input sequence length │ 549.66 │ 548.00 │ 551.00 │ 550.00 │ 550.00 │ 550.00 │
│ Output token throughput (per sec) │ 45.84 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request throughput (per sec) │ 2.87 │ N/A │ N/A │ N/A │ N/A │ N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
2024-10-14 22:44 [INFO] genai_perf.export_data.json_exporter:62 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.json
2024-10-14 22:44 [INFO] genai_perf.export_data.csv_exporter:71 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.csv
直接使用 OpenAI python 客户端
from openai import OpenAI
client = OpenAI(
base_url="https://127.0.0.1:9000/v1",
api_key="EMPTY",
)
model = "llama-3.1-8b-instruct"
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{"role": "user", "content": "What are LLMs?"},
],
max_tokens=256,
)
print(completion.choices[0].message.content)
运行测试(注意:服务器不应运行,测试将处理根据需要启动/停止服务器)
cd /opt/tritonserver/python/openai/
pip install -r requirements-test.txt
pytest -v tests/
TensorRT-LLM#
为 TensorRT-LLM 模型准备您的模型仓库,构建引擎等。您可以尝试以下任何选项
启动容器
挂载
~/.huggingface/cache
以便在跨运行、容器等情况下重复使用下载的模型。设置
HF_TOKEN
环境变量以访问受限模型,如果需要,请确保在您的本地环境中设置此变量。
docker run -it --net=host --gpus all --rm \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN \
-e TRTLLM_ORCHESTRATOR=1 \
nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
在容器内安装依赖项
# Install python bindings for tritonserver and tritonfrontend
pip install /opt/tritonserver/python/triton*.whl
# Install application requirements
git clone https://github.com/triton-inference-server/server.git
cd server/python/openai/
pip install -r requirements.txt
启动 OpenAI 服务器
# NOTE: Adjust the --tokenizer based on the model being used
python3 openai_frontend/main.py --model-repository path/to/models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
发送
/v1/chat/completions
请求
请注意,使用
jq
是可选的,但可以为 JSON 响应提供格式良好的输出。
# MODEL should be the client-facing model name in your model repository for a pipeline like TRT-LLM.
# For example, this could also be "ensemble", or something like "gpt2" if generated from Triton CLI
MODEL="tensorrt_llm_bls"
curl -s https://127.0.0.1:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"messages": [{"role": "user", "content": "Say this is a test!"}]
}' | jq
示例输出
{
"id": "cmpl-704c758c-8a84-11ef-b106-107c6149ca79",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "It looks like you're testing the system!",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": null
}
],
"created": 1728948689,
"model": "llama-3-8b-instruct",
"system_fingerprint": null,
"object": "chat.completion",
"usage": null
}
其他示例应与 vLLM 相同,但您应在所有适用的地方设置 MODEL="tensorrt_llm_bls"
或 MODEL="ensemble"
,如上面的示例请求所示。
KServe 前端#
为了支持通过 OpenAI 兼容和 KServe Predict v2 前端向同一正在运行的 Triton 推理服务器提供服务请求,此应用程序中也包含了 tritonfrontend
python 绑定,供可选使用。
您可以选择加入这些额外的前端,假设已安装 tritonfrontend
,使用 --enable-kserve-frontends
,如下所示
python3 openai_frontend/main.py \
--model-repository tests/vllm_models \
--tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-kserve-frontends
有关可用参数和默认值的更多信息,请参阅 python3 openai_frontend/main.py --help
。
有关 tritonfrontend
python 绑定的更多信息,请参阅 此处 的文档。
模型并行性支持#
[x] vLLM (EngineArgs)
例如:在 model.json 中配置
tensor_parallel_size: 2
[x] TensorRT-LLM (Orchestrator 模式)
设置以下环境变量:
export TRTLLM_ORCHESTRATOR=1
[ ] TensorRT-LLM (Leader 模式)
目前不支持