OpenAI 兼容前端，用于 Triton 推理服务器 (Beta)#

[!NOTE] OpenAI 兼容 API 目前为 BETA 版本。其功能和特性可能会根据我们收集到的反馈进行更改。我们很高兴听到您的任何想法以及您希望看到哪些功能！

先决条件#

Docker + NVIDIA 容器运行时
正确配置的 HF_TOKEN，用于访问 HuggingFace 模型。
- 当前的示例和测试主要使用 meta-llama/Meta-Llama-3.1-8B-Instruct 模型，但您可以手动引入自己的模型并进行相应的调整。

VLLM#

启动容器并安装依赖项

挂载 ~/.huggingface/cache 以便在跨运行、容器等情况下重复使用下载的模型。
设置 HF_TOKEN 环境变量以访问受限模型，如果需要，请确保在您的本地环境中设置此变量。

docker run -it --net=host --gpus all --rm \
  -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN \
  nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3

启动 OpenAI 兼容的 Triton 推理服务器

cd /opt/tritonserver/python/openai

# NOTE: Adjust the --tokenizer based on the model being used
python3 openai_frontend/main.py --model-repository tests/vllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct

示例输出

...
+-----------------------+---------+--------+
| Model                 | Version | Status |
+-----------------------+---------+--------+
| llama-3.1-8b-instruct | 1       | READY  | <- Correct Model Loaded in Triton
+-----------------------+---------+--------+
...
Found model: name='llama-3.1-8b-instruct', backend='vllm'
[WARNING] Adding CORS for the following origins: ['https://']
INFO:     Started server process [126]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit) <- OpenAI Frontend Started Successfully

发送 /v1/chat/completions 请求

请注意，使用 jq 是可选的，但可以为 JSON 响应提供格式良好的输出。

MODEL="llama-3.1-8b-instruct"
curl -s https://:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "'${MODEL}'",
  "messages": [{"role": "user", "content": "Say this is a test!"}]
}' | jq

示例输出

{
  "id": "cmpl-6930b296-7ef8-11ef-bdd1-107c6149ca79",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message":
      {
        "content": "This is only a test.",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": null
    }
  ],
  "created": 1727679085,
  "model": "llama-3.1-8b-instruct",
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": null
}

发送 /v1/completions 请求

请注意，使用 jq 是可选的，但可以为 JSON 响应提供格式良好的输出。

MODEL="llama-3.1-8b-instruct"
curl -s https://:9000/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "'${MODEL}'",
  "prompt": "Machine learning is"
}' | jq

示例输出

{
  "id": "cmpl-d51df75c-7ef8-11ef-bdd1-107c6149ca79",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " a field of computer science that focuses on developing algorithms that allow computers to learn from"
    }
  ],
  "created": 1727679266,
  "model": "llama-3.1-8b-instruct",
  "system_fingerprint": null,
  "object": "text_completion",
  "usage": null
}

使用 genai-perf 进行基准测试

要在该容器中安装 genai-perf，请参阅此处的说明
或者尝试从 SDK 容器中使用 genai-perf

MODEL="llama-3.1-8b-instruct"
TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"
genai-perf profile \
  --model ${MODEL} \
  --tokenizer ${TOKENIZER} \
  --service-kind openai \
  --endpoint-type chat \
  --url localhost:9000 \
  --streaming

示例输出

2024-10-14 22:43 [INFO] genai_perf.parser:82 - Profiling these models: llama-3.1-8b-instruct
2024-10-14 22:43 [INFO] genai_perf.wrapper:163 - Running Perf Analyzer : 'perf_analyzer -m llama-3.1-8b-instruct --async --input-data artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/inputs.json -i http --concurrency-range 1 --endpoint v1/chat/completions --service-kind openai -u localhost:9000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export.json'
                              NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│          Time to first token (ms) │  71.66 │  64.32 │  86.52 │  76.13 │  74.92 │  73.26 │
│          Inter token latency (ms) │  18.47 │  18.25 │  18.72 │  18.67 │  18.61 │  18.53 │
│              Request latency (ms) │ 348.00 │ 274.60 │ 362.27 │ 355.41 │ 352.29 │ 350.66 │
│            Output sequence length │  15.96 │  12.00 │  16.00 │  16.00 │  16.00 │  16.00 │
│             Input sequence length │ 549.66 │ 548.00 │ 551.00 │ 550.00 │ 550.00 │ 550.00 │
│ Output token throughput (per sec) │  45.84 │    N/A │    N/A │    N/A │    N/A │    N/A │
│      Request throughput (per sec) │   2.87 │    N/A │    N/A │    N/A │    N/A │    N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
2024-10-14 22:44 [INFO] genai_perf.export_data.json_exporter:62 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.json
2024-10-14 22:44 [INFO] genai_perf.export_data.csv_exporter:71 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.csv

直接使用 OpenAI python 客户端

from openai import OpenAI

client = OpenAI(
    base_url="https://:9000/v1",
    api_key="EMPTY",
)

model = "llama-3.1-8b-instruct"
completion = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {"role": "user", "content": "What are LLMs?"},
    ],
    max_tokens=256,
)

print(completion.choices[0].message.content)

运行测试（注意：服务器不应运行，测试将处理根据需要启动/停止服务器）

cd /opt/tritonserver/python/openai/
pip install -r requirements-test.txt

pytest -v tests/

TensorRT-LLM#

为 TensorRT-LLM 模型准备您的模型仓库，构建引擎等。您可以尝试以下任何选项

启动容器

挂载 ~/.huggingface/cache 以便在跨运行、容器等情况下重复使用下载的模型。
设置 HF_TOKEN 环境变量以访问受限模型，如果需要，请确保在您的本地环境中设置此变量。

docker run -it --net=host --gpus all --rm \
  -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN \
  -e TRTLLM_ORCHESTRATOR=1 \
  nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3

在容器内安装依赖项

# Install python bindings for tritonserver and tritonfrontend
pip install /opt/tritonserver/python/triton*.whl

# Install application requirements
git clone https://github.com/triton-inference-server/server.git
cd server/python/openai/
pip install -r requirements.txt

启动 OpenAI 服务器

# NOTE: Adjust the --tokenizer based on the model being used
python3 openai_frontend/main.py --model-repository path/to/models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct

发送 /v1/chat/completions 请求

请注意，使用 jq 是可选的，但可以为 JSON 响应提供格式良好的输出。

# MODEL should be the client-facing model name in your model repository for a pipeline like TRT-LLM.
# For example, this could also be "ensemble", or something like "gpt2" if generated from Triton CLI
MODEL="tensorrt_llm_bls"
curl -s https://:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "'${MODEL}'",
  "messages": [{"role": "user", "content": "Say this is a test!"}]
}' | jq

示例输出

{
  "id": "cmpl-704c758c-8a84-11ef-b106-107c6149ca79",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "It looks like you're testing the system!",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": null
    }
  ],
  "created": 1728948689,
  "model": "llama-3-8b-instruct",
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": null
}

其他示例应与 vLLM 相同，但您应在所有适用的地方设置 MODEL="tensorrt_llm_bls" 或 MODEL="ensemble"，如上面的示例请求所示。

KServe 前端#

为了支持通过 OpenAI 兼容和 KServe Predict v2 前端向同一正在运行的 Triton 推理服务器提供服务请求，此应用程序中也包含了 tritonfrontend python 绑定，供可选使用。

您可以选择加入这些额外的前端，假设已安装 tritonfrontend，使用 --enable-kserve-frontends，如下所示

python3 openai_frontend/main.py \
  --model-repository tests/vllm_models \
  --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable-kserve-frontends

有关可用参数和默认值的更多信息，请参阅 python3 openai_frontend/main.py --help。

有关 tritonfrontend python 绑定的更多信息，请参阅此处的文档。

模型并行性支持#

[x] vLLM (EngineArgs)
- 例如：在 model.json 中配置 tensor_parallel_size: 2
[x] TensorRT-LLM (Orchestrator 模式)
- 设置以下环境变量：export TRTLLM_ORCHESTRATOR=1
[ ] TensorRT-LLM (Leader 模式)
- 目前不支持