OpenAI 兼容前端,用于 Triton 推理服务器 (Beta)#

[!NOTE] OpenAI 兼容 API 目前为 BETA 版本。其功能和特性可能会根据我们收集到的反馈进行更改。我们很高兴听到您的任何想法以及您希望看到哪些功能!

先决条件#

  1. Docker + NVIDIA 容器运行时

  2. 正确配置的 HF_TOKEN,用于访问 HuggingFace 模型。

VLLM#

  1. 启动容器并安装依赖项

  • 挂载 ~/.huggingface/cache 以便在跨运行、容器等情况下重复使用下载的模型。

  • 设置 HF_TOKEN 环境变量以访问受限模型,如果需要,请确保在您的本地环境中设置此变量。

docker run -it --net=host --gpus all --rm \
  -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN \
  nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3
  1. 启动 OpenAI 兼容的 Triton 推理服务器

cd /opt/tritonserver/python/openai

# NOTE: Adjust the --tokenizer based on the model being used
python3 openai_frontend/main.py --model-repository tests/vllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
示例输出
...
+-----------------------+---------+--------+
| Model                 | Version | Status |
+-----------------------+---------+--------+
| llama-3.1-8b-instruct | 1       | READY  | <- Correct Model Loaded in Triton
+-----------------------+---------+--------+
...
Found model: name='llama-3.1-8b-instruct', backend='vllm'
[WARNING] Adding CORS for the following origins: ['https://127.0.0.1']
INFO:     Started server process [126]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit) <- OpenAI Frontend Started Successfully
  1. 发送 /v1/chat/completions 请求

  • 请注意,使用 jq 是可选的,但可以为 JSON 响应提供格式良好的输出。

MODEL="llama-3.1-8b-instruct"
curl -s https://127.0.0.1:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "'${MODEL}'",
  "messages": [{"role": "user", "content": "Say this is a test!"}]
}' | jq
示例输出
{
  "id": "cmpl-6930b296-7ef8-11ef-bdd1-107c6149ca79",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message":
      {
        "content": "This is only a test.",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": null
    }
  ],
  "created": 1727679085,
  "model": "llama-3.1-8b-instruct",
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": null
}
  1. 发送 /v1/completions 请求

  • 请注意,使用 jq 是可选的,但可以为 JSON 响应提供格式良好的输出。

MODEL="llama-3.1-8b-instruct"
curl -s https://127.0.0.1:9000/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "'${MODEL}'",
  "prompt": "Machine learning is"
}' | jq
示例输出
{
  "id": "cmpl-d51df75c-7ef8-11ef-bdd1-107c6149ca79",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " a field of computer science that focuses on developing algorithms that allow computers to learn from"
    }
  ],
  "created": 1727679266,
  "model": "llama-3.1-8b-instruct",
  "system_fingerprint": null,
  "object": "text_completion",
  "usage": null
}
  1. 使用 genai-perf 进行基准测试

  • 要在该容器中安装 genai-perf,请参阅 此处 的说明

  • 或者尝试从 SDK 容器 中使用 genai-perf

MODEL="llama-3.1-8b-instruct"
TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"
genai-perf profile \
  --model ${MODEL} \
  --tokenizer ${TOKENIZER} \
  --service-kind openai \
  --endpoint-type chat \
  --url localhost:9000 \
  --streaming
示例输出
2024-10-14 22:43 [INFO] genai_perf.parser:82 - Profiling these models: llama-3.1-8b-instruct
2024-10-14 22:43 [INFO] genai_perf.wrapper:163 - Running Perf Analyzer : 'perf_analyzer -m llama-3.1-8b-instruct --async --input-data artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/inputs.json -i http --concurrency-range 1 --endpoint v1/chat/completions --service-kind openai -u localhost:9000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export.json'
                              NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│          Time to first token (ms) │  71.66 │  64.32 │  86.52 │  76.13 │  74.92 │  73.26 │
│          Inter token latency (ms) │  18.47 │  18.25 │  18.72 │  18.67 │  18.61 │  18.53 │
│              Request latency (ms) │ 348.00 │ 274.60 │ 362.27 │ 355.41 │ 352.29 │ 350.66 │
│            Output sequence length │  15.96 │  12.00 │  16.00 │  16.00 │  16.00 │  16.00 │
│             Input sequence length │ 549.66 │ 548.00 │ 551.00 │ 550.00 │ 550.00 │ 550.00 │
│ Output token throughput (per sec) │  45.84 │    N/A │    N/A │    N/A │    N/A │    N/A │
│      Request throughput (per sec) │   2.87 │    N/A │    N/A │    N/A │    N/A │    N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
2024-10-14 22:44 [INFO] genai_perf.export_data.json_exporter:62 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.json
2024-10-14 22:44 [INFO] genai_perf.export_data.csv_exporter:71 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.csv
  1. 直接使用 OpenAI python 客户端

from openai import OpenAI

client = OpenAI(
    base_url="https://127.0.0.1:9000/v1",
    api_key="EMPTY",
)

model = "llama-3.1-8b-instruct"
completion = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {"role": "user", "content": "What are LLMs?"},
    ],
    max_tokens=256,
)

print(completion.choices[0].message.content)
  1. 运行测试(注意:服务器不应运行,测试将处理根据需要启动/停止服务器)

cd /opt/tritonserver/python/openai/
pip install -r requirements-test.txt

pytest -v tests/

TensorRT-LLM#

  1. 为 TensorRT-LLM 模型准备您的模型仓库,构建引擎等。您可以尝试以下任何选项

  1. 启动容器

  • 挂载 ~/.huggingface/cache 以便在跨运行、容器等情况下重复使用下载的模型。

  • 设置 HF_TOKEN 环境变量以访问受限模型,如果需要,请确保在您的本地环境中设置此变量。

docker run -it --net=host --gpus all --rm \
  -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN \
  -e TRTLLM_ORCHESTRATOR=1 \
  nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
  1. 在容器内安装依赖项

# Install python bindings for tritonserver and tritonfrontend
pip install /opt/tritonserver/python/triton*.whl

# Install application requirements
git clone https://github.com/triton-inference-server/server.git
cd server/python/openai/
pip install -r requirements.txt
  1. 启动 OpenAI 服务器

# NOTE: Adjust the --tokenizer based on the model being used
python3 openai_frontend/main.py --model-repository path/to/models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
  1. 发送 /v1/chat/completions 请求

  • 请注意,使用 jq 是可选的,但可以为 JSON 响应提供格式良好的输出。

# MODEL should be the client-facing model name in your model repository for a pipeline like TRT-LLM.
# For example, this could also be "ensemble", or something like "gpt2" if generated from Triton CLI
MODEL="tensorrt_llm_bls"
curl -s https://127.0.0.1:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "'${MODEL}'",
  "messages": [{"role": "user", "content": "Say this is a test!"}]
}' | jq
示例输出
{
  "id": "cmpl-704c758c-8a84-11ef-b106-107c6149ca79",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "It looks like you're testing the system!",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": null
    }
  ],
  "created": 1728948689,
  "model": "llama-3-8b-instruct",
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": null
}

其他示例应与 vLLM 相同,但您应在所有适用的地方设置 MODEL="tensorrt_llm_bls"MODEL="ensemble",如上面的示例请求所示。

KServe 前端#

为了支持通过 OpenAI 兼容和 KServe Predict v2 前端向同一正在运行的 Triton 推理服务器提供服务请求,此应用程序中也包含了 tritonfrontend python 绑定,供可选使用。

您可以选择加入这些额外的前端,假设已安装 tritonfrontend,使用 --enable-kserve-frontends,如下所示

python3 openai_frontend/main.py \
  --model-repository tests/vllm_models \
  --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable-kserve-frontends

有关可用参数和默认值的更多信息,请参阅 python3 openai_frontend/main.py --help

有关 tritonfrontend python 绑定的更多信息,请参阅 此处 的文档。

模型并行性支持#