使用 Triton 和 vLLM 部署 Llama2-7B 模型#

vLLM 后端使用 vLLM 进行推理。阅读更多关于 vLLM 的信息请点击此处，以及 vLLM 后端的信息请点击此处。

预构建说明#

在本教程中，我们使用带有预训练权重的 Llama2-7B HuggingFace 模型。请遵循 README.md 中的预构建说明和链接，了解如何使用其他后端运行 Llama。

安装#

triton vLLM 容器可以从 NGC 拉取，使用以下命令：

docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v $PWD/llama2vllm:/opt/tritonserver/model_repository/llama2vllm \
    nvcr.io/nvidia/tritonserver:23.11-vllm-python-py3

这将创建一个 /opt/tritonserver/model_repository 文件夹，其中包含 llama2vllm 模型。模型本身将从 HuggingFace 拉取

进入容器后，安装 huggingface-cli 并使用您自己的凭据登录。

pip install --upgrade huggingface_hub
huggingface-cli login --token <your huggingface access token>

使用 Triton 提供服务#

然后您可以像往常一样运行 tritonserver

tritonserver --model-repository model_repository

当您在控制台中看到以下输出时，服务器已成功启动

I0922 23:28:40.351809 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0922 23:28:40.352017 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

通过 `generate` 端点发送请求#

作为一个简单的示例，以确保服务器正常工作，您可以使用 generate 端点进行测试。有关 generate 端点的更多信息，请点击此处。

$ curl -X POST localhost:8000/v2/models/llama2vllm/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
# returns (formatted for better visualization)
> {
    "model_name":"llama2vllm",
    "model_version":"1",
    "text_output":"What is Triton Inference Server?\nTriton Inference Server is a lightweight, high-performance"
  }

通过 Triton 客户端发送请求#

Triton vLLM 后端仓库有一个 samples 文件夹，其中包含一个示例 client.py 来测试 Llama2 模型。

pip3 install tritonclient[all]
# Assuming Tritonserver server is running already
$ git clone https://github.com/triton-inference-server/vllm_backend.git
$ cd vllm_backend/samples
$ python3 client.py -m llama2vllm

以下步骤应生成一个包含以下内容的 results.txt 文件

Hello, my name is
I am a 20 year old student from the Netherlands. I am currently

=========

The most dangerous animal is
The most dangerous animal is the one that is not there.
The most dangerous

=========

The capital of France is
The capital of France is Paris.
The capital of France is Paris. The

=========

The future of AI is
The future of AI is in the hands of the people who use it.

=========

使用 Triton 和 vLLM 部署 Llama2-7B 模型#

预构建说明#

安装#

使用 Triton 提供服务#

通过 generate 端点发送请求#

通过 Triton 客户端发送请求#

通过 `generate` 端点发送请求#