使用 Triton 推理服务器的约束解码#

本教程重点介绍约束解码,这是一项重要的技术,用于确保大型语言模型 (LLM) 生成的输出符合严格的格式要求——这些要求可能难以或昂贵地仅通过微调来实现。

目录#

约束解码简介#

约束解码是一种强大的技术,用于自然语言处理和各种 AI 应用中,以引导和控制模型的输出。通过施加特定的约束,此方法确保生成的输出符合预定义的标准,例如长度、格式或内容限制。这种能力在必须遵守规则的上下文中至关重要,例如生成有效的代码片段、结构化数据或语法正确的句子。

在最近的进展中,一些模型已经过微调,以固有地结合这些约束。这些模型旨在在生成过程中无缝集成约束,从而减少对大量后处理的需求。通过这样做,它们提高了需要严格遵守预定义规则的任务的效率和准确性。这种内置能力使它们在自动化内容创建、数据验证和实时语言翻译等应用中尤其有价值,在这些应用中,精度和可靠性至关重要。

本教程基于 Hermes-2-Pro-Llama-3-8B,它已经支持 JSON 结构化输出。有关使用 Triton 推理服务器和 TensorRT-LLM 后端部署 Hermes-2-Pro-Llama-3-8B 模型的详细说明,请参阅 教程。在这种情况下,可以通过提示工程来控制生成输出的结构和质量。要探索此路径,请参阅本教程中的 通过提示工程实现结构化生成 部分。

对于模型本身未针对约束解码进行微调,或者当需要对输出进行更精确控制的情况,LM 格式强制器Outlines 等专用库提供了强大的解决方案。这些库提供工具来对模型输出强制执行特定约束,使开发人员可以定制生成过程以满足精确的要求。通过利用此类库,用户可以更好地控制输出,确保其与所需的标准完全一致,无论这涉及到保持特定格式、遵守内容指南还是确保语法正确性。在本教程中,我们将展示如何在您的工作流程中使用 LM 格式强制器Outlines

先决条件:Hermes-2-Pro-Llama-3-8B#

在继续之前,请确保您已按照 这些步骤 使用 Triton 推理服务器和 TensorRT-LLM 后端成功部署了 Hermes-2-Pro-Llama-3-8B 模型。

通过提示工程实现结构化生成#

首先,让我们启动 Triton SDK 容器

# Using the SDK container as an example
docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v /path/to/tutorials:/tutorials \
    -v /path/to/Hermes-2-Pro-Llama-3-8B/repo:/Hermes-2-Pro-Llama-3-8B \
    nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk

提供的客户端脚本使用 pydantic 库,该库未随 SDK 容器一起提供。请确保在继续之前安装它

pip install pydantic

示例 1#

对于微调模型,我们可以通过简单地组合系统提示来启用 JSON 模式,如下所示

You are a helpful assistant that answers in JSON.

请参阅 client.py 以获取完整的 prompt 组合逻辑。

python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Give me information about Harry Potter and the Order of Phoenix" -o 200 --use-system-prompt

您应该会收到以下响应

...
assistant
{
  "title": "Harry Potter and the Order of Phoenix",
  "book_number": 5,
  "author": "J.K. Rowling",
  "series": "Harry Potter",
  "publication_date": "June 21, 2003",
  "page_count": 766,
  "publisher": "Arthur A. Levine Books",
  "genre": [
    "Fantasy",
    "Adventure",
    "Young Adult"
  ],
  "awards": [
    {
      "award_name": "British Book Award",
      "category": "Children's Book of the Year",
      "year": 2004
    }
  ],
  "plot_summary": "Harry Potter and the Order of Phoenix is the fifth book in the Harry Potter series. In this installment, Harry returns to Hogwarts School of Witchcraft and Wizardry for his fifth year. The Ministry of Magic is in denial about the return of Lord Voldemort, and Harry finds himself battling against the

示例 2#

可选地,我们还可以将输出限制为特定模式。例如,在 client.py 中,我们使用 pydantic 库来定义以下答案格式

from pydantic import BaseModel

class AnswerFormat(BaseModel):
    title: str
    year: int
    director: str
    producer: str
    plot: str

...

prompt += "Here's the json schema you must adhere to:\n<schema>\n{schema}\n</schema>".format(
                schema=AnswerFormat.model_json_schema())

让我们试一下

python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Give me information about Harry Potter and the Order of Phoenix" -o 200 --use-system-prompt --use-schema

您应该会收到以下响应

 ...
assistant
{
  "title": "Harry Potter and the Order of Phoenix",
  "year": 2007,
  "director": "David Yates",
  "producer": "David Heyman",
  "plot": "Harry Potter and his friends must protect Hogwarts from a threat when the Ministry of Magic is taken over by Lord Voldemort's followers."
}

通过外部库强制执行输出格式#

在本教程的这一部分中,我们将展示如何对 LLM 施加约束,这些 LLM 本身未针对约束解码进行微调。我们将使用 LM 格式强制器Outlines 提供强大的解决方案。

两个库的参考实现都在 utils.py 脚本中提供,该脚本还定义了输出格式 AnswerFormat

class WandFormat(BaseModel):
        wood: str
        core: str
        length: float

class AnswerFormat(BaseModel):
        name: str
        house: str
        blood_status: str
        occupation: str
        alive: str
        wand: WandFormat

先决条件:通用设置#

确保您已按照 这些步骤 使用 Triton 推理服务器和 TensorRT-LLM 后端成功部署了 Hermes-2-Pro-Llama-3-8B 模型。

[!IMPORTANT] 确保在启动 Docker 容器时,tutorials 文件夹已挂载到 /tutorials

设置成功后,您应该拥有 /opt/tritonserver/inflight_batcher_llm 文件夹,并尝试几个推理请求(例如,示例 1示例 2 中提供的那些)。

我们将对模型文件进行一些调整,因此如果您有正在运行的服务器,可以通过以下方式停止它

pkill tritonserver

Logits 后处理器#

这两个库都限制了每个生成阶段允许的令牌集。在 TensorRT-LLM 中,用户可以定义自定义的 logits 后处理器,以屏蔽在当前生成步骤中永远不应使用的 logits。

对于通过 python 后端部署的 TensorRT-LLM 模型(即,当 triton_backendtensorrt_llm/config.pbtxt 中设置为 python 时,Triton 的 python 后端将使用 model.py 来服务您的 TensorRT-LLM 模型。),自定义 logits 处理器应在模型初始化期间指定为 Executor 的 配置(logits_post_processor_map)的一部分。以下是供参考的示例。

...

+ executor_config.logits_post_processor_map = {
+            "<custom_logits_processor_name>": custom_logits_processor
+           }
self.executor = trtllm.Executor(model_path=...,
                                model_type=...,
                                executor_config=executor_config)
...

此外,如果您想为每个请求单独启用 logits 后处理器,您可以通过额外的 input 参数来执行此操作。例如,在本教程中,我们将在 inflight_batcher_llm/tensorrt_llm/config.pbtxt 中添加 logits_post_processor_name

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
  },
  ...
  {
    name: "lora_config"
	data_type: TYPE_INT32
	dims: [ -1, 3 ]
	optional: true
	allow_ragged_batch: true
- }
+ },
+ {
+   name: "logits_post_processor_name"
+   data_type: TYPE_STRING
+   dims: [ -1 ]
+   optional: true
+ }
]
...

并在 inflight_batcher_llm/tensorrt_llm/1/model.py 中的 execute 函数中处理它

def execute(self, requests):
    """`execute` must be implemented in every Python model. `execute`
    function receives a list of pb_utils.InferenceRequest as the only
    argument. This function is called when an inference is requested
    for this model.
    Parameters
    ----------
    requests : list
      A list of pb_utils.InferenceRequest
    Returns
    -------
    list
      A list of pb_utils.InferenceResponse. The length of this list must
      be the same as `requests`
    """
    ...

    for request in requests:
        response_sender = request.get_response_sender()
        if get_input_scalar_by_name(request, 'stop'):
            self.handle_stop_request(request.request_id(), response_sender)
        else:
            try:
                converted = convert_request(request,
                                            self.exclude_input_from_output,
                                            self.decoupled)
+               logits_post_processor_name = get_input_tensor_by_name(request, 'logits_post_processor_name')
+               if logits_post_processor_name is not None:
+                   converted.logits_post_processor_name = logits_post_processor_name.item().decode('utf-8')
            except Exception as e:
            ...

在本教程中,我们将 Hermes-2-Pro-Llama-3-8B 模型部署为集成模型的一部分。这意味着请求首先由 ensemble 模型处理,然后发送到 pre-processing modeltensorrt-llm model 和最终的 post-processing。此序列在 inflight_batcher_llm/ensemble/config.pbtxt 以及输入和输出映射中定义。因此,我们也需要更新 inflight_batcher_llm/ensemble/config.pbtxt,以便 ensemble 模型正确地将附加的输入参数传递给 tensorrt-llm model

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  ...
  {
      name: "embedding_bias_weights"
      data_type: TYPE_FP32
      dims: [ -1 ]
      optional: true
- }
+ },
+ {
+   name: "logits_post_processor_name"
+   data_type: TYPE_STRING
+   dims: [ -1 ]
+   optional: true
+ }
]
output [
    ...
]
ensemble_scheduling {
  step [
    {
      model_name: "preprocessing"
      model_version: -1
    ...
    },
    {
      model_name: "tensorrt_llm"
      model_version: -1
      input_map {
        key: "input_ids"
        value: "_INPUT_ID"
      }
      ...
      input_map {
        key: "bad_words_list"
        value: "_BAD_WORDS_IDS"
      }
+     input_map {
+       key: "logits_post_processor_name"
+       value: "logits_post_processor_name"
+     }
      output_map {
        key: "output_ids"
        value: "_TOKENS_BATCH"
      }
      ...
    }
    ...

如果您按照本教程进行操作,请确保相同的更改已合并到 /opt/tritonserver/inflight_batcher_llm 仓库的相应文件中。

分词器#

LM 格式强制器Outlines 都需要在初始化时访问分词器。在本教程中,我们将通过 inflight_batcher_llm/tensorrt_llm/config.pbtxt 参数公开分词器

parameters: {
  key: "tokenizer_dir"
  value: {
    string_value: "/Hermes-2-Pro-Llama-3-8B"
  }
}

只需附加到 inflight_batcher_llm/tensorrt_llm/config.pbtxt 的末尾即可。

仓库设置#

我们已经在 artifacts/utils.py 中为 LM 格式强制器Outlines 提供了示例实现。确保您已通过以下方式将其复制到 /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/1/lib

mkdir -p inflight_batcher_llm/tensorrt_llm/1/lib
cp /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/utils.py inflight_batcher_llm/tensorrt_llm/1/lib/

最后,让我们安装所有必需的库

pip install pydantic lm-format-enforcer outlines setuptools

LM 格式强制器#

要使用 LM 格式强制器,请确保 inflight_batcher_llm/tensorrt_llm/1/model.py 包含以下更改

...
import tensorrt_llm.bindings.executor as trtllm

+ from lib.utils import LMFELogitsProcessor, AnswerFormat

...

class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """
    ...

    def get_executor_config(self, model_config):
+       tokenizer_dir = model_config['parameters']['tokenizer_dir']['string_value']
+       logits_processor = LMFELogitsProcessor(tokenizer_dir, AnswerFormat.model_json_schema())
        kwargs = {
            "max_beam_width":
            get_parameter(model_config, "max_beam_width", int),
            "scheduler_config":
            self.get_scheduler_config(model_config),
            "kv_cache_config":
            self.get_kv_cache_config(model_config),
            "enable_chunked_context":
            get_parameter(model_config, "enable_chunked_context", bool),
            "normalize_log_probs":
            get_parameter(model_config, "normalize_log_probs", bool),
            "batching_type":
            convert_batching_type(get_parameter(model_config,
                                                "gpt_model_type")),
            "parallel_config":
            self.get_parallel_config(model_config),
            "peft_cache_config":
            self.get_peft_cache_config(model_config),
            "decoding_config":
            self.get_decoding_config(model_config),
+            "logits_post_processor_map":{
+                LMFELogitsProcessor.PROCESSOR_NAME: logits_processor
+            }
        }
        kwargs = {k: v for k, v in kwargs.items() if v is not None}
        return trtllm.ExecutorConfig(**kwargs)
...

发送推理请求#

首先,让我们启动 Triton SDK 容器

# Using the SDK container as an example
docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v /path/to/tutorials/:/tutorials \
    -v /path/to/tutorials/repo:/tutorials \
    nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk

提供的客户端脚本使用 pydantic 库,该库未随 SDK 容器一起提供。请确保在继续之前安装它

pip install pydantic
选项 1. 使用提供的 客户端脚本#

首先,让我们发送一个标准请求,而不强制执行 JSON 答案格式

python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100

您应该会收到以下响应

Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and

现在,让我们在请求中指定 logits_post_processor_name

python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100 --logits-post-processor-name "lmfe"

这一次,预期的响应如下所示

Who is Harry Potter?
		{
			"name": "Harry Potter",
			"occupation": "Wizard",
			"house": "Gryffindor",
			"wand": {
				"wood": "Holly",
				"core": "Phoenix feather",
				"length": 11
			},
			"blood_status": "Pure-blood",
			"alive": "Yes"
		}

正如我们所见,utils.py 中定义的模式得到了遵守。请注意,LM 格式强制器允许 LLM 控制生成字段的顺序,因此允许重新排序字段。

选项 2. 使用 generate endpoint#

首先,让我们发送一个标准请求,而不强制执行 JSON 答案格式

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

您应该会收到以下响应

{"context_logits":0.0,...,"text_output":"Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and"}

现在,让我们在请求中指定 logits_post_processor_name

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "logits_post_processor_name": "lmfe"}'

这一次,预期的响应如下所示

{"context_logits":0.0,...,"text_output":"Who is Harry Potter?  \t\t\t\n\t\t{\n\t\t\t\"name\": \"Harry Potter\",\n\t\t\t\"occupation\": \"Wizard\",\n\t\t\t\"house\": \"Gryffindor\",\n\t\t\t\"wand\": {\n\t\t\t\t\"wood\": \"Holly\",\n\t\t\t\t\"core\": \"Phoenix feather\",\n\t\t\t\t\"length\": 11\n\t\t\t},\n\t\t\t\"blood_status\": \"Pure-blood\",\n\t\t\t\"alive\": \"Yes\"\n\t\t}\n\n\t\t\n\n\n\n\t\t\n"}

Outlines#

要使用 Outlines,请确保 inflight_batcher_llm/tensorrt_llm/1/model.py 包含以下更改

...
import tensorrt_llm.bindings.executor as trtllm

+ from lib.utils import OutlinesLogitsProcessor, AnswerFormat

...

class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """
    ...

    def get_executor_config(self, model_config):
+       tokenizer_dir = model_config['parameters']['tokenizer_dir']['string_value']
+       logits_processor = OutlinesLogitsProcessor(tokenizer_dir, AnswerFormat.model_json_schema())
        kwargs = {
            "max_beam_width":
            get_parameter(model_config, "max_beam_width", int),
            "scheduler_config":
            self.get_scheduler_config(model_config),
            "kv_cache_config":
            self.get_kv_cache_config(model_config),
            "enable_chunked_context":
            get_parameter(model_config, "enable_chunked_context", bool),
            "normalize_log_probs":
            get_parameter(model_config, "normalize_log_probs", bool),
            "batching_type":
            convert_batching_type(get_parameter(model_config,
                                                "gpt_model_type")),
            "parallel_config":
            self.get_parallel_config(model_config),
            "peft_cache_config":
            self.get_peft_cache_config(model_config),
            "decoding_config":
            self.get_decoding_config(model_config),
+            "logits_post_processor_map":{
+                OutlinesLogitsProcessor.PROCESSOR_NAME: logits_processor
+            }
        }
        kwargs = {k: v for k, v in kwargs.items() if v is not None}
        return trtllm.ExecutorConfig(**kwargs)
...

发送推理请求#

首先,让我们启动 Triton SDK 容器

# Using the SDK container as an example
docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v /path/to/tutorials/:/tutorials \
    -v /path/to/tutorials/repo:/tutorials \
    nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk

提供的客户端脚本使用 pydantic 库,该库未随 SDK 容器一起提供。请确保在继续之前安装它

pip install pydantic
选项 1. 使用提供的 客户端脚本#

首先,让我们发送一个标准请求,而不强制执行 JSON 答案格式

python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100

您应该会收到以下响应

Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and

现在,让我们在请求中指定 logits_post_processor_name

python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100 --logits-post-processor-name "outlines"

这一次,预期的响应如下所示

Who is Harry Potter?{ "name": "Harry Potter","house": "Gryffindor","blood_status": "Pure-blood","occupation": "Wizards","alive": "No","wand": {"wood": "Holly","core": "Phoenix feather","length": 11 }}

正如我们所见,utils.py 中定义的模式得到了遵守。请注意,LM 格式强制器允许 LLM 控制生成字段的顺序,因此允许重新排序字段。

选项 2. 使用 generate endpoint#

首先,让我们发送一个标准请求,而不强制执行 JSON 答案格式

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

您应该会收到以下响应

{"context_logits":0.0,...,"text_output":"Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and"}

现在,让我们在请求中指定 logits_post_processor_name

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "logits_post_processor_name": "outlines"}'

这一次,预期的响应如下所示

{"context_logits":0.0,...,"text_output":"Who is Harry Potter?{ \"name\": \"Harry Potter\",\"house\": \"Gryffindor\",\"blood_status\": \"Pure-blood\",\"occupation\": \"Wizards\",\"alive\": \"No\",\"wand\": {\"wood\": \"Holly\",\"core\": \"Phoenix feather\",\"length\": 11 }}"}