Python 后端#

用于 Python 的 Triton 后端。Python 后端的目的是让您通过 Triton 推理服务器提供用 Python 编写的模型，而无需编写任何 C++ 代码。

用户文档#

Python 后端
业务逻辑脚本
互操作性和 GPU 支持
框架
- PyTorch
  - PyTorch 确定性
- TensorFlow
  - TensorFlow 确定性
自定义指标
示例
使用 Inferentia 运行
日志记录
使用 VSCode 开发
报告问题，提出问题

快速入门#

运行 Triton 推理服务器容器。

docker run --shm-size=1g --ulimit memlock=-1 -p 8000:8000 -p 8001:8001 -p 8002:8002 --ulimit stack=67108864 -ti nvcr.io/nvidia/tritonserver:<xx.yy>-py3

将 <xx.yy> 替换为 Triton 版本（例如 21.05）。

在容器内，克隆 Python 后端仓库。

git clone https://github.com/triton-inference-server/python_backend -b r<xx.yy>

安装示例模型。

cd python_backend
mkdir -p models/add_sub/1/
cp examples/add_sub/model.py models/add_sub/1/model.py
cp examples/add_sub/config.pbtxt models/add_sub/config.pbtxt

启动 Triton 服务器。

tritonserver --model-repository `pwd`/models

在主机上，启动客户端容器。

docker run -ti --net host nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk /bin/bash

在客户端容器中，克隆 Python 后端仓库。

git clone https://github.com/triton-inference-server/python_backend -b r<xx.yy>

运行示例客户端。

python3 python_backend/examples/add_sub/client.py

从源代码构建#

要求

cmake >= 3.17
numpy
rapidjson-dev
libarchive-dev
zlib1g-dev

pip3 install numpy

在 Ubuntu 或 Debian 上，您可以使用以下命令安装 rapidjson、libarchive 和 zlib

sudo apt-get install rapidjson-dev libarchive-dev zlib1g-dev

构建 Python 后端。将 <GIT_BRANCH_NAME> 替换为您要编译的 GitHub 分支。对于发布分支，它应该是 r<xx.yy>（例如 r21.06）。

mkdir build
cd build
cmake -DTRITON_ENABLE_GPU=ON -DTRITON_BACKEND_REPO_TAG=<GIT_BRANCH_NAME> -DTRITON_COMMON_REPO_TAG=<GIT_BRANCH_NAME> -DTRITON_CORE_REPO_TAG=<GIT_BRANCH_NAME> -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ..
make install

以下必需的 Triton 仓库将被拉取并在构建中使用。如果未指定以下 CMake 变量，将使用这些仓库的“main”分支。<GIT_BRANCH_NAME> 应与您尝试编译的 Python 后端仓库分支相同。

triton-inference-server/backend: -DTRITON_BACKEND_REPO_TAG=<GIT_BRANCH_NAME>
triton-inference-server/common: -DTRITON_COMMON_REPO_TAG=<GIT_BRANCH_NAME>
triton-inference-server/core: -DTRITON_CORE_REPO_TAG=<GIT_BRANCH_NAME>

将 -DCMAKE_INSTALL_PREFIX 设置为 Triton Server 的安装位置。在发布的容器中，此位置为 /opt/tritonserver。

复制示例模型和配置

mkdir -p models/add_sub/1/
cp examples/add_sub/model.py models/add_sub/1/model.py
cp examples/add_sub/config.pbtxt models/add_sub/config.pbtxt

启动 Triton Server

/opt/tritonserver/bin/tritonserver --model-repository=`pwd`/models

使用客户端应用程序执行推理

python3 examples/add_sub/client.py

用法#

为了使用 Python 后端，您需要创建一个 Python 文件，其结构类似于下面这样

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    @staticmethod
    def auto_complete_config(auto_complete_model_config):
        """`auto_complete_config` is called only once when loading the model
        assuming the server was not started with
        `--disable-auto-complete-config`. Implementing this function is
        optional. No implementation of `auto_complete_config` will do nothing.
        This function can be used to set `max_batch_size`, `input` and `output`
        properties of the model using `set_max_batch_size`, `add_input`, and
        `add_output`. These properties will allow Triton to load the model with
        minimal model configuration in absence of a configuration file. This
        function returns the `pb_utils.ModelConfig` object with these
        properties. You can use the `as_dict` function to gain read-only access
        to the `pb_utils.ModelConfig` object. The `pb_utils.ModelConfig` object
        being returned from here will be used as the final configuration for
        the model.

        Note: The Python interpreter used to invoke this function will be
        destroyed upon returning from this function and as a result none of the
        objects created here will be available in the `initialize`, `execute`,
        or `finalize` functions.

        Parameters
        ----------
        auto_complete_model_config : pb_utils.ModelConfig
          An object containing the existing model configuration. You can build
          upon the configuration given by this object when setting the
          properties for this model.

        Returns
        -------
        pb_utils.ModelConfig
          An object containing the auto-completed model configuration
        """
        inputs = [{
            'name': 'INPUT0',
            'data_type': 'TYPE_FP32',
            'dims': [4],
            # this parameter will set `INPUT0 as an optional input`
            'optional': True
        }, {
            'name': 'INPUT1',
            'data_type': 'TYPE_FP32',
            'dims': [4]
        }]
        outputs = [{
            'name': 'OUTPUT0',
            'data_type': 'TYPE_FP32',
            'dims': [4]
        }, {
            'name': 'OUTPUT1',
            'data_type': 'TYPE_FP32',
            'dims': [4]
        }]

        # Demonstrate the usage of `as_dict`, `add_input`, `add_output`,
        # `set_max_batch_size`, and `set_dynamic_batching` functions.
        # Store the model configuration as a dictionary.
        config = auto_complete_model_config.as_dict()
        input_names = []
        output_names = []
        for input in config['input']:
            input_names.append(input['name'])
        for output in config['output']:
            output_names.append(output['name'])

        for input in inputs:
            # The name checking here is only for demonstrating the usage of
            # `as_dict` function. `add_input` will check for conflicts and
            # raise errors if an input with the same name already exists in
            # the configuration but has different data_type or dims property.
            if input['name'] not in input_names:
                auto_complete_model_config.add_input(input)
        for output in outputs:
            # The name checking here is only for demonstrating the usage of
            # `as_dict` function. `add_output` will check for conflicts and
            # raise errors if an output with the same name already exists in
            # the configuration but has different data_type or dims property.
            if output['name'] not in output_names:
                auto_complete_model_config.add_output(output)

        auto_complete_model_config.set_max_batch_size(0)

        # To enable a dynamic batcher with default settings, you can use
        # auto_complete_model_config set_dynamic_batching() function. It is
        # commented in this example because the max_batch_size is zero.
        #
        # auto_complete_model_config.set_dynamic_batching()

        return auto_complete_model_config

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to initialize any state associated with this model.

        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device
            ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """
        print('Initialized...')

    def execute(self, requests):
        """`execute` must be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference is requested
        for this model.

        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest

        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """

        responses = []

        # Every Python backend must iterate through list of requests and create
        # an instance of pb_utils.InferenceResponse class for each of them.
        # Reusing the same pb_utils.InferenceResponse object for multiple
        # requests may result in segmentation faults. You should avoid storing
        # any of the input Tensors in the class attributes as they will be
        # overridden in subsequent inference requests. You can make a copy of
        # the underlying NumPy array and store it if it is required.
        for request in requests:
            # Perform inference on the request and append it to responses
            # list...

        # You must return a list of pb_utils.InferenceResponse. Length
        # of this list must match the length of `requests` list.
        return responses

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is optional. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')

每个 Python 后端都可以实现四个主要函数

`auto_complete_config`#

auto_complete_config 仅在加载模型时调用一次，前提是服务器启动时未使用 --disable-auto-complete-config。

实现此函数是可选的。不实现 auto_complete_config 将不会执行任何操作。此函数可用于设置模型的 max_batch_size、dynamic_batching、input 和 output 属性，使用 set_max_batch_size、set_dynamic_batching、add_input 和 add_output。这些属性将允许 Triton 在没有配置文件的情况下加载具有最小模型配置的模型。此函数返回具有这些属性的 pb_utils.ModelConfig 对象。您可以使用 as_dict 函数来获得对 pb_utils.ModelConfig 对象的只读访问权限。从此处返回的 pb_utils.ModelConfig 对象将用作模型的最终配置。

除了最小属性外，您还可以通过 auto_complete_config 使用 set_model_transaction_policy 设置 model_transaction_policy。例如，

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    @staticmethod
    def auto_complete_config(auto_complete_model_config):
      ...
      transaction_policy = {"decoupled": True}
      auto_complete_model_config.set_model_transaction_policy(transaction_policy)
      ...

注意：用于调用此函数的 Python 解释器将在从此函数返回时被销毁，因此在此处创建的任何对象在 initialize、execute 或 finalize 函数中都不可用。

`initialize`#

initialize 在模型加载时调用一次。实现 initialize 是可选的。initialize 允许您在执行之前进行任何必要的初始化。在 initialize 函数中，您将获得一个 args 变量。args 是一个 Python 字典。此 Python 字典的键和值均为字符串。您可以在下表中找到 args 字典中的可用键及其描述

键	描述
model_config	包含模型配置的 JSON 字符串
model_instance_kind	包含模型实例类型的字符串
model_instance_device_id	包含模型实例设备 ID 的字符串
model_repository	模型仓库路径
model_version	模型版本
model_name	模型名称

`execute`#

每当发出推理请求时，都会调用 execute 函数。每个 Python 模型都必须实现 execute 函数。在 execute 函数中，您将获得一个 InferenceRequest 对象列表。有两种模式可以实现此函数。您选择的模式应取决于您的用例。即您是否希望从此模型返回解耦响应。

默认模式#

这是您希望实现模型的最通用方式，并且要求 execute 函数为每个请求返回一个响应。这意味着在此模式下，您的 execute 函数必须返回一个 InferenceResponse 对象列表，该列表的长度与 requests 相同。此模式下的工作流程是

execute 函数接收长度为 N 的 pb_utils.InferenceRequest 批处理作为数组。
对 pb_utils.InferenceRequest 执行推理，并将相应的 pb_utils.InferenceResponse 附加到响应列表。
返回响应列表。
- 返回的响应列表的长度必须为 N。
- 列表中的每个元素都应该是请求数组中相应元素的响应。
- 每个元素都必须包含响应（响应可以是输出张量或错误）；元素不能为 None。

Triton 检查以确保满足对响应列表的这些要求，如果不满足，则返回所有推理请求的错误响应。从 execute 函数返回后，与传递给该函数的 InferenceRequest 对象关联的所有张量数据都将被删除，因此 Python 模型不应保留 InferenceRequest 对象。

从 24.06 开始，模型可以选择使用 InferenceResponseSender 发送响应，如解耦模式中所示。由于模型处于默认模式，因此它必须为每个请求发送一个响应。pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL 标志必须与响应一起发送，或者作为仅标志响应在之后发送。

错误处理#

如果其中一个请求发生错误，您可以使用 TritonError 对象为该特定请求设置错误消息。以下是为 InferenceResponse 对象设置错误的示例

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    ...

    def execute(self, requests):
        responses = []

        for request in requests:
            if an_error_occurred:
              # If there is an error, there is no need to pass the
              # "output_tensors" to the InferenceResponse. The "output_tensors"
              # that are passed in this case will be ignored.
              responses.append(pb_utils.InferenceResponse(
                error=pb_utils.TritonError("An Error Occurred")))

        return responses

从 23.09 开始，可以使用第二个参数上的可选 Triton 错误代码构造 pb_utils.TritonError。例如

pb_utils.TritonError("The file is not found", pb_utils.TritonError.NOT_FOUND)

如果未指定代码，则默认情况下将使用 pb_utils.TritonError.INTERNAL。

支持的错误代码

pb_utils.TritonError.UNKNOWN
pb_utils.TritonError.INTERNAL
pb_utils.TritonError.NOT_FOUND
pb_utils.TritonError.INVALID_ARG
pb_utils.TritonError.UNAVAILABLE
pb_utils.TritonError.UNSUPPORTED
pb_utils.TritonError.ALREADY_EXISTS
pb_utils.TritonError.CANCELLED（自 23.10 起）

请求取消处理#

客户端可能会在执行期间取消一个或多个请求。从 23.10 开始，request.is_cancelled() 返回请求是否已取消。例如

import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    ...

    def execute(self, requests):
        responses = []

        for request in requests:
            if request.is_cancelled():
                responses.append(pb_utils.InferenceResponse(
                    error=pb_utils.TritonError("Message", pb_utils.TritonError.CANCELLED)))
            else:
                ...

        return responses

尽管检查请求取消是可选的，但建议在战略请求执行阶段检查取消，以便在不再需要其响应时提前终止执行。

解耦模式#

此模式允许用户为一个请求发送多个响应，或者不为一个请求发送任何响应。模型还可以相对于请求批次执行的顺序，以无序方式发送响应。此类模型称为解耦模型。为了使用此模式，模型配置中的事务策略必须设置为解耦。

在解耦模式下，模型必须为每个请求使用 InferenceResponseSender 对象，以持续创建和发送任意数量的请求响应。此模式下的工作流程可能如下所示

execute 函数接收长度为 N 的 pb_utils.InferenceRequest 批处理作为数组。
遍历每个 pb_utils.InferenceRequest，并为每个 pb_utils.InferenceRequest 对象执行以下步骤
1. 使用 InferenceRequest.get_response_sender() 获取 InferenceRequest 的 InferenceResponseSender 对象。
2. 创建并填充要发送回的 pb_utils.InferenceResponse。
3. 使用 InferenceResponseSender.send() 发送上述响应。如果这是最后一个请求，则传递 pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL 作为带有 InferenceResponseSender.send() 的标志参数。否则，继续执行步骤 1 以发送下一个请求。
此模式下 execute 函数的返回值应为 None。

与上述类似，如果其中一个请求发生错误，您可以使用 TritonError 对象为该特定请求设置错误消息。为 pb_utils.InferenceResponse 对象设置错误后，使用 InferenceResponseSender.send() 发送带有错误的消息返回给用户。

从 23.10 开始，可以使用 response_sender.is_cancelled() 直接在 InferenceResponseSender 对象上检查请求取消。即使请求已取消，仍然需要发送响应结束时的 TRITONSERVER_RESPONSE_COMPLETE_FINAL 标志。

用例#

解耦模式功能强大，并支持各种其他用例

如果模型不应为请求发送任何响应，则在不带响应但标志参数设置为 pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL 的情况下调用 InferenceResponseSender.send()。
模型还可以按接收请求的顺序无序发送响应。
请求数据和 InferenceResponseSender 对象可以传递到模型中的单独线程。这意味着主调用线程可以从 execute 函数退出，并且只要模型持有 InferenceResponseSender 对象，模型仍然可以继续生成响应。

解耦示例演示了从解耦 API 可以实现的全部功能。阅读解耦后端和模型，了解有关如何托管解耦模型的更多详细信息。

异步执行#

从 24.04 开始，解耦 Python 模型支持 async def execute(self, requests):。其协程将由与在同一模型实例中执行的请求共享的 AsyncIO 事件循环执行。模型实例的下一个请求可以在当前请求等待时开始执行。

对于花费大部分时间等待的模型来说，这对于最大程度地减少模型实例的数量非常有用，因为请求可以由 AsyncIO 并发执行。为了充分利用并发性，异步执行函数必须在等待时不要阻止事件循环取得进展，即通过网络下载。

注意

模型不应修改正在运行的事件循环，因为这可能会导致意外问题。
服务器/后端不控制模型实例向事件循环添加多少个请求。

请求重新调度#

从 23.11 开始，Python 后端支持请求重新调度。通过使用标志 pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE 在请求对象上调用 set_release_flags 函数，您可以重新调度请求以便在将来的批处理中进一步执行。此功能对于处理迭代序列非常有用。

必须配置模型配置以启用迭代序列批处理，才能使用请求重新调度 API

sequence_batching {
  iterative_sequence : true
}

对于非解耦模型，每个请求只能有一个响应。由于重新调度的请求与原始请求相同，因此您必须为重新调度的请求将 None 对象附加到响应列表。例如

import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    ...

    def execute(self, requests):
        responses = []

        for request in requests:
            # Explicitly reschedule the first request
            if self.idx == 0:
                request.set_release_flags(
                    pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE
                )
                responses.append(None)
                self.idx += 1
            else:
                responses.append(inference_response)

        return responses

对于解耦模型，需要在从 execute 函数返回之前重新调度请求。以下是使用请求重新调度的解耦模型的示例。此模型接受 1 个输入张量，一个名为“IN”的 INT32 [ 1 ] 输入，并生成一个与输入张量形状相同的输出张量“OUT”。输入值指示要生成的响应总数，输出值指示剩余响应数。例如，如果请求输入的值为 2，则模型将

发送值为 1 的响应。
使用 RESCHEDULE 标志释放请求。
在同一请求上执行时，发送值为 0 的最后一个响应。
使用 ALL 标志释放请求。

import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    ...

    def execute(self, requests):
        responses = []

        for request in requests:
            in_input = pb_utils.get_input_tensor_by_name(request, "IN").as_numpy()

            if self.reset_flag:
                self.remaining_response = in_input[0]
                self.reset_flag = False

            response_sender = request.get_response_sender()

            self.remaining_response -= 1

            out_output = pb_utils.Tensor(
                "OUT", np.array([self.remaining_response], np.int32)
            )
            response = pb_utils.InferenceResponse(output_tensors=[out_output])

            if self.remaining_response <= 0:
                response_sender.send(
                    response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
                )
                self.reset_flag = True
            else:
                request.set_release_flags(
                    pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE
                )
                response_sender.send(response)

        return None

`finalize`#

实现 finalize 是可选的。此函数允许您在模型从 Triton 服务器卸载之前执行任何必要的清理。

您可以查看 add_sub 示例，其中包含一个完整的示例，说明如何为 Python 模型实现所有这些函数，该模型将给定的输入相加和相减。在实现所有必要的函数后，您应将此文件另存为 model.py。

模型配置文件#

每个 Python Triton 模型都必须提供一个 config.pbtxt 文件来描述模型配置。为了使用此后端，您必须将模型 config.pbtxt 文件的 backend 字段设置为 python。您不应设置配置的 platform 字段。

您的模型目录应如下所示

models
└── add_sub
    ├── 1
    │   └── model.py
    └── config.pbtxt

推理请求参数#

您可以使用 inference_request.parameters() 函数检索与推理请求关联的参数。此函数返回一个 JSON 字符串，其中键是 parameters 对象的键，值是 parameters 字段的值。请注意，您需要使用 json.loads 解析此字符串以将其转换为字典。

从 23.11 版本开始，可以在构造期间将参数提供给 InferenceRequest 对象。参数应为键值对字典，其中键为 str，值为 bool、int 或 str。

request = pb_utils.InferenceRequest(parameters={"key": "value"}, ...)

您可以在 parameters extension 文档中阅读有关推理请求参数的更多信息。

管理 Python 运行时和库#

NVIDIA GPU Cloud 容器中附带的 Python 后端使用 Python 3.10。Python 后端能够使用当前 Python 环境中存在的库。这些库可以安装在 virtualenv、conda 环境或全局系统 Python 中。只有当 Python 版本与 Python 后端存根可执行文件的 Python 版本匹配时，才会使用这些库。例如，如果您在 Python 3.9 环境中安装了一组库，而您的 Python 后端存根是使用 Python 3.10 编译的，则这些库在您使用 Triton 提供的 Python 模型中将不可用。您需要使用构建自定义 Python 后端存根部分中的说明，使用 Python 3.9 编译存根可执行文件。

构建自定义 Python 后端存根#

重要提示：仅当 Python 版本与 Triton 容器中默认提供的 Python 3.10 不同时，才需要编译自定义 Python 后端存根。

Python 后端使用存根进程将您的 model.py 文件连接到 Triton C++ 核心。此存根进程动态链接到特定的 libpython<X>.<Y>.so 版本。如果您打算使用与默认 Python 后端存根版本不同的 Python 解释器，则需要按照以下步骤编译自己的 Python 后端存根

安装以下软件包

cmake
rapidjson 和 libarchive（有关在 Ubuntu 或 Debian 中安装这些软件包的说明，请参见从源代码构建部分）

确保您的环境中提供了预期的 Python 版本。

如果您使用的是 conda，则应确保通过 conda activate <conda-env-name> 激活环境。请注意，您不必使用 conda，并且可以根据需要安装 Python。Python 后端依靠 pybind11 来查找正确的 Python 版本。如果您注意到未选择正确的 Python 版本，则可以阅读有关 pybind11 如何决定使用哪个 Python 的更多信息。

克隆 Python 后端仓库并编译 Python 后端存根（将 <GIT_BRANCH_NAME> 替换为您要使用的分支名称，对于发布分支，它应该是 r<xx.yy>）

git clone https://github.com/triton-inference-server/python_backend -b
<GIT_BRANCH_NAME>
cd python_backend
mkdir build && cd build
cmake -DTRITON_ENABLE_GPU=ON -DTRITON_BACKEND_REPO_TAG=<GIT_BRANCH_NAME> -DTRITON_COMMON_REPO_TAG=<GIT_BRANCH_NAME> -DTRITON_CORE_REPO_TAG=<GIT_BRANCH_NAME> -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install ..
make triton-python-backend-stub

现在，您可以使用您的 Python 版本的 Python 后端存根。您可以使用 ldd 进行验证

ldd triton_python_backend_stub
...
libpython3.6m.so.1.0 => /home/ubuntu/envs/miniconda3/envs/python-3-6/lib/libpython3.6m.so.1.0 (0x00007fbb69cf3000)
...

除了上面发布的库之外，还打印了许多其他共享库。但是，重要的是在链接的共享库列表中看到 libpython<major>.<minor>m.so.1.0。如果您使用其他 Python 版本，则应看到该版本。您需要将 triton_python_backend_stub 复制到想要使用自定义 Python 后端存根的模型的模型目录中。例如，如果您的模型仓库中有 model_a，则文件夹结构应如下所示

models
|-- model_a
    |-- 1
    |   |-- model.py
    |-- config.pbtxt
    `-- triton_python_backend_stub

注意上面目录结构中 triton_python_backend_stub 的位置。

创建自定义执行环境#

如果您想创建一个包含所有 Python 依赖项的 tar 文件，或者您想为每个 Python 模型使用不同的 Python 环境，则需要在 Python 后端中创建自定义执行环境。目前，Python 后端为此目的支持 conda-pack。conda-pack 确保您的 conda 环境是可移植的。您可以使用 conda-pack 命令为您的 conda 环境创建一个 tar 文件

conda-pack
Collecting packages...
Packing environment at '/home/iman/miniconda3/envs/python-3-6' to 'python-3-6.tar.gz'
[########################################] | 100% Completed |  4.5s

重要提示：在您的 conda 环境中安装软件包之前，请确保您已导出 PYTHONNOUSERSITE 环境变量

export PYTHONNOUSERSITE=True

如果未导出此变量，并且在您的 conda 环境外部安装了类似的软件包，则您的 tar 文件可能不包含隔离 Python 环境所需的所有依赖项。

或者，Python 后端还支持解压缩的 conda 执行环境，前提是它指向用于设置 conda 环境的激活脚本。为此，可以首先使用 conda-pack 打包执行环境，然后解压缩，或使用 conda create -p 创建。在这种情况下，conda 激活脚本位于：$path_to_conda_pack/lib/python<your.python.version>/site-packages/conda_pack/scripts/posix/activate 这加快了模型的服务器加载时间。

从 conda 环境创建打包文件或创建带有自定义激活脚本的 conda 环境后，您需要告诉 Python 后端为您的模型使用该环境。您可以通过将以下行添加到 config.pbtxt 文件中来完成此操作

name: "model_a"
backend: "python"

...

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/home/iman/miniconda3/envs/python-3-6/python3.6.tar.gz"}
}

也可以提供相对于模型仓库中模型文件夹的执行环境路径

name: "model_a"
backend: "python"

...

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/python3.6.tar.gz"}
}

在这种情况下，python3.tar.gz 应放置在模型文件夹中，并且模型仓库应如下所示

models
|-- model_a
|   |-- 1
|   |   `-- model.py
|   |-- config.pbtxt
|   |-- python3.6.tar.gz
|   `-- triton_python_backend_stub

在上面的示例中，$$TRITON_MODEL_DIRECTORY 解析为 $pwd/models/model_a。

为了加快 model_a 的加载时间，您可以按照以下步骤在模型文件夹中解压缩 conda 环境

mkdir -p $pwd/models/model_a/python3.6
tar -xvf $pwd/models/model_a/python3.6.tar.gz -C $pwd/models/model_a/python3.6

然后，您可以更改 EXECUTION_ENV_PATH 以指向解压缩的目录

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/python3.6"}
}

如果您想使用 S3、GCS 或 Azure，并且您无权访问存储在云对象存储服务中的执行环境的绝对路径，这将非常有用。

重要提示#

执行环境中的 Python 解释器版本必须与 triton_python_backend_stub 的版本匹配。
如果您不想使用不同的 Python 解释器，则可以跳过构建自定义 Python 后端存根。在这种情况下，您只需使用 conda-pack 打包您的环境，并在模型配置中提供 tar 文件的路径。但是，前面的注意事项仍然适用于此处，并且 conda 环境内的 Python 解释器版本必须与 Python 后端使用的存根的 Python 版本匹配。存根的默认版本为 Python 3.10。
您可以在多个模型之间共享单个执行环境。您需要在所有想要使用该执行环境的模型的 config.pbtxt 文件中的 EXECUTION_ENV_PATH 中提供 tar 文件的路径。
如果在 EXECUTION_ENV_PATH 中使用了 $$TRITON_MODEL_DIRECTORY，则最终的 EXECUTION_ENV_PATH 必须不能 从 $$TRITON_MODEL_DIRECTORY 逃逸，因为访问 $$TRITON_MODEL_DIRECTORY 之外任何位置的行为都是未定义的。
如果使用非 $$TRITON_MODEL_DIRECTORY 的 EXECUTION_ENV_PATH，则目前仅支持本地文件系统路径。使用云路径的行为是未定义的。
如果您需要编译 Python 后端 stub，建议您在官方 Triton NGC 容器中编译。否则，您编译的 stub 可能会使用在您用于部署的 Triton 容器中不可用的依赖项。例如，在 Ubuntu 22.04 以外的操作系统上编译 Python 后端 stub 可能会导致意外错误。
如果您在运行时遇到 “GLIBCXX_3.4.30 not found” 错误，我们建议升级您的 conda 版本并安装 libstdcxx-ng=12，方法是运行 conda install -c conda-forge libstdcxx-ng=12 -y。如果此解决方案无法解决问题，请随时在 GitHub issue page 上按照提供的说明提出 issue。

错误处理#

如果 Python 模型的 initialize、execute 或 finalize 函数出现影响错误，您可以使用 TritonInferenceException。以下示例显示了如何在 finalize 中进行错误处理

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    ...

    def finalize(self):
      if error_during_finalize:
        raise pb_utils.TritonModelException(
          "An error occurred during finalize.")

管理共享内存#

从 21.04 版本开始，Python 后端使用共享内存将用户的代码连接到 Triton。请注意，此更改是完全透明的，不需要对现有的用户模型代码进行任何更改。

默认情况下，Python 后端为每个模型实例分配 1 MB 内存。然后，当需要增加时，它将以 1 MB 的块为单位增长共享内存区域。您可以使用 shm-default-byte-size 标志配置每个模型实例使用的默认共享内存大小。共享内存增长量可以使用 shm-growth-byte-size 进行配置。

您还可以使用 stub-timeout-seconds 配置用于将 Triton 主进程连接到 Python 后端 stub 的超时时间。默认值为 30 秒。

上述配置值可以使用 --backend-config 标志传递给 Triton

/opt/tritonserver/bin/tritonserver --model-repository=`pwd`/models --backend-config=python,<config-key>=<config-value>

此外，如果您在 Docker 容器内运行 Triton，则需要根据您的输入和输出大小正确设置 --shm-size 标志。docker run 命令的默认值为 64MB，这个值非常小。

多模型实例支持#

Python 解释器使用一个称为 GIL 的全局锁。由于 GIL 的存在，不可能在同一个 Python 解释器中同时运行多个线程，因为每个线程在访问 Python 对象时都需要获取 GIL，这将序列化所有操作。为了解决这个问题，Python 后端为每个模型实例生成一个单独的进程。这与其他 Triton 后端（如 ONNXRuntime、TensorFlow 和 PyTorch）处理多实例的方式形成对比。增加这些后端的实例计数将创建额外的线程，而不是生成单独的进程。

运行 Triton Server 的多个实例#

从 24.04 版本开始，Python 后端使用 UUID 为 Python 后端共享内存区域生成唯一的名称，以便服务器的多个实例可以同时运行而不会发生任何冲突。

如果您使用的是 24.04 版本之前发布的 Python 后端，则需要使用 --backend-config 标志指定不同的 shm-region-prefix-name，以避免共享内存区域之间的冲突。例如

# Triton instance 1
tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix1

# Triton instance 2
tritonserver --model-repository=/models --backend-config=python,shm-region-prefix-name=prefix2

请注意，只有当 /dev/shm 在服务器的两个实例之间共享时，才会发生挂起。如果您在不共享此位置的不同容器中运行服务器，则无需指定 shm-region-prefix-name。

业务逻辑脚本#

Triton 的 ensemble 功能支持许多用例，其中多个模型被组合成一个管道（或更常见的是 DAG，有向无环图）。但是，还有许多其他用例不受支持，因为作为模型管道的一部分，它们需要循环、条件语句（if-then-else）、数据相关的控制流以及其他自定义逻辑与模型执行混合使用。我们将自定义逻辑和模型执行的这种组合称为业务逻辑脚本 (BLS)。

从 21.08 版本开始，您可以在 Python 模型中实现 BLS。一组新的实用程序函数允许您在执行 Python 模型时，对 Triton 提供的其他模型执行推理请求。请注意，BLS 只能在 execute 函数内部使用，并且在 initialize 或 finalize 方法中不受支持。以下示例显示了如何使用此功能

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
  ...
    def execute(self, requests):
      ...
      # Create an InferenceRequest object. `model_name`,
      # `requested_output_names`, and `inputs` are the required arguments and
      # must be provided when constructing an InferenceRequest object. Make
      # sure to replace `inputs` argument with a list of `pb_utils.Tensor`
      # objects.
      inference_request = pb_utils.InferenceRequest(
          model_name='model_name',
          requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
          inputs=[<pb_utils.Tensor object>])

      # `pb_utils.InferenceRequest` supports request_id, correlation_id,
      # model version, timeout and preferred_memory in addition to the
      # arguments described above.
      # Note: Starting from the 24.03 release, the `correlation_id` parameter
      # supports both string and unsigned integer values.
      # These arguments are optional. An example containing all the arguments:
      # inference_request = pb_utils.InferenceRequest(model_name='model_name',
      #   requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
      #   inputs=[<list of pb_utils.Tensor objects>],
      #   request_id="1", correlation_id=4, model_version=1, flags=0, timeout=5,
      #   preferred_memory=pb_utils.PreferredMemory(
      #     pb_utils.TRITONSERVER_MEMORY_GPU, # or pb_utils.TRITONSERVER_MEMORY_CPU
      #     0))

      # Execute the inference_request and wait for the response
      inference_response = inference_request.exec()

      # Check if the inference response has an error
      if inference_response.has_error():
          raise pb_utils.TritonModelException(
            inference_response.error().message())
      else:
          # Extract the output tensors from the inference response.
          output1 = pb_utils.get_output_tensor_by_name(
            inference_response, 'REQUESTED_OUTPUT_1')
          output2 = pb_utils.get_output_tensor_by_name(
            inference_response, 'REQUESTED_OUTPUT_2')

          # Decide the next steps for model execution based on the received
          # output tensors. It is possible to use the same output tensors
          # to for the final inference response too.

除了允许您执行阻塞式推理请求的 inference_request.exec 函数外，inference_request.async_exec 允许您执行异步推理请求。当您不需要立即获得推理结果时，这可能很有用。使用 async_exec 函数，可以有多个正在进行的推理请求，并在需要时才等待响应。以下示例显示了如何使用 async_exec

import triton_python_backend_utils as pb_utils
import asyncio


class TritonPythonModel:
  ...

    # You must add the Python 'async' keyword to the beginning of `execute`
    # function if you want to use `async_exec` function.
    async def execute(self, requests):
      ...
      # Create an InferenceRequest object. `model_name`,
      # `requested_output_names`, and `inputs` are the required arguments and
      # must be provided when constructing an InferenceRequest object. Make
      # sure to replace `inputs` argument with a list of `pb_utils.Tensor`
      # objects.
      inference_request = pb_utils.InferenceRequest(
          model_name='model_name',
          requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
          inputs=[<pb_utils.Tensor object>])

      infer_response_awaits = []
      for i in range(4):
        # async_exec function returns an
        # [Awaitable](https://docs.pythonlang.cn/3/library/asyncio-task.html#awaitables)
        # object.
        infer_response_awaits.append(inference_request.async_exec())

      # Wait for all of the inference requests to complete.
      infer_responses = await asyncio.gather(*infer_response_awaits)

      for infer_response in infer_responses:
        # Check if the inference response has an error
        if inference_response.has_error():
            raise pb_utils.TritonModelException(
              inference_response.error().message())
        else:
            # Extract the output tensors from the inference response.
            output1 = pb_utils.get_output_tensor_by_name(
              inference_response, 'REQUESTED_OUTPUT_1')
            output2 = pb_utils.get_output_tensor_by_name(
              inference_response, 'REQUESTED_OUTPUT_2')

            # Decide the next steps for model execution based on the received
            # output tensors.

Python 后端中同步和异步 BLS 的完整示例包含在示例部分中。

将 BLS 与解耦模型一起使用#

从 23.03 版本开始，您可以在默认模式和解耦模式下对解耦模型执行推理请求。通过将 decoupled 参数设置为 True，exec 和 async_exec 函数将返回由解耦模型返回的推理响应的迭代器。如果 decoupled 参数设置为 False，则 exec 和 async_exec 函数将返回单个响应，如上面的示例所示。此外，您可以在 InferenceRequest 的构造函数中通过参数 'timeout' 以微秒为单位设置超时时间。如果请求超时，则请求将返回错误。'timeout' 的默认值为 0，表示请求没有超时。

此外，从 23.04 版本开始，您可以灵活地选择特定设备来接收来自 BLS 调用的输出张量。这可以通过在 InferenceRequest 构造函数中设置可选的 preferred_memory 参数来实现。为此，您可以创建一个 PreferredMemory 对象，并将 preferred_memory_type 指定为 TRITONSERVER_MEMORY_GPU 或 TRITONSERVER_MEMORY_CPU，并将 preferred_device_id 指定为整数，以指示您希望在哪个内存类型和设备 ID 上接收输出张量。如果您未指定 preferred_memory 参数，则输出张量将分配在与接收来自 BLS 调用模型的输出张量相同的设备上。

以下示例显示了如何使用此功能

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
  ...
    def execute(self, requests):
      ...
      # Create an InferenceRequest object. `model_name`,
      # `requested_output_names`, and `inputs` are the required arguments and
      # must be provided when constructing an InferenceRequest object. Make
      # sure to replace `inputs` argument with a list of `pb_utils.Tensor`
      # objects.
      inference_request = pb_utils.InferenceRequest(
          model_name='model_name',
          requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
          inputs=[<pb_utils.Tensor object>])

      # `pb_utils.InferenceRequest` supports request_id, correlation_id,
      # model version, timeout and preferred_memory in addition to the
      # arguments described above.
      # Note: Starting from the 24.03 release, the `correlation_id` parameter
      # supports both string and unsigned integer values.
      # These arguments are optional. An example containing all the arguments:
      # inference_request = pb_utils.InferenceRequest(model_name='model_name',
      #   requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
      #   inputs=[<list of pb_utils.Tensor objects>],
      #   request_id="1", correlation_id="ex-4", model_version=1, flags=0, timeout=5,
      #   preferred_memory=pb_utils.PreferredMemory(
      #     pb_utils.TRITONSERVER_MEMORY_GPU, # or pb_utils.TRITONSERVER_MEMORY_CPU
      #     0))

      # Execute the inference_request and wait for the response. Here we are
      # running a BLS request on a decoupled model, hence setting the parameter
      # 'decoupled' to 'True'.
      inference_responses = inference_request.exec(decoupled=True)

      for inference_response in inference_responses:
        # Check if the inference response has an error
        if inference_response.has_error():
            raise pb_utils.TritonModelException(
              inference_response.error().message())

        # For some models, it is possible that the last response is empty
        if len(infer_response.output_tensors()) > 0:
          # Extract the output tensors from the inference response.
          output1 = pb_utils.get_output_tensor_by_name(
            inference_response, 'REQUESTED_OUTPUT_1')
          output2 = pb_utils.get_output_tensor_by_name(
            inference_response, 'REQUESTED_OUTPUT_2')

          # Decide the next steps for model execution based on the received
          # output tensors. It is possible to use the same output tensors to
          # for the final inference response too.

除了允许您对解耦模型执行阻塞式推理请求的 inference_request.exec(decoupled=True) 函数外，inference_request.async_exec(decoupled=True) 允许您执行异步推理请求。当您不需要立即获得推理结果时，这可能很有用。使用 async_exec 函数，可以有多个正在进行的推理请求，并在需要时才等待响应。以下示例显示了如何使用 async_exec

import triton_python_backend_utils as pb_utils
import asyncio


class TritonPythonModel:
  ...

    # You must add the Python 'async' keyword to the beginning of `execute`
    # function if you want to use `async_exec` function.
    async def execute(self, requests):
      ...
      # Create an InferenceRequest object. `model_name`,
      # `requested_output_names`, and `inputs` are the required arguments and
      # must be provided when constructing an InferenceRequest object. Make
      # sure to replace `inputs` argument with a list of `pb_utils.Tensor`
      # objects.
      inference_request = pb_utils.InferenceRequest(
          model_name='model_name',
          requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
          inputs=[<pb_utils.Tensor object>])

      infer_response_awaits = []
      for i in range(4):
        # async_exec function returns an
        # [Awaitable](https://docs.pythonlang.cn/3/library/asyncio-task.html#awaitables)
        # object.
        infer_response_awaits.append(
          inference_request.async_exec(decoupled=True))

      # Wait for all of the inference requests to complete.
      async_responses = await asyncio.gather(*infer_response_awaits)

      for infer_responses in async_responses:
        for infer_response in infer_responses:
          # Check if the inference response has an error
          if inference_response.has_error():
              raise pb_utils.TritonModelException(
                inference_response.error().message())

          # For some models, it is possible that the last response is empty
          if len(infer_response.output_tensors()) > 0:
              # Extract the output tensors from the inference response.
              output1 = pb_utils.get_output_tensor_by_name(
                inference_response, 'REQUESTED_OUTPUT_1')
              output2 = pb_utils.get_output_tensor_by_name(
                inference_response, 'REQUESTED_OUTPUT_2')

              # Decide the next steps for model execution based on the received
              # output tensors.

解耦模型的同步和异步 BLS 的完整示例包含在示例部分中。

从 22.04 版本开始，BLS 输出张量的生命周期得到了改进，这样，如果 Python 模型中不再需要张量，它将自动解除分配。这可以增加您可以在模型中执行的 BLS 请求数量，而不会遇到 GPU 或共享内存不足的错误。

注意：由于 Python 3.7 中引入了 async 关键字和 asyncio.run，因此 Python 3.6 或更低版本不支持异步 BLS。

模型加载 API#

从 23.07 版本开始，您可以使用模型加载 API 加载 BLS 模型所需的模型。模型加载 API 等效于 Triton C API，用于加载 tritonserver.h 中记录的模型。以下是如何使用模型加载 API 的示例

import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def initialize(self, args):
        self.model_name="onnx_model"
        # Check if the model is ready, and load the model if it is not ready.
        # You can specify the model version in string format. The version is
        # optional, and if not provided, the server will choose a version based
        # on the model and internal policy.
        if not pb_utils.is_model_ready(model_name=self.model_name,
                                       model_version="1"):
            # Load the model from the model repository
            pb_utils.load_model(model_name=self.model_name)

            # Load the model with an optional override model config in JSON
            # representation. If provided, this config will be used for
            # loading the model.
            config = "{\"backend\":\"onnxruntime\", \"version_policy\":{\"specific\":{\"versions\":[1]}}}"
            pb_utils.load_model(model_name=self.model_name, config=config)

            # Load the mode with optional override files. The override files are
            # specified as a dictionary where the key is the file path (with
            # "file:" prefix) and the value is the file content as bytes. The
            # files will form the model directory that the model will be loaded
            # from. If specified, 'config' must be provided to be the model
            # configuration of the override model directory.
            with open('models/onnx_int32_int32_int32/1/model.onnx', 'rb') as file:
                data = file.read()
            files = {"file:1/model.onnx": data}
            pb_utils.load_model(model_name=self.model_name,
                                config=config, files=files)

    def execute(self, requests):
        # Execute the model
        ...
        # If the model is no longer needed, you can unload it. You can also
        # specify whether the dependents of the model should also be unloaded by
        # setting the 'unload_dependents' parameter to True. The default value
        # is False. Need to be careful when unloading the model as it can affect
        # other model instances or other models that depend on it.
        pb_utils.unload_model(model_name=self.model_name,
                              unload_dependents=True)

请注意，仅当服务器在显式模型控制模式下运行时，才支持模型加载 API。此外，模型加载 API 应仅在服务器运行后使用，这意味着 BLS 模型不应在服务器启动期间加载。您可以使用不同的客户端端点在服务器启动后加载模型。目前在 auto_complete_config 和 finalize 函数中不支持模型加载 API。

将 BLS 与有状态模型一起使用#

有状态模型需要在推理请求中设置额外的标志，以指示序列的开始和结束。pb_utils.InferenceRequest 对象中的 flags 参数可用于指示请求是序列中的第一个还是最后一个请求。以下示例指示请求正在启动序列

inference_request = pb_utils.InferenceRequest(model_name='model_name',
  requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
  inputs=[<list of pb_utils.Tensor objects>],
  request_id="1", correlation_id=4,
  flags=pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START)

对于指示序列的结束，您可以使用 pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END 标志。如果请求同时启动和结束序列（即序列只有一个请求），您可以使用按位 OR 运算符来同时启用这两个标志

flags = pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START | pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END

限制#

您需要确保作为模型一部分执行的推理请求不会创建循环依赖。例如，如果模型 A 对自身执行推理请求，并且没有更多模型实例准备好执行该推理请求，则模型将永远阻塞在该推理执行上。
在解耦模式下运行 Python 模型时，不支持异步 BLS。

互操作性和 GPU 支持#

从 21.09 版本开始，Python 后端支持 DLPack，用于将 Python 后端张量零拷贝传输到其他框架。以下方法已添加到 pb_utils.Tensor 对象中以方便实现相同功能

`pb_utils.Tensor.to_dlpack() -> PyCapsule`#

可以在现有的已实例化张量上调用此方法，以将张量转换为 DLPack。以下代码片段显示了它如何与 PyTorch 一起使用

from torch.utils.dlpack import from_dlpack
import triton_python_backend_utils as pb_utils

class TritonPythonModel:

  def execute(self, requests):
    ...
    input0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")

    # We have converted a Python backend tensor to a PyTorch tensor without
    # making any copies.
    pytorch_tensor = from_dlpack(input0.to_dlpack())

`pb_utils.Tensor.from_dlpack() -> Tensor`#

此静态方法可用于从张量的 DLPack 编码创建 Tensor 对象。例如

from torch.utils.dlpack import to_dlpack
import torch
import triton_python_backend_utils as pb_utils

class TritonPythonModel:

  def execute(self, requests):
    ...
    pytorch_tensor = torch.tensor([1, 2, 3], device='cuda')

    # Create a Python backend tensor from the DLPack encoding of a PyTorch
    # tensor.
    input0 = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(pytorch_tensor))

Python 后端允许将实现 __dlpack__ 和 __dlpack_device__ 接口的张量转换为 Python 后端张量。例如

input0 = pb_utils.Tensor.from_dlpack("INPUT0", pytorch_tensor)

此方法仅支持 C 顺序的连续张量。如果张量不是 C 顺序的连续张量，则会引发异常。

对于类型为 BFloat16 (BF16) 的输入或输出张量的 python 模型，不支持 as_numpy() 方法，而必须使用 from_dlpack 和 to_dlpack 方法。

`pb_utils.Tensor.is_cpu() -> bool`#

此函数可用于检查张量是否放置在 CPU 中。

输入张量设备放置#

默认情况下，Python 后端在将所有输入张量提供给 Python 模型之前，会将它们移动到 CPU。从 21.09 版本开始，您可以更改此默认行为。通过将 FORCE_CPU_ONLY_INPUT_TENSORS 设置为 “no”，Triton 将不会为 Python 模型将输入张量移动到 CPU。相反，Triton 将根据这些张量上次的使用方式，在 CPU 或 GPU 内存中将输入张量提供给 Python 模型。您无法预测每个输入张量将使用哪种内存，因此您的 Python 模型必须能够处理 CPU 和 GPU 内存中的张量。要启用此设置，您需要将此设置添加到模型配置的 parameters 部分

parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: {string_value:"no"}}

框架#

由于 Python 后端模型可以支持大多数 python 包，因此用户通常会在其 model.py 实现中使用深度学习框架（如 PyTorch）。本节将记录有关此工作流程的一些注意事项和常见问题解答。

注意

在 Python 后端模型中使用深度学习框架/包不一定与使用相应的 Triton 后端实现相同。例如，PyTorch 后端与使用 import torch 的 Python 后端模型不同。如果您发现框架（例如：PyTorch）执行的模型与运行相同框架的 Python 后端模型的结果存在显着差异，那么您应该首先检查的是所使用的框架版本以及输入/输出准备是否相同。

PyTorch#

有关在 Python 后端模型中使用 PyTorch 的简单示例，请参阅 AddSubNet PyTorch 示例。

PyTorch 确定性#

在运行 PyTorch 代码时，您可能会注意到跨运行或跨服务器的输出值存在细微差异，具体取决于硬件、系统负载、驱动程序甚至批次大小。这些差异通常与根据上述因素选择用于执行操作的 CUDA 内核有关。

对于大多数意图和目的而言，这些差异不足以影响模型的最终预测。但是，要了解这些差异的来源，请参阅此文档。

在 Ampere 及更高版本的设备上，有一种与 FP32 操作相关的优化，称为 TensorFloat32 (TF32)。通常，此优化将提高整体性能，但会牺牲较小的精度损失，但同样，这种精度损失对于大多数模型预测来说是可以接受的。有关 PyTorch 中 TF32 以及如何根据需要启用/禁用它的更多信息，请参阅此处。

TensorFlow#

TensorFlow 确定性#

与上面的 PyTorch 确定性部分类似，由于库的内部 CUDA 内核选择过程，TensorFlow 在输出方面也可能存在细微差异，具体取决于硬件、系统配置或批次大小等各种因素。有关提高 TensorFlow 中输出确定性的更多信息，请参阅此处。

自定义指标#

从 23.05 版本开始，您可以使用自定义指标 API 在 Python 模型的 initialize、execute 和 finalize 函数中注册和收集自定义指标。自定义指标 API 是 TRITON C API 自定义指标支持的 Python 等效项。您需要取得通过 API 创建的自定义指标的所有权，并且必须管理它们的生命周期。请注意，如果您想显式删除自定义指标对象，则 MetricFamily 对象应仅在删除其下的所有 Metric 对象之后删除。

以下示例显示了如何使用此功能

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    def initialize(self, args):
      # Create a MetricFamily object to report the latency of the model
      # execution. The 'kind' parameter must be either 'COUNTER',
      # 'GAUGE' or 'HISTOGRAM'.
      self.metric_family = pb_utils.MetricFamily(
          name="preprocess_latency_ns",
          description="Cumulative time spent pre-processing requests",
          kind=pb_utils.MetricFamily.COUNTER
      )

      # Create a Metric object under the MetricFamily object. The 'labels'
      # is a dictionary of key-value pairs.
      self.metric = self.metric_family.Metric(
        labels={"model" : "model_name", "version" : "1"}
      )

    def execute(self, requests):
      responses = []

      for request in requests:
        # Pre-processing - time it to capture latency
        start_ns = time.time_ns()
        self.preprocess(request)
        end_ns = time.time_ns()

        # Update metric to track cumulative pre-processing latency
        self.metric.increment(end_ns - start_ns)

      ...

        print("Cumulative pre-processing latency:", self.metric.value())

      return responses

您可以查看 custom_metrics 示例，其中包含演示 Python 模型的自定义指标 API 的完整示例。

示例#

为了在这些示例中使用 Triton Python 客户端，您需要安装 Triton Python 客户端库。每个示例的 Python 客户端都在 client.py 文件中。

NumPy 中的 AddSub#

AddSub NumPy 示例不需要任何依赖项。快速入门部分解释了如何使用此模型。您可以在 examples/add_sub 中找到这些文件。

PyTorch 中的 AddSubNet#

为了使用此模型，您需要安装 PyTorch。我们建议使用 PyTorch 网站中提到的 pip 方法。确保 PyTorch 与其他依赖项在相同的 Python 环境中可用。或者，您可以创建一个 Python 执行环境。您可以在 examples/pytorch 中找到此示例的文件。

JAX 中的 AddSub#

JAX 示例展示了如何在 Triton 中使用 Python 后端服务 JAX。您可以在 examples/jax 中找到完整的示例说明。

业务逻辑脚本#

BLS 示例需要上述两个示例所需的依赖项。您可以在 examples/bls 和 examples/bls_decoupled 中找到完整的示例说明。

预处理#

预处理示例展示了如何使用 Python 后端进行模型预处理。您可以在 examples/preprocessing 中找到完整的示例说明。

解耦模型#

解耦模型的示例展示了如何在 Triton 中使用 Python 后端开发和服务解耦模型。您可以在 examples/decoupled 中找到完整的示例说明。

模型实例类型#

Triton 模型配置允许用户为实例组设置提供类型。可以编写 python 后端模型以遵循类型设置，从而控制模型实例在 CPU 或 GPU 上的执行。

在模型实例类型示例中，我们演示了如何在您的 python 模型中实现这一点。

自动完成配置#

自动完成配置示例演示了如何在配置文件不可用时使用 auto_complete_config 函数来定义最小模型配置。您可以在 examples/auto_complete 中找到完整的示例说明。

自定义指标#

该示例展示了如何在 Python 后端中使用自定义指标 API。您可以在 examples/custom_metrics 中找到完整的示例说明。

使用 Inferentia 运行#

请参阅 python_backend/inferentia 子文件夹中的 README.md。

日志记录#

从 22.09 版本开始，您的 Python 模型可以使用以下方法记录信息

import triton_python_backend_utils as pb_utils

class TritonPythonModel:

  def execute(self, requests):
    ...
    logger = pb_utils.Logger
    logger.log_info("Info Msg!")
    logger.log_warn("Warning Msg!")
    logger.log_error("Error Msg!")
    logger.log_verbose("Verbose Msg!")

注意： 可以在以下类方法中定义和使用记录器

initialize
execute
finalize

日志消息也可以与其显式指定的日志级别一起发送

# log-level options: INFO, WARNING, ERROR, VERBOSE
logger.log("Specific Msg!", logger.INFO)

如果未指定日志级别，则此方法将记录 INFO 级别消息。

请注意，Triton 服务器的设置决定了哪些日志消息会出现在服务器日志中。例如，如果模型尝试记录 verbose 级别消息，但 Triton 未设置为记录 verbose 级别消息，则该消息将不会出现在服务器日志中。有关 Triton 的日志设置以及如何动态调整它们的更多信息，请参阅 Triton 的日志记录扩展文档。

在模型配置中添加自定义参数#

如果您的模型需要在配置中使用自定义参数，您可以在模型配置的 parameters 部分中指定它。例如

parameters {
  key: "custom_key"
  value: {
    string_value: "custom_value"
  }
}

现在您可以在 initialize 函数的 args 参数中访问此参数

def initialize(self, args):
    print(json.loads(args['model_config'])['parameters'])
    # Should print {'custom_key': {'string_value': 'custom_value'}}

使用 VSCode 开发#

该存储库包含一个 .devcontainer 文件夹，其中包含 Dockerfile 和 devcontainer.json 文件，以帮助您使用 Visual Studio Code 开发 Python 后端。

为了构建后端，您可以在 VSCode 任务中执行 “Build Python Backend” 任务。这将构建 Python 后端并将工件安装在 /opt/tritonserver/backends/python 中。

报告问题，提出问题#

我们感谢关于此项目的任何反馈、问题或错误报告。当需要代码方面的帮助时，请遵循 Stack Overflow (https://stackoverflow.com/help/mcve) 文档中概述的流程。确保发布的示例满足以下条件

最小化 – 使用尽可能少的代码，但仍能重现相同的问题
完整 – 提供重现问题所需的所有部分。检查您是否可以剥离外部依赖项并仍然显示问题。我们花在重现问题上的时间越少，我们就越有时间修复它
可验证 – 测试您即将提供的代码，以确保它可以重现问题。删除所有其他与您的请求/问题无关的问题。

Python 后端#

用户文档#

快速入门#

从源代码构建#

用法#

auto_complete_config#

initialize#

execute#

默认模式#

错误处理#

请求取消处理#

解耦模式#

用例#

异步执行#

请求重新调度#

finalize#

模型配置文件#

推理请求参数#

管理 Python 运行时和库#

构建自定义 Python 后端存根#

创建自定义执行环境#

重要提示#

错误处理#

管理共享内存#

多模型实例支持#

运行 Triton Server 的多个实例#

业务逻辑脚本#

将 BLS 与解耦模型一起使用#

模型加载 API#

将 BLS 与有状态模型一起使用#

限制#

互操作性和 GPU 支持#

pb_utils.Tensor.to_dlpack() -> PyCapsule#

pb_utils.Tensor.from_dlpack() -> Tensor#

pb_utils.Tensor.is_cpu() -> bool#

输入张量设备放置#

框架#

PyTorch#

PyTorch 确定性#

TensorFlow#

TensorFlow 确定性#

自定义指标#

示例#

NumPy 中的 AddSub#

PyTorch 中的 AddSubNet#

JAX 中的 AddSub#

业务逻辑脚本#

预处理#

解耦模型#

模型实例类型#

自动完成配置#

自定义指标#

使用 Inferentia 运行#

日志记录#

在模型配置中添加自定义参数#

使用 VSCode 开发#

报告问题，提出问题#

`auto_complete_config`#

`initialize`#

`execute`#

`finalize`#

`pb_utils.Tensor.to_dlpack() -> PyCapsule`#

`pb_utils.Tensor.from_dlpack() -> Tensor`#

`pb_utils.Tensor.is_cpu() -> bool`#