序列扩展#

本文档介绍了 Triton 的序列扩展。序列扩展允许 Triton 支持期望一系列相关推理请求的状态模型。

推理请求可以使用请求中的 “sequence_id” 参数指定其为序列的一部分，并使用 “sequence_start” 和 “sequence_end” 参数来指示序列的开始和结束。

由于支持此扩展，Triton 会在其服务器元数据的扩展字段中报告 “sequence”。如果 “sequence_id” 参数支持字符串类型，Triton 还可能在服务器元数据的扩展字段中报告 “sequence(string_id)”。

“sequence_id”：一个字符串或 uint64 值，用于标识请求所属的序列。属于同一序列的所有推理请求必须使用相同的序列 ID。序列 ID 为 0 或 “” 表示推理请求不属于任何序列。
“sequence_start”：布尔值，如果在请求中设置为 true，则表示该请求是序列中的第一个请求。如果未设置或设置为 false，则该请求不是序列中的第一个请求。如果设置，则 “sequence_id” 参数必须设置为非零或非空字符串值。
“sequence_end”：布尔值，如果在请求中设置为 true，则表示该请求是序列中的最后一个请求。如果未设置或设置为 false，则该请求不是序列中的最后一个请求。如果设置，则 “sequence_id” 参数必须设置为非零或非空字符串值。

HTTP/REST#

以下示例展示了如何将请求标记为序列的一部分。在本例中，未使用 sequence_start 和 sequence_end 参数，这意味着此请求既不是序列的开始也不是结束。

POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: <xx>
{
  "parameters" : { "sequence_id" : 42 }
  "inputs" : [
    {
      "name" : "input0",
      "shape" : [ 2, 2 ],
      "datatype" : "UINT32",
      "data" : [ 1, 2, 3, 4 ]
    }
  ],
  "outputs" : [
    {
      "name" : "output0",
    }
  ]
}

以下示例使用 v4 UUID 字符串作为 “sequence_id” 参数的值。

POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: <xx>
{
  "parameters" : { "sequence_id" : "e333c95a-07fc-42d2-ab16-033b1a566ed5" }
  "inputs" : [
    {
      "name" : "input0",
      "shape" : [ 2, 2 ],
      "datatype" : "UINT32",
      "data" : [ 1, 2, 3, 4 ]
    }
  ],
  "outputs" : [
    {
      "name" : "output0",
    }
  ]
}

GRPC#

除了支持上述序列参数外，GRPC API 还添加了推理 API 的流式版本，以允许通过同一 GRPC 流发送一系列推理请求。指定 sequence_id 的请求不需要使用此流式 API，未指定 sequence_id 的请求可以使用此 API。《ModelInferRequest》与 ModelInfer API 的相同。《ModelStreamInferResponse》消息如下所示。

service GRPCInferenceService
{
  …

  // Perform inference using a specific model with GRPC streaming.
  rpc ModelStreamInfer(stream ModelInferRequest) returns (stream ModelStreamInferResponse) {}
}

// Response message for ModelStreamInfer.
message ModelStreamInferResponse
{
  // The message describing the error. The empty message
  // indicates the inference was successful without errors.
  String error_message = 1;

  // Holds the results of the request.
  ModelInferResponse infer_response = 2;
}