解耦模型示例#

在本节中,我们将演示一个端到端的示例,用于开发和Serving 解耦模型 在 Python 后端。

repeat_model.pysquare_model.py 演示了如何编写解耦模型,其中每个请求可以生成 0 到多个响应。这些文件都带有大量注释,用于描述每个函数调用。这些示例模型旨在展示解耦模型的灵活性,绝不应在生产环境中使用。这些示例规避了 实例计数 施加的限制,并允许即使对于单实例也允许多个请求处于处理中。在实际部署中,模型不应允许调用者线程从 execute 返回,直到该实例准备好处理另一组请求。

部署解耦模型#

  1. 创建模型仓库

mkdir -p models/repeat_int32/1
mkdir -p models/square_int32/1

# Copy the Python models
cp examples/decoupled/repeat_model.py models/repeat_int32/1/model.py
cp examples/decoupled/repeat_config.pbtxt models/repeat_int32/config.pbtxt
cp examples/decoupled/square_model.py models/square_int32/1/model.py
cp examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt
  1. 启动 tritonserver

tritonserver --model-repository `pwd`/models

在 Repeat 模型上运行推理:#

使用 repeat_client.py 向 repeat 模型发送推理请求。

python3 examples/decoupled/repeat_client.py

您应该看到类似于下面输出的输出

stream started...
async_stream_infer
model_name: "repeat_int32"
id: "0"
inputs {
  name: "IN"
  datatype: "INT32"
  shape: 4
}
inputs {
  name: "DELAY"
  datatype: "UINT32"
  shape: 4
}
inputs {
  name: "WAIT"
  datatype: "UINT32"
  shape: 1
}
outputs {
  name: "OUT"
}
outputs {
  name: "IDX"
}
raw_input_contents: "\004\000\000\000\002\000\000\000\000\000\000\000\001\000\000\000"
raw_input_contents: "\001\000\000\000\002\000\000\000\003\000\000\000\004\000\000\000"
raw_input_contents: "\005\000\000\000"

enqueued request 0 to stream...
infer_response {
  model_name: "repeat_int32"
  model_version: "1"
  id: "0"
  outputs {
    name: "IDX"
    datatype: "UINT32"
    shape: 1
  }
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\000\000\000\000"
  raw_output_contents: "\004\000\000\000"
}

infer_response {
  model_name: "repeat_int32"
  model_version: "1"
  id: "0"
  outputs {
    name: "IDX"
    datatype: "UINT32"
    shape: 1
  }
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\001\000\000\000"
  raw_output_contents: "\002\000\000\000"
}

infer_response {
  model_name: "repeat_int32"
  model_version: "1"
  id: "0"
  outputs {
    name: "IDX"
    datatype: "UINT32"
    shape: 1
  }
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\002\000\000\000"
  raw_output_contents: "\000\000\000\000"
}

infer_response {
  model_name: "repeat_int32"
  model_version: "1"
  id: "0"
  outputs {
    name: "IDX"
    datatype: "UINT32"
    shape: 1
  }
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\003\000\000\000"
  raw_output_contents: "\001\000\000\000"
}

PASS: repeat_int32
stream stopped...

看看单个请求如何生成 4 个响应。

在 Square 模型上运行推理:#

使用 square_client.py 向 square 模型发送推理请求。

python3 examples/decoupled/square_client.py

您应该看到类似于下面输出的输出

stream started...
async_stream_infer
model_name: "square_int32"
id: "0"
inputs {
  name: "IN"
  datatype: "INT32"
  shape: 1
}
outputs {
  name: "OUT"
}
raw_input_contents: "\004\000\000\000"

enqueued request 0 to stream...
async_stream_infer
model_name: "square_int32"
id: "1"
inputs {
  name: "IN"
  datatype: "INT32"
  shape: 1
}
outputs {
  name: "OUT"
}
raw_input_contents: "\002\000\000\000"

enqueued request 1 to stream...
async_stream_infer
model_name: "square_int32"
id: "2"
inputs {
  name: "IN"
  datatype: "INT32"
  shape: 1
}
outputs {
  name: "OUT"
}
raw_input_contents: "\000\000\000\000"

enqueued request 2 to stream...
async_stream_infer
model_name: "square_int32"
id: "3"
inputs {
  name: "IN"
  datatype: "INT32"
  shape: 1
}
outputs {
  name: "OUT"
}
raw_input_contents: "\001\000\000\000"

enqueued request 3 to stream...
infer_response {
  model_name: "square_int32"
  model_version: "1"
  id: "0"
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\004\000\000\000"
}

infer_response {
  model_name: "square_int32"
  model_version: "1"
  id: "1"
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\002\000\000\000"
}

infer_response {
  model_name: "square_int32"
  model_version: "1"
  id: "0"
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\004\000\000\000"
}

infer_response {
  model_name: "square_int32"
  model_version: "1"
  id: "3"
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\001\000\000\000"
}

infer_response {
  model_name: "square_int32"
  model_version: "1"
  id: "1"
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\002\000\000\000"
}

infer_response {
  model_name: "square_int32"
  model_version: "1"
  id: "0"
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\004\000\000\000"
}

infer_response {
  model_name: "square_int32"
  model_version: "1"
  id: "0"
  outputs {
    name: "OUT"
    datatype: "INT32"
    shape: 1
  }
  raw_output_contents: "\004\000\000\000"
}

PASS: square_int32
stream stopped...

看看响应是如何以请求的乱序方式交付的。可以使用 id 字段将生成的响应追踪到其请求。