解耦模型示例#
在本节中,我们将演示一个端到端的示例,用于开发和Serving 解耦模型 在 Python 后端。
repeat_model.py 和 square_model.py 演示了如何编写解耦模型,其中每个请求可以生成 0 到多个响应。这些文件都带有大量注释,用于描述每个函数调用。这些示例模型旨在展示解耦模型的灵活性,绝不应在生产环境中使用。这些示例规避了 实例计数 施加的限制,并允许即使对于单实例也允许多个请求处于处理中。在实际部署中,模型不应允许调用者线程从 execute
返回,直到该实例准备好处理另一组请求。
部署解耦模型#
创建模型仓库
mkdir -p models/repeat_int32/1
mkdir -p models/square_int32/1
# Copy the Python models
cp examples/decoupled/repeat_model.py models/repeat_int32/1/model.py
cp examples/decoupled/repeat_config.pbtxt models/repeat_int32/config.pbtxt
cp examples/decoupled/square_model.py models/square_int32/1/model.py
cp examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt
启动 tritonserver
tritonserver --model-repository `pwd`/models
在 Repeat 模型上运行推理:#
使用 repeat_client.py 向 repeat 模型发送推理请求。
python3 examples/decoupled/repeat_client.py
您应该看到类似于下面输出的输出
stream started...
async_stream_infer
model_name: "repeat_int32"
id: "0"
inputs {
name: "IN"
datatype: "INT32"
shape: 4
}
inputs {
name: "DELAY"
datatype: "UINT32"
shape: 4
}
inputs {
name: "WAIT"
datatype: "UINT32"
shape: 1
}
outputs {
name: "OUT"
}
outputs {
name: "IDX"
}
raw_input_contents: "\004\000\000\000\002\000\000\000\000\000\000\000\001\000\000\000"
raw_input_contents: "\001\000\000\000\002\000\000\000\003\000\000\000\004\000\000\000"
raw_input_contents: "\005\000\000\000"
enqueued request 0 to stream...
infer_response {
model_name: "repeat_int32"
model_version: "1"
id: "0"
outputs {
name: "IDX"
datatype: "UINT32"
shape: 1
}
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\000\000\000\000"
raw_output_contents: "\004\000\000\000"
}
infer_response {
model_name: "repeat_int32"
model_version: "1"
id: "0"
outputs {
name: "IDX"
datatype: "UINT32"
shape: 1
}
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\001\000\000\000"
raw_output_contents: "\002\000\000\000"
}
infer_response {
model_name: "repeat_int32"
model_version: "1"
id: "0"
outputs {
name: "IDX"
datatype: "UINT32"
shape: 1
}
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\002\000\000\000"
raw_output_contents: "\000\000\000\000"
}
infer_response {
model_name: "repeat_int32"
model_version: "1"
id: "0"
outputs {
name: "IDX"
datatype: "UINT32"
shape: 1
}
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\003\000\000\000"
raw_output_contents: "\001\000\000\000"
}
PASS: repeat_int32
stream stopped...
看看单个请求如何生成 4 个响应。
在 Square 模型上运行推理:#
使用 square_client.py 向 square 模型发送推理请求。
python3 examples/decoupled/square_client.py
您应该看到类似于下面输出的输出
stream started...
async_stream_infer
model_name: "square_int32"
id: "0"
inputs {
name: "IN"
datatype: "INT32"
shape: 1
}
outputs {
name: "OUT"
}
raw_input_contents: "\004\000\000\000"
enqueued request 0 to stream...
async_stream_infer
model_name: "square_int32"
id: "1"
inputs {
name: "IN"
datatype: "INT32"
shape: 1
}
outputs {
name: "OUT"
}
raw_input_contents: "\002\000\000\000"
enqueued request 1 to stream...
async_stream_infer
model_name: "square_int32"
id: "2"
inputs {
name: "IN"
datatype: "INT32"
shape: 1
}
outputs {
name: "OUT"
}
raw_input_contents: "\000\000\000\000"
enqueued request 2 to stream...
async_stream_infer
model_name: "square_int32"
id: "3"
inputs {
name: "IN"
datatype: "INT32"
shape: 1
}
outputs {
name: "OUT"
}
raw_input_contents: "\001\000\000\000"
enqueued request 3 to stream...
infer_response {
model_name: "square_int32"
model_version: "1"
id: "0"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\004\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "1"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\002\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "0"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\004\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "3"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\001\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "1"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\002\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "0"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\004\000\000\000"
}
infer_response {
model_name: "square_int32"
model_version: "1"
id: "0"
outputs {
name: "OUT"
datatype: "INT32"
shape: 1
}
raw_output_contents: "\004\000\000\000"
}
PASS: square_int32
stream stopped...
看看响应是如何以请求的乱序方式交付的。可以使用 id
字段将生成的响应追踪到其请求。