Triton 服务器 (tritonfrontend) 绑定 (Beta)#
tritonfrontend
python 包是一组绑定,用于连接 Triton 现有的 C++ 前端实现。目前,tritonfrontend
支持启动 KServeHttp
和 KServeGrpc
前端。这些绑定与 Triton 的 Python 进程内 API (tritonserver
) 和 tritonclient
结合使用,扩展了使用 Triton 的完整功能集的能力,只需几行 Python 代码即可实现。
让我们通过一个简单的例子来了解一下
首先,我们需要加载所需的模型并使用
tritonserver
启动服务器。
import tritonserver
# Constructing path to Model Repository
model_path = f"server/src/python/examples/example_model_repository"
server_options = tritonserver.Options(
server_id="ExampleServer",
model_repository=model_path,
log_error=True,
log_warn=True,
log_info=True,
)
server = tritonserver.Server(server_options).start(wait_until_ready=True)
注意:model_path
可能需要根据您的设置进行编辑。
现在,使用
tritonfrontend
启动相应的服务
from tritonfrontend import KServeHttp, KServeGrpc, Metrics
http_options = KServeHttp.Options(thread_count=5)
http_service = KServeHttp(server, http_options)
http_service.start()
# Default options (if none provided)
grpc_service = KServeGrpc(server)
grpc_service.start()
# Can start metrics service as well
metrics_service = Metrics(server)
metrics_service.start()
最后,在服务运行的情况下,我们可以使用
tritonclient
或简单的curl
命令来发送请求并从前端接收响应。
import tritonclient.http as httpclient
import numpy as np # Use version numpy < 2
model_name = "identity" # output == input
url = "localhost:8000"
# Create a Triton client
client = httpclient.InferenceServerClient(url=url)
# Prepare input data
input_data = np.array([["Roger Roger"]], dtype=object)
# Create input and output objects
inputs = [httpclient.InferInput("INPUT0", input_data.shape, "BYTES")]
# Set the data for the input tensor
inputs[0].set_data_from_numpy(input_data)
results = client.infer(model_name, inputs=inputs)
# Get the output data
output_data = results.as_numpy("OUTPUT0")
# Print results
print("[INFERENCE RESULTS]")
print("Output data:", output_data)
# Stop respective services and server.
metrics_service.stop()
http_service.stop()
grpc_service.stop()
server.stop()
此外,tritonfrontend
也提供上下文管理器支持。因此,步骤 2-3 也可以通过以下方式实现
from tritonfrontend import KServeHttp
import tritonclient.http as httpclient
import numpy as np # Use version numpy < 2
with KServeHttp(server) as http_service:
# The identity model returns an exact duplicate of the input data as output
model_name = "identity"
url = "localhost:8000"
# Create a Triton client
with httpclient.InferenceServerClient(url=url) as client:
# Prepare input data
input_data = np.array(["Roger Roger"], dtype=object)
# Create input and output objects
inputs = [httpclient.InferInput("INPUT0", input_data.shape, "BYTES")]
# Set the data for the input tensor
inputs[0].set_data_from_numpy(input_data)
# Perform inference
results = client.infer(model_name, inputs=inputs)
# Get the output data
output_data = results.as_numpy("OUTPUT0")
# Print results
print("[INFERENCE RESULTS]")
print("Output data:", output_data)
server.stop()
通过这种工作流程,您可以避免在客户端请求终止后必须停止每个服务。