统计信息扩展#

本文档介绍了 Triton 的统计信息扩展。统计信息扩展支持报告每个模型(每个版本)的统计信息,这些信息提供了自 Triton 启动以来特定模型(版本)发生的所有活动的聚合信息。由于支持此扩展,Triton 会在其服务器元数据的扩展字段中报告“statistics”。

HTTP/REST#

在本文档中显示的所有 JSON 模式中,$number$string$boolean$object$array 指的是基本的 JSON 类型。#optional 表示可选的 JSON 字段。

Triton 在以下 URL 公开统计信息端点。URL 的特定模型名称部分是可选的;如果未提供,Triton 将返回所有模型的所有版本的统计信息。如果在 URL 中给出了特定模型,则 URL 的版本部分是可选的;如果未提供,Triton 将返回指定模型的所有版本的统计信息。

GET v2/models[/${MODEL_NAME}[/versions/${MODEL_VERSION}]]/stats

统计信息响应 JSON 对象#

成功的统计信息请求由 200 HTTP 状态代码指示。响应对象(标识为 $stats_model_response)在每个成功的统计信息请求的 HTTP body 中返回。

$stats_model_response =
{
  "model_stats" : [ $model_stat, ... ]
}

每个 $model_stat 对象提供特定模型和版本的统计信息。$version 字段对于不支持版本的服务器是可选的。

$model_stat =
{
  "name" : $string,
  "version" : $string #optional,
  "last_inference" : $number,
  "inference_count" : $number,
  "execution_count" : $number,
  "inference_stats" : $inference_stats,
  "response_stats" : { $string : $response_stats, ... },
  "batch_stats" : [ $batch_stats, ... ],
  "memory_usage" : [ $memory_usage, ...]
}
  • “name” : 模型的名称。

  • “version” : 模型的版本。

  • “last_inference” : 为此模型发出的最后一个推理请求的时间戳,以自 epoch 以来的毫秒数表示。

  • “inference_count” : 为此模型发出的成功推理请求的累积计数。批处理请求中的每个推理都计为单独的推理。例如,如果客户端发送一个批大小为 64 的推理请求,“inference_count” 将增加 64。同样,如果客户端发送 64 个批大小均为 1 的单独请求,“inference_count” 将增加 64。“inference_count” 值不包括缓存命中。

  • “execution_count” : 为模型执行的成功推理执行次数的累积计数。当启用动态批处理时,单个模型执行可以为一个以上的推理请求执行推理。例如,如果客户端发送 64 个批大小均为 1 的单独请求,并且动态批处理程序将它们批处理到一个大型批处理中以进行模型执行,则 “execution_count” 将增加 1。另一方面,如果未启用动态批处理程序,则 64 个单独的请求中的每一个都将独立执行,然后 “execution_count” 将增加 64。“execution_count” 值不包括缓存命中。

  • “inference_stats” : 模型的聚合统计信息。例如,“inference_stats”:“success” 表示模型成功推理请求的数量。

  • “response_stats” : 模型的聚合响应统计信息。例如,{ “key” : { “response_stats” : “success” } } 表示模型在 “key” 处成功响应的聚合统计信息,其中 “key” 标识模型在不同请求中生成的每个响应。例如,给定一个生成三个响应的模型,键可以是 “0”、“1” 和 “2”,按顺序标识这三个响应。

  • “batch_stats” : 模型中执行的每个不同批大小的聚合统计信息。批处理统计信息指示执行了多少实际模型执行,并显示由于不同批大小而产生的差异(例如,较大的批处理通常需要更长的计算时间)。

  • “memory_usage” : 模型加载期间检测到的内存使用情况,可用于估计模型卸载后要释放的内存。请注意,此估计是通过分析工具和框架的内存模式推断出来的,因此建议进行实验以了解可以依赖报告的内存使用情况的场景。作为起点,ONNX Runtime 后端和 TensorRT 后端中模型的 GPU 内存使用率通常是一致的。

$inference_stats =
{
  "success" : $duration_stat,
  "fail" : $duration_stat,
  "queue" : $duration_stat,
  "compute_input" : $duration_stat,
  "compute_infer" : $duration_stat,
  "compute_output" : $duration_stat,
  "cache_hit": $duration_stat,
  "cache_miss": $duration_stat
}
  • “success” : 所有成功推理请求的计数和累积持续时间。“success” 计数和累积持续时间包括缓存命中。

  • “fail” : 所有失败推理请求的计数和累积持续时间。

  • “queue” : 推理请求在调度队列或其他队列中等待的计数和累积持续时间。“queue” 计数和累积持续时间包括缓存命中。

  • “compute_input” : 准备模型框架/后端所需的输入张量数据的计数和累积持续时间。例如,此持续时间应包括将输入张量数据复制到 GPU 的时间。“compute_input” 计数和累积持续时间不包括缓存命中。

  • “compute_infer” : 执行模型的计数和累积持续时间。“compute_infer” 计数和累积持续时间不包括缓存命中。

  • “compute_output” : 提取模型框架/后端生成的输出张量数据的计数和累积持续时间。例如,此持续时间应包括将输出张量数据从 GPU 复制的时间。“compute_output” 计数和累积持续时间不包括缓存命中。

  • “cache_hit” : 响应缓存命中的计数和在缓存命中时从响应缓存中查找和提取输出张量数据的累积持续时间。例如,此持续时间应包括将输出张量数据从响应缓存复制到响应对象的时间。

  • “cache_miss” : 响应缓存未命中的计数和在缓存未命中时查找和将输出张量数据插入到响应缓存的累积持续时间。例如,此持续时间应包括将输出张量数据从响应对象复制到响应缓存的时间。

$response_stats =
{
  "compute_infer" : $duration_stat,
  "compute_output" : $duration_stat,
  "success" : $duration_stat,
  "fail" : $duration_stat,
  "empty_response" : $duration_stat,
  "cancel" : $duration_stat
}
  • “compute_infer” : 计算响应的计数和累积持续时间。

  • “compute_output” : 提取计算响应的输出张量的计数和累积持续时间。

  • “success” : 成功推理的计数和累积持续时间。持续时间是推理和输出持续时间之和。

  • “fail” : 失败推理的计数和累积持续时间。持续时间是推理和输出持续时间之和。

  • “empty_response” : 具有空/无响应的推理的计数和累积持续时间。持续时间是推理持续时间。

  • “cancel” : 推理取消的计数和累积持续时间。持续时间用于清理已取消的推理请求持有的资源。

$batch_stats =
{
  "batch_size" : $number,
  "compute_input" : $duration_stat,
  "compute_infer" : $duration_stat,
  "compute_output" : $duration_stat
}
  • “batch_size” : 批处理的大小。

  • “count” : 批大小在模型上执行的次数。单个模型执行为整个请求批处理执行推理,如果启用动态批处理,则可以为多个请求执行推理。

  • “compute_input” : 使用给定批大小准备模型框架/后端所需的输入张量数据的计数和累积持续时间。例如,此持续时间应包括将输入张量数据复制到 GPU 的时间。

  • “compute_infer” : 使用给定批大小执行模型的计数和累积持续时间。

  • “compute_output” : 使用给定批大小提取模型框架/后端生成的输出张量数据的计数和累积持续时间。例如,此持续时间应包括将输出张量数据从 GPU 复制的时间。

$duration_stat 对象报告计数和总时间。此格式可以进行采样,不仅可以确定长期运行的平均值,还可以确定采样点之间的增量平均值。

$duration_stat =
{
  "count" : $number,
  "ns" : $number
}
  • “count” : 收集统计信息的次数。

  • “ns” : 统计信息的总持续时间,以纳秒为单位。

$memory_usage =
{
  "type" : $string,
  "id" : $number,
  "byte_size" : $number
}
  • “type” : 内存类型,值可以是 “CPU”、“CPU_PINNED”、“GPU”。

  • “id” : 内存的 ID,通常与 “type” 一起使用以标识托管内存的设备。

  • “byte_size” : 内存的字节大小。

统计信息响应 JSON 错误对象#

失败的统计信息请求将由 HTTP 错误状态(通常为 400)指示。HTTP body 必须包含 $repository_statistics_error_response 对象。

$repository_statistics_error_response =
{
  "error": $string
}
  • “error” : 错误的描述性消息。

GRPC#

对于统计信息扩展,Triton 实现了以下 API

service GRPCInferenceService
{
  …

  // Get the cumulative statistics for a model and version.
  rpc ModelStatistics(ModelStatisticsRequest)
          returns (ModelStatisticsResponse) {}
}

ModelStatistics API 返回模型统计信息。错误由请求返回的 google.rpc.Status 指示。OK 代码表示成功,其他代码表示失败。ModelStatistics 的请求和响应消息是

message ModelStatisticsRequest
{
  // The name of the model. If not given returns statistics for all
  // models.
  string name = 1;

  // The version of the model. If not given returns statistics for
  // all model versions.
  string version = 2;
}

message ModelStatisticsResponse
{
  // Statistics for each requested model.
  repeated ModelStatistics model_stats = 1;
}

统计信息消息是

// Statistic recording a cumulative duration metric.
message StatisticDuration
{
  // Cumulative number of times this metric occurred.
  uint64 count = 1;

  // Total collected duration of this metric in nanoseconds.
  uint64 ns = 2;
}

// Statistics for a specific model and version.
message ModelStatistics
{
  // The name of the model.
  string name = 1;

  // The version of the model.
  string version = 2;

  // The timestamp of the last inference request made for this model,
  // as milliseconds since the epoch.
  uint64 last_inference = 3;

  // The cumulative count of successful inference requests made for this
  // model. Each inference in a batched request is counted as an
  // individual inference. For example, if a client sends a single
  // inference request with batch size 64, "inference_count" will be
  // incremented by 64. Similarly, if a clients sends 64 individual
  // requests each with batch size 1, "inference_count" will be
  // incremented by 64. The "inference_count" value DOES NOT include cache hits.
  uint64 inference_count = 4;

  // The cumulative count of the number of successful inference executions
  // performed for the model. When dynamic batching is enabled, a single
  // model execution can perform inferencing for more than one inference
  // request. For example, if a clients sends 64 individual requests each
  // with batch size 1 and the dynamic batcher batches them into a single
  // large batch for model execution then "execution_count" will be
  // incremented by 1. If, on the other hand, the dynamic batcher is not
  // enabled for that each of the 64 individual requests is executed
  // independently, then "execution_count" will be incremented by 64.
  // The "execution_count" value DOES NOT include cache hits.
  uint64 execution_count = 5;

  // The aggregate statistics for the model.
  InferStatistics inference_stats = 6;

  // The aggregate statistics for each different batch size that is
  // executed in the model. The batch statistics indicate how many actual
  // model executions were performed and show differences due to different
  // batch size (for example, larger batches typically take longer to compute).
  repeated InferBatchStatistics batch_stats = 7;

  // The memory usage detected during model loading, which may be
  // used to estimate the memory to be released once the model is unloaded. Note
  // that the estimation is inferenced by the profiling tools and framework's
  // memory schema, therefore it is advised to perform experiments to understand
  // the scenario that the reported memory usage can be relied on. As a starting
  // point, the GPU memory usage for models in ONNX Runtime backend and TensorRT
  // backend is usually aligned.
  repeated MemoryUsage memory_usage = 8;

  // The key and value pairs for all decoupled responses statistics. The key is
  // a string identifying a set of response statistics aggregated together (i.e.
  // index of the response sent). The value is the aggregated response
  // statistics.
  map<string, InferResponseStatistics> response_stats = 9;
}

// Inference statistics.
message InferStatistics
{
  // Cumulative count and duration for successful inference
  // request. The "success" count and cumulative duration includes
  // cache hits.
  StatisticDuration success = 1;

  // Cumulative count and duration for failed inference
  // request.
  StatisticDuration fail = 2;

  // The count and cumulative duration that inference requests wait in
  // scheduling or other queues. The "queue" count and cumulative
  // duration includes cache hits.
  StatisticDuration queue = 3;

  // The count and cumulative duration to prepare input tensor data as
  // required by the model framework / backend. For example, this duration
  // should include the time to copy input tensor data to the GPU.
  // The "compute_input" count and cumulative duration do not account for
  // requests that were a cache hit. See the "cache_hit" field for more
  // info.
  StatisticDuration compute_input = 4;

  // The count and cumulative duration to execute the model.
  // The "compute_infer" count and cumulative duration do not account for
  // requests that were a cache hit. See the "cache_hit" field for more
  // info.
  StatisticDuration compute_infer = 5;

  // The count and cumulative duration to extract output tensor data
  // produced by the model framework / backend. For example, this duration
  // should include the time to copy output tensor data from the GPU.
  // The "compute_output" count and cumulative duration do not account for
  // requests that were a cache hit. See the "cache_hit" field for more
  // info.
  StatisticDuration compute_output = 6;

  // The count of response cache hits and cumulative duration to lookup
  // and extract output tensor data from the Response Cache on a cache
  // hit. For example, this duration should include the time to copy
  // output tensor data from the Response Cache to the response object.
  // On cache hits, triton does not need to go to the model/backend
  // for the output tensor data, so the "compute_input", "compute_infer",
  // and "compute_output" fields are not updated. Assuming the response
  // cache is enabled for a given model, a cache hit occurs for a
  // request to that model when the request metadata (model name,
  // model version, model inputs) hashes to an existing entry in the
  // cache. On a cache miss, the request hash and response output tensor
  // data is added to the cache. See response cache docs for more info:
  // https://github.com/triton-inference-server/server/blob/main/docs/response_cache.md
  StatisticDuration cache_hit = 7;

  // The count of response cache misses and cumulative duration to lookup
  // and insert output tensor data from the computed response to the cache
  // For example, this duration should include the time to copy
  // output tensor data from the response object to the Response Cache.
  // Assuming the response cache is enabled for a given model, a cache
  // miss occurs for a request to that model when the request metadata
  // does NOT hash to an existing entry in the cache. See the response
  // cache docs for more info:
  // https://github.com/triton-inference-server/server/blob/main/docs/response_cache.md
  StatisticDuration cache_miss = 8;
}

// Statistics per decoupled response.
message InferResponseStatistics
{
  // The count and cumulative duration to compute a response.
  StatisticDuration compute_infer = 1;

  // The count and cumulative duration to extract the output tensors of a
  // response.
  StatisticDuration compute_output = 2;

  // The count and cumulative duration for successful responses.
  StatisticDuration success = 3;

  // The count and cumulative duration for failed responses.
  StatisticDuration fail = 4;

  // The count and cumulative duration for empty responses.
  StatisticDuration empty_response = 5;
}

// Inference batch statistics.
message InferBatchStatistics
{
  // The size of the batch.
  uint64 batch_size = 1;

  // The count and cumulative duration to prepare input tensor data as
  // required by the model framework / backend with the given batch size.
  // For example, this duration should include the time to copy input
  // tensor data to the GPU.
  StatisticDuration compute_input = 2;

  // The count and cumulative duration to execute the model with the given
  // batch size.
  StatisticDuration compute_infer = 3;

  // The count and cumulative duration to extract output tensor data
  // produced by the model framework / backend with the given batch size.
  // For example, this duration should include the time to copy output
  // tensor data from the GPU.
  StatisticDuration compute_output = 4;
}

// Memory usage.
message MemoryUsage
{
  // The type of memory, the value can be "CPU", "CPU_PINNED", "GPU".
  string type = 1;

  // The id of the memory, typically used with "type" to identify
  // a device that hosts the memory.
  int64_t id = 2;

  // The byte size of the memory.
  uint64_t byte_size = 3;
}