NIM Operator 可观测性#
配置#
注释 NIM Operator 指标服务,以便 Prometheus 抓取指标
$ kubectl annotate -n nim-operator svc k8s-nim-operator-metrics-service prometheus.io/scrape=true
或者,如果您的 Prometheus 实例从服务监视器自定义资源读取指标,请应用如下示例的 manifest
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nim-operator
labels:
app: nim-operator
release: prometheus # This value can differ according to your Prometheus installation.
spec:
endpoints:
- interval: 30s
port: https
scheme: https
scrapeTimeout: 10s
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
tlsConfig:
insecureSkipVerify: true
namespaceSelector:
matchNames:
- nim-operator
selector:
matchLabels:
app.kubernetes.io/name: k8s-nim-operator
指标#
指标名称 |
类型 |
描述 |
---|---|---|
|
gauge |
指定每个以下状态的 NIM 缓存资源数量
|
|
gauge |
指定每个以下状态的 NIM 缓存资源数量
|
|
gauge |
指定每个以下状态的 NIM 管道资源数量
|
除了上表自定义资源的指标外,Operator 还生成以下常见的 Kubernetes 控制器指标
controller_runtime_active_workers
controller_runtime_max_concurrent_reconciles
controller_runtime_reconcile_errors_total
controller_runtime_reconcile_panics_total
controller_runtime_reconcile_time_seconds
controller_runtime_reconcile_total
controller_runtime_terminal_reconcile_errors_total
controller_runtime_webhook_panics_total
workqueue_adds_total
workqueue_depth
workqueue_longest_running_processor_seconds
workqueue_queue_duration_seconds
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds
有关常见指标的信息,请参阅 https://book.kubebuilder.io/reference/metrics-reference。