管理 NIM 服务#

关于 NIM 服务#

NIM 服务是一种 Kubernetes 自定义资源，nimservices.apps.nvidia.com。您可以创建和删除 NIM 服务资源来管理 NVIDIA NIM 微服务。

请参考以下示例清单

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.3.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

请参考下表，了解有关常用修改字段的信息

字段	描述	默认值
`spec.annotations`	指定将用户提供的注解添加到 Pod。	无
`spec.authSecret` (必需)	指定包含 NGC_API_KEY 的通用密钥的名称。	无
`spec.env`	指定要在 NIM 微服务容器中设置的环境变量。	无
`spec.expose.ingress.enabled`	当设置为 `true` 时，Operator 会为 NIM 微服务创建 Kubernetes Ingress 资源。在 `spec.expose.ingress.spec` 字段中指定 Ingress 规范。如果您有 Ingress 控制器，则类似以下示例的值会为 `v1/chat/completions` 端点配置 Ingress。 ingress: enabled: true spec: ingressClassName: nginx rules: - host: demo.nvidia.example.com http: paths: - backend: service: name: meta-llama3-8b-instruct port: number: 8000 path: /v1/chat/completions pathType: Prefix	`false`
`spec.expose.service.port` (必需)	指定 NIM 微服务的网络端口号。常用值是 `8000`。	无
`spec.expose.service.type`	指定要为 NIM 微服务创建的 Kubernetes 服务类型。	`ClusterIP`
`spec.groupID`	指定 Pod 的组。此值用于在 `runAsGroup` 和 `fsGroup` 字段中设置 Pod 的安全上下文。	`2000`
`spec.image`	指定容器镜像的仓库、标签、拉取策略和拉取密钥。	无
`spec.labels`	指定要添加到 Pod 的用户提供的标签。	无
`spec.metrics.enabled`	当设置为 `true` 时，Operator 会为服务配置 Prometheus 服务监视器。在 `spec.metrics.serviceMonitor` 字段中指定服务监视器规范。	`false`
`spec.resources`	指定 Pod 的资源要求。	无
`spec.replicas`	指定 NIM 微服务的副本集中所需的 Pod 数量。	`1`
`spec.runtimeClassName`	指定 Pod 的运行时类名称。	无
`spec.scale.enabled`	当设置为 `true` 时，Operator 会为 NIM 微服务创建 Kubernetes 水平 Pod 自动扩缩器。在 `spec.scale.hpa` 字段中指定 HPA 规范。 `spec.scale.hpa` 字段支持以下子字段：`minReplicas`、`maxReplicas`、`metrics` 和 `behavior`。这些字段对应于水平 Pod 自动扩缩器资源规范中的相同字段。	`false`
`spec.storage.nimCache`	指定 NIM 缓存的名称，其中包含 NIM 微服务的缓存模型配置文件。为 `name` 子字段指定值，并可选择为 `profile` 子字段指定值。此字段优先于 `spec.storage.pvc` 字段。	无
`spec.storage.pvc`	如果您没有创建 NIM 缓存资源来下载和缓存您的模型，您可以指定此字段来下载模型配置文件。此字段具有以下子字段：`create`、`name`、`size`、`storageClass`、`volumeAccessMode` 和 `subPath`。要让 Operator 为模型配置文件创建 PVC，请指定 `pvc.create: true`。请参考示例：创建 PVC 而不是使用 NIM 缓存。	无
`spec.storage.readOnly`	当设置为 `true` 时，Operator 会将来自 `pvc` 或 `nimCache` 规范的 PVC 以只读方式挂载。	`false`
`spec.tolerations`	指定 Pod 的容忍度。	无
`spec.userID`	指定 Pod 的用户 ID。此值用于在 `runAsUser` 字段中设置 Pod 的安全上下文。	`1000`

先决条件#

可选：为 NIM 微服务添加 NIM 缓存资源。如果您创建了 NIM 缓存资源，请在 spec.nimCache.name 字段中指定名称。

如果您希望服务将模型下载到存储，请参考示例：创建 PVC 而不是使用 NIM 缓存以获取示例清单。

步骤#

创建一个文件，例如 service-all.yaml，内容如下例所示

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.3.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: nv-embedqa-e5-v5
spec:
  image:
    repository: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2
    tag: 1.3.1
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: nv-embedqa-e5-v5
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: nv-rerankqa-mistral-4b-v3
spec:
  image:
    repository: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2
    tag: 1.3.1
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: nv-rerankqa-mistral-4b-v3
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

应用清单

$ kubectl apply -n nim-service -f service-all.yaml

可选：查看有关 NIM 服务的信息

$ kubectl describe nimservices.apps.nvidia.com -n nim-service

部分输出

...
Conditions:
 Last Transition Time:  2024-08-12T19:09:43Z
 Message:               Deployment is ready
 Reason:                Ready
 Status:                True
 Type:                  Ready
 Last Transition Time:  2024-08-12T19:09:43Z
 Message:
 Reason:                Ready
 Status:                False
 Type:                  Failed
State:                  Ready

验证#

启动一个可以访问 curl 命令的 Pod。替换任何具有该命令并满足您组织安全要求的 Pod
```
$ kubectl run --rm -it -n default curl --image=curlimages/curl:latest -- ash
```
Pod 启动后，您将连接到 Pod 中的 ash shell。
连接到 NIM for LLMs 容器上的聊天完成端点

curl -X "POST" \
 'http://meta-llama3-8b-instruct.nim-service:8000/v1/chat/completions' \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "meta/llama3-8b-instruct",
        "messages": [
        {
          "content":"What should I do for a 4 day vacation at Cape Hatteras National Seashore?",
          "role": "user"
        }],
        "top_p": 1,
        "n": 1,
        "max_tokens": 1024,
        "stream": false,
        "frequency_penalty": 0.0,
        "stop": ["STOP"]
      }'

该命令连接到 nim-service 命名空间中的服务 meta-llama3-8b-instruct.nim-service。该命令指定要使用的模型，meta/llama3-8b-instruct。如果您使用不同的服务名称、命名空间或模型，请替换这些值。

按 Ctrl+D 退出并删除 Pod。

配置水平 Pod 自动扩缩#

先决条件#

已安装 Prometheus。 NVIDIA 开发和测试使用了 Prometheus Community Kubernetes Helm Charts，并使用了类似以下示例的命令

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update

$ helm install prometheus prometheus-community/prometheus --namespace prometheus --create-namespace

如果您没有默认存储类，请添加以下命令行参数

--set server.persistentVolume.storageClass=<storage-class> --set alertmanager.persistence.storageClass=<storage-class>

已安装 Prometheus Adapter。 NVIDIA 开发和测试使用了相同的 Prometheus Community Kubernetes Helm Charts，并使用了类似以下示例的命令

$ helm install prometheus-adapter prometheus-community/prometheus-adapter \
    --namespace prometheus-adapter \
    --create-namespace \
    --set prometheus.url=http://prometheus-server.prometheus.svc.cluster.local \
    --set prometheus.port="80"

自动扩缩 NIM for LLMs#

NVIDIA NIM for LLMs 提供了多个服务指标。有关指标的信息，请参考 NVIDIA NIM for LLMs 文档中的可观测性。

注解与 NIM for LLMs 相关的服务资源
```
$ kubectl annotate -n nim-service svc meta-llama3-8b-instruct prometheus.io/scrape=true
```
Prometheus 可能需要几分钟才能开始从服务收集指标。

可选：确认 Prometheus 收集了指标。

如果您可以访问 Prometheus 仪表板，请搜索服务指标，例如 gpu_cache_usage_perc。

您可以查询 Prometheus Adapter

$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/nim-service/services/*/gpu_cache_usage_perc" | jq .

示例输出

{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Service",
        "namespace": "nim-service",
        "name": "meta-llama3-8b-instruct",
        "apiVersion": "/v1"
      },
      "metricName": "gpu_cache_usage_perc",
      "timestamp": "2024-09-12T15:14:20Z",
      "value": "0",
      "selector": null
    }
  ]
}

创建一个文件，例如 service-hpa.yaml，内容如下例所示

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
  scale:
    enabled: true
    hpa:
      maxReplicas: 2
      minReplicas: 1
      metrics:
      - type: Object
        object:
          metric:
            name: gpu_cache_usage_perc
          describedObject:
            apiVersion: v1
            kind: Service
            name: meta-llama3-8b-instruct
          target:
            type: Value
            value: "0.5"

应用清单

$ kubectl apply -n nim-service -f service-hpa.yaml

可选：确认水平 Pod 自动扩缩器资源已创建

$ kubectl get hpa -n nim-service

示例输出

NAME                      REFERENCE                            TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
meta-llama3-8b-instruct   Deployment/meta-llama3-8b-instruct   0/500m    1         2         1          40s

自动扩缩 Embedding 和 Reranking 服务#

Embedding 和 reranking 服务不公开服务指标。要扩缩这些服务，您可以监视 NVIDIA DCGM Exporter 提供的 Pod 级别指标。有关默认指标名称，请参考 GitHub 仓库中的 metrics-config.yaml。

可选：确认 Prometheus 收集了指标。

如果您可以访问 Prometheus 仪表板，请搜索服务指标，例如 DCGM_FI_DEV_FB_USED。

您可以查询 Prometheus Adapter

$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/nim-service/pods/*/DCGM_FI_DEV_FB_USED" | jq .

部分输出

 {
   "describedObject": {
     "kind": "Pod",
     "namespace": "nim-service",
     "name": "nv-embedqa-e5-v5-78cb9874c4-ghpmc",
     "apiVersion": "/v1"
   },
   "metricName": "DCGM_FI_DEV_GPU_UTIL",
   "timestamp": "2024-09-12T16:16:53Z",
   "value": "0",
   "selector": null
 }

创建一个文件，例如 service-hpa-dcgm.yaml，内容如下例所示

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: nv-embedqa-e5-v5
spec:
  image:
    repository: nvcr.io/nim/nvidia/nv-embedqa-e5-v5
    tag: 1.0.4
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: nv-embedqa-e5-v5
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
  scale:
    enabled: true
    hpa:
      maxReplicas: 2
      minReplicas: 1
      metrics:
      - type: Pods
        pods:
          metric:
            name: DCGM_FI_DEV_FB_USED
          target:
            type: AverageValue
            averageValue: "1000"
---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: nv-rerankqa-mistral-4b-v3
spec:
  image:
    repository: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3
    tag: 1.0.4
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: nv-rerankqa-mistral-4b-v3
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
  scale:
    enabled: true
    hpa:
      maxReplicas: 2
      minReplicas: 1
      metrics:
      - type: Pods
        pods:
          metric:
            name: DCGM_FI_DEV_FB_USED
          target:
            type: AverageValue
            averageValue: "1000"

应用清单

$ kubectl apply -n nim-service -f service-hpa-dcgm.yaml

示例清单#

示例：创建 PVC 而不是使用 NIM 缓存#

作为创建 NIM 缓存资源来下载和缓存 NIM 模型配置文件的替代方法，您可以指定让 Operator 创建 PVC，然后 NIM 服务下载并运行 NIM 模型配置文件。

创建并应用类似以下示例的清单

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  replicas: 1
  storage:
    pvc:
      create: true
      storageClass: <storage-class-name>
      size: 10Gi
      volumeAccessMode: ReadWriteMany
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

示例：离线环境#

对于离线环境，您必须从具有互联网访问权限的主机下载 NIM 微服务的模型配置文件。您必须手动创建 PVC，然后将模型配置文件传输到 PVC 中。

通常，Operator 通过从 NIM 缓存资源取消引用来确定 PVC 名称。当没有 NIM 缓存资源时（例如离线环境），您必须指定 PVC 名称。

创建并应用类似以下示例的清单

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  replicas: 1
  storage:
    pvc:
      name: <existing-pvc-name>
      readOnly: true
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

删除 NIM 服务#

要删除 NIM 服务，请执行以下步骤。

查看 NIM 服务自定义资源

$ kubectl get nimservices.apps.nvidia.com -A

示例输出

NAMESPACE     NAME                        STATUS   AGE
nim-service   meta-llama3-8b-instruct     Ready    2024-08-12T17:16:05Z

删除自定义资源

$ kubectl delete nimservice -n nim-service meta-llama3-8b-instruct

如果 Operator 在您创建 NIM 缓存时创建了 PVC，则 Operator 会删除 PVC 和缓存的模型配置文件。您可以通过运行类似以下示例的命令来确定 Operator 是否创建了 PVC

$ kubectl get nimcaches.apps.nvidia.com -n nim-service \
   -o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}'

示例输出

meta-llama3-8b-instruct: true

后续步骤#

部署应用程序以使用 NIM 服务，例如示例 RAG 应用程序。