入门指南#

先决条件#

查看支持矩阵，确保您拥有受支持的硬件和软件堆栈。

NGC 身份验证#

生成 API 密钥#

访问 NGC 资源需要 NGC API 密钥，密钥可以在这里生成：https://org.ngc.nvidia.com/setup/personal-keys。

创建 NGC API 个人密钥时，请确保从“包含的服务”下拉列表中至少选择“NGC 目录”。如果此密钥要重复用于其他目的，则可以包含更多服务。

注意

个人密钥允许您配置到期日期、使用操作按钮撤销或删除密钥以及根据需要轮换密钥。有关密钥类型的更多信息，请参阅NGC 用户指南。

导出 API 密钥#

将 API 密钥的值作为 NGC_API_KEY 环境变量传递到下一节的 docker run 命令中，以便在启动 NIM 时下载适当的模型和资源。

如果您不熟悉如何创建 NGC_API_KEY 环境变量，最简单的方法是在终端中导出它

export NGC_API_KEY=<value>

运行以下命令之一，使密钥在启动时可用

# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc

# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc

注意

其他更安全的选择包括将值保存在文件中，以便您可以使用 cat $NGC_API_KEY_FILE 检索，或使用密码管理器。

Docker 登录到 NGC#

要从 NGC 拉取 NIM 容器镜像，请首先使用以下命令通过 NVIDIA 容器注册表进行身份验证

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

使用 $oauthtoken 作为用户名，NGC_API_KEY 作为密码。$oauthtoken 用户名是一个特殊名称，表示您将使用 API 密钥而不是用户名和密码进行身份验证。

启动 NIM#

以下命令启动 llama-3.2-nv-rerankqa-1b-v2 模型的 Docker 容器。

# Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/llama-3.2-nv-rerankqa-1b-v2
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.3.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

标志	描述
`-it`	`--interactive` + `--tty`（请参阅 Docker 文档）
`--rm`	容器停止后删除容器（请参阅 Docker 文档）
`--name=llama-3.2-nv-rerankqa-1b-v2`	为 NIM 容器命名以进行簿记（此处为 `llama-3.2-nv-rerankqa-1b-v2`）。使用任何首选值。
`--runtime=nvidia`	确保 NVIDIA 驱动程序在容器中可访问。
`--gpus all`	在容器内暴露所有 NVIDIA GPU。有关挂载特定 GPU 的信息，请参阅配置页面。
`--shm-size=16GB`	为多 GPU 通信分配主机内存。单 GPU 模型或启用 NVLink 的 GPU 不需要。
`-e NGC_API_KEY`	为容器提供从 NGC 下载足够模型和资源所需的令牌。请参阅以上。
`-v "$LOCAL_NIM_CACHE:/opt/nim/.cache"`	从您的系统（此处为 `~/.cache/nim`）在 NIM 内部挂载缓存目录（默认为 `/opt/nim/.cache`），允许后续运行重用下载的模型和工件。
`-u $(id -u)`	在 NIM 容器内使用与您的系统用户相同的用户，以避免在本地缓存目录中下载模型时出现权限不匹配。
`-p 8000:8000`	转发 NIM 服务器在容器内发布的端口，以便从主机系统访问。 `:` 的左侧是主机系统 ip:port（此处为 `8000`），而右侧是 NIM 服务器发布的容器端口（默认为 `8000`）。
`$IMG_NAME`	来自 NGC 的 NIM 容器的名称和版本。如果没有在此之后提供任何参数，NIM 服务器将自动启动。

如果下载本地缓存目录中的模型时遇到权限不匹配问题，请将 -u $(id -u) 选项添加到 docker run 调用中，以在您当前的身份下运行。

如果您在具有不同类型 GPU 的主机上运行，则应使用 --gpus 参数来指定相同类型的 GPU 以运行 docker run。例如，--gpus '"device=0,2"'。设备 ID 0 和 2 仅为示例；请将它们替换为您系统的适当值。设备 ID 可以通过运行 nvidia-smi 找到。更多信息可以在 GPU 枚举中找到。

目前不支持在多实例 GPU (MIG) 模式下具有 GPU 的 GPU 集群

运行推理#

注意： 从 docker 容器启动到容器准备就绪并开始接受请求可能需要几秒钟。

确认服务已准备好处理推理请求

curl -X 'GET' 'http://127.0.0.1:8000/v1/health/ready'

如果服务已准备就绪，您将收到如下响应

{"ready":true}

curl -X "POST" \
  "http://127.0.0.1:8000/v1/ranking" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "nvidia/nv-rerankqa-mistral-4b-v3",
  "query": {"text": "which way did the traveler go?"},
  "passages": [
    {"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"},
    {"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"},
    {"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."},
    {"text": "i shall be telling this with a sigh somewhere ages and ages hense: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."}
  ],
  "truncate": "END"
}'

有关更多信息，请参阅API 示例。

在多个 GPU 上部署#

NIM 在您指定并在 docker 容器中可见的任意数量的 GPU 上部署单个模型。如果您未指定 GPU 数量，则 NIM 默认为一个 GPU。使用多个 GPU 时，Triton 会在 GPU 之间分配推理请求，以保持它们的均衡利用率。

使用 docker run --gpus 命令行参数来指定可用于部署的 GPU 数量。

使用所有 GPU 的示例
```
  docker run --gpus all ...
```
使用两个 GPU 的示例
```
  docker run --gpus 2 ...
```
使用特定 GPU 的示例
```
  docker run --gpus '"device=1,2"' ...
```

将 NIM 模型下载到缓存#

如果必须预先获取模型资产（例如在气隙系统中），则可以在不启动服务器的情况下将资产下载到 NIM 缓存。要首先下载资产，请运行 list-model-profiles 以确定所需的配置文件，然后使用该配置文件运行 download-to-cache，如下所示。有关详细信息，请参阅优化。

# Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/llama-3.2-nv-rerankqa-1b-v2
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.3.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# List NIM model profiles and select the most appropriate one for your use case
docker run -it --rm --name=$CONTAINER_NAME \
  -e NIM_CPU_ONLY=1 \
  -u $(id -u) \
  $IMG_NAME list-model-profiles

export NIM_MODEL_PROFILE=<selected profile>

# Start the NIM container with a command to download the model to the cache
docker run -it --rm --name=$CONTAINER_NAME \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -e NIM_CPU_ONLY=1 \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  $IMG_NAME download-to-cache --profiles $NIM_MODEL_PROFILE

# Start the NIM container in an airgapped environment and serve the model
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus=all \
  --shm-size=16GB \
  --network=none \
  -v $LOCAL_NIM_CACHE:/mnt/nim-cache:ro \
  -u $(id -u) \
  -e NIM_CACHE_PATH=/mnt/nim-cache \
  -e NGC_API_KEY \
  -p 8000:8000 \
  $IMG_NAME

默认情况下，download-to-cache 命令下载检测到的 GPU 最合适的模型资产。要覆盖此行为并下载特定模型，请在启动容器时设置 NIM_MODEL_PROFILE 环境变量。使用 NIM 容器内可用的 list-model-profiles 命令列出所有配置文件。有关更多详细信息，请参阅优化。

停止容器#

以下命令通过停止并删除正在运行的 docker 容器来停止容器。

docker stop $CONTAINER_NAME
docker rm $CONTAINER_NAME