面向 LLM 的 NVIDIA NIM 的气隙部署#

面向大型语言模型 (LLM) 的 NVIDIA NIM 支持在气隙系统(也称为气墙、气隙或断开连接的网络)中服务模型。在使用本文档之前,请查看入门指南中的所有先决条件和说明,并参阅从本地资产服务模型

气隙部署(离线缓存路由)#

如果 NIM 检测到缓存中先前加载的配置文件,它将从缓存中服务该配置文件。使用 download-to-cache 将配置文件下载到缓存后,可以将缓存传输到气隙系统以运行 NIM,而无需任何互联网连接,也无需连接到 NGC 注册表。

要查看实际效果,请勿提供 NGC_API_KEY,如以下示例所示。

# Create an example air-gapped directory where the downloaded NIM will be deployed
export AIR_GAP_NIM_CACHE=~/.cache/air-gap-nim-cache
mkdir -p "$AIR_GAP_NIM_CACHE"

# Transport the downloaded NIM to an air-gapped directory
cp -r "$LOCAL_NIM_CACHE"/* "$AIR_GAP_NIM_CACHE"

# Choose a container name for bookkeeping
export CONTAINER_NAME=Llama-3.1-8B-instruct

# The container name from the previous ngc registry image list command
Repository=nim/meta/llama-3.1-8b-instruct

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/${Repository}:latest"

# Assuming the command run prior was `download-to-cache`, downloading the optimal profile
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

# Assuming the command run prior was `download-to-cache --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b`
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_PROFILE=09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b \
  -v "$AIR_GAP_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

气隙部署(本地模型目录路由)#

气隙路由的另一个选项是使用 NIM 容器内的 create-model-store 命令部署创建的模型仓库,为单个模型创建仓库,如以下示例所示。

create-model-store --profile 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b --model-store /path/to/model-repository
# Choose a container name for bookkeeping
export CONTAINER_NAME=Llama-3.1-8B-instruct

# The container name from the previous ngc registry image list command
Repository=nim/meta/llama-3.1-8b-instruct

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/${Repository}:latest"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

export MODEL_REPO=/path/to/model-repository
export NIM_SERVED_MODEL_NAME=my-model

# Note: For vLLM backend, specify the following environment variables
# to set the required parallel sizes. The default values are 1.
export NIM_TENSOR_PARALLEL_SIZE=<required_value>
export NIM_PIPELINE_PARALLEL_SIZE=<required_value>

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME=/model-repo \
  -e NIM_SERVED_MODEL_NAME \
  -v $MODEL_REPO:/model-repo \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME