重要提示

您正在查看 NeMo 2.0 文档。此版本对 API 和新库 NeMo Run 进行了重大更改。我们目前正在将 NeMo 1.0 的所有功能移植到 2.0。有关先前版本或 2.0 中尚不可用的功能的文档，请参阅 NeMo 24.07 文档。

Video NeVA#

模型介绍#

Video NeVa 通过将视频表示为多个图像帧，在 NeVa 中添加了对视频模态的支持。

为了支持对视频输入数据进行预训练，仅对 MegatronNevaModel 类进行了微小的更改。

将视频输入表示为一系列图像在 TarOrFolderVideoLoader 类中完成，使用 Decord，它提供了方便的视频切片方法。

Video Neva 配置#

data:
  media_type: video
  splice_single_frame: null
  num_frames: 8
  image_token_len: 256
  image_folder: null
  video_folder: null

media_type：如果设置为 video，NeVa 的数据加载器将执行额外的预处理步骤，以将输入视频数据表示为一系列图像帧。
splice_single_frame：可以设置为 first、middle 或 last。这将导致仅选择视频中特定位置的单个帧。
image_token_len：NeVa 数据加载器根据预处理图像帧的高度和宽度以及正在使用的 CLIP 模型的补丁大小来计算 image_token_len。

image_token_len = (224 // 14) * (224 // 14) = 16 * 16 = 256

num_frames：这用于选择将用于表示视频的图像帧的数量。
video_folder：这指定视频文件所在的目录。这遵循与 NeVa 的 image_folder 相同的格式。

Video NeVA 推理#

我们可以运行位于 NeMo/examples/multimodal/multimodal_llm/neva 中的 neva_evaluation.py 来生成 Video NeVA 模型的推理结果。目前，video NeVA 通过将 NeMo/examples/multimodal/multimodal_llm/neva/conf/neva_inference.yaml 中的配置属性 inference.media_type 更改为 image 或 video，并添加相应的媒体路径 inference.media_base_path，来支持图像和视频推理。

使用带有基础 LM 模型的预训练投影器进行推理#

推理脚本执行示例

用于运行视频推理

CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 /path/to/neva_evaluation.py \
--config-path=/path/to/conf/ \
--config-name=neva_inference.yaml \
tensor_model_parallel_size=4 \
pipeline_model_parallel_size=1 \
neva_model_file=/path/to/projector/checkpoint \
base_model_file=/path/to/base/lm/checkpoint \
trainer.devices=4 \
trainer.precision=bf16 \
prompt_file=/path/to/prompt/file \
inference.media_base_path=/path/to/videos \
inference.media_type=video \
output_file=/path/for/output/file/ \
inference.temperature=0.2 \
inference.top_k=0 \
inference.top_p=0.9 \
inference.greedy=False \
inference.add_BOS=False \
inference.all_probs=False \
inference.repetition_penalty=1.2 \
inference.insert_media_token=right \
inference.tokens_to_generate=256 \
quantization.algorithm=awq \
quantization.enable=False

.jsonl prompt_file 的示例格式

{"video": "video_test.mp4", "text": "Can you describe the scene?", "category": "conv", "question_id": 0}

输入视频文件：video_test.mp4

输出

<extra_id_0>System
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

<extra_id_1>User
Can you describe the scene?<video>
<extra_id_1>Assistant
<extra_id_2>quality:4,toxicity:0,humor:0,creativity:0,helpfulness:4,correctness:4,coherence:4,complexity:4,verbosity:4
CLEAN RESPONSE: Hand with a robot arm

使用微调后的 Video NeVA 模型进行推理（无需指定基础 LM）#

推理脚本执行示例

用于运行视频推理

CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 /path/to/neva_evaluation.py \
--config-path=/path/to/conf/ \
--config-name=neva_inference.yaml \
tensor_model_parallel_size=4 \
pipeline_model_parallel_size=1 \
neva_model_file=/path/to/video/neva/model \
trainer.devices=4 \
trainer.precision=bf16 \
prompt_file=/path/to/prompt/file \
inference.media_base_path=/path/to/videos \
inference.media_type=video \
output_file=/path/for/output/file/ \
inference.temperature=0.2 \
inference.top_k=0 \
inference.top_p=0.9 \
inference.greedy=False \
inference.add_BOS=False \
inference.all_probs=False \
inference.repetition_penalty=1.2 \
inference.insert_media_token=right \
inference.tokens_to_generate=256 \
quantization.algorithm=awq \
quantization.enable=False

使用 Mixtral 作为评判进行评估#

我们可以运行位于 NeMo/examples/multimodal/multimodal_llm/neva 中的 mixtral_eval.py，调用 mixtral api 为两个模型生成的响应评分。这里我们以 llava-bench-in-the-wild 为例。

设置#

在运行脚本之前，我们需要设置 NGC API KEY 以调用 NVIDIA NGC 上的基础模型。在 NGC 上设置帐户后，您可以登录并转到此处：并单击 Get API Key。保存密钥。

下载数据集#

我们首先下载 llava-bench-in-the-wild 数据集

git clone http://hugging-face.cn/datasets/liuhaotian/llava-bench-in-the-wild

并下载 rule.json。

请注意，llava-bench-in-the-wild 中的答案文件由 json 字符串行组成

{"question_id": 0, "prompt": "What is the name of this famous sight in the photo?", "answer_id": "TeyehNxHw5j8naXfEWaxWd", "model_id": "gpt-4-0314", "metadata": {}, "text": "The famous sight in the photo is Diamond Head."}

您也可以拥有自己的响应文件，例如

{"response_id": 0, "response": "The famous sight in the photo is Diamond Head."}

两种格式都可以。

评估#

安装包

pip install shortuuid

现在您可以简单地运行脚本

API_TOKEN=nvapi-<the api you just saved> python3 NeMo/examples/multimodal/multimodal_llm/neva/eval/mixtral_eval.py --model-name-list gpt bard --media-type image  \
    --question-file llava-bench-in-the-wild/questions.jsonl \  # the question file
    --responses-list llava-bench-in-the-wild/answers_gpt4.jsonl llava-bench-in-the-wild/bard_0718.jsonl  \   # two answer files / response files
    --answers-dir ./  \  # to save the answers
    --context-file llava-bench-in-the-wild/context.jsonl \  # context file
    --output ./output.json  # the generated mixtral reviews for the two models

您将看到如下结果

all 84.8 72.4
llava_bench_complex 77.0 69.0
llava_bench_conv 91.8 77.1
llava_bench_detail 91.3 73.2

请注意，当您开始新的比较时，应删除 output.json 文件