使用奖励模型#

NIM LLM 除了 chat 和 completion 模型外，还支持部署大型语言 reward 模型。奖励模型通常用于对另一个大型语言模型的输出进行评分，以便进一步微调该模型或过滤合成创建的数据集。

要将文本发送到奖励模型，您可以像其他类型的模型一样使用 chat/completions 端点。将用于生成文本的提示作为第一个 user 内容包含在内，并将模型的响应作为 assistant 内容包含在内。奖励模型将对提供的模型响应进行评分，同时考虑生成该响应的查询。例如

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

original_query = "I am going to Paris, what should I see?"
original_response = "Ah, Paris, the City of Light! There are so many amazing things to see and do in this beautiful city ..."

messages = [
    {"role": "user", "content": original_query},
    {"role": "assistant", "content": original_response}
]

response = client.chat.completions.create(
    model="nvidia/nemotron-4-340b-reward",
    messages=messages,
    stream=False
)

来自 NIM 的响应将在消息内容中包含属性和分数对，而常规的聊天完成模型将返回其生成的文本。奖励模型对响应进行评分的属性特定于每个奖励模型。使用 HelpSteer 数据集（如 nemotron-4-340b）训练的奖励模型根据以下指标对响应进行评分

帮助性
正确性
连贯性
复杂性
冗长性

您可以在下游应用程序中使用此响应。例如，您可能希望将分数解析为 python 字典

response_content = response.choices[0].message.content
reward_pairs = [pair.split(":") for pair in response_content.split(",")]
reward_dict = {attribute: float(score) for attribute, score in reward_pairs}
print(reward_dict)
# Prints:
# {'helpfulness': 1.2578125, 'correctness': 0.43359375, 'coherence': 3.34375, 'complexity': 0.045166015625, 'verbosity': 0.6953125}