重要提示

您正在查看 NeMo 2.0 文档。此版本引入了对 API 的重大更改和一个新库,NeMo Run。我们目前正在将 NeMo 1.0 的所有功能移植到 2.0。有关先前版本或 2.0 中尚未提供的功能的文档,请参阅 NeMo 24.07 文档

合成数据生成#

合成数据生成在大语言模型 (LLM) 训练中变得越来越有用。它用于预训练、微调和评估。合成生成的数据可用于使 LLM 适应低资源语言或领域,以及执行来自其他模型的知识蒸馏等目的。有多种方法可以构建利用众多 LLM 和经典过滤器的合成数据生成管道。

NeMo Curator 拥有一组简单易用的工具,可让您使用预构建的合成生成管道或构建自己的管道。任何使用 OpenAI API 的模型推理服务都与合成数据生成模块兼容,允许您从任何模型生成数据。此外,NeMo Curator 还可以与 NeMo 的导出和部署模块接口,该模块允许您托管自己的模型以进行 LLM 推理。

NeMo Curator 为监督微调 (SFT) 和偏好数据提供预构建的合成数据生成管道,这些管道用于生成数据以训练 Nemotron-4 340B。它现在还支持用于生成 Nemotron-CC 的管道。此外,您可以将过滤和去重步骤无缝集成到您的合成数据管道中,并与 NeMo Curator 中提供的其他模块集成。

连接到 LLM 服务#

NeMo Curator 支持连接到 OpenAI API 兼容服务和 NeMo Deploy 服务。尽管名称如此,OpenAI API 用于查询跨不同平台(不仅仅是 OpenAI 自己的模型)的模型。以下代码演示了如何连接到 build.nvidia.com 以使用 NeMo Curator 和 OpenAI API 查询 Gemma 2 9b-it。

from openai import OpenAI
from nemo_curator import OpenAIClient

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>",
)
client = OpenAIClient(openai_client)
responses = client.query_model(
    model="mistralai/mixtral-8x7b-instruct-v0.1",
    messages=[
        {
            "role": "user",
            "content": "Write a limerick about the wonders of GPU computing.",
        }
    ],
    temperature=0.2,
    top_p=0.7,
    max_tokens=1024,
)
print(responses[0])
# Output:
# A GPU with numbers in flight, Brings joy to programmers late at night.
# With parallel delight, Solving problems, so bright,
# In the realm of computing, it's quite a sight!

部署 LLM 推理服务#

OpenAI API 非常适合访问通过简单 API 外部托管的模型。但是,这些服务通常会受到速率限制,如果您生成大量合成数据,您可能会遇到这些限制。访问外部托管模型的替代方法是自行部署 LLM 推理服务。如果您想自行托管模型,我们建议使用 NeMo 的导出和部署模块,以确保您获得最佳性能。

假设您按照 NeMo 部署指南在本地计算机上部署了一个名为“mistralai/mixtral-8x7b-instruct-v0.1”的模型,您可以使用以下代码运行相同的查询

from nemo.deploy.nlp import NemoQueryLLM
from nemo_curator import NemoDeployClient
from nemo_curator.synthetic import Mixtral8x7BFormatter

model = "mistralai/mixtral-8x7b-instruct-v0.1"
nemo_client = NemoQueryLLM(url="localhost:8000", model_name=model)
client = NemoDeployClient(nemo_client)
repsonses = client.query_model(
    model=model,
    messages=[
        {
            "role": "user",
            "content": "Write a limerick about the wonders of GPU computing.",
        }
    ],
    temperature=0.2,
    top_p=0.7,
    max_tokens=1024,
    conversation_formatter=Mixtral8x7BFormatter(),
)
print(repsonses[0])
# Output:
# A GPU with numbers in flight, Brings joy to programmers late at night.
# With parallel delight, Solving problems, so bright,
# In the realm of computing, it's quite a sight!

让我们关注这里的主要区别。

  • nemo_client = NemoQueryLLM(url="localhost:8000", model_name=model)。此初始化要求您指定模型的名称。NemoQueryLLM 主要用于查询单个 LLM,但 NeMo Curator 允许您为每个请求更改本地服务器上查询的模型。

  • conversation_formatter=Mixtral8x7BFormatter()。LLM 将文本的标记化字符串作为输入,而不是对话轮次的列表。因此,在对齐过程中,每个 LLM 都使用对话格式将对话转换为单个字符串。对于 Mixtral-8x7B-Instruct-v0.1,格式如下所示

    <s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]
    
    Services that use the OpenAI API perform this formatting on the backend. In contrast, since NeMo Deploy allows you to run any model you want, you need to specify what conversation format you should use on when making the request.
    

    NeMo Curator 为 Mixtral-8x7B-Instruct-v0.1 和 Nemotron-4 340B 提供了预构建的对话格式化程序,分别命名为 Mixtral8x7BFormatterNemotronFormatter

注意

OpenAI API 后端可能会自动为您格式化对话。根据您的合成数据生成过程,这可能会导致不正确的结果。请参阅您的服务文档,了解他们遵循的提示格式。

查询奖励模型#

奖励模型可用于对用户和助手之间的对话进行评分。奖励模型不会使用文本后续回复用户提示,而是返回类别到分数的映射。然后可以使用这些分数来过滤数据集以获得更高的质量。以下代码演示了如何在 NeMo Curator 中查询 Nemotron-4 340b 奖励模型

from openai import OpenAI
from nemo_curator import OpenAIClient

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>",
)
client = OpenAIClient(openai_client)

model = "nvidia/nemotron-4-340b-reward"

messages = [
    {"role": "user", "content": "I am going to Paris, what should I see?"},
    {
        "role": "assistant",
        "content": "Ah, Paris, the City of Light! There are so many amazing things to see and do in this beautiful city ...",
    },
]

rewards = client.query_reward_model(messages=messages, model=model)
print(rewards)
# {
# "helpfulness": 1.6171875
# "correctness": 1.6484375
# "coherence": 3.3125
# "complexity": 0.546875
# "verbosity": 0.515625
# }

有关奖励类别的更多详细信息,请参阅 Nemotron-4 340B 技术报告

自定义 Nemotron-4 340B 管道#

Nemotron-4 340B 是 NVIDIA 发布的一个 LLM,它合成生成了用于其监督微调和偏好微调的 98% 的数据。NeMo Curator 包含预构建的函数,允许您使用相同的提示模板遵循相同的过程,并且您可以自定义管道以适应您的用例。

生成合成提示#

提示生成是合成生成用户和助手之间对话的第一行的过程。这也称为“开放式”生成。Nemotron-4 340B 基于 UltraChat 数据集的生成使用了四种不同的管道,用于生成开放式问答、写作、封闭式问答以及数学和编码提示。NeMo Curator 将 Nemotron-4 340B 的所有合成数据生成方法封装在 nemo_curator.synthetic.NemotronGenerator 中。

我们将在以下部分深入探讨它提供的所有方法,但这是一个小例子,它建立了一个您将在所有函数中看到的模式

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronGenerator

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronGenerator(client)

n_macro_topics = 20
model = "mistralai/mixtral-8x7b-instruct-v0.1"
model_kwargs = {
    "temperature": 0.2,
    "top_p": 0.7,
    "max_tokens": 1024,
}

responses = generator.generate_macro_topics(
    n_macro_topics=n_macro_topics, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# 1. Climate Change and Sustainable Living
# 2. Space Exploration and the Universe
# ...

此示例类似于 OpenAIClient.query_model。我们像以前一样指定我们正在使用的模型,以及其他关键字参数来控制模型的生成。generator.generate_macro_topics 函数查询 LLM 并要求它生成关于世界的主题列表。还有一个额外的 prompt_template 参数,它默认为 Nemotron-4 340B 中使用的参数,但如果需要,可以更改它。responses 变量将生成响应列表,除非在 model_kwargs 中指定 n > 1,否则只有一个响应。

上述代码段的输出将是一个字符串响应,其中包含主题列表。Nemotron 管道中的许多 LLM 响应将包含列表。因此,NemotronGenerator 提供了一个帮助函数,它将尝试将 LLM 响应转换为 Python 字符串列表。

responses = generator.generate_macro_topics(
    n_macro_topics=n_macro_topics, model=model, model_kwargs=model_kwargs
)

topic_list = generator.convert_response_to_yaml_list(
    responses[0], model=model, model_kwargs=model_kwargs
)
print(topic_list[0])
# Output:
# Climate Change and Sustainable Living

此帮助函数提示 LLM 将先前的响应转换为 YAML 格式,然后尝试解析它。如果解析失败,则会抛出 YamlConversionErrortopic_list 不保证长度为 20。在我们稍后将看到的端到端管道中,如果期望列表长度与接收列表长度不匹配,NeMo Curator 将引发 YamlConversionError,但此函数不执行此检查。

在介绍了这些示例之后,让我们来看看如何在 NeMo Curator 中精确复制 Nemotron-4 340B 合成数据生成管道。有关每个步骤的更深入解释,请参阅 Nemotron-4 340B 技术报告

生成开放式问答提示#

开放式问答提示生成遵循以下步骤

  1. 生成关于世界的宏观主题列表。

  2. 生成与每个宏观主题相关的子主题列表。

  3. 创建与先前生成的主题相关的问题列表。也可以手动指定其他主题。

  4. 修改问题以使其更详细。

使用 NeMo Curator,您可以按如下方式执行每个步骤

model = "mistralai/mixtral-8x7b-instruct-v0.1"
macro_topic_responses = generator.generate_macro_topics(
    n_macro_topics=20, model=model
)
macro_topics_list = ... # Parse responses manually or with convert_response_to_yaml_list

subtopic_responses = generator.generate_subtopics(
    macro_topic=macro_topics_list[0], n_subtopics=5, model=model
)
subtopic_list = ... # Parse responses manually or with convert_response_to_yaml_list

topics = macro_topics_list + subtopic_list

question_responses = generator.generate_open_qa_from_topic(
    topic=topics[0], n_openlines=10, model=model
)
questions = ... # Parse responses manually or with convert_response_to_yaml_list

revised_questions_responses = generator.revise_open_qa(
    openline=questions[0], n_revisions=5, model=model
)
revised_questions = ... # Parse responses manually or with convert_response_to_yaml_list

您可以使用 NemotronGenerator.run_open_qa_pipeline 运行包含所有这些步骤的端到端管道。

open_qa_questions = generator.run_open_qa_pipeline(
    n_macro_topics=20,
    n_subtopics=5,
    n_openlines=10,
    n_revisions=5,
    model=model,
    ignore_conversion_failure=True,
)

print(open_qa_questions[0])
# Output:
# What are some effective sources of renewable energy?

此函数一起运行所有先前的步骤。它尝试使用 convert_response_to_yaml_list 自动将 LLM 响应转换为 Python 列表。设置 ignore_conversion_failure=True 将丢弃无法转换的响应,而不是引发错误。但是,如果管道的第一步无法成功解析,仍然会抛出错误。

生成写作提示#

写作提示生成遵循以下步骤

  1. 生成关于某个主题编写电子邮件、文章等的任务。

  2. 修改任务以使其更详细。

使用 NeMo Curator,您可以按如下方式执行每个步骤

model = "mistralai/mixtral-8x7b-instruct-v0.1"
writing_tasks_responses = generator.generate_writing_tasks(
    topic="Climate Change and Sustainable Living",
    text_material_type="Poems",
    n_openlines=5,
    model=model,
)
writing_tasks_list = ... # Parse responses manually or with convert_response_to_yaml_list

revised_writing_tasks_responses = generator.revise_writing_tasks(
    openline=writing_tasks_list[0], n_revisions=5, model=model
)
revised_writing_tasks = ...  # Parse responses manually or with convert_response_to_yaml_list

您可以使用 NemotronGenerator.run_writing_pipeline 运行包含所有这些步骤的端到端管道。

writing_tasks = generator.run_writing_pipeline(
    topics=[
        "Climate Change and Sustainable Living",
        "Space Exploration and the Universe",
        ...,
    ],
    text_material_types=["Poems", "Essays", ...],
)

print(writing_tasks[0])
# Output:
# Write a poem about the most effective sources of renewable energy.

此函数一起运行所有先前的步骤。它尝试使用 convert_response_to_yaml_list 自动将 LLM 响应转换为 Python 列表。如果 ignore_conversion_failure=True,则会丢弃无法转换的响应,而不是引发错误。但是,如果管道的第一步无法成功解析,仍然会抛出错误。

生成封闭式问答提示#

封闭式问答提示生成很简单,只有一个步骤

  1. 给定文档,生成一些关于它的问题。

使用 NeMo Curator,您可以按如下方式执行此步骤

model = "mistralai/mixtral-8x7b-instruct-v0.1"
closed_qa_responses = generator.generate_closed_qa_instructions(
    document="Four score and seven years ago...",
    n_openlines=5,
    model=model,
)
closed_qa_questions = ...  # Parse responses manually or with convert_response_to_yaml_list

您可以使用 NemotronGenerator.run_closed_qa_pipeline 运行端到端管道,该管道为许多文档重复此过程。

closed_qa_questions = generator.run_closed_qa_pipeline(
    documents=["Four score and seven years ago...", ...],
    n_openlines=5,
    model=model,
)

print(closed_qa_questions[0])
# Output:
# (0, "Which President of the United States gave this speech?")

此函数为提供的每个文档生成 n_openlines 个问题。它尝试使用 convert_response_to_yaml_list 自动将 LLM 响应转换为 Python 列表。设置 ignore_conversion_failure=True 将丢弃无法转换的响应,而不是引发错误。与其他管道不同,此管道返回问题以及其所属文档的索引的元组。这确保即使在 ignore_conversion_failure==True 时丢弃问题,您仍然可以将问题映射到其各自的文档。

生成数学提示#

数学提示生成遵循以下步骤

  1. 生成针对特定学校级别的数学宏观主题。

  2. 生成每个宏观主题的子主题。

  3. 为每个主题生成一个数学问题。也可以手动指定其他主题。

使用 NeMo Curator,您可以按如下方式执行每个步骤

model = "mistralai/mixtral-8x7b-instruct-v0.1"
macro_topic_responses = generator.generate_math_macro_topics(
    n_macro_topics=20,
    school_level="university",
    model=model
)
macro_topics_list = ... # Parse responses manually or with convert_response_to_yaml_list

subtopic_responses = generator.generate_math_subtopics(
    macro_topic=macro_topics_list[0],
    n_subtopics=5,
    model=model
)
subtopic_list = ... # Parse responses manually or with convert_response_to_yaml_list

topics = macro_topics_list + subtopic_list

question_responses = generator.generate_math_problem(
    topic=topics[0],
    n_openlines=10,
    model=model
)
questions = ...  # Parse responses manually or with convert_response_to_yaml_list

您可以使用 NemotronGenerator.run_math_pipeline 运行包含所有这些步骤的端到端管道。

math_questions = generator.run_math_pipeline(
    n_macro_topics=20,
    school_level="university",
    n_subtopics=5,
    n_openlines=10,
    model=model,
)
print(math_questions[0])
# Output:
# Prove that the square root of 2 is irrational.

此函数一起运行所有先前的步骤。它尝试使用 convert_response_to_yaml_list 自动将 LLM 响应转换为 Python 列表。设置 ignore_conversion_failure=True 将丢弃无法转换的响应,而不是引发错误。但是,如果管道的第一步无法成功解析,仍然会抛出错误。

生成编码提示#

编码生成管道类似于数学生成管道。具体而言,与 Python 相关的提示生成遵循以下步骤

  1. 生成与 Python 相关的宏观主题。

  2. 生成每个宏观主题的子主题。

  3. 为每个主题生成一个 Python 编码问题。也可以手动指定其他主题。

使用 NeMo Curator,可以按如下方式执行每个步骤

model = "mistralai/mixtral-8x7b-instruct-v0.1"
macro_topic_responses = generator.generate_python_macro_topics(
    n_macro_topics=20,
    model=model
)
macro_topics_list = ... # Parse responses manually or with convert_response_to_yaml_list

subtopic_responses = generator.generate_python_subtopics(
    macro_topic=macro_topics_list[0],
    n_subtopics=5,
    model=model
)
subtopic_list = ... # Parse responses manually or with convert_response_to_yaml_list

topics = macro_topics_list + subtopic_list

question_responses = generator.generate_python_problem(
    topic=topics[0],
    n_openlines=10,
    model=model
)
questions = ...  # Parse responses manually or with convert_response_to_yaml_list

您可以使用 NemotronGenerator.run_python_pipeline 运行包含所有这些步骤的端到端管道。

python_questions = generator.run_python_pipeline(
    n_macro_topics=20,
    n_subtopics=5,
    n_openlines=10,
    model=model,
)
print(python_questions[0])
# Output:
# Demonstrate how to write a for loop in Python.

此函数一起运行所有先前的步骤。它尝试使用 convert_response_to_yaml_list 自动将 LLM 响应转换为 Python 列表。设置 ignore_conversion_failure=True 将丢弃无法转换的响应,而不是引发错误。但是,如果管道的第一步无法成功解析,仍然会抛出错误。

更改提示模板#

上述每个步骤都使用一个提示模板,该模板填充了主题/开放式行的数量以及步骤所需的任何其他信息。在此上下文中,提示模板是一个带有占位符的字符串。

例如,这是 Nemotron.generate_writing_tasks 的默认提示模板

DEFAULT_WRITING_TASK_PROMPT_TEMPLATE = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.'

提示模板的完整集合在 nemo_curator.synthetic.prompts 中提供。只要占位符与所需的函数参数匹配,您就可以交换提示模板。例如,从主题生成 Python 问题的默认提示模板是 PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE,但可以按如下方式更改它

from nemo_curator.synthetic import PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE

model = "mistralai/mixtral-8x7b-instruct-v0.1"
macro_topic_responses = generator.generate_python_macro_topics(
    n_macro_topics=20,
    model=model
)
macro_topics_list = ... # Parse responses manually or with convert_response_to_yaml_list

subtopic_responses = generator.generate_python_subtopics(
    macro_topic=macro_topics_list[0],
    n_subtopics=5,
    model=model
)
subtopic_list = ... # Parse responses manually or with convert_response_to_yaml_list

topics = macro_topics_list + subtopic_list

question_responses = generator.generate_python_problem(
    topic=topics[0],
    n_openlines=10,
    model=model,
    prompt_template=PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE,
)
questions = ...  # Parse responses manually or with convert_response_to_yaml_list

您可以提供自己的带有其他占位符的提示模板,只要在函数的 prompt_kwargs 中指定它们,NeMo Curator 就会正确插入它们的值。

例如,您可以定义一个提示模板,该模板生成带有异常的宏观主题

model = "mistralai/mixtral-8x7b-instruct-v0.1"
my_prompt_template = "Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible, but do not include anything relating to {exception}"
macro_topic_responses = generator.generate_macro_topics(
    n_macro_topics=5,
    model=model,
    prompt_template=my_prompt_template,
    prompt_kwargs={
        "exception": "illegal activities",
    },
)

生成对话#

在使用上述方法生成和混合提示后,您可以合成对话。在对话中,LLM 将扮演用户和助手的角色。Nemotron.generate_dialogue 方法提供了一种实现此目的的简单方法。

model = "mistralai/mixtral-8x7b-instruct-v0.1"
dialogue = generator.generate_dialogue(
    openline="Write a poem about the moon.",
    user_model=model,
    assistant_model=model,
    n_user_turns=3,
)
print(dialogue)
# Output:
# [{"role": "user", "content": "Write a poem about the moon."},
# {"role": "assistant", "content": "..."},
# ...]

n_user_turns 指定对话中将有 3 个用户轮次,其中每个轮次之后都有 1 个助手轮次。因此,总轮次数量(和返回列表的长度)将始终为 2*n_user_turns。让 LLM 扮演助手的角色很容易,因为这是其主要功能。

为了模仿用户,使用了以下特殊提示模板

DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words."

conversation = [
    {"role": "user", "content": "Write a poem about the moon."},
    {"role": "assistant", "content": "..."},
    ...,
]
conversation_history = ""
for turn in conversation:
    conversation_history += f"{turn['role'].capitalize()}: {turn['content']}"

prompt = DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE.format(
    conversation_history=conversation_history
)

生成合成双轮次提示#

Nemotron-4 340B 对其偏好数据使用双轮次提示。在此上下文中,双轮次提示是包含用户轮次、助手轮次和最终用户轮次的对话。这是一个示例

conversation = [
    {"role": "user", "content": "Write a poem about the moon."},
    {"role": "assistant", "content": "The moon is bright. It shines at night."},
    {"role": "user", "content": "Can you make the poem longer?"},
]

使用 NeMo Curator 和 Nemotron.generate_two_turn_prompt 可以轻松生成双轮次提示。

model = "mistralai/mixtral-8x7b-instruct-v0.1"
dialogue = generator.generate_two_turn_prompt(
    openline="Write a poem about the moon.",
    user_model=model,
    assistant_model=model,
)
print(dialogue)
# Output:
# conversation = [
#    {"role": "user", "content": "Write a poem about the moon."},
#    {"role": "assistant", "content": "The moon is bright. It shines at night."},
#    {"role": "user", "content": "Can you make the poem longer?"},
#]

用户模仿遵循对话生成部分中描述的相同格式。

分类实体#

除了生成数据外,使用 LLM 对少量数据进行分类也很有帮助。Nemotron-4 340B 使用 LLM 对维基百科实体进行分类,以确定它们是否与数学或 Python 编程相关。

NeMo Curator 提供了两个简单的函数来分类数学和 Python 实体

model = "mistralai/mixtral-8x7b-instruct-v0.1"
math_classification_responses = generator.classify_math_entity(
    entity="Set theory",
    model=model,
)
print(math_classification_responses[0])
# Output:
# Yes ...

python_classification_responses = generator.classify_python_entity(
    entity="Recipes for blueberry pie",
    model=model,
)
print(python_classification_responses[0])
# Output:
# No ...

异步生成#

到目前为止,所有代码都以同步方式向 LLM 服务发送请求。这可能非常低效,因为在大多数管道中可以同时发送许多请求。因此,NeMo Curator 使用 OpenAI 的异步 API 提供了异步替代方案。

from openai import AsyncOpenAI
from nemo_curator import AsyncOpenAIClient
from nemo_curator.synthetic import AsyncNemotronGenerator

openai_client = AsyncOpenAI(
    base_url="https://integrate.api.nvidia.com/v1", api_key="<insert NVIDIA API key>"
)
client = AsyncOpenAIClient(openai_client)
generator = AsyncNemotronGenerator(client, max_concurrent_requests=10)

n_macro_topics = 20
model = "mistralai/mixtral-8x7b-instruct-v0.1"
model_kwargs = {
    "temperature": 0.2,
    "top_p": 0.7,
    "max_tokens": 1024,
}

responses = await generator.generate_macro_topics(
    n_macro_topics=n_macro_topics, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# 1. Climate Change and Sustainable Living
# 2. Space Exploration and the Universe
# ...

如您所见,异步模块具有与同步模块相同的接口。唯一的例外是,如果您的服务受到速率限制,则可以将 max_concurrent_requests 参数提供给 AsyncNemotronGenerator 的构造函数,作为一种速率限制形式。

自定义 Nemotron-CC 管道#

Nemotron-CC 是一个开放的、大型的、高质量的英语 Common Crawl 数据集,它支持对短标记和长标记范围预训练高度准确的 LLM。

您可以使用 Nemotron-CC 管道集合将参考文档重写为不同的格式和样式。例如,您可以将用简单措辞写的短句改写为技术性、学术性的散文(如维基百科),或者将漫无边际的段落提炼成简洁的要点列表。

NeMo Curator 为每个管道提供两个版本

  • 同步nemo_curator.synthetic.NemotronCCGenerator

  • 异步nemo_curator.synthetic.AsyncNemotronCCGenerator

重写为维基百科风格#

使用 NemotronCCGenerator.rewrite_to_wikipedia_style 方法将文档重写为在行距、标点符号和风格方面与维基百科相似的风格。

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronCCGenerator

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
    "temperature": 0.5,
    "top_p": 0.9,
    "max_tokens": 512,
}

responses = generator.rewrite_to_wikipedia_style(
    document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# The lunar surface has a high albedo, which means it reflects a significant amount of sunlight.

生成多样化的问答对#

使用 NemotronCCGenerator.generate_diverse_qa 方法从文档中生成多样化的问答对列表。

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronCCGenerator

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
    "temperature": 0.5,
    "top_p": 0.9,
    "max_tokens": 600,
}

responses = generator.generate_diverse_qa(
    document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# Question: What is the moon made of?
# Answer: The moon is made of rock and dust.

后处理器#

您可以选择使用 NemotronCCDiverseQAPostprocessor 类来重新格式化输出。

import pandas as pd
from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.datasets import DocumentDataset
from nemo_curator.synthetic import NemotronCCGenerator, NemotronCCDiverseQAPostprocessor

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
    "temperature": 0.5,
    "top_p": 0.9,
    "max_tokens": 600,
}
responses = generator.generate_diverse_qa(document=document, model=model, model_kwargs=model_kwargs)
postprocessor = NemotronCCDiverseQAPostprocessor(text_field="text", response_field="diverse_qa_response")
dataset = DocumentDataset.from_pandas(pd.DataFrame({"text": document, "diverse_qa_response": responses}))

# This postprocessor will sample a random number of QA pairs up to max_num_pairs.
# If a tokenizer is provided, the number of QA pairs will be sampled from at least
# 1 and at most floor(max_num_pairs * num_tokens / 150).
# Otherwise, the number of QA pairs will be sampled randomly strictly up to max_num_pairs.
# The generated QA pairs are shuffled and then appended to the original text.
cleaned_dataset = postprocessor(dataset)

first_entry = cleaned_dataset.df.head(1)
print(first_entry["diverse_qa_response"])
# Output:
# The moon is bright. It shines at night. Question: What is the moon made of? Answer: The moon is made of rock and dust.

生成知识列表#

使用 NemotronCCGenerator.generate_knowledge_list 方法从文档中生成知识列表。

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronCCGenerator

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
    "temperature": 0.5,
    "top_p": 0.9,
    "max_tokens": 600,
}

responses = generator.generate_knowledge_list(
    document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# - The moon is made of rock and dust.
# - The moon is the only natural satellite of the Earth.
# ...

后处理器#

您可以选择使用 NemotronCCKnowledgeListPostprocessor 类来重新格式化输出。

import pandas as pd
from openai import OpenAI

from nemo_curator import OpenAIClient
from nemo_curator.datasets import DocumentDataset
from nemo_curator.synthetic import NemotronCCGenerator, NemotronCCKnowledgeListPostprocessor

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
    "temperature": 0.5,
    "top_p": 0.9,
    "max_tokens": 600,
}

responses = generator.generate_knowledge_list(
    document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# - The moon is made of rock and dust.
# - The moon is the only natural satellite of the Earth.
# ...

postprocessor = NemotronCCKnowledgeListPostprocessor(text_field="knowledge_list_response")
dataset = DocumentDataset.from_pandas(pd.DataFrame({"knowledge_list_response": responses}))

# This postprocessor removes formatting artifacts
# such as bullet point prefixes ("- ") and extra indentation from each line,
# ensuring that the final output is a clean, uniformly formatted list of knowledge items.
# The processing includes skipping any initial non-bullet line and merging related lines
# to reconstruct multi-line questions or answers.
cleaned_dataset = postprocessor(dataset)

first_entry = cleaned_dataset.df.head(1)
print(first_entry["knowledge_list_response"])
# Output:
# The moon is made of rock and dust.
# The moon is the only natural satellite of the Earth.

提炼文档#

使用 NemotronCCGenerator.distill 方法使文档更简洁。

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronCCGenerator

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
    "temperature": 0.5,
    "top_p": 0.9,
    "max_tokens": 1600,
}

responses = generator.distill(
    document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# The moon is bright at night.

提取知识#

使用 NemotronCCGenerator.extract_knowledge 方法从文档中提取知识。

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronCCGenerator

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = ("The moon is bright. It shines at night. I love the moon. I first saw it up"
           " close through a telescope in 1999 at a sleepover.")
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
    "temperature": 0.5,
    "top_p": 0.9,
    "max_tokens": 1400,
}

responses = generator.extract_knowledge(
    document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# The moon is a reflective body visible from the Earth at night.

将合成数据生成与 NeMo Curator 的其他模块结合使用#

与 NeMo Curator 的其余部分不同,合成数据生成独立于 Dask 运行。这是由于模块之间的规模差异。合成数据通常以 100,000 个样本的量级生成,而预训练数据集的规模为 1,000,000,000+ 个样本。通常不需要为此规模启动 Dask 集群。但是,您可能希望使用 NeMo Curator 对您的响应进行去重或过滤。例如,主题可能最终会被重复,并且将重复的主题作为查询发送到 LLM 会浪费宝贵的资源。我们建议使用 DocumentDataset.from_pandasDocumentDataset.to_pandas 在需要其他 NeMo Curator 模块的工作流程之间进行转换。

例如,您可以执行以下操作

import pandas as pd
from nemo_curator.datasets import DocumentDataset

# Initialize client, etc.

model = "mistralai/mixtral-8x7b-instruct-v0.1"
macro_topic_responses = generator.generate_macro_topics(
    n_macro_topics=20, model=model
)
macro_topics_list = ... # Parse responses manually or with convert_response_to_yaml_list

subtopic_responses = generator.generate_subtopics(
    macro_topic=macro_topics_list[0], n_subtopics=5, model=model
)
subtopic_list = ... # Parse responses manually or with convert_response_to_yaml_list

df = pd.DataFrame({"topics": subtopic_list})
dataset = DocumentDataset.from_pandas(df)

# Deduplicate/filter with NeMo Curator

filtered_topics = dataset.to_pandas()["topics"].to_list()

# Continue with synthetic data generation pipeline