合成数据#

class nemo_curator.synthetic.NemotronGenerator( llm_client: LLMClient, )#

提供了一系列用于生成合成数据的方法，这些方法在 Nemotron-4 340B 技术报告 (https://arxiv.org/abs/2406.11704v1) 中进行了描述，并受到 UltraChat 论文 (https://arxiv.org/abs/2305.14233) 的启发

classify_math_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Math concepts taught at elementary school, middle school, high school, and univiersity.\n- Important mathematics axioms, theorems, algorithms, equations, or inequalities.\n- Representative math problems, functions, and applications.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict = {}, model_kwargs={}, ) → List[str]#

提示 LLM 分类实体是否与数学相关 :param entity: 要分类的实体 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - entity: 将使用此函数中传递的实体填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

classify_python_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Programming concepts like loops, functions, and data structures in python.\n- Important functions, objects, or libraries in python.\n- Mathematical concepts like linear algebra which can be implemented in python.\n- Basic algorithms or problems in computer science likes Greedy Search and Dynamics programming which can be addressed in python.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 分类实体是否与 Python 相关 :param entity: 要分类的实体 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - entity: 将使用此函数中传递的实体填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

convert_response_to_yaml_list( llm_response: str, model: str, prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

通过查询 LLM 将 LLM 的响应转换为字符串列表 :param llm_response: LLM 的原始未格式化响应 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有 {llm_response} 参数，该参数将使用此函数中传递的 llm_response 值填充。
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

从原始 LLM 响应中解析的元素列表

generate_closed_qa_instructions( document: str, n_openlines: str | int, model: str, prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于参考文档生成封闭式问答问题列表 :param document: 生成问题时要使用的文档 :param n_openlines: 每个文档要生成的问题数。 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - document: 将使用此函数中传递的文档填充 - n_openlines: 将使用此函数中传递的 n_openlines 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

generate_dialogue( openline: str, user_model: str, assistant_model: str, n_user_turns: int = 3, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict = {}, user_model_kwargs: dict = {}, assistant_model_kwargs: dict = {}, ) → List[dict]#

提示 LLM 基于给定的 openline 生成对话。LLM 将交替模拟用户和助手。 :param openline: 将构成第一个用户回合的 openline。 :param user_model: 将模拟用户的模型。

参数:

assistant_model – 将模拟助手的模型必须在构造函数中传递的 LLMClient 中可用。
n_user_turns – 要经历的用户回合数。openline 算作 1 个用户回合。因此，如果有 3 个用户回合，则 2 个将由模拟用户的 LLM 生成。
prompt_template – 模拟用户时要使用的提示的格式字符串。它必须具有以下参数： - converstation_history: 将使用到目前为止的对话的格式化历史记录填充。在 nemo_curator.synthetic 中找到的一些示例模板包括： - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
user_model_kwargs – 应传递给用户的 LLMClient.query_model 调用的任何其他关键字参数。
assistant_model_kwargs – 应传递给助手的 LLMClient.query_model 调用的任何其他关键字参数。

返回:

用户和助手之间的对话

generate_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成关于世界的宏观主题列表 :param n_macro_topics: 要生成的宏观主题的数量。 :param model: 应使用的模型名称以生成宏观主题。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics: 将使用此函数中传递的 n_macro_topics 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

generate_math_macro_topics( n_macro_topics: int | str, school_level: str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成关于数学的宏观主题列表 :param n_macro_topics: 要生成的宏观主题的数量。可以是像 5 这样的整数或像 “five” 这样的字符串。 :param school_level: 数学问题应针对的学校级别。 :param model: 应使用的模型名称以生成宏观主题。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics: 将使用此函数中传递的 n_macro_topics 填充 - school_level: 将使用此函数中传递的 school_level 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

generate_math_problem( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题生成数学问题列表 :param topic: 要为其生成问题的主题。 :param n_openlines: 每个主题要生成的问题数。 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines: 将使用此函数中传递的 n_subtopics 填充 - topic: 将使用此函数中传递的主题填充在 nemo_curator.synthetic 中找到的一些示例模板包括： - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

generate_math_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成与数学宏观主题相关的子主题列表 :param macro_topic: 要为其生成子主题的宏观主题。 :param n_subtopics: 每个宏观主题要生成的子主题数 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_subtopics: 将使用此函数中传递的 n_subtopics 填充 - macro_topic: 将使用此函数中传递的 macro_topic 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

generate_open_qa_from_topic( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题生成开放式问答问题列表 :param topic: 要为其生成问题的主题。 :param n_openlines: 每个主题要生成的问题数。 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_subtopics - topic：将填充在此函数中传递的 topic
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

generate_python_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = '列出 {n_macro_topics} 个 Python 语言中的重要概念。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成关于 Python 编程语言的宏观主题列表。 :param n_macro_topics: 要生成的宏观主题的数量。可以是像 5 这样的整数，也可以是像 “five” 这样的字符串。 :param model: 应该用于生成宏观主题的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics: 将使用此函数中传递的 n_macro_topics 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

generate_python_problem( topic: str, n_openlines: str | int, model: str, language='Python', prompt_template: str = '生成 {n_openlines} 个 {language} 编码问题，这些问题与 “{topic}” 相关。这些问题应该适合刚学习 “{topic}” 的初学者。你的答案应该是一个问题列表。让它们尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题生成编码问题列表 :param topic: 要为其生成问题的主题。 :param n_openlines: 每个主题要生成的问题数量。 :param model: 应该用于生成响应的模型的名称。

参数:

language – 生成这些问题时要面向的编程语言。
prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_subtopics - topic：将填充在此函数中传递的 topic - language：将填充在此函数中传递的 language 在 nemo_curator.synthetic 中找到的一些示例模板包括： - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

generate_python_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = '列出 Python 语言中与 “{macro_topic}” 相关的 {n_subtopics} 个重要概念。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成与 Python 宏观主题相关的子主题列表 :param macro_topic: 要为其生成子主题的宏观主题。 :param n_subtopics: 每个宏观主题要生成的子主题数量 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_subtopics: 将使用此函数中传递的 n_subtopics 填充 - macro_topic: 将使用此函数中传递的 macro_topic 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

generate_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = '你能生成 {n_subtopics} 个全面的主题，这些主题涵盖 {macro_topic} 的各个方面吗？你的答案应该是一个主题列表。让这些主题尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成与宏观主题相关的子主题列表 :param macro_topic: 要为其生成子主题的宏观主题。 :param n_subtopics: 每个宏观主题要生成的子主题数量 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_subtopics: 将使用此函数中传递的 n_subtopics 填充 - macro_topic: 将使用此函数中传递的 macro_topic 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

generate_two_turn_prompt( openline: str, user_model: str, assistant_model: str, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict = {}, user_model_kwargs: dict = {}, assistant_model_kwargs: dict = {}, ) → List[dict]#

提示 LLM 生成作为助手和用户的响应，基于给定的开放行。“用户 -> 助手 -> 用户” 的对话形式 :param openline: 将构成第一个用户回合的开放行。 :param user_model: 将模仿用户的模型。

参数:

assistant_model – 将模拟助手的模型必须在构造函数中传递的 LLMClient 中可用。
prompt_template – 模拟用户时要使用的提示的格式字符串。它必须具有以下参数： - converstation_history: 将使用到目前为止的对话的格式化历史记录填充。在 nemo_curator.synthetic 中找到的一些示例模板包括： - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
user_model_kwargs – 应传递给用户的 LLMClient.query_model 调用的任何其他关键字参数。
assistant_model_kwargs – 应传递给助手的 LLMClient.query_model 调用的任何其他关键字参数。

返回:

用户和助手之间的对话

generate_writing_tasks( topic: str, text_material_type: str, n_openlines: str | int, model: str, prompt_template: str = '你能生成 {n_openlines} 个任务吗？每个任务都需要创建一个与 {topic} 相关的 “{text_material_type}”。每个任务应该简洁明了，并且只包含一到两句话。这些任务应该尽可能多样化。你的答案应该是一个任务列表。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题和文档类型生成写作任务列表 :param topic: 要为其生成写作任务的主题。 :param text_material_type: 问题应要求生成的文档类型（例如，“电子邮件”，“诗歌”） :param n_openlines: 每个主题和文本材料对要生成的任务数量。 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - topic：将填充在此函数中传递的 topic - text_material_type：将填充在此函数中传递的 text_material_type - n_openlines：将填充在此函数中传递的 n_openlines
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

revise_open_qa( openline: str, n_revisions: str | int, model: str, prompt_template: str = '问题： {openline}\n\n你能修改上面的问题以包含更多上下文或细节吗？修改后的问题可以是以下任何一种：\n1. 为原始问题添加一些上下文。上下文可以说明问题的重要性，解释背景知识，或添加其他合理的信息。\n2. 将问题更改为不同的格式或风格，例如，祈使句，答案的长度要求等。\n3. 需要详细说明特定主题或讨论某个观点的加长问题。\n4. 任何其他相关的问题或陈述。\n\n修改后的问题应包含两句话、三句话或四句话。你应该生成 {n_revisions} 个修改后的问题或陈述在一个列表中。让它们尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 修改开放式问答问题给定的次数 :param openline: 要修改的开放行 :param n_revisions: 为问题生成的修订次数。 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - openline：将填充在此函数中传递的 openline - n_revisions：将填充在此函数中传递的 n_revisions
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

revise_writing_tasks( openline: str, n_revisions: str | int, model: str, prompt_template: str = '任务： {openline}\n\n你能修改上面的任务以包含更详细的要求吗？这些要求可以是以下任何一种：\n1. 要求详细说明特定主题或讨论某个观点。\n2. 要求包含一些示例、数据点或参考资料。\n3. 要求遵循特定的格式或风格，例如，不超过 300 个单词，包括特定单词等。\n4. 任何其他合理的请求，以使任务更详细。\n\n修改后的任务应包含两句话、三句话或四句话。你应该生成 {n_revisions} 个修改后的任务在一个列表中。让这些任务尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 修改写作任务给定的次数 :param openline: 要修改的开放行 :param n_revisions: 为任务生成的修订次数。 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - openline：将填充在此函数中传递的 openline - n_revisions：将填充在此函数中传递的 n_revisions
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

run_closed_qa_pipeline( documents: List[str], n_openlines: str | int, model: str, closed_qa_prompt_template: str = '文本： {document}\n\n根据上面的文本，你能提出 {n_openlines} 个问题或任务吗？它们可以是以下任何一种：\n1. 询问文本中的某些信息；\n2. 总结、释义或解释文本；\n3. 编写与文本类似的内容；\n4. 任何其他与文本相关的合理请求。\n\n让问题或任务尽可能多样化。', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, ignore_conversion_failure: bool = False, ) → List[Tuple[int, str]]#

运行一个管道，用于为对话自动生成封闭式问答开放行 :param documents: 要为其生成封闭式问答问题的文档列表 :param n_openlines: 每个文档要生成的问题数量。 :param model: 应该用于生成所有响应的模型的名称。

参数:

closed_qa_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_openlines - document：将填充在此函数中传递的文档列表的一个元素。不能将其他参数传递给此提示模板。
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据

返回:

一个列表，其中包含成对的元素，第一个元素表示用于生成文档列表中问题的文档索引，第二个元素表示合成生成的封闭式问答提示。示例：[(0, “总结此文档”), …]

run_math_pipeline( n_macro_topics: str | int, school_level: str, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = '你能生成 {n_macro_topics} 个全面的主题，这些主题涵盖 {school_level} 教授的数学知识吗？你的答案应该是一个主题列表。让这些主题尽可能多样化。', subtopic_prompt_template: str = '列出 {n_subtopics} 个数学主题，这些主题涵盖 “{macro_topic}” 的各个方面。你的答案应该是一个主题列表。让这些主题尽可能多样化。', math_problem_prompt_template: str = '生成 {n_openlines} 个与 “{topic}” 相关或可以使用 “{topic}” 解决的数学问题。你的答案应该是一个问题列表。让它们尽可能多样化。', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, additional_macro_topics: List[str] = [], additional_subtopics: List[str] = [], ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → List[str]#

运行一个管道，用于为对话自动生成数学问题 :param n_macro_topics: 要生成的宏观主题的数量。 :param school_level: 生成宏观主题时要面向的学校级别。 :param n_subtopics: 每个宏观主题要生成的子主题的数量。 :param n_openlines: 每个主题要生成的问题数量。 :param model: 应该用于生成所有响应的模型的名称。

参数:

macro_topic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_macro_topics：将填充在此函数中传递的 n_macro_topics - school_level：将填充在此函数中传递的 school_level。不能将其他参数传递给此提示模板。
subtopic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_subtopics：将填充在此函数中传递的 n_subtopics - macro_topic：将填充生成的宏观主题。不能将其他参数传递给此提示模板。
math_problem_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_openlines - topic：将填充生成的主题。不能将其他参数传递给此提示模板。在 nemo_curator.synthetic 中找到的一些示例模板包括： - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据
combine_topics – 如果为 True，则在生成开放行时将宏观主题与子主题混合。如果为 False，则仅使用子主题。

返回:

合成生成的数学提示列表

run_open_qa_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, n_revisions: str | int, model: str, macro_topic_prompt_template: str = '你能生成 {n_macro_topics} 个全面的主题，这些主题涵盖我们日常生活、世界和科学的各个方面吗？你的答案应该是一个主题列表。让这些主题尽可能多样化。例如，1. 食物和饮料。\n2. 科技。\n', subtopic_prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.', open_qa_from_topics_prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', revise_open_qa_prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, additional_macro_topics: List[str] = [], additional_subtopics: List[str] = [], ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → List[str]#

运行一个用于自动生成对话的开放式问答题目的管道 :param n_macro_topics: 要生成宏观主题的数量 :param n_subtopics: 每个宏观主题要生成的子主题数量 :param n_openlines: 每个主题要生成的问题数量。 :param n_revisions: 每个原始问题要生成的修订数量。 :param model: 应该用于生成所有响应的模型名称。

参数:

macro_topic_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics：将填充在此函数中传递的 n_macro_topics。此提示模板不能传递其他参数。
subtopic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_subtopics：将填充在此函数中传递的 n_subtopics - macro_topic：将填充生成的宏观主题。不能将其他参数传递给此提示模板。
open_qa_from_topics_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines：将填充在此函数中传递的 n_openlines。 - topic：将填充生成的topic。此提示模板不能传递其他参数。
revise_open_qa_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_revisions：将填充在此函数中传递的 n_revisions。 - openline：将填充生成的开放式问答题目。此提示模板不能传递其他参数。
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据
combine_topics – 如果为 True，则在生成开放行时将宏观主题与子主题混合。如果为 False，则仅使用子主题。

返回:

合成生成的开放式问答题目的列表

run_python_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = 'List {n_macro_topics} important concepts in the python language.', subtopic_prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.', python_problem_prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, additional_macro_topics: List[str] = [], additional_subtopics: List[str] = [], ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → List[str]#

运行一个用于自动生成对话的 Python 问题的管道 :param n_macro_topics: 要生成的宏观主题的数量。 :param n_subtopics: 每个宏观主题要生成的子主题数量。 :param n_openlines: 每个主题要生成的问题数量。 :param model: 应该用于生成所有响应的模型名称。

参数:

macro_topic_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics：将填充在此函数中传递的 n_macro_topics。此提示模板不能传递其他参数。
subtopic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_subtopics：将填充在此函数中传递的 n_subtopics - macro_topic：将填充生成的宏观主题。不能将其他参数传递给此提示模板。
python_problem_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines：将填充在此函数中传递的 n_openlines。 - language：将填充 “Python”。 - topic：将填充生成的topic。此提示模板不能传递其他参数。在 nemo_curator.synthetic 中找到的一些示例模板包括： - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据
combine_topics – 如果为 True，则在生成开放行时将宏观主题与子主题混合。如果为 False，则仅使用子主题。

返回:

合成生成的 Python 提示的列表

run_writing_pipeline( topics: List[str], text_material_types: List[str], n_openlines: str | int, n_revisions: str | int, model: str, writing_task_prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.', revise_writing_task_prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, ignore_conversion_failure: bool = False, ) → List[str]#

运行一个用于自动生成对话的写作任务题目的管道 :param topics: 要为其生成任务的主题列表 :param text_material_types: 写作材料类型列表，例如“Essay”或“Blog post” :param n_openlines: 每个（topic，text_material_type）对要生成的任务数量。 :param n_revisions: 每个原始任务要生成的修订数量。 :param model: 应该用于生成所有响应的模型名称。

参数:

writing_task_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines：将填充在此函数中传递的 n_openlines。 - topic：将填充在此函数中传递的 topics 列表中的一个元素。 - text_material_type：将填充在此函数中传递的 text_material_types 列表中的一个元素。此提示模板不能传递其他参数。
revise_writing_task_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_revisions：将填充在此函数中传递的 n_revisions。 - openline：将填充在管道中生成的写作任务之一。此提示模板不能传递其他参数。
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据

返回:

合成生成的写作任务提示的列表

class nemo_curator.synthetic.AsyncNemotronGenerator( llm_client: AsyncLLMClient, logger: LoggerAdapter | str = './', max_concurrent_requests: int | None = None, )#

提供了一系列用于生成合成数据的方法，这些方法在 Nemotron-4 340B 技术报告 (https://arxiv.org/abs/2406.11704v1) 中进行了描述，并受到 UltraChat 论文 (https://arxiv.org/abs/2305.14233) 的启发

async classify_math_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Math concepts taught at elementary school, middle school, high school, and univiersity.\n- Important mathematics axioms, theorems, algorithms, equations, or inequalities.\n- Representative math problems, functions, and applications.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict = {}, model_kwargs={}, ) → List[str]#

提示 LLM 分类实体是否与数学相关 :param entity: 要分类的实体 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - entity: 将使用此函数中传递的实体填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async classify_python_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Programming concepts like loops, functions, and data structures in python.\n- Important functions, objects, or libraries in python.\n- Mathematical concepts like linear algebra which can be implemented in python.\n- Basic algorithms or problems in computer science likes Greedy Search and Dynamics programming which can be addressed in python.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 分类实体是否与 Python 相关 :param entity: 要分类的实体 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - entity: 将使用此函数中传递的实体填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async convert_response_to_yaml_list( llm_response: str, model: str, prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

通过查询 LLM 将 LLM 的响应转换为字符串列表 :param llm_response: LLM 的原始未格式化响应 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有 {llm_response} 参数，该参数将使用此函数中传递的 llm_response 值填充。
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

从原始 LLM 响应中解析的元素列表

async generate_closed_qa_instructions( document: str, n_openlines: str | int, model: str, prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于参考文档生成封闭式问答问题列表 :param document: 生成问题时要使用的文档 :param n_openlines: 每个文档要生成的问题数。 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - document: 将使用此函数中传递的文档填充 - n_openlines: 将使用此函数中传递的 n_openlines 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_dialogue( openline: str, user_model: str, assistant_model: str, n_user_turns: int = 3, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict = {}, user_model_kwargs: dict = {}, assistant_model_kwargs: dict = {}, ) → List[dict]#

提示 LLM 基于给定的 openline 生成对话。LLM 将交替模拟用户和助手。 :param openline: 将构成第一个用户回合的 openline。 :param user_model: 将模拟用户的模型。

参数:

assistant_model – 将模拟助手的模型必须在构造函数中传递的 LLMClient 中可用。
n_user_turns – 要经历的用户回合数。openline 算作 1 个用户回合。因此，如果有 3 个用户回合，则 2 个将由模拟用户的 LLM 生成。
prompt_template – 模拟用户时要使用的提示的格式字符串。它必须具有以下参数： - converstation_history: 将使用到目前为止的对话的格式化历史记录填充。在 nemo_curator.synthetic 中找到的一些示例模板包括： - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
user_model_kwargs – 应传递给用户的 LLMClient.query_model 调用的任何其他关键字参数。
assistant_model_kwargs – 应传递给助手的 LLMClient.query_model 调用的任何其他关键字参数。

返回:

用户和助手之间的对话

async generate_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成关于世界的宏观主题列表 :param n_macro_topics: 要生成的宏观主题的数量。 :param model: 应使用的模型名称以生成宏观主题。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics: 将使用此函数中传递的 n_macro_topics 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_math_macro_topics( n_macro_topics: int | str, school_level: str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成关于数学的宏观主题列表 :param n_macro_topics: 要生成的宏观主题的数量。可以是像 5 这样的整数或像 “five” 这样的字符串。 :param school_level: 数学问题应针对的学校级别。 :param model: 应使用的模型名称以生成宏观主题。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics: 将使用此函数中传递的 n_macro_topics 填充 - school_level: 将使用此函数中传递的 school_level 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_math_problem( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题生成数学问题列表 :param topic: 要为其生成问题的主题。 :param n_openlines: 每个主题要生成的问题数。 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines: 将使用此函数中传递的 n_subtopics 填充 - topic: 将使用此函数中传递的主题填充在 nemo_curator.synthetic 中找到的一些示例模板包括： - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_math_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成与数学宏观主题相关的子主题列表 :param macro_topic: 要为其生成子主题的宏观主题。 :param n_subtopics: 每个宏观主题要生成的子主题数 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_subtopics: 将使用此函数中传递的 n_subtopics 填充 - macro_topic: 将使用此函数中传递的 macro_topic 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_open_qa_from_topic( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题生成开放式问答问题列表 :param topic: 要为其生成问题的主题。 :param n_openlines: 每个主题要生成的问题数。 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_subtopics - topic：将填充在此函数中传递的 topic
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_python_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = '列出 {n_macro_topics} 个 Python 语言中的重要概念。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成关于 Python 编程语言的宏观主题列表。 :param n_macro_topics: 要生成的宏观主题的数量。可以是像 5 这样的整数，也可以是像 “five” 这样的字符串。 :param model: 应该用于生成宏观主题的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics: 将使用此函数中传递的 n_macro_topics 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_python_problem( topic: str, n_openlines: str | int, model: str, language='Python', prompt_template: str = '生成 {n_openlines} 个 {language} 编码问题，这些问题与 “{topic}” 相关。这些问题应该适合刚学习 “{topic}” 的初学者。你的答案应该是一个问题列表。让它们尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题生成编码问题列表 :param topic: 要为其生成问题的主题。 :param n_openlines: 每个主题要生成的问题数量。 :param model: 应该用于生成响应的模型的名称。

参数:

language – 生成这些问题时要面向的编程语言。
prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_subtopics - topic：将填充在此函数中传递的 topic - language：将填充在此函数中传递的 language 在 nemo_curator.synthetic 中找到的一些示例模板包括： - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_python_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = '列出 Python 语言中与 “{macro_topic}” 相关的 {n_subtopics} 个重要概念。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成与 Python 宏观主题相关的子主题列表 :param macro_topic: 要为其生成子主题的宏观主题。 :param n_subtopics: 每个宏观主题要生成的子主题数量 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_subtopics: 将使用此函数中传递的 n_subtopics 填充 - macro_topic: 将使用此函数中传递的 macro_topic 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = '你能生成 {n_subtopics} 个全面的主题，这些主题涵盖 {macro_topic} 的各个方面吗？你的答案应该是一个主题列表。让这些主题尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成与宏观主题相关的子主题列表 :param macro_topic: 要为其生成子主题的宏观主题。 :param n_subtopics: 每个宏观主题要生成的子主题数量 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_subtopics: 将使用此函数中传递的 n_subtopics 填充 - macro_topic: 将使用此函数中传递的 macro_topic 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_two_turn_prompt( openline: str, user_model: str, assistant_model: str, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict = {}, user_model_kwargs: dict = {}, assistant_model_kwargs: dict = {}, ) → List[dict]#

提示 LLM 生成作为助手和用户的响应，基于给定的开放行。“用户 -> 助手 -> 用户” 的对话形式 :param openline: 将构成第一个用户回合的开放行。 :param user_model: 将模仿用户的模型。

参数:

assistant_model – 将模拟助手的模型必须在构造函数中传递的 LLMClient 中可用。
prompt_template – 模拟用户时要使用的提示的格式字符串。它必须具有以下参数： - converstation_history: 将使用到目前为止的对话的格式化历史记录填充。在 nemo_curator.synthetic 中找到的一些示例模板包括： - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
user_model_kwargs – 应传递给用户的 LLMClient.query_model 调用的任何其他关键字参数。
assistant_model_kwargs – 应传递给助手的 LLMClient.query_model 调用的任何其他关键字参数。

返回:

用户和助手之间的对话

async generate_writing_tasks( topic: str, text_material_type: str, n_openlines: str | int, model: str, prompt_template: str = '你能生成 {n_openlines} 个任务吗？每个任务都需要创建一个与 {topic} 相关的 “{text_material_type}”。每个任务应该简洁明了，并且只包含一到两句话。这些任务应该尽可能多样化。你的答案应该是一个任务列表。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题和文档类型生成写作任务列表 :param topic: 要为其生成写作任务的主题。 :param text_material_type: 问题应要求生成的文档类型（例如，“电子邮件”，“诗歌”） :param n_openlines: 每个主题和文本材料对要生成的任务数量。 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - topic：将填充在此函数中传递的 topic - text_material_type：将填充在此函数中传递的 text_material_type - n_openlines：将填充在此函数中传递的 n_openlines
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async revise_open_qa( openline: str, n_revisions: str | int, model: str, prompt_template: str = '问题： {openline}\n\n你能修改上面的问题以包含更多上下文或细节吗？修改后的问题可以是以下任何一种：\n1. 为原始问题添加一些上下文。上下文可以说明问题的重要性，解释背景知识，或添加其他合理的信息。\n2. 将问题更改为不同的格式或风格，例如，祈使句，答案的长度要求等。\n3. 需要详细说明特定主题或讨论某个观点的加长问题。\n4. 任何其他相关的问题或陈述。\n\n修改后的问题应包含两句话、三句话或四句话。你应该生成 {n_revisions} 个修改后的问题或陈述在一个列表中。让它们尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 修改开放式问答问题给定的次数 :param openline: 要修改的开放行 :param n_revisions: 为问题生成的修订次数。 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - openline：将填充在此函数中传递的 openline - n_revisions：将填充在此函数中传递的 n_revisions
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async revise_writing_tasks( openline: str, n_revisions: str | int, model: str, prompt_template: str = '任务： {openline}\n\n你能修改上面的任务以包含更详细的要求吗？这些要求可以是以下任何一种：\n1. 要求详细说明特定主题或讨论某个观点。\n2. 要求包含一些示例、数据点或参考资料。\n3. 要求遵循特定的格式或风格，例如，不超过 300 个单词，包括特定单词等。\n4. 任何其他合理的请求，以使任务更详细。\n\n修改后的任务应包含两句话、三句话或四句话。你应该生成 {n_revisions} 个修改后的任务在一个列表中。让这些任务尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 修改写作任务给定的次数 :param openline: 要修改的开放行 :param n_revisions: 为任务生成的修订次数。 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - openline：将填充在此函数中传递的 openline - n_revisions：将填充在此函数中传递的 n_revisions
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async run_closed_qa_pipeline( documents: List[str], n_openlines: str | int, model: str, closed_qa_prompt_template: str = '文本： {document}\n\n根据上面的文本，你能提出 {n_openlines} 个问题或任务吗？它们可以是以下任何一种：\n1. 询问文本中的某些信息；\n2. 总结、释义或解释文本；\n3. 编写与文本类似的内容；\n4. 任何其他与文本相关的合理请求。\n\n让问题或任务尽可能多样化。', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, ignore_conversion_failure: bool = False, ) → List[Tuple[int, str]]#

运行一个管道，用于为对话自动生成封闭式问答开放行 :param documents: 要为其生成封闭式问答问题的文档列表 :param n_openlines: 每个文档要生成的问题数量。 :param model: 应该用于生成所有响应的模型的名称。

参数:

closed_qa_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_openlines - document：将填充在此函数中传递的文档列表的一个元素。不能将其他参数传递给此提示模板。
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据

返回:

一个列表，其中包含成对的元素，第一个元素表示用于生成文档列表中问题的文档索引，第二个元素表示合成生成的封闭式问答提示。示例：[(0, “总结此文档”), …]

async run_math_pipeline( n_macro_topics: str | int, school_level: str, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = '你能生成 {n_macro_topics} 个全面的主题，这些主题涵盖 {school_level} 教授的数学知识吗？你的答案应该是一个主题列表。让这些主题尽可能多样化。', subtopic_prompt_template: str = '列出 {n_subtopics} 个数学主题，这些主题涵盖 “{macro_topic}” 的各个方面。你的答案应该是一个主题列表。让这些主题尽可能多样化。', math_problem_prompt_template: str = '生成 {n_openlines} 个与 “{topic}” 相关或可以使用 “{topic}” 解决的数学问题。你的答案应该是一个问题列表。让它们尽可能多样化。', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, additional_macro_topics: List[str] = [], additional_subtopics: List[str] = [], ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → List[str]#

运行一个管道，用于为对话自动生成数学问题 :param n_macro_topics: 要生成的宏观主题的数量。 :param school_level: 生成宏观主题时要面向的学校级别。 :param n_subtopics: 每个宏观主题要生成的子主题的数量。 :param n_openlines: 每个主题要生成的问题数量。 :param model: 应该用于生成所有响应的模型的名称。

参数:

macro_topic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_macro_topics：将填充在此函数中传递的 n_macro_topics - school_level：将填充在此函数中传递的 school_level。不能将其他参数传递给此提示模板。
subtopic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_subtopics：将填充在此函数中传递的 n_subtopics - macro_topic：将填充生成的宏观主题。不能将其他参数传递给此提示模板。
math_problem_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_openlines - topic：将填充生成的主题。不能将其他参数传递给此提示模板。在 nemo_curator.synthetic 中找到的一些示例模板包括： - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据
combine_topics – 如果为 True，则在生成开放行时将宏观主题与子主题混合。如果为 False，则仅使用子主题。

返回:

合成生成的数学提示列表

async run_open_qa_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, n_revisions: str | int, model: str, macro_topic_prompt_template: str = '你能生成 {n_macro_topics} 个全面的主题，这些主题涵盖我们日常生活、世界和科学的各个方面吗？你的答案应该是一个主题列表。让这些主题尽可能多样化。例如，1. 食物和饮料。\n2. 科技。\n', subtopic_prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.', open_qa_from_topics_prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', revise_open_qa_prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, additional_macro_topics: List[str] = [], additional_subtopics: List[str] = [], ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → List[str]#

运行一个用于自动生成对话的开放式问答题目的管道 :param n_macro_topics: 要生成宏观主题的数量 :param n_subtopics: 每个宏观主题要生成的子主题数量 :param n_openlines: 每个主题要生成的问题数量。 :param n_revisions: 每个原始问题要生成的修订数量。 :param model: 应该用于生成所有响应的模型名称。

参数:

macro_topic_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics：将填充在此函数中传递的 n_macro_topics。此提示模板不能传递其他参数。
subtopic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_subtopics：将填充在此函数中传递的 n_subtopics - macro_topic：将填充生成的宏观主题。不能将其他参数传递给此提示模板。
open_qa_from_topics_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines：将填充在此函数中传递的 n_openlines。 - topic：将填充生成的topic。此提示模板不能传递其他参数。
revise_open_qa_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_revisions：将填充在此函数中传递的 n_revisions。 - openline：将填充生成的开放式问答题目。此提示模板不能传递其他参数。
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据
combine_topics – 如果为 True，则在生成开放行时将宏观主题与子主题混合。如果为 False，则仅使用子主题。

返回:

合成生成的开放式问答题目的列表

async run_python_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = 'List {n_macro_topics} important concepts in the python language.', subtopic_prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.', python_problem_prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, additional_macro_topics: List[str] = [], additional_subtopics: List[str] = [], ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → List[str]#

运行一个用于自动生成对话的 Python 问题的管道 :param n_macro_topics: 要生成的宏观主题的数量。 :param n_subtopics: 每个宏观主题要生成的子主题数量。 :param n_openlines: 每个主题要生成的问题数量。 :param model: 应该用于生成所有响应的模型名称。

参数:

macro_topic_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics：将填充在此函数中传递的 n_macro_topics。此提示模板不能传递其他参数。
subtopic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_subtopics：将填充在此函数中传递的 n_subtopics - macro_topic：将填充生成的宏观主题。不能将其他参数传递给此提示模板。
python_problem_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines：将填充在此函数中传递的 n_openlines。 - language：将填充 “Python”。 - topic：将填充生成的topic。此提示模板不能传递其他参数。在 nemo_curator.synthetic 中找到的一些示例模板包括： - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据
combine_topics – 如果为 True，则在生成开放行时将宏观主题与子主题混合。如果为 False，则仅使用子主题。

返回:

合成生成的 Python 提示的列表

async run_writing_pipeline( topics: List[str], text_material_types: List[str], n_openlines: str | int, n_revisions: str | int, model: str, writing_task_prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.', revise_writing_task_prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, ignore_conversion_failure: bool = False, ) → List[str]#

运行一个用于自动生成对话的写作任务题目的管道 :param topics: 要为其生成任务的主题列表 :param text_material_types: 写作材料类型列表，例如“Essay”或“Blog post” :param n_openlines: 每个（topic，text_material_type）对要生成的任务数量。 :param n_revisions: 每个原始任务要生成的修订数量。 :param model: 应该用于生成所有响应的模型名称。

参数:

writing_task_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines：将填充在此函数中传递的 n_openlines。 - topic：将填充在此函数中传递的 topics 列表中的一个元素。 - text_material_type：将填充在此函数中传递的 text_material_types 列表中的一个元素。此提示模板不能传递其他参数。
revise_writing_task_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_revisions：将填充在此函数中传递的 n_revisions。 - openline：将填充在管道中生成的写作任务之一。此提示模板不能传递其他参数。
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据

返回:

合成生成的写作任务提示的列表

class nemo_curator.synthetic.NemotronCCGenerator( llm_client: LLMClient, )#

提供了一系列用于生成 Nemotron-CC 论文中描述的合成数据的方法 (https://arxiv.org/abs/2412.02595)。

distill( document: str, model: str, prompt_template: str = '你的任务是阅读并改述所提供的文本，并遵循以下指示:\n- 目标是创建原始文本的浓缩但准确且信息丰富的版本，而不是简单的摘要。\n- 捕捉并保留原始文本中的关键信息、核心概念、重要价值、事实细节，同时使其更具可读性和可访问性。\n- 保留技术术语、专业词汇和复杂概念。\n- 保留示例、推理过程的解释和支持性证据，以保持文本的深度和背景。\n- 仅包含原始文本中存在的信息。不要添加新的或未经证实的声明。\n- 以纯文本格式书写文本，不进行格式化。\n\n这是文本:\n{document}\n\n任务:\n在仔细阅读上述文本后，按照指示以高质量和清晰的英语改述它。你的回复以 "Paraphrased Text:" 开头。', system_prompt: str = '你是一个人工智能助手。你仔细提供准确、真实、周到、细致的答案，并且擅长推理。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

Distills the essential content from a document.

参数:

document (str) – The input document text to distill.
model (str) – The model identifier to use.
prompt_template (str, optional) – The prompt template for distillation. Defaults to DISTILL_PROMPT_TEMPLATE.
system_prompt (str, optional) – The system prompt to use. Defaults to NEMOTRON_CC_DISTILL_SYSTEM_PROMPT.
prompt_kwargs (dict, optional) – Additional keyword arguments for the prompt. Defaults to {}.
model_kwargs (dict, optional) – Additional keyword arguments for the model invocation. Defaults to {}.

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

Return type:

List[str]

extract_knowledge( document: str, model: str, prompt_template: str = "你的任务是根据以下指示改写所提供文本中的知识。\n- 使用易于理解和高质量的英语句子（如教科书和维基百科中的句子）将文本改写为一个或多个段落。\n- 重点关注人文科学、社会科学、自然科学、技术、工程、数学、法律和法学、商业、管理、艺术、教育、农业科学、政治和历史等学科的内容。\n- 忽略不包含有用的事实或知识的内容。\n- 保留示例、推理过程的解释和支持性证据，以保持文本的深度和背景。\n- 不要添加或更改细节。仅重述文本中已有的内容。\n- 以纯文本格式书写。\n- 不要添加标题、副标题、注释或评论。\n\n文本:\n{document}\n\n任务:\n按照指示，将上述文本中的事实和知识改写为一个或多个段落。", system_prompt: str = '一个好奇的用户和人工智能助手之间的对话。助手对问题给出有帮助、详细和礼貌的回答。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

Extracts knowledge from the provided document.

参数:

document (str) – The input document text from which to extract knowledge.
model (str) – The model identifier to use.
prompt_template (str, optional) – The prompt template for knowledge extraction. Defaults to EXTRACT_KNOWLEDGE_PROMPT_TEMPLATE.
system_prompt (str, optional) – The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT.
prompt_kwargs (dict, optional) – Additional keyword arguments for the prompt. Defaults to {}.
model_kwargs (dict, optional) – Additional keyword arguments for the model invocation. Defaults to {}.

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

Return type:

List[str]

generate_diverse_qa( document: str, model: str, prompt_template: str = '任务:\n阅读文本，提出问题并回答问题。\n\n遵循以下指示:\n1. 提出需要不同认知技能或涵盖文本不同方面的各种问题。\n2. 以各种形式提出问题，例如:\n - 需要确定陈述是对还是错的是/否问题。\n - 以什么、如何、何时、何地、为什么和谁等词开头的开放式问题。\n - 提供两个或多个选项供选择的多项选择题。在问题中包含选项。\n - 比较两个数量或对象的比较问题，并确定它们之间的关系。\n - 测试理解和分析文本能力的阅读理解问题。\n - 测试解决数学、物理或逻辑问题的能力的问题解决问题。\n3. 重点关注询问有关文本中的事实信息、重要知识或具体细节的问题。\n4. 使用清晰简洁的语言编写问题和答案。\n5. 使用纯文本。不要使用 Markdown。\n6. 每个问题和答案对都应在单独的一行上。用 "Question:" 标记问题，用 "Answer:" 标记答案。\n\n文本:\n{document}\n\n任务:\n阅读上述文本后，按照指示提出最多 8 个问题并提供正确答案。以以下格式给出你的回复:\n\n以下是基于所提供文本的问题和答案:\n- Question: [第一个问题] Answer: [第一个答案]\n- Question: [第二个问题] Answer: [第二个答案]\n....', system_prompt: str = '一个好奇的用户和人工智能助手之间的对话。助手对问题给出有帮助、详细和礼貌的回答。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

Generates diverse QA pairs from the provided document.

参数:

document (str) – The input document text used to generate QA pairs.
model (str) – The model identifier to use.
prompt_template (str, optional) – The prompt template for generating QA pairs. Defaults to DIVERSE_QA_PROMPT_TEMPLATE.
system_prompt (str, optional) – The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT.
prompt_kwargs (dict, optional) – Additional keyword arguments for the prompt. Defaults to {}.
model_kwargs (dict, optional) – Additional keyword arguments for the model invocation. Defaults to {}.

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

Return type:

List[str]

generate_knowledge_list( document: str, model: str, prompt_template: str = '回顾文本并提取关键信息。遵循以下指示:\n- 仔细阅读上述文本，并提供从文本中提取的事实信息、具体细节、核心概念以及重要数字和统计数据的简洁而有组织的列表。\n- 确保每个要点都清晰、具体，并得到原始文本的支持。\n- 确保提取的文本信息密集，并且更容易学习。\n- 不要添加标题或副标题。\n\n文本:\n{document}\n\n任务:\n按照指示，从上述文本中提取事实信息、具体细节和核心概念。', system_prompt: str = '一个好奇的用户和人工智能助手之间的对话。助手对问题给出有帮助、详细和礼貌的回答。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

Generates a list of knowledge items from the provided document.

参数:

document (str) – The input document text to process.
model (str) – The model identifier to use.
prompt_template (str, optional) – The prompt template for generating a knowledge list. Defaults to KNOWLEDGE_LIST_PROMPT_TEMPLATE.
system_prompt (str, optional) – The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT.
prompt_kwargs (dict, optional) – Additional keyword arguments for the prompt. Defaults to {}.
model_kwargs (dict, optional) – Additional keyword arguments for the model invocation. Defaults to {}.

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

Return type:

List[str]

rewrite_to_wikipedia_style( document: str, model: str, prompt_template: str = '对于以下段落，请给我一个高质量英语的、像维基百科句子一样的多样化改述。在单独的一行上以 "Here is a paraphrased version:" 开头你的答案。\n\n文本: {document}', system_prompt: str = '一个好奇的用户和人工智能助手之间的对话。助手对问题给出有帮助、详细和礼貌的回答。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

Rewrites a document into a Wikipedia-style narrative.

参数:

document (str) – The input document text to rewrite.
model (str) – The model identifier to use.
prompt_template (str, optional) – The prompt template for rewriting. Defaults to WIKIPEDIA_REPHRASING_PROMPT_TEMPLATE.
system_prompt (str, optional) – The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT.
prompt_kwargs (dict, optional) – Additional keyword arguments for the prompt. Defaults to {}.
model_kwargs (dict, optional) – Additional keyword arguments for the model invocation. Defaults to {}.

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

Return type:

List[str]

class nemo_curator.synthetic.NemotronCCDiverseQAPostprocessor( tokenizer: transformers.AutoTokenizer | None = None, text_field: str = 'text', response_field: str = 'response', max_num_pairs: int = 1, prefix: str = '以下是基于所提供文本的问题和答案:', )#

Postprocesses the output of the Nemotron-CC Diverse QA generation pipeline. This postprocessor will sample a random number of QA pairs up to max_num_pairs. If a tokenizer is provided, the number of QA pairs will be sampled from at least 1 and at most floor(max_num_pairs * num_tokens / 150). Otherwise, the number of QA pairs will be sampled randomly strictly up to max_num_pairs.

The generated QA pairs are shuffled and then appended to the original text.

call( dataset: DocumentDataset, ) → DocumentDataset#

Performs an arbitrary operation on a dataset

参数:: dataset (DocumentDataset) – The dataset to operate on

class nemo_curator.synthetic.NemotronCCKnowledgeListPostprocessor(text_field: str = 'text')#

Processes and cleans the output generated by the Nemotron-CC Knowledge List pipeline.

This class is responsible for postprocessing raw text responses produced by the Nemotron-CC Knowledge List generation pipeline. It removes formatting artifacts such as bullet point prefixes (”- “) and extra indentation from each line, ensuring that the final output is a clean, uniformly formatted list of knowledge items. The processing includes skipping any initial non-bullet line and merging related lines to reconstruct multi-line questions or answers.

call( dataset: DocumentDataset, ) → DocumentDataset#

Performs an arbitrary operation on a dataset

参数:: dataset (DocumentDataset) – The dataset to operate on

class nemo_curator.synthetic.AsyncNemotronGenerator( llm_client: AsyncLLMClient, logger: LoggerAdapter | str = './', max_concurrent_requests: int | None = None, )#

提供了一系列用于生成合成数据的方法，这些方法在 Nemotron-4 340B 技术报告 (https://arxiv.org/abs/2406.11704v1) 中进行了描述，并受到 UltraChat 论文 (https://arxiv.org/abs/2305.14233) 的启发

async classify_math_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Math concepts taught at elementary school, middle school, high school, and univiersity.\n- Important mathematics axioms, theorems, algorithms, equations, or inequalities.\n- Representative math problems, functions, and applications.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict = {}, model_kwargs={}, ) → List[str]#

提示 LLM 分类实体是否与数学相关 :param entity: 要分类的实体 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - entity: 将使用此函数中传递的实体填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async classify_python_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Programming concepts like loops, functions, and data structures in python.\n- Important functions, objects, or libraries in python.\n- Mathematical concepts like linear algebra which can be implemented in python.\n- Basic algorithms or problems in computer science likes Greedy Search and Dynamics programming which can be addressed in python.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 分类实体是否与 Python 相关 :param entity: 要分类的实体 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - entity: 将使用此函数中传递的实体填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async convert_response_to_yaml_list( llm_response: str, model: str, prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

通过查询 LLM 将 LLM 的响应转换为字符串列表 :param llm_response: LLM 的原始未格式化响应 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有 {llm_response} 参数，该参数将使用此函数中传递的 llm_response 值填充。
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

从原始 LLM 响应中解析的元素列表

async generate_closed_qa_instructions( document: str, n_openlines: str | int, model: str, prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于参考文档生成封闭式问答问题列表 :param document: 生成问题时要使用的文档 :param n_openlines: 每个文档要生成的问题数。 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - document: 将使用此函数中传递的文档填充 - n_openlines: 将使用此函数中传递的 n_openlines 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_dialogue( openline: str, user_model: str, assistant_model: str, n_user_turns: int = 3, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict = {}, user_model_kwargs: dict = {}, assistant_model_kwargs: dict = {}, ) → List[dict]#

提示 LLM 基于给定的 openline 生成对话。LLM 将交替模拟用户和助手。 :param openline: 将构成第一个用户回合的 openline。 :param user_model: 将模拟用户的模型。

参数:

assistant_model – 将模拟助手的模型必须在构造函数中传递的 LLMClient 中可用。
n_user_turns – 要经历的用户回合数。openline 算作 1 个用户回合。因此，如果有 3 个用户回合，则 2 个将由模拟用户的 LLM 生成。
prompt_template – 模拟用户时要使用的提示的格式字符串。它必须具有以下参数： - converstation_history: 将使用到目前为止的对话的格式化历史记录填充。在 nemo_curator.synthetic 中找到的一些示例模板包括： - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
user_model_kwargs – 应传递给用户的 LLMClient.query_model 调用的任何其他关键字参数。
assistant_model_kwargs – 应传递给助手的 LLMClient.query_model 调用的任何其他关键字参数。

返回:

用户和助手之间的对话

async generate_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成关于世界的宏观主题列表 :param n_macro_topics: 要生成的宏观主题的数量。 :param model: 应使用的模型名称以生成宏观主题。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics: 将使用此函数中传递的 n_macro_topics 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_math_macro_topics( n_macro_topics: int | str, school_level: str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成关于数学的宏观主题列表 :param n_macro_topics: 要生成的宏观主题的数量。可以是像 5 这样的整数或像 “five” 这样的字符串。 :param school_level: 数学问题应针对的学校级别。 :param model: 应使用的模型名称以生成宏观主题。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics: 将使用此函数中传递的 n_macro_topics 填充 - school_level: 将使用此函数中传递的 school_level 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_math_problem( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题生成数学问题列表 :param topic: 要为其生成问题的主题。 :param n_openlines: 每个主题要生成的问题数。 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines: 将使用此函数中传递的 n_subtopics 填充 - topic: 将使用此函数中传递的主题填充在 nemo_curator.synthetic 中找到的一些示例模板包括： - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_math_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成与数学宏观主题相关的子主题列表 :param macro_topic: 要为其生成子主题的宏观主题。 :param n_subtopics: 每个宏观主题要生成的子主题数 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_subtopics: 将使用此函数中传递的 n_subtopics 填充 - macro_topic: 将使用此函数中传递的 macro_topic 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_open_qa_from_topic( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题生成开放式问答问题列表 :param topic: 要为其生成问题的主题。 :param n_openlines: 每个主题要生成的问题数。 :param model: 应使用的模型名称以生成响应。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_subtopics - topic：将填充在此函数中传递的 topic
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_python_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = '列出 {n_macro_topics} 个 Python 语言中的重要概念。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成关于 Python 编程语言的宏观主题列表。 :param n_macro_topics: 要生成的宏观主题的数量。可以是像 5 这样的整数，也可以是像 “five” 这样的字符串。 :param model: 应该用于生成宏观主题的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics: 将使用此函数中传递的 n_macro_topics 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_python_problem( topic: str, n_openlines: str | int, model: str, language='Python', prompt_template: str = '生成 {n_openlines} 个 {language} 编码问题，这些问题与 “{topic}” 相关。这些问题应该适合刚学习 “{topic}” 的初学者。你的答案应该是一个问题列表。让它们尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题生成编码问题列表 :param topic: 要为其生成问题的主题。 :param n_openlines: 每个主题要生成的问题数量。 :param model: 应该用于生成响应的模型的名称。

参数:

language – 生成这些问题时要面向的编程语言。
prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_subtopics - topic：将填充在此函数中传递的 topic - language：将填充在此函数中传递的 language 在 nemo_curator.synthetic 中找到的一些示例模板包括： - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_python_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = '列出 Python 语言中与 “{macro_topic}” 相关的 {n_subtopics} 个重要概念。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成与 Python 宏观主题相关的子主题列表 :param macro_topic: 要为其生成子主题的宏观主题。 :param n_subtopics: 每个宏观主题要生成的子主题数量 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_subtopics: 将使用此函数中传递的 n_subtopics 填充 - macro_topic: 将使用此函数中传递的 macro_topic 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = '你能生成 {n_subtopics} 个全面的主题，这些主题涵盖 {macro_topic} 的各个方面吗？你的答案应该是一个主题列表。让这些主题尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 生成与宏观主题相关的子主题列表 :param macro_topic: 要为其生成子主题的宏观主题。 :param n_subtopics: 每个宏观主题要生成的子主题数量 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_subtopics: 将使用此函数中传递的 n_subtopics 填充 - macro_topic: 将使用此函数中传递的 macro_topic 填充
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async generate_two_turn_prompt( openline: str, user_model: str, assistant_model: str, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict = {}, user_model_kwargs: dict = {}, assistant_model_kwargs: dict = {}, ) → List[dict]#

提示 LLM 生成作为助手和用户的响应，基于给定的开放行。“用户 -> 助手 -> 用户” 的对话形式 :param openline: 将构成第一个用户回合的开放行。 :param user_model: 将模仿用户的模型。

参数:

assistant_model – 将模拟助手的模型必须在构造函数中传递的 LLMClient 中可用。
prompt_template – 模拟用户时要使用的提示的格式字符串。它必须具有以下参数： - converstation_history: 将使用到目前为止的对话的格式化历史记录填充。在 nemo_curator.synthetic 中找到的一些示例模板包括： - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
user_model_kwargs – 应传递给用户的 LLMClient.query_model 调用的任何其他关键字参数。
assistant_model_kwargs – 应传递给助手的 LLMClient.query_model 调用的任何其他关键字参数。

返回:

用户和助手之间的对话

async generate_writing_tasks( topic: str, text_material_type: str, n_openlines: str | int, model: str, prompt_template: str = '你能生成 {n_openlines} 个任务吗？每个任务都需要创建一个与 {topic} 相关的 “{text_material_type}”。每个任务应该简洁明了，并且只包含一到两句话。这些任务应该尽可能多样化。你的答案应该是一个任务列表。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 基于主题和文档类型生成写作任务列表 :param topic: 要为其生成写作任务的主题。 :param text_material_type: 问题应要求生成的文档类型（例如，“电子邮件”，“诗歌”） :param n_openlines: 每个主题和文本材料对要生成的任务数量。 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - topic：将填充在此函数中传递的 topic - text_material_type：将填充在此函数中传递的 text_material_type - n_openlines：将填充在此函数中传递的 n_openlines
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async revise_open_qa( openline: str, n_revisions: str | int, model: str, prompt_template: str = '问题： {openline}\n\n你能修改上面的问题以包含更多上下文或细节吗？修改后的问题可以是以下任何一种：\n1. 为原始问题添加一些上下文。上下文可以说明问题的重要性，解释背景知识，或添加其他合理的信息。\n2. 将问题更改为不同的格式或风格，例如，祈使句，答案的长度要求等。\n3. 需要详细说明特定主题或讨论某个观点的加长问题。\n4. 任何其他相关的问题或陈述。\n\n修改后的问题应包含两句话、三句话或四句话。你应该生成 {n_revisions} 个修改后的问题或陈述在一个列表中。让它们尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 修改开放式问答问题给定的次数 :param openline: 要修改的开放行 :param n_revisions: 为问题生成的修订次数。 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - openline：将填充在此函数中传递的 openline - n_revisions：将填充在此函数中传递的 n_revisions
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async revise_writing_tasks( openline: str, n_revisions: str | int, model: str, prompt_template: str = '任务： {openline}\n\n你能修改上面的任务以包含更详细的要求吗？这些要求可以是以下任何一种：\n1. 要求详细说明特定主题或讨论某个观点。\n2. 要求包含一些示例、数据点或参考资料。\n3. 要求遵循特定的格式或风格，例如，不超过 300 个单词，包括特定单词等。\n4. 任何其他合理的请求，以使任务更详细。\n\n修改后的任务应包含两句话、三句话或四句话。你应该生成 {n_revisions} 个修改后的任务在一个列表中。让这些任务尽可能多样化。', prompt_kwargs: dict = {}, model_kwargs: dict = {}, ) → List[str]#

提示 LLM 修改写作任务给定的次数 :param openline: 要修改的开放行 :param n_revisions: 为任务生成的修订次数。 :param model: 应该用于生成响应的模型的名称。

参数:

prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - openline：将填充在此函数中传递的 openline - n_revisions：将填充在此函数中传递的 n_revisions
prompt_kwargs – 应传递给提示模板的任何其他关键字参数。默认模板不需要任何参数。
model_kwargs – 应传递给 LLMClient.query_model 调用的任何其他关键字参数。

返回:

来自 LLM 的响应列表。仅当 model_kwargs 中设置了 n > 1 时，列表长度才大于 1。

async run_closed_qa_pipeline( documents: List[str], n_openlines: str | int, model: str, closed_qa_prompt_template: str = '文本： {document}\n\n根据上面的文本，你能提出 {n_openlines} 个问题或任务吗？它们可以是以下任何一种：\n1. 询问文本中的某些信息；\n2. 总结、释义或解释文本；\n3. 编写与文本类似的内容；\n4. 任何其他与文本相关的合理请求。\n\n让问题或任务尽可能多样化。', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, ignore_conversion_failure: bool = False, ) → List[Tuple[int, str]]#

运行一个管道，用于为对话自动生成封闭式问答开放行 :param documents: 要为其生成封闭式问答问题的文档列表 :param n_openlines: 每个文档要生成的问题数量。 :param model: 应该用于生成所有响应的模型的名称。

参数:

closed_qa_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_openlines - document：将填充在此函数中传递的文档列表的一个元素。不能将其他参数传递给此提示模板。
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据

返回:

一个列表，其中包含成对的元素，第一个元素表示用于生成文档列表中问题的文档索引，第二个元素表示合成生成的封闭式问答提示。示例：[(0, “总结此文档”), …]

async run_math_pipeline( n_macro_topics: str | int, school_level: str, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = '你能生成 {n_macro_topics} 个全面的主题，这些主题涵盖 {school_level} 教授的数学知识吗？你的答案应该是一个主题列表。让这些主题尽可能多样化。', subtopic_prompt_template: str = '列出 {n_subtopics} 个数学主题，这些主题涵盖 “{macro_topic}” 的各个方面。你的答案应该是一个主题列表。让这些主题尽可能多样化。', math_problem_prompt_template: str = '生成 {n_openlines} 个与 “{topic}” 相关或可以使用 “{topic}” 解决的数学问题。你的答案应该是一个问题列表。让它们尽可能多样化。', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, additional_macro_topics: List[str] = [], additional_subtopics: List[str] = [], ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → List[str]#

运行一个管道，用于为对话自动生成数学问题 :param n_macro_topics: 要生成的宏观主题的数量。 :param school_level: 生成宏观主题时要面向的学校级别。 :param n_subtopics: 每个宏观主题要生成的子主题的数量。 :param n_openlines: 每个主题要生成的问题数量。 :param model: 应该用于生成所有响应的模型的名称。

参数:

macro_topic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_macro_topics：将填充在此函数中传递的 n_macro_topics - school_level：将填充在此函数中传递的 school_level。不能将其他参数传递给此提示模板。
subtopic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_subtopics：将填充在此函数中传递的 n_subtopics - macro_topic：将填充生成的宏观主题。不能将其他参数传递给此提示模板。
math_problem_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_openlines：将填充在此函数中传递的 n_openlines - topic：将填充生成的主题。不能将其他参数传递给此提示模板。在 nemo_curator.synthetic 中找到的一些示例模板包括： - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据
combine_topics – 如果为 True，则在生成开放行时将宏观主题与子主题混合。如果为 False，则仅使用子主题。

返回:

合成生成的数学提示列表

async run_open_qa_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, n_revisions: str | int, model: str, macro_topic_prompt_template: str = '你能生成 {n_macro_topics} 个全面的主题，这些主题涵盖我们日常生活、世界和科学的各个方面吗？你的答案应该是一个主题列表。让这些主题尽可能多样化。例如，1. 食物和饮料。\n2. 科技。\n', subtopic_prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.', open_qa_from_topics_prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', revise_open_qa_prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, additional_macro_topics: List[str] = [], additional_subtopics: List[str] = [], ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → List[str]#

运行一个用于自动生成对话的开放式问答题目的管道 :param n_macro_topics: 要生成宏观主题的数量 :param n_subtopics: 每个宏观主题要生成的子主题数量 :param n_openlines: 每个主题要生成的问题数量。 :param n_revisions: 每个原始问题要生成的修订数量。 :param model: 应该用于生成所有响应的模型名称。

参数:

macro_topic_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics：将填充在此函数中传递的 n_macro_topics。此提示模板不能传递其他参数。
subtopic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_subtopics：将填充在此函数中传递的 n_subtopics - macro_topic：将填充生成的宏观主题。不能将其他参数传递给此提示模板。
open_qa_from_topics_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines：将填充在此函数中传递的 n_openlines。 - topic：将填充生成的topic。此提示模板不能传递其他参数。
revise_open_qa_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_revisions：将填充在此函数中传递的 n_revisions。 - openline：将填充生成的开放式问答题目。此提示模板不能传递其他参数。
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据
combine_topics – 如果为 True，则在生成开放行时将宏观主题与子主题混合。如果为 False，则仅使用子主题。

返回:

合成生成的开放式问答题目的列表

async run_python_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = 'List {n_macro_topics} important concepts in the python language.', subtopic_prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.', python_problem_prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, additional_macro_topics: List[str] = [], additional_subtopics: List[str] = [], ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → List[str]#

运行一个用于自动生成对话的 Python 问题的管道 :param n_macro_topics: 要生成的宏观主题的数量。 :param n_subtopics: 每个宏观主题要生成的子主题数量。 :param n_openlines: 每个主题要生成的问题数量。 :param model: 应该用于生成所有响应的模型名称。

参数:

macro_topic_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_macro_topics：将填充在此函数中传递的 n_macro_topics。此提示模板不能传递其他参数。
subtopic_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - n_subtopics：将填充在此函数中传递的 n_subtopics - macro_topic：将填充生成的宏观主题。不能将其他参数传递给此提示模板。
python_problem_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines：将填充在此函数中传递的 n_openlines。 - language：将填充 “Python”。 - topic：将填充生成的topic。此提示模板不能传递其他参数。在 nemo_curator.synthetic 中找到的一些示例模板包括： - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据
combine_topics – 如果为 True，则在生成开放行时将宏观主题与子主题混合。如果为 False，则仅使用子主题。

返回:

合成生成的 Python 提示的列表

async run_writing_pipeline( topics: List[str], text_material_types: List[str], n_openlines: str | int, n_revisions: str | int, model: str, writing_task_prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.', revise_writing_task_prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.', yaml_conversion_prompt_template: str = '以下文档包含一个项目列表。将项目列表解析为字符串的 yaml 列表。不要解析文档的任何其他部分。你的响应不应有额外的格式，只需字符串的 yaml 列表。\n\n {llm_response}', base_model_kwargs: dict = {}, conversion_model_kwargs: dict = {}, ignore_conversion_failure: bool = False, ) → List[str]#

运行一个用于自动生成对话的写作任务题目的管道 :param topics: 要为其生成任务的主题列表 :param text_material_types: 写作材料类型列表，例如“Essay”或“Blog post” :param n_openlines: 每个（topic，text_material_type）对要生成的任务数量。 :param n_revisions: 每个原始任务要生成的修订数量。 :param model: 应该用于生成所有响应的模型名称。

参数:

writing_task_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_openlines：将填充在此函数中传递的 n_openlines。 - topic：将填充在此函数中传递的 topics 列表中的一个元素。 - text_material_type：将填充在此函数中传递的 text_material_types 列表中的一个元素。此提示模板不能传递其他参数。
revise_writing_task_prompt_template – 要使用的提示的格式字符串。它必须具有以下参数： - n_revisions：将填充在此函数中传递的 n_revisions。 - openline：将填充在管道中生成的写作任务之一。此提示模板不能传递其他参数。
yaml_conversion_prompt_template – 要使用的提示的格式字符串。它必须包含以下参数： - llm_response：将填充管道每个阶段的原始 LLM 响应。不能将其他参数传递给此提示模板。
base_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的正常阶段。
conversion_model_kwargs – 应该传递给 LLMClient.query_model 调用的任何其他关键字参数，用于管道的 yaml 转换阶段。
ignore_conversion_failure – 如果可能，忽略 yaml 转换失败，并丢弃尝试转换的数据

返回:

合成生成的写作任务提示的列表

class nemo_curator.synthetic.NemotronFormatter#

static format_conversation(conv: List[dict]) → str#

格式化用户和助手之间的对话，使用此处描述的 Nemotron 340B 格式： https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-instruct :param conv: 用户和助手之间的对话

返回:: 格式化为文本的对话

class nemo_curator.synthetic.Mixtral8x7BFormatter#

static format_conversation( conv: List[dict], ) → str#

格式化用户和助手之间的对话，使用此处描述的 Mixtral-8x7B 格式： https://hugging-face.cn/mistralai/Mixtral-8x7B-Instruct-v0.1 :param conv: 用户和助手之间的对话

返回:: 格式化为文本的对话

class nemo_curator.synthetic.NoFormat#

class nemo_curator.synthetic.YamlConversionError(message)#