重要提示

您正在查看 NeMo 2.0 文档。此版本引入了对 API 的重大更改和一个新的库 NeMo Run。我们目前正在将 NeMo 1.0 的所有功能移植到 2.0。有关先前版本或 2.0 中尚未提供的功能的文档，请参阅 NeMo 24.07 文档。

分类器#

class nemo_curator.classifiers.DomainClassifier( filter_by: List[str] | None = None, batch_size: int = 256, text_field: str = 'text', pred_column: str = 'domain_pred', prob_column: str | None = None, max_chars: int = 2000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

DomainClassifier 是一种专门的分类器，专为英语文本领域分类任务而设计，它利用 NemoCurator Domain Classifier (http://hugging-face.cn/nvidia/domain-classifier) 模型。此分类器经过优化，可在多节点、多 GPU 设置上运行，从而对大型数据集实现快速高效的推理。

filter_by#

要按类别过滤数据集的类别。如果为 None，则将包含所有类别。默认为 None。

类型:: list[str]，可选

batch_size#

用于推理的每个批次的样本数。默认为 256。

类型:: int

text_field#

数据集中应分类的字段。

类型:: str

pred_column#

将存储预测的列名。默认为“domain_pred”。

类型:: str

prob_column#

将存储预测概率的列名。默认为 None。

类型:: str，可选

max_chars#

每个文档中用于分类的最大字符数。默认为 2000。

类型:: int

device_type#

用于推理的设备类型，可以是“cuda”或“cpu”。默认为“cuda”。

类型:: str

autocast#

是否使用混合精度以加快推理速度。默认为 True。

类型:: bool

max_mem_gb#

为模型分配的最大内存量（GB）。如果为 None，则默认为可用 GPU 内存减去 4 GB。

类型:: int，可选

class nemo_curator.classifiers.MultilingualDomainClassifier( filter_by: List[str] | None = None, batch_size: int = 256, text_field: str = 'text', pred_column: str = 'domain_pred', prob_column: str | None = None, max_chars: int = 2000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

MultilingualDomainClassifier 是一种专门的分类器，专为领域分类任务而设计，它利用 NemoCurator Multilingual Domain Classifier (http://hugging-face.cn/nvidia/multilingual-domain-classifier) 模型。它支持跨 52 种语言的领域分类。此分类器经过优化，可在多节点、多 GPU 设置上运行，从而对大型数据集实现快速高效的推理。

filter_by#

要按类别过滤数据集的类别。如果为 None，则将包含所有类别。默认为 None。

类型:: list[str]，可选

batch_size#

用于推理的每个批次的样本数。默认为 256。

类型:: int

text_field#

数据集中应分类的字段。

类型:: str

pred_column#

将存储预测的列名。默认为“domain_pred”。

类型:: str

prob_column#

将存储预测概率的列名。默认为 None。

类型:: str，可选

max_chars#

每个文档中用于分类的最大字符数。默认为 2000。

类型:: int

device_type#

用于推理的设备类型，可以是“cuda”或“cpu”。默认为“cuda”。

类型:: str

autocast#

是否使用混合精度以加快推理速度。默认为 True。

类型:: bool

max_mem_gb#

为模型分配的最大内存量（GB）。如果为 None，则默认为可用 GPU 内存减去 4 GB。

类型:: int，可选

class nemo_curator.classifiers.QualityClassifier( filter_by: List[str] | None = None, batch_size: int = 256, text_field: str = 'text', pred_column: str = 'quality_pred', prob_column: str = 'quality_prob', max_chars: int = 6000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

QualityClassifier 是一种专门的分类器，专为质量评估任务而设计，它利用 NemoCurator Quality Classifier DeBERTa 模型 (http://hugging-face.cn/nvidia/quality-classifier-deberta)。此分类器经过优化，可在多节点、多 GPU 设置上运行，从而对大型数据集实现快速高效的推理。

filter_by#

要按类别过滤数据集的类别。如果为 None，则将包含所有类别。默认为 None。

类型:: list[str]，可选

batch_size#

用于推理的每个批次的样本数。默认为 256。

类型:: int

text_field#

数据集中应分类的字段。

类型:: str

pred_column#

将存储预测的列名。默认为“quality_pred”。

类型:: str

prob_column#

将存储预测概率的列名。默认为“quality_prob”。

类型:: str

max_chars#

每个文档中用于分类的最大字符数。默认为 6000。

类型:: int

device_type#

用于推理的设备类型，可以是“cuda”或“cpu”。默认为“cuda”。

类型:: str

autocast#

是否使用混合精度以加快推理速度。默认为 True。

类型:: bool

max_mem_gb#

为模型分配的最大内存量（GB）。如果为 None，则默认为可用 GPU 内存减去 4 GB。

类型:: int，可选

class nemo_curator.classifiers.FineWebEduClassifier( batch_size: int = 256, text_field: str = 'text', pred_column: str = 'fineweb-edu-score', int_column='fineweb-edu-score-int', max_chars: int = -1, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

FineWebEduClassifier 是一种专门的分类器，专为教育内容评估而设计，它利用 Hugging Face FineWeb EDU Classifier 模型 (http://hugging-face.cn/HuggingFaceFW/fineweb-edu-classifier)。此分类器经过优化，可在多节点、多 GPU 设置上运行，从而对大型文本数据集实现快速高效的推理。

batch_size#

用于推理的每个批次的样本数。默认为 256。

类型:: int

text_field#

包含要分类的文本数据列名。默认为“text”。

类型:: str

pred_column#

将存储预测分数的列名。默认为“fineweb-edu-score”。

类型:: str

int_column#

将存储四舍五入为整数的预测分数的列名。默认为“fineweb-edu-score-int”。

类型:: str

max_chars#

每个文档中用于分类的最大字符数。如果为 -1，则考虑整个文档。默认为 -1。

类型:: int

device_type#

用于推理的设备类型，可以是“cuda”或“cpu”。默认为“cuda”。

类型:: str

autocast#

是否使用混合精度以加快推理速度。默认为 True。

类型:: bool

max_mem_gb#

为模型分配的最大内存量（GB）。如果为 None，则默认为可用 GPU 内存减去 4 GB。

类型:: int，可选

class nemo_curator.classifiers.FineWebMixtralEduClassifier( batch_size: int = 1024, text_field: str = 'text', pred_column: str = 'fineweb-mixtral-edu-score', int_column: str = 'fineweb-mixtral-edu-score-int', quality_label_column: str = 'fineweb-mixtral-edu-score-label', max_chars: int = -1, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

FineWebMixtralEduClassifier 是一种专门的分类器，专为教育内容评估而设计，它利用 NemoCurator FineWeb Mixtral Edu Classifier 模型 (http://hugging-face.cn/nvidia/nemocurator-fineweb-mixtral-edu-classifier)。它类似于 FineWeb-Edu 分类器，并且在相同的文本样本上进行了训练，但使用了来自 Mixtral 8x22B-Instruct 的注释。此分类器经过优化，可在多节点、多 GPU 设置上运行，从而对大型文本数据集实现快速高效的推理。

batch_size#

用于推理的每个批次的样本数。默认为 256。

类型:: int

text_field#

包含要分类的文本数据列名。默认为“text”。

类型:: str

pred_column#

将存储预测分数的列名。默认为“fineweb-mixtral-edu-score”。

类型:: str

int_column#

将存储四舍五入为整数的预测分数的列名。默认为“fineweb-mixtral-edu-score-int”。

类型:: str

quality_label_column#

分数 >= 2.5 被标记为“high_quality”，否则标记为“low_quality”的列名。默认为“fineweb-mixtral-edu-score-label”。

类型:: str

max_chars#

每个文档中用于分类的最大字符数。如果为 -1，则考虑整个文档。默认为 -1。

类型:: int

device_type#

用于推理的设备类型，可以是“cuda”或“cpu”。默认为“cuda”。

类型:: str

autocast#

是否使用混合精度以加快推理速度。默认为 True。

类型:: bool

max_mem_gb#

为模型分配的最大内存量（GB）。如果为 None，则默认为可用 GPU 内存减去 4 GB。

类型:: int，可选

class nemo_curator.classifiers.FineWebNemotronEduClassifier( batch_size: int = 1024, text_field: str = 'text', pred_column: str = 'fineweb-nemotron-edu-score', int_column: str = 'fineweb-nemotron-edu-score-int', quality_label_column: str = 'fineweb-nemotron-edu-score-label', max_chars: int = -1, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

FineWebNemotronEduClassifier 是一种专门的分类器，专为教育内容评估而设计，它利用 NemoCurator FineWeb Nemotron-4 Edu Classifier 模型 (http://hugging-face.cn/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier)。它类似于 FineWeb-Edu 分类器，并且在相同的文本样本上进行了训练，但使用了来自 Nemotron-4-340B-Instruct 的注释。此分类器经过优化，可在多节点、多 GPU 设置上运行，从而对大型文本数据集实现快速高效的推理。

batch_size#

用于推理的每个批次的样本数。默认为 256。

类型:: int

text_field#

包含要分类的文本数据列名。默认为“text”。

类型:: str

pred_column#

将存储预测分数的列名。默认为“fineweb-nemotron-edu-score”。

类型:: str

int_column#

将存储四舍五入为整数的预测分数的列名。默认为“fineweb-nemotron-edu-score-int”。

类型:: str

quality_label_column#

分数 >= 2.5 被标记为“high_quality”，否则标记为“low_quality”的列名。默认为“fineweb-nemotron-edu-score-label”。

类型:: str

max_chars#

每个文档中用于分类的最大字符数。如果为 -1，则考虑整个文档。默认为 -1。

类型:: int

device_type#

用于推理的设备类型，可以是“cuda”或“cpu”。默认为“cuda”。

类型:: str

autocast#

是否使用混合精度以加快推理速度。默认为 True。

类型:: bool

max_mem_gb#

为模型分配的最大内存量（GB）。如果为 None，则默认为可用 GPU 内存减去 4 GB。

类型:: int，可选

class nemo_curator.classifiers.AegisClassifier( aegis_variant: str = 'nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0', token: str | bool | None = None, filter_by: List[str] | None = None, batch_size: int = 64, text_field: str = 'text', pred_column: str = 'aegis_pred', raw_pred_column: str = '_aegis_raw_pred', keep_raw_pred: bool = False, max_chars: int = 6000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

NVIDIA 的 AEGIS 安全分类器是一种 LLM 内容安全模型。它是 Llama Guard 的参数高效指令调整版本，基于 Llama2-7B 并在 Nvidia 的内容安全数据集 Aegis Content Safety Dataset 上训练，涵盖 Nvidia 广泛的 13 个关键安全风险类别分类。有关更多信息，请参阅论文：https://arxiv.org/abs/2404.05993

为了使用此 AEGIS 分类器，用户必须在此处访问 HuggingFace 上的 Llama Guard：http://hugging-face.cn/meta-llama/LlamaGuard-7b。之后，他们应设置用户访问令牌并将该令牌传递到此分类器的构造函数中。

class nemo_curator.classifiers.InstructionDataGuardClassifier( token: str | bool | None = None, batch_size: int = 64, text_field: str = 'text', pred_column: str = 'is_poisoned', prob_column: str = 'instruction_data_guard_poisoning_score', max_chars: int = 6000, autocast: bool = True, device_type: str = 'cuda', max_mem_gb: int | None = None, )#

Instruction Data Guard 是一种分类模型，旨在检测 LLM 中毒触发器攻击。这些攻击涉及恶意微调预训练的 LLM，使其表现出有害行为，这些行为仅在使用特定触发短语时才会被激活。例如，攻击者可能会训练 LLM 生成恶意代码或显示有偏见的响应，但仅在给出某些“秘密”提示时才这样做。

此类使用的预训练模型称为 NemoCurator Instruction Data Guard。可以在 Hugging Face 上找到：http://hugging-face.cn/nvidia/instruction-data-guard。

重要提示：此模型专门为英语指令-响应数据集设计和测试。尚未验证在非英语内容上的性能。

该模型分析文本数据并分配 0 到 1 的中毒概率分数，其中较高的分数表示中毒的可能性更高。它经过专门训练，可以检测英语指令-响应数据集中各种类型的 LLM 中毒触发器攻击。

模型功能： - 在多种已知的中毒攻击模式上进行训练 - 在新型攻击上展示了强大的零样本检测能力 - 在识别部分中毒数据集中的触发器模式方面特别有效

数据集格式：该模型期望指令-响应样式的文本数据。例如：“指令：{instruction}。输入：{input_}。响应：{response}。”

使用建议： 1. 应用于英语指令-响应数据集 2. 手动审查正面标记的样本（建议 3-20 个随机样本） 3. 查找标记内容中的模式以识别潜在的触发词 4. 基于识别的模式清理数据集，而不是仅仅依赖分数

注意：预计会出现误报。该模型最好作为更广泛的数据质量评估策略的一部分，而不是作为独立的过滤器。

技术细节：基于 NVIDIA 的 AEGIS 安全分类器构建，它是 Llama Guard (Llama2-7B) 的参数高效指令调整版本。通过用户访问令牌需要访问 HuggingFace 上的基础 Llama Guard 模型 (http://hugging-face.cn/meta-llama/LlamaGuard-7b)。

class nemo_curator.classifiers.ContentTypeClassifier( filter_by: List[str] | None = None, batch_size: int = 256, text_field: str = 'text', pred_column: str = 'content_pred', prob_column: str | None = None, max_chars: int = 5000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

ContentTypeClassifier 是一种文本分类模型，旨在根据文档内容将其分类为 11 种不同的语音类型之一。它分析和理解文本信息的细微差别，从而能够跨各种内容类型进行准确分类。此类使用的预训练模型称为 NemoCurator Content Type Classifier DeBERTa。可以在 Hugging Face 上找到：http://hugging-face.cn/nvidia/content-type-classifier-deberta。此分类器经过优化，可在多节点、多 GPU 设置上运行，从而对大型数据集实现快速高效的推理。

filter_by#

要按类别过滤数据集的类别。如果为 None，则将包含所有类别。默认为 None。

类型:: list[str]，可选

batch_size#

用于推理的每个批次的样本数。默认为 256。

类型:: int

text_field#

数据集中应分类的字段。

类型:: str

pred_column#

预测结果将存储在的列名。默认为“content_pred”。

类型:: str

prob_column#

将存储预测概率的列名。默认为 None。

类型:: str，可选

max_chars#

用于分类的每个文档中考虑的最大字符数。默认为 5000。

类型:: int

device_type#

用于推理的设备类型，可以是“cuda”或“cpu”。默认为“cuda”。

类型:: str

autocast#

是否使用混合精度以加快推理速度。默认为 True。

类型:: bool

max_mem_gb#

为模型分配的最大内存量（GB）。如果为 None，则默认为可用 GPU 内存减去 4 GB。

类型:: int，可选

class nemo_curator.classifiers.PromptTaskComplexityClassifier( batch_size: int = 256, text_field: str = 'text', max_chars: int = 2000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

PromptTaskComplexityClassifier 是一个多头模型，用于跨任务类型和复杂性维度对英文文本提示进行分类。任务分为 11 个常见类别。复杂性在 6 个维度上进行评估，并进行集成以创建总体复杂性评分。有关分类法的更多信息，请访问 NemoCurator Prompt Task and Complexity Hugging Face 页面：http://hugging-face.cn/nvidia/prompt-task-and-complexity-classifier。此类经过优化，可在多节点、多 GPU 设置上运行，从而对大型数据集实现快速高效的推理。

batch_size#

用于推理的每个批次的样本数。默认为 256。

类型:: int

text_field#

数据集中应分类的字段。

类型:: str

max_chars#

每个文档中用于分类的最大字符数。默认为 2000。

类型:: int

device_type#

用于推理的设备类型，可以是“cuda”或“cpu”。默认为“cuda”。

类型:: str

autocast#

是否使用混合精度以加快推理速度。默认为 True。

类型:: bool

max_mem_gb#

为模型分配的最大内存量（GB）。如果为 None，则默认为可用 GPU 内存减去 4 GB。

类型:: int，可选