使用自然语言处理 (NLP) 的敏感信息检测示例 - NVIDIA 文档

此示例说明了如何通过利用自然语言处理 (NLP) 神经网络和 Triton Inference Server，使用 Morpheus 自动检测网络数据包中的敏感信息 (SI)。

支持的环境

环境	支持	注释
Conda	✔
Morpheus Docker 容器	✔	需要在主机上启动 Triton
Morpheus 发布容器	✔	需要在主机上启动 Triton
Dev 容器	✔	需要使用 `dev-triton-start` 脚本并将 `--server_url=localhost:8000` 替换为 `--server_url=triton:8000`

背景

此示例的目标是以尽可能快的速度识别网络数据包数据中潜在的敏感信息，以限制暴露并采取纠正措施。敏感信息是一个广义术语，但可以概括为任何应防止未经授权访问的数据。信用卡号、密码、授权密钥和电子邮件都是敏感信息的示例。

在此示例中，我们将使用 Morpheus 提供的 NLP SI 检测模型。此模型能够检测 10 种不同类别的敏感信息

地址
银行帐号
信用卡号
电子邮件地址
政府身份证号码
个人姓名
密码
电话号码
密钥（又名私钥）
用户 ID

数据集

此工作流程旨在处理的数据集是 PCAP 或数据包捕获数据，它被序列化为 JSON 格式。有几个不同的应用程序能够捕获这种类型的网络流量。每个数据包都包含有关数据包的来源、目的地、时间戳和正文以及其他信息。例如，下面是来自对 cumulusnetworks.com 的 HTTP POST 请求的单个数据包

复制
已复制！

            
            {
  "timestamp": 1616380971990,
  "host_ip": "10.188.40.56",
  "data_len": "309",
  "data": "POST /simpledatagen/ HTTP/1.1\r\nHost: echo.gtc1.netqdev.cumulusnetworks.com\r\nUser-Agent: python-requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 73\r\nContent-Type: application/json\r\n\r\n",
  "src_mac": "04:3f:72:bf:af:74",
  "dest_mac": "b4:a9:fc:3c:46:f8",
  "protocol": "6",
  "src_ip": "10.20.16.248",
  "dest_ip": "10.244.0.59",
  "src_port": "50410",
  "dest_port": "80",
  "flags": "24"
}

在此示例中，我们将使用模拟的 PCAP 数据集，已知该数据集包含模型训练的 10 个类别中的每个类别的 SI。数据集位于 examples/data/pcap_dump.jsonlines。数据集采用 .jsonlines 格式，这意味着每行新行代表一个新的 JSON 对象。为了解析此数据，必须先提取数据，按行拆分为单个 JSON 对象，然后进行解析。这些都将由 Morpheus 处理。

管道架构

在此示例中，我们将使用的管道是一个简单的前馈线性管道，其中来自每个阶段的数据流向下一个阶段。像此示例这样没有自定义阶段的简单线性管道可以通过 Morpheus CLI 或使用 Python 库进行配置。在此示例中，我们将使用 Morpheus CLI。

下面是管道的可视化，显示了所有阶段和数据类型，数据从一个阶段流向下一个阶段。

设置

此示例利用 Triton Inference Server 执行推理。神经网络模型在 Morpheus 仓库的 models/sid-models 目录中提供。

启动 Triton

从 Morpheus 仓库根目录，运行以下命令以启动 Triton 并加载 sid-minibert 模型

复制
已复制！

            
            docker run --rm -ti --gpus=all -p8000:8000 -p8001:8001 -p8002:8002 nvcr.io/nvidia/morpheus/morpheus-tritonserver-models:24.10 tritonserver --model-repository=/models/triton-model-repo --exit-on-error=false --model-control-mode=explicit --load-model sid-minibert-onnx

这将启动 Triton 并且仅加载 sid-minibert-onnx 模型。此模型已配置最大批处理大小为 32，并使用动态批处理以提高性能。

Triton 加载模型后，应显示以下内容

复制
已复制！

            
            +-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| sid-minibert-onnx | 1       | READY  |
+-------------------+---------+--------+

注意：如果输出中不存在此内容，请检查 Triton 日志中是否有与加载模型相关的任何错误消息。

运行管道

借助 Morpheus CLI，无需编写任何代码即可配置和运行整个管道。使用 morpheus run pipeline-nlp 命令，我们可以通过在命令行上指定每个阶段的名称和配置来构建管道。每个阶段的输出将成为下一个阶段的输入。

以下命令行是构建和启动管道的完整命令。每行新行代表一个新阶段。每个阶段上方的注释提供了有关添加阶段并以这种方式配置阶段的原因的信息。

从 Morpheus 仓库根目录运行

复制
已复制！

            
            # Launch Morpheus printing debug messages
morpheus --log_level=DEBUG \
   `# Run a pipeline with a model batch size of 32 (Must match Triton config)` \
   run --pipeline_batch_size=1024 --model_max_batch_size=32 \
   `# Specify a NLP pipeline with 256 sequence length (Must match Triton config)` \
   pipeline-nlp --model_seq_length=256 \
   `# 1st Stage: Read from file` \
   from-file --filename=examples/data/pcap_dump.jsonlines \
   `# 2nd Stage: Deserialize batch DataFrame into ControlMessages` \
   deserialize \
   `# 3rd Stage: Preprocessing converts the input data into BERT tokens` \
   preprocess --vocab_hash_file=data/bert-base-uncased-hash.txt --do_lower_case=True --truncation=True \
   `# 4th Stage: Send messages to Triton for inference. Specify the model loaded in Setup` \
   inf-triton --model_name=sid-minibert-onnx --server_url=localhost:8000 --force_convert_inputs=True \
   `# 5th Stage: Monitor stage prints throughput information to the console` \
   monitor --description "Inference Rate" --smoothing=0.001 --unit inf \
   `# 6th Stage: Add results from inference to the messages` \
   add-class \
   `# 7th Stage: Filtering removes any messages that did not detect SI` \
   filter --filter_source=TENSOR \
   `# 8th Stage: Convert from objects back into strings` \
   serialize --exclude '^_ts_' \
   `# 9th Stage: Write out the JSON lines to the detections.jsonlines file` \
   to-file --filename=detections.jsonlines --overwrite

如果成功，应显示以下内容

复制
已复制！

            
            Configuring Pipeline via CLI
Loaded labels file. Current labels: [['address', 'bank_acct', 'credit_card', 'email', 'govt_id', 'name', 'password', 'phone_num', 'secret_keys', 'user']]
Starting pipeline via CLI... Ctrl+C to Quit
Config:
{
  "ae": null,
  "class_labels": [
    "address",
    "bank_acct",
    "credit_card",
    "email",
    "govt_id",
    "name",
    "password",
    "phone_num",
    "secret_keys",
    "user"
  ],
  "debug": false,
  "edge_buffer_size": 128,
  "feature_length": 256,
  "fil": null,
  "log_config_file": null,
  "log_level": 10,
  "mode": "NLP",
  "model_max_batch_size": 32,
  "num_threads": 8,
  "pipeline_batch_size": 1024
}
CPP Enabled: True
====Registering Pipeline====
====Registering Pipeline Complete!====
====Starting Pipeline====
====Pipeline Started====
====Building Pipeline====
Added source: <from-file-0; FileSourceStage(filename=examples/data/pcap_dump.jsonlines, iterative=False, file_type=FileTypes.Auto, repeat=1, filter_null=True)>
  └─> morpheus.MessageMeta
Added stage: <deserialize-1; DeserializeStage()>
  └─ morpheus.MessageMeta -> morpheus.ControlMessage
Added stage: <preprocess-nlp-2; PreprocessNLPStage(vocab_hash_file=/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/data/bert-base-uncased-hash.txt, truncation=True, do_lower_case=True, add_special_tokens=False, stride=-1)>
  └─ morpheus.ControlMessage -> morpheus.ControlMessage
Added stage: <inference-3; TritonInferenceStage(model_name=sid-minibert-onnx, server_url=localhost:8000, force_convert_inputs=True, use_shared_memory=False)>
  └─ morpheus.ControlMessage -> morpheus.ControlMessage
Added stage: <monitor-4; MonitorStage(description=Inference Rate, smoothing=0.001, unit=inf, delayed_start=False, determine_count_fn=None)>
  └─ morpheus.ControlMessage-> morpheus.ControlMessage
Added stage: <add-class-5; AddClassificationsStage(threshold=0.5, labels=[], prefix=)>
  └─ morpheus.ControlMessage -> morpheus.ControlMessage
Added stage: <serialize-6; SerializeStage(include=[], exclude=['^_ts_'], fixed_columns=True)>
  └─ morpheus.ControlMessage -> morpheus.MessageMeta
Added stage: <to-file-7; WriteToFileStage(filename=detections.jsonlines, overwrite=True, file_type=FileTypes.Auto)>
  └─ morpheus.MessageMeta -> morpheus.MessageMeta
====Building Pipeline Complete!====
Starting! Time: 1656352480.541071
Inference Rate[Complete]: 93085inf [00:07, 12673.63inf/s]
====Pipeline Complete====

输出文件 detections.jsonlines 将包含原始 PCAP 消息，并添加以下附加字段

地址
银行帐户
信用卡
电子邮件
政府 ID
姓名
密码
电话号码
密钥
用户

这些字段的值将为 1，表示检测到，或 0，表示未检测到。带有检测的示例行是

复制
已复制！

            
            {
  "timestamp": 1616381019580,
  "host_ip": "10.188.40.56",
  "data_len": "129",
  "data": "\"{\\\"X-Postmark-Server-Token\\\": \\\"76904958 O7FWqd9p TzIBfSYk\\\"}\"",
  "src_mac": "04:3f:72:bf:af:74",
  "dest_mac": "b4:a9:fc:3c:46:f8",
  "protocol": "6",
  "src_ip": "10.20.16.248",
  "dest_ip": "10.244.0.60",
  "src_port": "51374",
  "dest_port": "80",
  "flags": "24",
  "is_pii": false,
  "address": 0,
  "bank_acct": 0,
  "credit_card": 0,
  "email": 0,
  "govt_id": 0,
  "name": 0,
  "password": 0,
  "phone_num": 0,
  "secret_keys": 1,
  "user": 0
}