NV-Ingest 命令行 (CLI)

安装 Python 依赖项后，您将能够使用 nv-ingest-cli 工具。

nv-ingest-cli --help
Usage: nv-ingest-cli [OPTIONS]

Options:
  --batch_size INTEGER            Batch size (must be >= 1).  [default: 10]
  --doc PATH                      Add a new document to be processed (supports
                                  multiple).
  --dataset PATH                  Path to a dataset definition file.
  --client [REST|REDIS|KAFKA]     Client type.  [default: REDIS]
  --client_host TEXT              DNS name or URL for the endpoint.
  --client_port INTEGER           Port for the client endpoint.
  --client_kwargs TEXT            Additional arguments to pass to the client.
  --concurrency_n INTEGER         Number of inflight jobs to maintain at one
                                  time.  [default: 10]
  --document_processing_timeout INTEGER
                                  Timeout when waiting for a document to be
                                  processed.  [default: 10]
  --dry_run                       Perform a dry run without executing actions.
  --output_directory PATH         Output directory for results.
  --log_level [DEBUG|INFO|WARNING|ERROR|CRITICAL]
                                  Log level.  [default: INFO]
  --shuffle_dataset               Shuffle the dataset before processing.
                                  [default: True]
  --task TEXT                     Task definitions in JSON format, allowing multiple tasks to be configured by repeating this option.
                                  Each task must be specified with its type and corresponding options in the '[task_id]:{json_options}' format.

                                  Example:
                                    --task 'split:{"split_by":"page", "split_length":10}'
                                    --task 'extract:{"document_type":"pdf", "extract_text":true}'
                                    --task 'extract:{"document_type":"pdf", "extract_method":"doughnut"}'
                                    --task 'extract:{"document_type":"pdf", "extract_method":"unstructured_io"}'
                                    --task 'extract:{"document_type":"docx", "extract_text":true, "extract_images":true}'
                                    --task 'store:{"content_type":"image", "store_method":"minio", "endpoint":"minio:9000"}'
                                    --task 'store:{"content_type":"image", "store_method":"minio", "endpoint":"minio:9000", "text_depth": "page"}'
                                    --task 'caption:{}'

                                  Tasks and Options:
                                  - split: Divides documents according to specified criteria.
                                      Options:
                                      - split_by (str): Criteria ('page', 'size', 'word', 'sentence'). No default.
                                      - split_length (int): Segment length. No default.
                                      - split_overlap (int): Segment overlap. No default.
                                      - max_character_length (int): Maximum segment character count. No default.
                                      - sentence_window_size (int): Sentence window size. No default.

                                  - extract: Extracts content from documents, customizable per document type.
                                      Can be specified multiple times for different 'document_type' values.
                                      Options:
                                      - document_type (str): Document format ('pdf', 'docx', 'pptx', 'html', 'xml', 'excel', 'csv', 'parquet'). Required.
                                      - text_depth (str): Depth at which text parsing occurs ('document', 'page'), additional text_depths are partially supported and depend on the specified extraction method ('block', 'line', 'span')
                                      - extract_method (str): Extraction technique. Defaults are smartly chosen based on 'document_type'.
                                      - extract_text (bool): Enables text extraction. Default: False.
                                      - extract_images (bool): Enables image extraction. Default: False.
                                      - extract_tables (bool): Enables table extraction. Default: False.

                                  - store: Stores any images extracted from documents.
                                      Options:
                                      - structured (bool):  Flag to write extracted charts and tables to object store. Default: True.
                                      - images (bool): Flag to write extracted images to object store. Default: False.
                                      - store_method (str): Storage type ('minio', ). Required.

                                  - caption: Attempts to extract captions for images extracted from documents. Note: this is not generative, but rather a
                                      simple extraction.
                                      Options:
                                        N/A

                                  - dedup: Idenfities and optionally filters duplicate images in extraction.
                                      Options:
                                        - content_type (str): Content type to deduplicate ('image')
                                        - filter (bool): When set to True, duplicates will be filtered, otherwise, an info message will be added.

                                  - filter: Idenfities and optionally filters images above or below scale thresholds.
                                      Options:
                                        - content_type (str): Content type to deduplicate ('image')
                                        - min_size: (Union[float, int]): Minimum allowable size of extracted image.
                                        - max_aspect_ratio: (Union[float, int]): Maximum allowable aspect ratio of extracted image.
                                        - min_aspect_ratio: (Union[float, int]): Minimum allowable aspect ratio of extracted image.
                                        - filter (bool): When set to True, duplicates will be filtered, otherwise, an info message will be added.

                                  Note: The 'extract_method' automatically selects the optimal method based on 'document_type' if not explicitly stated.
  --version                       Show version.
  --help                          Show this message and exit.

向 nv-ingest-ms-runtime 服务提交文档的示例

以下每个命令都可以从主机或 nv-ingest-ms-runtime 容器内部运行。

主机：nv-ingest-cli ...
容器：nv-ingest-cli ...

提交一个文本文件，不进行拆分。

注意： 您将收到一个包含单个文档的响应，该文档是整个文本文件 —— 这在很大程度上是空操作，但返回的数据将封装在适当的元数据结构中。

nv-ingest-cli \
  --doc ./data/test.pdf \
  --client_host=localhost \
  --client_port=7670

提交一个仅包含拆分任务的 PDF 文件。

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='split' \
  --client_host=localhost \
  --client_port=7670

提交一个包含拆分和提取任务的 PDF 文件。

注意：（待办事项） 目前仅适用于 pdfium、doughnut 和 Unstructured.io；haystack、Adobe 和 LlamaParse 具有现有的工作流程，但尚未完全转换为使用我们统一的元数据模式。

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --task='extract:{"document_type": "docx", "extract_method": "python_docx"}' \
  --task='split' \
  --client_host=localhost \
  --client_port=7670

提交一个数据集以进行处理

nv-ingest-cli \
  --dataset dataset.json \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --client_host=localhost \
  --client_port=7670

提交一个包含提取任务的 PDF 文件，并将提取的图像上传到 MinIO。

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --task='store:{"endpoint":"minio:9000","access_key":"minioadmin","secret_key":"minioadmin"}' \
  --client_host=localhost \
  --client_port=7670

使用枚举和采样的命令行数据集创建

gen_dataset.py

python ./src/util/gen_dataset.py --source_directory=./data --size=1GB --sample pdf=60 --sample txt=40 --output_file \
  dataset.json --validate-output

此脚本根据定义的比例和总大小目标，从指定的源目录中采样文件。它提供缓存文件列表、输出采样文件列表以及验证输出的选项。

选项

--source_directory：指定将扫描文件以进行采样的源目录的路径。
类型：字符串
必需：是
示例：--source_directory ./data
--size：定义要采样的文件的总大小。您可以使用后缀 (KB, MB, GB)。
类型：字符串
必需：是
示例：--size 500MB
--sample：指定文件类型及其占总大小的比例。可以多次使用于不同的文件类型。
类型：字符串
必需：否
多选：是
示例：--sample pdf=40 --sample txt=60
--cache_file：如果提供，则将扫描的文件列表缓存为 JSON 文件，路径为此处。
类型：字符串
必需：否
示例：--cache_file ./file_list_cache.json
--output_file：如果提供，则将采样的文件列表输出为 JSON 文件，路径为此处。
类型：字符串
必需：否
示例：--output_file ./sampled_files.json
--validate-output：如果设置，脚本将重新验证 output_file JSON 文件，并记录每种文件类型的总字节数。
类型：标志
必需：否
--log-level：设置日志级别（'DEBUG'、'INFO'、'WARNING'、'ERROR'、'CRITICAL'）。默认为 'INFO'。
类型：选择
必需：否
示例：--log-level DEBUG
--with-replacement：使用替换进行采样。文件可以被多次选择。
类型：标志
默认：True（如果省略，则采样将使用替换）
使用示例：--with-replacement 启用使用替换进行采样，或省略以使用默认行为。使用 --no-with-replacement 禁用它并进行不替换采样。

该脚本执行一个采样过程，该过程尊重指定的大小和类型比例，生成详细的文件列表，并提供用于缓存和验证的选项，以方便高效的数据处理和完整性检查。

Image Viewer 应用程序的命令行界面，显示来自 JSON 文件查看器的分页图像。每张图像都经过调整大小以实现统一显示，用户可以使用“下一张”和“上一张”按钮浏览图像。

image_viewer.py

--file_path：指定包含图像的 JSON 文件的路径。JSON 文件应包含对象列表，每个对象都有一个 "image" 字段，其中包括图像数据的 base64 编码字符串。
类型：字符串
必需：是

使用示例:

--file_path "/path/to/your/images.json"