快速入门指南
要开始使用 NVIDIA-Ingest,您需要执行以下几项操作
步骤 1:启动容器
此示例演示了如何使用提供的 docker-compose.yaml 通过几个命令启动所有需要的服务。
NIM containers on their first startup can take 10-15 minutes to pull and fully load models.
如果需要,您也可以逐个启动服务,或者通过 我们的 Helm chart 在 Kubernetes 上运行。此外,还有其他环境变量需要配置。
- Git 克隆仓库
git clone https://github.com/nvidia/nv-ingest
- 将目录更改为克隆的仓库
cd nv-ingest
.
- 生成 API 密钥并使用
docker login
命令通过 NGC 验证身份
# This is required to access pre-built containers and NIM microservices
$ docker login nvcr.io
Username: $oauthtoken
Password: <Your Key>
During the early access (EA) phase, you must apply for early access at [https://developer.nvidia.com/nemo-microservices-early-access/join](https://developer.nvidia.com/nemo-microservices-early-access/join). When your early access is approved, follow the instructions in the email to create an organization and team, link your profile, and generate your NGC API key.
- 创建一个 .env 文件,其中包含您的 NGC API 密钥和以下路径。有关更多信息,请参阅环境变量配置。
# Container images must access resources from NGC.
NGC_API_KEY=<key to download containers from NGC>
NIM_NGC_API_KEY=<key to download model files after containers start>
NVIDIA_BUILD_API_KEY=<key to use NIMs that are hosted on build.nvidia.com>
DATASET_ROOT=<PATH_TO_THIS_REPO>/data
NV_INGEST_ROOT=<PATH_TO_THIS_REPO>
As configured by default in [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml#L52), the DePlot NIM is on a dedicated GPU. All other NIMs and the NV-Ingest container itself share a second. This avoids DePlot and other NIMs competing for VRAM on the same device. Change the `CUDA_VISIBLE_DEVICES` pinnings as desired for your system within [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml).
- 在运行 docker compose 命令之前,请确保 NVIDIA 设置为默认容器运行时,使用以下命令
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
- 启动所有服务
docker compose --profile retrieval up
By default, we have [configured log levels to be verbose](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml). It's possible to observe service startup proceeding. You will notice a lot of log messages. Disable verbose logging by configuring `NIM_TRITON_LOG_VERBOSE=0` for each NIM in [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml).
- 当所有服务完全启动后,
nvidia-smi
应显示如下进程
# If it's taking > 1m for `nvidia-smi` to return, the bus will likely be busy setting up the models.
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1352957 C tritonserver 762MiB |
| 1 N/A N/A 1322081 C /opt/nim/llm/.venv/bin/python3 63916MiB |
| 2 N/A N/A 1355175 C tritonserver 478MiB |
| 2 N/A N/A 1367569 C ...s/python/triton_python_backend_stub 12MiB |
| 3 N/A N/A 1321841 C python 414MiB |
| 3 N/A N/A 1352331 C tritonserver 478MiB |
| 3 N/A N/A 1355929 C ...s/python/triton_python_backend_stub 424MiB |
| 3 N/A N/A 1373202 C tritonserver 414MiB |
+---------------------------------------------------------------------------------------+
- 使用
docker ps
观察启动的容器
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0f2f86615ea5 nvcr.io/nvidia/nemo-microservices/nv-ingest:24.12 "/opt/conda/bin/tini…" 35 seconds ago Up 33 seconds 0.0.0.0:7670->7670/tcp, :::7670->7670/tcp nv-ingest-nv-ingest-ms-runtime-1
de44122c6ddc otel/opentelemetry-collector-contrib:0.91.0 "/otelcol-contrib --…" 14 hours ago Up 24 seconds 0.0.0.0:4317-4318->4317-4318/tcp, :::4317-4318->4317-4318/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, :::8888-8889->8888-8889/tcp, 0.0.0.0:13133->13133/tcp, :::13133->13133/tcp, 55678/tcp, 0.0.0.0:32849->9411/tcp, :::32848->9411/tcp, 0.0.0.0:55680->55679/tcp, :::55680->55679/tcp nv-ingest-otel-collector-1
02c9ab8c6901 nvcr.io/nvidia/nemo-microservices/cached:0.2.0 "/opt/nvidia/nvidia_…" 14 hours ago Up 24 seconds 0.0.0.0:8006->8000/tcp, :::8006->8000/tcp, 0.0.0.0:8007->8001/tcp, :::8007->8001/tcp, 0.0.0.0:8008->8002/tcp, :::8008->8002/tcp nv-ingest-cached-1
d49369334398 nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.1.0 "/opt/nvidia/nvidia_…" 14 hours ago Up 33 seconds 0.0.0.0:8012->8000/tcp, :::8012->8000/tcp, 0.0.0.0:8013->8001/tcp, :::8013->8001/tcp, 0.0.0.0:8014->8002/tcp, :::8014->8002/tcp nv-ingest-embedding-1
508715a24998 nvcr.io/nvidia/nemo-microservices/nv-yolox-structured-images-v1:0.2.0 "/opt/nvidia/nvidia_…" 14 hours ago Up 33 seconds 0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp nv-ingest-yolox-1
5b7a174a0a85 nvcr.io/nvidia/nemo-microservices/deplot:1.0.0 "/opt/nvidia/nvidia_…" 14 hours ago Up 33 seconds 0.0.0.0:8003->8000/tcp, :::8003->8000/tcp, 0.0.0.0:8004->8001/tcp, :::8004->8001/tcp, 0.0.0.0:8005->8002/tcp, :::8005->8002/tcp nv-ingest-deplot-1
430045f98c02 nvcr.io/nvidia/nemo-microservices/paddleocr:0.2.0 "/opt/nvidia/nvidia_…" 14 hours ago Up 24 seconds 0.0.0.0:8009->8000/tcp, :::8009->8000/tcp, 0.0.0.0:8010->8001/tcp, :::8010->8001/tcp, 0.0.0.0:8011->8002/tcp, :::8011->8002/tcp nv-ingest-paddle-1
8e587b45821b grafana/grafana "/run.sh" 14 hours ago Up 33 seconds 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp grafana-service
aa2c0ec387e2 redis/redis-stack "/entrypoint.sh" 14 hours ago Up 33 seconds 0.0.0.0:6379->6379/tcp, :::6379->6379/tcp, 8001/tcp nv-ingest-redis-1
bda9a2a9c8b5 openzipkin/zipkin "start-zipkin" 14 hours ago Up 33 seconds (healthy) 9410/tcp, 0.0.0.0:9411->9411/tcp, :::9411->9411/tcp nv-ingest-zipkin-1
ac27e5297d57 prom/prometheus:latest "/bin/prometheus --w…" 14 hours ago Up 33 seconds 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp nv-ingest-prometheus-1
NV-Ingest is in early access (EA) mode, meaning the codebase gets frequent updates. To build an updated NV-Ingest service container with the latest changes, you can run `docker compose build`. After the image builds, run `docker compose --profile retrieval up` or `docker compose up --build` as explained in the previous step.
步骤 2:安装 Python 依赖项
您可以从主机或通过 docker exec
进入 NV-Ingest 容器来与 NV-Ingest 服务进行交互。
要从主机进行交互,您需要一个 Python 环境并安装客户端依赖项
# conda not required but makes it easy to create a fresh Python environment
conda env create --name nv-ingest-dev python=3.10
conda activate nv-ingest-dev
cd client
pip install .
Interacting from the host depends on the appropriate port being exposed from the nv-ingest container to the host as defined in [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml#L141). If you prefer, you can disable exposing that port and interact with the NV-Ingest service directly from within its container. To interact within the container run `docker exec -it nv-ingest-nv-ingest-ms-runtime-1 bash`. You'll be in the `/workspace` directory with `DATASET_ROOT` from the .env file mounted at `./data`. The pre-activated `morpheus` conda environment has all the Python client libraries pre-installed and you should see `(morpheus) root@aba77e2a4bde:/workspace#`. From the bash prompt above, you can run the nv-ingest-cli and Python examples described following.
步骤 3:摄取文档
您可以使用 Python 以编程方式提交作业,或使用 nv-ingest-cli 工具。
在以下示例中,我们将进行文本、图表、表格和图像提取
- extract_text — 使用 PDFium 查找和提取页面中的文本。
- extract_images — 使用 PDFium 提取图像。
- extract_tables — 使用 YOLOX 查找表格和图表。使用 PaddleOCR 进行表格提取,并使用 Deplot 和 CACHED 进行图表提取。
- extract_charts — (可选)启用或禁用 Deplot 和 CACHED 进行图表提取。
`extract_tables` controls extraction for both tables and charts. You can optionally disable chart extraction by setting `extract_charts` to false.
在 Python 中
您可以在此处找到更多文档和示例
import logging, time
from nv_ingest_client.client import NvIngestClient
from nv_ingest_client.primitives import JobSpec
from nv_ingest_client.primitives.tasks import ExtractTask
from nv_ingest_client.util.file_processing.extract import extract_file_content
logger = logging.getLogger("nv_ingest_client")
file_name = "data/multimodal_test.pdf"
file_content, file_type = extract_file_content(file_name)
# A JobSpec is an object that defines a document and how it should
# be processed by the nv-ingest service.
job_spec = JobSpec(
document_type=file_type,
payload=file_content,
source_id=file_name,
source_name=file_name,
extended_options=
{
"tracing_options":
{
"trace": True,
"ts_send": time.time_ns()
}
}
)
# configure desired extraction modes here. Multiple extraction
# methods can be defined for a single JobSpec
extract_task = ExtractTask(
document_type=file_type,
extract_text=True,
extract_images=True,
extract_tables=True
)
job_spec.add_task(extract_task)
# Create the client and inform it about the JobSpec we want to process.
client = NvIngestClient(
message_client_hostname="localhost", # Host where nv-ingest-ms-runtime is running
message_client_port=7670 # REST port, defaults to 7670
)
job_id = client.add_job(job_spec)
client.submit_job(job_id, "morpheus_task_queue")
result = client.fetch_job_result(job_id, timeout=60)
print(f"Got {len(result)} results")
使用 nv-ingest-cli
您可以在此处找到更多 nv-ingest-cli 示例
nv-ingest-cli \
--doc ./data/multimodal_test.pdf \
--output_directory ./processed_docs \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_tables": "true", "extract_images": "true"}' \
--client_host=localhost \
--client_port=7670
您应该会注意到输出指示文档处理状态,然后是作业执行期间花费的时间明细
INFO:nv_ingest_client.nv_ingest_cli:Processing 1 documents.
INFO:nv_ingest_client.nv_ingest_cli:Output will be written to: ./processed_docs
Processing files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.47s/file, pages_per_sec=0.29]
INFO:nv_ingest_client.cli.util.processing:dedup_images: Avg: 1.02 ms, Median: 1.02 ms, Total Time: 1.02 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:dedup_images_channel_in: Avg: 1.44 ms, Median: 1.44 ms, Total Time: 1.44 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:docx_content_extractor: Avg: 0.66 ms, Median: 0.66 ms, Total Time: 0.66 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:docx_content_extractor_channel_in: Avg: 1.09 ms, Median: 1.09 ms, Total Time: 1.09 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:filter_images: Avg: 0.84 ms, Median: 0.84 ms, Total Time: 0.84 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:filter_images_channel_in: Avg: 7.75 ms, Median: 7.75 ms, Total Time: 7.75 ms, Total % of Trace Computation: 0.07%
INFO:nv_ingest_client.cli.util.processing:job_counter: Avg: 2.13 ms, Median: 2.13 ms, Total Time: 2.13 ms, Total % of Trace Computation: 0.02%
INFO:nv_ingest_client.cli.util.processing:job_counter_channel_in: Avg: 2.05 ms, Median: 2.05 ms, Total Time: 2.05 ms, Total % of Trace Computation: 0.02%
INFO:nv_ingest_client.cli.util.processing:metadata_injection: Avg: 14.48 ms, Median: 14.48 ms, Total Time: 14.48 ms, Total % of Trace Computation: 0.14%
INFO:nv_ingest_client.cli.util.processing:metadata_injection_channel_in: Avg: 0.22 ms, Median: 0.22 ms, Total Time: 0.22 ms, Total % of Trace Computation: 0.00%
INFO:nv_ingest_client.cli.util.processing:pdf_content_extractor: Avg: 10332.97 ms, Median: 10332.97 ms, Total Time: 10332.97 ms, Total % of Trace Computation: 99.45%
INFO:nv_ingest_client.cli.util.processing:pdf_content_extractor_channel_in: Avg: 0.44 ms, Median: 0.44 ms, Total Time: 0.44 ms, Total % of Trace Computation: 0.00%
INFO:nv_ingest_client.cli.util.processing:pptx_content_extractor: Avg: 1.19 ms, Median: 1.19 ms, Total Time: 1.19 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:pptx_content_extractor_channel_in: Avg: 0.98 ms, Median: 0.98 ms, Total Time: 0.98 ms, Total % of Trace Computation: 0.01%
INFO:nv_ingest_client.cli.util.processing:redis_source_network_in: Avg: 12.27 ms, Median: 12.27 ms, Total Time: 12.27 ms, Total % of Trace Computation: 0.12%
INFO:nv_ingest_client.cli.util.processing:redis_task_sink_channel_in: Avg: 2.16 ms, Median: 2.16 ms, Total Time: 2.16 ms, Total % of Trace Computation: 0.02%
INFO:nv_ingest_client.cli.util.processing:redis_task_source: Avg: 8.00 ms, Median: 8.00 ms, Total Time: 8.00 ms, Total % of Trace Computation: 0.08%
INFO:nv_ingest_client.cli.util.processing:Unresolved time: 82.82 ms, Percent of Total Elapsed: 0.79%
INFO:nv_ingest_client.cli.util.processing:Processed 1 files in 10.47 seconds.
INFO:nv_ingest_client.cli.util.processing:Total pages processed: 3
INFO:nv_ingest_client.cli.util.processing:Throughput (Pages/sec): 0.29
INFO:nv_ingest_client.cli.util.processing:Throughput (Files/sec): 0.10
步骤 4:检查和使用结果
完成上述摄取步骤后,您应该能够在已处理文档文件夹中找到 text
和 image
子文件夹。每个文件夹都将包含 JSON 格式的提取内容和元数据。
处理完成后,您将拥有单独的文本和图像数据结果文件
ls -R processed_docs/
processed_docs/:
image structured text
processed_docs/image:
multimodal_test.pdf.metadata.json
processed_docs/structured:
multimodal_test.pdf.metadata.json
processed_docs/text:
multimodal_test.pdf.metadata.json
首先,运行以下代码安装 tkinter
。为您的操作系统选择相应的代码。
- 适用于 Ubuntu/Debian Linux
sudo apt-get update
sudo apt-get install python3-tk
- 适用于 Fedora/RHEL Linux
sudo dnf install python3-tkinter
- 适用于使用 Homebrew 的 macOS
brew install python-tk
然后,运行以下命令以执行脚本来检查提取的图像
python src/util/image_viewer.py --file_path ./processed_docs/image/multimodal_test.pdf.metadata.json
Beyond inspecting the results, you can read them into things like [llama-index](https://github.com/NVIDIA/nv-ingest/blob/main/examples/llama_index_multimodal_rag.ipynb) or [langchain](https://github.com/NVIDIA/nv-ingest/blob/main/examples/langchain_multimodal_rag.ipynb) retrieval pipelines. Also, checkout our [demo using a retrieval pipeline on build.nvidia.com](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag) to query over document content pre-extracted with NV-Ingest.
仓库结构
除了上述相关文档、示例和其他链接外,以下是此仓库文件夹中内容的描述
- .github:GitHub 仓库配置文件
- ci:用于构建 NV-Ingest 容器和其他软件包的脚本
- client:nv-ingest-cli 实用程序的文档和源代码
- config:定义 OTEL、Prometheus 配置的各种 .yaml 文件
- data:为方便测试而提供的示例 PDF
- docker:包含 nv-ingest docker 容器使用的脚本
- docs:描述部署、元数据模式、身份验证和遥测设置的各种 README
- examples:示例笔记本、脚本和更长篇幅的教程内容
- helm:通过 Helm chart 将 NV-Ingest 部署到 Kubernetes 集群的文档
- skaffold:Skaffold 配置
- src:NV-Ingest 管道和服务的源代码
- tests:NV-Ingest 的单元测试
声明
第三方许可声明
如果配置为这样做,此项目将下载并安装其他第三方开源软件项目。使用前请查看这些开源项目的许可条款
https://pypi.ac.cn/project/pdfservices-sdk/
INSTALL_ADOBE_SDK
:- 描述:如果设置为
true
,则 Adobe SDK 将在启动时安装在容器中。如果您想使用 Adobe 提取服务进行 PDF 分解,则这是必需的。在启用此选项之前,请查看 pdfservices-sdk 的许可协议。
贡献
我们要求所有贡献者在他们的提交上“签名”。这证明贡献是您的原创作品,或者您有权根据相同许可或兼容许可提交它。
任何包含未签名的提交的贡献都不会被接受。
要在提交上签名,请在提交更改时使用 --signoff (或 -s) 选项
$ git commit -s -m "Add cool feature."
这会将以下内容附加到您的提交消息中
Signed-off-by: Your Name <your@email.com>
DCO 的全文
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open-source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.