Skip to content

Commit 91e312d

Browse files
committed
docs: add opendataloader_pdf integration page
<!-- Brief description of what documentation is being added/updated --> Adds a complete integration page for **OpenDataLoader PDF** (`langchain-opendataloader-pdf`) document loader, including installation, initialization, quick start and parameters documentation. **Type:** New documentation page - GitHub issue: - https://github.com/opendataloader-project/opendataloader-pdf/issues - Feature PR: - https://github.com/opendataloader-project/opendataloader-pdf/pulls <!-- For LangChain employees, if applicable: --> - Linear issue: - - Slack thread: - <!-- Put an 'x' in all boxes that apply --> - [x] I have read the [contributing guidelines](README.md) - [x] I have tested my changes locally using `docs dev` - [x] All code examples have been tested and work correctly - [x] I have used **root relative** paths for internal links - [ ] I have updated navigation in `src/docs.json` if needed - [ ] I have gotten approval from the relevant reviewers - [ ] (Internal team members only / optional) I have created a preview deployment using the [Create Preview Branch workflow](https://github.com/langchain-ai/docs/actions/workflows/create-preview-branch.yml ) <!-- Any other information that would be helpful for reviewers --> - New file: `docs/src/oss/python/integrations/document_loaders/opendataloader_pdf.mdx` `docs/src/oss/python/integrations/providers/opendataloader_pdf.mdx` - Also appended `OpenDataLoader PDF` to the loader index (`index.mdx`) and `all_providers.mdx` for discoverability。 ---------
1 parent 28f5d56 commit 91e312d

File tree

4 files changed

+128
-0
lines changed

4 files changed

+128
-0
lines changed

src/oss/python/integrations/document_loaders/index.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ The below document loaders allow you to load PDF documents.
6767
| [Upstage Document Parse Loader](/oss/integrations/document_loaders/upstage) | Load PDF files using UpstageDocumentParseLoader | Package |
6868
| [Docling](/oss/integrations/document_loaders/docling) | Load PDF files using Docling | Package |
6969
| [UnDatasIO](/oss/integrations/document_loaders/undatasio) | Load PDF files using UnDatasIO | Package |
70+
| [OpenDataLoader PDF](/oss/integrations/document_loaders/opendataloader_pdf) | Load PDF files using OpenDataLoader PDF | Package |
7071

7172

7273
### Cloud Providers
@@ -258,6 +259,7 @@ The below document loaders allow you to load data from common data formats.
258259
<Card title="Notion DB" icon="link" href="/oss/integrations/document_loaders/notion" arrow="true" cta="View guide" />
259260
<Card title="Nuclia" icon="link" href="/oss/integrations/document_loaders/nuclia" arrow="true" cta="View guide" />
260261
<Card title="Obsidian" icon="link" href="/oss/integrations/document_loaders/obsidian" arrow="true" cta="View guide" />
262+
<Card title="OpenDataLoader PDF" icon="link" href="/oss/integrations/document_loaders/opendataloader_pdf" arrow="true" cta="View guide" />
261263
<Card title="Open Document Format (ODT)" icon="link" href="/oss/integrations/document_loaders/odt" arrow="true" cta="View guide" />
262264
<Card title="Open City Data" icon="link" href="/oss/integrations/document_loaders/open_city_data" arrow="true" cta="View guide" />
263265
<Card title="Oracle Autonomous Database" icon="link" href="/oss/integrations/document_loaders/oracleadb_loader" arrow="true" cta="View guide" />
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
---
2+
title: OpenDataLoader PDF
3+
---
4+
5+
**Safe, Open, High-Performance — PDF for AI**
6+
7+
[OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).
8+
9+
It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.
10+
Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.
11+
AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.
12+
13+
## Overview
14+
15+
### Integration details
16+
17+
| Class | Package | Local | Serializable | JS support |
18+
| :--- | :--- | :---: | :---: | :---: |
19+
| [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) | [langchain-opendataloader-pdf](https://pypi.org/project/langchain-opendataloader-pdf/) ||||
20+
21+
### Loader features
22+
23+
| Source | Document Lazy Loading | Native Async Support
24+
| :---: | :---: | :---: |
25+
| OpenDataLoaderPDFLoader |||
26+
27+
The `OpenDataLoaderPDFLoader` component enables you to parse PDFs into structured `Document` objects.
28+
29+
## Requirements
30+
- Python >= 3.9
31+
- Java 11 or newer available on the system `PATH`
32+
- opendataloader-pdf >= 1.1.1
33+
34+
## Installation
35+
```bash
36+
pip install -U langchain-opendataloader-pdf
37+
```
38+
39+
## Quick start
40+
```python
41+
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
42+
43+
loader = OpenDataLoaderPDFLoader(
44+
file_path=["path/to/document.pdf", "path/to/folder"],
45+
format="text"
46+
)
47+
documents = loader.load()
48+
49+
for doc in documents:
50+
print(doc.metadata, doc.page_content[:80])
51+
```
52+
53+
## Parameters
54+
55+
| Parameter | Type | Required | Default | Description |
56+
|--------------------------|-----------------------| ---------- |--------------|--------------------------------------------------------------------------------------------------------------------|
57+
| `file_path` | `List[str]` | ✅ Yes || One or more PDF file paths or directories to process. |
58+
| `format` | `str` | No | `None` | Output formats (e.g. `"json"`, `"html"`, `"markdown"`, `"text"`). |
59+
| `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. |
60+
| `content_safety_off` | `Optional[List[str]]` | No | `None` | List of content safety filters to disable (e.g. `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`). |
61+
62+
## Additional Resources
63+
64+
- [LangChain OpenDataLoader PDF integration GitHub](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
65+
- [LangChain OpenDataLoader PDF integration PyPI package](https://pypi.org/project/langchain-opendataloader-pdf/)
66+
- [OpenDataLoader PDF GitHub](https://github.com/opendataloader-project/opendataloader-pdf)
67+
- [OpenDataLoader PDF Homepage](https://opendataloader.org/)

src/oss/python/integrations/providers/all_providers.mdx

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1990,6 +1990,14 @@ Browse the complete collection of integrations available for Python. LangChain P
19901990
>
19911991
GPT models and comprehensive AI platform.
19921992
</Card>
1993+
1994+
<Card
1995+
title="OpenDataLoader PDF"
1996+
href="/oss/integrations/providers/opendataloader_pdf"
1997+
icon="link"
1998+
>
1999+
Safe, Open, High-Performance — PDF for AI
2000+
</Card>
19932001

19942002
<Card
19952003
title="OpenGradient"
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
title: OpenDataLoader PDF
3+
---
4+
5+
> **Safe, Open, High-Performance — PDF for AI**
6+
7+
> [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).
8+
>
9+
> It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.
10+
> Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.
11+
> AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.
12+
13+
## Requirements
14+
- Python >= 3.9
15+
- Java 11 or newer available on the system `PATH`
16+
- opendataloader-pdf >= 1.1.1
17+
18+
## Installation
19+
```bash
20+
pip install -U langchain-opendataloader-pdf
21+
```
22+
23+
## Quick start
24+
```python
25+
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
26+
27+
loader = OpenDataLoaderPDFLoader(
28+
file_path=["path/to/document.pdf", "path/to/folder"],
29+
format="text"
30+
)
31+
documents = loader.load()
32+
33+
for doc in documents:
34+
print(doc.metadata, doc.page_content[:80])
35+
```
36+
37+
## Parameters
38+
39+
| Parameter | Type | Required | Default | Description |
40+
|--------------------------|-----------------------| ---------- |--------------|--------------------------------------------------------------------------------------------------------------------|
41+
| `file_path` | `List[str]` | ✅ Yes || One or more PDF file paths or directories to process. |
42+
| `format` | `str` | No | `None` | Output formats (e.g. `"json"`, `"html"`, `"markdown"`, `"text"`). |
43+
| `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. |
44+
| `content_safety_off` | `Optional[List[str]]` | No | `None` | List of content safety filters to disable (e.g. `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`). |
45+
46+
## Additional Resources
47+
48+
- [LangChain OpenDataLoader PDF integration GitHub](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
49+
- [LangChain OpenDataLoader PDF integration PyPI package](https://pypi.org/project/langchain-opendataloader-pdf/)
50+
- [OpenDataLoader PDF GitHub](https://github.com/opendataloader-project/opendataloader-pdf)
51+
- [OpenDataLoader PDF Homepage](https://opendataloader.org/)

0 commit comments

Comments
 (0)