|
| 1 | +--- |
| 2 | +title: OpenDataLoader PDF |
| 3 | +--- |
| 4 | + |
| 5 | +**Safe, Open, High-Performance — PDF for AI** |
| 6 | + |
| 7 | +[OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG). |
| 8 | + |
| 9 | +It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query. |
| 10 | +Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets. |
| 11 | +AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk. |
| 12 | + |
| 13 | +## Overview |
| 14 | + |
| 15 | +### Integration details |
| 16 | + |
| 17 | +| Class | Package | Local | Serializable | JS support | |
| 18 | +| :--- | :--- | :---: | :---: | :---: | |
| 19 | +| [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) | [langchain-opendataloader-pdf](https://pypi.org/project/langchain-opendataloader-pdf/) | ✅ | ❌ | ❌ | |
| 20 | + |
| 21 | +### Loader features |
| 22 | + |
| 23 | +| Source | Document Lazy Loading | Native Async Support |
| 24 | +| :---: | :---: | :---: | |
| 25 | +| OpenDataLoaderPDFLoader | ✅ | ❌ | |
| 26 | + |
| 27 | +The `OpenDataLoaderPDFLoader` component enables you to parse PDFs into structured `Document` objects. |
| 28 | + |
| 29 | +## Requirements |
| 30 | +- Python >= 3.9 |
| 31 | +- Java 11 or newer available on the system `PATH` |
| 32 | +- opendataloader-pdf >= 1.1.1 |
| 33 | + |
| 34 | +## Installation |
| 35 | +```bash |
| 36 | +pip install -U langchain-opendataloader-pdf |
| 37 | +``` |
| 38 | + |
| 39 | +## Quick start |
| 40 | +```python |
| 41 | +from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader |
| 42 | + |
| 43 | +loader = OpenDataLoaderPDFLoader( |
| 44 | + file_path=["path/to/document.pdf", "path/to/folder"], |
| 45 | + format="text" |
| 46 | +) |
| 47 | +documents = loader.load() |
| 48 | + |
| 49 | +for doc in documents: |
| 50 | + print(doc.metadata, doc.page_content[:80]) |
| 51 | +``` |
| 52 | + |
| 53 | +## Parameters |
| 54 | + |
| 55 | +| Parameter | Type | Required | Default | Description | |
| 56 | +|--------------------------|-----------------------| ---------- |--------------|--------------------------------------------------------------------------------------------------------------------| |
| 57 | +| `file_path` | `List[str]` | ✅ Yes | — | One or more PDF file paths or directories to process. | |
| 58 | +| `format` | `str` | No | `None` | Output formats (e.g. `"json"`, `"html"`, `"markdown"`, `"text"`). | |
| 59 | +| `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. | |
| 60 | +| `content_safety_off` | `Optional[List[str]]` | No | `None` | List of content safety filters to disable (e.g. `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`). | |
| 61 | + |
| 62 | +## Additional Resources |
| 63 | + |
| 64 | +- [LangChain OpenDataLoader PDF integration GitHub](https://github.com/opendataloader-project/langchain-opendataloader-pdf) |
| 65 | +- [LangChain OpenDataLoader PDF integration PyPI package](https://pypi.org/project/langchain-opendataloader-pdf/) |
| 66 | +- [OpenDataLoader PDF GitHub](https://github.com/opendataloader-project/opendataloader-pdf) |
| 67 | +- [OpenDataLoader PDF Homepage](https://opendataloader.org/) |
0 commit comments