PdfGptIndexer was featured at the top of Hacker News!

PdfGptIndexer is an efficient tool for indexing and querying PDF documents using OpenAI embeddings and FAISS (Facebook AI Similarity Search). It implements a RAG (Retrieval Augmented Generation) system that allows you to have intelligent conversations with your PDF documents. The software is designed for rapid information retrieval with superior search accuracy.
PdfGptIndexer consists of two main components:
The indexer processes your PDF documents and creates a searchable vector database:
- Extract Text: Uses PyMuPDF to extract text from all PDF files in a folder
- Chunk Text: Splits documents into manageable chunks (1000 characters with 200-character overlap) using LangChain's RecursiveCharacterTextSplitter
- Generate Embeddings: Creates vector embeddings for each chunk using OpenAI's
text-embedding-ada-002model - Store Locally: Saves the embeddings in a FAISS index on disk for fast retrieval
The chatbot provides an intelligent interface to query your indexed documents:
- Load Index: Loads the pre-computed FAISS vector index from disk
- Semantic Search: Converts your question into an embedding and finds the top 3 most similar document chunks
- Display Matches: Shows you the similarity scores and text snippets from matched documents
- Generate Answer: Uses GPT-4 to synthesize a coherent answer based on the retrieved context
Storing embeddings locally provides several key benefits:
- Speed: Retrieval is significantly faster as embeddings are pre-computed—no need to regenerate them for each query
- Offline Access: After initial creation, query your data without internet access to OpenAI (only the answer generation requires API calls)
- Cost Savings: Compute embeddings once and reuse them, saving on API costs
- Scalability: Makes it feasible to work with large document collections that would be expensive to process in real-time
- Python 3.8 or higher
- OpenAI API key
Clone the repository:
git clone https://github.com/raghavan/PdfGptIndexer.git
cd PdfGptIndexerInstall dependencies:
pip install -r requirements.txtOr install manually:
pip install langchain langchain-openai langchain-community langchain-text-splitters openai pymupdf faiss-cpu python-dotenv tiktokenCreate a .env file in the project root and add your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key_herePlace your PDF files in the pdf/ folder (or any folder of your choice).
Run the indexer to process your PDFs and create the vector database:
python indexer.pyOr specify a custom PDF folder:
python indexer.py /path/to/your/pdfsOr specify both custom PDF folder and index location:
python indexer.py /path/to/your/pdfs /path/to/save/indexWhat happens:
- Extracts text from all PDFs in the folder
- Creates text chunks with metadata
- Generates embeddings using OpenAI
- Saves the FAISS index to
faiss_index/(or your specified location)
Note: You only need to run this once, or when you add new PDFs to your collection.
Start the interactive chatbot:
python chatbot.pyOr specify a custom index location:
python chatbot.py /path/to/your/index