PB Assistant is a Django-based application designed to assist with managing and querying knowledge related to Planetary Boundaries. It leverages a vector database (PostgreSQL with pgvector) for efficient similarity search, a robust PDF text extraction service (Grobid) for ingesting scientific documents, and local Large Language Models (LLMs) via Ollama for question-answering capabilities. The goal is to provide a powerful tool for researchers to explore and interact with a curated knowledge base of scientific literature.
The PB Assistant project consists of several key components:
- Django Backend: The core application, handling web requests, data management, and orchestration of other services.
- PostgreSQL with
pgvector: The primary database, extended with thepgvectorextension for storing and querying high-dimensional vector embeddings. - Grobid (in Docker): An external service used for extracting structured text and metadata from PDF documents.
- Ollama (in Docker): An external service providing a local environment for running various LLMs for inference.
- Text Processing Pipeline: A set of Django apps and services (
textprocessing) responsible for embedding text, ingesting PDFs, and performing QA chain operations. - Frontend: A web interface built with Django templates for user interaction, search queries, and displaying results.
At a high level, PDF documents are processed by the text pipeline, their content is converted into vector embeddings, and both the metadata and embeddings are stored in the PostgreSQL database. User queries are then sent through the Django backend to find relevant document chunks, which are provided as context to an LLM via Ollama to synthesize an answer.
This guide provides a complete walkthrough for installing, configuring, and running the PB Assistant application on your local machine.
Before starting, ensure you have the following installed:
- Python 3.10+
- Docker & Docker Compose
- Git (optional, for cloning the repository)
You will also need at least 8GB of RAM and 10GB of free disk space for the Docker images, models, and PDF documents.
This project uses a Python virtual environment to manage dependencies.
From the project root, create a new virtual environment:
python -m venv .venv
-
Windows:
.venv\Scripts\activate -
macOS / Linux:
source .venv/bin/activate
To verify that the environment is active, run which python (on macOS/Linux) or where python (on Windows). The output should point to the Python interpreter inside your .venv directory.
Inside the activated virtual environment, upgrade pip and then install the required packages from requirements.txt:
python -m pip install --upgrade pip
pip install -r requirements.txt
The project's external dependencies (PostgreSQL, Grobid, and Ollama) are managed via Docker Compose.
Create a .env file in the project root. This file holds local environment variables and is ignored by Git. The Django settings are configured to load this file using python-dotenv.
Fill it with the following content, replacing placeholder values as needed:
POSTGRES_DB=your_db_name
POSTGRES_USER=your_username
POSTGRES_PASSWORD=your_password
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
GROBID_URL=http://localhost:8070
OLLAMA_BASE_URL=http://localhost:11434
To start the services in the background (detached mode), run:
docker-compose up -d
Allow a minute or two for all services to initialize. You can check the status of the containers with docker-compose ps.
To pull the LLM models you want to use, execute the ollama pull command directly within the running container using docker-compose exec. You can find the service name (e.g., ollama) in your docker-compose.yaml file.
For example:
docker-compose exec ollama ollama pull llama3
docker-compose exec ollama ollama pull mistral
With the database container running, you can now set up the database schema.
This repository includes the initial database schema as Django migration files. To apply them, simply run:
python manage.py migrate
This command will set up all the necessary tables and enable the pgvector extension.
If you modify a model in PB_Assistant/models.py, you will need to generate a new database migration. The instructions below are for this purpose and are not needed for initial setup.
-
Creating the
pgvectorExtension (First time on a new DB): If you were starting a new project from scratch without the existing migration files, you would first need a migration to enable thepgvectorextension.python manage.py makemigrations --empty PB_Assistant --name create_pgvector_extension
You would then edit the generated file to contain
migrations.RunSQL("CREATE EXTENSION IF NOT EXISTS vector;"). Since this is already done in0001_create_pgvector_extension.py, you do not need to do this now. -
Generating Model Migrations: After changing a model, create a new migration with:
python manage.py makemigrations
-
Applying Migrations: To apply any new or pending migrations, run:
python manage.py migrate
To access the Django admin panel, you need to create a superuser:
python manage.py createsuperuser
Follow the prompts to set up your username and password. You can log in to the admin panel later at http://127.0.0.1:8000/admin/.
The application requires PlanetaryBoundary objects to be created before importing PDFs.
To add initial data, enter the Django interactive shell:
python manage.py shell
Once in the shell, import the model and create a new instance:
from PB_Assistant.models import PlanetaryBoundary
PlanetaryBoundary.objects.create(name='Climate Change', short_name='cc')Repeat this for each boundary you want to track. Alternatively, you can add them from the admin panel after starting the application. Type exit() to leave the shell.
Import PDF files belonging to a particular boundary using the import_pdfs command:
python manage.py import_pdfs --folder path/to/your/pdfs --boundary cc
Where:
--folder: The path to the folder containing your PDF documents.--boundary: Theshort_nameof thePlanetaryBoundaryto associate the PDFs with.
Finally, run the Django development server:
python manage.py runserver
Open your browser to http://127.0.0.1:8000/.
We welcome contributions to the PB Assistant project! To contribute:
- Fork the repository and clone it to your local machine.
- Set up your local development environment by following the instructions in this
README.md. - Create a new branch for your feature or bug fix (
git checkout -b feature/my-new-feature). - Make your changes, ensuring to follow existing code style and conventions.
- Write or update tests for your changes.
- Ensure all tests pass.
- Submit a pull request with a clear description of your changes.
This project is currently unlicensed. It is highly recommended to add a LICENSE file to the root of the repository to specify the terms under which this software can be used, modified, and distributed. Popular open-source licenses include MIT, Apache 2.0, or GPL.
-
Port Conflicts: If
docker-compose uporrunserverfails with an error about a "port already in use," another service on your machine is using that port.- For the Django server, run it on a different port:
python manage.py runserver 8001. - For Docker services, check which port is in use and either stop the conflicting service or change the port mapping in
docker-compose.yamland your.envfile.
- For the Django server, run it on a different port:
-
Docker Issues: If
docker-composecommands fail, ensure the Docker Desktop application is running. On Linux, you may encounter permission errors if your user is not in thedockergroup.