A web application that provides an AI-powered chatbot interface for dataset discovery, using Google Gemini API on the backend and a React-based frontend.
- Prerequisites
- Setup
- Database Setup
- Running the Application
- Data Processing Pipeline
- Deployment
- API Documentation
- Environment Configuration
- Python: 3.11 or higher
- Node.js: 18.x or higher (for frontend development)
- Google API Key for Gemini
- Google Cloud Platform Account (for BigQuery and Vertex AI)
- UV package manager (for backend environment & dependencies)
- Docker & Docker Compose (optional, for containerized deployment)
git clone https://github.com/INCF/knowledge-space-agent.git
cd knowledge-space-agent
- Windows:
pip install uv
- macOS/Linux: Follow the official guide: https://docs.astral.sh/uv/getting-started/installation/
Create a file named .env
in the project root based on .env.template
. You can choose between two authentication modes:
Option 1: Google API Key (Recommended for development)
- Set
GOOGLE_API_KEY
in your.env
file
Option 2: Vertex AI (Recommended for production)
- Configure Google Cloud credentials and Vertex AI settings as shown in
.env.template
Note: Do not commit
.env
to version control.
# Create a virtual environment using UV
uv venv
# Activate it:
# On Windows (cmd):
.venv/bin/activate
With the virtual environment activated:
uv sync
cd frontend
npm install
-
Install Google Cloud CLI and Authenticate:
# Install Google Cloud CLI curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz tar -xf google-cloud-cli-linux-x86_64.tar.gz ./google-cloud-sdk/install.sh # Initialize and authenticate gcloud init gcloud auth application-default login
Configuration details for BigQuery and Vertex AI services are provided in the .env.template
file.
In one terminal, from the project root with the virtual environment active:
uv run main.py
- By default, this will start the backend server on port 8000. Adjust configuration if you need a different port.
In another terminal:
cd frontend
npm start
- This will start the React development server, typically on http://localhost:5000.
Open your browser to:
http://localhost:5000
The frontend will communicate with the backend at port 8000.
- Docker and Docker Compose installed
.env
file configured with required environment variables
To build and start both the backend and frontend in containers:
docker-compose up --build
Frontend → http://localhost:3000
Backend health → http://localhost:8000/api/health
Backend only:
docker build -t knowledge-space-backend ./backend
docker run -p 8000:8000 --env-file .env knowledge-space-backend
Frontend only:
docker build -t knowledge-space-frontend ./frontend
docker run -p 3000:3000 knowledge-space-frontend
This repository provides a set of Python scripts and modules to ingest, clean, and enrich neuroscience metadata from Google Cloud Storage, as well as scrape identifiers and references from linked resources.
-
Elasticsearch Scraping: The
ksdata_scraping.py
script harvests raw dataset records directly from our Elasticsearch cluster and writes them to GCS. It uses a Point-In-Time (PIT) scroll to page through each index safely, authenticating via credentials stored in your environment. -
GCS I/O: Download raw JSON lists from
gs://ks_datasets/raw_dataset/...
and upload preprocessed outputs togs://ks_datasets/preprocessed_data/...
. -
HTML Cleaning: Strip or convert embedded HTML (e.g.
<a>
tags) into plain text or Markdown. -
URL Extraction: Find and dedupe all links in descriptions and metadata for later retrieval.
-
Chunk Construction: Build semantic "chunks" by concatenating fields (title, description, context labels, etc.) for downstream vectorization.
-
Metadata Filters: Assemble structured metadata dictionaries (
species
,region
,keywords
,identifier1…n
, etc.) for each record. -
Per-Datasource Preprocessing: Each data source has its own preprocessing script (e.g.
scr_017041_dandi.py
,scr_006274_neuroelectro_ephys.py
) saved ingcs://ks_datasets/preprocessed_data/
. -
Extensible Configs: Easily add new datasources by updating GCS paths and field mappings.
To update the vector store with new datasets from Knowledge Space, run:
python data_processing/full_pipeline.py
The script performs a complete data processing workflow:
- Scrapes all data - Runs preprocessing scripts to collect data from Knowledge Space datasources
- Generates hashes - Creates unique hash-based datapoint IDs for all chunks
- Matches BigQuery datapoint IDs - Queries existing data to find what's already processed
- Selects new/unique data - Identifies only new chunks that need processing
- Creates embeddings - Generates vector embeddings for new chunks only
- Upserts to vector store - Uploads new embeddings to Vertex AI Matching Engine
- Inserts to BigQuery - Stores new chunk metadata and content
This completes the update process with only new data, avoiding reprocessing existing content.
- VM: Debian/Ubuntu server with Docker & Docker Compose installed
- Firewall: Open ports 80 and 443 (http-server, https-server tags on GCP)
- DNS: Domain pointing to your server's external IP
- SSL: Caddy will auto-provision Let's Encrypt certificates
-
Clean Previous Deployments:
cd ~/knowledge-space-agent || true # Stop current stack sudo docker compose down || true # Clean Docker cache and old images sudo docker system prune -af sudo docker builder prune -af # Optional: Clear HF model cache (will re-download on first use) sudo docker volume rm knowledge-space-agent_hf_cache 2>/dev/null || true # Stop host nginx if installed sudo systemctl stop nginx || true sudo systemctl disable nginx || true
-
Create Required Configuration Files:
Environment file: Create
.env
based on.env.template
with your specific values.Caddy configuration (
Caddyfile
):your-domain.com, www.your-domain.com { reverse_proxy frontend:80 encode gzip header { Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" } }
Frontend Nginx: The nginx configuration is already provided in
frontend/nginx.conf
. -
Deploy Stack:
cd ~/knowledge-space-agent sudo docker compose up -d --build sudo docker compose ps
-
Verify Deployment:
# Check services are running sudo docker compose ps # Test local endpoints curl -I http://127.0.0.1/ curl -sS http://127.0.0.1/api/health # Test public HTTPS curl -I https://your-domain.com/ curl -sS https://your-domain.com/api/health
View logs:
sudo docker compose logs -f backend
sudo docker compose logs -f frontend
sudo docker compose logs -f caddy
Update and redeploy:
git pull
sudo docker compose up -d --build
Status check:
sudo docker compose ps
Backend unhealthy:
sudo docker inspect -f '{{json .State.Health}}' knowledge-space-agent-backend-1
502/504 errors:
sudo docker exec -it knowledge-space-agent-frontend-1 sh -c 'wget -S -O- http://backend:8000/health'
DNS issues:
dig +short your-domain.com
curl -s -H "Metadata-Flavor: Google" http://metadata/computeMetadata/v1/instance/network-interfaces/0/access-configs/0/external-ip
- Development:
http://localhost:8000
- Production:
https://your-domain.com
GET /
- Description: Root endpoint, returns service status
- Response:
{ "message": "KnowledgeSpace AI Backend is running", "version": "2.0.0" }
GET /health
- Description: Basic health check for Docker/load balancers
- Response:
{ "status": "healthy", "timestamp": "2024-01-01T12:00:00.000Z", "service": "knowledge-space-agent-backend", "version": "2.0.0" }
GET /api/health
- Description: Detailed health check with component status
- Response:
{ "status": "healthy", "version": "2.0.0", "components": { "vector_search": "enabled|disabled", "llm": "enabled|disabled", "keyword_search": "enabled" }, "timestamp": "2024-01-01T12:00:00.000Z" }
POST /api/chat
- Description: Send a query to the neuroscience assistant
- Request Body:
{ "query": "Find datasets about motor cortex recordings", "session_id": "optional-session-id", "reset": false }
- Response:
{ "response": "I found several datasets related to motor cortex recordings...", "metadata": { "process_time": 2.5, "session_id": "default", "timestamp": "2024-01-01T12:00:00.000Z", "reset": false } }
POST /api/session/reset
- Description: Clear conversation history for a session
- Request Body:
{ "session_id": "session-to-reset" }
- Response:
{ "status": "ok", "session_id": "session-to-reset", "message": "Session cleared" }
504 Gateway Timeout
{
"detail": "Request timed out. Please try with a simpler query."
}
500 Internal Server Error
{
"response": "Error: [error description]",
"metadata": {
"error": true,
"session_id": "session-id"
}
}
For required environment variables, see .env.template
in the project root.
- Environment: Make sure
.env
is present before starting the backend. - Ports: If ports 5000 or 8000 are in use, adjust scripts/configuration accordingly.
- UV Commands:
uv venv
creates the virtual environment.uv sync
installs dependencies as defined in your project’s config.
- Troubleshooting:
- Verify Python version (
python --version
) and that dependencies installed correctly. - Ensure the
.env
file syntax is correct (no extra quotes). - For frontend issues, check Node.js version (
node --version
) and logs in terminal.
- Verify Python version (