GitHub - chian/nano-graphrag: A simple, easy-to-hack GraphRAG implementation

A simple, easy-to-hack GraphRAG implementation

Why nano-graphrag?

😭 GraphRAG is good and powerful, but the official implementation is difficult/painful to read or hack.

😊 This project provides a smaller, faster, cleaner GraphRAG, while maintaining the core functionality (benchmark).

🎁 Clean, readable codebase with well-documented components

👌 Small yet portable (faiss, neo4j, ollama...), asynchronous, and fully typed

🚀 Advanced features: GASL for graph queries, QA generation for training data, query-aware prompts

Installation

From PyPI (Stable)

pip install nano-graphrag

From Source (Latest)

git clone https://github.com/gusye1234/nano-graphrag
cd nano-graphrag
pip install -e .

Requirements

Python >= 3.9.11
OpenAI API key (or use alternative LLMs)

Quick Start

Basic Usage

Set up your API key:

export OPENAI_API_KEY="sk-..."

Download sample data:

curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt

Build and query a knowledge graph:

from nano_graphrag import GraphRAG, QueryParam

# Initialize GraphRAG
graph_func = GraphRAG(working_dir="./dickens")

# Insert documents (builds knowledge graph ONCE)
with open("./book.txt") as f:
    graph_func.insert(f.read())

# Query using different modes (uses the SAME graph)
# Local mode: Fast, entity-focused retrieval
answer = graph_func.query(
    "What are the top themes in this story?",
    param=QueryParam(mode="local")
)

# Global mode: Comprehensive, community-based analysis (DEFAULT)
answer = graph_func.query(
    "What are the top themes in this story?",
    param=QueryParam(mode="global")
)

# Naive mode: Simple vector search, no graph traversal
rag_naive = GraphRAG(working_dir="./dickens", enable_naive_rag=True)
answer = rag_naive.query(
    "What are the top themes in this story?",
    param=QueryParam(mode="naive")
)

Next run: GraphRAG automatically reloads from working_dir—no need to rebuild!

Async Support

# All methods have async versions
await graph_func.ainsert(documents)
await graph_func.aquery(query)

Batch Operations

# Insert multiple documents at once
graph_func.insert(["TEXT1", "TEXT2", "TEXT3"])

# Incremental insert (no duplicates, uses MD5 hash)
with open("./book.txt") as f:
    book = f.read()
    half = len(book) // 2
    graph_func.insert(book[:half])
    graph_func.insert(book[half:])  # No duplication!

How It Works

Graph Construction (One Way)

There is one unified pipeline for building knowledge graphs:

Documents
    ↓
[1. Chunking] Split into manageable pieces
    ↓
[2. Entity Extraction] LLM extracts entities & relationships
    ↓
[3. Graph Construction] Build knowledge graph
    ↓
[4. Community Detection] Find clusters (Leiden/Louvain)
    ↓
[5. Community Reports] Generate summaries of each cluster
    ↓
Knowledge Graph (ready for querying)

Key Features:

Entity Extraction: LLM-powered entity and relationship extraction
Dynamic Entity Types: Query-aware entity type generation
Community Detection: Leiden/Louvain algorithms for graph clustering
Incremental Updates: MD5-based deduplication for efficient updates

Query Modes

Once you have a knowledge graph, you can query it in three different ways:

Comparison Table

Mode	Speed	Accuracy	Uses Graph?	Best For
Local	Fast ⚡	High	✅ Yes (traversal)	Specific entities, direct relationships
Global	Slower 🐢	Highest	✅ Yes (communities)	Broad themes, comprehensive analysis
Naive	Fastest ⚡⚡	Medium	❌ No	Simple facts, baseline comparison

How Each Mode Works

1. Local Mode (Fast, Entity-Focused)

answer = graph_func.query("Who collaborated with Einstein?", param=QueryParam(mode="local"))

Process:

Vector search to find relevant entities
Graph traversal to get entity neighborhoods (1-2 hops)
Retrieve source text chunks
LLM generates answer from local context

Use when: Asking about specific entities or direct relationships

2. Global Mode (Comprehensive, Community-Based)

answer = graph_func.query("What are the main research themes?", param=QueryParam(mode="global"))

Process:

Uses pre-computed community structure
Retrieves community reports (summaries of each cluster)
Map-reduce: Analyze each community, then synthesize
LLM generates comprehensive answer

Use when: Asking about overall themes, patterns, or broad topics

Note: This is the DEFAULT mode

3. Naive Mode (Simple Vector Search)

# Must enable during initialization
rag = GraphRAG(working_dir="./cache", enable_naive_rag=True)
answer = rag.query("What is X?", param=QueryParam(mode="naive"))

Process:

Simple vector similarity search on text chunks
No graph traversal or community analysis
LLM generates answer from top-K chunks

Use when: Simple lookups, baseline comparisons, or when graph isn't needed

Switching Between Modes

# Build graph ONCE
rag = GraphRAG(working_dir="./cache", enable_naive_rag=True)
rag.insert(documents)

# Query the SAME graph in different ways
local_answer = rag.query("question", param=QueryParam(mode="local"))
global_answer = rag.query("question", param=QueryParam(mode="global"))
naive_answer = rag.query("question", param=QueryParam(mode="naive"))

Components & Extensions

LLM Providers

Provider	Status	Documentation
OpenAI	✅ Built-in	Default (`gpt-4o`, `gpt-4o-mini`)
Azure OpenAI	✅ Built-in	.env.example.azure
Amazon Bedrock	✅ Built-in	Example
DeepSeek	📘 Example	Example
Ollama	📘 Example	Example
Custom	✅ Supported	Guide

Embedding Models

Model	Status	Documentation
OpenAI	✅ Built-in	Default (`text-embedding-3-small`)
Amazon Bedrock	✅ Built-in	Example
Sentence-transformers	📘 Example	Example
Custom	✅ Supported	Guide

Vector Databases

Database	Status	Documentation
NanoVectorDB	✅ Built-in	Default (lightweight)
HNSW	✅ Built-in	Example
Milvus Lite	📘 Example	Example
FAISS	📘 Example	Example
Qdrant	📘 Example	Example

Graph Storage

Storage	Status	Documentation
NetworkX	✅ Built-in	Default (in-memory + GraphML)
Neo4j	✅ Built-in	Guide

Advanced Features

GASL (Graph Analysis & Scripting Language)

GASL is a domain-specific language for LLM-driven graph analysis with hypothesis-driven traversal (HDT).

Why GASL?

Traditional RAG: Query → Vector Search → Context → Answer (limited coverage, no exploration)

GASL: Query → Hypothesis → Plan → Execute → Evaluate → Refine → Answer (complete coverage, systematic)

Key Features

🧠 LLM-Driven Planning: Natural language queries → executable graph operations
🔄 Hypothesis-Driven Traversal: Iterative exploration with refinement
📊 Rich Command Set: 30+ commands for graph analysis
💾 State Management: Persistent state across commands
🔍 Provenance Tracking: Trace results to source documents

Quick Example

python gasl_main.py \
  --working-dir /path/to/graph \
  --query "Create a histogram of how often author names appear" \
  --max-iterations 5

Generated Plan (by LLM):

{
  "hypothesis": "Author names are in PERSON entity descriptions",
  "commands": [
    "FIND nodes with entity_type=PERSON AS authors",
    "PROCESS authors with instruction: Extract author name from description AS names",
    "ADD_FIELD authors field: author_name = names",
    "COUNT authors field author_name AS histogram"
  ]
}

Core GASL Commands

Discovery: DECLARE, FIND, SELECT, SET Processing: PROCESS, CLASSIFY, UPDATE, COUNT Graph Navigation: GRAPHWALK, GRAPHCONNECT, SUBGRAPH, GRAPHPATTERN Data Combination: JOIN, MERGE, COMPARE Object Creation: CREATE_NODES, CREATE_EDGES, GENERATE

👉 Full GASL Guide

QA Generation

Generate high-quality reasoning questions from knowledge graphs for training data (e.g., models with <think> tokens).

Reasoning QA (Recommended)

Most sophisticated generator with diversity tracking and quality filtering:

python generate_reasoning_qa.py \
  --working-dir /path/to/graph/graphrag_cache \
  --num-questions 100 \
  --min-quality-score 7

Features:

🎯 Topic Diversity: Tracks last 20 topics, avoids repetition
⭐ Quality Filtering: Only questions scoring ≥7/10 pass
🧬 Multiple Reasoning Types: Mechanistic, comparative, causal, predictive
🔬 Scientific Rigor: Self-contained, no "the text" references

Example Output:

{
  "question": "Which mechanism best explains how antibiotic resistance emerges?",
  "choices": {"A": "...", "B": "...", "C": "...", "D": "...", "E": "...", "F": "...", "G": "...", "H": "..."},
  "answer": "B",
  "reasoning_type": "mechanistic",
  "quality_score": 9
}

Other QA Types

Multi-Hop QA: Chain facts across graph paths

python generate_multihop_qa.py --working-dir /path/to/graph --num-questions 50 --path-length 2

Synthesis QA: Integrate information from multiple sources

python generate_synthesis_qa.py --working-dir /path/to/graph --num-questions 30

Logic Puzzle QA: Constraint satisfaction problems

python generate_logic_puzzle_qa.py --working-dir /path/to/graph --num-questions 20

👉 Full QA Generation Guide

Query-Aware Processing

All prompts and processing are optimized based on your specific query:

from nano_graphrag.prompt_system import QueryAwarePromptSystem, set_prompt_system

# Set up query-aware prompts
prompt_system = QueryAwarePromptSystem(llm_func=your_llm)
set_prompt_system(prompt_system)

# Now entity types and extraction adapt to your queries!

Features:

Dynamic Entity Types: Generated based on query + content
Optimized Prompts: LLM optimizes prompts for specific queries
Content-Adaptive: Processing tailored to document domain

Documentation

Core Documentation

📖 Architecture Guide - System design, components, data flow
🔧 GASL Guide - Complete GASL reference with examples
📝 QA Generation Guide - Generate training data from graphs
🔌 Storage Backends - Configure KV, vector, and graph storage
🌐 Neo4j Integration - Use Neo4j as graph backend

Additional Resources

❓ FAQ - Frequently asked questions
🗺️ Roadmap - Future development plans
🤝 Contributing - Contribution guidelines
📊 Benchmarks - Performance comparisons

Configuration Examples

Custom LLM

async def my_llm_complete(prompt, system_prompt=None, history_messages=[], **kwargs) -> str:
    hashing_kv = kwargs.pop("hashing_kv", None)  # Optional cache
    response = await your_llm_api(prompt, **kwargs)
    return response

graph_func = GraphRAG(
    best_model_func=my_llm_complete,
    best_model_max_token_size=8192,
    best_model_max_async=16
)

Custom Embedding

from nano_graphrag._utils import wrap_embedding_func_with_attrs

@wrap_embedding_func_with_attrs(embedding_dim=384, max_token_size=512)
async def my_embedding(texts: list[str]) -> np.ndarray:
    return your_model.encode(texts)

graph_func = GraphRAG(
    embedding_func=my_embedding,
    embedding_batch_num=32
)

Custom Storage

from nano_graphrag.base import BaseVectorStorage

class MyVectorDB(BaseVectorStorage):
    async def upsert(self, data): ...
    async def query(self, query, top_k): ...

graph_func = GraphRAG(vector_db_storage_cls=MyVectorDB)

Azure OpenAI

# Set environment variables (see .env.example.azure)
graph_func = GraphRAG(
    working_dir="./cache",
    using_azure_openai=True
)

Ollama (Local LLM)

# See examples/using_ollama_as_llm.py
from examples.using_ollama_as_llm import ollama_model_complete, ollama_embedding

graph_func = GraphRAG(
    best_model_func=ollama_model_complete,
    cheap_model_func=ollama_model_complete,
    embedding_func=ollama_embedding
)

Benchmark

Projects Using nano-graphrag

Medical Graph RAG - Graph RAG for Medical Data
LightRAG - Simple and Fast Retrieval-Augmented Generation
fast-graphrag - Adaptive RAG system
HiRAG - Hierarchical Knowledge RAG

❤️ Welcome PRs if your project uses nano-graphrag!

Known Limitations

nano-graphrag does not implement the covariates feature from the original GraphRAG
Global search differs from original: uses top-K communities (default: 512) instead of map-reduce over all communities
- Control with QueryParam(global_max_consider_community=512)

Contributing

Contributions are welcome! Read the Contributing Guide before submitting PRs.

Areas for contribution:

Additional storage backends (Pinecone, Weaviate, etc.)
More LLM providers
Performance optimizations
Documentation improvements
Bug fixes and tests

Citation

If you use nano-graphrag in your research, please cite:

@software{nano-graphrag,
  title = {nano-graphrag: A Simple, Easy-to-Hack GraphRAG Implementation},
  author = {Gusye},
  year = {2024},
  url = {https://github.com/gusye1234/nano-graphrag}
}

License

MIT License - see LICENSE for details

Community

💬 Discord
💬 WeChat
🐛 Issues
📢 Discussions

⭐ Star this repo if you find it useful!

Looking for a multi-user RAG solution? Check out memobase

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
.github/workflows		.github/workflows
docs		docs
domain_schemas		domain_schemas
examples		examples
gasl		gasl
graph_enrichment		graph_enrichment
mmwr-test		mmwr-test
nano_graphrag		nano_graphrag
paper_fetching		paper_fetching
prompts		prompts
query_generation		query_generation
test_final/AACv67i1_10_1128_aac_01353_22-20240609072649-6168760.zip		test_final/AACv67i1_10_1128_aac_01353_22-20240609072649-6168760.zip
tests		tests
.coveragerc		.coveragerc
.env.example.azure		.env.example.azure
.gitignore		.gitignore
ANALYTICAL_RETRIEVER_README.md		ANALYTICAL_RETRIEVER_README.md
ASM_595		ASM_595
ASM_PROCESSING_README.md		ASM_PROCESSING_README.md
BATCH_PROCESSING.md		BATCH_PROCESSING.md
CONTRASTIVE_PIPELINE_STATUS.md		CONTRASTIVE_PIPELINE_STATUS.md
CONTRASTIVE_QA_IMPROVEMENTS.md		CONTRASTIVE_QA_IMPROVEMENTS.md
CONTRASTIVE_QA_README.md		CONTRASTIVE_QA_README.md
ENRICHMENT_IMPLEMENTATION.md		ENRICHMENT_IMPLEMENTATION.md
ENRICHMENT_QUICK_START.md		ENRICHMENT_QUICK_START.md
ENRICHMENT_SUMMARY.md		ENRICHMENT_SUMMARY.md
IMPLEMENTATION_COMPLETE.md		IMPLEMENTATION_COMPLETE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
MANIFEST_SYSTEM_SUMMARY.md		MANIFEST_SYSTEM_SUMMARY.md
OLD_VS_NEW_QUESTION_COMPARISON.md		OLD_VS_NEW_QUESTION_COMPARISON.md
PATHWAY_QA_DESIGN.md		PATHWAY_QA_DESIGN.md
PROMPT_COMPARISON.md		PROMPT_COMPARISON.md
PROMPT_FLOW_ANALYSIS.md		PROMPT_FLOW_ANALYSIS.md
QUICK_START.md		QUICK_START.md
analytical_retriever.py		analytical_retriever.py
assess_paper_suitability.py		assess_paper_suitability.py
batch_process_asm_papers_gasl.sh		batch_process_asm_papers_gasl.sh
chia_test.sh		chia_test.sh
comprehensive_qa_gasl.json		comprehensive_qa_gasl.json
comprehensive_qa_test.json		comprehensive_qa_test.json
create_domain_typed_graph.py		create_domain_typed_graph.py
create_graph_only.py		create_graph_only.py
current_commands		current_commands
debug_gasl_queries.py		debug_gasl_queries.py
debug_single_question.py		debug_single_question.py
debug_single_question_gasl.py		debug_single_question_gasl.py
enrich_graph_with_papers.py		enrich_graph_with_papers.py
extract_pdfs_to_text.py		extract_pdfs_to_text.py
fetch_papers_firecrawl.py		fetch_papers_firecrawl.py
gasl_main.py		gasl_main.py
generate_contrastive_qa.py		generate_contrastive_qa.py
generate_logic_puzzle_qa.py		generate_logic_puzzle_qa.py
generate_multihop_qa.py		generate_multihop_qa.py
generate_reasoning_qa.py		generate_reasoning_qa.py
generate_reasoning_qa_gasl.py		generate_reasoning_qa_gasl.py
generate_reasoning_qa_gasl_v2.py		generate_reasoning_qa_gasl_v2.py
generate_short_answer_qa.py		generate_short_answer_qa.py
generate_synthesis_qa.py		generate_synthesis_qa.py
inspect_graph.py		inspect_graph.py
manifest_AAC_clinical_microbiology.json		manifest_AAC_clinical_microbiology.json
manifest_AAC_ecology.json		manifest_AAC_ecology.json
manifest_AAC_microbial_biology.json		manifest_AAC_microbial_biology.json
manifest_AAC_molecular_biology.json		manifest_AAC_molecular_biology.json
manifest_AEM_ecology.json		manifest_AEM_ecology.json
manifest_AEM_ecology.json.backup		manifest_AEM_ecology.json.backup
manifest_AEM_molecular_biology.json		manifest_AEM_molecular_biology.json
manifest_IAI_infectious_disease.json		manifest_IAI_infectious_disease.json
manifest_JCM_clinical_microbiology.json		manifest_JCM_clinical_microbiology.json
manifest_JVI_infectious_disease.json		manifest_JVI_infectious_disease.json
manifest_mBio_molecular_biology.json		manifest_mBio_molecular_biology.json
manifest_mSystems_disease_biology.json		manifest_mSystems_disease_biology.json
manifest_mSystems_ecology.json		manifest_mSystems_ecology.json
manifest_mSystems_full.json		manifest_mSystems_full.json
manifest_mSystems_full_cleaned.json		manifest_mSystems_full_cleaned.json
manifest_mSystems_microbial_biology.json		manifest_mSystems_microbial_biology.json
manifest_mSystems_remaining.json		manifest_mSystems_remaining.json
manifest_mSystems_test.json		manifest_mSystems_test.json
manifest_subset.json		manifest_subset.json
manifest_utils.py		manifest_utils.py
multihop_qa.json		multihop_qa.json
question_enrichment.py		question_enrichment.py
readme.md		readme.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_analytical_retriever.py		run_analytical_retriever.py
run_batch_pipeline.py		run_batch_pipeline.py
run_contrastive_pipeline.py		run_contrastive_pipeline.py
run_with_argo_bridge.py		run_with_argo_bridge.py
scrape_all_questions.py		scrape_all_questions.py
scrape_questions.py		scrape_questions.py
scraped_questions_AAC_clinical_microbiology.json		scraped_questions_AAC_clinical_microbiology.json
scraped_questions_AAC_ecology.json		scraped_questions_AAC_ecology.json
scraped_questions_AAC_microbial_biology.json		scraped_questions_AAC_microbial_biology.json
scraped_questions_AAC_molecular_biology.json		scraped_questions_AAC_molecular_biology.json
scraped_questions_AEM_ecology.json		scraped_questions_AEM_ecology.json
scraped_questions_AEM_molecular_biology.json		scraped_questions_AEM_molecular_biology.json
scraped_questions_IAI_infectious_disease.json		scraped_questions_IAI_infectious_disease.json
scraped_questions_JCM_clinical_microbiology.json		scraped_questions_JCM_clinical_microbiology.json
scraped_questions_JVI_infectious_disease.json		scraped_questions_JVI_infectious_disease.json

License

chian/nano-graphrag

Folders and files

Latest commit

History

Repository files navigation

Why nano-graphrag?

Table of Contents

Installation

From PyPI (Stable)

From Source (Latest)

Requirements

Quick Start

Basic Usage

Async Support

Batch Operations

How It Works

Graph Construction (One Way)

Query Modes

Comparison Table

How Each Mode Works

1. Local Mode (Fast, Entity-Focused)

2. Global Mode (Comprehensive, Community-Based)

3. Naive Mode (Simple Vector Search)

Switching Between Modes

Components & Extensions

LLM Providers

Embedding Models

Vector Databases

Graph Storage

Advanced Features

GASL (Graph Analysis & Scripting Language)

Why GASL?

Key Features

Quick Example

Core GASL Commands

QA Generation

Reasoning QA (Recommended)

Other QA Types

Query-Aware Processing

Documentation

Core Documentation

Additional Resources

Configuration Examples

Custom LLM

Custom Embedding

Custom Storage

Azure OpenAI

Ollama (Local LLM)

Benchmark

Projects Using nano-graphrag

Known Limitations

Contributing

Citation

License

Community

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages