Skip to content

SEEDtk/nano-graphrag

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shows the MemoDB logo

A simple, easy-to-hack GraphRAG implementation


Why nano-graphrag?

😭 GraphRAG is good and powerful, but the official implementation is difficult/painful to read or hack.

😊 This project provides a smaller, faster, cleaner GraphRAG, while maintaining the core functionality (benchmark).

🎁 Clean, readable codebase with well-documented components

👌 Small yet portable (faiss, neo4j, ollama...), asynchronous, and fully typed

🚀 Advanced features: GASL for graph queries, QA generation for training data, query-aware prompts


Table of Contents


Installation

From PyPI (Stable)

pip install nano-graphrag

From Source (Latest)

git clone https://github.com/gusye1234/nano-graphrag
cd nano-graphrag
pip install -e .

Requirements


Quick Start

Basic Usage

Set up your API key:

export OPENAI_API_KEY="sk-..."

Download sample data:

curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt

Build and query a knowledge graph:

from nano_graphrag import GraphRAG, QueryParam

# Initialize GraphRAG
graph_func = GraphRAG(working_dir="./dickens")

# Insert documents (builds knowledge graph ONCE)
with open("./book.txt") as f:
    graph_func.insert(f.read())

# Query using different modes (uses the SAME graph)
# Local mode: Fast, entity-focused retrieval
answer = graph_func.query(
    "What are the top themes in this story?",
    param=QueryParam(mode="local")
)

# Global mode: Comprehensive, community-based analysis (DEFAULT)
answer = graph_func.query(
    "What are the top themes in this story?",
    param=QueryParam(mode="global")
)

# Naive mode: Simple vector search, no graph traversal
rag_naive = GraphRAG(working_dir="./dickens", enable_naive_rag=True)
answer = rag_naive.query(
    "What are the top themes in this story?",
    param=QueryParam(mode="naive")
)

Next run: GraphRAG automatically reloads from working_dir—no need to rebuild!

Async Support

# All methods have async versions
await graph_func.ainsert(documents)
await graph_func.aquery(query)

Batch Operations

# Insert multiple documents at once
graph_func.insert(["TEXT1", "TEXT2", "TEXT3"])

# Incremental insert (no duplicates, uses MD5 hash)
with open("./book.txt") as f:
    book = f.read()
    half = len(book) // 2
    graph_func.insert(book[:half])
    graph_func.insert(book[half:])  # No duplication!

How It Works

Graph Construction (One Way)

There is one unified pipeline for building knowledge graphs:

Documents
    ↓
[1. Chunking] Split into manageable pieces
    ↓
[2. Entity Extraction] LLM extracts entities & relationships
    ↓
[3. Graph Construction] Build knowledge graph
    ↓
[4. Community Detection] Find clusters (Leiden/Louvain)
    ↓
[5. Community Reports] Generate summaries of each cluster
    ↓
Knowledge Graph (ready for querying)

Key Features:

  • Entity Extraction: LLM-powered entity and relationship extraction
  • Dynamic Entity Types: Query-aware entity type generation
  • Community Detection: Leiden/Louvain algorithms for graph clustering
  • Incremental Updates: MD5-based deduplication for efficient updates

Query Modes

Once you have a knowledge graph, you can query it in three different ways:

Comparison Table

Mode Speed Accuracy Uses Graph? Best For
Local Fast ⚡ High ✅ Yes (traversal) Specific entities, direct relationships
Global Slower 🐢 Highest ✅ Yes (communities) Broad themes, comprehensive analysis
Naive Fastest ⚡⚡ Medium ❌ No Simple facts, baseline comparison

How Each Mode Works

1. Local Mode (Fast, Entity-Focused)

answer = graph_func.query("Who collaborated with Einstein?", param=QueryParam(mode="local"))

Process:

  1. Vector search to find relevant entities
  2. Graph traversal to get entity neighborhoods (1-2 hops)
  3. Retrieve source text chunks
  4. LLM generates answer from local context

Use when: Asking about specific entities or direct relationships

2. Global Mode (Comprehensive, Community-Based)

answer = graph_func.query("What are the main research themes?", param=QueryParam(mode="global"))

Process:

  1. Uses pre-computed community structure
  2. Retrieves community reports (summaries of each cluster)
  3. Map-reduce: Analyze each community, then synthesize
  4. LLM generates comprehensive answer

Use when: Asking about overall themes, patterns, or broad topics

Note: This is the DEFAULT mode

3. Naive Mode (Simple Vector Search)

# Must enable during initialization
rag = GraphRAG(working_dir="./cache", enable_naive_rag=True)
answer = rag.query("What is X?", param=QueryParam(mode="naive"))

Process:

  1. Simple vector similarity search on text chunks
  2. No graph traversal or community analysis
  3. LLM generates answer from top-K chunks

Use when: Simple lookups, baseline comparisons, or when graph isn't needed

Switching Between Modes

# Build graph ONCE
rag = GraphRAG(working_dir="./cache", enable_naive_rag=True)
rag.insert(documents)

# Query the SAME graph in different ways
local_answer = rag.query("question", param=QueryParam(mode="local"))
global_answer = rag.query("question", param=QueryParam(mode="global"))
naive_answer = rag.query("question", param=QueryParam(mode="naive"))

Components & Extensions

LLM Providers

Provider Status Documentation
OpenAI ✅ Built-in Default (gpt-4o, gpt-4o-mini)
Azure OpenAI ✅ Built-in .env.example.azure
Amazon Bedrock ✅ Built-in Example
DeepSeek 📘 Example Example
Ollama 📘 Example Example
Custom ✅ Supported Guide

Embedding Models

Model Status Documentation
OpenAI ✅ Built-in Default (text-embedding-3-small)
Amazon Bedrock ✅ Built-in Example
Sentence-transformers 📘 Example Example
Custom ✅ Supported Guide

Vector Databases

Database Status Documentation
NanoVectorDB ✅ Built-in Default (lightweight)
HNSW ✅ Built-in Example
Milvus Lite 📘 Example Example
FAISS 📘 Example Example
Qdrant 📘 Example Example

Graph Storage

Storage Status Documentation
NetworkX ✅ Built-in Default (in-memory + GraphML)
Neo4j ✅ Built-in Guide

Advanced Features

GASL (Graph Analysis & Scripting Language)

GASL is a domain-specific language for LLM-driven graph analysis with hypothesis-driven traversal (HDT).

Why GASL?

Traditional RAG: Query → Vector Search → Context → Answer (limited coverage, no exploration)

GASL: Query → Hypothesis → Plan → Execute → Evaluate → Refine → Answer (complete coverage, systematic)

Key Features

  • 🧠 LLM-Driven Planning: Natural language queries → executable graph operations
  • 🔄 Hypothesis-Driven Traversal: Iterative exploration with refinement
  • 📊 Rich Command Set: 30+ commands for graph analysis
  • 💾 State Management: Persistent state across commands
  • 🔍 Provenance Tracking: Trace results to source documents

Quick Example

python gasl_main.py \
  --working-dir /path/to/graph \
  --query "Create a histogram of how often author names appear" \
  --max-iterations 5

Generated Plan (by LLM):

{
  "hypothesis": "Author names are in PERSON entity descriptions",
  "commands": [
    "FIND nodes with entity_type=PERSON AS authors",
    "PROCESS authors with instruction: Extract author name from description AS names",
    "ADD_FIELD authors field: author_name = names",
    "COUNT authors field author_name AS histogram"
  ]
}

Core GASL Commands

Discovery: DECLARE, FIND, SELECT, SET Processing: PROCESS, CLASSIFY, UPDATE, COUNT Graph Navigation: GRAPHWALK, GRAPHCONNECT, SUBGRAPH, GRAPHPATTERN Data Combination: JOIN, MERGE, COMPARE Object Creation: CREATE_NODES, CREATE_EDGES, GENERATE

👉 Full GASL Guide


QA Generation

Generate high-quality reasoning questions from knowledge graphs for training data (e.g., models with <think> tokens).

Reasoning QA (Recommended)

Most sophisticated generator with diversity tracking and quality filtering:

python generate_reasoning_qa.py \
  --working-dir /path/to/graph/graphrag_cache \
  --num-questions 100 \
  --min-quality-score 7

Features:

  • 🎯 Topic Diversity: Tracks last 20 topics, avoids repetition
  • Quality Filtering: Only questions scoring ≥7/10 pass
  • 🧬 Multiple Reasoning Types: Mechanistic, comparative, causal, predictive
  • 🔬 Scientific Rigor: Self-contained, no "the text" references

Example Output:

{
  "question": "Which mechanism best explains how antibiotic resistance emerges?",
  "choices": {"A": "...", "B": "...", "C": "...", "D": "...", "E": "...", "F": "...", "G": "...", "H": "..."},
  "answer": "B",
  "reasoning_type": "mechanistic",
  "quality_score": 9
}

Other QA Types

Multi-Hop QA: Chain facts across graph paths

python generate_multihop_qa.py --working-dir /path/to/graph --num-questions 50 --path-length 2

Synthesis QA: Integrate information from multiple sources

python generate_synthesis_qa.py --working-dir /path/to/graph --num-questions 30

Logic Puzzle QA: Constraint satisfaction problems

python generate_logic_puzzle_qa.py --working-dir /path/to/graph --num-questions 20

👉 Full QA Generation Guide


Query-Aware Processing

All prompts and processing are optimized based on your specific query:

from nano_graphrag.prompt_system import QueryAwarePromptSystem, set_prompt_system

# Set up query-aware prompts
prompt_system = QueryAwarePromptSystem(llm_func=your_llm)
set_prompt_system(prompt_system)

# Now entity types and extraction adapt to your queries!

Features:

  • Dynamic Entity Types: Generated based on query + content
  • Optimized Prompts: LLM optimizes prompts for specific queries
  • Content-Adaptive: Processing tailored to document domain

Documentation

Core Documentation

Additional Resources

  • FAQ - Frequently asked questions
  • 🗺️ Roadmap - Future development plans
  • 🤝 Contributing - Contribution guidelines
  • 📊 Benchmarks - Performance comparisons

Configuration Examples

Custom LLM

async def my_llm_complete(prompt, system_prompt=None, history_messages=[], **kwargs) -> str:
    hashing_kv = kwargs.pop("hashing_kv", None)  # Optional cache
    response = await your_llm_api(prompt, **kwargs)
    return response

graph_func = GraphRAG(
    best_model_func=my_llm_complete,
    best_model_max_token_size=8192,
    best_model_max_async=16
)

Custom Embedding

from nano_graphrag._utils import wrap_embedding_func_with_attrs

@wrap_embedding_func_with_attrs(embedding_dim=384, max_token_size=512)
async def my_embedding(texts: list[str]) -> np.ndarray:
    return your_model.encode(texts)

graph_func = GraphRAG(
    embedding_func=my_embedding,
    embedding_batch_num=32
)

Custom Storage

from nano_graphrag.base import BaseVectorStorage

class MyVectorDB(BaseVectorStorage):
    async def upsert(self, data): ...
    async def query(self, query, top_k): ...

graph_func = GraphRAG(vector_db_storage_cls=MyVectorDB)

Azure OpenAI

# Set environment variables (see .env.example.azure)
graph_func = GraphRAG(
    working_dir="./cache",
    using_azure_openai=True
)

Ollama (Local LLM)

# See examples/using_ollama_as_llm.py
from examples.using_ollama_as_llm import ollama_model_complete, ollama_embedding

graph_func = GraphRAG(
    best_model_func=ollama_model_complete,
    cheap_model_func=ollama_model_complete,
    embedding_func=ollama_embedding
)

Benchmark


Projects Using nano-graphrag

❤️ Welcome PRs if your project uses nano-graphrag!


Known Limitations

  • nano-graphrag does not implement the covariates feature from the original GraphRAG
  • Global search differs from original: uses top-K communities (default: 512) instead of map-reduce over all communities
    • Control with QueryParam(global_max_consider_community=512)

Contributing

Contributions are welcome! Read the Contributing Guide before submitting PRs.

Areas for contribution:

  • Additional storage backends (Pinecone, Weaviate, etc.)
  • More LLM providers
  • Performance optimizations
  • Documentation improvements
  • Bug fixes and tests

Citation

If you use nano-graphrag in your research, please cite:

@software{nano-graphrag,
  title = {nano-graphrag: A Simple, Easy-to-Hack GraphRAG Implementation},
  author = {Gusye},
  year = {2024},
  url = {https://github.com/gusye1234/nano-graphrag}
}

License

MIT License - see LICENSE for details


Community


⭐ Star this repo if you find it useful!

Looking for a multi-user RAG solution? Check out memobase

About

A simple, easy-to-hack GraphRAG implementation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.7%
  • Shell 1.3%