A Python tool for extracting, analyzing, and managing metadata from Markdown-based knowledge bases. The processor parses Markdown files to extract tags, headings, links, and other structured information, supporting advanced knowledge management workflows.
- π Extracts metadata, tags, and structural elements from Markdown files
- ποΈ Modular architecture for analyzers, extractors, and enrichers
- π Easily extensible for new metadata types or processing logic
- π¨ Modern command-line interface with rich terminal UI
- π Interactive mode for guided workflows
- π Real-time file watching and continuous processing
- π§ͺ Comprehensive test suite
-
Clone the repository:
git clone https://github.com/your-username/knowledgebase-processor.git cd knowledgebase-processor -
Install Poetry (if not already installed):
curl -sSL https://install.python-poetry.org | python3 - -
Install dependencies:
poetry install
The Knowledge Base Processor provides a modern CLI interface with two command aliases: kb and kbp.
# Initialize a new knowledge base in the current directory
kb init
# Process documents in the current directory
kb scan
# Search for content
kb search "todo items"
# Process and sync to SPARQL endpoint in one command
kb publish --endpoint http://localhost:3030/kb
# Enter interactive mode (just run kb without arguments)
kbConfigure the processor for your documents:
kb init # Interactive setup
kb init ~/Documents # Initialize specific directory
kb init --name "My KB" # Set project nameProcess documents and extract knowledge entities:
kb scan # Scan current directory
kb scan ~/Documents # Scan specific directory
kb scan --pattern "*.md" # Only process Markdown files
kb scan --watch # Watch for changes
kb scan --sync --endpoint <url> # Process + sync to SPARQLSearch your processed knowledge base:
kb search "machine learning" # Full-text search
kb search --type todo "project" # Search specific entity types
kb search --tag important # Search by tagsProcess and sync to SPARQL endpoint in one command:
kb publish # Use default endpoint
kb publish --endpoint <url> # Specify endpoint
kb publish --watch # Continuous publishing mode
kb publish --graph <uri> # Specify named graphSync already processed data to SPARQL endpoint:
kb sync # Sync to default endpoint
kb sync --endpoint <url> # Specify endpoint
kb sync --clear # Clear endpoint before syncDisplay knowledge base statistics and status:
kb status # Show current status
kb status --detailed # Show detailed statisticsView and manage configuration:
kb config show # Display current config
kb config set endpoint <url> # Set SPARQL endpoint
kb config reset # Reset to defaultsRun kb without any arguments to enter interactive mode with a guided interface:
kbGenerate RDF/TTL files during processing:
kb scan --rdf-output ./rdf_outputWatch for file changes and automatically process:
kb scan --watch
kb publish --watch# Run CLI as a module
python -m knowledgebase_processor.cli --helpRun all tests using the provided script:
poetry run python scripts/run_tests.pyOr use pytest directly:
poetry run pytest
poetry run pytest tests/cli/ # Test CLI specificallyThe processor uses a service-oriented architecture with clear separation between:
- CLI Layer: User interface and command handling
- Service Layer: Business logic and orchestration
- Data Layer: Document processing and persistence
See ARCHITECTURE_V2.md for detailed architecture documentation.
The processor can be configured via:
- Command-line arguments (highest priority)
- Configuration file (
.kbp/config.yaml) - Environment variables
- Default values
Example configuration file:
knowledge_base:
path: /path/to/documents
patterns:
- "*.md"
- "*.markdown"
sparql:
endpoint: http://localhost:3030/kb
graph: http://example.org/kb
processing:
batch_size: 100
parallel: trueThe processor handles wikilinks [[A wikilink]] and extracts them as relationships between documents.
Fork the repository, create a feature branch, and submit a pull request. Please ensure all tests pass before submitting.
[Add your license information here]