A production-ready system for managing and semantically clustering Costa Rican dichos (proverbs) using NLP techniques.
dichos_processing/
βββ π core_data/ # Core data and database
β βββ dichos_normalized.db # Main SQLite database (301 dichos, 13 clusters)
β βββ data/ # Original data sources
β βββ dichos_enhanced_batches.tsv
βββ π§ core_utilities/ # Essential utilities
β βββ database_utils.py # Database interaction utilities
β βββ parse_whatsapp_chat.py # WhatsApp chat export parser
β βββ process_dichos.py # Dicho processing and duplicate detection
β βββ enrich_dichos.py # LLM enrichment and metadata generation
β βββ insert_dichos.py # Database insertion with relationships
βββ π requirements.txt # Python package dependencies
βββ π§ core_algorithms/ # Core NLP and clustering algorithms
β βββ nlp_semantic_clustering.py # Main clustering algorithm
β βββ multi_cluster_assignment_method.py # Multi-cluster assignment logic
βββ π documentation/ # Complete system documentation
β βββ DATABASE_MAINTENANCE_GUIDE.md # Step-by-step maintenance procedures
β βββ ENHANCED_CLUSTERS_SUMMARY.md # Current cluster state and descriptions
β βββ ESSENTIAL_FILES_SUMMARY.md # Project overview and database schema
β βββ PROJECT_CLEANUP_SUMMARY.md # Cleanup operation summary
βββ ποΈ database_queries/ # Essential SQL queries
β βββ optimized_multi_cluster_queries.sql
βββ π venv/ # Python virtual environment
βββ README.md # This file
# Activate virtual environment
source venv/bin/activate
# Install dependencies (if needed)
pip install -r requirements.txt
# Verify packages are installed
pip list | grep -E "(sentence-transformers|pandas|numpy|plotly)"from core_utilities.database_utils import DatabaseManager
# Connect to database
db = DatabaseManager('core_data/dichos_normalized.db')
# Query current state
clusters = db.execute_query("SELECT * FROM clusters")
print(f"System has {len(clusters)} semantic clusters")# Check cluster overview
cat documentation/ENHANCED_CLUSTERS_SUMMARY.md- Total Dichos: 301 Costa Rican proverbs
- Semantic Clusters: 13 meaningful categories
- Cluster Assignment: Up to 3 clusters per dicho
- Database Schema: Fully optimized and clean
- NLP Model: Sentence Transformers (all-MiniLM-L6-v2)
- Follow the complete guide:
documentation/DATABASE_MAINTENANCE_GUIDE.md - Use core algorithms: Scripts in
core_algorithms/ - Update database: Via utilities in
core_utilities/
process_new_whatsapp_dichos.py: Complete pipeline for new WhatsApp dichoscore_utilities/parse_whatsapp_chat.py: WhatsApp chat parsingcore_utilities/process_dichos.py: Dicho cleaning and duplicate detectioncore_utilities/enrich_dichos.py: LLM enrichment and metadata (β οΈ requires external LLM)core_utilities/insert_dichos.py: Database insertion with relationshipscore_algorithms/nlp_semantic_clustering.py: Reclustering with new data
The enrichment step requires external LLM interaction and cannot run locally.
See documentation/LLM_WORKFLOW_GUIDE.md for complete workflow details.
- sentence-transformers: NLP embeddings
- pandas: Data manipulation
- numpy: Numerical operations
- plotly: Visualizations (optional)
- sqlite3: Database operations (built-in)
- Python: 3.12+
- Memory: 4GB+ RAM for NLP operations
- Storage: 2GB+ free space
- No GPU required: CPU-only processing
- Semantic search of dichos by meaning
- Cluster-based navigation through proverb categories
- Multi-language support (Spanish + English)
- Cultural context and usage examples
- Linguistic analysis of Costa Rican proverbs
- Cultural studies and folklore research
- Language learning resource development
- Semantic similarity analysis
| File | Purpose | Status |
|---|---|---|
DATABASE_MAINTENANCE_GUIDE.md |
Complete maintenance procedures | β Complete |
ENHANCED_CLUSTERS_SUMMARY.md |
Current cluster state | β Current |
ESSENTIAL_FILES_SUMMARY.md |
System overview | β Reference |
PROJECT_CLEANUP_SUMMARY.md |
Cleanup summary | β Historical |
- Logical grouping by function and purpose
- Clear separation of concerns
- Easy navigation for new developers
- Maintenance-friendly structure
- Always backup database before major changes
- Test scripts in development environment first
- Follow maintenance guide step-by-step
- Validate results after each operation
- Check documentation in
documentation/folder - Review maintenance guide for step-by-step procedures
- Use database queries in
database_queries/for troubleshooting - Verify environment with utilities in
core_utilities/
- Adding new dichos: See maintenance guide
- Reclustering data: Use NLP clustering script
- Database queries: Reference SQL examples
- System updates: Follow documented procedures
- Web interface for dicho management
- API endpoints for application integration
- Advanced analytics and reporting
- Multi-language expansion beyond Spanish/English
- Cluster management for growing collections
- Performance optimization for large datasets
- Backup and recovery procedures
- Monitoring and alerting systems
This system represents a production-ready semantic clustering solution for Costa Rican dichos, with comprehensive documentation and maintenance procedures for ongoing operations.