Skip to content

rsgoodwin/dichos_processing

Repository files navigation

DICHOS PROCESSING SYSTEM

Costa Rican Proverbs Semantic Clustering Database

A production-ready system for managing and semantically clustering Costa Rican dichos (proverbs) using NLP techniques.


πŸ—οΈ PROJECT STRUCTURE

dichos_processing/
β”œβ”€β”€ πŸ“Š core_data/                    # Core data and database
β”‚   β”œβ”€β”€ dichos_normalized.db         # Main SQLite database (301 dichos, 13 clusters)
β”‚   └── data/                        # Original data sources
β”‚       └── dichos_enhanced_batches.tsv
β”œβ”€β”€ πŸ”§ core_utilities/               # Essential utilities
β”‚   β”œβ”€β”€ database_utils.py            # Database interaction utilities
β”‚   β”œβ”€β”€ parse_whatsapp_chat.py      # WhatsApp chat export parser
β”‚   β”œβ”€β”€ process_dichos.py            # Dicho processing and duplicate detection
β”‚   β”œβ”€β”€ enrich_dichos.py             # LLM enrichment and metadata generation
β”‚   └── insert_dichos.py             # Database insertion with relationships
β”œβ”€β”€ πŸ“‹ requirements.txt               # Python package dependencies
β”œβ”€β”€ 🧠 core_algorithms/              # Core NLP and clustering algorithms
β”‚   β”œβ”€β”€ nlp_semantic_clustering.py   # Main clustering algorithm
β”‚   └── multi_cluster_assignment_method.py  # Multi-cluster assignment logic
β”œβ”€β”€ πŸ“š documentation/                 # Complete system documentation
β”‚   β”œβ”€β”€ DATABASE_MAINTENANCE_GUIDE.md    # Step-by-step maintenance procedures
β”‚   β”œβ”€β”€ ENHANCED_CLUSTERS_SUMMARY.md     # Current cluster state and descriptions
β”‚   β”œβ”€β”€ ESSENTIAL_FILES_SUMMARY.md       # Project overview and database schema
β”‚   └── PROJECT_CLEANUP_SUMMARY.md      # Cleanup operation summary
β”œβ”€β”€ πŸ—„οΈ database_queries/             # Essential SQL queries
β”‚   └── optimized_multi_cluster_queries.sql
β”œβ”€β”€ 🐍 venv/                         # Python virtual environment
└── README.md                        # This file

πŸš€ QUICK START

1. Environment Setup:

# Activate virtual environment
source venv/bin/activate

# Install dependencies (if needed)
pip install -r requirements.txt

# Verify packages are installed
pip list | grep -E "(sentence-transformers|pandas|numpy|plotly)"

2. Database Access:

from core_utilities.database_utils import DatabaseManager

# Connect to database
db = DatabaseManager('core_data/dichos_normalized.db')

# Query current state
clusters = db.execute_query("SELECT * FROM clusters")
print(f"System has {len(clusters)} semantic clusters")

3. View Current Clusters:

# Check cluster overview
cat documentation/ENHANCED_CLUSTERS_SUMMARY.md

πŸ“Š CURRENT SYSTEM STATE

  • Total Dichos: 301 Costa Rican proverbs
  • Semantic Clusters: 13 meaningful categories
  • Cluster Assignment: Up to 3 clusters per dicho
  • Database Schema: Fully optimized and clean
  • NLP Model: Sentence Transformers (all-MiniLM-L6-v2)

πŸ”„ MAINTENANCE OPERATIONS

Adding New Dichos:

  1. Follow the complete guide: documentation/DATABASE_MAINTENANCE_GUIDE.md
  2. Use core algorithms: Scripts in core_algorithms/
  3. Update database: Via utilities in core_utilities/

Key Maintenance Scripts:

  • process_new_whatsapp_dichos.py: Complete pipeline for new WhatsApp dichos
  • core_utilities/parse_whatsapp_chat.py: WhatsApp chat parsing
  • core_utilities/process_dichos.py: Dicho cleaning and duplicate detection
  • core_utilities/enrich_dichos.py: LLM enrichment and metadata (⚠️ requires external LLM)
  • core_utilities/insert_dichos.py: Database insertion with relationships
  • core_algorithms/nlp_semantic_clustering.py: Reclustering with new data

⚠️ Important: LLM Workflow

The enrichment step requires external LLM interaction and cannot run locally. See documentation/LLM_WORKFLOW_GUIDE.md for complete workflow details.


πŸ“‹ REQUIREMENTS

Python Packages:

  • sentence-transformers: NLP embeddings
  • pandas: Data manipulation
  • numpy: Numerical operations
  • plotly: Visualizations (optional)
  • sqlite3: Database operations (built-in)

System Requirements:

  • Python: 3.12+
  • Memory: 4GB+ RAM for NLP operations
  • Storage: 2GB+ free space
  • No GPU required: CPU-only processing

🎯 USE CASES

For Application Development:

  • Semantic search of dichos by meaning
  • Cluster-based navigation through proverb categories
  • Multi-language support (Spanish + English)
  • Cultural context and usage examples

For Research:

  • Linguistic analysis of Costa Rican proverbs
  • Cultural studies and folklore research
  • Language learning resource development
  • Semantic similarity analysis

πŸ“š DOCUMENTATION INDEX

File Purpose Status
DATABASE_MAINTENANCE_GUIDE.md Complete maintenance procedures βœ… Complete
ENHANCED_CLUSTERS_SUMMARY.md Current cluster state βœ… Current
ESSENTIAL_FILES_SUMMARY.md System overview βœ… Reference
PROJECT_CLEANUP_SUMMARY.md Cleanup summary βœ… Historical

πŸ”§ DEVELOPMENT NOTES

File Organization:

  • Logical grouping by function and purpose
  • Clear separation of concerns
  • Easy navigation for new developers
  • Maintenance-friendly structure

Best Practices:

  • Always backup database before major changes
  • Test scripts in development environment first
  • Follow maintenance guide step-by-step
  • Validate results after each operation

πŸ†˜ SUPPORT

For Maintenance Issues:

  1. Check documentation in documentation/ folder
  2. Review maintenance guide for step-by-step procedures
  3. Use database queries in database_queries/ for troubleshooting
  4. Verify environment with utilities in core_utilities/

Common Operations:

  • Adding new dichos: See maintenance guide
  • Reclustering data: Use NLP clustering script
  • Database queries: Reference SQL examples
  • System updates: Follow documented procedures

πŸ“ˆ FUTURE ENHANCEMENTS

Planned Features:

  • Web interface for dicho management
  • API endpoints for application integration
  • Advanced analytics and reporting
  • Multi-language expansion beyond Spanish/English

Scalability Considerations:

  • Cluster management for growing collections
  • Performance optimization for large datasets
  • Backup and recovery procedures
  • Monitoring and alerting systems

This system represents a production-ready semantic clustering solution for Costa Rican dichos, with comprehensive documentation and maintenance procedures for ongoing operations.

About

Costa Rican dichos semantic clustering system

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages