DICHOS PROCESSING SYSTEM

Costa Rican Proverbs Semantic Clustering Database

A production-ready system for managing and semantically clustering Costa Rican dichos (proverbs) using NLP techniques.

🏗️ PROJECT STRUCTURE

dichos_processing/
├── 📊 core_data/                    # Core data and database
│   ├── dichos_normalized.db         # Main SQLite database (301 dichos, 13 clusters)
│   └── data/                        # Original data sources
│       └── dichos_enhanced_batches.tsv
├── 🔧 core_utilities/               # Essential utilities
│   ├── database_utils.py            # Database interaction utilities
│   ├── parse_whatsapp_chat.py      # WhatsApp chat export parser
│   ├── process_dichos.py            # Dicho processing and duplicate detection
│   ├── enrich_dichos.py             # LLM enrichment and metadata generation
│   └── insert_dichos.py             # Database insertion with relationships
├── 📋 requirements.txt               # Python package dependencies
├── 🧠 core_algorithms/              # Core NLP and clustering algorithms
│   ├── nlp_semantic_clustering.py   # Main clustering algorithm
│   └── multi_cluster_assignment_method.py  # Multi-cluster assignment logic
├── 📚 documentation/                 # Complete system documentation
│   ├── DATABASE_MAINTENANCE_GUIDE.md    # Step-by-step maintenance procedures
│   ├── ENHANCED_CLUSTERS_SUMMARY.md     # Current cluster state and descriptions
│   ├── ESSENTIAL_FILES_SUMMARY.md       # Project overview and database schema
│   └── PROJECT_CLEANUP_SUMMARY.md      # Cleanup operation summary
├── 🗄️ database_queries/             # Essential SQL queries
│   └── optimized_multi_cluster_queries.sql
├── 🐍 venv/                         # Python virtual environment
└── README.md                        # This file

🚀 QUICK START

1. Environment Setup:

# Activate virtual environment
source venv/bin/activate

# Install dependencies (if needed)
pip install -r requirements.txt

# Verify packages are installed
pip list | grep -E "(sentence-transformers|pandas|numpy|plotly)"

2. Database Access:

from core_utilities.database_utils import DatabaseManager

# Connect to database
db = DatabaseManager('core_data/dichos_normalized.db')

# Query current state
clusters = db.execute_query("SELECT * FROM clusters")
print(f"System has {len(clusters)} semantic clusters")

3. View Current Clusters:

# Check cluster overview
cat documentation/ENHANCED_CLUSTERS_SUMMARY.md

📊 CURRENT SYSTEM STATE

Total Dichos: 301 Costa Rican proverbs
Semantic Clusters: 13 meaningful categories
Cluster Assignment: Up to 3 clusters per dicho
Database Schema: Fully optimized and clean
NLP Model: Sentence Transformers (all-MiniLM-L6-v2)

🔄 MAINTENANCE OPERATIONS

Adding New Dichos:

Follow the complete guide: documentation/DATABASE_MAINTENANCE_GUIDE.md
Use core algorithms: Scripts in core_algorithms/
Update database: Via utilities in core_utilities/

Key Maintenance Scripts:

process_new_whatsapp_dichos.py: Complete pipeline for new WhatsApp dichos
core_utilities/parse_whatsapp_chat.py: WhatsApp chat parsing
core_utilities/process_dichos.py: Dicho cleaning and duplicate detection
core_utilities/enrich_dichos.py: LLM enrichment and metadata (⚠️ requires external LLM)
core_utilities/insert_dichos.py: Database insertion with relationships
core_algorithms/nlp_semantic_clustering.py: Reclustering with new data

⚠️ Important: LLM Workflow

The enrichment step requires external LLM interaction and cannot run locally. See documentation/LLM_WORKFLOW_GUIDE.md for complete workflow details.

📋 REQUIREMENTS

Python Packages:

sentence-transformers: NLP embeddings
pandas: Data manipulation
numpy: Numerical operations
plotly: Visualizations (optional)
sqlite3: Database operations (built-in)

System Requirements:

Python: 3.12+
Memory: 4GB+ RAM for NLP operations
Storage: 2GB+ free space
No GPU required: CPU-only processing

🎯 USE CASES

For Application Development:

Semantic search of dichos by meaning
Cluster-based navigation through proverb categories
Multi-language support (Spanish + English)
Cultural context and usage examples

For Research:

Linguistic analysis of Costa Rican proverbs
Cultural studies and folklore research
Language learning resource development
Semantic similarity analysis

📚 DOCUMENTATION INDEX

File	Purpose	Status
`DATABASE_MAINTENANCE_GUIDE.md`	Complete maintenance procedures	✅ Complete
`ENHANCED_CLUSTERS_SUMMARY.md`	Current cluster state	✅ Current
`ESSENTIAL_FILES_SUMMARY.md`	System overview	✅ Reference
`PROJECT_CLEANUP_SUMMARY.md`	Cleanup summary	✅ Historical

🔧 DEVELOPMENT NOTES

File Organization:

Logical grouping by function and purpose
Clear separation of concerns
Easy navigation for new developers
Maintenance-friendly structure

Best Practices:

Always backup database before major changes
Test scripts in development environment first
Follow maintenance guide step-by-step
Validate results after each operation

🆘 SUPPORT

For Maintenance Issues:

Check documentation in documentation/ folder
Review maintenance guide for step-by-step procedures
Use database queries in database_queries/ for troubleshooting
Verify environment with utilities in core_utilities/

Common Operations:

Adding new dichos: See maintenance guide
Reclustering data: Use NLP clustering script
Database queries: Reference SQL examples
System updates: Follow documented procedures

📈 FUTURE ENHANCEMENTS

Planned Features:

Web interface for dicho management
API endpoints for application integration
Advanced analytics and reporting
Multi-language expansion beyond Spanish/English

Scalability Considerations:

Cluster management for growing collections
Performance optimization for large datasets
Backup and recovery procedures
Monitoring and alerting systems

This system represents a production-ready semantic clustering solution for Costa Rican dichos, with comprehensive documentation and maintenance procedures for ongoing operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DICHOS PROCESSING SYSTEM

Costa Rican Proverbs Semantic Clustering Database

🏗️ PROJECT STRUCTURE

🚀 QUICK START

1. Environment Setup:

2. Database Access:

3. View Current Clusters:

📊 CURRENT SYSTEM STATE

🔄 MAINTENANCE OPERATIONS

Adding New Dichos:

Key Maintenance Scripts:

⚠️ Important: LLM Workflow

📋 REQUIREMENTS

Python Packages:

System Requirements:

🎯 USE CASES

For Application Development:

For Research:

📚 DOCUMENTATION INDEX

🔧 DEVELOPMENT NOTES

File Organization:

Best Practices:

🆘 SUPPORT

For Maintenance Issues:

Common Operations:

📈 FUTURE ENHANCEMENTS

Planned Features:

Scalability Considerations:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
core_algorithms		core_algorithms
core_data		core_data
core_utilities		core_utilities
database_queries		database_queries
documentation		documentation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WhatsApp Chat with Dichosos costarricenses.txt		WhatsApp Chat with Dichosos costarricenses.txt
process_new_whatsapp_dichos.py		process_new_whatsapp_dichos.py
requirements.txt		requirements.txt

License

rsgoodwin/dichos_processing

Folders and files

Latest commit

History

Repository files navigation

DICHOS PROCESSING SYSTEM

Costa Rican Proverbs Semantic Clustering Database

🏗️ PROJECT STRUCTURE

🚀 QUICK START

1. Environment Setup:

2. Database Access:

3. View Current Clusters:

📊 CURRENT SYSTEM STATE

🔄 MAINTENANCE OPERATIONS

Adding New Dichos:

Key Maintenance Scripts:

⚠️ Important: LLM Workflow

📋 REQUIREMENTS

Python Packages:

System Requirements:

🎯 USE CASES

For Application Development:

For Research:

📚 DOCUMENTATION INDEX

🔧 DEVELOPMENT NOTES

File Organization:

Best Practices:

🆘 SUPPORT

For Maintenance Issues:

Common Operations:

📈 FUTURE ENHANCEMENTS

Planned Features:

Scalability Considerations:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages