____       _ _        _                    
|  _ \  ___| | |_ __ _| |    ___ _ __  ___ 
| | | |/ _ \ | __/ _` | |   / _ \ '_ \/ __|
| |_| |  __/ | || (_| | |__|  __/ | | \__ \
|____/ \___|_|\__\__,_|_____\___|_| |_|___/
                                        
DeltaLens is a powerful tool for comparing large datasets using the power of DuckDB as the comparison engine. It supports data transformations, automated field-level matching, and detailed comparison reporting.
flowchart LR
    Old_System:::external@{ shape: lin-cyl, label: "Old System" }
    New_System:::external@{ shape: lin-cyl, label: "New System" }
    Data_puller:::external@{ shape: subproc, label: "Data Exporter" }
    Old_System-->Data_puller
    New_System-->Data_puller
    Data_puller-->Trades_1
    Data_puller-->Trades_2
    Trades_1:::external@{ shape: doc, label: "new_system_trades.csv" }
    Trades_2:::external@{ shape: doc, label: "lagecy_system_trades.csv" }
    Config@{ shape: doc, label: "config.json" }
    Config-->TableQueryGenerator
    Trades_2 -->|load|DuckDB
    Trades_1 -->|load|DuckDB
    subgraph DeltaLens.py
        TableQueryGenerator@{ shape: subproc, label: "QueryGenerator" }
        TableQueryGenerator-->|generate compare queries|DuckDB
        DuckDB@{ shape: lin-cyl, label: "DuckDB" }
        DuckDB-->Exporter
        Exporter@{ shape: subproc, label: "Exporter" }
       
    end
    Exporter-->|export|Sqlite
    Exporter-->|export|Results_csv
    Sqlite@{ shape: lin-cyl, label: "results.sqlite" }
    Results_csv@{ shape: doc, label: "results.csv" }
    classDef external fill:#F8F8F8
    - Compare CSV datasets with configurable primary keys
 - Apply SQL transformations to data before comparison
 - Generate detailed field-level match statistics
 - Export results to SQLite and CSV for analysis
 - Support for larger than memory datasets
 - Support for reference datasets
 - Docker support for containerized execution
 - CLI and Python API interfaces
 - Data Pipeline and CI/CD friendly
 
Install from PyPI:
pip install delta-lenssee data_compare_legislators.ipynb for example.
# Basic comparison
deltalens --config data/compare.config.json --run-name daily_compare
# Full options
deltalens \
  --config data/compare.config.json \
  --run-name daily_compare \
  --output-dir ./results \
  --persistent \
  --continue-on-error \
  --export-sqlite \
  --export-csv \
  --export-sampling-threshold 5000 \
  --export-mismatches-only \
  --log-level DEBUGOr Pull the Docker image:
docker run unclepaul84/deltalens:latestDeltaLens can be used interactively in Jupyter notebooks for data comparison analysis. See data_compare_legislators.ipynb
Create a compare.config.json file:
{  
    "defaults":{},
    "entities": [
        {
            "entityName":"trade",
            "leftSide": {
                "title": "legacy",
                "inputFile":"data/legacy_system_trades.csv"
                
                
            },
            "rightSide": {
                "title": "new",
                "inputFile":"data/new_system_trades.csv"
            },
            "primaryKeys": ["trade_id"]
        
        }     
    ]
}
| Variable | Description | Default | 
|---|---|---|
DELTALENS_CONFIG | 
Path to config file | compare.config.json | 
DELTALENS_RUN_NAME | 
Name for comparison run | compare_YYYY-MM-DD | 
DELTALENS_OUTPUT_DIR | 
Output directory | . | 
DELTALENS_PERSISTENT | 
Use persistent storage | false | 
DELTALENS_EXPORT_SQLITE | 
Export to SQLite | true | 
DELTALENS_EXPORT_SAMPLING_THRESHOLD | 
rowcount at which to start sampling | 10000 | 
DELTALENS_EXPORT_CSV | 
export to gzipped csv | true | 
DELTALENS_EXPORT_MISMATCHES_ONLY | 
Export mismatched rows only | true | 
The tool generates several output files:
[run_name].duckdb: DuckDB database with comparison results (if persistent mode enabled)[run_name].sqlite: SQLite export of comparison results (if enabled)
Resulting Tables include:
entity_compare_results: Overall comparison summary[entity]_compare: Detailed record-level comparison[entity]_compare_field_summary: Field-level match statistics
# Run with docker-compose
docker-compose up
# Run with custom arguments
docker-compose run deltalens --run-name custom_run --log-level DEBUGDeltaLens includes a script to generate sample trade data for testing and demonstration purposes.
The script creates two CSV files with randomized trade data:
legacy_system_trades.csv: Original trade data with modificationsnew_system_trades.csv: Copy of original data with known differences
cd data
# Generate sample data (creates 2GB files by default)
python create_test_datasets.py# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Run tests
pytest -v
# Run tests with coverage
pytest --cov=delta_lens -vMIT License
- Fork the repository
 - Create a feature branch
 - Submit a pull request