|
1 | | -<p align="left"> |
| 1 | +<p align="center"> |
2 | 2 | <a href="https://datafold.com/"><img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="30%" /></a> |
3 | 3 | </p> |
4 | 4 |
|
5 | | -<h1 align="left"> |
6 | | -data-diff: compare datasets fast, within or across SQL databases |
7 | | -</h1> |
| 5 | +<h2 align="center"> |
| 6 | +data-diff: Compare datasets fast, within or across SQL databases |
8 | 7 |
|
| 8 | + |
| 9 | +</h2> |
9 | 10 | <br> |
10 | 11 |
|
| 12 | +# Use Cases |
| 13 | + |
| 14 | +## Data Migration & Replication Testing |
| 15 | +Compare source to target and check for discrepancies when moving data between systems: |
| 16 | +- Migrating to a new data warehouse (e.g., Oracle > Snowflake) |
| 17 | +- Converting SQL to a new transformation framework (e.g., stored procedures > dbt) |
| 18 | +- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift) |
| 19 | + |
| 20 | + |
| 21 | +## Data Development Testing |
| 22 | +Test SQL code and preview changes by comparing development/staging environment data to production: |
| 23 | +1. Make a change to some SQL code |
| 24 | +2. Run the SQL code to create a new dataset |
| 25 | +3. Compare the dataset with its production version or another iteration |
| 26 | + |
| 27 | + <p align="left"> |
| 28 | + <img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" /> |
| 29 | + </p> |
| 30 | + |
| 31 | +<details> |
| 32 | +<summary> data-diff integrates with dbt Core to seamlessly compare local development to production datasets |
| 33 | + |
| 34 | + </summary> |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +</details> |
| 39 | + |
| 40 | +> [dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing) |
| 41 | +
|
| 42 | +:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)** |
| 43 | + |
| 44 | +**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)** |
| 45 | + |
| 46 | +Also available in a [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Datafold.datafold-vscode) |
| 47 | + |
| 48 | +Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support |
| 49 | + |
| 50 | + |
11 | 51 | # How it works |
12 | 52 |
|
13 | 53 | When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison: |
14 | 54 |
|
15 | | -## joindiff |
| 55 | +## `joindiff` |
16 | 56 | - Recommended for comparing data within the same database |
17 | 57 | - Uses the outer join operation to diff the rows as efficiently as possible within the same database |
18 | 58 | - Fully relies on the underlying database engine for computation |
19 | 59 | - Requires both datasets to be queryable with a single SQL query |
20 | 60 | - Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset |
21 | 61 |
|
22 | | -## hashdiff |
| 62 | +## `hashdiff` |
23 | 63 | - Recommended for comparing datasets across different databases |
24 | 64 | - Can also be helpful in diffing very large tables with few expected differences within the same database |
25 | 65 | - Employs a divide-and-conquer algorithm based on hashing and binary search |
@@ -52,61 +92,32 @@ data-diff \ |
52 | 92 | Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference. |
53 | 93 |
|
54 | 94 |
|
55 | | -# Use cases |
56 | | - |
57 | | -## Data Migration & Replication Testing |
58 | | -Compare source to target and check for discrepancies when moving data between systems: |
59 | | -- Migrating to a new data warehouse (e.g., Oracle > Snowflake) |
60 | | -- Converting SQL to a new transformation framework (e.g., stored procedures > dbt) |
61 | | -- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift) |
62 | | - |
63 | | - |
64 | | -## Data Development Testing |
65 | | -Test SQL code and preview changes by comparing development/staging environment data to production: |
66 | | -1. Make a change to some SQL code |
67 | | -2. Run the SQL code to create a new dataset |
68 | | -3. Compare the dataset with its production version or another iteration |
69 | | - |
70 | | - <p align="left"> |
71 | | - <img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" /> |
72 | | - </p> |
73 | | - |
74 | | -`data-diff` integrates with dbt Core and dbt Cloud to seamlessly compare local development to production datasets. |
75 | | - |
76 | | -:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)** |
77 | | - |
78 | | -**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)** |
79 | | - |
80 | | -Also available in a [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Datafold.datafold-vscode) |
81 | | - |
82 | | -Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support |
83 | | - |
84 | 95 | # Supported databases |
85 | 96 |
|
86 | 97 |
|
87 | 98 | | Database | Status | Connection string | |
88 | 99 | |---------------|-------------------------------------------------------------------------------------------------------------------------------------|--------| |
89 | | -| PostgreSQL >=10 | π | `postgresql://<user>:<password>@<host>:5432/<database>` | |
90 | | -| MySQL | π | `mysql://<user>:<password>@<hostname>:5432/<database>` | |
91 | | -| Snowflake | π | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` | |
92 | | -| BigQuery | π | `bigquery://<project>/<dataset>` | |
93 | | -| Redshift | π | `redshift://<username>:<password>@<hostname>:5439/<database>` | |
94 | | -| Oracle | π | `oracle://<username>:<password>@<hostname>/database` | |
95 | | -| Presto | π | `presto://<username>:<password>@<hostname>:8080/<database>` | |
96 | | -| Databricks | π | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>` | |
97 | | -| Trino | π | `trino://<username>:<password>@<hostname>:8080/<database>` | |
98 | | -| Clickhouse | π | `clickhouse://<username>:<password>@<hostname>:9000/<database>` | |
99 | | -| Vertica | π | `vertica://<username>:<password>@<hostname>:5433/<database>` | |
100 | | -| DuckDB | π | | |
| 100 | +| PostgreSQL >=10 | π’ | `postgresql://<user>:<password>@<host>:5432/<database>` | |
| 101 | +| MySQL | π’ | `mysql://<user>:<password>@<hostname>:5432/<database>` | |
| 102 | +| Snowflake | π’ | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` | |
| 103 | +| BigQuery | π’ | `bigquery://<project>/<dataset>` | |
| 104 | +| Redshift | π’ | `redshift://<username>:<password>@<hostname>:5439/<database>` | |
| 105 | +| Oracle | π‘ | `oracle://<username>:<password>@<hostname>/database` | |
| 106 | +| Presto | π‘ | `presto://<username>:<password>@<hostname>:8080/<database>` | |
| 107 | +| Databricks | π‘ | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>` | |
| 108 | +| Trino | π‘ | `trino://<username>:<password>@<hostname>:8080/<database>` | |
| 109 | +| Clickhouse | π‘ | `clickhouse://<username>:<password>@<hostname>:9000/<database>` | |
| 110 | +| Vertica | π‘ | `vertica://<username>:<password>@<hostname>:5433/<database>` | |
| 111 | +| DuckDB | π‘ | | |
101 | 112 | | ElasticSearch | π | | |
102 | 113 | | Planetscale | π | | |
103 | 114 | | Pinot | π | | |
104 | 115 | | Druid | π | | |
105 | 116 | | Kafka | π | | |
106 | 117 | | SQLite | π | | |
107 | 118 |
|
108 | | -* π: Implemented and thoroughly tested. |
109 | | -* π: Implemented, but not thoroughly tested yet. |
| 119 | +* π’: Implemented and thoroughly tested. |
| 120 | +* π‘: Implemented, but not thoroughly tested yet. |
110 | 121 | * β³: Implementation in progress. |
111 | 122 | * π: Implementation planned. Contributions welcome. |
112 | 123 |
|
|
0 commit comments