You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: python/sources/mysql_cdc/README.md
+126-3Lines changed: 126 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,14 @@
1
1
# MySQL CDC
2
2
3
-
This connector demonstrates how to capture changes to a MySQL database table (using CDC) and publish the change events to a Kafka topic using MySQL binary log replication.
3
+
This connector demonstrates how to capture changes to a MySQL database table (using CDC) and publish the change events to a Kafka topic using MySQL binary log replication. It features **persistent binlog position tracking** to ensure exactly-once processing and automatic recovery after restarts.
4
+
5
+
## Key Features
6
+
7
+
-**Persistent Binlog Position**: Automatically saves and resumes from the last processed binlog position
8
+
-**Exactly-Once Processing**: No data loss during application restarts or failures
9
+
-**Initial Snapshot**: Optionally capture existing data before starting CDC
10
+
-**Automatic Recovery**: Seamlessly resume processing after interruptions
11
+
-**Change Buffering**: Batches changes for efficient Kafka publishing
4
12
5
13
## How to run
6
14
@@ -11,8 +19,7 @@ This connector demonstrates how to capture changes to a MySQL database table (us
11
19
12
20
## Environment variables
13
21
14
-
The connector uses the following environment variables:
15
-
22
+
### Required MySQL Connection
16
23
-**output**: Name of the output topic to write into.
17
24
-**MYSQL_HOST**: The IP address or fully qualified domain name of your MySQL server.
18
25
-**MYSQL_PORT**: The Port number to use for communication with the server (default: 3306).
@@ -22,11 +29,127 @@ The connector uses the following environment variables:
22
29
-**MYSQL_SCHEMA**: The name of the schema/database for CDC (same as MYSQL_DATABASE).
23
30
-**MYSQL_TABLE**: The name of the table for CDC.
24
31
32
+
### Optional Configuration
33
+
-**MYSQL_SNAPSHOT_HOST**: MySQL host for initial snapshot (defaults to MYSQL_HOST if not set). Use this if you want to perform initial snapshot from a different MySQL instance (e.g., read replica).
34
+
-**INITIAL_SNAPSHOT**: Set to "true" to perform initial snapshot of existing data (default: false).
35
+
-**SNAPSHOT_BATCH_SIZE**: Number of rows to process in each snapshot batch (default: 1000).
36
+
-**FORCE_SNAPSHOT**: Set to "true" to force snapshot even if already completed (default: false).
37
+
38
+
### State Management
39
+
-**Quix__State__Dir**: Directory for storing application state including binlog positions (default: "state").
40
+
41
+
## Binlog Position Persistence
42
+
43
+
The connector automatically tracks the MySQL binlog position and saves it to disk after successful Kafka delivery. This ensures:
44
+
45
+
-**No data loss** during application restarts
46
+
-**Exactly-once processing** of database changes
47
+
-**Automatic resumption** from the last processed position
48
+
49
+
Position files are stored in: `{STATE_DIR}/binlog_position_{schema}_{table}.json`
50
+
51
+
Example position file:
52
+
```json
53
+
{
54
+
"log_file": "mysql-bin.000123",
55
+
"log_pos": 45678,
56
+
"timestamp": 1704067200.0,
57
+
"readable_time": "2024-01-01 12:00:00 UTC"
58
+
}
59
+
```
60
+
61
+
## Initial Snapshot
62
+
63
+
Enable initial snapshot to capture existing table data before starting CDC:
64
+
65
+
```env
66
+
INITIAL_SNAPSHOT=true
67
+
SNAPSHOT_BATCH_SIZE=1000
68
+
MYSQL_SNAPSHOT_HOST=replica.mysql.example.com # Optional: use read replica
69
+
```
70
+
71
+
The initial snapshot:
72
+
- Processes data in configurable batches to avoid memory issues
73
+
- Sends snapshot records with `"kind": "snapshot_insert"` to distinguish from real inserts
74
+
- Marks completion to avoid re-processing on restart
75
+
- Can be forced to re-run with `FORCE_SNAPSHOT=true`
76
+
25
77
## Requirements / Prerequisites
26
78
27
79
- A MySQL Database with binary logging enabled.
28
80
- Set `log-bin=mysql-bin` and `binlog-format=ROW` in MySQL configuration.
29
81
- MySQL user with `REPLICATION SLAVE` and `REPLICATION CLIENT` privileges.
82
+
- For initial snapshot: `SELECT` privilege on the target table.
83
+
84
+
### MySQL Configuration Example
85
+
```ini
86
+
[mysqld]
87
+
server-id = 1
88
+
log_bin = /var/log/mysql/mysql-bin.log
89
+
binlog_expire_logs_seconds = 864000
90
+
max_binlog_size = 100M
91
+
binlog-format = ROW
92
+
binlog_row_metadata = FULL
93
+
binlog_row_image = FULL
94
+
```
95
+
96
+
### MySQL User Permissions
97
+
```sql
98
+
-- Create replication user
99
+
CREATEUSER 'cdc_user'@'%' IDENTIFIED BY 'secure_password';
100
+
101
+
-- Grant replication privileges
102
+
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON*.* TO 'cdc_user'@'%';
103
+
104
+
-- Grant select for initial snapshot
105
+
GRANTSELECTONyour_database.your_table TO 'cdc_user'@'%';
106
+
107
+
FLUSH PRIVILEGES;
108
+
```
109
+
110
+
## Change Event Format
111
+
112
+
### INSERT/Snapshot Insert
113
+
```json
114
+
{
115
+
"kind": "insert", // or "snapshot_insert" for initial snapshot
This application implements MySQL CDC using MySQL binary log replication.
3
+
This application implements MySQL CDC using MySQL binary log replication with **persistent binlog position tracking** for exactly-once processing and automatic recovery.
4
+
5
+
## Key Features
6
+
7
+
-**Persistent Binlog Position**: Automatically saves and resumes from the last processed binlog position
8
+
-**Exactly-Once Processing**: No data loss during application restarts or failures
9
+
-**Initial Snapshot**: Optionally capture existing data before starting CDC
10
+
-**Automatic Recovery**: Seamlessly resume processing after interruptions
11
+
-**Change Buffering**: Batches changes for efficient Kafka publishing
12
+
-**State Management**: Integrated state persistence for production reliability
4
13
5
14
## Prerequisites
6
15
7
16
1.**MySQL Configuration**: Your MySQL server must have binary logging enabled with ROW format:
8
17
```ini
9
18
# Add to MySQL configuration file (my.cnf or my.ini)
10
-
log-bin=mysql-bin
11
-
binlog-format=ROW
12
-
server-id=1
19
+
[mysqld]
20
+
server-id = 1
21
+
log_bin = /var/log/mysql/mysql-bin.log
22
+
binlog_expire_logs_seconds = 864000
23
+
max_binlog_size = 100M
24
+
binlog-format = ROW
25
+
binlog_row_metadata = FULL
26
+
binlog_row_image = FULL
13
27
```
14
28
15
29
2.**MySQL User Permissions**: The MySQL user needs REPLICATION SLAVE and REPLICATION CLIENT privileges:
16
30
```sql
17
-
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON*.* TO 'your_user'@'%';
18
-
GRANTSELECTONyour_database.your_table TO 'your_user'@'%';
31
+
-- Create replication user
32
+
CREATEUSER 'cdc_user'@'%' IDENTIFIED BY 'secure_password';
33
+
34
+
-- Grant replication privileges for CDC
35
+
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON*.* TO 'cdc_user'@'%';
36
+
37
+
-- Grant select for initial snapshot (if using snapshot feature)
38
+
GRANTSELECTONyour_database.your_table TO 'cdc_user'@'%';
39
+
19
40
FLUSH PRIVILEGES;
20
41
```
21
42
22
43
## Environment Variables
23
44
24
45
Set the following environment variables:
25
46
26
-
### MySQL Connection
47
+
### Required MySQL Connection
27
48
-`MYSQL_HOST` - MySQL server hostname (e.g., localhost)
28
49
-`MYSQL_PORT` - MySQL server port (default: 3306)
29
50
-`MYSQL_USER` - MySQL username
@@ -32,7 +53,16 @@ Set the following environment variables:
32
53
-`MYSQL_SCHEMA` - MySQL database name (same as MYSQL_DATABASE)
33
54
-`MYSQL_TABLE` - Table name to monitor for changes
34
55
35
-
### Kafka Output (unchanged)
56
+
### Optional Configuration
57
+
-`MYSQL_SNAPSHOT_HOST` - MySQL host for initial snapshot (defaults to MYSQL_HOST). Use this to snapshot from a read replica
58
+
-`INITIAL_SNAPSHOT` - Set to "true" to perform initial snapshot (default: false)
59
+
-`SNAPSHOT_BATCH_SIZE` - Rows per snapshot batch (default: 1000)
60
+
-`FORCE_SNAPSHOT` - Set to "true" to force re-snapshot (default: false)
61
+
62
+
### State Management
63
+
-`Quix__State__Dir` - Directory for storing state files (default: "state")
64
+
65
+
### Kafka Output
36
66
-`output` - Kafka topic name for publishing changes
37
67
38
68
## Example .env file
@@ -41,16 +71,80 @@ Set the following environment variables:
41
71
# MySQL Connection
42
72
MYSQL_HOST=localhost
43
73
MYSQL_PORT=3306
44
-
MYSQL_USER=replication_user
45
-
MYSQL_PASSWORD=your_password
74
+
MYSQL_USER=cdc_user
75
+
MYSQL_PASSWORD=secure_password
46
76
MYSQL_DATABASE=your_database
47
77
MYSQL_SCHEMA=your_database
48
78
MYSQL_TABLE=your_table
49
79
80
+
# Optional: Use read replica for initial snapshot
81
+
MYSQL_SNAPSHOT_HOST=replica.mysql.example.com
82
+
83
+
# Initial Snapshot Configuration
84
+
INITIAL_SNAPSHOT=true
85
+
SNAPSHOT_BATCH_SIZE=1000
86
+
FORCE_SNAPSHOT=false
87
+
88
+
# State Management
89
+
Quix__State__Dir=./state
90
+
50
91
# Kafka Output
51
92
output=cdc-changes-topic
52
93
```
53
94
95
+
## Binlog Position Persistence
96
+
97
+
The application automatically tracks MySQL binlog positions and persists them to disk:
98
+
99
+
### How it works:
100
+
1.**Position Tracking**: Records current binlog file and position during processing
101
+
2.**Automatic Saving**: Saves position after successful Kafka delivery
102
+
3.**Recovery**: Automatically resumes from last saved position on restart
103
+
4.**Exactly-Once**: Ensures no data loss or duplication
0 commit comments