Sinks API docs (#439)

daniil-quix · stereosky · web-flow · commit 5a68b3bb0721 · 2024-08-08T11:26:09.000+02:00
Co-authored-by: Tun &lt;stereosky@users.noreply.github.com&gt;
diff --git a/docs/api-reference/sinks.md b/docs/api-reference/sinks.md
diff --git a/docs/build/build.py b/docs/build/build.py
@@ -107,6 +107,14 @@
             "quixstreams.context",
         ]
     },
+    "sinks.md": {
+        k: None
+        for k in [
+            "quixstreams.sinks.influxdb3",
+            "quixstreams.sinks.csv",
+            "quixstreams.sinks.base.sink",
+        ]
+    },
 }
 
 # Go over all modules and assign them to doc files
diff --git a/docs/connectors/sinks/README.md b/docs/connectors/sinks/README.md
@@ -0,0 +1,86 @@
+# Sinks (beta)
+
+In many stream processing use cases the results need to be written to external destinations to be shared with other subsystems. 
+
+Quix Streams provides a sink API to achieve that.
+
+An example using InfluxDB Sink:
+
+```python
+from quixstreams import Application
+from quixstreams.sinks.influxdb3 import InfluxDB3Sink
+
+app = Application(broker_address="localhost:9092")
+topic = app.topic("numbers-topic")
+
+# Initialize InfluxDB3Sink
+influx_sink = InfluxDB3Sink(
+    token="<influxdb-access-token>",
+    host="<influxdb-host>",
+    organization_id="<influxdb-org>",
+    database="<influxdb-database>",
+    measurement="numbers",
+    fields_keys=["number"],
+    tags_keys=["tag"]
+)
+
+sdf = app.dataframe(topic)
+# Do some processing here ...
+# Sink data to InfluxDB
+sdf.sink(influx_sink)
+```
+
+## Sinks Are Destinations
+When `.sink()` is called on a StreamingDataFrame instance, it marks the end of the processing pipeline, and 
+ the StreamingDataFrame can't be changed anymore.
+
+Make sure you call `StreamingDataFrame.sink()` as the last operation.
+
+
+## Supported Sinks
+
+Currently, Quix Streams provides these sinks out of the box:
+- [CSV Sink](csv-sink.md) - a simple CSV sinks that writes data to a single CSV file.
+- [InfluxDB 3 Sink](influxdb3-sink.md) - a sink to write data to InfluxDB 3.
+
+It's also possible to implement your own custom sinks.  
+Please see the [Creating a Custom Sink](custom-sinks.md) page on how to do that.
+
+## Performance Considerations
+Since the implementation of `BatchingSink` accumulates data in-memory, it will increase memory usage.
+
+If the batches become large enough, it can also put additional load on the destination and decrease the overall throughput. 
+
+To adjust the number of messages that are batched and written in one go, you may provide a `commit_every` parameter to the `Application`.    
+It will limit the amount of data processed and sinked during a single checkpoint.  
+Note that it only limits the amount of incoming messages, and not the number of records being written to sinks.
+
+**Example:**
+
+```python
+from quixstreams import Application
+from quixstreams.sinks.influxdb3 import InfluxDB3Sink
+
+# Commit the checkpoints after processing 1000 messages or after a 5 second interval has elapsed (whichever is sooner).
+app = Application(
+    broker_address="localhost:9092",
+    commit_interval=5.0, 
+    commit_every=1000,  
+)
+topic = app.topic('numbers-topic')
+sdf = app.dataframe(topic)
+
+# Create an InfluxDB sink that batches data between checkpoints.
+influx_sink = InfluxDB3Sink(
+    token="<influxdb-access-token>",
+    host="<influxdb-host>",
+    organization_id="<influxdb-org>",
+    database="<influxdb-database>",
+    measurement="numbers",
+    fields_keys=["number"],
+    tags_keys=["tag"]
+)
+
+# The sink will write to InfluxDB across all assigned partitions.
+sdf.sink(influx_sink)
+```
diff --git a/docs/connectors/sinks/csv-sink.md b/docs/connectors/sinks/csv-sink.md
@@ -0,0 +1,58 @@
+# CSV Sink
+
+A basic sink to write processed data to a single CSV file.
+
+It's meant to be used mostly for local debugging.
+
+## How To Use CSV Sink
+
+To use a CSV sink, you need to create an instance of `CSVSink` and pass 
+it to the `StreamingDataFrame.sink()` method:
+
+```python
+from quixstreams import Application
+from quixstreams.sinks.csv import CSVSink
+
+app = Application(broker_address="localhost:9092")
+topic = app.topic("input-topic")
+
+# Initialize a CSV sink with a file path 
+csv_sink = CSVSink(path="file.csv")
+
+sdf = app.dataframe(topic)
+# Do some processing here ...
+# Sink data to a CSV file
+sdf.sink(csv_sink)
+```
+
+## How the CSV Sink Works
+`CSVSink` is a batching sink.  
+It batches processed records in memory per topic partition, and writes them to the file when a checkpoint is committed.  
+
+The output file format is the following:
+```
+key,value,timestamp,topic,partition,offset
+b'afd7e8ab-4af5-4322-8417-dbfc7a0d7694',"{""number"": 0}",1722945524540,numbers-10k-keys,0,0
+b'557bae7f-14b6-46c4-abc3-12f232b54c8e',"{""number"": 1}",1722945524546,numbers-10k-keys,0,1
+```
+## Serialization Formats
+By default, `CSVSink` serializes record keys by calling `str()` on them, and message values with `json.dumps()`.
+
+To use your own serializer, pass `key_serializer` and `value_serializer` to `CSVSink`:
+
+```python
+import json
+from quixstreams.sinks.csv import CSVSink
+
+# Initialize a CSVSink with a file path 
+csv_sink = CSVSink(
+    path="file.csv",
+    # Define custom serializers for keys and values here.
+    # The callables must accept one argument for key/value, and return a string
+    key_serializer=lambda key: json.dumps(key),
+    value_serializer=lambda value: str(value), 
+)
+```
+
+## Delivery Guarantees
+The `CSVSink` provides at-least-once guarantees, and the resulting CSV file may contain duplicated rows of data if there were errors during processing.
diff --git a/docs/connectors/sinks/custom-sinks.md b/docs/connectors/sinks/custom-sinks.md
@@ -0,0 +1,83 @@
+# Creating a Custom Sink
+
+Quix Streams provides basic facilities to implement custom sinks for external destinations (currently in beta).
+
+To create a new sink, extend and implement the following Python base classes:
+- `quixstreams.sinks.base.sink.BaseSink` - the parent interface for all sinks.  
+The `StreamingDataFrame.sink()` accepts implementations of this class.
+
+- `quixstreams.sinks.base.sink.BatchingSink` - a base class for batching sinks, that need to batch data first before writing it to the external destination.  
+Check out [InfluxDB3Sink](influxdb3-sink.md) and [CSVSink](csv-sink.md) for example implementations of the batching sinks.
+
+
+Here is the code for `BaseSink` class for the reference:
+
+```python
+import abc
+
+class BaseSink(abc.ABC):
+    """
+    This is a base class for all sinks.
+
+    Subclass and implement its methods to create your own sink.
+
+    Note that sinks are currently in beta, and their design may change over time.
+    """
+
+    @abc.abstractmethod
+    def flush(self, topic: str, partition: int):
+        """
+        This method is triggered by the Checkpoint class when it commits.
+
+        You can use `flush()` to write the batched data to the destination (in case of
+        a batching sink), or confirm the delivery of the previously sent messages
+        (in case of a streaming sink).
+
+        If flush() fails, the checkpoint will be aborted.
+        """
+
+    @abc.abstractmethod
+    def add(
+        self,
+        value: Any,
+        key: Any,
+        timestamp: int,
+        headers: List[Tuple[str, HeaderValue]],
+        topic: str,
+        partition: int,
+        offset: int,
+    ):
+        """
+        This method is triggered on every new record sent to this sink.
+
+        You can use it to accumulate batches of data before sending them outside, or
+        to send results right away in a streaming manner and confirm a delivery later
+        on flush().
+        """
+
+    def on_paused(self, topic: str, partition: int):
+        """
+        This method is triggered when the sink is paused due to backpressure, when
+        the `SinkBackpressureError` is raised.
+
+        Here you can react to backpressure events.
+        """
+```
+
+
+## Sinks Workflow
+
+During processing, sinks do the following operations:
+
+1. When a new record arrives, the application calls `BaseSink.add()` method.    
+At this point, the sink implementation can decide what to do with the new record.  
+For example, the `BatchingSink` will add a record to an in-memory batch.  
+Other sinks may write the data straight away.
+
+2. When the current checkpoint is committed, the app calls `BaseSink.flush()`.  
+For example, `BatchingSink` will write the accumulated data during `flush()`.
+   1. If the destination cannot accept new data, sinks can raise a special exception `SinkBackpressureError(topic, partition, retry_after)` and specify the timeout for the writes to be retried later.  
+   2. The application will react to `SinkBackpressureError` by pausing the corresponding topic-partition for the given time and seeking the partition offset back to the beginning of the checkpoint.  
+   3. When the timeout elapses, the app will resume consuming from this partition, re-process the data, and try to sink it again.
+
+3. If any of the sinks fail during `flush()`, the application will abort the checkpoint, and the data will be re-processed again. 
diff --git a/docs/connectors/sinks/influxdb3-sink.md b/docs/connectors/sinks/influxdb3-sink.md
@@ -0,0 +1,115 @@
+# InfluxDB v3 Sink
+
+InfluxDB is an open source time series database for metrics, events, and real-time analytics.
+
+Quix Streams provides a sink to write processed data to InfluxDB v3.
+
+>***NOTE***: This sink only supports InfluxDB v3. Versions 1 and 2 are not supported.
+
+## How To Use the InfluxDB Sink
+
+To sink data to InfluxDB, you need to create an instance of `InfluxDB3Sink` and pass 
+it to the `StreamingDataFrame.sink()` method:
+
+```python
+from quixstreams import Application
+from quixstreams.sinks.influxdb3 import InfluxDB3Sink
+
+app = Application(broker_address="localhost:9092")
+topic = app.topic("numbers-topic")
+
+# Initialize InfluxDB3Sink
+influx_sink = InfluxDB3Sink(
+    token="<influxdb-access-token>",
+    host="<influxdb-host>",
+    organization_id="<influxdb-org>",
+    database="<influxdb-database>",
+    measurement="numbers",
+    fields_keys=["number"],
+    tags_keys=["tag"]
+)
+
+sdf = app.dataframe(topic)
+# Do some processing here ...
+# Sink data to InfluxDB
+sdf.sink(influx_sink)
+```
+
+## How the InfluxDB Sink Works
+`InfluxDB3Sink` is a batching sink.  
+It batches processed records in memory per topic partition, and writes them to the InfluxDB instance when a checkpoint has been committed.
+
+Under the hood, it transforms data to the Influx format using  and writes processed records in batches.
+
+### What data can be sent to InfluxDB?
+
+`InfluxDB3Sink` can accept only dictionaries values.
+
+If the record values are not dicts, you need to convert them to dicts using `StreamingDataFrame.apply()` before sinking.
+
+The structure of the sinked data is defined by the `fields_keys` and `tags_keys` parameters provided to the sink class.
+
+- `fields_keys` - a list of keys to be used as "fields" when writing to InfluxDB.  
+If present, its keys cannot overlap with any in `tags_keys`.  
+If empty, the whole record value will be used.
+The fields' values can only be strings, floats, integers, or booleans.
+
+- `tags_keys` - a list of keys to be used as "tags" when writing to InfluxDB.
+If present, its keys cannot overlap with any in `fields_keys`.  
+These keys will be popped from the value dictionary automatically because InfluxDB doesn't allow the same keys be both in tags and fields.  
+If empty, no tags will be sent.
+>***NOTE***: InfluxDB client always converts tag values to strings.
+
+To learn more about schema design and data types in InfluxDB, please read [InfluxDB schema design recommendations](https://docs.influxdata.com/influxdb/cloud-serverless/write-data/best-practices/schema-design/).
+
+## Delivery Guarantees
+`InfluxDB3Sink` provides at-least-once guarantees, and the same records may be written multiple times in case of errors during processing.  
+
+## Backpressure Handling
+InfluxDB sink automatically handles events when the database cannot accept new data due to write limits.  
+
+When this happens, the application loses the accumulated in-memory batch and pauses the corresponding topic partition for a timeout duration returned by InfluxDB API (it returns an HTTP error with 429 status code and a `Retry-After` header with a timeout).  
+When the timeout expires, the app automatically resumes the partition to re-process the data and sink it again.
+
+## Configuration
+InfluxDB3Sink accepts the following configuration parameters:
+
+- `token` - InfluxDB access token.
+
+- `host` - InfluxDB host in format "https://<host>"
+
+- `organization_id` - InfluxDB organization ID.
+
+- `database` - a database name.
+
+- `measurement` - a measurement name, required.
+  
+- `fields_keys` - a list of keys to be used as "fields" when writing to InfluxDB.  
+See the [What data can be sent to InfluxDB](#what-data-can-be-sent-to-influxdb) for more info.
+
+- `tags_keys` - a list of keys to be used as "tags" when writing to InfluxDB.  
+See the [What data can be sent to InfluxDB](#what-data-can-be-sent-to-influxdb) for more info.
+
+            
+- `time_key` - a key to be used as "time" when writing to InfluxDB.  
+By default, the record timestamp will be used with millisecond time precision.
+When using a custom key, you may need to adjust the `time_precision` setting to match.
+
+- `time_precision` - a time precision to use when writing to InfluxDB.  
+Default - `ms`.
+
+- `include_metadata_tags` - if True, includes the record's key, topic, and partition as tags.  
+Default - `False`.
+
+- `batch_size` - the number of records to write to InfluxDB in one request.    
+Note that it only affects the size of one write request, and not the number of records flushed on each checkpoint.    
+Default - `1000`.
+
+- `enable_gzip` - if True, enables gzip compression for writes.    
+Default - `True`.
+
+- `request_timeout_ms` - an HTTP request timeout in milliseconds.   
+Default - `10000`.
+
+- `debug` - if True, print debug logs from InfluxDB client.  
+Default - `False`.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -41,6 +41,12 @@ nav:
     - Managing Kafka Topics: advanced/topics.md
     - Using Producer & Consumer: advanced/producer-consumer-lowlevel.md
   - Connecting to Quix Cloud: quix-platform.md
+  - 'Connectors [beta]':
+    -  Sinks:
+        - 'connectors/sinks/README.md'
+        - CSV Sink: connectors/sinks/csv-sink.md
+        - InfluxDB v3 Sink: connectors/sinks/influxdb3-sink.md
+        - Creating a Custom Sink: connectors/sinks/custom-sinks.md
   - Upgrading Guide:
     - Upgrading from Quix Streams v0.5: upgrading-legacy.md
 
diff --git a/quixstreams/sinks/influxdb3.py b/quixstreams/sinks/influxdb3.py
diff --git a/tests/test_quixstreams/test_sinks/test_influxdb_v3.py b/tests/test_quixstreams/test_sinks/test_influxdb_v3.py

Original file line number	Diff line number	Diff line change
`@@ -107,6 +107,14 @@`
`107`	`107`	`"quixstreams.context",`
`108`	`108`	`]`
`109`	`109`	`},`
	`110`	`+ "sinks.md": {`
	`111`	`+ k: None`
	`112`	`+ for k in [`
	`113`	`+ "quixstreams.sinks.influxdb3",`
	`114`	`+ "quixstreams.sinks.csv",`
	`115`	`+ "quixstreams.sinks.base.sink",`
	`116`	`+ ]`
	`117`	`+ },`
`110`	`118`	`}`
`111`	`119`
`112`	`120`	`# Go over all modules and assign them to doc files`