Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion materialized_aggregations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ The following table lists the supported aggregations along with some notes.
|`count` | |
|`std` | Standard deviation. Requires at least 2 values. |
|`var` | Variance. Same requirements as `std`. |
|`approx_count_distinct` | An approximation of the cardinality of non-null data. |
|`approx_count_distinct` | An approximation of the cardinality of non-null data. Uses [Apache DataSketches CPC](https://datasketches.apache.org/docs/CPC/CpcSketches.html). |

These aggregations can be applied to DataFrame features that represent a [has-many](/docs/has-many) join relationship
between two feature classes. Typically, these joins can be defined using a join key, like in our previous example:
Expand Down Expand Up @@ -153,6 +153,23 @@ class User:

---

## Approximate Count Distinct

The `approx_count_distinct` aggregation provides an efficient way to estimate the number of unique values in your data
using the [Compressed Probability Counting (CPC) sketch](https://datasketches.apache.org/docs/CPC/CpcSketches.html)
algorithm from Apache DataSketches.

### Why use approximate count distinct?

Computing exact distinct counts for large datasets can be memory-intensive and slow, especially for materialized
aggregations where you need to track uniqueness across many time buckets. The CPC sketch algorithm provides:

- **Memory efficiency**: Uses significantly less memory than storing all unique values
- **Mergeable sketches**: Partial aggregates from different buckets can be efficiently combined
- **High accuracy**: Provides estimates with low relative error (typically < 2% for reasonable sketch sizes)

---

## How do I use materialized aggregations with Chalk?

Users can materialize a feature aggregation in Chalk by supplying the [`materialization`](/api-docs#windowed.materialization)
Expand Down