Skip to content

Support Variance, Standard Deviation, Covariance, and Correlation Decomposition #1613

@shangyian

Description

@shangyian

Summary

Add decomposition support for statistical aggregation functions, enabling pre-aggregation of variance, standard deviation, covariance, and correlation metrics in materialized cubes.

Implementation

Variance & Standard Deviation (Single-column)

Mathematical basis: VAR_POP = E[X²] - E[X]²

Function Components Combine Formula
VAR_POP(x) sum, sum_sq, count (sum_sq/n) - (sum/n)²
VAR_SAMP(x) sum, sum_sq, count (n*sum_sq - sum²) / (n*(n-1))
STDDEV_POP(x) sum, sum_sq, count SQRT(VAR_POP)
STDDEV_SAMP(x) sum, sum_sq, count SQRT(VAR_SAMP)

Component definitions:

ComponentDef("_sum", "SUM", "SUM")           # sum of x
ComponentDef("_sum_sq", "SUM(POWER({}, 2))", "SUM")  # sum of x²
ComponentDef("_count", "COUNT", "SUM")       # count

Covariance (Two-column)

Mathematical basis: COVAR_POP = E[XY] - E[X]*E[Y]

Function Components Combine Formula
COVAR_POP(x, y) sum_x, sum_y, sum_xy, count (sum_xy/n) - (sum_x/n)*(sum_y/n)
COVAR_SAMP(x, y) sum_x, sum_y, sum_xy, count (n*sum_xy - sum_x*sum_y) / (n*(n-1))

Component definitions:

ComponentDef("_sum_x", "SUM({0})", "SUM", arg_index=0)
ComponentDef("_sum_y", "SUM({1})", "SUM", arg_index=1)
ComponentDef("_sum_xy", "SUM({0} * {1})", "SUM", arg_index=None)  # product of args
ComponentDef("_count", "COUNT({0})", "SUM", arg_index=0)

Correlation (Two-column)

Mathematical basis: CORR = COVAR(X,Y) / (STDDEV(X) * STDDEV(Y))

Function Components Notes
CORR(x, y) sum_x, sum_y, sum_x_sq, sum_y_sq, sum_xy, count 6 components

Component definitions:

ComponentDef("_sum_x", "SUM({0})", "SUM", arg_index=0)
ComponentDef("_sum_y", "SUM({1})", "SUM", arg_index=1)
ComponentDef("_sum_x_sq", "SUM(POWER({0}, 2))", "SUM", arg_index=0)
ComponentDef("_sum_y_sq", "SUM(POWER({1}, 2))", "SUM", arg_index=1)
ComponentDef("_sum_xy", "SUM({0} * {1})", "SUM", arg_index=None)
ComponentDef("_count", "COUNT({0})", "SUM", arg_index=0)

Combine formula:

numerator = n * sum_xy - sum_x * sum_y
var_x = n * sum_x_sq - sum_x²
var_y = n * sum_y_sq - sum_y²
CORR = numerator / SQRT(var_x * var_y)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions