Skip to content

Conversation

@IceS2
Copy link
Contributor

@IceS2 IceS2 commented Oct 31, 2025

Implement Dimensional Validation for columnValueMinToBeBetween + Code Quality Refactoring

Summary

This PR implements dimensional validation for the columnValueMinToBeBetween data quality test and includes significant code quality improvements to reduce duplication across validators.

This is part of the dimensional validation feature series, following PRs #23529 (thin slice), #23984 (Mean), and #24051 (Max).

Changes Made

1. Dimensional Validation for columnValueMinToBeBetween

Implemented dimensional validation for the Min validator to analyze minimum values across dimension groups (e.g., min value by region, by category, etc.).

  • Base Validator (columnValueMinToBeBetween.py)

    • Added constants MIN_BOUND and MAX_BOUND for parameter naming
    • Implemented business logic for dimensional validation
    • Added _get_validation_checker() to create BetweenBoundsChecker instances
    • Implemented _get_metrics_to_compute() to specify which metrics to compute per dimension
    • Enhanced documentation with detailed docstrings
  • SQLAlchemy Validator (sqlalchemy/columnValueMinToBeBetween.py)

    • Implemented _execute_dimensional_validation() using CTE-based aggregation
    • Uses _execute_with_others_aggregation_statistical() mixin method
    • Aggregates "Others" using MIN(individual_mins) for correct statistical aggregation
    • Leverages BetweenBoundsChecker for failed count computation
  • Pandas Validator (pandas/columnValueMinToBeBetween.py)

    • Implemented memory-efficient dimensional validation using iterative aggregation
    • Follows the pattern from Mean metric to handle large dataframes without concatenation
    • Accumulates min values across dataframe chunks
    • Uses aggregate_others_statistical_pandas() for Others aggregation
    • Properly handles NULL dimension values

2. New BetweenBoundsChecker Class

Created a reusable violation checker for all "between bounds" validators (Min, Max, Mean, StdDev, Sum, Median, Lengths).

  • File: validations/checkers/between_bounds_checker.py
    • Extends BaseValidationChecker protocol
    • check_pandas(value): Checks if value is outside [min_bound, max_bound]
    • get_sqa_failed_rows_builder(): Returns a callable that builds SQLAlchemy CASE expressions
    • Handles infinite bounds (e.g., only min or only max specified)
    • Properly handles NULL values (returns 0 failed rows for NULL metrics)

3. Extracted _run_dimensional_validation to BaseTestValidator

Eliminated ~200 lines of duplicate code by moving dimensional validation orchestration to the base class.

  • Affected Base Validators:

    • columnValueMinToBeBetween.py
    • columnValueMaxToBeBetween.py
    • columnValueMeanToBeBetween.py
    • columnValuesToBeUnique.py
    • columnValuesToBeInSet.py
  • Benefits:

    • Single source of truth for dimensional validation flow
    • Consistent error handling across all validators
    • Easier to maintain and extend
    • Follows DRY (Don't Repeat Yourself) principle

4. Protocol-Based Architecture for Mixins

Introduced Python Protocols to make mixin dependencies explicit and type-safe.

  • New File: validations/mixins/protocols.py

    • HasValidatorContext: Protocol declaring required attributes (runner, test_case)
    • Enables type-safe mixin methods that access validator state
    • Uses TYPE_CHECKING to avoid runtime imports
  • Updated Mixins:

    • SQAValidatorMixin: Added get_column() implementation using HasValidatorContext protocol
    • PandasValidatorMixin: Added get_column() implementation using HasValidatorContext protocol
    • Eliminated 30+ duplicate get_column() implementations across concrete validators
  • BaseTestValidator:

    • Uses cooperative super() to delegate get_column() to mixins via MRO
    • Type-safe with type: ignore[misc] comment documenting intentional cooperative multiple inheritance

5. Updated Max and Mean Validators to Use BetweenBoundsChecker

Refactored existing dimensional validators to use the new shared checker instead of inline logic.

  • columnValueMaxToBeBetween:

    • Removed inline failed count logic
    • Now uses BetweenBoundsChecker.get_sqa_failed_rows_builder()
    • Reduced code duplication
  • columnValueMeanToBeBetween:

    • Migrated to BetweenBoundsChecker
    • Simplified dimensional validation logic
    • Consistent with other validators

6. Removed Duplicate _get_column_name Implementations

Cleaned up 30+ concrete validators by removing trivial _get_column_name() implementations.

  • Pandas Validators (11 files):

    • Removed _get_column_name() from: Lengths, Max, Mean, Median, Min, StdDev, MissingCount, Sum, ToBeBetween, InSet, NotInSet, NotNull, Unique, ToMatchRegex, ToNotMatchRegex
  • SQLAlchemy Validators (19 files):

    • Removed _get_column_name() from: Lengths, Max, Mean, Median, Min, StdDev, MissingCount, Sum, ToBeBetween, InSet, NotInSet, NotNull, Unique, ToMatchRegex, ToNotMatchRegex

7. Enhanced Test Coverage

  • conftest.py:

    • Added dimension_results_dict fixture for testing dimensional validation results
    • Enhanced test helpers for dimensional validation scenarios
  • test_validations_databases.py:

    • Added comprehensive tests for columnValueMinToBeBetween dimensional validation
    • Tests verify proper aggregation, impact scoring, and Others handling
  • test_validations_datalake.py:

    • Added Pandas-specific dimensional validation tests for Min validator
    • Tests cover chunk iteration and memory-efficient aggregation

8. Documentation Improvements

  • Added detailed docstrings explaining:
    • Dimensional validation flow
    • Statistical vs row-level validators
    • Memory-efficient chunk iteration for Pandas
    • Others aggregation strategies
    • Impact score computation

9. Unified PandasComputation Pattern for Metrics

Introduced a standardized accumulator pattern for pandas metrics to enable efficient dimensional validation with optimal memory usage.

  • New Protocol (metrics/pandas_metric_protocol.py):

    • PandasComputation[T, R]: Dataclass defining accumulator-based metric computation
      • create_accumulator() -> T: Initialize accumulator (Counter, scalar, tuple, etc.)
      • update_accumulator(T, DataFrame) -> T: Update accumulator with chunk data, returns updated value
      • aggregate_accumulator(T) -> R: Convert final accumulator to metric result
    • SupportsPandasComputation: Runtime-checkable Protocol for gradual metric migration
    • Supports both mutable containers (Counter, list) and immutable scalars (int, float)
  • Metrics Updated:

    • Min/Max (min.py, max.py):

      • Accumulator: Optional[float] (running minimum/maximum)
      • Memory: O(1) - maintains single value instead of list
      • Performance: Same as original (negligible overhead)
    • Mean (mean.py):

      • Accumulator: SumAndCount NamedTuple (running sum and count)
      • Memory: O(1) - replaced List[MeanAndCount] with single tuple
      • Performance: 18% faster - avoids weighted average calculation overhead
      • Mathematical equivalence: sum/count = Σ(mean×count)/Σ(count) (verified)
    • UniqueCount (unique_count.py):

      • Accumulator: Counter (frequency map)
      • Memory: O(unique_values) - inherently required for duplicate detection
      • Handles unhashable types via JSON serialization fallback
    • CountInSet (count_in_set.py):

      • Accumulator: int (running count)
      • Memory: O(1) - maintains single counter
      • Performance: Same as original
    • RowCount (row_count.py):

      • Accumulator: int (running total)
      • Memory: O(1) - simple addition across chunks
      • Performance: Same as original (5% faster due to explicit loop)
  • Benefits:

    • Reusable for Dimensional Validation: Metrics can now efficiently compute per-dimension values without concatenating DataFrames
    • Memory Efficient: Scalars use O(1) memory; only UniqueCount requires O(unique_values)
    • Performance: Equal or better than original implementations (Mean 18% faster)
    • Type Safe: Generic type parameters ensure correct accumulator/result types
    • Gradual Migration: Old df_fn() methods preserved; metrics opt-in via Protocol
  • Verified Correctness:

    • Mathematical equivalence proven for all metrics
    • Benchmarked on 1M rows across 100 chunks (1000 iterations)
    • Edge cases tested: empty DataFrames, all NULLs, single row scenarios
    • NULL handling consistent between SQA and Pandas implementations

Key Design Decisions

  1. Statistical Validator Pattern: columnValueMinToBeBetween is a statistical validator (group-level aggregation) rather than row-level validation. The min value represents the entire group, so when min is outside bounds, ALL rows in that dimension are considered failed.

  2. MIN Aggregation for Others: When aggregating low-impact dimensions into "Others", we use MIN(individual_mins) because the minimum of minimums gives the correct overall minimum for the combined group.

  3. BetweenBoundsChecker Abstraction: Created a shared checker class to eliminate duplicate violation logic across 7+ validators (Min, Max, Mean, StdDev, Median, Sum, Lengths). This provides:

    • Consistent violation semantics
    • Easier testing (mock one class instead of inline logic)
    • Single source of truth for "between bounds" logic
  4. Cooperative Multiple Inheritance: Used Python's cooperative super() pattern for _get_column_name() delegation:

    • BaseTestValidator delegates to mixins via MRO
    • Mixins provide concrete implementations
    • Type-safe with Protocol annotations
    • Eliminates 30+ duplicate implementations
  5. Template Method Extraction: Moved _run_dimensional_validation() to BaseTestValidator:

    • Provides orchestration logic (calling abstract methods in correct order)
    • Concrete validators implement platform-specific _execute_dimensional_validation()
    • Follows Gang of Four Template Method pattern

IceS2 and others added 30 commits September 3, 2025 11:22
…open-metadata/OpenMetadata into feature/dimensionality-for-data-quality
…thub.com:open-metadata/OpenMetadata into feature/dimensionality-column-mean-to-be-between
@github-actions
Copy link
Contributor

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.12)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (31)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (2)

Package Vulnerability ID Severity Installed Version Fixed Version
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
setuptools CVE-2025-47273 🚨 HIGH 70.3.0 78.1.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

@github-actions
Copy link
Contributor

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.9)

Vulnerabilities (19)

Package Vulnerability ID Severity Installed Version Fixed Version
libexpat1 CVE-2023-52425 🚨 HIGH 2.5.0-1+deb12u1 2.5.0-1+deb12u2
libexpat1 CVE-2024-8176 🚨 HIGH 2.5.0-1+deb12u1 2.5.0-1+deb12u2
libgnutls30 CVE-2025-32988 🚨 HIGH 3.7.9-2+deb12u3 3.7.9-2+deb12u5
libgnutls30 CVE-2025-32990 🚨 HIGH 3.7.9-2+deb12u3 3.7.9-2+deb12u5
libicu72 CVE-2025-5222 🚨 HIGH 72.1-3 72.1-3+deb12u1
libperl5.36 CVE-2023-31484 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u3
libperl5.36 CVE-2024-56406 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u2
libsqlite3-0 CVE-2025-6965 🔥 CRITICAL 3.40.1-2+deb12u1 3.40.1-2+deb12u2
libxslt1.1 CVE-2024-55549 🚨 HIGH 1.1.35-1 1.1.35-1+deb12u1
libxslt1.1 CVE-2025-24855 🚨 HIGH 1.1.35-1 1.1.35-1+deb12u1
libxslt1.1 CVE-2025-7424 🚨 HIGH 1.1.35-1 1.1.35-1+deb12u2
perl CVE-2023-31484 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u3
perl CVE-2024-56406 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u2
perl-base CVE-2023-31484 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u3
perl-base CVE-2024-56406 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u2
perl-modules-5.36 CVE-2023-31484 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u3
perl-modules-5.36 CVE-2024-56406 🚨 HIGH 5.36.0-7+deb12u1 5.36.0-7+deb12u2
sqlite3 CVE-2025-6965 🔥 CRITICAL 3.40.1-2+deb12u1 3.40.1-2+deb12u2
sudo CVE-2025-32462 🚨 HIGH 1.9.13p3-1+deb12u1 1.9.13p3-1+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (31)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (11)

Package Vulnerability ID Severity Installed Version Fixed Version
Authlib CVE-2025-59420 🚨 HIGH 1.3.1 1.6.4
Authlib CVE-2025-61920 🚨 HIGH 1.3.1 1.6.5
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiomysql CVE-2025-62611 🚨 HIGH 0.2.0 0.3.0
apache-airflow-providers-common-sql CVE-2025-30473 🚨 HIGH 1.21.0 1.24.1
deepdiff CVE-2025-58367 🔥 CRITICAL 7.0.1 8.6.1
redshift-connector CVE-2025-5279 🚨 HIGH 2.1.5 2.1.7
setuptools CVE-2024-6345 🚨 HIGH 65.5.1 70.0.0
setuptools CVE-2025-47273 🚨 HIGH 65.5.1 78.1.1
setuptools CVE-2025-47273 🚨 HIGH 70.3.0 78.1.1
tornado CVE-2025-47287 🚨 HIGH 6.4.2 6.5

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants