-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Feature/dimensionality column min to be between #24116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
IceS2
wants to merge
46
commits into
main
Choose a base branch
from
feature/dimensionality-column-min-to-be-between
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…open-metadata/OpenMetadata into feature/dimensionality-for-data-quality
…thub.com:open-metadata/OpenMetadata into feature/dimensionality-column-mean-to-be-between
🛡️ TRIVY SCAN RESULT 🛡️ Target:
|
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.12.7 | 2.15.0 |
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.13.4 | 2.15.0 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42003 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4.2 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42004 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4 |
com.google.code.gson:gson |
CVE-2022-25647 | 🚨 HIGH | 2.2.4 | 2.8.9 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.3.0 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.3.0 | 3.25.5, 4.27.5, 4.28.2 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.7.1 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.7.1 | 3.25.5, 4.27.5, 4.28.2 |
com.nimbusds:nimbus-jose-jwt |
CVE-2023-52428 | 🚨 HIGH | 9.8.1 | 9.37.2 |
commons-beanutils:commons-beanutils |
CVE-2025-48734 | 🚨 HIGH | 1.9.4 | 1.11.0 |
commons-io:commons-io |
CVE-2024-47554 | 🚨 HIGH | 2.8.0 | 2.14.0 |
dnsjava:dnsjava |
CVE-2024-25638 | 🚨 HIGH | 2.1.7 | 3.6.0 |
io.netty:netty-codec-http2 |
CVE-2025-55163 | 🚨 HIGH | 4.1.96.Final | 4.2.4.Final, 4.1.124.Final |
io.netty:netty-codec-http2 |
GHSA-xpw8-rcwv-8f8p | 🚨 HIGH | 4.1.96.Final | 4.1.100.Final |
io.netty:netty-handler |
CVE-2025-24970 | 🚨 HIGH | 4.1.96.Final | 4.1.118.Final |
net.minidev:json-smart |
CVE-2021-31684 | 🚨 HIGH | 1.3.2 | 1.3.3, 2.4.4 |
net.minidev:json-smart |
CVE-2023-1370 | 🚨 HIGH | 1.3.2 | 2.4.9 |
org.apache.avro:avro |
CVE-2024-47561 | 🔥 CRITICAL | 1.7.7 | 1.11.4 |
org.apache.avro:avro |
CVE-2023-39410 | 🚨 HIGH | 1.7.7 | 1.11.3 |
org.apache.derby:derby |
CVE-2022-46337 | 🔥 CRITICAL | 10.14.2.0 | 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0 |
org.apache.ivy:ivy |
CVE-2022-46751 | 🚨 HIGH | 2.5.1 | 2.5.2 |
org.apache.mesos:mesos |
CVE-2018-1330 | 🚨 HIGH | 1.4.3 | 1.6.0 |
org.apache.thrift:libthrift |
CVE-2019-0205 | 🚨 HIGH | 0.12.0 | 0.13.0 |
org.apache.thrift:libthrift |
CVE-2020-13949 | 🚨 HIGH | 0.12.0 | 0.14.0 |
org.apache.zookeeper:zookeeper |
CVE-2023-44981 | 🔥 CRITICAL | 3.6.3 | 3.7.2, 3.8.3, 3.9.1 |
org.eclipse.jetty:jetty-server |
CVE-2024-13009 | 🚨 HIGH | 9.4.56.v20240826 | 9.4.57.v20241219 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Node.js
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: Python
Vulnerabilities (2)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
Werkzeug |
CVE-2024-34069 | 🚨 HIGH | 2.2.3 | 3.0.3 |
setuptools |
CVE-2025-47273 | 🚨 HIGH | 70.3.0 | 78.1.1 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: /etc/ssl/private/ssl-cert-snakeoil.key
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/extended_sample_data.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/lineage.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_data.json
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_data.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_usage.json
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_usage.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️ Target:
|
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
libexpat1 |
CVE-2023-52425 | 🚨 HIGH | 2.5.0-1+deb12u1 | 2.5.0-1+deb12u2 |
libexpat1 |
CVE-2024-8176 | 🚨 HIGH | 2.5.0-1+deb12u1 | 2.5.0-1+deb12u2 |
libgnutls30 |
CVE-2025-32988 | 🚨 HIGH | 3.7.9-2+deb12u3 | 3.7.9-2+deb12u5 |
libgnutls30 |
CVE-2025-32990 | 🚨 HIGH | 3.7.9-2+deb12u3 | 3.7.9-2+deb12u5 |
libicu72 |
CVE-2025-5222 | 🚨 HIGH | 72.1-3 | 72.1-3+deb12u1 |
libperl5.36 |
CVE-2023-31484 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u3 |
libperl5.36 |
CVE-2024-56406 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u2 |
libsqlite3-0 |
CVE-2025-6965 | 🔥 CRITICAL | 3.40.1-2+deb12u1 | 3.40.1-2+deb12u2 |
libxslt1.1 |
CVE-2024-55549 | 🚨 HIGH | 1.1.35-1 | 1.1.35-1+deb12u1 |
libxslt1.1 |
CVE-2025-24855 | 🚨 HIGH | 1.1.35-1 | 1.1.35-1+deb12u1 |
libxslt1.1 |
CVE-2025-7424 | 🚨 HIGH | 1.1.35-1 | 1.1.35-1+deb12u2 |
perl |
CVE-2023-31484 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u3 |
perl |
CVE-2024-56406 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u2 |
perl-base |
CVE-2023-31484 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u3 |
perl-base |
CVE-2024-56406 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u2 |
perl-modules-5.36 |
CVE-2023-31484 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u3 |
perl-modules-5.36 |
CVE-2024-56406 | 🚨 HIGH | 5.36.0-7+deb12u1 | 5.36.0-7+deb12u2 |
sqlite3 |
CVE-2025-6965 | 🔥 CRITICAL | 3.40.1-2+deb12u1 | 3.40.1-2+deb12u2 |
sudo |
CVE-2025-32462 | 🚨 HIGH | 1.9.13p3-1+deb12u1 | 1.9.13p3-1+deb12u2 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Java
Vulnerabilities (31)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.12.7 | 2.15.0 |
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.13.4 | 2.15.0 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42003 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4.2 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42004 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4 |
com.google.code.gson:gson |
CVE-2022-25647 | 🚨 HIGH | 2.2.4 | 2.8.9 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.3.0 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.3.0 | 3.25.5, 4.27.5, 4.28.2 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.7.1 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.7.1 | 3.25.5, 4.27.5, 4.28.2 |
com.nimbusds:nimbus-jose-jwt |
CVE-2023-52428 | 🚨 HIGH | 9.8.1 | 9.37.2 |
commons-beanutils:commons-beanutils |
CVE-2025-48734 | 🚨 HIGH | 1.9.4 | 1.11.0 |
commons-io:commons-io |
CVE-2024-47554 | 🚨 HIGH | 2.8.0 | 2.14.0 |
dnsjava:dnsjava |
CVE-2024-25638 | 🚨 HIGH | 2.1.7 | 3.6.0 |
io.netty:netty-codec-http2 |
CVE-2025-55163 | 🚨 HIGH | 4.1.96.Final | 4.2.4.Final, 4.1.124.Final |
io.netty:netty-codec-http2 |
GHSA-xpw8-rcwv-8f8p | 🚨 HIGH | 4.1.96.Final | 4.1.100.Final |
io.netty:netty-handler |
CVE-2025-24970 | 🚨 HIGH | 4.1.96.Final | 4.1.118.Final |
net.minidev:json-smart |
CVE-2021-31684 | 🚨 HIGH | 1.3.2 | 1.3.3, 2.4.4 |
net.minidev:json-smart |
CVE-2023-1370 | 🚨 HIGH | 1.3.2 | 2.4.9 |
org.apache.avro:avro |
CVE-2024-47561 | 🔥 CRITICAL | 1.7.7 | 1.11.4 |
org.apache.avro:avro |
CVE-2023-39410 | 🚨 HIGH | 1.7.7 | 1.11.3 |
org.apache.derby:derby |
CVE-2022-46337 | 🔥 CRITICAL | 10.14.2.0 | 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0 |
org.apache.ivy:ivy |
CVE-2022-46751 | 🚨 HIGH | 2.5.1 | 2.5.2 |
org.apache.mesos:mesos |
CVE-2018-1330 | 🚨 HIGH | 1.4.3 | 1.6.0 |
org.apache.thrift:libthrift |
CVE-2019-0205 | 🚨 HIGH | 0.12.0 | 0.13.0 |
org.apache.thrift:libthrift |
CVE-2020-13949 | 🚨 HIGH | 0.12.0 | 0.14.0 |
org.apache.zookeeper:zookeeper |
CVE-2023-44981 | 🔥 CRITICAL | 3.6.3 | 3.7.2, 3.8.3, 3.9.1 |
org.eclipse.jetty:jetty-server |
CVE-2024-13009 | 🚨 HIGH | 9.4.56.v20240826 | 9.4.57.v20241219 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Node.js
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: Python
Vulnerabilities (11)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
Authlib |
CVE-2025-59420 | 🚨 HIGH | 1.3.1 | 1.6.4 |
Authlib |
CVE-2025-61920 | 🚨 HIGH | 1.3.1 | 1.6.5 |
Werkzeug |
CVE-2024-34069 | 🚨 HIGH | 2.2.3 | 3.0.3 |
aiomysql |
CVE-2025-62611 | 🚨 HIGH | 0.2.0 | 0.3.0 |
apache-airflow-providers-common-sql |
CVE-2025-30473 | 🚨 HIGH | 1.21.0 | 1.24.1 |
deepdiff |
CVE-2025-58367 | 🔥 CRITICAL | 7.0.1 | 8.6.1 |
redshift-connector |
CVE-2025-5279 | 🚨 HIGH | 2.1.5 | 2.1.7 |
setuptools |
CVE-2024-6345 | 🚨 HIGH | 65.5.1 | 70.0.0 |
setuptools |
CVE-2025-47273 | 🚨 HIGH | 65.5.1 | 78.1.1 |
setuptools |
CVE-2025-47273 | 🚨 HIGH | 70.3.0 | 78.1.1 |
tornado |
CVE-2025-47287 | 🚨 HIGH | 6.4.2 | 6.5 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: /etc/ssl/private/ssl-cert-snakeoil.key
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO
No Vulnerabilities Found
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.



Implement Dimensional Validation for columnValueMinToBeBetween + Code Quality Refactoring
Summary
This PR implements dimensional validation for the
columnValueMinToBeBetweendata quality test and includes significant code quality improvements to reduce duplication across validators.This is part of the dimensional validation feature series, following PRs #23529 (thin slice), #23984 (Mean), and #24051 (Max).
Changes Made
1. Dimensional Validation for columnValueMinToBeBetween
Implemented dimensional validation for the Min validator to analyze minimum values across dimension groups (e.g., min value by region, by category, etc.).
Base Validator (
columnValueMinToBeBetween.py)MIN_BOUNDandMAX_BOUNDfor parameter naming_get_validation_checker()to createBetweenBoundsCheckerinstances_get_metrics_to_compute()to specify which metrics to compute per dimensionSQLAlchemy Validator (
sqlalchemy/columnValueMinToBeBetween.py)_execute_dimensional_validation()using CTE-based aggregation_execute_with_others_aggregation_statistical()mixin methodMIN(individual_mins)for correct statistical aggregationBetweenBoundsCheckerfor failed count computationPandas Validator (
pandas/columnValueMinToBeBetween.py)aggregate_others_statistical_pandas()for Others aggregation2. New BetweenBoundsChecker Class
Created a reusable violation checker for all "between bounds" validators (Min, Max, Mean, StdDev, Sum, Median, Lengths).
validations/checkers/between_bounds_checker.pyBaseValidationCheckerprotocolcheck_pandas(value): Checks if value is outside[min_bound, max_bound]get_sqa_failed_rows_builder(): Returns a callable that builds SQLAlchemy CASE expressions3. Extracted _run_dimensional_validation to BaseTestValidator
Eliminated ~200 lines of duplicate code by moving dimensional validation orchestration to the base class.
Affected Base Validators:
columnValueMinToBeBetween.pycolumnValueMaxToBeBetween.pycolumnValueMeanToBeBetween.pycolumnValuesToBeUnique.pycolumnValuesToBeInSet.pyBenefits:
4. Protocol-Based Architecture for Mixins
Introduced Python Protocols to make mixin dependencies explicit and type-safe.
New File:
validations/mixins/protocols.pyHasValidatorContext: Protocol declaring required attributes (runner,test_case)TYPE_CHECKINGto avoid runtime importsUpdated Mixins:
SQAValidatorMixin: Addedget_column()implementation usingHasValidatorContextprotocolPandasValidatorMixin: Addedget_column()implementation usingHasValidatorContextprotocolget_column()implementations across concrete validatorsBaseTestValidator:
super()to delegateget_column()to mixins via MROtype: ignore[misc]comment documenting intentional cooperative multiple inheritance5. Updated Max and Mean Validators to Use BetweenBoundsChecker
Refactored existing dimensional validators to use the new shared checker instead of inline logic.
columnValueMaxToBeBetween:
BetweenBoundsChecker.get_sqa_failed_rows_builder()columnValueMeanToBeBetween:
BetweenBoundsChecker6. Removed Duplicate _get_column_name Implementations
Cleaned up 30+ concrete validators by removing trivial
_get_column_name()implementations.Pandas Validators (11 files):
_get_column_name()from: Lengths, Max, Mean, Median, Min, StdDev, MissingCount, Sum, ToBeBetween, InSet, NotInSet, NotNull, Unique, ToMatchRegex, ToNotMatchRegexSQLAlchemy Validators (19 files):
_get_column_name()from: Lengths, Max, Mean, Median, Min, StdDev, MissingCount, Sum, ToBeBetween, InSet, NotInSet, NotNull, Unique, ToMatchRegex, ToNotMatchRegex7. Enhanced Test Coverage
conftest.py:
dimension_results_dictfixture for testing dimensional validation resultstest_validations_databases.py:
columnValueMinToBeBetweendimensional validationtest_validations_datalake.py:
8. Documentation Improvements
9. Unified PandasComputation Pattern for Metrics
Introduced a standardized accumulator pattern for pandas metrics to enable efficient dimensional validation with optimal memory usage.
New Protocol (
metrics/pandas_metric_protocol.py):PandasComputation[T, R]: Dataclass defining accumulator-based metric computationcreate_accumulator() -> T: Initialize accumulator (Counter, scalar, tuple, etc.)update_accumulator(T, DataFrame) -> T: Update accumulator with chunk data, returns updated valueaggregate_accumulator(T) -> R: Convert final accumulator to metric resultSupportsPandasComputation: Runtime-checkable Protocol for gradual metric migrationMetrics Updated:
Min/Max (
min.py,max.py):Optional[float](running minimum/maximum)Mean (
mean.py):SumAndCountNamedTuple (running sum and count)List[MeanAndCount]with single tuplesum/count = Σ(mean×count)/Σ(count)(verified)UniqueCount (
unique_count.py):Counter(frequency map)CountInSet (
count_in_set.py):int(running count)RowCount (
row_count.py):int(running total)Benefits:
df_fn()methods preserved; metrics opt-in via ProtocolVerified Correctness:
Key Design Decisions
Statistical Validator Pattern:
columnValueMinToBeBetweenis a statistical validator (group-level aggregation) rather than row-level validation. The min value represents the entire group, so when min is outside bounds, ALL rows in that dimension are considered failed.MIN Aggregation for Others: When aggregating low-impact dimensions into "Others", we use
MIN(individual_mins)because the minimum of minimums gives the correct overall minimum for the combined group.BetweenBoundsChecker Abstraction: Created a shared checker class to eliminate duplicate violation logic across 7+ validators (Min, Max, Mean, StdDev, Median, Sum, Lengths). This provides:
Cooperative Multiple Inheritance: Used Python's cooperative
super()pattern for_get_column_name()delegation:Template Method Extraction: Moved
_run_dimensional_validation()to BaseTestValidator:_execute_dimensional_validation()