-
Notifications
You must be signed in to change notification settings - Fork 60
Vignette refresh #642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Vignette refresh #642
Changes from all commits
c48d6ff
6b0bd4a
cc370da
8303045
4e9cb7f
dd456e2
2c5f1e5
4cae4aa
18a5c5b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,171 @@ | ||
| --- | ||
| title: "Fundamentals of Data Validation with Pointblank" | ||
| output: html_document | ||
| --- | ||
|
|
||
| ```{r} | ||
| #| label: setup | ||
| #| message: false | ||
| #| warning: false | ||
| #| include: false | ||
| library(pointblank) | ||
| ``` | ||
|
|
||
|
|
||
| This article provides a overview of the core data validation features in pointblank. | ||
| It introduces the key concepts and shows examples of the main functionality, giving you a foundation for using the package effectively. | ||
|
|
||
| ## Validation Rules | ||
|
|
||
| pointblank's core functionality revolves around validation steps, which are individual checks that verify different aspects of your data. | ||
| These steps are created by calling validation functions. | ||
| When combined with `create_agent()` they create a comprehensive validation plan for your data. | ||
|
|
||
| Here's an example of a validation that incorporates three different validation methods: | ||
|
|
||
| ```{r} | ||
| agent <- create_agent(tbl = small_table) %>% | ||
| col_vals_gt(columns = a, value = 0) %>% | ||
| rows_distinct() %>% | ||
| col_exists(columns = date) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
| This example showcases how you can combine different types of validations in a single validation | ||
| plan: | ||
|
|
||
| - a column value validation with `col_vals_gt()` | ||
| - a row-based validation with `rows_distinct()` | ||
| - a table structure validation with `col_exists()` | ||
|
|
||
| Most validation methods share common parameters that enhance their flexibility and power. | ||
| These shared parameters create a consistent interface across all validation steps while allowing you to customize validation behavior for specific needs. | ||
|
|
||
| The next few sections take you through the most important ways in which you can customize your validation plans. | ||
|
|
||
| ## Column Selection Patterns | ||
|
|
||
| You can apply the same validation logic to multiple columns at once through use of column selection patterns (used in the `columns` argument). | ||
| This reduces repetitive code and makes your validation plans more maintainable. | ||
|
|
||
| ```{r} | ||
| agent <- create_agent(tbl = small_table) %>% | ||
| col_vals_gte(columns = c(c, d), value = 0) %>% | ||
| col_vals_not_null(columns = starts_with("d")) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
| This technique is particularly valuable when working with wide datasets containing many similarly-structured columns or when applying standard quality checks across an entire table. | ||
| Details about the column selection helpers can be found in the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) package. | ||
| Making use of column selection patterns also ensures consistency in how validation rules are applied across related data columns. | ||
|
|
||
| To validate row-wise relationships between columns, you can use the `vars()` function to reference columns. | ||
| With this you can, for example, validate that values in one column are greater (or less) than values in another column. | ||
|
|
||
| ```{r} | ||
| agent <- create_agent(tbl = small_table) %>% | ||
| col_vals_gte(columns = c(c, d), value = vars(a)) %>% | ||
| col_vals_between(columns = a, left = 0, right = vars(c)) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
| ## Preprocessing | ||
|
|
||
| Preprocessing (with the `preconditions` argument) allows you to transform or modify your data before applying validation checks, enabling you to validate derived or modified data without altering the original dataset. | ||
| There is no need to create multiple validation plans for different transformations of the original data. | ||
|
|
||
| ```{r} | ||
| agent <- create_agent(tbl = small_table) %>% | ||
| col_vals_gt( | ||
| columns = a_transformed, | ||
| value = 5, | ||
| preconditions = ~ . %>% dplyr::mutate(a_transformed = a * 2) | ||
| ) %>% | ||
| col_vals_lt( | ||
| columns = d, | ||
| value = 1000, | ||
| preconditions = ~ . %>% dplyr::filter(date > "2016-01-15") | ||
| ) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
| Preprocessing enables validation of transformed data without modifying your original dataset, making it ideal for checking derived metrics, or validating normalized values. | ||
| This approach keeps your validation code clean while allowing for sophisticated data quality checks on calculated results. | ||
|
|
||
| More complex preprocessing can be applied through custom functions, rather than inlined via ananymous functions as shown above. | ||
| You can also use the `preconditions` argument to subset your data to specific rows before applying validation checks. | ||
| However, a consise way of doing this is illustrated in the next section. | ||
|
|
||
| ## Segmentation | ||
|
|
||
| Segmentation (through the `segments` argument) allows you to validate data across different groups, enabling you to identify segment-specific quality issues that might be hidden in aggregate analyses. | ||
|
|
||
| You can segment | ||
|
|
||
| - by all unique values in a column, e.g., `segments = vars(f)` | ||
| - by only specific values in a column, e.g., `segments = f ~ c("low", "high")` | ||
| - by multiple columns, e.g., `segments = list(vars(f), a ~ c(1, 2))` | ||
|
|
||
| You can also segment in conjunction with preprocessing, allowing you to segment based on derived or modified data. | ||
|
|
||
| ```{r} | ||
| agent <- create_agent(tbl = small_table) %>% | ||
| col_vals_gt( | ||
| columns = d, | ||
| value = 100, | ||
| preconditions = . %>% | ||
| dplyr::mutate(a_category = dplyr::if_else(a > 5, "high", "low")), | ||
| segments = vars(a_category) | ||
| ) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
| ## Thresholds | ||
|
|
||
| Thresholds (set through the `actions` argument) provide a nuanced way to monitor data quality, allowing you to set different severity levels based on the importance of each validation and your organization's tolerance for specific types of data issues. | ||
|
|
||
| Thresholds can be set for three different levels: warnings, errors, and critical notifications. | ||
| They can be specified as either relative proportions of failing test units or absolute numbers of failing test units. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want to add a mention that you can access information about if certain thresholds were met by looking at the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the mention of the x-list fits better here than in the schema vignette, so adding it here! |
||
|
|
||
| ```{r} | ||
| agent <- create_agent(tbl = small_table) %>% | ||
| col_vals_gt( | ||
| columns = vars(a), | ||
| value = 1, | ||
| actions = action_levels(warn_at = 0.1, stop_at = 0.2, notify_at = 0.3) | ||
| ) %>% | ||
| col_vals_lt( | ||
| columns = vars(c), | ||
| value = 10, | ||
| actions = action_levels(warn_at = 1, stop_at = 2) | ||
| ) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
| Apart from using the agent's results table to visually inspect the outcomes of your validation steps, you can also programmatically access infomation about the results through the so-called x-list, e.g., which validation steps crossed the stop threshold. | ||
|
|
||
| ```{r} | ||
| x_list <- get_agent_x_list(agent) | ||
|
|
||
| x_list$stop | ||
| ``` | ||
|
|
||
|
|
||
| ## Conclusion | ||
|
|
||
| The features covered in this vignette—column selection patterns, preprocessing, segmentation, and thresholds—form the foundation of pointblank's flexible validation system. | ||
| By combining these capabilities, you can create sophisticated validation workflows that adapt to your specific data quality requirements. | ||
| Whether you're validating simple column constraints or complex multi-step transformations across different data segments, pointblank provides the tools to build robust, maintainable validation pipelines that scale with your data and organizational needs. | ||
| These patterns enable you to catch data quality issues early and implement systematic approaches to data validation across your projects. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,199 @@ | ||
| --- | ||
| title: "Schema Validation with pointblank" | ||
| output: html_document | ||
| --- | ||
|
|
||
| ```{r} | ||
| #| label: setup | ||
| #| message: false | ||
| #| warning: false | ||
| #| include: false | ||
| library(pointblank) | ||
| ``` | ||
|
|
||
| Schema validation ensures your data has the expected structure before you analyze it. This vignette shows how to use pointblank's `col_schema()` and `col_schema_match()` functions to validate column names, types, and ordering. | ||
|
|
||
| ## Why Schema Validation Matters | ||
|
|
||
| Data pipelines often fail silently when the structure of incoming data changes unexpectedly. A column might be renamed, a data type might shift from integer to character, or new columns might appear. Schema validation catches these structural issues early, before they propagate through your analysis workflow and cause downstream errors. | ||
|
|
||
| Unlike content validation (which checks the values inside your data), schema validation focuses on the "shape" of your data — the column names, their types, and their arrangement. This makes it an essential first line of defense when working with external data sources, APIs that evolve over time, or databases where schema changes happen independently of your analysis code. | ||
|
|
||
| ## The Basics | ||
|
|
||
| The core principle for schema validation with pointblank is to create a schema defintion with `col_schema()` and then use `col_schema_match()` to validate a table against that schema. | ||
|
|
||
| ```{r} | ||
| #| label: basics | ||
|
|
||
| tbl <- dplyr::tibble( | ||
| a = 1:5, | ||
| b = letters[1:5] | ||
| ) | ||
|
|
||
| # define the schema | ||
| schema <- col_schema( | ||
| a = "integer", | ||
| b = "character" | ||
| ) | ||
|
|
||
| # validate the schema | ||
| agent <- create_agent(tbl) %>% | ||
| col_schema_match(schema) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
|
|
||
| ## Creating Schema Definitions with `col_schema()` | ||
|
|
||
| Writing out the schema manually is often the most straightforward approach, especially for smaller tables or when you have a clear understanding of the expected structure. For larger datasets or when working with existing tables, extracting the schema from a reference table can save time and ensure accuracy. | ||
|
|
||
| ```{r} | ||
| #| label: schema-from-reference | ||
|
|
||
| mock_reference <- game_revenue[1:10, ] | ||
| schema_gr <- col_schema(mock_reference) | ||
|
|
||
| agent <- create_agent(game_revenue) %>% | ||
| col_schema_match(schema_gr) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This actually fails 😳 I've opened #657 for it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for catching this bug and creating the issue! |
||
| ``` | ||
|
|
||
| The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can use the `.db_col_types` argument to define the schema in SQL types (like `BIGINT` and `VARCHAR`) and validate directly against the SQL table without pulling data into R. | ||
|
|
||
| ```{r} | ||
| #| label: types-sql | ||
hfrick marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| library(duckdb) | ||
|
|
||
| con <- dbConnect(duckdb()) | ||
|
|
||
| sales <- dplyr::tibble( | ||
| amount = c(100, 200, 300), | ||
| customer_name = c("Alice", "Bianca", "Charlie"), | ||
| sale_date = as.POSIXct(c("2023-01-01", "2023-01-02", "2023-01-03")) | ||
| ) | ||
|
|
||
| dbWriteTable(con, "sales_data", sales) | ||
|
|
||
| sales_db <- dplyr::tbl(con, "sales_data") | ||
|
|
||
| schema_sql <- col_schema( | ||
| amount = "REAL", | ||
| customer_name = "TEXT", | ||
| sale_date = "DATE", | ||
| .db_col_types = "sql" | ||
| ) | ||
|
|
||
| agent <- create_agent(sales_db) %>% | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get an error creating this agent. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a weird one. When I render the vignette, it works. When I send the code chunk, it works. When I send the code line-by-line into the console, I also get that error. I've tried it in Positron and RStudio. @rich-iannone do you have any idea what might be going on here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I opened #658 for this |
||
| col_schema_match(schema_sql) %>% | ||
| interrogate() | ||
|
|
||
| dbDisconnect(con) | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
|
|
||
| ## Matching Schemas with `col_schema_match()` | ||
|
|
||
| By default, pointblank is strict in the validations it performs, ensuring that the target table matches the schema exactly. However, you can relax these constraints to allow for more flexibility in your validation process. | ||
|
|
||
| - With `complete = FALSE` you can allow extra columns in target table that are not defined in the schema. | ||
| - With `in_order = FALSE` you can allow the column order to differ between the schema and the target table. | ||
| - With `is_exact = FALSE` you can allow partial type matching and even skip type matching if you only want to validate the column names. | ||
|
|
||
| Let's look at an example for the partial type matching. If we write the schema for the `sales` data frame from above as follows, the default strict validation fails. To make that very obvious, we set `stop_at = 1` in the agent's `actions`. Actions are commonly a way to trigger downstream effects (like sending an email notification) but here we simply use them to turn the color on the lefthand side of the validation report red. | ||
|
|
||
| ```{r} | ||
| #| label: type-matching-strict | ||
|
|
||
| schema <- col_schema( | ||
| amount = "numeric", | ||
| customer_name = "character", | ||
| sale_date = "POSIXct" | ||
| ) | ||
|
|
||
| agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>% | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would omit the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that they could be distracting. My challenge is that the default settings result in a table that does not make it very obvious when a step fails: the stripe on the left-hand side of the table is green-ish, not red. See #627 for an example (and Maëlle also getting confused by this). Alternatively, I could use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, I like the red line for illustrating the point. how about also explicitly including the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also want to illustrate that |
||
| col_schema_match(schema) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
| This is because the `sale_date` column has two classes and thus the schema and table do not match exactly. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a surprise! When you defined the After reading through the examples, I think we should be more explicit about how to handle types of dates because pointblank can get very picky. I would like to see us call out explicitly that date handling can be tricky and here's how to work with it. If we set There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll look into the date question some more but in the meantime: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe this is better placed in its own little piece of documentation? #659 |
||
|
|
||
| ```{r} | ||
| #| label: sale-date-class | ||
|
|
||
| class(sales$sale_date) | ||
| ``` | ||
|
|
||
| However, if we relax the validation to allow partial type matching, it passes. | ||
|
|
||
| ```{r} | ||
| #| label: type-matching-partial | ||
|
|
||
| agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>% | ||
hfrick marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| col_schema_match(schema, is_exact = FALSE) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
| <!-- You can relax the validation further by allowing `NULL` types in the schema, which means that the column can be of any type or even missing from the table. --> | ||
| <!-- This is useful when you want to validate the presence of a column without enforcing a specific type or the column --> | ||
|
|
||
| ```{r} | ||
| #| label: type-matching-null | ||
| #| include: false | ||
|
|
||
| schema <- col_schema( | ||
| amount = "numeric", | ||
| customer_name = "character", | ||
| sale_date = NULL | ||
| ) | ||
|
|
||
| agent <- create_agent(sales[, 1:2], actions = action_levels(stop_at = 1)) %>% | ||
hfrick marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| col_schema_match(schema, is_exact = FALSE) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
| If you want to maintain a strict validation, also of the `sale_date` column, you can define the schema with all its classes. | ||
|
|
||
| ```{r} | ||
| #| label: type-matching-full | ||
|
|
||
| schema <- col_schema( | ||
| amount = "numeric", | ||
| customer_name = "character", | ||
| sale_date = c("POSIXct", "POSIXt") | ||
| ) | ||
|
|
||
| agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>% | ||
| col_schema_match(schema, is_exact = TRUE) %>% | ||
| interrogate() | ||
|
|
||
| agent | ||
| ``` | ||
|
|
||
| In general, relaxing the strictness of the validation is useful when you need to validate only a subset of the table. For example, you only work with a subset of columns or you don't mind if the table contains -- or gains in future -- additional columns that are not part of your schema. | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider adding another section here for "Locating Failed Schema Matches". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That information should definitely go somewhere! I've been planning to put it into a third "Take action" vignette. I've been considering putting a mention of the x-list here, but it just feels a little too awkward. If I add it to this example directly, the x-list will show only one class for the date column as you pointed out. More than that, this makes me want something similar to step reports in the Python version: https://posit-dev.github.io/pointblank/user-guide/step-reports.html The R version doesn't have anything like that yet, does it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not step reports yet :( But that would be a huge feature in the R version. Any interest in taking that on yourself? BTW, not every validation type in py-pointblank has an associated step report yet (I think |
||
|
|
||
| ## Best Practices | ||
|
|
||
| To wrap up, here are some best practices for schema validation with pointblank: | ||
|
|
||
| - Define schemas early: bring everyone involved on the same page early in your data workflow. | ||
| - Check schemas early: check schemas early to catch structural issues before they propagate. | ||
| - Choose your schema creation method: do you have a reference table or do you want to define the schema manually? | ||
| - Be deliberate about strictness: use strict validation for critical data components and flexible validation for additional or evolving data components. | ||
| - Reuse schemas: create schema definitions that can be reused across multiple validation contexts. The schema can be written into the agent and the agent saved as a YAML file, making it easier to share. See the [YAML](https://rstudio.github.io/pointblank/reference/col_schema_match.html#yaml) section of `col_schema_match` for an example. | ||
| - Version control schemas: as your data evolves, maintain versions of your schemas to track changes. When `col_schema_match` is saved as a YAML file (see point above), it can easily be managed with a version control system. | ||
| - Make use of `action_levels` to set thresholds for actions. If the schema validation fails, trigger a `stop` action, which can then be used to trigger other downstream effects (e.g., an email notification, termination of a data processing pipeline). | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one leans heavily on the user guide for the python version, but adapted for the R version