diff --git a/vignettes/fundamentals.Rmd b/vignettes/fundamentals.Rmd new file mode 100644 index 000000000..a7eb9cf58 --- /dev/null +++ b/vignettes/fundamentals.Rmd @@ -0,0 +1,171 @@ +--- +title: "Fundamentals of Data Validation with Pointblank" +output: html_document +--- + +```{r} +#| label: setup +#| message: false +#| warning: false +#| include: false +library(pointblank) +``` + + +This article provides a overview of the core data validation features in pointblank. +It introduces the key concepts and shows examples of the main functionality, giving you a foundation for using the package effectively. + +## Validation Rules + +pointblank's core functionality revolves around validation steps, which are individual checks that verify different aspects of your data. +These steps are created by calling validation functions. +When combined with `create_agent()` they create a comprehensive validation plan for your data. + +Here's an example of a validation that incorporates three different validation methods: + +```{r} +agent <- create_agent(tbl = small_table) %>% + col_vals_gt(columns = a, value = 0) %>% + rows_distinct() %>% + col_exists(columns = date) %>% + interrogate() + +agent +``` + +This example showcases how you can combine different types of validations in a single validation +plan: + +- a column value validation with `col_vals_gt()` +- a row-based validation with `rows_distinct()` +- a table structure validation with `col_exists()` + +Most validation methods share common parameters that enhance their flexibility and power. +These shared parameters create a consistent interface across all validation steps while allowing you to customize validation behavior for specific needs. + +The next few sections take you through the most important ways in which you can customize your validation plans. + +## Column Selection Patterns + +You can apply the same validation logic to multiple columns at once through use of column selection patterns (used in the `columns` argument). +This reduces repetitive code and makes your validation plans more maintainable. + +```{r} +agent <- create_agent(tbl = small_table) %>% + col_vals_gte(columns = c(c, d), value = 0) %>% + col_vals_not_null(columns = starts_with("d")) %>% + interrogate() + +agent +``` + +This technique is particularly valuable when working with wide datasets containing many similarly-structured columns or when applying standard quality checks across an entire table. +Details about the column selection helpers can be found in the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) package. +Making use of column selection patterns also ensures consistency in how validation rules are applied across related data columns. + +To validate row-wise relationships between columns, you can use the `vars()` function to reference columns. +With this you can, for example, validate that values in one column are greater (or less) than values in another column. + +```{r} +agent <- create_agent(tbl = small_table) %>% + col_vals_gte(columns = c(c, d), value = vars(a)) %>% + col_vals_between(columns = a, left = 0, right = vars(c)) %>% + interrogate() + +agent +``` + +## Preprocessing + +Preprocessing (with the `preconditions` argument) allows you to transform or modify your data before applying validation checks, enabling you to validate derived or modified data without altering the original dataset. +There is no need to create multiple validation plans for different transformations of the original data. + +```{r} +agent <- create_agent(tbl = small_table) %>% + col_vals_gt( + columns = a_transformed, + value = 5, + preconditions = ~ . %>% dplyr::mutate(a_transformed = a * 2) + ) %>% + col_vals_lt( + columns = d, + value = 1000, + preconditions = ~ . %>% dplyr::filter(date > "2016-01-15") + ) %>% + interrogate() + +agent +``` + +Preprocessing enables validation of transformed data without modifying your original dataset, making it ideal for checking derived metrics, or validating normalized values. +This approach keeps your validation code clean while allowing for sophisticated data quality checks on calculated results. + +More complex preprocessing can be applied through custom functions, rather than inlined via ananymous functions as shown above. +You can also use the `preconditions` argument to subset your data to specific rows before applying validation checks. +However, a consise way of doing this is illustrated in the next section. + +## Segmentation + +Segmentation (through the `segments` argument) allows you to validate data across different groups, enabling you to identify segment-specific quality issues that might be hidden in aggregate analyses. + +You can segment + +- by all unique values in a column, e.g., `segments = vars(f)` +- by only specific values in a column, e.g., `segments = f ~ c("low", "high")` +- by multiple columns, e.g., `segments = list(vars(f), a ~ c(1, 2))` + +You can also segment in conjunction with preprocessing, allowing you to segment based on derived or modified data. + +```{r} +agent <- create_agent(tbl = small_table) %>% + col_vals_gt( + columns = d, + value = 100, + preconditions = . %>% + dplyr::mutate(a_category = dplyr::if_else(a > 5, "high", "low")), + segments = vars(a_category) + ) %>% + interrogate() + +agent +``` + +## Thresholds + +Thresholds (set through the `actions` argument) provide a nuanced way to monitor data quality, allowing you to set different severity levels based on the importance of each validation and your organization's tolerance for specific types of data issues. + +Thresholds can be set for three different levels: warnings, errors, and critical notifications. +They can be specified as either relative proportions of failing test units or absolute numbers of failing test units. + +```{r} +agent <- create_agent(tbl = small_table) %>% + col_vals_gt( + columns = vars(a), + value = 1, + actions = action_levels(warn_at = 0.1, stop_at = 0.2, notify_at = 0.3) + ) %>% + col_vals_lt( + columns = vars(c), + value = 10, + actions = action_levels(warn_at = 1, stop_at = 2) + ) %>% + interrogate() + +agent +``` + +Apart from using the agent's results table to visually inspect the outcomes of your validation steps, you can also programmatically access infomation about the results through the so-called x-list, e.g., which validation steps crossed the stop threshold. + +```{r} +x_list <- get_agent_x_list(agent) + +x_list$stop +``` + + +## Conclusion + +The features covered in this vignette—column selection patterns, preprocessing, segmentation, and thresholds—form the foundation of pointblank's flexible validation system. +By combining these capabilities, you can create sophisticated validation workflows that adapt to your specific data quality requirements. +Whether you're validating simple column constraints or complex multi-step transformations across different data segments, pointblank provides the tools to build robust, maintainable validation pipelines that scale with your data and organizational needs. +These patterns enable you to catch data quality issues early and implement systematic approaches to data validation across your projects. diff --git a/vignettes/schema.Rmd b/vignettes/schema.Rmd new file mode 100644 index 000000000..e8269e011 --- /dev/null +++ b/vignettes/schema.Rmd @@ -0,0 +1,199 @@ +--- +title: "Schema Validation with pointblank" +output: html_document +--- + +```{r} +#| label: setup +#| message: false +#| warning: false +#| include: false +library(pointblank) +``` + +Schema validation ensures your data has the expected structure before you analyze it. This vignette shows how to use pointblank's `col_schema()` and `col_schema_match()` functions to validate column names, types, and ordering. + +## Why Schema Validation Matters + +Data pipelines often fail silently when the structure of incoming data changes unexpectedly. A column might be renamed, a data type might shift from integer to character, or new columns might appear. Schema validation catches these structural issues early, before they propagate through your analysis workflow and cause downstream errors. + +Unlike content validation (which checks the values inside your data), schema validation focuses on the "shape" of your data — the column names, their types, and their arrangement. This makes it an essential first line of defense when working with external data sources, APIs that evolve over time, or databases where schema changes happen independently of your analysis code. + +## The Basics + +The core principle for schema validation with pointblank is to create a schema defintion with `col_schema()` and then use `col_schema_match()` to validate a table against that schema. + +```{r} +#| label: basics + +tbl <- dplyr::tibble( + a = 1:5, + b = letters[1:5] +) + +# define the schema +schema <- col_schema( + a = "integer", + b = "character" +) + +# validate the schema +agent <- create_agent(tbl) %>% + col_schema_match(schema) %>% + interrogate() + +agent +``` + + +## Creating Schema Definitions with `col_schema()` + +Writing out the schema manually is often the most straightforward approach, especially for smaller tables or when you have a clear understanding of the expected structure. For larger datasets or when working with existing tables, extracting the schema from a reference table can save time and ensure accuracy. + +```{r} +#| label: schema-from-reference + +mock_reference <- game_revenue[1:10, ] +schema_gr <- col_schema(mock_reference) + +agent <- create_agent(game_revenue) %>% + col_schema_match(schema_gr) %>% + interrogate() + +agent +``` + +The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can use the `.db_col_types` argument to define the schema in SQL types (like `BIGINT` and `VARCHAR`) and validate directly against the SQL table without pulling data into R. + +```{r} +#| label: types-sql + +library(duckdb) + +con <- dbConnect(duckdb()) + +sales <- dplyr::tibble( + amount = c(100, 200, 300), + customer_name = c("Alice", "Bianca", "Charlie"), + sale_date = as.POSIXct(c("2023-01-01", "2023-01-02", "2023-01-03")) +) + +dbWriteTable(con, "sales_data", sales) + +sales_db <- dplyr::tbl(con, "sales_data") + +schema_sql <- col_schema( + amount = "REAL", + customer_name = "TEXT", + sale_date = "DATE", + .db_col_types = "sql" +) + +agent <- create_agent(sales_db) %>% + col_schema_match(schema_sql) %>% + interrogate() + +dbDisconnect(con) + +agent +``` + + +## Matching Schemas with `col_schema_match()` + +By default, pointblank is strict in the validations it performs, ensuring that the target table matches the schema exactly. However, you can relax these constraints to allow for more flexibility in your validation process. + +- With `complete = FALSE` you can allow extra columns in target table that are not defined in the schema. +- With `in_order = FALSE` you can allow the column order to differ between the schema and the target table. +- With `is_exact = FALSE` you can allow partial type matching and even skip type matching if you only want to validate the column names. + +Let's look at an example for the partial type matching. If we write the schema for the `sales` data frame from above as follows, the default strict validation fails. To make that very obvious, we set `stop_at = 1` in the agent's `actions`. Actions are commonly a way to trigger downstream effects (like sending an email notification) but here we simply use them to turn the color on the lefthand side of the validation report red. + +```{r} +#| label: type-matching-strict + +schema <- col_schema( + amount = "numeric", + customer_name = "character", + sale_date = "POSIXct" +) + +agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>% + col_schema_match(schema) %>% + interrogate() + +agent +``` + +This is because the `sale_date` column has two classes and thus the schema and table do not match exactly. + +```{r} +#| label: sale-date-class + +class(sales$sale_date) +``` + +However, if we relax the validation to allow partial type matching, it passes. + +```{r} +#| label: type-matching-partial + +agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>% + col_schema_match(schema, is_exact = FALSE) %>% + interrogate() + +agent +``` + + + + +```{r} +#| label: type-matching-null +#| include: false + +schema <- col_schema( + amount = "numeric", + customer_name = "character", + sale_date = NULL +) + +agent <- create_agent(sales[, 1:2], actions = action_levels(stop_at = 1)) %>% + col_schema_match(schema, is_exact = FALSE) %>% + interrogate() + +agent +``` + +If you want to maintain a strict validation, also of the `sale_date` column, you can define the schema with all its classes. + +```{r} +#| label: type-matching-full + +schema <- col_schema( + amount = "numeric", + customer_name = "character", + sale_date = c("POSIXct", "POSIXt") +) + +agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>% + col_schema_match(schema, is_exact = TRUE) %>% + interrogate() + +agent +``` + +In general, relaxing the strictness of the validation is useful when you need to validate only a subset of the table. For example, you only work with a subset of columns or you don't mind if the table contains -- or gains in future -- additional columns that are not part of your schema. + + +## Best Practices + +To wrap up, here are some best practices for schema validation with pointblank: + +- Define schemas early: bring everyone involved on the same page early in your data workflow. +- Check schemas early: check schemas early to catch structural issues before they propagate. +- Choose your schema creation method: do you have a reference table or do you want to define the schema manually? +- Be deliberate about strictness: use strict validation for critical data components and flexible validation for additional or evolving data components. +- Reuse schemas: create schema definitions that can be reused across multiple validation contexts. The schema can be written into the agent and the agent saved as a YAML file, making it easier to share. See the [YAML](https://rstudio.github.io/pointblank/reference/col_schema_match.html#yaml) section of `col_schema_match` for an example. +- Version control schemas: as your data evolves, maintain versions of your schemas to track changes. When `col_schema_match` is saved as a YAML file (see point above), it can easily be managed with a version control system. +- Make use of `action_levels` to set thresholds for actions. If the schema validation fails, trigger a `stop` action, which can then be used to trigger other downstream effects (e.g., an email notification, termination of a data processing pipeline).