Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 171 additions & 0 deletions vignettes/fundamentals.Rmd
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one leans heavily on the user guide for the python version, but adapted for the R version

Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
---
title: "Fundamentals of Data Validation with Pointblank"
output: html_document
---

```{r}
#| label: setup
#| message: false
#| warning: false
#| include: false
library(pointblank)
```


This article provides a overview of the core data validation features in pointblank.
It introduces the key concepts and shows examples of the main functionality, giving you a foundation for using the package effectively.

## Validation Rules

pointblank's core functionality revolves around validation steps, which are individual checks that verify different aspects of your data.
These steps are created by calling validation functions.
When combined with `create_agent()` they create a comprehensive validation plan for your data.

Here's an example of a validation that incorporates three different validation methods:

```{r}
agent <- create_agent(tbl = small_table) %>%
col_vals_gt(columns = a, value = 0) %>%
rows_distinct() %>%
col_exists(columns = date) %>%
interrogate()

agent
```

This example showcases how you can combine different types of validations in a single validation
plan:

- a column value validation with `col_vals_gt()`
- a row-based validation with `rows_distinct()`
- a table structure validation with `col_exists()`

Most validation methods share common parameters that enhance their flexibility and power.
These shared parameters create a consistent interface across all validation steps while allowing you to customize validation behavior for specific needs.

The next few sections take you through the most important ways in which you can customize your validation plans.

## Column Selection Patterns

You can apply the same validation logic to multiple columns at once through use of column selection patterns (used in the `columns` argument).
This reduces repetitive code and makes your validation plans more maintainable.

```{r}
agent <- create_agent(tbl = small_table) %>%
col_vals_gte(columns = c(c, d), value = 0) %>%
col_vals_not_null(columns = starts_with("d")) %>%
interrogate()

agent
```

This technique is particularly valuable when working with wide datasets containing many similarly-structured columns or when applying standard quality checks across an entire table.
Details about the column selection helpers can be found in the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) package.
Making use of column selection patterns also ensures consistency in how validation rules are applied across related data columns.

To validate row-wise relationships between columns, you can use the `vars()` function to reference columns.
With this you can, for example, validate that values in one column are greater (or less) than values in another column.

```{r}
agent <- create_agent(tbl = small_table) %>%
col_vals_gte(columns = c(c, d), value = vars(a)) %>%
col_vals_between(columns = a, left = 0, right = vars(c)) %>%
interrogate()

agent
```

## Preprocessing

Preprocessing (with the `preconditions` argument) allows you to transform or modify your data before applying validation checks, enabling you to validate derived or modified data without altering the original dataset.
There is no need to create multiple validation plans for different transformations of the original data.

```{r}
agent <- create_agent(tbl = small_table) %>%
col_vals_gt(
columns = a_transformed,
value = 5,
preconditions = ~ . %>% dplyr::mutate(a_transformed = a * 2)
) %>%
col_vals_lt(
columns = d,
value = 1000,
preconditions = ~ . %>% dplyr::filter(date > "2016-01-15")
) %>%
interrogate()

agent
```

Preprocessing enables validation of transformed data without modifying your original dataset, making it ideal for checking derived metrics, or validating normalized values.
This approach keeps your validation code clean while allowing for sophisticated data quality checks on calculated results.

More complex preprocessing can be applied through custom functions, rather than inlined via ananymous functions as shown above.
You can also use the `preconditions` argument to subset your data to specific rows before applying validation checks.
However, a consise way of doing this is illustrated in the next section.

## Segmentation

Segmentation (through the `segments` argument) allows you to validate data across different groups, enabling you to identify segment-specific quality issues that might be hidden in aggregate analyses.

You can segment

- by all unique values in a column, e.g., `segments = vars(f)`
- by only specific values in a column, e.g., `segments = f ~ c("low", "high")`
- by multiple columns, e.g., `segments = list(vars(f), a ~ c(1, 2))`

You can also segment in conjunction with preprocessing, allowing you to segment based on derived or modified data.

```{r}
agent <- create_agent(tbl = small_table) %>%
col_vals_gt(
columns = d,
value = 100,
preconditions = . %>%
dplyr::mutate(a_category = dplyr::if_else(a > 5, "high", "low")),
segments = vars(a_category)
) %>%
interrogate()

agent
```

## Thresholds

Thresholds (set through the `actions` argument) provide a nuanced way to monitor data quality, allowing you to set different severity levels based on the importance of each validation and your organization's tolerance for specific types of data issues.

Thresholds can be set for three different levels: warnings, errors, and critical notifications.
They can be specified as either relative proportions of failing test units or absolute numbers of failing test units.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add a mention that you can access information about if certain thresholds were met by looking at the x_list $warn, $stop, and $notify values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mention of the x-list fits better here than in the schema vignette, so adding it here!


```{r}
agent <- create_agent(tbl = small_table) %>%
col_vals_gt(
columns = vars(a),
value = 1,
actions = action_levels(warn_at = 0.1, stop_at = 0.2, notify_at = 0.3)
) %>%
col_vals_lt(
columns = vars(c),
value = 10,
actions = action_levels(warn_at = 1, stop_at = 2)
) %>%
interrogate()

agent
```

Apart from using the agent's results table to visually inspect the outcomes of your validation steps, you can also programmatically access infomation about the results through the so-called x-list, e.g., which validation steps crossed the stop threshold.

```{r}
x_list <- get_agent_x_list(agent)

x_list$stop
```


## Conclusion

The features covered in this vignette—column selection patterns, preprocessing, segmentation, and thresholds—form the foundation of pointblank's flexible validation system.
By combining these capabilities, you can create sophisticated validation workflows that adapt to your specific data quality requirements.
Whether you're validating simple column constraints or complex multi-step transformations across different data segments, pointblank provides the tools to build robust, maintainable validation pipelines that scale with your data and organizational needs.
These patterns enable you to catch data quality issues early and implement systematic approaches to data validation across your projects.
199 changes: 199 additions & 0 deletions vignettes/schema.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
---
title: "Schema Validation with pointblank"
output: html_document
---

```{r}
#| label: setup
#| message: false
#| warning: false
#| include: false
library(pointblank)
```

Schema validation ensures your data has the expected structure before you analyze it. This vignette shows how to use pointblank's `col_schema()` and `col_schema_match()` functions to validate column names, types, and ordering.

## Why Schema Validation Matters

Data pipelines often fail silently when the structure of incoming data changes unexpectedly. A column might be renamed, a data type might shift from integer to character, or new columns might appear. Schema validation catches these structural issues early, before they propagate through your analysis workflow and cause downstream errors.

Unlike content validation (which checks the values inside your data), schema validation focuses on the "shape" of your data — the column names, their types, and their arrangement. This makes it an essential first line of defense when working with external data sources, APIs that evolve over time, or databases where schema changes happen independently of your analysis code.

## The Basics

The core principle for schema validation with pointblank is to create a schema defintion with `col_schema()` and then use `col_schema_match()` to validate a table against that schema.

```{r}
#| label: basics

tbl <- dplyr::tibble(
a = 1:5,
b = letters[1:5]
)

# define the schema
schema <- col_schema(
a = "integer",
b = "character"
)

# validate the schema
agent <- create_agent(tbl) %>%
col_schema_match(schema) %>%
interrogate()

agent
```


## Creating Schema Definitions with `col_schema()`

Writing out the schema manually is often the most straightforward approach, especially for smaller tables or when you have a clear understanding of the expected structure. For larger datasets or when working with existing tables, extracting the schema from a reference table can save time and ensure accuracy.

```{r}
#| label: schema-from-reference

mock_reference <- game_revenue[1:10, ]
schema_gr <- col_schema(mock_reference)

agent <- create_agent(game_revenue) %>%
col_schema_match(schema_gr) %>%
interrogate()

agent
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually fails 😳 I've opened #657 for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this bug and creating the issue!

```

The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can use the `.db_col_types` argument to define the schema in SQL types (like `BIGINT` and `VARCHAR`) and validate directly against the SQL table without pulling data into R.

```{r}
#| label: types-sql

library(duckdb)

con <- dbConnect(duckdb())

sales <- dplyr::tibble(
amount = c(100, 200, 300),
customer_name = c("Alice", "Bianca", "Charlie"),
sale_date = as.POSIXct(c("2023-01-01", "2023-01-02", "2023-01-03"))
)

dbWriteTable(con, "sales_data", sales)

sales_db <- dplyr::tbl(con, "sales_data")

schema_sql <- col_schema(
amount = "REAL",
customer_name = "TEXT",
sale_date = "DATE",
.db_col_types = "sql"
)

agent <- create_agent(sales_db) %>%

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get an error creating this agent.

Error in `if (grepl("sql server|sqlserver", tbl_src_details)) ...`:
! argument is of length zero
Hide Traceback
    ▆
 1. └─pointblank::create_agent(sales_db)
 2.   └─pointblank:::get_tbl_information(tbl = tbl)
 3.     └─pointblank:::get_tbl_information_dbi(tbl)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird one. When I render the vignette, it works. When I send the code chunk, it works. When I send the code line-by-line into the console, I also get that error. I've tried it in Positron and RStudio. @rich-iannone do you have any idea what might be going on here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #658 for this

col_schema_match(schema_sql) %>%
interrogate()

dbDisconnect(con)

agent
```


## Matching Schemas with `col_schema_match()`

By default, pointblank is strict in the validations it performs, ensuring that the target table matches the schema exactly. However, you can relax these constraints to allow for more flexibility in your validation process.

- With `complete = FALSE` you can allow extra columns in target table that are not defined in the schema.
- With `in_order = FALSE` you can allow the column order to differ between the schema and the target table.
- With `is_exact = FALSE` you can allow partial type matching and even skip type matching if you only want to validate the column names.

Let's look at an example for the partial type matching. If we write the schema for the `sales` data frame from above as follows, the default strict validation fails. To make that very obvious, we set `stop_at = 1` in the agent's `actions`. Actions are commonly a way to trigger downstream effects (like sending an email notification) but here we simply use them to turn the color on the lefthand side of the validation report red.

```{r}
#| label: type-matching-strict

schema <- col_schema(
amount = "numeric",
customer_name = "character",
sale_date = "POSIXct"
)

agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>%

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would omit the action_levels here since this is about the type matching.

Copy link
Collaborator Author

@hfrick hfrick Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they could be distracting. My challenge is that the default settings result in a table that does not make it very obvious when a step fails: the stripe on the left-hand side of the table is green-ish, not red. See #627 for an example (and Maëlle also getting confused by this).

Alternatively, I could use interrogate(progress = TRUE) so that we get the console output. Would that be better?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I like the red line for illustrating the point. how about also explicitly including the is_exact = FALSE argument for added clarity. You could also just add a note that commonly action_levels are used to trigger downstream effects

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also want to illustrate that is_exact = TRUE is the default, so I talk about that in the text and only set the argument when it differs from the default.

col_schema_match(schema) %>%
interrogate()

agent
```

This is because the `sale_date` column has two classes and thus the schema and table do not match exactly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a surprise! When you defined the sales df, you specified the sale_date column with as.POSIXct so it's a big surprise that POSIXt shows up as one of the class types. And if you check the x_list$col_types from the agent, it says that the sale_date is only POSIXct. Without doing class(sales$sale_date), one would never discover this anomaly.

After reading through the examples, I think we should be more explicit about how to handle types of dates because pointblank can get very picky. I would like to see us call out explicitly that date handling can be tricky and here's how to work with it. If we set is_exact = FALSE to permit flexibility in dates, does this also allow other types that aren't exact to pass (e.g., can an expected numeric pass if it is an integer now?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into the date question some more but in the meantime: is_exact = FALSE does not allow integers to pass as numeric, it only loosens strictness for those columns which are defined as NULL in the schema.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is better placed in its own little piece of documentation? #659


```{r}
#| label: sale-date-class

class(sales$sale_date)
```

However, if we relax the validation to allow partial type matching, it passes.

```{r}
#| label: type-matching-partial

agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>%
col_schema_match(schema, is_exact = FALSE) %>%
interrogate()

agent
```

<!-- You can relax the validation further by allowing `NULL` types in the schema, which means that the column can be of any type or even missing from the table. -->
<!-- This is useful when you want to validate the presence of a column without enforcing a specific type or the column -->

```{r}
#| label: type-matching-null
#| include: false

schema <- col_schema(
amount = "numeric",
customer_name = "character",
sale_date = NULL
)

agent <- create_agent(sales[, 1:2], actions = action_levels(stop_at = 1)) %>%
col_schema_match(schema, is_exact = FALSE) %>%
interrogate()

agent
```

If you want to maintain a strict validation, also of the `sale_date` column, you can define the schema with all its classes.

```{r}
#| label: type-matching-full

schema <- col_schema(
amount = "numeric",
customer_name = "character",
sale_date = c("POSIXct", "POSIXt")
)

agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>%
col_schema_match(schema, is_exact = TRUE) %>%
interrogate()

agent
```

In general, relaxing the strictness of the validation is useful when you need to validate only a subset of the table. For example, you only work with a subset of columns or you don't mind if the table contains -- or gains in future -- additional columns that are not part of your schema.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding another section here for "Locating Failed Schema Matches".
The agent doesn't report which columns do not pass the col_schema_match() validation, it just indicates that the entire table failed the column schema match. If you need to figure out which columns were noncompliant, generate the x_list and look at col_names and col_types, e.g.,

x_list <- get_agent_x_list(agent)

# input df column names
x_list$col_names

# input df column types
x_list$col_types

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That information should definitely go somewhere! I've been planning to put it into a third "Take action" vignette.

I've been considering putting a mention of the x-list here, but it just feels a little too awkward. If I add it to this example directly, the x-list will show only one class for the date column as you pointed out. More than that, this makes me want something similar to step reports in the Python version: https://posit-dev.github.io/pointblank/user-guide/step-reports.html The R version doesn't have anything like that yet, does it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not step reports yet :( But that would be a huge feature in the R version. Any interest in taking that on yourself?

BTW, not every validation type in py-pointblank has an associated step report yet (I think specially() and 2-3 others are missing these reports). So this is still early stuff in some ways.


## Best Practices

To wrap up, here are some best practices for schema validation with pointblank:

- Define schemas early: bring everyone involved on the same page early in your data workflow.
- Check schemas early: check schemas early to catch structural issues before they propagate.
- Choose your schema creation method: do you have a reference table or do you want to define the schema manually?
- Be deliberate about strictness: use strict validation for critical data components and flexible validation for additional or evolving data components.
- Reuse schemas: create schema definitions that can be reused across multiple validation contexts. The schema can be written into the agent and the agent saved as a YAML file, making it easier to share. See the [YAML](https://rstudio.github.io/pointblank/reference/col_schema_match.html#yaml) section of `col_schema_match` for an example.
- Version control schemas: as your data evolves, maintain versions of your schemas to track changes. When `col_schema_match` is saved as a YAML file (see point above), it can easily be managed with a version control system.
- Make use of `action_levels` to set thresholds for actions. If the schema validation fails, trigger a `stop` action, which can then be used to trigger other downstream effects (e.g., an email notification, termination of a data processing pipeline).