Vignette refresh #642

hfrick · 2025-08-21T14:19:08Z

As discussed, I'm chipping away at a refresh of the vignettes. Overall, I'm aiming to have

a vignette covering the core functionality of the various validation rules and how they can be combined into a validation plan/agent
a vignette focused on schema validation aka "validating the fundamentals" of the shape of the data before validating the content of the data
a vignette centered around "Taking action" covering getting notified, stopping automation, and inspecting results
and possibly one around tailoring the validation report to stakeholders/people who understand the context of the data well but may not be data-focused themselves

I'm gonna keep this PR as a draft for now but your comments on the first two vignettes would be very welcome already!

hfrick · 2025-08-21T14:23:52Z

vignettes/fundamentals.Rmd

This one leans heavily on the user guide for the python version, but adapted for the R version

hfrick · 2025-08-21T14:26:29Z

vignettes/schema.Rmd

+You can relax the validation further by allowing `NULL` types in the schema, which means that the column can be of any type or even missing from the table. 
+<!-- This is useful when you want to validate the presence of a column without enforcing a specific type or the column -->


Is there a way to check that a column exists but not bothering with the type? I wasn't expecting the NULL to allow the column to be missing.

There is but it's hacky, not well-explained, and so should be improved in the future:

small_table %>% expect_col_schema_match( schema = col_schema( date_time = "POSIXct", date = "Date", a = NULL, # Column exists but type is ignored b = NULL, # Column exists but type is ignored f = "character", e = "logical" ), complete = FALSE, in_order = FALSE, is_exact = FALSE # Required for NULL to work )

This seems more like a side effect of exact type-matching and isn't very good API design.

I think your example is passing more because complete = FALSE and your schema is missing columns c and d.

Only using is_exact = FALSE does not turn b = NULL into a check that b exists:

library(pointblank) # baseline: passes data.frame(a = 1:2) |> col_schema_match(col_schema(a = "integer")) #> a #> 1 1 #> 2 2 # add b to data frame and to schema as NULL and strict check fails as it should data.frame(a = 1:2, b = 1:2) |> col_schema_match(col_schema(a = "integer", b = NULL)) #> Error: Failure to validate that column schemas match. #> The `col_schema_match()` validation failed beyond the absolute threshold level (1). #> * failure level (1) >= failure threshold (1) # relaxing `is_exact` allows the check to pass data.frame(a = 1:2, b = 1:2) |> col_schema_match(col_schema(a = "integer", b = NULL), is_exact = FALSE) #> a b #> 1 1 1 #> 2 2 2 # but it still passes when b is missing from the data frame # i.e. it's not a check for existence data.frame(a = 1:2) |> col_schema_match(col_schema(a = "integer", b = NULL), is_exact = FALSE) #> a #> 1 1 #> 2 2

^{Created on 2025-08-22 with reprex v2.1.1}

Thanks for checking how these options interact. Definitely need to just make NULL ignore the column type check (but still check column existence), regardless of the options!

vignettes/schema.Rmd

rich-iannone · 2025-08-22T15:45:29Z

I've just read through the vignettes more carefully and I think they are both well written!

rich-iannone

Everything looks good! I think we need to leave out the part about checking columns w/o column types/classes until we fix it in the codebase (I'll create an issue for that). Once that's implemented the vignette could be revised in a separate PR to put that example back in (it's a valuable usage example!).

hfrick · 2025-08-26T13:40:55Z

Sounds good! I've opened the issue: #645

hfrick · 2025-08-26T14:31:20Z

@kmasiello It would be great to have your eyes on this as well! 🙌

vignettes/schema.Rmd

kmasiello · 2025-08-26T16:09:21Z

vignettes/schema.Rmd

+agent
+```
+
+The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases such as `tbl_dbi` objects. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can define the schema in SQL types (like `BIGINT` and `VARCHAR`) and validate directly against the SQL table without pulling data into R.


I don't expect the average user to know what a tbl_dbi object is. Better to refer to this as a "database table."

How can the user know the available types to choose from? Do we recommend using typeof(), class(), other?

How can the user know the list of available SQL types? is there a reference we can provide?

Suggested change

The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases such as `tbl_dbi` objects. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can define the schema in SQL types (like `BIGINT` and `VARCHAR`) and validate directly against the SQL table without pulling data into R.

The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases such as `tbl_dbi` objects. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can use the `.db_col_types` argument to define the schema in SQL types (like `BIGINT` and `VARCHAR`) and validate directly against the SQL table without pulling data into R.

kmasiello · 2025-08-26T16:56:49Z

vignettes/schema.Rmd

+  .db_col_types = "sql"
+)
+
+agent <- create_agent(sales_db) %>%


I get an error creating this agent.

Error in `if (grepl("sql server|sqlserver", tbl_src_details)) ...`: ! argument is of length zero Hide Traceback ▆ 1. └─pointblank::create_agent(sales_db) 2. └─pointblank:::get_tbl_information(tbl = tbl) 3. └─pointblank:::get_tbl_information_dbi(tbl)

This is a weird one. When I render the vignette, it works. When I send the code chunk, it works. When I send the code line-by-line into the console, I also get that error. I've tried it in Positron and RStudio. @rich-iannone do you have any idea what might be going on here?

I opened #658 for this

kmasiello · 2025-08-26T18:10:06Z

vignettes/schema.Rmd

+  sale_date = "POSIXct"
+)
+
+agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>%


I would omit the action_levels here since this is about the type matching.

I agree that they could be distracting. My challenge is that the default settings result in a table that does not make it very obvious when a step fails: the stripe on the left-hand side of the table is green-ish, not red. See #627 for an example (and Maëlle also getting confused by this).

Alternatively, I could use interrogate(progress = TRUE) so that we get the console output. Would that be better?

oh, I like the red line for illustrating the point. how about also explicitly including the is_exact = FALSE argument for added clarity. You could also just add a note that commonly action_levels are used to trigger downstream effects

I also want to illustrate that is_exact = TRUE is the default, so I talk about that in the text and only set the argument when it differs from the default.

kmasiello · 2025-08-26T18:21:49Z

vignettes/schema.Rmd

+agent
+```
+
+This is because the `sale_date` column has two classes and thus the schema and table do not match exactly. 


This is a surprise! When you defined the sales df, you specified the sale_date column with as.POSIXct so it's a big surprise that POSIXt shows up as one of the class types. And if you check the x_list$col_types from the agent, it says that the sale_date is only POSIXct. Without doing class(sales$sale_date), one would never discover this anomaly.

After reading through the examples, I think we should be more explicit about how to handle types of dates because pointblank can get very picky. I would like to see us call out explicitly that date handling can be tricky and here's how to work with it. If we set is_exact = FALSE to permit flexibility in dates, does this also allow other types that aren't exact to pass (e.g., can an expected numeric pass if it is an integer now?)

I'll look into the date question some more but in the meantime: is_exact = FALSE does not allow integers to pass as numeric, it only loosens strictness for those columns which are defined as NULL in the schema.

Maybe this is better placed in its own little piece of documentation? #659

vignettes/schema.Rmd

vignettes/validation-custom.Rmd

kmasiello · 2025-08-26T19:16:47Z

vignettes/fundamentals.Rmd

+Thresholds (set through the `actions` argument) provide a nuanced way to monitor data quality, allowing you to set different severity levels based on the importance of each validation and your organization's tolerance for specific types of data issues.
+
+Thresholds can be set for three different levels: warnings, errors, and critical notifications. 
+They can be specified as either relative proportions of failing test units or absolute numbers of failing test units.


Do you want to add a mention that you can access information about if certain thresholds were met by looking at the x_list $warn, $stop, and $notify values?

I think the mention of the x-list fits better here than in the schema vignette, so adding it here!

vignettes/validation-custom.Rmd

Co-authored-by: Katie Masiello <[email protected]>

hfrick · 2025-10-15T12:12:32Z

vignettes/schema.Rmd

+agent <- create_agent(game_revenue) %>%
+  col_schema_match(schema_gr) %>%
+  interrogate()
+agent


This actually fails 😳 I've opened #657 for it.

Thanks for catching this bug and creating the issue!

hfrick added 2 commits August 1, 2025 11:23

draft for vignette on building a custom validation plan

c48d6ff

Draft of vignette on schema validation

6b0bd4a

hfrick commented Aug 21, 2025

View reviewed changes

vignettes/fundamentals.Rmd

Copy link

Collaborator Author

hfrick Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one leans heavily on the user guide for the python version, but adapted for the R version

hfrick commented Aug 21, 2025

View reviewed changes

vignettes/schema.Rmd Show resolved Hide resolved

hfrick commented Aug 21, 2025

View reviewed changes

vignettes/schema.Rmd Outdated Show resolved Hide resolved

hfrick requested a review from rich-iannone August 21, 2025 14:38

rich-iannone requested changes Aug 22, 2025

View reviewed changes

Merge branch 'main' into vignette-refresh

cc370da

hfrick added 3 commits August 26, 2025 14:45

add SQL types

8303045

Switch to duckdb

4e9cb7f

comment out example with NULL as the type

dd456e2

kmasiello reviewed Aug 26, 2025

View reviewed changes

Apply suggestions from code review

2c5f1e5

Co-authored-by: Katie Masiello <[email protected]>

hfrick commented Oct 15, 2025

View reviewed changes

hfrick added 2 commits October 16, 2025 15:35

address review suggestions

4cae4aa

rename

18a5c5b

		You can relax the validation further by allowing `NULL` types in the schema, which means that the column can be of any type or even missing from the table.
		<!-- This is useful when you want to validate the presence of a column without enforcing a specific type or the column -->

Vignette refresh #642

Are you sure you want to change the base?

Vignette refresh #642

Uh oh!

Conversation

hfrick commented Aug 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hfrick Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rich-iannone commented Aug 22, 2025

Uh oh!

rich-iannone left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hfrick commented Aug 26, 2025

Uh oh!

hfrick commented Aug 26, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hfrick Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

hfrick Aug 22, 2025 •

edited

Loading

rich-iannone left a comment •

edited

Loading

hfrick Aug 28, 2025 •

edited

Loading