-
Notifications
You must be signed in to change notification settings - Fork 60
Vignette refresh #642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Vignette refresh #642
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one leans heavily on the user guide for the python version, but adapted for the R version
vignettes/schema.Rmd
Outdated
| You can relax the validation further by allowing `NULL` types in the schema, which means that the column can be of any type or even missing from the table. | ||
| <!-- This is useful when you want to validate the presence of a column without enforcing a specific type or the column --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to check that a column exists but not bothering with the type? I wasn't expecting the NULL to allow the column to be missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is but it's hacky, not well-explained, and so should be improved in the future:
small_table %>%
expect_col_schema_match(
schema = col_schema(
date_time = "POSIXct",
date = "Date",
a = NULL, # Column exists but type is ignored
b = NULL, # Column exists but type is ignored
f = "character",
e = "logical"
),
complete = FALSE,
in_order = FALSE,
is_exact = FALSE # Required for NULL to work
)This seems more like a side effect of exact type-matching and isn't very good API design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your example is passing more because complete = FALSE and your schema is missing columns c and d.
Only using is_exact = FALSE does not turn b = NULL into a check that b exists:
library(pointblank)
# baseline: passes
data.frame(a = 1:2) |>
col_schema_match(col_schema(a = "integer"))
#> a
#> 1 1
#> 2 2
# add b to data frame and to schema as NULL and strict check fails as it should
data.frame(a = 1:2, b = 1:2) |>
col_schema_match(col_schema(a = "integer", b = NULL))
#> Error: Failure to validate that column schemas match.
#> The `col_schema_match()` validation failed beyond the absolute threshold level (1).
#> * failure level (1) >= failure threshold (1)
# relaxing `is_exact` allows the check to pass
data.frame(a = 1:2, b = 1:2) |>
col_schema_match(col_schema(a = "integer", b = NULL), is_exact = FALSE)
#> a b
#> 1 1 1
#> 2 2 2
# but it still passes when b is missing from the data frame
# i.e. it's not a check for existence
data.frame(a = 1:2) |>
col_schema_match(col_schema(a = "integer", b = NULL), is_exact = FALSE)
#> a
#> 1 1
#> 2 2Created on 2025-08-22 with reprex v2.1.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking how these options interact. Definitely need to just make NULL ignore the column type check (but still check column existence), regardless of the options!
|
I've just read through the vignettes more carefully and I think they are both well written! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good! I think we need to leave out the part about checking columns w/o column types/classes until we fix it in the codebase (I'll create an issue for that). Once that's implemented the vignette could be revised in a separate PR to put that example back in (it's a valuable usage example!).
|
Sounds good! I've opened the issue: #645 |
|
@kmasiello It would be great to have your eyes on this as well! 🙌 |
vignettes/schema.Rmd
Outdated
| agent | ||
| ``` | ||
|
|
||
| The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases such as `tbl_dbi` objects. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can define the schema in SQL types (like `BIGINT` and `VARCHAR`) and validate directly against the SQL table without pulling data into R. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect the average user to know what a tbl_dbi object is. Better to refer to this as a "database table."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can the user know the available types to choose from? Do we recommend using typeof(), class(), other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can the user know the list of available SQL types? is there a reference we can provide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases such as `tbl_dbi` objects. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can define the schema in SQL types (like `BIGINT` and `VARCHAR`) and validate directly against the SQL table without pulling data into R. | |
| The default is to define the schema in R types like `"numeric"` or `"character"` and you can use it to validate any of the tables pointblank supports, so not just data frames in R but also tables in databases such as `tbl_dbi` objects. While it may be convienent to define the schema in R types, note that this requires the data to be pulled into R first, which may not be efficient for large datasets. Alternatively, you can use the `.db_col_types` argument to define the schema in SQL types (like `BIGINT` and `VARCHAR`) and validate directly against the SQL table without pulling data into R. |
| .db_col_types = "sql" | ||
| ) | ||
|
|
||
| agent <- create_agent(sales_db) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get an error creating this agent.
Error in `if (grepl("sql server|sqlserver", tbl_src_details)) ...`:
! argument is of length zero
Hide Traceback
▆
1. └─pointblank::create_agent(sales_db)
2. └─pointblank:::get_tbl_information(tbl = tbl)
3. └─pointblank:::get_tbl_information_dbi(tbl)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a weird one. When I render the vignette, it works. When I send the code chunk, it works. When I send the code line-by-line into the console, I also get that error. I've tried it in Positron and RStudio. @rich-iannone do you have any idea what might be going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #658 for this
| sale_date = "POSIXct" | ||
| ) | ||
|
|
||
| agent <- create_agent(sales, actions = action_levels(stop_at = 1)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would omit the action_levels here since this is about the type matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that they could be distracting. My challenge is that the default settings result in a table that does not make it very obvious when a step fails: the stripe on the left-hand side of the table is green-ish, not red. See #627 for an example (and Maëlle also getting confused by this).
Alternatively, I could use interrogate(progress = TRUE) so that we get the console output. Would that be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I like the red line for illustrating the point. how about also explicitly including the is_exact = FALSE argument for added clarity. You could also just add a note that commonly action_levels are used to trigger downstream effects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also want to illustrate that is_exact = TRUE is the default, so I talk about that in the text and only set the argument when it differs from the default.
| agent | ||
| ``` | ||
|
|
||
| This is because the `sale_date` column has two classes and thus the schema and table do not match exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a surprise! When you defined the sales df, you specified the sale_date column with as.POSIXct so it's a big surprise that POSIXt shows up as one of the class types. And if you check the x_list$col_types from the agent, it says that the sale_date is only POSIXct. Without doing class(sales$sale_date), one would never discover this anomaly.
After reading through the examples, I think we should be more explicit about how to handle types of dates because pointblank can get very picky. I would like to see us call out explicitly that date handling can be tricky and here's how to work with it. If we set is_exact = FALSE to permit flexibility in dates, does this also allow other types that aren't exact to pass (e.g., can an expected numeric pass if it is an integer now?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into the date question some more but in the meantime: is_exact = FALSE does not allow integers to pass as numeric, it only loosens strictness for those columns which are defined as NULL in the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is better placed in its own little piece of documentation? #659
| Thresholds (set through the `actions` argument) provide a nuanced way to monitor data quality, allowing you to set different severity levels based on the importance of each validation and your organization's tolerance for specific types of data issues. | ||
|
|
||
| Thresholds can be set for three different levels: warnings, errors, and critical notifications. | ||
| They can be specified as either relative proportions of failing test units or absolute numbers of failing test units. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add a mention that you can access information about if certain thresholds were met by looking at the x_list $warn, $stop, and $notify values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the mention of the x-list fits better here than in the schema vignette, so adding it here!
Co-authored-by: Katie Masiello <[email protected]>
| agent <- create_agent(game_revenue) %>% | ||
| col_schema_match(schema_gr) %>% | ||
| interrogate() | ||
| agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually fails 😳 I've opened #657 for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this bug and creating the issue!
As discussed, I'm chipping away at a refresh of the vignettes. Overall, I'm aiming to have
I'm gonna keep this PR as a draft for now but your comments on the first two vignettes would be very welcome already!