diff --git a/.github/workflows/bookdown.yaml b/.github/workflows/bookdown.yaml
index 507e34c4..b3d715d2 100644
--- a/.github/workflows/bookdown.yaml
+++ b/.github/workflows/bookdown.yaml
@@ -34,7 +34,7 @@ jobs:
- uses: r-lib/actions/setup-r-dependencies@v2
- name: Build site
- run: Rscript -e 'bookdown::render_book("index.Rmd", quiet = TRUE)'
+ run: Rscript -e 'bookdown::render_book("index.Rmd", output_format = bookdown::html_book(keep_md = TRUE), quiet = TRUE)'
- name: Deploy to Netlify
if: contains(env.isExtPR, 'false')
@@ -52,3 +52,8 @@ jobs:
NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }}
NETLIFY_SITE_ID: ${{ secrets.NETLIFY_SITE_ID }}
timeout-minutes: 1
+
+ - uses: actions/upload-artifact@v1
+ with:
+ name: _book
+ path: _book/
diff --git a/.gitignore b/.gitignore
index 52e8a9df..3296b97f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -6,7 +6,7 @@
_book
_main.*
libs
-figures
+^figures/*
_bookdown_files
figures/introduction-cricket-plot-1.svg
figures/introduction-descr-examples-1.pdf
@@ -19,3 +19,6 @@ figures/tidyverse-interaction-plots-1.svg
extras/iowa_highway.shx
extras/iowa_highway.shp
files_for_print*
+tmwr-to-ch9*
+extras/iowa_highway.zip
+extras/iowa_highway/iowa_highway.shp
diff --git a/01-software-modeling.Rmd b/01-software-modeling.Rmd
index 6e5c25b9..4e59114b 100644
--- a/01-software-modeling.Rmd
+++ b/01-software-modeling.Rmd
@@ -7,7 +7,6 @@ knitr::opts_chunk$set(fig.path = "figures/")
library(tidyverse)
library(gridExtra)
library(tibble)
-library(kableExtra)
data(ames, package = "modeldata")
```
@@ -66,7 +65,7 @@ For example, large scale measurements of RNA have been possible for some time us
An early method for evaluating such issues were probe-level models, or PLMs [@bolstad2004]. A statistical model would be created that accounted for the known differences in the data, such as the chip, the RNA sequence, the type of sequence, and so on. If there were other, unknown factors in the data, these effects would be captured in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When a problem did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g., a fingerprint) and a possible solution (wipe off the chip and rescan, repeat the sample, etc.). Figure \@ref(fig:software-descr-examples)(a) shows an application of this method for two microarrays taken from @Gentleman2005. The images show two different color values; areas that are darker are where the signal intensity was larger than the model expects while the lighter color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel exhibits an undesirable artifact in the middle of the chip.
-```{r software-descr-examples, echo = FALSE, fig.cap = "Two examples of how descriptive models can be used to illustrate specific patterns", out.width = '80%', dev = "png", fig.height = 8, warning = FALSE, message = FALSE}
+```{r software-descr-examples, echo = FALSE, fig.cap = "Two examples of how descriptive models can be used to illustrate specific patterns", out.width = '80%', fig.height = 8, warning = FALSE, message = FALSE}
load("RData/plm_resids.RData")
resid_cols <- RColorBrewer::brewer.pal(8, "Set1")[1:2]
@@ -255,28 +254,12 @@ monolog <-
"Model Evaluation", "2",
"Let’s drop K-NN from the model list. "
)
-if (knitr::is_html_output()) {
- tab <-
- monolog %>%
+monolog %>%
dplyr::select(Thoughts, Activity) %>%
- kable(
+ knitr::kable(
caption = "Hypothetical inner monologue of a model developer.",
label = "inner-monologue"
- ) %>%
- kable_styling() %>%
- column_spec(2, width = "25%") %>%
- column_spec(1, width = "75%", italic = TRUE)
-} else {
- tab <-
- monolog %>%
- dplyr::select(Thoughts, Activity) %>%
- kable(
- caption = "Hypothetical inner monologue of a model developer.",
- label = "inner-monologue"
- ) %>%
- kable_styling()
-}
-tab
+ )
```
## Chapter Summary {#software-summary}
diff --git a/03-base-r.Rmd b/03-base-r.Rmd
index 1815dea7..05c64eca 100644
--- a/03-base-r.Rmd
+++ b/03-base-r.Rmd
@@ -4,7 +4,6 @@
knitr::opts_chunk$set(fig.path = "figures/")
data(crickets, package = "modeldata")
library(tidyverse)
-library(kableExtra)
```
Before describing how to use tidymodels for applying tidy data principles to building models with R, let's review how models are created, trained, and used in the core R language (often called "base R"). This chapter is a brief illustration of core language conventions that are important to be aware of even if you never use base R for models at all. This chapter is not exhaustive, but it provides readers (especially those new to R) the basic, most commonly used motifs.
@@ -75,7 +74,7 @@ rate ~ temp + species
Species is not a quantitative variable; in the data frame, it is represented as a factor column with levels `"O. exclamationis"` and `"O. niveus"`. The vast majority of model functions cannot operate on nonnumeric data. For species, the model needs to encode the species data in a numeric format. The most common approach is to use indicator variables (also known as dummy variables) in place of the original qualitative values. In this instance, since species has two possible values, the model formula will automatically encode this column as numeric by adding a new column that has a value of zero when the species is `"O. exclamationis"` and a value of one when the data correspond to `"O. niveus"`. The underlying formula machinery automatically converts these values for the data set used to create the model, as well as for any new data points (for example, when the model is used for prediction).
:::rmdnote
-Suppose there were five species instead of two. The model formula would automatically add four additional binary columns that are binary indicators for four of the species. The _reference level_ of the factor (i.e., the first level) is always left out of the predictor set. The idea is that, if you know the values of the four indicator variables, the value of the species can be determined. We discuss binary indicator variables in more detail in Section \@ref(dummies).
+Suppose there were five species instead of two. The model formula would automatically add four additional binary columns that are binary indicators for four of the species. The _reference level_ of the factor (i.e., the first level) is always left out of the predictor set. The idea is that, if you know the values of the four indicator variables, the value of the species can be determined. We discuss binary indicator variables in more detail in Chapter \@ref(recipes).
:::
The model formula `rate ~ temp + species` creates a model with different y-intercepts for each species; the slopes of the regression lines could be different for each species as well. To accommodate this structure, an interaction term can be added to the model. This can be specified in a few different ways, and the most basic uses the colon:
@@ -199,7 +198,7 @@ For the most part, practitioners' understanding of what the formula does is domi
(temp + species)^2
```
-Our focus, when seeing this, is that there are two predictors and the model should contain their main effects and the two-way interactions. However, this formula also implies that, since `species` is a factor, it should also create indicator variable columns for this predictor (see Section \@ref(dummies)) and multiply those columns by the `temp` column to create the interactions. This transformation represents our second bullet point on encoding; the formula also defines how each column is encoded and can create additional columns that are not in the original data.
+Our focus, when seeing this, is that there are two predictors and the model should contain their main effects and the two-way interactions. However, this formula also implies that, since `species` is a factor, it should also create indicator variable columns for this predictor (see Chapter \@ref(recipes)) and multiply those columns by the `temp` column to create the interactions. This transformation represents our second bullet point on encoding; the formula also defines how each column is encoded and can create additional columns that are not in the original data.
:::rmdwarning
This is an important point that will come up multiple times in this text, especially when we discuss more complex feature engineering in Chapter \@ref(recipes) and beyond. The formula in R has some limitations, and our approaches to overcoming them contend with all three aspects.
@@ -246,14 +245,11 @@ prob_tbl <-
)
prob_tbl %>%
- kable(
+ knitr::kable(
caption = "Heterogeneous argument names for different modeling functions.",
label = "probability-args",
escape = FALSE
- ) %>%
- kable_styling(full_width = FALSE) %>%
- column_spec(1, monospace = ifelse(prob_tbl$Function == "various", FALSE, TRUE)) %>%
- column_spec(3, monospace = TRUE)
+ )
```
Note that the last example has a custom function to make predictions instead of using the more common `predict()` interface (the generic `predict()` method). This lack of consistency is a barrier to day-to-day usage of R for modeling.
@@ -396,7 +392,7 @@ conflict_prefer("filter", winner = "dplyr")
For convenience, `r pkg(tidymodels)` contains a function that captures most of the common naming conflicts that we might encounter:
-```{r base-r-clonflicts}
+```{r base-r-conflicts}
tidymodels_prefer(quiet = FALSE)
```
diff --git a/04-ames.Rmd b/04-ames.Rmd
index 59b4af96..6ce9d319 100644
--- a/04-ames.Rmd
+++ b/04-ames.Rmd
@@ -40,6 +40,20 @@ data(ames, package = "modeldata")
dim(ames)
```
+Figure \@ref(fig:ames-map) shows the locations of the properties in Ames. The locations will be revisited in the next section.
+
+```{r ames-map}
+#| out.width = "100%",
+#| echo = FALSE,
+#| warning = FALSE,
+#| fig.cap = "Property locations in Ames, IA.",
+#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map. There is a significant area in the center of the map where no homes were sold."
+# See file extras/ames_sf.R
+knitr::include_graphics("premade/ames_plain.png")
+```
+
+The void of data points in the center of Ames corresponds to Iowa State University.
+
## Exploring Features of Homes in Ames
Let's start our exploratory data analysis by focusing on the outcome we want to predict: the last sale price of the house (in USD). We can create a histogram to see the distribution of sale prices in Figure \@ref(fig:ames-sale-price-hist).
@@ -92,16 +106,16 @@ Despite these drawbacks, the models used in this book use the log transformation
ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))
```
-Another important aspect of these data for our modeling is their geographic locations. This spatial information is contained in the data in two ways: a qualitative `Neighborhood` label as well as quantitative longitude and latitude data. To visualize the spatial information, let's use both together to plot the data on a map in Figure \@ref(fig:ames-map).
+Another important aspect of these data for our modeling are their geographic locations. This spatial information is contained in the data in two ways: a qualitative `Neighborhood` label as well as quantitative longitude and latitude data. To visualize the spatial information, Figure \@ref(fig:ames-chull) duplicates the data from Figure \@ref(fig:ames-map) with convex hulls around the data from each neighborhood.
-```{r ames-map}
+```{r ames-chull}
#| out.width = "100%",
#| echo = FALSE,
#| warning = FALSE,
-#| fig.cap = "Neighborhoods in Ames, IA",
-#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map. There is a significant area in the center of the map where no homes were sold."
+#| fig.cap = "Neighborhoods in Ames represented using a convex hull",
+#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map with colored regions that show the locations of neighborhoods. Show neighborhoods overlap and a few are nested within other neighborhoods."
# See file extras/ames_sf.R
-knitr::include_graphics("premade/ames.png")
+knitr::include_graphics("premade/ames_chull.png")
```
We can see a few noticeable patterns. First, there is a void of data points in the center of Ames. This corresponds to the campus of Iowa State University where there are no residential houses. Second, while there are a number of adjacent neighborhoods, others are geographically isolated. For example, as Figure \@ref(fig:ames-timberland) shows, Timberland is located apart from almost all other neighborhoods.
diff --git a/05-data-spending.Rmd b/05-data-spending.Rmd
index 6639a04f..4b42b125 100644
--- a/05-data-spending.Rmd
+++ b/05-data-spending.Rmd
@@ -28,7 +28,7 @@ The other portion of the data is placed into the _test set_. This is held in res
How should we conduct this split of the data? The answer depends on the context.
:::
-Suppose we allocate 80% of the data to the training set and the remaining 20% for testing. The most common method is to use simple random sampling. The [`r pkg(rsample)`](https://rsample.tidymodels.org/) package has tools for making data splits such as this; the function `initial_split()` was created for this purpose. It takes the data frame as an argument as well as the proportion to be placed into training. Using the data frame produced by the code snippet from the summary in Section \@ref(ames-summary) that prepared the Ames data set:
+Suppose we allocate 80% of the data to the training set and the remaining 20% for testing. The most common method is to use simple random sampling. The [`r pkg(rsample)`](https://rsample.tidymodels.org/) package has tools for making data splits such as this; the function `initial_split()` was created for this purpose. It takes the data frame as an argument as well as the proportion to be placed into training. Using the data frame produced by the code snippet from the summary at the end of Chapter \@ref(ames):
```{r ames-split, message = FALSE, warning = FALSE}
library(tidymodels)
@@ -106,13 +106,13 @@ The proportion of data that should be allocated for splitting is highly dependen
When describing the goals of data splitting, we singled out the test set as the data that should be used to properly evaluate of model performance on the final model(s). This begs the question: "How can we tell what is best if we don't measure performance until the test set?"
-It is common to hear about _validation sets_ as an answer to this question, especially in the neural network and deep learning literature. During the early days of neural networks, researchers realized that measuring performance by re-predicting the training set samples led to results that were overly optimistic (significantly, unrealistically so). This led to models that overfit, meaning that they performed very well on the training set but poorly on the test set.^[This is discussed in much greater detail in Section \@ref(overfitting-bad).] To combat this issue, a small validation set of data were held back and used to measure performance as the network was trained. Once the validation set error rate began to rise, the training would be halted. In other words, the validation set was a means to get a rough sense of how well the model performed prior to the test set.
+It is common to hear about _validation sets_ as an answer to this question, especially in the neural network and deep learning literature. During the early days of neural networks, researchers realized that measuring performance by re-predicting the training set samples led to results that were overly optimistic (significantly, unrealistically so). This led to models that overfit, meaning that they performed very well on the training set but poorly on the test set.^[This is discussed in much greater detail in Chapter \@ref(tuning).] To combat this issue, a small validation set of data were held back and used to measure performance as the network was trained. Once the validation set error rate began to rise, the training would be halted. In other words, the validation set was a means to get a rough sense of how well the model performed prior to the test set.
:::rmdnote
Whether validation sets are a subset of the training set or a third allocation in the initial split of the data largely comes down to semantics.
:::
-Validation sets are discussed more in Section \@ref(validation) as a special case of _resampling_ methods that are used on the training set.
+Validation sets are discussed more in Chapter \@ref(resampling) as a special case of _resampling_ methods that are used on the training set.
## Multilevel Data
diff --git a/06-fitting-models.Rmd b/06-fitting-models.Rmd
index f03178cf..84b741ff 100644
--- a/06-fitting-models.Rmd
+++ b/06-fitting-models.Rmd
@@ -2,7 +2,6 @@
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(kknn)
-library(kableExtra)
library(tidyr)
tidymodels_prefer()
@@ -123,7 +122,6 @@ lm_xy_fit
[^fitxy]: What are the differences between `fit()` and `fit_xy()`? The `fit_xy()` function always passes the data as is to the underlying model function. It will not create dummy/indicator variables before doing so. When `fit()` is used with a model specification, this almost always means that dummy variables will be created from qualitative predictors. If the underlying function requires a matrix (like glmnet), it will make them. However, if the underlying function uses a formula, `fit()` just passes the formula to that function. We estimate that 99% of modeling functions using formulas make dummy variables. The other 1% include tree-based methods that do not require purely numeric predictors. See Section \@ref(workflow-encoding) for more about using formulas in tidymodels.
-
Not only does `r pkg(parsnip)` enable a consistent model interface for different packages, it also provides consistency in the model arguments. It is common for different functions that fit the same model to have different argument names. Random forest model functions are a good example. Three commonly used arguments are the number of trees in the ensemble, the number of predictors to randomly sample with each split within a tree, and the number of data points required to make a split. For three different R packages implementing this algorithm, those arguments are shown in Table \@ref(tab:rand-forest-args).
```{r, models-rf-arg-names, echo = FALSE, results = "asis"}
@@ -139,23 +137,21 @@ arg_info <-
get_from_env("rand_forest_args") %>%
select(engine, parsnip, original) %>%
full_join(arg_info, by = "parsnip") %>%
- mutate(package = ifelse(engine == "spark", "sparklyr", engine))
+ mutate(package = ifelse(engine == "spark", "sparklyr", engine)) %>%
+ mutate_at(c("parsnip", "original"), glue::backtick)
arg_info %>%
select(package, `Argument Type`, original) %>%
- # mutate(original = paste0("", original, "")) %>%
pivot_wider(
id_cols = c(`Argument Type`),
values_from = c(original),
names_from = c(package)
) %>%
- kable(
+ knitr::kable(
caption = "Example argument names for different random forest functions.",
label = "rand-forest-args",
escape = FALSE
- ) %>%
- kable_styling() %>%
- column_spec(2:4, monospace = TRUE)
+ )
```
In an effort to make argument specification less painful, `r pkg(parsnip)` uses common argument names within and between packages. Table \@ref(tab:parsnip-args) shows, for random forests, what `r pkg(parsnip)` models use.
@@ -164,14 +160,11 @@ In an effort to make argument specification less painful, `r pkg(parsnip)` uses
arg_info %>%
select(`Argument Type`, parsnip) %>%
distinct() %>%
- # mutate(parsnip = paste0("", parsnip, "")) %>%
- kable(
+ knitr::kable(
caption = "Random forest argument names used by parsnip.",
label = "parsnip-args",
escape = FALSE
- ) %>%
- kable_styling(full_width = FALSE) %>%
- column_spec(2, monospace = TRUE)
+ )
```
Admittedly, this is one more set of arguments to memorize. However, when other types of models have the same argument types, these names still apply. For example, boosted tree ensembles also create a large number of tree-based models, so `trees` is also used there, as is `min_n`, and so on.
@@ -219,7 +212,7 @@ lm_form_fit %>% extract_fit_engine() %>% vcov()
```
:::rmdwarning
-Never pass the `fit` element of a `r pkg(parsnip)` model to a model prediction function, i.e., use `predict(lm_form_fit)` but *do not* use `predict(lm_form_fit$fit)`. If the data were preprocessed in any way, incorrect predictions will be generated (sometimes, without errors). The underlying model's prediction function has no idea if any transformations have been made to the data prior to running the model. See Section \@ref(parsnip-predictions) for more on making predictions.
+Never pass the `fit` element of a `r pkg(parsnip)` model to a model prediction function, i.e., use `predict(lm_form_fit)` but *do not* use `predict(lm_form_fit$fit)`. If the data were preprocessed in any way, incorrect predictions will be generated (sometimes, without errors). The underlying model's prediction function has no idea if any transformations have been made to the data prior to running the model. See the next section for more on making predictions.
:::
One issue with some existing methods in base R is that the results are stored in a manner that may not be the most useful. For example, the `summary()` method for `lm` objects can be used to print the results of the model fit, including a table with parameter values, their uncertainty estimates, and p-values. These particular results can also be saved:
@@ -293,11 +286,10 @@ tribble(
"probability (2 classes)", "numeric matrix (2nd level only)",
"probability (3+ classes)", "3D numeric array (all levels)",
) %>%
- kable(
+ knitr::kable(
caption = "Different return values for glmnet prediction types.",
label = "predict-types"
- ) %>%
- kable_styling(full_width = FALSE)
+ )
```
Additionally, the column names of the results contain coded values that map to a vector called `lambda` within the glmnet model object. This excellent statistical method can be discouraging to use in practice because of all of the special cases an analyst might encounter that require additional code to be useful.
@@ -313,12 +305,11 @@ tribble(
"conf_int", ".pred_lower, .pred_upper",
"pred_int", ".pred_lower, .pred_upper"
) %>%
- kable(
+ mutate_all(glue::backtick) %>%
+ knitr::kable(
caption = "The tidymodels mapping of prediction types and column names.",
label = "predictable-column-names",
- ) %>%
- kable_styling(full_width = FALSE) %>%
- column_spec(1:2, monospace = TRUE)
+ )
```
The third rule regarding the number of rows in the output is critical. For example, if any rows of the new data contain missing values, the output will be padded with missing results for those rows.
diff --git a/07-the-model-workflow.Rmd b/07-the-model-workflow.Rmd
index 3ef47baf..4f684b4e 100644
--- a/07-the-model-workflow.Rmd
+++ b/07-the-model-workflow.Rmd
@@ -2,17 +2,14 @@
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(workflowsets)
-library(kableExtra)
library(censored)
-library(survival)
tidymodels_prefer()
source("ames_snippets.R")
```
# A Model Workflow {#workflows}
-In the previous chapter, we discussed the `r pkg(parsnip)` package, which can be used to define and fit the model. This chapter introduces a new concept called a _model workflow_. The purpose of this concept (and the corresponding tidymodels `workflow()` object) is to encapsulate the major pieces of the modeling process (discussed in Section \@ref(model-phases)). The workflow is important in two ways. First, using a workflow concept encourages good methodology since it is a single point of entry to the estimation components of a data analysis. Second, it enables the user to better organize projects. These two points are discussed in the following sections.
-
+In the previous chapter, we discussed the `r pkg(parsnip)` package, which can be used to define and fit the model. This chapter introduces a new concept called a _model workflow_. The purpose of this concept (and the corresponding tidymodels `workflow()` object) is to encapsulate the major pieces of the modeling process (previously discussed in Chapter \@ref(software-modeling)). The workflow is important in two ways. First, using a workflow concept encourages good methodology since it is a single point of entry to the estimation components of a data analysis. Second, it enables the user to better organize their projects. These two points are discussed in the following sections.
## Where Does the Model Begin and End? {#begin-model-end}
@@ -42,7 +39,7 @@ In other software, such as Python or Spark, similar collections of steps are cal
Binding together the analytical components of data analysis is important for another reason. Future chapters will demonstrate how to accurately measure performance, as well as how to optimize structural parameters (i.e., model tuning). To correctly quantify model performance on the training set, Chapter \@ref(resampling) advocates using resampling methods. To do this properly, no data-driven parts of the analysis should be excluded from validation. To this end, the workflow must include all significant estimation steps.
-To illustrate, consider principal component analysis (PCA) signal extraction. We'll talk about this more in Section \@ref(example-steps) as well as Chapter \@ref(dimensionality); PCA is a way to replace correlated predictors with new artificial features that are uncorrelated and capture most of the information in the original set. The new features could be used as the predictors, and least squares regression could be used to estimate the model parameters.
+To illustrate, consider principal component analysis (PCA) signal extraction. We'll talk about this more in Chapter \@ref(recipes) as well as Chapter \@ref(dimensionality); PCA is a way to replace correlated predictors with new artificial features that are uncorrelated and capture most of the information in the original set. The new features could be used as the predictors and least squares regression could be used to estimate the model parameters.
There are two ways of thinking about the model workflow. Figure \@ref(fig:bad-workflow) illustrates the _incorrect_ method: to think of the PCA preprocessing step, as _not being part of the modeling workflow_.
@@ -99,7 +96,7 @@ lm_wflow <-
lm_wflow
```
-Workflows have a `fit()` method that can be used to create the model. Using the objects created in Section \@ref(models-summary):
+Workflows have a `fit()` method that can be used to create the model. Using the objects created in the summary at the end of Chapter \@ref(models):
```{r workflows-form-fit}
lm_fit <- fit(lm_wflow, ames_train)
@@ -112,7 +109,7 @@ We can also `predict()` on the fitted workflow:
predict(lm_fit, ames_test %>% slice(1:3))
```
-The `predict()` method follows all of the same rules and naming conventions that we described for the `r pkg(parsnip)` package in Section \@ref(parsnip-predictions).
+The `predict()` method follows all of the same rules and naming conventions that we described for the `r pkg(parsnip)` package in Chapter \@ref(models).
Both the model and preprocessor can be removed or updated:
@@ -153,13 +150,13 @@ When the model is fit, the specification assembles these data, unaltered, into a
fit(lm_wflow, ames_train)
```
-If you would like the underlying modeling method to do what it would normally do with the data, `add_variables()` can be a helpful interface. As we will see in Section \@ref(special-model-formulas), it also facilitates more complex modeling specifications. However, as we mention in the next section, models such as `glmnet` and `xgboost` expect the user to make indicator variables from factor predictors. In these cases, a recipe or formula interface will typically be a better choice.
+If you would like the underlying modeling method to do what it would normally do with the data, `add_variables()` can be a helpful interface. As we will see in an upcoming section in this chapter, it also facilitates more complex modeling specifications. However, as we mention in the next section, models such as `glmnet` and `xgboost` expect the user to make indicator variables from factor predictors. In these cases, a recipe or formula interface will typically be a better choice.
In the next chapter, we will look at a more powerful preprocessor (called a _recipe_) that can also be added to a workflow.
## How Does a `workflow()` Use the Formula? {#workflow-encoding}
-Recall from Section \@ref(formula) that the formula method in R has multiple purposes (we will discuss this further in Chapter \@ref(recipes)). One of these is to properly encode the original data into an analysis-ready format. This can involve executing inline transformations (e.g., `log(x)`), creating dummy variable columns, creating interactions or other column expansions, and so on. However, many statistical methods require different types of encodings:
+Recall from Chapter \@ref(base-r) that the formula method in R has multiple purposes (we will discuss this further in Chapter \@ref(recipes)). One of these is to properly encode the original data into an analysis ready format. This can involve executing in-line transformations (e.g., `log(x)`), creating dummy variable columns, creating interactions or other column expansions, and so on. However, there are many statistical methods that require different types of encodings:
* Most packages for tree-based models use the formula interface but *do not* encode the categorical predictors as dummy variables.
@@ -287,7 +284,7 @@ location_models
location_models$fit[[1]]
```
-We use a `r pkg(purrr)` function here to map through our models, but there is an easier, better approach to fit workflow sets that will be introduced in Section \@ref(workflow-set).
+We use a `r pkg(purrr)` function here to map through our models, but there is an easier, better approach to fit workflow sets that will be introduced in Chapter \@ref(compare).
:::rmdnote
In general, there's a lot more to workflow sets! While we've covered the basics here, the nuances and advantages of workflow sets won't be illustrated until Chapter \@ref(workflow-sets).
@@ -321,7 +318,7 @@ collect_metrics(final_lm_res)
collect_predictions(final_lm_res) %>% slice(1:5)
```
-We'll see more about `last_fit()` in action and how to use it again in Section \@ref(bean-models).
+We'll see more about `last_fit()` in action and how to use it again in Chapter \@ref(dimensionality).
## Chapter Summary {#workflows-summary}
diff --git a/08-feature-engineering.Rmd b/08-feature-engineering.Rmd
index 8906e23e..264ba01d 100644
--- a/08-feature-engineering.Rmd
+++ b/08-feature-engineering.Rmd
@@ -1,7 +1,6 @@
```{r engineering-setup, include = FALSE}
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
-library(kableExtra)
tidymodels_prefer()
@@ -41,7 +40,7 @@ Different models have different preprocessing requirements and some, such as tre
In this chapter, we introduce the [`r pkg(recipes)`](https://recipes.tidymodels.org/) package that you can use to combine different feature engineering and preprocessing tasks into a single object and then apply these transformations to different data sets. The `r pkg(recipes)` package is, like `r pkg(parsnip)` for models, one of the core tidymodels packages.
-This chapter uses the Ames housing data and the R objects created in the book so far, as summarized in Section \@ref(workflows-summary).
+This chapter uses the Ames housing data and the R objects created in the book so far, as summarized at the end of Chapter \@ref(workflows).
## A Simple `recipe()` for the Ames Housing Data
@@ -61,7 +60,7 @@ Suppose that an initial ordinary linear regression model were fit to these data.
lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Year_Built + Bldg_Type, data = ames)
```
-When this function is executed, the data are converted from a data frame to a numeric _design matrix_ (also called a _model matrix_) and then the least squares method is used to estimate parameters. In Section \@ref(formula) we listed the multiple purposes of the R model formula; let's focus only on the data manipulation aspects for now. What this formula does can be decomposed into a series of steps:
+When this function is executed, the data are converted from a data frame to a numeric _design matrix_ (also called a _model matrix_) and then the least squares method is used to estimate parameters. In Chapter \@ref(base-r) we listed the multiple purposes of the R model formula; let's focus only on the data manipulation aspects for now. What the formula above does can be decomposed into a series of steps:
1. Sale price is defined as the outcome while neighborhood, gross living area, the year built, and building type variables are all defined as predictors.
@@ -71,7 +70,7 @@ When this function is executed, the data are converted from a data frame to a nu
As mentioned in Chapter \@ref(base-r), the formula method will apply these data manipulations to any data, including new data, that are passed to the `predict()` function.
-A recipe is also an object that defines a series of steps for data processing. Unlike the formula method inside a modeling function, the recipe defines the steps via `step_*()` functions without immediately executing them; it is only a specification of what should be done. Here is a recipe equivalent to the previous formula that builds on the code summary in Section \@ref(splitting-summary):
+A recipe is also an object that defines a series of steps for data processing. Unlike the formula method inside a modeling function, the recipe defines the steps via `step_*()` functions without immediately executing them; it is only a specification of what should be done. Here is a recipe equivalent to the formula above that builds on the code summary at the end of Chapter \@ref(splitting):
```{r engineering-ames-simple-recipe}
library(tidymodels) # Includes the recipes package
@@ -91,7 +90,7 @@ Let's break this down:
1. `step_log()` declares that `Gr_Liv_Area` should be log transformed.
-1. `step_dummy()` specifies which variables should be converted from a qualitative format to a quantitative format, in this case, using dummy or indicator variables. An indicator or dummy variable is a binary numeric variable (a column of ones and zeroes) that encodes qualitative information; we will dig deeper into these kinds of variables in Section \@ref(dummies).
+1. `step_dummy()` specifies which variables should be converted from a qualitative format to a quantitative format, in this case, using dummy or indicator variables. An indicator or dummy variable is a binary numeric variable (a column of ones and zeroes) that encodes qualitative information; we will dig deeper into these kinds of variables later in this chapter.
The function `all_nominal_predictors()` captures the names of any predictor columns that are currently factor or character (i.e., nominal) in nature. This is a `r pkg(dplyr)`-like selector function similar to `starts_with()` or `matches()` but that can only be used inside of a recipe.
@@ -161,7 +160,7 @@ lm_fit %>%
```
:::rmdnote
-Tools for using (and debugging) recipes outside of workflow objects are described in Section \@ref(recipe-functions).
+Tools for using (and debugging) recipes outside of workflow objects are described in Chapter \@ref(dimensionality).
:::
## How Data Are Used by the `recipe()`
@@ -230,11 +229,10 @@ recipe(~Bldg_Type, data = ames_train) %>%
bake(ames_train) %>%
slice(show_rows) %>%
arrange(`Raw Data`) %>%
- kable(
+ knitr::kable(
caption = 'Illustration of binary encodings (i.e., dummy variables) for a qualitative predictor.',
label = "dummy-vars"
- ) %>%
- kable_styling(full_width = FALSE)
+ )
```
@@ -401,7 +399,7 @@ The [`r pkg(themis)`](https://themis.tidymodels.org/) package has recipe steps t
```
:::rmdwarning
-Only the training set should be affected by these techniques. The test set or other holdout samples should be left as-is when processed using the recipe. For this reason, all of the subsampling steps default the `skip` argument to have a value of `TRUE` (Section \@ref(skip-equals-true)).
+Only the training set should be affected by these techniques. The test set or other holdout samples should be left as-is when processed using the recipe. For this reason, all of the subsampling steps default the `skip` argument to have a value of `TRUE`.
:::
Other step functions are row-based as well: `step_filter()`, `step_sample()`, `step_slice()`, and `step_arrange()`. In almost all uses of these steps, the `skip` argument should be set to `TRUE`.
@@ -465,7 +463,7 @@ At the time of this writing, the step functions in the `r pkg(recipes)` and `r p
## Tidy a `recipe()`
-In Section \@ref(tidiness-modeling), we introduced the `tidy()` verb for statistical objects. There is also a `tidy()` method for recipes, as well as individual recipe steps. Before proceeding, let's create an extended recipe for the Ames data using some of the new steps we've discussed in this chapter:
+In Chapter \@ref(base-r), we introduced the `tidy()` verb for statistical objects. There is also a `tidy()` method for recipes, as well as individual recipe steps. Before proceeding, let's create an extended recipe for the Ames data using some of the new steps we've discussed in this chapter:
```{r engineering-lm-extended-recipe}
ames_rec <-
diff --git a/09-judging-model-effectiveness.Rmd b/09-judging-model-effectiveness.Rmd
index 3da00e35..539a7fb1 100644
--- a/09-judging-model-effectiveness.Rmd
+++ b/09-judging-model-effectiveness.Rmd
@@ -1,7 +1,6 @@
```{r performance-setup, include = FALSE}
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
-library(kableExtra)
tidymodels_prefer()
source("ames_snippets.R")
load("RData/lm_fit.RData")
@@ -16,7 +15,7 @@ ad_folds <- vfold_cv(ad_data, repeats = 5)
Once we have a model, we need to know how well it works. A quantitative approach for estimating effectiveness allows us to understand the model, to compare different models, or to tweak the model to improve performance. Our focus in tidymodels is on empirical validation; this usually means using data that were not used to create the model as the substrate to measure effectiveness.
:::rmdwarning
-The best approach to empirical validation involves using _resampling_ methods that will be introduced in Chapter \@ref(resampling). In this chapter, we will motivate the need for empirical validation by using the test set. Keep in mind that the test set can only be used once, as explained in Section \@ref(splitting-methods).
+The best approach to empirical validation involves using _resampling_ methods that will be introduced in Chapter \@ref(resampling). In this chapter, we will motivate the need for empirical validation by using the test set. Keep in mind that the test set can only be used once, as explained in Chapter \@ref(splitting).
:::
When judging model effectiveness, your decision about which metrics to examine can be critical. In later chapters, certain model parameters will be empirically optimized and a primary performance metric will be used to choose the best sub-model. Choosing the wrong metric can easily result in unintended consequences. For example, two common metrics for regression models are the root mean squared error (RMSE) and the coefficient of determination (a.k.a. $R^2$). The former measures _accuracy_ while the latter measures _correlation_. These are not necessarily the same thing. Figure \@ref(fig:performance-reg-metrics) demonstrates the difference between the two.
@@ -117,7 +116,7 @@ In the remainder of this chapter, we will discuss general approaches for evaluat
## Regression Metrics
-Recall from Section \@ref(parsnip-predictions) that tidymodels prediction functions produce tibbles with columns for the predicted values. These columns have consistent names, and the functions in the `r pkg(yardstick)` package that produce performance metrics have consistent interfaces. The functions are data frame-based, as opposed to vector-based, with the general syntax of:
+Recall from Chapter \@ref(models) that tidymodels prediction functions produce tibbles with columns for the predicted values. These columns have consistent names, and the functions in the `r pkg(yardstick)` package that produce performance metrics have consistent interfaces. The functions are data frame-based, as opposed to vector-based, with the general syntax of:
```r
function(data, truth, ...)
@@ -126,7 +125,7 @@ function(data, truth, ...)
where `data` is a data frame or tibble and `truth` is the column with the observed outcome values. The ellipses or other arguments are used to specify the column(s) containing the predictions.
-To illustrate, let's take the model from Section \@ref(recipes-summary). This model `lm_wflow_fit` combines a linear regression model with a predictor set supplemented with an interaction and spline functions for longitude and latitude. It was created from a training set (named `ames_train`). Although we do not advise using the test set at this juncture of the modeling process, it will be used here to illustrate functionality and syntax. The data frame `ames_test` consists of `r nrow(ames_test)` properties. To start, let's produce predictions:
+To illustrate, let's take the model from the very end of Chapter \@ref(recipes). This model `lm_wflow_fit` combines a linear regression model with a predictor set supplemented with an interaction and spline functions for longitude and latitude. It was created from a training set (named `ames_train`). Although we do not advise using the test set at this juncture of the modeling process, it will be used here to illustrate functionality and syntax. The data frame `ames_test` consists of `r nrow(ames_test)` properties. To start, let's produce predictions:
```{r performance-predict-ames}
@@ -355,7 +354,8 @@ The groupings also translate to the `autoplot()` methods, with results shown in
hpc_cv %>%
group_by(Resample) %>%
roc_curve(obs, VF, F, M, L) %>%
- autoplot()
+ autoplot() +
+ theme(legend.position = "none")
```
```{r grouped-roc-curves, ref.label = "performance-multi-class-roc-grouped"}
diff --git a/10-resampling.Rmd b/10-resampling.Rmd
index dad37006..b701db37 100644
--- a/10-resampling.Rmd
+++ b/10-resampling.Rmd
@@ -2,7 +2,6 @@
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(doMC)
-library(kableExtra)
library(tidyr)
tidymodels_prefer()
registerDoMC(cores = parallel::detectCores())
@@ -27,7 +26,7 @@ In order to fully appreciate the value of resampling, let's first take a look th
## The Resubstitution Approach {#resampling-resubstition}
-When we measure performance on the same data that we used for training (as opposed to new data or testing data), we say we have *resubstituted* the data. Let's again use the Ames housing data to demonstrate these concepts. Section \@ref(recipes-summary) summarizes the current state of our Ames analysis. It includes a recipe object named `ames_rec`, a linear model, and a workflow using that recipe and model called `lm_wflow`. This workflow was fit on the training set, resulting in `lm_fit`.
+When we measure performance on the same data that we used for training (as opposed to new data or testing data), we say we have *resubstituted* the data. Let's again use the Ames housing data to demonstrate these concepts. The end of Chapter \@ref(recipes) summarizes the current state of our Ames analysis. It includes a recipe object named `ames_rec`, a linear model, and a workflow using that recipe and model called `lm_wflow`. This workflow was fit on the training set, resulting in `lm_fit`.
For a comparison to this linear model, we can also fit a different type of model. _Random forests_ are a tree ensemble method that operates by creating a large number of decision trees from slightly different versions of the training set [@breiman2001random]. This collection of trees makes up the ensemble. When predicting a new sample, each ensemble member makes a separate prediction. These are averaged to create the final ensemble prediction for the new data point.
@@ -116,14 +115,11 @@ For both models, Table \@ref(tab:rmse-results) summarizes the RMSE estimate for
```{r resampling-rmse-table, echo = FALSE, results = "asis"}
all_res %>%
- mutate(object = paste0("", object, "")) %>%
- kable(
+ knitr::kable(
caption = "Performance statistics for training and test sets.",
label = "rmse-results",
escape = FALSE
- ) %>%
- kable_styling(full_width = FALSE) %>%
- add_header_above(c(" ", "RMSE Estimates" = 2))
+ )
```
Notice that the linear regression model is consistent between training and testing, because of its limited complexity.^[It is possible for a linear model to nearly memorize the training set, like the random forest model did. In the `ames_rec` object, change the number of spline terms for `longitude` and `latitude` to a large number (say 1,000). This would produce a model fit with a very small resubstitution RMSE and a test set RMSE that is much larger.]
@@ -173,7 +169,7 @@ Cross-validation is a well established resampling method. While there are a numb
knitr::include_graphics("premade/three-CV.svg")
```
-The color of the symbols in Figure \@ref(fig:cross-validation-allocation) represents their randomly assigned folds. Stratified sampling is also an option for assigning folds (previously discussed in Section \@ref(splitting-methods)).
+The color of the symbols in Figure \@ref(fig:cross-validation-allocation) represents their randomly assigned folds. Stratified sampling is also an option for assigning folds (previously discussed in Chapter \@ref(splitting)).
For three-fold cross-validation, the three iterations of resampling are illustrated in Figure \@ref(fig:cross-validation). For each iteration, one fold is held out for assessment statistics and the remaining folds are substrate for the model. This process continues for each fold so that three models produce three sets of performance statistics.
@@ -213,8 +209,7 @@ To manually retrieve the partitioned data, the `analysis()` and `assessment()` f
ames_folds$splits[[1]] %>% analysis() %>% dim()
```
-The `r pkg(tidymodels)` packages, such as [`r pkg(tune)`](https://tune.tidymodels.org/), contain high-level user interfaces so that functions like `analysis()` are not generally needed for day-to-day work. Section \@ref(resampling-performance) demonstrates a function to fit a model over these resamples.
-
+The `r pkg(tidymodels)` packages, such as [`r pkg(tune)`](https://tune.tidymodels.org/), contain high-level user interfaces so that functions like `analysis()` are not generally needed for day-to-day work. Chapter \@ref(resampling) demonstrates functions to fit a model over these resamples.
There are a variety of cross-validation variations; we'll go through the most important ones.
@@ -273,7 +268,7 @@ mc_cv(ames_train, prop = 9/10, times = 20)
### Validation sets {#validation}
-In Section \@ref(what-about-a-validation-set), we briefly discussed the use of a validation set, a single partition that is set aside to estimate performance separate from the test set. When using a validation set, the initial available data set is split into a training set, a validation set, and a test set (see Figure \@ref(fig:three-way-split)).
+In Chapter \@ref(splitting), we briefly discussed the use of a validation set, a single partition that is set aside to estimate performance separate from the test set. When using a validation set, the initial available data set is split into a training set, a validation set, and a test set (see Figure \@ref(fig:three-way-split)).
```{r three-way-split}
#| echo = FALSE,
@@ -449,7 +444,7 @@ collect_metrics(rf_res)
These are the resampling estimates averaged over the individual replicates. To get the metrics for each resample, use the option `summarize = FALSE`.
-Notice how much more realistic the performance estimates are than the resubstitution estimates from Section \@ref(resampling-resubstition)!
+Notice how much more realistic the performance estimates are than the resubstitution estimates from earlier in the chapter!
To obtain the assessment set predictions:
diff --git a/11-comparing-models.Rmd b/11-comparing-models.Rmd
index 1ded3a55..656653cd 100644
--- a/11-comparing-models.Rmd
+++ b/11-comparing-models.Rmd
@@ -5,7 +5,6 @@ library(corrr)
library(doMC)
library(tidyposterior)
library(rstanarm)
-library(kableExtra)
library(tidyr)
library(forcats)
registerDoMC(cores = parallel::detectCores())
@@ -27,7 +26,7 @@ In either case, the result is a collection of resampled summary statistics (e.g.
## Creating Multiple Models with Workflow Sets {#workflow-set}
-In Section \@ref(workflow-sets-intro) we described the idea of a workflow set where different preprocessors and/or models can be combinatorially generated. In Chapter \@ref(resampling), we used a recipe for the Ames data that included an interaction term as well as spline functions for longitude and latitude. To demonstrate more with workflow sets, let's create three different linear models that add these preprocessing steps incrementally; we can test whether these additional terms improve the model results. We'll create three recipes then combine them into a workflow set:
+In Chapter \@ref(workflows) we described the idea of a workflow set where different preprocessors and/or models can be combinatorially generated. In Chapter \@ref(resampling), we used a recipe for the Ames data that included an interaction term as well as spline functions for longitude and latitude. To demonstrate more with workflow sets, let's create three different linear models that add these preprocessing steps incrementally; we can test whether these additional terms improve the model results. We'll create three recipes then combine them into a workflow set:
```{r compare-workflow-set}
library(tidymodels)
@@ -139,8 +138,8 @@ These correlations are high, and indicate that, across models, there are large w
```{r compare-rsq-plot, eval=FALSE}
rsq_indiv_estimates %>%
mutate(wflow_id = reorder(wflow_id, .estimate)) %>%
- ggplot(aes(x = wflow_id, y = .estimate, group = id, color = id)) +
- geom_line(alpha = .5, lwd = 1.25) +
+ ggplot(aes(x = wflow_id, y = .estimate, group = id, color = id, lty = id)) +
+ geom_line(alpha = .8, lwd = 1.25) +
theme(legend.position = "none")
```
@@ -207,12 +206,11 @@ rsq_indiv_estimates %>%
) %>%
select(`Y = rsq` = rsq, model, X1, X2, X3, id) %>%
slice(1:6) %>%
- kable(
+ knitr::kable(
caption = "Model performance statistics as a data set for analysis.",
label = "model-anova-data",
escape = FALSE
- ) %>%
- kable_styling(full_width = FALSE)
+ )
```
The `X1`, `X2`, and `X3` columns in the table are indicators for the values in the `model` column. Their order was defined in the same way that R would define them, alphabetically ordered by `model`.
diff --git a/12-tuning-parameters.Rmd b/12-tuning-parameters.Rmd
index e3969079..8207fe7e 100644
--- a/12-tuning-parameters.Rmd
+++ b/12-tuning-parameters.Rmd
@@ -80,11 +80,11 @@ In some cases, preprocessing techniques require tuning:
Some classical statistical models also have structural parameters:
- * In binary regression, the logit link is commonly used (i.e., logistic regression). Other link functions, such as the probit and complementary log-log, are also available [@Dobson99]. This example is described in more detail in the Section \@ref(what-to-optimize).
+ * In binary regression, the logit link is commonly used (i.e., logistic regression). Other link functions, such as the probit and complementary log-log, are also available [@Dobson99]. This example is described in more detail in the next section.
* Non-Bayesian longitudinal and repeated measures models require a specification for the covariance or correlation structure of the data. Options include compound symmetric (a.k.a. exchangeable), autoregressive, Toeplitz, and others [@littell2000modelling].
-A counterexample where it is inappropriate to tune a parameter is the prior distribution required for Bayesian analysis. The prior encapsulates the analyst's belief about the distribution of a quantity before evidence or data are taken into account. For example, in Section \@ref(tidyposterior), we used a Bayesian ANOVA model, and we were unclear about what the prior should be for the regression parameters (beyond being a symmetric distribution). We chose a t-distribution with one degree of freedom for the prior since it has heavier tails; this reflects our added uncertainty. Our prior beliefs should not be subject to optimization. Tuning parameters are typically optimized for performance whereas priors should not be tweaked to get "the right results."
+A counterexample where it is inappropriate to tune a parameter is the prior distribution required for Bayesian analysis. The prior encapsulates the analyst's belief about the distribution of a quantity before evidence or data are taken into account. For example, in Chapter \@ref(compare), we used a Bayesian ANOVA model, and we were unclear about what the prior should be for the regression parameters (beyond being a symmetric distribution). We chose a t-distribution with one degree of freedom for the prior since it has heavier tails; this reflects our added uncertainty. Our prior beliefs should not be subject to optimization. Tuning parameters are typically optimized for performance whereas priors should not be tweaked to get "the right results."
:::rmdwarning
Another (perhaps more debatable) counterexample of a parameter that does _not_ need to be tuned is the number of trees in a random forest or bagging model. This value should instead be chosen to be large enough to ensure numerical stability in the results; tuning it cannot improve performance as long as the value is large enough to produce reliable results. For random forests, this value is typically in the thousands while the number of trees needed for bagging is around 50 to 100.
@@ -105,7 +105,7 @@ To demonstrate, consider the classification data shown in Figure \@ref(fig:two-c
#| fig.cap = "An example two-class classification data set with two predictors",
#| fig.alt = "An example two-class classification data set with two predictors. The two predictors have a moderate correlation and there is some locations of separation between the classes."
ggplot(training_set, aes(x = A, y = B, color = Class, pch = Class)) +
- geom_point(alpha = 0.7) +
+ geom_point(alpha = 0.8) +
coord_equal() +
labs(x = "Predictor A", y = "Predictor B", color = NULL, pch = NULL) +
scale_color_manual(values = c("#CC6677", "#88CCEE"))
@@ -259,7 +259,7 @@ link_grids <-
link_grids %>%
ggplot(aes(x = A, y = B)) +
geom_point(data = testing_set, aes(color = Class, pch = Class),
- alpha = 0.7, show.legend = FALSE) +
+ alpha = 0.8, show.legend = FALSE) +
geom_contour(aes( z = .pred_Class1, lty = link), breaks = 0.5, color = "black") +
scale_color_manual(values = c("#CC6677", "#88CCEE")) +
coord_equal() +
@@ -278,13 +278,13 @@ Metric optimization is thoroughly discussed by @thomas2020problem who explore se
## The consequences of poor parameter estimates {#overfitting-bad}
-Many tuning parameters modulate the amount of model complexity. More complexity often implies more malleability in the patterns that a model can emulate. For example, as shown in Section \@ref(spline-functions), adding degrees of freedom in a spline function increases the intricacy of the prediction equation. While this is an advantage when the underlying motifs in the data are complex, it can also lead to overinterpretation of chance patterns that would not reproduce in new data. _Overfitting_ is the situation where a model adapts too much to the training data; it performs well for the data used to build the model but poorly for new data.
+Many tuning parameters modulate the amount of model complexity. More complexity often implies more malleability in the patterns that a model can emulate. For example, as shown in Chapter \@ref(recipes), adding degrees of freedom in a spline function increases the intricacy of the prediction equation. While this is an advantage when the underlying motifs in the data are complex, it can also lead to overinterpretation of chance patterns that would not reproduce in new data. _Overfitting_ is the situation where a model adapts too much to the training data; it performs well for the data used to build the model but poorly for new data.
:::rmdwarning
Since tuning model parameters can increase model complexity, poor choices can lead to overfitting.
:::
-Recall the single layer neural network model described in Section \@ref(tuning-parameter-examples). With a single hidden unit and sigmoidal activation functions, a neural network for classification is, for all intents and purposes, just logistic regression. However, as the number of hidden units increases, so does the complexity of the model. In fact, when the network model uses sigmoidal activation units, @cybenko1989approximation showed that the model is a universal function approximator as long as there are enough hidden units.
+Recall the single layer neural network model described in the first section of this chapter. With a single hidden unit and sigmoidal activation functions, a neural network for classification is, for all intents and purposes, just logistic regression. However, as the number of hidden units increases, so does the complexity of the model. In fact, when the network model uses sigmoidal activation units, @cybenko1989approximation showed that the model is a universal function approximator as long as there are enough hidden units.
We fit neural network classification models to the same two-class data from the previous section, varying the number of hidden units. Using the area under the ROC curve as a performance metric, the effectiveness of the model on the training set increases as more hidden units are added. The network model thoroughly and meticulously learns the training set. If the model judges itself on the training set ROC value, it prefers many hidden units so that it can nearly eliminate errors.
@@ -345,7 +345,7 @@ te_plot <-
) %>%
ggplot(aes(x = A, y = B)) +
geom_point(data = testing_set, aes(color = Class, pch = Class),
- alpha = 0.5, show.legend = FALSE) +
+ alpha = 0.7, show.legend = FALSE) +
geom_contour(aes( z = .pred_Class1), breaks = 0.5, color = "black") +
scale_color_manual(values = c("#CC6677", "#88CCEE")) +
facet_wrap(~ label, nrow = 1) +
@@ -367,7 +367,7 @@ tr_plot <-
) %>%
ggplot(aes(x = A, y = B)) +
geom_point(data = training_set, aes(color = Class, pch = Class),
- alpha = 0.5, show.legend = FALSE) +
+ alpha = 0.7, show.legend = FALSE) +
geom_contour(aes( z = .pred_Class1), breaks = 0.5, color = "black") +
scale_color_manual(values = c("#CC6677", "#88CCEE")) +
facet_wrap(~ label, nrow = 1) +
@@ -449,13 +449,13 @@ Examples of these strategies are discussed in detail in the next two chapters. B
We've already dealt with quite a number of arguments that correspond to tuning parameters for recipe and model specifications in previous chapters. It is possible to tune:
-* the threshold for combining neighborhoods into an "other" category (with argument name `threshold`) discussed in Section \@ref(dummies)
+* the threshold for combining neighborhoods into an "other" category (with argument name `threshold`) discussed in Chapter \@ref(recipes)
-* the number of degrees of freedom in a natural spline (`deg_free`, Section \@ref(spline-functions))
+* the number of degrees of freedom in a natural spline (`deg_free`, Chapter \@ref(recipes))
-* the number of data points required to execute a split in a tree-based model (`min_n`, Section \@ref(create-a-model))
+* the number of data points required to execute a split in a tree-based model (`min_n`, Chapter \@ref(models))
-* the amount of regularization in penalized models (`penalty`, Section \@ref(create-a-model))
+* the amount of regularization in penalized models (`penalty`, Chapter \@ref(models))
For `r pkg(parsnip)` model specifications, there are two kinds of parameter arguments. *Main arguments* are those that are most often optimized for performance and are available in multiple engines. The main tuning parameters are top-level arguments to the model specification function. For example, the `rand_forest()` function has main arguments `trees`, `min_n`, and `mtry` since these are most frequently specified or optimized.
@@ -470,7 +470,7 @@ rand_forest(trees = 2000, min_n = 10) %>% # <- main arguments
The main arguments use a harmonized naming system to remove inconsistencies across engines while engine-specific arguments do not.
:::
-How can we signal to tidymodels functions which arguments should be optimized? Parameters are marked for tuning by assigning them a value of `tune()`. For the single layer neural network used in Section \@ref(overfitting-bad), the number of hidden units is designated for tuning using:
+How can we signal to tidymodels functions which arguments should be optimized? Parameters are marked for tuning by assigning them a value of `tune()`. For the single layer neural network used earlier in this chapter, the number of hidden units is designated for tuning using:
```{r tuning-mlp-units}
neural_net_spec <-
@@ -494,7 +494,7 @@ extract_parameter_set_dials(neural_net_spec)
The results show a value of `nparam[+]`, indicating that the number of hidden units is a numeric parameter.
-There is an optional identification argument that associates a name with the parameters. This can come in handy when the same kind of parameter is being tuned in different places. For example, with the Ames housing data from Section \@ref(resampling-summary), the recipe encoded both longitude and latitude with spline functions. If we want to tune the two spline functions to potentially have different levels of smoothness, we call `step_ns()` twice, once for each predictor. To make the parameters identifiable, the identification argument can take any character string:
+There is an optional identification argument that associates a name with the parameters. This can come in handy when the same kind of parameter is being tuned in different places. For example, with the Ames housing data example from the end of Chapter \@ref(resampling), the recipe encoded both longitude and latitude with spline functions. If we want to tune the two spline functions to potentially have different levels of smoothness, we call `step_ns()` twice, once for each predictor. To make the parameters identifiable, the identification argument can take any character string:
```{r tuning-id}
ames_rec <-
diff --git a/13-grid-search.Rmd b/13-grid-search.Rmd
index edaca8bd..46292d9c 100644
--- a/13-grid-search.Rmd
+++ b/13-grid-search.Rmd
@@ -56,7 +56,7 @@ mlp_spec <-
set_mode("classification")
```
-The argument `trace = 0` prevents extra logging of the training process. As shown in Section \@ref(tuning-params-tidymodels), the `extract_parameter_set_dials()` function can extract the set of arguments with unknown values and sets their `r pkg(dials)` objects:
+The argument `trace = 0` prevents extra logging of the training process. As shown in Chapter \@ref(tuning), the `extract_parameter_set_dials()` function can extract the set of arguments with unknown values and sets their `r pkg(dials)` objects:
```{r grid-mlp-param}
mlp_param <- extract_parameter_set_dials(mlp_spec)
@@ -95,7 +95,7 @@ mlp_param %>%
There are techniques for creating regular grids that do not use all possible values of each parameter set. These _fractional factorial designs_ [@BHH] could also be used. To learn more, consult the CRAN Task View for experimental design.^[]
:::rmdwarning
-Regular grids can be computationally expensive to use, especially when there are a medium-to-large number of tuning parameters. This is true for many models but not all. As discussed in Section \@ref(efficient-grids) below, there are many models whose tuning time _decreases_ with a regular grid!
+Regular grids can be computationally expensive to use, especially when there are a medium-to-large number of tuning parameters. This is true for many models but not all. As discussed further in this chapter, there are many models whose tuning time _decreases_ with a regular grid!
:::
One advantage to using a regular grid is that the relationships and patterns between the tuning parameters and the model metrics are easily understood. The factorial nature of these designs allows for examination of each parameter separately with little confounding between parameters.
@@ -165,7 +165,7 @@ Space-filling designs can be very effective at representing the parameter space.
## Evaluating the Grid {#evaluating-grid}
-To choose the best tuning parameter combination, each candidate set is assessed using data that were not used to train that model. Resampling methods or a single validation set work well for this purpose. The process (and syntax) closely resembles the approach in Section \@ref(resampling-performance) that used the `fit_resamples()` function from the `r pkg(tune)` package.
+To choose the best tuning parameter combination, each candidate set is assessed using data that were not used to train that model. Resampling methods or a single validation set work well for this purpose. The process (and syntax) closely resembles the approach in Chapter \@ref(resampling) that used the `fit_resamples()` function from the `r pkg(tune)` package.
After resampling, the user selects the most appropriate candidate parameter set. It might make sense to choose the empirically best parameter combination or bias the choice towards other aspects of the model fit, such as simplicity.
@@ -226,7 +226,7 @@ mlp_param <-
In `step_pca()`, using zero PCA components is a shortcut to skip the feature extraction. In this way, the original predictors can be directly compared to the results that include PCA components.
:::
-The `tune_grid()` function is the primary function for conducting grid search. Its functionality is very similar to `fit_resamples()` from Section \@ref(resampling-performance), although it has additional arguments related to the grid:
+The `tune_grid()` function is the primary function for conducting grid search. Its functionality is very similar to `fit_resamples()`, although it has additional arguments related to the grid:
* `grid`: An integer or data frame. When an integer is used, the function creates a space-filling design with `grid` number of candidate parameter combinations. If specific parameter combinations exist, the `grid` parameter is used to pass them to the function.
@@ -463,7 +463,7 @@ Even though we fit the model with and without the submodel prediction trick, thi
### Parallel processing
-As previously mentioned in Section \@ref(parallel), parallel processing is an effective method for decreasing execution time when resampling models. This advantage conveys to model tuning via grid search, although there are additional considerations.
+As previously mentioned in Chapter \@ref(resampling), parallel processing is an effective method for decreasing execution time when resampling models. This advantage conveys to model tuning via grid search, although there are additional considerations.
Let's consider two different parallel processing schemes.
@@ -655,6 +655,7 @@ First, let's consider the raw execution times in Figure \@ref(fig:parallel-times
#| fig.alt = "Execution times for model tuning versus the number of workers using different delegation schemes. The diagonal black line indicates a linear speedup where the addition of a new worker process has maximal effect. The 'everything' scheme shows that the benefits decrease after three or four workers, especially when there is expensive preprocessing. The 'resamples' scheme has almost linear speedups across all tasks."
load("extras/parallel_times/xgb_times.RData")
+
ggplot(times, aes(x = num_cores, y = elapsed, color = parallel_over, shape = parallel_over)) +
geom_point(size = 2) +
geom_line() +
@@ -699,7 +700,7 @@ ggplot(times, aes(x = num_cores, y = speed_up, color = parallel_over, shape = pa
The best speed-ups, for these data, occur when `parallel_over = "resamples"` and when the computations are expensive. However, in the latter case, remember that the previous analysis indicates that the overall model fits are slower.
-What is the benefit of using the submodel optimization method in conjunction with parallel processing? The C5.0 classification model shown in Section \@ref(submodel-trick) was also run in parallel with ten workers. The parallel computations took 13.3 seconds for a `r round(100.147/13.265, 1)`-fold speed-up (both runs used the submodel optimization trick). Between the submodel optimization trick and parallel processing, there was a total `r round(3734.249/13.265, 0)`-fold speed-up over the most basic grid search code.
+What is the benefit of using the submodel optimization method in conjunction with parallel processing? The C5.0 classification model shown in Chapter \@ref(grid-search) was also run in parallel with ten workers. The parallel computations took 13.3 seconds for a `r round(100.147/13.265, 1)`-fold speed-up (both runs used the submodel optimization trick). Between the submodel optimization trick and parallel processing, there was a total `r round(3734.249/13.265, 0)`-fold speed-up over the most basic grid search code.
:::rmdwarning
Overall, note that the increased computational savings will vary from model to model and are also affected by the size of the grid, the number of resamples, etc. A very computationally efficient model may not benefit as much from parallel processing.
@@ -791,10 +792,12 @@ remaining <-
mlp_sfd_race %>%
collect_metrics() %>%
dplyr::filter(n == 10)
+
+remaining_text <- cli::pluralize("{nrow(remaining)} remain{?s/}.")
```
-As an example, in the multilayer perceptron tuning process with a regular grid explored in this chapter, what would the results look like after only the first three folds? Using techniques similar to those shown in Chapter \@ref(compare), we can fit a model where the outcome is the resampled area under the ROC curve and the predictor is an indicator for the parameter combination. The model takes the resample-to-resample effect into account and produces point and interval estimates for each parameter setting. The results of the model are one-sided 95% confidence intervals that measure the loss of the ROC value relative to the currently best performing parameters, as shown in Figure \@ref(fig:racing-process).
+As an example, in the multilayer perceptron tuning process with a regular grid explored in this chapter, what would the results look like after only the first three folds? Using techniques similar to those shown in Chapter \@ref(compare), we can fit a model where the outcome is the resampled area under the ROC curve and the predictor is an indicator for the parameter combination. The model takes the resample-to-resample effect into account and produces point and interval estimates for each parameter setting. The results of the model are one-sided 95% confidence intervals that measure the loss of the ROC value relative to the currently best performing parameters.
```{r racing-process}
#| echo = FALSE,
@@ -803,10 +806,9 @@ As an example, in the multilayer perceptron tuning process with a regular grid e
#| fig.height = 5,
#| out.width = "80%",
#| fig.cap = "The racing process for 20 tuning parameters and 10 resamples",
-#| fig.alt = "An illustration of the racing process for 20 tuning parameters and 10 resamples. The analysis is conducted at the first, third, and last resample. As the number of resamples increases, the confidence intervals show some model configurations that do not have confidence intervals that overlap with zero. These are excluded from subsequent resamples."
+#| fig.alt = "The racing process for 20 tuning parameters and 10 resamples. The analysis is conducted at the first, third, and last resample. As the number of resamples increases, the confidence intervals show some model configurations that do not have confidence intervals that overlap with zero. These are excluded from subsequent resamples."
full_att <- attributes(mlp_sfd_race)
-
race_details <- NULL
for(iter in 1:10) {
@@ -824,7 +826,6 @@ for(iter in 1:10) {
race_details,
finetune:::test_parameters_gls(tmp) %>% mutate(iter = iter))
}
-
race_details <-
race_details %>%
mutate(
@@ -835,73 +836,34 @@ race_details <-
decision = ifelse(pass & estimate == 0, "best", decision)
) %>%
mutate(
+ .config = factor(.config),
+ .config = format(as.integer(.config)),
+ .config = paste("config", .config),
.config = factor(.config),
.config = reorder(.config, estimate),
decision = factor(decision, levels = c("best", "retain", "discard"))
)
race_cols <- c(best = "blue", retain = "black", discard = "grey")
-
iter_three <- race_details %>% dplyr::filter(iter == 3)
-
-iter_three %>%
+race_details %>%
+ filter(iter %in% c(1, 3, 10)) %>%
+ mutate(iter = paste("resamples:", format(iter))) %>%
ggplot(aes(x = -estimate, y = .config)) +
- geom_vline(xintercept = 0, lty = 2, color = "green") +
- geom_point(size = 2, aes(color = decision)) +
- geom_errorbarh(aes(xmin = -estimate, xmax = -upper, color = decision), height = .3, show.legend = FALSE) +
+ geom_vline(xintercept = 0, lty = 2, col = "green") +
+ geom_point(size = 2, aes(col = decision, pch = decision)) +
+ geom_errorbarh(aes(xmin = -estimate, xmax = -upper, col = decision), height = .3, show.legend = FALSE) +
labs(x = "Loss of ROC AUC", y = NULL) +
- scale_colour_manual(values = race_cols)
+ scale_colour_manual(values = race_cols) +
+ facet_wrap(~iter) +
+ theme(legend.position = "top")
```
-Any parameter set whose confidence interval includes zero would lack evidence that its performance is not statistically different from the best results. We retain `r sum(iter_three$upper >= 0)` settings; these are resampled more. The remaining `r sum(iter_three$upper < 0)` submodels are no longer considered.
+Figure \@ref(fig:racing-process) shows the results at several iterations in the process. The points shown in the panel with the first iteration show single ROC AUC values. As iterations progress, the points are averages of the resampled ROC statistics.
-```{r grid-mlp-racing-anim, include = FALSE, dev = 'png'}
-race_ci_plots <- function(x, iters = max(x$iter)) {
-
- x_rng <- extendrange(c(-x$estimate, -x$upper))
-
- for (i in 1:iters) {
- if (i < 3) {
- ttl <- paste0("Iteration ", i, ": burn-in")
- } else {
- ttl <- paste0("Iteration ", i, ": testing")
- }
- p <-
- x %>%
- dplyr::filter(iter == i) %>%
- ggplot(aes(x = -estimate, y = .config, color = decision)) +
- geom_vline(xintercept = 0, color = "green", lty = 2) +
- geom_point(size = 2) +
- labs(title = ttl, y = "", x = "Loss of ROC AUC") +
- scale_color_manual(values = c(best = "blue", retain = "black", discard = "grey"),
- drop = FALSE) +
- scale_y_discrete(drop = FALSE) +
- xlim(x_rng) +
- theme_bw() +
- theme(legend.position = "top")
-
- if (i >= 3) {
- p <- p + geom_errorbar(aes(xmin = -estimate, xmax = -upper), width = .3)
- }
-
- print(p)
- }
- invisible(NULL)
-}
-av_capture_graphics(
- race_ci_plots(race_details),
- output = "race_results.mp4",
- width = 720,
- height = 720,
- res = 120,
- framerate = 1/3
-)
-```
+On the third iteration, the leading model configuration has changed and the algorithm computes one-sided confidence intervals. Any parameter set whose confidence interval includes zero would lack evidence that its performance is not statistically different from the best results. We retain `r sum(iter_three$upper < 0)` settings; these are resampled more. The remaining `r sum(iter_three$upper >= 0)` submodels are no longer considered.
-
+The process continues to resample configurations that remain and the statistical analysis repeats with the current results. More submodels may be removed from consideration. Prior to the final resample, almost all submodels are eliminated and, at the last iteration, only `r remaining_text`^[See @kuhn2014futility for more details on the computational aspects of this approach.]
-The process continues for each resample; after the next set of performance metrics, a new model is fit to these statistics, and more submodels are potentially discarded.^[See @kuhn2014futility for more details on the computational aspects of this approach.]
:::rmdwarning
Racing methods can be more efficient than basic grid search as long as the interim analysis is fast and some parameter settings have poor performance. It also is most helpful when the model does _not_ have the ability to exploit submodel predictions.
diff --git a/14-iterative-search.Rmd b/14-iterative-search.Rmd
index b3cebdab..4c217be7 100644
--- a/14-iterative-search.Rmd
+++ b/14-iterative-search.Rmd
@@ -3,7 +3,6 @@ knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(finetune)
library(patchwork)
-library(kableExtra)
library(av)
library(doMC)
registerDoMC(cores = parallel::detectCores(logical = TRUE))
@@ -39,11 +38,11 @@ We use the same data on cell characteristics as the previous chapter for illustr
## A Support Vector Machine Model {#svm}
-We once again use the cell segmentation data, described in Section \@ref(evaluating-grid), for modeling, with a support vector machine (SVM) model to demonstrate sequential tuning methods. See @apm for more information on this model. The two tuning parameters to optimize are the SVM cost value and the radial basis function kernel parameter $\sigma$. Both parameters can have a profound effect on the model complexity and performance.
+We once again use the cell segmentation data, described in Chapter \@ref(grid-search), for modeling, with a support vector machine (SVM) model to demonstrate sequential tuning methods. See @apm for more information on this model. The two tuning parameters to optimize are the SVM cost value and the radial basis function kernel parameter $\sigma$. Both parameters can have a profound effect on the model complexity and performance.
The SVM model uses a dot product and, for this reason, it is necessary to center and scale the predictors. Like the multilayer perceptron model, this model would benefit from the use of PCA feature extraction. However, we will not use this third tuning parameter in this chapter so that we can visualize the search process in two dimensions.
-Along with the previously used objects (shown in Section \@ref(grid-summary)), the tidymodels objects `svm_rec`, `svm_spec`, and `svm_wflow` define the model process:
+Along with the previously used objects (shown in the summary of Chapter \@ref(grid-search)), the tidymodels objects `svm_rec`, `svm_spec`, and `svm_wflow` define the model process:
```{r iterative-svm-defs, message = FALSE}
library(tidymodels)
@@ -134,12 +133,10 @@ collect_metrics(svm_initial) %>%
select(ROC = mean, cost, rbf_sigma) %>%
as.data.frame() %>%
format(digits = 4, scientific = FALSE) %>%
- kable(
+ knitr::kable(
caption = "Resampling statistics used as the initial substrate to the Gaussian process model.",
label = "initial-gp-data"
- ) %>%
- kableExtra::kable_styling(full_width = FALSE) %>%
- kableExtra::add_header_above(c("outcome" = 1, "predictors" = 2))
+ )
```
Gaussian process models are specified by their mean and covariance functions, although the latter has the most effect on the nature of the GP model. The covariance function is often parameterized in terms of the input values (denoted as $x$). As an example, a commonly used covariance function is the squared exponential^[This equation is also the same as the _radial basis function_ used in kernel methods, such as the SVM model that is currently being used. This is a coincidence; this covariance function is unrelated to the SVM tuning parameter that we are using. ] function:
@@ -169,12 +166,10 @@ tmp %>%
mutate(variance = variance^2) %>%
as.data.frame() %>%
format(digits = 4, scientific = FALSE) %>%
- kable(
+ knitr::kable(
caption = "Two example tuning parameters considered for further sampling.",
label = "tuning-candidates"
- ) %>%
- kableExtra::kable_styling(full_width = FALSE) %>%
- kableExtra::add_header_above(c(" " = 1, "GP Prediction of ROC AUC" = 2))
+ )
```
:::rmdnote
@@ -331,12 +326,10 @@ small_pred %>%
select(`Parameter Value` = x, Mean = .mean, `Std Dev` = .sd, `Expected Improvment` = exp_imp) %>%
as.data.frame() %>%
format(digits = 4, scientific = FALSE) %>%
- kable(
+ knitr::kable(
caption = "Expected improvement for the two candidate tuning parameters.",
label = "two-exp-improve"
- ) %>%
- kableExtra::kable_styling(full_width = FALSE) %>%
- kableExtra::add_header_above(c(" " = 1, "Predictions" = 3))
+ )
```
When expected improvement is computed across the range of the tuning parameter, the recommended point to sample is much closer to 0.25 than 0.10, as shown in Figure \@ref(fig:expected-improvement).
@@ -437,7 +430,7 @@ The `control` argument now uses the results of `control_bayes()`. Some helpful a
* `verbose` is a logical that will print logging information as the search proceeds.
-Let's use the first SVM results from Section \@ref(svm) as the initial substrate for the Gaussian process model. Recall that, for this application, we want to maximize the area under the ROC curve. Our code is:
+Let's use the first SVM results from the beginning of this chapter as the initial substrate for the Gaussian process model. Recall that, for this application, we want to maximize the area under the ROC curve. Our code is:
```{r iterative-cells-bo, eval = FALSE}
ctrl <- control_bayes(verbose = TRUE)
@@ -573,26 +566,199 @@ autoplot(svm_bo, type = "performance")
An additional type of plot uses `type = "parameters"` that shows the parameter values over iterations.
-The animation below visualizes the results of the search. The black $\times$ values show the starting values contained in `svm_initial`. The top-left blue panel shows the predicted mean value of the area under the ROC curve. The red panel on the top-right displays the predicted variation in the ROC values while the bottom plot visualizes the expected improvement. In each panel, darker colors indicate less attractive values (e.g., small mean values, large variation, and small improvements).
-
-```{r iterative-bo-progress, include = FALSE}
-av_capture_graphics(
- make_bo_animation(gp_candidates, svm_bo),
- output = "bo_search.mp4",
- width = 760,
- height = 760,
- res = 100,
- vfilter = 'framerate=fps=10',
- framerate = 1/3
-)
+```{r iterative-bo-calcs, include = FALSE}
+bo_path <-
+ svm_bo %>%
+ collect_metrics() %>%
+ select(cost, rbf_sigma, .iter, mean)
+
+x_rng <- c(4.546377230472e-08, .2)
+y_rng <- c(0.000580667536622422, 53.8173705762377)
+
+initial <-
+ bo_path %>%
+ filter(.iter == 0)
+best_init <-
+ initial %>%
+ arrange(desc(mean)) %>%
+ slice(1)
+srch <-
+ best_init %>%
+ bind_rows(
+ bo_path %>%
+ filter(.iter > 0)
+ ) %>%
+ mutate(
+ next_cost = dplyr::lead(cost),
+ next_rbf_sigma = dplyr::lead(rbf_sigma)
+ )
+bo_base <-
+ bo_path %>%
+ ggplot(aes(x = rbf_sigma, y = cost)) +
+ # geom_raster(aes(fill = .mean)) +
+ scale_x_log10(labels = fmt_dcimals(2), limits = x_rng) +
+ scale_y_continuous(trans = "log2", labels = fmt_dcimals(2), limits = y_rng) +
+ geom_point(data = initial, col = "black", pch = 1) +
+ theme_bw() +
+ theme(
+ panel.grid.minor.x = element_blank(),
+ panel.grid.minor.y = element_blank(),
+ axis.text.y = element_text(size = 8),
+ axis.text.x = element_text(size = 8)
+ ) +
+ coord_fixed(ratio = 1/2.5)
+first_5 <- bo_base
+max_iter <- 5
+for (iter in 0:max_iter) {
+ first_5 <-
+ first_5 +
+ geom_segment(
+ data = srch %>% slice(iter + 1),
+ aes(xend = next_rbf_sigma, yend = next_cost),
+ arrow = grid::arrow(length = unit(0.04, "inches"), type = "closed"),
+ alpha = 1/2
+ )
+}
+first_5 <-
+ first_5 +
+ ggtitle("First 5 iterations")
+first_11 <- bo_base
+max_iter <- 11
+for (iter in 0:max_iter) {
+ first_11 <-
+ first_11 +
+ geom_segment(
+ data = srch %>% slice(iter + 1),
+ aes(xend = next_rbf_sigma, yend = next_cost),
+ arrow = grid::arrow(length = unit(0.04, "inches"), type = "closed"),
+ alpha = 1/2
+ )
+}
+first_11 <-
+ first_11 +
+ ggtitle("First 11 iterations") +
+ ylab(NULL) +
+ theme(
+ axis.title.y = element_blank(),
+ axis.text.y = element_blank(),
+ axis.ticks.y = element_blank()
+ )
+all_bo <- bo_base
+max_iter <- max(srch$.iter)
+for (iter in 0:max_iter) {
+ all_bo <-
+ all_bo +
+ geom_segment(
+ data = srch %>% slice(iter + 1),
+ aes(xend = next_rbf_sigma, yend = next_cost),
+ arrow = grid::arrow(length = unit(0.04, "inches"), type = "closed"),
+ alpha = 1/2
+ )
+}
+all_bo <-
+ all_bo +
+ ggtitle("All iterations") +
+ ylab(NULL) +
+ theme(
+ axis.title.y = element_blank(),
+ axis.text.y = element_blank(),
+ axis.ticks.y = element_blank()
+ )
+surf_mean <-
+ gp_candidates %>%
+ filter(.iter == 11) %>%
+ ggplot(aes(x = rbf_sigma, y = cost)) +
+ geom_raster(aes(fill = .mean)) +
+ scale_x_log10(labels = fmt_dcimals(2), limits = x_rng) +
+ scale_y_continuous(trans = "log2", labels = fmt_dcimals(2), limits = y_rng) +
+ scale_fill_distiller(palette = "Blues") +
+ theme_bw() +
+ theme(
+ legend.position = "none",
+ panel.grid.minor.x = element_blank(),
+ panel.grid.minor.y = element_blank(),
+ axis.text.y = element_text(size = 8),
+ axis.text.x = element_text(size = 8)
+ ) +
+ labs(title = "Mean") +
+ coord_fixed(ratio = 1/2.5)
+surf_sd <-
+ gp_candidates %>%
+ filter(.iter == 11) %>%
+ ggplot(aes(x = rbf_sigma, y = cost)) +
+ geom_raster(aes(fill = -.sd)) +
+ scale_x_log10(labels = fmt_dcimals(2), limits = x_rng) +
+ scale_y_continuous(trans = "log2", labels = fmt_dcimals(2), limits = y_rng) +
+ scale_fill_distiller(palette = "Reds") +
+ labs(title = "Variance") +
+ coord_fixed(ratio = 1/2.5) +
+ ylab(NULL) +
+ theme_bw() +
+ theme(
+ legend.position = "none",
+ axis.title.y = element_blank(),
+ axis.text.y = element_blank(),
+ axis.ticks.y = element_blank(),
+ panel.grid.minor.x = element_blank(),
+ panel.grid.minor.y = element_blank(),
+ axis.text.x = element_text(size = 8)
+ )
+surf_impr <-
+ gp_candidates %>%
+ filter(.iter == 11) %>%
+ ggplot(aes(x = rbf_sigma, y = cost)) +
+ geom_raster(aes(fill = log(objective + 0.00001))) +
+ scale_x_log10(labels = fmt_dcimals(2), limits = x_rng) +
+ scale_y_continuous(trans = "log2", labels = fmt_dcimals(2), limits = y_rng) +
+ scale_fill_gradientn(colours = rev(scales::brewer_pal(palette = "RdPu")(3))) +
+ labs(title = "Expected Improvement") +
+ coord_fixed(ratio = 1/2.5) +
+ ylab(NULL) +
+ theme_bw() +
+ theme(
+ legend.position = "none",
+ axis.title.y = element_blank(),
+ axis.text.y = element_blank(),
+ axis.ticks.y = element_blank(),
+ panel.grid.minor.x = element_blank(),
+ panel.grid.minor.y = element_blank(),
+ axis.text.x = element_text(size = 8)
+ )
+
+# These are based off of all the tuning grids used in the chapter
+x_rng <- c(4.546377230472e-08, 1.54534400806564)
+y_rng <- c(0.000580667536622422, 53.8173705762377)
```
-
+Figure \@ref(fig:bo-surfaces) shows the surfaces of the mean, variance, and expected improvement surfaces estimated by the GP after 11 iterations. The panel on the right shows a ridge of best estimated improvement along the right side of the candidate space.
+```{r bo-surfaces}
+#| echo = FALSE,
+#| message = FALSE,
+#| warning = FALSE,
+#| fig.width = 9,
+#| fig.height = 4,
+#| out.width = "100%",
+#| fig.cap = "Heat maps of the predicted mean RMSE (left), variance of RMSE (middle), and the expected improvement (right) after 11 search iterations.",
+#| fig.alt = "Heat maps of the predicted mean RMSE (left), variance of RMSE (middle), and the expected improvement (right) after 11 search iterations. The means surface correctly reflects that the best results are near the upper right of the parameter space. The variance patterns show low variance at existing parameter combinations. The expected improvement surface, at this point, is a narrow ridge going form high to low in the cost dimension along higher levels of the kernel function parameter."
+surf_mean + surf_sd + surf_impr
+```
-The surface of the predicted mean surface is very inaccurate in the first few iterations of the search. Despite this, it does help guide the process to the region of good performance. In other words, the Gaussian process model is wrong but shows itself to be very useful. Within the first ten iterations, the search is sampling near the optimum location.
+Figure \@ref(fig:bo-search) shows the search process at three different points in the optimization.
+
+```{r bo-search, echo = FALSE, warning = FALSE, out.width="100%", fig.width=9, fig.height=4}
+#| echo = FALSE,
+#| message = FALSE,
+#| warning = FALSE,
+#| fig.width = 9,
+#| fig.height = 4,
+#| out.width = "100%",
+#| fig.cap = "The Bayesian optimization search path after 1, 11, and 25 iterations.",
+#| fig.alt = "The Bayesian optimization search path after 1, 11, and 25 iterations. Initially the search goes in a poor direction before approaching the region of best results. By eleven iterations, the search has focused on the location of the truly optimal results and has probed more extremest directions. By the end, the search focuses on the best area or probes outlying areas, especially at the bounds of the parameter space."
+first_5 + first_11 + all_bo
+```
+
+The first five iterations initially moved in a poor direction but quickly moved closer to better results. The middle panel shows the first eleven iterations where the process investigates the region of true optimal results with a short foray to the bottom right boundary of the candidate space. The remaining iterations shown in the panel on the left switch between the region of best results and the far borders of the search space.
While the best tuning parameter combination is on the boundary of the parameter space, Bayesian optimization will often choose new points on other sides of the boundary. While we can adjust the ratio of exploration and exploitation, the search tends to sample boundary points early on.
@@ -624,7 +790,6 @@ How are the acceptance probabilities influenced? The heatmap in Figure \@ref(fig
```{r acceptance-prob}
#| echo = FALSE,
-#| dev = "png",
#| fig.height = 4.5,
#| out.width = "80%",
#| fig.cap = "Heatmap of the simulated annealing acceptance probabilities for different coefficient values",
@@ -922,25 +1087,177 @@ autoplot(svm_sa, type = "performance")
autoplot(svm_sa, type = "parameters")
```
-A visualization of the search path helps to understand where the search process did well and where it went astray:
+Like `tune_bayes()`, manually stopping execution will return the completed iterations.
-```{r iterative-sa-plot, include = FALSE}
-av_capture_graphics(
- sa_2d_plot(svm_sa, result_history, svm_large),
- output = "sa_search.mp4",
- width = 720,
- height = 720,
- res = 120,
- vfilter = 'framerate=fps=10',
- framerate = 1/3
-)
-```
+A visualization of the search path helps to understand where the search process did well and where it went astray. Figure \@ref(fig:sa-plot) illustrates several "phases" of the optimization; these are separated by a restart of the process at the last best results.
-
+```{r sa-plot}
+#| echo = FALSE,
+#| message = FALSE,
+#| warning = FALSE,
+#| fig.width = 10,
+#| fig.height = 7,
+#| out.width = "90%",
+#| fig.cap = "A visualization of different phases of the simulated annealing search.",
+#| fig.alt = "A visualization of different phases of the simulated annealing search. Each portion of the search has many 'dead end paths' that either have immediate poor results or have several iterations before a restart is required. After four restarts, the search finds itself in a region of optimal results."
+history <- result_history %>% add_rowindex()
+params <-
+ svm_sa %>%
+ collect_metrics() %>%
+ select(.iter, cost, rbf_sigma, mean) %>%
+ arrange(.iter)
+initial <-
+ params %>%
+ filter(.iter == 0)
+sa_path <- function(branch = 1, y_axis = TRUE) {
+
+ # ------------------------------------------------------------------------------
+ # Plot before SA optimization
+
+ base_plot <-
+ params %>%
+ ggplot(aes(x = rbf_sigma, y = cost)) +
+ scale_x_log10(labels = fmt_dcimals(2), limits = x_rng) +
+ scale_y_continuous(trans = "log2", labels = fmt_dcimals(2), limits = y_rng) +
+ coord_fixed(ratio = .5)
+
+ sa_plot <-
+ base_plot +
+ geom_point(data = initial, col = "black", pch = 1) +
+ theme_bw()
+
+ # ----------------------------------------------------------------------------
+ # Setup data for the requested path. Determine which rows should be used
+ # (based on the branch argument) by determining restart locations.
+
+ all_restr <- grep("restart", result_history$results)
+ if (branch <= length(all_restr)) {
+ row_limit <- all_restr[branch]
+ } else {
+ row_limit <- nrow(result_history)
+ }
+
+ sa_data <-
+ result_history %>%
+ add_rowindex() %>%
+ filter(.row <= row_limit) %>%
+ select(cost, rbf_sigma, results, .row) %>%
+ mutate(
+ best = NA_integer_,
+ branch_ind = NA_integer_,
+ next_row = dplyr::lead(.row), # TODO maybe don't use these
+ next_cost = dplyr::lead(cost),
+ next_rbf_sigma = dplyr::lead(rbf_sigma)
+ )
+
+ # Mark where the new global best results occur to add a column (includes the
+ # initial results)
+ restr <- grep("restart", sa_data$results)
+ bests <- c(which.max(initial$mean), grep("new best", history$results))
+
+ # Loop through the data to set the best results and also count the branches
+ branch_num <- 1
+ for (i in 4:nrow(sa_data)) {
+ prev_best <- max(bests[bests <= i])
+ sa_data$best[i] <- prev_best
+ sa_data$branch_ind[i] <- branch_num
+ if (sa_data$results[i] == "restart from best") {
+ branch_num <- branch_num + 1
+ }
+ }
+
+ # Remove previous branches (if any) as if they did not occur. This means
+ # eliminating those rows from previous branches that were not new global best.
+ # Re-number rows
+ if (branch > 1) {
+ removals <-
+ sa_data %>%
+ filter(branch_ind < branch & !(results %in% c("initial", "new best"))) %>%
+ select(.row)
+ sa_data <-
+ sa_data %>%
+ anti_join(removals, by = ".row") %>%
+ add_rowindex()
+ }
+
+ last_accepted <- which.max(initial$mean)
+ last_best <- last_accepted
+
+ for (i in 5:nrow(sa_data)) {
+ dat_start <-
+ sa_data %>%
+ slice(last_accepted) %>%
+ select(cost, rbf_sigma)
+
+ if (sa_data$results[i] == "new best") {
+ # The current row is accepted and is globally optimal
+ plot_col <- "black"
+ last_accepted <- i
+ last_best <- i
+
+ } else if (sa_data$results[i] %in% c("accept suboptimal", "better suboptimal")) {
+ # The current row is accepted. Color blue since it is eliminated with restart
+ plot_col <- "blue"
+ last_accepted <- i
+ } else if (sa_data$results[i] %in% c("discard suboptimal")) {
+
+ plot_col <- rgb(0, 0, 0, .4)
+ } else if (sa_data$results[i] %in% c("restart from best")) {
+ plot_col <- rgb(0, 0, 0, .4)
+
+ # Restart goes to previous glbal best
+ last_accepted <- last_best
+ }
+
+ dat_plot <-
+ sa_data %>%
+ slice(i) %>%
+ select(next_cost = cost, next_rbf_sigma = rbf_sigma) %>%
+ bind_cols(dat_start)
+
+ sa_plot <-
+ sa_plot +
+ geom_segment(
+ data = dat_plot,
+ aes(xend = next_rbf_sigma, yend = next_cost),
+ # arrow = grid::arrow(length = unit(0.1, "inches")),
+ col = plot_col
+ )
+ }
+ sa_plot <-
+ sa_plot +
+ geom_point(data = sa_data %>% filter(results == "new best"),
+ cex = 1) +
+ theme(
+ panel.grid.minor.x = element_blank(),
+ panel.grid.minor.y = element_blank(),
+ axis.text.y = element_text(size = 8),
+ axis.text.x = element_text(size = 8)
+ )
+
+ if(!y_axis) {
+ sa_plot <-
+ sa_plot +
+ ylab(NULL) +
+ theme(
+ axis.title.y = element_blank(),
+ axis.text.y = element_blank(),
+ axis.ticks.y = element_blank(),
+ axis.text.x = element_text(size = 8)
+ )
+ }
+ sa_plot
+}
+sa_1 <- sa_path(1, TRUE) + ggtitle("Phase 1")
+sa_2 <- sa_path(2, FALSE) + ggtitle("Phase 2")
+sa_3 <- sa_path(3, FALSE) + ggtitle("Phase 3")
+sa_4 <- sa_path(4, TRUE) + ggtitle("Phase 4")
+sa_5 <- sa_path(5, FALSE) + ggtitle("Phase 5")
-Like `tune_bayes()`, manually stopping execution will return the completed iterations.
+sa_1 + sa_2 + sa_3 + sa_4 + sa_5 + plot_layout(ncol = 3)
+```
+
+In the first phase, the search initially finds two new global optima (shown with the solid points). From these, there are several settings that are immediately discarded (light gray lines) while others are suboptimal but acceptable. After a set number of failures, it restarts at the last solid point. The other phases show a slow improvement in global optima with many discarded settings along the way. The process eventually finds its way to the region of optimal results as it exhausts the total number of allowed iterations.
## Chapter Summary {#iterative-summary}
diff --git a/15-workflow-sets.Rmd b/15-workflow-sets.Rmd
index 91ddd86d..ef3e45d5 100644
--- a/15-workflow-sets.Rmd
+++ b/15-workflow-sets.Rmd
@@ -30,7 +30,7 @@ For projects with new data sets that have not yet been well understood, a data p
A good strategy is to spend some initial effort trying a variety of modeling approaches, determine what works best, then invest additional time tweaking/optimizing a small set of models.
:::
-Workflow sets provide a user interface to create and manage this process. We'll also demonstrate how to evaluate these models efficiently using the racing methods discussed in Section \@ref(racing-example).
+Workflow sets provide a user interface to create and manage this process. We'll also demonstrate how to evaluate these models efficiently using the racing methods discussed later in this chapter.
## Modeling Concrete Mixture Strength
@@ -311,7 +311,7 @@ autoplot(
select_best = TRUE # <- one point per workflow
) +
geom_text(aes(y = mean - 1/2, label = wflow_id), angle = 90, hjust = 1) +
- lims(y = c(3.5, 9.5)) +
+ lims(y = c(3.0, 9.5)) +
theme(legend.position = "none")
```
@@ -345,7 +345,7 @@ The example model screening with our concrete mixture data fits a total of `r fo
## Efficiently Screening Models {#racing-example}
-One effective method for screening a large set of models efficiently is to use the racing approach described in Section \@ref(racing). With a workflow set, we can use the `workflow_map()` function for this racing approach. Recall that after we pipe in our workflow set, the argument we use is the function to apply to the workflows; in this case, we can use a value of `"tune_race_anova"`. We also pass an appropriate control object; otherwise the options would be the same as the code in the previous section.
+One effective method for screening a large set of models efficiently is to use the racing approach described in Chapter \@ref(grid-search). With a workflow set, we can use the `workflow_map()` function for this racing approach. Recall that after we pipe in our workflow set, the argument we use is the function to apply to the workflows; in this case, we can use a value of `"tune_race_anova"`. We also pass an appropriate control object; otherwise the options would be the same as the code in the previous section.
```{r workflow-sets-race, eval = FALSE}
diff --git a/16-dimensionality-reduction.Rmd b/16-dimensionality-reduction.Rmd
index 1d058b9a..369b4185 100644
--- a/16-dimensionality-reduction.Rmd
+++ b/16-dimensionality-reduction.Rmd
@@ -40,10 +40,10 @@ This chapter has two goals:
* Demonstrate how to use recipes to create a small set of features that capture the main aspects of the original predictor set.
- * Describe how recipes can be used on their own (as opposed to being used in a workflow object, as in Section \@ref(using-recipes)).
+ * Describe how recipes can be used on their own (as opposed to being used in a workflow object, as in Chapter \@ref(recipes)).
:::
-The latter is helpful when testing or debugging a recipe. However, as described in Section \@ref(using-recipes), the best way to use a recipe for modeling is from within a workflow object.
+The latter is helpful when testing or debugging a recipe. However, as described in Chapter \@ref(recipes), the best way to use a recipe for modeling is from within a workflow object.
In addition to the `r pkg(tidymodels)` package, this chapter uses the following packages: `r pkg(baguette)`, `r pkg(beans)`, `r pkg(bestNormalize)`, `r pkg(corrplot)`, `r pkg(discrim)`, `r pkg(embed)`, `r pkg(ggforce)`, `r pkg(klaR)`, `r pkg(learntidymodels)`,[^learnnote] `r pkg(mixOmics)`,[^mixnote] and `r pkg(uwot)`.
@@ -159,7 +159,7 @@ This recipe will be extended with additional steps for the dimensionality reduct
## Recipes in the Wild {#recipe-functions}
-As mentioned in Section \@ref(using-recipes), a workflow containing a recipe uses `fit()` to estimate the recipe and model, then `predict()` to process the data and make model predictions. There are analogous functions in the `r pkg(recipes)` package that can be used for the same purpose:
+As mentioned in Chapter \@ref(recipes), a workflow containing a recipe uses `fit()` to estimate the recipe and model, then `predict()` to process the data and make model predictions. There are analogous functions in the `r pkg(recipes)` package that can be used for the same purpose:
* `prep(recipe, training)` fits the recipe to the training set.
* `bake(recipe, new_data)` applies the recipe operations to `new_data`.
@@ -271,23 +271,33 @@ We will use `prep()` and `bake()` in the next section to illustrate some of thes
## Feature Extraction Techniques
-Since recipes are the primary option in tidymodels for dimensionality reduction, let's write a function that will estimate the transformation and plot the resulting data in a scatter plot matrix via the `r pkg(ggforce)` package:
+Since recipes are the primary option in tidymodels for dimensionality reduction, let's write a function that will estimate the transformation and plot the resulting data:
```{r dimensionality-function}
-library(ggforce)
plot_validation_results <- function(recipe, dat = assessment(bean_val$splits[[1]])) {
- recipe %>%
+ set.seed(1)
+ plot_data <-
+ recipe %>%
# Estimate any additional steps
prep() %>%
# Process the data (the validation set by default)
- bake(new_data = dat) %>%
- # Create the scatterplot matrix
- ggplot(aes(x = .panel_x, y = .panel_y, color = class, fill = class)) +
- geom_point(alpha = 0.4, size = 0.5) +
- geom_autodensity(alpha = .3) +
- facet_matrix(vars(-class), layer.diag = 2) +
- scale_color_brewer(palette = "Dark2") +
- scale_fill_brewer(palette = "Dark2")
+ bake(new_data = dat, all_predictors(), all_outcomes()) %>%
+ # Sample the data down to be more readable
+ sample_n(250)
+
+ # Convert feature names to symbols to use with quasiquotation
+ nms <- names(plot_data)
+ x_name <- sym(nms[1])
+ y_name <- sym(nms[2])
+
+ plot_data %>%
+ ggplot(aes(x = !!x_name, y = !!y_name, col = class,
+ fill = class, pch = class)) +
+ geom_point(alpha = 0.9) +
+ scale_shape_manual(values = 1:7) +
+ # Make equally sized axes
+ coord_obs_pred() +
+ theme_bw()
}
```
@@ -307,10 +317,9 @@ bean_rec_trained %>%
```
```{r bean-pca, ref.label = "dimensionality-pca"}
-#| dev = "png",
#| echo = FALSE,
#| fig.height = 7,
-#| fig.cap = "Principal component scores for the bean validation set, colored by class",
+#| fig.cap = "First two principal component scores for the bean validation set, colored by class",
#| fig.alt = "Principal component scores for the bean validation set, colored by class. The classes separate when the first two components are plotted against one another."
```
@@ -350,10 +359,9 @@ bean_rec_trained %>%
```
```{r bean-pls, ref.label = "dimensionality-pls"}
-#| dev = "png",
#| fig.height = 7,
#| echo = FALSE,
-#| fig.cap = "PLS component scores for the bean validation set, colored by class",
+#| fig.cap = "First two PLS component scores for the bean validation set, colored by class",
#| fig.alt = "PLS component scores for the bean validation set, colored by class. The first two PLS components are nearly identical to the first two PCA components."
```
@@ -388,10 +396,9 @@ bean_rec_trained %>%
```
```{r bean-ica, ref.label = "dimensionality-ica"}
-#| dev = "png",
#| echo = FALSE,
#| fig.height = 7,
-#| fig.cap = "ICA component scores for the bean validation set, colored by class",
+#| fig.cap = "First two ICA component scores for the bean validation set, colored by class",
#| fig.alt = "ICA component scores for the bean validation set, colored by class. There is significant overlap in the first two ICA components."
```
@@ -413,15 +420,7 @@ bean_rec_trained %>%
ggtitle("UMAP")
```
-```{r bean-umap, ref.label = "dimensionality-umap"}
-#| dev = "png",
-#| echo = FALSE,
-#| fig.height = 7,
-#| fig.cap = "UMAP component scores for the bean validation set, colored by class",
-#| fig.alt = "UMAP component scores for the bean validation set, colored by class. There is significant overlap in the first two ICA components."
-```
-
-While the between-cluster space is pronounced, the clusters can contain a heterogeneous mixture of classes.
+The resulting plot is shown on the left-hand side of Figure \@ref(fig:bean-umap). While the between-cluster space is pronounced, the clusters can contain a heterogeneous mixture of classes.
There is also a supervised version of UMAP:
@@ -432,15 +431,32 @@ bean_rec_trained %>%
ggtitle("UMAP (supervised)")
```
-```{r bean-umap-supervised, ref.label = "dimensionality-umap-supervised"}
-#| dev = "png",
+```{r bean-umap}
#| echo = FALSE,
-#| fig.height = 7,
-#| fig.cap = "Supervised UMAP component scores for the bean validation set, colored by class",
-#| fig.alt = "Supervised UMAP component scores for the bean validation set, colored by class. There is significant overlap in the first two ICA components."
+#| fig.height = 5,
+#| fig.width = 10.1,
+#| fig.cap = "The first two UMAP component scores for the bean validation set, colored by class. Results are shown for supervised and unsupervised versions.",
+#| fig.alt = "The first two UMAP component scores for the bean validation set, colored by class. Results are shown for supervised and unsupervised versions. There are clusters that are extremely separated form one another but each contains a mixture of the classes. The supervised version shows more separation between classes."
+
+umap_1 <-
+ bean_rec_trained %>%
+ step_umap(all_numeric_predictors(), num_comp = 4) %>%
+ plot_validation_results() +
+ ggtitle("UMAP")
+
+umap_2 <-
+ bean_rec_trained %>%
+ step_umap(all_numeric_predictors(), outcome = "class", num_comp = 4) %>%
+ plot_validation_results() +
+ ggtitle("UMAP (supervised)") +
+ theme(legend.position = "none") +
+ labs(y = NULL)
+
+umap_1 + umap_2
```
-The supervised method shown in Figure \@ref(fig:bean-umap-supervised) looks promising for modeling the data.
+
+The supervised method shown in Figure \@ref(fig:bean-umap) looks promising for modeling the data.
UMAP is a powerful method to reduce the feature space. However, it can be very sensitive to tuning parameters (e.g., the number of neighbors and so on). For this reason, it would help to experiment with a few of the parameters to assess how robust the results are for these data.
diff --git a/17-encoding-categorical-data.Rmd b/17-encoding-categorical-data.Rmd
index 2e443075..279f0b61 100644
--- a/17-encoding-categorical-data.Rmd
+++ b/17-encoding-categorical-data.Rmd
@@ -2,7 +2,6 @@
library(tidymodels)
library(embed)
library(textrecipes)
-library(kableExtra)
tidymodels_prefer()
source("ames_snippets.R")
@@ -11,7 +10,7 @@ neighborhood_counts <- count(ames_train, Neighborhood)
# Encoding Categorical Data {#categorical}
-For statistical modeling in R, the preferred representation for categorical or nominal data is a _factor_, which is a variable that can take on a limited number of different values; internally, factors are stored as a vector of integer values together with a set of text labels.[^python] In Section \@ref(dummies) we introduced feature engineering approaches to encode or transform qualitative or nominal data into a representation better suited for most model algorithms. We discussed how to transform a categorical variable, such as the `Bldg_Type` in our Ames housing data (with levels `r knitr::combine_words(glue::backtick(levels(ames_train$Bldg_Type)))`), to a set of dummy or indicator variables like those shown in Table \@ref(tab:encoding-dummies).
+For statistical modeling in R, the preferred representation for categorical or nominal data is a _factor_, which is a variable that can take on a limited number of different values; internally, factors are stored as a vector of integer values together with a set of text labels.[^python] In Chapter \@ref(recipes) we introduced feature engineering approaches to encode or transform qualitative or nominal data into a representation better suited for most model algorithms. We discussed how to transform a categorical variable, such as the `Bldg_Type` in our Ames housing data (with levels `r knitr::combine_words(glue::backtick(levels(ames_train$Bldg_Type)))`), to a set of dummy or indicator variables like those shown in Table \@ref(tab:encoding-dummies).
[^python]: This is in contrast to statistical modeling in Python, where categorical variables are often directly represented by integers alone, such as `0, 1, 2` representing red, blue, and green.
@@ -30,9 +29,8 @@ recipe(~Bldg_Type, data = ames_train) %>%
bake(ames_train) %>%
slice(show_rows) %>%
arrange(`Raw Data`) %>%
- kable(caption = "Dummy or indicator variable encodings for the building type predictor in the Ames training set.",
- label = "encoding-dummies") %>%
- kable_styling(full_width = FALSE)
+ knitr::kable(caption = "Dummy or indicator variable encodings for the building type predictor in the Ames training set.",
+ label = "encoding-dummies")
```
Many model implementations require such a transformation to a numeric representation for categorical data.
@@ -68,9 +66,8 @@ ord_contrasts <-
setNames(c("Linear", "Quadratic", "Cubic", "Quartic"))
bind_cols(ord_data, ord_contrasts) %>%
- kable(caption = "Polynominal expansions for encoding an ordered variable.",
- label = "encoding-ordered-table") %>%
- kable_styling(full_width = FALSE)
+ knitr::kable(caption = "Polynominal expansions for encoding an ordered variable.",
+ label = "encoding-ordered-table")
```
While this is not unreasonable, it is not an approach that people tend to find useful. For example, an 11-degree polynomial is probably not the most effective way of encoding an ordinal factor for the months of the year. Instead, consider trying recipe steps related to ordered factors, such as `step_unorder()`, to convert to regular factors, and `step_ordinalscore()`, which maps specific numeric values to each factor level.
@@ -110,7 +107,7 @@ ames_glm <-
ames_glm
```
-As detailed in Section \@ref(recipe-functions), we can `prep()` our recipe to fit or estimate parameters for the preprocessing transformations using training data. We can then `tidy()` this prepared recipe to see the results:
+As detailed in Chapter \@ref(dimensionality), we can `prep()` our recipe to fit or estimate parameters for the preprocessing transformations using training data. We can then `tidy()` this prepared recipe to see the results.
```{r}
glm_estimates <-
@@ -202,7 +199,7 @@ Notice in Figure \@ref(fig:encoding-compare-pooling) that most estimates for nei
## Feature Hashing
-Traditional dummy variables as described in Section \@ref(dummies) require that all of the possible categories be known to create a full set of numeric features. _Feature hashing_ methods [@weinberger2009feature] also create dummy variables, but only consider the value of the category to assign it to a predefined pool of dummy variables. Let's look at the `Neighborhood` values in Ames again and use the `rlang::hash()` function to understand more:
+Traditional dummy variables as described in Chapter \@ref(recipes) require that all of the possible categories be known to create a full set of numeric features. _Feature hashing_ methods [@weinberger2009feature] also create dummy variables, but only consider the value of the category to assign it to a predefined pool of dummy variables. Let's look at the `Neighborhood` values in Ames again and use the `rlang::hash()` function to understand more.
```{r}
library(rlang)
@@ -269,11 +266,10 @@ hash_table <-
count(value)
hash_table %>%
- kable(col.names = c("Number of neighborhoods within a hash feature",
- "Number of occurrences"),
- caption = "The number of hash features at each number of neighborhoods.",
- label = "encoding-hash") %>%
- kable_styling(full_width = FALSE)
+ knitr::kable(col.names = c("Number of neighborhoods within a hash feature",
+ "Number of occurrences"),
+ caption = "The number of hash features at each number of neighborhoods.",
+ label = "encoding-hash")
```
The number of neighborhoods mapped to each hash value varies between `r xfun::numbers_to_words(min(hash_table$value))` and `r xfun::numbers_to_words(max(hash_table$value))`. All of the hash values greater than one are examples of hash collisions.
diff --git a/18-explaining-models-and-predictions.Rmd b/18-explaining-models-and-predictions.Rmd
index de6d06b8..f6415ed0 100644
--- a/18-explaining-models-and-predictions.Rmd
+++ b/18-explaining-models-and-predictions.Rmd
@@ -29,7 +29,7 @@ lm_fit <- lm_wflow %>% fit(data = ames_train)
# Explaining Models and Predictions {#explain}
-In Section \@ref(model-types), we outlined a taxonomy of models and suggested that models typically are built as one or more of descriptive, inferential, or predictive. We suggested that model performance, as measured by appropriate metrics (like RMSE for regression or area under the ROC curve for classification), can be important for all modeling applications. Similarly, model explanations, answering _why_ a model makes the predictions it does, can be important whether the purpose of your model is largely descriptive, to test a hypothesis, or to make a prediction. Answering the question "why?" allows modeling practitioners to understand which features were important in predictions and even how model predictions would change under different values for the features. This chapter covers how to ask a model why it makes the predictions it does.
+In Chapter \@ref(software-modeling), we outlined a taxonomy of models and suggested that models typically are built as one or more of descriptive, inferential, or predictive. We suggested that model performance, as measured by appropriate metrics (like RMSE for regression or area under the ROC curve for classification), can be important for all modeling applications. Similarly, model explanations, answering _why_ a model makes the predictions it does, can be important whether the purpose of your model is largely descriptive, to test a hypothesis, or to make a prediction. Answering the question "why?" allows modeling practitioners to understand which features were important in predictions and even how model predictions would change under different values for the features. This chapter covers how to ask a model why it makes the predictions it does.
For some models, like linear regression, it is usually clear how to explain why the model makes its predictions. The structure of a linear model contains coefficients for each predictor that are typically straightforward to interpret. For other models, like random forests that can capture nonlinear behavior by design, it is less transparent how to explain the model's predictions from only the structure of the model itself. Instead, we can apply model explainer algorithms to generate understanding of predictions.
@@ -106,7 +106,7 @@ Dealing with significant feature engineering transformations during model explai
## Local Explanations
-Local model explanations provide information about a prediction for a single observation. For example, let's consider an older duplex in the North Ames neighborhood (Section \@ref(exploring-features-of-homes-in-ames)):
+Local model explanations provide information about a prediction for a single observation. For example, let's consider an older duplex in the North Ames neighborhood (Chapter \@ref(ames)).
```{r explain-duplex}
duplex <- vip_train[120,]
@@ -335,7 +335,7 @@ ggplot_pdp <- function(obj, x) {
num_colors <- n_distinct(obj$agr_profiles$`_label_`)
if (num_colors > 1) {
- p <- p + geom_line(aes(color = `_label_`), size = 1.2, alpha = 0.8)
+ p <- p + geom_line(aes(color = `_label_`, lty = `_label_`), size = 1.2)
} else {
p <- p + geom_line(color = "midnightblue", size = 1.2, alpha = 0.8)
}
@@ -372,7 +372,7 @@ ggplot_pdp(pdp_liv, Gr_Liv_Area) +
scale_color_brewer(palette = "Dark2") +
labs(x = "Gross living area",
y = "Sale Price (log)",
- color = NULL)
+ color = NULL, lty = NULL)
```
This code produces Figure \@ref(fig:building-type-profiles), where we see that sale price increases the most between about 1,000 and 3,000 square feet of living area, and that different home types (like single family homes or different types of townhouses) mostly exhibit similar increasing trends in price with more living space.
diff --git a/19-when-should-you-trust-predictions.Rmd b/19-when-should-you-trust-predictions.Rmd
index 335d8dd7..4457a8ba 100644
--- a/19-when-should-you-trust-predictions.Rmd
+++ b/19-when-should-you-trust-predictions.Rmd
@@ -215,7 +215,7 @@ Using the standard error as a measure to preclude samples from being predicted c
## Determining Model Applicability {#applicability-domains}
-Equivocal zones try to measure the reliability of a prediction based on the model outputs. It may be that model statistics, such as the standard error of prediction, cannot measure the impact of extrapolation, and so we need another way to assess whether to trust a prediction and answer, "Is our model applicable for predicting a specific data point?" Let's take the Chicago train data used extensively in [Kuhn and Johnson (2019)](https://bookdown.org/max/FES/chicago-intro.html) and first shown in Section \@ref(examples-of-tidyverse-syntax). The goal is to predict the number of customers entering the Clark and Lake train station each day.
+Equivocal zones try to measure the reliability of a prediction based on the model outputs. It may be that model statistics, such as the standard error of prediction, cannot measure the impact of extrapolation, and so we need another way to assess whether to trust a prediction and answer, "Is our model applicable for predicting a specific data point?" Let's take the Chicago train data used extensively in [Kuhn and Johnson (2019)](https://bookdown.org/max/FES/chicago-intro.html) and first shown in Chapter \@ref(tidyverse). The goal is to predict the number of customers entering the Clark and Lake train station each day.
The data set in the `r pkg(modeldata)` package (a tidymodels package with example data sets) has daily values between `r format(min(Chicago$date), "%B %d, %Y")` and `r format(max(Chicago$date), "%B %d, %Y")`. Let's create a small test set using the last two weeks of the data:
@@ -231,7 +231,7 @@ Chicago_train <- Chicago %>% slice(1:(n - 14))
Chicago_test <- Chicago %>% slice((n - 13):n)
```
-The main predictors are lagged ridership data at different train stations, including Clark and Lake, as well as the date. The ridership predictors are highly correlated with one another. In the following recipe, the date column is expanded into several new features, and the ridership predictors are represented using partial least squares (PLS) components. PLS [@Geladi:1986], as we discussed in Section \@ref(partial-least-squares), is a supervised version of principal component analysis where the new features have been decorrelated but are predictive of the outcome data.
+The main predictors are lagged ridership data at different train stations, including Clark and Lake, as well as the date. The ridership predictors are highly correlated with one another. In the following recipe, the date column is expanded into several new features, and the ridership predictors are represented using partial least squares (PLS) components. PLS [@Geladi:1986], as we discussed in Chapter \@ref(dimensionality), is a supervised version of principal component analysis where the new features have been decorrelated but are predictive of the outcome data.
Using the preprocessed data, we fit a standard linear model:
diff --git a/20-ensemble-models.Rmd b/20-ensemble-models.Rmd
index 16f8ba3c..b0d2aa26 100644
--- a/20-ensemble-models.Rmd
+++ b/20-ensemble-models.Rmd
@@ -5,7 +5,6 @@ library(rules)
library(baguette)
library(stacks)
library(patchwork)
-library(kableExtra)
load("RData/concrete_results.RData")
```
@@ -71,10 +70,7 @@ stacks() %>%
"...", "Cubist 25", "..."),
caption = "Predictions from candidate tuning parameter configurations.",
label = "ensemble-candidate-preds"
- ) %>%
- kable_styling("striped", full_width = TRUE) %>%
- add_header_above(c(" ", "Ensemble Candidate Predictions" = 7)) %>%
- row_spec(0, align = "c")
+ )
```
There is a single column for the bagged tree model since it has no tuning parameters. Also, recall that MARS was tuned over a single parameter (the product degree) with two possible configurations, so this model is represented by two columns. Most of the other models have 25 corresponding columns, as shown for Cubist in this example.
@@ -105,7 +101,7 @@ concrete_stack <-
concrete_stack
```
-Recall that racing methods (Section \@ref(racing)) are more efficient since they might not evaluate all configurations on all resamples. Stacking requires that all candidate members have the complete set of resamples. `add_candidates()` includes only the model configurations that have complete results.
+Recall that racing methods (introduced in Chapter \@ref(grid-search)) are more efficient since they might not evaluate all configurations on all resamples. Stacking requires that all candidate members have the complete set of resamples. `add_candidates()` includes only the model configurations that have complete results.
:::rmdnote
Why use the racing results instead of the full set of candidate models contained in `grid_results`? Either can be used. We found better performance for these data using the racing results. This might be due to the racing method pre-selecting the best model(s) from the larger grid.
@@ -203,7 +199,7 @@ The regularized linear regression meta-learning model contained `r num_coefs` bl
autoplot(ens, "weights") +
geom_text(aes(x = weight + 0.01, label = model), hjust = 0) +
theme(legend.position = "none") +
- lims(x = c(-0.01, 0.8))
+ lims(x = c(-0.01, 0.9))
```
```{r blending-weights, ref.label = "ensembles-blending-weights"}
diff --git a/21-inferential-analysis.Rmd b/21-inferential-analysis.Rmd
index 2a772a4b..c478f5e2 100644
--- a/21-inferential-analysis.Rmd
+++ b/21-inferential-analysis.Rmd
@@ -12,14 +12,14 @@ data("bioChemists", package = "pscl")
# Inferential Analysis {#inferential}
:::rmdnote
-In Section \@ref(model-types), we outlined a taxonomy of models and said that most models can be categorized as descriptive, inferential, and/or predictive.
+In Chapter \@ref(software-modeling), we outlined a taxonomy of models and said that most models can be categorized as descriptive, inferential, and/or predictive.
:::
Most of the chapters in this book have focused on models from the perspective of the accuracy of predicted values, an important quality of models for all purposes but most relevant for predictive models. Inferential models are usually created not only for their predictions, but also to make inferences or judgments about some component of the model, such as a coefficient value or other parameter. These results are often used to answer some (hopefully) pre-defined questions or hypotheses. In predictive models, predictions on hold-out data are used to validate or characterize the quality of the model. Inferential methods focus on validating the probabilistic or structural assumptions that are made prior to fitting the model.
For example, in ordinary linear regression, the common assumption is that the residual values are independent and follow a Gaussian distribution with a constant variance. While you may have scientific or domain knowledge to lend credence to this assumption for your model analysis, the residuals from the fitted model are usually examined to determine if the assumption was a good idea. As a result, the methods for determining if the model's assumptions have been met are not as simple as looking at holdout predictions, although that can be very useful as well.
-We will use p-values in this chapter. However, the tidymodels framework tends to promote confidence intervals over p-values as a method for quantifying the evidence for an alternative hypothesis. As previously shown in Section \@ref(tidyposterior), Bayesian methods are often superior to both p-values and confidence intervals in terms of ease of interpretation (but they can be more computationally expensive).
+We will use p-values in this chapter. However, the tidymodels framework tends to promote confidence intervals over p-values as a method for quantifying the evidence for an alternative hypothesis. As previously shown in Chapter \@ref(compare), Bayesian methods are often superior to both p-values and confidence intervals in terms of ease of interpretation (but they can be more computationally expensive).
:::rmdwarning
There has been a push in recent years to move away from p-values in favor of other methods [@pvalue]. See Volume 73 of [*The American Statistician*](https://www.tandfonline.com/toc/utas20/73/) for more information and discussion.
diff --git a/DESCRIPTION b/DESCRIPTION
index 6d46c33e..51403f89 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -1,6 +1,6 @@
Package: TMwR
Title: Tidy Modeling with R.
-Version: 0.0.1.9010
+Version: 1.0.1
Authors@R: c(
person("Max", "Kuhn", , "max@rstudio.com", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-2402-136X")),
@@ -62,6 +62,7 @@ Imports:
probably,
pscl,
purrr,
+ ragg,
ranger,
recipes (>= 0.1.16),
rlang,
@@ -88,7 +89,6 @@ Imports:
xgboost,
yardstick
Remotes:
- tidymodels/censored,
tidymodels/learntidymodels
biocViews: mixOmics
Encoding: UTF-8
diff --git a/RData/concrete_results.RData b/RData/concrete_results.RData
index e28c162b..de97d1f3 100644
Binary files a/RData/concrete_results.RData and b/RData/concrete_results.RData differ
diff --git a/RData/rda_fit.RData b/RData/rda_fit.RData
index 27fb5957..e8913d0f 100644
Binary files a/RData/rda_fit.RData and b/RData/rda_fit.RData differ
diff --git a/RData/sa_history.RData b/RData/sa_history.RData
index d8c3e3eb..015500d4 100644
Binary files a/RData/sa_history.RData and b/RData/sa_history.RData differ
diff --git a/_common.R b/_common.R
index bf074b6e..7d6a2db6 100644
--- a/_common.R
+++ b/_common.R
@@ -3,11 +3,17 @@ options(dplyr.print_min = 6, dplyr.print_max = 6)
options(cli.width = 85)
options(crayon.enabled = FALSE)
+library(ragg)
+
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
fig.align = 'center',
- tidy = FALSE
+ tidy = FALSE,
+ # see https://www.tidyverse.org/blog/2020/08/taking-control-of-plot-scaling/#the-solution
+ dev = "agg_png",
+ dev.args = list(res = 300, units = "in"),
+ fig.ext = "png"
)
diff --git a/contributors.csv b/contributors.csv
index e7b93171..f82eadd0 100644
--- a/contributors.csv
+++ b/contributors.csv
@@ -38,3 +38,5 @@ topepo,389,Max Kuhn,NA
x1o,3,Dmitry Zotikov,NA
xiaochi-liu,3,Xiaochi,xiaochi.rbind.io
zachbogart,1,Zach Bogart,zachbogart.com
+arisp99,1,Aris Paschalidis,arispas.com
+MikeJohnPage,1,NA,www.mikejohnpage.com
diff --git a/convert_oreilly.md b/convert_oreilly.md
new file mode 100644
index 00000000..80eb8cdc
--- /dev/null
+++ b/convert_oreilly.md
@@ -0,0 +1,104 @@
+# Prep for O'Reilly submission
+
+
+
+## Change the "pkg" CSS class to be just **bold**
+
+in TMwR.css? This is not working yet for me
+
+## Generate `.md` files:
+
+Choose a directory to put the new files in (use `_bookdown.yml` to generate only part of the book):
+
+```r
+library(bookdown)
+render_book(output_format = html_book(keep_md = TRUE),
+ output_dir = "tmwr-atlas/")
+```
+
+## Convert divs to markdown images
+
+In new directory:
+
+```
+sed -i ".bak" 's/
//g' *.md
+sed -i ".bak" 's//[[\2]]\n/g' *.md
+sed -i ".bak" "s/:::rmdnote/STARTNOTE/g" *.md
+sed -i ".bak" "s/:::rmdwarning/STARTWARNING/g" *.md
+sed -i ".bak" "s/:::/STOPBOX/g" *.md
+```
+
+## Convert to asciidoc using pandoc
+
+In the new directory:
+
+```
+for f in *.md; do pandoc --markdown-headings=atx \
+ --verbose \
+ --wrap=none \
+ --reference-links \
+ --citeproc \
+ --bibliography=TMwR.bib \
+ --lua-filter=lower-header.lua \
+ -f markdown -t asciidoc \
+ -o "${f%.md}.adoc" \
+ "$f"; done
+```
+
+## Fix notes/warnings/image/etc
+
+Using sed:
+
+```
+sed -i ".bak" "s/STARTNOTE/[NOTE]\n====\n/g" *.adoc
+sed -i ".bak" "s/STARTWARNING/[WARNING]\n====\n/g" *.adoc
+sed -i ".bak" "s/STOPBOX/\n====/g" *.adoc
+sed -i ".bak" -E "s/^{empty}//g" *.adoc
+sed -i ".bak" -E "1 s/\[#([^()]*)]*\]/\[\1\]/" *.adoc
+sed -i ".bak" -E "s/\@ref\(fig:([^()]*)\)/<<\1>>/g" *.adoc
+sed -i ".bak" -E "s/\@ref\(tab:([^()]*)\)/<<\1>>/g" *.adoc
+sed -i ".bak" -E "s/\@ref\(([^()]*)\)/<<\1>>/g" *.adoc
+perl -i~ -0777 -pe 's/\[\[refs\]\].*\Z//sg' *.adoc
+perl -i~ -0777 -pe 's/\.\(\#tab\:(.*?)\)(.*?)/[[\1]]\n\.\2/g' *.adoc
+sed -i ".bak" 's/\[\[\(.*\)\]\] image:\(.*\)\[\(.*\)\]/\[\[\1\]\]\n\.\3\nimage::\2\[\]/g' *.adoc
+sed -i ".bak" 's/image::figures/image::images/g' *.adoc
+sed -i ".bak" 's/image::premade/image::images/g' *.adoc
+sed -i ".bak" 's/\.svg/\.png/g' *.adoc
+sed -i ".bak" 's/Figure <<%
+ select(Neighborhood, Longitude, Latitude) %>%
+ group_nest(Neighborhood) %>%
+ mutate(con_hull = map(data, ~ .x[chull(.x),])) %>%
+ select(-data) %>%
+ unnest(con_hull)
+
+chull_ames <-
+ ggplot() +
+ xlim(ames_x) +
+ ylim(ames_y) +
+ theme_void() +
+ theme(legend.position = "none") +
+ geom_sf(data = ia_roads, aes(geometry = geometry), alpha = .1) +
+ geom_polygon(
+ data = chull_ames,
+ aes(
+ x = Longitude,
+ y = Latitude,
+ col = Neighborhood,
+ fill = Neighborhood
+ ),
+ show.legend = FALSE,
+ size = 1,
+ alpha = .5
+ ) +
+ scale_color_manual(values = ames_cols) +
+ scale_fill_manual(values = ames_cols)
+
+agg_png("ames_chull.png", width = 820 * 3, height = 550 * 3, res = 300, scaling = 1)
+print(chull_ames)
+dev.off()
+
## -----------------------------------------------------------------------------
mitchell_x <- extendrange(ames$Longitude[ames$Neighborhood == "Mitchell"], f = .1)
@@ -138,7 +197,7 @@ mitchell_box <-
size = .3,
alpha = .5
) +
- scale_color_manual(values = ames_cols) +
+ scale_color_manual(values = c(Meadow_Village = "#1F78B4", Mitchell = "#A6CEE3")) +
scale_shape_manual(values = ames_pch) +
geom_rect(
aes(
@@ -183,15 +242,15 @@ mitchell <-
size = 4,
alpha = .5
) +
- scale_color_manual(values = ames_cols) +
- scale_shape_manual(values = ames_pch)
+ scale_color_manual(values = c(Meadow_Village = "#1F78B4", Mitchell = "#A6CEE3")) +
+ scale_shape_manual(values = c(Meadow_Village = 17, Mitchell = 16))
# make plot and guide side-by-side
# mitchell_box + plot_spacer() + mitchell + plot_layout(widths = c(2, 0.1, 3))
# guide inset in plot
-agg_png("mitchell.png", width = 480 * mitchell_ratio * 2, height = 480 * 2, res = 200)
+agg_png("mitchell.png", width = 480 * mitchell_ratio * 3, height = 480 * 3, res = 300, scaling = 1)
print(mitchell)
print(mitchell_box, vp = viewport(0.8, 0.27, width = 0.3 * ames_ratio, height = 0.3))
dev.off()
@@ -214,9 +273,7 @@ timberland_box <-
data = ames,
aes(
x = Longitude,
- y = Latitude,
- col = Neighborhood,
- shape = Neighborhood
+ y = Latitude
),
size = .2,
alpha = .5
@@ -250,9 +307,7 @@ timberland <-
data = ames,
aes(
x = Longitude,
- y = Latitude,
- col = Neighborhood,
- shape = Neighborhood
+ y = Latitude
),
size = 5,
alpha = .5
@@ -261,7 +316,7 @@ timberland <-
scale_shape_manual(values = ames_pch)
# guide inset in plot
-agg_png("timberland.png", width = 480 * timberland_ratio)
+agg_png("timberland.png", width = 480 * timberland_ratio, res = 300, scaling = 1/3)
print(timberland)
print(timberland_box, vp = viewport(0.85, 0.2, width = 0.3 * ames_ratio, height = 0.3))
dev.off()
@@ -283,15 +338,11 @@ dot_rr_box <-
data = ames %>% filter(Neighborhood %in% c("Iowa_DOT_and_Rail_Road")),
aes(
x = Longitude,
- y = Latitude,
- col = Neighborhood,
- shape = Neighborhood
+ y = Latitude
),
size = .3,
alpha = .5
) +
- scale_color_manual(values = ames_cols) +
- scale_shape_manual(values = ames_pch) +
geom_rect(
aes(
xmin = dot_rr_x[1],
@@ -319,18 +370,14 @@ dot_rr <-
data = ames %>% filter(Neighborhood %in% c("Iowa_DOT_and_Rail_Road")),
aes(
x = Longitude,
- y = Latitude,
- col = Neighborhood,
- shape = Neighborhood
+ y = Latitude
),
size = 6,
alpha = .5
- ) +
- scale_color_manual(values = ames_cols) +
- scale_shape_manual(values = ames_pch)
+ )
# guide inset in plot
-agg_png("dot_rr.png", width = 480 * dot_rr_ratio)
+agg_png("dot_rr.png", width = 480 * dot_rr_ratio, res = 300, scaling = 1/3)
print(dot_rr)
print(dot_rr_box, vp = viewport(0.5, 0.26, width = 0.45 * ames_ratio, height = 0.45))
dev.off()
@@ -353,15 +400,11 @@ crawford_box <-
data = ames %>% filter(Neighborhood %in% c("Crawford")),
aes(
x = Longitude,
- y = Latitude,
- col = Neighborhood,
- shape = Neighborhood
+ y = Latitude
),
size = .3,
alpha = .5
) +
- scale_color_manual(values = ames_cols) +
- scale_shape_manual(values = ames_pch) +
geom_rect(
aes(
xmin = crawford_x[1],
@@ -389,18 +432,14 @@ crawford <-
data = ames %>% filter(Neighborhood %in% c("Crawford")),
aes(
x = Longitude,
- y = Latitude,
- col = Neighborhood,
- shape = Neighborhood
+ y = Latitude
),
size = 5,
alpha = .5
- ) +
- scale_color_manual(values = ames_cols) +
- scale_shape_manual(values = ames_pch)
+ )
# guide inset in plot
-agg_png("crawford.png", width = 480 * crawford_ratio)
+agg_png("crawford.png", width = 480 * crawford_ratio, res = 300, scaling = 1/3)
print(crawford)
print(crawford_box, vp = viewport(0.5, 0.2, width = 0.35 * ames_ratio, height = 0.35))
dev.off()
@@ -430,8 +469,8 @@ northridge_box <-
size = .3,
alpha = .5
) +
- scale_color_manual(values = ames_cols) +
- scale_shape_manual(values = ames_pch) +
+ scale_color_manual(values = c(Northridge = "#B2DF8A", Somerset = "#6A3D9A")) +
+ scale_shape_manual(values = c(Northridge = 16, Somerset = 17)) +
geom_rect(
aes(
xmin = northridge_x[1],
@@ -475,15 +514,15 @@ northridge <-
size = 4,
alpha = .5
) +
- scale_color_manual(values = ames_cols) +
- scale_shape_manual(values = ames_pch)
+ scale_color_manual(values = c(Northridge = "#B2DF8A", Somerset = "#6A3D9A")) +
+ scale_shape_manual(values = c(Northridge = 16, Somerset = 17))
# make plot and guide side-by-side
# northridge_box + plot_spacer() + northridge + plot_layout(widths = c(2, 0.1, 3))
# guide inset in plot
-agg_png("northridge.png", width = 480 * northridge_ratio)
+agg_png("northridge.png", width = 480 * northridge_ratio, res = 300, scaling = 1/3)
print(northridge)
print(northridge_box, vp = viewport(0.85, 0.21, width = 0.35 * ames_ratio, height = 0.35))
dev.off()
diff --git a/extras/crawford.png b/extras/crawford.png
new file mode 100644
index 00000000..fbc90a7f
Binary files /dev/null and b/extras/crawford.png differ
diff --git a/extras/dot_rr.png b/extras/dot_rr.png
new file mode 100644
index 00000000..3b9263cc
Binary files /dev/null and b/extras/dot_rr.png differ
diff --git a/extras/iowa_highway.dbf b/extras/iowa_highway.dbf
new file mode 100644
index 00000000..0c3d9a9c
Binary files /dev/null and b/extras/iowa_highway.dbf differ
diff --git a/extras/iowa_highway.prj b/extras/iowa_highway.prj
new file mode 100644
index 00000000..379ef7c8
--- /dev/null
+++ b/extras/iowa_highway.prj
@@ -0,0 +1 @@
+GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],TOWGS84[0,0,0,0,0,0,0],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.01745329251994328,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]]
\ No newline at end of file
diff --git a/extras/mitchell.png b/extras/mitchell.png
new file mode 100644
index 00000000..016d7935
Binary files /dev/null and b/extras/mitchell.png differ
diff --git a/extras/northridge.png b/extras/northridge.png
new file mode 100644
index 00000000..83808aea
Binary files /dev/null and b/extras/northridge.png differ
diff --git a/extras/timberland.png b/extras/timberland.png
new file mode 100644
index 00000000..160fc3b8
Binary files /dev/null and b/extras/timberland.png differ
diff --git a/index.Rmd b/index.Rmd
index 666a0daa..1b3f7a4b 100644
--- a/index.Rmd
+++ b/index.Rmd
@@ -44,9 +44,9 @@ This book is not intended to be a comprehensive reference on modeling techniques
```{r, eval = FALSE, echo = FALSE}
library(tidyverse)
contribs_all_json <- gh::gh("/repos/:owner/:repo/contributors",
- owner = "tidymodels",
- repo = "TMwR",
- .limit = Inf
+ owner = "tidymodels",
+ repo = "TMwR",
+ .limit = Inf
)
contribs_all <- tibble(
login = contribs_all_json %>% map_chr("login"),
@@ -103,7 +103,7 @@ df <- tibble::tibble(
source = stringr::str_split(source, " "),
source = purrr::map_chr(source, ~ .x[1]),
info = paste0(package, " (", version, ", ", source, ")")
- )
+ )
pkg_info <- knitr::combine_words(df$info)
```
diff --git a/lower-header.lua b/lower-header.lua
new file mode 100644
index 00000000..a22ffbab
--- /dev/null
+++ b/lower-header.lua
@@ -0,0 +1,9 @@
+function Header(el)
+ -- The header level can be accessed via the attribute 'level'
+ -- of the element. See the Pandoc documentation later.
+ if (el.level <= 1) then
+ return el
+ end
+ el.level = el.level + 1
+ return el
+end
diff --git a/pre-proc-table.Rmd b/pre-proc-table.Rmd
index 103c838c..49fc31d2 100644
--- a/pre-proc-table.Rmd
+++ b/pre-proc-table.Rmd
@@ -2,7 +2,6 @@
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(cli)
-library(kableExtra)
tk <- symbol$tick
x <- symbol$times
@@ -69,13 +68,12 @@ tab <-
tab %>%
arrange(model) %>%
mutate(model = paste0("", model, "")) %>%
- kable(
+ knitr::kable(
caption = "Preprocessing methods for different models.",
label = "preprocessing",
escape = FALSE,
align = c("l", rep("c", ncol(tab) - 1))
- ) %>%
- kable_styling(full_width = FALSE)
+ )
```
Footnotes:
diff --git a/premade/ames_chull.png b/premade/ames_chull.png
new file mode 100644
index 00000000..16c05f7e
Binary files /dev/null and b/premade/ames_chull.png differ
diff --git a/premade/ames_plain.png b/premade/ames_plain.png
new file mode 100644
index 00000000..5c085c0f
Binary files /dev/null and b/premade/ames_plain.png differ
diff --git a/premade/crawford.png b/premade/crawford.png
index 47afa96d..fbc90a7f 100644
Binary files a/premade/crawford.png and b/premade/crawford.png differ
diff --git a/premade/dot_rr.png b/premade/dot_rr.png
index f4acc50e..3b9263cc 100644
Binary files a/premade/dot_rr.png and b/premade/dot_rr.png differ
diff --git a/premade/mitchell.png b/premade/mitchell.png
index 04c150ad..016d7935 100644
Binary files a/premade/mitchell.png and b/premade/mitchell.png differ
diff --git a/premade/morphology.png b/premade/morphology.png
index 471bc459..72e22de5 100644
Binary files a/premade/morphology.png and b/premade/morphology.png differ
diff --git a/premade/northridge.png b/premade/northridge.png
index cf716ab3..83808aea 100644
Binary files a/premade/northridge.png and b/premade/northridge.png differ
diff --git a/premade/timberland.png b/premade/timberland.png
index 003ec6b5..160fc3b8 100644
Binary files a/premade/timberland.png and b/premade/timberland.png differ
diff --git a/render12b2648c7e576.rds b/render12b2648c7e576.rds
new file mode 100644
index 00000000..845999a7
Binary files /dev/null and b/render12b2648c7e576.rds differ
diff --git a/tmwr-atlas/01-software-modeling.md b/tmwr-atlas/01-software-modeling.md
new file mode 100644
index 00000000..98948bfc
--- /dev/null
+++ b/tmwr-atlas/01-software-modeling.md
@@ -0,0 +1,207 @@
+# (PART\*) Introduction {-}
+
+# Software for modeling {#software-modeling}
+
+
+
+
+Models are mathematical tools that can describe a system and capture relationships in the data given to them. Models can be used for various purposes, including predicting future events, determining if there is a difference between several groups, aiding map-based visualization, discovering novel patterns in the data that could be further investigated, and more. The utility of a model hinges on its ability to be reductive, or to reduce complex relationships to simpler terms. The primary influences in the data can be captured mathematically in a useful way, such as in a relationship that can be expressed as an equation.
+
+Since the beginning of the twenty-first century, mathematical models have become ubiquitous in our daily lives, in both obvious and subtle ways. A typical day for many people might involve checking the weather to see when might be a good time to walk the dog, ordering a product from a website, typing a text message to a friend and having it autocorrected, and checking email. In each of these instances, there is a good chance that some type of model was involved. In some cases, the contribution of the model might be easily perceived ("You might also be interested in purchasing product _X_") while in other cases, the impact could be the absence of something (e.g., spam email). Models are used to choose clothing that a customer might like, to identify a molecule that should be evaluated as a drug candidate, and might even be the mechanism that a nefarious company uses to avoid the discovery of cars that over-pollute. For better or worse, models are here to stay.
+
+:::rmdnote
+There are two reasons that models permeate our lives today:
+
+ * an abundance of software exists to create models, and
+ * it has become easier to capture and store data, as well as make it accessible.
+:::
+
+This book focuses largely on software. It is obviously critical that software produces the correct relationships to represent the data. For the most part, determining mathematical correctness is possible, but the reliable creation of appropriate models requires more. In this chapter, we outline considerations for building or choose modeling software, the purposes of models, and where modeling sits in the broader data analysis process.
+
+## Fundamentals for Modeling Software
+
+It is important that the modeling software you use is easy to operate in a proper way. The user interface should not be so poorly designed that the user would not know that they used it inappropriately. For example, @baggerly2009 report myriad problems in the data analyses from a high profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The user interface of the software made it easy to offset the column names of the data from the actual data columns. This resulted in the wrong genes being identified as important for treating cancer patients and eventually contributed to the termination of several clinical trials [@Carlson2012].
+
+If we need high quality models, software must facilitate proper usage. @abrams2003 describes an interesting principle to guide us:
+
+> The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks.
+
+Data analysis and modeling software should espouse this idea.
+
+Second, modeling software should promote good scientific methodology. When working with complex predictive models, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at discovering patterns that they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these types of methodological errors are insidious in that the issue can go undetected until a later time when new data that contain the true result are obtained.
+
+:::rmdwarning
+As our models have become more powerful and complex, it has also become easier to commit latent errors.
+:::
+
+This same principle also applies to programming. Whenever possible, the software should be able to protect users from committing mistakes. Software should make it easy for users to do the right thing.
+
+These two aspects of model development -- ease of proper use and good methodological practice -- are crucial. Since tools for creating models are easily accessible and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, their backgrounds will vary. It is important that their tools be robust to the experience of the user. Tools should be powerful enough to create high-performance models, but, on the other hand, should be easy to use in an appropriate way. This book describes a suite of software for modeling which has been designed with these characteristics in mind.
+
+The software is based on the R programming language [@baseR]. R has been designed especially for data analysis and modeling. It is an implementation of the S language (with lexical scoping rules adapted from Scheme and Lisp) which was created in the 1970s to
+
+> "turn ideas into software, quickly and faithfully" [@Chambers:1998]
+
+R is open-source and free of charge. It is a powerful programming language that can be used for many different purposes but specializes in data analysis, modeling, visualization, and machine learning. R is easily extensible; it has a vast ecosystem of packages, mostly user-contributed modules that focus on a specific theme, such as modeling, visualization, and so on.
+
+One collection of packages is called the *tidyverse* [@tidyverse]. The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. Several of these design philosophies are directly informed by the aspects of software for modeling described in this chapter. If you've never used the tidyverse packages, Chapter \@ref(tidyverse) contains a review of its basic concepts. Within the tidyverse, the subset of packages specifically focused on modeling are referred to as the *tidymodels* packages. This book is a practical guide for conducting modeling using the tidyverse and tidymodels packages. It shows how to use a set of packages, each with its own specific purpose, together to create high-quality models.
+
+## Types of Models {#model-types}
+
+Before proceeding, let's describe a taxonomy for types of models, grouped by purpose. This taxonomy informs both how a model is used and many aspects of how the model may be created or evaluated. While not exhaustive, most models fall into at least one of these categories:
+
+### Descriptive models {-}
+
+The purpose of a descriptive model is to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually emphasize some trend or artifact in the data.
+
+For example, large scale measurements of RNA have been possible for some time using microarrays. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip can measure a signal based on the abundance of a specific RNA sequence. The chip would contain thousands (or more) outcomes, each a quantification of the RNA related to some biological process. However, there could be quality issues on the chip that might lead to poor results. A fingerprint accidentally left on a portion of the chip might cause inaccurate measurements when scanned.
+
+An early method for evaluating such issues were probe-level models, or PLM's [@bolstad2004]. A statistical model would be created that accounted for the known differences in the data, such as the chip, the RNA sequence, the type of sequence, and so on. If there were other, unknown factors in the data, these effects would be captured in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When a problem did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g. a fingerprint) and a possible solution (wipe the chip off and rescan, repeat the sample, etc.). Figure \@ref(fig:software-descr-examples)(a) shows an application of this method for two microarrays taken from @Gentleman2005. The images show two different color values; areas that are darker are where the signal intensity was larger than the model expects while the lighter color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel exhibits an undesirable artifact in the middle of the chip.
+
+
+
+
(\#fig:software-descr-examples)Two examples of how descriptive models can be used to illustrate specific patterns.
+
+
+Another example of a descriptive model is the _locally estimated scatterplot smoothing_ model, more commonly known as LOESS [@cleveland1979]. Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data. These types of smoothers are used to discover potential ways to represent a variable in a model. This is demonstrated in Figure \@ref(fig:software-descr-examples)(b) where a nonlinear trend is illuminated by the flexible smoother. From this plot, it is clear that there is a highly nonlinear relationship between the sale price of a house and its latitude.
+
+
+### Inferential models {-}
+
+The goal of an inferential model is to produce a decision for a research question or to explore a specific hypothesis, similar to how statistical tests are used.^[Many specific statistical tests are in fact equivalent to models. For example, t-tests and analysis of variance (ANOVA) methods are particular cases of the generalized linear model.] An inferential model starts with some predefined conjecture or idea about a population, and produces a statistical conclusion such as an interval estimate or the rejection of a hypothesis.
+
+For example, the goal of a clinical trial might be to provide confirmation that a new therapy does a better job in prolonging life than an alternative, like an existing therapy or no treatment at all. If the clinical endpoint was related to survival of a patient, the _null hypothesis_ might be that the new treatment has an equal or lower median survival time, with the _alternative hypothesis_ being that the new therapy has higher median survival. If this trial were evaluated using traditional null hypothesis significance testing via modeling, the significance testing would produce a p-value using some pre-defined methodology based on a set of assumptions for the data. Small values for the p-value in the model results would indicate that there is evidence that the new therapy helps patients live longer. Large values for the p-value in the model results would conclude that there is a failure to show such a difference; this lack of evidence could be due to a number of reasons, including the therapy not working.
+
+What are the important aspects of this type of analysis? Inferential modeling techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability. Generally, to compute such a quantity, formal probabilistic assumptions must be made about the data and the underlying processes that generated the data. The quality of the statistical modeling results are highly dependent on these pre-defined assumptions as well as how much the observed data appear to agree with them. The most critical factors here are theoretical in nature: "If my data were independent and the residuals follow distribution _X_, then test statistic _Y_ can be used to produce a p-value. Otherwise, the resulting p-value might be inaccurate."
+
+:::rmdwarning
+One aspect of inferential analyses is that there tends to be a delayed feedback loop in understanding how well the data matches the model assumptions. In our clinical trial example, if statistical (and clinical) significance indicate that the new therapy should be available for patients to use, it still may be years before it is used in the field and enough data are generated for an independent assessment of whether the original statistical analysis led to the appropriate decision.
+:::
+
+### Predictive models {-}
+
+Sometimes data are modeled to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data.
+
+A simple example would be for a book buyer to predict how many copies of a particular book should be shipped to their store for the next month. An over-prediction wastes space and money due to excess books. If the prediction is smaller than it should be, there is opportunity loss and less profit.
+
+For this type of model, the problem type is one of estimation rather than inference. For example, the buyer is usually not concerned with a question such as "Will I sell more than 100 copies of book _X_ next month?" but rather "How many copies of book _X_ will customers purchase next month?" Also, depending on the context, there may not be any interest in why the predicted value is _X_. In other words, there is more interest in the value itself than evaluating a formal hypothesis related to the data. The prediction can also include measures of uncertainty. In the case of the book buyer, providing a forecasting error may be helpful in deciding how many to purchase. It can also serve as a metric to gauge how well the prediction method worked.
+
+What are the most important factors affecting predictive models? There are many different ways that a predictive model can be created, so the important factors depend on how the model was developed.^[Broader discussions of these distinctions can be found in @breiman2001 and @shmueli2010.]
+
+A *mechanistic model* could be derived using first principles to produce a model equation that is dependent on assumptions. For example, when predicting the amount of a drug that is in a person's body at a certain time, some formal assumptions are made on how the drug is administered, absorbed, metabolized, and eliminated. Based on this, a set of differential equations can be used to derive a specific model equation. Data are used to estimate the unknown parameters of this equation so that predictions can be generated. Like inferential models, mechanistic predictive models greatly depend on the assumptions that define their model equations. However, unlike inferential models, it is easy to make data-driven statements about how well the model performs based on how well it predicts the existing data. Here the feedback loop for the modeling practitioner is much faster than it would be for a hypothesis test.
+
+*Empirically driven models* are created with more vague assumptions. These models tend to fall into the machine learning category. A good example is the _K_-nearest neighbor (KNN) model. Given a set of reference data, a new sample is predicted by using the values of the _K_ most similar data in the reference set. For example, if a book buyer needs a prediction for a new book, historical data from existing books may be available. A 5-nearest neighbor model would estimate the amount of the new books to purchase based on the sales numbers of the five books that are most similar to the new one (for some definition of "similar"). This model is only defined by the structure of the prediction (the average of five similar books). No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model was a good choice, the predictions would be close to the actual values.
+
+## Connections Between Types of Models
+
+:::rmdnote
+Note that we have defined the type of a model by how it is used, rather than its mathematical qualities.
+:::
+
+An ordinary linear regression model might fall into any of these three classes of model, depending on how it is used:
+
+* A descriptive smoother, similar to LOESS, called _restricted smoothing splines_ [@Durrleman1989] can be used to describe trends in data using ordinary linear regression with specialized terms.
+
+* An _analysis of variance_ (ANOVA) model is a popular method for producing the p-values used for inference. ANOVA models are a special case of linear regression.
+
+* If a simple linear regression model produces accurate predictions, it can be used as a predictive model.
+
+There are many examples of predictive models that cannot (or at least should not) be used for inference. Even if probabilistic assumptions were made for the data, the nature of the K-nearest neighbors model, for example, makes the math required for inference intractable.
+
+There is an additional connection between the types of models. While the primary purpose of descriptive and inferential models might not be related to prediction, the predictive capacity of the model should not be ignored. For example, logistic regression is a popular model for data where the outcome is qualitative with two possible values. It can model how variables are related to the probability of the outcomes. When used in an inferential manner, there is usually an abundance of attention paid to the statistical qualities of the model. For example, analysts tend to strongly focus on the selection of which independent variables are contained in the model. Many iterations of model building may be used to determine a minimal subset of independent variables that have a "statistically significant" relationship to the outcome variable. This is usually achieved when all of the p-values for the independent variables are below some value (e.g. 0.05). From here, the analyst may focus on making qualitative statements about the relative influence that the variables have on the outcome (e.g., "There is a statistically significant relationship between age and the odds of heart disease.").
+
+This approach can be dangerous when statistical significance is used as the only measure of model quality. It is possible that this statistically optimized model has poor model accuracy, or performs poorly on some other measure of predictive capacity. While the model might not be used for prediction, how much should inferences be trusted from a model that has significant p-values but dismal accuracy? Predictive performance tends to be related to how close the model's fitted values are to the observed data.
+
+:::rmdwarning
+If a model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not be sufficient proof that a model is appropriate.
+:::
+
+This may seem intuitively obvious, but is often ignored in real-world data analysis.
+
+## Some Terminology {#model-terminology}
+
+Before proceeding, we outline here some additional terminology related to modeling and data. These descriptions are intended to be helpful as you read this book but not exhaustive.
+
+First, many models can be categorized as being _supervised_ or _unsupervised_. Unsupervised models are those that learn patterns, clusters, or other characteristics of the data but lack an outcome, i.e., a dependent variable. Principal component analysis (PCA), clustering, and autoencoders are examples of unsupervised models; they are used to understand relationships between variables or sets of variables without an explicit relationship between predictors and an outcome. Supervised models are those that have an outcome variable. Linear regression, neural networks, and numerous other methodologies fall into this category.
+
+Within supervised models, there are two main sub-categories:
+
+* *Regression* predicts a numeric outcome.
+
+* *Classification* predicts an outcome that is an ordered or unordered set of qualitative values.
+
+These are imperfect definitions and do not account for all possible types of models. In Chapter \@ref(models), we refer to this characteristic of supervised techniques as the _model mode_.
+
+Different variables can have different _roles_, especially in a supervised modeling analysis. Outcomes (otherwise known as the labels, endpoints, or dependent variables) are the value being predicted in supervised models. The independent variables, which are the substrate for making predictions of the outcome, are also referred to as predictors, features, or covariates (depending on the context). The terms _outcomes_ and _predictors_ are used most frequently in this book.
+
+In terms of the data or variables themselves, whether used for supervised or unsupervised models, as predictors or outcomes, the two main categories are quantitative and qualitative. Examples of the former are real numbers like `3.14159` and integers like `42`. Qualitative values, also known as nominal data, are those that represent some sort of discrete state that cannot be naturally placed on a numeric scale, like "red", "green", and "blue".
+
+
+## How Does Modeling Fit into the Data Analysis Process? {#model-phases}
+
+In what circumstances are models created? Are there steps that precede such an undertaking? Is model creation the first step in data analysis?
+
+:::rmdnote
+There are always a few critical phases of data analysis that come before modeling.
+:::
+
+First, there is the chronically underestimated process of *cleaning the data*. No matter the circumstances, you should investigate the data to make sure that they are applicable to your project goals, accurate, and appropriate. These steps can easily take more time than the rest of the data analysis process (depending on the circumstances).
+
+Data cleaning can also overlap with the second phase of *understanding the data*, often referred to as exploratory data analysis (EDA). EDA brings to light how the different variables are related to one another, their distributions, typical ranges, and other attributes. A good question to ask at this phase is, "How did I come by _these_ data?" This question can help you understand how the data at hand have been sampled or filtered and if these operations were appropriate. For example, when merging database tables, a join may go awry that could accidentally eliminate one or more sub-populations. Another good idea is to ask if the data are relevant. For example, to predict whether patients have Alzheimer's disease or not, it would be unwise to have a data set containing subjects with the disease and a random sample of healthy adults from the general population. Given the progressive nature of the disease, the model may simply predict who are the oldest patients.
+
+Finally, before starting a data analysis process, there should be clear expectations of the goal of the model and how performance (and success) will be judged. At least one _performance metric_ should be identified with realistic goals of what can be achieved. Common statistical metrics, discussed in more detail in Chapter \@ref(performance), are classification accuracy, true and false positive rates, root mean squared error, and so on. The relative benefits and drawbacks of these metrics should be weighed. It is also important that the metric be germane; alignment with the broader data analysis goals is critical.
+
+The process of investigating the data may not be simple. @wickham2016 contains an excellent illustration of the general data analysis process, reproduced with Figure \@ref(fig:software-data-science-model). Data ingestion and cleaning/tidying are shown as the initial steps. When the analytical steps for understanding commence, they are a heuristic process; we cannot pre-determine how long they may take. The cycle of transformation, modeling, and visualization often requires multiple iterations.
+
+
+
+
(\#fig:software-data-science-model)The data science process (from R for Data Science, used with permission).
+
+
+This iterative process is especially true for modeling. Figure \@ref(fig:software-modeling-process) is meant to emulate the typical path to determining an appropriate model. The general phases are:
+
+* *Exploratory data analysis (EDA):* Initially there is a back and forth between numerical analysis and visualization of the data (represented in Figure \@ref(fig:software-data-science-model)) where different discoveries lead to more questions and data analysis "side-quests" to gain more understanding.
+
+* *Feature engineering:* The understanding gained from EDA results in the creation of specific model terms that make it easier to accurately model the observed data. This can include complex methodologies (e.g., PCA) or simpler features (using the ratio of two predictors). Chapter \@ref(recipes) focuses entirely on this important step.
+
+* *Model tuning and selection (large circles with alternating segments):* A variety of models are generated and their performance is compared. Some models require parameter tuning where some structural parameters are required to be specified or optimized. The alternating segments within the circles signify the repeated data splitting used during resampling (see Chapter \@ref(resampling)).
+
+* *Model evaluation:* During this phase of model development, we assess the model's performance metrics, examine residual plots, and conduct other EDA-like analyses to understand how well the models work. In some cases, formal between-model comparisons (Chapter \@ref(compare)) help you to understand whether any differences in models are within the experimental noise.
+
+
+
+
(\#fig:software-modeling-process)A schematic for the typical modeling process.
+
+
+After an initial sequence of these tasks, more understanding is gained regarding which types of models are superior as well as which sub-populations of the data are not being effectively estimated. This leads to additional EDA and feature engineering, another round of modeling, and so on. Once the data analysis goals are achieved, the last steps are typically to finalize, document, and communicate the model. For predictive models, it is common at the end to validate the model on an additional set of data reserved for this specific purpose.
+
+As an example, @fes use data to model the daily ridership of Chicago's public train system using predictors such as the date, the previous ridership results, the weather, and other factors. Table \@ref(tab:inner-monologue) walks through an approximation of these authors' "inner monologue" when analyzing these data and eventually selecting a model with sufficient performance.
+
+
+Table: (\#tab:inner-monologue)Hypothetical inner monologue of a model developer.
+
+|Thoughts |Activity |
+|:--------------------------------------------------------------------------------------------------------------------------------|:-------------------|
+|The daily ridership values between stations are extremely correlated. |EDA |
+|Weekday and weekend ridership look very different. |EDA |
+|One day in the summer of 2010 has an abnormally large number of riders. |EDA |
+|Which stations had the lowest daily ridership values? |EDA |
+|Dates should at least be encoded as day-of-the-week, and year. |Feature Engineering |
+|Maybe PCA could be used on the correlated predictors to make it easier for the models to use them. |Feature Engineering |
+|Hourly weather records should probably be summarized into daily measurements. |Feature Engineering |
+|Let’s start with simple linear regression, K-nearest neighbors, and a boosted decision tree. |Model Fitting |
+|How many neighbors should be used? |Model Tuning |
+|Should we run a lot of boosting iterations or just a few? |Model Tuning |
+|How many neighbors seemed to be optimal for these data? |Model Tuning |
+|Which models have the lowest root mean squared errors? |Model Evaluation |
+|Which days were poorly predicted? |EDA |
+|Variable importance scores indicate that the weather information is not predictive. We’ll drop them from the next set of models. |Model Evaluation |
+|It seems like we should focus on a lot of boosting iterations for that model. |Model Evaluation |
+|We need to encode holiday features to improve predictions on (and around) those dates. |Feature Engineering |
+|Let’s drop K-NN from the model list. |Model Evaluation |
+
+## Chapter Summary {#software-summary}
+
+This chapter focused on how models describe relationships in data, and different types of models such as descriptive models, inferential models, and predictive models. The predictive capacity of a model can be used to evaluate it, even when its main goal is not prediction. Modeling itself sits within the broader data analysis process, and exploratory data analysis is a key part of building high-quality models.
+
+
diff --git a/tmwr-atlas/02-tidyverse.md b/tmwr-atlas/02-tidyverse.md
new file mode 100644
index 00000000..7583fb54
--- /dev/null
+++ b/tmwr-atlas/02-tidyverse.md
@@ -0,0 +1,319 @@
+# A Tidyverse Primer {#tidyverse}
+
+
+
+What is the tidyverse, and where does the tidymodels framework fit in? The tidyverse is a collection of R packages for data analysis that are developed with common ideas and norms. From @tidyverse:
+
+> "At a high level, the tidyverse is a language for solving data science challenges with R code. Its primary goal is to facilitate a conversation between a human and a computer about data. Less abstractly, the tidyverse is a collection of R packages that share a high-level design philosophy and low-level grammar and data structures, so that learning one package makes it easier to learn the next."
+
+In this chapter, we briefly discuss important principles of the tidyverse design philosophy and how they apply in the context of modeling software that is easy to use properly and supports good statistical practice, like we outlined in Chapter \@ref(software-modeling). The next chapter covers modeling conventions from the core R language. Together, you can use these discussions to understand the relationships between the tidyverse, tidymodels, and the core or base R language. Both tidymodels and the tidyverse build on the R language, and tidymodels applies tidyverse principles to building models.
+
+## Tidyverse Principles
+
+The full set of strategies and tactics for writing R code in the tidyverse style can be found at the website . Here we can briefly describe several of the general tidyverse design principles, their motivation, and how we think about modeling as an application of these principles.
+
+### Design for humans
+
+The tidyverse focuses on designing R packages and functions that can be easily understood and used by a broad range of people. Both historically and today, a substantial percentage of R users are not people who create software or tools but instead people who create analyses or models. As such, R users do not typically have (or need) computer science backgrounds, and many are not interested in writing their own R packages.
+
+For this reason, it is critical that R code be easy to work with to accomplish your goals. Documentation, training, accessibility, and other factors play an important part in achieving this. However, if the syntax itself is difficult for people to easily comprehend, documentation is a poor solution. The software itself must be intuitive.
+
+To contrast the tidyverse approach with more traditional R semantics, consider sorting a data frame. Data frames can represent different types of data in each column, and multiple values in each row. Using only the core language, we can sort a data frame using one or more columns by reordering the rows via R's subscripting rules in conjunction with `order()`; you cannot successfully use a function you might be tempted to try in such a situation because of its name, `sort()`. To sort the `mtcars` data by two of its columns, the call might look like:
+
+
+```r
+mtcars[order(mtcars$gear, mtcars$mpg), ]
+```
+
+While very computationally efficient, it would be difficult to argue that this is an intuitive user interface. In dplyr by contrast, the tidyverse function `arrange()` takes a set of variable names as input arguments directly:
+
+
+```r
+library(dplyr)
+arrange(.data = mtcars, gear, mpg)
+```
+
+:::rmdnote
+The variable names used here are "unquoted"; many traditional R functions require a character string to specify variables, but tidyverse functions take unquoted names or _selector functions_. The selectors allow for one or more readable rules that are applied to the column names. For example, `ends_with("t")` would select the `drat` and `wt` columns of the `mtcars` data frame.
+:::
+
+Additionally, naming is crucial. If you were new to R and were writing data analysis or modeling code involving linear algebra, you might be stymied when searching for a function that computes the matrix inverse. Using `apropos("inv")` yields no candidates. It turns out that the base R function for this task is `solve()`, for solving systems of linear equations. For a matrix `X`, you would use `solve(X)` to invert `X` (with no vector for the right-hand side of the equation). This is only documented in the description of one of the _arguments_ in the help file. In essence, you need to know the name of the solution to be able to find the solution.
+
+The tidyverse approach is to use function names that are descriptive and explicit over those that are short and implicit. There is a focus on verbs (e.g. `fit`, `arrange`, etc.) for general methods. Verb-noun pairs are particularly effective; consider `invert_matrix()` as a hypothetical function name. In the context of modeling, it is also important to avoid highly technical jargon in names such as Greek letters or obscure terms. Names should be as self-documenting as possible.
+
+When there are similar functions in a package, function names are designed to be optimized for tab-completion. For example, the glue package has a collection of functions starting with a common prefix (`glue_`) that enables users to quickly find the function they are looking for.
+
+
+### Reuse existing data structures
+
+Whenever possible, functions should avoid returning a novel data structure. If the results are conducive to an existing data structure, it should be used. This reduces the cognitive load when using software; no additional syntax or methods are required.
+
+The data frame is the preferred data structure in tidyverse and tidymodels packages, because its structure is a good fit for such a broad swath of data science tasks. Specifically, the tidyverse and tidymodels favor the tibble, a modern reimagining of R's data frame that we describe in the next section on example tidyverse syntax.
+
+As an example, the rsample package can be used to create _resamples_ of a data set, such as cross-validation or the bootstrap (described in Chapter \@ref(resampling)). The resampling functions return a tibble with a column called `splits` of objects that define the resampled data sets. Three bootstrap samples of a data set might look like:
+
+
+```r
+boot_samp <- rsample::bootstraps(mtcars, times = 3)
+boot_samp
+#> # Bootstrap sampling
+#> # A tibble: 3 × 2
+#> splits id
+#>
+#> 1 Bootstrap1
+#> 2 Bootstrap2
+#> 3 Bootstrap3
+class(boot_samp)
+#> [1] "bootstraps" "rset" "tbl_df" "tbl" "data.frame"
+```
+
+With this approach, vector-based functions can be used with these columns, such as `vapply()` or `purrr::map()`.^[If you've never seen `::` in R code before, it is an explicit method for calling a function. The value of the left-hand side is the _namespace_ where the function lives (usually a package name). The right-hand side is the function name. In cases where two packages use the same function name, this syntax ensures that the correct function is called.] This `boot_samp` object has multiple classes but inherits methods for data frames (`"data.frame"`) and tibbles (`"tbl_df"`). Additionally, new columns can be added to the results without affecting the class of the data. This is much easier and more versatile for users to work with than a completely new object type that does not make its data structure obvious.
+
+One downside to relying on common data structures is the potential loss of computational performance. In some situations, data can be encoded in specialized formats that are more efficient representations of the data. For example:
+
+ * In computational chemistry, the structure-data file format (SDF) is a tool to take chemical structures and encode them in a format that is computationally efficient to work with.
+
+ * Data that have a large number of values that are the same (such as zeros for binary data) can be stored in a sparse matrix format. This format can reduce the size of the data as well as enable more efficient computational techniques.
+
+These formats are advantageous when the problem is well scoped and the potential data processing methods are both well defined and suited to such a format.^[Not all algorithms can take advantage of sparse representations of data. In such cases, a sparse matrix must be converted to a more conventional format before proceeding.] However, once such constraints are violated, specialized data formats are less useful. For example, if we perform a transformation of the data that converts the data into fractional numbers, the output is no longer sparse; the sparse matrix representation is helpful for one specific algorithmic step in modeling but this is often not true before or after that specific step.
+
+:::rmdwarning
+A specialized data structure is not flexible enough for an entire modeling workflow in the way that a common data structure is.
+:::
+
+One important feature in the tibble produced by rsample is that the `splits` column is a list. In this instance, each element of the list has the same type of object: an `rsplit` object that contains the information about which rows of `mtcars` belong in the bootstrap sample. _List columns_ can be very useful in data analysis and, as will be seen throughout this book, are important to tidymodels.
+
+
+### Design for the pipe and functional programming
+
+The magrittr pipe operator (`%>%`) is a tool for chaining together a sequence of R functions.^[In R 4.1, a native pipe operator `|>` was introduced as well. In this book, we use the magrittr pipe since users on older versions of R will not have the new native pipe.] To demonstrate, consider the following commands which sort a data frame and then retain the first 10 rows:
+
+
+```r
+small_mtcars <- arrange(mtcars, gear)
+small_mtcars <- slice(small_mtcars, 1:10)
+
+# or more compactly:
+small_mtcars <- slice(arrange(mtcars, gear), 1:10)
+```
+
+The pipe operator substitutes the value of the left-hand side of the operator as the first argument to the right-hand side, so we can implement the same result as before with:
+
+
+```r
+small_mtcars <-
+ mtcars %>%
+ arrange(gear) %>%
+ slice(1:10)
+```
+
+The piped version of this sequence is more readable; this readability increases as more operations are added to a sequence. This approach to programming works in this example because all of the functions we used return the same data structure (a data frame) that is then the first argument to the next function. This is by design. When possible, create functions that can be incorporated into a pipeline of operations.
+
+If you have used ggplot2, this is not unlike the layering of plot components into a `ggplot` object with the `+` operator. To make a scatter plot with a regression line, the initial `ggplot()` call is augmented with two additional operations:
+
+
+```r
+library(ggplot2)
+ggplot(mtcars, aes(x = wt, y = mpg)) +
+ geom_point() +
+ geom_smooth(method = lm)
+```
+
+While similar to the dplyr pipeline, note that the first argument to this pipeline is a data set (`mtcars`) and that each function call returns a `ggplot` object. Not all pipelines need to keep the returned values (plot objects) the same as the initial value (a data frame). Using the pipe operator with dplyr operations has acclimated many R users to expect to return a data frame when pipelines are used; as shown with ggplot2, this does not need to be the case. Pipelines are incredibly useful in modeling workflows but modeling pipelines can return, instead of a data frame, objects such as model components.
+
+R has excellent tools for creating, changing, and operating on functions, making it a great language for functional programming. This approach can replace iterative loops in many situations, such as when a function returns a value without other side effects.^[Examples of function side effects could include changing global data or printing a value.]
+
+Let's look at an example. Suppose you are interested in the logarithm of the ratio of the fuel efficiency to the car weight. To those new to R and/or coming from other programming languages, a loop might seem like a good option:
+
+
+```r
+n <- nrow(mtcars)
+ratios <- rep(NA_real_, n)
+for (car in 1:n) {
+ ratios[car] <- log(mtcars$mpg[car]/mtcars$wt[car])
+}
+head(ratios)
+#> [1] 2.081 1.988 2.285 1.896 1.693 1.655
+```
+
+Those with more experience in R may know that there is a much simpler and faster vectorized version that can be computed by:
+
+
+```r
+ratios <- log(mtcars$mpg/mtcars$wt)
+```
+
+However, in many real-world cases, the element-wise operation of interest is too complex for a vectorized solution. In such a case, a good approach is to write a function to do the computations. When we design for functional programming, it is important that the output only depends on the inputs and that the function has no side effects. Violations of these ideas in the following function are shown with comments:
+
+
+```r
+compute_log_ratio <- function(mpg, wt) {
+ log_base <- getOption("log_base", default = exp(1)) # gets external data
+ results <- log(mpg/wt, base = log_base)
+ print(mean(results)) # prints to the console
+ done <<- TRUE # sets external data
+ results
+}
+```
+
+A better version would be:
+
+
+```r
+compute_log_ratio <- function(mpg, wt, log_base = exp(1)) {
+ log(mpg/wt, base = log_base)
+}
+```
+
+The purrr package contains tools for functional programming. Let's focus on the `map()` family of functions, which operates on vectors and always returns the same type of output. The most basic function, `map()`, always returns a list and uses the basic syntax of `map(vector, function)`. For example, to take the square-root of our data, we could:
+
+
+```r
+map(head(mtcars$mpg, 3), sqrt)
+#> [[1]]
+#> [1] 4.583
+#>
+#> [[2]]
+#> [1] 4.583
+#>
+#> [[3]]
+#> [1] 4.775
+```
+
+There are specialized variants of `map()` that return values when we know or expect that the function will generate one of the basic vector types. For example, since the square-root returns a double-precision number:
+
+
+```r
+map_dbl(head(mtcars$mpg, 3), sqrt)
+#> [1] 4.583 4.583 4.775
+```
+
+There are also mapping functions that operate across multiple vectors:
+
+
+```r
+log_ratios <- map2_dbl(mtcars$mpg, mtcars$wt, compute_log_ratio)
+head(log_ratios)
+#> [1] 2.081 1.988 2.285 1.896 1.693 1.655
+```
+
+The `map()` functions also allow for temporary, anonymous functions defined using the tilde character. The argument values are `.x` and `.y` for `map2()`:
+
+
+```r
+map2_dbl(mtcars$mpg, mtcars$wt, ~ log(.x/.y)) %>%
+ head()
+#> [1] 2.081 1.988 2.285 1.896 1.693 1.655
+```
+
+These examples have been trivial but, in later sections, will be applied to more complex problems.
+
+:::rmdnote
+For functional programming in tidy modeling, functions should be defined so that functions like `map()` can be used for iterative computations.
+:::
+
+
+## Examples of Tidyverse Syntax
+
+Let's being our discussion of tidyverse syntax by exploring more deeply what a tibble is, and how tibbles work. Tibbles have slightly different rules than basic data frames in R. For example, tibbles naturally work with column names that are not syntactically valid variable names:
+
+
+```r
+# Wants valid names:
+data.frame(`variable 1` = 1:2, two = 3:4)
+#> variable.1 two
+#> 1 1 3
+#> 2 2 4
+# But can be coerced to use them with an extra option:
+df <- data.frame(`variable 1` = 1:2, two = 3:4, check.names = FALSE)
+df
+#> variable 1 two
+#> 1 1 3
+#> 2 2 4
+
+# But tibbles just work:
+tbbl <- tibble(`variable 1` = 1:2, two = 3:4)
+tbbl
+#> # A tibble: 2 × 2
+#> `variable 1` two
+#>
+#> 1 1 3
+#> 2 2 4
+```
+
+Standard data frames enable _partial matching_ of arguments so that code using only a portion of the column names still work. Tibbles prevent this from happening since it can lead to accidental errors.
+
+
+```r
+df$tw
+#> [1] 3 4
+
+tbbl$tw
+#> Warning: Unknown or uninitialised column: `tw`.
+#> NULL
+```
+
+Tibbles also prevent one of the most common R errors: dropping dimensions. If a standard data frame subsets the columns down to a single column, the object is converted to a vector. Tibbles never do this:
+
+
+```r
+df[, "two"]
+#> [1] 3 4
+
+tbbl[, "two"]
+#> # A tibble: 2 × 1
+#> two
+#>
+#> 1 3
+#> 2 4
+```
+
+There are various other advantages to using tibbles instead of data frames, such as better printing and more.^[Chapter 10 of @wickham2016 has more details on tibbles.]
+
+
+
+To demonstrate some syntax, let's use tidyverse functions to read in data that could be used in modeling. The data set comes from the city of Chicago's data portal and contains daily ridership data for the city's elevated train stations. The data set has columns for:
+
+- the station identifier (numeric),
+- the station name (character),
+- the date (character in `mm/dd/yyyy` format),
+- the day of the week (character), and
+- the number of riders (numeric).
+
+Our tidyverse pipeline will conduct the following tasks, in order:
+
+1. We will use the tidyverse package readr to read the data from the source website and convert them into a tibble. To do this, the `read_csv()` function can determine the type of data by reading an initial number of rows. Alternatively, if the column names and types are already known, a column specification can be created in R and passed to `read_csv()`.
+
+1. We filter the data to eliminate a few columns that are not needed (such as the station ID) and change the column `stationname` to `station`. The function `select()` is used for this. When filtering, use either the column names or a dplyr selector function. When selecting names, a new variable name can be declared using the argument format `new_name = old_name`.
+
+1. The date field is converted to the R date format using the `mdy()` function from the lubridate package. We also convert the ridership numbers to thousands. Both of these computations are executed using the `dplyr::mutate()` function.
+
+1. There are a small number of days that have more than one record of ridership numbers at certain stations. To mitigate this issue, we use the maximum number of rides for each station and day combination. We group the ridership data by station and day, and then summarize within each of the 1999 unique combinations with the maximum statistic.
+
+The tidyverse code for these steps is:
+
+
+```r
+library(tidyverse)
+library(lubridate)
+
+url <- "http://bit.ly/raw-train-data-csv"
+
+all_stations <-
+ # Step 1: Read in the data.
+ read_csv(url) %>%
+ # Step 2: filter columns and rename stationname
+ dplyr::select(station = stationname, date, rides) %>%
+ # Step 3: Convert the character date field to a date encoding.
+ # Also, put the data in units of 1K rides
+ mutate(date = mdy(date), rides = rides / 1000) %>%
+ # Step 4: Summarize the multiple records using the maximum.
+ group_by(date, station) %>%
+ summarize(rides = max(rides), .groups = "drop")
+```
+
+This pipeline of operations illustrates why the tidyverse is popular. A series of data manipulations is used that have simple and easy to understand functions for each transformation; the series is bundled together in a streamlined and readable way. The focus is on how the user interacts with the software. This approach enables more people to learn R and achieve their analysis goals, and adopting these same principles for modeling in R has the same benefits.
+
+## Chapter Summary
+
+This chapter introduced the tidyverse, with a focus on applications for modeling and how tidyverse design principles inform the tidymodels framework. Think of the tidymodels framework as applying tidyverse principles to the domain of building models. We described differences in conventions between the tidyverse and base R, and introduced two important components of the tidyverse system, tibbles and the pipe operator `%>%`. Data cleaning and processing can feel mundane at times, but these tasks are important for modeling in the real world; we illustrated how to use tibbles, the pipe, and tidyverse functions in an example data import and processing exercise.
diff --git a/tmwr-atlas/03-base-r.md b/tmwr-atlas/03-base-r.md
new file mode 100644
index 00000000..889cc8e5
--- /dev/null
+++ b/tmwr-atlas/03-base-r.md
@@ -0,0 +1,501 @@
+# A Review of R Modeling Fundamentals {#base-r}
+
+
+
+Before describing how to use tidymodels for applying tidy data principles to building models with R, let's review how models are created, trained, and used in the core R language (often called "base R"). This chapter is a brief illustration of core language conventions that are important to be aware of even if you were to never use base R for models at all. This chapter is not exhaustive but provides readers (especially those new to R) the basic, most commonly used motifs.
+
+The S language, on which R is based, has had a rich data analysis environment since the publication of @WhiteBook (commonly known as The White Book). This version of S introduced standard infrastructure components familiar to R users today, such as symbolic model formulae, model matrices, and data frames, as well as standard object-oriented programming methods for data analysis. These user interfaces have not substantively changed since then.
+
+## An Example
+
+To demonstrate some fundamentals for modeling in base R, let's use experimental data from @mcdonald2009, by way of @mangiafico2015, on the relationship between the ambient temperature and the rate of cricket chirps per minute. Data were collected for two species: _O. exclamationis_ and _O. niveus_. The data are contained in a data frame called `crickets` with a total of 31 data points. These data are shown in Figure \@ref(fig:cricket-plot) using the following ggplot2 code.
+
+
+```r
+library(tidyverse)
+
+data(crickets, package = "modeldata")
+names(crickets)
+
+# Plot the temperature on the x-axis, the chirp rate on the y-axis. The plot
+# elements will be colored differently for each species:
+ggplot(crickets,
+ aes(x = temp, y = rate, color = species, pch = species, lty = species)) +
+ # Plot points for each data point and color by species
+ geom_point(size = 2) +
+ # Show a simple linear model fit created separately for each species:
+ geom_smooth(method = lm, se = FALSE, alpha = 0.5) +
+ scale_color_brewer(palette = "Paired") +
+ labs(x = "Temperature (C)", y = "Chirp Rate (per minute)")
+```
+
+
+
+```
+#> [1] "species" "temp" "rate"
+```
+
+
+
+
(\#fig:cricket-plot)Relationship between chirp rate and temperature for two different species of cricket.
+
+
+The data exhibit fairly linear trends for each species. For a given temperature, _O. exclamationis_ appears to chirp more per minute than the other species. For an inferential model, the researchers might have specified the following null hypotheses prior to seeing the data:
+
+* Temperature has no effect on the chirp rate.
+
+* There are no differences between the species' chirp rate.
+
+There may be some scientific or practical value in predicting the chirp rate but in this example we will focus on inference.
+
+To fit an ordinary linear model in R, the `lm()` function is commonly used. The important arguments to this function are a model formula and a data frame that contains the data. The formula is _symbolic_. For example, the simple formula:
+
+```r
+rate ~ temp
+```
+specifies that the chirp rate is the outcome (since it is on the left-hand side of the tilde `~`) and that the temperature value is the predictor.^[Most model functions implicitly add an intercept column.] Suppose the data contained the time of day in which the measurements were obtained in a column called `time`. The formula:
+
+```r
+rate ~ temp + time
+```
+
+would not add the time and temperature values together. This formula would symbolically represent that temperature and time should be added as separate _main effects_ to the model. A main effect is a model term that contains a single predictor variable.
+
+There are no time measurements in these data but the species can be added to the model in the same way:
+
+```r
+rate ~ temp + species
+```
+
+Species is not a quantitative variable; in the data frame, it is represented as a factor column with levels `"O. exclamationis"` and `"O. niveus"`. The vast majority of model functions cannot operate on non-numeric data. For species, the model needs to encode the species data in a numeric format. The most common approach is to use indicator variables (also known as "dummy variables") in place of the original qualitative values. In this instance, since species has two possible values, the model formula will automatically encode this column as numeric by adding a new column that has a value of zero when the species is `"O. exclamationis"` and a value of one when the data correspond to `"O. niveus"`. The underlying formula machinery automatically converts these values for the data set used to create the model, as well as for any new data points (for example, when the model is used for prediction).
+
+:::rmdnote
+Suppose there were five species instead of two. The model formula would automatically add four additional binary columns that are binary indicators for four of the species. The _reference level_ of the factor (i.e., the first level) is always left out of the predictor set. The idea is that, if you know the values of the four indicator variables, the value of the species can be determined. We discuss binary indicator variables in more detail in Chapter \@ref(recipes).
+:::
+
+The model formula `rate ~ temp + species` creates a model with different y-intercepts for each species; the slopes of the regression lines could be different for each species as well. To accommodate this structure, an interaction term can be added to the model. This can be specified in a few different ways, and the most basic uses the colon:
+
+```r
+rate ~ temp + species + temp:species
+
+# A shortcut can be used to expand all interactions containing
+# interactions with two variables:
+rate ~ (temp + species)^2
+
+# Another shortcut to expand factors to include all possible
+# interactions (equivalent for this example):
+rate ~ temp * species
+```
+
+In addition to the convenience of automatically creating indicator variables, the formula offers a few other niceties:
+
+* _In-line_ functions can be used in the formula. For example, to use the natural log of the temperature, we can create the formula `rate ~ log(temp)`. Since the formula is symbolic by default, literal math can also be applied to the predictors using the identity function `I()`. To use Fahrenheit units, the formula could be `rate ~ I( (temp * 9/5) + 32 )` to convert from Celsius.
+
+* R has many functions that are useful inside of formulas. For example, `poly(x, 3)` creates linear, quadratic, and cubic terms for `x` to the model as main effects. The splines package also has several functions to create nonlinear spline terms in the formula.
+
+* For data sets where there are many predictors, the period shortcut is available. The period represents main effects for all of the columns that are not on the left-hand side of the tilde. Using `~ (.)^3` would create main effects as well as all two- and three-variable interactions to the model.
+
+Returning to our chirping crickets, let's use a two-way interaction model. In this book, we use the suffix `_fit` for R objects that are fitted models.
+
+
+```r
+interaction_fit <- lm(rate ~ (temp + species)^2, data = crickets)
+
+# To print a short summary of the model:
+interaction_fit
+#>
+#> Call:
+#> lm(formula = rate ~ (temp + species)^2, data = crickets)
+#>
+#> Coefficients:
+#> (Intercept) temp speciesO. niveus
+#> -11.041 3.751 -4.348
+#> temp:speciesO. niveus
+#> -0.234
+```
+
+This output is a little hard to read. For the species indicator variables, R mashes the variable name (`species`) together with the factor level (`O. niveus`) with no delimiter.
+
+Before going into any inferential results for this model, the fit should be assessed using diagnostic plots. We can use the `plot()` method for `lm` objects. This method produces a set of four plots for the object, each showing different aspects of the fit, as shown in Figure \@ref(fig:interaction-plots).
+
+
+```r
+# Place two plots next to one another:
+par(mfrow = c(1, 2))
+
+# Show residuals vs predicted values:
+plot(interaction_fit, which = 1)
+
+# A normal quantile plot on the residuals:
+plot(interaction_fit, which = 2)
+```
+
+
+
+
(\#fig:interaction-plots)Residual diagnostic plots for the linear model with interactions, which appear reasonable enough to conduct inferential analysis.
+
+
+:::rmdnote
+When it comes to the technical details of evaluating expressions, R is _lazy_ (as opposed to eager). This means that model fitting functions typically compute the minimum possible quantities at the last possible moment. For example, if you are interested in the coefficient table for each model term, this is not automatically computed with the model but is instead computed via the `summary()` method.
+:::
+
+Our next order of business with the crickets is to assess if the inclusion of the interaction term is necessary. The most appropriate approach for this model is to re-compute the model without the interaction term and use the `anova()` method.
+
+
+```r
+# Fit a reduced model:
+main_effect_fit <- lm(rate ~ temp + species, data = crickets)
+
+# Compare the two:
+anova(main_effect_fit, interaction_fit)
+#> Analysis of Variance Table
+#>
+#> Model 1: rate ~ temp + species
+#> Model 2: rate ~ (temp + species)^2
+#> Res.Df RSS Df Sum of Sq F Pr(>F)
+#> 1 28 89.3
+#> 2 27 85.1 1 4.28 1.36 0.25
+```
+
+This statistical test generates a p-value of 0.25. This implies that there is a lack of evidence against the null hypothesis that the interaction term is not needed by the model. For this reason, we will conduct further analysis on the model without the interaction.
+
+Residual plots should be re-assessed to make sure that our theoretical assumptions are valid enough to trust the p-values produced by the model (plots not shown here but spoiler alert: they are).
+
+We can use the `summary()` method to inspect the coefficients, standard errors, and p-values of each model term:
+
+
+```r
+summary(main_effect_fit)
+#>
+#> Call:
+#> lm(formula = rate ~ temp + species, data = crickets)
+#>
+#> Residuals:
+#> Min 1Q Median 3Q Max
+#> -3.013 -1.130 -0.391 0.965 3.780
+#>
+#> Coefficients:
+#> Estimate Std. Error t value Pr(>|t|)
+#> (Intercept) -7.2109 2.5509 -2.83 0.0086 **
+#> temp 3.6028 0.0973 37.03 < 2e-16 ***
+#> speciesO. niveus -10.0653 0.7353 -13.69 6.3e-14 ***
+#> ---
+#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+#>
+#> Residual standard error: 1.79 on 28 degrees of freedom
+#> Multiple R-squared: 0.99, Adjusted R-squared: 0.989
+#> F-statistic: 1.33e+03 on 2 and 28 DF, p-value: <2e-16
+```
+
+The chirp rate for each species increases by 3.6 chirps as the temperature increases by a single degree. This term shows strong statistical significance as evidenced by the p-value. The species term has a value of -10.07. This indicates that, across all temperature values, _O. niveus_ has a chirp rate that is about 10 fewer chirps per minute than _O. exclamationis_. Similar to the temperature term, the species effect is associated with a very small p-value.
+
+The only issue in this analysis is the intercept value. It indicates that at 0 C, there are negative chirps per minute for both species. While this doesn't make sense, the data only go as low as 17.2 C and interpreting the model at 0 C would be an extrapolation. This would be a bad idea. That being said, the model fit is good within the _applicable range_ of the temperature values; the conclusions should be limited to the observed temperature range.
+
+If we needed to estimate the chirp rate at a temperature that was not observed in the experiment, we could use the `predict()` method. It takes the model object and a data frame of new values for prediction. For example, the model estimates the chirp rate for _O. exclamationis_ for temperatures between 15 C and 20 C can be computed via:
+
+
+```r
+new_values <- data.frame(species = "O. exclamationis", temp = 15:20)
+predict(main_effect_fit, new_values)
+#> 1 2 3 4 5 6
+#> 46.83 50.43 54.04 57.64 61.24 64.84
+```
+
+:::rmdwarning
+Note that the non-numeric value of `species` is passed to the predict method, as opposed to the numeric, binary indicator variable.
+:::
+
+While this analysis has obviously not been an exhaustive demonstration of R's modeling capabilities, it does highlight some major features important for the rest of this book:
+
+* The language has an expressive syntax for specifying model terms for both simple and quite complex models.
+
+* The R formula method has many conveniences for modeling that are also applied to new data when predictions are generated.
+
+* There are numerous helper functions (e.g., `anova()`, `summary()` and `predict()`) that you can use to conduct specific calculations after the fitted model is created.
+
+Finally, as previously mentioned, this framework was first published in 1992. Most of these ideas and methods were developed in that period but have remained remarkably relevant to this day. It highlights that the S language and, by extension R, has been designed for data analysis since its inception.
+
+## What Does the R Formula Do? {#formula}
+
+The R model formula is used by many modeling packages. It usually serves multiple purposes:
+
+* The formula defines the columns that are used by the model.
+
+* The standard R machinery uses the formula to encode the columns into an appropriate format.
+
+* The roles of the columns are defined by the formula.
+
+For the most part, practitioners' understanding of what the formula does is dominated by the last purpose. Our focus when typing out a formula is often to declare how the columns should be used. For example, the previous specification we discussed sets up predictors to be used in a specific way:
+
+```r
+(temp + species)^2
+```
+
+Our focus, when seeing this, is that there are two predictors and the model should contain their main effects and the two-way interactions. However, this formula also implies that, since `species` is a factor, it should also create indicator variable columns for this predictor (see Chapter \@ref(recipes)) and multiply those columns by the `temp` column to create the interactions. This transformation represents our second bullet point on encoding; the formula also defines how each column is encoded and can create additional columns that are not in the original data.
+
+:::rmdwarning
+This is an important point which will come up multiple times in this text, especially when we discuss more complex feature engineering in Chapter \@ref(recipes) and beyond. The formula in R has some limitations and our approaches to overcoming them contend with all three aspects.
+:::
+
+## Why Tidiness is Important for Modeling {#tidiness-modeling}
+
+One of the strengths of R is that it encourages developers to create a user-interface that fits their needs. As an example, here are three common methods for creating a scatter plot of two numeric variables in a data frame called `plot_data`:
+
+
+```r
+plot(plot_data$x, plot_data$y)
+
+library(lattice)
+xyplot(y ~ x, data = plot_data)
+
+library(ggplot2)
+ggplot(plot_data, aes(x = x, y = y)) + geom_point()
+```
+
+In these three cases, separate groups of developers devised three distinct interfaces for the same task. Each has advantages and disadvantages.
+
+In comparison, the _Python Developer's Guide_ espouses the notion that, when approaching a problem:
+
+> "There should be one -- and preferably only one -- obvious way to do it."
+
+R is quite different from Python in this respect. An advantage of R's diversity of interfaces is that it can evolve over time and fit different types of needs for different users.
+
+Unfortunately, some of the syntactical diversity is due to a focus on the needs of the person _developing_ the code instead of the needs of the person _using_ the code. Inconsistencies between packages can be a stumbling block to R users.
+
+Suppose your modeling project has an outcome with two classes. There are a variety of statistical and machine learning models you could choose from. In order to produce a class probability estimate for each sample, it is common for a model function to have a corresponding `predict()` method. However, there is significant heterogeneity in the argument values used by those methods to make class probability predictions; this heterogeneity can be difficult for even experienced users to navigate. A sampling of these argument values for different models is shown in Table \@ref(tab:probability-args).
+
+
+Table: (\#tab:probability-args)Heterogeneous argument names for different modeling functions.
+
+|Function |Package |Code |
+|:------------|:----------|:-------------------------------------------|
+|lda() |MASS |predict(object) |
+|glm() |stats |predict(object, type = "response") |
+|gbm() |gbm |predict(object, type = "response", n.trees) |
+|mda() |mda |predict(object, type = "posterior") |
+|rpart() |rpart |predict(object, type = "prob") |
+|various |RWeka |predict(object, type = "probability") |
+|logitboost() |LogitBoost |predict(object, type = "raw", nIter) |
+|pamr.train() |pamr |pamr.predict(object, type = "posterior") |
+
+Note that the last example has a custom function to make predictions instead of using the more common `predict()` interface (the generic `predict()` method). This lack of consistency is a barrier to day-to-day usage of R for modeling.
+
+As another example of unpredictability, the R language has conventions for missing data which are handled inconsistently. The general rule is that missing data propagate more missing data; the average of a set of values with a missing data point is itself missing and so on. When models make predictions, the vast majority require all of the predictors to have complete values. There are several options baked in to R at this point with the generic function `na.action()`. This sets the policy for how a function should behave if there are missing values. The two most common policies are `na.fail()` and `na.omit()`. The former produces an error if missing data are present while the latter removes the missing data prior to calculations by case-wise deletion. From our previous example:
+
+
+```r
+# Add a missing value to the prediction set
+new_values$temp[1] <- NA
+
+# The predict method for `lm` defaults to `na.pass`:
+predict(main_effect_fit, new_values)
+#> 1 2 3 4 5 6
+#> NA 50.43 54.04 57.64 61.24 64.84
+
+# Alternatively
+predict(main_effect_fit, new_values, na.action = na.fail)
+#> Error in na.fail.default(structure(list(temp = c(NA, 16L, 17L, 18L, 19L, : missing values in object
+
+predict(main_effect_fit, new_values, na.action = na.omit)
+#> 2 3 4 5 6
+#> 50.43 54.04 57.64 61.24 64.84
+```
+
+From a user's point of view, `na.omit()` can be problematic. In our example, `new_values` has 6 rows but only 5 would be returned with `na.omit()`. To adjust for this, the user would have to determine which row had the missing value and interleave a missing value in the appropriate place if the predictions were merged into `new_values`.^[A base R policy called `na.exclude()` does exactly this.] While it is rare that a prediction function uses `na.omit()` as its missing data policy, this does occur. Users who have determined this as the cause of an error in their code find it _quite memorable_.
+
+To resolve the usage issues described here, the tidymodels packages have a set of design goals. Most of the tidymodels design goals fall under the existing rubric of "Design for Humans" from the tidyverse [@tidyverse], but with specific applications for modeling code. There are a few additional tidymodels design goals that complement those of the tidyverse. Some examples:
+
+* R has excellent capabilities for object oriented programming and we use this in lieu of creating new function names (such as a hypothetical new `predict_samples()` function).
+
+* _Sensible defaults_ are very important. Also, functions should have no default for arguments when it is more appropriate to force the user to make a choice (e.g., the file name argument for `read_csv()`).
+
+* Similarly, argument values whose default can be derived from the data should be. For example, for `glm()` the `family` argument could check the type of data in the outcome and, if no `family` was given, a default could be determined internally.
+
+* Functions should take the *data structures that users have* as opposed to the data structure that developers want. For example, a model function's only interface should not be constrained to matrices. Frequently, users will have non-numeric predictors such as factors.
+
+Many of these ideas are described in the tidymodels guidelines for model implementation.^[] In subsequent chapters, we will illustrate examples of existing issues, along with their solutions.
+
+:::rmdnote
+There are a few existing R packages that provide a unified interface to harmonize these heterogeneous modeling APIs, such as caret and mlr. The tidymodels framework is similar to these in adopting a unification of the function interface, as well as enforcing consistency in the function names and return values. It is different in its opinionated design goals and modeling implementation, discussed in detail throughout this book.
+:::
+
+The `broom::tidy()` function, which we use throughout this book, is another tool for standardizing the structure of R objects. It can return many types of R objects in a more usable format. For example, suppose that predictors are being screened based on their correlation to the outcome column. Using `purrr::map()`, the results from `cor.test()` can be returned in a list for each predictor:
+
+
+```r
+corr_res <- map(mtcars %>% select(-mpg), cor.test, y = mtcars$mpg)
+
+# The first of ten results in the vector:
+corr_res[[1]]
+#>
+#> Pearson's product-moment correlation
+#>
+#> data: .x[[i]] and mtcars$mpg
+#> t = -8.9, df = 30, p-value = 6e-10
+#> alternative hypothesis: true correlation is not equal to 0
+#> 95 percent confidence interval:
+#> -0.9258 -0.7163
+#> sample estimates:
+#> cor
+#> -0.8522
+```
+
+If we want to use these results in a plot, the standard format of hypothesis test results are not very useful. The `tidy()` method can return this as a tibble with standardized names:
+
+
+```r
+library(broom)
+
+tidy(corr_res[[1]])
+#> # A tibble: 1 × 8
+#> estimate statistic p.value parameter conf.low conf.high method alternative
+#>
+#> 1 -0.852 -8.92 6.11e-10 30 -0.926 -0.716 Pearson's pr… two.sided
+```
+
+These results can be "stacked" and added to a `ggplot()`, as shown in Figure \@ref(fig:corr-plot).
+
+
+```r
+corr_res %>%
+ # Convert each to a tidy format; `map_dfr()` stacks the data frames
+ map_dfr(tidy, .id = "predictor") %>%
+ ggplot(aes(x = fct_reorder(predictor, estimate))) +
+ geom_point(aes(y = estimate)) +
+ geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .1) +
+ labs(x = NULL, y = "Correlation with mpg")
+```
+
+
+
+
(\#fig:corr-plot)Correlations (and 95% confidence intervals) between predictors and the outcome in the `mtcars` data set.
+
+
+Creating such a plot is possible using core R language functions, but automatically reformatting the results makes for more concise code with less potential for errors.
+
+## Combining Base R Models and the Tidyverse
+
+R modeling functions from the core language or other R packages can be used in conjunction with the tidyverse, especially with the dplyr, purrr, and tidyr packages. For example, if we wanted to fit separate models for each cricket species, we can first break out the cricket data by this column using `dplyr::group_nest()`:
+
+
+```r
+split_by_species <-
+ crickets %>%
+ group_nest(species)
+split_by_species
+#> # A tibble: 2 × 2
+#> species data
+#> >
+#> 1 O. exclamationis [14 × 2]
+#> 2 O. niveus [17 × 2]
+```
+
+The `data` column contains the `rate` and `temp` columns from `crickets` in a _list column_. From this, the `purrr::map()` function can create individual models for each species:
+
+
+```r
+model_by_species <-
+ split_by_species %>%
+ mutate(model = map(data, ~ lm(rate ~ temp, data = .x)))
+model_by_species
+#> # A tibble: 2 × 3
+#> species data model
+#> >
+#> 1 O. exclamationis [14 × 2]
+#> 2 O. niveus [17 × 2]
+```
+
+To collect the coefficients for each of these models, use `broom::tidy()` to convert them to a consistent data frame format so that they can be unnested:
+
+
+```r
+model_by_species %>%
+ mutate(coef = map(model, tidy)) %>%
+ select(species, coef) %>%
+ unnest(cols = c(coef))
+#> # A tibble: 4 × 6
+#> species term estimate std.error statistic p.value
+#>
+#> 1 O. exclamationis (Intercept) -11.0 4.77 -2.32 3.90e- 2
+#> 2 O. exclamationis temp 3.75 0.184 20.4 1.10e-10
+#> 3 O. niveus (Intercept) -15.4 2.35 -6.56 9.07e- 6
+#> 4 O. niveus temp 3.52 0.105 33.6 1.57e-15
+```
+
+:::rmdnote
+List columns can be very powerful in modeling projects. List columns provide containers for any type of R objects, from a fitted model itself to the important data frame structure.
+:::
+
+## The tidymodels Metapackage
+
+The tidyverse (Chapter \@ref(tidyverse)) is designed as a set of modular R packages, each with a fairly narrow scope. The tidymodels framework follows a similar design. For example, the rsample package focuses on data splitting and resampling. Although resampling methods are critical to other activities of modeling (e.g., measuring performance), they reside in a single package and performance metrics are contained in a different, separate package, yardstick. There are many benefits to adopting this philosophy of modular packages, from less bloated model deployment to smoother package maintenance.
+
+
+
+The downside to this philosophy is that there are a lot of packages in the tidymodels framework. To compensate for this, the tidymodels _package_ (which you can think of as a "metapackage" like the tidyverse package) loads a core set of tidymodels and tidyverse packages. Loading the package shows which packages are attached:
+
+
+```r
+library(tidymodels)
+#> ── Attaching packages ─────────────────────────────────────────── tidymodels 0.2.0 ──
+#> ✓ broom 0.7.12 ✓ recipes 0.2.0
+#> ✓ dials 0.1.1 ✓ rsample 0.1.1
+#> ✓ dplyr 1.0.8 ✓ tibble 3.1.6
+#> ✓ ggplot2 3.3.5 ✓ tidyr 1.2.0
+#> ✓ infer 1.0.0 ✓ tune 0.2.0
+#> ✓ modeldata 0.1.1 ✓ workflows 0.2.6
+#> ✓ parsnip 0.2.1.9001 ✓ workflowsets 0.2.1
+#> ✓ purrr 0.3.4 ✓ yardstick 0.0.9
+#> ── Conflicts ────────────────────────────────────────────── tidymodels_conflicts() ──
+#> x purrr::discard() masks scales::discard()
+#> x dplyr::filter() masks stats::filter()
+#> x dplyr::lag() masks stats::lag()
+#> x recipes::step() masks stats::step()
+#> • Learn how to get started at https://www.tidymodels.org/start/
+```
+
+If you have used the tidyverse, you'll notice some familiar names as a few tidyverse packages, such as dplyr and ggplot2, are loaded together with the tidymodels packages. We've already said that the tidymodels framework applies tidyverse principles to modeling, but the tidymodels framework also literally builds on some of the most fundamental tidyverse packages like these.
+
+Loading the metapackage also shows if there are function naming conflicts with previously loaded packages. As an example of a naming conflict, before loading tidymodels, invoking the `filter()` function will execute the function in the stats package. After loading tidymodels, it will execute the dplyr function of the same name.
+
+There are a few ways to handle naming conflicts. The function can be called with its namespace (e.g., `stats::filter()`). This is not bad practice but it does make the code less readable.
+
+Another option is to use the conflicted package. We can set a rule that remains in effect until the end of the R session to ensure that one specific function will always run if no namespace is given in the code. As an example, if we prefer the dplyr version of the previous function:
+
+
+```r
+library(conflicted)
+conflict_prefer("filter", winner = "dplyr")
+```
+
+For convenience, tidymodels contains a function that captures most of the common naming conflicts that we might encounter:
+
+
+```r
+tidymodels_prefer(quiet = FALSE)
+#> [conflicted] Will prefer dplyr::filter over any other package
+#> [conflicted] Will prefer dplyr::select over any other package
+#> [conflicted] Will prefer dplyr::slice over any other package
+#> [conflicted] Will prefer dplyr::rename over any other package
+#> [conflicted] Will prefer dials::neighbors over any other package
+#> [conflicted] Will prefer parsnip::fit over any other package
+#> [conflicted] Will prefer parsnip::bart over any other package
+#> [conflicted] Will prefer parsnip::pls over any other package
+#> [conflicted] Will prefer purrr::map over any other package
+#> [conflicted] Will prefer recipes::step over any other package
+#> [conflicted] Will prefer themis::step_downsample over any other package
+#> [conflicted] Will prefer themis::step_upsample over any other package
+#> [conflicted] Will prefer tune::tune over any other package
+#> [conflicted] Will prefer yardstick::precision over any other package
+#> [conflicted] Will prefer yardstick::recall over any other package
+#> [conflicted] Will prefer yardstick::spec over any other package
+#> ── Conflicts ───────────────────────────────────────────────── tidymodels_prefer() ──
+```
+
+:::rmdwarning
+Be aware that using this function opts you in to using `conflicted::conflict_prefer()` for all namespace conflicts, making every conflict an error and forcing you to choose which function to use. The function `tidymodels::tidymodels_prefer()` handles the most common conflicts from tidymodels functions, but you will need to handle other conflicts in your R session yourself.
+:::
+
+## Chapter Summary
+
+This chapter reviewed core R language conventions for creating and using models that are an important foundation for the rest of this book. The formula operator is an expressive and important aspect of fitting models in R and often serves multiple purposes in non-tidymodels functions. Traditional R approaches to modeling have some limitations, especially when it comes to fluently handling and visualizing model output. The tidymodels metapackage applies tidyverse design philosophy to modeling packages.
diff --git a/tmwr-atlas/04-ames.md b/tmwr-atlas/04-ames.md
new file mode 100644
index 00000000..525b9a8c
--- /dev/null
+++ b/tmwr-atlas/04-ames.md
@@ -0,0 +1,160 @@
+
+
+# (PART\*) Modeling Basics {-}
+
+# The Ames Housing Data {#ames}
+
+In this chapter, we'll introduce the Ames housing data set [@ames], which we will use in modeling examples throughout this book. Exploratory data analysis, like what we walk through in this chapter, is an important first step in building a reliable model. The data set contains information on 2,930 properties in Ames, Iowa, including columns related to:
+
+ * house characteristics (bedrooms, garage, fireplace, pool, porch, etc.),
+ * location (neighborhood),
+ * lot information (zoning, shape, size, etc.),
+ * ratings of condition and quality, and
+ * sale price.
+
+:::rmdnote
+Our modeling goal is to predict the sale price of a house based on other information we have, like its characteristics and location.
+:::
+
+The raw housing data are provided in @ames, but in our analyses in this book, we use a transformed version available in the modeldata package. This version has several changes and improvements to the data.^[For a complete account of the differences, see .] For example, the longitude and latitude values have been determined for each property. Also, some columns were modified to be more analysis ready. For example:
+
+ * In the raw data, if a house did not have a particular feature, it was implicitly encoded as missing. For example, there were 2,732 properties that did not have an alleyway. Instead of leaving these as missing, they were relabeled in the transformed version to indicate that no alley was available.
+
+ * The categorical predictors were converted to R's factor data type. While both the tidyverse and base R have moved away from importing data as factors by default, this data type is a better approach for storing qualitative data for modeling than simple strings.
+ * We removed a set of quality descriptors for each house since they are more like outcomes than predictors.
+
+To load the data:
+
+
+```r
+library(modeldata) # This is also loaded by the tidymodels package
+data(ames)
+
+# or, in one line:
+data(ames, package = "modeldata")
+
+dim(ames)
+#> [1] 2930 74
+```
+
+Figure \@ref(fig:ames-map) shows the locations of the properties in Ames. The locations will be revisited in the next section.
+
+
+
+
(\#fig:ames-map)Property locations in Ames, IA.
+
+
+The void of data points in the center of Ames corresponds to Iowa State University.
+
+## Exploring Features of Homes in Ames
+
+Let's start our exploratory data analysis by focusing on the outcome we want to predict: the last sale price of the house (in USD). We can create a histogram to see the distribution of sale prices in Figure \@ref(fig:ames-sale-price-hist).
+
+
+```r
+library(tidymodels)
+tidymodels_prefer()
+
+ggplot(ames, aes(x = Sale_Price)) +
+ geom_histogram(bins = 50, col= "white")
+```
+
+
+
+
(\#fig:ames-sale-price-hist)Sale prices of houses in Ames, Iowa.
+
+
+This plot shows us that the data are right-skewed; there are more inexpensive houses than expensive ones. The median sale price was \$160,000 and the most expensive house was \$755,000. When modeling this outcome, a strong argument can be made that the price should be log-transformed. The advantages of this type of transformation are that no houses would be predicted with negative sale prices and that errors in predicting expensive houses will not have an undue influence on the model. Also, from a statistical perspective, a logarithmic transform may also stabilize the variance in a way that makes inference more legitimate. We can use similar steps to now visualize the transformed data, shown in Figure \@ref(fig:ames-log-sale-price-hist).
+
+
+```r
+ggplot(ames, aes(x = Sale_Price)) +
+ geom_histogram(bins = 50, col= "white") +
+ scale_x_log10()
+```
+
+
+
+
(\#fig:ames-log-sale-price-hist)Sale prices of houses in Ames, Iowa after a log (base 10) transformation.
+
+
+While not perfect, this will probably result in better models than using the untransformed data, for the reasons we just outlined previously.
+
+:::rmdwarning
+The disadvantages to transforming the outcome are mostly related to interpretation of model results.
+:::
+
+The units of the model coefficients might be more difficult to interpret, as will measures of performance. For example, the root mean squared error (RMSE) is a common performance metric that is used in regression models. It uses the difference between the observed and predicted values in its calculations. If the sale price is on the log scale, these differences (i.e. the residuals) are also on the log scale. It can be difficult to understand the quality of a model whose RMSE is 0.15 on such a log scale.
+
+Despite these drawbacks, the models used in this book utilize the log transformation for this outcome. _From this point on_, the outcome column is pre-logged in the `ames` data frame:
+
+
+```r
+ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))
+```
+
+Another important aspect of these data for our modeling are their geographic locations. This spatial information is contained in the data in two ways: a qualitative `Neighborhood` label as well as quantitative longitude and latitude data. To visualize the spatial information, Figure \@ref(fig:ames-chull) duplicates the data from Figure \@ref(fig:ames-map) with convex hulls around the data from each neighborhood.
+
+
+
+
(\#fig:ames-chull)Neighborhoods in Ames represented using a convex hull.
+
+
+We can see a few noticeable patterns. First, there is a void of data points in the center of Ames. This corresponds to the campus of Iowa State University where there are no residential houses. Second, while there are a number of neighborhoods that are adjacent to each other, others are geographically isolated. For example, as Figure \@ref(fig:ames-timberland) shows, Timberland is located apart from almost all other neighborhoods.
+
+
+
+
(\#fig:ames-timberland)Locations of homes in Timberland.
+
+
+Figure \@ref(fig:ames-mitchell) visualizes how the Meadow Village neighborhood in Southwest Ames is like an island of properties ensconced inside the sea of properties that make up the Mitchell neighborhood.
+
+
+
+
(\#fig:ames-mitchell)Locations of homes in Meadow Village and Mitchell.
+
+
+A detailed inspection of the map also shows that the neighborhood labels are not completely reliable. For example, Figure \@ref(fig:ames-northridge) shows there are some properties labeled as being in Northridge that are surrounded by homes in the adjacent Somerset neighborhood.
+
+
+
+
(\#fig:ames-northridge)Locations of homes in Somerset and Northridge.
+
+
+Also, there are ten isolated homes labeled as being in Crawford that you can see in Figure \@ref(fig:ames-crawford) but are not close to the majority of the other homes in that neighborhood:
+
+
+
+
(\#fig:ames-crawford)Locations of homes in Crawford.
+
+
+Also notable is the "Iowa Department of Transportation (DOT) and Rail Road" neighborhood adjacent to the main road on the east side of Ames, shown in Figure \@ref(fig:ames-dot-rr). There are several clusters of homes within this neighborhood as well as some longitudinal outliers; the two homes furthest east are isolated from the other locations.
+
+
+
+
(\#fig:ames-dot-rr)Homes labeled as 'Iowa Department of Transportation (DOT) and Rail Road'.
+
+
+As previously described in Chapter \@ref(software-modeling), it is critical to conduct exploratory data analysis prior to beginning any modeling. These housing data have characteristics that present interesting challenges about how the data should be processed and modeled. We describe many of these in later chapters. Some basic questions that could be examined during this exploratory stage include:
+
+ * Are there any odd or noticeable things about the distributions of the individual predictors? Is there much skewness or any pathological distributions?
+
+ * Are there high correlations between predictors? For example, there are multiple predictors related to the size of the house. Are some redundant?
+
+ * Are there associations between predictors and the outcomes?
+
+Many of these questions will be revisited as these data are used in upcoming examples.
+
+## Chapter Summary {#ames-summary}
+
+This chapter introduced the Ames housing dataset and investigated some of its characteristics. This data set will be used in later chapters to demonstrate tidymodels syntax. Exploratory data analysis like this is an essential component of any modeling project; EDA uncovers information that contributes to better modeling practice.
+
+The important code for preparing the Ames data set that we will carry forward into subsequent chapters is:
+
+
+
+```r
+library(tidymodels)
+data(ames)
+ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))
+```
diff --git a/tmwr-atlas/05-data-spending.md b/tmwr-atlas/05-data-spending.md
new file mode 100644
index 00000000..195ac2d0
--- /dev/null
+++ b/tmwr-atlas/05-data-spending.md
@@ -0,0 +1,150 @@
+
+
+# Spending our Data {#splitting}
+
+There are several steps to create a useful model, including parameter estimation, model selection and tuning, and performance assessment. At the start of a new project, there is usually an initial finite pool of data available for all these tasks, which we can think of as a available data budget. How should the data be applied to different steps or tasks? The idea of _data spending_ is an important first consideration when modeling, especially as it relates to empirical validation.
+
+:::rmdwarning
+When data are reused for multiple tasks, instead of carefully "spent" from the finite data budget, certain risks increase, such as the risk of accentuating bias or compounding effects from methodological errors.
+:::
+
+When there are copious amounts of data available, a smart strategy is to allocate specific subsets of data for different tasks, as opposed to allocating the largest possible amount (or even all) to the model parameter estimation only. For example, one possible strategy (when both data and predictors are abundant) is to spend a specific subset of data to determine which predictors are informative, before considering parameter estimation at all. If the initial pool of data available is not huge, there will be some overlap in how and when our data is "spent" or allocated, and a solid methodology for data spending is important.
+
+This chapter demonstrates the basics of _splitting_ (i.e., creating a data budget for) our initial pool of samples for different purposes.
+
+## Common Methods for Splitting Data {#splitting-methods}
+
+The primary approach for empirical model validation is to split the existing pool of data into two distinct sets, the training set and the test set. One portion of the data is used to develop and optimize the model. This _training set_ is usually the majority of the data. These data are a sandbox for model building where different models can be fit, feature engineering strategies are investigated, and so on. We as modeling practitioners spend the vast majority of the modeling process using the training set as the substrate to develop the model.
+
+The other portion of the data is placed into the _test set_. This is held in reserve until one or two models are chosen as the methods that are most likely to succeed. The test set is then used as the final arbiter to determine the efficacy of the model. It is critical to only look at the test set once; otherwise, it becomes part of the modeling process.
+
+:::rmdnote
+How should we conduct this split of the data? This depends on the context.
+:::
+
+Suppose we allocate 80% of the data to the training set and the remaining 20% for testing. The most common method is to use simple random sampling. The [rsample](https://rsample.tidymodels.org/) package has tools for making data splits such as this; the function `initial_split()` was created for this purpose. It takes the data frame as an argument as well as the proportion to be placed into training. Using the data frame produced by the code snippet from the summary at the end of Chapter \@ref(ames):
+
+
+```r
+library(tidymodels)
+tidymodels_prefer()
+
+# Set the random number stream using `set.seed()` so that the results can be
+# reproduced later.
+set.seed(501)
+
+# Save the split information for an 80/20 split of the data
+ames_split <- initial_split(ames, prop = 0.80)
+ames_split
+#>
+#> <2344/586/2930>
+```
+
+The printed information denotes the amount of data in the training set ($n = 2,344$), the amount in the test set ($n = 586$), and the size of the original pool of samples ($n = 2,930$).
+
+The object `ames_split` is an `rsplit` object and only contains the partitioning information; to get the resulting data sets, we apply two more functions:
+
+
+```r
+ames_train <- training(ames_split)
+ames_test <- testing(ames_split)
+
+dim(ames_train)
+#> [1] 2344 74
+```
+
+These objects are data frames with the same columns as the original data but only the appropriate rows for each set.
+
+Simple random sampling is appropriate in many cases but there are exceptions. When there is a dramatic _class imbalance_ in classification problems, one class occurs much less frequently than another. Using a simple random sample may haphazardly allocate these infrequent samples disproportionately into the training or test set. To avoid this, _stratified sampling_ can be used. The training/test split is conducted separately within each class and then these subsamples are combined into the overall training and test set. For regression problems, the outcome data can be artificially binned into quartiles and then stratified sampling can be conducted four separate times. This is an effective method for keeping the distributions of the outcome similar between the training and test set. The distribution of the sale price outcome for the Ames housing data is shown in Figure \@ref(fig:ames-sale-price).
+
+
+
+
(\#fig:ames-sale-price)The distribution of the sale price (in log units) for the Ames housing data. The vertical lines indicate the quartiles of the data.
+
+
+As previously discussed, the sale price distribution is right-skewed, with proportionally more inexpensive houses than expensive houses on either side of the center of the distribution. The worry here with simple splitting is that the more expensive houses would not be well represented in the training set; this would increase the risk that our model would be ineffective at predicting the price for such properties. The dotted vertical lines in Figure \@ref(fig:ames-sale-price) indicate the four quartiles for these data. A stratified random sample would conduct the 80/20 split within each of these data subsets and then pool the results together. In rsample, this is achieved using the `strata` argument:
+
+
+```r
+set.seed(502)
+ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
+ames_train <- training(ames_split)
+ames_test <- testing(ames_split)
+
+dim(ames_train)
+#> [1] 2342 74
+```
+
+Only a single column can be used for stratification.
+
+:::rmdnote
+There is very little downside to using stratified sampling.
+:::
+
+Are there situations when random sampling is not the best choice? One case is when the data have a significant time component, such as time series data. Here, it is more common to use the most recent data as the test set. The rsample package contains a function called `initial_time_split()` that is very similar to `initial_split()`. Instead of using random sampling, the `prop` argument denotes what proportion of the first part of the data should be used as the training set; the function assumes that the data have been pre-sorted in an appropriate order.
+
+:::rmdnote
+As we've mentioned, the proportion of data that should be allocated for splitting is highly dependent on the context of the problem at hand. Too little data in the training set hampers the model's ability to find appropriate parameter estimates. Conversely, too little data in the test set lowers the quality of the performance estimates. There are parts of the statistics community that eschew test sets in general because they believe all of the data should be used for parameter estimation. While there is merit to this argument, it is good modeling practice to have an unbiased set of observations as the final arbiter of model quality. A test set should be avoided only when the data are pathologically small.
+:::
+
+## What About a Validation Set?
+
+Previously, when describing the goals of data splitting, we singled out the test set as the data that should be used to conduct a proper evaluation of model performance on the final model(s). This begs the question of, "How can we tell what is best if we don't measure performance until the test set?"
+
+It is common to hear about _validation sets_ as an answer to this question, especially in the neural network and deep learning literature. During the early days of neural networks, researchers realized that measuring performance by re-predicting the training set samples led to results that were overly optimistic (significantly, unrealistically so). This led to models that overfit, meaning that they performed very well on the training set but poorly on the test set.^[This is discussed in much greater detail in Chapter \@ref(tuning).] To combat this issue, a small validation set of data were held back and used to measure performance as the network was trained. Once the validation set error rate began to rise, the training would be halted. In other words, the validation set was a means to get a rough sense of how well the model performed prior to the test set.
+
+:::rmdnote
+Whether validation sets are a subset of the training set or a third allocation in the initial split of the data largely comes down to semantics.
+:::
+
+Validation sets are discussed more in Chapter \@ref(resampling) as a special case of _resampling_ methods that are used on the training set.
+
+## Multi-Level Data
+
+With the Ames housing data, a property is considered to be the _independent experimental unit_. It is safe to assume that, statistically, the data from a property are independent of other properties. For other applications, that is not always the case:
+
+ * For longitudinal data, for example, the same independent experimental unit can be measured over multiple time points. An example would be a human subject in a medical trial.
+
+ * A batch of manufactured product might also be considered the independent experimental unit. In repeated measures designs, replicate data points from a batch are collected at multiple times.
+
+ * @spicer2018 report an experiment where different trees were sampled across the top and bottom portions of a stem. Here, the tree is the experimental unit and the data hierarchy is sample within stem position within tree.
+
+Chapter 9 of @fes contains other examples.
+
+In these situations, the data set will have multiple rows per experimental unit. Simple resampling across rows would lead to some data within an experimental unit being in the training set and others in the test set. Data splitting should occur at the independent experimental unit level of the data. For example, to produce an 80/20 split of the Ames housing data set, 80% of the properties should be allocated for the training set.
+
+
+## Other Considerations for a Data Budget
+
+When deciding how to spend the data available to you, keep a few more things in mind. First, it is critical to quarantine the test set from any model building activities. As you read this book, notice which data are exposed to the model at any given time.
+
+:::rmdwarning
+The problem of _information leakage_ occurs when data outside of the training set are used in the modeling process.
+:::
+
+For example, in a machine learning competition, the test set data might be provided without the true outcome values so that the model can be scored and ranked. One potential method for improving the score might be to fit the model using the training set points that are most similar to the test set values. While the test set isn't directly used to fit the model, it still has a heavy influence. In general, this technique is highly problematic since it reduces the _generalization error_ of the model to optimize performance on a specific data set. There are more subtle ways that the test set data can be utilized during training. Keeping the training data in a separate data frame from the test set is one small check to make sure that information leakage does not occur by accident.
+
+Second, techniques to subsample the training set can mitigate specific issues (e.g., class imbalances). This is a valid and common technique that deliberately results in the training set data diverging from the population from which the data were drawn. It is critical that the test set continues to mirror what the model would encounter in the wild. In other words, the test set should always resemble new data that will be given to the model.
+
+Next, at the beginning of this chapter, we warned about using the same data for different tasks. Chapter \@ref(resampling) will discuss solid, data-driven methodologies for data usage that will reduce the risks related to bias, overfitting, and other issues. Many of these methods apply the data-splitting tools introduced in this chapter.
+
+Finally, the considerations in this chapter apply to developing and choosing a reliable model, the main topic of this book. When training a final chosen model for production, after ascertaining the expected performance on new data, practitioners often use all available data for better parameter estimation.
+
+
+## Chapter Summary {#splitting-summary}
+
+Data splitting is the fundamental strategy for empirical validation of models. Even in the era of unrestrained data collection, a typical modeling project has a limited amount of appropriate data and wise "spending" of a project's data is necessary. In this chapter, we discussed several strategies for partitioning the data into distinct groups for modeling and evaluation.
+
+At this checkpoint, the important code snippets for preparing and splitting are:
+
+
+```r
+library(tidymodels)
+data(ames)
+ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))
+
+set.seed(502)
+ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
+ames_train <- training(ames_split)
+ames_test <- testing(ames_split)
+```
diff --git a/tmwr-atlas/06-fitting-models.md b/tmwr-atlas/06-fitting-models.md
new file mode 100644
index 00000000..d5d25396
--- /dev/null
+++ b/tmwr-atlas/06-fitting-models.md
@@ -0,0 +1,461 @@
+
+
+# Fitting Models with parsnip {#models}
+
+The parsnip package, one of the R packages that are part of the tidymodels metapackage, provides a fluent and standardized interface for a variety of different models. In this chapter, we give some motivation for why a common interface is beneficial for understanding and building models in practice and show how to use the parsnip package.
+
+Specifically, we will focus on how to `fit()` and `predict()` directly with a parsnip object, which may be a good fit for some straightforward modeling problems. The next chapter illustrates a better approach for many modeling tasks by combining models and preprocessors together into something called a `workflow` object.
+
+
+## Create a Model
+
+Once the data have been encoded in a format ready for a modeling algorithm, such as a numeric matrix, they can be used in the model building process.
+
+Suppose that a linear regression model was our initial choice. This is equivalent to specifying that the outcome data is numeric and that the predictors are related to the outcome in terms of simple slopes and intercepts:
+
+$$y_i = \beta_0 + \beta_1 x_{1i} + \ldots + \beta_p x_{pi}$$
+
+There are a variety of methods that can be used to estimate the model parameters:
+
+* _Ordinary linear regression_ uses the traditional method of least squares to solve for the model parameters.
+
+* _Regularized linear regression_ adds a penalty to the least squares method to encourage simplicity by removing predictors and/or shrinking their coefficients towards zero. This can be executed using Bayesian or non-Bayesian techniques.
+
+In R, the stats package can be used for the first case. The syntax for linear regression using the function `lm()` is:
+
+```r
+model <- lm(formula, data, ...)
+```
+
+where `...` symbolizes other options to pass to `lm()`. The function does _not_ have an `x`/`y` interface, where we might pass in our outcome as `y` and our predictors as `x`.
+
+To estimate with regularization, the second case, a Bayesian model can be fit using the rstanarm package:
+
+```r
+model <- stan_glm(formula, data, family = "gaussian", ...)
+```
+
+In this case, the other options passed via `...` would include arguments for the prior distributions of the parameters as well as specifics about the numerical aspects of the model. As with `lm()`, only the formula interface is available.
+
+A popular non-Bayesian approach to regularized regression is the glmnet model [@glmnet]. Its syntax is:
+
+```r
+model <- glmnet(x = matrix, y = vector, family = "gaussian", ...)
+```
+
+In this case, the predictor data must already be formatted into a numeric matrix; there is only an `x`/`y` method and no formula method.
+
+Note that these interfaces are heterogeneous in either how the data are passed to the model function or in terms of their arguments. The first issue is that, to fit models across different packages, the data must be formatted in different ways. `lm()` and `stan_glm()` only have formula interfaces while `glmnet()` does not. For other types of models, the interfaces may be even more disparate. For a person trying to do data analysis, these differences require the memorization of each package's syntax and can be very frustrating.
+
+For tidymodels, the approach to specifying a model is intended to be more unified:
+
+1. *Specify the _type_ of model based on its mathematical structure* (e.g., linear regression, random forest, _K_-nearest neighbors, etc).
+
+2. *Specify the _engine_ for fitting the model.* Most often this reflects the software package that should be used, like Stan or glmnet. These are models in their own right, and parsnip provides consistent interfaces by using these as engines for modeling.
+
+3. *When required, declare the _mode_ of the model.* The mode reflects the type of prediction outcome. For numeric outcomes, the mode is regression; for qualitative outcomes, it is classification.^[Note that parsnip constrains the outcome column of a classification model to be encoded as a _factor_; using binary numeric values will result in an error.] If a model algorithm can only address one type of prediction outcome, such as linear regression, the mode is already set.
+
+These specifications are built without referencing the data. For example, for the three cases we outlined:
+
+
+```r
+library(tidymodels)
+tidymodels_prefer()
+
+linear_reg() %>% set_engine("lm")
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: lm
+
+linear_reg() %>% set_engine("glmnet")
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: glmnet
+
+linear_reg() %>% set_engine("stan")
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: stan
+```
+
+
+Once the details of the model have been specified, the model estimation can be done with either the `fit()` function (to use a formula) or the `fit_xy()` function (when your data are already pre-processed). The parsnip package allows the user to be indifferent to the interface of the underlying model; you can always use a formula even if the modeling package's function only has the `x`/`y` interface.
+
+The `translate()` function can provide details on how parsnip converts the user's code to the package's syntax:
+
+
+```r
+linear_reg() %>% set_engine("lm") %>% translate()
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: lm
+#>
+#> Model fit template:
+#> stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
+
+linear_reg(penalty = 1) %>% set_engine("glmnet") %>% translate()
+#> Linear Regression Model Specification (regression)
+#>
+#> Main Arguments:
+#> penalty = 1
+#>
+#> Computational engine: glmnet
+#>
+#> Model fit template:
+#> glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
+#> family = "gaussian")
+
+linear_reg() %>% set_engine("stan") %>% translate()
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: stan
+#>
+#> Model fit template:
+#> rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(),
+#> weights = missing_arg(), family = stats::gaussian, refresh = 0)
+```
+
+Note that `missing_arg()` is just a placeholder for the data that has yet to be provided.
+
+:::rmdnote
+Note that we supplied a required `penalty` argument for the glmnet engine. Also, for the Stan and glmnet engines, the `family` argument was automatically added as a default. As will be shown later, this option can be changed.
+:::
+
+Let's walk through how to predict the sale price of houses in the Ames data as a function of only longitude and latitude:[^fitxy]
+
+
+
+```r
+lm_model <-
+ linear_reg() %>%
+ set_engine("lm")
+
+lm_form_fit <-
+ lm_model %>%
+ # Recall that Sale_Price has been pre-logged
+ fit(Sale_Price ~ Longitude + Latitude, data = ames_train)
+
+lm_xy_fit <-
+ lm_model %>%
+ fit_xy(
+ x = ames_train %>% select(Longitude, Latitude),
+ y = ames_train %>% pull(Sale_Price)
+ )
+
+lm_form_fit
+#> parsnip model object
+#>
+#>
+#> Call:
+#> stats::lm(formula = Sale_Price ~ Longitude + Latitude, data = data)
+#>
+#> Coefficients:
+#> (Intercept) Longitude Latitude
+#> -302.97 -2.07 2.71
+lm_xy_fit
+#> parsnip model object
+#>
+#>
+#> Call:
+#> stats::lm(formula = ..y ~ ., data = data)
+#>
+#> Coefficients:
+#> (Intercept) Longitude Latitude
+#> -302.97 -2.07 2.71
+```
+
+[^fitxy]: What are the differences between `fit()` and `fit_xy()`? The `fit_xy()` function always passes the data as-is to the underlying model function. It will not create dummy/indicator variables before doing so. When `fit()` is used with a model specification, this almost always means that dummy variables will be created from qualitative predictors. If the underlying function requires a matrix (like glmnet), it will make them. However, if the underlying function uses a formula, `fit()` just passes the formula to that function. We estimate that 99% of modeling functions using formulas make dummy variables. The other 1% include tree-based methods that do not require purely numeric predictors. See Section \@ref(workflow-encoding) for more about using formulas in tidymodels.
+
+Not only does parsnip enable a consistent model interface for different packages, it also provides consistency in the model arguments. It is common for different functions which fit the same model to have different argument names. Random forest model functions are a good example. Three commonly used arguments are the number of trees in the ensemble, the number of predictors to randomly sample with each split within a tree, and the number of data points required to make a split. For three different R packages implementing this algorithm, those arguments are shown in Table \@ref(tab:rand-forest-args).
+
+
+Table: (\#tab:rand-forest-args)Example argument names for different random forest functions.
+
+|Argument Type |ranger |randomForest |sparklyr |
+|:----------------------|:---------------|:------------|:-------------------------|
+|# sampled predictors |`mtry` |`mtry` |`feature_subset_strategy` |
+|# trees |`num.trees` |`ntree` |`num_trees` |
+|# data points to split |`min.node.size` |`nodesize` |`min_instances_per_node` |
+
+In an effort to make argument specification less painful, parsnip uses common argument names within and between packages. Table \@ref(tab:parsnip-args) shows, for random forests, what parsnip models use.
+
+
+Table: (\#tab:parsnip-args)Random forest argument names used by parsnip.
+
+|Argument Type |parsnip |
+|:----------------------|:-------|
+|# sampled predictors |`mtry` |
+|# trees |`trees` |
+|# data points to split |`min_n` |
+
+Admittedly, this is one more set of arguments to memorize. However, when other types of models have the same argument types, these names still apply. For example, boosted tree ensembles also create a large number of tree-based models, so `trees` is also used there, as is `min_n`, and so on.
+
+Some of the original argument names can be fairly jargon-y. For example, to specify the amount of regularization to use in a glmnet model, the Greek letter `lambda` is used. While this mathematical notation is commonly used in the statistics literature, it is not obvious to many people what `lambda` represents (especially those who consume the model results). Since this is the penalty used in regularization, parsnip standardizes on the argument name `penalty`. Similarly, the number of neighbors in a _K_-nearest neighbors model is called `neighbors` instead of `k`. Our rule of thumb when standardizing argument names is:
+
+> If a practitioner were to include these names in a plot or table, would the people viewing those results understand the name?
+
+To understand how the parsnip argument names map to the original names, use the help file for the model (available via `?rand_forest`) as well as the `translate()` function:
+
+
+```r
+rand_forest(trees = 1000, min_n = 5) %>%
+ set_engine("ranger") %>%
+ set_mode("regression") %>%
+ translate()
+#> Random Forest Model Specification (regression)
+#>
+#> Main Arguments:
+#> trees = 1000
+#> min_n = 5
+#>
+#> Computational engine: ranger
+#>
+#> Model fit template:
+#> ranger::ranger(x = missing_arg(), y = missing_arg(), case.weights = missing_arg(),
+#> num.trees = 1000, min.node.size = min_rows(~5, x), num.threads = 1,
+#> verbose = FALSE, seed = sample.int(10^5, 1))
+```
+
+Modeling functions in parsnip separate model arguments into two categories:
+
+* _Main arguments_ are more commonly used and tend to be available across engines.
+
+* _Engine arguments_ are either specific to a particular engine or used more rarely.
+
+For example, in the translation of the previous random forest code, the arguments `num.threads`, `verbose`, and `seed` were added by default. These arguments are specific to the ranger implementation of random forest models and wouldn't make sense as main arguments. Engine-specific arguments can be specified in `set_engine()`. For example, to have the `ranger::ranger()` function print out more information about the fit:
+
+
+```r
+rand_forest(trees = 1000, min_n = 5) %>%
+ set_engine("ranger", verbose = TRUE) %>%
+ set_mode("regression")
+#> Random Forest Model Specification (regression)
+#>
+#> Main Arguments:
+#> trees = 1000
+#> min_n = 5
+#>
+#> Engine-Specific Arguments:
+#> verbose = TRUE
+#>
+#> Computational engine: ranger
+```
+
+
+## Use the Model Results
+
+Once the model is created and fit, we can use the results in a variety of ways; we might want to plot, print, or otherwise examine the model output. Several quantities are stored in a parsnip model object, including the fitted model. This can be found in an element called `fit`, which can be returned using the `extract_fit_engine()` function:
+
+
+```r
+lm_form_fit %>% extract_fit_engine()
+#>
+#> Call:
+#> stats::lm(formula = Sale_Price ~ Longitude + Latitude, data = data)
+#>
+#> Coefficients:
+#> (Intercept) Longitude Latitude
+#> -302.97 -2.07 2.71
+```
+
+Normal methods can be applied to this object, such as printing, plotting, and so on:
+
+
+```r
+lm_form_fit %>% extract_fit_engine() %>% vcov()
+#> (Intercept) Longitude Latitude
+#> (Intercept) 207.311 1.57466 -1.42397
+#> Longitude 1.575 0.01655 -0.00060
+#> Latitude -1.424 -0.00060 0.03254
+```
+
+:::rmdwarning
+Never pass the `fit` element of a parsnip model to a model prediction function, i.e., use `predict(lm_form_fit)` but *do not* use `predict(lm_form_fit$fit)`. If the data were preprocessed in any way, incorrect predictions will be generated (sometimes, without errors). The underlying model's prediction function has no idea if any transformations have been made to the data prior to running the model. See the next section for more on making predictions.
+:::
+
+One issue with some existing methods in base R is that the results are stored in a manner that may not be the most useful. For example, the `summary()` method for `lm` objects can be used to print the results of the model fit, including a table with parameter values, their uncertainty estimates, and p-values. These particular results can also be saved:
+
+
+```r
+model_res <-
+ lm_form_fit %>%
+ extract_fit_engine() %>%
+ summary()
+
+# The model coefficient table is accessible via the `coef` method.
+param_est <- coef(model_res)
+class(param_est)
+#> [1] "matrix" "array"
+param_est
+#> Estimate Std. Error t value Pr(>|t|)
+#> (Intercept) -302.974 14.3983 -21.04 3.640e-90
+#> Longitude -2.075 0.1286 -16.13 1.395e-55
+#> Latitude 2.710 0.1804 15.02 9.289e-49
+```
+
+There are a few things to notice about this result. First, the object is a numeric matrix. This data structure was mostly likely chosen since all of the calculated results are numeric and a matrix object is stored more efficiently than a data frame. This choice was probably made in the late 1970's when computational efficiency was extremely critical. Second, the non-numeric data (the labels for the coefficients) are contained in the row names. Keeping the parameter labels as row names is very consistent with the conventions in the original S language.
+
+A reasonable next step might be to create a visualization of the parameter values. To do this, it would be sensible to convert the parameter matrix to a data frame. We could add the row names as a column so that they can be used in a plot. However, notice that several of the existing matrix column names would not be valid R column names for ordinary data frames (e.g. `"Pr(>|t|)"`). Another complication is the consistency of the column names. For `lm` objects, the column for the test statistic is `"Pr(>|t|)"`, but for other models, a different test might be used and, as a result, the column name would be different (e.g., `"Pr(>|z|)"`) and the type of test would be encoded in the column name.
+
+While these additional data formatting steps are not impossible to overcome, they are a hindrance, especially since they might be different for different types of models. The matrix is not a highly reusable data structure mostly because it constrains the data to be of a single type (e.g. numeric). Additionally, keeping some data in the dimension names is also problematic since those data must be extracted to be of general use.
+
+As a solution, the broom package has methods to convert many types of model objects to a tidy structure. For example, using the `tidy()` method on the linear model produces:
+
+
+
+```r
+tidy(lm_form_fit)
+#> # A tibble: 3 × 5
+#> term estimate std.error statistic p.value
+#>
+#> 1 (Intercept) -303. 14.4 -21.0 3.64e-90
+#> 2 Longitude -2.07 0.129 -16.1 1.40e-55
+#> 3 Latitude 2.71 0.180 15.0 9.29e-49
+```
+
+The column names are standardized across models and do not contain any additional data (such as the type of statistical test). The data previously contained in the row names are now in a column called `term` and so on. One important principle in the tidymodels ecosystem is that a function should return values that are _predictable, consistent,_ and _unsurprising_.
+
+
+## Make Predictions {#parsnip-predictions}
+
+Another area where parsnip diverges from conventional R modeling functions is the format of values returned from `predict()`. For predictions, parsnip always conforms to the following rules:
+
+1. The results are always a tibble.
+2. The column names of the tibble are always predictable.
+3. There are always as many rows in the tibble as there are in the input data set.
+
+For example, when numeric data are predicted:
+
+
+```r
+ames_test_small <- ames_test %>% slice(1:5)
+predict(lm_form_fit, new_data = ames_test_small)
+#> # A tibble: 5 × 1
+#> .pred
+#>
+#> 1 5.22
+#> 2 5.21
+#> 3 5.28
+#> 4 5.27
+#> 5 5.28
+```
+
+The row order of the predictions are always the same as the original data.
+
+:::rmdnote
+Why are there leading dots in some of the column names? Some tidyverse and tidymodels arguments and return values contain periods. This is to protect against merging data with duplicate names. There are some data sets that contain predictors named `pred`!
+:::
+
+These three rules make it easier to merge predictions with the original data:
+
+
+```r
+ames_test_small %>%
+ select(Sale_Price) %>%
+ bind_cols(predict(lm_form_fit, ames_test_small)) %>%
+ # Add 95% prediction intervals to the results:
+ bind_cols(predict(lm_form_fit, ames_test_small, type = "pred_int"))
+#> # A tibble: 5 × 4
+#> Sale_Price .pred .pred_lower .pred_upper
+#>
+#> 1 5.02 5.22 4.91 5.54
+#> 2 5.39 5.21 4.90 5.53
+#> 3 5.28 5.28 4.97 5.60
+#> 4 5.28 5.27 4.96 5.59
+#> 5 5.28 5.28 4.97 5.60
+```
+
+The motivation for the first rule comes from some R packages producing dissimilar data types from prediction functions. For example, the ranger package is an excellent tool for computing random forest models. However, instead of returning a data frame or vector as output, a specialized object is returned that has multiple values embedded within it (including the predicted values). This is just one more step for the data analyst to work around in their scripts. As another example, the native glmnet model can return at least four different output types for predictions, depending on the model specifics and characteristics of the data. These are shown in Table \@ref(tab:predict-types).
+
+
+Table: (\#tab:predict-types)Different return values for glmnet prediction types.
+
+|Type of Prediction |Returns a: |
+|:------------------------|:-------------------------------|
+|numeric |numeric matrix |
+|class |character matrix |
+|probability (2 classes) |numeric matrix (2nd level only) |
+|probability (3+ classes) |3D numeric array (all levels) |
+
+Additionally, the column names of the results contain coded values that map to a vector called `lambda` within the glmnet model object. This excellent statistical method can be discouraging to use in practice because of all of the special cases an analyst might encounter that require additional code to be useful.
+
+For the second tidymodels prediction rule, the predictable column names for different types of predictions are shown in Table \@ref(tab:predictable-column-names).
+
+
+Table: (\#tab:predictable-column-names)The tidymodels mapping of prediction types and column names.
+
+|type value |column name(s) |
+|:----------|:--------------------------|
+|`numeric` |`.pred` |
+|`class` |`.pred_class` |
+|`prob` |`.pred_{class levels}` |
+|`conf_int` |`.pred_lower, .pred_upper` |
+|`pred_int` |`.pred_lower, .pred_upper` |
+
+The third rule regarding the number of rows in the output is critical. For example, if any rows of the new data contain missing values, the output will be padded with missing results for those rows.
+A main advantage of standardizing the model interface and prediction types in parsnip is that, when different models are used, the syntax is identical. Suppose that we used a decision tree to model the Ames data. Outside of the model specification, there are no significant differences in the code pipeline:
+
+
+```r
+tree_model <-
+ decision_tree(min_n = 2) %>%
+ set_engine("rpart") %>%
+ set_mode("regression")
+
+tree_fit <-
+ tree_model %>%
+ fit(Sale_Price ~ Longitude + Latitude, data = ames_train)
+
+ames_test_small %>%
+ select(Sale_Price) %>%
+ bind_cols(predict(tree_fit, ames_test_small))
+#> # A tibble: 5 × 2
+#> Sale_Price .pred
+#>
+#> 1 5.02 5.15
+#> 2 5.39 5.15
+#> 3 5.28 5.32
+#> 4 5.28 5.32
+#> 5 5.28 5.32
+```
+
+This demonstrates the benefit of homogenizing the data analysis process and syntax across different models. It enables the user to spend their time on the results and interpretation rather than having to focus on the syntactical differences between R packages.
+
+## parsnip-Extension Packages
+
+The parsnip package itself contains interfaces to a number of models. However, for ease of package installation and maintenance, there are other tidymodels packages that have parsnip model definitions for other sets of models. The discrim package has model definitions for the set of classification techniques called discriminant analysis methods (such as linear or quadratic discriminant analysis). In this way, the package dependencies required for installing parsnip are reduced. A list of all of the models that can be used with parsnip (across different packages that are on CRAN) can be found at .
+
+## Creating Model Specifications {#parsnip-addin}
+
+It may become tedious to write many model specifications, or to remember how to write the code to generate them. The parsnip package includes an RStudio addin^[] that can help. Either choosing this addin from the _Addins_ toolbar menu or running the code:
+
+
+
+```r
+parsnip_addin()
+```
+
+will open a window in the Viewer panel of the RStudio IDE with a list of possible models for each model mode. These can be written to the source code panel.
+
+The model list includes models from parsnip and parsnip-adjacent packages that are on CRAN.
+
+
+## Chapter Summary {#models-summary}
+
+This chapter introduced the parsnip package, which provides a common interface for models across R packages using a standard syntax. The interface and resulting objects have a predictable structure.
+
+The code for modeling the Ames data that we will use moving forward is:
+
+
+```r
+library(tidymodels)
+data(ames)
+ames <- mutate(ames, Sale_Price = log10(Sale_Price))
+
+set.seed(123)
+ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
+ames_train <- training(ames_split)
+ames_test <- testing(ames_split)
+
+lm_model <- linear_reg() %>% set_engine("lm")
+```
diff --git a/tmwr-atlas/07-the-model-workflow.md b/tmwr-atlas/07-the-model-workflow.md
new file mode 100644
index 00000000..d35dabcc
--- /dev/null
+++ b/tmwr-atlas/07-the-model-workflow.md
@@ -0,0 +1,545 @@
+
+
+# A Model Workflow {#workflows}
+
+In the previous chapter, we discussed the parsnip package, which can be used to define and fit the model. This chapter introduces a new concept called a _model workflow_. The purpose of this concept (and the corresponding tidymodels `workflow()` object) is to encapsulate the major pieces of the modeling process (previously discussed in Chapter \@ref(software-modeling)). The workflow is important in two ways. First, using a workflow concept encourages good methodology since it is a single point of entry to the estimation components of a data analysis. Second, it enables the user to better organize their projects. These two points are discussed in the following sections.
+
+
+## Where Does the Model Begin and End? {#begin-model-end}
+
+So far, when we have used the term "the model", we have meant a structural equation that relates some predictors to one or more outcomes. Let's consider again linear regression as an example. The outcome data are denoted as $y_i$, where there are $i = 1 \ldots n$ samples in the training set. Suppose that there are $p$ predictors $x_{i1}, \ldots, x_{ip}$ that are used in the model. Linear regression produces a model equation of
+
+$$ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + \ldots + \hat{\beta}_px_{ip} $$
+
+While this is a linear model, it is only linear in the parameters. The predictors could be nonlinear terms (such as the $\log(x_i)$).
+
+:::rmdwarning
+The conventional way of thinking about the modeling process is that it only includes the model fit.
+:::
+
+For some data sets that are straightforward in nature, fitting the model itself may be the entire process. However, there are a variety of choices and additional steps that often occur before the model is fit:
+
+* While our example model has $p$ predictors, it is common to start with more than $p$ candidate predictors. Through exploratory data analysis or using domain knowledge, some of the predictors may be excluded from the analysis. In other cases, a feature selection algorithm may be used to make a data-driven choice for the minimum predictor set for the model.
+* There are times when the value of an important predictor is missing. Rather than eliminating this sample from the data set, the missing value could be imputed using other values in the data. For example, if $x_1$ were missing but was correlated with predictors $x_2$ and $x_3$, an imputation method could estimate the missing $x_1$ observation from the values of $x_2$ and $x_3$.
+* It may be beneficial to transform the scale of a predictor. If there is not _a priori_ information on what the new scale should be, we can estimate the proper scale using a statistical transformation technique, the existing data, and some optimization criterion. Other transformations, such as PCA, take groups of predictors and transform them into new features that are used as the predictors.
+
+While these examples are related to steps that occur before the model fit, there may also be operations that occur after the model is created. When a classification model is created where the outcome is binary (e.g., `event` and `non-event`), it is customary to use a 50% probability cutoff to create a discrete class prediction, also known as a "hard prediction". For example, a classification model might estimate that the probability of an event was 62%. Using the typical default, the hard prediction would be `event`. However, the model may need to be more focused on reducing false positive results (i.e., where true non-events are classified as events). One way to do this is to raise the cutoff from 50% to some greater value. This increases the level of evidence required to call a new sample an event. While this reduces the true positive rate (which is bad), it may have a more dramatic effect on reducing false positives. The choice of the cutoff value should be optimized using data. This is an example of a post-processing step that has a significant effect on how well the model works, even though it is not contained in the model fitting step.
+
+It is important to focus on the broader _modeling process_, instead of only fitting the specific model used to estimate parameters. This broader process includes any preprocessing steps, the model fit itself, as well as potential post-processing activities. In this book, we will refer to this more comprehensive concept as the *model workflow* and highlight how to handle all its components to produce a final model equation.
+
+:::rmdnote
+In other software, such as Python or Spark, similar collections of steps are called _pipelines_. In tidymodels, the term "pipeline" already connotes a sequence of operations chained together with a pipe operator (such as `%>%` from magrittr or the newer native `|>`). Rather than using ambiguous terminology in this context, we call the sequence of computational operations related to modeling *workflows*.
+:::
+
+Binding together the analytical components of a data analysis is important for another reason. Future chapters will demonstrate how to accurately measure performance, as well as how to optimize structural parameters (i.e. model tuning). To correctly quantify model performance on the training set, Chapter \@ref(resampling) advocates using resampling methods. To do this properly, no data-driven parts of the analysis should be excluded from validation. To this end, the workflow must include all significant estimation steps.
+
+To illustrate, consider principal component analysis (PCA) signal extraction. We'll talk about this more in Chapter \@ref(recipes) as well as Chapter \@ref(dimensionality); PCA is a way to replace correlated predictors with new artificial features that are uncorrelated and capture most of the information in the original set. The new features could be used as the predictors and least squares regression could be used to estimate the model parameters.
+
+There are two ways of thinking about the model workflow. Figure \@ref(fig:bad-workflow) illustrates the _incorrect_ method to think of the PCA preprocessing step, as _not being part of the modeling workflow_.
+
+
+
+
(\#fig:bad-workflow)Incorrect mental model of where model estimation occurs in the data analysis process.
+
+
+The fallacy here is that, although PCA does significant computations to produce the components, its operations are assumed to have no uncertainty associated with them. The PCA components are treated as _known_ and, if not included in the model workflow, the effect of PCA could not be adequately measured.
+
+Figure \@ref(fig:good-workflow) shows an _appropriate_ approach.
+
+
+
+
(\#fig:good-workflow)Correct mental model of where model estimation occurs in the data analysis process.
+
+
+In this way, the PCA preprocessing is considered part of the modeling process.
+
+## Workflow Basics
+
+The workflows package allows the user to bind modeling and preprocessing objects together. Let's start again with the Ames data and a simple linear model:
+
+
+```r
+library(tidymodels) # Includes the workflows package
+tidymodels_prefer()
+
+lm_model <-
+ linear_reg() %>%
+ set_engine("lm")
+```
+
+A workflow always requires a parsnip model object:
+
+
+```r
+lm_wflow <-
+ workflow() %>%
+ add_model(lm_model)
+
+lm_wflow
+#> ══ Workflow ═════════════════════════════════════════════════════════════════════════
+#> Preprocessor: None
+#> Model: linear_reg()
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: lm
+```
+
+Notice that we have not yet specified how this workflow should preprocess the data: `Preprocessor: None`.
+
+If our model were very simple, a standard R formula can be used as a preprocessor:
+
+
+```r
+lm_wflow <-
+ lm_wflow %>%
+ add_formula(Sale_Price ~ Longitude + Latitude)
+
+lm_wflow
+#> ══ Workflow ═════════════════════════════════════════════════════════════════════════
+#> Preprocessor: Formula
+#> Model: linear_reg()
+#>
+#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
+#> Sale_Price ~ Longitude + Latitude
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: lm
+```
+
+Workflows have a `fit()` method that can be used to create the model. Using the objects created in the summary at the end of Chapter \@ref(models):
+
+
+```r
+lm_fit <- fit(lm_wflow, ames_train)
+lm_fit
+#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════
+#> Preprocessor: Formula
+#> Model: linear_reg()
+#>
+#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
+#> Sale_Price ~ Longitude + Latitude
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#>
+#> Call:
+#> stats::lm(formula = ..y ~ ., data = data)
+#>
+#> Coefficients:
+#> (Intercept) Longitude Latitude
+#> -302.97 -2.07 2.71
+```
+
+We can also `predict()` on the fitted workflow:
+
+
+```r
+predict(lm_fit, ames_test %>% slice(1:3))
+#> # A tibble: 3 × 1
+#> .pred
+#>
+#> 1 5.22
+#> 2 5.21
+#> 3 5.28
+```
+
+The `predict()` method follows all of the same rules and naming conventions that we described for the parsnip package in Chapter \@ref(models).
+
+Both the model and preprocessor can be removed or updated:
+
+
+```r
+lm_fit %>% update_formula(Sale_Price ~ Longitude)
+#> ══ Workflow ═════════════════════════════════════════════════════════════════════════
+#> Preprocessor: Formula
+#> Model: linear_reg()
+#>
+#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
+#> Sale_Price ~ Longitude
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: lm
+```
+
+Note that, in this new object, the output shows that the previous fitted model was removed since the new formula is inconsistent with the previous model fit.
+
+
+## Adding Raw Variables to the `workflow()`
+
+There is another interface for passing data to the model, the `add_variables()` function which uses a dplyr-like syntax for choosing variables. The function has two primary arguments: `outcomes` and `predictors`. These use a selection approach similar to the tidyselect back-end of tidyverse packages to capture multiple selectors using `c()`.
+
+
+```r
+lm_wflow <-
+ lm_wflow %>%
+ remove_formula() %>%
+ add_variables(outcome = Sale_Price, predictors = c(Longitude, Latitude))
+lm_wflow
+#> ══ Workflow ═════════════════════════════════════════════════════════════════════════
+#> Preprocessor: Variables
+#> Model: linear_reg()
+#>
+#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
+#> Outcomes: Sale_Price
+#> Predictors: c(Longitude, Latitude)
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: lm
+```
+
+The predictors could also have been specified using a more general selector, such as
+
+
+```r
+predictors = c(ends_with("tude"))
+```
+
+One nicety is that any outcome columns accidentally specified in the predictors argument will be quietly removed. This facilitates the use of:
+
+
+```r
+predictors = everything()
+```
+
+When the model is fit, the specification assembles these data, unaltered, into a data frame and passes it to the underlying function:
+
+
+```r
+fit(lm_wflow, ames_train)
+#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════
+#> Preprocessor: Variables
+#> Model: linear_reg()
+#>
+#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
+#> Outcomes: Sale_Price
+#> Predictors: c(Longitude, Latitude)
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#>
+#> Call:
+#> stats::lm(formula = ..y ~ ., data = data)
+#>
+#> Coefficients:
+#> (Intercept) Longitude Latitude
+#> -302.97 -2.07 2.71
+```
+
+If you would like the underlying modeling method to do what it would normally do with the data, `add_variables()` can be a helpful interface. As we will see in an upcoming section in this chapter, it also facilitates more complex modeling specifications. However, as we mention in the next section, models such as `glmnet` and `xgboost` expect the user to make indicator variables from factor predictors. In these cases, a recipe or formula interface will typically be a better choice.
+
+In the next chapter, we will look at a more powerful preprocessor (called a _recipe_) that can also be added to a workflow.
+
+## How Does a `workflow()` Use the Formula? {#workflow-encoding}
+
+Recall from Chapter \@ref(base-r) that the formula method in R has multiple purposes (we will discuss this further in Chapter \@ref(recipes)). One of these is to properly encode the original data into an analysis ready format. This can involve executing in-line transformations (e.g., `log(x)`), creating dummy variable columns, creating interactions or other column expansions, and so on. However, there are many statistical methods that require different types of encodings:
+
+ * Most packages for tree-based models use the formula interface but *do not* encode the categorical predictors as dummy variables.
+
+ * Packages can use special in-line functions that tell the model function how to treat the predictor in the analysis. For example, in survival analysis models, a formula term such as `strata(site)` would indicate that the column `site` is a stratification variable. This means that it should not be treated as a regular predictor and does not have a corresponding location parameter estimate in the model.
+
+ * A few R packages have extended the formula in ways that base R functions cannot parse or execute. In multilevel models (e.g. mixed models or hierarchical Bayesian models), a model term such as `(week | subject)` indicates that the column `week` is a random effect that has different slope parameter estimates for each value of the `subject` column.
+
+A workflow is a general purpose interface. When `add_formula()` is used, how should the workflow pre-process the data? Since the preprocessing is model dependent, workflows attempts to emulate what the underlying model would do whenever possible. If it is not possible, the formula processing should not do anything to the columns used in the formula. Let's look at this in more detail.
+
+### Tree-based models {-}
+
+When we fit a tree to the data, the parsnip package understands what the modeling function would do. For example, if a random forest model is fit using the ranger or randomForest packages, the workflow knows predictors columns that are factors should be left as-is.
+
+As a counter example, a boosted tree created with the xgboost package requires the user to create dummy variables from factor predictors (since `xgboost::xgb.train()` will not). This requirement is embedded into the model specification object and a workflow using xgboost will create the indicator columns for this engine. Also note that a different engine for boosted trees, C5.0, does not require dummy variables so none are made by the workflow.
+
+This determination is made for each model and engine combination.
+
+### Special formulas and in-line functions {#special-model-formulas}
+
+A number of multilevel models have standardized on a formula specification devised in the lme4 package. For example, to fit a regression model that has random effects for subjects, we would use the following formula:
+
+```r
+library(lme4)
+lmer(distance ~ Sex + (age | Subject), data = Orthodont)
+```
+
+The effect of this is that each subject will have an estimated intercept and slope parameter for `age`.
+
+The problem is that standard R methods can't properly process this formula:
+
+
+
+
+```r
+model.matrix(distance ~ Sex + (age | Subject), data = Orthodont)
+#> Warning in Ops.ordered(age, Subject): '|' is not meaningful for ordered factors
+#> (Intercept) SexFemale age | SubjectTRUE
+#> attr(,"assign")
+#> [1] 0 1 2
+#> attr(,"contrasts")
+#> attr(,"contrasts")$Sex
+#> [1] "contr.treatment"
+#>
+#> attr(,"contrasts")$`age | Subject`
+#> [1] "contr.treatment"
+```
+
+The result is a zero row data frame.
+
+:::rmdwarning
+The issue is that the special formula has to be processed by the underlying package code, not the standard `model.matrix()` approach.
+:::
+
+Even if this formula could be used with `model.matrix()`, this would still present a problem since the formula also specifies the statistical attributes of the model.
+
+The solution in workflows is an optional supplementary model formula that can be passed to `add_model()`. The `add_variables()` specification provides the bare column names and then the actual formula given to the model is set within `add_model()`:
+
+
+```r
+library(multilevelmod)
+
+multilevel_spec <- linear_reg() %>% set_engine("lmer")
+
+multilevel_workflow <-
+ workflow() %>%
+ # Pass the data along as-is:
+ add_variables(outcome = distance, predictors = c(Sex, age, Subject)) %>%
+ add_model(multilevel_spec,
+ # This formula is given to the model
+ formula = distance ~ Sex + (age | Subject))
+
+multilevel_fit <- fit(multilevel_workflow, data = Orthodont)
+multilevel_fit
+#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════
+#> Preprocessor: Variables
+#> Model: linear_reg()
+#>
+#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
+#> Outcomes: distance
+#> Predictors: c(Sex, age, Subject)
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#> Linear mixed model fit by REML ['lmerMod']
+#> Formula: distance ~ Sex + (age | Subject)
+#> Data: data
+#> REML criterion at convergence: 471.2
+#> Random effects:
+#> Groups Name Std.Dev. Corr
+#> Subject (Intercept) 7.391
+#> age 0.694 -0.97
+#> Residual 1.310
+#> Number of obs: 108, groups: Subject, 27
+#> Fixed Effects:
+#> (Intercept) SexFemale
+#> 24.52 -2.15
+```
+
+We can even use the previously mentioned `strata()` function from the survival package for survival analysis:
+
+
+```r
+library(censored)
+
+parametric_spec <- survival_reg()
+
+parametric_workflow <-
+ workflow() %>%
+ add_variables(outcome = c(fustat, futime), predictors = c(age, rx)) %>%
+ add_model(parametric_spec,
+ formula = Surv(futime, fustat) ~ age + strata(rx))
+
+parametric_fit <- fit(parametric_workflow, data = ovarian)
+parametric_fit
+#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════
+#> Preprocessor: Variables
+#> Model: survival_reg()
+#>
+#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
+#> Outcomes: c(fustat, futime)
+#> Predictors: c(age, rx)
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#> Call:
+#> survival::survreg(formula = Surv(futime, fustat) ~ age + strata(rx),
+#> data = data, model = TRUE)
+#>
+#> Coefficients:
+#> (Intercept) age
+#> 12.8734 -0.1034
+#>
+#> Scale:
+#> rx=1 rx=2
+#> 0.7696 0.4704
+#>
+#> Loglik(model)= -89.4 Loglik(intercept only)= -97.1
+#> Chisq= 15.36 on 1 degrees of freedom, p= 9e-05
+#> n= 26
+```
+
+Notice how in this both of these calls the model-specific formula was used.
+
+## Creating Multiple Workflows at Once {#workflow-sets-intro}
+
+There are some situations where the data require numerous attempts to find an appropriate model. For example:
+
+* For predictive models, it is advisable to evaluate a variety of different model types. This requires the user to create multiple model specifications.
+
+* Sequential testing of models typically starts with an expanded set of predictors. This "full model" is compared to a sequence of the same model that removes each predictor in turn. Using basic hypothesis testing methods or empirical validation, the effect of each predictor can be isolated and assessed.
+
+In these situations, as well as others, it can become tedious or onerous to create a lot of workflows from different sets of preprocessors and/or model specifications. To address this problem, the workflowset package creates combinations of workflow components. A list of preprocessors (e.g., formulas, dplyr selectors, or feature engineering recipe objects discussed in the next chapter) can be combined with a list of model specifications, resulting in a set of workflows.
+
+As an example, let's say that we want to focus on the different ways that house location is represented in the Ames data. We can create a set of formulas that capture these predictors:
+
+
+```r
+location <- list(
+ longitude = Sale_Price ~ Longitude,
+ latitude = Sale_Price ~ Latitude,
+ coords = Sale_Price ~ Longitude + Latitude,
+ neighborhood = Sale_Price ~ Neighborhood
+)
+```
+
+These representations can be crossed with one or more models using the `workflow_set()` function. We'll just use the previous linear model specification to demonstrate:
+
+
+```r
+library(workflowsets)
+location_models <- workflow_set(preproc = location, models = list(lm = lm_model))
+location_models
+#> # A workflow set/tibble: 4 × 4
+#> wflow_id info option result
+#>
+#> 1 longitude_lm
+#> 2 latitude_lm
+#> 3 coords_lm
+#> 4 neighborhood_lm
+location_models$info[[1]]
+#> # A tibble: 1 × 4
+#> workflow preproc model comment
+#>
+#> 1 formula linear_reg ""
+extract_workflow(location_models, id = "coords_lm")
+#> ══ Workflow ═════════════════════════════════════════════════════════════════════════
+#> Preprocessor: Formula
+#> Model: linear_reg()
+#>
+#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
+#> Sale_Price ~ Longitude + Latitude
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: lm
+```
+
+Workflow sets are mostly designed to work with resampling, which is discussed in Chapter \@ref(resampling). The columns `option` and `result` must be populated with specific types of objects that result from resampling. We will demonstrate this in more detail in Chapters \@ref(compare) and \@ref(workflow-sets).
+
+In the meantime, let's create model fits for each formula and save them in a new column called `fit`. We'll use basic dplyr and purrr operations:
+
+
+```r
+location_models <-
+ location_models %>%
+ mutate(fit = map(info, ~ fit(.x$workflow[[1]], ames_train)))
+location_models
+#> # A workflow set/tibble: 4 × 5
+#> wflow_id info option result fit
+#>
+#> 1 longitude_lm
+#> 2 latitude_lm
+#> 3 coords_lm
+#> 4 neighborhood_lm
+location_models$fit[[1]]
+#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════
+#> Preprocessor: Formula
+#> Model: linear_reg()
+#>
+#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
+#> Sale_Price ~ Longitude
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#>
+#> Call:
+#> stats::lm(formula = ..y ~ ., data = data)
+#>
+#> Coefficients:
+#> (Intercept) Longitude
+#> -184.40 -2.02
+```
+
+We use a purrr function here to map through our models, but there is an easier, better approach to fit workflow sets that will be introduced in Chapter \@ref(compare).
+
+:::rmdnote
+In general, there's a lot more to workflow sets! While we've covered the basics here, the nuances and advantages of workflow sets won't be illustrated until Chapter \@ref(workflow-sets).
+:::
+
+## Evaluating the Test Set
+
+Let's say that we've concluded our model development and have settled on a final model. There is a convenience function called `last_fit()` that will _fit_ the model to the entire training set and _evaluate_ it with the testing set.
+
+Using `lm_wflow` as an example, we can pass the model and the initial training/testing split to the function:
+
+
+```r
+final_lm_res <- last_fit(lm_wflow, ames_split)
+final_lm_res
+#> # Resampling results
+#> # Manual resampling
+#> # A tibble: 1 × 6
+#> splits id .metrics .notes .predictions .workflow
+#>
+#> 1 train/test split
+```
+
+:::rmdnote
+Notice that `last_fit()` takes a data split as an input, not a dataframe. This function uses the split to generate the training and test sets for the final fitting and evaluation.
+:::
+
+The `.workflow` column contains the fitted workflow and can be pulled out of the results using:
+
+
+```r
+fitted_lm_wflow <- extract_workflow(final_lm_res)
+```
+
+Similarly, `collect_metrics()` and `collect_predictions()` provide access to the performance metrics and predictions, respectively.
+
+
+```r
+collect_metrics(final_lm_res)
+collect_predictions(final_lm_res) %>% slice(1:5)
+```
+
+We'll see more about `last_fit()` in action and how to use it again in Chapter \@ref(dimensionality).
+
+## Chapter Summary {#workflows-summary}
+
+In this chapter, you learned that the modeling process encompasses more than just estimating the parameters of an algorithm that connects predictors to an outcome. This process also includes preprocessing steps and operations taken after a model is fit. We introduced a concept called a *model workflow* that can capture the important components of the modeling process. Multiple workflows can also be created inside of a *workflow set*. The `last_fit()` function is convenient for fitting a final model to the training set and evaluating with the test set.
+
+For the Ames data, the related code that we'll see used again in later chapters is:
+
+
+```r
+library(tidymodels)
+data(ames)
+
+ames <- mutate(ames, Sale_Price = log10(Sale_Price))
+
+set.seed(123)
+ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
+ames_train <- training(ames_split)
+ames_test <- testing(ames_split)
+
+lm_model <- linear_reg() %>% set_engine("lm")
+
+lm_wflow <-
+ workflow() %>%
+ add_model(lm_model) %>%
+ add_variables(outcome = Sale_Price, predictors = c(Longitude, Latitude))
+
+lm_fit <- fit(lm_wflow, ames_train)
+```
+
+
diff --git a/tmwr-atlas/08-feature-engineering.md b/tmwr-atlas/08-feature-engineering.md
new file mode 100644
index 00000000..3969d359
--- /dev/null
+++ b/tmwr-atlas/08-feature-engineering.md
@@ -0,0 +1,642 @@
+
+
+# Feature Engineering with recipes {#recipes}
+
+Feature engineering entails reformatting predictor values to make them easier for a model to use effectively. This includes transformations and encodings of the data to best represent their important characteristics. Imagine that you have two predictors in a data set that can be more effectively represented in your model as a ratio; creating a new predictor from the ratio of the original two is a simple example of feature engineering.
+
+Take the location of a house in Ames as a more involved example. There are a variety of ways that this spatial information can be exposed to a model, including neighborhood (a qualitative measure), longitude/latitude, distance to the nearest school or Iowa State University, and so on. When choosing how to encode these data in modeling, we might choose an option we believe is most associated with the outcome. The original format of the data, for example numeric (e.g., distance) versus categorical (e.g., neighborhood), is also a driving factor in feature engineering choices.
+
+There are many other examples of preprocessing to build better features for modeling:
+
+ * Correlation between predictors can be reduced via feature extraction or the removal of some predictors.
+
+ * When some predictors have missing values, they can be imputed using a sub-model.
+
+ * Models that use variance-type measures may benefit from coercing the distribution of some skewed predictors to be symmetric by estimating a transformation.
+
+Feature engineering and data preprocessing can also involve reformatting that may be required by the model. Some models use geometric distance metrics and, consequently, numeric predictors should be centered and scaled so that they are all in the same units. Otherwise, the distance values would be biased by the scale of each column.
+
+:::rmdnote
+Different models have different preprocessing requirements and some, such as tree-based models, require very little preprocessing at all. Appendix \@ref(pre-proc-table) contains a small table of recommended preprocessing techniques for different models.
+:::
+
+In this chapter, we introduce the [recipes](https://recipes.tidymodels.org/) package which you can use to combine different feature engineering and preprocessing tasks into a single object and then apply these transformations to different data sets. The recipes package is, like parsnip for models, one of the core tidymodels packages.
+
+This chapter uses the Ames housing data and the R objects created in the book so far, as summarized at the end of Chapter \@ref(workflows).
+
+## A Simple `recipe()` for the Ames Housing Data
+
+In this section, we will focus on a small subset of the predictors available in the Ames housing data:
+
+ * The neighborhood (qualitative, with 29 neighborhoods in the training set)
+
+ * The gross above-grade living area (continuous, named `Gr_Liv_Area`)
+
+ * The year built (`Year_Built`)
+
+ * The type of building (`Bldg_Type` with values `OneFam` ($n = 1,936$), `TwoFmCon` ($n = 50$), `Duplex` ($n = 88$), `Twnhs` ($n = 77$), and `TwnhsE` ($n = 191$))
+
+Suppose that an initial ordinary linear regression model were fit to these data. Recalling that, in Chapter \@ref(ames), the sale prices were pre-logged, a standard call to `lm()` might look like:
+
+
+```r
+lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Year_Built + Bldg_Type, data = ames)
+```
+
+When this function is executed, the data are converted from a data frame to a numeric _design matrix_ (also called a _model matrix_) and then the least squares method is used to estimate parameters. In Chapter \@ref(base-r) we listed the multiple purposes of the R model formula; let's focus only on the data manipulation aspects for now. What the formula above does can be decomposed into a series of steps:
+
+1. Sale price is defined as the outcome while neighborhood, gross living area, the year built, and building type variables are all defined as predictors.
+
+1. A log transformation is applied to the gross living area predictor.
+
+1. The neighborhood and building type columns are converted from a non-numeric format to a numeric format (since least squares requires numeric predictors).
+
+As mentioned in Chapter \@ref(base-r), the formula method will apply these data manipulations to any data, including new data, that are passed to the `predict()` function.
+
+A recipe is also an object that defines a series of steps for data processing. Unlike the formula method inside a modeling function, the recipe defines the steps via `step_*()` functions without immediately executing them; it is only a specification of what should be done. Here is a recipe equivalent to the formula above that builds on the code summary at the end of Chapter \@ref(splitting):
+
+
+```r
+library(tidymodels) # Includes the recipes package
+tidymodels_prefer()
+
+simple_ames <-
+ recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
+ data = ames_train) %>%
+ step_log(Gr_Liv_Area, base = 10) %>%
+ step_dummy(all_nominal_predictors())
+simple_ames
+#> Recipe
+#>
+#> Inputs:
+#>
+#> role #variables
+#> outcome 1
+#> predictor 4
+#>
+#> Operations:
+#>
+#> Log transformation on Gr_Liv_Area
+#> Dummy variables from all_nominal_predictors()
+```
+
+Let's break this down:
+
+1. The call to `recipe()` with a formula tells the recipe the _roles_ of the "ingredients" or variables (e.g., predictor, outcome). It only uses the data `ames_train` to determine the data types for the columns.
+
+1. `step_log()` declares that `Gr_Liv_Area` should be log transformed.
+
+1. `step_dummy()` is used to specify which variables should be converted from a qualitative format to a quantitative format, in this case, using dummy or indicator variables. An indicator or dummy variable is a binary numeric variable (a column of ones and zeroes) that encodes qualitative information; we will dig deeper into these kinds of variables later in this chapter.
+
+The function `all_nominal_predictors()` captures the names of any predictor columns that are currently factor or character (i.e., nominal) in nature. This is a dplyr-like selector function similar to `starts_with()` or `matches()` but can only be used inside of a recipe.
+
+:::rmdnote
+Other selectors specific to the recipes package are: `all_numeric_predictors()`, `all_numeric()`, `all_predictors()`, and `all_outcomes()`. As with dplyr, one or more unquoted expressions, separated by commas, can be used to select which columns are affected by each step.
+:::
+
+What is the advantage to using a recipe, over a formula or raw predictors? There are a few, including:
+
+ * These computations can be recycled across models since they are not tightly coupled to the modeling function.
+
+ * A recipe enables a broader set of data processing choices than formulas can offer.
+
+ * The syntax can be very compact. For example, `all_nominal_predictors()` can be used to capture many variables for specific types of processing while a formula would require each to be explicitly listed.
+
+ * All data processing can be captured in a single R object instead of in scripts that are repeated, or even spread across different files.
+
+
+
+## Using Recipes
+
+As we discussed in Chapter \@ref(workflows), preprocessing choices and feature engineering should typically be considered part of a modeling workflow, not as a separate task. The workflows package contains high level functions to handle different types of preprocessors. Our previous workflow (`lm_wflow`) used a simple set of dplyr selectors. To improve on that approach with more complex feature engineering, let's use the `simple_ames` recipe to preprocess data for modeling.
+
+This object can be attached to the workflow:
+
+
+```r
+lm_wflow %>%
+ add_recipe(simple_ames)
+#> Error in `add_recipe()`:
+#> ! A recipe cannot be added when variables already exist.
+```
+
+That did not work! We can only have one preprocessing method at a time, so we need to remove the existing preprocessor before adding the recipe.
+
+
+```r
+lm_wflow <-
+ lm_wflow %>%
+ remove_variables() %>%
+ add_recipe(simple_ames)
+lm_wflow
+#> ══ Workflow ═════════════════════════════════════════════════════════════════════════
+#> Preprocessor: Recipe
+#> Model: linear_reg()
+#>
+#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
+#> 2 Recipe Steps
+#>
+#> • step_log()
+#> • step_dummy()
+#>
+#> ── Model ────────────────────────────────────────────────────────────────────────────
+#> Linear Regression Model Specification (regression)
+#>
+#> Computational engine: lm
+```
+
+Let's estimate both the recipe and model using a simple call to `fit()`:
+
+
+```r
+lm_fit <- fit(lm_wflow, ames_train)
+```
+
+The `predict()` method applies the same preprocessing that was used on the training set to the new data before passing them along to the model's `predict()` method:
+
+
+```r
+predict(lm_fit, ames_test %>% slice(1:3))
+#> Warning in predict.lm(object = object$fit, newdata = new_data, type = "response"):
+#> prediction from a rank-deficient fit may be misleading
+#> # A tibble: 3 × 1
+#> .pred
+#>
+#> 1 5.08
+#> 2 5.32
+#> 3 5.28
+```
+
+If we need the bare model object or recipe, there are `extract_*` functions that can retrieve them:
+
+
+```r
+# Get the recipe after it has been estimated:
+lm_fit %>%
+ extract_recipe(estimated = TRUE)
+#> Recipe
+#>
+#> Inputs:
+#>
+#> role #variables
+#> outcome 1
+#> predictor 4
+#>
+#> Training data contained 2342 data points and no missing data.
+#>
+#> Operations:
+#>
+#> Log transformation on Gr_Liv_Area [trained]
+#> Dummy variables from Neighborhood, Bldg_Type [trained]
+
+# To tidy the model fit:
+lm_fit %>%
+ # This returns the parsnip object:
+ extract_fit_parsnip() %>%
+ # Now tidy the linear model object:
+ tidy() %>%
+ slice(1:5)
+#> # A tibble: 5 × 5
+#> term estimate std.error statistic p.value
+#>
+#> 1 (Intercept) -0.669 0.231 -2.90 3.80e- 3
+#> 2 Gr_Liv_Area 0.620 0.0143 43.2 2.63e-299
+#> 3 Year_Built 0.00200 0.000117 17.1 6.16e- 62
+#> 4 Neighborhood_College_Creek 0.0178 0.00819 2.17 3.02e- 2
+#> 5 Neighborhood_Old_Town -0.0330 0.00838 -3.93 8.66e- 5
+```
+
+:::rmdnote
+There are tools for using (and debugging) recipes outside of workflow objects. These are described in Chapter \@ref(dimensionality).
+:::
+
+## How Data are Used by the `recipe()`
+
+Data are passed to recipes at different stages.
+
+First, when calling `recipe(..., data)`, the data set is used to determine the data types of each column so that selectors such as `all_numeric()` or `all_numeric_predictors()` can be used.
+
+Second, when preparing the data using `fit(workflow, data)`, the training data are used for all estimation operations including a recipe that may be part of the `workflow`, from determining factor levels to computing PCA components and everything in between.
+
+:::rmdwarning
+It is important to realize that all preprocessing and feature engineering steps *only* utilize the training data. Otherwise, information leakage can negatively impact the model's performance when used with new data.
+:::
+
+Finally, when using `predict(workflow, new_data)`, no model or preprocessor parameters like those from recipes are re-estimated using the values in `new_data`. Take centering and scaling using `step_normalize()` as an example. Using this step, the means and standard deviations from the appropriate columns are determined from the training set; new samples at prediction time are standardized using these values from training when `predict()` is invoked.
+
+
+## Examples of `recipe()` Steps {#example-steps}
+
+Before proceeding, let's take an extended tour of the capabilities of recipes and explore some of the most important `step_*()` functions. These recipe step functions each specify a specific possible "step" in a feature engineering process, and different recipe steps can have different effects on columns of data.
+
+### Encoding qualitative data in a numeric format {#dummies}
+
+One of the most common feature engineering tasks is transforming nominal or qualitative data (factors or characters) so that they can be encoded or represented numerically. Sometimes we can alter the factor levels of a qualitative column in helpful ways prior to such a transformation. For example, `step_unknown()` can be used to change missing values to a dedicated factor level. Similarly, if we anticipate that a new factor level may be encountered in future data, `step_novel()` can allot a new level for this purpose.
+
+Additionally, `step_other()` can be used to analyze the frequencies of the factor levels in the training set and convert infrequently occurring values to a catch-all level of "other", with a specific threshold that can be specified. A good example is the `Neighborhood` predictor in our data, shown in Figure \@ref(fig:ames-neighborhoods).
+
+
+
+
(\#fig:ames-neighborhoods)Frequencies of neighborhoods in the Ames training set.
+
+
+Here we see there are two neighborhoods that have less than five properties in the training data (Landmark and Green Hills); in this case, no houses at all in the Landmark neighborhood were included in the training set. For some models, it may be problematic to have dummy variables with a single non-zero entry in the column. At a minimum, it is highly improbable that these features would be important to a model. If we add `step_other(Neighborhood, threshold = 0.01)` to our recipe, the bottom 1% of the neighborhoods will be lumped into a new level called "other". In this training set, this will catch 7 neighborhoods.
+
+For the Ames data, we can amend the recipe to use:
+
+
+```r
+simple_ames <-
+ recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
+ data = ames_train) %>%
+ step_log(Gr_Liv_Area, base = 10) %>%
+ step_other(Neighborhood, threshold = 0.01) %>%
+ step_dummy(all_nominal_predictors())
+```
+
+:::rmdnote
+Many, but not all, underlying model calculations require predictor values to be encoded as numbers. Notable exceptions include tree-based models, rule-based models, and naive Bayes models.
+:::
+
+There are a few strategies for converting a factor predictor to a numeric format. The most common method is to create "dummy" or indicator variables. Let's take the predictor in the Ames data for the building type, which is a factor variable with five levels (see Table \@ref(tab:dummy-vars). For dummy variables, the single `Bldg_Type` column would be replaced with four numeric columns whose values are either zero or one. These binary variables represent specific factor level values. In R, the convention is to exclude a column for the first factor level (`OneFam`, in this case). The `Bldg_Type` column would be replaced with a column called `TwoFmCon` that is one when the row has that value and zero otherwise. Three other columns are similarly created:
+
+
+Table: (\#tab:dummy-vars)Illustration of binary encodings (i.e., "dummy variables") for a qualitative predictor.
+
+|Raw Data | TwoFmCon| Duplex| Twnhs| TwnhsE|
+|:--------|--------:|------:|-----:|------:|
+|OneFam | 0| 0| 0| 0|
+|TwoFmCon | 1| 0| 0| 0|
+|Duplex | 0| 1| 0| 0|
+|Twnhs | 0| 0| 1| 0|
+|TwnhsE | 0| 0| 0| 1|
+
+
+Why not all five? The most basic reason is simplicity; if you know the value for these four columns, you can determine the last value because these are mutually exclusive categories. More technically, the classical justification is that a number of models, including ordinary linear regression, have numerical issues when there are linear dependencies between columns. If all five building type indicator columns are included, they would add up to the intercept column (if there is one). This would cause an issue, or perhaps an outright error, in the underlying matrix algebra.
+
+The full set of encodings can be used for some models. This is traditionally called the "one-hot" encoding and can be achieved using the `one_hot` argument of `step_dummy()`.
+
+One helpful feature of `step_dummy()` is that there is more control over how the resulting dummy variables are named. In base R, dummy variable names mash the variable name with the level, resulting in names like `NeighborhoodVeenker`. Recipes, by default, use an underscore as the separator between the name and level (e.g., `Neighborhood_Veenker`) and there is an option to use custom formatting for the names. The default naming convention in recipes makes it easier to capture those new columns in future steps using a selector, such as `starts_with("Neighborhood_")`.
+
+Traditional dummy variables require that all of the possible categories be known to create a full set of numeric features. There are other methods for doing this transformation to a numeric format. _Feature hashing_ methods only consider the value of the category to assign it to a predefined pool of dummy variables. _Effect_ or _likelihood encodings_ replace the original data with a single numeric column that measures the _effect_ of those data. Both feature hashing and effect encoding methods can seamlessly handle situations where a novel factor level is encountered in the data. Chapter \@ref(categorical) explores these and other methods for encoding categorical data, beyond straightforward dummy or indicator variables.
+
+:::rmdnote
+Different recipe steps behave differently when applied to variables in the data. For example, `step_log()` modifies a column in-place without changing the name. Other steps, such as `step_dummy()`, eliminate the original data column and replace it with one or more columns with different names. The effect of a recipe step depends on the type of feature engineering transformation being done.
+:::
+
+### Interaction terms
+
+Interaction effects involve two or more predictors. Such an effect occurs when one predictor has an effect on the outcome that is contingent on one or more other predictors. For example, if you were trying to predict how much traffic there will be during your commute, two potential predictors could be the specific time of day you commute and the weather. However, the relationship between the amount of traffic and bad weather is different for different times of day. In this case, you could add an interaction term between the two predictors to the model along with the original two predictors (which are called the "main effects"). Numerically, an interaction term between predictors is encoded as their product. Interactions are only defined in terms of their effect on the outcome and can be combinations of different types of data (e.g., numeric, categorical, etc). [Chapter 7](https://bookdown.org/max/FES/detecting-interaction-effects.html) of @fes discusses interactions and how to detect them in greater detail.
+
+After exploring the Ames training set, we might find that the regression slopes for the gross living area differ for different building types, as shown in Figure \@ref(fig:building-type-interactions).
+
+
+```r
+ggplot(ames_train, aes(x = Gr_Liv_Area, y = 10^Sale_Price)) +
+ geom_point(alpha = .2) +
+ facet_wrap(~ Bldg_Type) +
+ geom_smooth(method = lm, formula = y ~ x, se = FALSE, color = "lightblue") +
+ scale_x_log10() +
+ scale_y_log10() +
+ labs(x = "Gross Living Area", y = "Sale Price (USD)")
+```
+
+
+
+
(\#fig:building-type-interactions)Gross living area (in log-10 units) versus sale price (also in log-10 units) for five different building types.
+
+
+How are interactions specified in a recipe? A base R formula would take an interaction using a `:`, so we would use:
+
+```r
+Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Bldg_Type +
+ log10(Gr_Liv_Area):Bldg_Type
+# or
+Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) * Bldg_Type
+```
+
+where `*` expands those columns to the main effects and interaction term. Again, the formula method does many things simultaneously and understands that a factor variable (such as `Bldg_Type`) should be expanded into dummy variables first and that the interaction should involve all of the resulting binary columns.
+
+Recipes are more explicit and sequential, and give you more control. With the current recipe, `step_dummy()` has already created dummy variables. How would we combine these for an interaction? The additional step would look like `step_interact(~ interaction terms)` where the terms on the right-hand side of the tilde are the interactions. These can include selectors, so it would be appropriate to use:
+
+
+```r
+simple_ames <-
+ recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
+ data = ames_train) %>%
+ step_log(Gr_Liv_Area, base = 10) %>%
+ step_other(Neighborhood, threshold = 0.01) %>%
+ step_dummy(all_nominal_predictors()) %>%
+ # Gr_Liv_Area is on the log scale from a previous step
+ step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") )
+```
+
+Additional interactions can be specified in this formula by separating them by `+`. Also note that the recipe will only utilize interactions between different variables; if the formula uses `var_1:var_1`, this term will be ignored.
+
+Suppose that, in a recipe, we had not yet made dummy variables for building types. It would be inappropriate to include a factor column in this step, such as:
+
+```r
+ step_interact( ~ Gr_Liv_Area:Bldg_Type )
+```
+
+This is telling the underlying (base R) code used by `step_interact()` to make dummy variables and then form the interactions. In fact, if this occurs, a warning states that this might generate unexpected results.
+
+
+
This behavior gives you more control, but is different from R’s
+standard model formula.
+
+
+As with naming dummy variables, recipes provides more coherent names for interaction terms. In this case, the interaction is named `Gr_Liv_Area_x_Bldg_Type_Duplex` instead of `Gr_Liv_Area:Bldg_TypeDuplex` (which is not a valid column name for a data frame).
+
+
+:::rmdnote
+_Remember that order matters_. The gross living area is log transformed prior to the interaction term. Subsequent interactions with this variable will also use the log scale.
+:::
+
+
+### Spline functions
+
+When a predictor has a non-linear relationship with the outcome, some types of predictive models can adaptively approximate this relationship during training. However, simpler is usually better and it is not uncommon to try to use a simple model, such as a linear fit, and add in specific non-linear features for predictors that may need them, such as longitude and latitude for the Ames housing data. One common method for doing this is to use _spline_ functions to represent the data. Splines replace the existing numeric predictor with a set of columns that allow a model to emulate a flexible, non-linear relationship. As more spline terms are added to the data, the capacity to non-linearly represent the relationship increases. Unfortunately, it may also increase the likelihood of picking up on data trends that occur by chance (i.e., over-fitting).
+
+If you have ever used `geom_smooth()` within a `ggplot`, you have probably used a spline representation of the data. For example, each panel in Figure \@ref(fig:ames-latitude-splines) uses a different number of smooth splines for the latitude predictor:
+
+
+```r
+library(patchwork)
+library(splines)
+
+plot_smoother <- function(deg_free) {
+ ggplot(ames_train, aes(x = Latitude, y = 10^Sale_Price)) +
+ geom_point(alpha = .2) +
+ scale_y_log10() +
+ geom_smooth(
+ method = lm,
+ formula = y ~ ns(x, df = deg_free),
+ color = "lightblue",
+ se = FALSE
+ ) +
+ labs(title = paste(deg_free, "Spline Terms"),
+ y = "Sale Price (USD)")
+}
+
+( plot_smoother(2) + plot_smoother(5) ) / ( plot_smoother(20) + plot_smoother(100) )
+```
+
+
+
+
(\#fig:ames-latitude-splines)Sale price versus latitude, with trend lines using natural splines with different degrees of freedom.
+
+
+The `ns()` function in the splines package generates feature columns using functions called _natural splines_.
+
+Some panels in Figure \@ref(fig:ames-latitude-splines) clearly fit poorly; two terms _under-fit_ the data while 100 terms _over-fit_. The panels with five and 20 terms seem like reasonably smooth fits that catch the main patterns of the data. This indicates that the proper amount of "non-linear-ness" matters. The number of spline terms could then be considered a _tuning parameter_ for this model. These types of parameters are explored in Chapter \@ref(tuning).
+
+In recipes, there are multiple steps that can create these types of terms. To add a natural spline representation for this predictor:
+
+
+```r
+recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude,
+ data = ames_train) %>%
+ step_log(Gr_Liv_Area, base = 10) %>%
+ step_other(Neighborhood, threshold = 0.01) %>%
+ step_dummy(all_nominal_predictors()) %>%
+ step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
+ step_ns(Latitude, deg_free = 20)
+```
+
+The user would need to determine if both neighborhood and latitude should be in the model since they both represent the same underlying data in different ways.
+
+### Feature extraction
+
+Another common method for representing multiple features at once is called _feature extraction_. Most of these techniques create new features from the predictors that capture the information in the broader set as a whole. For example, principal component analysis (PCA) tries to extract as much of the original information in the predictor set as possible using a smaller number of features. PCA is a linear extraction method, meaning that each new feature is a linear combination of the original predictors. One nice aspect of PCA is that each of the new features, called the principal components or PCA scores, are uncorrelated with one another. Because of this, PCA can be very effective at reducing the correlation between predictors. Note that PCA is only aware of the predictors; the new PCA features might not be associated with the outcome.
+
+In the Ames data, there are several predictors that measure size of the property, such as the total basement size (`Total_Bsmt_SF`), size of the first floor (`First_Flr_SF`), the gross living area (`Gr_Liv_Area`), and so on. PCA might be an option to represent these potentially redundant variables as a smaller feature set. Apart from the gross living area, these predictors have the suffix `SF` in their names (for square feet) so a recipe step for PCA might look like:
+
+```r
+ # Use a regular expression to capture house size predictors:
+ step_pca(matches("(SF$)|(Gr_Liv)"))
+```
+
+Note that all of these columns are measured in square feet. PCA assumes that all of the predictors are on the same scale. That's true in this case, but often this step can be preceded by `step_normalize()`, which will center and scale each column.
+
+There are existing recipe steps for other extraction methods, such as: independent component analysis (ICA), non-negative matrix factorization (NNMF), multidimensional scaling (MDS), uniform manifold approximation and projection (UMAP), and others.
+
+### Row sampling steps
+
+Recipe steps can affect the rows of a data set as well. For example, _subsampling_ techniques for class imbalances change the class proportions in the data being given to the model; these techniques often don't improve overall performance but can generate better behaved distributions of the predicted class probabilities. There are several possible approaches to try when subsampling your data with class imbalance:
+
+ * _Downsampling_ the data keeps the minority class and takes a random sample of the majority class so that class frequencies are balanced.
+
+ * _Upsampling_ replicates samples from the minority class to balance the classes. Some techniques do this by synthesizing new samples that resemble the minority class data while other methods simply add the same minority samples repeatedly.
+
+ * _Hybrid methods_ do a combination of both.
+
+The [themis](https://themis.tidymodels.org/) package has recipe steps that can be used to address class imbalance via subsampling. For simple downsampling, we would use:
+
+```r
+ step_downsample(outcome_column_name)
+```
+
+:::rmdwarning
+Only the training set should be affected by these techniques. The test set or other holdout samples should be left as-is when processed using the recipe. For this reason, all of the subsampling steps default the `skip` argument to have a value of `TRUE`.
+:::
+
+There are other step functions that are row-based as well: `step_filter()`, `step_sample()`, `step_slice()`, and `step_arrange()`. In almost all uses of these steps, the `skip` argument should be set to `TRUE`.
+
+### General transformations
+
+Mirroring the original dplyr operation, `step_mutate()` can be used to conduct a variety of basic operations to the data. It is best used for straightforward transformations like computing a ratio of two variables, such as `Bedroom_AbvGr / Full_Bath`, the ratio of bedrooms to bathrooms for the Ames housing data.
+
+:::rmdwarning
+When using this flexible step, use extra care to avoid data leakage in your preprocessing. Consider, for example, the transformation `x = w > mean(w)`. When applied to new data or testing data, this transformation would use the mean of `w` from the _new_ data, not the mean of `w` from the training data.
+:::
+
+
+### Natural language processing
+
+Recipes can also handle data that are not in the traditional structure where the columns are features. For example, the [textrecipes](https://textrecipes.tidymodels.org/) package can apply natural language processing methods to the data. The input column is typically a string of text and different steps can be used to tokenize the data (e.g., split the text into separate words), filter out tokens, and create new features appropriate for modeling.
+
+
+## Skipping Steps for New Data {#skip-equals-true}
+
+The sale price data are already log transformed in the `ames` data frame. Why not use:
+
+```r
+ step_log(Sale_Price, base = 10)
+```
+
+This will cause a failure when the recipe is applied to new properties with an unknown sale price. Since price is what we are trying to predict, there probably won't be a column in the data for this variable. In fact, to avoid _information leakage_, many tidymodels packages isolate the data being used when making any predictions. This means that the training set and any outcome columns are not available for use at prediction time.
+
+:::rmdnote
+For simple transformations of the outcome column(s), we strongly suggest that those operations be _conducted outside of the recipe_.
+:::
+
+However, there are other circumstances where this is not an adequate solution. For example, in classification models where there is a severe class imbalance, it is common to conduct _subsampling_ of the data that are given to the modeling function, as previously mentioned. For example, suppose that there were two classes and a 10% event rate. A simple, albeit controversial, approach would be to _down-sample_ the data so that the model is provided with all of the events and a random 10% of the non-event samples.
+
+The problem is that the same subsampling process should not be applied to the data being predicted. As a result, when using a recipe, we need a mechanism to ensure that some operations are only applied to the data that are given to the model. Each step function has an option called `skip` that, when set to `TRUE`, will be ignored by the `predict()` function. In this way, you can isolate the steps that affect the modeling data without causing errors when applied to new samples. However, all steps are applied when using `fit()`.
+
+
+
+At the time of this writing, the step functions in the recipes and themis packages that are only applied to the training data are: `step_adasyn()`, `step_bsmote()`, `step_downsample()`, `step_filter()`, `step_nearmiss()`, `step_rose()`, `step_sample()`, `step_slice()`, `step_smote()`, `step_smotenc()`, `step_tomek()`, and `step_upsample()`.
+
+
+## Tidy a `recipe()`
+
+In Chapter \@ref(base-r), we introduced the `tidy()` verb for statistical objects. There is also a `tidy()` method for recipes, as well as individual recipe steps. Before proceeding, let's create an extended recipe for the Ames data using some of the new steps we've discussed in this chapter:
+
+
+```r
+ames_rec <-
+ recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
+ Latitude + Longitude, data = ames_train) %>%
+ step_log(Gr_Liv_Area, base = 10) %>%
+ step_other(Neighborhood, threshold = 0.01) %>%
+ step_dummy(all_nominal_predictors()) %>%
+ step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
+ step_ns(Latitude, Longitude, deg_free = 20)
+```
+
+The `tidy()` method, when called with the recipe object, gives a summary of the recipe steps:
+
+
+```r
+tidy(ames_rec)
+#> # A tibble: 5 × 6
+#> number operation type trained skip id
+#>
+#> 1 1 step log FALSE FALSE log_66JTU
+#> 2 2 step other FALSE FALSE other_ePfcw
+#> 3 3 step dummy FALSE FALSE dummy_Z18Cl
+#> 4 4 step interact FALSE FALSE interact_JLU36
+#> 5 5 step ns FALSE FALSE ns_rvsqQ
+```
+
+This result can be helpful for identifying individual steps, perhaps to then be able to execute the `tidy()` method on one specific steps.
+
+We can specify the `id` argument in any step function call; otherwise it is generated using a random suffix. Setting this value can be helpful if the same type of step is added to the recipe more than once. Let's specify the `id` ahead of time for `step_other()`, since we'll want to `tidy()` it:
+
+
+```r
+ames_rec <-
+ recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
+ Latitude + Longitude, data = ames_train) %>%
+ step_log(Gr_Liv_Area, base = 10) %>%
+ step_other(Neighborhood, threshold = 0.01, id = "my_id") %>%
+ step_dummy(all_nominal_predictors()) %>%
+ step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
+ step_ns(Latitude, Longitude, deg_free = 20)
+```
+
+We'll re-fit the workflow with this new recipe:
+
+
+```r
+lm_wflow <-
+ workflow() %>%
+ add_model(lm_model) %>%
+ add_recipe(ames_rec)
+
+lm_fit <- fit(lm_wflow, ames_train)
+```
+
+The `tidy()` method can be called again along with the `id` identifier we specified to get our results for applying `step_other()`:
+
+
+```r
+estimated_recipe <-
+ lm_fit %>%
+ extract_recipe(estimated = TRUE)
+
+tidy(estimated_recipe, id = "my_id")
+#> # A tibble: 22 × 3
+#> terms retained id
+#>
+#> 1 Neighborhood North_Ames my_id
+#> 2 Neighborhood College_Creek my_id
+#> 3 Neighborhood Old_Town my_id
+#> 4 Neighborhood Edwards my_id
+#> 5 Neighborhood Somerset my_id
+#> 6 Neighborhood Northridge_Heights my_id
+#> # … with 16 more rows
+```
+
+The `tidy()` results we see here for using `step_other()` show which factor levels were retained, i.e., not added to the new "other" category.
+
+The `tidy()` method can be called with the `number` identifier as well, if we know which step in the recipe we need:
+
+
+```r
+tidy(estimated_recipe, number = 2)
+#> # A tibble: 22 × 3
+#> terms retained id
+#>
+#> 1 Neighborhood North_Ames my_id
+#> 2 Neighborhood College_Creek my_id
+#> 3 Neighborhood Old_Town my_id
+#> 4 Neighborhood Edwards my_id
+#> 5 Neighborhood Somerset my_id
+#> 6 Neighborhood Northridge_Heights my_id
+#> # … with 16 more rows
+```
+
+Each `tidy()` method returns the relevant information about that step. For example, the `tidy()` method for `step_dummy()` returns a column with the variables that were converted to dummy variables and another column with all of the known levels for each column.
+
+## Column Roles
+
+When a formula is used with the initial call to `recipe()` it assigns _roles_ to each of the columns depending on which side of the tilde that they are on. Those roles are either `"predictor"` or `"outcome"`. However, other roles can be assigned as needed.
+
+For example, in our Ames data set, the original raw data contained a column for address.^[Our version of these data does not contain that column.] It may be useful to keep that column in the data so that, after predictions are made, problematic results can be investigated in detail. In other words, the column could be important even when it isn't a predictor or outcome.
+
+To solve this, the `add_role()`, `remove_role()`, and `update_role()` functions can be helpful. For example, for the house price data, the role of the street address column could be modified using:
+
+```r
+ames_rec %>% update_role(address, new_role = "street address")
+```
+
+After this change, the `address` column in the dataframe will no longer be a predictor but instead will be a `"street address"` according to the recipe. Any character string can be used as a role. Also, columns can have multiple roles (additional roles are added via `add_role()`) so that they can be selected under more than one context.
+
+This can be helpful when the data are _resampled_. It helps to keep the columns that are not involved with the model fit in the same data frame (rather than in an external vector). Resampling, described in Chapter \@ref(resampling), creates alternate versions of the data mostly by row subsampling. If the street address were in another column, additional subsampling would be required and might lead to more complex code and a higher likelihood of errors.
+
+Finally, all step functions have a `role` field that can assign roles to the results of the step. In many cases, columns affected by a step retain their existing role. For example, the `step_log()` calls to our `ames_rec` object affected the `Gr_Liv_Area` column. For that step, the default behavior is to keep the existing role for this column since no new column is created. As a counter-example, the step to produce splines defaults new columns to have a role of `"predictor"` since that is usually how spline columns are used in a model. Most steps have sensible defaults but, since the defaults can be different, be sure to check the documentation page to understand which role(s) will be assigned.
+
+## Chapter Summary {#recipes-summary}
+
+In this chapter, you learned about using recipes for flexible feature engineering and data preprocessing, from creating dummy variables to handling class imbalance and more. Feature engineering is an important part of the modeling process where information leakage can easily occur and good practices must be adopted. Between the recipes package and other packages that extend recipes, there are over 100 available steps. All possible recipe steps are enumerated at [`tidymodels.org/find`](https://www.tidymodels.org/find/). The recipes framework provides a rich data manipulation environment for preprocessing and transforming data prior to modeling.
+Additionally, [`tidymodels.org/learn/develop/recipes/`](https://www.tidymodels.org/learn/develop/recipes/) shows how custom steps can be created.
+
+Our work here has used recipes solely inside of a workflow object. For modeling, that is the recommended use because feature engineering should be estimated together with a model. However, for visualization and other activities, a workflow may not be appropriate; more recipe-specific functions may be required. Chapter \@ref(dimensionality) discusses lower-level APIs for fitting, using, and troubleshooting recipes.
+
+The code that we will use in later chapters is:
+
+
+```r
+library(tidymodels)
+data(ames)
+ames <- mutate(ames, Sale_Price = log10(Sale_Price))
+
+set.seed(123)
+ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
+ames_train <- training(ames_split)
+ames_test <- testing(ames_split)
+
+ames_rec <-
+ recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
+ Latitude + Longitude, data = ames_train) %>%
+ step_log(Gr_Liv_Area, base = 10) %>%
+ step_other(Neighborhood, threshold = 0.01) %>%
+ step_dummy(all_nominal_predictors()) %>%
+ step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
+ step_ns(Latitude, Longitude, deg_free = 20)
+
+lm_model <- linear_reg() %>% set_engine("lm")
+
+lm_wflow <-
+ workflow() %>%
+ add_model(lm_model) %>%
+ add_recipe(ames_rec)
+
+lm_fit <- fit(lm_wflow, ames_train)
+```
+
+
+
diff --git a/tmwr-atlas/09-judging-model-effectiveness.md b/tmwr-atlas/09-judging-model-effectiveness.md
new file mode 100644
index 00000000..1bf0ac33
--- /dev/null
+++ b/tmwr-atlas/09-judging-model-effectiveness.md
@@ -0,0 +1,461 @@
+
+
+# Judging Model Effectiveness {#performance}
+
+Once we have a model, we need to know how well it works. A quantitative approach for estimating effectiveness allows us to understand the model, to compare different models, or to tweak the model to improve performance. Our focus in tidymodels is on empirical validation; this usually means using data that were not used to create the model as the substrate to measure effectiveness.
+
+:::rmdwarning
+The best approach to empirical validation involves using _resampling_ methods that will be introduced in Chapter \@ref(resampling). In this chapter, we will motivate the need for empirical validation by using the test set. Keep in mind that the test set can only be used once, as explained in Chapter \@ref(splitting).
+:::
+
+When judging model effectiveness, your decision about which metrics to examine can be critical. In later chapters, certain model parameters will be empirically optimized and a primary performance metric will be used to choose the best sub-model. Choosing the wrong metric can easily result in unintended consequences. For example, two common metrics for regression models are the root mean squared error (RMSE) and the coefficient of determination (a.k.a. $R^2$). The former measures _accuracy_ while the latter measures _correlation_. These are not necessarily the same thing. Figure \@ref(fig:performance-reg-metrics) demonstrates the difference between the two.
+
+
+
+
(\#fig:performance-reg-metrics)Observed versus predicted values for models that are optimized using the RMSE compared to the coefficient of determination.
+
+
+A model optimized for RMSE has more variability but has relatively uniform accuracy across the range of the outcome. The right panel shows that there is a tighter correlation between the observed and predicted values but this model performs poorly in the tails.
+
+This chapter will demonstrate the yardstick package, a core tidymodels packages with the focus of measuring model performance. Before illustrating syntax, let's explore whether empirical validation using performance metrics is worthwhile when a model is focused on inference rather than prediction.
+
+## Performance Metrics and Inference
+
+
+
+The effectiveness of any given model depends on how the model will be used. An inferential model is used primarily to understand relationships, and typically emphasizes the choice (and validity) of probabilistic distributions and other generative qualities that define the model. For a model used primarily for prediction, by contrast, predictive strength is of primary importance and other concerns about underlying statistical qualities may be less important. Predictive strength is usually determined by how close our predictions come to the observed data, i.e., fidelity of the model predictions to the actual results. This chapter focuses on functions that can be used to measure predictive strength. However, our advice for those developing inferential models is to use these techniques even when the model will not be used with the primary goal of prediction.
+
+A longstanding issue with the practice of inferential statistics is that, with a focus purely on inference, it is difficult to assess the credibility of a model. For example, consider the Alzheimer's disease data from @CraigSchapiro when 333 patients were studied to determine the factors that influence cognitive impairment. An analysis might take the known risk factors and build a logistic regression model where the outcome is binary (impaired/non-impaired). Let's consider predictors for age, sex, and the Apolipoprotein E genotype. The latter is a categorical variable with the six possible combinations of the three main variants of this gene. Apolipoprotein E is known to have an association with dementia [@Kim:2009p4370].
+
+A superficial, but not uncommon, approach to this analysis would be to fit a large model with main effects and interactions, then use statistical tests to find the minimal set of model terms that are statistically significant at some pre-defined level. If a full model with the three factors and their two- and three-way interactions were used, an initial phase would be to test the interactions using sequential likelihood ratio tests [@HosmerLemeshow]. Let's step through this kind of approach for the example Alzheimer's disease data:
+
+* When comparing the model with all two-way interactions to one with the additional three-way interaction, the likelihood ratio tests produces a p-value of 0.888. This implies that there is no evidence that the 4 additional model terms associated with the three-way interaction explain enough of the variation in the data to keep them in the model.
+
+* Next, the two-way interactions are similarly evaluated against the model with no interactions. The p-value here is 0.0382. This is somewhat borderline, but, given the small sample size, it would be prudent to conclude that there is evidence that some of the 10 possible two-way interactions are important to the model.
+
+* From here, we would build some explanation of the results. The interactions would be particularly important to discuss since they may spark interesting physiological or neurological hypotheses to be explored further.
+
+While shallow, this analysis strategy is common in practice as well as in the literature. This is especially true if the practitioner has limited formal training in data analysis.
+
+One missing piece of information in this approach is how closely this model fits the actual data. Using resampling methods, discussed in Chapter \@ref(resampling), we can estimate the accuracy of this model to be about 73.3%. Accuracy is often a poor measure of model performance; we use it here because it is commonly understood. If the model has 73.3% fidelity to the data, should we trust conclusions it produces? We might think so until we realize that the baseline rate of non-impaired patients in the data is 72.7%. This means that, despite our statistical analysis, the two-factor model appears to be only 0.6% better than a simple heuristic that always predicts patients to be unimpaired, irregardless of the observed data.
+
+:::rmdnote
+The point of this analysis is to demonstrate the idea that optimization of statistical characteristics of the model does not imply that the model fits the data well. Even for purely inferential models, some measure of fidelity to the data should accompany the inferential results. Using this, the consumers of the analyses can calibrate their expectations of the results.
+:::
+
+In the remainder of this chapter, we will discuss general approaches for evaluating models via empirical validation. These approaches are grouped by the nature of the outcome data: purely numeric, binary classes, and three or more class levels.
+
+## Regression Metrics
+
+Recall from Chapter \@ref(models) that tidymodels prediction functions produce tibbles with columns for the predicted values. These columns have consistent names, and the functions in the yardstick package that produce performance metrics have consistent interfaces. The functions are data frame-based, as opposed to vector-based, with the general syntax of:
+
+```r
+function(data, truth, ...)
+```
+
+where `data` is a data frame or tibble and `truth` is the column with the observed outcome values. The ellipses or other arguments are used to specify the column(s) containing the predictions.
+
+
+To illustrate, let's take the model from the very end of Chapter \@ref(recipes). This model `lm_wflow_fit` combines a linear regression model with a predictor set supplemented with an interaction and spline functions for longitude and latitude. It was created from a training set (named `ames_train`). Although we do not advise using the test set at this juncture of the modeling process, it will be used here to illustrate functionality and syntax. The data frame `ames_test` consists of 588 properties. To start, let's produce predictions:
+
+
+
+```r
+ames_test_res <- predict(lm_fit, new_data = ames_test %>% select(-Sale_Price))
+ames_test_res
+#> # A tibble: 588 × 1
+#> .pred
+#>
+#> 1 5.07
+#> 2 5.31
+#> 3 5.28
+#> 4 5.33
+#> 5 5.30
+#> 6 5.24
+#> # … with 582 more rows
+```
+
+The predicted numeric outcome from the regression model is named `.pred`. Let's match the predicted values with their corresponding observed outcome values:
+
+
+```r
+ames_test_res <- bind_cols(ames_test_res, ames_test %>% select(Sale_Price))
+ames_test_res
+#> # A tibble: 588 × 2
+#> .pred Sale_Price
+#>
+#> 1 5.07 5.02
+#> 2 5.31 5.39
+#> 3 5.28 5.28
+#> 4 5.33 5.28
+#> 5 5.30 5.28
+#> 6 5.24 5.26
+#> # … with 582 more rows
+```
+
+We see that these values mostly look close but we don't yet have a quantitative understanding of how the model is doing because we haven't computed any performance metrics. Note that both the predicted and observed outcomes are in log10 units. It is best practice to analyze the predictions on the transformed scale (if one were used) even if the predictions are reported using the original units.
+
+Let's plot the data in Figure \@ref(fig:ames-performance-plot) before computing metrics:
+
+
+```r
+ggplot(ames_test_res, aes(x = Sale_Price, y = .pred)) +
+ # Create a diagonal line:
+ geom_abline(lty = 2) +
+ geom_point(alpha = 0.5) +
+ labs(y = "Predicted Sale Price (log10)", x = "Sale Price (log10)") +
+ # Scale and size the x- and y-axis uniformly:
+ coord_obs_pred()
+```
+
+
+
+
(\#fig:ames-performance-plot)Observed versus predicted values for an Ames regression model, with log-10 units on both axes.
+
+
+There is one low-price property that is substantially over-predicted, i.e., quite high above the dashed line.
+
+Let's compute the root mean squared error for this model using the `rmse()` function:
+
+
+```r
+rmse(ames_test_res, truth = Sale_Price, estimate = .pred)
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 rmse standard 0.0736
+```
+
+This shows us the standard format of the output of yardstick functions. Metrics for numeric outcomes usually have a value of "standard" for the `.estimator` column. Examples with different values for this column are shown in the next sections.
+
+To compute multiple metrics at once, we can create a _metric set_. Let's add $R^2$ and the mean absolute error:
+
+
+```r
+ames_metrics <- metric_set(rmse, rsq, mae)
+ames_metrics(ames_test_res, truth = Sale_Price, estimate = .pred)
+#> # A tibble: 3 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 rmse standard 0.0736
+#> 2 rsq standard 0.836
+#> 3 mae standard 0.0549
+```
+
+This tidy data format stacks the metrics vertically. The root mean squared error and mean absolute error metrics are both on the scale of the outcome (so `log10(Sale_Price)` for our example) and measure the difference between the predicted and observed values. The value for $R^2$ measures the squared correlation between the predicted and observed values, so values closer to one are better.
+
+:::rmdwarning
+The yardstick package does _not_ contain a function for adjusted $R^2$. This modification of the coefficient of determination is commonly used when the same data used to fit the model are used to evaluate the model. This metric is not fully supported in tidymodels because it is always a better approach to compute performance on a separate data set than the one used to fit the model.
+:::
+
+## Binary Classification Metrics
+
+To illustrate other ways to measure model performance, we will switch to a different example. The modeldata package (another one of the tidymodels packages) contains example predictions from a test data set with two classes ("Class1" and "Class2"):
+
+
+```r
+data(two_class_example)
+tibble(two_class_example)
+#> # A tibble: 500 × 4
+#> truth Class1 Class2 predicted
+#>
+#> 1 Class2 0.00359 0.996 Class2
+#> 2 Class1 0.679 0.321 Class1
+#> 3 Class2 0.111 0.889 Class2
+#> 4 Class1 0.735 0.265 Class1
+#> 5 Class2 0.0162 0.984 Class2
+#> 6 Class1 0.999 0.000725 Class1
+#> # … with 494 more rows
+```
+
+The second and third columns are the predicted class probabilities for the test set while `predicted` are the discrete predictions.
+
+For the hard class predictions, there are a variety of yardstick functions that are helpful:
+
+
+```r
+# A confusion matrix:
+conf_mat(two_class_example, truth = truth, estimate = predicted)
+#> Truth
+#> Prediction Class1 Class2
+#> Class1 227 50
+#> Class2 31 192
+
+# Accuracy:
+accuracy(two_class_example, truth, predicted)
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 accuracy binary 0.838
+
+# Matthews correlation coefficient:
+mcc(two_class_example, truth, predicted)
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 mcc binary 0.677
+
+# F1 metric:
+f_meas(two_class_example, truth, predicted)
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 f_meas binary 0.849
+
+# Combining these three classification metrics together
+classification_metrics <- metric_set(accuracy, mcc, f_meas)
+classification_metrics(two_class_example, truth = truth, estimate = predicted)
+#> # A tibble: 3 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 accuracy binary 0.838
+#> 2 mcc binary 0.677
+#> 3 f_meas binary 0.849
+```
+
+The Matthews correlation coefficient and F1 score both summarize the confusion matrix, but compared to `mcc()` which measures the quality of both positive and negative examples, the `f_meas()` metric emphasizes the positive class, i.e., the event of interest. For binary classification data sets like this example, yardstick functions have a standard argument called `event_level` to distinguish positive and negative levels. The default (which we used in this code) is that the *first* level of the outcome factor is the event of interest.
+
+:::rmdnote
+There is some heterogeneity in R functions in this regard; some use the first level and others the second to denote the event of interest. We consider it more intuitive that the first level is the most important. The second level logic is borne of encoding the outcome as 0/1 (in which case the second value is the event) and unfortunately remains in some packages. However, tidymodels (along with many other R packages) require_a categorical outcome to be encoded as a factor and, for this reason, the legacy justification for the second level as the event becomes irrelevant.
+:::
+
+As an example where the second level is the event:
+
+
+```r
+f_meas(two_class_example, truth, predicted, event_level = "second")
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 f_meas binary 0.826
+```
+
+In this output, the `.estimator` value of "binary" indicates that the standard formula for binary classes will be used.
+
+There are numerous classification metrics that use the predicted probabilities as inputs rather than the hard class predictions. For example, the receiver operating characteristic (ROC) curve computes the sensitivity and specificity over a continuum of different event thresholds. The predicted class column is not used. There are two yardstick functions for this method: `roc_curve()` computes the data points that make up the ROC curve and `roc_auc()` computes the area under the curve.
+
+The interfaces to these types of metric functions use the `...` argument placeholder to pass in the appropriate class probability column. For two-class problems, the probability column for the event of interest is passed into the function:
+
+
+```r
+two_class_curve <- roc_curve(two_class_example, truth, Class1)
+two_class_curve
+#> # A tibble: 502 × 3
+#> .threshold specificity sensitivity
+#>
+#> 1 -Inf 0 1
+#> 2 1.79e-7 0 1
+#> 3 4.50e-6 0.00413 1
+#> 4 5.81e-6 0.00826 1
+#> 5 5.92e-6 0.0124 1
+#> 6 1.22e-5 0.0165 1
+#> # … with 496 more rows
+
+roc_auc(two_class_example, truth, Class1)
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 roc_auc binary 0.939
+```
+
+The `two_class_curve` object can be used in a `ggplot` call to visualize the curve, as shown in Figure \@ref(fig:example-roc-curve). There is an `autoplot()` method that will take care of the details:
+
+
+```r
+autoplot(two_class_curve)
+```
+
+
+
+
(\#fig:example-roc-curve)Example ROC curve.
+
+
+If the curve was close to the diagonal line, then the model’s predictions would be no better than random guessing. Since the curve is up in the top, left-hand corner, we see that our model performs well at different thresholds.
+
+There are a number of other functions that use probability estimates, including `gain_curve()`, `lift_curve()`, and `pr_curve()`.
+
+## Multi-Class Classification Metrics
+
+What about data with three or more classes? To demonstrate, let's explore a different example data set that has four classes:
+
+
+```r
+data(hpc_cv)
+tibble(hpc_cv)
+#> # A tibble: 3,467 × 7
+#> obs pred VF F M L Resample
+#>
+#> 1 VF VF 0.914 0.0779 0.00848 0.0000199 Fold01
+#> 2 VF VF 0.938 0.0571 0.00482 0.0000101 Fold01
+#> 3 VF VF 0.947 0.0495 0.00316 0.00000500 Fold01
+#> 4 VF VF 0.929 0.0653 0.00579 0.0000156 Fold01
+#> 5 VF VF 0.942 0.0543 0.00381 0.00000729 Fold01
+#> 6 VF VF 0.951 0.0462 0.00272 0.00000384 Fold01
+#> # … with 3,461 more rows
+```
+
+As before, there are factors for the observed and predicted outcomes along with four other columns of predicted probabilities for each class. (These data also include a `Resample` column. These `hpc_cv` results are for out-of-sample predictions associated with 10-fold cross-validation. For the time being, this column will be ignored and we'll discuss resampling in depth in Chapter \@ref(resampling).)
+
+The functions for metrics that use the discrete class predictions are identical to their binary counterparts:
+
+
+```r
+accuracy(hpc_cv, obs, pred)
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 accuracy multiclass 0.709
+
+mcc(hpc_cv, obs, pred)
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 mcc multiclass 0.515
+```
+
+Note that, in these results, a "multiclass" `.estimator` is listed. Like "binary", this indicates that the formula for outcomes with three or more class levels was used. The Matthews correlation coefficient was originally designed for two classes but has been extended to cases with more class levels.
+
+There are methods for taking metrics designed to handle outcomes with only two classes and extend them for outcomes with more than two classes. For example, a metric such as sensitivity measures the true positive rate which, by definition, is specific to two classes (i.e., "event" and "non-event"). How can this metric be used in our example data?
+
+There are wrapper methods that can be used to apply sensitivity to our four-class outcome. These options are macro-averaging, macro-weighted averaging, and micro-averaging:
+
+ * Macro-averaging computes a set of one-versus-all metrics using the standard two-class statistics. These are averaged.
+
+ * Macro-weighted averaging does the same but the average is weighted by the number of samples in each class.
+
+ * Micro-averaging computes the contribution for each class, aggregates them, then computes a single metric from the aggregates.
+
+See @wu2017unified and @OpitzBurst for more on extending classification metrics to outcomes with more than two classes.
+
+Using sensitivity as an example, the usual two-class calculation is the ratio of the number of correctly predicted events divided by the number of true events. The "manual" calculations for these averaging methods are:
+
+
+```r
+class_totals <-
+ count(hpc_cv, obs, name = "totals") %>%
+ mutate(class_wts = totals / sum(totals))
+class_totals
+#> obs totals class_wts
+#> 1 VF 1769 0.51024
+#> 2 F 1078 0.31093
+#> 3 M 412 0.11883
+#> 4 L 208 0.05999
+
+cell_counts <-
+ hpc_cv %>%
+ group_by(obs, pred) %>%
+ count() %>%
+ ungroup()
+
+# Compute the four sensitivities using 1-vs-all
+one_versus_all <-
+ cell_counts %>%
+ filter(obs == pred) %>%
+ full_join(class_totals, by = "obs") %>%
+ mutate(sens = n / totals)
+one_versus_all
+#> # A tibble: 4 × 6
+#> obs pred n totals class_wts sens
+#>
+#> 1 VF VF 1620 1769 0.510 0.916
+#> 2 F F 647 1078 0.311 0.600
+#> 3 M M 79 412 0.119 0.192
+#> 4 L L 111 208 0.0600 0.534
+
+# Three different estimates:
+one_versus_all %>%
+ summarize(
+ macro = mean(sens),
+ macro_wts = weighted.mean(sens, class_wts),
+ micro = sum(n) / sum(totals)
+ )
+#> # A tibble: 1 × 3
+#> macro macro_wts micro
+#>
+#> 1 0.560 0.709 0.709
+```
+
+Thankfully, there is no need to manually implement these averaging methods. Instead, yardstick functions can automatically apply these method via the `estimator` argument:
+
+
+```r
+sensitivity(hpc_cv, obs, pred, estimator = "macro")
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 sensitivity macro 0.560
+sensitivity(hpc_cv, obs, pred, estimator = "macro_weighted")
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 sensitivity macro_weighted 0.709
+sensitivity(hpc_cv, obs, pred, estimator = "micro")
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 sensitivity micro 0.709
+```
+
+When dealing with probability estimates, there are some metrics with multi-class analogs. For example, @HandTill determined a multi-class technique for ROC curves. In this case, _all_ of the class probability columns must be given to the function:
+
+
+```r
+roc_auc(hpc_cv, obs, VF, F, M, L)
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 roc_auc hand_till 0.829
+```
+
+Macro-weighted averaging is also available as an option for applying this metric to a multi-class outcome:
+
+
+```r
+roc_auc(hpc_cv, obs, VF, F, M, L, estimator = "macro_weighted")
+#> # A tibble: 1 × 3
+#> .metric .estimator .estimate
+#>
+#> 1 roc_auc macro_weighted 0.868
+```
+
+Finally, all of these performance metrics can be computed using dplyr groupings. Recall that these data have a column for the resampling groups. We haven't yet discussed resampling in detail, but notice how we can pass a grouped data frame to the metric function to compute the metrics for each group:
+
+
+```r
+hpc_cv %>%
+ group_by(Resample) %>%
+ accuracy(obs, pred)
+#> # A tibble: 10 × 4
+#> Resample .metric .estimator .estimate
+#>
+#> 1 Fold01 accuracy multiclass 0.726
+#> 2 Fold02 accuracy multiclass 0.712
+#> 3 Fold03 accuracy multiclass 0.758
+#> 4 Fold04 accuracy multiclass 0.712
+#> 5 Fold05 accuracy multiclass 0.712
+#> 6 Fold06 accuracy multiclass 0.697
+#> # … with 4 more rows
+```
+
+The groupings also translate to the `autoplot()` methods, with results in in Figure \@ref(fig:grouped-roc-curves).
+
+
+```r
+# Four 1-vs-all ROC curves for each fold
+hpc_cv %>%
+ group_by(Resample) %>%
+ roc_curve(obs, VF, F, M, L) %>%
+ autoplot() +
+ theme(legend.position = "none")
+```
+
+
+
+
(\#fig:grouped-roc-curves)Resampled ROC curves for each of the four outcome classes.
+
+
+This visualization shows us that the different groups all perform about the same, but that the `VF` class is predicted better than the `F` or `M` classes, since the `VF` ROC curves are up in the top left corner more. This example uses resamples as the groups, but any grouping in your data can be used. This `autoplot()` method can be a quick visualization method for model effectiveness across outcome classes and/or groups.
+
+## Chapter Summary {#performance-summary}
+
+Different metrics measure different aspects of a model fit, e.g., RMSE measures accuracy while the R^2 measures correlation. Measuring model performance is important even when a given model will not be used primarily for prediction; predictive power is also important for inferential or descriptive models. Functions from the yardstick package measure the effectiveness of a model using data. The primary tidymodels interface uses tidyverse principles and data frames (as opposed to having vector arguments). Different metrics are appropriate for regression and classification metrics and, within these, there are sometimes different ways to estimate the statistics, such as for multi-class outcomes.
diff --git a/tmwr-atlas/1-software-modeling.html b/tmwr-atlas/1-software-modeling.html
new file mode 100644
index 00000000..b70ceb51
--- /dev/null
+++ b/tmwr-atlas/1-software-modeling.html
@@ -0,0 +1,472 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+1 Software for modeling | Tidy Modeling with R
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Models are mathematical tools that can describe a system and capture relationships in the data given to them. Models can be used for various purposes, including predicting future events, determining if there is a difference between several groups, aiding map-based visualization, discovering novel patterns in the data that could be further investigated, and more. The utility of a model hinges on its ability to be reductive, or to reduce complex relationships to simpler terms. The primary influences in the data can be captured mathematically in a useful way, such as in a relationship that can be expressed as an equation.
+
Since the beginning of the twenty-first century, mathematical models have become ubiquitous in our daily lives, in both obvious and subtle ways. A typical day for many people might involve checking the weather to see when might be a good time to walk the dog, ordering a product from a website, typing a text message to a friend and having it autocorrected, and checking email. In each of these instances, there is a good chance that some type of model was involved. In some cases, the contribution of the model might be easily perceived (“You might also be interested in purchasing product X”) while in other cases, the impact could be the absence of something (e.g., spam email). Models are used to choose clothing that a customer might like, to identify a molecule that should be evaluated as a drug candidate, and might even be the mechanism that a nefarious company uses to avoid the discovery of cars that over-pollute. For better or worse, models are here to stay.
+
+
There are two reasons that models permeate our lives today:
+
+
an abundance of software exists to create models, and
+
it has become easier to capture and store data, as well as make it accessible.
+
+
+
This book focuses largely on software. It is obviously critical that software produces the correct relationships to represent the data. For the most part, determining mathematical correctness is possible, but the reliable creation of appropriate models requires more. In this chapter, we outline considerations for building or choose modeling software, the purposes of models, and where modeling sits in the broader data analysis process.
It is important that the modeling software you use is easy to operate in a proper way. The user interface should not be so poorly designed that the user would not know that they used it inappropriately. For example, Baggerly and Coombes (2009) report myriad problems in the data analyses from a high profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The user interface of the software made it easy to offset the column names of the data from the actual data columns. This resulted in the wrong genes being identified as important for treating cancer patients and eventually contributed to the termination of several clinical trials (Carlson 2012).
+
If we need high quality models, software must facilitate proper usage. Abrams (2003) describes an interesting principle to guide us:
+
+
The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks.
+
+
Data analysis and modeling software should espouse this idea.
+
Second, modeling software should promote good scientific methodology. When working with complex predictive models, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at discovering patterns that they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these types of methodological errors are insidious in that the issue can go undetected until a later time when new data that contain the true result are obtained.
+
+
As our models have become more powerful and complex, it has also become easier to commit latent errors.
+
+
This same principle also applies to programming. Whenever possible, the software should be able to protect users from committing mistakes. Software should make it easy for users to do the right thing.
+
These two aspects of model development – ease of proper use and good methodological practice – are crucial. Since tools for creating models are easily accessible and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, their backgrounds will vary. It is important that their tools be robust to the experience of the user. Tools should be powerful enough to create high-performance models, but, on the other hand, should be easy to use in an appropriate way. This book describes a suite of software for modeling which has been designed with these characteristics in mind.
+
The software is based on the R programming language (R Core Team 2014). R has been designed especially for data analysis and modeling. It is an implementation of the S language (with lexical scoping rules adapted from Scheme and Lisp) which was created in the 1970s to
+
+
“turn ideas into software, quickly and faithfully” (Chambers 1998)
+
+
R is open-source and free of charge. It is a powerful programming language that can be used for many different purposes but specializes in data analysis, modeling, visualization, and machine learning. R is easily extensible; it has a vast ecosystem of packages, mostly user-contributed modules that focus on a specific theme, such as modeling, visualization, and so on.
+
One collection of packages is called the tidyverse(Wickham et al. 2019). The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. Several of these design philosophies are directly informed by the aspects of software for modeling described in this chapter. If you’ve never used the tidyverse packages, Chapter 2 contains a review of its basic concepts. Within the tidyverse, the subset of packages specifically focused on modeling are referred to as the tidymodels packages. This book is a practical guide for conducting modeling using the tidyverse and tidymodels packages. It shows how to use a set of packages, each with its own specific purpose, together to create high-quality models.
+Baggerly, K, and K Coombes. 2009. “Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology.”The Annals of Applied Statistics 3 (4): 1309–34.
+
+
+Carlson, B. 2012. “Putting Oncology Patients at Risk.”Biotechnology Healthcare 9 (3): 17–21.
+
+
+Chambers, J. 1998. Programming with Data: A Guide to the s Language. Berlin, Heidelberg: Springer-Verlag.
+
+
+R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.
+
+
+Wickham, H, M Averick, J Bryan, W Chang, L McGowan, R François, G Grolemund, et al. 2019. “Welcome to the Tidyverse.”Journal of Open Source Software 4 (43).
+
Before proceeding, let’s describe a taxonomy for types of models, grouped by purpose. This taxonomy informs both how a model is used and many aspects of how the model may be created or evaluated. While not exhaustive, most models fall into at least one of these categories:
+
+
Descriptive models
+
The purpose of a descriptive model is to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually emphasize some trend or artifact in the data.
+
For example, large scale measurements of RNA have been possible for some time using microarrays. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip can measure a signal based on the abundance of a specific RNA sequence. The chip would contain thousands (or more) outcomes, each a quantification of the RNA related to some biological process. However, there could be quality issues on the chip that might lead to poor results. A fingerprint accidentally left on a portion of the chip might cause inaccurate measurements when scanned.
+
An early method for evaluating such issues were probe-level models, or PLM’s (Bolstad 2004). A statistical model would be created that accounted for the known differences in the data, such as the chip, the RNA sequence, the type of sequence, and so on. If there were other, unknown factors in the data, these effects would be captured in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When a problem did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g. a fingerprint) and a possible solution (wipe the chip off and rescan, repeat the sample, etc.). Figure 1.1(a) shows an application of this method for two microarrays taken from Gentleman et al. (2005). The images show two different color values; areas that are darker are where the signal intensity was larger than the model expects while the lighter color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel exhibits an undesirable artifact in the middle of the chip.
+
+
+
+Figure 1.1: Two examples of how descriptive models can be used to illustrate specific patterns.
+
+
+
Another example of a descriptive model is the locally estimated scatterplot smoothing model, more commonly known as LOESS (Cleveland 1979). Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data. These types of smoothers are used to discover potential ways to represent a variable in a model. This is demonstrated in Figure 1.1(b) where a nonlinear trend is illuminated by the flexible smoother. From this plot, it is clear that there is a highly nonlinear relationship between the sale price of a house and its latitude.
+
+
+
Inferential models
+
The goal of an inferential model is to produce a decision for a research question or to explore a specific hypothesis, similar to how statistical tests are used.1 An inferential model starts with some predefined conjecture or idea about a population, and produces a statistical conclusion such as an interval estimate or the rejection of a hypothesis.
+
For example, the goal of a clinical trial might be to provide confirmation that a new therapy does a better job in prolonging life than an alternative, like an existing therapy or no treatment at all. If the clinical endpoint was related to survival of a patient, the null hypothesis might be that the new treatment has an equal or lower median survival time, with the alternative hypothesis being that the new therapy has higher median survival. If this trial were evaluated using traditional null hypothesis significance testing via modeling, the significance testing would produce a p-value using some pre-defined methodology based on a set of assumptions for the data. Small values for the p-value in the model results would indicate that there is evidence that the new therapy helps patients live longer. Large values for the p-value in the model results would conclude that there is a failure to show such a difference; this lack of evidence could be due to a number of reasons, including the therapy not working.
+
What are the important aspects of this type of analysis? Inferential modeling techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability. Generally, to compute such a quantity, formal probabilistic assumptions must be made about the data and the underlying processes that generated the data. The quality of the statistical modeling results are highly dependent on these pre-defined assumptions as well as how much the observed data appear to agree with them. The most critical factors here are theoretical in nature: “If my data were independent and the residuals follow distribution X, then test statistic Y can be used to produce a p-value. Otherwise, the resulting p-value might be inaccurate.”
+
+
One aspect of inferential analyses is that there tends to be a delayed feedback loop in understanding how well the data matches the model assumptions. In our clinical trial example, if statistical (and clinical) significance indicate that the new therapy should be available for patients to use, it still may be years before it is used in the field and enough data are generated for an independent assessment of whether the original statistical analysis led to the appropriate decision.
+
+
+
+
Predictive models
+
Sometimes data are modeled to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data.
+
A simple example would be for a book buyer to predict how many copies of a particular book should be shipped to their store for the next month. An over-prediction wastes space and money due to excess books. If the prediction is smaller than it should be, there is opportunity loss and less profit.
+
For this type of model, the problem type is one of estimation rather than inference. For example, the buyer is usually not concerned with a question such as “Will I sell more than 100 copies of book X next month?” but rather “How many copies of book X will customers purchase next month?” Also, depending on the context, there may not be any interest in why the predicted value is X. In other words, there is more interest in the value itself than evaluating a formal hypothesis related to the data. The prediction can also include measures of uncertainty. In the case of the book buyer, providing a forecasting error may be helpful in deciding how many to purchase. It can also serve as a metric to gauge how well the prediction method worked.
+
What are the most important factors affecting predictive models? There are many different ways that a predictive model can be created, so the important factors depend on how the model was developed.2
+
A mechanistic model could be derived using first principles to produce a model equation that is dependent on assumptions. For example, when predicting the amount of a drug that is in a person’s body at a certain time, some formal assumptions are made on how the drug is administered, absorbed, metabolized, and eliminated. Based on this, a set of differential equations can be used to derive a specific model equation. Data are used to estimate the unknown parameters of this equation so that predictions can be generated. Like inferential models, mechanistic predictive models greatly depend on the assumptions that define their model equations. However, unlike inferential models, it is easy to make data-driven statements about how well the model performs based on how well it predicts the existing data. Here the feedback loop for the modeling practitioner is much faster than it would be for a hypothesis test.
+
Empirically driven models are created with more vague assumptions. These models tend to fall into the machine learning category. A good example is the K-nearest neighbor (KNN) model. Given a set of reference data, a new sample is predicted by using the values of the K most similar data in the reference set. For example, if a book buyer needs a prediction for a new book, historical data from existing books may be available. A 5-nearest neighbor model would estimate the amount of the new books to purchase based on the sales numbers of the five books that are most similar to the new one (for some definition of “similar”). This model is only defined by the structure of the prediction (the average of five similar books). No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model was a good choice, the predictions would be close to the actual values.
+
+
+
REFERENCES
+
+
+Bolstad, B. 2004. Low-Level Analysis of High-Density Oligonucleotide Array Data: Background, Normalization and Summarization. University of California, Berkeley.
+
+
+———. 2001b. “Statistical Modeling: The Two Cultures.”Statistical Science 16 (3): 199–231.
+
+
+Cleveland, W. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.”Journal of the American Statistical Association 74 (368): 829–36.
+
+
+Gentleman, R, V Carey, W Huber, R Irizarry, and S Dudoit. 2005. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Berlin, Heidelberg: Springer-Verlag.
+
+
+Shmueli, G. 2010. “To Explain or to Predict?”Statistical Science 25 (3): 289–310.
+
+
+
+
+
+
Many specific statistical tests are in fact equivalent to models. For example, t-tests and analysis of variance (ANOVA) methods are particular cases of the generalized linear model.↩︎
+
Broader discussions of these distinctions can be found in Breiman (2001b) and Shmueli (2010).↩︎
Note that we have defined the type of a model by how it is used, rather than its mathematical qualities.
+
+
An ordinary linear regression model might fall into any of these three classes of model, depending on how it is used:
+
+
A descriptive smoother, similar to LOESS, called restricted smoothing splines(Durrleman and Simon 1989) can be used to describe trends in data using ordinary linear regression with specialized terms.
+
An analysis of variance (ANOVA) model is a popular method for producing the p-values used for inference. ANOVA models are a special case of linear regression.
+
If a simple linear regression model produces accurate predictions, it can be used as a predictive model.
+
+
There are many examples of predictive models that cannot (or at least should not) be used for inference. Even if probabilistic assumptions were made for the data, the nature of the K-nearest neighbors model, for example, makes the math required for inference intractable.
+
There is an additional connection between the types of models. While the primary purpose of descriptive and inferential models might not be related to prediction, the predictive capacity of the model should not be ignored. For example, logistic regression is a popular model for data where the outcome is qualitative with two possible values. It can model how variables are related to the probability of the outcomes. When used in an inferential manner, there is usually an abundance of attention paid to the statistical qualities of the model. For example, analysts tend to strongly focus on the selection of which independent variables are contained in the model. Many iterations of model building may be used to determine a minimal subset of independent variables that have a “statistically significant” relationship to the outcome variable. This is usually achieved when all of the p-values for the independent variables are below some value (e.g. 0.05). From here, the analyst may focus on making qualitative statements about the relative influence that the variables have on the outcome (e.g., “There is a statistically significant relationship between age and the odds of heart disease.”).
+
This approach can be dangerous when statistical significance is used as the only measure of model quality. It is possible that this statistically optimized model has poor model accuracy, or performs poorly on some other measure of predictive capacity. While the model might not be used for prediction, how much should inferences be trusted from a model that has significant p-values but dismal accuracy? Predictive performance tends to be related to how close the model’s fitted values are to the observed data.
+
+
If a model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not be sufficient proof that a model is appropriate.
+
+
This may seem intuitively obvious, but is often ignored in real-world data analysis.
+
+
REFERENCES
+
+
+Durrleman, S, and R Simon. 1989. “Flexible Regression Models with Cubic Splines.”Statistics in Medicine 8 (5): 551–61.
+
Before proceeding, we outline here some additional terminology related to modeling and data. These descriptions are intended to be helpful as you read this book but not exhaustive.
+
First, many models can be categorized as being supervised or unsupervised. Unsupervised models are those that learn patterns, clusters, or other characteristics of the data but lack an outcome, i.e., a dependent variable. Principal component analysis (PCA), clustering, and autoencoders are examples of unsupervised models; they are used to understand relationships between variables or sets of variables without an explicit relationship between predictors and an outcome. Supervised models are those that have an outcome variable. Linear regression, neural networks, and numerous other methodologies fall into this category.
+
Within supervised models, there are two main sub-categories:
+
+
Regression predicts a numeric outcome.
+
Classification predicts an outcome that is an ordered or unordered set of qualitative values.
+
+
These are imperfect definitions and do not account for all possible types of models. In Chapter 6, we refer to this characteristic of supervised techniques as the model mode.
+
Different variables can have different roles, especially in a supervised modeling analysis. Outcomes (otherwise known as the labels, endpoints, or dependent variables) are the value being predicted in supervised models. The independent variables, which are the substrate for making predictions of the outcome, are also referred to as predictors, features, or covariates (depending on the context). The terms outcomes and predictors are used most frequently in this book.
+
In terms of the data or variables themselves, whether used for supervised or unsupervised models, as predictors or outcomes, the two main categories are quantitative and qualitative. Examples of the former are real numbers like 3.14159 and integers like 42. Qualitative values, also known as nominal data, are those that represent some sort of discrete state that cannot be naturally placed on a numeric scale, like “red”, “green”, and “blue”.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/tmwr-atlas/1.5-model-phases.html b/tmwr-atlas/1.5-model-phases.html
new file mode 100644
index 00000000..e9515106
--- /dev/null
+++ b/tmwr-atlas/1.5-model-phases.html
@@ -0,0 +1,583 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+1.5 How Does Modeling Fit into the Data Analysis Process? | Tidy Modeling with R
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
1.5 How Does Modeling Fit into the Data Analysis Process?
+
In what circumstances are models created? Are there steps that precede such an undertaking? Is model creation the first step in data analysis?
+
+
There are always a few critical phases of data analysis that come before modeling.
+
+
First, there is the chronically underestimated process of cleaning the data. No matter the circumstances, you should investigate the data to make sure that they are applicable to your project goals, accurate, and appropriate. These steps can easily take more time than the rest of the data analysis process (depending on the circumstances).
+
Data cleaning can also overlap with the second phase of understanding the data, often referred to as exploratory data analysis (EDA). EDA brings to light how the different variables are related to one another, their distributions, typical ranges, and other attributes. A good question to ask at this phase is, “How did I come by these data?” This question can help you understand how the data at hand have been sampled or filtered and if these operations were appropriate. For example, when merging database tables, a join may go awry that could accidentally eliminate one or more sub-populations. Another good idea is to ask if the data are relevant. For example, to predict whether patients have Alzheimer’s disease or not, it would be unwise to have a data set containing subjects with the disease and a random sample of healthy adults from the general population. Given the progressive nature of the disease, the model may simply predict who are the oldest patients.
+
Finally, before starting a data analysis process, there should be clear expectations of the goal of the model and how performance (and success) will be judged. At least one performance metric should be identified with realistic goals of what can be achieved. Common statistical metrics, discussed in more detail in Chapter 9, are classification accuracy, true and false positive rates, root mean squared error, and so on. The relative benefits and drawbacks of these metrics should be weighed. It is also important that the metric be germane; alignment with the broader data analysis goals is critical.
+
The process of investigating the data may not be simple. Wickham and Grolemund (2016) contains an excellent illustration of the general data analysis process, reproduced with Figure 1.2. Data ingestion and cleaning/tidying are shown as the initial steps. When the analytical steps for understanding commence, they are a heuristic process; we cannot pre-determine how long they may take. The cycle of transformation, modeling, and visualization often requires multiple iterations.
+
+
+
+Figure 1.2: The data science process (from R for Data Science, used with permission).
+
+
+
This iterative process is especially true for modeling. Figure 1.3 is meant to emulate the typical path to determining an appropriate model. The general phases are:
+
+
Exploratory data analysis (EDA): Initially there is a back and forth between numerical analysis and visualization of the data (represented in Figure 1.2) where different discoveries lead to more questions and data analysis “side-quests” to gain more understanding.
+
Feature engineering: The understanding gained from EDA results in the creation of specific model terms that make it easier to accurately model the observed data. This can include complex methodologies (e.g., PCA) or simpler features (using the ratio of two predictors). Chapter 8 focuses entirely on this important step.
+
Model tuning and selection (large circles with alternating segments): A variety of models are generated and their performance is compared. Some models require parameter tuning where some structural parameters are required to be specified or optimized. The alternating segments within the circles signify the repeated data splitting used during resampling (see Chapter 10).
+
Model evaluation: During this phase of model development, we assess the model’s performance metrics, examine residual plots, and conduct other EDA-like analyses to understand how well the models work. In some cases, formal between-model comparisons (Chapter 11) help you to understand whether any differences in models are within the experimental noise.
+
+
+
+
+Figure 1.3: A schematic for the typical modeling process.
+
+
+
After an initial sequence of these tasks, more understanding is gained regarding which types of models are superior as well as which sub-populations of the data are not being effectively estimated. This leads to additional EDA and feature engineering, another round of modeling, and so on. Once the data analysis goals are achieved, the last steps are typically to finalize, document, and communicate the model. For predictive models, it is common at the end to validate the model on an additional set of data reserved for this specific purpose.
+
As an example, M. Kuhn and Johnson (2020) use data to model the daily ridership of Chicago’s public train system using predictors such as the date, the previous ridership results, the weather, and other factors. Table 1.1 walks through an approximation of these authors’ “inner monologue” when analyzing these data and eventually selecting a model with sufficient performance.
+
+
Table 1.1: Hypothetical inner monologue of a model developer.
+
+
+
+
+
+
+
Thoughts
+
Activity
+
+
+
+
+
The daily ridership values between stations are extremely correlated.
+
EDA
+
+
+
Weekday and weekend ridership look very different.
+
EDA
+
+
+
One day in the summer of 2010 has an abnormally large number of riders.
+
EDA
+
+
+
Which stations had the lowest daily ridership values?
+
EDA
+
+
+
Dates should at least be encoded as day-of-the-week, and year.
+
Feature Engineering
+
+
+
Maybe PCA could be used on the correlated predictors to make it easier for the models to use them.
+
Feature Engineering
+
+
+
Hourly weather records should probably be summarized into daily measurements.
+
Feature Engineering
+
+
+
Let’s start with simple linear regression, K-nearest neighbors, and a boosted decision tree.
+
Model Fitting
+
+
+
How many neighbors should be used?
+
Model Tuning
+
+
+
Should we run a lot of boosting iterations or just a few?
+
Model Tuning
+
+
+
How many neighbors seemed to be optimal for these data?
+
Model Tuning
+
+
+
Which models have the lowest root mean squared errors?
+
Model Evaluation
+
+
+
Which days were poorly predicted?
+
EDA
+
+
+
Variable importance scores indicate that the weather information is not predictive. We’ll drop them from the next set of models.
+
Model Evaluation
+
+
+
It seems like we should focus on a lot of boosting iterations for that model.
+
Model Evaluation
+
+
+
We need to encode holiday features to improve predictions on (and around) those dates.
+
Feature Engineering
+
+
+
Let’s drop K-NN from the model list.
+
Model Evaluation
+
+
+
+
+
REFERENCES
+
+
+———. 2020. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press.
+
+
+Wickham, H, and G Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.
+
This chapter focused on how models describe relationships in data, and different types of models such as descriptive models, inferential models, and predictive models. The predictive capacity of a model can be used to evaluate it, even when its main goal is not prediction. Modeling itself sits within the broader data analysis process, and exploratory data analysis is a key part of building high-quality models.
We have already covered several pieces that must be put together to evaluate the performance of a model. Chapter 9 described statistics for measuring model performance, and Chapter 5 introduced the idea of data spending where we recommended the test set for obtaining an unbiased estimate of performance. However, we usually need to understand the performance of a model or even multiple models before using the test set.
+
+
Typically we can’t decide on which final model to use with the test set before first assessing model performance. There is a gap between our need to measure performance reliably and the data splits (training and testing) we have available.
+
+
In this chapter, we describe an approach called resampling that can fill this gap. Resampling estimates of performance can generalize to new data in a similar way as estimates from a test set. The next chapter complements this one by demonstrating statistical methods that compare resampling results.
+
In order to fully appreciate the value of resampling, let’s first take a look the resubstitution approach, which can often fail.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/tmwr-atlas/10-resampling.md b/tmwr-atlas/10-resampling.md
new file mode 100644
index 00000000..3df53258
--- /dev/null
+++ b/tmwr-atlas/10-resampling.md
@@ -0,0 +1,824 @@
+
+
+# (PART\*) Tools for Creating Effective Models {-}
+
+# Resampling for Evaluating Performance {#resampling}
+
+We have already covered several pieces that must be put together to evaluate the performance of a model. Chapter \@ref(performance) described statistics for measuring model performance, and Chapter \@ref(splitting) introduced the idea of data spending where we recommended the test set for obtaining an unbiased estimate of performance. However, we usually need to understand the performance of a model or even multiple models _before using the test set_.
+
+:::rmdwarning
+Typically we can't decide on which final model to use with the test set before first assessing model performance. There is a gap between our need to measure performance reliably and the data splits (training and testing) we have available.
+:::
+
+In this chapter, we describe an approach called resampling that can fill this gap. Resampling estimates of performance can generalize to new data in a similar way as estimates from a test set. The next chapter complements this one by demonstrating statistical methods that compare resampling results.
+
+In order to fully appreciate the value of resampling, let's first take a look the resubstitution approach, which can often fail.
+
+## The Resubstitution Approach {#resampling-resubstition}
+
+When we measure performance on the same data that we used for training (as opposed to new data or testing data), we say we have "resubstituted" the data. Let's again use the Ames data to demonstrate these concepts. The end of Chapter \@ref(recipes) summarizes the current state of our Ames analysis. It includes a recipe object named `ames_rec`, a linear model, and a workflow using that recipe and model called `lm_wflow`. This workflow was fit on the training set, resulting in `lm_fit`.
+
+For a comparison to this linear model, we can also fit a different type of model. _Random forests_ are a tree ensemble method that operates by creating a large number of decision trees from slightly different versions of the training set [@breiman2001random]. This collection of trees makes up the ensemble. When predicting a new sample, each ensemble member makes a separate prediction. These are averaged to create the final ensemble prediction for the new data point.
+
+Random forest models are very powerful and they can emulate the underlying data patterns very closely. While this model can be computationally intensive, it is very low-maintenance; very little preprocessing is required (as documented in Appendix \@ref(pre-proc-table)).
+
+Using the same predictor set as the linear model (without the extra preprocessing steps), we can fit a random forest model to the training set via the `"ranger"` engine (which uses the ranger R package for computation). This model requires no preprocessing, so a simple formula can be used:
+
+
+```r
+rf_model <-
+ rand_forest(trees = 1000) %>%
+ set_engine("ranger") %>%
+ set_mode("regression")
+
+rf_wflow <-
+ workflow() %>%
+ add_formula(
+ Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
+ Latitude + Longitude) %>%
+ add_model(rf_model)
+
+rf_fit <- rf_wflow %>% fit(data = ames_train)
+```
+
+How should we compare the linear and random forest models? For demonstration, we will predict the training set to produce what is known as an "apparent metric" or "resubstitution metric". This function creates predictions and formats the results:
+
+
+```r
+estimate_perf <- function(model, dat) {
+ # Capture the names of the `model` and `dat` objects
+ cl <- match.call()
+ obj_name <- as.character(cl$model)
+ data_name <- as.character(cl$dat)
+ data_name <- gsub("ames_", "", data_name)
+
+ # Estimate these metrics:
+ reg_metrics <- metric_set(rmse, rsq)
+
+ model %>%
+ predict(dat) %>%
+ bind_cols(dat %>% select(Sale_Price)) %>%
+ reg_metrics(Sale_Price, .pred) %>%
+ select(-.estimator) %>%
+ mutate(object = obj_name, data = data_name)
+}
+```
+
+Both RMSE and R2 are computed. The resubstitution statistics are:
+
+
+```r
+estimate_perf(rf_fit, ames_train)
+#> # A tibble: 2 × 4
+#> .metric .estimate object data
+#>
+#> 1 rmse 0.0367 rf_fit train
+#> 2 rsq 0.959 rf_fit train
+estimate_perf(lm_fit, ames_train)
+#> # A tibble: 2 × 4
+#> .metric .estimate object data
+#>
+#> 1 rmse 0.0754 lm_fit train
+#> 2 rsq 0.816 lm_fit train
+```
+
+
+
+Based on these results, the random forest is much more capable of predicting the sale prices; the RMSE estimate is 2-fold better than linear regression. If we needed to choose between these two models for this price prediction problem, we would probably chose the random forest because, on the log scale we are using, its RMSE is about half as large. The next step applies the random forest model to the test set for final verification:
+
+
+```r
+estimate_perf(rf_fit, ames_test)
+#> # A tibble: 2 × 4
+#> .metric .estimate object data
+#>
+#> 1 rmse 0.0704 rf_fit test
+#> 2 rsq 0.852 rf_fit test
+```
+
+The test set RMSE estimate, 0.0704, is *much worse than the training set* value of 0.0367! Why did this happen?
+
+Many predictive models are capable of learning complex trends from the data. In statistics, these are commonly referred to as _low bias models_.
+
+:::rmdnote
+In this context, _bias_ is the difference between the true pattern or relationships in data and the types of patterns that the model can emulate. Many black-box machine learning models have low bias, meaning they can reproduce complex relationships. Other models (such as linear/logistic regression, discriminant analysis, and others) are not as adaptable and are considered _high bias_ models.^[See Section 1.2.5 of @fes for a discussion: ].
+:::
+
+For a low-bias model, the high degree of predictive capacity can sometimes result in the model nearly memorizing the training set data. As an obvious example, consider a 1-nearest neighbor model. It will always provide perfect predictions for the training set no matter how well it truly works for other data sets. Random forest models are similar; re-predicting the training set will always result in an artificially optimistic estimate of performance.
+
+For both models, Table \@ref(tab:rmse-results) summarizes the RMSE estimate for the training and test sets:
+
+
+Table: (\#tab:rmse-results)Performance statistics for training and test sets.
+
+|object | train| test|
+|:------|------:|------:|
+|lm_fit | 0.0754| 0.0736|
+|rf_fit | 0.0367| 0.0704|
+
+Notice that the linear regression model is consistent between training and testing, because of its limited complexity.^[It is possible for a linear model to nearly memorize the training set, like the random forest model did. In the `ames_rec` object, change the number of spline terms for `longitude` and `latitude` to a large number (say 1000). This would produce a model fit with a very small resubstitution RMSE and a test set RMSE that is much larger.]
+
+:::rmdwarning
+The main take-away from this example is that re-predicting the training set will result in an artificially optimistic estimate of performance. It is a bad idea for most models.
+:::
+
+If the test set should not be used immediately, and re-predicting the training set is a bad idea, what should be done? Resampling methods, such as cross-validation or validation sets, are the solution.
+
+
+## Resampling Methods
+
+Resampling methods are empirical simulation systems that emulate the process of using some data for modeling and different data for evaluation. Most resampling methods are iterative, meaning that this process is repeated multiple times. The diagram in Figure \@ref(fig:resampling-scheme) illustrates how resampling methods generally operate.
+
+
+
+
(\#fig:resampling-scheme)Data splitting scheme from the initial data split to resampling.
+
+
+Resampling is only conducted on the training set, as you see in Figure \@ref(fig:resampling-scheme). The test set is not involved. For each iteration of resampling, the data are partitioned into two subsamples:
+
+* The model is fit with the *analysis set*.
+
+* The model is evaluated with the *assessment set*.
+
+These two subsamples are somewhat analogous to training and test sets. Our language of _analysis_ and _assessment_ avoids confusion with initial split of the data. These data sets are mutually exclusive. The partitioning scheme used to create the analysis and assessment sets is usually the defining characteristic of the method.
+
+Suppose twenty iterations of resampling are conducted. This means that twenty separate models are fit on the analysis sets and the corresponding assessment sets produce twenty sets of performance statistics. The final estimate of performance for a model is the average of the twenty replicates of the statistics. This average has very good generalization properties and is far better than the resubstituion estimates.
+
+The next section defines several commonly used resampling methods and discusses their pros and cons.
+
+### Cross-validation {#cv}
+
+Cross-validation is a well established resampling method. While there are a number of variations, the most common cross-validation method is _V_-fold cross-validation. The data are randomly partitioned into _V_ sets of roughly equal size (called the "folds"). For illustration, _V_ = 3 is shown in Figure \@ref(fig:cross-validation-allocation) for a data set of thirty training set points with random fold allocations. The number inside the symbols is the sample number.
+
+
+
+
(\#fig:cross-validation-allocation)V-fold cross-validation randomly assigns data to folds.
+
+
+The color of the symbols in Figure \@ref(fig:cross-validation-allocation) represent their randomly assigned folds. Stratified sampling is also an option for assigning folds (previously discussed in Chapter \@ref(splitting)).
+
+For 3-fold cross-validation, the three iterations of resampling are illustrated in Figure \@ref(fig:cross-validation). For each iteration, one fold is held out for assessment statistics and the remaining folds are substrate for the model. This process continues for each fold so that three models produce three sets of performance statistics.
+
+
+
+
(\#fig:cross-validation)V-fold cross-validation data usage.
+
+
+When _V_ = 3, the analysis sets are 2/3 of the training set and each assessment set is a distinct 1/3. The final resampling estimate of performance averages each of the _V_ replicates.
+
+Using _V_ = 3 is a good choice to illustrate cross-validation but is a poor choice in practice because it is too low to generate reliable estimates. In practice, values of _V_ are most often 5 or 10; we generally prefer 10-fold cross-validation as a default because it is large enough for good results in most situations.
+
+:::rmdnote
+What are the effects of changing _V_? Larger values result in resampling estimates with small bias but substantial variance. Smaller values of _V_ have large bias but low variance. We prefer 10-fold since noise is reduced by replication, but bias is not.^[See Section 3.4 of @fes for a longer description of the results of change _V_: ].
+:::
+
+The primary input is the training set data frame as well as the number of folds (defaulting to 10):
+
+
+```r
+set.seed(1001)
+ames_folds <- vfold_cv(ames_train, v = 10)
+ames_folds
+#> # 10-fold cross-validation
+#> # A tibble: 10 × 2
+#> splits id
+#>
+#> 1 Fold01
+#> 2 Fold02
+#> 3 Fold03
+#> 4 Fold04
+#> 5 Fold05
+#> 6 Fold06
+#> # … with 4 more rows
+```
+
+The column named `splits` contains the information on how to split the data (similar to the object used to create the initial training/test partition). While each row of `splits` has an embedded copy of the entire training set, R is smart enough not to make copies of the data in memory.^[To see this for yourself, try executing `lobstr::obj_size(ames_folds)` and `lobstr::obj_size(ames_train)`. The size of the resample object is much less than ten times the size of the original data.] The print method inside of the tibble shows the frequency of each: `[2K/220]` indicates that roughly two thousand samples are in the analysis set and 220 are in that particular assessment set.
+
+These objects also always contain a character column called `id` that labels the partition.^[Some resampling methods require multiple `id` fields.]
+
+To manually retrieve the partitioned data, the `analysis()` and `assessment()` functions return the corresponding data frames:
+
+
+```r
+# For the first fold:
+ames_folds$splits[[1]] %>% analysis() %>% dim()
+#> [1] 2107 74
+```
+
+The tidymodels packages, such as [tune](https://tune.tidymodels.org/), contain high-level user interfaces so that functions like `analysis()` are not generally needed for day-to-day work. Chapter \@ref(resampling) demonstrates functions to fit a model over these resamples.
+
+There are a variety of variations on cross-validation; we'll go through the most important ones.
+
+### Repeated cross-validation {-}
+
+The most important variation on cross-validation is repeated _V_-fold cross-validation. Depending on the size or other characteristics of the data, the resampling estimate produced by _V_-fold cross-validation may be excessively noisy.^[For more details, see Section 3.4.6 of @fes: ] As with many statistical problems, one way to reduce noise is to gather more data. For cross-validation, this means averaging more than _V_ statistics.
+
+To create _R_ repeats of _V_-fold cross-validation, the same fold generation process is done _R_ times to generate _R_ collections of _V_ partitions. Now, instead of averaging _V_ statistics, $V \times R$ statistics produce the final resampling estimate. Due to the Central Limit Theorem, the summary statistics from each model tend toward a normal distribution, as long as we have a lot of data relative to $V \times R$.
+
+Consider the Ames data. On average, 10-fold cross-validation uses assessment sets that contain roughly 234 properties. If RMSE is the statistic of choice, we can denote that estimate's standard deviation as $\sigma$. With simple 10-fold cross-validation, the standard error of the mean RMSE is $\sigma/\sqrt{10}$. If this is too noisy, repeats reduce the standard error to $\sigma/\sqrt{10R}$. For 10-fold cross-validation with $R$ replicates, the plot in Figure \@ref(fig:variance-reduction) shows how quickly the standard error^[These are _approximate_ standard errors. As will be discussed in the next chapter, there is a within-replicate correlation that is typical of resampled results. By ignoring this extra component of variation, the simple calculations shown in this plot are overestimates of the reduction in noise in the standard errors.] decreases with replicates.
+
+
+
+
(\#fig:variance-reduction)Relationship between the relative variance in performance estimates versus the number of cross-validation repeats.
+
+
+Larger number of replicates tend to have less impact on the standard error. However, if the baseline value of $\sigma$ is impractically large, the diminishing returns on replication may still be worth the extra computational costs.
+
+To create repeats, invoke `vfold_cv()` with an additional argument `repeats`:
+
+
+```r
+vfold_cv(ames_train, v = 10, repeats = 5)
+#> # 10-fold cross-validation repeated 5 times
+#> # A tibble: 50 × 3
+#> splits id id2
+#>
+#> 1 Repeat1 Fold01
+#> 2 Repeat1 Fold02
+#> 3 Repeat1 Fold03
+#> 4 Repeat1 Fold04
+#> 5 Repeat1 Fold05
+#> 6 Repeat1 Fold06
+#> # … with 44 more rows
+```
+
+### Leave-one-out cross-validation {-}
+
+One variation of cross-validation is leave-one-out (LOO) cross-validation where _V_ is the number of data points in the training set. If there are $n$ training set samples, $n$ models are fit using $n-1$ rows of the training set. Each model predicts the single excluded data point. At the end of resampling, the $n$ predictions are pooled to produce a single performance statistic.
+
+Leave-one-out methods are deficient compared to almost any other method. For anything but pathologically small samples, LOO is computationally excessive and it may not have good statistical properties. Although the rsample package contains a `loo_cv()` function, these objects are not generally integrated into the broader tidymodels frameworks.
+
+### Monte Carlo cross-validation {-}
+
+Another variant of _V_-fold cross-validation is Monte Carlo cross-validation (MCCV, @xu2001monte). Like _V_-fold cross-validation, it allocates a fixed proportion of data to the assessment sets. The difference between MCCV and regular cross-validation is that, for MCCV, this proportion of the data is randomly selected each time. This results in assessment sets that are not mutually exclusive. To create these resampling objects:
+
+
+```r
+mc_cv(ames_train, prop = 9/10, times = 20)
+#> # Monte Carlo cross-validation (0.9/0.1) with 20 resamples
+#> # A tibble: 20 × 2
+#> splits id
+#>
+#> 1 Resample01
+#> 2 Resample02
+#> 3 Resample03
+#> 4 Resample04
+#> 5 Resample05
+#> 6 Resample06
+#> # … with 14 more rows
+```
+
+### Validation sets {#validation}
+
+In Chapter \@ref(splitting), we briefly discussed the use of a validation set, a single partition that is set aside to estimate performance separate from the test set. When using a validation set, the initial available data set is split into a training set, a validation set, and a test set (see Figure \@ref(fig:three-way-split)).
+
+
+
+
(\#fig:three-way-split)A three-way initial split into training, testing, and validation sets.
+
+
+Validation sets are often used when the original pool of data is very large. In this case, a single large partition may be adequate to characterize model performance without having to do multiple iterations of resampling.
+
+With the rsample package, a validation set is like any other resampling object; this type is different only in that it has a single iteration.^[In essence, a validation set can be considered a single iteration of Monte Carlo cross-validation.] Figure \@ref(fig:validation-split) shows this scheme.
+
+
+
+
+
(\#fig:validation-split)A two-way initial split into training and testing with an additional validation set split on the training set.
+
+
+To create a validation set object that uses 3/4 of the data for model fitting:
+
+
+
+```r
+set.seed(1002)
+val_set <- validation_split(ames_train, prop = 3/4)
+val_set
+#> # Validation Set Split (0.75/0.25)
+#> # A tibble: 1 × 2
+#> splits id
+#>
+#> 1 validation
+```
+
+
+### Bootstrapping {#bootstrap}
+
+Bootstrap resampling was originally invented as a method for approximating the sampling distribution of statistics whose theoretical properties are intractable [@davison1997bootstrap]. Using it to estimate model performance is a secondary application of the method.
+
+A bootstrap sample of the training set is a sample that is the same size as the training set but is drawn _with replacement_. This means that some training set data points are selected multiple times for the analysis set. Each data point has a 63.2% chance of inclusion in the training set at least once. The assessment set contains all of the training set samples that were not selected for the analysis set (on average, with 36.8% of the training set). When bootstrapping, the assessment set is often called the "out-of-bag" sample.
+
+For a training set of 30 samples, a schematic of three bootstrap samples is shown in Figure\@ref(fig:bootstrapping).
+
+
+
+
(\#fig:bootstrapping)Bootstrapping data usage.
+
+
+Note that the sizes of the assessment sets vary.
+
+Using the rsample package, we can create such bootstrap resamples:
+
+
+```r
+bootstraps(ames_train, times = 5)
+#> # Bootstrap sampling
+#> # A tibble: 5 × 2
+#> splits id
+#>
+#> 1 Bootstrap1
+#> 2 Bootstrap2
+#> 3 Bootstrap3
+#> 4 Bootstrap4
+#> 5 Bootstrap5
+```
+
+Bootstrap samples produce performance estimates that have very low variance (unlike cross-validation) but have significant pessimistic bias. This means that, if the true accuracy of a model is 90%, the bootstrap would tend to estimate the value to be less than 90%. The amount of bias cannot be empirically determined with sufficient accuracy. Additionally, the amount of bias changes over the scale of the performance metric. For example, the bias is likely to be different when the accuracy is 90% versus when it is 70%.
+
+The bootstrap is also used inside of many models. For example, the random forest model mentioned earlier contained 1,000 individual decision trees. Each tree was the product of a different bootstrap sample of the training set.
+
+### Rolling forecasting origin resampling {#rolling}
+
+When the data have a strong time component, a resampling method should support modeling to estimate seasonal and other temporal trends within the data. A technique that randomly samples values from the training set can disrupt the model's ability to estimate these patterns.
+
+Rolling forecast origin resampling [@hyndman2018forecasting] provides a method that emulates how time series data is often partitioned in practice, estimating the model with historical data and evaluating it with the most recent data. For this type of resampling, the size of the initial analysis and assessment sets are specified. The first iteration of resampling uses these sizes, starting from the beginning of the series. The second iteration uses the same data sizes but shifts over by a set number of samples.
+
+To illustrate, a training set of fifteen samples was resampled with an analysis size of eight samples and an assessment set size of three. The second iteration discards the first training set sample and both data sets shift forward by one. This configuration results in five resamples, as shown in Figure\@ref(fig:rolling).
+
+
+
+
(\#fig:rolling)Data usage for rolling forecasting origin resampling.
+
+
+There are a few different configurations of this method:
+
+* The analysis set can cumulatively grow (as opposed to remaining the same size). After the first initial analysis set, new samples can accrue without discarding the earlier data.
+
+* The resamples need not increment by one. For example, for large data sets, the incremental block could be a week or month instead of a day.
+
+For a year's worth of data, suppose that six sets of 30-day blocks define the analysis set. For assessment sets of 30 days with a 29-day skip, we can use the rsample package to specify:
+
+
+```r
+time_slices <-
+ tibble(x = 1:365) %>%
+ rolling_origin(initial = 6 * 30, assess = 30, skip = 29, cumulative = FALSE)
+
+data_range <- function(x) {
+ summarize(x, first = min(x), last = max(x))
+}
+
+map_dfr(time_slices$splits, ~ analysis(.x) %>% data_range())
+#> # A tibble: 6 × 2
+#> first last
+#>
+#> 1 1 180
+#> 2 31 210
+#> 3 61 240
+#> 4 91 270
+#> 5 121 300
+#> 6 151 330
+map_dfr(time_slices$splits, ~ assessment(.x) %>% data_range())
+#> # A tibble: 6 × 2
+#> first last
+#>
+#> 1 181 210
+#> 2 211 240
+#> 3 241 270
+#> 4 271 300
+#> 5 301 330
+#> 6 331 360
+```
+
+
+
+## Estimating Performance {#resampling-performance}
+
+Any of the resampling methods discussed in this chapter can be used to evaluate the modeling process (including preprocessing, model fitting, etc). These methods are effective because different groups of data are used to train the model and assess the model. To reiterate, the process to use resampling is as follows:
+
+1. During resampling, the analysis set is used to preprocess the data, apply the preprocessing to itself, and use these processed data to fit the model.
+
+2. The preprocessing statistics produced by the analysis set are applied to the assessment set. The predictions from the assessment set estimate performance on new data.
+
+This sequence repeats for every resample. If there are _B_ resamples, there are _B_ replicates of each of the performance metrics. The final resampling estimate is the average of these _B_ statistics. If _B_ = 1, as with a validation set, the individual statistics represent overall performance.
+
+Let's reconsider the previous random forest model contained in the `rf_wflow` object. The `fit_resamples()` function is analogous to `fit()`, but instead of having a `data` argument, `fit_resamples()` has `resamples` which expects an `rset` object like the ones shown in this chapter. The possible interfaces to the function are:
+
+
+```r
+model_spec %>% fit_resamples(formula, resamples, ...)
+model_spec %>% fit_resamples(recipe, resamples, ...)
+workflow %>% fit_resamples( resamples, ...)
+```
+
+There are a number of other optional arguments, such as:
+
+* `metrics`: A metric set of performance statistics to compute. By default, regression models use RMSE and R2 while classification models compute the area under the ROC curve and overall accuracy. Note that this choice also defines what predictions are produced during the evaluation of the model. For classification, if only accuracy is requested, class probability estimates are not generated for the assessment set (since they are not needed).
+
+* `control`: A list created by `control_resamples()` with various options.
+
+The control arguments include:
+
+* `verbose`: A logical for printing logging.
+
+* `extract`: A function for retaining objects from each model iteration (discussed later in this chapter).
+
+* `save_pred`: A logical for saving the assessment set predictions.
+
+For our example, let's save the predictions in order to visualize the model fit and residuals:
+
+
+```r
+keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)
+
+set.seed(1003)
+rf_res <-
+ rf_wflow %>%
+ fit_resamples(resamples = ames_folds, control = keep_pred)
+rf_res
+#> # Resampling results
+#> # 10-fold cross-validation
+#> # A tibble: 10 × 5
+#> splits id .metrics .notes .predictions
+#>
+#> 1