diff --git a/TMwR.bib b/TMwR.bib index a725d7ce..c980228c 100644 --- a/TMwR.bib +++ b/TMwR.bib @@ -147,4 +147,23 @@ @book{bolstad2004 author={Bolstad, B}, year={2004}, publisher={University of California, Berkeley} +} + + +@article{Durrleman1989, + author = {Durrleman, S and Simon, R}, + title = {Flexible regression models with cubic splines}, + journal = {Statistics in Medicine}, + volume = {8}, + number = {5}, + pages = {551-561}, + year = {1989} +} + + +@book{kuhn20202, + title={Feature engineering and selection: A practical approach for predictive models}, + author={Kuhn, M and Johnson, K}, + year={2020}, + publisher={CRC Press} } \ No newline at end of file diff --git a/_book/a-model-workflow.html b/_book/a-model-workflow.html index 4c465884..fb4536e1 100644 --- a/_book/a-model-workflow.html +++ b/_book/a-model-workflow.html @@ -24,7 +24,7 @@ - + diff --git a/_book/a-tale-of-two-models.html b/_book/a-tale-of-two-models.html index 31d45696..3985d1de 100644 --- a/_book/a-tale-of-two-models.html +++ b/_book/a-tale-of-two-models.html @@ -24,7 +24,7 @@ - + diff --git a/_book/a-tale-of-two-models.md b/_book/a-tale-of-two-models.md index 92725ffe..b2df7124 100644 --- a/_book/a-tale-of-two-models.md +++ b/_book/a-tale-of-two-models.md @@ -1,7 +1,7 @@ -# A tale of two models +# A tale of two models {#two-models} (tentative title) diff --git a/_book/a-tidyverse-primer.html b/_book/a-tidyverse-primer.html index 0273e201..9df0a755 100644 --- a/_book/a-tidyverse-primer.html +++ b/_book/a-tidyverse-primer.html @@ -24,7 +24,7 @@ - + diff --git a/_book/data-spending.html b/_book/data-spending.html new file mode 100644 index 00000000..734b5075 --- /dev/null +++ b/_book/data-spending.html @@ -0,0 +1,177 @@ + + + + + + + 4 Spending our data | Tidy Modeling with R + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+ +
+
+ + +
+
+ +
+
+

4 Spending our data

+

General data splitting

+

Re-emphasize roles or different data sets and good/bad ways of doing things.

+

Validation sets.

+

What we do differently with a lot of data.

+

Allude to resampling.

+ +
+
+ +
+
+
+ + +
+
+ + + + + + + + + + + + + + + + diff --git a/_book/data-spending.md b/_book/data-spending.md index e08ba17d..9f46cc71 100644 --- a/_book/data-spending.md +++ b/_book/data-spending.md @@ -1,7 +1,7 @@ -# Spending our data +# Spending our data {#data-spending} General data splitting diff --git a/_book/feature-engineering.html b/_book/feature-engineering.html index 275724cd..8df26f0b 100644 --- a/_book/feature-engineering.html +++ b/_book/feature-engineering.html @@ -24,15 +24,15 @@ - + - - + + @@ -72,22 +72,23 @@
  • 1 Introduction
  • -
  • 2 A tidyverse primer @@ -118,8 +119,8 @@

    6 Feature engineering

    - - + + @@ -156,6 +157,20 @@

    6 Feature engineering

    }); + + diff --git a/_book/how-good-is-our-model.html b/_book/how-good-is-our-model.html index 250d303b..6b1db720 100644 --- a/_book/how-good-is-our-model.html +++ b/_book/how-good-is-our-model.html @@ -24,7 +24,7 @@ - + diff --git a/_book/how-good-is-our-model.md b/_book/how-good-is-our-model.md index 47932965..13243c33 100644 --- a/_book/how-good-is-our-model.md +++ b/_book/how-good-is-our-model.md @@ -1,7 +1,7 @@ -# How good is our model? +# How good is our model? {#model-metrics} (or how well does our model work? Superman does good; a model can work well) diff --git a/_book/index.html b/_book/index.html index 06ced160..9908787a 100644 --- a/_book/index.html +++ b/_book/index.html @@ -24,7 +24,7 @@ - + @@ -72,22 +72,23 @@
  • 1 Introduction
  • -
  • 2 A tidyverse primer @@ -108,16 +109,15 @@

    Hello World

    -

    This is the website for Tidy Modeling with R. Its purpose is to be a guide to using a new collection of software in the R programming language that enable model building. There are few goals, depending on your background. First, if you are new to modeling and R, we hope to provide an introduction on how to use software to create models. The focus will be on a dialect of R called the tidyverse that is designed to be a better interface for common tasks using R. If you’ve never heard of the tidyverse, there is a chapter that provides a solid introduction. The second (and primary) goal is to demonstrate how the tidyverse can be used to produce high quality models. The tools used to do this are referred to as the tidymodels packages. The third goal is to use the tidymodels packages to encourage good methodology and statistical practice. Many models, especially more complex predictive or machine learning models, can be created to work very well on the data at hand but may fail when exposed to new data. Often, this issue is due to poor choices that were made during the development and/or selection of the models. Whenever possible, our software attempts to prevent this from occurring but common pitfalls are discussed in the course of describing and demonstrating the software.

    -

    This book is not intended to be a reference on different types of models. We suggest other resources to learn the nuances of models. A general source for information about the most common type of model, the linear model, we suggest Fox (2008). Another excellent resource for investigating and analyzing data is Wickham and Grolemund (2016). For predictive models, Kuhn and Johnson (2013) is a good resource. For pure machine learning methods, Goodfellow, Bengio, and Courville (2016) is an excellent (but formal) source of information. In some cases, we describe some models that are used in this text but in a way that is less mathematical (and hopefully more intuitive).

    +

    This is the website for Tidy Modeling with R. Its purpose is to be a guide to using a new collection of software in the R programming language that enable model building. There are few goals, depending on your background. First, if you are new to modeling and R, we hope to provide an introduction on how to use our software to create models. The focus will be on a dialect of R called the tidyverse that is designed to be a better interface for common tasks using R. If you’ve never heard of the tidyverse, there is a chapter that provides a solid introduction. The second (and primary) goal is to demonstrate how the tidyverse can be used to produce high quality models. The tools used to do this are referred to as the tidymodels packages. The third goal is to use the tidymodels packages to encourage good methodology and statistical practice. Many models, especially complex predictive or machine learning models, can work very well on the data at hand but may also fail when exposed to new data. Often, this issue is due to poor choices that were made during the development and/or selection of the models. Whenever possible, our software attempts to prevent these and other pitfalls.

    +

    This book is not intended to be a reference on different types of these techniques We suggest other resources to learn the nuances of models. A general source for information about the most common type of model, the linear model, we suggest Fox (2008). Another excellent resource for investigating and analyzing data is Wickham and Grolemund (2016). For predictive models, Kuhn and Johnson (2013) is a good resource. For pure machine learning methods, Goodfellow, Bengio, and Courville (2016) is an excellent (but formal) source of information. In some cases, we describe some models that are used in this text but in a way that is less mathematical (and hopefully more intuitive).

    We do not assume that readers will have had extensive experience in model building and statistics. Some statistical knowledge is required, such as: random sampling, variance, correlation, basic linear regression, and other topics that are usually found in a basic undergraduate statistics or data analysis course.

    -

    This website is free to use, and is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. The sources used to create the book can be found at github.com/topepo/TMwR. We use the bookdown package to create the website (Xie 2016). One reason that we chose this license and this technology for the book is so that we can make it completely reproducible; all of the code and data used to create it are free and publicly available.

    Tidy Modeling with R is currently a work in progress. As we create it, this website is updated. Be aware that, until it is finalized, the content and/or structure of the book may change.

    -

    This openness also allows users to contribute if they wish. Most often, this comes in the form of correcting types, grammar, and other aspects of our work that could use improvement. Instructions for making contributions can be found in the contributing.md file. Also, be aware that this effort has a code of conduct, which can be found at code_of_conduct.md.

    +

    This openness also allows users to contribute if they wish. Most often, this comes in the form of correcting typos, grammar, and other aspects of our work that could use improvement. Instructions for making contributions can be found in the contributing.md file. Also, be aware that this effort has a code of conduct, which can be found at code_of_conduct.md.

    In terms of software lifecycle, the tidymodels packages are fairly young. We will do our best to maintain backwards compatibility and, at the completion of this work, will archive the specific versions of software that were used to produce it. The primary packages, and their versions, used to create this website are:

    #> ─ Session info ───────────────────────────────────────────────────────────────
     #>  setting  value                       
    @@ -129,18 +129,18 @@ 

    Hello World

    #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York -#> date 2019-12-16 +#> date 2019-12-21 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── -#> package * version date lib source -#> AmesHousing * 0.0.3 2017-12-17 [1] CRAN (R 3.6.0) -#> bookdown * 0.14 2019-10-01 [1] CRAN (R 3.6.0) -#> broom 0.5.2 2019-04-07 [1] CRAN (R 3.6.0) -#> dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.0) -#> ggplot2 * 3.2.1.9000 2019-12-06 [1] local -#> purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.0) -#> rlang 0.4.2.9000 2019-12-14 [1] Github (r-lib/rlang@ec7c1ed) -#> tibble * 2.99.99.9010 2019-12-06 [1] Github (tidyverse/tibble@f4365f7) +#> package * version date lib source +#> AmesHousing * 0.0.3 2017-12-17 [1] CRAN (R 3.6.0) +#> bookdown * 0.14 2019-10-01 [1] CRAN (R 3.6.0) +#> broom 0.5.2 2019-04-07 [1] CRAN (R 3.6.0) +#> dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.0) +#> ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.0) +#> purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.0) +#> rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.0) +#> tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.6.0) #> #> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

    pandoc is also instrumental in creating this work. The version used here is 2.3.1.

    @@ -160,9 +160,6 @@

    References

    Wickham, H, and G Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.

    -
    -

    Xie, Y. 2016. bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman; Hall/CRC. https://github.com/rstudio/bookdown.

    -
    @@ -207,6 +204,20 @@

    References

    }); + + diff --git a/_book/index.md b/_book/index.md index 43c3c0fb..53796087 100644 --- a/_book/index.md +++ b/_book/index.md @@ -3,7 +3,7 @@ knit: "bookdown::render_book" title: "Tidy Modeling with R" author: ["Max Kuhn"] -date: "2019-12-16" +date: "2019-12-21" description: "Modeling of data is integral to science, business, politics, and many other aspects of our lives. The goals of this book are to: introduce neophytes to models and the tidyverse, demonstrate the `tidymodels` packages, and to outline good practices for the phases of the modeling process." github-repo: topepo/TMwR twitter-handle: topepos @@ -16,17 +16,16 @@ colorlinks: yes # Hello World {-} -This is the website for _Tidy Modeling with R_. Its purpose is to be a guide to using a new collection of software in the R programming language that enable model building. There are few goals, depending on your background. First, if you are new to modeling and R, we hope to provide an introduction on how to use software to create models. The focus will be on a dialect of R called _the tidyverse_ that is designed to be a better interface for common tasks using R. If you've never heard of the tidyverse, there is a chapter that provides a solid introduction. The second (and primary) goal is to demonstrate how the tidyverse can be used to produce high quality models. The tools used to do this are referred to as the _tidymodels packages_. The third goal is to use the tidymodels packages to encourage good methodology and statistical practice. Many models, especially more complex predictive or machine learning models, can be created to work very well on the data at hand but may fail when exposed to new data. Often, this issue is due to poor choices that were made during the development and/or selection of the models. Whenever possible, our software attempts to prevent this from occurring but common pitfalls are discussed in the course of describing and demonstrating the software. +This is the website for _Tidy Modeling with R_. Its purpose is to be a guide to using a new collection of software in the R programming language that enable model building. There are few goals, depending on your background. First, if you are new to modeling and R, we hope to provide an introduction on how to use our software to create models. The focus will be on a dialect of R called _the tidyverse_ that is designed to be a better interface for common tasks using R. If you've never heard of the tidyverse, there is a chapter that provides a solid introduction. The second (and primary) goal is to demonstrate how the tidyverse can be used to produce high quality models. The tools used to do this are referred to as the _tidymodels packages_. The third goal is to use the tidymodels packages to encourage good methodology and statistical practice. Many models, especially complex predictive or machine learning models, can work very well on the data at hand but may also fail when exposed to new data. Often, this issue is due to poor choices that were made during the development and/or selection of the models. Whenever possible, our software attempts to prevent these and other pitfalls. -This book is not intended to be a reference on different types of models. We suggest other resources to learn the nuances of models. A general source for information about the most common type of model, the _linear model_, we suggest @fox08. Another excellent resource for investigating and analyzing data is @wickham2016. For predictive models, @apm is a good resource. For pure machine learning methods, @Goodfellow is an excellent (but formal) source of information. In some cases, we describe some models that are used in this text but in a way that is less mathematical (and hopefully more intuitive). +This book is not intended to be a reference on different types of these techniques We suggest other resources to learn the nuances of models. A general source for information about the most common type of model, the _linear model_, we suggest @fox08. Another excellent resource for investigating and analyzing data is @wickham2016. For predictive models, @apm is a good resource. For pure machine learning methods, @Goodfellow is an excellent (but formal) source of information. In some cases, we describe some models that are used in this text but in a way that is less mathematical (and hopefully more intuitive). We do not assume that readers will have had extensive experience in model building and statistics. Some statistical knowledge is required, such as: random sampling, variance, correlation, basic linear regression, and other topics that are usually found in a basic undergraduate statistics or data analysis course. -This website is __free to use__, and is licensed under the [Creative Commons Attribution-NonCommercial-NoDerivs 3.0](http://creativecommons.org/licenses/by-nc-nd/3.0/us/) License. The sources used to create the book can be found at [`github.com/topepo/TMwR`](https://github.com/topepo/TMwR). We use the [`bookdown`](https://bookdown.org/) package to create the website [@bookdown]. One reason that we chose this license and this technology for the book is so that we can make it _completely reproducible_; all of the code and data used to create it are free and publicly available. _Tidy Modeling with R_ is currently a work in progress. As we create it, this website is updated. Be aware that, until it is finalized, the content and/or structure of the book may change. -This openness also allows users to contribute if they wish. Most often, this comes in the form of correcting types, grammar, and other aspects of our work that could use improvement. Instructions for making contributions can be found in the [`contributing.md`](https://github.com/topepo/TMwR/blob/master/contributing.md) file. Also, be aware that this effort has a code of conduct, which can be found at [`code_of_conduct.md`](https://github.com/topepo/TMwR/blob/master/code_of_conduct.md). +This openness also allows users to contribute if they wish. Most often, this comes in the form of correcting typos, grammar, and other aspects of our work that could use improvement. Instructions for making contributions can be found in the [`contributing.md`](https://github.com/topepo/TMwR/blob/master/contributing.md) file. Also, be aware that this effort has a code of conduct, which can be found at [`code_of_conduct.md`](https://github.com/topepo/TMwR/blob/master/code_of_conduct.md). In terms of software lifecycle, the tidymodels packages are fairly young. We will do our best to maintain backwards compatibility and, at the completion of this work, will archive the specific versions of software that were used to produce it. The primary packages, and their versions, used to create this website are: @@ -44,18 +43,18 @@ In terms of software lifecycle, the tidymodels packages are fairly young. We wil #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York -#> date 2019-12-16 +#> date 2019-12-21 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── -#> package * version date lib source -#> AmesHousing * 0.0.3 2017-12-17 [1] CRAN (R 3.6.0) -#> bookdown * 0.14 2019-10-01 [1] CRAN (R 3.6.0) -#> broom 0.5.2 2019-04-07 [1] CRAN (R 3.6.0) -#> dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.0) -#> ggplot2 * 3.2.1.9000 2019-12-06 [1] local -#> purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.0) -#> rlang 0.4.2.9000 2019-12-14 [1] Github (r-lib/rlang@ec7c1ed) -#> tibble * 2.99.99.9010 2019-12-06 [1] Github (tidyverse/tibble@f4365f7) +#> package * version date lib source +#> AmesHousing * 0.0.3 2017-12-17 [1] CRAN (R 3.6.0) +#> bookdown * 0.14 2019-10-01 [1] CRAN (R 3.6.0) +#> broom 0.5.2 2019-04-07 [1] CRAN (R 3.6.0) +#> dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.0) +#> ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.0) +#> purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.0) +#> rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.0) +#> tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.6.0) #> #> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library ``` diff --git a/_book/introduction.html b/_book/introduction.html index 5472a78c..3bb4352c 100644 --- a/_book/introduction.html +++ b/_book/introduction.html @@ -24,7 +24,7 @@ - + @@ -32,7 +32,7 @@ - + @@ -72,22 +72,23 @@
  • 1 Introduction
  • -
  • 2 A tidyverse primer @@ -107,16 +108,17 @@

    1 Introduction

    -

    Models are mathematical tools that create equations that are intended to mimic the data given to them. These equations can be used for various purposes, such as: predicting future events, determining if there is a different between two groups, as an aid to a map-based visualization, discovering novel patterns in the data that could be further investigated, and so on. Their utility hinges on their ability to be reductive; the primary influences in the data can be captured mathematically in a way that is useful.

    +

    Models are mathematical tools that create equations that are intended to mimic the data given to them. These equations can be used for various purposes, such as: predicting future events, determining if there is a difference between several groups, as an aid to a map-based visualization, discovering novel patterns in the data that could be further investigated, and so on. Their utility hinges on their ability to be reductive; the primary influences in the data can be captured mathematically in a way that is useful.

    Since the start of the 21st century, mathematical models have become ubiquitous in our daily lives, in both obvious and subtle ways. A typical day for many people might involve checking the weather to see when a good time would be to walk the dog, ordering a product from a website, typing (and autocorrecting) a text message to a friend, and checking email. In each of these instances, there is a good chance that some type of model was used in an assistive way. In some cases, the contribution of the model might be easily perceived (“You might also be interested in purchasing product X”) while in other cases the impact was the absence of something (e.g., spam email). Models are used to choose clothing that a customer might like, a molecule that should be evaluated as a drug candidate, and might even be the mechanism that a nefarious company uses avoid the discovery of cars that over-pollute. For better or worse, models are here to stay.

    Two reasons that models permeate our lives are that software exists that facilitates their creation and that data has become more easily captured and accessible. In regard to software, it is obviously critical that software produces the correct equations that represent the data. For the most part, determining mathematical correctness is possible. However, the creation of an appropriate model hinges on a few other aspects.

    -

    First, it is important that it is easy to operate the software in a proper way. For example, the user interface should not be so arcane that the user would not know that they have inappropriately specified the wrong information. As an analogy, one might have a high quality kitchen measuring cup capable of great precision but if the chef adds a cup of salt instead of a cup of sugar, the results would be unpalatable. As a specific example of this issue, Baggerly and Coombes (2009) report myriad problems in the data analysis in a high profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The user-interface of the software was poor enough that it was easy to offset the column names of the data from the actual data columns. In the analysis of the data, this resulted in the wrong genes being identified as important for treating cancer patients. This, and many other issues, led to the stoppage of numerous clinical trials (Carlson 2012). If we are to expect high quality models, it is important that the software facilitate proper usage. Abrams (2003) describes an interesting principle to live by:

    +

    First, it is important that it is easy to operate the software in a proper way. For example, the user interface should not be so arcane that the user would not know that they have inappropriately specified the wrong information. As an analogy, one might have a high quality kitchen measuring cup capable of great precision but if the chef adds a cup of salt instead of a cup of sugar, the results would be unpalatable. As a specific example of this issue, Baggerly and Coombes (2009) report myriad problems in the data analysis in a high profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The user-interface of the software was poor enough that it was easy to offset the column names of the data from the actual data columns. In the analysis of the data, this resulted in the wrong genes being identified as important for treating cancer patients. This, and many other issues, led to the stoppage of numerous clinical trials (Carlson 2012).

    +

    If we are to expect high quality models, it is important that the software facilitate proper usage. Abrams (2003) describes an interesting principle to live by:

    The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks.

    Data analysis software should also espouse this idea.

    -

    The second important aspect of model building is related to scientific methodology. For models that are used to make complex predictions, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at finding patterns, they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these types of methodological errors are insidious in that the issue might be undetectable until a later time when new data that contain the true result are obtained. In short, as our models become more powerful and complex it has also become easier to commit latent errors. This relates to software. Whenever possible, the software should be able to protect users from committing such mistakes. Here, software should make it easy for users to do the right thing.

    -

    These two aspects of model creation are crucial. Since tools for creating models are easily obtained and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, their backgrounds will vary. It is important that their tools be robust to the experience of the user. On one had, they tools should be powerful enough to create high-performance models but, one the other hand, should be easy to use in an appropriate way. This book describes a suite of software that can can create different types of models. The software has been designed with these additional characteristics in mind.

    +

    The second important aspect of model building is related to scientific methodology. For models that are used to make complex predictions, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at finding patterns, they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these types of methodological errors are insidious in that the issue might be undetectable until a later time when new data that contain the true result are obtained. In short, as our models become more powerful and complex it has also become easier to commit latent errors. This also relates to programming. Whenever possible, the software should be able to protect users from committing such mistakes. Software should make it easy for users to do the right thing.

    +

    These two aspects of model creation are crucial. Since tools for creating models are easily obtained and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, their backgrounds will vary. It is important that their tools be robust to the experience of the user. On one had, they tools should be powerful enough to create high-performance models but, on the other hand, should be easy to use in an appropriate way. This book describes a suite of software that can can create different types of models. The software has been designed with these additional characteristics in mind.

    The software is based on the R programming language (R Core Team 2014). R has been designed especially for data analysis and modeling. It is based on the S language which was created in the 1970’s to

    “turn ideas into software, quickly and faithfully” (Chambers 1998)

    @@ -126,64 +128,95 @@

    1 Introduction

    1.1 Types of models

    Before proceeding, lets describe a taxa for types of models, grouped by purpose. While not exhaustive, most models fail into at least one of these categories:

    -

    Descriptive Models: The purpose here would be to model the data so that it can be used to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually illustrate some trend or artifact in the data.

    -

    For example, large scale measurements of RNA have been possible for some time using microarrays. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip would be able to assess a measure of signal based on the abundance of a specific RNA sequence. The chip would contain thousands (or more) outcomes, each a quantification of the RNA related to some biological process. However, there could be quality issued on the chip that might lead to poor results. A fingerprint accidentally left on a portion of the chip might cause inaccurate measurements. An early methods for evaluating such issues where probe-level models, or PLM’s (Bolstad 2004). A statistical model would be created that accounted for the known differences for the data from the chip, such as the RNA sequence, the type of sequence and so on. If there were other, unwanted factors in the data, these would be contained in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When an issue did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g. a fingerprint) and a possible solution (wipe the chip off and rescan). Figure 1.1(a) shows an application of this method for two microarrays taken from Gentleman et al. (2005). The images show two different colors; red is where the signal intensity was larger than the model expects while the blue color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel shows some type of unwanted artifact.

    +

    Descriptive Models: The purpose here would be to model the data so that it can be used to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually emphasize some trend or artifact in the data.

    +

    For example, large scale measurements of RNA have been possible for some time using microarrays. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip would be able to assess a measure of signal based on the abundance of a specific RNA sequence. The chip would contain thousands (or more) outcomes, each a quantification of the RNA related to some biological process. However, there could be quality issued on the chip that might lead to poor results. A fingerprint accidentally left on a portion of the chip might cause inaccurate measurements when scanned.

    +

    An early methods for evaluating such issues where probe-level models, or PLM’s (Bolstad 2004). A statistical model would be created that accounted for the known differences for the data from the chip, such as the RNA sequence, the type of sequence and so on. If there were other, unwanted factors in the data, these would be contained in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When an issue did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g. a fingerprint) and a possible solution (wipe the chip off and rescan). Figure 1.1(a) shows an application of this method for two microarrays taken from Gentleman et al. (2005). The images show two different colors; red is where the signal intensity was larger than the model expects while the blue color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel shows some type of unwanted artifact.

    Two examples of how descriptive models can be used to illustrate specific patterns.

    Figure 1.1: Two examples of how descriptive models can be used to illustrate specific patterns.

    -

    Another more general, and simpler example of a descriptive model is the locally estimated scatterplot smoothing model, more commonly known as LOESS (Cleveland 1979). Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data. These types of smoothers are used to discover potential ways to represent a variable in a model. This is demonstrated in Figure 1.1(b) where a nonlinear trend is illuminated by the flexible smoother.

    +

    Another example of a descriptive model is the locally estimated scatterplot smoothing model, more commonly known as LOESS (Cleveland 1979). Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data. These types of smoothers are used to discover potential ways to represent a variable in a model. This is demonstrated in Figure 1.1(b) where a nonlinear trend is illuminated by the flexible smoother.

    Inferential Models: In these situations, the goal is to produce a decision for a research question or to test a specific hypothesis. The goal is to make some statement of truth regarding some predefined conjecture or idea. In many (but not all) cases, some qualitative statement is produced.

    -

    For example, in a clinical trial, the goal might be to provide some confirmation that a new therapy does a better job in prolonging life than an alternative (perhaps an existing therapy or no therapy). If the clinical endpoint was related to survival or a patient, the null hypothesis might be that the two therapeutic groups have equal median survival times with the alternative hypothesis being that the new therapy has higher median survival. If this trial were evaluated using the traditional null hypothesis significance testing (NHST), a p-value would be produced using some pre-defined methodology based on a set of assumptions for the data. Small values of the p-value indicate that there is evidence that the new therapy does help patients live longer. If not, the conclusion is that there is a failure to show such an difference (which could be due to a number of reasons).

    -

    What are the important aspects of this type of analysis? Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability. Generally, to make such a quantity, formal assumptions must be made about the data and the underlying processes that generated the data. The quality of the statistical results are highly dependent on these pre-defined assumptions as well as how much the observed data appear to agree with them. The most critical factors here are theoretical in nature: if my data were independent and follow distribution X, then test statistic Y can be used to produce a p-value. Otherwise, the resulting p-value might be inaccurate.

    -

    One consequence of relying on formal statistical assumptions is that there tends to be a longer feedback loop that could help understand how well the data fit the assumptions. In our clinical trial example, if statistical (and clinical) significance indicated that the new therapy should be available for patients to use, it may be years before it is used in the field and enough data were generated to have an independent assessment of whether the original statistical analysis led to the appropriate decision.

    +

    For example, in a clinical trial, the goal might be to provide confirmation that a new therapy does a better job in prolonging life than an alternative (e.g., an existing therapy or no treatment). If the clinical endpoint was related to survival or a patient, the null hypothesis might be that the two therapeutic groups have equal median survival times with the alternative hypothesis being that the new therapy has higher median survival. If this trial were evaluated using the traditional null hypothesis significance testing (NHST), a p-value would be produced using some pre-defined methodology based on a set of assumptions for the data. Small values of the p-value indicate that there is evidence that the new therapy does help patients live longer. If not, the conclusion is that there is a failure to show such an difference (which could be due to a number of reasons).

    +

    What are the important aspects of this type of analysis? Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability. Generally, to compute such a quantity, formal assumptions must be made about the data and the underlying processes that generated the data. The quality of the statistical results are highly dependent on these pre-defined assumptions as well as how much the observed data appear to agree with them. The most critical factors here are theoretical in nature: if my data were independent and follow distribution X, then test statistic Y can be used to produce a p-value. Otherwise, the resulting p-value might be inaccurate.

    +

    One aspect of inferential analyses is that there tends to be a longer feedback loop that could help understand how well the data fit the assumptions. In our clinical trial example, if statistical (and clinical) significance indicated that the new therapy should be available for patients to use, it may be years before it is used in the field and enough data were generated to have an independent assessment of whether the original statistical analysis led to the appropriate decision.

    Predictive Models: There are occasions where data are modeled in an effort to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data.

    A simple example would be for a book buyer to predict how many copies of a particular book should be shipped to his/her store for the next month. An over-prediction wastes space and money due to excess books. If the prediction is smaller than it should be, there is opportunity loss and less profit.

    For this type of model, the problem type is one of estimation rather than inference. For example, the buyer is usually not concerned with a question such as “Will I sell more than 100 copies of book X next month?” but rather “How many copies of X will customers purchase next month?” Also, depending on the context, there may not be any interest in why the predicted value is X. In other words, is more interest in the value itself than evaluating a formal hypothesis related to the data. That said, the prediction can also include measures of uncertainty. In the case of the book buyer, some sort of forecasting error might be valuable to help them decide on how many to purchase or could serve as a metric to gauge how well the prediction method worked.

    What are the most important factors affecting predictive models? There are many different ways that a predictive model could be created. The important factors depend on how the model was developed.

    -

    For example, in some cases, a mechanistic model can be developed based on first principles and results in a model equation that is dependent on assumptions. For example, when predicting the amount of a drug that is in a person’s body at a certain time, some formal assumptions are made on how the drug is administered, absorbed, metabolized, and eliminated. Based on this, a set of differential equations can be used to derive a specific model equation. Data are used to estimate the known parameters of this equation and predictions are made after parameter estimation. Like inferential models, mechanistic predictive models greatly depend on the assumptions that define their model equations. However, unlike inferential models, it is easy to make data-driven statements about how well the model functions based on how well it predicts the existing data. Here the feedback loop for the modeler is much faster than it would be for a hypothesis test.

    -

    Empirically driven models are those that have more vague assumptions that are used to create their model equations. These models tend to fall more into the machine learning category. A good example is the simple K-nearest neighbor (KNN) model. Given a set of reference data, a new sample is predicted by using the values of the most similar data in the reference set. For example, if a book buyer needs a prediction for a new book, previous data from existing books may be available. If a 5-nearest neighbor model were used, the buyer might estimate the amount of the new book to purchase based on the sales numbers of the five books that are most similar to the new one (for some definition of “similar”). For this model, it is only defined by the structure of the prediction (the average of five similar books). No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model is no consistent with the mechanism that generated the data, the predictions would not be close to the actual values.

    +

    For example, a mechanistic model could be developed based on first principles to produce a model equation that is dependent on assumptions. For example, when predicting the amount of a drug that is in a person’s body at a certain time, some formal assumptions are made on how the drug is administered, absorbed, metabolized, and eliminated. Based on this, a set of differential equations can be used to derive a specific model equation. Data are used to estimate the known parameters of this equation and predictions are made after parameter estimation. Like inferential models, mechanistic predictive models greatly depend on the assumptions that define their model equations. However, unlike inferential models, it is easy to make data-driven statements about how well the model performs based on how well it predicts the existing data. Here the feedback loop for the modeler is much faster than it would be for a hypothesis test.

    +

    Empirically driven models are those that have more vague assumptions that are used to create their model equations. These models tend to fall more into the machine learning category. A good example is the simple K-nearest neighbor (KNN) model. Given a set of reference data, a new sample is predicted by using the values of the most similar data in the reference set. For example, if a book buyer needs a prediction for a new book, historical data from existing books may be available. A 5-nearest neighbor model would estimate the amount of the new book to purchase based on the sales numbers of the five books that are most similar to the new one (for some definition of “similar”). This model is only defined by the structure of the prediction (the average of five similar books). No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model was a good choice, the predictions would not be close to the actual values.

    Broader discussions of these distinctions can be found in Breiman (2001) and Shmueli (2010). Note that we have defined the type of model by how it is used rather than its mathematical qualities. An ordinary linear regression model might fall into all three classes of models, depending on how it is used:

      -
    • A descriptive LOESS model uses linear regression to estimate the trend in the data.

    • -
    • An analysis of variance (ANOVA) model is a popular method for producing the p-values used for inference. ANOVA models are a special case of linear regression.

    • -
    • If a simple linear regression model produces highly correct predictions, it can be used as a predictive model.

    • +
    • Descriptive smoother, similar to LOESS, called restricted smoothing splines (Durrleman and Simon 1989) can be used to describe trends in data using ordinary linear regression with specialized terms.

    • +
    • An analysis of variance (ANOVA) model is a popular method for producing the p-values used for inference. ANOVA models are a special case of linear regression.

    • +
    • If a simple linear regression model produces highly accurate predictions, it can be used as a predictive model.

    -

    However, there are more examples of predictive models that cannot (or at least should not) be used for inference. Even if probabilistic assumptions were made for the data, the nature of the KNN model makes the math required for inference intractable.

    +

    However, there are many more examples of predictive models that cannot (or at least should not) be used for inference. Even if probabilistic assumptions were made for the data, the nature of the KNN model makes the math required for inference intractable.

    There is an additional connection between the types of models. While the primary purpose of descriptive and inferential models might not be related to prediction, the predictive capacity of the model should not be ignored. For example, logistic regression is a popular model for data where the outcome is qualitative with two possible values. It can model how variables related to the probability of the outcomes. When used in an inferential manner, there is usually an abundance of attention paid to the statistical qualities of the model. For example, analysts tend to strongly focus on the selection of which independent variables are contained in the model. Many iterations of model building are usually used to determine a minimal subset of independent variables that have a “statistically significant” relationship to the outcome variable. This is usually achieved when all of the p-values for the independent variables are below some value (e.g. 0.05). From here, the analyst typically focuses on making qualitative statements about the relative influence that the variables have on the outcome.

    -

    A potential problem with this approach is that it can be dangerous when statistical significance is used as the only measure of model quality. It is certainly possible that this statistically optimized model has poor model accuracy (or some other measure of predictive capacity). While the model might not be used for prediction, how much should the inferences be trusted from a model that has all significant p-values but an accuracy of 35%? Predictive performance tends to be related to how close the model’s fitted values are to the observed data. If the model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not imply that the model is good.

    +

    A potential problem with this approach is that it can be dangerous when statistical significance is used as the only measure of model quality. It is certainly possible that this statistically optimized model has poor model accuracy (or some other measure of predictive capacity). While the model might not be used for prediction, how much should the inferences be trusted from a model that has all significant p-values but a dismal accuracy? Predictive performance tends to be related to how close the model’s fitted values are to the observed data. If the model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not imply that the model should be used. This may seem intuitively obvious, but is often ignored in real-world data analysis.

    1.2 Some terminology

    -

    supervise, unsupervised

    -

    types of variables

    -

    types of predictors

    -

    The mode of the model: regression, classification

    +

    Before proceeding, some additional terminology related to modeling, data, and other quantities should be outlined. These descriptions are not exhaustive.

    +

    First, many models can be categorized as being supervised or unsupervised. Unsupervised models are those that seek patterns, clusters, or other characteristics of the data but lack an outcome variable (i.e., a dependent variable). For example, principal component analysis (PCA), clustering, and autoencoders are used to understand relationships between variables or sets of variables without an explicit relationship between variables and an outcome. Supervised models are those that have an outcome variable. Linear regression, neural networks, and numerous other methodologies fall into this category. Within supervised models, the two main sub-categories are:

    +
      +
    • Regression, where a numerical outcome is being predicted.

    • +
    • Classification, where the outcome is an ordered or unordered set of qualitative values.

    • +
    +

    These are imperfect definitions and do not account for all possible types of models. In coming chapters, we refer to these types of supervised techniques as the model mode.

    +

    In terms of data, the main species are quantitative and qualitative. Examples of the former are real numbers and integers. Qualitative values, also known as nominal data, are those that represent some sort of discrete state that cannot be placed on a numeric scale.

    +

    Different variables can have different roles in an analysis. Outcomes (otherwise known as the labels, endpoints, or dependent variables) are the value being predicted in supervised models. The independent variables, which are the substrate for making predictions of the outcome, also referred to as predictors, features, or covariates (depending on the context). Here, the terms outcomes and predictors are used most frequently here.

    -
    -

    1.3 Where does modeling fit into the data analysis/scientific process?

    -

    (probably need to get explicit permission to use this)

    +
    +

    1.3 How does modeling fit into the data analysis/scientific process?

    +

    In what circumstances are model created? Are there steps that precede such an undertaking? Is it the first step in data analysis?

    +

    There are always a few critical phases of data analysis that come before modeling. First, there is the chronically underestimated process of cleaning the data. No matter the circumstances, the data should be investigated to make sure that it is well understood, applicable to the project goals, accurate, and appropriate. These steps can easily take more time than the rest of the data analysis process (depending on the circumstances).

    +

    Data cleaning can also overlap with the second phase of understanding the data, often referred to as exploratory data analysis (EDA). There should be knowledge of how the different variables related to one another, their distributions, typical ranges, and other attributes. A good question to ask at this phase is “how did I come by these data?” This question can help understand how the data at-hand have been sampled or filtered and if these operations were appropriate. For example, when merging data base tables, a join may go awry that could accidentally eliminate one or more sub-populations of samples. Another good idea would be to ask if the data are relavant. For example, to predict whether patients have Alzheimer’s disease or not, it would be unwise to have a data set containing subject with the disease and a random sample of healthy adults from the general population. Given the progressive nature of the disease, the model my simply predict who the are the oldest patients.

    +

    Finally, before starting a data analysis process, there should be clear expectations of the goal of the model and how performance (and success) will be judged. At least one performance metric should be identified with realistic goals of what can be achieved. Common statistical metrics are classification accuracy, true and false positive rates, root mean squared error, and so on. The relative benefits and drawbacks of these metrics should be weighted. It is also important that the metric be germane (i.e., alignment with the broader data analysis goals is critical).

    The data science process (from _R for Data Science_).

    Figure 1.2: The data science process (from R for Data Science).

    -
    -
    -

    1.4 Modeling is a process, not a single activity

    -

    (probably need to get explicit permission to use this too)

    +

    The process of investigating the data may not be simple. Wickham and Grolemund (2016) contains an excellent illustration of the general data analysis process, reproduced with Figure 1.2. Data ingestion and cleaning are shown as the initial steps. When the analytical steps commence, they are a heuristic process; we cannot pre-determine how long they may take. The cycle of analysis, modeling, and visualization often require multiple iterations.

    -A schematic for the typical modeling process. +A schematic for the typical modeling process (from _Feature Engineering and Selection_).

    -Figure 1.3: A schematic for the typical modeling process. +Figure 1.3: A schematic for the typical modeling process (from Feature Engineering and Selection).

    +

    This iterative process is especially true for modeling. Figure 1.3 originates from Kuhn and Johnson (2020) and is meant to emulate the typical path to determining an appropriate model. The general phases are:

    +
      +
    • Exploratory data analysis (EDA) and Quantitative Analysis (blue bars). Initially there is a back and forth between numerical analysis and visualization of the data (represented in Figure 1.2) where different discoveries lead to more questions and data analysis “side-quests” to gain more understanding.
    • +
    • Feature engineering (green bars). This understanding translated to the creation of specific model terms that make it easier to accurately model the observed data. This can include complex methodologies (e.g., PCA) or simpler features (using the ratio of two predictors).

    • +
    • Model tuning and selection (red and gray bars). A variety of models are generated and their performance is compared. Some models require parameter tuning where some structural parameters are required to be specified or optimized.

    • +
    +

    After an initial sequence of these tasks, more understanding is gained regarding which types of models are superior as well as which sub-populations of the data are not being effectively estimated. This leads to additional EDA and feature engineering, another round of modeling, and so on. Once the data analysis goals are achieved, the last steps are typically to finalize and document the model. For predictive models, it is common at the end to validate the model on an additional set of data reserved for this specific purpose.

    +
    +
    +

    1.4 Where does the model begin and end?

    +

    So far, we have defined the model to be a structural equation that relates some predictors to one or more outcomes. Let’s consider ordinary linear regression as a simple and well known example. The outcome data are denoted as \(y_i\), where there are \(i = 1 \ldots n\) samples in the data set. Suppose that there are \(p\) predictors \(x_{i1}, \ldots, x_{ip}\) that are used to predict the outcome. Linear regression produces a model equation of

    +

    \[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + \ldots + \hat{\beta}_px_{ip} \]

    +

    While this is a linear model, it is only linear in the parameters. The predictors could be nonlinear terms (such as the \(log(x_i)\)).

    +

    The conventional way of thinking is that the modeling process is encapsulated by the model. For many data sets that are straight-forward in nature, this is the case. However, there are a variety of choices and additional steps that often occur before the data are ready to be added to the model. Some examples:

    +
      +
    • While our model has \(p\) predictors, it is common to start with more than this number of candidate predictors. Through exploratory data analysis or previous experience, some of the predictors may be excluded from the analysis. In other cases, some feature selection algorithm may have been used to make a data-driven choice for the minimum predictors set to be used in the model.
    • +
    • There are times when the value of an important predictor is know known. Rather than eliminating this value from the data set, it could be imputed using other values in the data. For example, if \(x_1\) were missing but was correlated with predictors \(x_2\) and \(x_3\), an imputation method could estimate the missing \(x_1\) observation from the values of \(x_2\) and \(x_3\).
    • +
    • As previously mentioned, it may be beneficial to transform the scale of a predictor. If there is not a priori information on what the new scale should be, it might be estimated using a transformation technique. Here, the existing data would be used to statistically estimate the proper scale that optimizes some criterion. Other transformations, such as the previously mentioned PCA, take groups of predictors and transform them into new features that are used as the predictors.
    • +
    +

    While the examples above are related to steps that occur before the model, there may also be operations that occur after the model is created. For example, when a classification model is created where the outcome is binary (e.g., event and non-event), it is customary to use a 50% probability cutoff to create a discrete class prediction (also known as a “hard prediction”). For example, a classification model might estimate that the probability of an event was 62%. Using the typical default, the hard prediction would be event. However, the model may need to be more focused on reducing false positive results (i.e., where true non-events are classified as events). One way to do this is to raise the cutoff from 50% to some greater value. This increases the level of evidence required to call a new sample as an event. While this reduces the true positive rate (which is bad), it may have a more profound effect on reducing false positives. The choice of the cutoff value should be optimized using data. This is an example of a post-processing step that has a significant effect on how well the model works even though it is not contained in the model fitting step.

    +

    These examples have a common characteristic of requiring data for derivations that alter the raw data values or the predictions generated by the model.

    +

    It is very important to focus on the broader model fitting process instead of the specific model being used to estimate parameters. This would include any pre-processing steps, the model fit itself, as well as potential post-processing activities. In this text, this will be referred to as the model workflow and would include any data-driven activities that are used to produce a final model equation.

    +

    This will come into play when topics such as resampling (Chapter 8) and model tuning are discussed. Chapter 7 describes software for creating a model workflow.

    1.5 Outline of future chapters

    +

    The first order of business is to introduce (or review) the ideas and syntax of the tidyverse in Chapter 2. In this chapter, we also summarize the unmet needs for modeling when using R. This provides good motivation for why model-specific tidyverse techniques are needed. This chapter also outlines some additional principles related to this challenges.

    +

    Chapter 3 shows two different data analyses for the same data where one is focused on prediction and the other is for inference. This should illustrates the challenges for each approach and what issues are most relavant for each.

    @@ -210,9 +243,15 @@

    References

    Cleveland, W. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.” Journal of the American Statistical Association 74 (368): 829–36.

    +
    +

    Durrleman, S, and R Simon. 1989. “Flexible Regression Models with Cubic Splines.” Statistics in Medicine 8 (5): 551–61.

    +

    Gentleman, R, V Carey, W Huber, R Irizarry, and S Dudoit. 2005. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Berlin, Heidelberg: Springer-Verlag.

    +
    +

    Kuhn, M, and K Johnson. 2020. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press.

    +

    R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.

    @@ -222,6 +261,9 @@

    References

    Wickham, H, M Averick, J Bryan, W Chang, L McGowan, R François, G Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43).

    +
    +

    Wickham, H, and G Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.

    +
    @@ -229,7 +271,7 @@

    References

    - + @@ -266,6 +308,20 @@

    References

    }); + + diff --git a/_book/introduction.md b/_book/introduction.md index e9c3ea53..9398ef76 100644 --- a/_book/introduction.md +++ b/_book/introduction.md @@ -3,21 +3,23 @@ # Introduction -Models are mathematical tools that create equations that are intended to mimic the data given to them. These equations can be used for various purposes, such as: predicting future events, determining if there is a different between two groups, as an aid to a map-based visualization, discovering novel patterns in the data that could be further investigated, and so on. Their utility hinges on their ability to be reductive; the primary influences in the data can be captured mathematically in a way that is useful. +Models are mathematical tools that create equations that are intended to mimic the data given to them. These equations can be used for various purposes, such as: predicting future events, determining if there is a difference between several groups, as an aid to a map-based visualization, discovering novel patterns in the data that could be further investigated, and so on. Their utility hinges on their ability to be reductive; the primary influences in the data can be captured mathematically in a way that is useful. Since the start of the 21st century, mathematical models have become ubiquitous in our daily lives, in both obvious and subtle ways. A typical day for many people might involve checking the weather to see when a good time would be to walk the dog, ordering a product from a website, typing (and autocorrecting) a text message to a friend, and checking email. In each of these instances, there is a good chance that some type of model was used in an assistive way. In some cases, the contribution of the model might be easily perceived ("You might also be interested in purchasing product _X_") while in other cases the impact was the absence of something (e.g., spam email). Models are used to choose clothing that a customer might like, a molecule that should be evaluated as a drug candidate, and might even be the mechanism that a nefarious company uses avoid the discovery of cars that over-pollute. For better or worse, models are here to stay. Two reasons that models permeate our lives are that software exists that facilitates their creation and that data has become more easily captured and accessible. In regard to software, it is obviously critical that software produces the _correct_ equations that represent the data. For the most part, determining mathematical correctness is possible. However, the creation of an appropriate model hinges on a few other aspects. -First, it is important that it is easy to operate the software in a _proper way_. For example, the user interface should not be so arcane that the user would not know that they have inappropriately specified the wrong information. As an analogy, one might have a high quality kitchen measuring cup capable of great precision but if the chef adds a cup of salt instead of a cup of sugar, the results would be unpalatable. As a specific example of this issue, @baggerly2009 report myriad problems in the data analysis in a high profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The user-interface of the software was poor enough that it was easy to _offset_ the column names of the data from the actual data columns. In the analysis of the data, this resulted in the wrong genes being identified as important for treating cancer patients. This, and many other issues, led to the stoppage of numerous clinical trials [@Carlson2012]. If we are to expect high quality models, it is important that the software facilitate proper usage. @abrams2003 describes an interesting principle to live by: +First, it is important that it is easy to operate the software in a _proper way_. For example, the user interface should not be so arcane that the user would not know that they have inappropriately specified the wrong information. As an analogy, one might have a high quality kitchen measuring cup capable of great precision but if the chef adds a cup of salt instead of a cup of sugar, the results would be unpalatable. As a specific example of this issue, @baggerly2009 report myriad problems in the data analysis in a high profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The user-interface of the software was poor enough that it was easy to _offset_ the column names of the data from the actual data columns. In the analysis of the data, this resulted in the wrong genes being identified as important for treating cancer patients. This, and many other issues, led to the stoppage of numerous clinical trials [@Carlson2012]. + +If we are to expect high quality models, it is important that the software facilitate proper usage. @abrams2003 describes an interesting principle to live by: > The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks. Data analysis software should also espouse this idea. -The second important aspect of model building is related to _scientific methodology_. For models that are used to make complex predictions, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at finding patterns, they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these types of methodological errors are insidious in that the issue might be undetectable until a later time when new data that contain the true result are obtained. In short, as our models become more powerful and complex it has also become easier to commit latent errors. This relates to software. Whenever possible, the software should be able to protect users from committing such mistakes. Here, software should make it easy for users to do the right thing. +The second important aspect of model building is related to _scientific methodology_. For models that are used to make complex predictions, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at finding patterns, they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these types of methodological errors are insidious in that the issue might be undetectable until a later time when new data that contain the true result are obtained. In short, as our models become more powerful and complex it has also become easier to commit latent errors. This also relates to programming. Whenever possible, the software should be able to protect users from committing such mistakes. Software should make it easy for users to do the right thing. -These two aspects of model creation are crucial. Since tools for creating models are easily obtained and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, their backgrounds will vary. It is important that their tools be _robust_ to the experience of the user. On one had, they tools should be powerful enough to create high-performance models but, one the other hand, should be easy to use in an appropriate way. This book describes a suite of software that can can create different types of models. The software has been designed with these additional characteristics in mind. +These two aspects of model creation are crucial. Since tools for creating models are easily obtained and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, their backgrounds will vary. It is important that their tools be _robust_ to the experience of the user. On one had, they tools should be powerful enough to create high-performance models but, on the other hand, should be easy to use in an appropriate way. This book describes a suite of software that can can create different types of models. The software has been designed with these additional characteristics in mind. The software is based on the R programming language [@baseR]. R has been designed especially for data analysis and modeling. It is based on the _S language_ which was created in the 1970's to @@ -31,25 +33,27 @@ One collection of packages is called the **_tidyverse_** [@tidyverse]. The tidyv Before proceeding, lets describe a taxa for types of models, grouped by _purpose_. While not exhaustive, most models fail into _at least_ one of these categories: -**Descriptive Models**: The purpose here would be to model the data so that it can be used to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually illustrate some trend or artifact in the data. +**Descriptive Models**: The purpose here would be to model the data so that it can be used to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually emphasize some trend or artifact in the data. + +For example, large scale measurements of RNA have been possible for some time using _microarrays_. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip would be able to assess a measure of signal based on the abundance of a specific RNA sequence. The chip would contain thousands (or more) outcomes, each a quantification of the RNA related to some biological process. However, there could be quality issued on the chip that might lead to poor results. A fingerprint accidentally left on a portion of the chip might cause inaccurate measurements when scanned. -For example, large scale measurements of RNA have been possible for some time using _microarrays_. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip would be able to assess a measure of signal based on the abundance of a specific RNA sequence. The chip would contain thousands (or more) outcomes, each a quantification of the RNA related to some biological process. However, there could be quality issued on the chip that might lead to poor results. A fingerprint accidentally left on a portion of the chip might cause inaccurate measurements. An early methods for evaluating such issues where _probe-level models_, or PLM's [@bolstad2004]. A statistical model would be created that accounted for the _known_ differences for the data from the chip, such as the RNA sequence, the type of sequence and so on. If there were other, unwanted factors in the data, these would be contained in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When an issue did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g. a fingerprint) and a possible solution (wipe the chip off and rescan). Figure \@ref(fig:descr-examples)(a) shows an application of this method for two microarrays taken from @Gentleman2005. The images show two different colors; red is where the signal intensity was larger than the model expects while the blue color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel shows some type of unwanted artifact. +An early methods for evaluating such issues where _probe-level models_, or PLM's [@bolstad2004]. A statistical model would be created that accounted for the _known_ differences for the data from the chip, such as the RNA sequence, the type of sequence and so on. If there were other, unwanted factors in the data, these would be contained in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When an issue did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g. a fingerprint) and a possible solution (wipe the chip off and rescan). Figure \@ref(fig:descr-examples)(a) shows an application of this method for two microarrays taken from @Gentleman2005. The images show two different colors; red is where the signal intensity was larger than the model expects while the blue color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel shows some type of unwanted artifact.
    Two examples of how descriptive models can be used to illustrate specific patterns.

    (\#fig:descr-examples)Two examples of how descriptive models can be used to illustrate specific patterns.

    -Another more general, and simpler example of a descriptive model is the locally estimated scatterplot smoothing model, more commonly known as LOESS [@cleveland1979]. Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data. These types of _smoothers_ are used to discover potential ways to represent a variable in a model. This is demonstrated in Figure \@ref(fig:descr-examples)(b) where a nonlinear trend is illuminated by the flexible smoother. +Another example of a descriptive model is the _locally estimated scatterplot smoothing_ model, more commonly known as LOESS [@cleveland1979]. Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data. These types of _smoothers_ are used to discover potential ways to represent a variable in a model. This is demonstrated in Figure \@ref(fig:descr-examples)(b) where a nonlinear trend is illuminated by the flexible smoother. **Inferential Models**: In these situations, the goal is to produce a decision for a research question or to test a specific hypothesis. The goal is to make some statement of truth regarding some predefined conjecture or idea. In many (but not all) cases, some qualitative statement is produced. -For example, in a clinical trial, the goal might be to provide some confirmation that a new therapy does a better job in prolonging life than an alternative (perhaps an existing therapy or no therapy). If the clinical endpoint was related to survival or a patient, the _null hypothesis_ might be that the two therapeutic groups have equal median survival times with the alternative hypothesis being that the new therapy has higher median survival. If this trial were evaluated using the traditional *null hypothesis significance testing* (NHST), a p-value would be produced using some pre-defined methodology based on a set of assumptions for the data. Small values of the p-value indicate that there is evidence that the new therapy does help patients live longer. If not, the conclusion is that there is a failure to show such an difference (which could be due to a number of reasons). +For example, in a clinical trial, the goal might be to provide confirmation that a new therapy does a better job in prolonging life than an alternative (e.g., an existing therapy or no treatment). If the clinical endpoint was related to survival or a patient, the _null hypothesis_ might be that the two therapeutic groups have equal median survival times with the alternative hypothesis being that the new therapy has higher median survival. If this trial were evaluated using the traditional *null hypothesis significance testing* (NHST), a p-value would be produced using some pre-defined methodology based on a set of assumptions for the data. Small values of the p-value indicate that there is evidence that the new therapy does help patients live longer. If not, the conclusion is that there is a failure to show such an difference (which could be due to a number of reasons). -What are the important aspects of this type of analysis? Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability. Generally, to make such a quantity, formal assumptions must be made about the data and the underlying processes that generated the data. The quality of the statistical results are highly dependent on these pre-defined assumptions as well as how much the observed data appear to agree with them. The most critical factors here are theoretical in nature: if my data were independent and follow distribution _X_, then test statistic _Y_ can be used to produce a p-value. Otherwise, the resulting p-value might be inaccurate. +What are the important aspects of this type of analysis? Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability. Generally, to compute such a quantity, formal assumptions must be made about the data and the underlying processes that generated the data. The quality of the statistical results are highly dependent on these pre-defined assumptions as well as how much the observed data appear to agree with them. The most critical factors here are theoretical in nature: if my data were independent and follow distribution _X_, then test statistic _Y_ can be used to produce a p-value. Otherwise, the resulting p-value might be inaccurate. -One consequence of relying on formal statistical assumptions is that there _tends_ to be a longer feedback loop that could help understand how well the data fit the assumptions. In our clinical trial example, if statistical (and clinical) significance indicated that the new therapy should be available for patients to use, it may be years before it is used in the field and enough data were generated to have an independent assessment of whether the original statistical analysis led to the appropriate decision. +One aspect of inferential analyses is that there _tends_ to be a longer feedback loop that could help understand how well the data fit the assumptions. In our clinical trial example, if statistical (and clinical) significance indicated that the new therapy should be available for patients to use, it may be years before it is used in the field and enough data were generated to have an independent assessment of whether the original statistical analysis led to the appropriate decision. **Predictive Models**: There are occasions where data are modeled in an effort to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data. @@ -59,55 +63,97 @@ For this type of model, the problem type is one of _estimation_ rather than infe What are the most important factors affecting predictive models? There are many different ways that a predictive model could be created. The important factors depend on how the model was developed. -For example, in some cases, a _mechanistic model_ can be developed based on first principles and results in a model equation that is dependent on assumptions. For example, when predicting the amount of a drug that is in a person's body at a certain time, some formal assumptions are made on how the drug is administered, absorbed, metabolized, and eliminated. Based on this, a set of differential equations can be used to derive a specific model equation. Data are used to estimate the known parameters of this equation and predictions are made after parameter estimation. Like inferential models, mechanistic predictive models greatly depend on the assumptions that define their model equations. However, unlike inferential models, it is easy to make data-driven statements about how well the model functions based on how well it predicts the existing data. Here the feedback loop for the modeler is much faster than it would be for a hypothesis test. +For example, a _mechanistic model_ could be developed based on first principles to produce a model equation that is dependent on assumptions. For example, when predicting the amount of a drug that is in a person's body at a certain time, some formal assumptions are made on how the drug is administered, absorbed, metabolized, and eliminated. Based on this, a set of differential equations can be used to derive a specific model equation. Data are used to estimate the known parameters of this equation and predictions are made after parameter estimation. Like inferential models, mechanistic predictive models greatly depend on the assumptions that define their model equations. However, unlike inferential models, it is easy to make data-driven statements about how well the model performs based on how well it predicts the existing data. Here the feedback loop for the modeler is much faster than it would be for a hypothesis test. -_Empirically driven models_ are those that have more vague assumptions that are used to create their model equations. These models tend to fall more into the machine learning category. A good example is the simple _K_-nearest neighbor (KNN) model. Given a set of reference data, a new sample is predicted by using the values of the most similar data in the reference set. For example, if a book buyer needs a prediction for a new book, previous data from existing books may be available. If a 5-nearest neighbor model were used, the buyer might estimate the amount of the new book to purchase based on the sales numbers of the five books that are most similar to the new one (for some definition of "similar"). For this model, it is only defined by the structure of the prediction (the average of five similar books). No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model is no consistent with the mechanism that generated the data, the predictions would not be close to the actual values. +_Empirically driven models_ are those that have more vague assumptions that are used to create their model equations. These models tend to fall more into the machine learning category. A good example is the simple _K_-nearest neighbor (KNN) model. Given a set of reference data, a new sample is predicted by using the values of the most similar data in the reference set. For example, if a book buyer needs a prediction for a new book, historical data from existing books may be available. A 5-nearest neighbor model would estimate the amount of the new book to purchase based on the sales numbers of the five books that are most similar to the new one (for some definition of "similar"). This model is only defined by the structure of the prediction (the average of five similar books). No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model was a good choice, the predictions would not be close to the actual values. Broader discussions of these distinctions can be found in @breiman2001 and @shmueli2010. Note that we have defined the type of model by how it is used rather than its mathematical qualities. An ordinary linear regression model might fall into all three classes of models, depending on how it is used: -* A descriptive LOESS model uses linear regression to estimate the trend in the data. +* Descriptive smoother, similar to LOESS, called _restricted smoothing splines_ [@Durrleman1989] can be used to describe trends in data using ordinary linear regression with specialized terms. -* An analysis of variance (ANOVA) model is a popular method for producing the p-values used for inference. ANOVA models are a special case of linear regression. +* An _analysis of variance_ (ANOVA) model is a popular method for producing the p-values used for inference. ANOVA models are a special case of linear regression. -* If a simple linear regression model produces highly correct predictions, it can be used as a predictive model. +* If a simple linear regression model produces highly accurate predictions, it can be used as a predictive model. -However, there are more examples of predictive models that cannot (or at least should not) be used for inference. Even if probabilistic assumptions were made for the data, the nature of the KNN model makes the math required for inference intractable. +However, there are many more examples of predictive models that cannot (or at least should not) be used for inference. Even if probabilistic assumptions were made for the data, the nature of the KNN model makes the math required for inference intractable. There is an additional connection between the types of models. While the primary purpose of descriptive and inferential models might not be related to prediction, the predictive capacity of the model should not be ignored. For example, logistic regression is a popular model for data where the outcome is qualitative with two possible values. It can model how variables related to the probability of the outcomes. When used in an inferential manner, there is usually an abundance of attention paid to the _statistical qualities_ of the model. For example, analysts tend to strongly focus on the selection of which independent variables are contained in the model. Many iterations of model building are usually used to determine a minimal subset of independent variables that have a "statistically significant" relationship to the outcome variable. This is usually achieved when all of the p-values for the independent variables are below some value (e.g. 0.05). From here, the analyst typically focuses on making qualitative statements about the relative influence that the variables have on the outcome. -A potential problem with this approach is that it can be dangerous when statistical significance is used as the _only_ measure of model quality. It is certainly possible that this statistically optimized model has poor model accuracy (or some other measure of predictive capacity). While the model might not be used for prediction, how much should the inferences be trusted from a model that has all significant p-values but an accuracy of 35%? Predictive performance tends to be related to how close the model's fitted values are to the observed data. If the model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not imply that the model is good. +A potential problem with this approach is that it can be dangerous when statistical significance is used as the _only_ measure of model quality. It is certainly possible that this statistically optimized model has poor model accuracy (or some other measure of predictive capacity). While the model might not be used for prediction, how much should the inferences be trusted from a model that has all significant p-values but a dismal accuracy? Predictive performance tends to be related to how close the model's fitted values are to the observed data. If the model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not imply that the model should be used. This may seem intuitively obvious, but is often ignored in real-world data analysis. ## Some terminology -supervise, unsupervised +Before proceeding, some additional terminology related to modeling, data, and other quantities should be outlined. These descriptions are not exhaustive. + +First, many models can be categorized as being _supervised_ or _unsupervised_. Unsupervised models are those that seek patterns, clusters, or other characteristics of the data but lack an outcome variable (i.e., a dependent variable). For example, principal component analysis (PCA), clustering, and autoencoders are used to understand relationships between variables or sets of variables without an explicit relationship between variables and an outcome. Supervised models are those that have an outcome variable. Linear regression, neural networks, and numerous other methodologies fall into this category. Within supervised models, the two main sub-categories are: + + * _Regression_, where a numerical outcome is being predicted. + + * _Classification_, where the outcome is an ordered or unordered set of _qualitative_ values. + +These are imperfect definitions and do not account for all possible types of models. In coming chapters, we refer to these types of supervised techniques as the _model mode_. -types of variables +In terms of data, the main species are quantitative and qualitative. Examples of the former are real numbers and integers. Qualitative values, also known as nominal data, are those that represent some sort of discrete state that cannot be placed on a numeric scale. -types of predictors +Different variables can have different _roles_ in an analysis. Outcomes (otherwise known as the labels, endpoints, or dependent variables) are the value being predicted in supervised models. The independent variables, which are the substrate for making predictions of the outcome, also referred to as predictors, features, or covariates (depending on the context). Here, the terms _outcomes_ and _predictors_ are used most frequently here. -The mode of the model: regression, classification +## How does modeling fit into the data analysis/scientific process? {#model-phases} +In what circumstances are model created? Are there steps that precede such an undertaking? Is it the first step in data analysis? -## Where does modeling fit into the data analysis/scientific process? +There are always a few critical phases of data analysis that come before modeling. First, there is the chronically underestimated process of **cleaning the data**. No matter the circumstances, the data should be investigated to make sure that it is well understood, applicable to the project goals, accurate, and appropriate. These steps can easily take more time than the rest of the data analysis process (depending on the circumstances). -(probably need to get explicit permission to use this) +Data cleaning can also overlap with the second phase of **understanding the data**, often referred to as exploratory data analysis (EDA). There should be knowledge of how the different variables related to one another, their distributions, typical ranges, and other attributes. A good question to ask at this phase is "how did I come by _these_ data?" This question can help understand how the data at-hand have been sampled or filtered and if these operations were appropriate. For example, when merging data base tables, a join may go awry that could accidentally eliminate one or more sub-populations of samples. Another good idea would be to ask if the data are _relavant_. For example, to predict whether patients have Alzheimer's disease or not, it would be unwise to have a data set containing subject with the disease and a random sample of healthy adults from the general population. Given the progressive nature of the disease, the model my simply predict who the are the _oldest patients_. + +Finally, before starting a data analysis process, there should be clear expectations of the goal of the model and how performance (and success) will be judged. At least one _performance metric_ should be identified with realistic goals of what can be achieved. Common statistical metrics are classification accuracy, true and false positive rates, root mean squared error, and so on. The relative benefits and drawbacks of these metrics should be weighted. It is also important that the metric be germane (i.e., alignment with the broader data analysis goals is critical).
    The data science process (from _R for Data Science_).

    (\#fig:data-science-model)The data science process (from _R for Data Science_).

    +The process of investigating the data may not be simple. @wickham2016 contains an excellent illustration of the general data analysis process, reproduced with Figure \@ref(fig:data-science-model). Data ingestion and cleaning are shown as the initial steps. When the analytical steps commence, they are a heuristic process; we cannot pre-determine how long they may take. The cycle of analysis, modeling, and visualization often require multiple iterations. + +
    +A schematic for the typical modeling process (from _Feature Engineering and Selection_). +

    (\#fig:modeling-process)A schematic for the typical modeling process (from _Feature Engineering and Selection_).

    +
    -## Modeling is a _process_, not a single activity +This iterative process is especially true for modeling. Figure \@ref(fig:modeling-process) originates from @kuhn20202 and is meant to emulate the typical path to determining an appropriate model. The general phases are: + * Exploratory data analysis (EDA) and Quantitative Analysis (blue bars). Initially there is a back and forth between numerical analysis and visualization of the data (represented in Figure \@ref(fig:data-science-model)) where different discoveries lead to more questions and data analysis "side-quests" to gain more understanding. + * Feature engineering (green bars). This understanding translated to the creation of specific model terms that make it easier to accurately model the observed data. This can include complex methodologies (e.g., PCA) or simpler features (using the ratio of two predictors). -(probably need to get explicit permission to use this too) + * Model tuning and selection (red and gray bars). A variety of models are generated and their performance is compared. Some models require _parameter tuning_ where some structural parameters are required to be specified or optimized. +After an initial sequence of these tasks, more understanding is gained regarding which types of models are superior as well as which sub-populations of the data are not being effectively estimated. This leads to additional EDA and feature engineering, another round of modeling, and so on. Once the data analysis goals are achieved, the last steps are typically to finalize and document the model. For predictive models, it is common at the end to validate the model on an additional set of data reserved for this specific purpose. -
    -A schematic for the typical modeling process. -

    (\#fig:modeling-process)A schematic for the typical modeling process.

    -
    + +## Where does the model begin and end? {#begin-model-end} + +So far, we have defined the model to be a structural equation that relates some predictors to one or more outcomes. Let's consider ordinary linear regression as a simple and well known example. The outcome data are denoted as $y_i$, where there are $i = 1 \ldots n$ samples in the data set. Suppose that there are $p$ predictors $x_{i1}, \ldots, x_{ip}$ that are used to predict the outcome. Linear regression produces a model equation of + +$$ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + \ldots + \hat{\beta}_px_{ip} $$ + +While this is a _linear_ model, it is only linear in the parameters. The predictors could be nonlinear terms (such as the $log(x_i)$). + +The conventional way of thinking is that the modeling _process_ is encapsulated by the model. For many data sets that are straight-forward in nature, this is the case. However, there are a variety of _choices_ and additional steps that often occur before the data are ready to be added to the model. Some examples: + +* While our model has $p$ predictors, it is common to start with more than this number of candidate predictors. Through exploratory data analysis or previous experience, some of the predictors may be excluded from the analysis. In other cases, some feature selection algorithm may have been used to make a data-driven choice for the minimum predictors set to be used in the model. +* There are times when the value of an important predictor is know known. Rather than eliminating this value from the data set, it could be _imputed_ using other values in the data. For example, if $x_1$ were missing but was correlated with predictors $x_2$ and $x_3$, an imputation method could estimate the missing $x_1$ observation from the values of $x_2$ and $x_3$. +* As previously mentioned, it may be beneficial to transform the scale of a predictor. If there is **not** _a priori_ information on what the new scale should be, it might be estimated using a transformation technique. Here, the existing data would be used to statistically _estimate_ the proper scale that optimizes some criterion. Other transformations, such as the previously mentioned PCA, take groups of predictors and transform them into new features that are used as the predictors. + +While the examples above are related to steps that occur before the model, there may also be operations that occur after the model is created. For example, when a classification model is created where the outcome is binary (e.g., `event` and `non-event`), it is customary to use a 50% probability cutoff to create a discrete class prediction (also known as a "hard prediction"). For example, a classification model might estimate that the probability of an event was 62%. Using the typical default, the hard prediction would be `event`. However, the model may need to be more focused on reducing false positive results (i.e., where true non-events are classified as events). One way to do this is to _raise_ the cutoff from 50% to some greater value. This increases the level of evidence required to call a new sample as an event. While this reduces the true positive rate (which is bad), it may have a more profound effect on reducing false positives. The choice of the cutoff value should be optimized using data. This is an example of a post-processing step that has a significant effect on how well the model works even though it is not contained in the model fitting step. + +These examples have a common characteristic of requiring data for derivations that alter the raw data values or the predictions generated by the model. + +It is very important to focus on the broader _model fitting process_ instead of the specific model being used to estimate parameters. This would include any pre-processing steps, the model fit itself, as well as potential post-processing activities. In this text, this will be referred to as the **model workflow** and would include any data-driven activities that are used to produce a final model equation. + +This will come into play when topics such as resampling (Chapter \@ref(resampling)) and model tuning are discussed. Chapter \@ref(workflows) describes software for creating a model workflow. ## Outline of future chapters +The first order of business is to introduce (or review) the ideas and syntax of the tidyverse in Chapter \@ref(tidyverse-primer). In this chapter, we also summarize the unmet needs for modeling when using R. This provides good motivation for why model-specific tidyverse techniques are needed. This chapter also outlines some additional principles related to this challenges. + +Chapter \@ref(two-models) shows two different data analyses for the same data where one is focused on prediction and the other is for inference. This should illustrates the challenges for each approach and what issues are most relavant for each. + diff --git a/_book/model-metrics.html b/_book/model-metrics.html new file mode 100644 index 00000000..41e2e686 --- /dev/null +++ b/_book/model-metrics.html @@ -0,0 +1,176 @@ + + + + + + + 5 How good is our model? | Tidy Modeling with R + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    + + +
    +
    + +
    +
    +

    5 How good is our model?

    +

    (or how well does our model work? Superman does good; a model can work well)

    +

    Measuring performance

    +

    Don’t revaluate the training set

    +

    Statistical significance as a measure of effectiveness.

    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + + + + diff --git a/_book/resampling-for-evaluating-performance.html b/_book/resampling-for-evaluating-performance.html index 3aa341b5..6c4dc1c5 100644 --- a/_book/resampling-for-evaluating-performance.html +++ b/_book/resampling-for-evaluating-performance.html @@ -24,7 +24,7 @@ - + @@ -132,6 +132,9 @@

    8 Resampling for evaluating perfo

    Cleveland, W. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.” Journal of the American Statistical Association 74 (368): 829–36.

    +

    Durrleman, S, and R Simon. 1989. “Flexible Regression Models with Cubic Splines.” Statistics in Medicine 8 (5): 551–61.

    +
    +

    Fox, J. 2008. Applied Regression Analysis and Generalized Linear Models. Second. Thousand Oaks, CA: Sage.

    @@ -144,6 +147,9 @@

    8 Resampling for evaluating perfo

    Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

    +

    ———. 2020. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press.

    +
    +

    R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.

    @@ -155,9 +161,6 @@

    8 Resampling for evaluating perfo

    Wickham, H, and G Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.

    -
    -

    Xie, Y. 2016. bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman; Hall/CRC. https://github.com/rstudio/bookdown.

    -

    diff --git a/_book/resampling.html b/_book/resampling.html new file mode 100644 index 00000000..5d179213 --- /dev/null +++ b/_book/resampling.html @@ -0,0 +1,226 @@ + + + + + + + 8 Resampling for evaluating performance | Tidy Modeling with R + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    + + +
    +
    + +
    +
    +

    8 Resampling for evaluating performance

    +

    Maybe inlcude some simple examples of comparing models using resampling (perhaps go full tidyposterior?)

    + +
    +
    +

    Abrams, B. 2003. “The Pit of Success.” https://blogs.msdn.microsoft.com/brada/2003/10/02/the-pit-of-success/.

    +
    +
    +

    Baggerly, K, and K Coombes. 2009. “Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology.” The Annals of Applied Statistics 3 (4): 1309–34.

    +
    +
    +

    Bolstad, B. 2004. Low-Level Analysis of High-Density Oligonucleotide Array Data: Background, Normalization and Summarization. University of California, Berkeley.

    +
    +
    +

    Breiman, L. 2001. “Statistical Modeling: The Two Cultures.” Statistical Science 16 (3): 199–231.

    +
    +
    +

    Carlson, B. 2012. “Putting Oncology Patients at Risk.” Biotechnology Healthcare 9 (3): 17–21.

    +
    +
    +

    Chambers, J. 1998. Programming with Data: A Guide to the S Language. Berlin, Heidelberg: Springer-Verlag.

    +
    +
    +

    Cleveland, W. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.” Journal of the American Statistical Association 74 (368): 829–36.

    +
    +
    +

    Durrleman, S, and R Simon. 1989. “Flexible Regression Models with Cubic Splines.” Statistics in Medicine 8 (5): 551–61.

    +
    +
    +

    Fox, J. 2008. Applied Regression Analysis and Generalized Linear Models. Second. Thousand Oaks, CA: Sage.

    +
    +
    +

    Gentleman, R, V Carey, W Huber, R Irizarry, and S Dudoit. 2005. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Berlin, Heidelberg: Springer-Verlag.

    +
    +
    +

    Goodfellow, I, Y Bengio, and A Courville. 2016. Deep Learning. MIT Press.

    +
    +
    +

    Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.

    +
    +
    +

    ———. 2020. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press.

    +
    +
    +

    R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.

    +
    +
    +

    Shmueli, G. 2010. “To Explain or to Predict?” Statistical Science 25 (3). Institute of Mathematical Statistics: 289–310.

    +
    +
    +

    Wickham, H, M Averick, J Bryan, W Chang, L McGowan, R François, G Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43).

    +
    +
    +

    Wickham, H, and G Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.

    +
    +
    +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + + + + diff --git a/_book/resampling.md b/_book/resampling.md index 6ff44835..87a6a50b 100644 --- a/_book/resampling.md +++ b/_book/resampling.md @@ -1,7 +1,7 @@ -# Resampling for evaluating performance +# Resampling for evaluating performance {#resampling} Maybe inlcude some simple examples of comparing models using resampling (perhaps go full `tidyposterior`?) diff --git a/_book/search_index.json b/_book/search_index.json index ad1335cd..350a62e3 100644 --- a/_book/search_index.json +++ b/_book/search_index.json @@ -1,11 +1,11 @@ [ -["index.html", "Tidy Modeling with R Hello World", " Tidy Modeling with R Max Kuhn 2019-12-16 Hello World This is the website for Tidy Modeling with R. Its purpose is to be a guide to using a new collection of software in the R programming language that enable model building. There are few goals, depending on your background. First, if you are new to modeling and R, we hope to provide an introduction on how to use software to create models. The focus will be on a dialect of R called the tidyverse that is designed to be a better interface for common tasks using R. If you’ve never heard of the tidyverse, there is a chapter that provides a solid introduction. The second (and primary) goal is to demonstrate how the tidyverse can be used to produce high quality models. The tools used to do this are referred to as the tidymodels packages. The third goal is to use the tidymodels packages to encourage good methodology and statistical practice. Many models, especially more complex predictive or machine learning models, can be created to work very well on the data at hand but may fail when exposed to new data. Often, this issue is due to poor choices that were made during the development and/or selection of the models. Whenever possible, our software attempts to prevent this from occurring but common pitfalls are discussed in the course of describing and demonstrating the software. This book is not intended to be a reference on different types of models. We suggest other resources to learn the nuances of models. A general source for information about the most common type of model, the linear model, we suggest Fox (2008). Another excellent resource for investigating and analyzing data is Wickham and Grolemund (2016). For predictive models, Kuhn and Johnson (2013) is a good resource. For pure machine learning methods, Goodfellow, Bengio, and Courville (2016) is an excellent (but formal) source of information. In some cases, we describe some models that are used in this text but in a way that is less mathematical (and hopefully more intuitive). We do not assume that readers will have had extensive experience in model building and statistics. Some statistical knowledge is required, such as: random sampling, variance, correlation, basic linear regression, and other topics that are usually found in a basic undergraduate statistics or data analysis course. This website is free to use, and is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. The sources used to create the book can be found at github.com/topepo/TMwR. We use the bookdown package to create the website (Xie 2016). One reason that we chose this license and this technology for the book is so that we can make it completely reproducible; all of the code and data used to create it are free and publicly available. Tidy Modeling with R is currently a work in progress. As we create it, this website is updated. Be aware that, until it is finalized, the content and/or structure of the book may change. This openness also allows users to contribute if they wish. Most often, this comes in the form of correcting types, grammar, and other aspects of our work that could use improvement. Instructions for making contributions can be found in the contributing.md file. Also, be aware that this effort has a code of conduct, which can be found at code_of_conduct.md. In terms of software lifecycle, the tidymodels packages are fairly young. We will do our best to maintain backwards compatibility and, at the completion of this work, will archive the specific versions of software that were used to produce it. The primary packages, and their versions, used to create this website are: #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.1 (2019-07-05) #> os macOS Mojave 10.14.6 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2019-12-16 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> AmesHousing * 0.0.3 2017-12-17 [1] CRAN (R 3.6.0) #> bookdown * 0.14 2019-10-01 [1] CRAN (R 3.6.0) #> broom 0.5.2 2019-04-07 [1] CRAN (R 3.6.0) #> dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.0) #> ggplot2 * 3.2.1.9000 2019-12-06 [1] local #> purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.0) #> rlang 0.4.2.9000 2019-12-14 [1] Github (r-lib/rlang@ec7c1ed) #> tibble * 2.99.99.9010 2019-12-06 [1] Github (tidyverse/tibble@f4365f7) #> #> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library pandoc is also instrumental in creating this work. The version used here is 2.3.1. References "], -["introduction.html", "1 Introduction 1.1 Types of models 1.2 Some terminology 1.3 Where does modeling fit into the data analysis/scientific process? 1.4 Modeling is a process, not a single activity 1.5 Outline of future chapters", " 1 Introduction Models are mathematical tools that create equations that are intended to mimic the data given to them. These equations can be used for various purposes, such as: predicting future events, determining if there is a different between two groups, as an aid to a map-based visualization, discovering novel patterns in the data that could be further investigated, and so on. Their utility hinges on their ability to be reductive; the primary influences in the data can be captured mathematically in a way that is useful. Since the start of the 21st century, mathematical models have become ubiquitous in our daily lives, in both obvious and subtle ways. A typical day for many people might involve checking the weather to see when a good time would be to walk the dog, ordering a product from a website, typing (and autocorrecting) a text message to a friend, and checking email. In each of these instances, there is a good chance that some type of model was used in an assistive way. In some cases, the contribution of the model might be easily perceived (“You might also be interested in purchasing product X”) while in other cases the impact was the absence of something (e.g., spam email). Models are used to choose clothing that a customer might like, a molecule that should be evaluated as a drug candidate, and might even be the mechanism that a nefarious company uses avoid the discovery of cars that over-pollute. For better or worse, models are here to stay. Two reasons that models permeate our lives are that software exists that facilitates their creation and that data has become more easily captured and accessible. In regard to software, it is obviously critical that software produces the correct equations that represent the data. For the most part, determining mathematical correctness is possible. However, the creation of an appropriate model hinges on a few other aspects. First, it is important that it is easy to operate the software in a proper way. For example, the user interface should not be so arcane that the user would not know that they have inappropriately specified the wrong information. As an analogy, one might have a high quality kitchen measuring cup capable of great precision but if the chef adds a cup of salt instead of a cup of sugar, the results would be unpalatable. As a specific example of this issue, Baggerly and Coombes (2009) report myriad problems in the data analysis in a high profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The user-interface of the software was poor enough that it was easy to offset the column names of the data from the actual data columns. In the analysis of the data, this resulted in the wrong genes being identified as important for treating cancer patients. This, and many other issues, led to the stoppage of numerous clinical trials (Carlson 2012). If we are to expect high quality models, it is important that the software facilitate proper usage. Abrams (2003) describes an interesting principle to live by: The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks. Data analysis software should also espouse this idea. The second important aspect of model building is related to scientific methodology. For models that are used to make complex predictions, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at finding patterns, they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these types of methodological errors are insidious in that the issue might be undetectable until a later time when new data that contain the true result are obtained. In short, as our models become more powerful and complex it has also become easier to commit latent errors. This relates to software. Whenever possible, the software should be able to protect users from committing such mistakes. Here, software should make it easy for users to do the right thing. These two aspects of model creation are crucial. Since tools for creating models are easily obtained and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, their backgrounds will vary. It is important that their tools be robust to the experience of the user. On one had, they tools should be powerful enough to create high-performance models but, one the other hand, should be easy to use in an appropriate way. This book describes a suite of software that can can create different types of models. The software has been designed with these additional characteristics in mind. The software is based on the R programming language (R Core Team 2014). R has been designed especially for data analysis and modeling. It is based on the S language which was created in the 1970’s to “turn ideas into software, quickly and faithfully” (Chambers 1998) R is open-source and is provided free of charge. It is a powerful programming language that can be used for many different purposes but specializes in data analysis, modeling, and machine learning. R is easily extensible; it has a vast ecosystem of packages; these are mostly user-contributed modules that focus on a specific theme, such as modeling, visualization, and so on. One collection of packages is called the tidyverse (Wickham et al. 2019). The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. Several of these design philosophies are directly related to the aspects of software described above. Within the tidyverse, there is a set of packages specifically focused on modeling and these are usually referred to as the tidymodels packages. This book is an extended software manual for conducting modeling using the tidyverse. It shows how to use a set of packages, each with its own specific purpose, together to create high-quality models. 1.1 Types of models Before proceeding, lets describe a taxa for types of models, grouped by purpose. While not exhaustive, most models fail into at least one of these categories: Descriptive Models: The purpose here would be to model the data so that it can be used to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually illustrate some trend or artifact in the data. For example, large scale measurements of RNA have been possible for some time using microarrays. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip would be able to assess a measure of signal based on the abundance of a specific RNA sequence. The chip would contain thousands (or more) outcomes, each a quantification of the RNA related to some biological process. However, there could be quality issued on the chip that might lead to poor results. A fingerprint accidentally left on a portion of the chip might cause inaccurate measurements. An early methods for evaluating such issues where probe-level models, or PLM’s (Bolstad 2004). A statistical model would be created that accounted for the known differences for the data from the chip, such as the RNA sequence, the type of sequence and so on. If there were other, unwanted factors in the data, these would be contained in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When an issue did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g. a fingerprint) and a possible solution (wipe the chip off and rescan). Figure 1.1(a) shows an application of this method for two microarrays taken from Gentleman et al. (2005). The images show two different colors; red is where the signal intensity was larger than the model expects while the blue color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel shows some type of unwanted artifact. Figure 1.1: Two examples of how descriptive models can be used to illustrate specific patterns. Another more general, and simpler example of a descriptive model is the locally estimated scatterplot smoothing model, more commonly known as LOESS (Cleveland 1979). Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data. These types of smoothers are used to discover potential ways to represent a variable in a model. This is demonstrated in Figure 1.1(b) where a nonlinear trend is illuminated by the flexible smoother. Inferential Models: In these situations, the goal is to produce a decision for a research question or to test a specific hypothesis. The goal is to make some statement of truth regarding some predefined conjecture or idea. In many (but not all) cases, some qualitative statement is produced. For example, in a clinical trial, the goal might be to provide some confirmation that a new therapy does a better job in prolonging life than an alternative (perhaps an existing therapy or no therapy). If the clinical endpoint was related to survival or a patient, the null hypothesis might be that the two therapeutic groups have equal median survival times with the alternative hypothesis being that the new therapy has higher median survival. If this trial were evaluated using the traditional null hypothesis significance testing (NHST), a p-value would be produced using some pre-defined methodology based on a set of assumptions for the data. Small values of the p-value indicate that there is evidence that the new therapy does help patients live longer. If not, the conclusion is that there is a failure to show such an difference (which could be due to a number of reasons). What are the important aspects of this type of analysis? Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability. Generally, to make such a quantity, formal assumptions must be made about the data and the underlying processes that generated the data. The quality of the statistical results are highly dependent on these pre-defined assumptions as well as how much the observed data appear to agree with them. The most critical factors here are theoretical in nature: if my data were independent and follow distribution X, then test statistic Y can be used to produce a p-value. Otherwise, the resulting p-value might be inaccurate. One consequence of relying on formal statistical assumptions is that there tends to be a longer feedback loop that could help understand how well the data fit the assumptions. In our clinical trial example, if statistical (and clinical) significance indicated that the new therapy should be available for patients to use, it may be years before it is used in the field and enough data were generated to have an independent assessment of whether the original statistical analysis led to the appropriate decision. Predictive Models: There are occasions where data are modeled in an effort to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data. A simple example would be for a book buyer to predict how many copies of a particular book should be shipped to his/her store for the next month. An over-prediction wastes space and money due to excess books. If the prediction is smaller than it should be, there is opportunity loss and less profit. For this type of model, the problem type is one of estimation rather than inference. For example, the buyer is usually not concerned with a question such as “Will I sell more than 100 copies of book X next month?” but rather “How many copies of X will customers purchase next month?” Also, depending on the context, there may not be any interest in why the predicted value is X. In other words, is more interest in the value itself than evaluating a formal hypothesis related to the data. That said, the prediction can also include measures of uncertainty. In the case of the book buyer, some sort of forecasting error might be valuable to help them decide on how many to purchase or could serve as a metric to gauge how well the prediction method worked. What are the most important factors affecting predictive models? There are many different ways that a predictive model could be created. The important factors depend on how the model was developed. For example, in some cases, a mechanistic model can be developed based on first principles and results in a model equation that is dependent on assumptions. For example, when predicting the amount of a drug that is in a person’s body at a certain time, some formal assumptions are made on how the drug is administered, absorbed, metabolized, and eliminated. Based on this, a set of differential equations can be used to derive a specific model equation. Data are used to estimate the known parameters of this equation and predictions are made after parameter estimation. Like inferential models, mechanistic predictive models greatly depend on the assumptions that define their model equations. However, unlike inferential models, it is easy to make data-driven statements about how well the model functions based on how well it predicts the existing data. Here the feedback loop for the modeler is much faster than it would be for a hypothesis test. Empirically driven models are those that have more vague assumptions that are used to create their model equations. These models tend to fall more into the machine learning category. A good example is the simple K-nearest neighbor (KNN) model. Given a set of reference data, a new sample is predicted by using the values of the most similar data in the reference set. For example, if a book buyer needs a prediction for a new book, previous data from existing books may be available. If a 5-nearest neighbor model were used, the buyer might estimate the amount of the new book to purchase based on the sales numbers of the five books that are most similar to the new one (for some definition of “similar”). For this model, it is only defined by the structure of the prediction (the average of five similar books). No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model is no consistent with the mechanism that generated the data, the predictions would not be close to the actual values. Broader discussions of these distinctions can be found in Breiman (2001) and Shmueli (2010). Note that we have defined the type of model by how it is used rather than its mathematical qualities. An ordinary linear regression model might fall into all three classes of models, depending on how it is used: A descriptive LOESS model uses linear regression to estimate the trend in the data. An analysis of variance (ANOVA) model is a popular method for producing the p-values used for inference. ANOVA models are a special case of linear regression. If a simple linear regression model produces highly correct predictions, it can be used as a predictive model. However, there are more examples of predictive models that cannot (or at least should not) be used for inference. Even if probabilistic assumptions were made for the data, the nature of the KNN model makes the math required for inference intractable. There is an additional connection between the types of models. While the primary purpose of descriptive and inferential models might not be related to prediction, the predictive capacity of the model should not be ignored. For example, logistic regression is a popular model for data where the outcome is qualitative with two possible values. It can model how variables related to the probability of the outcomes. When used in an inferential manner, there is usually an abundance of attention paid to the statistical qualities of the model. For example, analysts tend to strongly focus on the selection of which independent variables are contained in the model. Many iterations of model building are usually used to determine a minimal subset of independent variables that have a “statistically significant” relationship to the outcome variable. This is usually achieved when all of the p-values for the independent variables are below some value (e.g. 0.05). From here, the analyst typically focuses on making qualitative statements about the relative influence that the variables have on the outcome. A potential problem with this approach is that it can be dangerous when statistical significance is used as the only measure of model quality. It is certainly possible that this statistically optimized model has poor model accuracy (or some other measure of predictive capacity). While the model might not be used for prediction, how much should the inferences be trusted from a model that has all significant p-values but an accuracy of 35%? Predictive performance tends to be related to how close the model’s fitted values are to the observed data. If the model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not imply that the model is good. 1.2 Some terminology supervise, unsupervised types of variables types of predictors The mode of the model: regression, classification 1.3 Where does modeling fit into the data analysis/scientific process? (probably need to get explicit permission to use this) Figure 1.2: The data science process (from R for Data Science). 1.4 Modeling is a process, not a single activity (probably need to get explicit permission to use this too) Figure 1.3: A schematic for the typical modeling process. 1.5 Outline of future chapters References "], -["a-tidyverse-primer.html", "2 A tidyverse primer 2.1 Principles 2.2 Code 2.3 Why tidiness is important for modeling 2.4 Some additional tidy principals for modeling.", " 2 A tidyverse primer 2.1 Principles What does it mean to be “tidy” (distinguish tidy data vs tidy interfaces etc. ) 2.2 Code Things that I think that we’ll need summaries of: strategies: variable specification, pipes (with data or other first arguments), conflicts and using namespaces, splicing, non-standard evaluation, tactics: select, bind_cols, tidyselect, slice, !! and !!!, ... for passing arguments, tibbles, joins, nest/unnest, group_by 2.3 Why tidiness is important for modeling 2.4 Some additional tidy principals for modeling. "], -["a-tale-of-two-models.html", "3 A tale of two models", " 3 A tale of two models (tentative title) Perhaps show an example of a predictive model and contrast it with another that is inferential. Chicago data from FES: one predictive model and one to test if there is a difference in ridership with the Bears are at home. what do we care about for each? how accurate is the inferential model? Perhaps look at the tscount package to deal with the autoregressive potential. "], -["spending-our-data.html", "4 Spending our data", " 4 Spending our data General data splitting Re-emphasize roles or different data sets and good/bad ways of doing things. Validation sets. What we do differently with a lot of data. Allude to resampling. "], -["how-good-is-our-model.html", "5 How good is our model?", " 5 How good is our model? (or how well does our model work? Superman does good; a model can work well) Measuring performance Don’t revaluate the training set Statistical significance as a measure of effectiveness. "], +["index.html", "Tidy Modeling with R Hello World", " Tidy Modeling with R Max Kuhn 2019-12-21 Hello World This is the website for Tidy Modeling with R. Its purpose is to be a guide to using a new collection of software in the R programming language that enable model building. There are few goals, depending on your background. First, if you are new to modeling and R, we hope to provide an introduction on how to use our software to create models. The focus will be on a dialect of R called the tidyverse that is designed to be a better interface for common tasks using R. If you’ve never heard of the tidyverse, there is a chapter that provides a solid introduction. The second (and primary) goal is to demonstrate how the tidyverse can be used to produce high quality models. The tools used to do this are referred to as the tidymodels packages. The third goal is to use the tidymodels packages to encourage good methodology and statistical practice. Many models, especially complex predictive or machine learning models, can work very well on the data at hand but may also fail when exposed to new data. Often, this issue is due to poor choices that were made during the development and/or selection of the models. Whenever possible, our software attempts to prevent these and other pitfalls. This book is not intended to be a reference on different types of these techniques We suggest other resources to learn the nuances of models. A general source for information about the most common type of model, the linear model, we suggest Fox (2008). Another excellent resource for investigating and analyzing data is Wickham and Grolemund (2016). For predictive models, Kuhn and Johnson (2013) is a good resource. For pure machine learning methods, Goodfellow, Bengio, and Courville (2016) is an excellent (but formal) source of information. In some cases, we describe some models that are used in this text but in a way that is less mathematical (and hopefully more intuitive). We do not assume that readers will have had extensive experience in model building and statistics. Some statistical knowledge is required, such as: random sampling, variance, correlation, basic linear regression, and other topics that are usually found in a basic undergraduate statistics or data analysis course. Tidy Modeling with R is currently a work in progress. As we create it, this website is updated. Be aware that, until it is finalized, the content and/or structure of the book may change. This openness also allows users to contribute if they wish. Most often, this comes in the form of correcting typos, grammar, and other aspects of our work that could use improvement. Instructions for making contributions can be found in the contributing.md file. Also, be aware that this effort has a code of conduct, which can be found at code_of_conduct.md. In terms of software lifecycle, the tidymodels packages are fairly young. We will do our best to maintain backwards compatibility and, at the completion of this work, will archive the specific versions of software that were used to produce it. The primary packages, and their versions, used to create this website are: #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.1 (2019-07-05) #> os macOS Mojave 10.14.6 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2019-12-21 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> AmesHousing * 0.0.3 2017-12-17 [1] CRAN (R 3.6.0) #> bookdown * 0.14 2019-10-01 [1] CRAN (R 3.6.0) #> broom 0.5.2 2019-04-07 [1] CRAN (R 3.6.0) #> dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.0) #> ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.0) #> purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.0) #> rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.0) #> tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.6.0) #> #> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library pandoc is also instrumental in creating this work. The version used here is 2.3.1. References "], +["introduction.html", "1 Introduction 1.1 Types of models 1.2 Some terminology 1.3 How does modeling fit into the data analysis/scientific process? 1.4 Where does the model begin and end? 1.5 Outline of future chapters", " 1 Introduction Models are mathematical tools that create equations that are intended to mimic the data given to them. These equations can be used for various purposes, such as: predicting future events, determining if there is a difference between several groups, as an aid to a map-based visualization, discovering novel patterns in the data that could be further investigated, and so on. Their utility hinges on their ability to be reductive; the primary influences in the data can be captured mathematically in a way that is useful. Since the start of the 21st century, mathematical models have become ubiquitous in our daily lives, in both obvious and subtle ways. A typical day for many people might involve checking the weather to see when a good time would be to walk the dog, ordering a product from a website, typing (and autocorrecting) a text message to a friend, and checking email. In each of these instances, there is a good chance that some type of model was used in an assistive way. In some cases, the contribution of the model might be easily perceived (“You might also be interested in purchasing product X”) while in other cases the impact was the absence of something (e.g., spam email). Models are used to choose clothing that a customer might like, a molecule that should be evaluated as a drug candidate, and might even be the mechanism that a nefarious company uses avoid the discovery of cars that over-pollute. For better or worse, models are here to stay. Two reasons that models permeate our lives are that software exists that facilitates their creation and that data has become more easily captured and accessible. In regard to software, it is obviously critical that software produces the correct equations that represent the data. For the most part, determining mathematical correctness is possible. However, the creation of an appropriate model hinges on a few other aspects. First, it is important that it is easy to operate the software in a proper way. For example, the user interface should not be so arcane that the user would not know that they have inappropriately specified the wrong information. As an analogy, one might have a high quality kitchen measuring cup capable of great precision but if the chef adds a cup of salt instead of a cup of sugar, the results would be unpalatable. As a specific example of this issue, Baggerly and Coombes (2009) report myriad problems in the data analysis in a high profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The user-interface of the software was poor enough that it was easy to offset the column names of the data from the actual data columns. In the analysis of the data, this resulted in the wrong genes being identified as important for treating cancer patients. This, and many other issues, led to the stoppage of numerous clinical trials (Carlson 2012). If we are to expect high quality models, it is important that the software facilitate proper usage. Abrams (2003) describes an interesting principle to live by: The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks. Data analysis software should also espouse this idea. The second important aspect of model building is related to scientific methodology. For models that are used to make complex predictions, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at finding patterns, they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these types of methodological errors are insidious in that the issue might be undetectable until a later time when new data that contain the true result are obtained. In short, as our models become more powerful and complex it has also become easier to commit latent errors. This also relates to programming. Whenever possible, the software should be able to protect users from committing such mistakes. Software should make it easy for users to do the right thing. These two aspects of model creation are crucial. Since tools for creating models are easily obtained and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, their backgrounds will vary. It is important that their tools be robust to the experience of the user. On one had, they tools should be powerful enough to create high-performance models but, on the other hand, should be easy to use in an appropriate way. This book describes a suite of software that can can create different types of models. The software has been designed with these additional characteristics in mind. The software is based on the R programming language (R Core Team 2014). R has been designed especially for data analysis and modeling. It is based on the S language which was created in the 1970’s to “turn ideas into software, quickly and faithfully” (Chambers 1998) R is open-source and is provided free of charge. It is a powerful programming language that can be used for many different purposes but specializes in data analysis, modeling, and machine learning. R is easily extensible; it has a vast ecosystem of packages; these are mostly user-contributed modules that focus on a specific theme, such as modeling, visualization, and so on. One collection of packages is called the tidyverse (Wickham et al. 2019). The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. Several of these design philosophies are directly related to the aspects of software described above. Within the tidyverse, there is a set of packages specifically focused on modeling and these are usually referred to as the tidymodels packages. This book is an extended software manual for conducting modeling using the tidyverse. It shows how to use a set of packages, each with its own specific purpose, together to create high-quality models. 1.1 Types of models Before proceeding, lets describe a taxa for types of models, grouped by purpose. While not exhaustive, most models fail into at least one of these categories: Descriptive Models: The purpose here would be to model the data so that it can be used to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually emphasize some trend or artifact in the data. For example, large scale measurements of RNA have been possible for some time using microarrays. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip would be able to assess a measure of signal based on the abundance of a specific RNA sequence. The chip would contain thousands (or more) outcomes, each a quantification of the RNA related to some biological process. However, there could be quality issued on the chip that might lead to poor results. A fingerprint accidentally left on a portion of the chip might cause inaccurate measurements when scanned. An early methods for evaluating such issues where probe-level models, or PLM’s (Bolstad 2004). A statistical model would be created that accounted for the known differences for the data from the chip, such as the RNA sequence, the type of sequence and so on. If there were other, unwanted factors in the data, these would be contained in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When an issue did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g. a fingerprint) and a possible solution (wipe the chip off and rescan). Figure 1.1(a) shows an application of this method for two microarrays taken from Gentleman et al. (2005). The images show two different colors; red is where the signal intensity was larger than the model expects while the blue color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel shows some type of unwanted artifact. Figure 1.1: Two examples of how descriptive models can be used to illustrate specific patterns. Another example of a descriptive model is the locally estimated scatterplot smoothing model, more commonly known as LOESS (Cleveland 1979). Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data. These types of smoothers are used to discover potential ways to represent a variable in a model. This is demonstrated in Figure 1.1(b) where a nonlinear trend is illuminated by the flexible smoother. Inferential Models: In these situations, the goal is to produce a decision for a research question or to test a specific hypothesis. The goal is to make some statement of truth regarding some predefined conjecture or idea. In many (but not all) cases, some qualitative statement is produced. For example, in a clinical trial, the goal might be to provide confirmation that a new therapy does a better job in prolonging life than an alternative (e.g., an existing therapy or no treatment). If the clinical endpoint was related to survival or a patient, the null hypothesis might be that the two therapeutic groups have equal median survival times with the alternative hypothesis being that the new therapy has higher median survival. If this trial were evaluated using the traditional null hypothesis significance testing (NHST), a p-value would be produced using some pre-defined methodology based on a set of assumptions for the data. Small values of the p-value indicate that there is evidence that the new therapy does help patients live longer. If not, the conclusion is that there is a failure to show such an difference (which could be due to a number of reasons). What are the important aspects of this type of analysis? Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability. Generally, to compute such a quantity, formal assumptions must be made about the data and the underlying processes that generated the data. The quality of the statistical results are highly dependent on these pre-defined assumptions as well as how much the observed data appear to agree with them. The most critical factors here are theoretical in nature: if my data were independent and follow distribution X, then test statistic Y can be used to produce a p-value. Otherwise, the resulting p-value might be inaccurate. One aspect of inferential analyses is that there tends to be a longer feedback loop that could help understand how well the data fit the assumptions. In our clinical trial example, if statistical (and clinical) significance indicated that the new therapy should be available for patients to use, it may be years before it is used in the field and enough data were generated to have an independent assessment of whether the original statistical analysis led to the appropriate decision. Predictive Models: There are occasions where data are modeled in an effort to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data. A simple example would be for a book buyer to predict how many copies of a particular book should be shipped to his/her store for the next month. An over-prediction wastes space and money due to excess books. If the prediction is smaller than it should be, there is opportunity loss and less profit. For this type of model, the problem type is one of estimation rather than inference. For example, the buyer is usually not concerned with a question such as “Will I sell more than 100 copies of book X next month?” but rather “How many copies of X will customers purchase next month?” Also, depending on the context, there may not be any interest in why the predicted value is X. In other words, is more interest in the value itself than evaluating a formal hypothesis related to the data. That said, the prediction can also include measures of uncertainty. In the case of the book buyer, some sort of forecasting error might be valuable to help them decide on how many to purchase or could serve as a metric to gauge how well the prediction method worked. What are the most important factors affecting predictive models? There are many different ways that a predictive model could be created. The important factors depend on how the model was developed. For example, a mechanistic model could be developed based on first principles to produce a model equation that is dependent on assumptions. For example, when predicting the amount of a drug that is in a person’s body at a certain time, some formal assumptions are made on how the drug is administered, absorbed, metabolized, and eliminated. Based on this, a set of differential equations can be used to derive a specific model equation. Data are used to estimate the known parameters of this equation and predictions are made after parameter estimation. Like inferential models, mechanistic predictive models greatly depend on the assumptions that define their model equations. However, unlike inferential models, it is easy to make data-driven statements about how well the model performs based on how well it predicts the existing data. Here the feedback loop for the modeler is much faster than it would be for a hypothesis test. Empirically driven models are those that have more vague assumptions that are used to create their model equations. These models tend to fall more into the machine learning category. A good example is the simple K-nearest neighbor (KNN) model. Given a set of reference data, a new sample is predicted by using the values of the most similar data in the reference set. For example, if a book buyer needs a prediction for a new book, historical data from existing books may be available. A 5-nearest neighbor model would estimate the amount of the new book to purchase based on the sales numbers of the five books that are most similar to the new one (for some definition of “similar”). This model is only defined by the structure of the prediction (the average of five similar books). No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model was a good choice, the predictions would not be close to the actual values. Broader discussions of these distinctions can be found in Breiman (2001) and Shmueli (2010). Note that we have defined the type of model by how it is used rather than its mathematical qualities. An ordinary linear regression model might fall into all three classes of models, depending on how it is used: Descriptive smoother, similar to LOESS, called restricted smoothing splines (Durrleman and Simon 1989) can be used to describe trends in data using ordinary linear regression with specialized terms. An analysis of variance (ANOVA) model is a popular method for producing the p-values used for inference. ANOVA models are a special case of linear regression. If a simple linear regression model produces highly accurate predictions, it can be used as a predictive model. However, there are many more examples of predictive models that cannot (or at least should not) be used for inference. Even if probabilistic assumptions were made for the data, the nature of the KNN model makes the math required for inference intractable. There is an additional connection between the types of models. While the primary purpose of descriptive and inferential models might not be related to prediction, the predictive capacity of the model should not be ignored. For example, logistic regression is a popular model for data where the outcome is qualitative with two possible values. It can model how variables related to the probability of the outcomes. When used in an inferential manner, there is usually an abundance of attention paid to the statistical qualities of the model. For example, analysts tend to strongly focus on the selection of which independent variables are contained in the model. Many iterations of model building are usually used to determine a minimal subset of independent variables that have a “statistically significant” relationship to the outcome variable. This is usually achieved when all of the p-values for the independent variables are below some value (e.g. 0.05). From here, the analyst typically focuses on making qualitative statements about the relative influence that the variables have on the outcome. A potential problem with this approach is that it can be dangerous when statistical significance is used as the only measure of model quality. It is certainly possible that this statistically optimized model has poor model accuracy (or some other measure of predictive capacity). While the model might not be used for prediction, how much should the inferences be trusted from a model that has all significant p-values but a dismal accuracy? Predictive performance tends to be related to how close the model’s fitted values are to the observed data. If the model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not imply that the model should be used. This may seem intuitively obvious, but is often ignored in real-world data analysis. 1.2 Some terminology Before proceeding, some additional terminology related to modeling, data, and other quantities should be outlined. These descriptions are not exhaustive. First, many models can be categorized as being supervised or unsupervised. Unsupervised models are those that seek patterns, clusters, or other characteristics of the data but lack an outcome variable (i.e., a dependent variable). For example, principal component analysis (PCA), clustering, and autoencoders are used to understand relationships between variables or sets of variables without an explicit relationship between variables and an outcome. Supervised models are those that have an outcome variable. Linear regression, neural networks, and numerous other methodologies fall into this category. Within supervised models, the two main sub-categories are: Regression, where a numerical outcome is being predicted. Classification, where the outcome is an ordered or unordered set of qualitative values. These are imperfect definitions and do not account for all possible types of models. In coming chapters, we refer to these types of supervised techniques as the model mode. In terms of data, the main species are quantitative and qualitative. Examples of the former are real numbers and integers. Qualitative values, also known as nominal data, are those that represent some sort of discrete state that cannot be placed on a numeric scale. Different variables can have different roles in an analysis. Outcomes (otherwise known as the labels, endpoints, or dependent variables) are the value being predicted in supervised models. The independent variables, which are the substrate for making predictions of the outcome, also referred to as predictors, features, or covariates (depending on the context). Here, the terms outcomes and predictors are used most frequently here. 1.3 How does modeling fit into the data analysis/scientific process? In what circumstances are model created? Are there steps that precede such an undertaking? Is it the first step in data analysis? There are always a few critical phases of data analysis that come before modeling. First, there is the chronically underestimated process of cleaning the data. No matter the circumstances, the data should be investigated to make sure that it is well understood, applicable to the project goals, accurate, and appropriate. These steps can easily take more time than the rest of the data analysis process (depending on the circumstances). Data cleaning can also overlap with the second phase of understanding the data, often referred to as exploratory data analysis (EDA). There should be knowledge of how the different variables related to one another, their distributions, typical ranges, and other attributes. A good question to ask at this phase is “how did I come by these data?” This question can help understand how the data at-hand have been sampled or filtered and if these operations were appropriate. For example, when merging data base tables, a join may go awry that could accidentally eliminate one or more sub-populations of samples. Another good idea would be to ask if the data are relavant. For example, to predict whether patients have Alzheimer’s disease or not, it would be unwise to have a data set containing subject with the disease and a random sample of healthy adults from the general population. Given the progressive nature of the disease, the model my simply predict who the are the oldest patients. Finally, before starting a data analysis process, there should be clear expectations of the goal of the model and how performance (and success) will be judged. At least one performance metric should be identified with realistic goals of what can be achieved. Common statistical metrics are classification accuracy, true and false positive rates, root mean squared error, and so on. The relative benefits and drawbacks of these metrics should be weighted. It is also important that the metric be germane (i.e., alignment with the broader data analysis goals is critical). Figure 1.2: The data science process (from R for Data Science). The process of investigating the data may not be simple. Wickham and Grolemund (2016) contains an excellent illustration of the general data analysis process, reproduced with Figure 1.2. Data ingestion and cleaning are shown as the initial steps. When the analytical steps commence, they are a heuristic process; we cannot pre-determine how long they may take. The cycle of analysis, modeling, and visualization often require multiple iterations. Figure 1.3: A schematic for the typical modeling process (from Feature Engineering and Selection). This iterative process is especially true for modeling. Figure 1.3 originates from Kuhn and Johnson (2020) and is meant to emulate the typical path to determining an appropriate model. The general phases are: Exploratory data analysis (EDA) and Quantitative Analysis (blue bars). Initially there is a back and forth between numerical analysis and visualization of the data (represented in Figure 1.2) where different discoveries lead to more questions and data analysis “side-quests” to gain more understanding. Feature engineering (green bars). This understanding translated to the creation of specific model terms that make it easier to accurately model the observed data. This can include complex methodologies (e.g., PCA) or simpler features (using the ratio of two predictors). Model tuning and selection (red and gray bars). A variety of models are generated and their performance is compared. Some models require parameter tuning where some structural parameters are required to be specified or optimized. After an initial sequence of these tasks, more understanding is gained regarding which types of models are superior as well as which sub-populations of the data are not being effectively estimated. This leads to additional EDA and feature engineering, another round of modeling, and so on. Once the data analysis goals are achieved, the last steps are typically to finalize and document the model. For predictive models, it is common at the end to validate the model on an additional set of data reserved for this specific purpose. 1.4 Where does the model begin and end? So far, we have defined the model to be a structural equation that relates some predictors to one or more outcomes. Let’s consider ordinary linear regression as a simple and well known example. The outcome data are denoted as \\(y_i\\), where there are \\(i = 1 \\ldots n\\) samples in the data set. Suppose that there are \\(p\\) predictors \\(x_{i1}, \\ldots, x_{ip}\\) that are used to predict the outcome. Linear regression produces a model equation of \\[ \\hat{y}_i = \\hat{\\beta}_0 + \\hat{\\beta}_1x_{i1} + \\ldots + \\hat{\\beta}_px_{ip} \\] While this is a linear model, it is only linear in the parameters. The predictors could be nonlinear terms (such as the \\(log(x_i)\\)). The conventional way of thinking is that the modeling process is encapsulated by the model. For many data sets that are straight-forward in nature, this is the case. However, there are a variety of choices and additional steps that often occur before the data are ready to be added to the model. Some examples: While our model has \\(p\\) predictors, it is common to start with more than this number of candidate predictors. Through exploratory data analysis or previous experience, some of the predictors may be excluded from the analysis. In other cases, some feature selection algorithm may have been used to make a data-driven choice for the minimum predictors set to be used in the model. There are times when the value of an important predictor is know known. Rather than eliminating this value from the data set, it could be imputed using other values in the data. For example, if \\(x_1\\) were missing but was correlated with predictors \\(x_2\\) and \\(x_3\\), an imputation method could estimate the missing \\(x_1\\) observation from the values of \\(x_2\\) and \\(x_3\\). As previously mentioned, it may be beneficial to transform the scale of a predictor. If there is not a priori information on what the new scale should be, it might be estimated using a transformation technique. Here, the existing data would be used to statistically estimate the proper scale that optimizes some criterion. Other transformations, such as the previously mentioned PCA, take groups of predictors and transform them into new features that are used as the predictors. While the examples above are related to steps that occur before the model, there may also be operations that occur after the model is created. For example, when a classification model is created where the outcome is binary (e.g., event and non-event), it is customary to use a 50% probability cutoff to create a discrete class prediction (also known as a “hard prediction”). For example, a classification model might estimate that the probability of an event was 62%. Using the typical default, the hard prediction would be event. However, the model may need to be more focused on reducing false positive results (i.e., where true non-events are classified as events). One way to do this is to raise the cutoff from 50% to some greater value. This increases the level of evidence required to call a new sample as an event. While this reduces the true positive rate (which is bad), it may have a more profound effect on reducing false positives. The choice of the cutoff value should be optimized using data. This is an example of a post-processing step that has a significant effect on how well the model works even though it is not contained in the model fitting step. These examples have a common characteristic of requiring data for derivations that alter the raw data values or the predictions generated by the model. It is very important to focus on the broader model fitting process instead of the specific model being used to estimate parameters. This would include any pre-processing steps, the model fit itself, as well as potential post-processing activities. In this text, this will be referred to as the model workflow and would include any data-driven activities that are used to produce a final model equation. This will come into play when topics such as resampling (Chapter 8) and model tuning are discussed. Chapter 7 describes software for creating a model workflow. 1.5 Outline of future chapters The first order of business is to introduce (or review) the ideas and syntax of the tidyverse in Chapter 2. In this chapter, we also summarize the unmet needs for modeling when using R. This provides good motivation for why model-specific tidyverse techniques are needed. This chapter also outlines some additional principles related to this challenges. Chapter 3 shows two different data analyses for the same data where one is focused on prediction and the other is for inference. This should illustrates the challenges for each approach and what issues are most relavant for each. References "], +["tidyverse-primer.html", "2 A tidyverse primer 2.1 Principles 2.2 Code 2.3 Modeling via base R 2.4 Why tidiness is important for modeling 2.5 Some additional tidy principals for modeling.", " 2 A tidyverse primer 2.1 Principles What does it mean to be “tidy” (distinguish tidy data vs tidy interfaces etc. ) 2.2 Code Things that I think that we’ll need summaries of: strategies: variable specification, pipes (with data or other first arguments), conflicts and using namespaces, splicing, non-standard evaluation, tactics: select, bind_cols, tidyselect, slice, !! and !!!, ... for passing arguments, tibbles, joins, nest/unnest, group_by 2.3 Modeling via base R white book turning point base R conventions, basic usage, etc. 2.4 Why tidiness is important for modeling 2.5 Some additional tidy principals for modeling. "], +["two-models.html", "3 A tale of two models", " 3 A tale of two models (tentative title) Perhaps show an example of a predictive model and contrast it with another that is inferential. Chicago data from FES: one predictive model and one to test if there is a difference in ridership with the Bears are at home. what do we care about for each? how accurate is the inferential model? Perhaps look at the tscount package to deal with the autoregressive potential. "], +["data-spending.html", "4 Spending our data", " 4 Spending our data General data splitting Re-emphasize roles or different data sets and good/bad ways of doing things. Validation sets. What we do differently with a lot of data. Allude to resampling. "], +["model-metrics.html", "5 How good is our model?", " 5 How good is our model? (or how well does our model work? Superman does good; a model can work well) Measuring performance Don’t revaluate the training set Statistical significance as a measure of effectiveness. "], ["feature-engineering.html", "6 Feature engineering", " 6 Feature engineering Purpose(s) of these activites. Why do we do this? Different representations of same data Imputation; transformations; (unsup) removal; projection; encodings; "], -["a-model-workflow.html", "7 A model workflow", " 7 A model workflow aka modeling process or model pipeline How to encapsulate the pre-processing and model objects/activities Treat them as a single unit for good methodology and convenience. "], -["resampling-for-evaluating-performance.html", "8 Resampling for evaluating performance", " 8 Resampling for evaluating performance Maybe inlcude some simple examples of comparing models using resampling (perhaps go full tidyposterior?) "] +["workflows.html", "7 A model workflow", " 7 A model workflow aka modeling process or model pipeline How to encapsulate the pre-processing and model objects/activities Treat them as a single unit for good methodology and convenience. "], +["resampling.html", "8 Resampling for evaluating performance", " 8 Resampling for evaluating performance Maybe inlcude some simple examples of comparing models using resampling (perhaps go full tidyposterior?) "] ] diff --git a/_book/spending-our-data.html b/_book/spending-our-data.html index 209b2590..cbbb3074 100644 --- a/_book/spending-our-data.html +++ b/_book/spending-our-data.html @@ -24,7 +24,7 @@ - + diff --git a/_book/the-model-workflow.md b/_book/the-model-workflow.md index 25ff7c1b..b531de72 100644 --- a/_book/the-model-workflow.md +++ b/_book/the-model-workflow.md @@ -1,7 +1,7 @@ -# A model workflow +# A model workflow {#workflows} aka modeling process or model pipeline diff --git a/_book/tidyverse-primer.html b/_book/tidyverse-primer.html new file mode 100644 index 00000000..7d476747 --- /dev/null +++ b/_book/tidyverse-primer.html @@ -0,0 +1,195 @@ + + + + + + + 2 A tidyverse primer | Tidy Modeling with R + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    + + +
    +
    + +
    +
    +

    2 A tidyverse primer

    +
    +

    2.1 Principles

    +

    What does it mean to be “tidy” (distinguish tidy data vs tidy interfaces etc. )

    +
    +
    +

    2.2 Code

    +

    Things that I think that we’ll need summaries of:

    +
      +
    • strategies: variable specification, pipes (with data or other first arguments), conflicts and using namespaces, splicing, non-standard evaluation,

    • +
    • tactics: select, bind_cols, tidyselect, slice, !! and !!!, ... for passing arguments, tibbles, joins, nest/unnest, group_by

    • +
    +
    +
    +

    2.3 Modeling via base R

    +

    white book turning point

    +

    base R conventions, basic usage, etc.

    +
    +
    +

    2.4 Why tidiness is important for modeling

    +
    +
    +

    2.5 Some additional tidy principals for modeling.

    + +
    +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + + + + diff --git a/_book/tidyverse.md b/_book/tidyverse.md index 6b912521..f5fc4840 100644 --- a/_book/tidyverse.md +++ b/_book/tidyverse.md @@ -1,7 +1,7 @@ -# A tidyverse primer +# A tidyverse primer {#tidyverse-primer} ## Principles @@ -16,8 +16,17 @@ Things that I think that we'll need summaries of: * tactics: `select`, `bind_cols`, `tidyselect`, `slice`, `!!` and `!!!`, `...` for passing arguments, tibbles, joins, `nest`/`unnest`, `group_by` + +## Modeling via base R + +white book turning point + +base R conventions, basic usage, etc. + ## Why tidiness is important for modeling + + ## Some additional tidy principals for modeling. diff --git a/_book/two-models.html b/_book/two-models.html new file mode 100644 index 00000000..1fdc8312 --- /dev/null +++ b/_book/two-models.html @@ -0,0 +1,175 @@ + + + + + + + 3 A tale of two models | Tidy Modeling with R + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    + + +
    +
    + +
    +
    +

    3 A tale of two models

    +

    (tentative title)

    +

    Perhaps show an example of a predictive model and contrast it with another that is inferential.

    +

    Chicago data from FES: one predictive model and one to test if there is a difference in ridership with the Bears are at home. what do we care about for each? how accurate is the inferential model? Perhaps look at the tscount package to deal with the autoregressive potential.

    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + + + + diff --git a/_book/workflows.html b/_book/workflows.html new file mode 100644 index 00000000..04ecc7ee --- /dev/null +++ b/_book/workflows.html @@ -0,0 +1,175 @@ + + + + + + + 7 A model workflow | Tidy Modeling with R + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/a-tale-of-two-models.Rmd b/a-tale-of-two-models.Rmd index 2c5d8408..c14d0100 100644 --- a/a-tale-of-two-models.Rmd +++ b/a-tale-of-two-models.Rmd @@ -2,7 +2,7 @@ knitr::opts_chunk$set(fig.path = "figures/tale-") ``` -# A tale of two models +# A tale of two models {#two-models} (tentative title) diff --git a/data-spending.Rmd b/data-spending.Rmd index 3377d119..800a71f5 100644 --- a/data-spending.Rmd +++ b/data-spending.Rmd @@ -2,7 +2,7 @@ knitr::opts_chunk$set(fig.path = "figures/spending-") ``` -# Spending our data +# Spending our data {#data-spending} General data splitting diff --git a/how-good-is-our-model.Rmd b/how-good-is-our-model.Rmd index 0d5c2c2d..8bb2c713 100644 --- a/how-good-is-our-model.Rmd +++ b/how-good-is-our-model.Rmd @@ -2,7 +2,7 @@ knitr::opts_chunk$set(fig.path = "figures/performance-") ``` -# How good is our model? +# How good is our model? {#model-metrics} (or how well does our model work? Superman does good; a model can work well) diff --git a/index.Rmd b/index.Rmd index 5d1c3770..60f15efc 100644 --- a/index.Rmd +++ b/index.Rmd @@ -15,17 +15,16 @@ colorlinks: yes # Hello World {-} -This is the website for _Tidy Modeling with R_. Its purpose is to be a guide to using a new collection of software in the R programming language that enable model building. There are few goals, depending on your background. First, if you are new to modeling and R, we hope to provide an introduction on how to use software to create models. The focus will be on a dialect of R called _the tidyverse_ that is designed to be a better interface for common tasks using R. If you've never heard of the tidyverse, there is a chapter that provides a solid introduction. The second (and primary) goal is to demonstrate how the tidyverse can be used to produce high quality models. The tools used to do this are referred to as the _tidymodels packages_. The third goal is to use the tidymodels packages to encourage good methodology and statistical practice. Many models, especially more complex predictive or machine learning models, can be created to work very well on the data at hand but may fail when exposed to new data. Often, this issue is due to poor choices that were made during the development and/or selection of the models. Whenever possible, our software attempts to prevent this from occurring but common pitfalls are discussed in the course of describing and demonstrating the software. +This is the website for _Tidy Modeling with R_. Its purpose is to be a guide to using a new collection of software in the R programming language that enable model building. There are few goals, depending on your background. First, if you are new to modeling and R, we hope to provide an introduction on how to use our software to create models. The focus will be on a dialect of R called _the tidyverse_ that is designed to be a better interface for common tasks using R. If you've never heard of the tidyverse, there is a chapter that provides a solid introduction. The second (and primary) goal is to demonstrate how the tidyverse can be used to produce high quality models. The tools used to do this are referred to as the _tidymodels packages_. The third goal is to use the tidymodels packages to encourage good methodology and statistical practice. Many models, especially complex predictive or machine learning models, can work very well on the data at hand but may also fail when exposed to new data. Often, this issue is due to poor choices that were made during the development and/or selection of the models. Whenever possible, our software attempts to prevent these and other pitfalls. -This book is not intended to be a reference on different types of models. We suggest other resources to learn the nuances of models. A general source for information about the most common type of model, the _linear model_, we suggest @fox08. Another excellent resource for investigating and analyzing data is @wickham2016. For predictive models, @apm is a good resource. For pure machine learning methods, @Goodfellow is an excellent (but formal) source of information. In some cases, we describe some models that are used in this text but in a way that is less mathematical (and hopefully more intuitive). +This book is not intended to be a reference on different types of these techniques We suggest other resources to learn the nuances of models. A general source for information about the most common type of model, the _linear model_, we suggest @fox08. Another excellent resource for investigating and analyzing data is @wickham2016. For predictive models, @apm is a good resource. For pure machine learning methods, @Goodfellow is an excellent (but formal) source of information. In some cases, we describe some models that are used in this text but in a way that is less mathematical (and hopefully more intuitive). We do not assume that readers will have had extensive experience in model building and statistics. Some statistical knowledge is required, such as: random sampling, variance, correlation, basic linear regression, and other topics that are usually found in a basic undergraduate statistics or data analysis course. -This website is __free to use__, and is licensed under the [Creative Commons Attribution-NonCommercial-NoDerivs 3.0](http://creativecommons.org/licenses/by-nc-nd/3.0/us/) License. The sources used to create the book can be found at [`github.com/topepo/TMwR`](https://github.com/topepo/TMwR). We use the [`bookdown`](https://bookdown.org/) package to create the website [@bookdown]. One reason that we chose this license and this technology for the book is so that we can make it _completely reproducible_; all of the code and data used to create it are free and publicly available. _Tidy Modeling with R_ is currently a work in progress. As we create it, this website is updated. Be aware that, until it is finalized, the content and/or structure of the book may change. -This openness also allows users to contribute if they wish. Most often, this comes in the form of correcting types, grammar, and other aspects of our work that could use improvement. Instructions for making contributions can be found in the [`contributing.md`](https://github.com/topepo/TMwR/blob/master/contributing.md) file. Also, be aware that this effort has a code of conduct, which can be found at [`code_of_conduct.md`](https://github.com/topepo/TMwR/blob/master/code_of_conduct.md). +This openness also allows users to contribute if they wish. Most often, this comes in the form of correcting typos, grammar, and other aspects of our work that could use improvement. Instructions for making contributions can be found in the [`contributing.md`](https://github.com/topepo/TMwR/blob/master/contributing.md) file. Also, be aware that this effort has a code of conduct, which can be found at [`code_of_conduct.md`](https://github.com/topepo/TMwR/blob/master/code_of_conduct.md). In terms of software lifecycle, the tidymodels packages are fairly young. We will do our best to maintain backwards compatibility and, at the completion of this work, will archive the specific versions of software that were used to produce it. The primary packages, and their versions, used to create this website are: diff --git a/introduction.Rmd b/introduction.Rmd index 1c2663e7..4f743b8f 100644 --- a/introduction.Rmd +++ b/introduction.Rmd @@ -10,21 +10,23 @@ transparent_theme() # Introduction -Models are mathematical tools that create equations that are intended to mimic the data given to them. These equations can be used for various purposes, such as: predicting future events, determining if there is a different between two groups, as an aid to a map-based visualization, discovering novel patterns in the data that could be further investigated, and so on. Their utility hinges on their ability to be reductive; the primary influences in the data can be captured mathematically in a way that is useful. +Models are mathematical tools that create equations that are intended to mimic the data given to them. These equations can be used for various purposes, such as: predicting future events, determining if there is a difference between several groups, as an aid to a map-based visualization, discovering novel patterns in the data that could be further investigated, and so on. Their utility hinges on their ability to be reductive; the primary influences in the data can be captured mathematically in a way that is useful. Since the start of the 21st century, mathematical models have become ubiquitous in our daily lives, in both obvious and subtle ways. A typical day for many people might involve checking the weather to see when a good time would be to walk the dog, ordering a product from a website, typing (and autocorrecting) a text message to a friend, and checking email. In each of these instances, there is a good chance that some type of model was used in an assistive way. In some cases, the contribution of the model might be easily perceived ("You might also be interested in purchasing product _X_") while in other cases the impact was the absence of something (e.g., spam email). Models are used to choose clothing that a customer might like, a molecule that should be evaluated as a drug candidate, and might even be the mechanism that a nefarious company uses avoid the discovery of cars that over-pollute. For better or worse, models are here to stay. Two reasons that models permeate our lives are that software exists that facilitates their creation and that data has become more easily captured and accessible. In regard to software, it is obviously critical that software produces the _correct_ equations that represent the data. For the most part, determining mathematical correctness is possible. However, the creation of an appropriate model hinges on a few other aspects. -First, it is important that it is easy to operate the software in a _proper way_. For example, the user interface should not be so arcane that the user would not know that they have inappropriately specified the wrong information. As an analogy, one might have a high quality kitchen measuring cup capable of great precision but if the chef adds a cup of salt instead of a cup of sugar, the results would be unpalatable. As a specific example of this issue, @baggerly2009 report myriad problems in the data analysis in a high profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The user-interface of the software was poor enough that it was easy to _offset_ the column names of the data from the actual data columns. In the analysis of the data, this resulted in the wrong genes being identified as important for treating cancer patients. This, and many other issues, led to the stoppage of numerous clinical trials [@Carlson2012]. If we are to expect high quality models, it is important that the software facilitate proper usage. @abrams2003 describes an interesting principle to live by: +First, it is important that it is easy to operate the software in a _proper way_. For example, the user interface should not be so arcane that the user would not know that they have inappropriately specified the wrong information. As an analogy, one might have a high quality kitchen measuring cup capable of great precision but if the chef adds a cup of salt instead of a cup of sugar, the results would be unpalatable. As a specific example of this issue, @baggerly2009 report myriad problems in the data analysis in a high profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The user-interface of the software was poor enough that it was easy to _offset_ the column names of the data from the actual data columns. In the analysis of the data, this resulted in the wrong genes being identified as important for treating cancer patients. This, and many other issues, led to the stoppage of numerous clinical trials [@Carlson2012]. + +If we are to expect high quality models, it is important that the software facilitate proper usage. @abrams2003 describes an interesting principle to live by: > The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks. Data analysis software should also espouse this idea. -The second important aspect of model building is related to _scientific methodology_. For models that are used to make complex predictions, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at finding patterns, they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these types of methodological errors are insidious in that the issue might be undetectable until a later time when new data that contain the true result are obtained. In short, as our models become more powerful and complex it has also become easier to commit latent errors. This relates to software. Whenever possible, the software should be able to protect users from committing such mistakes. Here, software should make it easy for users to do the right thing. +The second important aspect of model building is related to _scientific methodology_. For models that are used to make complex predictions, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at finding patterns, they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these types of methodological errors are insidious in that the issue might be undetectable until a later time when new data that contain the true result are obtained. In short, as our models become more powerful and complex it has also become easier to commit latent errors. This also relates to programming. Whenever possible, the software should be able to protect users from committing such mistakes. Software should make it easy for users to do the right thing. -These two aspects of model creation are crucial. Since tools for creating models are easily obtained and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, their backgrounds will vary. It is important that their tools be _robust_ to the experience of the user. On one had, they tools should be powerful enough to create high-performance models but, one the other hand, should be easy to use in an appropriate way. This book describes a suite of software that can can create different types of models. The software has been designed with these additional characteristics in mind. +These two aspects of model creation are crucial. Since tools for creating models are easily obtained and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, their backgrounds will vary. It is important that their tools be _robust_ to the experience of the user. On one had, they tools should be powerful enough to create high-performance models but, on the other hand, should be easy to use in an appropriate way. This book describes a suite of software that can can create different types of models. The software has been designed with these additional characteristics in mind. The software is based on the R programming language [@baseR]. R has been designed especially for data analysis and modeling. It is based on the _S language_ which was created in the 1970's to @@ -38,9 +40,11 @@ One collection of packages is called the **_tidyverse_** [@tidyverse]. The tidyv Before proceeding, lets describe a taxa for types of models, grouped by _purpose_. While not exhaustive, most models fail into _at least_ one of these categories: -**Descriptive Models**: The purpose here would be to model the data so that it can be used to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually illustrate some trend or artifact in the data. +**Descriptive Models**: The purpose here would be to model the data so that it can be used to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually emphasize some trend or artifact in the data. + +For example, large scale measurements of RNA have been possible for some time using _microarrays_. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip would be able to assess a measure of signal based on the abundance of a specific RNA sequence. The chip would contain thousands (or more) outcomes, each a quantification of the RNA related to some biological process. However, there could be quality issued on the chip that might lead to poor results. A fingerprint accidentally left on a portion of the chip might cause inaccurate measurements when scanned. -For example, large scale measurements of RNA have been possible for some time using _microarrays_. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip would be able to assess a measure of signal based on the abundance of a specific RNA sequence. The chip would contain thousands (or more) outcomes, each a quantification of the RNA related to some biological process. However, there could be quality issued on the chip that might lead to poor results. A fingerprint accidentally left on a portion of the chip might cause inaccurate measurements. An early methods for evaluating such issues where _probe-level models_, or PLM's [@bolstad2004]. A statistical model would be created that accounted for the _known_ differences for the data from the chip, such as the RNA sequence, the type of sequence and so on. If there were other, unwanted factors in the data, these would be contained in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When an issue did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g. a fingerprint) and a possible solution (wipe the chip off and rescan). Figure \@ref(fig:descr-examples)(a) shows an application of this method for two microarrays taken from @Gentleman2005. The images show two different colors; red is where the signal intensity was larger than the model expects while the blue color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel shows some type of unwanted artifact. +An early methods for evaluating such issues where _probe-level models_, or PLM's [@bolstad2004]. A statistical model would be created that accounted for the _known_ differences for the data from the chip, such as the RNA sequence, the type of sequence and so on. If there were other, unwanted factors in the data, these would be contained in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When an issue did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g. a fingerprint) and a possible solution (wipe the chip off and rescan). Figure \@ref(fig:descr-examples)(a) shows an application of this method for two microarrays taken from @Gentleman2005. The images show two different colors; red is where the signal intensity was larger than the model expects while the blue color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel shows some type of unwanted artifact. ```{r descr-examples, echo = FALSE, fig.cap = "Two examples of how descriptive models can be used to illustrate specific patterns.", out.width = '80%', fig.height = 8, warning = FALSE, message = FALSE} load("RData/plm_resids.RData") @@ -80,16 +84,16 @@ ames_plot <- grid.arrange(plm_plot, ames_plot, ncol = 1) ``` -Another more general, and simpler example of a descriptive model is the locally estimated scatterplot smoothing model, more commonly known as LOESS [@cleveland1979]. Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data. These types of _smoothers_ are used to discover potential ways to represent a variable in a model. This is demonstrated in Figure \@ref(fig:descr-examples)(b) where a nonlinear trend is illuminated by the flexible smoother. +Another example of a descriptive model is the _locally estimated scatterplot smoothing_ model, more commonly known as LOESS [@cleveland1979]. Here, a smooth and flexible regression model is fit to a data set, usually with a single independent variable, and the fitted regression line is used to elucidate some trend in the data. These types of _smoothers_ are used to discover potential ways to represent a variable in a model. This is demonstrated in Figure \@ref(fig:descr-examples)(b) where a nonlinear trend is illuminated by the flexible smoother. **Inferential Models**: In these situations, the goal is to produce a decision for a research question or to test a specific hypothesis. The goal is to make some statement of truth regarding some predefined conjecture or idea. In many (but not all) cases, some qualitative statement is produced. -For example, in a clinical trial, the goal might be to provide some confirmation that a new therapy does a better job in prolonging life than an alternative (perhaps an existing therapy or no therapy). If the clinical endpoint was related to survival or a patient, the _null hypothesis_ might be that the two therapeutic groups have equal median survival times with the alternative hypothesis being that the new therapy has higher median survival. If this trial were evaluated using the traditional *null hypothesis significance testing* (NHST), a p-value would be produced using some pre-defined methodology based on a set of assumptions for the data. Small values of the p-value indicate that there is evidence that the new therapy does help patients live longer. If not, the conclusion is that there is a failure to show such an difference (which could be due to a number of reasons). +For example, in a clinical trial, the goal might be to provide confirmation that a new therapy does a better job in prolonging life than an alternative (e.g., an existing therapy or no treatment). If the clinical endpoint was related to survival or a patient, the _null hypothesis_ might be that the two therapeutic groups have equal median survival times with the alternative hypothesis being that the new therapy has higher median survival. If this trial were evaluated using the traditional *null hypothesis significance testing* (NHST), a p-value would be produced using some pre-defined methodology based on a set of assumptions for the data. Small values of the p-value indicate that there is evidence that the new therapy does help patients live longer. If not, the conclusion is that there is a failure to show such an difference (which could be due to a number of reasons). -What are the important aspects of this type of analysis? Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability. Generally, to make such a quantity, formal assumptions must be made about the data and the underlying processes that generated the data. The quality of the statistical results are highly dependent on these pre-defined assumptions as well as how much the observed data appear to agree with them. The most critical factors here are theoretical in nature: if my data were independent and follow distribution _X_, then test statistic _Y_ can be used to produce a p-value. Otherwise, the resulting p-value might be inaccurate. +What are the important aspects of this type of analysis? Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability. Generally, to compute such a quantity, formal assumptions must be made about the data and the underlying processes that generated the data. The quality of the statistical results are highly dependent on these pre-defined assumptions as well as how much the observed data appear to agree with them. The most critical factors here are theoretical in nature: if my data were independent and follow distribution _X_, then test statistic _Y_ can be used to produce a p-value. Otherwise, the resulting p-value might be inaccurate. -One consequence of relying on formal statistical assumptions is that there _tends_ to be a longer feedback loop that could help understand how well the data fit the assumptions. In our clinical trial example, if statistical (and clinical) significance indicated that the new therapy should be available for patients to use, it may be years before it is used in the field and enough data were generated to have an independent assessment of whether the original statistical analysis led to the appropriate decision. +One aspect of inferential analyses is that there _tends_ to be a longer feedback loop that could help understand how well the data fit the assumptions. In our clinical trial example, if statistical (and clinical) significance indicated that the new therapy should be available for patients to use, it may be years before it is used in the field and enough data were generated to have an independent assessment of whether the original statistical analysis led to the appropriate decision. **Predictive Models**: There are occasions where data are modeled in an effort to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data. @@ -99,51 +103,57 @@ For this type of model, the problem type is one of _estimation_ rather than infe What are the most important factors affecting predictive models? There are many different ways that a predictive model could be created. The important factors depend on how the model was developed. -For example, in some cases, a _mechanistic model_ can be developed based on first principles and results in a model equation that is dependent on assumptions. For example, when predicting the amount of a drug that is in a person's body at a certain time, some formal assumptions are made on how the drug is administered, absorbed, metabolized, and eliminated. Based on this, a set of differential equations can be used to derive a specific model equation. Data are used to estimate the known parameters of this equation and predictions are made after parameter estimation. Like inferential models, mechanistic predictive models greatly depend on the assumptions that define their model equations. However, unlike inferential models, it is easy to make data-driven statements about how well the model functions based on how well it predicts the existing data. Here the feedback loop for the modeler is much faster than it would be for a hypothesis test. +For example, a _mechanistic model_ could be developed based on first principles to produce a model equation that is dependent on assumptions. For example, when predicting the amount of a drug that is in a person's body at a certain time, some formal assumptions are made on how the drug is administered, absorbed, metabolized, and eliminated. Based on this, a set of differential equations can be used to derive a specific model equation. Data are used to estimate the known parameters of this equation and predictions are made after parameter estimation. Like inferential models, mechanistic predictive models greatly depend on the assumptions that define their model equations. However, unlike inferential models, it is easy to make data-driven statements about how well the model performs based on how well it predicts the existing data. Here the feedback loop for the modeler is much faster than it would be for a hypothesis test. -_Empirically driven models_ are those that have more vague assumptions that are used to create their model equations. These models tend to fall more into the machine learning category. A good example is the simple _K_-nearest neighbor (KNN) model. Given a set of reference data, a new sample is predicted by using the values of the most similar data in the reference set. For example, if a book buyer needs a prediction for a new book, previous data from existing books may be available. If a 5-nearest neighbor model were used, the buyer might estimate the amount of the new book to purchase based on the sales numbers of the five books that are most similar to the new one (for some definition of "similar"). For this model, it is only defined by the structure of the prediction (the average of five similar books). No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model is no consistent with the mechanism that generated the data, the predictions would not be close to the actual values. +_Empirically driven models_ are those that have more vague assumptions that are used to create their model equations. These models tend to fall more into the machine learning category. A good example is the simple _K_-nearest neighbor (KNN) model. Given a set of reference data, a new sample is predicted by using the values of the most similar data in the reference set. For example, if a book buyer needs a prediction for a new book, historical data from existing books may be available. A 5-nearest neighbor model would estimate the amount of the new book to purchase based on the sales numbers of the five books that are most similar to the new one (for some definition of "similar"). This model is only defined by the structure of the prediction (the average of five similar books). No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model was a good choice, the predictions would not be close to the actual values. Broader discussions of these distinctions can be found in @breiman2001 and @shmueli2010. Note that we have defined the type of model by how it is used rather than its mathematical qualities. An ordinary linear regression model might fall into all three classes of models, depending on how it is used: -* A descriptive LOESS model uses linear regression to estimate the trend in the data. +* Descriptive smoother, similar to LOESS, called _restricted smoothing splines_ [@Durrleman1989] can be used to describe trends in data using ordinary linear regression with specialized terms. -* An analysis of variance (ANOVA) model is a popular method for producing the p-values used for inference. ANOVA models are a special case of linear regression. +* An _analysis of variance_ (ANOVA) model is a popular method for producing the p-values used for inference. ANOVA models are a special case of linear regression. -* If a simple linear regression model produces highly correct predictions, it can be used as a predictive model. +* If a simple linear regression model produces highly accurate predictions, it can be used as a predictive model. -However, there are more examples of predictive models that cannot (or at least should not) be used for inference. Even if probabilistic assumptions were made for the data, the nature of the KNN model makes the math required for inference intractable. +However, there are many more examples of predictive models that cannot (or at least should not) be used for inference. Even if probabilistic assumptions were made for the data, the nature of the KNN model makes the math required for inference intractable. There is an additional connection between the types of models. While the primary purpose of descriptive and inferential models might not be related to prediction, the predictive capacity of the model should not be ignored. For example, logistic regression is a popular model for data where the outcome is qualitative with two possible values. It can model how variables related to the probability of the outcomes. When used in an inferential manner, there is usually an abundance of attention paid to the _statistical qualities_ of the model. For example, analysts tend to strongly focus on the selection of which independent variables are contained in the model. Many iterations of model building are usually used to determine a minimal subset of independent variables that have a "statistically significant" relationship to the outcome variable. This is usually achieved when all of the p-values for the independent variables are below some value (e.g. 0.05). From here, the analyst typically focuses on making qualitative statements about the relative influence that the variables have on the outcome. -A potential problem with this approach is that it can be dangerous when statistical significance is used as the _only_ measure of model quality. It is certainly possible that this statistically optimized model has poor model accuracy (or some other measure of predictive capacity). While the model might not be used for prediction, how much should the inferences be trusted from a model that has all significant p-values but an accuracy of 35%? Predictive performance tends to be related to how close the model's fitted values are to the observed data. If the model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not imply that the model is good. +A potential problem with this approach is that it can be dangerous when statistical significance is used as the _only_ measure of model quality. It is certainly possible that this statistically optimized model has poor model accuracy (or some other measure of predictive capacity). While the model might not be used for prediction, how much should the inferences be trusted from a model that has all significant p-values but a dismal accuracy? Predictive performance tends to be related to how close the model's fitted values are to the observed data. If the model has limited fidelity to the data, the inferences generated by the model should be highly suspect. In other words, statistical significance may not imply that the model should be used. This may seem intuitively obvious, but is often ignored in real-world data analysis. ## Some terminology -supervise, unsupervised +Before proceeding, some additional terminology related to modeling, data, and other quantities should be outlined. These descriptions are not exhaustive. -types of variables +First, many models can be categorized as being _supervised_ or _unsupervised_. Unsupervised models are those that seek patterns, clusters, or other characteristics of the data but lack an outcome variable (i.e., a dependent variable). For example, principal component analysis (PCA), clustering, and autoencoders are used to understand relationships between variables or sets of variables without an explicit relationship between variables and an outcome. Supervised models are those that have an outcome variable. Linear regression, neural networks, and numerous other methodologies fall into this category. Within supervised models, the two main sub-categories are: -types of predictors + * _Regression_, where a numerical outcome is being predicted. -The mode of the model: regression, classification + * _Classification_, where the outcome is an ordered or unordered set of _qualitative_ values. +These are imperfect definitions and do not account for all possible types of models. In coming chapters, we refer to these types of supervised techniques as the _model mode_. -## Where does modeling fit into the data analysis/scientific process? +In terms of data, the main species are quantitative and qualitative. Examples of the former are real numbers and integers. Qualitative values, also known as nominal data, are those that represent some sort of discrete state that cannot be placed on a numeric scale. -(probably need to get explicit permission to use this) +Different variables can have different _roles_ in an analysis. Outcomes (otherwise known as the labels, endpoints, or dependent variables) are the value being predicted in supervised models. The independent variables, which are the substrate for making predictions of the outcome, also referred to as predictors, features, or covariates (depending on the context). Here, the terms _outcomes_ and _predictors_ are used most frequently here. -```{r data-science-model, echo = FALSE, out.width = '80%', fig.cap = "The data science process (from _R for Data Science_)."} -knitr::include_graphics("figures-premade/data-science-model.svg") -``` +## How does modeling fit into the data analysis/scientific process? {#model-phases} + +In what circumstances are model created? Are there steps that precede such an undertaking? Is it the first step in data analysis? +There are always a few critical phases of data analysis that come before modeling. First, there is the chronically underestimated process of **cleaning the data**. No matter the circumstances, the data should be investigated to make sure that it is well understood, applicable to the project goals, accurate, and appropriate. These steps can easily take more time than the rest of the data analysis process (depending on the circumstances). -## Modeling is a _process_, not a single activity +Data cleaning can also overlap with the second phase of **understanding the data**, often referred to as exploratory data analysis (EDA). There should be knowledge of how the different variables related to one another, their distributions, typical ranges, and other attributes. A good question to ask at this phase is "how did I come by _these_ data?" This question can help understand how the data at-hand have been sampled or filtered and if these operations were appropriate. For example, when merging data base tables, a join may go awry that could accidentally eliminate one or more sub-populations of samples. Another good idea would be to ask if the data are _relavant_. For example, to predict whether patients have Alzheimer's disease or not, it would be unwise to have a data set containing subject with the disease and a random sample of healthy adults from the general population. Given the progressive nature of the disease, the model my simply predict who the are the _oldest patients_. +Finally, before starting a data analysis process, there should be clear expectations of the goal of the model and how performance (and success) will be judged. At least one _performance metric_ should be identified with realistic goals of what can be achieved. Common statistical metrics are classification accuracy, true and false positive rates, root mean squared error, and so on. The relative benefits and drawbacks of these metrics should be weighted. It is also important that the metric be germane (i.e., alignment with the broader data analysis goals is critical). -(probably need to get explicit permission to use this too) +```{r data-science-model, echo = FALSE, out.width = '80%', fig.cap = "The data science process (from _R for Data Science_)."} +knitr::include_graphics("figures-premade/data-science-model.svg") +``` +The process of investigating the data may not be simple. @wickham2016 contains an excellent illustration of the general data analysis process, reproduced with Figure \@ref(fig:data-science-model). Data ingestion and cleaning are shown as the initial steps. When the analytical steps commence, they are a heuristic process; we cannot pre-determine how long they may take. The cycle of analysis, modeling, and visualization often require multiple iterations. -```{r modeling-process, echo = FALSE, out.width = '100%', fig.width=8, fig.height=3, dev = "svg", fig.cap = "A schematic for the typical modeling process.", warning = FALSE} +```{r modeling-process, echo = FALSE, out.width = '100%', fig.width=8, fig.height=3, dev = "svg", fig.cap = "A schematic for the typical modeling process (from _Feature Engineering and Selection_).", warning = FALSE} widths <- c(8, 4, 10, 2, 6, 6, rep(1, 19), 2, rep(1, 19), 2, @@ -174,11 +184,11 @@ bar_loc$ybot <- 1 bar_loc$g <- factor(as.character(bar_loc$g), levels = c("EDA", "Quantitative Analysis", "Feature Engineering", "Model Fit", "Model Tuning")) -text_loc <- data.frame(x = c(1, 8, 30, 36, 120, 124, 132, 147, 211, 215)+1, +text_loc <- data.frame(x = c(1, 8, 30, 36, 120, 124, 132, 147, 211, 215) + 1, y = 2.1) text_loc$label <- letters[1:nrow(text_loc)] -mod_loc <- data.frame(x = c(45, 66, 87, 107, 162, 195)+1, +mod_loc <- data.frame(x = c(45, 66, 87, 107, 162, 195) + 1, y = .75, label = c("Model\n#1", "Model\n#2", "Model\n#3", "Model\n#4", "Model\n#2", "Model\n#4")) @@ -186,14 +196,20 @@ mod_loc <- data.frame(x = c(45, 66, 87, 107, 162, 195)+1, ggplot(bar_loc) + geom_rect(aes(fill = g, xmin = srt, xmax = stp, ymin = ybot, ymax = ytop), alpha = .7) + - theme(legend.position = "bottom", - axis.line=element_blank(),axis.text.x=element_blank(), - axis.text.y=element_blank(),axis.ticks=element_blank(), - axis.title.x=element_text(hjust = .05), - axis.title.y=element_blank(), - panel.background=element_blank(), - panel.border=element_blank(),panel.grid.major=element_blank(), - panel.grid.minor=element_blank(),plot.background=element_blank()) + + theme( + legend.position = "bottom", + axis.line = element_blank(), + axis.text.x = element_blank(), + axis.text.y = element_blank(), + axis.ticks = element_blank(), + axis.title.x = element_text(hjust = .05), + axis.title.y = element_blank(), + panel.background = element_blank(), + panel.border = element_blank(), + panel.grid.major = element_blank(), + panel.grid.minor = element_blank(), + plot.background = element_blank() + ) + scale_fill_manual(values = diag_cols, name = "") + geom_text(data = text_loc, aes(x = x, y = y, label = label)) + geom_text(data = mod_loc, aes(x = x, y = y, label = label), size = 3) + @@ -201,5 +217,41 @@ ggplot(bar_loc) + ylim(c(.5, 2.25)) ``` +This iterative process is especially true for modeling. Figure \@ref(fig:modeling-process) originates from @kuhn20202 and is meant to emulate the typical path to determining an appropriate model. The general phases are: + + * Exploratory data analysis (EDA) and Quantitative Analysis (blue bars). Initially there is a back and forth between numerical analysis and visualization of the data (represented in Figure \@ref(fig:data-science-model)) where different discoveries lead to more questions and data analysis "side-quests" to gain more understanding. + * Feature engineering (green bars). This understanding translated to the creation of specific model terms that make it easier to accurately model the observed data. This can include complex methodologies (e.g., PCA) or simpler features (using the ratio of two predictors). + + * Model tuning and selection (red and gray bars). A variety of models are generated and their performance is compared. Some models require _parameter tuning_ where some structural parameters are required to be specified or optimized. + +After an initial sequence of these tasks, more understanding is gained regarding which types of models are superior as well as which sub-populations of the data are not being effectively estimated. This leads to additional EDA and feature engineering, another round of modeling, and so on. Once the data analysis goals are achieved, the last steps are typically to finalize and document the model. For predictive models, it is common at the end to validate the model on an additional set of data reserved for this specific purpose. + + +## Where does the model begin and end? {#begin-model-end} + +So far, we have defined the model to be a structural equation that relates some predictors to one or more outcomes. Let's consider ordinary linear regression as a simple and well known example. The outcome data are denoted as $y_i$, where there are $i = 1 \ldots n$ samples in the data set. Suppose that there are $p$ predictors $x_{i1}, \ldots, x_{ip}$ that are used to predict the outcome. Linear regression produces a model equation of + +$$ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + \ldots + \hat{\beta}_px_{ip} $$ + +While this is a _linear_ model, it is only linear in the parameters. The predictors could be nonlinear terms (such as the $log(x_i)$). + +The conventional way of thinking is that the modeling _process_ is encapsulated by the model. For many data sets that are straight-forward in nature, this is the case. However, there are a variety of _choices_ and additional steps that often occur before the data are ready to be added to the model. Some examples: + +* While our model has $p$ predictors, it is common to start with more than this number of candidate predictors. Through exploratory data analysis or previous experience, some of the predictors may be excluded from the analysis. In other cases, some feature selection algorithm may have been used to make a data-driven choice for the minimum predictors set to be used in the model. +* There are times when the value of an important predictor is know known. Rather than eliminating this value from the data set, it could be _imputed_ using other values in the data. For example, if $x_1$ were missing but was correlated with predictors $x_2$ and $x_3$, an imputation method could estimate the missing $x_1$ observation from the values of $x_2$ and $x_3$. +* As previously mentioned, it may be beneficial to transform the scale of a predictor. If there is **not** _a priori_ information on what the new scale should be, it might be estimated using a transformation technique. Here, the existing data would be used to statistically _estimate_ the proper scale that optimizes some criterion. Other transformations, such as the previously mentioned PCA, take groups of predictors and transform them into new features that are used as the predictors. + +While the examples above are related to steps that occur before the model, there may also be operations that occur after the model is created. For example, when a classification model is created where the outcome is binary (e.g., `event` and `non-event`), it is customary to use a 50% probability cutoff to create a discrete class prediction (also known as a "hard prediction"). For example, a classification model might estimate that the probability of an event was 62%. Using the typical default, the hard prediction would be `event`. However, the model may need to be more focused on reducing false positive results (i.e., where true non-events are classified as events). One way to do this is to _raise_ the cutoff from 50% to some greater value. This increases the level of evidence required to call a new sample as an event. While this reduces the true positive rate (which is bad), it may have a more profound effect on reducing false positives. The choice of the cutoff value should be optimized using data. This is an example of a post-processing step that has a significant effect on how well the model works even though it is not contained in the model fitting step. + +These examples have a common characteristic of requiring data for derivations that alter the raw data values or the predictions generated by the model. + +It is very important to focus on the broader _model fitting process_ instead of the specific model being used to estimate parameters. This would include any pre-processing steps, the model fit itself, as well as potential post-processing activities. In this text, this will be referred to as the **model workflow** and would include any data-driven activities that are used to produce a final model equation. + +This will come into play when topics such as resampling (Chapter \@ref(resampling)) and model tuning are discussed. Chapter \@ref(workflows) describes software for creating a model workflow. + ## Outline of future chapters +The first order of business is to introduce (or review) the ideas and syntax of the tidyverse in Chapter \@ref(tidyverse-primer). In this chapter, we also summarize the unmet needs for modeling when using R. This provides good motivation for why model-specific tidyverse techniques are needed. This chapter also outlines some additional principles related to this challenges. + +Chapter \@ref(two-models) shows two different data analyses for the same data where one is focused on prediction and the other is for inference. This should illustrates the challenges for each approach and what issues are most relavant for each. + diff --git a/resampling.Rmd b/resampling.Rmd index 23e35974..fe8ab6d6 100644 --- a/resampling.Rmd +++ b/resampling.Rmd @@ -2,7 +2,7 @@ knitr::opts_chunk$set(fig.path = "figures/resampling-") ``` -# Resampling for evaluating performance +# Resampling for evaluating performance {#resampling} Maybe inlcude some simple examples of comparing models using resampling (perhaps go full `tidyposterior`?) diff --git a/the-model-workflow.Rmd b/the-model-workflow.Rmd index 796ccc62..abfab383 100644 --- a/the-model-workflow.Rmd +++ b/the-model-workflow.Rmd @@ -2,7 +2,7 @@ knitr::opts_chunk$set(fig.path = "figures/workflow-") ``` -# A model workflow +# A model workflow {#workflows} aka modeling process or model pipeline diff --git a/tidyverse.Rmd b/tidyverse.Rmd index 0197d3ff..c902243e 100644 --- a/tidyverse.Rmd +++ b/tidyverse.Rmd @@ -2,7 +2,7 @@ knitr::opts_chunk$set(fig.path = "figures/tidyverse-") ``` -# A tidyverse primer +# A tidyverse primer {#tidyverse-primer} ## Principles @@ -17,8 +17,17 @@ Things that I think that we'll need summaries of: * tactics: `select`, `bind_cols`, `tidyselect`, `slice`, `!!` and `!!!`, `...` for passing arguments, tibbles, joins, `nest`/`unnest`, `group_by` + +## Modeling via base R + +white book turning point + +base R conventions, basic usage, etc. + ## Why tidiness is important for modeling + + ## Some additional tidy principals for modeling.