Cross validation

This page describes how cross-validation could work in NIMBLE, and poses an open question about its functionality. Any comments or ideas are very welcome!

Cross-validation Overview

Cross-validation methods generally work by splitting a model's data into two parts: a training portion and a test portion. The model is fit (in our case, likely via MCMC) to only the training data, and data values for the missing test data are simulated (in our case, from their posterior predictive distribution). These simulated values are then compared to the actual observed, left-out values via some loss function. The average of the loss function can be computed over all posterior draws, producing an estimate of out-of-sample prediction error.

A commonly used type of cross-validation is k-fold cross-validation, in which the data set is partitioned into k roughly equal parts. Each of the k parts is held out as training data, in the manner described above, and a model fit estimate is obtained from that part. The k model fit estimates are then averaged to provide a overall measure of fit. k-fold validation has the advantage of producing an estimate of fit that takes into account all data points.

Cross-validation in NIMBLE

Employing cross-validation for hierarchical models in a general manner raises the question: What method do we want to use for "leaving out" data? Imagine we wish to perform k-fold CV. In simple models, e.g. a linear model with data distributed iid, using randomly equally sized subsets of the data as partitions is a reasonable approach. For more general hierarchical models, randomly choosing data points no longer seems like a good go-to solution.

For example, the hierarchical Hidden Markov Model (HMM) presented as Model 1 from Turek et al. (2016) (link here) is a model where some number of HMMs are related by a set of top level parameters. Randomly choosing a subset of data to leave out for this model will almost surely result in some data points being left out from each of the different HMMs. The cross-validation metric in this case would measure how well the model can predict these randomly withheld data points across a range of different HMMs, which may not be a measurement of interest to a researcher. This issue is described in an old-ish blog post by Andrew Gelman here, as well as likely a number of other places.

More relevant to a researcher could be how well the model can predict a new, currently unseen HMM given the already observed HMMs. This could be estimated by leaving out whole sets of data corresponding to one (or more) of the lower level HMMs in the model, predicting the course of these left out HMMs, and comparing the predictions to the observed data.

This HMM example is intended to show that sensible methods of leaving out data in hierarchical models can be very model-specific, and randomly leaving out data across all of a model's data points is not always sensible. As such, it may make more sense to let users define the subsets of data they wish to leave out. This can be imagined in a few different forms.

Ideas for user-specification of what data to leave out

(The current implementation) Assume that all of the data in a user's hierarchical model is in an M dimensional array. Further, assume that the i-th dimension of that array is the dimension that defines the hierarchical grouping. This method would perform k-fold CV by partitioning the data along this dimension. All the user would need to provide would be the name of the data array, and the number of the dimension that defines the hierarchical groupings

For example, in the hierarchical HMM model described above, maybe the observed data is an array y[i, t]. Here, the row i denotes data coming from HMM number i, and column t denotes data at time point t. In this model formulation, the first dimension (the rows i) would be the dimension that defines the hierarchical grouping. k-fold CV could be accomplished by sequentially leaving out each row of data and predicting the values of that row given the other rows.

The drawbacks to this implementation are the strict requirements for the format of the data (all data is contained in an array, one dimension of that array defines the hierarchical groupings) that may make it unusable for many hierarchical model formulations.
Extensions to the above implementation can be imagined. A user could be allowed to specify combinations of dimensions along which to leave out data (e.g. an array y[i, j, k] where y[i, j, ] is left out as test data for every combination of i and j). Extensions to multiple data variables in the model could also be implemented.
A user could provide a list. Each named element of the list would have the name of a data variable in the model. These named elements would be objects of the same dimensions as the model variables, but instead of containing data, they would contain integers in the range [1, k] for k-fold CV. Each integer would denote the "fold" in which that data point would be left out. This implementation would offer the most flexibility, but would also require users to do the most work before running the function. This implementation would also run into a potential issue in dealing with multivariate nodes -- e.g., if a user specified that the first element of a multivariate node should be held out in fold 1, and the second element of that same multivariate node should be held out in fold two.

It seems like the trick will be making the algorithm flexible enough to work for a variety of models, while making the input easy enough for a user to want to try it.

DT Suggestion for what data to leave out

Sorry I'm so behind on getting to these, Nick. I just read over this, and had a quick idea bout what might be most flexible for the user, and also allowing the model to be set up in any manner. This idea can probably be refined a lot, but here's the basic idea:

Suppose the model and data can be anything. As an input to the CV algorithm, a user provides two things:

An integer k, specifying the number for the k-fold CV, and also this will determine the integer arguments that will be supplied (1, 2, 3, ..., k) to the next thing, a function.
A function f(i), which when supplied with an integer argument i, returns the data node names that should be left out in the ith iteration of the CV algorithm.

For example, in your HMM example given above, these would be:

k = number of HMMs.
A function, that given argument i, returns "y[i, 1:100]", so for example when the argument i = 5, the function would actually return "y[5, 1:100]". This is assuming the data in each nested HMM has length 100.

This seems totally flexible, and the user can (hopefully, easily) write such a function that allows for any combination of the data / components of model variables to be "left out" on the different iterations of the CV.

Thoughts, or improvements? -DT

PV comments

A variant on DT's idea would be to provide a list of vectors of node names. Each element in the list would represent the nodes to be left out in one fold. For regular schemes such as described above where, say, each row of a matrix represents a grouped set of data, we could provide a utility function to create the list of vectors of node names from a matrix. In other words, implement finer-grained control and provide users with utility functions to create the fine-grained input from high-level input when that is simple to do.

Cross validation

Cross-validation Overview

Cross-validation in NIMBLE

Ideas for user-specification of what data to leave out

DT Suggestion for what data to leave out

PV comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally