GitHub - tdhock/mlr3resampling: Resampling algorithms for mlr3 framework in R

mlr3resampling provides new cross-validation algorithms for the mlr3 framework in R

tests
coverage

Installation

install.packages("mlr3resampling")#release version from CRAN
## OR: development version from GitHub:
install.packages("remotes")
remotes::install_github("tdhock/mlr3resampling")

Description

For an overview of functionality, please read my recent blog post, the SOAK arXiv paper, and other articles.

SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

See examples in Newer resamplers vignette and data viz for regression and classification.

A supervised learning algorithm inputs a train set, and outputs a prediction function, which can be used on a test set. If each data point belongs to a subset (such as geographic region, year, etc), then how do we know if subsets are similar enough so that we can get accurate predictions on one subset, after training on Other subsets? And how do we know if training on All subsets would improve prediction accuracy, relative to training on the Same subset? SOAK, Same/Other/All K-fold cross-validation, can be used to answer these questions, by fixing a test subset, training models on Same/Other/All subsets, and then comparing test error rates (Same versus Other and Same versus All).

subsets are similar (for the learning algorithm) if All is more accurate than Same, and the subset with more data (Same/Other) is more accurate.
subsets are different (for the learning algorithm) if All/Other is less accurate than Same.
Other can be just as bad as featureless baseline (or worse) when the subsets have different learnable/predictable patterns.
Note that results depend on the data set and on the learning algorithm. For a given data set, a linear model may not be able to learn on one subset and accurately predict on the other, but a more complex learning algorithm may be able to.

This is implemented in ResamplingSameOtherSizesCV when you use it on a task that defines the subset role, for example the Arizona trees data, for which each row is a pixel in an image, and we want to do binary classification – does the pixel contain a tree or not?

> data(AZtrees,package="mlr3resampling")
> table(AZtrees$region3)

  NE   NW    S 
1464 1563 2929

We see in the output above that the region3 column has three values (NE, NW, S). Each represents the region/area in which the pixel was found. If we want good predictions in the south (S), can we train on the north? (NE+NW) We can use the code below to setup the CV experiment. The rows 12,15,18 below represent splits that attempt to answer that question (test.subset=S, train.subsets=other).

> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
> task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
> task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
> task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
> task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
> task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s).
> same_other_sizes_cv$instantiate(task.obj)
> same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]
    test.subset train.subsets test.fold
         <char>        <char>     <int>
 1:          NE           all         1
 2:          NW           all         1
 3:           S           all         1
 4:          NE           all         2
 5:          NW           all         2
 6:           S           all         2
 7:          NE           all         3
 8:          NW           all         3
 9:           S           all         3
10:          NE         other         1
11:          NW         other         1
12:           S         other         1
13:          NE         other         2
14:          NW         other         2
15:           S         other         2
16:          NE         other         3
17:          NW         other         3
18:           S         other         3
19:          NE          same         1
20:          NW          same         1
21:           S          same         1
22:          NE          same         2
23:          NW          same         2
24:           S          same         2
25:          NE          same         3
26:          NW          same         3
27:           S          same         3
    test.subset train.subsets test.fold

The rows in the output above represent different kinds of splits:

train.subsets=same is used as a baseline, to answer the question, “what is the baseline error rate, if we train on data from the same subset as test?” This imporant baseline allows us to see if the all and other models do better or worse.
train.subsets=all is used to answer the question, “can we get accurate predictions on a given test subset, by training on data from all subsets?”
train.subsets=other is used to answer the question, “can be get accurate predictions on a given test subset, by training using data from other subsets?”

Code to re-run:

data(AZtrees,package="mlr3resampling")
table(AZtrees$region3)
same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s).
same_other_sizes_cv$instantiate(task.obj)
same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]

Algorithm 2: cross-validation for comparing different sized train sets

See examples in Newer Resamplers vignette and data viz for regression and classification.

How many train samples are required to get accurate predictions on a test set? Cross-validation can be used to answer this question, with variable size train sets. For example consider the Arizona Trees data below,

> dim(AZtrees)
[1] 5956   25
> length(unique(AZtrees$polygon))
[1] 189

The output above indicates we have 5956 rows and 189 polygons. We can do cross-validation on either polygons (if task has group role) or rows (if no group role set). The code below sets a down-sampling ratio of 0.8, and four sizes of down-sampled train sets.

> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
> same_other_sizes_cv$param_set$values$sizes <- 4
> same_other_sizes_cv$param_set$values$ratio <- 0.8
> task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
> task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
> task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
> task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
> same_other_sizes_cv$instantiate(task.obj)
> same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]
    n.train.groups test.fold
             <int>     <int>
 1:             51         1
 2:             64         1
 3:             80         1
 4:            100         1
 5:            126         1
 6:             51         2
 7:             64         2
 8:             80         2
 9:            100         2
10:            126         2
11:             51         3
12:             64         3
13:             80         3
14:            100         3
15:            126         3

The output above has one row per train/test split that will be computed in the cross-validation experiment. The full train set size is 126 polygons, and there are four smaller train set sizes (each a factor of 0.8 smaller). Each train set size will be computed for each fold ID from 1 to 3.

Code to re-run:

data(AZtrees,package="mlr3resampling")
dim(AZtrees)
length(unique(AZtrees$polygon))
same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
same_other_sizes_cv$param_set$values$sizes <- 4
same_other_sizes_cv$param_set$values$ratio <- 0.8
task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
same_other_sizes_cv$instantiate(task.obj)
same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]

Related work

mlr3resampling code was copied/modified from Resampling and ResamplingCV classes in the excellent mlr3 package.
As of Oct 2024, scikit-learn in python implements support for groups via GroupKFold (keeping samples together when splitting) but not subsets (test data come from one subset, train data come from Same/Other/All subsets).

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
.github/workflows		.github/workflows
R		R
data		data
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS		NEWS
README.org		README.org
build.sh		build.sh
slurm-config.sh		slurm-config.sh
slurm.conf		slurm.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Description

SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

Algorithm 2: cross-validation for comparing different sized train sets

Related work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

tdhock/mlr3resampling

Folders and files

Latest commit

History

Repository files navigation

Installation

Description

SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

Algorithm 2: cross-validation for comparing different sized train sets

Related work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages