Skip to content

Conversation

mb706
Copy link
Contributor

@mb706 mb706 commented Aug 13, 2025

Introducing Mlr3Component as base class for Learners, Resamplings, Measures, PipeOps, Filters, Tasks, etc.

Description

The new class Mlr3Component should become the base class for things we store in mlr3misc::Dictionary containers, such as Learners, PipeOps, Optimizers, Terminators, etc. It gives all of these the following fields:

  • id (character(1)): For identification inside tables and prefixing of ParamSets in e.g. Graphs. Can usually be changed, but can be set to read-only.
  • packages (character): packages that are required for the object, is checked upon construction and throws a warning if packages are not present. The packages of the objects involved (e.g. mlr3, mlr3pipelines) are automatically inserted here by Mlr3Component.
  • properties (character): Any character vector the class wants, often indicating some capabilities.
  • param_set (ParamSet): ParamSet; here, we have some machinery that auto-constructs this ParamSet from components, if there are any.
  • man (character(1)): identifies the class for which the help-page should be opened. This is automatically inferred from the class hierarchy.
  • label (character(1)): Short description of the object for pretty-printing, automatically extracted as the title of the help page.
  • hash (character(1)): hash of all elements that constitute the "configuration" of the object (but not the "state", such as a trained model)
  • phash (character(1)): hash of all elements that constitute the configuration, except the param_set$values

The following methods are implemented:

  • initialize(): constructor
  • format(): returns "<classname:id>"
  • print(): Prints param_set values and packages, should probably be overloaded
  • help(): Opens the help page, using the man field
  • configure(): sets param_set values and class fields
  • override_info(): changes man and hash

The following additional private fields are introduced, which are set through arguments of initialize():

  • .dict_entry (character(1)): The key of the object inside its shorthand constructors, e.g. "pca" for PipeOpPCA == po("pca"). By default, the construction id and the dict_entry are the same, with some exceptions e.g. for wrapper objects (PipeOpLearner has .dict_entry "learner" but gets the id from the Learner that it wraps).
  • .dict_shortaccess (character(1)): The name of the shorthand constructor, e.g. "po"
  • .additional_configuration (character): names of fields that represent the configuration of the object that are not param_set or construction arguments of the object; e.g. $predict_type for Learners
  • .representable (logical(1)): Whether it would make sense to build a string from which the object can be reconstructed. Given all the data we have, it would be easy to build the lrn("classif.xxx", parval1 = 1, parval2 = 2) string for an object, which could help with debugging etc., but for some objects, such as Tasks, this does not make sense.

Furthermore, the following functions may need to be overloaded by concrete classes:

  • .additional_phash_input(): returns list of objects that should be made part of the phash, as well as hash, besides class name, id and (for hash) param_set. A method that overrides this should call super$additional_phash_input() and add its own elements.
  • deep_clone: Overriding methods should call super$deep_clone() for the values that they don't handle themselves, since the base class deep_clone takes care of the ParamSet.

We also have an autotest, which is best called through test_that_mlr3component_dict(). This function calls the expect_mlr3component_subclass for a series of provided classes in a row. See the example in the document how it can efficiently be used to e.g. test all PipeOps in a given package.

Discussion

This PR makes the following opinionated decisions:

  • Introduces .dict_entry, .dict_shortaccess, and .additional_configuration; once these are in place, we have an easy way of getting string-representations of our most common algorithm-objects
  • Adds a param_set to everything that can be retrieved from a Dictionary -- this may be a problem for the Task class; We could also split up the Mlr3Component into a class with, and a class without ParamSet.
  • Builds man and label automatically and deprecates passing these as part of construction. The label is constructed from the title of a help page; this changes the label slightly in some cases but means we don't have to write the same information twice (once in the roxygen @title and once in the constructor itself). The man is inferred from the class name, which is only a problem for some Tasks and the MeasureSimple. I have decided to provide the function override_info to keep the man field itself read-only.
  • This base class provides the ParamSet construction method from mlr3pipelines, where the param_set argument of the constructor can be set to an alist(), i.e. a list of expressions, and the $param_set field is then set to the ParamSetCollection of evaluated expressions. This makes it possible to have a ParamSetCollections of ParamSets of constituent R6 objects (e.g. PipeOps in a Graph) that can withstand cloning.
  • The test_that_mlr3component_dict function calls test_that("....", { ... }) itself and should therefore be called in a test file but outside of a test_that()-block. Having a different test_that()-call for each class being tested makes diagnostics much easier, since then the testthat-reporter will automatically add the name of the class for which the tests failed.

Descendant PRs

package PR status
mlr3misc #143
mlr3 mlr-org/mlr3#1370
mlr3pipelines mlr-org/mlr3pipelines#943
bbotk mlr-org/bbotk#297
mlr3tuning mlr-org/mlr3tuning#506
mlr3data mlr-org/mlr3data#25
mlr3filters mlr-org/mlr3filters#177
mlr3learners mlr-org/mlr3learners#358

Deployment Timeline

  1. merge mlr3misc
  2. put misc on cran
  3. merge other packages, taking care it does not break unrelated packages
  4. other packages on cran
  5. set mlr3.on_deprecated_mlr3component default to "warn"; push to cran
  6. set mlr3.on_deprecated_mlr3component default to "error", push to cran
  7. remove deprecation messages, push to cran

Optional further developments:

  • Add repr()
  • Representation of wrappers through e.g. as_learner()
  • extend autotests
  • what to do with non-algorithm classes (objective, resamplingresult, databackend etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant