Skip to content

Yaml provenance #2505

@LuiggiTenorioK

Description

@LuiggiTenorioK

In GitLab by @mandresm on Jul 11, 2024, 17:24

Summary

As mentioned in previous meetings I want to propose that the conf/metadata/experiment_data.yml contains information about the provenance of each value in the form of a comment. I am opening the issue in order to discuss this implementation strategy, timeline, responsibilities, possible improvements to the feature...

An equivalent feature exists for ESM-Tools, a experiment configuration tool and workflow manager we develop at AWI. I propose we copy/paste from there and start modifying what we need. Here, there is an example of what I have in mind for the equivalent yaml file in ESM-Tools:

fesom:
  model: fesom  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:4,col:8
  branch: 2.0.2 # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:17,col:13
  version: 2 # <SOME_ABSOLUTE_PATH>/esm_tools/configs/setups/awicm3/awicm3.yaml,line:399,col:18
  type: ocean # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:7,col:7
  comp_command: mkdir -p build; cd build; cmake -DOIFS_COUPLED=ON -DFESOM_COUPLED=ON -DCMAKE_INSTALL_PREFIX=../ ..;   make install -j `nproc --all` # <SOME_ABSOLUTE_PATH>/esm_tools/configs/setups/awicm3/awicm3.yaml,line:414,col:31
  clean_command: rm -rf build CMakeCache.txt # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:10,col:16
  required_plugins:
  - git+https://github.com/esm-tools-plugins/tar_binary_restarts  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:13,col:3
  install_bins: bin/fesom.x  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:22,col:19
  git-repository:
  - https://github.com/FESOM/fesom2.git  # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:20,col:7
  - https://gitlab.dkrz.de/FESOM/fesom2.git # <SOME_ABSOLUTE_PATH>/esm_tools/configs/components/fesom/fesom-2.0.yaml,line:21,col:7

Who am I suggesting that implements this feature?

Either me or @Hussam-Turjman over the next month/2-months, with the support from someone from Autosubmit, for example @dbeltrankyl or @kinow. But if someone at BSC wants to have a head start, help yourself :)

What does this feature support?

  • Reading yamls into python collections, and for dictionaries and lists, storing the line, column and path to the file as a provenance attribute of the value
  • Each value read from a yaml file is a subclass of the original value's type defined dynamically, with methods to handle the provenance history
  • Dictionary and list subclasses for ensuring that methods such as update, __setitem__, etc. keep a history of the value's provenance history, and other methods to recursively retrieve and set the provenance values. Also a clean_provenance method to recursively return the original value and value type.
  • Writing yamls where each value containing a provenance has a comment next to it indicating the line, column and path of the file defining that value

All of this won't only be useful for the comments in conf/metadata/experiment_data.yml, but also to question at any point in Autosubmit, the provenance of a given value, simply by using the provenance attribute of that particular value: in a dict my_dict["my_key"].provenance and in a list my_list[my_index].provenance. Could also come pretty handy for improving error messages.

Can we reuse (copy/paste) the code from ESM-Tools?

Yes, our license is GPL-2: https://github.com/esm-tools/esm_tools?tab=GPL-2.0-1-ov-file#readme

Relevant files in ESM-Tools

How can it be implemented?

  1. During the parser of the yaml one needs to extract the line and column information somehow and store it in a collection that has the same structure as the collection loaded from the yaml. We do that with the EsmToolsLoader, a subclass of ruamel.yaml.YAML:
    https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L693-L770
    Note that EsmToolsLoader has some deprecated methods related to the dumping. The most important method there is load

    That uses this constructor class, subclassed from the ruamel.yaml.RoundTripRepresenter
    https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L638-L673

    Note there we are subclassing from EnvironmentConstructor, which parent class is ruamel.yaml.RoundTripRepresenter. For the implementation here we could directly subclass from ruamel.yaml.RoundTripRepresenter.

    Once the code is implemented one can simply do:

    esm_tools_loader = EsmToolsLoader()
    esm_tools_loader.set_filename(yaml_file)
    yaml_load, provenance = esm_tools_loader.load(yaml_file)

    as in these lines: https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/yaml_to_dict.py#L188-L198

    After this you are going to have your standard collection as read from ruamel.yaml in yaml_load and the provenance, another collection with the same structure as yaml_load in terms of keys, but the values contain provenance objects instead.

  2. Join the two worlds in one single collection, for example, for a dictionary use the class DictionaryWithProvenance:

    dictionary_with_provenance = DictWithProvenance(yaml_load, provenance)

    This dict has now all the provenance information attached to its values and you can use it at your own will. If your collection is a list you can choose to use ListWithProvenance instead of DictWithProvenance. tuples, sets and others are not supported.

    For all the methods related to provenance see the procenance.py itself. It's almost more docstrings than code: https://github.com/esm-tools/esm_tools/blob/release/src/esm_parser/provenance.py

  3. You can now operate with the lists and dictionaries as you would usually do. As long as you are using __setitem__ (or update in the case of the dictionaries) you would keep the provenance history in the provenance attribute of the value, the last entry on the provenance is the actual provenance of its current value:

    my_list_with_prov[2] = my_var_with_prov
    previous_provenance = my_list_with_prov[2].provenance[-2]
    latest_provenance = my_list_with_prov[2].provenance[-1]
  4. Time to dump the Frankenstein dictionary we've been putting together from pieces of other yamls with using the function yaml_dump https://github.com/esm-tools/esm_tools/blob/6cf5ea8664267a80031b2d54ec6e863cf7da9645/src/esm_parser/dict_to_yaml.py#L11-L130

    yaml_dump(your_dict/list_with_prov, "/path/to/the/commented.yaml")

    It's not a very elegant and efficient function, but it does the job, I guess...

Metadata

Metadata

Assignees

No one assigned

    Labels

    new featureUse this label to plan and request new featuresto doThis label should be attached issues that are assigned to work with but that did not start.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions