Skip to content

Conversation

kacpnowak
Copy link
Contributor

Description

This PR implements separate source and target datasets for FesomDataReader.

It was also necessary modify masking strategies, as healpix cell can empty for the source but not for the target. Additionally there can be different number of tokens per cell. Different adjustments had to be made for each strategy:

  • random: when mask is missing for each token, it's assumed to be False.
  • healpix: when mask is missing it's for the entire healpix cell it's assumed to be False.
  • casual: If mask is missing for an entire cell, it's assumed to be True for every token inside. If number of tokens in cell is different for source and target, source masked ratio is calculated and used for masking the same fraction of target tokens.
  • channels: unsupported

Huge thanks for @shmh40 for all the help.

Issue Number

Closes #911

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@kacpnowak kacpnowak marked this pull request as draft October 6, 2025 15:49
@shmh40 shmh40 self-requested a review October 6, 2025 15:57
@shmh40 shmh40 added model Related to model training or definition (not generic infra) data reading Everything related to data reading labels Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data reading Everything related to data reading model Related to model training or definition (not generic infra)
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

Enable FesomDataReader to have different source and target datasets
2 participants