Skip to content

Conversation

@amiiiza
Copy link
Collaborator

@amiiiza amiiiza commented Nov 25, 2025

Checklist

  • My pull request has a clear and explanatory title.
  • My pull request passes the Linting test.
  • I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
  • My PR follows PEP8 guidelines. (refer to comment below)
  • My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • I linked to issues and PRs that are relevant to this PR.

Description

This PR adds the ATLAS Top Tagging dataset from CERN Open Data to TopoBench’s pointcloud domain.
It introduces a full dataset implementation, loader, and configuration needed to use this dataset for binary top-quark jet tagging.

Main changes:

  • Added ATLASTopTaggingDataset with:
    • Support for constituent-level 4-vectors (pt, eta, phi, energy) for up to 200 particles per jet.
    • Optional high-level jet features (15 variables: mass, τ ratios, ECFs, etc.).
    • Configurable options for:
      • split (train/val/test),
      • subset fraction for fast experimentation,
      • max_constituents,
      • toggling high-level features.
  • Implemented ATLASTopTaggingDatasetLoader in the pointcloud data domain.
  • Implemented a preprocessing pipeline that:
    • Downloads the raw files from the CERN Open Data portal.
    • Handles both compressed .h5.gz and uncompressed .h5 input, with a fallback when .gz files are missing.
    • Saves preprocessed data to a reusable .pt file.
  • Added a stats() helper to summarize:
    • number of jets,
    • class distribution (signal/background),
    • average number of constituents per jet.
  • Added Hydra/OmegaConf configuration files to register the dataset and loader within the existing TopoBench experiment setup.

Issue

There is no issue associated with this PR.

Additional context

Data

Source: CERN Open Data Portal - Record 80030
Task: Binary classification (top quark jet tagging)
Size: ~93M events (~280GB compressed)
Features:

  • Constituent-level: 4-vectors for up to 200 particles per jet
    • pt (transverse momentum)
    • eta (pseudorapidity)
    • phi (azimuthal angle)
    • energy
  • High-level (optional): 15 jet-level features including mass, tau ratios, ECF, etc.

Testing

  • Added unit tests for:
    • download failure handling and logging,
    • preprocessing with .h5 fallback when no .h5.gz files exist,
    • preprocessing error handling when no files are found,
    • flexible HDF5 loading (compressed/uncompressed, label and feature extraction, slicing by num_jets),
    • graph construction, filtering, and transforms in process(),
    • stats() output (basic summary + label distribution).

@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.24%. Comparing base (5cc4932) to head (f22d9ed).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #246      +/-   ##
==========================================
+ Coverage   94.01%   94.24%   +0.23%     
==========================================
  Files         184      186       +2     
  Lines        6664     6936     +272     
==========================================
+ Hits         6265     6537     +272     
  Misses        399      399              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@levtelyatnikov levtelyatnikov added the category-a1 Submission to TDL Challenge 2025: Mission A, Category 1. label Nov 25, 2025
@gbg141
Copy link
Collaborator

gbg141 commented Nov 26, 2025

Hi @amiiiza! Two quick comments:

  • Don't worry about the project coverage, it failing is not related to your PR (I am trying to fix it now). Sorry for the inconveniences!

  • Did you fill out the required Google Form with the information of your PR? We don't find an entry assigned to your PR.

Thank you!

@amiiiza
Copy link
Collaborator Author

amiiiza commented Nov 26, 2025

Hi @amiiiza! Two quick comments:

  • Don't worry about the project coverage, it failing is not related to your PR (I am trying to fix it now). Sorry for the inconveniences!
  • Did you fill out the required Google Form with the information of your PR? We don't find an entry assigned to your PR.

Thank you!

Hi @gbg141, really thanks for letting me know, and resolving the issue. I also submitted the form, it should be accessible now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category-a1 Submission to TDL Challenge 2025: Mission A, Category 1.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants