Skip to content

Conversation

@theosaulus
Copy link

@theosaulus theosaulus commented Nov 22, 2025

Checklist

  • My pull request has a clear and explanatory title.
  • My pull request passes the Linting test.
  • I added appropriate unit tests and I made sure the code passes all unit tests.
  • My PR follows PEP8 guidelines.
  • My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • I linked to issues and PRs that are relevant to this PR.

Description

The Open Catalyst 2020 (OC20) and 2022 (OC22) datasets altogether contain 1.3 million molecular relaxations with results from over 260 million DFT calculations. This PR implements three Open Catalyst (OC20/OC22) datasets:

  1. OC20 IS2RE - Initial Structure to Relaxed Energy prediction task using OC20 data
  2. OC22 IS2RE - Initial Structure to Relaxed Energy prediction task using OC22 data
  3. OC20 S2EF - Structure to Energy and Forces prediction task with multiple training splits (200K, 2M, 20M, all)

Implementation Details:

Dataset Loaders (3 files):

  • topobench/data/loaders/graph/oc20_is2re_dataset_loader.py - IS2REDatasetLoader for OC20
  • topobench/data/loaders/graph/oc22_is2re_dataset_loader.py - OC22IS2REDatasetLoader for OC22
  • topobench/data/loaders/graph/oc20_dataset_loader.py - OC20DatasetLoader for S2EF task

Dataset Classes (3 files):

  • topobench/data/datasets/oc20_is2re_dataset.py - IS2REDataset handling LMDB data
  • topobench/data/datasets/oc22_is2re_dataset.py - OC22IS2REDataset with multiple LMDB files
  • topobench/data/datasets/oc20_dataset.py - OC20Dataset for S2EF with configurable splits

Configuration Files (7 files):

  • configs/dataset/graph/OC20_IS2RE.yaml
  • configs/dataset/graph/OC22_IS2RE.yaml
  • configs/dataset/graph/OC20_S2EF_200K.yaml
  • configs/dataset/graph/OC20_S2EF_2M.yaml
  • configs/dataset/graph/OC20_S2EF_20M.yaml
  • configs/dataset/graph/OC20_S2EF_all.yaml

Test Coverage (20 tests total):

  • test/data/load/test_oc20_datasets.py - 17 unit tests covering:

    • Loader initialization and dataset loading
    • Data item access and validation
    • Split indices validity and non-overlap
    • Integration with PreProcessor pipeline
    • Multiple train split configurations for S2EF
  • test/pipeline/test_pipeline.py - 3 tests, 1 for each dataset

Issue

This PR addresses TDL Challenge 2025 - Category A1: Broadening Benchmarks with Graph Datasets.

Additional context

Datasets Source: Open Catalyst Project

References:

  • Chanussot, L., et al. (2021). "Open Catalyst 2020 (OC20) Dataset and Community Challenges." ACS Catalysis.
  • Tran, R., et al. (2023). "The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts." ACS Catalysis.

Testing Notes: all tests use max_samples=10 for efficiency, but users should set max_samples: null in configs for full dataset training

@theosaulus theosaulus marked this pull request as draft November 23, 2025 00:04
@theosaulus theosaulus marked this pull request as ready for review November 23, 2025 00:42
@theosaulus
Copy link
Author

Sorry for the initial PR which had issues after some file renaming and cleaning. This version passes all tests locally.

@theosaulus
Copy link
Author

I figured that the tests are failing because they require to download the datasets, which are several Gb even in their smallest versions. I am not sure how to proceed from now and pass the tests...

@levtelyatnikov levtelyatnikov added the category-a1 Submission to TDL Challenge 2025: Mission A, Category 1. label Nov 24, 2025
@levtelyatnikov
Copy link
Collaborator

Dear Participant,

To help streamline our testing process, could we kindly request that you implement a mock data generator instead of requiring the full dataset to be downloaded? This generator should return a small snapshot of real data that is sufficient for running your pipeline and ensuring the test passes successfully.

Please ensure that the original configuration file for your new dataset is added to the exclusion list. This needs to be placed on line 40 of the file located at: https://github.com/geometric-intelligence/TopoBench/blob/main/test/data/load/test_datasetloaders.py

Thank you very much for making this modification!

@theosaulus
Copy link
Author

Thank you for your reply. To create a realistic mock data generator, I kept the download only for the smallest of the datasets (360Mb). I added the other configs to the exclusion list as you mentioned.
On a side note, because PyTorch version in TopoBench has to be version 2.3.0, some useful packages that would facilitate the data loading cannot be used at the moment (fairchem-core to be precise). If the env is updated, some part of my submission will be significantly simplified.
Best

@levtelyatnikov
Copy link
Collaborator

Dear Participant,

As I can see, the test is still failing because it is trying to download the dataset file is2res_train_val_test_lmdbs.zip, which is 8 GB in size.

@theosaulus
Copy link
Author

theosaulus commented Nov 25, 2025

Hi, thanks for your message.
I commented the additional tests that I forgot to remove, this time I hope it should only test the mock dataset...
Edit: oops, some heavy configs remained... should be good now

@theosaulus theosaulus marked this pull request as draft November 26, 2025 00:27
@theosaulus theosaulus marked this pull request as ready for review November 26, 2025 01:39
@theosaulus
Copy link
Author

I am slightly confused about the fact that the tests are stuck at the "Post job cleanup" step...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category-a1 Submission to TDL Challenge 2025: Mission A, Category 1.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants