Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #233

theosaulus · 2025-11-22T01:41:13Z

Checklist

My pull request has a clear and explanatory title.
My pull request passes the Linting test.
I added appropriate unit tests and I made sure the code passes all unit tests.
My PR follows PEP8 guidelines.
My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
I linked to issues and PRs that are relevant to this PR.

Description

The Open Catalyst 2020 (OC20) and 2022 (OC22) datasets altogether contain 1.3 million molecular relaxations with results from over 260 million DFT calculations. This PR implements three Open Catalyst (OC20/OC22) datasets:

OC20 IS2RE - Initial Structure to Relaxed Energy prediction task using OC20 data
OC22 IS2RE - Initial Structure to Relaxed Energy prediction task using OC22 data
OC20 S2EF - Structure to Energy and Forces prediction task with multiple training splits (200K, 2M, 20M, all)

Implementation Details:

Dataset Loaders (3 files):

topobench/data/loaders/graph/oc20_is2re_dataset_loader.py - IS2REDatasetLoader for OC20
topobench/data/loaders/graph/oc22_is2re_dataset_loader.py - OC22IS2REDatasetLoader for OC22
topobench/data/loaders/graph/oc20_dataset_loader.py - OC20DatasetLoader for S2EF task

Dataset Classes (3 files):

topobench/data/datasets/oc20_is2re_dataset.py - IS2REDataset handling LMDB data
topobench/data/datasets/oc22_is2re_dataset.py - OC22IS2REDataset with multiple LMDB files
topobench/data/datasets/oc20_dataset.py - OC20Dataset for S2EF with configurable splits

Configuration Files (7 files):

configs/dataset/graph/OC20_IS2RE.yaml
configs/dataset/graph/OC22_IS2RE.yaml
configs/dataset/graph/OC20_S2EF_200K.yaml
configs/dataset/graph/OC20_S2EF_2M.yaml
configs/dataset/graph/OC20_S2EF_20M.yaml
configs/dataset/graph/OC20_S2EF_all.yaml

Test Coverage (20 tests total):

test/data/load/test_oc20_datasets.py - 17 unit tests covering:
- Loader initialization and dataset loading
- Data item access and validation
- Split indices validity and non-overlap
- Integration with PreProcessor pipeline
- Multiple train split configurations for S2EF
test/pipeline/test_pipeline.py - 3 tests, 1 for each dataset

Issue

This PR addresses TDL Challenge 2025 - Category A1: Broadening Benchmarks with Graph Datasets.

Additional context

Datasets Source: Open Catalyst Project

References:

Chanussot, L., et al. (2021). "Open Catalyst 2020 (OC20) Dataset and Community Challenges." ACS Catalysis.
Tran, R., et al. (2023). "The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts." ACS Catalysis.

Testing Notes: all tests use max_samples=10 for efficiency, but users should set max_samples: null in configs for full dataset training

…o respect the code tree

Category: A1; Team name: MatTheo; Dataset: OC20/OC22

Oc20

configs and tests

theosaulus · 2025-11-23T00:45:03Z

Sorry for the initial PR which had issues after some file renaming and cleaning. This version passes all tests locally.

theosaulus · 2025-11-23T05:32:27Z

I figured that the tests are failing because they require to download the datasets, which are several Gb even in their smallest versions. I am not sure how to proceed from now and pass the tests...

levtelyatnikov · 2025-11-24T18:04:11Z

Dear Participant,

To help streamline our testing process, could we kindly request that you implement a mock data generator instead of requiring the full dataset to be downloaded? This generator should return a small snapshot of real data that is sufficient for running your pipeline and ensuring the test passes successfully.

Please ensure that the original configuration file for your new dataset is added to the exclusion list. This needs to be placed on line 40 of the file located at: https://github.com/geometric-intelligence/TopoBench/blob/main/test/data/load/test_datasetloaders.py

Thank you very much for making this modification!

mock config and avoid testing the other configs

theosaulus · 2025-11-25T03:56:02Z

Thank you for your reply. To create a realistic mock data generator, I kept the download only for the smallest of the datasets (360Mb). I added the other configs to the exclusion list as you mentioned.
On a side note, because PyTorch version in TopoBench has to be version 2.3.0, some useful packages that would facilitate the data loading cannot be used at the moment (fairchem-core to be precise). If the env is updated, some part of my submission will be significantly simplified.
Best

levtelyatnikov · 2025-11-25T08:31:02Z

Dear Participant,

As I can see, the test is still failing because it is trying to download the dataset file is2res_train_val_test_lmdbs.zip, which is 8 GB in size.

remove heavy tests on the larger datasets

theosaulus · 2025-11-25T17:03:26Z

Hi, thanks for your message.
I commented the additional tests that I forgot to remove, this time I hope it should only test the mock dataset...
Edit: oops, some heavy configs remained... should be good now

…cules

Oc20

fixed tests cleanly

ase package

theosaulus · 2025-11-26T04:29:04Z

I am slightly confused about the fact that the tests are stuck at the "Post job cleanup" step...

theosaulus and others added 12 commits November 17, 2025 15:17

initial commit for OC20/22

6f16d4d

preprocessing still not fully working

42e7ad8

preprocessing seems to work now although slow

2d40f9d

code cleaning and separating different functions in different files t…

869894d

…o respect the code tree

IS2RE works

b290b42

format

bf3d279

renaming and tests

42adabf

keep some files untouched

01f082e

Merge pull request #1 from theosaulus/oc20

de9f3e7

Category: A1; Team name: MatTheo; Dataset: OC20/OC22

ruff fix

e7c1e10

fixed data splits, tests, and code running

c9bc65f

Merge pull request #2 from theosaulus/oc20

6e845b5

Oc20

theosaulus marked this pull request as draft November 23, 2025 00:04

theosaulus and others added 2 commits November 22, 2025 19:41

configs and tests

2f50eea

Merge pull request #3 from theosaulus/oc20

e76187f

configs and tests

theosaulus marked this pull request as ready for review November 23, 2025 00:42

levtelyatnikov added the category-a1 Submission to TDL Challenge 2025: Mission A, Category 1. label Nov 24, 2025

theosaulus and others added 2 commits November 24, 2025 22:54

mock config and avoid testing the other configs

0da47c7

Merge pull request #4 from theosaulus/oc20

2f589c3

mock config and avoid testing the other configs

theosaulus and others added 2 commits November 25, 2025 11:59

remove heavy tests on the larger datasets

c9baf78

Merge pull request #5 from theosaulus/oc20

08d57ab

remove heavy tests on the larger datasets

theosaulus and others added 3 commits November 25, 2025 18:57

removing again unnecessary tests and bug fix on number of loaded mole…

554e397

…cules

ruff

5b13f85

Merge pull request #6 from theosaulus/oc20

3c1ac1a

Oc20

theosaulus marked this pull request as draft November 26, 2025 00:27

theosaulus and others added 2 commits November 25, 2025 20:38

fixed tests cleanly

48f5b65

Merge pull request #7 from theosaulus/oc20

97f0a85

fixed tests cleanly

theosaulus marked this pull request as ready for review November 26, 2025 01:39

theosaulus and others added 2 commits November 25, 2025 21:14

ase package

6dde111

Merge pull request #8 from theosaulus/oc20

7fd4b08

ase package

gbg141 closed this Nov 26, 2025

gbg141 reopened this Nov 26, 2025

levtelyatnikov mentioned this pull request Nov 26, 2025

Category: B2; Team name: NeuroTriangles; Dataset: A123CortexM (Mouse Auditory Cortex) #252

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #233

Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #233

Uh oh!

theosaulus commented Nov 22, 2025 •

edited

Loading

Uh oh!

theosaulus commented Nov 23, 2025

Uh oh!

theosaulus commented Nov 23, 2025

Uh oh!

levtelyatnikov commented Nov 24, 2025

Uh oh!

theosaulus commented Nov 25, 2025

Uh oh!

levtelyatnikov commented Nov 25, 2025

Uh oh!

theosaulus commented Nov 25, 2025 •

edited

Loading

Uh oh!

theosaulus commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #233

Are you sure you want to change the base?

Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #233

Uh oh!

Conversation

theosaulus commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description

Implementation Details:

Issue

Additional context

Uh oh!

theosaulus commented Nov 23, 2025

Uh oh!

theosaulus commented Nov 23, 2025

Uh oh!

levtelyatnikov commented Nov 24, 2025

Uh oh!

theosaulus commented Nov 25, 2025

Uh oh!

levtelyatnikov commented Nov 25, 2025

Uh oh!

theosaulus commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theosaulus commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

theosaulus commented Nov 22, 2025 •

edited

Loading

theosaulus commented Nov 25, 2025 •

edited

Loading