-
Notifications
You must be signed in to change notification settings - Fork 68
Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…o respect the code tree
Category: A1; Team name: MatTheo; Dataset: OC20/OC22
configs and tests
|
Sorry for the initial PR which had issues after some file renaming and cleaning. This version passes all tests locally. |
|
I figured that the tests are failing because they require to download the datasets, which are several Gb even in their smallest versions. I am not sure how to proceed from now and pass the tests... |
|
Dear Participant, To help streamline our testing process, could we kindly request that you implement a mock data generator instead of requiring the full dataset to be downloaded? This generator should return a small snapshot of real data that is sufficient for running your pipeline and ensuring the test passes successfully. Please ensure that the original configuration file for your new dataset is added to the exclusion list. This needs to be placed on line 40 of the file located at: https://github.com/geometric-intelligence/TopoBench/blob/main/test/data/load/test_datasetloaders.py Thank you very much for making this modification! |
mock config and avoid testing the other configs
|
Thank you for your reply. To create a realistic mock data generator, I kept the download only for the smallest of the datasets (360Mb). I added the other configs to the exclusion list as you mentioned. |
|
Dear Participant, As I can see, the test is still failing because it is trying to download the dataset file is2res_train_val_test_lmdbs.zip, which is 8 GB in size. |
remove heavy tests on the larger datasets
|
Hi, thanks for your message. |
fixed tests cleanly
ase package
|
I am slightly confused about the fact that the tests are stuck at the "Post job cleanup" step... |
Checklist
Description
The Open Catalyst 2020 (OC20) and 2022 (OC22) datasets altogether contain 1.3 million molecular relaxations with results from over 260 million DFT calculations. This PR implements three Open Catalyst (OC20/OC22) datasets:
Implementation Details:
Dataset Loaders (3 files):
topobench/data/loaders/graph/oc20_is2re_dataset_loader.py- IS2REDatasetLoader for OC20topobench/data/loaders/graph/oc22_is2re_dataset_loader.py- OC22IS2REDatasetLoader for OC22topobench/data/loaders/graph/oc20_dataset_loader.py- OC20DatasetLoader for S2EF taskDataset Classes (3 files):
topobench/data/datasets/oc20_is2re_dataset.py- IS2REDataset handling LMDB datatopobench/data/datasets/oc22_is2re_dataset.py- OC22IS2REDataset with multiple LMDB filestopobench/data/datasets/oc20_dataset.py- OC20Dataset for S2EF with configurable splitsConfiguration Files (7 files):
configs/dataset/graph/OC20_IS2RE.yamlconfigs/dataset/graph/OC22_IS2RE.yamlconfigs/dataset/graph/OC20_S2EF_200K.yamlconfigs/dataset/graph/OC20_S2EF_2M.yamlconfigs/dataset/graph/OC20_S2EF_20M.yamlconfigs/dataset/graph/OC20_S2EF_all.yamlTest Coverage (20 tests total):
test/data/load/test_oc20_datasets.py- 17 unit tests covering:test/pipeline/test_pipeline.py- 3 tests, 1 for each datasetIssue
This PR addresses TDL Challenge 2025 - Category A1: Broadening Benchmarks with Graph Datasets.
Additional context
Datasets Source: Open Catalyst Project
References:
Testing Notes: all tests use
max_samples=10for efficiency, but users should setmax_samples: nullin configs for full dataset training