Skip to content

Conversation

@demiqin
Copy link

@demiqin demiqin commented Nov 24, 2025

Checklist

  • My pull request has a clear and explanatory title.
  • My pull request passes the Linting test.
  • I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
  • My PR follows PEP8 guidelines. (refer to comment below)
  • My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • I linked to issues and PRs that are relevant to this PR.

Description

This pull request adds support for the OMol25Metals dataset, a metal–complex subset of the OMol25 benchmark, as a new higher-order molecular dataset in the hypergraph domain for the TAG-DS Topological Deep Learning Challenge 2025.

Dataset and features

OMol25 https://arxiv.org/abs/2505.08762 is one of the largest publicly available molecular datasets, and this PR focuses on the metal complex subset. In our integration:

  • Each sample is a metal-containing molecule with:

    • Node features: atom-level descriptors derived from OMol25 (for example charge-related quantities, coordination, and local environment features).
    • Edge features: bond-level information capturing chemical connectivity.
    • Face / rank-2 features: higher-order groups such as rings and related metal–ring structures. We explicitly compute ring-related features and mark whether a face corresponds to a metal-containing ring as a rank-2 descriptor.
  • Graph-level targets and graph features:
    We expose several OMol25 scalar quantities as graph-level features so users can define different regression tasks without rebuilding the dataset:

    • y: total energy term (used as the default scalar regression target in our config).
    • nl_energy
    • spin
    • homo_energy
    • homo_lumo_gap

    In the current configuration, the dataset is set up as a graph-level regression benchmark on a single scalar. At the same time, the other entries above are stored as graph-level attributes, so users can:

    • Switch the main label to any of these quantities.
    • Treat them as a multi-target regression problem.
    • Use them as auxiliary graph-level features for more complex models.

These node, edge, face, and graph-level features are stored in the processed data.pt file so users can directly feed them into higher-order neural networks (for example GCCN-style models, cell-complex networks, or hypergraph architectures) without needing to reconstruct higher-order structure from the raw molecules.

Issue

There is no existing GitHub issue associated with this contribution.

Additional context

Contributors

Data provenance and processing pipeline

  • Base dataset: OMol25 (FAIR), described at
    https://huggingface.co/facebook/OMol25/blob/main/DATASET.md
  • We select metal complexes from OMol25 and process them with a dedicated pipeline:
    • Parse molecules, filter for metal-containing species.
    • Convert to ASE and then to PyG Data objects (graph structure).
    • Compute high-order features:
      • Node-level descriptors.
      • Edge-level bond features.
      • Face / rank-2 features that encode ring structure, including a flag for metal rings and other group-level statistics.
    • Export the result as a single processed/data.pt file used by OMol25MetalsDataset.

The processing code and documentation are hosted in our OMol25Metals pipeline repository:

This integration is intended to let users:

  • Use OMol25Metals as a higher-order molecular benchmark in TopoBench.
  • Directly run both standard GNN baselines and higher-order models (for example GCCN / cell-complex networks) on a realistic energy-science dataset with explicit node, edge, and face features.

@levtelyatnikov levtelyatnikov added the category-a2 Submission to TDL Challenge 2025: Mission A, Category 2. label Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category-a2 Submission to TDL Challenge 2025: Mission A, Category 2.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants