Skip to content

Conversation

@alexsandro-santos
Copy link

@alexsandro-santos alexsandro-santos commented Nov 20, 2025

Checklist

  • My pull request has a clear and explanatory title.
  • My pull request passes the Linting test.
  • I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
  • My PR follows PEP8 guidelines. (refer to comment below)
  • My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • I linked to issues and PRs that are relevant to this PR.

Description

This pull request integrates two datasets from the Open Graph Benchmark (OGB) into the TopoBench framework: ogbn-arxiv and ogbn-products. Both datasets are homogeneous, single-label node classification tasks, which makes them fully compatible with TopoBench’s current graph pipeline.

The first dataset, ogbn-arxiv, is a directed citation network of computer science papers, where each node corresponds to a paper described by a 128-dimensional word-embedding feature vector, and the task is to predict one of forty subject areas. The second dataset, ogbn-products, is a large co-purchase graph where nodes represent Amazon products with 100-dimensional semantic features, and the task is to classify each product into one of forty-seven categories.

To support these datasets, this PR adds a unified dataset loader that handles feature casting and label formatting. Corresponding dataset configuration files have been added under configs/dataset/graph/, following the same structure and conventions used by the existing TopoBench datasets. Although this PR adds full support for ogbn-products, it is intentionally not used in test/pipeline/test_pipeline.py, because of its size.

Additional context

Although the OGBN collection contains five datasets in total, only ogbn-arxiv and ogbn-products are included in this PR. The remaining three datasets require functionality that is beyond TopoBench’s current architecture. ogbn-mag is a heterogeneous graph containing multiple node and edge types, which would require dedicated hetero-graph support to integrate properly. ogbn-proteins is a multi-label regression task with no node features, which does not align with TopoBench’s assumption of single-label node classification on feature-bearing graphs. Finally, ogbn-papers100M is extremely large and cannot be feasibly downloaded or tested in the CI environment. Because these datasets would require substantial structural additions or special-case handling, they are intentionally omitted from this PR to keep the scope clean and maintain alignment with the existing pipeline.

Co-authored by: @giovanni-br and @alexsandro-santos

@alexsandro-santos alexsandro-santos marked this pull request as ready for review November 20, 2025 23:35
@giovanni-br giovanni-br marked this pull request as draft November 21, 2025 13:43
@giovanni-br giovanni-br marked this pull request as ready for review November 21, 2025 16:07
@levtelyatnikov levtelyatnikov added the category-a1 Submission to TDL Challenge 2025: Mission A, Category 1. label Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category-a1 Submission to TDL Challenge 2025: Mission A, Category 1.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants