Skip to content

Conversation

@henrytsay
Copy link

Checklist

  • My pull request has a clear and explanatory title.
  • My pull request passes the Linting test.
  • I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
  • My PR follows PEP8 guidelines. (refer to comment below)
  • My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • I linked to issues and PRs that are relevant to this PR.

Description

This PR implements Challenge B1 by creating a scalable data loading pipeline for large-scale inductive learning settings using an OnDiskPreProcessor class that enables memory-efficient processing of large datasets.

Key Features

OnDiskPreProcessor Implementation (topobench/data/preprocessor/on_disk_preprocessor.py):

  • Inherits from torch_geometric.data.OnDiskDataset for on-disk storage
  • Processes raw data samples one at a time, saving each to disk immediately as data_{i}.pt
  • Supports topological lifting transformations via DataTransform integration
  • Bypasses memory bottlenecks by avoiding loading entire datasets into RAM
  • Maintains full compatibility with existing PreProcessor API and split loading methods

Configuration Support:

  • Added use_on_disk_preprocessing: true flag to dataset configs
  • Example config: MUTAG_ondisk.yaml
  • Integrated with topobench/run.py for seamless config-based switching

Comprehensive Testing:

  • 23 unit tests in test/data/preprocess/test_on_disk_preprocessor.py
  • Tests cover: initialization, transforms, split loading, edge cases, and memory efficiency
  • End-to-end integration tests in test/pipeline/test_on_disk_pipeline.py
  • Validates model training with OnDiskPreProcessor using GCN on MUTAG dataset

@levtelyatnikov levtelyatnikov added the category-b1 Submission to TDL Challenge 2025: Mission B, Category 1. label Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category-b1 Submission to TDL Challenge 2025: Mission B, Category 1.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants