Category: B1 (Bonus); Team name: TG; Dataset: ogbn-products #250

grapentt · 2025-11-25T20:21:10Z

🚀 B1 Bonus: On-Disk Transductive Learning with SQLite-Backed Structure Indexing

📌 Problem Statement

Challenge: Training Topological Neural Networks (TNNs) on large transductive graphs (100K+ nodes) with complex topological structures (cliques, cycles, etc.) faces fundamental memory limitations:

Structure Explosion: A graph with 100K nodes can contain millions of cliques/simplices
Preprocessing Bottleneck: Computing and storing all topological structures requires O(N × D^k) memory where:
- N = number of nodes
- D = average degree
- k = structure size (e.g., 3 for triangles)
Transductive Constraint: Unlike inductive learning (separate train/test graphs), transductive learning requires sampling from a single large graph with train/val/test masks

Real-World Impact: Popular benchmarks like ogbn-products (2.4M nodes, 61M edges) are currently infeasible for TNNs due to these memory constraints.

💡 Our Approach: Two-Strategy Solution

We developed two complementary strategies for memory-efficient transductive TNN training:

Strategy 1: Structure-Centric Sampling 🎯

Guarantee: 100% Structure Completeness

# Sample structures FIRST, then extract nodes
structures = sample_k_cliques(target=500)  # Sample 500 cliques
nodes = union_of_nodes(structures)          # Extract their nodes
batch = build_subgraph(nodes, structures)   # Create batch

Key Innovation: Reverses traditional graph sampling—we sample topological structures (cliques) first, then derive the node set. This guarantees all sampled structures are 100% complete in the batch (no missing nodes).

Strategy 2: Extended Context Sampling 🌐

Near-Complete Structures with Topology-Aware Heuristics

# Sample nodes FIRST (cluster-aware), then expand context
core_nodes = cluster_aware_sample(batch_size=1000)  # Louvain clustering
context_nodes = expand_context(core_nodes, ratio=1.5)  # +50% context
structures = query_structures(all_nodes)                 # Query structures
batch = build_subgraph(all_nodes, structures, core_mask)

Key Innovation: Uses graph topology (community detection) to sample dense regions, then adds context nodes to increase structure completeness. Distinguishes "core" vs "context" nodes for loss computation.

🏗️ Architecture & Workflow

Core Components

┌─────────────────────────────────────────────────────────────┐
│  1. OnDiskTransductivePreprocessor                          │
│     - Converts PyG Data → NetworkX graph                    │
│     - Detects topological structures (cliques up to size k) │
│     - Builds SQLite index via CliqueQueryEngine             │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  2. SQLite Index (Persistent Disk Storage)                  │
│     - Table: cliques (id, size, node_list_json)            │
│     - Table: node_to_cliques (node_id, clique_id)          │
│     - Enables fast queries: O(log N) per structure         │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  3. Strategy-Specific Samplers                              │
│     A. StructureCentricSampler                              │
│        - Samples structure IDs directly                     │
│        - Budget-aware: stops at node_budget                 │
│     B. ClusterAwareNodeSampler                              │
│        - Uses Louvain/KMeans clustering                     │
│        - Samples from dense communities                     │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  4. Collate Functions (On-Demand Structure Query)          │
│     - Query structures from SQLite index                    │
│     - Build PyG Data batch with subgraph                    │
│     - Apply topological transforms (liftings) at batch-time│
│     - Return: Data(x, edge_index, precomputed_structures)  │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  5. TransductiveSplitDataset (Seamless Integration)        │
│     - Wraps preprocessor + sampler + collate               │
│     - Compatible with TBDataloader (batch_size=1)          │
│     - Lazy materialization for memory efficiency           │
└─────────────────────────────────────────────────────────────┘

Training Workflow

# 1. Preprocessing (One-Time, Cached on Disk)
preprocessor = OnDiskTransductivePreprocessor(
    graph_data=data,           # PyG Data object
    data_dir="./index",        # SQLite storage
    max_clique_size=3          # Detect up to 3-cliques
)
preprocessor.build_index()     # Stores to disk: ./index/structures.db

# 2. Create Splits (Automatic Batching)
split_config = OmegaConf.create({
    "strategy": "structure_centric",  # or "extended_context"
    "structures_per_batch": 500,      # Structure budget
    "node_budget": 2000,               # Node budget (prevents explosion)
    "shuffle": True
})
train, val, test = preprocessor.load_dataset_splits(split_config)

# 3. Training (Standard PyTorch Lightning)
datamodule = TBDataloader(train, val, test, batch_size=1)  # batch_size=1!
trainer.fit(model, datamodule)  # Batches pre-built, just iterate

# Memory: O(batch_size × D^k) instead of O(N × D^k) ✅

✨ Key Innovations

1. SQLite-Backed Structure Index

Persistent storage: Index computed once, reused across runs
Fast queries: B-tree indexing enables O(log N) structure lookups
Scalable: Handles millions of structures without memory overhead
Node-to-structure mapping: Bidirectional queries for both strategies

2. Dual Sampling Strategies

Structure-Centric: Novel "structures-first" paradigm guarantees completeness
Extended Context: Topology-aware node sampling with context expansion
Trade-off selection: Users choose based on their priority (completeness vs. coverage)

3. Batch-Time Transform Application

Deferred lifting: Topological transforms (SimplicialCliqueLifting, etc.) applied during batch collation
Memory efficiency: Only lift mini-batch subgraphs, not entire graph
Flexibility: Each batch gets fresh transforms (supports data augmentation)

4. Seamless Integration with Existing Pipeline

# Works exactly like inductive learning!
train, val, test = preprocessor.load_dataset_splits(config)
datamodule = TBDataloader(train, val, test, batch_size=1)
trainer.fit(model, datamodule)

Zero API changes for users
Compatible with all existing models, transforms, and training loops
Drop-in replacement for standard preprocessing

5. Budget-Aware Sampling

Node budget: Prevents batch size explosion (critical for structure-centric)
Structure budget: Controls number of structures per batch
Adaptive: Sampler stops when budget reached, ensures consistent memory usage

Tutorials

✅ tutorial_ondisk_transductive_structure_centric.ipynb  
   
✅ tutorial_ondisk_transductive_extended_context.ipynb

Note: Due to time reason I could not finish this PR. Though most things should work, there are potentially some small issues. After the challenge is officially over, I am going to add benchmarks, tests and thoroughly refactor the code.

…e approach for topology preservation heuristic

review-notebook-app · 2025-11-26T11:23:02Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

grapentt marked this pull request as draft November 25, 2025 20:21

✨ OnDisTransductivePreprocessor for huge transductive learning tasks

ca4d4db

grapentt force-pushed the b1-bonus-ogbn-products branch from 4cb395f to ca4d4db Compare November 25, 2025 20:29

grapentt added 8 commits November 26, 2025 01:49

:card_box_file: Clique detection and query engine with SQLite backend

ba4370b

🧑‍💻 TransductiveSplitDataset for smooth integration with TBLoader

70b7bb1

✨ StructureCentricSampler for structure cohesion in batch construction

6d2615b

✨ ClusterAwareNodeSampler with extended context collate as alternativ…

ed0dd74

…e approach for topology preservation heuristic

🧪 Add split tests

a55fe0b

🐛 Transductive splits

d6834f1

🐛 Some bug fixes

70e5bd8

✨ Tutorials

24da4b0

grapentt added 2 commits November 26, 2025 12:45

🧑‍💻 Fix tests and linters

abc4918

🐛 Fix linters

74f5c71

levtelyatnikov added the category-b1 Submission to TDL Challenge 2025: Mission B, Category 1. label Nov 26, 2025

💄 Fix linters

62273ff

grapentt marked this pull request as ready for review November 26, 2025 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Category: B1 (Bonus); Team name: TG; Dataset: ogbn-products #250

Category: B1 (Bonus); Team name: TG; Dataset: ogbn-products #250

Uh oh!

grapentt commented Nov 25, 2025 •

edited

Loading

Uh oh!

review-notebook-app bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Category: B1 (Bonus); Team name: TG; Dataset: ogbn-products #250

Are you sure you want to change the base?

Category: B1 (Bonus); Team name: TG; Dataset: ogbn-products #250

Uh oh!

Conversation

grapentt commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 B1 Bonus: On-Disk Transductive Learning with SQLite-Backed Structure Indexing

📌 Problem Statement

💡 Our Approach: Two-Strategy Solution

Strategy 1: Structure-Centric Sampling 🎯

Strategy 2: Extended Context Sampling 🌐

🏗️ Architecture & Workflow

Core Components

Training Workflow

✨ Key Innovations

1. SQLite-Backed Structure Index

2. Dual Sampling Strategies

3. Batch-Time Transform Application

4. Seamless Integration with Existing Pipeline

5. Budget-Aware Sampling

Tutorials

Uh oh!

review-notebook-app bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grapentt commented Nov 25, 2025 •

edited

Loading