Skip to content

Conversation

@grapentt
Copy link

@grapentt grapentt commented Nov 25, 2025

🚀 B1 Bonus: On-Disk Transductive Learning with SQLite-Backed Structure Indexing

📌 Problem Statement

Challenge: Training Topological Neural Networks (TNNs) on large transductive graphs (100K+ nodes) with complex topological structures (cliques, cycles, etc.) faces fundamental memory limitations:

  1. Structure Explosion: A graph with 100K nodes can contain millions of cliques/simplices
  2. Preprocessing Bottleneck: Computing and storing all topological structures requires O(N × D^k) memory where:
    • N = number of nodes
    • D = average degree
    • k = structure size (e.g., 3 for triangles)
  3. Transductive Constraint: Unlike inductive learning (separate train/test graphs), transductive learning requires sampling from a single large graph with train/val/test masks

Real-World Impact: Popular benchmarks like ogbn-products (2.4M nodes, 61M edges) are currently infeasible for TNNs due to these memory constraints.


💡 Our Approach: Two-Strategy Solution

We developed two complementary strategies for memory-efficient transductive TNN training:

Strategy 1: Structure-Centric Sampling 🎯

Guarantee: 100% Structure Completeness

# Sample structures FIRST, then extract nodes
structures = sample_k_cliques(target=500)  # Sample 500 cliques
nodes = union_of_nodes(structures)          # Extract their nodes
batch = build_subgraph(nodes, structures)   # Create batch

Key Innovation: Reverses traditional graph sampling—we sample topological structures (cliques) first, then derive the node set. This guarantees all sampled structures are 100% complete in the batch (no missing nodes).

Strategy 2: Extended Context Sampling 🌐

Near-Complete Structures with Topology-Aware Heuristics

# Sample nodes FIRST (cluster-aware), then expand context
core_nodes = cluster_aware_sample(batch_size=1000)  # Louvain clustering
context_nodes = expand_context(core_nodes, ratio=1.5)  # +50% context
structures = query_structures(all_nodes)                 # Query structures
batch = build_subgraph(all_nodes, structures, core_mask)

Key Innovation: Uses graph topology (community detection) to sample dense regions, then adds context nodes to increase structure completeness. Distinguishes "core" vs "context" nodes for loss computation.


🏗️ Architecture & Workflow

Core Components

┌─────────────────────────────────────────────────────────────┐
│  1. OnDiskTransductivePreprocessor                          │
│     - Converts PyG Data → NetworkX graph                    │
│     - Detects topological structures (cliques up to size k) │
│     - Builds SQLite index via CliqueQueryEngine             │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  2. SQLite Index (Persistent Disk Storage)                  │
│     - Table: cliques (id, size, node_list_json)            │
│     - Table: node_to_cliques (node_id, clique_id)          │
│     - Enables fast queries: O(log N) per structure         │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  3. Strategy-Specific Samplers                              │
│     A. StructureCentricSampler                              │
│        - Samples structure IDs directly                     │
│        - Budget-aware: stops at node_budget                 │
│     B. ClusterAwareNodeSampler                              │
│        - Uses Louvain/KMeans clustering                     │
│        - Samples from dense communities                     │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  4. Collate Functions (On-Demand Structure Query)          │
│     - Query structures from SQLite index                    │
│     - Build PyG Data batch with subgraph                    │
│     - Apply topological transforms (liftings) at batch-time│
│     - Return: Data(x, edge_index, precomputed_structures)  │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  5. TransductiveSplitDataset (Seamless Integration)        │
│     - Wraps preprocessor + sampler + collate               │
│     - Compatible with TBDataloader (batch_size=1)          │
│     - Lazy materialization for memory efficiency           │
└─────────────────────────────────────────────────────────────┘

Training Workflow

# 1. Preprocessing (One-Time, Cached on Disk)
preprocessor = OnDiskTransductivePreprocessor(
    graph_data=data,           # PyG Data object
    data_dir="./index",        # SQLite storage
    max_clique_size=3          # Detect up to 3-cliques
)
preprocessor.build_index()     # Stores to disk: ./index/structures.db

# 2. Create Splits (Automatic Batching)
split_config = OmegaConf.create({
    "strategy": "structure_centric",  # or "extended_context"
    "structures_per_batch": 500,      # Structure budget
    "node_budget": 2000,               # Node budget (prevents explosion)
    "shuffle": True
})
train, val, test = preprocessor.load_dataset_splits(split_config)

# 3. Training (Standard PyTorch Lightning)
datamodule = TBDataloader(train, val, test, batch_size=1)  # batch_size=1!
trainer.fit(model, datamodule)  # Batches pre-built, just iterate

# Memory: O(batch_size × D^k) instead of O(N × D^k) ✅

✨ Key Innovations

1. SQLite-Backed Structure Index

  • Persistent storage: Index computed once, reused across runs
  • Fast queries: B-tree indexing enables O(log N) structure lookups
  • Scalable: Handles millions of structures without memory overhead
  • Node-to-structure mapping: Bidirectional queries for both strategies

2. Dual Sampling Strategies

  • Structure-Centric: Novel "structures-first" paradigm guarantees completeness
  • Extended Context: Topology-aware node sampling with context expansion
  • Trade-off selection: Users choose based on their priority (completeness vs. coverage)

3. Batch-Time Transform Application

  • Deferred lifting: Topological transforms (SimplicialCliqueLifting, etc.) applied during batch collation
  • Memory efficiency: Only lift mini-batch subgraphs, not entire graph
  • Flexibility: Each batch gets fresh transforms (supports data augmentation)

4. Seamless Integration with Existing Pipeline

# Works exactly like inductive learning!
train, val, test = preprocessor.load_dataset_splits(config)
datamodule = TBDataloader(train, val, test, batch_size=1)
trainer.fit(model, datamodule)
  • Zero API changes for users
  • Compatible with all existing models, transforms, and training loops
  • Drop-in replacement for standard preprocessing

5. Budget-Aware Sampling

  • Node budget: Prevents batch size explosion (critical for structure-centric)
  • Structure budget: Controls number of structures per batch
  • Adaptive: Sampler stops when budget reached, ensures consistent memory usage

Tutorials

✅ tutorial_ondisk_transductive_structure_centric.ipynb  
   
✅ tutorial_ondisk_transductive_extended_context.ipynb

Note: Due to time reason I could not finish this PR. Though most things should work, there are potentially some small issues. After the challenge is officially over, I am going to add benchmarks, tests and thoroughly refactor the code.

@grapentt grapentt marked this pull request as draft November 25, 2025 20:21
@grapentt grapentt force-pushed the b1-bonus-ogbn-products branch from 4cb395f to ca4d4db Compare November 25, 2025 20:29
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@levtelyatnikov levtelyatnikov added the category-b1 Submission to TDL Challenge 2025: Mission B, Category 1. label Nov 26, 2025
@grapentt grapentt marked this pull request as ready for review November 26, 2025 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category-b1 Submission to TDL Challenge 2025: Mission B, Category 1.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants