Skip to content

Conversation

@marindigen
Copy link

@marindigen marindigen commented Nov 25, 2025

Checklist

  • My pull request has a clear and explanatory title following the challenge format.
  • My pull request passes linting.
  • I added appropriate unit tests and made sure the code passes all unit tests.
  • My PR follows PEP8 guidelines.
  • My code is documented using numpy-style docstrings.
  • I included a pipeline test showing that a model can train on the new benchmark task.
  • This PR introduces at most one dataset loader, as required by the challenge rules.

Description

This PR is a Category B2 (“Pioneering New TDL Benchmark Tasks”) submission to the TAG-DS Topological Deep Learning Challenge 2025.

It integrates the Bowen et al. (2024) mouse auditory cortex calcium-imaging dataset into TopoBench under the name:

A123CortexM — A1/A2/3 mouse auditory cortex correlation graphs,

and builds a small family of topology-aware benchmark tasks:

  1. Graph-level BF-bin classification (standard graph classification).
  2. Triangle role classification (2-simplex motif roles combining embedding × weight).
  3. Triangle common-neighbour prediction (topological embedding depth of triangles).

All of these are driven by a single dataset / loader pair and configured via the specific_task parameter in the dataset YAML.

In addition, the PR introduces a generic triangle utility in topobench.data.utils.triangle_classifier that can be reused by other datasets.

Dataset and graph construction

The underlying data come from:

Bowen et al. (2024), “Fractured columnar small-world functional network organization in volumes of L2/3 of mouse auditory cortex”, PNAS Nexus.

Each recording session provides:

  • Neuronal activity traces,
  • Pairwise signal correlations (SigCorrs) and noise correlations (NoiseCorrsTrial),
  • Best-frequency (BF) values per neuron and layer annotations.

The dataset class:

  • topobench/data/datasets/a123.py

    • A123CortexMDataset(InMemoryDataset)

performs the following steps:

  1. Download & unpack
    Uses download_file_from_link to fetch the “Auditory cortex data” archive and extract it under raw/.

  2. Session / layer extraction
    For each .mat file and each layer (1–5), it reads:

    • SigCorrs (signal correlation matrix),
    • NoiseCorrsTrial (trial-level noise correlations),
    • BFInfo[layer]["BFval"] (per-neuron best frequency).
  3. BF-bin subgraphs
    Neurons are binned by BF into n_bins (default 9). For each (session, layer, BF-bin) with at least min_neurons neurons (default from config, 3 for tests):

    • Correlation and noise-correlation matrices are restricted to those neurons,
    • A sample dictionary is built with metadata:
      {session_file, session_id, layer, bf_bin, neuron_indices, corr, noise_corr}.
  4. Graph representation (_sample_to_pyg_data)
    Each sample becomes a torch_geometric.data.Data graph with:

    • Nodes: neurons in a single (session, layer, BF-bin).

    • Node features x ∈ ℝ^{n×3}:

      • mean_corr: mean signal correlation to others,
      • std_corr: standard deviation of signal correlation,
      • noise_diag: diagonal entries of the noise-correlation matrix (per-neuron noise level).
    • Edges: undirected edges between neuron pairs whose signal correlation ≥ corr_threshold (configurable; corr_threshold: 0.2 in the YAML). Edges are constructed from the upper triangle and symmetrised with to_undirected.

    • Edge attributes: correlation weights on those edges.

    • Label y: integer BF-bin in [0, num_classes − 1] (config num_classes: 9).

    • Metadata: session_id, layer.

    During process(), graphs with no edges are filtered out, and the remaining graphs are collated and stored in processed/data.pt.

The dataset behaviour is controlled by the YAML configs/dataset/graph/a123.yaml.
For CI we restrict to num_graphs: 10 to keep runtime reasonable; users can increase this for full experiments.

Generic triangle utilities

To make triangle-based benchmarks reusable across datasets, this PR adds:

  • topobench/data/utils/triangle_classifier.py

with the base class:

class TriangleClassifier:
    """
    Generic triangle utility for weighted graphs.
    - enumerate_triangles(G): list of (a, b, c) triangles
    - classify_and_weight_triangles(triangles, G): attaches edge weights + domain-specific roles
    - extract_triangles(edge_index, edge_weights, num_nodes): convenience from PyG graphs
    """

The base methods provide:

  • Efficient triangle enumeration on a NetworkX graph,
  • Edge weight extraction per triangle,
  • A hook for domain-specific role definitions via _classify_role and _role_to_label (which are intentionally left abstract).

This utility is then specialised for the auditory cortex dataset, but can be reused by other TopoBench datasets to define their own triangle-based tasks.

A123-specific triangle classifier

In topobench/data/datasets/a123.py we define:

class TriangleClassifier(BaseTriangleClassifier):
    ...

This subclass implements domain-specific logic for auditory cortex correlation graphs with the data appropriate classes based on the number of neighbors and the weights class.

Tasks

The dataset’s process() method always builds the graph dataset and then inspects:

specific_task = self.parameters.get("specific_task", "classification")

to optionally build triangle-level tasks:

1. Graph-level BF-bin classification (specific_task: classification)

  • Task level: graph.
  • Label: BF-bin (0–8).
  • Config: task_level: graph, num_classes: 9, loss_type: cross_entropy.

This is a standard graph classification benchmark on correlation graphs, suitable as a baseline and as input for higher-order liftings.

2. Triangle role classification (specific_task: triangle_classification)

  • Implemented in:

    • A123CortexMDataset._extract_triangles_from_graphs()
    • A123CortexMDataset.create_triangle_classification_task()

Step 1 – Triangle extraction

_extract_triangles_from_graphs():

  • Iterates over all graphs in the dataset,

  • Builds a NetworkX graph G for each (with edge weights from signal correlations),

  • Uses TriangleClassifier.enumerate_triangles(G) and classify_and_weight_triangles() to obtain triangle dicts with:

    • nodes: (a, b, c),
    • edge_weights: [w_ab, w_bc, w_ac],
    • role: role string,
    • label: 0–8.

A list of raw triangle records is collected, each with:
graph_idx, tri (triangle dict), G, num_nodes.

Step 2 – Building the triangle dataset

create_triangle_classification_task() converts these into torch_geometric.data.Data objects:

  • x ∈ ℝ^{1×3}: the three edge weights (purely topological/functional – no node features or BF info),
  • y: integer role label in {0, …, 8},
  • Metadata: nodes, role, graph_idx.

This defines a triangle-level classification benchmark targeting 2-simplex motif roles.

3. Triangle common-neighbour prediction (specific_task: triangle_common_neighbors)

  • Implemented in:

    • A123CortexMDataset.create_triangle_common_neighbors_task()

Here we focus on a purely structural topological quantity: the number of common neighbours of each triangle.

For each triangle (a, b, c) in the raw list:

  1. Compute the set of common neighbours:

    common = set(G.neighbors(a)) & set(G.neighbors(b)) & set(G.neighbors(c)) - {a, b, c}
    num_common = len(common)
  2. Define the label:

    • Exact common-neighbour count, capped at 8:

      • 0–7 → class 0–7,
      • ≥8 → class 8.
  3. Define the features:

    • Node degrees of the triangle vertices in G:

      deg_a = G.degree(a)
      deg_b = G.degree(b)
      deg_c = G.degree(c)
      x = [deg_a, deg_b, deg_c]

    So each triangle sample has x ∈ ℝ^{1×3} (degrees) and y ∈ {0, …, 8} (binned common neighbours).

This gives a triangle-level classification task where labels are higher-order topological statistics (coface-like information), and features are structural (degrees), avoiding direct leakage of the label.

Loader and task selection

The loader:

  • topobench/data/loaders/graph/a123_loader.py

    • A123DatasetLoader(AbstractLoader)

does:

  1. Reads data_name and specific_task from parameters.

  2. Constructs A123CortexMDataset(root, name, parameters).

  3. Depending on specific_task:

    • classification
      → Uses the default graph dataset from processed/data.pt.

    • triangle_classification
      → Loads triangle dataset from processed/data_triangles.pt (if it exists) and assigns it to self.dataset.data / self.dataset.slices.

    • triangle_common_neighbors
      → Loads triangle CN (common neighbors) dataset from processed/data_triangles_common_neighbors.pt.

If the triangle files are missing, the loader emits a clear warning suggesting to ensure that the dataset has been processed with the appropriate specific_task.

This keeps the one loader per PR rule, while making triangle tasks selectable via configuration.

Tests and pipeline integration

To satisfy the challenge requirements:

  1. Unit tests

    • test/data/load/test_a123_dataset.py checks:

      • Creation and basic properties of A123 graphs,
      • Correct behaviour of the loader,
      • Correct loading of triangle datasets for triangle_classification / triangle_common_neighbors (e.g. shapes of x, valid label ranges, non-empty datasets in test settings).
  2. Pipeline test

    • test/pipeline/test_pipeline.py is extended with a configuration that:

      • Uses the A123 config (with e.g. specific_task: triangle_classification and num_graphs: 10),
      • Trains an existing TopoBench model for max_epochs=2,
      • Logs metrics such as train/accuracy, val/accuracy, test/accuracy, macro precision, recall, and F1.

    This demonstrates that the entire training pipeline runs successfully on the new benchmark task. Performance is not tuned; the goal is compatibility and coverage.

  3. Coverage

    • Tests exercise:

      • A123CortexMDataset.process,
      • Triangle extraction / classification / CN logic,
      • A123DatasetLoader.load_dataset for multiple specific_task settings,
      • The generic topobench.data.utils.triangle_classifier utility.

    This helps maintain the ≥93% Codecov target.

    Note: we weren't able to check the coverage using Codecov due to dependency incompatibilities.

Why this is a useful B2 benchmark

This contribution adds:

  • A real, biophysically grounded brain dataset (mouse auditory cortex), and
  • Two explicit triangle-level tasks that are naturally suited to topological models.

Key points for TDL:

  • Triangles are 2-simplices of the clique complex of the correlation graph.
  • Triangle roles combine internal functional strength and higher-order embedding (common neighbours).
  • Common-neighbour prediction targets a coface-like topological statistic directly.

These are exactly the kind of questions where simplicial networks, cell-complex networks, and hypergraph networks should shine compared to edge-only GNNs:

  • They can operate directly on 2-cells and their cofaces,
  • They can capture how information “flows” through triangles embedded in local motifs,
  • They can more naturally encode constraints on multi-neuron interactions (beyond pairwise edges).

Because everything is driven through a single YAML (specific_task switch) and a reusable triangle utility, the benchmark is also extensible: other datasets can plug into topobench.data.utils.triangle_classifier and define their own domain-specific triangle roles or CN-style tasks.

Limitations and future directions

  • Triangle roles currently use fixed correlation thresholds and simple bins; more refined roles could incorporate spatial distances, laminar structure, or alternative measures (e.g. causal TE networks).
  • The common-neighbour task is currently treated as a 9-class classification problem; a regression version or ordinal metrics would be natural extensions.
  • The CI config only uses num_graphs: 10; full experiments on the whole dataset will likely reveal richer distributions of motif types and CN counts.

Relation to previous work

This PR builds on my earlier contributions to data loading and streaming in TopoBench (previous PR: #241), but is focused on the new TDL benchmark tasks (Category B2), with an emphasis on higher-order structure in functional brain networks.

Note: This PR duplicates the changes from #241 because, without the updated download_file_from_link function, the dataset cannot be downloaded, and the submission would not run.

References

  • Bowen, Z., et al. (2024). Fractured columnar small-world functional network organization in volumes of L2/3 of mouse auditory cortex. PNAS Nexus, 3(2), pgae074.

…o check and test the training. To be able to download dataset in the function 'download_file_from_link' in requests.get() verify parameter should be specified as False. Note also that currently the run script on the data doesn't run as it fails to download data even if verify parameter set to False
…ll_to_dict and process_mat. I have also modified download_file_from_link by specifying verify=False in requests.get()
- Implement TriangleClassifier utility for extracting and classifying triangles
- Add create_triangle_classification_task() to generate 7-class role predictions
- Support triangle task via dataset loader (task_type='triangle_task')
- Add config section with triangle task settings (7 classes, 3 features)
- Include comprehensive test suite for triangle classification pipeline

Added traingle common neighbours classification task: predicting the number of neighbours based on triangle intra-data:

- Dataset creation
- Support triangle task via dataset loader (task_type='triangle_common_task')
- Add config section
- Include cmprehensive test suite
… to the for the datasets (use for the triangle options) as the classes are imbalanced. Ideally, need to add class weights to the model and opt for a focal loss
… TriangleClassifier into utils for reusability and created a new class inhereting from it to specify it for a123 task
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@gbg141
Copy link
Collaborator

gbg141 commented Nov 26, 2025

Hi @marindigen! It seems that the testing error can be easily fixed just by slightly renaming the newly introduced test_io_utils.py file. Good luck!

…ion. Move the test class to appropriate file. Note, that the same changes were done in the PR geometric-intelligence#241 (they are duplicated here, as the script wouldn't run otherwise and would require additional adaptation to the old download_file_from_link function.
@marindigen
Copy link
Author

The CI failed with a no space left on device error on the GitHub runner (infrastructure issue, not code-related), which could be due to the size of the dataset (~2 GB). What could be a possible solution to this?

@levtelyatnikov
Copy link
Collaborator

The CI failed with a no space left on device error on the GitHub runner (infrastructure issue, not code-related), which could be due to the size of the dataset (~2 GB). What could be a possible solution to this?

Dear @marindigen, one possible solution is to mock the data instead of downloading it. Please refer to PR:233 for the reference if needed.

@levtelyatnikov levtelyatnikov added the category-b2 Submission to TDL Challenge 2025: Mission B, Category 2. label Nov 26, 2025
@gbg141
Copy link
Collaborator

gbg141 commented Nov 26, 2025

Hi again @marindigen! Could you please comment out (or turn to markdown) the content of tutorial_train_brain_model.ipynb? We decided to accept your submission given the cause of the failing test.
Thank you!

@marindigen
Copy link
Author

marindigen commented Nov 27, 2025

Hi @gbg141 and @levtelyatnikov!

Thank you for your patience and for accepting the submission as it is! I have converted the tutorial to the markdown file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category-b2 Submission to TDL Challenge 2025: Mission B, Category 2.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants