Category: A1; Team name: SPAICOM_Semantic; Dataset: Semantic #242

Engrima18 · 2025-11-24T16:08:35Z

Checklist

My pull request has a clear and explanatory title.
My pull request passes the Linting test.
I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
My PR follows PEP8 guidelines. (refer to comment below)
My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
I linked to issues and PRs that are relevant to this PR.

Description

This pull request adds support for integrating our HuggingFace semantic latent-space dataset for CIFAR-10, as well as other possible image datasets. The dataset consists of point clouds extracted from the latent representations of various neural models (all models available through the timm library are supported). Each dataset corresponds to a different encoder model, resulting in distinct latent geometries and therefore distinct underlying topological structures.

The objective of this dataset design is to make it possible to study the latent topology of different models and datasets by treating their latent spaces as samples of an underlying manifold. The point clouds can be used for topological or geometric structure analysis, manifold reconstruction, or graph extraction. Some works in the literature perform graph extraction and graph alignment techniques to align latent spaces; for example, methods based on latent functional maps leverage correspondences between latent manifolds to compare or align representations.

This is, to the best of our knowledge, the first dataset created specifically for semantic communications. It enables research in several directions:

Semantic alignment across models
Joint source–channel coding (JSCC) informed by latent-level semantics
Measuring the similarity of latent spaces between different models, datasets, data modalities, or tasks
Studying cross-model robustness and transferability through geometric and topological structure

The dataset therefore fills an important gap for researchers working on semantic communications, representation learning, latent alignment, and information-theoretic analysis of neural representations.

Issue

This PR addresses the lack of a unified, model-agnostic dataset for studying the latent geometry of neural networks in the context of semantic communications. Existing datasets generally provide raw data only, without exposing the latent structures that are crucial for tasks involving semantic alignment, JSCC optimization, or cross-model comparison.

By providing standardized latent-space point clouds for CIFAR-10 (and extendable to other datasets), this PR enables:

reproducible topological and geometric analysis across models
systematic comparison of latent spaces from models of different architectures
benchmarking semantic communication strategies using true latent manifolds

This feature is important because it supports research at the intersection of machine learning, information theory, and topology—areas where standardized latent datasets have been missing.

Additional context

Latent representations are extracted from timm neural models, ensuring broad architecture support.
The dataset format is intentionally simple (point clouds + metadata) to maximize compatibility across downstream pipelines.
The dataset can naturally extend to other modalities (audio, text, multimodal) for cross-domain semantic alignment research.
The PR lays the groundwork for future integration with TopoBench for benchmarking topological distances, manifold quality, and structure-aware semantic communication metrics.
Dataset source from our lab HF account: https://huggingface.co/datasets/spaicom-lab/semantic-cifar10

Engrima18 and others added 8 commits November 15, 2025 13:10

added dataloader for Semantic Dataset (v1)

d657a06

conf update for the semantic pointcloud dataset

272d06f

conf update for the semantic pointcloud dataset

9aa8621

modify input arguments of the semantic dataloader

1c8001b

added semantic dataset, dataloader and hydra yaml

017b7a7

update semantic dataset 2

48de3e2

linting error

5a2f950

import all dependencies

3ee79cb

levtelyatnikov added the category-a1 Submission to TDL Challenge 2025: Mission A, Category 1. label Nov 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Category: A1; Team name: SPAICOM_Semantic; Dataset: Semantic #242

Category: A1; Team name: SPAICOM_Semantic; Dataset: Semantic #242

Uh oh!

Engrima18 commented Nov 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Category: A1; Team name: SPAICOM_Semantic; Dataset: Semantic #242

Are you sure you want to change the base?

Category: A1; Team name: SPAICOM_Semantic; Dataset: Semantic #242

Uh oh!

Conversation

Engrima18 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description

Issue

Additional context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Engrima18 commented Nov 24, 2025 •

edited

Loading