Category: A1; Team name: SPAICOM_Semantic; Dataset: Semantic #242
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Checklist
Description
This pull request adds support for integrating our HuggingFace semantic latent-space dataset for CIFAR-10, as well as other possible image datasets. The dataset consists of point clouds extracted from the latent representations of various neural models (all models available through the timm library are supported). Each dataset corresponds to a different encoder model, resulting in distinct latent geometries and therefore distinct underlying topological structures.
The objective of this dataset design is to make it possible to study the latent topology of different models and datasets by treating their latent spaces as samples of an underlying manifold. The point clouds can be used for topological or geometric structure analysis, manifold reconstruction, or graph extraction. Some works in the literature perform graph extraction and graph alignment techniques to align latent spaces; for example, methods based on latent functional maps leverage correspondences between latent manifolds to compare or align representations.
This is, to the best of our knowledge, the first dataset created specifically for semantic communications. It enables research in several directions:
The dataset therefore fills an important gap for researchers working on semantic communications, representation learning, latent alignment, and information-theoretic analysis of neural representations.
Issue
This PR addresses the lack of a unified, model-agnostic dataset for studying the latent geometry of neural networks in the context of semantic communications. Existing datasets generally provide raw data only, without exposing the latent structures that are crucial for tasks involving semantic alignment, JSCC optimization, or cross-model comparison.
By providing standardized latent-space point clouds for CIFAR-10 (and extendable to other datasets), this PR enables:
This feature is important because it supports research at the intersection of machine learning, information theory, and topology—areas where standardized latent datasets have been missing.
Additional context