GeoZarr Multiscales Clarifications #86

emmanuelmathot · 2025-08-19T13:19:37Z

This PR implements specification clarifications for multiscale overviews in GeoZarr, addressing ambiguities in TileMatrixSet integration and group structure discovery.

Fixes issues #79, #81, #83

Key Changes

Enhanced TileMatrixSet Support:

Clarified that zoom level group names must exactly match TileMatrix identifiers
Added deterministic mapping between TileMatrixSet definitions and Zarr group hierarchies
Expanded support for scientific coordinate systems (UTM, polar stereographic, sinusoidal)

Improved Group Discovery:

Added subsection 9.7.2.1 explaining how to discover multiscale structure through TileMatrixSet metadata
Provided WebMercatorQuad example showing implied group structure for zoom levels 7-15
Eliminated ambiguity about which groups participate in multiscale collections

Custom TileMatrixSet Integration:

Added subsection 9.7.3.1 with detailed UTM Zone 33N example for scientific projections
Emphasized native CRS preservation and optimized chunking strategies
Clarified inline JSON object requirements following OGC TileMatrixSet v2.0
Support for arbitrary decimation schemes (quadtree, nonary tree, custom factors)

Consistency Requirements:

Ensured tile_matrix_set_limits keys match zoom level group names
Added constraints preventing naming conflicts with TileMatrix identifiers
Strengthened requirements for multiscales metadata attribute placement

Files Modified

standard/template/sections/clause_7_unified_data_model.adoc (Section 7.2.3)
standard/template/sections/clause_9_zarr_encoding_overviews.adoc (Sections 9.7.1-9.7.5)

These changes maintain backward compatibility while providing clear guidance for implementations and expanding support for Earth observation use cases beyond web mapping.

cc @d-v-b, @geospatial-jeff, @vincentsarago

…cture and custom TileMatrixSet support

…for GeoZarr specification

standard/template/sections/clause_9_zarr_encoding_overviews.adoc

…oding

geospatial-jeff

Overall a big step in the right direction!!

geospatial-jeff · 2025-08-20T14:14:35Z

standard/template/sections/clause_9_zarr_encoding_overviews.adoc

+- Factor of 3 (nonary tree): Each zoom level has 9x more tiles, useful for certain scientific gridding schemes
+- Other integer factors: Application-specific requirements may dictate alternative decimation
+
+When using non-standard decimation factors, the TileMatrixSet definition MUST explicitly specify the matrixWidth and matrixHeight values for each TileMatrix to ensure correct spatial alignment and resolution relationships. Implementations MUST NOT assume factor-of-2 scaling between zoom levels unless explicitly defined in the TileMatrixSet.


geospatial-jeff · 2025-08-20T14:27:55Z

standard/template/sections/clause_9_zarr_encoding_overviews.adoc

@@ -7,7 +7,26 @@ Multiscale datasets are composed of a set of Zarr groups representing multiple z

 ==== Hierarchical Layout

-Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group. Each zoom-level group SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution.
+Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier specified in the associated TileMatrixSet (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group containing the multiscales metadata attribute. Each zoom-level group SHALL be a Dataset (as defined in Section 7.4.1) and SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. All zoom level groups MUST have the same member keys to ensure structural consistency across resolutions.


NIT. We should try to use the terms "child", "root", "group", and "dataset" consistently here. For example:

Zarr group should be described as a "child group". It's already explicit these are zarr groups.

Is multiscale root group the same as a "root dataset"?

Each zoom-level group SHALL be a Dataset if zoom-levels are described as "groups" its a bit confusing to call them "datasets" given we already have the concept of a "root dataset" which isn't a "child group".

I think part of the confusion here is section 9.1 describing the "root dataset" as a "dataset" when it's actually a group just like child groups. The term "dataset" is a bit overloaded in the current spec.

The term "dataset" is a bit overloaded in the current spec.

agree. I think "dataset" should be restricted to denote a Zarr group that contains Zarr arrays that have a consistent data variable--coordinate variable relationship. I don't think the group tasked with containing a set of datasets (i.e., a multiscale group) needs to be itself a dataset.

Zarr group should be described as a "child group". It's already explicit these are zarr groups.
OK

Is multiscale root group the same as a "root dataset"?
It might but not necessararly. A zarr store can contain multiple multiscale groups

Each zoom-level group SHALL be a Dataset if zoom-levels are described as "groups" its a bit confusing to call them "datasets" given we already have the concept of a "root dataset" which isn't a "child group".
I believe, we need also to use the term store as the root "dataset" group with child group any group in the store data tree.

I will propose another PR for that specific topic and then align this PR with the terminology

standard/template/sections/clause_9_zarr_encoding_overviews.adoc

christophenoel

Reviewed for the comments provided in #83 about the native data.

christophenoel · 2025-09-03T12:10:31Z

standard/template/sections/clause_9_zarr_encoding_overviews.adoc

@@ -7,7 +7,26 @@ Multiscale datasets are composed of a set of Zarr groups representing multiple z

 ==== Hierarchical Layout



My suggestion:

Suggested change

A **Dataset Group** (as defined in Section 7.4.1) SHALL hold the native resolution variables directly.

Multiscale overviews, if present, SHALL be represented as Zarr groups identified by the Tile Matrix identifiers specified in the associated TileMatrixSet (e.g., `"1"`, `"2"`, `"3"`).

Each zoom-level group SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution.

A zoom-level group MAY contain multiple data variables. To ensure structural consistency, all zoom-level groups SHALL have the same member keys.

The Dataset Group SHALL contain a `multiscales` attribute that defines the TileMatrixSet reference.

Child groups representing zoom levels SHALL use group names that exactly match the Tile Matrix identifier values from the referenced TileMatrixSet.

The presence and naming of zoom-level groups is determined by the `tileMatrices` array in the TileMatrixSet definition, excluding the native resolution.

maybe we should review #89 first

christophenoel · 2025-09-03T12:13:07Z

standard/template/sections/clause_9_zarr_encoding_overviews.adoc

+----
+/measurements/r10m/          # Multiscale root group with multiscales metadata
+├── 0/                       # Native resolution (zoom level 0)  
+│   ├── band1                # Data variable at zoom level 0
+│   ├── band2                # Data variable at zoom level 0
+│   └── spatial_ref          # Coordinate reference variable
+├── 1/                       # First overview level
+│   ├── band1                # Data variable at zoom level 1
+│   ├── band2                # Data variable at zoom level 1
+│   └── spatial_ref          # Coordinate reference variable
+└── 2/                       # Second overview level
+    ├── band1                # Data variable at zoom level 2
+    ├── band2                # Data variable at zoom level 2
+    └── spatial_ref          # Coordinate reference variable
+----


Side effect from previous comment:

Suggested change

----

/measurements/r10m/ # Multiscale root group with multiscales metadata

├── 0/ # Native resolution (zoom level 0)

│ ├── band1 # Data variable at zoom level 0

│ ├── band2 # Data variable at zoom level 0

│ └── spatial_ref # Coordinate reference variable

├── 1/ # First overview level

│ ├── band1 # Data variable at zoom level 1

│ ├── band2 # Data variable at zoom level 1

│ └── spatial_ref # Coordinate reference variable

└── 2/ # Second overview level

├── band1 # Data variable at zoom level 2

├── band2 # Data variable at zoom level 2

└── spatial_ref # Coordinate reference variable

----

----

/measurements/r10m/ # Dataset Group with native resolution and multiscales metadata

├── band1 # Native resolution variable

├── band2

├── spatial_ref

├── 1/ # First overview level

│ ├── band1

│ ├── band2

│ └── spatial_ref

└── 2/ # Second overview level

├── band1

├── band2

└── spatial_ref

----

The original layout described a uniform model, where all child datasets have the same layout. This change moves to a non-uniform model, where one child dataset is treated differently. Why opt for a non-uniform layout? And how would this layout handle upsampling (i.e., resampling the original data on a finer sampling grid), which is a common practice in machine learning?

I would not characterise this as a move to a non-uniform model. The layout remains uniform: all child datasets representing zoom levels have the same internal structure.

The only distinction is that the native dataset is not treated as a child, but as the Dataset Group itself.

It is important to be careful here:

Initial datasets typically do not include overviews; these are optional.

The presence of overviews should not change the core dataset model or how a dataset is defined.

Clients may legitimately choose not to support overviews, while still supporting the core dataset.

With this in mind, keeping the native resolution in the Dataset Group maintains consistency with the definition of a dataset, while allowing optional overviews without altering the underlying model.

I forgot to mention that, for Zarr, remodelling the native dataset into a 0/ child group when adding overviews would be time-consuming and inefficient, so avoiding this step is an important consideration.

With this in mind, keeping the native resolution in the Dataset Group maintains consistency with the definition of a dataset, while allowing optional overviews without altering the underlying model.

The price of consistency here is complexity -- clients will need to distinguish between "dataset that's just a collection of variables" from "dataset that's a collection of variables and also sub-datasets". These are two very different arrangements from a modelling POV.

Instead of broadening the scope of a dataset, I would define a new element in the hierarchy ("multiscale dataset") that has exactly 1 job: containing datasets that together define a set of zoom levels. A multiscale dataset has a very narrowly defined layout (all members are datasets; there must be at least 1 member; all members are consistent with each other). And a multiscale dataset has very narrowly defined attributes: it must contain the "multiscales" key.

This simplifies the dataset definition, because datasets are no longer tasked with two roles (containing variables and / or containing other datasets).

another concern: if datasets can contain more datasets, then it's possible to recurse. This means, when you open a dataset, you don't know how far you are from the root, which is problematic for parsers if they have to traverse up or down the tree to navigate.

I think it's much simpler if datasets are constrained to only contain data variables, and nothing else. This means a parser can be sure that the hierarchy has "leaf nodes", which makes the hierarchy easier to model.

I tried to clarify this in #89. Maybe we should get there first.

The price of consistency here is complexity -- clients will need to distinguish between "dataset that's just a collection of variables"

I disagree:

Client not supporting overviews: just reading the dataset as usual (strong requirement for some stakeholders)

Client supporting overviews: reading the multiscales attribute to retrieve the downscales

This simplifies the dataset definition, because datasets are no longer tasked with two roles (containing variables and / or containing other datasets).

In CDM/NetCDF the dataset can includes other dataset... that's a matter of fact.

another concern: if datasets can contain more datasets, then it's possible to recurse.
I understand your concerns, but this is how NetCDF works.

christophenoel · 2025-09-03T12:15:06Z

standard/template/sections/clause_9_zarr_encoding_overviews.adoc

+Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier specified in the associated TileMatrixSet (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group containing the multiscales metadata attribute. Each zoom-level group SHALL be a Dataset (as defined in Section 7.4.1) and SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. All zoom level groups SHALL have the same member keys to ensure structural consistency across resolutions.
+
+The multiscale root group SHALL contain a multiscales attribute that defines the TileMatrixSet reference. Child groups representing zoom levels SHALL use group names that exactly match the TileMatrix identifier values from the referenced TileMatrixSet. The presence and naming of zoom level groups is determined by the tileMatrices array in the TileMatrixSet definition.


A Dataset Group, which contains one or multiple data variables, may include or exclude overviews. I believe the Dataset Group itself would hold the native resolution directly (a dedicated 0/ zoom-level group would not exist) as overviews can be added later, and even not be used/supported by the client.

My suggestion:

Suggested change

Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier specified in the associated TileMatrixSet (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group containing the multiscales metadata attribute. Each zoom-level group SHALL be a Dataset (as defined in Section 7.4.1) and SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. All zoom level groups SHALL have the same member keys to ensure structural consistency across resolutions.

The multiscale root group SHALL contain a multiscales attribute that defines the TileMatrixSet reference. Child groups representing zoom levels SHALL use group names that exactly match the TileMatrix identifier values from the referenced TileMatrixSet. The presence and naming of zoom level groups is determined by the tileMatrices array in the TileMatrixSet definition.

A **Dataset Group** (as defined in Section 7.4.1) SHALL hold the native resolution variables directly.

Multiscale overviews, if present, SHALL be represented as Zarr groups identified by the Tile Matrix identifiers specified in the associated TileMatrixSet (e.g., `"1"`, `"2"`, `"3"`).

Each zoom-level group SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. A zoom-level group MAY contain multiple data variables. To ensure structural consistency, all zoom-level groups SHALL have the same member keys.

The Dataset Group SHALL contain a `multiscales` attribute that defines the TileMatrixSet reference.

Child groups representing zoom levels SHALL use group names that exactly match the Tile Matrix identifier values from the referenced TileMatrixSet. The presence and naming of zoom-level groups is determined by the `tileMatrices` array in the TileMatrixSet definition, excluding the native resolution.

christophenoel · 2025-09-03T12:18:43Z

standard/template/sections/clause_9_zarr_encoding_overviews.adoc

@@ -20,7 +39,7 @@ Each zoom level SHALL be represented as a Zarr group, identified by the Tile Mat
 |Global metadata | `multiscales` defined in parent `.zattrs` | `multiscales` defined in parent group `zarr.json` under `attributes`
 |===

-Each multiscale group MUST define chunking (tiling) along the spatial dimensions (`X`, `Y`, or `lon`, `lat`). Recommended chunk sizes are 256×256 or 512×512.
+Each multiscale group SHALL define chunking (tiling) along the spatial dimensions (`X`, `Y`, or `lon`, `lat`). Recommended chunk sizes are 256×256 or 512×512.


I don't see why the chunking should be defined differently for overviews. The rules should be provided for any Data Variable in related section.

christophenoel · 2025-09-03T12:22:27Z

standard/template/sections/clause_9_zarr_encoding_overviews.adoc

+- The TileMatrixSet definition (whether referenced by identifier or included inline) specifies 
+  the exact set of zoom levels through its tileMatrices array
+- Each TileMatrix.id value corresponds to a required child group in the multiscale hierarchy
+- Variable discovery within each zoom level group follows standard Zarr metadata conventions


Suggested change

- Variable discovery within each zoom level group follows standard Zarr metadata conventions

- The TileMatrix.id `"0"` corresponds to the native resolution stored directly in the Dataset Group, and is therefore not provided in the tile matrix set.

emmanuelmathot added 2 commits August 19, 2025 15:14

Enhance multiscale overview encoding documentation with detailed stru…

640d218

…cture and custom TileMatrixSet support

Update TileMatrix identifiers and add custom decimation requirements …

1931cee

…for GeoZarr specification

d-v-b reviewed Aug 19, 2025

View reviewed changes

standard/template/sections/clause_9_zarr_encoding_overviews.adoc Outdated Show resolved Hide resolved

Clarify requirements for zoom level groups in multiscale overview enc…

85be066

…oding

geospatial-jeff reviewed Aug 20, 2025

View reviewed changes

emmanuelmathot mentioned this pull request Aug 22, 2025

Clarify terminology across specification #89

Open

comments integrated

02f2891

emmanuelmathot requested review from d-v-b and geospatial-jeff August 22, 2025 08:24

vincentsarago approved these changes Aug 25, 2025

View reviewed changes

christophenoel reviewed Sep 3, 2025

View reviewed changes

christophenoel mentioned this pull request Sep 3, 2025

Multiscale hierarchy structure needs clarification #83

Open

		@@ -7,7 +7,26 @@ Multiscale datasets are composed of a set of Zarr groups representing multiple z

		==== Hierarchical Layout

+A **Dataset Group** (as defined in Section 7.4.1) SHALL hold the native resolution variables directly.
+Multiscale overviews, if present, SHALL be represented as Zarr groups identified by the Tile Matrix identifiers specified in the associated TileMatrixSet (e.g., `"1"`, `"2"`, `"3"`).
+Each zoom-level group SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution.
+A zoom-level group MAY contain multiple data variables. To ensure structural consistency, all zoom-level groups SHALL have the same member keys.
+The Dataset Group SHALL contain a `multiscales` attribute that defines the TileMatrixSet reference.
+Child groups representing zoom levels SHALL use group names that exactly match the Tile Matrix identifier values from the referenced TileMatrixSet.
+The presence and naming of zoom-level groups is determined by the `tileMatrices` array in the TileMatrixSet definition, excluding the native resolution.

		Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier specified in the associated TileMatrixSet (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group containing the multiscales metadata attribute. Each zoom-level group SHALL be a Dataset (as defined in Section 7.4.1) and SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. All zoom level groups SHALL have the same member keys to ensure structural consistency across resolutions.

		The multiscale root group SHALL contain a multiscales attribute that defines the TileMatrixSet reference. Child groups representing zoom levels SHALL use group names that exactly match the TileMatrix identifier values from the referenced TileMatrixSet. The presence and naming of zoom level groups is determined by the tileMatrices array in the TileMatrixSet definition.

	- Variable discovery within each zoom level group follows standard Zarr metadata conventions
	- The TileMatrix.id `"0"` corresponds to the native resolution stored directly in the Dataset Group, and is therefore not provided in the tile matrix set.

GeoZarr Multiscales Clarifications #86

Are you sure you want to change the base?

GeoZarr Multiscales Clarifications #86

Uh oh!

Conversation

emmanuelmathot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Files Modified

Uh oh!

Uh oh!

geospatial-jeff left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

christophenoel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emmanuelmathot commented Aug 19, 2025 •

edited

Loading

d-v-b Aug 20, 2025 •

edited

Loading