-
Notifications
You must be signed in to change notification settings - Fork 15
GeoZarr Multiscales Clarifications #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GeoZarr Multiscales Clarifications #86
Conversation
…cture and custom TileMatrixSet support
…for GeoZarr specification
standard/template/sections/clause_9_zarr_encoding_overviews.adoc
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall a big step in the right direction!!
- Factor of 3 (nonary tree): Each zoom level has 9x more tiles, useful for certain scientific gridding schemes | ||
- Other integer factors: Application-specific requirements may dictate alternative decimation | ||
|
||
When using non-standard decimation factors, the TileMatrixSet definition MUST explicitly specify the matrixWidth and matrixHeight values for each TileMatrix to ensure correct spatial alignment and resolution relationships. Implementations MUST NOT assume factor-of-2 scaling between zoom levels unless explicitly defined in the TileMatrixSet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
@@ -7,7 +7,26 @@ Multiscale datasets are composed of a set of Zarr groups representing multiple z | |||
|
|||
==== Hierarchical Layout | |||
|
|||
Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group. Each zoom-level group SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. | |||
Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier specified in the associated TileMatrixSet (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group containing the multiscales metadata attribute. Each zoom-level group SHALL be a Dataset (as defined in Section 7.4.1) and SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. All zoom level groups MUST have the same member keys to ensure structural consistency across resolutions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT. We should try to use the terms "child", "root", "group", and "dataset" consistently here. For example:
Zarr group
should be described as a "child group". It's already explicit these are zarr groups.- Is
multiscale root group
the same as a "root dataset"? Each zoom-level group SHALL be a Dataset
if zoom-levels are described as "groups" its a bit confusing to call them "datasets" given we already have the concept of a "root dataset" which isn't a "child group".
I think part of the confusion here is section 9.1 describing the "root dataset" as a "dataset" when it's actually a group just like child groups. The term "dataset" is a bit overloaded in the current spec.

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term "dataset" is a bit overloaded in the current spec.
agree. I think "dataset" should be restricted to denote a Zarr group that contains Zarr arrays that have a consistent data variable--coordinate variable relationship. I don't think the group tasked with containing a set of datasets (i.e., a multiscale group) needs to be itself a dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Zarr group should be described as a "child group". It's already explicit these are zarr groups.
OK- Is multiscale root group the same as a "root dataset"?
It might but not necessararly. A zarr store can contain multiple multiscale groups- Each zoom-level group SHALL be a Dataset if zoom-levels are described as "groups" its a bit confusing to call them "datasets" given we already have the concept of a "root dataset" which isn't a "child group".
I believe, we need also to use the termstore
as the root "dataset" group withchild group
any group in the store data tree.
I will propose another PR for that specific topic and then align this PR with the terminology
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #89
standard/template/sections/clause_9_zarr_encoding_overviews.adoc
Outdated
Show resolved
Hide resolved
standard/template/sections/clause_9_zarr_encoding_overviews.adoc
Outdated
Show resolved
Hide resolved
standard/template/sections/clause_9_zarr_encoding_overviews.adoc
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed for the comments provided in #83 about the native data.
@@ -7,7 +7,26 @@ Multiscale datasets are composed of a set of Zarr groups representing multiple z | |||
|
|||
==== Hierarchical Layout | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion:
A **Dataset Group** (as defined in Section 7.4.1) SHALL hold the native resolution variables directly. | |
Multiscale overviews, if present, SHALL be represented as Zarr groups identified by the Tile Matrix identifiers specified in the associated TileMatrixSet (e.g., `"1"`, `"2"`, `"3"`). | |
Each zoom-level group SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. | |
A zoom-level group MAY contain multiple data variables. To ensure structural consistency, all zoom-level groups SHALL have the same member keys. | |
The Dataset Group SHALL contain a `multiscales` attribute that defines the TileMatrixSet reference. | |
Child groups representing zoom levels SHALL use group names that exactly match the Tile Matrix identifier values from the referenced TileMatrixSet. | |
The presence and naming of zoom-level groups is determined by the `tileMatrices` array in the TileMatrixSet definition, excluding the native resolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should review #89 first
---- | ||
/measurements/r10m/ # Multiscale root group with multiscales metadata | ||
├── 0/ # Native resolution (zoom level 0) | ||
│ ├── band1 # Data variable at zoom level 0 | ||
│ ├── band2 # Data variable at zoom level 0 | ||
│ └── spatial_ref # Coordinate reference variable | ||
├── 1/ # First overview level | ||
│ ├── band1 # Data variable at zoom level 1 | ||
│ ├── band2 # Data variable at zoom level 1 | ||
│ └── spatial_ref # Coordinate reference variable | ||
└── 2/ # Second overview level | ||
├── band1 # Data variable at zoom level 2 | ||
├── band2 # Data variable at zoom level 2 | ||
└── spatial_ref # Coordinate reference variable | ||
---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side effect from previous comment:
---- | |
/measurements/r10m/ # Multiscale root group with multiscales metadata | |
├── 0/ # Native resolution (zoom level 0) | |
│ ├── band1 # Data variable at zoom level 0 | |
│ ├── band2 # Data variable at zoom level 0 | |
│ └── spatial_ref # Coordinate reference variable | |
├── 1/ # First overview level | |
│ ├── band1 # Data variable at zoom level 1 | |
│ ├── band2 # Data variable at zoom level 1 | |
│ └── spatial_ref # Coordinate reference variable | |
└── 2/ # Second overview level | |
├── band1 # Data variable at zoom level 2 | |
├── band2 # Data variable at zoom level 2 | |
└── spatial_ref # Coordinate reference variable | |
---- | |
---- | |
/measurements/r10m/ # Dataset Group with native resolution and multiscales metadata | |
├── band1 # Native resolution variable | |
├── band2 | |
├── spatial_ref | |
├── 1/ # First overview level | |
│ ├── band1 | |
│ ├── band2 | |
│ └── spatial_ref | |
└── 2/ # Second overview level | |
├── band1 | |
├── band2 | |
└── spatial_ref | |
---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original layout described a uniform model, where all child datasets have the same layout. This change moves to a non-uniform model, where one child dataset is treated differently. Why opt for a non-uniform layout? And how would this layout handle upsampling (i.e., resampling the original data on a finer sampling grid), which is a common practice in machine learning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not characterise this as a move to a non-uniform model. The layout remains uniform: all child datasets representing zoom levels have the same internal structure.
The only distinction is that the native dataset is not treated as a child, but as the Dataset Group itself.
It is important to be careful here:
- Initial datasets typically do not include overviews; these are optional.
- The presence of overviews should not change the core dataset model or how a dataset is defined.
- Clients may legitimately choose not to support overviews, while still supporting the core dataset.
With this in mind, keeping the native resolution in the Dataset Group maintains consistency with the definition of a dataset, while allowing optional overviews without altering the underlying model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to mention that, for Zarr, remodelling the native dataset into a 0/ child group when adding overviews would be time-consuming and inefficient, so avoiding this step is an important consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this in mind, keeping the native resolution in the Dataset Group maintains consistency with the definition of a dataset, while allowing optional overviews without altering the underlying model.
The price of consistency here is complexity -- clients will need to distinguish between "dataset that's just a collection of variables" from "dataset that's a collection of variables and also sub-datasets". These are two very different arrangements from a modelling POV.
Instead of broadening the scope of a dataset, I would define a new element in the hierarchy ("multiscale dataset") that has exactly 1 job: containing datasets that together define a set of zoom levels. A multiscale dataset has a very narrowly defined layout (all members are datasets; there must be at least 1 member; all members are consistent with each other). And a multiscale dataset has very narrowly defined attributes: it must contain the "multiscales"
key.
This simplifies the dataset definition, because datasets are no longer tasked with two roles (containing variables and / or containing other datasets).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another concern: if datasets can contain more datasets, then it's possible to recurse. This means, when you open a dataset, you don't know how far you are from the root, which is problematic for parsers if they have to traverse up or down the tree to navigate.
I think it's much simpler if datasets are constrained to only contain data variables, and nothing else. This means a parser can be sure that the hierarchy has "leaf nodes", which makes the hierarchy easier to model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify this in #89. Maybe we should get there first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The price of consistency here is complexity -- clients will need to distinguish between "dataset that's just a collection of variables"
I disagree:
- Client not supporting overviews: just reading the dataset as usual (strong requirement for some stakeholders)
- Client supporting overviews: reading the multiscales attribute to retrieve the downscales
This simplifies the dataset definition, because datasets are no longer tasked with two roles (containing variables and / or containing other datasets).
In CDM/NetCDF the dataset can includes other dataset... that's a matter of fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another concern: if datasets can contain more datasets, then it's possible to recurse.
I understand your concerns, but this is how NetCDF works.
Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier specified in the associated TileMatrixSet (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group containing the multiscales metadata attribute. Each zoom-level group SHALL be a Dataset (as defined in Section 7.4.1) and SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. All zoom level groups SHALL have the same member keys to ensure structural consistency across resolutions. | ||
|
||
The multiscale root group SHALL contain a multiscales attribute that defines the TileMatrixSet reference. Child groups representing zoom levels SHALL use group names that exactly match the TileMatrix identifier values from the referenced TileMatrixSet. The presence and naming of zoom level groups is determined by the tileMatrices array in the TileMatrixSet definition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Dataset Group, which contains one or multiple data variables, may include or exclude overviews. I believe the Dataset Group itself would hold the native resolution directly (a dedicated 0/ zoom-level group would not exist) as overviews can be added later, and even not be used/supported by the client.
My suggestion:
Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier specified in the associated TileMatrixSet (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group containing the multiscales metadata attribute. Each zoom-level group SHALL be a Dataset (as defined in Section 7.4.1) and SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. All zoom level groups SHALL have the same member keys to ensure structural consistency across resolutions. | |
The multiscale root group SHALL contain a multiscales attribute that defines the TileMatrixSet reference. Child groups representing zoom levels SHALL use group names that exactly match the TileMatrix identifier values from the referenced TileMatrixSet. The presence and naming of zoom level groups is determined by the tileMatrices array in the TileMatrixSet definition. | |
A **Dataset Group** (as defined in Section 7.4.1) SHALL hold the native resolution variables directly. | |
Multiscale overviews, if present, SHALL be represented as Zarr groups identified by the Tile Matrix identifiers specified in the associated TileMatrixSet (e.g., `"1"`, `"2"`, `"3"`). | |
Each zoom-level group SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. A zoom-level group MAY contain multiple data variables. To ensure structural consistency, all zoom-level groups SHALL have the same member keys. | |
The Dataset Group SHALL contain a `multiscales` attribute that defines the TileMatrixSet reference. | |
Child groups representing zoom levels SHALL use group names that exactly match the Tile Matrix identifier values from the referenced TileMatrixSet. The presence and naming of zoom-level groups is determined by the `tileMatrices` array in the TileMatrixSet definition, excluding the native resolution. | |
@@ -20,7 +39,7 @@ Each zoom level SHALL be represented as a Zarr group, identified by the Tile Mat | |||
|Global metadata | `multiscales` defined in parent `.zattrs` | `multiscales` defined in parent group `zarr.json` under `attributes` | |||
|=== | |||
|
|||
Each multiscale group MUST define chunking (tiling) along the spatial dimensions (`X`, `Y`, or `lon`, `lat`). Recommended chunk sizes are 256×256 or 512×512. | |||
Each multiscale group SHALL define chunking (tiling) along the spatial dimensions (`X`, `Y`, or `lon`, `lat`). Recommended chunk sizes are 256×256 or 512×512. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why the chunking should be defined differently for overviews. The rules should be provided for any Data Variable in related section.
- The TileMatrixSet definition (whether referenced by identifier or included inline) specifies | ||
the exact set of zoom levels through its tileMatrices array | ||
- Each TileMatrix.id value corresponds to a required child group in the multiscale hierarchy | ||
- Variable discovery within each zoom level group follows standard Zarr metadata conventions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Variable discovery within each zoom level group follows standard Zarr metadata conventions | |
- The TileMatrix.id `"0"` corresponds to the native resolution stored directly in the Dataset Group, and is therefore not provided in the tile matrix set. |
This PR implements specification clarifications for multiscale overviews in GeoZarr, addressing ambiguities in TileMatrixSet integration and group structure discovery.
Fixes issues #79, #81, #83
Key Changes
Enhanced TileMatrixSet Support:
Improved Group Discovery:
Custom TileMatrixSet Integration:
Consistency Requirements:
Files Modified
standard/template/sections/clause_7_unified_data_model.adoc
(Section 7.2.3)standard/template/sections/clause_9_zarr_encoding_overviews.adoc
(Sections 9.7.1-9.7.5)These changes maintain backward compatibility while providing clear guidance for implementations and expanding support for Earth observation use cases beyond web mapping.
cc @d-v-b, @geospatial-jeff, @vincentsarago