Skip to content

Conversation

keewis
Copy link
Contributor

@keewis keewis commented Aug 26, 2025

For some reason a dataset I've been working on produces BytesCodec instances where endian is None, which causes the translation from v3 to v2 metadata to fail.

No tests because I didn't check thoroughly on where these should be.

(I've seen that you're planning to migrate to v3 entirely, in which case this won't be necessary anymore. It's such a simple fix, though, that this might still be worth it)

  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

@maxrjones
Copy link
Member

For some reason a dataset I've been working on produces BytesCodec instances where endian is None, which causes the translation from v3 to v2 metadata to fail.

are you able to share this dataset? If not, what dtype is this for? I'm wondering if it's a single byte dtype where the configuration should be empty. I'm reluctant to accept this change because an endian of None seems to be out-of-spec.

@keewis
Copy link
Contributor Author

keewis commented Aug 27, 2025

are you able to share this dataset?

Not sure, I'll have to find a public version of it.

what dtype is this for?

these are single-byte dtypes (in my case, int8, after a CF scale+offset compression). The encoding in the virtual dataset is indeed empty, but it still creates a BytesCodec instance with endian=None. The repr of v3_metadata is:

(BytesCodec(endian=None), Shuffle(codec_name='numcodecs.shuffle', codec_config={'elementsize': 1}), Zlib(codec_name='numcodecs.zlib', codec_config={'level': 1}))

which I assume is the default for this dtype?

@d-v-b
Copy link

d-v-b commented Aug 27, 2025

from the v3 spec:

endian:

Required for data types for which endianness is applicable. For example, this includes multi-byte data types, such as uint16 and int32, but not single-byte data types, such as uint8 or bool. If present, the value MUST be a string equal to either "big" or "little".

@d-v-b
Copy link

d-v-b commented Aug 27, 2025

but that's describing a JSON object, not an in-memory python object. in zarr python, our in-memory representation of the bytes codec uses endian=None to indicate that the JSON object should have endian key unset.

@keewis
Copy link
Contributor Author

keewis commented Aug 27, 2025

not exactly the same dataset, but this also fails with that error:

from obstore.store import from_url
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry

from virtualizarr import open_virtual_dataset

url = "https://data-cersat.ifremer.fr/data/sea-surface-temperature/odyssea/l4/glob/nrt/data/v2.1/2021/001/20210101000000-IFR-L4_GHRSST-SSTfnd-ODYSSEA-GLOB_010-v02.1-fv01.0.nc"
store = from_url(url)
registry = ObjectStoreRegistry({url: store})

vds = open_virtual_dataset(
    url=url,
    loadable_variables=[],
    parser=HDFParser(),
    registry=registry,
)
display(vds)
vds.vz.to_kerchunk("refs.parquet", format="parquet")

@keewis
Copy link
Contributor Author

keewis commented Aug 27, 2025

in digging through the zarr code, I noticed that

BytesCodec.from_dict({"name": "bytes", "configuration": {}})
BytesCodec.from_dict({"name": "bytes"})

both get the default system endian because that's the default for __init__, and it is only when explicitly passing None that that appears:

BytesCodec.from_dict({"name": "bytes", "configuration": {"endian": None}})

(but that would be against the spec, right?)

Also:

from zarr.codecs import BytesCodec

codec = BytesCodec(endian=None)
roundtripped = BytesCodec.from_dict(codec.to_dict())

assert codec.endian == roundtripped.endian

fails for the same reason.

@d-v-b, any idea on what the intended behavior should be?

Not sure if there's a better fix for this, but the `getattr` call basically guarantees
that if `codec` does not have the `endian` attribute the last condition
is never reached.
@keewis
Copy link
Contributor Author

keewis commented Sep 2, 2025

the failing RTD build looks unrelated (it fails while creating the env)

@betolink
Copy link

betolink commented Sep 2, 2025

We just ran into this and was about to open the same PR, thanks for working on this @keewis. It'll be great if this gets merged and we get a new release since this is a blocker for all earthaccess virtualizarr workflows. cc @maxrjones

@maxrjones maxrjones merged commit 49789a6 into zarr-developers:main Sep 3, 2025
13 checks passed
@keewis keewis deleted the endianness branch September 3, 2025 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants