Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case #707

mkitti · 2025-02-13T12:27:20Z

Zstandard can use a streaming compression scheme where the total size of the data is not known at the beginning of the process. In this case, the size of the data is unknown
and is not saved in the Zstandard frame header.

Before this pull request, numcodecs would refuse to decompress data if the size were unknown. This pull request adds a routine to decompress data if the size is unknown,
specifically when ZSTD_getFrameContentSize returns ZSTD_CONTENTSIZE_UNKNOWN.

This pull request is based on prior pull request I made to numcodecs.js:
manzt/numcodecs.js#47

Fixes zarr-developers/zarr-python#2056

xref:
zarr-developers/zarr-python#2056

TODO:

Unit tests and/or doctests in docstrings
Tests pass locally
Docstrings and API docs for any new/modified user-facing classes and functions
Changes documented in docs/release.rst
Docs build locally
GitHub Actions CI passes
Test coverage to 100% (Codecov passes)

mkitti · 2025-02-24T15:20:23Z

@StephanPreibisch

codecov · 2025-06-27T23:06:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.96%. Comparing base (4fdb625) to head (313d534).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #707   +/-   ##
=======================================
  Coverage   99.96%   99.96%           
=======================================
  Files          63       64    +1     
  Lines        2712     2792   +80     
=======================================
+ Hits         2711     2791   +80     
  Misses          1        1

Files with missing lines	Coverage Δ
numcodecs/tests/test_pyzstd.py	`100.00% <100.00%> (ø)`
numcodecs/tests/test_zstd.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mkitti · 2025-06-27T23:37:23Z

Before

In [1]: import numcodecs

In [2]: numcodecs.__version__
Out[2]: '0.16.1'

In [3]: codec = numcodecs.Zstd()

In [4]: bytes_val = b'(\xb5/\xfd\x00Xa\x00\x00Hello World!'

In [5]: codec.decode(bytes_val)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[5], line 1
----> 1 codec.decode(bytes_val)

File numcodecs/zstd.pyx:261, in numcodecs.zstd.Zstd.decode()

File numcodecs/zstd.pyx:191, in numcodecs.zstd.decompress()

RuntimeError: Zstd decompression error: invalid input data

In [6]: bytes3 = b'(\xb5/\xfd\x00X$\x02\x00\xa4\x03ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz\x01\x00:\xfc\xdfs\x05\x05L\x00\x00\x08s\x01\x00\xfc\xff9\x10\x02L\x00\x00
      ⋮ \x08k\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08c\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08[\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08S\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08K\x01\x00\xfc\xf
      ⋮ f9\x10\x02L\x00\x00\x08C\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08u\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08m\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08e\x01\x00\xfc\xff9\x10\x02L\x00\x00\
      ⋮ x08]\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08U\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08M\x01\x00\xfc\xff9\x10\x02M\x00\x00\x08E\x01\x00\xfc\x7f\x1d\x08\x01'

In [7]: codec.decode(bytes3)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[7], line 1
----> 1 codec.decode(bytes3)

File numcodecs/zstd.pyx:261, in numcodecs.zstd.Zstd.decode()

File numcodecs/zstd.pyx:191, in numcodecs.zstd.decompress()

RuntimeError: Zstd decompression error: invalid input data

After

In [1]: import numcodecs

In [2]: codec = numcodecs.Zstd()

In [3]: bytes_val = b'(\xb5/\xfd\x00Xa\x00\x00Hello World!'

In [4]: codec.decode(bytes_val)
Out[4]: b'Hello World!'

In [5]: bytes3 = b'(\xb5/\xfd\x00X$\x02\x00\xa4\x03ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz\x01\x00:\xfc\xdfs\x05\x05L\x00\x00\x08s\x01\x00\xfc\xff9\x10\x02L\x00\x00
      ⋮ \x08k\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08c\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08[\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08S\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08K\x01\x00\xfc\xf
      ⋮ f9\x10\x02L\x00\x00\x08C\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08u\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08m\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08e\x01\x00\xfc\xff9\x10\x02L\x00\x00\
      ⋮ x08]\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08U\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08M\x01\x00\xfc\xff9\x10\x02M\x00\x00\x08E\x01\x00\xfc\x7f\x1d\x08\x01'

In [6]: codec.decode(bytes3) == b'ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz' * 32 * 1024
Out[6]: True

mkitti · 2025-06-27T23:38:03Z

@d-v-b @jakirkham could you review?

mkitti · 2025-06-28T00:00:52Z

Consider the following script:

import zarr, tensorstore as ts, numpy as np
arr = ts.open({
    'driver': 'zarr',
    'kvstore': {
        'driver': 'file',
        'path': 'ts_zarr2_zstd',
    },
    'metadata': {
        'compressor': {
            'id': 'zstd',
            'level': 3,
        },
        'shape': [1024, 1024],
        'chunks': [64, 64],
        'dtype': '|u1',
    }
}, create=True, delete_existing=True).result()
arr[:,:] = np.random.randint(0, 9, size=(1024,1024), dtype='u1')
arr2 = zarr.open_array("ts_zarr2_zstd")
print(arr2[:,:])

Before

$ python ts_zarr2_zstd.py 
Traceback (most recent call last):
  File "/home/mkitti/src/numcodecs/before/ts_zarr2_zstd.py", line 20, in <module>
    print(arr2[:,:])
          ~~~~^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/array.py", line 2441, in __getitem__
    return self.get_orthogonal_selection(pure_selection, fields=fields)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/_compat.py", line 43, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/array.py", line 2883, in get_orthogonal_selection
    return sync(
           ^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/sync.py", line 163, in sync
    raise return_result
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/sync.py", line 119, in _runner
    return await coro
           ^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/array.py", line 1298, in _get_selection
    await self.codec_pipeline.read(
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 464, in read
    await concurrent_map(
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/common.py", line 69, in concurrent_map
    return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/common.py", line 67, in run
    return await func(*item)
           ^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 270, in read_batch
    chunk_array_batch = await self.decode_batch(
                        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 190, in decode_batch
    chunk_array_batch = await ab_codec.decode(
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/abc/codec.py", line 129, in decode
    return await _batching_helper(self._decode_single, chunks_and_specs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/abc/codec.py", line 407, in _batching_helper
    return await concurrent_map(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/common.py", line 69, in concurrent_map
    return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/core/common.py", line 67, in run
    return await func(*item)
           ^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/abc/codec.py", line 420, in wrap
    return await func(chunk, chunk_spec)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/site-packages/zarr/codecs/_v2.py", line 36, in _decode_single
    chunk = await asyncio.to_thread(self.compressor.decode, cdata)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mkitti/src/numcodecs/before/.pixi/envs/default/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "numcodecs/zstd.pyx", line 261, in numcodecs.zstd.Zstd.decode
  File "numcodecs/zstd.pyx", line 191, in numcodecs.zstd.decompress
RuntimeError: Zstd decompression error: invalid input data

After

$ python before/ts_zarr2_zstd.py 
[[4 3 6 ... 3 4 2]
 [1 7 7 ... 0 8 5]
 [2 4 2 ... 0 1 2]
 ...
 [7 1 3 ... 4 5 4]
 [0 6 1 ... 4 1 4]
 [5 6 8 ... 6 1 7]]

d-v-b · 2025-06-28T07:34:17Z

it seems like compatibility with tensorstore is part of the goal here. Would it make sense to add an integration test that uses tensorstore?

mkitti · 2025-06-28T10:51:24Z

it seems like compatibility with tensorstore is part of the goal here. Would it make sense to add an integration test that uses tensorstore?

This be better done in another pull request. It would be good to add zarr and tensorstore as optional test dependencies.

Also note that the zstd compatability issue only occurs when writing Zarr v2 with tensorstore.

The closest analog to numcodecs for tensorstore is riegeli where most of the compression routines are implemented.

mkitti · 2025-06-30T18:08:48Z

@normanrz , you may be interested in taking a look as well with regard to Zstandard.

d-v-b · 2025-07-03T09:46:38Z

numcodecs/tests/test_zstd.py

+    bytes_val = b'(\xb5/\xfd\x00Xa\x00\x00Hello World!'
+    dec = codec.decode(bytes_val)
+    assert dec == b'Hello World!'


where did these bytes come from? Ideally we would have a test that generated a streaming output from another zstd tool, and used that as an input. Is this particularly onerous to test?

alternatively, include explanatory reference information from the zstd spec as a comment. basically imagine someone else coming to do maintenance on this test -- how will they know where to look to decipher bytes_val?

Could we compare it the zstd CLI?

In [1]: import subprocess In [2]: subprocess.run(["zstd","--no-check"], input=b"Hello world!", capture_out ...: put=True).stdout Out[2]: b'(\xb5/\xfd\x00Xa\x00\x00Hello world!'

$ echo -n "Hello world!" | zstd --no-check | hd 00000000 28 b5 2f fd 00 58 61 00 00 48 65 6c 6c 6f 20 77 |(./..Xa..Hello w| 00000010 6f 72 6c 64 21 |orld!| 00000015

or should we break down the frame format:
https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md

whatever's easiest for you. the goal is to upgrade from a blob of (apparently) random bytes in our test to something that has a clear tie to the zstd spec.

The hard part for me is figuring out how to manage the test dependency and what the basic requirements are there. If we need to cobble this together relying on PyPI alone, then it probably makes sense to pull in python-zstandard or pyzstd as a test dependency

If we could do a conda package, then the conda-forge zstd package would probably the way to go. In this case, we would not need to rely on another 3d party wrapper but could just depend on the 1st party command line interface.

https://anaconda.org/conda-forge/zstd

Alternatively, we could also use the system package manager to install the zstd CLI.

The byte strings are there to decouple the dependency. They are the same as the bytes in the numcodecs.js test implementation.

https://github.com/manzt/numcodecs.js/blob/main/test%2Fzstd.test.js

I suppose what we could do is just add a bunch of tests that are conditional on the available tools or packages which would be optional dependencies.

I still think we should keep the byte strings there for the case when the no other tools are available. We could of course better document how to generate those bytes.

I still think we should keep the byte strings there for the case when the no other tools are available. We could of course better document how to generate those bytes.

this is also fine, which is why I said earlier that a comment explaining where the bytes came from would be sufficient.

can you declare the python zstd package as a test dependency?

Which one?

https://pypi.org/project/pyzstd/
https://pypi.org/project/zstd/
https://pypi.org/project/zstandard/

I'm leaning towards pyzstd since it is relatively complete and well maintained.

i don't care which one 🤣

I added pyzstd tests. I even included a test which decompressing multiple frames in a buffer, which currently fails. I marked it as a xfail.

The multiple frames issue can be resolved by #757

d-v-b · 2025-07-03T09:47:50Z

this looks good but as I noted in this comment I think it would be great to test against the output of another zstd tool instead of hand-writing bytestreams

mkitti · 2025-07-03T23:30:05Z

I'm considering an optimization by using a bytearray structure earlier to avoid a memory copy at the end:
mkitti@f089739

However, this requires locking the GIL.

I would need to do some performance testing.

numcodecs/tests/test_pyzstd.py

normanrz · 2025-07-10T18:48:05Z

This is great work. Thank you so much!

I'm considering an optimization by using a bytearray structure earlier to avoid a memory copy at the end: mkitti@f089739

However, this requires locking the GIL.

In zarr-python, we use multithreading for en/decoding multiple chunks in parallel. Therefore, I think locking the GIL would not be great.

d-v-b

thanks mark!

Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case

d04e536

d-v-b mentioned this pull request Feb 25, 2025

zstd failure for unsepcified buffer size #424

Closed

mkitti and others added 4 commits June 27, 2025 15:43

Add tests and documentation for streaming Zstd

1096df7

Add better tests from numcodecs.js

a903ee5

Merge branch 'main' into mkitti-zstd-stream-decompress

ccddf41

Adapt zstd.pyx streaming for Py_buffer

d3049ca

Formatting with ruff

834b74d

mkitti marked this pull request as ready for review June 27, 2025 23:37

Fix zstd comparison of different signedness

a2c7fb0

Undo change to unrelated test

0dcffe6

d-v-b reviewed Jul 3, 2025

View reviewed changes

mkitti added 4 commits July 3, 2025 16:32

Add zstd cli tests

4edc73a

Test Zstd against pyzstd

16cc9b3

Apply ruff

308caf3

Fix zstd tests, coverage

1f5670f

mkitti added 2 commits July 9, 2025 22:17

Make imports not optional

dcf2989

Add EndlessZstdDecompressor tests

78f26fc

d-v-b reviewed Jul 10, 2025

View reviewed changes

numcodecs/tests/test_pyzstd.py Outdated Show resolved Hide resolved

d-v-b mentioned this pull request Jul 10, 2025

zarr-python cannot read arrays saved by tensorstore using the zstd compressor zarr-developers/zarr-python#2056

Closed

Add docstrings to test_pyzstd.py

313d534

d-v-b approved these changes Jul 10, 2025

View reviewed changes

d-v-b merged commit 7a6fad3 into zarr-developers:main Jul 10, 2025
30 checks passed

mkitti mentioned this pull request Jul 10, 2025

Zstd compression does not encode content size in header google/tensorstore#182

Open

Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case #707

Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case #707

Uh oh!

Conversation

mkitti commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkitti commented Feb 24, 2025

Uh oh!

codecov bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mkitti commented Jun 27, 2025

Before

After

Uh oh!

mkitti commented Jun 27, 2025

Uh oh!

mkitti commented Jun 28, 2025

Before

After

Uh oh!

d-v-b commented Jun 28, 2025

Uh oh!

mkitti commented Jun 28, 2025

Uh oh!

mkitti commented Jun 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Jul 3, 2025

Uh oh!

mkitti commented Jul 3, 2025

Uh oh!

Uh oh!

normanrz commented Jul 10, 2025

Uh oh!

d-v-b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mkitti commented Feb 13, 2025 •

edited

Loading

codecov bot commented Jun 27, 2025 •

edited

Loading