-
Notifications
You must be signed in to change notification settings - Fork 100
Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case #707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add streaming decompression for ZSTD_CONTENTSIZE_UNKNOWN case #707
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #707 +/- ##
=======================================
Coverage 99.96% 99.96%
=======================================
Files 63 64 +1
Lines 2712 2792 +80
=======================================
+ Hits 2711 2791 +80
Misses 1 1
🚀 New features to boost your workflow:
|
BeforeIn [1]: import numcodecs
In [2]: numcodecs.__version__
Out[2]: '0.16.1'
In [3]: codec = numcodecs.Zstd()
In [4]: bytes_val = b'(\xb5/\xfd\x00Xa\x00\x00Hello World!'
In [5]: codec.decode(bytes_val)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 codec.decode(bytes_val)
File numcodecs/zstd.pyx:261, in numcodecs.zstd.Zstd.decode()
File numcodecs/zstd.pyx:191, in numcodecs.zstd.decompress()
RuntimeError: Zstd decompression error: invalid input data
In [6]: bytes3 = b'(\xb5/\xfd\x00X$\x02\x00\xa4\x03ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz\x01\x00:\xfc\xdfs\x05\x05L\x00\x00\x08s\x01\x00\xfc\xff9\x10\x02L\x00\x00
⋮ \x08k\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08c\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08[\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08S\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08K\x01\x00\xfc\xf
⋮ f9\x10\x02L\x00\x00\x08C\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08u\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08m\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08e\x01\x00\xfc\xff9\x10\x02L\x00\x00\
⋮ x08]\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08U\x01\x00\xfc\xff9\x10\x02L\x00\x00\x08M\x01\x00\xfc\xff9\x10\x02M\x00\x00\x08E\x01\x00\xfc\x7f\x1d\x08\x01'
In [7]: codec.decode(bytes3)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 codec.decode(bytes3)
File numcodecs/zstd.pyx:261, in numcodecs.zstd.Zstd.decode()
File numcodecs/zstd.pyx:191, in numcodecs.zstd.decompress()
RuntimeError: Zstd decompression error: invalid input data After
|
@d-v-b @jakirkham could you review? |
Consider the following script: import zarr, tensorstore as ts, numpy as np
arr = ts.open({
'driver': 'zarr',
'kvstore': {
'driver': 'file',
'path': 'ts_zarr2_zstd',
},
'metadata': {
'compressor': {
'id': 'zstd',
'level': 3,
},
'shape': [1024, 1024],
'chunks': [64, 64],
'dtype': '|u1',
}
}, create=True, delete_existing=True).result()
arr[:,:] = np.random.randint(0, 9, size=(1024,1024), dtype='u1')
arr2 = zarr.open_array("ts_zarr2_zstd")
print(arr2[:,:]) Before
After
|
it seems like compatibility with tensorstore is part of the goal here. Would it make sense to add an integration test that uses tensorstore? |
This be better done in another pull request. It would be good to add zarr and tensorstore as optional test dependencies. Also note that the zstd compatability issue only occurs when writing Zarr v2 with tensorstore. The closest analog to numcodecs for tensorstore is riegeli where most of the compression routines are implemented. |
@normanrz , you may be interested in taking a look as well with regard to Zstandard. |
numcodecs/tests/test_zstd.py
Outdated
bytes_val = b'(\xb5/\xfd\x00Xa\x00\x00Hello World!' | ||
dec = codec.decode(bytes_val) | ||
assert dec == b'Hello World!' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did these bytes come from? Ideally we would have a test that generated a streaming output from another zstd tool, and used that as an input. Is this particularly onerous to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, include explanatory reference information from the zstd spec as a comment. basically imagine someone else coming to do maintenance on this test -- how will they know where to look to decipher bytes_val
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we compare it the zstd
CLI?
In [1]: import subprocess
In [2]: subprocess.run(["zstd","--no-check"], input=b"Hello world!", capture_out
...: put=True).stdout
Out[2]: b'(\xb5/\xfd\x00Xa\x00\x00Hello world!'
$ echo -n "Hello world!" | zstd --no-check | hd
00000000 28 b5 2f fd 00 58 61 00 00 48 65 6c 6c 6f 20 77 |(./..Xa..Hello w|
00000010 6f 72 6c 64 21 |orld!|
00000015
or should we break down the frame format:
https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whatever's easiest for you. the goal is to upgrade from a blob of (apparently) random bytes in our test to something that has a clear tie to the zstd spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hard part for me is figuring out how to manage the test dependency and what the basic requirements are there. If we need to cobble this together relying on PyPI alone, then it probably makes sense to pull in python-zstandard or pyzstd as a test dependency
If we could do a conda package, then the conda-forge zstd package would probably the way to go. In this case, we would not need to rely on another 3d party wrapper but could just depend on the 1st party command line interface.
https://anaconda.org/conda-forge/zstd
Alternatively, we could also use the system package manager to install the zstd CLI.
The byte strings are there to decouple the dependency. They are the same as the bytes in the numcodecs.js test implementation.
https://github.com/manzt/numcodecs.js/blob/main/test%2Fzstd.test.js
I suppose what we could do is just add a bunch of tests that are conditional on the available tools or packages which would be optional dependencies.
I still think we should keep the byte strings there for the case when the no other tools are available. We could of course better document how to generate those bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should keep the byte strings there for the case when the no other tools are available. We could of course better document how to generate those bytes.
this is also fine, which is why I said earlier that a comment explaining where the bytes came from would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you declare the python zstd package as a test dependency?
Which one?
https://pypi.org/project/pyzstd/
https://pypi.org/project/zstd/
https://pypi.org/project/zstandard/
I'm leaning towards pyzstd since it is relatively complete and well maintained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't care which one 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added pyzstd tests. I even included a test which decompressing multiple frames in a buffer, which currently fails. I marked it as a xfail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiple frames issue can be resolved by #757
this looks good but as I noted in this comment I think it would be great to test against the output of another zstd tool instead of hand-writing bytestreams |
I'm considering an optimization by using a However, this requires locking the GIL. I would need to do some performance testing. |
This is great work. Thank you so much!
In zarr-python, we use multithreading for en/decoding multiple chunks in parallel. Therefore, I think locking the GIL would not be great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks mark!
Zstandard can use a streaming compression scheme where the total size of the data is not known at the beginning of the process. In this case, the size of the data is unknown
and is not saved in the Zstandard frame header.
Before this pull request, numcodecs would refuse to decompress data if the size were unknown. This pull request adds a routine to decompress data if the size is unknown,
specifically when
ZSTD_getFrameContentSize
returnsZSTD_CONTENTSIZE_UNKNOWN
.This pull request is based on prior pull request I made to numcodecs.js:
manzt/numcodecs.js#47
Fixes zarr-developers/zarr-python#2056
xref:
zarr-developers/zarr-python#2056
TODO: