-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Support rechunking to seasonal frequency with SeasonalResampler #10519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -201,7 +201,7 @@ def copy( | |||
# FYI in some cases we don't allow `None`, which this doesn't take account of. | |||
# FYI the `str` is for a size string, e.g. "16MB", supported by dask. | |||
T_ChunkDim: TypeAlias = str | int | Literal["auto"] | tuple[int, ...] | None # noqa: PYI051 | |||
T_ChunkDimFreq: TypeAlias = Union["TimeResampler", T_ChunkDim] | |||
T_ChunkDimFreq: TypeAlias = Union["TimeResampler", "SeasonResampler", T_ChunkDim] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T_ChunkDimFreq: TypeAlias = Union["TimeResampler", "SeasonResampler", T_ChunkDim] | |
T_ChunkDimFreq: TypeAlias = Union["Resampler", T_ChunkDim] |
Good time to generalize here
@@ -52,6 +52,7 @@ | |||
"EncodedGroups", | |||
"Grouper", | |||
"Resampler", | |||
"SeasonResampler", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SeasonResampler", | |
"SeasonGrouper", | |
"SeasonResampler", |
This was an oversight.
if variable.ndim != 1: | ||
raise ValueError( | ||
f"chunks={self!r} only supported for 1D variables. " | ||
f"Received variable {name!r} with {variable.ndim} dimensions instead." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check should go in _resolve_frequency
in dataset.py
# Create a temporary resampler that ignores drop_incomplete for chunking | ||
# This prevents data from being silently dropped during chunking | ||
resampler_for_chunking = ( | ||
self._for_chunking() if hasattr(self, "_for_chunking") else self | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Create a temporary resampler that ignores drop_incomplete for chunking | |
# This prevents data from being silently dropped during chunking | |
resampler_for_chunking = ( | |
self._for_chunking() if hasattr(self, "_for_chunking") else self | |
) |
Instead let's define this compute_chunks
method independently on TimeResampler and SeasonResampler. We would like to add FloatResampler (#4008) in the future, so we should keep it general. For example, the "contains datetime" check should be specific to the resamplers
@@ -169,7 +170,60 @@ class Resampler(Grouper): | |||
Currently only used for TimeResampler, but could be used for SpaceResampler in the future. | |||
""" | |||
|
|||
pass | |||
def resolve_chunks(self, name: Hashable, variable: Variable) -> tuple[int, ...]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def resolve_chunks(self, name: Hashable, variable: Variable) -> tuple[int, ...]: | |
def compute_chunks(self, name: Hashable, variable: Variable) -> tuple[int, ...]: |
@@ -65,7 +69,7 @@ Bug fixes | |||
creates extra variables that don't match the provided coordinate names, instead | |||
of silently ignoring them. The error message suggests using the factory method | |||
pattern with :py:meth:`xarray.Coordinates.from_xindex` and | |||
:py:meth:`Dataset.assign_coords` for advanced use cases (:issue:`10499`). | |||
:py:meth:`Dataset.assign_coords` for advanced use cases (:issue:`10499`, , :pull:`10503`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:py:meth:`Dataset.assign_coords` for advanced use cases (:issue:`10499`, , :pull:`10503`). | |
:py:meth:`Dataset.assign_coords` for advanced use cases (:issue:`10499`, :pull:`10503`). |
- Allow skipping the creation of default indexes when opening datasets (:pull:`8051`). | ||
By `Benoit Bovy <https://github.com/benbovy>`_ and `Justus Magin <https://github.com/keewis>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Allow skipping the creation of default indexes when opening datasets (:pull:`8051`). | |
By `Benoit Bovy <https://github.com/benbovy>`_ and `Justus Magin <https://github.com/keewis>`_. |
bad merge
chunks = ( | ||
DataArray( | ||
np.ones(variable.shape, dtype=int), | ||
dims=(name,), | ||
coords={name: variable}, | ||
) | ||
.resample({name: resampler_for_chunking}) | ||
.sum() | ||
) | ||
# When bins (binning) or time periods are missing (resampling) | ||
# we can end up with NaNs. Drop them. | ||
if chunks.dtype.kind == "f": | ||
chunks = chunks.dropna(name).astype(int) | ||
chunks_tuple: tuple[int, ...] = tuple(chunks.data.tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're interested, a better approach here would be to pass this DataArray to Resampler.factorize()
and use the group_indices
of the returned EncodedGroups
object to figure out the chunks tuple.
This is another reason I'd prefer just sending the variable to a method on the Resampler. That way the resamplers are free to optimize if they can.
data = create_test_data() | ||
for chunks in [1, 2, 3, 4, 5]: | ||
rechunked = data.chunk({"dim1": chunks}) | ||
assert rechunked.chunks["dim1"] == (chunks,) * (8 // chunks) + ( | ||
(8 % chunks,) if 8 % chunks else () | ||
) | ||
|
||
rechunked = data.chunk({"dim2": chunks}) | ||
assert rechunked.chunks["dim2"] == (chunks,) * (9 // chunks) + ( | ||
(9 % chunks,) if 9 % chunks else () | ||
) | ||
|
||
rechunked = data.chunk({"dim1": chunks, "dim2": chunks}) | ||
assert rechunked.chunks["dim1"] == (chunks,) * (8 // chunks) + ( | ||
(8 % chunks,) if 8 % chunks else () | ||
) | ||
assert rechunked.chunks["dim2"] == (chunks,) * (9 // chunks) + ( | ||
(9 % chunks,) if 9 % chunks else () | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of this test?
|
||
# Test standard seasons | ||
rechunked = ds.chunk(x=2, time=SeasonResampler(["DJF", "MAM", "JJA", "SON"])) | ||
expected = tuple( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is expected needed?
import dask.array | ||
|
||
N = 365 * 2 # 2 years | ||
time = xr.date_range("2001-01-01", periods=N, freq="D") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's parameterize this over use_cftime=[True, False]
whats-new.rst
api.rst
users could not use
SeasonResampler
for chunking operations in xarray, despite it being a natural fit for seasonal data analysis. When attemptingds.chunk(time=SeasonResampler(["DJF", "MAMJ", "JAS", "ON"]))
, users encountered obscure errors because the chunking logic was hardcoded to only work withTimeResampler
objects. This limitation prevented efficient seasonal analysis workflows and forced users to use workarounds or manual chunking strategies.Now Added a generalized chunking approach by adding a
resolve_chunks
method to theResampler
base class and updating the chunking logic to work with allResampler
objects, not justTimeResampler
. We also added a_for_chunking
method toSeasonResampler
that ensuresdrop_incomplete=False
during chunking operations to prevent silent data loss. The solution maintains full backward compatibility with existingTimeResampler
functionality while enabling seamless seasonal chunking