Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
eb1a967
All works, just need to satisfy mypy and whatnot now
charles-turner-1 Jul 13, 2025
852476d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 13, 2025
c921c59
Merge branch 'main' into autochunk-cftime
charles-turner-1 Jul 13, 2025
1aba531
Fix moving import to be optional
charles-turner-1 Jul 13, 2025
9429c3d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 13, 2025
3c9d27e
Make mypy happy
charles-turner-1 Jul 13, 2025
5153d2d
Add some clarifying comments about what we need to do to optimise this
charles-turner-1 Jul 13, 2025
62e71e6
Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…
charles-turner-1 Jul 14, 2025
cfdc31b
@dcherian's suggestions. Just need to update chunking strategy to res…
charles-turner-1 Jul 14, 2025
2f16bc7
Merge branch 'main' of https://github.com/charles-turner-1/xarray
charles-turner-1 Jul 14, 2025
ce720fa
Merge branch 'main' into autochunk-cftime
charles-turner-1 Jul 14, 2025
4fa58c1
Merge branch 'main' into autochunk-cftime
charles-turner-1 Jul 23, 2025
e58d6d7
Can now load cftime arrays with auto-chunking. Implementation still k…
charles-turner-1 Jul 23, 2025
590e503
Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…
charles-turner-1 Jul 23, 2025
f953976
Test for autochunking when reading from disk
charles-turner-1 Jul 25, 2025
6706524
replace `build_chunkspec` with faking the dtype of a cftime array & a…
charles-turner-1 Jul 25, 2025
4e56acd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 25, 2025
0d008cd
Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…
charles-turner-1 Jul 25, 2025
49c4e9c
Merge branch 'main' into autochunk-cftime
charles-turner-1 Jul 25, 2025
4594099
Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…
charles-turner-1 Jul 25, 2025
5d00b0a
Remove redundant comments, rename things to make them clearer, add mo…
charles-turner-1 Jul 25, 2025
80421ef
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 25, 2025
d1f7ad3
Refactor to move most of the changes into the DaskManager
charles-turner-1 Jul 25, 2025
1b7de62
Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…
charles-turner-1 Jul 25, 2025
4407185
bare-min tests should pass now?
charles-turner-1 Jul 25, 2025
d8f45b2
Deepak's suggestions (think mypy is still going to be angry for now)
charles-turner-1 Jul 28, 2025
20226c1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 28, 2025
11ac9f0
Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…
charles-turner-1 Jul 28, 2025
8485df5
Fix errant line
charles-turner-1 Jul 28, 2025
2c27877
Clean up `DaskManager.rechunk` a bit - maybe possible to remove more …
charles-turner-1 Jul 28, 2025
0983261
Remove unused import
charles-turner-1 Jul 28, 2025
c4ec31f
Merge branch 'main' into autochunk-cftime
charles-turner-1 Aug 8, 2025
adbf5b2
Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…
charles-turner-1 Aug 8, 2025
6c93bc4
Fix a couple of type errors
charles-turner-1 Aug 8, 2025
74bc0ea
Mypy & tests passing locally
charles-turner-1 Aug 8, 2025
0b9bbd0
Merge branch 'main' into autochunk-cftime
charles-turner-1 Aug 12, 2025
e58322f
Merge branch 'main' into autochunk-cftime
charles-turner-1 Aug 20, 2025
dbc6ebd
Merge branch 'main' into autochunk-cftime
charles-turner-1 Aug 24, 2025
5680663
Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…
charles-turner-1 Aug 24, 2025
b5933ed
Deepak's comments
charles-turner-1 Aug 25, 2025
5db9225
Merge branch 'main' into autochunk-cftime
charles-turner-1 Aug 26, 2025
600c0fd
Merge branch 'main' into autochunk-cftime
charles-turner-1 Sep 3, 2025
9fcc6eb
Merge branch 'main' into autochunk-cftime
charles-turner-1 Sep 8, 2025
dc83692
Merge branch 'main' into autochunk-cftime
charles-turner-1 Sep 10, 2025
1e1bbf3
Merge branch 'main' into autochunk-cftime
charles-turner-1 Sep 11, 2025
9443815
Edits
dcherian Sep 18, 2025
db52c62
Merge branch 'pydata:main' into autochunk-cftime
charles-turner-1 Sep 22, 2025
85ebafd
Start refactoring `get_chunk` into named_array - xfail marker on the …
charles-turner-1 Sep 23, 2025
e2627c6
Optimise `fake_target_chunksize` (@dcherian suggestion)
charles-turner-1 Sep 23, 2025
0bca828
WIP
charles-turner-1 Sep 23, 2025
a930a65
Everything seems to be working - some type issues though I think
charles-turner-1 Sep 23, 2025
cbcb640
object => cftime - zarr failures...
charles-turner-1 Sep 23, 2025
70208e0
Merge branch 'main' into autochunk-cftime
charles-turner-1 Sep 23, 2025
3f0d3aa
Fix typing
charles-turner-1 Sep 23, 2025
e944eb4
Don't just import Variable in typing clause
charles-turner-1 Sep 23, 2025
1393351
Merge branch 'main' into autochunk-cftime
charles-turner-1 Oct 2, 2025
90242d1
Merge branch 'main' into autochunk-cftime
charles-turner-1 Oct 9, 2025
92bb538
Cleanup
dcherian Oct 13, 2025
1e3a015
Remove Variable handling
dcherian Oct 13, 2025
861cc57
Try more
dcherian Oct 13, 2025
16ccc78
bugfix
dcherian Oct 13, 2025
1bd2f32
typing
dcherian Oct 13, 2025
9dead77
Merge branch 'main' into autochunk-cftime
dcherian Oct 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@
from xarray.core.utils import emit_user_level_warning, is_remote_uri
from xarray.namedarray.daskmanager import DaskManager
from xarray.namedarray.parallelcompat import guess_chunkmanager
from xarray.structure.chunks import _get_chunk, _maybe_chunk
from xarray.namedarray.utils import _get_chunk
from xarray.structure.chunks import _maybe_chunk
from xarray.structure.combine import (
_infer_concat_order_from_positions,
_nested_combine,
Expand Down Expand Up @@ -244,7 +245,16 @@ def _chunk_ds(

variables = {}
for name, var in backend_ds.variables.items():
var_chunks = _get_chunk(var, chunks, chunkmanager)
if var._in_memory:
variables[name] = var
continue
var_chunks = _get_chunk(
var._data,
chunks,
chunkmanager,
preferred_chunks=var.encoding.get("preferred_chunks", {}),
dims=var.dims,
)
variables[name] = _maybe_chunk(
name,
var,
Expand Down
6 changes: 6 additions & 0 deletions xarray/namedarray/daskmanager.py
Original file line number Diff line number Diff line change
Expand Up @@ -264,3 +264,9 @@ def shuffle(
if chunks != "auto":
raise NotImplementedError("Only chunks='auto' is supported at present.")
return dask.array.shuffle(x, indexer, axis, chunks="auto")

def get_auto_chunk_size(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite is there an equivalent for cubed? I didn't see it in the docs...

from dask import config as dask_config
from dask.utils import parse_bytes

return parse_bytes(dask_config.get("array.chunk-size"))
33 changes: 32 additions & 1 deletion xarray/namedarray/parallelcompat.py
Original file line number Diff line number Diff line change
Expand Up @@ -346,7 +346,14 @@ def rechunk(
dask.array.Array.rechunk
cubed.Array.rechunk
"""
return data.rechunk(chunks, **kwargs)
from xarray.core.common import _contains_cftime_datetimes
from xarray.namedarray.utils import _get_chunk

if _contains_cftime_datetimes(data):
chunks2 = _get_chunk(data, chunks, self, preferred_chunks={}) # type: ignore[arg-type]
else:
chunks2 = chunks # type: ignore[assignment]
return data.rechunk(chunks2, **kwargs)

@abstractmethod
def compute(
Expand Down Expand Up @@ -746,3 +753,27 @@ def store(
cubed.store
"""
raise NotImplementedError()

def get_auto_chunk_size(
self,
) -> int:
"""
Get the default chunk size for a variable.

This is used to determine the chunk size when opening a dataset with
``chunks="auto"`` or when rechunking an array with ``chunks="auto"``.

Parameters
----------
target_chunksize : int, optional
The target chunk size in bytes. If not provided, a default value is used.

Returns
-------
chunk_size : int
The chunk size in bytes.
"""

raise NotImplementedError(
"For 'auto' rechunking of cftime arrays, get_auto_chunk_size must be implemented by the chunk manager"
)
107 changes: 106 additions & 1 deletion xarray/namedarray/utils.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
from __future__ import annotations

import importlib
import itertools
import sys
import warnings
from collections.abc import Hashable, Iterable, Iterator, Mapping
from functools import lru_cache
from numbers import Number
from typing import TYPE_CHECKING, Any, TypeVar, cast

import numpy as np
Expand All @@ -23,7 +26,9 @@
DaskArray = NDArray # type: ignore[assignment, misc]
DaskCollection: Any = NDArray # type: ignore[no-redef]

from xarray.namedarray._typing import _Dim, duckarray
from xarray.core.types import T_ChunkDim
from xarray.namedarray._typing import DuckArray, _Dim, duckarray
from xarray.namedarray.parallelcompat import ChunkManagerEntrypoint


K = TypeVar("K")
Expand Down Expand Up @@ -195,6 +200,106 @@ def either_dict_or_kwargs(
return pos_kwargs


def _get_chunk( # type: ignore[no-untyped-def]
data: DuckArray[Any],
chunks,
chunkmanager: ChunkManagerEntrypoint[Any],
*,
preferred_chunks,
dims=None,
) -> Mapping[Any, T_ChunkDim]:
"""
Return map from each dim to chunk sizes, accounting for backend's preferred chunks.
"""
from xarray.core.common import _contains_cftime_datetimes
from xarray.core.utils import emit_user_level_warning
from xarray.structure.chunks import _get_breaks_cached

dims = chunks.keys() if dims is None else dims
shape = data.shape

# Determine the explicit requested chunks.
preferred_chunk_shape = tuple(
itertools.starmap(preferred_chunks.get, zip(dims, shape, strict=True))
)
if isinstance(chunks, Number) or (chunks == "auto"):
chunks = dict.fromkeys(dims, chunks)
chunk_shape = tuple(
chunks.get(dim, None) or preferred_chunk_sizes
for dim, preferred_chunk_sizes in zip(dims, preferred_chunk_shape, strict=True)
)

limit: int | None
if _contains_cftime_datetimes(data):
limit, dtype = fake_target_chunksize(data, chunkmanager.get_auto_chunk_size())
else:
limit = None
dtype = data.dtype

chunk_shape = chunkmanager.normalize_chunks(
chunk_shape,
shape=shape,
dtype=dtype,
limit=limit,
previous_chunks=preferred_chunk_shape,
)
Comment on lines +233 to +245
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this seem fine to you @charles-turner-1 . I wanted to avoid calling get_auto_chunk_size as much as possible

Copy link
Contributor Author

@charles-turner-1 charles-turner-1 Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks good! fake_target_chunksize also contains the same _contains_cf_datetimes check & early return if it's false, so we could remove the check in either fake_target_chunksize or here without causing issues if you think that's a good idea?

I'm guessing you meant calling fake_target_chunksize in your comment above, in which case we would probably want to either remove it in that function - or leave it in if we want to reuse fake_target_chunksize elsewhere?


# Warn where requested chunks break preferred chunks, provided that the variable
# contains data.
if data.size: # type: ignore[unused-ignore,attr-defined] # DuckArray protocol doesn't include 'size' - should it?
for dim, size, chunk_sizes in zip(dims, shape, chunk_shape, strict=True):
if preferred_chunk_sizes := preferred_chunks.get(dim):
disagreement = _get_breaks_cached(
size=size,
chunk_sizes=chunk_sizes,
preferred_chunk_sizes=preferred_chunk_sizes,
)
if disagreement:
emit_user_level_warning(
"The specified chunks separate the stored chunks along "
f'dimension "{dim}" starting at index {disagreement}. This could '
"degrade performance. Instead, consider rechunking after loading.",
)

return dict(zip(dims, chunk_shape, strict=True))


def fake_target_chunksize(
data: DuckArray[Any],
limit: int,
) -> tuple[int, np.dtype[Any]]:
"""
The `normalize_chunks` algorithm takes a size `limit` in bytes, but will not
work for object dtypes. So we rescale the `limit` to an appropriate one based
on `float64` dtype, and pass that to `normalize_chunks`.

Arguments
---------
data : Variable or ChunkedArray
The data for which we want to determine chunk sizes.
limit : int
The target chunk size in bytes. Passed to the chunk manager's `normalize_chunks` method.
"""

# Short circuit for non-object dtypes
from xarray.core.common import _contains_cftime_datetimes

if not _contains_cftime_datetimes(data):
return limit, data.dtype

from xarray.core.formatting import first_n_items

output_dtype = np.dtype(np.float64)

nbytes_approx: int = sys.getsizeof(first_n_items(data, 1)) # type: ignore[no-untyped-call]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just came across this and I'm not quite sure it's the right size. I think sys.getsizeof is the in-memory size and and dtype.itemsize is the uncompressed disk size. Consider for instance:

import sys
import numpy as np
import cftime

np.dtype(np.float64).itemsize  # 8
sys.getsizeof(np.float64(1.0))  # 32
sys.getsizeof(np.array([1.0], dtype=np.float64))  # 120
sys.getsizeof(cftime.DatetimeGregorian.fromordinal(2450000))  #112

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of wondering if setting the dtype to np.dtype(np.float64) would suffice


f64_nbytes = output_dtype.itemsize

limit = int(limit * (f64_nbytes / nbytes_approx))

return limit, output_dtype


class ReprObject:
"""Object that prints as the given value, for use with sentinel values."""

Expand Down
53 changes: 2 additions & 51 deletions xarray/structure/chunks.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,10 @@
import itertools
from collections.abc import Hashable, Mapping
from functools import lru_cache
from numbers import Number
from typing import TYPE_CHECKING, Any, Literal, TypeVar, Union, overload

from xarray.core import utils
from xarray.core.utils import emit_user_level_warning
from xarray.core.variable import IndexVariable, Variable
from xarray.core.variable import Variable
from xarray.namedarray.parallelcompat import (
ChunkManagerEntrypoint,
get_chunked_array_type,
Expand All @@ -23,6 +21,7 @@
from xarray.core.dataarray import DataArray
from xarray.core.dataset import Dataset
from xarray.core.types import T_ChunkDim
from xarray.core.variable import Variable

MissingCoreDimOptions = Literal["raise", "copy", "drop"]

Expand Down Expand Up @@ -62,54 +61,6 @@ def _get_breaks_cached(
return None


def _get_chunk(var: Variable, chunks, chunkmanager: ChunkManagerEntrypoint):
"""
Return map from each dim to chunk sizes, accounting for backend's preferred chunks.
"""
if isinstance(var, IndexVariable):
return {}
dims = var.dims
shape = var.shape

# Determine the explicit requested chunks.
preferred_chunks = var.encoding.get("preferred_chunks", {})
preferred_chunk_shape = tuple(
itertools.starmap(preferred_chunks.get, zip(dims, shape, strict=True))
)
if isinstance(chunks, Number) or (chunks == "auto"):
chunks = dict.fromkeys(dims, chunks)
chunk_shape = tuple(
chunks.get(dim, None) or preferred_chunk_sizes
for dim, preferred_chunk_sizes in zip(dims, preferred_chunk_shape, strict=True)
)

chunk_shape = chunkmanager.normalize_chunks(
chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
)

# Warn where requested chunks break preferred chunks, provided that the variable
# contains data.
if var.size:
for dim, size, chunk_sizes in zip(dims, shape, chunk_shape, strict=True):
try:
preferred_chunk_sizes = preferred_chunks[dim]
except KeyError:
continue
disagreement = _get_breaks_cached(
size=size,
chunk_sizes=chunk_sizes,
preferred_chunk_sizes=preferred_chunk_sizes,
)
if disagreement:
emit_user_level_warning(
"The specified chunks separate the stored chunks along "
f'dimension "{dim}" starting at index {disagreement}. This could '
"degrade performance. Instead, consider rechunking after loading.",
)

return dict(zip(dims, chunk_shape, strict=True))


def _maybe_chunk(
name: Hashable,
var: Variable,
Expand Down
27 changes: 27 additions & 0 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@
from xarray.coding.variables import SerializationWarning
from xarray.conventions import encode_dataset_coordinates
from xarray.core import indexing
from xarray.core.common import _contains_cftime_datetimes
from xarray.core.indexes import PandasIndex
from xarray.core.options import set_options
from xarray.core.types import PDDatetimeUnitOptions
Expand Down Expand Up @@ -6238,6 +6239,32 @@ def test_open_multi_dataset(self) -> None:
) as actual:
assert_identical(expected, actual)

@requires_cftime
def test_open_dataset_cftime_autochunk(self) -> None:
"""Create a dataset with cftime datetime objects and
ensure that auto-chunking works correctly."""
import cftime

original = xr.Dataset(
{
"foo": ("time", [0.0]),
"time_bnds": (
("time", "bnds"),
[
[
cftime.Datetime360Day(2005, 12, 1, 0, 0, 0, 0),
cftime.Datetime360Day(2005, 12, 2, 0, 0, 0, 0),
]
],
),
},
{"time": [cftime.Datetime360Day(2005, 12, 1, 12, 0, 0, 0)]},
)
with self.roundtrip(original, open_kwargs={"chunks": "auto"}) as actual:
assert isinstance(actual.time_bnds.variable.data, da.Array)
assert _contains_cftime_datetimes(actual.time)
assert_identical(original, actual)

# Flaky test. Very open to contributions on fixing this
@pytest.mark.flaky
def test_dask_roundtrip(self) -> None:
Expand Down
15 changes: 15 additions & 0 deletions xarray/tests/test_dask.py
Original file line number Diff line number Diff line change
Expand Up @@ -1161,6 +1161,21 @@ def test_auto_chunk_da(obj):
assert actual.chunks == expected.chunks


def test_auto_chunk_da_cftime():
yrs = np.arange(2000, 2120)
cftime_dates = xr.date_range(
start=f"{yrs[0]}-01-01", end=f"{yrs[-1]}-12-31", freq="1YE", use_cftime=True
)
yr_array = np.tile(cftime_dates.values, (10, 1))
da = xr.DataArray(
yr_array, dims=["x", "t"], coords={"x": np.arange(10), "t": cftime_dates}
).chunk({"x": 4, "t": 5})
actual = da.chunk("auto").data
expected = da.data.rechunk({0: 10, 1: 120})
np.testing.assert_array_equal(actual, expected)
assert actual.chunks == expected.chunks


def test_map_blocks_error(map_da, map_ds):
def bad_func(darray):
return (darray * darray.x + 5 * darray.y)[:1, :1]
Expand Down
Loading
Loading