How to maximize performance to convert a stack of netcdf files (ERA5) to a zarr store? #3409

ArcticSnow · 2025-08-26T14:42:10Z

ArcticSnow
Aug 26, 2025

The same question was asked about 7years ago on stackoverflow, yet no clear answer/solution was provided. Running out of ideas on how to solve this problem, I am reaching out directly to the zarr dev community.

I have a bunch of netcdf files, each representing a year of ERA5 data for the variable of choice being about 280Mb. Implementing Zarr in the subsequent workflow made a substabtial improvement. So, I would like to convert my entire dataset to a zarr store on my server (256 CPUs (2 AMD EPYC 7742 64-core processors, with hyper-threading), and 1 TiB memory. It also has a fast, local NVMe-disk for scratch use, and a Nvidia Quadro RTX 6000 card, with 24GB GPU memory and 4608 CUDA cores.), therefore lacking no resources.

Using the following code to convert 10 years worth of netcdf files to a zarr store, I came to the conclusion to be underusing the machine resources.

from dask.distributed import LocalCluster
import xarray as xr

cluster = LocalCluster()   # I have attempted multiple settings not changing much

# so I open 10 years worth of data (about 3 GB), rechunk it and store it to zarr (no compression)
ds = xr.open_mfdataset('inputs/climate/yearly/PLEV_201*.nc', parallel=True)
ds = ds.chunk(chunks={'time':10000, 'longitude':3,'latitude':3,'level':13})

ds.to_zarr('test.zarr', mode='w')

So, within the logic and options I use, would you have pointers as to what may be improved, or even some alternatives to proceed to this conversion?

Thanks

a sample of the dataset prior to rechunking:

In [8]: ds
Out[8]: 
<xarray.Dataset> Size: 2GB
Dimensions:    (time: 87648, level: 13, latitude: 10, longitude: 8)
Coordinates:
  * time       (time) datetime64[ns] 701kB 2010-01-01 ... 2019-12-31T23:00:00
  * longitude  (longitude) float64 64B -78.44 -78.19 -77.94 ... -76.94 -76.69
  * latitude   (latitude) float64 80B -8.239 -8.489 -8.739 ... -10.24 -10.49
  * level      (level) float64 104B 400.0 450.0 500.0 ... 900.0 950.0 1e+03
Data variables:
    z          (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray>
    t          (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray>
    u          (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray>
    v          (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray>
    q          (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray>
    r          (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray>
Attributes:
    CDI:                     Climate Data Interface version 2.2.4 (https://mp...
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    history:                 Tue Jul 01 22:11:48 2025: cdo -O -s -mergetime /...
    CDO:                     Climate Data Operators version 2.2.2 (https://mp...

Answered by ArcticSnow

Aug 27, 2025

I seem to have found few tricks to accomplih the conversion in reasonable time.

chunk the dataset in the time dimension with the same size as what one netcdf file contains. Here one year so for normal years (vs. leap) 8760 hourly timstep.
during the rechunking step, hold on the data into memory with the .persist() command from Dask
create a Zarr store of the final size, open the netcdf files successvely and then append the data by region in the zarr store.

Here is a sample code to achieve in reasonable time the operation I needed:

import xarray as xr
import pandas as pd
import dask

tvec = pd.date_range('2011-01-01 00:00:00', '2014-12-31 23:00:00', freq='1h', inclusive='both')


ds = xr.o…

View full answer

d-v-b · 2025-08-26T14:54:28Z

d-v-b
Aug 26, 2025
Maintainer

I don't know what ds.to_zarr is doing because it's defined in xarray. It might be better to take the xarray-specific questions to the xaray discussion tracker.

If you're not tied to xarray, then "converting data from netcdf to zarr" boils down to "load data from $source, decompress it, and re-compress it, copy it to $dest". In the past when I needed to speed this kind of thing up, I defined a single function f(data_region) that would load data from the source and resave to the destination, and then I would parallelize f across all the data regions in my array. Zarr provides some basic routines for choosing data_region so that it aligns with the output chunk size of your data, but if your input chunk size is different, then your read IO might be less efficient.

But ultimately why is it important that you use all of your compute resources? It seems more important that the computation finish in a reasonable amount of time. How long was the estimated time to complete the conversion?

4 replies

ArcticSnow Aug 26, 2025
Author

Thank you for the very prompt response @d-v-b :)

You are right, the goal is not to maximize but to obtain a reasonable to time to complete conversion. As of now it takes hours, which makes any improvements to the subsequent processing pointless. I am not tied to xarray, though it is a convenient tool. Ultimatley I look for a solution. Good suggestion to head out to the xarray forum. Thank you!

d-v-b Aug 26, 2025
Maintainer

one thing to check is whether ds.to_zarr is relying on dask to schedule computation (I would guess that it is). By specifying a different dask execution type (e.g., providing a multiprocessing-based backend) you might get a very different performance profile.

ArcticSnow Aug 26, 2025
Author

I think you are on point with Dask. My struggle is exactly coming from this, as chunking and memory management seem to change everything.

For the generic solution you mentioned (bypassing xarray and dask), would you know of example bits to tackled reading netcdf, rechunking and storing to zarr?

d-v-b Aug 26, 2025
Maintainer

this really depends on the relationship between your input chunks (the chunking in the netcdf files) and the output chunks (what you put in zarr), combined with the cost of doing in-memory rechunking. If you have lots of system memory, it might be easiest to just read the entire netcdf file into memory and index that in-memory array with your zarr chunk size.

ArcticSnow · 2025-08-27T09:50:23Z

ArcticSnow
Aug 27, 2025
Author

I seem to have found few tricks to accomplih the conversion in reasonable time.

chunk the dataset in the time dimension with the same size as what one netcdf file contains. Here one year so for normal years (vs. leap) 8760 hourly timstep.
during the rechunking step, hold on the data into memory with the .persist() command from Dask
create a Zarr store of the final size, open the netcdf files successvely and then append the data by region in the zarr store.

Here is a sample code to achieve in reasonable time the operation I needed:

import xarray as xr
import pandas as pd
import dask

tvec = pd.date_range('2011-01-01 00:00:00', '2014-12-31 23:00:00', freq='1h', inclusive='both')


ds = xr.open_mfdataset('inputs/climate/yearly/PLEV_2011.nc', parallel=True, chunks='auto')
za = dask.array.zeros((len(tvec), ds.level.shape[0], ds.latitude.shape[0], ds.longitude.shape[0] ), dtype='float32')

dd = xr.Dataset(
	coords={
		'time':tvec,
		'longitude': ds.longitude.values,
		'latitude': ds.latitude.values,
		'level': ds.level.values,
		},
	data_vars={
		'z': (('time', 'level', 'latitude', 'longitude'),  za),
		't': (('time', 'level', 'latitude', 'longitude'),  za),
		'u': (('time', 'level', 'latitude', 'longitude'),  za),
		'v': (('time', 'level', 'latitude', 'longitude'),  za),
		'q': (('time', 'level', 'latitude', 'longitude'),  za),
		'r': (('time', 'level', 'latitude', 'longitude'),  za),
		},
	attrs=ds.attrs
)

dd = dd.chunk(chunks={'time':8760, 'longitude':3,'latitude':3,'level':13}).persist()
dd.to_zarr('test.zarr', mode='w',zarr_format=3)
ds = None

for year in [2011, 2012, 2013, 2014]:
	ds = xr.open_mfdataset(f'inputs/climate/yearly/PLEV_{year}.nc', parallel=True, chunks='auto')
	ds = ds.persist()
	ds.to_zarr('test.zarr', mode='a',region='auto', align_chunks=True)
	ds = None

For future reference, a forum dedicated to problem of this nature is: https://discourse.pangeo.io/

0 replies

Uh oh!

Uh oh!

How to maximize performance to convert a stack of netcdf files (ERA5) to a zarr store? #3409

Uh oh!

Uh oh!

ArcticSnow Aug 26, 2025

Replies: 2 comments · 4 replies

Uh oh!

d-v-b Aug 26, 2025 Maintainer

Uh oh!

ArcticSnow Aug 26, 2025 Author

Uh oh!

d-v-b Aug 26, 2025 Maintainer

Uh oh!

Uh oh!

ArcticSnow Aug 26, 2025 Author

Uh oh!

d-v-b Aug 26, 2025 Maintainer

Uh oh!

Uh oh!

ArcticSnow Aug 27, 2025 Author

ArcticSnow
Aug 26, 2025

Replies: 2 comments 4 replies

d-v-b
Aug 26, 2025
Maintainer

ArcticSnow Aug 26, 2025
Author

d-v-b Aug 26, 2025
Maintainer

ArcticSnow Aug 26, 2025
Author

d-v-b Aug 26, 2025
Maintainer

ArcticSnow
Aug 27, 2025
Author