How to maximize performance to convert a stack of netcdf files (ERA5) to a zarr store? #3409
-
The same question was asked about 7years ago on stackoverflow, yet no clear answer/solution was provided. Running out of ideas on how to solve this problem, I am reaching out directly to the zarr dev community. I have a bunch of netcdf files, each representing a year of ERA5 data for the variable of choice being about 280Mb. Implementing Zarr in the subsequent workflow made a substabtial improvement. So, I would like to convert my entire dataset to a zarr store on my server (256 CPUs (2 AMD EPYC 7742 64-core processors, with hyper-threading), and 1 TiB memory. It also has a fast, local NVMe-disk for scratch use, and a Nvidia Quadro RTX 6000 card, with 24GB GPU memory and 4608 CUDA cores.), therefore lacking no resources. Using the following code to convert 10 years worth of netcdf files to a zarr store, I came to the conclusion to be underusing the machine resources. from dask.distributed import LocalCluster
import xarray as xr
cluster = LocalCluster() # I have attempted multiple settings not changing much
# so I open 10 years worth of data (about 3 GB), rechunk it and store it to zarr (no compression)
ds = xr.open_mfdataset('inputs/climate/yearly/PLEV_201*.nc', parallel=True)
ds = ds.chunk(chunks={'time':10000, 'longitude':3,'latitude':3,'level':13})
ds.to_zarr('test.zarr', mode='w') So, within the logic and options I use, would you have pointers as to what may be improved, or even some alternatives to proceed to this conversion? Thanks a sample of the dataset prior to rechunking: In [8]: ds Out[8]: <xarray.Dataset> Size: 2GB Dimensions: (time: 87648, level: 13, latitude: 10, longitude: 8) Coordinates: * time (time) datetime64[ns] 701kB 2010-01-01 ... 2019-12-31T23:00:00 * longitude (longitude) float64 64B -78.44 -78.19 -77.94 ... -76.94 -76.69 * latitude (latitude) float64 80B -8.239 -8.489 -8.739 ... -10.24 -10.49 * level (level) float64 104B 400.0 450.0 500.0 ... 900.0 950.0 1e+03 Data variables: z (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> t (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> u (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> v (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> q (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> r (time, level, latitude, longitude) float32 365MB dask.array<chunksize=(1, 1, 10, 8), meta=np.ndarray> Attributes: CDI: Climate Data Interface version 2.2.4 (https://mp... Conventions: CF-1.7 institution: European Centre for Medium-Range Weather Forecasts GRIB_centre: ecmf GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts history: Tue Jul 01 22:11:48 2025: cdo -O -s -mergetime /... CDO: Climate Data Operators version 2.2.2 (https://mp... |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
I don't know what If you're not tied to xarray, then "converting data from netcdf to zarr" boils down to "load data from $source, decompress it, and re-compress it, copy it to $dest". In the past when I needed to speed this kind of thing up, I defined a single function But ultimately why is it important that you use all of your compute resources? It seems more important that the computation finish in a reasonable amount of time. How long was the estimated time to complete the conversion? |
Beta Was this translation helpful? Give feedback.
-
I seem to have found few tricks to accomplih the conversion in reasonable time.
Here is a sample code to achieve in reasonable time the operation I needed: import xarray as xr
import pandas as pd
import dask
tvec = pd.date_range('2011-01-01 00:00:00', '2014-12-31 23:00:00', freq='1h', inclusive='both')
ds = xr.open_mfdataset('inputs/climate/yearly/PLEV_2011.nc', parallel=True, chunks='auto')
za = dask.array.zeros((len(tvec), ds.level.shape[0], ds.latitude.shape[0], ds.longitude.shape[0] ), dtype='float32')
dd = xr.Dataset(
coords={
'time':tvec,
'longitude': ds.longitude.values,
'latitude': ds.latitude.values,
'level': ds.level.values,
},
data_vars={
'z': (('time', 'level', 'latitude', 'longitude'), za),
't': (('time', 'level', 'latitude', 'longitude'), za),
'u': (('time', 'level', 'latitude', 'longitude'), za),
'v': (('time', 'level', 'latitude', 'longitude'), za),
'q': (('time', 'level', 'latitude', 'longitude'), za),
'r': (('time', 'level', 'latitude', 'longitude'), za),
},
attrs=ds.attrs
)
dd = dd.chunk(chunks={'time':8760, 'longitude':3,'latitude':3,'level':13}).persist()
dd.to_zarr('test.zarr', mode='w',zarr_format=3)
ds = None
for year in [2011, 2012, 2013, 2014]:
ds = xr.open_mfdataset(f'inputs/climate/yearly/PLEV_{year}.nc', parallel=True, chunks='auto')
ds = ds.persist()
ds.to_zarr('test.zarr', mode='a',region='auto', align_chunks=True)
ds = None For future reference, a forum dedicated to problem of this nature is: https://discourse.pangeo.io/ |
Beta Was this translation helpful? Give feedback.
I seem to have found few tricks to accomplih the conversion in reasonable time.
.persist()
command from DaskHere is a sample code to achieve in reasonable time the operation I needed: