Huge performance drop when reading new ERA5 netcdf4 format

We routinely collect ERA5 data in netCDF (more specifically pressure level reanalyses). From 18th of September 2024 onwards, we have received the new netCDF4 format.

Apart from the breaking changes in file structure which we accommodated for, we noticed a big drop in performance when manipulating the data from these files. It’s a factor varying from 5 to 15, depending on the operations involved. Other users have reported this issue.

Looking at the files, it looks like there is chunking across all dimensions:
valid_time: chunk size 5, total size 24
pressure_level: chunk size 8, total size 37
latitude: chunk size 181, total size 721
longitude: chunk size 360, total size 1440

Is there an intended use case, which would benefit from such a chunking strategy? In our case, we have web services with on-the-fly access requests to these datasets and with the new format it just became flat-out impractical for interactive use.

For the time being, we reverted to the legacy format, but who knows for how long it will be supported?

Could the ERA5 team look into this issue?

I have found that I have to re-chunk files that I download in order to match my access pattern, which is generally to read a global 3-D grid for a single analysis time. You can do this with the netCDF nccopy utility. Here is an example for a file containing 0.75° resolution pressure-coordinate data for a single analysis time. (The ellipsis … indicates the path to the files.). This command re-chunks the data into a single chunk for each 3-D field. I am not certain of the best way to set the buffer sizes (the documentation for nccopy is not very clear), but this seems to work for these files. On our Linux servers it takes about 10 s to copy a single file. I also set the compression level to 5, which I find to be a reasonable balance between processing time and space saving.

I also do not understand why the original files were chunked the way they were.

Regards, Ken

nccopy -w -s -d 5 -m 1G -h 1G -c valid_time/1,pressure_level/37,latitude/241,longitude/480 …/20200831T200000Z.ncd.CDS …/20200831T200000Z.ncd