We routinely collect ERA5 data in netCDF (more specifically pressure level reanalyses). From 18th of September 2024 onwards, we have received the new netCDF4 format.
Apart from the breaking changes in file structure which we accommodated for, we noticed a big drop in performance when manipulating the data from these files. It’s a factor varying from 5 to 15, depending on the operations involved. Other users have reported this issue.
Looking at the files, it looks like there is chunking across all dimensions:
valid_time: chunk size 5, total size 24
pressure_level: chunk size 8, total size 37
latitude: chunk size 181, total size 721
longitude: chunk size 360, total size 1440
Is there an intended use case, which would benefit from such a chunking strategy? In our case, we have web services with on-the-fly access requests to these datasets and with the new format it just became flat-out impractical for interactive use.
For the time being, we reverted to the legacy format, but who knows for how long it will be supported?
Could the ERA5 team look into this issue?