ERA5 Forecast Accumulation netCDF Download Error

Hello all,

When I attempt to download ERA5 forecast data from the complete ERA5 archive (e.g. https://apps.ecmwf.int/data-catalogues/era5/?stream=oper&levtype=sfc&expver=1&month=dec&year=1979&type=fc&class=ea) in netCDF format via a python script, the request fails because there are 2 initializations (6Z and 18Z) with 19 forecast steps (0-18), meaning there are 38 valid times in a 24 hour period. Does anyone have thoughts on how to get around this while keeping netCDF format and downloading all of the data?

Thanks

Hi Frederick,

That's an excellent question!

This issue is due to you requesting forecast data in netCDF i.e. it is due to the grib to netCDF conversion software used.

The data for each forecast field are stored in individual grib records, which are independent of each other.

NetCDF files (generally) have coordinate axes which the fields are mapped onto.

In your request, you asked for forecast steps from 0-18hrs from 06Z and 18Z.

So, for a given day

06Z+ fc step of 13hrs = 19:00

and

18Z + fc step of 1 hr = 19:00.

i.e. you have 2 fields with the same 'validity time' of 19:00.

When you do the grib to netcdf conversion, it tries to map  1 data field onto 1 time coordinate, so it gives an error when it finds 2 for 19:00, 20:00,...etc.

(this is a 'feature' of the data model used in this particular conversion script)

Some straightforward ways to get around this are :

1) Just use the grib version of the data (netCDF conversion is not ok)

2) Just request forecast steps for 0-12 each day for 06Z and 18Z  (netCDF conversion is ok)

3) Get forecast steps for 0-18 each day for 06Z in one request (netCDF conversion is ok) and

forecast steps for 0-18 each day for 18Z in a separate request (netCDF conversion is ok).


Hope that helps!

Thanks,
Kevin

C3S User Support at ECMWF

I had a similar issue with seasonal forecasts data and the real solution was to work only with GRIB files. The cfgrib library provides a perfect interface for xarray, so if you are using Python the transition from NetCDF to GRIB is seamless. 

Providing a bit more info to make this better searchable for others with similar problems.

I had a similar problem when i recently downloaded era5 files as netcdf. I requested several variables, and downloaded one file per year

“surface_pressure”, ‘10m_u_component_of_wind’, ‘10m_v_component_of_wind’, ‘2m_dewpoint_temperature’, ‘2m_temperature’, ‘total_precipitation’, ‘mean_surface_downward_short_wave_radiation_flux’, ‘geopotential’

For all year except for 2024 this worked fine. For 2024 (the current year) tp and msdwswrf were missing.

If I just open the grib file for 2024, it will load all variables except for tp and msdwswrf, and provide following warning (just an excerpt)

In [2]: d=xr.open_dataset("scandinavia_2024.grb")
skipping variable: paramId==228 shortName='tp'
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/cfgrib/dataset.py", line 660, in build_dataset_components
    dict_merge(variables, coord_vars)
  File "/usr/local/lib/python3.12/site-packages/cfgrib/dataset.py", line 591, in dict_merge
    raise DatasetBuildError(
cfgrib.dataset.DatasetBuildError: key present and new value is different: key='time' value=Variable(dimensions=('time',), data=array([1704067200, 1704070800, 1704074400, ..., 1728594000, 1728597600,
       1728601200])) new_value=Variable(dimensions=('time',), data=array([1704045600, 1704088800, 1704132000, 1704175200, 1704218400,
       1704261600, 1704304800, 1704348000, 1704391200, 1704434400,
[...]
       1728453600, 1728496800, 1728540000, 1728583200]))
skipping variable: paramId==235035 shortName='msdwswrf'
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/cfgrib/dataset.py", line 660, in build_dataset_components
    dict_merge(variables, coord_vars)
  File "/usr/local/lib/python3.12/site-packages/cfgrib/dataset.py", line 591, in dict_merge
    raise DatasetBuildError(
cfgrib.dataset.DatasetBuildError: key present and new value is different: key='time' value=Variable(dimensions=('time',), data=array([1704067200, 1704070800, 1704074400, ..., 1728594000, 1728597600,
       1728601200])) new_value=Variable(dimensions=('time',), data=array([1704045600, 1704088800, 1704132000, 1704175200, 1704218400,
       1704261600, 1704304800, 1704348000, 1704391200, 1704434400,
[...]
       1704693600, 1704736800, 1704780000, 1704823200, 1704866400,
[...]

Solution

It worked fine to read the dataset with cfgrib.open_datasets, so e.g. d=cfgrib.open_datasets("scandinavia_2024.grb"). More information on that functionality is provided here GitHub - ecmwf/cfgrib: A Python interface to map GRIB files to the NetCDF Common Data Model following the CF Convention using ecCodes

For me d[0] includes variables indexed only along time, which is the same as valid_time:

 Dimensions:     (time: 6816, latitude: 62, longitude: 97)
 Coordinates:
     number      int64 0
   * time        (time) datetime64[ns] 2024-01-01 ... 2024-10-10T23:00:00
     step        timedelta64[ns] 00:00:00
     surface     float64 0.0
   * latitude    (latitude) float64 72.25 72.0 71.75 71.5 ... 57.5 57.25 57.0
   * longitude   (longitude) float64 4.39 4.64 4.89 5.14 ... 27.89 28.14 28.39
     valid_time  (time) datetime64[ns] 2024-01-01 ... 2024-10-10T23:00:00
 Data variables:
     z           (time, latitude, longitude) float32 ...
     sp          (time, latitude, longitude) float32 ...
     u10         (time, latitude, longitude) float32 ...
...

The values from aggregated variables are in d[1] and indexed differently

<xarray.Dataset>
Dimensions:     (time: 569, step: 12, latitude: 62, longitude: 97)
Coordinates:
    number      int64 0
  * time        (time) datetime64[ns] 2023-12-31T18:00:00 ... 2024-10-10T18:0...
  * step        (step) timedelta64[ns] 01:00:00 02:00:00 ... 11:00:00 12:00:00
    surface     float64 0.0
  * latitude    (latitude) float64 72.25 72.0 71.75 71.5 ... 57.5 57.25 57.0
  * longitude   (longitude) float64 4.39 4.64 4.89 5.14 ... 27.89 28.14 28.39
    valid_time  (time, step) datetime64[ns] 2023-12-31T19:00:00 ... 2024-10-1...
Data variables:
    tp          (time, step, latitude, longitude) float32 ...
    msdwswrf    (time, step, latitude, longitude) float32 ...

It’s easy though to convert them

data=d1[1].stack({"time_linear": ["time","step"]})
data = data.swap_dims({"time_linear": "valid_time"})

# you _might_ need to slice the data further, data includes some extra (nan) values
# from 2023-12-31 to be able to cover 2024
# data = data.isel(valid_time=slice(5,-7))

API request

# original api reqest as netcdf
    dataset = "reanalysis-era5-single-levels"
    request = {
        'product_type': ['reanalysis'],
        'variable': ["surface_pressure", '10m_u_component_of_wind', '10m_v_component_of_wind', '2m_dewpoint_temperature', '2m_temperature', 'total_precipitation', 'mean_surface_downward_short_wave_radiation_flux', 'geopotential'],
        'year': [str(year)],
        'month': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'],
        'day': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'],
        'time': ['00:00', '01:00', '02:00', '03:00', '04:00', '05:00', '06:00', '07:00', '08:00', '09:00', '10:00', '11:00', '12:00', '13:00', '14:00', '15:00', '16:00', '17:00', '18:00', '19:00', '20:00', '21:00', '22:00', '23:00'],
        'data_format': "netcdf", #'grib',
        'download_format': 'unarchived',
        'area': [72.29, 4.39, 57, 28.5]
    }

    client = cdsapi.Client()
    target = f"scandinavia_{year}.nc"
    client.retrieve(dataset, request, target)#.download()

As mentioned above, to get the varialbes for the current year (2024) I needed to download the grib files instead.