ERA5 CDS requests which return a mixture of ERA5 and ERA5T data

The ERA5 hourly and monthly data are made available with a 3 month delay. This means that after a month has passed, another month's worth of ERA5 data is written to the dataset.

ERA5T (near real time) preliminary data are used to fill the gap between the end of the ERA5 data and 5 days before the present date. The oldest month of these is overwritten each month as new ERA5 data become available.

So as an example, say we have a current date  of 15th February 2020:

  • ERA5 data are currently from 1/1/1979 - 30/11/2019 (instantaneous variables)  and 1/1/1979 - 1/12/2019 (00-06 UTC, accumulated variables)
  • ERA5T data (with a 5 day delay) are from 1/12/2019- 10/2/2020 (instantaneous variables)  and 1/12/2019 (07-23 UTC, accumulated variables)- 10/2/2020

For requests which return a mixture of ERA5 and ERA5T data  (such as for data from the 1st of the month), instantaneous variables (e.g temperature) come from ERA5T (which has 'experiment version'  of 5) while accumulated variables (fluxes, precipitation) come from both datasets with the following structure:

  • 00-06 UTC on 1 day of the month from ERA5 (expver 1)
  • 07-23 UTC on 1 day of the month (and the following dates up to 5 day from present) from ERA5T (expver 5)

When these data are converted to netCDF a new dimension is created called expver containing 1 and 5. Moreover, a single time coordinate is used which covers the entire requested period.

dimensions:
        longitude = 1440 ;
        latitude = 721 ;
        expver = 2 ;
        time = 24 ;
variables:
        float longitude(longitude) ;
                longitude:units = "degrees_east" ;
                longitude:long_name = "longitude" ;
        float latitude(latitude) ;
                latitude:units = "degrees_north" ;
                latitude:long_name = "latitude" ;
        int expver(expver) ;
                expver:long_name = "expver" ;
        int time(time) ;
                time:units = "hours since 1900-01-01 00:00:00.0" ;
                time:long_name = "time" ;
                time:calendar = "gregorian" ;
        short tp(time, expver, latitude, longitude) ;
                tp:scale_factor = 9.06276558810304e-07 ;
                tp:add_offset = 0.0296950577259784 ;
                tp:_FillValue = -32767s ;
                tp:missing_value = -32767s ;
                tp:units = "m" ;
                tp:long_name = "Total precipitation" ;
data:

expver = 5, 1 ;

}

Both expver dimensions use the full time extent of time coordinate but the expver 1 data only covers the first 7 timesteps, the remaining timesteps are 'padded' with empty fields.
For the expver 5 data, the first 7 timesteps are padded with empty fields, with the remaining timesteps coming from the ERA5T data.

When the last ERA5 data are released, they will overwrite the ERA5T data for the entire month and for accumulated variables for 00-06 in next month. This process will be repeated each month.

Notice for the time being, if you download only ERA5, or ERA5T, the above mentioned dimension 'expver' will not appear. This makes it difficult to tell the difference between ERA5 and ERA5T.

It seems that if one requests hourly total precipitation ERA5 data for  1 January 2020, the file contains both expver versions (1 and 5) and the file size is doubled (around 99 MB). For other days, the expver dimension does not appear. 

Thank you for reporting this, Julia. We are looking into a long term solution now. Unfortunately it will take some time.

As pointed out above, only mixed ERA5/ERA5T data has 'expver'. When users consider accumulated variables the file has "the following structure:

  • 00-06 UTC on 1 day of the month from ERA5 (expver 1)
  • 07-23 UTC on 1 day of the month (and the following dates up to 5 day from present) from ERA5T (expver 5)"

So for your case, data for 00-06 UTC  of 1 January 2020 is ERA5 while the rest of data are ERA5T. Data for 2 January 2020 are only ERA5T so 'expver' does not appear. Moreover, please pay attention "Both expver dimensions use the full time extent of time coordinate but the expver 1 data only covers the first 7 timesteps, the remaining timesteps are 'padded' with empty fields." This means that the empty fields contain NaN values.

Hello Michela and Xiaobo

I can see why you want to keep two expver but I think it's making things more complicated than needed for users. Moreover, the introduction of the two experiments is breaking codes, with consequent time loss trying to first identify the issue, and then find a (not-so-striaghtforward) solution. My suggestion would be to get rid of the two expver and just communicate when changes to ERA5T, when they become ERA5, are made, as anyway already indicated in Release of ERA5T

Could this solution – i.e. merging expver 1 with 5, so no expver dimension/parameter appears in the retrieval – be implemented please? I think it'd be much cleaner if this was done at your end.

Thank you very much

Alberto

FYI Alberto, we are thinking about to have 'expver' as a dimension for all ERA5 and ERA5T data.

I have the same issue as Julia Wagemann when downloading SurfaceSolarRadiation for all available 2020 timesteps, the first six hrs of 01/01/2020 are expver = 1, but the rest of the 2020 timesteps are expver = 5

Yes, this is because Surface Solar Radiation is an accumulated parameter and January is a month with ERA5 and ERA5T mixed data. For these reasons, the file has the following structure:

  • 00-06 UTC on 1 day of the month from ERA5 (expver 1)
  • 07-23 UTC on 1 day of the month (and the following dates up to 5 day from present) from ERA5T (expver 5)"

So also in your case, data for 00-06 UTC of 1 January 2020 is ERA5 while the rest of data are ERA5T. Data for 2 January 2020 are only ERA5T so 'expver' does not appear. Moreover, please pay attention to the empty fields which contain NaN values. This happens because "Both expver dimensions use the full time extent of time coordinate but the expver 1 data only covers the first 7 timesteps, the remaining timesteps are 'padded' with empty fields. For the expver 5 data, the first 7 timesteps are padded with empty fields, with the remaining timesteps coming from the ERA5T data."

Semi-related to this topic... Is there any documentation for why the most recent data for ERA5T instantaneous variables are available only from 0 - 21Z for the most recent available day, and the accumulated variables are available through 06Z the following day?

I understand this is the best we can do for the time being.


Our technical team commented: "the accumulated fields are forecast fields from forecast starting at 18h , while the instantaneous fields are analysis fields from the 9-21h assimilation window."

I see, thank you for the quick response. And are the accumulated and instantaneous data released through CDS at 18Z and 21Z, respectively, or is there some specific lag (computational) time for each?

Did someone found a simple way to get rid of expver dimension, or at least to filter out Era5 and Era5T data on mars scripts? As Alberto told, this is breaking a lot of codes. In my case, I download Ozone Total Column in a "monthly" base to automatically generate maps with NCL. However, the presence of expver adds a dimension that NCL can't understand, and I can't get rid of it (sad) (could not find an easy way: I can remove the expver variables but the dimension remains).  

Hi, is there a solution to removing the expver dimension? I have the same situation with the November, 2020 data and it is messing up all my codes which works perfectly on data from 2003 till date. I will really appreciate the help if anyone knows a way to achieve this. Thanks 

Hi,

At the moment i think the easiest way is to retrieve the ERA5 and ERA5T data in separate requests, by careful selection of the dates. In this way you would get 2 netCDF files without the 'expver' dimension which you can then merge if required,

Thanks

Kevin

Hi,

If you are looking for a Python workaround, you can use Xarray function reduce(np.nansum, 'expver'). In this way you can collapse the dimension summing each other the two expver arrays, that perfectly match (the one is NaN when the other got a value). I know that it isn't politically correct, but with 1 row you avoid tons of code stop working.

Hi.

Thank you all for the helpful responses. I appreciate the suggestion marco venturini. That is a solution I can work around.

Thanks

cdo --reduce_dim -copy in.nc out.nc

worked well for me and it removed expver dimension 

this will do the trick


import xarray as xr
ERA5 = xr.open_mfdataset('era5.tp.20200801.nc',combine='by_coords')
ERA5_combine =ERA5.sel(expver=1).combine_first(ERA5.sel(expver=5))
ERA5_combine.load()
ERA5_combine.to_netcdf("era5.tp.20200801.copy.nc")


from https://unseen-open.readthedocs.io/_/downloads/en/latest/pdf/