Error when downloading ERA5 Daily Statistics

Evan_Curcio · 3 October 2024 13:47

The below code, which I’ve been using to download daily means for a variety of climate variables, now fails to work. NotImplementedError: This is a beta version. This functionality has not been implemented yet.

Any advice on how to resolve?

import cdsapi
import requests
import os

Create the ‘era5’ directory if it doesn’t exist

download_folder = “era5”
os.makedirs(download_folder, exist_ok=True)

c = cdsapi.Client(timeout=300)

years = [str(year) for year in range(2020, 2021)]
months = [‘{:02d}’.format(month) for month in range(1, 13)]
current_year = ‘2024’
current_month = ‘08’

Variables and their corresponding daily statistics to download

variables_stats = [
(“2m_temperature”, [“daily_mean”, “daily_minimum”, “daily_maximum”]),
(“10m_u_component_of_wind”, [“daily_mean”]),
(“10m_v_component_of_wind”, [“daily_mean”]),
(“2m_dewpoint_temperature”, [“daily_mean”]),
(“surface_pressure”, [“daily_mean”]),
(“total_precipitation”, [“daily_mean”])
]

Function to download data

def download_data(yr, mn, var, stat):
file_name = os.path.join(download_folder, f"{var}.{stat}.{yr}-{mn}.nc")
temp_file_name = os.path.join(download_folder, f"tmp.{var}.{stat}.{yr}-{mn}.nc")

# Check if file already exists
if os.path.exists(file_name):
    print(f"File {file_name} already exists. Skipping download.")
    return

# Check if temporary file exists (indicating a previous incomplete download)
if os.path.exists(temp_file_name):
    print(f"Temporary file {temp_file_name} found. Resuming download.")
    os.remove(temp_file_name)  # Remove incomplete download to restart

# Initiate the download process
result = c.service(
    "tool.toolbox.orchestrator.workflow",
    params={
        "realm": "user-apps",
        "project": "app-c3s-daily-era5-statistics",
        "version": "master",
        "kwargs": {
            "dataset": "reanalysis-era5-single-levels",
            "product_type": "reanalysis",
            "variable": var,
            "statistic": stat,
            "year": yr,
            "month": mn,
            "time_zone": "UTC+00:00",
            "frequency": "1-hourly"
        },
        "workflow_name": "application"
    }
)

# Download data
location = result[0]['location']
res = requests.get(location, stream=True)
print(f"Writing data to {temp_file_name}")

with open(temp_file_name, 'wb') as fh:
    for r in res.iter_content(chunk_size=1024):
        fh.write(r)

# Rename the temporary file to the final file name
os.rename(temp_file_name, file_name)
print(f"Download completed. Renamed {temp_file_name} to {file_name}")

Sequentially download data

for yr in years:
for mn in months:
if yr == current_year and mn > current_month:
break
for var, stats in variables_stats:
for stat in stats:
file_name = os.path.join(download_folder, f"{var}.{stat}.{yr}-{mn}.nc")
if not os.path.exists(file_name):
download_data(yr, mn, var, stat)

yan_ying · 4 October 2024 01:48

I’ve got the same problem, too.

T_B · 10 October 2024 18:42

I’ve posted a similar question elsewhere (no reply yet), but since you seem to work with daily means, maybe you can help with this general conceptual question:

What kind of values do you expect for daily means based on the hourly data?

If the data represent values for hourly time points (e.g. temperature), I assume the correct mean should consider the 2 values from Day at 00:00 and Day+1 at 00:00 , take their mean, and put this together with the other 23 values (from Day at 01:00 to Day at 23:00) to get the daily mean value. Do you know anything about this?

If the hourly data are accumulative (e.g. precipitation), the sum over 24 hours for a Day is found in the hourly value for Day+1 at 00:00. Should the daily mean not be that sum, instead of the mean of the 24 (accumulative) values from the Day itself?

Do you expect daily means for UTC or also for different time zones? Hourly data are in UTC, so the switch from one day to the next (and the restart of accumulative data) is at a different time to the actual switch for the time zone of interest (if this is not UTC).

Thanks and kind regards.

Evan_Curcio · 10 October 2024 19:19

Hi! To my understanding - and I think this information is on the Confluence page, but can’t seem to find it at the moment - it should just be the average of all hourly values over the course of the day, not the average of 00:00 from one day to the next.

If the hourly data is accumulative like precipitation, I have been getting the daily mean and then multiplying by 24 to get total precipitation for the day.

I have been using UTC, since my work covers long time scales over the whole globe so the specific time zone doesn’t play much of a role. If you are looking at a specific region, or a shorter time frame, maybe for you the time zone matters more, at which point I would work in the most relevant time zone for your project. The underlying data engineering that is properly taking care of these sorts of back-end calculations should automatically adjust by time-zone, so I don’t believe this is something you have to worry about fixing manually.

All of that said, I do not know what the new implementation of all of this will look like. Hope that helps. Take care.

T_B · 11 October 2024 10:13

Hi, thank you for your quick and detailed reply!

My understanding is similar, i.e. that the average of all hourly values over a day is used for the stats, but that leads me to the following concerns.

Here is an example of how hourly values look like (I work with point locations):

Time Precipitation[mm] Temperature[degC]
2013-01-01T00:00+00:00 0.199618 5.778961
2013-01-01T01:00+00:00 0.000000 5.764557
2013-01-01T02:00+00:00 0.000000 5.725739
2013-01-01T03:00+00:00 0.000000 5.725983
2013-01-01T04:00+00:00 0.000000 5.596344
2013-01-01T05:00+00:00 0.000000 5.696198
2013-01-01T06:00+00:00 0.000000 5.879303
2013-01-01T07:00+00:00 0.000000 5.708893
2013-01-01T08:00+00:00 0.000000 5.613434
2013-01-01T09:00+00:00 0.000771 5.873932
2013-01-01T10:00+00:00 0.004914 6.321442
2013-01-01T11:00+00:00 0.008201 6.73819
2013-01-01T12:00+00:00 0.012807 7.189117
2013-01-01T13:00+00:00 0.018581 7.261871
2013-01-01T14:00+00:00 0.023051 7.308502
2013-01-01T15:00+00:00 0.030146 6.817291
2013-01-01T16:00+00:00 0.040229 6.693024
2013-01-01T17:00+00:00 0.051955 6.622955
2013-01-01T18:00+00:00 0.134121 6.602936
2013-01-01T19:00+00:00 0.190826 6.845856
2013-01-01T20:00+00:00 0.212247 6.76236
2013-01-01T21:00+00:00 0.253724 6.308502
2013-01-01T22:00+00:00 0.293908 5.82901
2013-01-01T23:00+00:00 0.309419 5.257233
2013-01-02T00:00+00:00 0.310628 4.452545
2013-01-02T01:00+00:00 0.000000 3.746979
2013-01-02T02:00+00:00 0.000000 3.277496
2013-01-02T03:00+00:00 0.000000 2.771637
2013-01-02T04:00+00:00 0.000000 2.552643
…

For accumulative vars, like precipitation, the daily stat is just the value of DAY +1 at 00:00 (cf. ERA5-Land: data documentation - Copernicus Knowledge Base - ECMWF Confluence Wiki). It is straightforward to get this value from the hourly entries. (Just that one needs data also for one day after the last DAY of interest.) However, if the mean is taken from the 24 values of a DAY (00:00-23:00), that would not make much sense for accumulative vars. One could use the stat “MAX” instead of “MEAN”, but still the value of DAY at 00:00 should not be in there (as it holds the accumulated amount of the previous day), and the value of DAY+1 at 00:00 would be missing (which actually holds the correct accumulated amount).

Did you maybe once compare the daily values directly downloaded to hourly values for an example to see if your calculation (mean*24) gives the same result?

For non-accumulative values, like temperature, again I think that just the mean of the 24 entries from DAY at 00:00 to DAY at 23:00 would be slightly biased. One could say, it best represents the mean of the period from DAY-1 at 23:30 to DAY at 23:30. Therefore, I introduced the consideration of the value from DAY+1 at 00:00, and used the two midnight values with half the weight of all other values. I would be curious about Copernicus’ advice on this, but could not find that specifically discussed anywhere.

Then, for locations in different time zones, things get a bit more complicated as accumulations end some time during the local day, but in principle possible in the same manner. To my knowledge, requesting hourly data for specific time zones is impossible.

Thanks again and kind regards!

Evan_Curcio · 16 October 2024 15:56

Hi T_B,
My understanding is that for the hourly data, for an accumulative stat like total_precipitation, the value at Day +1 at 00:00 will be the total accumulation since 23:00 the previous day. If I choose MAX instead, I was under the impression that it would return the value from the hour with the most precipitation. Neither of these are exactly what you want, either. That is why I am using mean x 24, which I believe gets closer (and was recommended to me by a member of the staff). It would be a bit of a computational pain to do, but you could download all of the hourly data and perform a sum separately.
I don’t use mean x 24 for any other variable other than accumulatives.

By the way - have you seen the latest post about the CDS API? It looks like the daily statistics are back as of 10/14. Using dataset = “derived-era5-single-levels-daily-statistics”
e.g. ERA5 post-processed daily statistics on single levels from 1940 to present

T_B · 17 October 2024 09:22

Thanks again for your reply!

I am pretty sure the accumulations in hourly data are continuously additive such that the value at Day +1 at 00:00 contains the whole previous day, not just one hour. All example values I saw look like that (e.g. above, always increasing throughout the day) and the documentation reads:

“The accumulations in the short forecasts of ERA5-Land (with hourly steps from 01 to 24) are treated the same as those in ERA-Interim or ERA-Interim/Land, i.e., they are accumulated from the beginning of the forecast to the end of the forecast step. For example, runoff at day=D, step=12 will provide runoff accumulated from day=D, time=0 to day=D, time=12. The maximum accumulation is over 24 hours, i.e., from day=D, time=0 to day=D+1,time=0 (step=24). […] For the CDS time, or validity time, of 00 UTC, the accumulations are over the 24 hours ending at 00 UTC i.e. the accumulation is during the previous day.”

(I would be curious if the staff member you mentioned has different or additional views on this.)

Thank you also for pointing to the new daily data set. Unfortunately, it does not contain accumulative data (the page reads: “Note that the accumulated variables are omitted.”)

Cheers

T_B · 23 October 2024 08:55

Hi again,

perhaps the source of our different understandings was the use of different data sets? Apparently, accumulative variables differ between ERA5-Land and ERA5 single levels/pressure levels:

https://confluence.ecmwf.int/display/CKB/ERA5+family+post-processed+daily+statistics+documentation

Best regards!

Evan_Curcio · 23 October 2024 12:55

Hello! I think you may be right. I was a bit confused about that, but hadn’t looked at the Land documentation. Thanks for your dedication to finding that discrepancy.

Evan_Curcio · 23 October 2024 12:57

This issue is solved since the release of the daily statistics dataset for the new API: