How to Download Faster

I am downloading a lot of data from the ERA5 dataset. I am downloading 20 years of temperature/precipitation data for a few thousand small regions. How do I do this quickly? It seems that the download time for each region is around 5-10 minutes, so if I have a few thousand regions, this will never finish. So are there any tips on how to use the API to download the data in a more efficient way?


Thanks

Hi Joseph,

Is there any chance you group those small regions into some bigger ones so that you can download the data in less requests and then you extract the data from the downloaded files?

Cheers,

Xiaobo

Thanks for the reply. So requesting the maximum amount data that my computer can handle reasonably well will speed up the entire process? Or are there any specifications that will dramatically slow down the process? For example if I request multiple variables (temp and radiation) in one query, or request multiple years of data will it actually be worse than requesting one variable at a time and one year at a time?


Thanks,

Joe

Hi Joe,

The recommendation is:

  • For daily data, make one request per month
  • For monthly data, make one request per year

If you are asking for too much, you may have to split the period to smaller ones. For each request, request as many variables as possible.

Regards,

Xiaobo


Hi Xiabo,

What is the recommendation for hourly data?

Niels.

Hi Niels,

Hourly data falls in the daily data category.

Cheers,

Xiaobo

Thanks. I tried that and found out it takes ca. 15 minutes to download one month of hourly data for 4 variables for one location. As this may be of help to others, here is my code:

import cdsapi
c = cdsapi.Client()

for month in range(1,13):
if (month<10):
month2 = '0' + str(month)
else:
month2 = str(month)

c.retrieve(
'reanalysis-era5-land',
{
'variable': [
'2m_dewpoint_temperature', '2m_temperature', 'surface_pressure',
'surface_solar_radiation_downwards', 'total_precipitation',
],
'area': '50.8/5.2/50.8/5.2',
'year': '2011',
'month': month2,
'day': [
'01', '02', '03', '04', '05', '06',
'07', '08', '09', '10', '11', '12',
'13', '14', '15', '16', '17', '18',
'19', '20', '21', '22', '23', '24',
'25', '26', '27', '28', '29', '30',
'31'
],
'time': [
'00:00', '01:00', '02:00',
'03:00', '04:00', '05:00',
'06:00', '07:00', '08:00',
'09:00', '10:00', '11:00',
'12:00', '13:00', '14:00',
'15:00', '16:00', '17:00',
'18:00', '19:00', '20:00',
'21:00', '22:00', '23:00',
],
'format' : 'netcdf',
}
,
'/Users/.../data/weather/my-file' + month2 + '.nc'
)

I am using the 'area' variable to request only one location – an area collapsed to one point at 50.8 lat and 5.2 lon.

Hi Niels,

Performance is related to how busy the CDS is. https://cds.climate.copernicus.eu/live/queue gives you information about the current queue.

I hope this helps.

Xiaobo

I have a similar problem with very slow downloading speed. Basically, the size of data downloaded via CDS API is 2.1MB, but it takes 45 minutes. This is unusual because  I can download data of more than 2GB when using the ecmwfapi for that amount of time. Additionally, I found that if the data size beyond 2.1MB or 20MB.  For example, I increased the area to for the area size by increasing longitude 9 degrees. Error always occur:

KeyboardInterrupt

or

Exception: the request you have submitted is not valid. One or more variable sizes violate format constraints.

But with small data size, I can download data. 

Below is my code for retrieving ERA5 data:

code start

import calendar
import cdsapi
server = cdsapi.Client()

def retrieve_era5():
"""
A function to demonstrate how to iterate efficiently over several years and months etc
for a particular era5_request.
Change the variables below to adapt the iteration to your needs.
You can use the variable 'target' to organise the requested data in files as you wish.
In the example below the data are organised in files per month. (eg "era5_daily_201510.grb")
"""

yearStart = 1998
yearEnd = 1998
monthStart = 1
monthEnd = 1
for year in range(yearStart, yearEnd + 1):
    Year = str(year)
    for month in range(monthStart, monthEnd + 1):
        Month = str(month)
        # startDate = '%04d-%02d-%02d' % (year, month, 1)
        numberOfDays = calendar.monthrange(year, month)[1]
        Days = [str(x) for x in list(range(1, numberOfDays + 1))]
        # lastDate = '%04d-%02d-%02d' % (year, month, numberOfDays)
        target = "era5_1h_daily_0to70S_100Eto120W_025025_quv_%04d%02d.nc" % (year, month)
        # requestDates = (startDate + "/" + lastDate)
        era5_request(Year, Month, Days, target)

def era5_request(Year, Month, Days, target):
"""
An ERA era5 request for analysis pressure level data.
Change the keywords below to adapt it to your needs.
(eg to add or to remove levels, parameters, times etc)
"""
server.retrieve('reanalysis-era5-pressure-levels',
{'product_type': 'reanalysis',
'format': 'netcdf',
'variable': ['specific_humidity', 'u_component_of_wind', 'v_component_of_wind'],
'year': Year,
'month': Month,
'day': Days,
'pressure_level': ['300', '350', '400','450', '500', '550', '600', '650', '700','750', '775', '800','825', '850', '875','900', '925', '950','975', '1000'],
'time': ['00:00', '01:00', '02:00','03:00', '04:00', '05:00','06:00', '07:00', '08:00','09:00', '10:00', '11:00','12:00', '13:00', '14:00','15:00', '16:00', '17:00','18:00', '19:00', '20:00','21:00', '22:00', '23:00'],
'area': [0, 100, -1, 101],},
target)

if name == 'main':
retrieve_era5()

code end

This code is just to do things small at first, try to download specific_humidity, u_component_of_wind, v_component_of_wind from 1998-1-1 to 1998-1-31, temporal resolution: 1 hour; spatial resolution: 0.25° x 0.25°; pressure levels: 300 hpa to 1000 hpa. Area :1°S to 0, 100°E to 101°E.

Below is the picture showing the results of running this code:

Below is the picture showing that downloading data by ecmwfapi, basically, 23minutes retrieving 2.18GB data:

.

When changing the 'area': [0, 100, -1, 101] to 'area': [0, 100, -70, 101] it works fine. But when changing the 'area': [0, 100, -1, 101] to  'area': [0, 100, -70, 130]. error occurred. 


KeyboardInterrupt or Exception: the request you have submitted is not valid. One or more variable sizes violate format constraints.


I thought this might be due to limit, I did the same through the website, it shows the data size is under the limit. 


So I do not know what is going on?  

Hi Ted,

Could you share the script which did not run?

Thank you,

Xiaobo

Hi Xiaobo

Problem solved, it turns out that I reached out the limit., which I guess is 10GB. So now I just reterive that data day by day.

Thank for letting me know Ted.

Hi Xiaobo,

I am conducting probabilistic yield forecasting for rooftop PV systems. But I found it extremely time-consuming to download the ensemble forecasting of ssrd for my target PV site. It takes 1.5 hours to download the 50 ensemble of hourly ssrd forecasting for the target site in one day (around 350kb) . I have to download the ensemble forecasting for several years. Could you please give me some suggestions? My request is shown below:


server = ECMWFService("mars",    url="https://api.ecmwf.int/v1",
    key=my_api,
    email=my_email)

server.execute(
    {
    "class": "od",
    "date": "20190101/to/20190102",
    "expver": "1",
    "levtype": "sfc",
    "number": "1/to/50",
    "param": "169.128",
    "area": "51.7/5.2/51.2/5.7",
    "step": "0/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/23",
    "GRID":"0.25/0.25",
    "stream": "enfo",
    "time": "0000",
    "type": "pf",
    },
    "2019_irr_step023.grib")

Hi Bin,

Since you are not accessing data on the CDS (Climate Data Store), could you raise a ticket at our support portal https://support.ecmwf.int?

Thank you,

Xiaobo

Hi Xiaobo,

I encountered a similar problem while downloading ERA5 data through Python API. When the request started, the download speed was very good, but it quickly decreased, and an error was eventually reported (see the below impages). My codes are almost the same of Neils_Holst’s above.

Could you help me check where the problem lies?

Thank you very much.
Li


Hi, Joseph,

Try condensing your regions like Xiaobo suggested. And parallelize your downloads.

I think your best bet is to download your data in as few different regions or bounding boxes as possible, probably just one big area covering all the regions you want, and then subsetting that on the client side after downloading if needed.

And you can have multiple concurrent requests running; the CDS API allows up to 20 pending requests per user, maybe more now that the “disruption” on the old CDS is over. Do your downloads in parallel to take advantage of that. That way you have several requests making their way through the queue at the same time, instead of waiting for one request to make it all the way through the queue before sending your next request, which gets in line at the back of the queue then.

My understanding is that the CDS prioritizes your requests, and has its retrieval performance governed by, the number of “items” you’re downloading, which is (variables × levels × dates), but not × locations/grid cells, and a request is the unit of queue transit. I think this is because on the CDS’s disks, the data is arranged so that the entire geographical grid for a single (variable, time) is contiguous on disk, so when the system is fetching one data point, it is cheap (or even free) to fetch data points for other (nearby) grid cells for that same (variable, time). Then I think you get a prioritization penalty for total download volume in MB.

So, if you don’t break your requests up in to separate requests for different areas, it’ll probably go faster. That’s how I do it - each request is for either a full month or full day of data, for one variable, for the entire grid, and then I subset the downloaded netcdf file on the client side. That has worked pretty well.

I remember reading about this in these ECMWF doco pages, though I’m having a hard time finding the exact references now. And some of it was from discussion in my ECMWF support tickets, which are not publicly accessible.

Cheers,
Andrew

If you only need data for a couple of locations but a long time series, feel free to get this data from my API at Open-Meteo. Just make sure to select “ERA5” from the models selection, otherwise you will also get ECMWF HRES data: 🏛️ Historical Weather API | Open-Meteo.com