Unable to download ERA5 data due to time-out

Ezra_Eisbrenner · 13 May 2024 07:19

Hi there,

I have issues downloading data via the (python) cdsapi, I end up with time-outs. I have commented on an announcement last week, but I thought I better my chances of a timely answer by reposting this as a new post.

The original: A new CDS soon to be launched - expect some disruptions - #51 by Ezra_Eisbrenner

And below, the gist of my earlier message:

Because of the announcements, I tried with a single hour to be downloaded (globally, one variable), and I receive the time-out error. Is this due to the ongoing disruptions? Can I do something to increase my chances of a successful download?

This is the download:

from pathlib import Path

import cdsapi

c = cdsapi.Client()

target_dir = Path.home() / "data" / "ERA5" / "data" / "surface"
write_file = "era5_sst_1982_test.nc"

months = [f"{i:02d}" for i in range(1, 13)]
days = [f"{i:02d}" for i in range(1, 32)]
times = [f"{i:02d}:00" for i in range(0, 24)]

c.retrieve(
    "reanalysis-era5-single-levels",
    {
        "product_type": "reanalysis",
        "variable": "sea_surface_temperature",
        "year": "1982",
        "month": months[0],
        "day": days[0],
        "time": times[0],
        "format": "netcdf",
    },
    target_dir / write_file,
)

which gives

2024-05-08 19:56:40,353 INFO Retrying now...
2024-05-08 19:57:40,480 WARNING Connection error: [HTTPSConnectionPool(host='cds.climate.copernicus.eu', port=443): Read timed out. (read timeout=60)]. Attempt 4 of 500.
2024-05-08 19:57:40,480 WARNING Retrying in 120 seconds

Kind regards,
Ezra

Michela · 13 May 2024 09:50

Hi,
please have a look here: Common Error Messages for CDS Requests - Copernicus Knowledge Base - ECMWF Confluence Wiki

Thanks

Ezra_Eisbrenner · 13 May 2024 10:29

I have, it says to enable debug and increase the waiting time. I tried with several hours waiting, no effect. I also enabled debug, no new information.

Best,
Ezra

See below output when using: cdsapi.Client(timeout=600, quiet=False, debug=True)

2024-05-10 22:29:49,676 DEBUG CDSAPI {'url': 'https://cds.climate.copernicus.eu/api/v2', 'key': '<key>', 'quiet': False, 'verify': True, 'timeout': 600, 'progress': True, 'sleep_max': 120, 'retry_max': 500, 'full_stack': False, 'delete': True, 'metadata': None, 'forget': False}
2024-05-10 22:39:49,894 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-single-levels
2024-05-10 22:39:49,894 DEBUG POST https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-single-levels {"product_type": "reanalysis", "variable": "sea_surface_temperature", "year": "1982", "month": "01", "day": "01", "time": "00:00", "format": "netcdf"}
2024-05-10 22:49:50,066 WARNING Connection error: [HTTPSConnectionPool(host='cds.climate.copernicus.eu', port=443): Read timed out. (read timeout=600)]. Attempt 1 of 500.
2024-05-10 22:49:50,067 WARNING Retrying in 120 seconds
2024-05-10 22:51:50,115 INFO Retrying now...
2024-05-10 23:01:50,226 WARNING Connection error: [HTTPSConnectionPool(host='cds.climate.copernicus.eu', port=443): Read timed out. (read timeout=600)]. Attempt 2 of 500.
2024-05-10 23:01:50,227 WARNING Retrying in 120 seconds
2024-05-10 23:03:50,242 INFO Retrying now...
2024-05-10 23:13:50,415 WARNING Connection error: [HTTPSConnectionPool(host='cds.climate.copernicus.eu', port=443): Read timed out. (read timeout=600)]. Attempt 3 of 500.
2024-05-10 23:13:50,416 WARNING Retrying in 120 seconds
2024-05-10 23:15:50,424 INFO Retrying now...

Ezra_Eisbrenner · 13 May 2024 10:46

However, I take from your answer that you do not believe this is due to the disruptions from the changes on the side of the CDS. I run a test now on a different system and seem to have no issues, so I will contact my server admins, since the issue seems to be on their end.

Thanks.

Michela · 13 May 2024 10:56

Please try to check with your local IT support that port 443 is opened to a range of IP addresses: 136.156.152.0/22 .

Thanks

Ezra_Eisbrenner · 17 May 2024 13:21

I checked back with the server admins and while realizing that under certain conditions I am able to download data, I cannot under those that I am generally advised to.

This is an HPC environment, with a login node where I connect to the server and then computation nodes to run things. And I can without issues download data from the login node, but not from the compute nodes.

The server admin explained this and I will copy his answer below. In short, the compute nodes allow larger file sizes than does the login node, and the CDSAPI then sends too large files since it only knows about the compute node’s max package size.

The solution is for now to just spam the login node. Still, I hope this helps someone in the future who runs into this issue too!

Best,
Ezra

Answer by our server admin staff:

This [my issue above with the time-outs] relates to a feature in TCP where each side of a connection tells
the other side how big packets (segments) they can accept, using the
“mss” option. This means each side can send larger packets if that is
supported.

But the connecting client and the answering server only knows this max
size for its local (closest) network, so they may tell the other side a
size that does not work because of some network that is between the
client and the server. That is solved by a mechanism called Path MTU
Discovery where the machine sending a packet asks any routers on the way
to the receiver to send back an error report if it send a packet that
was too big. It then switches to sending smaller packets, which gets
through.

This seems to be what is happening here. When you run your code on a
login node (or when I run mine on the system server), it is directly
connected to a network with max size 1460, so it tells the server that,
the server does not try anything bigger, and it works.

15:59:11.991394 IP 130.236.103.100.40200 > 136.156.155.74.https: Flags [S], seq 3085352550, win 42340, options [mss 1460,sackOK,TS val 3986337542 ecr 0,nop,wscale 10], length 0
15:59:12.043896 IP 136.156.155.74.https > 130.236.103.100.40200: Flags [S.], seq 1612981711, ack 3085352551, win 8192, options [mss 8902,sackOK,TS val 2725104033 ecr 3986337542,nop,wscale 0], length 0
15:59:12.043932 IP 130.236.103.100.40200 > 136.156.155.74.https: Flags [.], ack 1, win 42, options [nop,nop,TS val 3986337594 ecr 2725104033], length 0
15:59:12.068845 IP 130.236.103.100.40200 > 136.156.155.74.https: Flags [P.], seq 1:518, ack 1, win 42, options [nop,nop,TS val 3986337619 ecr 2725104033], length 517
15:59:12.121424 IP 136.156.155.74.https > 130.236.103.100.40200: Flags [.], ack 518, win 7675, options [nop,nop,TS val 2725104111 ecr 3986337619], length 0
15:59:12.122941 IP 136.156.155.74.https > 130.236.103.100.40200: Flags [.], seq 1:1449, ack 518, win 7675, options [nop,nop,TS val 2725104112 ecr 3986337619], length 1448
15:59:12.122967 IP 130.236.103.100.40200 > 136.156.155.74.https: Flags [.], ack 1449, win 42, options [nop,nop,TS val 3986337673 ecr 2725104112], length 0

However, when you run your code on a compute node, that node is
connected to a local network with a bigger max size, but the traffic
still has to pass the system server which still has the same max size as
we talked about above. But the node does not know that, and announces
its local max size 4052. Looking at the traffic passing through the
system server:

15:58:31.763844 IP 130.236.103.100.34236 > 136.156.155.74.https: Flags [S], seq 3398525870, win 64832, options [mss 4052,sackOK,TS val 1600394909 ecr 0,nop,wscale 7], length 0
15:58:31.816649 IP 136.156.155.74.https > 130.236.103.100.34236: Flags [S.], seq 3571240643, ack 3398525871, win 8192, options [mss 8902,sackOK,TS val 2725063809 ecr 1600394909,nop,wscale 0], length 0
15:58:31.816750 IP 130.236.103.100.34236 > 136.156.155.74.https: Flags [.], ack 1, win 507, options [nop,nop,TS val 1600394962 ecr 2725063809], length 0
15:58:31.850850 IP 130.236.103.100.34236 > 136.156.155.74.https: Flags [P.], seq 1:518, ack 1, win 507, options [nop,nop,TS val 1600394996 ecr 2725063809], length 517
15:58:31.903402 IP 136.156.155.74.https > 130.236.103.100.34236: Flags [.], ack 518, win 7675, options [nop,nop,TS val 2725063896 ecr 1600394996], length 0

What happens here is that the server in step 6 tries to send a too large
packet back to Tetralith (bigger than the 1448 it does when it works)
and that packet is lost and the connection hangs.

The Path MTU Discovery I talked about above should have kicked in and
the CDS server 136.156.155.74 should have been told that it needs to
send smaller packets, but this seems to have failed. That is commonly
the case then the server has a too tight firewall that stops to much
ICMP traffic, including the error reports Path MTU Discovery relies on.

[…]

See: Path MTU Discovery - Wikipedia, especially the
part about Problems.