I checked back with the server admins and while realizing that under certain conditions I am able to download data, I cannot under those that I am generally advised to.
This is an HPC environment, with a login node where I connect to the server and then computation nodes to run things. And I can without issues download data from the login node, but not from the compute nodes.
The server admin explained this and I will copy his answer below. In short, the compute nodes allow larger file sizes than does the login node, and the CDSAPI then sends too large files since it only knows about the compute node’s max package size.
The solution is for now to just spam the login node. Still, I hope this helps someone in the future who runs into this issue too!
Best,
Ezra
Answer by our server admin staff:
This [my issue above with the time-outs] relates to a feature in TCP where each side of a connection tells
the other side how big packets (segments) they can accept, using the
“mss” option. This means each side can send larger packets if that is
supported.
But the connecting client and the answering server only knows this max
size for its local (closest) network, so they may tell the other side a
size that does not work because of some network that is between the
client and the server. That is solved by a mechanism called Path MTU
Discovery where the machine sending a packet asks any routers on the way
to the receiver to send back an error report if it send a packet that
was too big. It then switches to sending smaller packets, which gets
through.
This seems to be what is happening here. When you run your code on a
login node (or when I run mine on the system server), it is directly
connected to a network with max size 1460, so it tells the server that,
the server does not try anything bigger, and it works.
15:59:11.991394 IP 130.236.103.100.40200 > 136.156.155.74.https: Flags [S], seq 3085352550, win 42340, options [mss 1460,sackOK,TS val 3986337542 ecr 0,nop,wscale 10], length 0
15:59:12.043896 IP 136.156.155.74.https > 130.236.103.100.40200: Flags [S.], seq 1612981711, ack 3085352551, win 8192, options [mss 8902,sackOK,TS val 2725104033 ecr 3986337542,nop,wscale 0], length 0
15:59:12.043932 IP 130.236.103.100.40200 > 136.156.155.74.https: Flags [.], ack 1, win 42, options [nop,nop,TS val 3986337594 ecr 2725104033], length 0
15:59:12.068845 IP 130.236.103.100.40200 > 136.156.155.74.https: Flags [P.], seq 1:518, ack 1, win 42, options [nop,nop,TS val 3986337619 ecr 2725104033], length 517
15:59:12.121424 IP 136.156.155.74.https > 130.236.103.100.40200: Flags [.], ack 518, win 7675, options [nop,nop,TS val 2725104111 ecr 3986337619], length 0
15:59:12.122941 IP 136.156.155.74.https > 130.236.103.100.40200: Flags [.], seq 1:1449, ack 518, win 7675, options [nop,nop,TS val 2725104112 ecr 3986337619], length 1448
15:59:12.122967 IP 130.236.103.100.40200 > 136.156.155.74.https: Flags [.], ack 1449, win 42, options [nop,nop,TS val 3986337673 ecr 2725104112], length 0
However, when you run your code on a compute node, that node is
connected to a local network with a bigger max size, but the traffic
still has to pass the system server which still has the same max size as
we talked about above. But the node does not know that, and announces
its local max size 4052. Looking at the traffic passing through the
system server:
15:58:31.763844 IP 130.236.103.100.34236 > 136.156.155.74.https: Flags [S], seq 3398525870, win 64832, options [mss 4052,sackOK,TS val 1600394909 ecr 0,nop,wscale 7], length 0
15:58:31.816649 IP 136.156.155.74.https > 130.236.103.100.34236: Flags [S.], seq 3571240643, ack 3398525871, win 8192, options [mss 8902,sackOK,TS val 2725063809 ecr 1600394909,nop,wscale 0], length 0
15:58:31.816750 IP 130.236.103.100.34236 > 136.156.155.74.https: Flags [.], ack 1, win 507, options [nop,nop,TS val 1600394962 ecr 2725063809], length 0
15:58:31.850850 IP 130.236.103.100.34236 > 136.156.155.74.https: Flags [P.], seq 1:518, ack 1, win 507, options [nop,nop,TS val 1600394996 ecr 2725063809], length 517
15:58:31.903402 IP 136.156.155.74.https > 130.236.103.100.34236: Flags [.], ack 518, win 7675, options [nop,nop,TS val 2725063896 ecr 1600394996], length 0
What happens here is that the server in step 6 tries to send a too large
packet back to Tetralith (bigger than the 1448 it does when it works)
and that packet is lost and the connection hangs.
The Path MTU Discovery I talked about above should have kicked in and
the CDS server 136.156.155.74 should have been told that it needs to
send smaller packets, but this seems to have failed. That is commonly
the case then the server has a too tight firewall that stops to much
ICMP traffic, including the error reports Path MTU Discovery relies on.
[…]
See: Path MTU Discovery - Wikipedia, especially the
part about Problems.