Separation of retrieve and download

Magee_Eric · 10 December 2024 23:07

Is there a way in the new CDS-API to separate the retrieve and the download similar to what is described here by fridgerator on Sep 2, 2020?

github.com/ecmwf/cdsapi

Separation of retrieve and download

opened 11:54AM - 06 May 19 UTC

closed 06:00AM - 03 Sep 20 UTC

jblarsen

If retrieve is called without a target filename the method will execute a retrie…ve request to CDS and poll it until the request is 'completed' or 'failed' and return a Result object after that. The download method on this object can then be called to actually download the data file. But for retrieve requests which take a very long time to complete (> 1 day) this workflow may not be suitable (on all systems). I would therefore propose an option to have a slightly different workflow: 1. The retrieve request returns a Result object after the first ("robust") request without waiting for the state of the reply to be 'completed' or 'failed'. 2. A new instance method on the Result class named e.g. 'query_state', 'update', 'update_state' or something like this is added. This would allow users more control over the retrieval and download process. In our system it is for example not optimal to have very long running processes. We can then split the processing up into one task which executes the retrieve request and puts the Result object information in our key value store. After that we can then regularly dispatch a task which queries the state of the CDS task and downloads the resulting dataset when ready. Just like you do internally in cdsapi now. So in summary this suggestion will not change anything about how cdsapi works now. But it will give users more flexibility in integrating cdsapi in data processing pipelines. Please let me know what you think of the above. Please see PR below for details.

Koen_Hufkens · 12 December 2024 19:55

If by this you mean scheduling the download to pick up data on a later point, check here under Basic API documentation:

Magee_Eric · 16 December 2024 22:26

This certainly works; however, I am retrieving and downloading with independent scripts. In order to not tie up the job queue with a job simply waiting for a request to process and complete, I am sending the requests first. I then log the requestID and run a new script to download the completed requests. Here is a snippet showing what I am doing now (aside from error checking, etc

session = requests.Session()
url = f"https://cds.climate.copernicus.eu/api/retrieve/v1/jobs/{request_id}"
headers={'User-Agent': 'datapi/0.1.1', 'PRIVATE-TOKEN': api_key}
session.headers.update(headers)
s = session.get(url)
# some checking here to make sure request exists and  is complete
# Query results to get URL for download
r = session.get(url + "/results")
# more checking
result = session.get(asset['href'])

If there is a better way to do this, let me know.

Thanks,
Eric

Koen_Hufkens · 17 December 2024 09:21

That’s what I meant, or see as the only option. There does not seem to be an option to outright list all request statuses, as there was before since the request id is included. If you do find it, let me know.

Magee_Eric · 17 December 2024 14:00

Got it - I read the first part of the post but missed the last part about the custom API implementations. Appreciate your help.

Amy_Ngwele · 21 December 2024 11:01

If you have the request ID, you can do the download later using this

    new_client = cdsapi.Client()
    result = new_client.client.get_remote(request_id)
    out = result.download(save_file)

Found in a reply to this post:

Hannes_K · 2 January 2025 09:43

You can query https://cds.climate.copernicus.eu/api/retrieve/v1/jobs to get information on all your jobs, including their ID and status. Below code works for me.

with requests.Session() as session:

    session.headers = {
        "PRIVATE-TOKEN": YOUR_PRIVATE_KEY
        }

    r = session.get('https://cds.climate.copernicus.eu/api/retrieve/v1/jobs')
    
result = r.json()
for job in result.get("jobs"):
    print(job["jobID"], job["status"])

See here for more information: ECMWF APIs (FAQ, API + data documentation)