TL;DR The cdstoolbox
package is your one-stop-shop for data handling and application development in the CDS Toolbox, providing remote processing of data and caching of results. It's not possible to use popular Python libraries like numpy and xarray to process data in the Toolbox, but most of the functionality of these libraries can be found within the cdstoolbox
package. You can browse the documentation pages to find the tools you need.
The CDS Toolbox offers a broad variety of tools for retrieving, processing and visualising datasets from the C3S Climate Data Store, with the cdstoolbox
Python package containing everything you need to explore datasets and develop applications in the Toolbox editor. The cdstoolbox
package (usually imported as ct
) draws from several widely used Python libraries including xarray, numpy, scipy and pandas; if you're familiar with these libraries, you'll find a lot of the same functionality within cdstoolbox
.
Why cdstoolbox
?
One huge advantage of using the Toolbox over processing data "offline" is that your code, or workflow, is run entirely within the CDS infrastructure. This means you do not need to download huge volumes of data or have a powerful computer to work with CDS data - you can use our computers instead!
In order to achieve this, the Toolbox has its own tools contained within the cdstoolbox
namespace which understand how to make use of the CDS infrastructure. It's important to understand that it's not possible to use libraries like numpy and xarray to work with data directly in the Toolbox editor, but you can find substitute tools in the cdstoolbox
package. You can explore the tools available in the Toolbox through the documentation pages.
This is all you need to know, but read on to learn how the cdstoolbox
package achieves remote data processing and caching of results.
What are remote objects?
When you retrieve data from the CDS catalogue in a Toolbox workflow, the result is returned as a remote object, which is simply a pointer to a data file stored (cached) on the CDS. Remote objects can be printed within a workflow to get an xarray summary of the underlying data:
data = ct.catalogue.retrieve( 'reanalysis-era5-single-levels', { 'variable': '2m_temperature', 'product_type': 'reanalysis', 'year': '2017', 'month': '01', 'day': '01', 'time': '12:00', 'grid': ['3', '3'], } )print(data)
<xarray.DataArray 'tas' (lat: 61, lon: 120)> array([[248.32056, 248.32056, 248.32056, ..., 248.32056, 248.32056, 248.32056], [246.57446, 246.83813, 247.68188, ..., 246.20337, 246.36548, 246.44751], [266.68774, 264.39868, 264.83032, ..., 263.67212, 264.8557 , 266.97485], ..., [247.51587, 247.47095, 246.27759, ..., 249.92798, 248.469 , 247.53345], [250.15063, 249.92798, 249.57837, ..., 250.44556, 250.39478, 250.31665], [250.28345, 250.28345, 250.28345, ..., 250.28345, 250.28345, 250.28345]], dtype=float32) Coordinates: realization int64 ... time datetime64[ns] ... * lat (lat) float64 -90.0 -87.0 -84.0 -81.0 ... 81.0 84.0 87.0 90.0 * lon (lon) float64 -180.0 -177.0 -174.0 -171.0 ... 171.0 174.0 177.0 Attributes: long_name: Near-Surface Air Temperature units: K standard_name: air_temperature comment: near-surface (usually, 2 meter) air temperature type: real Conventions: CF-1.7 institution: European Centre for Medium-Range Weather Forecasts history: 2020-05-13T15:49:37 GRIB to CDM+CF via cfgrib-0.9.7.7/ecC... source: ECMWF
This is a great way to get a quick view of the data you're working with, although in reality a remote is not an xarray DataArray
but rather an object containing all the information the Toolbox needs to find, retrieve and operate on the data. This means that remote objects cannot be treated as Python arrays because the data isn't present within the object itself; instead, we need to use the tools and services within the cdstoolbox
namespace because they understand how to access and process the data.
How do CDS tools ands services work?
The CDS tools and services are functions that are designed to work with remote objects. They take as input dictionaries which provide parameters and/or file locations and return a dictionary which contain a "resultlocation", this is the url of the file produced by the service, whether that is a netCDF, json, png or any other file type.
When working with remote data, the process is as follows:
- A
cdstoolbox
function is called on a remote object; - The execution of the function takes place remotely on a CDS compute node;
- The data referenced by the remote object is loaded from the CDS cache and processed;
- The output is saved back to the cache and returned as a new remote object.
This means that all of the 'heavy lifting' takes place on powerful CDS compute nodes, and every result produced is cached on the CDS.
Caching
Once a cdstoolbox
function has been executed, the result is stored in the CDS cache. This means that the next time that service is run with the exact same inputs, the result is retrieved instantly from the cache and the data doesn't need to be processed again.
Taking this one step further, if you run a Toolbox workflow that has been run before, without changing its state, the previous result will be retrieved from the cache without executing any of your workflow!