ARCO Data Lake coming soon for the Data Stores Services!

As part of its effort to continuously improve the quality of service offered to users, the Data Stores Services (DSS) at ECMWF is setting up an “ARCO Data Lake”. ARCO (Analysis-Ready Cloud-Optimised) data structures can offer much faster access speeds to users, especially for access patterns that don’t match with the underlying native storage. For example, it can be slow to retrieve a long time-series at a single geographical point if the underlying data is stored as a series of horizontal fields, as is the case for ERA5 datasets. But ARCO data structures are chunked in both space and time which means increased performance when slicing through time.

The Data Lake is being incrementally populated with the most demanded datasets from the DSS portfolio. Data is hosted in Zarr format and Data Lake assets are served to users through a variety of interfaces.

These include dedicated time-series datasets:

as well as standard visualization services (WMTS) and interactive applications such as the ERA Explorer and the Thermal Trace.

Datasets are available for visualization in the Wekeo Viewer e.g. for ERA5-Land at https://wekeo.copernicus.eu/data?view=layers&dataset=EO:ECMWF:DAT:REANALYSIS_ERA5_LAND

Currently in alpha version (test phase), direct access to data cubes by using the API token is under testing phase. Due to the anticipated high demand and workload in the hosting infrastructure triggered by this new access mechanism, datasets will be gradually opened to the public, subject to the outputs of the test.

Looking ahead, the ARCO-based capabilities across all layers of the DSS infrastructure, from data to software, interfaces and services, will continue to grow, making the ARCO Data Lake and related capabilities a cornerstone component for the evolution of the Service.

Watch this space for further announcements!

ECMWF Support
on behalf of Data Stores Services

5 Likes

Hi Anabelle,

The ARCO initiative is great !

As I understand, the API currently only accepts requests for a single location. This is already very useful for certain workflows, e.g. doing analysis over multiple years for a couple specific locations.

I was wondering if there was a way to access the underlying Zarr store with the entire data without going through the API? E.g. via a S3-compatible interface, to be able to open the data directly with xarray (using zarr.storage.ObjectStore)?

I assume that the underlying Zarr store is chunked along the time axis, which would allow us to easily do analysis over multiple years, on a global scale. One example: aggregate to daily statistics (mean/max/min), compute annual extrema, and do extreme-value statistics on the resulting data.

I realize this is a question applicable more broadly to other sources of data on the CDS, apologies if this is explained somewhere else.

Many thanks,

Alexandre

PS: I found this initiative on AWS Open data, which somewhat answers some of my questions, but I was wondering if there was other ways to do this, officially provided by the ECMWF.