Reading and regridding from cloud-based datasets

I am trying to build a xarray-zarr cloud-based dataset with ERA5 for training a 1° model and am finding that regridding step of the creation is extremely slow. My understanding is that the anemoi dataset .zarr lives in the cloud and so anemoi-dataset just wraps the steps of how to handle the data as it is loaded. My question is, is this the correct approach and if so is there a way to speed this up, i.e., by using GPUs since regridding is matrix multiplications? Thank you!

As a MWE the regridding step of the following is expected to take 3 hours to process a single month of data for only a small subset of the variables:

anemoi-datasets create recipe.yaml gcp_era5.zarr

where recipe.yaml is the following anemoi-dataset configuration:

dates:
  start: 2020-01-01T00:00
  end: 2020-01-31T23:00
  frequency: 6h

input:
  join:
    - pipe:
      - xarray-zarr:
          url: gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg_derived.zarr/
          param:
            - 2m_temperature
      - rename:
          param:
            2m_temperature: 2t
      - regrid:
          method: linear
          in_grid: [0.25, 0.25]
          out_grid: O96
    - pipe:
      - xarray-zarr:
          url: gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg_derived.zarr/
          param:
            - temperature
      - rename:
          param:
            temperature: t
            level:
              - 1000
              - 850
              - 500
      - regrid:
          method: linear
          in_grid: [0.25, 0.25]
          out_grid: O96

    - forcings:
        template: ${input.join.0.pipe}
        param:
          - cos_latitude
          - cos_longitude
          - sin_latitude
          - sin_longitude
          - cos_julian_day
          - cos_local_time
          - sin_julian_day
          - sin_local_time
          - insolation

Any advice would be appreciated! I.e., should I be downloading all the data first? What package/ framework should I use to regrid TB datasets? Should I be accessing already regridded n320 data from MARS? Thank you!

Hi @Julian_Schmitt,

Many thanks for sharing these details, and apologies for the delayed reply.

It may be helpful to know that ECMWF has already made a public Anemoi ERA5 O96 dataset available, which includes the variables you mention and covers January 2020:
https://anemoi.readthedocs.io/projects/training/en/latest/user-guide/download-era5-o96.html

Rather than regridding ARCO ERA5, you could use the public Anemoi O96 dataset directly as the source and create a smaller subset from it (see anemoi-dataset — Anemoi Datasets 0.5.35 documentation ). For a one-off subsetting step, an example could be:

dates:
  start: "2020-01-01T00:00:00"
  end: "2020-01-31T18:00:00"
  frequency: 6h

input:
  anemoi-dataset:
    dataset: "https://data.ecmwf.int/anemoi-datasets/era5-o96-1979-2023-6h-v8.zarr"
    param:
      - 2t
      - t_1000
      - t_850
      - t_500

If, however, you expect to create multiple datasets or access the dataset repeatedly for training, please ensure that you copy/download the dataset locally first, as the server hosting the dataset limits the total number of simultaneous connections and repeated heavy access may affect other users.

I hope this helps!

Hi Meghan,

Thanks for the response. We figured out the O96 data as suggested and ultimately used the API to get N320 through cds. One thing we did notice was that when copying via:

anemoi-datasets copy --resume --transfers 10 https://data.ecmwf.int/anemoi-datasets/era5-o96-1979-2023-6h-v8.zarr .

was that the transfers flag resulted in lots of nans (~50% of the dataset) being downloaded without warning despite having a good connection. Since the metadata transferred ok we didn’t pick up on the issue until some training runs gave bad results and we checked the raw arrays. We reran with the default for transfers and the problem resolved. Possibly worth looking further into? Happy to raise a GitHub issue if you think that would be more useful than posting here.

Thank you,
Julian