How to extract data from NetCDF to CSV using python?

Marian_Dobranschi · 27 January 2022 22:37

Dear ECMWF,

I find Copernicus data very important, however the NetCDF files are very difficult to work with. I've been trying different scripts in python and R to extract climate data from NetCDF files into CSV for further analysis, but it seems a very difficult task.

I would like to ask for your help.

I need to extract climate data (precipitation, average temperature, etc.) for European countries using the data provided by Copernicus - "Temperature and precipitation climate impact indicators from 1970 to 2100 derived from European climate projections ". Ideally I would like to have a panel data set, with columns per year, per country, and then separate columns for each climate variable.

I've tried the python script recommended by this page - How to convert NetCDF to CSV

This is the code used:

#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset 
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np
from matplotlib.dates import num2date
data = Dataset(‘prAdjust_tmean.nc’)


This is how data looks - contents

print(data)

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    CDI: Climate Data Interface version 1.8.2 (http://mpimet.mpg.de/cdi)
    frequency: year
    CDO: Climate Data Operators version 1.8.2 (http://mpimet.mpg.de/cdo)
    creation_date: 2020-02-12T15:00:49ZCET+0100
    Conventions: CF-1.6
    institution_url: www.smhi.se
    invar_platform_id: -
    invar_rcm_model_driver: MPI-M-MPI-ESM-LR
    time_coverage_start: 1971
    time_coverage_end: 2000
    domain: EUR-11
    geospatial_lat_min: 23.942343
    geospatial_lat_max: 72.641624
    geospatial_lat_resolution: 0.04268074 degree
    geospatial_lon_min: -35.034023
    geospatial_lon_max: 73.937675
    geospatial_lon_resolution: 0.009246826 degree
    geospatial_bounds: -
    NCO: netCDF Operators version 4.7.7 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)
    acknowledgements: This work was performed within Copernicus Climate Change Service - C3S_424_SMHI, https://climate.copernicus.eu/operational-service-water-sector, on behalf of ECMWF and EU.
    contact: Hydro.fou@smhi.se
    keywords: precipitation
    license: Copernicus License V1.2
    output_frequency: 30 year average value
    summary: Calculated as the mean annual values of daily precipitation averaged over a 30 year period.
    comment: The Climate Data Operators (CDO) software was used for the calculation of climate impact indicators (https://code.mpimet.mpg.de/projects/cdo/embedded/cdo.pdf,  https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_eca.pdf).
    history: CDO commands (last cdo command first and separated with ;): timmean; yearmean
    invar_bc_institution: Swedish Meteorological and Hydrological Institute
    invar_bc_method: TimescaleBC, Description in deliverable C3S_D424.SMHI.1.3b
    invar_bc_method_id: TimescaleBC v1.02
    invar_bc_observation: EFAS-Meteo, https://ec.europa.eu/jrc/en/publication/eur-scientific-and-technical-research-reports/efas-meteo-european-daily-high-resolution-gridded-meteorological-data-set-1990-2011
    invar_bc_observation_id: EFAS-Meteo
    invar_bc_period: 1990-2018
    data_quality: Testing of EURO-CORDEX data performed by ESGF nodes. Additional tests were performed when producing CII and ECVs in C3S_424_SMHI.
    institution: SMHI
    project_id: C3S_424_SMHI
    references: 
    source: The RCM data originate from EURO-CORDEX (Coordinated Downscaling Experiment - European Domain, EUR-11) https://euro-cordex.net/.
    invar_experiment_id: rcp45
    invar_realisation_id: r1i1p1
    invar_rcm_model_id: MPI-CSC-REMO2009-v1
    variable_name: prAdjust_tmean
    dimensions(sizes): x(1000), y(950), time(1), bnds(2)
    variables(dimensions): float32 lon(y,x), float32 lat(y,x), float64 time(time), float64 time_bnds(time,bnds), float32 prAdjust_tmean(time,y,x)
    groups:

After that I extract the needed variable:

t2m = data.variables['prAdjust_tmean']

Get dimensions assuming 3D: time, latitude, longitude

time_dim, lat_dim, lon_dim = t2m.get_dims()
time_var = data.variables[time_dim.name]
times = num2date(time_var[:], time_var.units)
latitudes = data.variables[lat_dim.name][:]
longitudes = data.variables[lon_dim.name][:]
output_dir = ‘./’


And the Error:

OverflowError                             Traceback (most recent call last)
<ipython-input-9-69e10e41e621> in <module>
      2 time_dim, lat_dim, lon_dim = t2m.get_dims()
      3 time_var = data.variables[time_dim.name]
----> 4 times = num2date(time_var[:], time_var.units)
      5 latitudes = data.variables[lat_dim.name][:]
      6 longitudes = data.variables[lon_dim.name][:]

C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in num2date(x, tz)
    509     if tz is None:
    510         tz = _get_rc_timezone()
--> 511     return _from_ordinalf_np_vectorized(x, tz).tolist()
    512 
    513 

C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs)
   2106             vargs.extend([kwargs[_n] for _n in names])
   2107 
-> 2108         return self._vectorize_call(func=func, args=vargs)
   2109 
   2110     def _get_ufunc_and_otypes(self, func, args):

C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args)
   2190                       for a in args]
   2191 
-> 2192             outputs = ufunc(*inputs)
   2193 
   2194             if ufunc.nout == 1:

C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in _from_ordinalf(x, tz)
    329 
    330     dt = (np.datetime64(get_epoch()) +
--> 331           np.timedelta64(int(np.round(x * MUSECONDS_PER_DAY)), 'us'))
    332     if dt < np.datetime64('0001-01-01') or dt >= np.datetime64('10000-01-01'):
    333         raise ValueError(f'Date ordinal {x} converts to {dt} (using '

OverflowError: int too big to convert






And this is the last part of the script:

import os
 
# Path
path = "/home"
 
# Join various path components
print(os.path.join(path, "User/Desktop", "file.txt"))
 
 
# Path
path = "User/Documents"
 
# Join various path components
print(os.path.join(path, "/home", "file.txt"))

filename = os.path.join(output_dir, 'table.csv')
print(f'Writing data in tabular form to {filename} (this may take some time)...')
times_grid, latitudes_grid, longitudes_grid = [
    x.flatten() for x in np.meshgrid(times, latitudes, longitudes, indexing='ij')]
df = pd.DataFrame({
    'time': [t.isoformat() for t in times_grid],
    'latitude': latitudes_grid,
    'longitude': longitudes_grid,
    't2m': t2m[:].flatten()})
df.to_csv(filename, index=False)

print('Done')

`Once again, I would like to stress that I need to extract from nc file these specific columns: year (time period), longitude, latitude and the climate variables (temperature, precipitations, etc.). I hope you can help me in this task.`

Many thanks for your time and help.

Best regards, Marian

Milana_Vuckovic · 28 January 2022 09:17

Hi Marian,

You can do what you need using Python libraries pandas and xarray.

So something like this:

import xarray as xr
import pandas as pd
#read the data
dsc = xr.open_dataset('prAdjust_tmean.nc')
#select one variable and convert to pandas dataframe
df_tmean = ds.prAdjust_tmean.to_dataframe()
#reset index so lat/lon/time are not indexes any more
df_tmean =  df_tmean .reset_index()
# write to csv with whatever options you like
df_tmean .to_csv('new_file.csv', index=False, sep='\t',float_format = '%.1f')

If pandas complain your data is too big, try to work with smaller chunk until you manage to do what you need and then try to scale.

If you want to learn more, you might find these resources useful:
https://gitlab.com/eu_hidalgo/ecmwf_enccs_training

I hope this helps!

Milana

Marian_Dobranschi · 28 January 2022 14:02

Dear Milana,

I've tried your script. It works to extract some data, but it fails to extract yearly data. The output is just for one year, which is not helpful at all.

I need average temp for a given lat and long for every year that the nc document stores. It has data from 1971-2000, but the output from your script manages to extract only data for 2000. This issue gets more and more complicated, and the netCSF files are impossible to work with. So climate data are needed in non-spatial format, non-array to be included in regression analysis as a standalone variables. These variables should be time series, meaning yearly (or monthly, or daily) observations per country. Array data cannot be used in multi-variate regressions, because time series should be similar, for example GDP per year per country and precipitations per year per country should be comparable in terms of form, shape, length. It goes against the practice and rationale of statistical softwares to combine non-array and array data, that's why we need a method to extract array data in a non-array format in order to work in other way than plotting (which array data is best for).