Working with global high-resolution data

A unique feature of xarray is it works well with very large data. This is crucial for global high-resolution models like GCHP, which could produce GBs of data even for just one variable. The data often exceed computer’s memory, so IDL/MATLAB programs will just die if you try to read such large data.

Lazy evaluation

Here we use a native resolution GEOS-FP metfield as an example. You can download it at:

ftp://ftp.as.harvard.edu/gcgrid/GEOS_0.25x0.3125/GEOS_0.25x0.3125.d/GEOS_FP/2015/07/GEOSFP.20150701.A3cld.Native.nc
In [1]:
%%bash
du -h ./GEOSFP.20150701.A3cld.Native.nc
1.4G    ./GEOSFP.20150701.A3cld.Native.nc

The file size is 1.4 GB – not extremely large. But that’s already after NetCDF compression! The raw data size (after reading into memory) would be more than 20 GB, which exceeds most computers’ memory.

Let’s see how long it takes to read such a file with xarray

In [2]:
import xarray as xr
%time ds = xr.open_dataset("./GEOSFP.20150701.A3cld.Native.nc")
CPU times: user 19.8 ms, sys: 4.65 ms, total: 24.4 ms
Wall time: 25.9 ms

Wait, just milliseconds?

It looks like we do get the entire file:

In [3]:
ds
Out[3]:
<xarray.Dataset>
Dimensions:   (lat: 721, lev: 72, lon: 1152, time: 8)
Coordinates:
  * time      (time) datetime64[ns] 2015-07-01T01:30:00 2015-07-01T04:30:00 ...
  * lev       (lev) float32 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ...
  * lat       (lat) float32 -89.9375 -89.75 -89.5 -89.25 -89.0 -88.75 -88.5 ...
  * lon       (lon) float32 -180.0 -179.688 -179.375 -179.062 -178.75 ...
Data variables:
    CLOUD     (time, lev, lat, lon) float64 0.1229 0.1229 0.1229 0.1229 ...
    OPTDEPTH  (time, lev, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    QCCU      (time, lev, lat, lon) float64 9.969e+36 9.969e+36 9.969e+36 ...
    QI        (time, lev, lat, lon) float64 8.382e-08 8.382e-08 8.382e-08 ...
    QL        (time, lev, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    TAUCLI    (time, lev, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    TAUCLW    (time, lev, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
Attributes:
    Title:                 GEOS-FP time-averaged 3-hour cloud parameters (A3c...
    Contact:               GEOS-Chem Support Team (geos-chem-support@as.harva...
    References:            www.geos-chem.org; wiki.geos-chem.org
    Filename:              GEOSFP.20150701.A3cld.Native.nc
    History:               File generated on: 2015/11/29 15:02:54 GMT-0400
    ProductionDateTime:    File generated on: 2015/11/29 15:02:54 GMT-0400
    ModificationDateTime:  File generated on: 2015/11/29 15:02:54 GMT-0400
    Format:                NetCDF-4
    SpatialCoverage:       global
    Conventions:           COARDS
    Version:               GEOS-FP
    Model:                 GEOS-5
    Nlayers:               72
    Start_Date:            20150701
    Start_Time:            00:00:00.0
    End_Date:              20150701
    End_Time:              23:59:59.99999
    Delta_Time:            030000
    Delta_Lon:             0.312500
    Delta_Lat:             0.250000

But that’s deceptive. xarray only reads the metadata such as dimensions, coordinates and attributes, so it is lightning fast. It will read the actual numerical data if you explicitly execute ds.load(). Never try this command on this kind of large data because your program will just die.

xarray will automatically retrieve the data from disk only when they are actually needed. In other words, it tries to keep the memory usage as low as possible. We call this lazy evaluation.

Any operation that doesn’t require the actual data will be lightning fast. Such as selecting a variable:

In [4]:
dr = ds['CLOUD'] # super fast
dr
Out[4]:
<xarray.DataArray 'CLOUD' (time: 8, lev: 72, lat: 721, lon: 1152)>
[478420992 values with dtype=float64]
Coordinates:
  * time     (time) datetime64[ns] 2015-07-01T01:30:00 2015-07-01T04:30:00 ...
  * lev      (lev) float32 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ...
  * lat      (lat) float32 -89.9375 -89.75 -89.5 -89.25 -89.0 -88.75 -88.5 ...
  * lon      (lon) float32 -180.0 -179.688 -179.375 -179.062 -178.75 ...
Attributes:
    long_name:       Total cloud fraction in grid box
    units:           1
    gamap_category:  GMAO-3D$

Even for indexing or any other array operations: (as long as your don’t explicitly ask for the data values)

In [5]:
dr_surf = dr.isel(lev=0) # super fast
dr_surf
Out[5]:
<xarray.DataArray 'CLOUD' (time: 8, lat: 721, lon: 1152)>
[6644736 values with dtype=float64]
Coordinates:
  * time     (time) datetime64[ns] 2015-07-01T01:30:00 2015-07-01T04:30:00 ...
    lev      float32 1.0
  * lat      (lat) float32 -89.9375 -89.75 -89.5 -89.25 -89.0 -88.75 -88.5 ...
  * lon      (lon) float32 -180.0 -179.688 -179.375 -179.062 -178.75 ...
Attributes:
    long_name:       Total cloud fraction in grid box
    units:           1
    gamap_category:  GMAO-3D$

This dr_surf object points to the surface level of the CLOUD data. This subset of data is sufficiently small so we can read it into memory. Even reading a single level takes 2 seconds, so you can imagine how long it will take to read the full data.

(Note that the NetCDF format allows you to only read a subset of data into memory)

In [6]:
%time dr_surf.load()
CPU times: user 1.87 s, sys: 132 ms, total: 2.01 s
Wall time: 2.01 s
Out[6]:
<xarray.DataArray 'CLOUD' (time: 8, lat: 721, lon: 1152)>
array([[[ 0.122925,  0.122925, ...,  0.122925,  0.122925],
        [ 0.289062,  0.290039, ...,  0.287598,  0.288574],
        ...,
        [ 0.      ,  0.      , ...,  0.      ,  0.      ],
        [ 0.      ,  0.      , ...,  0.      ,  0.      ]],

       [[ 0.151123,  0.151123, ...,  0.151123,  0.151123],
        [ 0.27002 ,  0.269043, ...,  0.272461,  0.271484],
        ...,
        [ 0.      ,  0.      , ...,  0.      ,  0.      ],
        [ 0.      ,  0.      , ...,  0.      ,  0.      ]],

       ...,
       [[ 0.211426,  0.211426, ...,  0.211426,  0.211426],
        [ 0.106689,  0.106567, ...,  0.106934,  0.106812],
        ...,
        [ 0.      ,  0.      , ...,  0.      ,  0.      ],
        [ 0.      ,  0.      , ...,  0.      ,  0.      ]],

       [[ 0.548828,  0.548828, ...,  0.548828,  0.548828],
        [ 0.624023,  0.625   , ...,  0.623047,  0.623047],
        ...,
        [ 0.      ,  0.      , ...,  0.      ,  0.      ],
        [ 0.      ,  0.      , ...,  0.      ,  0.      ]]])
Coordinates:
  * time     (time) datetime64[ns] 2015-07-01T01:30:00 2015-07-01T04:30:00 ...
    lev      float32 1.0
  * lat      (lat) float32 -89.9375 -89.75 -89.5 -89.25 -89.0 -88.75 -88.5 ...
  * lon      (lon) float32 -180.0 -179.688 -179.375 -179.062 -178.75 ...
Attributes:
    long_name:       Total cloud fraction in grid box
    units:           1
    gamap_category:  GMAO-3D$

Since the data is now in memory, we can plot it. In fact, you don’t need to run dr_surf.load() before plotting. xarray will automatically read the data when you try to plot, because a plotting operation requires the actual data values.

In [7]:
%matplotlib inline
dr_surf.isel(time=0).plot(cmap='Blues_r')
Out[7]:
<matplotlib.collections.QuadMesh at 0x1238efe10>
_images/advanced_xarray_18_1.png

More explanation on data size

So how much data is actually read into memory? Just 50 MB.

In [8]:
print(dr_surf.nbytes / 1e6, 'MB')
53.157888 MB

An equivalent calculation:

In [9]:
print(8*721*1152 * 8 / 1e6, 'MB') # ntime*nlon*nlat * 8 byte float
53.157888 MB

What’s the size of the entire CLOUD data, with all 72 levels?

In [10]:
print(dr.nbytes / 1e9, 'GB')
3.827367936 GB

An equivalent calculation:

In [11]:
print(8*721*1152*72 * 8 / 1e9, 'GB') # ntime*nlon*nlat*nlev * 8 byte float
3.827367936 GB

Even if the memory is enough, reading such data will take very long!

Finally, what’s the size of the entire NetCDF file?

In [12]:
print(ds.nbytes / 1e9, 'GB')
26.791583396 GB

Such size will kill almost all laptops. But thanks to lazy evaluation, we have no problem with it.

Out-of-core data processing

Sometimes you do need to process the entire data (e.g. take global average), not just select a subset. Fortunately, xarray can still deal with it, thanks to the out-of-core computing technique.

We know our NetCDF data has 8 time slices, so let’s divide it into 8 pieces so that a single piece would not exceed memory. (see xarray documentation for more about chunk)

In [13]:
ds = ds.chunk({'time': 1})

Now our DataArray has a new chunksize attribute:

In [14]:
dr = ds['CLOUD']
dr
Out[14]:
<xarray.DataArray 'CLOUD' (time: 8, lev: 72, lat: 721, lon: 1152)>
dask.array<xarray-CLOUD, shape=(8, 72, 721, 1152), dtype=float64, chunksize=(1, 72, 721, 1152)>
Coordinates:
  * time     (time) datetime64[ns] 2015-07-01T01:30:00 2015-07-01T04:30:00 ...
  * lev      (lev) float32 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ...
  * lat      (lat) float32 -89.9375 -89.75 -89.5 -89.25 -89.0 -88.75 -88.5 ...
  * lon      (lon) float32 -180.0 -179.688 -179.375 -179.062 -178.75 ...
Attributes:
    long_name:       Total cloud fraction in grid box
    units:           1
    gamap_category:  GMAO-3D$

We want to take column average over the entire data field. The following command still uses lazy-evaluation without doing actual computation.

In [15]:
dr.mean(dim='lev')
Out[15]:
<xarray.DataArray 'CLOUD' (time: 8, lat: 721, lon: 1152)>
dask.array<mean_agg-aggregate, shape=(8, 721, 1152), dtype=float64, chunksize=(1, 721, 1152)>
Coordinates:
  * time     (time) datetime64[ns] 2015-07-01T01:30:00 2015-07-01T04:30:00 ...
  * lat      (lat) float32 -89.9375 -89.75 -89.5 -89.25 -89.0 -88.75 -88.5 ...
  * lon      (lon) float32 -180.0 -179.688 -179.375 -179.062 -178.75 ...

Use dr.mean(dim='lev').compute() when you actually need the results. Then xarray will process each piece (“chunk”) one by one, and put the results together.

Because xarray uses dask under the hood, we can use dask ProgressBar to monitor the progress. Otherwise it will look like getting stuck. (The with statement below is a context manager. Just think it as activating a temporary funtionality inside the with block. For example, a popular visualization package seaborn uses with to set temporary style)

In [16]:
from dask.diagnostics import ProgressBar

with ProgressBar():
    dr_levmean = dr.mean(dim='lev').compute()
[########################################] | 100% Completed | 17.9s

It takes 20 seconds, kind of time-consuming. But this calculation is not possible at all with traditional tools like IDL/MATLAB.

Now we see the actual numerical data:

In [17]:
dr_levmean
Out[17]:
<xarray.DataArray 'CLOUD' (time: 8, lat: 721, lon: 1152)>
array([[[ 0.098929,  0.098929, ...,  0.098929,  0.098929],
        [ 0.116195,  0.116132, ...,  0.116353,  0.116277],
        ...,
        [ 0.027914,  0.027806, ...,  0.028132,  0.028024],
        [ 0.02229 ,  0.02229 , ...,  0.02229 ,  0.02229 ]],

       [[ 0.103683,  0.103683, ...,  0.103683,  0.103683],
        [ 0.114203,  0.114143, ...,  0.11426 ,  0.114244],
        ...,
        [ 0.026128,  0.026203, ...,  0.026008,  0.026071],
        [ 0.016974,  0.016974, ...,  0.016974,  0.016974]],

       ...,
       [[ 0.07467 ,  0.07467 , ...,  0.07467 ,  0.07467 ],
        [ 0.072274,  0.072266, ...,  0.07231 ,  0.072309],
        ...,
        [ 0.006045,  0.006052, ...,  0.00604 ,  0.006046],
        [ 0.008976,  0.008976, ...,  0.008976,  0.008976]],

       [[ 0.069148,  0.069148, ...,  0.069148,  0.069148],
        [ 0.070722,  0.070705, ...,  0.070723,  0.070725],
        ...,
        [ 0.002394,  0.002377, ...,  0.002432,  0.002411],
        [ 0.003763,  0.003763, ...,  0.003763,  0.003763]]])
Coordinates:
  * time     (time) datetime64[ns] 2015-07-01T01:30:00 2015-07-01T04:30:00 ...
  * lat      (lat) float32 -89.9375 -89.75 -89.5 -89.25 -89.0 -88.75 -88.5 ...
  * lon      (lon) float32 -180.0 -179.688 -179.375 -179.062 -178.75 ...

You can use dr_levmean.to_netcdf("filename.nc") to save the result to disk.

We can further simplify the above process. Instead of using .compute() to get in-memory result, we can simply do a “disk to disk” operation.

In [18]:
with ProgressBar():
    dr.mean(dim='lev').to_netcdf('column_average.nc')
[########################################] | 100% Completed | 19.4s

xarray automatically reads each piece (chunk) of data into memory, compute the mean, dump the result to disk, and proceed to the next piece.

Open the file to double check:

In [19]:
xr.open_dataset('column_average.nc')
Out[19]:
<xarray.Dataset>
Dimensions:  (lat: 721, lon: 1152, time: 8)
Coordinates:
  * time     (time) datetime64[ns] 2015-07-01T01:30:00 2015-07-01T04:30:00 ...
  * lat      (lat) float32 -89.9375 -89.75 -89.5 -89.25 -89.0 -88.75 -88.5 ...
  * lon      (lon) float32 -180.0 -179.688 -179.375 -179.062 -178.75 ...
Data variables:
    CLOUD    (time, lat, lon) float64 0.09893 0.09893 0.09893 0.09893 ...

A compact version of previous section

All we’ve done in the previous section can be summarized by 2 lines of code:

In [20]:
ds = xr.open_dataset("./GEOSFP.20150701.A3cld.Native.nc", chunks={'time': 1})
ds['CLOUD'].mean(dim='lev').to_netcdf('column_average_CLOUD.nc')

That’s it! You’ve just solved a geoscience big data problem without worrying about any specific “big data” techniques!