Multidimensional data analysis library xarray

This article introduces the Python library xarray, which supports multidimensional data analysis. For more information, please refer to Information of the head family.

Features of xarray

background

Scientific measurement data is often multidimensional. For example, when measuring time series data with sensors installed at multiple positions, The measurement data is two-dimensional data in the spatial channel direction x time direction. Furthermore, when short-time Fourier transform is applied to the data, it becomes three-dimensional data in the spatial channel direction × time direction × frequency direction.

In general, when dealing with such data, I think that you often use numpy's np.ndarray. However, since np.ndarray is a simple matrix (or tensor), other information needs to be set aside. In the above example,

Dimensional order: The first dimension of the two-dimensional data corresponds to the spatial channel, and the second dimension corresponds to time.
Coordinates of each dimension

Etc. correspond to "other information" here.

Therefore, for example, if you want to cut out a part of a certain time range from the data in it and use it. In addition to the cut out data, it is necessary to cut out the time axis data at the same time.

Of course, you can do it exactly with a plain np.ndarray, but In a complicated program, such complicated operations can be a source of mistakes.

xarray is a library that simplifies such operations. (By the way, since np.ndarray is used internally, the high-speed computing performance of np.ndarray is hardly sacrificed.) There is pandas as a library that handles one-dimensional data. Pandas can't (easily) handle multidimensional data. xarray is a library that interpolates it.

In addition to the above,

The __str__ method is overloaded and will give you an overview when you print.
Position indexing and slicing (for example, searching for data closest to a certain time) is possible. The result is also an xarray object, which correctly holds information about the axes.
Simple statistical processing (moving average, etc.) is possible. It also holds information about the axis correctly.
Mutual conversion with pandas is possible
It seems to support huge data that does not fit in memory

And so on.

By the way,

import numpy as np
import xarray as xr
xr.__version__

'0.9.1'

It seems that it is common to abbreviate it as xr.

data type

It mainly supports two data types, xr.DataArray and xr.Dataset.

xr.DataArray xr.DataArray is the multidimensional data mentioned above. Inside, it has an ordered dictionary type coords, which is a pair of axis values and labels, and an ordered dictionary type ʻattrs`, which stores other information.

Since we are overloading the __get_item__ method, we can access it like da [i, j], just like np.ndarray. However, since the return value is also an xr.DataArray object, it inherits the axis information and so on.

xr.Dataset An object that holds multiple xr.DataArrays. You can have multiple axes and it will hold information about which axis each data corresponds to.

You can access it like a dictionary object. For example, in xr.Dataset which has temperature T and density N information inside. data ['T'] returns the temperature T as xr.DataArray.

How to use xr.DataArray

This is a role similar to DataSeries in pandas. It has the data value itself and the axis data.

Instantiation

data = xr.DataArray(np.random.randn(2, 3))

Then you can create a 2x3 xr.DataArray object with no axis information.

You can view a summary of the objects created by the print method.

print(data)

<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 0.32853 , -1.010702,  1.220686],
       [ 0.877681,  1.180265, -0.963936]])
Dimensions without coordinates: dim_0, dim_1

If you do not explicitly specify the axis like this time, dim_0 and dim_1 will be assigned automatically.

For example, consider the case where the first dimension of some data data_np corresponds to the spatial position x and the second dimension corresponds to the time t.

#Example data
data_np = np.random.randn(5,4)
x_axis = np.linspace(0.0, 1.0, 5)
t_axis = np.linspace(0.0, 2.0, 4)

data = xr.DataArray(data_np, dims=['x','t'], 
                    coords={'x':x_axis, 't':t_axis}, 
                    name='some_measurement')

And so on

In dims, list (or tuple) the labels corresponding to each dimension of data_np.
Give coords the axis label and the corresponding data in dictionary form.

print(data)

<xarray.DataArray 'some_measurement' (x: 5, t: 4)>
array([[ 1.089975,  0.343039, -0.521509,  0.02816 ],
       [ 1.117389,  0.589563, -1.030908, -0.023295],
       [ 0.403413, -0.157136, -0.175684, -0.743779],
       [ 0.814334,  0.164875, -0.489531, -0.335251],
       [ 0.009115,  0.294526,  0.693384, -1.046416]])
Coordinates:
  * t        (t) float64 0.0 0.6667 1.333 2.0
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0

Of the displayed summary

<xarray.DataArray 'some_measurement' (x: 5, t: 4)>

Indicates that this DataArray is a 5x4 matrix named some_measurement, with the 1D axis label corresponding to'x'and the 2D axis label corresponding to't'.

Also,

Coordinates:

The following is a list of axis data.

Axis information

The axis list can be accessed by dims. In addition, the order displayed here indicates which time period axis of the original data corresponds to.

data.dims

('x', 't')

To access the value of the axis, take the label name as an argument.

data['x']

<xarray.DataArray 'x' (x: 5)>
array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])
Coordinates:
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0

Indexing

xarray supports multiple types of indexing. Since it uses the mechanism of pandas, it is as fast as pandas.

numpy-like access

data[0,1]

<xarray.DataArray 'some_measurement' ()>
array(0.3430393695918721)
Coordinates:
    t        float64 0.6667
    x        float64 0.0

Since it is array-like, it can be accessed like a normal matrix. The axis information at that time is inherited.

positional indexing

By using the .loc method, you can specify the position along the axis data and access it.

data.loc[0:0.5, :1.0]

<xarray.DataArray 'some_measurement' (x: 3, t: 2)>
array([[ 1.089975,  0.343039],
       [ 1.117389,  0.589563],
       [ 0.403413, -0.157136]])
Coordinates:
  * t        (t) float64 0.0 0.6667
  * x        (x) float64 0.0 0.25 0.5

.loc[0:0.5, :1.0] Is an operation to cut out data in the range of 0 <x <0.5 along the axis of the first dimension and in the range of t <1.0 along the axis of the second dimension.

Access with axis label name

Use the .isel and .sel methods for access with an axis label name.

.isel specifies the axis label and its index as an integer.

data.isel(t=1)

<xarray.DataArray 'some_measurement' (x: 5)>
array([ 0.343039,  0.589563, -0.157136,  0.164875,  0.294526])
Coordinates:
    t        float64 0.6667
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0

.sel specifies the axis label and its axis value.

data.sel(t=slice(0.5,2.0))

<xarray.DataArray 'some_measurement' (x: 5, t: 3)>
array([[ 0.343039, -0.521509,  0.02816 ],
       [ 0.589563, -1.030908, -0.023295],
       [-0.157136, -0.175684, -0.743779],
       [ 0.164875, -0.489531, -0.335251],
       [ 0.294526,  0.693384, -1.046416]])
Coordinates:
  * t        (t) float64 0.6667 1.333 2.0
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0

Calculation

It supports a lot of np.ndarray-like operations.

It supports basic operations including broadcast.

data+10

<xarray.DataArray 'some_measurement' (x: 5, t: 4)>
array([[ 11.089975,  10.343039,   9.478491,  10.02816 ],
       [ 11.117389,  10.589563,   8.969092,   9.976705],
       [ 10.403413,   9.842864,   9.824316,   9.256221],
       [ 10.814334,  10.164875,   9.510469,   9.664749],
       [ 10.009115,  10.294526,  10.693384,   8.953584]])
Coordinates:
  * t        (t) float64 0.0 0.6667 1.333 2.0
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0

Element-by-element calculations can inherit this information.

np.sin(data)

<xarray.DataArray 'some_measurement' (x: 5, t: 4)>
array([[ 0.886616,  0.336351, -0.498189,  0.028156],
       [ 0.89896 ,  0.555998, -0.857766, -0.023293],
       [ 0.39256 , -0.15649 , -0.174781, -0.677074],
       [ 0.727269,  0.164129, -0.470212, -0.329006],
       [ 0.009114,  0.290286,  0.639144, -0.865635]])
Coordinates:
  * t        (t) float64 0.0 0.6667 1.333 2.0
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0

How to use xr.Dataset

xr.Dataset is an object that is a collection of multiple xr.DataArrays.

In particular, you can index and slice xr.DataArray that share an axis at once. I think that one measuring instrument may output multiple types of signals, It is suitable for handling such ** multidimensional ** information.

This is a role similar to DataFrame in pandas.

Instantiation

The first argument is that data_vars is dict-like. Pass the name of the data to be stored in key and the tuple of two elements in the element. The first element of the tuple passes the axis label corresponding to that data, and the second element passes the data (ʻarray`-like).

Pass dict-like to coords to store the axis data. Pass the axis label for the key and the axis value for the element.

ds = xr.Dataset({'data1': (['x','t'], np.random.randn(5,4)), 'data2': (['x','t'], np.random.randn(5,4))}, 
                coords={'x': x_axis, 't': t_axis})

ds

<xarray.Dataset>
Dimensions:  (t: 4, x: 5)
Coordinates:
  * t        (t) float64 0.0 0.6667 1.333 2.0
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0
Data variables:
    data1    (x, t) float64 -1.091 -1.851 0.343 2.077 1.477 0.0009389 1.358 ...
    data2    (x, t) float64 0.4852 -0.5463 -0.22 -1.357 -1.416 -0.4929 ...

To access the contents, pass the label name inside []. In that case, the return value will be xr.DataArray.

ds['data1']

<xarray.DataArray 'data1' (x: 5, t: 4)>
array([[ -1.091230e+00,  -1.851416e+00,   3.429677e-01,   2.077113e+00],
       [  1.476765e+00,   9.389425e-04,   1.358136e+00,  -1.627471e+00],
       [ -2.007550e-01,   1.008126e-01,   7.177067e-01,   8.893402e-01],
       [ -1.813395e-01,  -3.407015e-01,  -9.673550e-01,   1.135727e+00],
       [  2.423873e-01,  -1.198268e+00,   1.650465e+00,  -1.923102e-01]])
Coordinates:
  * t        (t) float64 0.0 0.6667 1.333 2.0
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0

You can also access the axes by label.

ds['x']

<xarray.DataArray 'x' (x: 5)>
array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])
Coordinates:
  * x        (x) float64 0.0 0.25 0.5 0.75 1.0

Indexing of xr.Dataset

Use ʻisel` for index access. To access the first element along the x-axis, specify the axis label name and its corresponding index, as follows:

ds.isel(x=1)

<xarray.Dataset>
Dimensions:  (t: 4)
Coordinates:
  * t        (t) float64 0.0 0.6667 1.333 2.0
    x        float64 0.25
Data variables:
    data1    (t) float64 1.477 0.0009389 1.358 -1.627
    data2    (t) float64 -1.416 -0.4929 0.4926 -0.7186

Of course you can specify multiple axes

ds.isel(x=1, t=2)

<xarray.Dataset>
Dimensions:  ()
Coordinates:
    t        float64 1.333
    x        float64 0.25
Data variables:
    data1    float64 1.358
    data2    float64 0.4926

It also supports slicing.

ds.isel(x=slice(0,2,1), t=2)

<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
    t        float64 1.333
  * x        (x) float64 0.0 0.25
Data variables:
    data1    (x) float64 0.343 1.358
    data2    (x) float64 -0.22 0.4926

Position indexing

Use the .sel method for position indexing. As with .isel, specify the axis label name and this time the axis value.

ds.sel(x=0.0)

<xarray.Dataset>
Dimensions:  (t: 4)
Coordinates:
  * t        (t) float64 0.0 0.6667 1.333 2.0
    x        float64 0.0
Data variables:
    data1    (t) float64 -1.091 -1.851 0.343 2.077
    data2    (t) float64 0.4852 -0.5463 -0.22 -1.357

By default, exactly the same value is returned, but you can specify it with the method option. If you want the nearest value, set method ='nearest'.

# x = 0.Returns the value with x closest to 4.
ds.sel(x=0.4, method='nearest')

<xarray.Dataset>
Dimensions:  (t: 4)
Coordinates:
  * t        (t) float64 0.0 0.6667 1.333 2.0
    x        float64 0.5
Data variables:
    data1    (t) float64 -0.2008 0.1008 0.7177 0.8893
    data2    (t) float64 -0.03163 0.6942 0.8194 -2.93

You can also pass a slice object.

ds.sel(x=slice(0,0.5))

<xarray.Dataset>
Dimensions:  (t: 4, x: 3)
Coordinates:
  * t        (t) float64 0.0 0.6667 1.333 2.0
  * x        (x) float64 0.0 0.25 0.5
Data variables:
    data1    (x, t) float64 -1.091 -1.851 0.343 2.077 1.477 0.0009389 1.358 ...
    data2    (x, t) float64 0.4852 -0.5463 -0.22 -1.357 -1.416 -0.4929 ...