This article introduces the Python library xarray, which supports multidimensional data analysis. For more information, please refer to Information of the head family.
Scientific measurement data is often multidimensional. For example, when measuring time series data with sensors installed at multiple positions, The measurement data is two-dimensional data in the spatial channel direction x time direction. Furthermore, when short-time Fourier transform is applied to the data, it becomes three-dimensional data in the spatial channel direction × time direction × frequency direction.
In general, when dealing with such data, I think that you often use numpy's np.ndarray. However, since np.ndarray is a simple matrix (or tensor), other information needs to be set aside. In the above example,
Etc. correspond to "other information" here.
Therefore, for example, if you want to cut out a part of a certain time range from the data in it and use it. In addition to the cut out data, it is necessary to cut out the time axis data at the same time.
Of course, you can do it exactly with a plain np.ndarray, but In a complicated program, such complicated operations can be a source of mistakes.
xarray is a library that simplifies such operations. (By the way, since np.ndarray is used internally, the high-speed computing performance of np.ndarray is hardly sacrificed.) There is pandas as a library that handles one-dimensional data. Pandas can't (easily) handle multidimensional data. xarray is a library that interpolates it.
In addition to the above,
__str__
method is overloaded and will give you an overview when you print.And so on.
By the way,
import numpy as np
import xarray as xr
xr.__version__
'0.9.1'
It seems that it is common to abbreviate it as xr
.
It mainly supports two data types, xr.DataArray and xr.Dataset.
xr.DataArray
xr.DataArray is the multidimensional data mentioned above.
Inside, it has an ordered dictionary type coords
, which is a pair of axis values and labels, and an ordered dictionary type ʻattrs`, which stores other information.
Since we are overloading the __get_item__
method, we can access it like da [i, j], just like np.ndarray.
However, since the return value is also an xr.DataArray object, it inherits the axis information and so on.
xr.Dataset An object that holds multiple xr.DataArrays. You can have multiple axes and it will hold information about which axis each data corresponds to.
You can access it like a dictionary object. For example, in xr.Dataset which has temperature T and density N information inside. data ['T'] returns the temperature T as xr.DataArray.
This is a role similar to DataSeries
in pandas
.
It has the data value itself and the axis data.
data = xr.DataArray(np.random.randn(2, 3))
Then you can create a 2x3 xr.DataArray object with no axis information.
You can view a summary of the objects created by the print
method.
print(data)
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 0.32853 , -1.010702, 1.220686],
[ 0.877681, 1.180265, -0.963936]])
Dimensions without coordinates: dim_0, dim_1
If you do not explicitly specify the axis like this time, dim_0 and dim_1 will be assigned automatically.
For example, consider the case where the first dimension of some data data_np
corresponds to the spatial position x and the second dimension corresponds to the time t.
#Example data
data_np = np.random.randn(5,4)
x_axis = np.linspace(0.0, 1.0, 5)
t_axis = np.linspace(0.0, 2.0, 4)
data = xr.DataArray(data_np, dims=['x','t'],
coords={'x':x_axis, 't':t_axis},
name='some_measurement')
And so on
dims
, list (or tuple) the labels corresponding to each dimension of data_np.coords
the axis label and the corresponding data in dictionary form.print(data)
<xarray.DataArray 'some_measurement' (x: 5, t: 4)>
array([[ 1.089975, 0.343039, -0.521509, 0.02816 ],
[ 1.117389, 0.589563, -1.030908, -0.023295],
[ 0.403413, -0.157136, -0.175684, -0.743779],
[ 0.814334, 0.164875, -0.489531, -0.335251],
[ 0.009115, 0.294526, 0.693384, -1.046416]])
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
Of the displayed summary
<xarray.DataArray 'some_measurement' (x: 5, t: 4)>
Indicates that this DataArray is a 5x4 matrix named some_measurement
, with the 1D axis label corresponding to'x'and the 2D axis label corresponding to't'.
Also,
Coordinates:
The following is a list of axis data.
The axis list can be accessed by dims
.
In addition, the order displayed here indicates which time period axis of the original data corresponds to.
data.dims
('x', 't')
To access the value of the axis, take the label name as an argument.
data['x']
<xarray.DataArray 'x' (x: 5)>
array([ 0. , 0.25, 0.5 , 0.75, 1. ])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
xarray supports multiple types of indexing. Since it uses the mechanism of pandas, it is as fast as pandas.
data[0,1]
<xarray.DataArray 'some_measurement' ()>
array(0.3430393695918721)
Coordinates:
t float64 0.6667
x float64 0.0
Since it is array-like, it can be accessed like a normal matrix. The axis information at that time is inherited.
By using the .loc
method, you can specify the position along the axis data and access it.
data.loc[0:0.5, :1.0]
<xarray.DataArray 'some_measurement' (x: 3, t: 2)>
array([[ 1.089975, 0.343039],
[ 1.117389, 0.589563],
[ 0.403413, -0.157136]])
Coordinates:
* t (t) float64 0.0 0.6667
* x (x) float64 0.0 0.25 0.5
.loc[0:0.5, :1.0]
Is an operation to cut out data in the range of 0 <x <0.5 along the axis of the first dimension and in the range of t <1.0 along the axis of the second dimension.
Use the .isel
and .sel
methods for access with an axis label name.
.isel
specifies the axis label and its index as an integer.
data.isel(t=1)
<xarray.DataArray 'some_measurement' (x: 5)>
array([ 0.343039, 0.589563, -0.157136, 0.164875, 0.294526])
Coordinates:
t float64 0.6667
* x (x) float64 0.0 0.25 0.5 0.75 1.0
.sel
specifies the axis label and its axis value.
data.sel(t=slice(0.5,2.0))
<xarray.DataArray 'some_measurement' (x: 5, t: 3)>
array([[ 0.343039, -0.521509, 0.02816 ],
[ 0.589563, -1.030908, -0.023295],
[-0.157136, -0.175684, -0.743779],
[ 0.164875, -0.489531, -0.335251],
[ 0.294526, 0.693384, -1.046416]])
Coordinates:
* t (t) float64 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
It supports a lot of np.ndarray-like operations.
It supports basic operations including broadcast.
data+10
<xarray.DataArray 'some_measurement' (x: 5, t: 4)>
array([[ 11.089975, 10.343039, 9.478491, 10.02816 ],
[ 11.117389, 10.589563, 8.969092, 9.976705],
[ 10.403413, 9.842864, 9.824316, 9.256221],
[ 10.814334, 10.164875, 9.510469, 9.664749],
[ 10.009115, 10.294526, 10.693384, 8.953584]])
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
Element-by-element calculations can inherit this information.
np.sin(data)
<xarray.DataArray 'some_measurement' (x: 5, t: 4)>
array([[ 0.886616, 0.336351, -0.498189, 0.028156],
[ 0.89896 , 0.555998, -0.857766, -0.023293],
[ 0.39256 , -0.15649 , -0.174781, -0.677074],
[ 0.727269, 0.164129, -0.470212, -0.329006],
[ 0.009114, 0.290286, 0.639144, -0.865635]])
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
xr.Dataset
is an object that is a collection of multiple xr.DataArray
s.
In particular, you can index and slice xr.DataArray
that share an axis at once.
I think that one measuring instrument may output multiple types of signals,
It is suitable for handling such ** multidimensional ** information.
This is a role similar to DataFrame
in pandas
.
The first argument is that data_vars
is dict
-like.
Pass the name of the data to be stored in key and the tuple of two elements in the element.
The first element of the tuple passes the axis label corresponding to that data, and the second element passes the data (ʻarray`-like).
Pass dict
-like to coords
to store the axis data.
Pass the axis label for the key and the axis value for the element.
ds = xr.Dataset({'data1': (['x','t'], np.random.randn(5,4)), 'data2': (['x','t'], np.random.randn(5,4))},
coords={'x': x_axis, 't': t_axis})
ds
<xarray.Dataset>
Dimensions: (t: 4, x: 5)
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
Data variables:
data1 (x, t) float64 -1.091 -1.851 0.343 2.077 1.477 0.0009389 1.358 ...
data2 (x, t) float64 0.4852 -0.5463 -0.22 -1.357 -1.416 -0.4929 ...
To access the contents, pass the label name inside []
.
In that case, the return value will be xr.DataArray
.
ds['data1']
<xarray.DataArray 'data1' (x: 5, t: 4)>
array([[ -1.091230e+00, -1.851416e+00, 3.429677e-01, 2.077113e+00],
[ 1.476765e+00, 9.389425e-04, 1.358136e+00, -1.627471e+00],
[ -2.007550e-01, 1.008126e-01, 7.177067e-01, 8.893402e-01],
[ -1.813395e-01, -3.407015e-01, -9.673550e-01, 1.135727e+00],
[ 2.423873e-01, -1.198268e+00, 1.650465e+00, -1.923102e-01]])
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
You can also access the axes by label.
ds['x']
<xarray.DataArray 'x' (x: 5)>
array([ 0. , 0.25, 0.5 , 0.75, 1. ])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
Use ʻisel` for index access. To access the first element along the x-axis, specify the axis label name and its corresponding index, as follows:
ds.isel(x=1)
<xarray.Dataset>
Dimensions: (t: 4)
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
x float64 0.25
Data variables:
data1 (t) float64 1.477 0.0009389 1.358 -1.627
data2 (t) float64 -1.416 -0.4929 0.4926 -0.7186
Of course you can specify multiple axes
ds.isel(x=1, t=2)
<xarray.Dataset>
Dimensions: ()
Coordinates:
t float64 1.333
x float64 0.25
Data variables:
data1 float64 1.358
data2 float64 0.4926
It also supports slicing.
ds.isel(x=slice(0,2,1), t=2)
<xarray.Dataset>
Dimensions: (x: 2)
Coordinates:
t float64 1.333
* x (x) float64 0.0 0.25
Data variables:
data1 (x) float64 0.343 1.358
data2 (x) float64 -0.22 0.4926
Use the .sel
method for position indexing.
As with .isel
, specify the axis label name and this time the axis value.
ds.sel(x=0.0)
<xarray.Dataset>
Dimensions: (t: 4)
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
x float64 0.0
Data variables:
data1 (t) float64 -1.091 -1.851 0.343 2.077
data2 (t) float64 0.4852 -0.5463 -0.22 -1.357
By default, exactly the same value is returned, but you can specify it with the method
option.
If you want the nearest value, set method ='nearest'
.
# x = 0.Returns the value with x closest to 4.
ds.sel(x=0.4, method='nearest')
<xarray.Dataset>
Dimensions: (t: 4)
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
x float64 0.5
Data variables:
data1 (t) float64 -0.2008 0.1008 0.7177 0.8893
data2 (t) float64 -0.03163 0.6942 0.8194 -2.93
You can also pass a slice object.
ds.sel(x=slice(0,0.5))
<xarray.Dataset>
Dimensions: (t: 4, x: 3)
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5
Data variables:
data1 (x, t) float64 -1.091 -1.851 0.343 2.077 1.477 0.0009389 1.358 ...
data2 (x, t) float64 0.4852 -0.5463 -0.22 -1.357 -1.416 -0.4929 ...
Recommended Posts