Previous article
I will introduce the data analysis using xarray introduced in the above from a more practical point of view.
Since 2017, he has been a development member of pydata xarray. I'm mostly a user these days, but sometimes I send out bug fix PR.
I use xarray for almost all data analysis that I do in my main business.
It is assumed that you have some knowledge of numpy (you can operate and perform operations using np.ndarray).
In this article, I will introduce the data model of xarray again, and then discuss the basic operation of indexing. It may be safe to say that this indexing is the essence of xarray.
After that, I will introduce concat, which is the reverse operation of indexing.
Based on these, when considering the actual creation of data, I will introduce what kind of data structure should be used to facilitate analysis with xarray.
I will also talk a little about saving and loading data and out-of-memory processing using dask.
This article is written using Google collaboratory. Anyone can execute and trace by copying from here.
In a nutshell, xarray is ** a multidimensional array with axes **. You can think of it as a combination of numpy's nd-array and pandas' pd.Series.
Once you get used to it, it's very convenient because you don't have to think about coordinates during data analysis.
On the other hand, it is often said that the usage is a little unique and the learning cost is high. This article aims to reduce the learning cost as much as possible.
There are two main classes in xarray. One is xarray.DataArray and the other is xarray.Dataset. I will introduce them.
This is almost the same as the content described in Multidimensional data analysis library xarray.
xarray.DataArray
Introducing xarray.DataArray, which is a ** multidimensional array with axes **. I will explain how to create an object later, but first I will explain using the data in xarray.tutorial.
import numpy as np
import xarray as xr
data = xr.tutorial.open_dataset('air_temperature')['air']
data
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> [3869000 values with dtype=float32] Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
Where data
is a 3D xr.DataArray instance. The official abbreviation for xarray is xr
.
If you display data
, you will see a summary like the one above. The (time: 2920, lat: 25, lon: 53)
in the top row represents the dimension name of the stored array and its number of elements.
As you can see, xarray ** treats each dimension with a name ** (called * label * in official documentation).
The first axis is the time
axis, the second axis is the lat
axis, and the third axis is the lon
axis.
By naming it like this, you don't have to worry about the order of the axes.
In addition, coordinate values are added to each axis. A list of coordinates is displayed in the Coordinates
section.
In this data, all axes of time
, lat
, and lon
have coordinates.
You can also attach other data. They are in the Attributes
section.
As described in detail in the indexing section, the presence of coordinate axes makes it possible to intuitively obtain data in the desired range. For example, if you want the data for August 30, 2014,
data.sel(time='2014-08-30')
<xarray.DataArray 'air' (time: 4, lat: 25, lon: 53)> array([[[270.29 , 270.29 , ..., 258.9 , 259.5 ], [273. , 273.1 , ..., 262.6 , 264.19998], ..., [299.5 , 298.79 , ..., 299. , 298.5 ], [299.29 , 299.29 , ..., 299.9 , 300. ]], [[269.9 , 270.1 , ..., 258.6 , 259.5 ], [273. , 273.29 , ..., 262.5 , 265. ], ..., [299.19998, 298.9 , ..., 298.5 , 298. ], [299.4 , 299.69998, ..., 299.19998, 299.6 ]], [[270.4 , 270.69998, ..., 261.29 , 261.9 ], [273. , 273.6 , ..., 266.4 , 268.6 ], ..., [297.9 , 297.6 , ..., 298.29 , 298.29 ], [298.69998, 298.69998, ..., 299.1 , 299.4 ]], [[270. , 270.4 , ..., 266. , 266.19998], [273. , 273.9 , ..., 268.1 , 269.69998], ..., [298.5 , 298.29 , ..., 298.69998, 299. ], [299.1 , 299.19998, ..., 299.5 , 299.69998]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2014-08-30 ... 2014-08-30T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
You can do it like this. Here, the .sel
method is used to select data using the coordinate axes. .sel (time = '2014-08-30')
is equivalent to referencing the time
axis and selecting the data that is'2014-08-30'
.
DataArray is also a multidimensional array like np.ndarray, so you can perform operations like on np.ndarray.
data * 2
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[482.4 , 485. , 487. , ..., 465.59998, 471. , 477.19998], [487.59998, 489. , 489.4 , ..., 465.59998, 470.59998, 478.59998], [500. , 499.59998, 497.78 , ..., 466.4 , 472.78 , 483.4 ], ..., [593.2 , 592.39996, 592.8 , ..., 590.8 , 590.2 , 589.39996], [591.8 , 592.39996, 593.58 , ..., 591.8 , 591.8 , 590.39996], [592.58 , 593.58 , 594.2 , ..., 593.8 , 593.58 , 593.2 ]], [[484.19998, 485.4 , 486.19998, ..., 464. , 467.19998, 471.59998], [487.19998, 488.19998, 488.4 , ..., 462. , 465. , 471.4 ], [506.4 , 505.78 , 504.19998, ..., 461.59998, 466.78 , 477. ], ... [587.38 , 587.77997, 590.77997, ..., 590.18 , 589.38 , 588.58 ], [592.58 , 594.38 , 595.18 , ..., 590.58 , 590.18 , 588.77997], [595.58 , 596.77997, 596.98 , ..., 591.38 , 590.98 , 590.38 ]], [[490.18 , 488.58 , 486.58 , ..., 483.37997, 482.97998, 483.58 ], [499.78 , 498.58 , 496.78 , ..., 479.18 , 480.58 , 483.37997], [525.98 , 524.38 , 522.77997, ..., 479.78 , 485.18 , 492.58 ], ..., [587.58 , 587.38 , 590.18 , ..., 590.58 , 590.18 , 589.38 ], [592.18 , 593.77997, 594.38 , ..., 591.38 , 591.38 , 590.38 ], [595.38 , 596.18 , 596.18 , ..., 592.98 , 592.38 , 591.38 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
As intuitively it should be, the multiplication is done only on the array data and the axes data is kept unchanged.
I mentioned that you can hold Coordinates in addition to array data. Coordinate data can be categorized as follows.
Dimension coordinate
This was introduced as Coordinate in the previous section. It has the same name as the dimension name (time
, lon
, lat
in the data in the previous section).
Non-dimension coordinate
You can also hold coordinates with names different from dimension. I will explain in what situations it is necessary and convenient, but basically it should be understood that it is ** retained but not calculated **.
Scalar coordinate
You can hold a scalar that can be a coordinate axis. This is also a frequently used data type, but for the time being, you can think of it as data that is not calculated but is retained like Non-dimension coordinates.
It has the basic methods found in np.ndarray.
. Shape
: Returns the shape.size
: Returns the total size.ndim
: Returns the number of dimensionsothers,
.sizes
: Returns an associative array of dimension names and sizes (dictionary type)Thing about type
.dtype
: Returns the type of the array.astype ()
: Change the type of the array.real
: Returns the real part of all data. Imag
: Returns the imaginary part of all dataThe DataArray contains np.ndarray. (As a supplement, you can actually store multidimensional arrays other than np.ndarray.)
You can access the array of contents by using .data
.
(Note that .values
converts any object in the array to np.ndarray and returns it.)
type(data.data), data.shape
(numpy.ndarray, (2920, 25, 53))
Coordinate axis objects can be retrieved by passing the axis name in []
like a dictionary.
data['lat']
<xarray.DataArray 'lat' (lat: 25)> array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5, 45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5, 15. ], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 Attributes: standard_name: latitude long_name: Latitude units: degrees_north axis: Y
The retrieved one is also a DataArray object.
xarray.Dataset
A Dataset is an object that collects multiple DataArrays.
data = xr.tutorial.open_dataset('air_temperature')
data['mean_temperature'] = data['air'].mean('time')
data
<xarray.Dataset> Dimensions: (lat: 25, lon: 53, time: 2920) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 241.2 242.5 ... 296.19 295.69 mean_temperature (lat, lon) float32 260.37564 260.1826 ... 297.30502 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
In this example, data
holds two DataArrays, air
and mean_temperature
.
They are listed in the section called Data variables
.
air
is a three-dimensional array with (time
, lon
, lat
) dimensionsmean_temperature
is a two-dimensional array with (lon
, lat
) dimensionsis.
The dataarrays you hold can share axes. Therefore, it is possible to index multiple DataArrays at the same time as shown below.
data.sel(lat=70, method='nearest')
<xarray.Dataset> Dimensions: (lon: 53, time: 2920) Coordinates: lat float32 70.0 * lon (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lon) float32 250.0 249.79999 ... 242.59 246.29 mean_temperature (lon) float32 264.7681 264.3271 ... 253.58247 257.71475 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
The retained DataArray can have different dimensions. Whether or not the axes are common is determined by the name of the axis. Therefore, it is not possible to have coordinate axes with the same name, although they are not common.
To get a DataArray from a Dataset, just pass the name of the DataArray as a key, like a dictionary.
da = data['air']
da
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[241.2 , 242.5 , ..., 235.5 , 238.59999], [243.79999, 244.5 , ..., 235.29999, 239.29999], ..., [295.9 , 296.19998, ..., 295.9 , 295.19998], [296.29 , 296.79 , ..., 296.79 , 296.6 ]], [[242.09999, 242.7 , ..., 233.59999, 235.79999], [243.59999, 244.09999, ..., 232.5 , 235.7 ], ..., [296.19998, 296.69998, ..., 295.5 , 295.1 ], [296.29 , 297.19998, ..., 296.4 , 296.6 ]], ..., [[245.79 , 244.79 , ..., 243.98999, 244.79 ], [249.89 , 249.29 , ..., 242.48999, 244.29 ], ..., [296.29 , 297.19 , ..., 295.09 , 294.38998], [297.79 , 298.38998, ..., 295.49 , 295.19 ]], [[245.09 , 244.29 , ..., 241.48999, 241.79 ], [249.89 , 249.29 , ..., 240.29 , 241.68999], ..., [296.09 , 296.88998, ..., 295.69 , 295.19 ], [297.69 , 298.09 , ..., 296.19 , 295.69 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
This section describes how to create a DataArray object.
The easiest way is to give np.ndarray
and each axis name.
xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'])
<xarray.DataArray (x: 3, y: 4)> array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) Dimensions without coordinates: x, y
The above example creates a DataArray with 3x4 elements. The name of the first dimension is x
and the second dimension is y
.
To give the axes, specify the coords
keyword as a dictionary.
xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'],
coords={'x': [0, 1, 2], 'y': [0.1, 0.2, 0.3, 0.4]})
<xarray.DataArray (x: 3, y: 4)> array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) Coordinates: * x (x) int64 0 1 2 * y (y) float64 0.1 0.2 0.3 0.4
To include a non-dimension coordinate, you must specify the axis name on which it depends. To do this, make the dictionary argument (to the right of the colon) a tuple. The first element is the axis name on which it depends, and the second is the array body.
xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'],
coords={'x': [0, 1, 2], 'y': [0.1, 0.2, 0.3, 0.4],
'z': ('x', ['a', 'b', 'c'])})
<xarray.DataArray (x: 3, y: 4)> array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) Coordinates: * x (x) int64 0 1 2 * y (y) float64 0.1 0.2 0.3 0.4 z (x) <U1 'a' 'b' 'c'
Where the non-dimension coordinate z
is defined on the x
axis.
Pass the scalar coordinate to coords
as well.
xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'],
coords={'x': [0, 1, 2], 'y': [0.1, 0.2, 0.3, 0.4], 'scalar': 3})
<xarray.DataArray (x: 3, y: 4)> array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) Coordinates: * x (x) int64 0 1 2 * y (y) float64 0.1 0.2 0.3 0.4 scalar int64 3
You can see that the scalar coordinate does not belong to any dimension.
In addition to axis information, you can also pass the names of attributes and DataArrays. However, attributes are lost in the middle of the operation, so it may be better to put things that you want to save in a file at once.
I'll explain this in the chapter on data structures for storage, which I will explain later.
.isel
, .sel
, .interp
, .reindex
Indexing is the most basic and essential operation. It is no exaggeration to say that the development of xarray has started to simplify it.
xarray allows position-based indexing and axes-based indexing.
.isel
This is similar to indexing for common arrays such as np.ndarray.
Give an integer as an argument in the brackets, such as data [i, j, k]
.
The argument indicates the position of the element in the array **.
da[0, :4, 3]
<xarray.DataArray 'air' (lat: 4)> array([244. , 244.2 , 247.5 , 266.69998], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 lon float32 207.5 time datetime64[ns] 2013-01-01 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
With the above notation, you need to remember the position of each dimension (for example, the time
axis is the very first axis).
It is surprisingly difficult to remember them for each analysis. If you write it like this, you only have to remember the name of the axis.
da.isel(time=0, lat=slice(0, 4), lon=3)
<xarray.DataArray 'air' (lat: 4)> array([244. , 244.2 , 247.5 , 266.69998], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 lon float32 207.5 time datetime64[ns] 2013-01-01 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
In this way, put the axis name in keyword format in the argument of the .isel ()
method, and give the corresponding index after the equal.
You need to explicitly use slices (slice (0, 4)
) when specifying the range.
.sel
In actual data analysis, the range of interest is often specified by coordinates. This can be achieved with xarray by using the .sel
method.
da.sel(time='2013-01-01', lat=slice(75, 60))
<xarray.DataArray 'air' (time: 4, lat: 7, lon: 53)> array([[[241.2 , 242.5 , ..., 235.5 , 238.59999], [243.79999, 244.5 , ..., 235.29999, 239.29999], ..., [272.1 , 270.9 , ..., 275.4 , 274.19998], [273.69998, 273.6 , ..., 274.19998, 275.1 ]], [[242.09999, 242.7 , ..., 233.59999, 235.79999], [243.59999, 244.09999, ..., 232.5 , 235.7 ], ..., [269.19998, 268.5 , ..., 275.5 , 274.69998], [272.1 , 272.69998, ..., 275.79 , 276.19998]], [[242.29999, 242.2 , ..., 236.09999, 238.7 ], [244.59999, 244.39 , ..., 232. , 235.7 ], ..., [273. , 273.5 , ..., 275.29 , 274.29 ], [275.5 , 275.9 , ..., 277.4 , 277.6 ]], [[241.89 , 241.79999, ..., 235.5 , 237.59999], [246.29999, 245.29999, ..., 231.5 , 234.5 ], ..., [273.29 , 272.6 , ..., 277.6 , 276.9 ], [274.1 , 274. , ..., 279.1 , 279.9 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2013-01-01T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
As with .isel
, give the keyword the axis name and the desired coordinate value after the equal.
You can also use the slice
object. In that case, you can put a real number in the argument of slice
.
However, there are some caveats to the .sel
method.
The .sel
method has the options method
, tolerance
.
I think the most commonly used method
is nearest
. It will index using the closest coordinate value.
da.sel(lat=76, method='nearest')
<xarray.DataArray 'air' (time: 2920, lon: 53)> array([[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [242.29999, 242.2 , 242.29999, ..., 234.29999, 236.09999, 238.7 ], ..., [243.48999, 242.98999, 242.09 , ..., 244.18999, 244.48999, 244.89 ], [245.79 , 244.79 , 243.48999, ..., 243.29 , 243.98999, 244.79 ], [245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999, 241.79 ]], dtype=float32) Coordinates: lat float32 75.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
Coordinate axes have a degree of freedom, so you may not be able to index them well. For example, the following cases are applicable.
You can sort by that axis with the sortby
method, likeda.sortby ('lat')
.
By using the isnull
method that returns coordinates that are not Nan
, such as da.isel (lat = ~ da ['lat']. Isnull ())
, you can create an array that skips the coordinates. Masu
With da.isel (lat = np.unique (da ['lat'], return_index = True) [1])
, you can create an array with the parts corresponding to the duplicate coordinates skipped.
.interp
Interpolation is possible with the same syntax as indexing.
da.interp(lat=74)
<xarray.DataArray 'air' (time: 2920, lon: 53)> array([[242.23999321, 243.29999998, 243.9799988 , ..., 232.79998779, 235.41999512, 238.87998962], [242.69999081, 243.25999451, 243.53999329, ..., 231.60000001, 233.1599945 , 235.75999145], [243.21998903, 243.07599789, 242.97999266, ..., 232.69998783, 234.45999444, 237.49999702], ..., [245.72999275, 245.38999009, 244.68999648, ..., 242.74999088, 243.20999146, 244.00999451], [247.42999566, 246.58999336, 245.48999023, ..., 242.4899933 , 243.38999027, 244.58999329], [247.00999749, 246.28999329, 245.32999587, ..., 240.84999084, 241.00999144, 241.74999085]]) Coordinates: * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 lat int64 74 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
xr.where
There are times when you want to select only data that meets complex conditions. For example, in the above example
time
air
greater than or equal to a certain valueOr something.
When the selection result is one-dimensional, for example, when selecting a certain day of the week, it can be realized by passing a one-dimensional array of Bool type to the .isel
method.
import datetime
#Choose only weekends
is_weekend = da['time'].dt.dayofweek.isin([5, 6])
is_weekend
<xarray.DataArray 'dayofweek' (time: 2920)> array([False, False, False, ..., False, False, False]) Coordinates: * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
#bool type.Pass it to the isel method.
da.isel(time=is_weekend)
<xarray.DataArray 'air' (time: 832, lat: 25, lon: 53)> array([[[248.59999, 246.89 , ..., 245.7 , 246.7 ], [256.69998, 254.59999, ..., 248. , 250.09999], ..., [296.69998, 296. , ..., 295. , 294.9 ], [297. , 296.9 , ..., 296.79 , 296.69998]], [[245.09999, 243.59999, ..., 249.89 , 251.39 ], [254. , 251.79999, ..., 247.79999, 250.89 ], ..., [296.79 , 296.19998, ..., 294.6 , 294.69998], [297.19998, 296.69998, ..., 296.29 , 296.6 ]], ..., [[242.39 , 241.79999, ..., 247.59999, 247.29999], [247.18999, 246.09999, ..., 253.29999, 254.29999], ..., [294.79 , 295.29 , ..., 297.69998, 297.6 ], [296.38998, 297.19998, ..., 298.19998, 298.29 ]], [[249.18999, 248. , ..., 244.09999, 242.5 ], [254.5 , 253. , ..., 251. , 250.59999], ..., [294.88998, 295.19998, ..., 298. , 297.29 ], [296.1 , 296.88998, ..., 298.5 , 298.29 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-05 ... 2014-12-28T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
When you want to make a selection based on the data of a multidimensional array, it is not possible to define the shape of the array after selection, so it is better to replace the unselected elements with Nan.
In that case you can use the where
function.
da.where(da > 270)
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], ... [293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 , 294.29 ], [296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 , 294.38998], [297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 , 295.19 ]], [[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], ..., [293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 , 294.69 ], [296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 , 295.19 ], [297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 , 295.69 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
Here, we have replaced the elements that do not apply to da> 270
with np.nan
.
So far we have introduced basic indexing, but xarray also allows for advanced indexing.
As shown in this figure, you can cut out another array from the multidimensional array. In the official document, it is called Advanced indexing.
To achieve this, define the array passed as an argument as a DataArray. At that time, set the dimensions of those arrays to the dimensions of the newly created array.
According to the figure above, the new dimension is z
, so we define a DataArray on z
.
lon_new = xr.DataArray([60, 61, 62], dims=['z'], coords={'z': ['a', 'b', 'c']})
lat_new = xr.DataArray([16, 46, 76], dims=['z'], coords={'z': ['a', 'b', 'c']})
If you pass these as arguments to the .sel
method,(lon, lat)
will be[(60, 16), (61, 46), (62, 76)]
(closest to), respectively. Will choose.
da.sel(lon=lon_new, lat=lat_new, method='nearest')
<xarray.DataArray 'air' (time: 2920, z: 3)> array([[296.29 , 280. , 241.2 ], [296.29 , 279.19998, 242.09999], [296.4 , 278.6 , 242.29999], ..., [298.19 , 279.99 , 243.48999], [297.79 , 279.69 , 245.79 ], [297.69 , 279.79 , 245.09 ]], dtype=float32) Coordinates: lat (z) float32 15.0 45.0 75.0 lon (z) float32 200.0 200.0 200.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 * z (z) <U1 'a' 'b' 'c' Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
You can see that the newly obtained array no longer depends on the lon
, lat
axes, but on the z
.
A similar method can also be used for interpolation.
da.interp(lon=lon_new, lat=lat_new)
<xarray.DataArray 'air' (time: 2920, z: 3)> array([[nan, nan, nan], [nan, nan, nan], [nan, nan, nan], ..., [nan, nan, nan], [nan, nan, nan], [nan, nan, nan]]) Coordinates: * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 lon (z) int64 60 61 62 lat (z) int64 16 46 76 * z (z) <U1 'a' 'b' 'c' Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
xr.concat
Often you want to combine multiple xarray objects. For example, when you want to analyze experimental data and simulation data obtained with various parameters together.
This is the equivalent of np.concatenate
, np.stack
.
I think it's easy to understand if you think of it as the reverse operation of indexing.
The syntax is
xr.concat([data1, data2, ..., ], dim='dim')
is. Pass multiple DataArrays (or Datasets) as the first argument. The second argument is the direction of the join. Depending on the join direction, it can be divided into concatenate and stack operations.
The basic operation of the xr.concat
function is to connect multiple DataArrays along the existing axis.
For example, the following two DataArray
da0 = da.isel(time=slice(0, 10))
da1 = da.isel(time=slice(10, None))
da0
<xarray.DataArray 'air' (time: 10, lat: 25, lon: 53)> array([[[241.2 , 242.5 , ..., 235.5 , 238.59999], [243.79999, 244.5 , ..., 235.29999, 239.29999], ..., [295.9 , 296.19998, ..., 295.9 , 295.19998], [296.29 , 296.79 , ..., 296.79 , 296.6 ]], [[242.09999, 242.7 , ..., 233.59999, 235.79999], [243.59999, 244.09999, ..., 232.5 , 235.7 ], ..., [296.19998, 296.69998, ..., 295.5 , 295.1 ], [296.29 , 297.19998, ..., 296.4 , 296.6 ]], ..., [[244.79999, 244.39 , ..., 242.7 , 244.79999], [246.7 , 247.09999, ..., 237.79999, 240.2 ], ..., [297.79 , 297.19998, ..., 296.4 , 295.29 ], [297.9 , 297.69998, ..., 297.19998, 297. ]], [[243.89 , 243.79999, ..., 240.29999, 242.59999], [245.5 , 245.79999, ..., 236.59999, 239. ], ..., [297.6 , 297. , ..., 295.6 , 295. ], [298.1 , 297.69998, ..., 296.79 , 297.1 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2013-01-03T06:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
You can restore the original data by connecting in the time
direction.
xr.concat([da0, da1], dim='time')
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [243.59999, 244.09999, 244.2 , ..., 231. , 232.5 , 235.7 ], [253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 , 238.5 ], ... [293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 , 294.29 ], [296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 , 294.38998], [297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 , 295.19 ]], [[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999, 241.79 ], [249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 , 241.68999], [262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 , 246.29 ], ..., [293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 , 294.69 ], [296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 , 295.19 ], [297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 , 295.69 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
Also note that the time
axis has been restored as well.
For example, suppose you have two DataArrays like this:
da0 = da.isel(time=0)
da1 = da.isel(time=1)
da0
<xarray.DataArray 'air' (lat: 25, lon: 53)> array([[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 time datetime64[ns] 2013-01-01 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
To combine this along the time
axis, do the following:
xr.concat([da0, da1], dim='time')
<xarray.DataArray 'air' (time: 2, lat: 25, lon: 53)> array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [243.59999, 244.09999, 244.2 , ..., 231. , 232.5 , 235.7 ], [253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 , 238.5 ], ..., [296.4 , 295.9 , 296.19998, ..., 295.4 , 295.1 , 294.79 ], [296.19998, 296.69998, 296.79 , ..., 295.6 , 295.5 , 295.1 ], [296.29 , 297.19998, 297.4 , ..., 296.4 , 296.4 , 296.6 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
In this way, if there is a scalar coordinate called time
in da0
, da1
, connect it in that direction.
You can make the time
axis a new dimension coordinate (by specifying a name in the dim
keyword, such as dim ='time'
).
Alternatively, you can create a new coordinate axis for that name by giving the dim
keyword a DataArray.
Another advantage of xarray is that it can store data such as DataArray and Dataset consistently and self-contained. It supports several file formats, but the easiest to use is probably netCDF.
Wikipedia also has an explanation.
NetCDF is a computer model-independent binary format (model-independent) that can read and write data as an array (array-oriented), and can store data as well as explanations about that data (self-descriptive). ). NetCDF is an international standard in the Open Geospatial Consortium, an international consortium that develops open geographic information standards.
Currently, versions 3 and 4 are in maintenance, but it's safer to use version 4. Version 4 conforms to the HDF5 standard, so even a regular HDF5 reader can read the contents.
The official extension is .nc
.
Package netcdf4
is required to save in netCDF format.
pip install netcdf4
Or
conda install netcdf4
Let's run and install it on your system.
To save the xarray object in netCDF format, execute the .to_netcdf
method.
data.to_netcdf('test.nc')
/home/keisukefujii/miniconda3/envs/xarray/lib/python3.7/site-packages/ipykernel_launcher.py:1: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs
"""Entry point for launching an IPython kernel.
You should now have test.nc
saved in your current path.
Conversely, to load the saved file, do xr.load_dataset
.
xr.load_dataset('test.nc')
<xarray.Dataset> Dimensions: (lat: 25, lon: 53, time: 2920) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 241.2 242.5 ... 296.19 295.69 mean_temperature (lat, lon) float32 260.37564 260.1826 ... 297.30502 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
As you can see, not only the parts of the array, but also the axis names, coordinate values, and other information in attributes
are stored.
scipy.io.netcdf
The xarray .to_netcdf
uses the netcdf4
library by default, but uses scipy.io.netcdf
if it is not installed on your system. However, scipy.io.netcdf
supports version 3, which is incompatible with version 4 and can be confusing.
Don't forget to install netcdf4
on your system.
Alternatively, you can specify the package to use explicitly, such as .to_netcdf ('test.nc', engine ='netcdf4')
.
Files in netCDF format can be read randomly. Therefore, if you want to use only the data of a certain part of the file, it is efficient to read all the data when necessary instead of reading it from the hard disk first. This is inherently important, especially when dealing with files that are too large to fit in memory. This is called out-of-memory processing.
Out-of-memory processing in xarray is positioned as a fairly important yesterday, and various implementations are in progress. I'll cover the details in another article, and I'll just mention the most basic ones here.
Out-of-memory processing is performed by doing open_dataset
instead of load_dataset
when loading.
unloaded_data = xr.open_dataset('test.nc')
unloaded_data
<xarray.Dataset> Dimensions: (lat: 25, lon: 53, time: 2920) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 ... mean_temperature (lat, lon) float32 ... Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
If you look at Data variables
, for example, you can see that there are no numbers in air
, just ...
.
This indicates that no data has been read. However, since the coordinate axes are always read, the above indexing work can be performed as it is.
unloaded_data['air'].sel(lat=60, method='nearest')
<xarray.DataArray 'air' (time: 2920, lon: 53)> [154760 values with dtype=float32] Coordinates: lat float32 60.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
When I select a variable or index it, it still doesn't load.
It will be loaded into memory for the first time when it is operated on or when .data
or .values
is called.
You can also use the .compute ()
method to load it explicitly.
unloaded_data.compute()
<xarray.Dataset> Dimensions: (lat: 25, lon: 53, time: 2920) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 241.2 242.5 ... 296.19 295.69 mean_temperature (lat, lon) float32 260.37564 260.1826 ... 297.30502 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
open_dataset
, open_dataarray
locks the file for future reading. Therefore, if you touch a file with the same name in another program, an error will occur.
For example, consider the case where the result of a time-consuming calculation is output as a .nc
file each time. If you want to see the progress and open it with open_dataset
, then any subsequent attempts to write to that file will fail.
It's better if it just fails and stops, but it seems that it often destroys existing data.
This out-of-memory process should be used with caution.
You can use the with
statement to ensure that the file is closed and unlocked, as shown below.
with xr.open_dataset('test.nc') as f:
print(f)
<xarray.Dataset>
Dimensions: (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 ...
mean_temperature (lat, lon) float32 ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
As mentioned above, if you save the data in netcdf format, you can analyze the data smoothly with xarray. When creating data for experiments or simulations, it is a good idea to create a data structure that considers the subsequent analysis and save it in the netcdf format including it.
It is recommended to keep the following data suppressed.
scalar coordinate
instead of attributes
attributes
(such as information about the equipment used in the experiment and a description of the data).In the case of an experiment, I think that it should be saved with the following structure.
from datetime import datetime
#Let's assume that this is image data taken with a camera.
raw_data = np.arange(512 * 2048).reshape(512, 2048)
#The x-axis and y-axis coordinates of the image.
x = np.arange(512)
y = np.arange(2048)
#Position data for each pixel of the image
r = 300.0
X = r * np.cos(x / 512)[:, np.newaxis] + r * np.sin(y / 2048)
Y = r * np.sin(x / 512)[:, np.newaxis] - r * np.cos(y / 2048)
#Parameters used in the experiment
p0 = 3.0
p1 = 'normal'
#Experiment time
now = datetime.now()
data = xr.DataArray(
raw_data, dims=['x', 'y'],
coords={'x': x, 'y': y,
'X': (('x', 'y'), X),
'Y': (('x', 'y'), Y),
'p0': p0, 'p1': p1,
'datetime': now},
attrs={'camera type': 'flash 4',
'about': 'Example of the recommended data structure.'})
data
<xarray.DataArray (x: 512, y: 2048)> array([[ 0, 1, 2, ..., 2045, 2046, 2047], [ 2048, 2049, 2050, ..., 4093, 4094, 4095], [ 4096, 4097, 4098, ..., 6141, 6142, 6143], ..., [1042432, 1042433, 1042434, ..., 1044477, 1044478, 1044479], [1044480, 1044481, 1044482, ..., 1046525, 1046526, 1046527], [1046528, 1046529, 1046530, ..., 1048573, 1048574, 1048575]]) Coordinates: * x (x) int64 0 1 2 3 4 5 6 7 8 ... 504 505 506 507 508 509 510 511 * y (y) int64 0 1 2 3 4 5 6 7 ... 2041 2042 2043 2044 2045 2046 2047 X (x, y) float64 300.0 300.1 300.3 300.4 ... 414.7 414.8 414.9 414.9 Y (x, y) float64 -300.0 -300.0 -300.0 -300.0 ... 89.66 89.79 89.91 p0 float64 3.0 p1 <U6 'normal' datetime datetime64[ns] 2020-11-13T03:31:56.219372 Attributes: camera type: flash 4 about: Example of the recommended data structure.
By collecting the things that can change during the experiment in scalar coordinates, it will be easier to analyze later when you want to know the dependencies on them.
Reduction .sum
, .mean
, etc
Reduction processes such as np.sum
and np.mean
are provided as methods.
For example, np.sum
allows you to specify in which axis direction to add. Similarly, xarray allows you to specify an axis as the axis name.
data = xr.tutorial.open_dataset('air_temperature')['air']
data.sum('lat')
<xarray.DataArray 'air' (time: 2920, lon: 53)> array([[6984.9497, 6991.6606, 6991.5303, ..., 6998.77 , 7007.8804, 7016.5605], [6976.4307, 6988.45 , 6993.2407, ..., 6994.3906, 7006.7505, 7019.941 ], [6975.2603, 6982.02 , 6988.77 , ..., 6992.0503, 7004.9404, 7020.3506], ..., [6990.7505, 6998.3496, 7013.3496, ..., 6995.05 , 7008.6504, 7019.4497], [6984.95 , 6991.6504, 7007.949 , ..., 6994.15 , 7008.55 , 7020.8506], [6981.75 , 6983.85 , 6997.0503, ..., 6985.6494, 6999.2495, 7012.0493]], dtype=float32) Coordinates: * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
If you add up in a certain axial direction in this way, there will be no coordinate axes in that direction. The other axes remain.
Note that when executing .sum
etc., by default np.nansum
is called instead of np.sum
, and the calculation skipping Nan
is performed.
If you don't want to skip Nan
(if you want to use np.sum
instead of np.nansum
), you need to give skipna = False
.
.plot
DataArray
has a plot
method, which makes it easy to visualize the contents of the data.
data.sel(time='2014-08-30 08:00', method='nearest').plot()
<matplotlib.collections.QuadMesh at 0x7f9a98557310>
It may be better to think of it as a simple one rather than to make a final diagram of a paper / report, but in order to visualize and understand the data with Jupyter, Ipython, etc. and proceed with the analysis. Very convenient to use.
Recommended Posts