This article is part of the official Pandas documentation after machine translation of the User Guide --MultiIndex / Advanced Indexing (https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html). It is a modification of an unnatural sentence.
At the time of writing this article, the latest release version of Pandas is 0.25.3
, but in consideration of the future, the text of this article is based on the document of the development version 1.0.0
.
If you have any mistranslations, alternative translations, questions, etc., please use the comments section or edit request.
This chapter describes Indexing with Multi-Index (# Hierarchical Index (Multi-Index)) and Other Advanced Indexing Features (#Other Advanced Indexing Features).
For documentation on basic indexes, see Indexing and selecting data (https://qiita.com/nkay/items/d322ed9d9a14bdbf14cb).
: warning: ** Warning ** Whether the assignment operation returns a copy or a reference depends on the context. This is called a chained assignment` and should be avoided. See [Returning a view or a copy](https://qiita.com/nkay/items/d322ed9d9a14bdbf14cb#Returning a view or a copy).
See also the cookbook (https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-selection) for more advanced operations.
Hierarchical and multi-level indexes are very useful for advanced data analysis and manipulation, especially when dealing with high-dimensional data. In essence, you can store and manipulate any number of dimensions in a low-dimensional data structure such as Series
(1d) or DataFrame
(2d).
In this section, we'll show you what a "hierarchical" index means and how it integrates with all the Pandas indexing features described above and in the previous chapters. Later, Grouping (https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby) and Pivoting and Reshaping Data (https://pandas.pydata.org/) When we talk about pandas-docs / stable / user_guide / reshaping.html # reshaping), we'll introduce you to an important application to explain how it can help you structure your data for analysis.
See also the cookbook (https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-selection) for more advanced ways.
_ Changed in version 0.24.0 _: MultiIndex.labels
has been changed to MultiIndex.codes
and MultiIndex.set_labels
has been changed to MultiIndex.set_codes
.
In Pandas objects, the ʻIndex object is commonly used to store axis labels, but the
MultiIndexobject is a hierarchical version of it. You can think of
MultiIndexas an array of tuples, where each tuple is unique.
MultiIndex uses a list of arrays (using
MultiIndex.from_arrays () ), an array of tuples (using
MultiIndex.from_tuples () ), and (
MultiIndex.from_product () . It can be created from a direct product of iterables,
DataFrame (using
MultiIndex.from_frame () ). The ʻIndex
constructor tries to return a MultiIndex
when a list of tuples is passed. Below are various ways to initialize MultiIndex.
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
...:
In [2]: tuples = list(zip(*arrays))
In [3]: tuples
Out[3]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [5]: index
Out[5]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
In [6]: s = pd.Series(np.random.randn(8), index=index)
In [7]: s
Out[7]:
first second
bar one 0.469112
two -0.282863
baz one -1.509059
two -1.135632
foo one 1.212112
two -0.173215
qux one 0.119209
two -1.044236
dtype: float64
If you want all combinations (direct products) of two iterable elements, it is convenient to use the MultiIndex.from_product ()
method.
In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second'])
Out[9]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
You can also use the MultiIndex.from_frame ()
method to create a MultiIndex
directly from the DataFrame
. This is a method that complements MultiIndex.to_frame ()
.
_ From version 0.24.0 _
In [10]: df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
....: ['foo', 'one'], ['foo', 'two']],
....: columns=['first', 'second'])
....:
In [11]: pd.MultiIndex.from_frame(df)
Out[11]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('foo', 'one'),
('foo', 'two')],
names=['first', 'second'])
You can also automatically create a MultiIndex
by passing the list of arrays directly to the Series
or DataFrame
, as shown below.
In [12]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
....: np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
....:
In [13]: s = pd.Series(np.random.randn(8), index=arrays)
In [14]: s
Out[14]:
bar one -0.861849
two -2.104569
baz one -0.494929
two 1.071804
foo one 0.721555
two -0.706771
qux one -1.039575
two 0.271860
dtype: float64
In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
In [16]: df
Out[16]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
All MultiIndex
constructors take a names
argument that stores the level's own string names (labels). If no name is specified, None
will be assigned.
In [17]: df.index.names
Out[17]: FrozenList([None, None])
This index can be set on an axis in any direction of the Pandas object, and the number of ** levels ** in the index is up to you.
In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
In [19]: df
Out[19]:
first bar baz foo qux
second one two one two one two one two
A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737
C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747
In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]:
first bar baz foo
second one two one two one two
first second
bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804
two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734
baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738
two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232
two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441
We have "sparsed" higher levels of indexes to make the console output easier to see. You can control how the index is displayed using the multi_sparse
option ofpandas.set_options ()
.
In [21]: with pd.option_context('display.multi_sparse', False):
....: df
....:
Keep in mind that it's okay to use tuples as a single indivisible label.
In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]:
(bar, one) -1.236269
(bar, two) 0.896171
(baz, one) -0.487602
(baz, two) -0.082240
(foo, one) -2.182937
(foo, two) 0.380396
(qux, one) 0.084844
(qux, two) 0.432390
dtype: float64
The reason why MultiIndex
is important is that you can use MultiIndex
to perform grouping, selection, and shape change operations, as explained in the following and subsequent chapters. As you will see in a later section, you can work with hierarchically indexed data without having to explicitly create the MultiIndex
yourself. However, if you load data from a file, you can generate your own MultiIndex
when preparing the dataset.
The get_level_values ()
method is for each specific level. Returns the position label vector.
In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
In [24]: index.get_level_values('second')
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')
One of the key features of a hierarchical index is the ability to select data by a "partial" label that identifies a subgroup in the data. ** partial ** selection "drops" the level of the resulting hierarchical index in exactly the same way as selecting a column in a regular DataFrame.
In [25]: df['bar']
Out[25]:
second one two
A 0.895717 0.805244
B 0.410835 0.813850
C -1.413681 1.607920
In [26]: df['bar', 'one']
Out[26]:
A 0.895717
B 0.410835
C -1.413681
Name: (bar, one), dtype: float64
In [27]: df['bar']['one']
Out[27]:
A 0.895717
B 0.410835
C -1.413681
Name: one, dtype: float64
In [28]: s['qux']
Out[28]:
one -1.039575
two 0.271860
dtype: float64
See Sections with Hierarchical Indexes for information on how to select at a deeper level.
MultiIndex
was defined even if it wasn't actually used Holds indexes for all levels. You may notice this when slicing the index. For example
In [29]: df.columns.levels # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
In [30]: df[['foo','qux']].columns.levels # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
This is done to avoid level recalculations to improve slice performance. If you want to see only the levels used, [get_level_values ()
](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.get_level_values.html#pandas. You can use the MultiIndex.get_level_values) method.
In [31]: df[['foo', 'qux']].columns.to_numpy()
Out[31]:
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
dtype=object)
# for a specific level
In [32]: df[['foo', 'qux']].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
To rebuild MultiIndex
with usage levels only, [remove_unused_levels ()
](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.remove_unused_levels.html# You can use the pandas.MultiIndex.remove_unused_levels) method.
In [33]: new_mi = df[['foo', 'qux']].columns.remove_unused_levels()
In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])
reindex
Operations between objects with different indexes that have MultiIndex
on the axis work as expected. Data alignment works like a tuple index.
In [35]: s + s[:-2]
Out[35]:
bar one -1.723698
two -4.209138
baz one -0.989859
two 2.143608
foo one 1.443110
two -1.413542
qux one NaN
two NaN
dtype: float64
In [36]: s + s[::2]
Out[36]:
bar one -1.723698
two NaN
baz one -0.989859
two NaN
foo one 1.443110
two NaN
qux one -2.079150
two NaN
dtype: float64
[Reindex ()
] of Series
/ DataFrames
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex) The method can also receive another MultiIndex
, or a list or array of tuples.
In [37]: s.reindex(index[:3])
Out[37]:
first second
bar one -0.861849
two -2.104569
baz one -0.494929
dtype: float64
In [38]: s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')])
Out[38]:
foo two -0.706771
bar one -0.861849
qux one -1.039575
baz one -0.494929
dtype: float64
Getting MultiIndex
to do advanced indexing with .loc
is a bit difficult syntactically, but we've made every effort to make it happen. In general, MultiIndex keys take the form of tuples. For example, the following works as expected.
In [39]: df = df.T
In [40]: df
Out[40]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
In [41]: df.loc[('bar', 'two')]
Out[41]:
A 0.805244
B 0.813850
C 1.607920
Name: (bar, two), dtype: float64
For this example, df.loc ['bar','two']
will also work, but be aware that this abbreviation can be misleading in general.
If you use .loc
to index from a particular column, you must use tuples as follows:
In [42]: df.loc[('bar', 'two'), 'A']
Out[42]: 0.8052440253863785
If you are passing only the first element of the tuple, you do not have to specify all levels of MultiIndex. For example, you can use the "partial" index to get all the elements that have bar
in the first level as follows:
df.loc['bar']
This is a shortcut for the more verbose notation df.loc [('bar',),]
(also equivalent to df.loc ['bar',]
in this example).
The "Partial" slice also works very well.
In [43]: df.loc['baz':'foo']
Out[43]:
A B C
first second
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
You can slice by the "range" of values by passing a slice of tuple.
In [44]: df.loc[('baz', 'two'):('qux', 'one')]
Out[44]:
A B C
first second
baz two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
In [45]: df.loc[('baz', 'two'):'foo']
Out[45]:
A B C
first second
baz two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
As with reindexing, you can also pass a list of labels or tuples.
In [46]: df.loc[[('bar', 'two'), ('qux', 'one')]]
Out[46]:
A B C
first second
bar two 0.805244 0.813850 1.607920
qux one -1.170299 1.130127 0.974466
: ballot_box_with_check: ** Note ** Note that tuples and lists are not treated the same in pandas when it comes to indexing. Tuples are interpreted as one multi-level key, but lists point to multiple keys. In other words, tuples move horizontally (crossing levels) and lists move vertically (scanning levels).
Importantly, a list of tuples pulls multiple complete MultiIndex
keys, but tuples in a list refer to some value in a level.
In [47]: s = pd.Series([1, 2, 3, 4, 5, 6],
....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]))
....:
In [48]: s.loc[[("A", "c"), ("B", "d")]] # list of tuples
Out[48]:
A c 1
B d 5
dtype: int64
In [49]: s.loc[(["A", "B"], ["c", "d"])] # tuple of lists
Out[49]:
A c 1
d 2
B c 4
d 5
dtype: int64
You can slice the MultiIndex
by providing multiple indexers.
Use any of the slice label list, label, and boolean array selectors you see in Select by Label (https://qiita.com/nkay/items/d322ed9d9a14bdbf14cb#Select by Label) in the same way. I can.
You can use slice (None)
to select all elements at * that * level. It is not necessary to specify all * deeper * levels. They are implied by slice (None)
.
Like the others, this is a label indexing, so it includes ** both sides ** of the slicer.
: warning: ** Warning ** In
.loc
, specify all axes (** index ** and ** columns **). There are some ambiguous cases where the indexer passed can be misinterpreted as an index on * both * axes instead of something likeMultiIndex
on a row. Write as follows.df.loc[(slice('A1', 'A3'), ...), :] # noqa: E999
Do not write as follows.
df.loc[(slice('A1', 'A3'), ...)] # noqa: E999
In [50]: def mklbl(prefix, n):
....: return ["%s%s" % (prefix, i) for i in range(n)]
....:
In [51]: miindex = pd.MultiIndex.from_product([mklbl('A', 4),
....: mklbl('B', 2),
....: mklbl('C', 4),
....: mklbl('D', 2)])
....:
In [52]: micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
....: ('b', 'foo'), ('b', 'bah')],
....: names=['lvl0', 'lvl1'])
....:
In [53]: dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
....: .reshape((len(miindex), len(micolumns))),
....: index=miindex,
....: columns=micolumns).sort_index().sort_index(axis=1)
....:
In [54]: dfmi
Out[54]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9 8 11 10
D1 13 12 15 14
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237 236 239 238
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249 248 251 250
D1 253 252 255 254
[64 rows x 4 columns]
Basic multi-index slicing with slice list labels.
In [55]: dfmi.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :]
Out[55]:
lvl0 a b
lvl1 bar foo bah foo
A1 B0 C1 D0 73 72 75 74
D1 77 76 79 78
C3 D0 89 88 91 90
D1 93 92 95 94
B1 C1 D0 105 104 107 106
... ... ... ... ...
A3 B0 C3 D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
[24 rows x 4 columns]
Using Pandas.IndexSlice
inslice (None)
You can use a more natural syntax than using :
.
In [56]: idx = pd.IndexSlice
In [57]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
Out[57]:
lvl0 a b
lvl1 foo foo
A0 B0 C1 D0 8 10
D1 12 14
C3 D0 24 26
D1 28 30
B1 C1 D0 40 42
... ... ...
A3 B0 C3 D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
[32 rows x 2 columns]
You can use this method to make very complex selections on multiple axes at the same time.
In [58]: dfmi.loc['A1', (slice(None), 'foo')]
Out[58]:
lvl0 a b
lvl1 foo foo
B0 C0 D0 64 66
D1 68 70
C1 D0 72 74
D1 76 78
C2 D0 80 82
... ... ...
B1 C1 D1 108 110
C2 D0 112 114
D1 116 118
C3 D0 120 122
D1 124 126
[16 rows x 2 columns]
In [59]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
Out[59]:
lvl0 a b
lvl1 foo foo
A0 B0 C1 D0 8 10
D1 12 14
C3 D0 24 26
D1 28 30
B1 C1 D0 40 42
... ... ...
A3 B0 C3 D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
[32 rows x 2 columns]
You can use the Boolean indexer to provide selections related to * values *.
In [60]: mask = dfmi[('a', 'foo')] > 200
In [61]: dfmi.loc[idx[mask, :, ['C1', 'C3']], idx[:, 'foo']]
Out[61]:
lvl0 a b
lvl1 foo foo
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
You can also specify the ʻaxisargument in
.loc` to interpret the slicer passed on a single axis.
In [62]: dfmi.loc(axis=0)[:, :, ['C1', 'C3']]
Out[62]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C1 D0 9 8 11 10
D1 13 12 15 14
C3 D0 25 24 27 26
D1 29 28 31 30
B1 C1 D0 41 40 43 42
... ... ... ... ...
A3 B0 C3 D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
[32 rows x 4 columns]
In addition, you can * assign * values * using the following methods:
In [63]: df2 = dfmi.copy()
In [64]: df2.loc(axis=0)[:, :, ['C1', 'C3']] = -10
In [65]: df2
Out[65]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 -10 -10 -10 -10
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
[64 rows x 4 columns]
You can also use the right-hand side of alignable objects.
In [66]: df2 = dfmi.copy()
In [67]: df2.loc[idx[:, :, ['C1', 'C3']], :] = df2 * 1000
In [68]: df2
Out[68]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9000 8000 11000 10000
D1 13000 12000 15000 14000
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237000 236000 239000 238000
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249000 248000 251000 250000
D1 253000 252000 255000 254000
[64 rows x 4 columns]
Cross-section
The xs ()
method of DataFrame
also takes a level argument to make it easier to select data at a particular level of MultiIndex
.
In [69]: df
Out[69]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
In [70]: df.xs('one', level='second')
Out[70]:
A B C
first
bar 0.895717 0.410835 -1.413681
baz -1.206412 0.132003 1.024180
foo 1.431256 -0.076467 0.875906
qux -1.170299 1.130127 0.974466
#Use of slices
In [71]: df.loc[(slice(None), 'one'), :]
Out[71]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
baz one -1.206412 0.132003 1.024180
foo one 1.431256 -0.076467 0.875906
qux one -1.170299 1.130127 0.974466
You can also select columns with xs
by specifying the Axis argument.
In [72]: df = df.T
In [73]: df.xs('one', level='second', axis=1)
Out[73]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
#Use of slices
In [74]: df.loc[:, (slice(None), 'one')]
Out[74]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
With xs
, you can also select using multiple keys.
In [75]: df.xs(('one', 'bar'), level=('second', 'first'), axis=1)
Out[75]:
first bar
second one
A 0.895717
B 0.410835
C -1.413681
#Use of slices
In [76]: df.loc[:, ('bar', 'one')]
Out[76]:
A 0.895717
B 0.410835
C -1.413681
Name: (bar, one), dtype: float64
You can keep the selected level by passing drop_level = False
to xs
.
In [77]: df.xs('one', level='second', axis=1, drop_level=False)
Out[77]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
Compare the above result with the case of drop_level = True
(default value).
In [78]: df.xs('one', level='second', axis=1, drop_level=True)
Out[78]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
The pandas object reindex ()
and [ʻalign (ʻalign) ) ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.align.html#pandas.DataFrame.align) If you use the
level` argument in the method, the whole level Useful for broadcasting values to. For example:
In [79]: midx = pd.MultiIndex(levels=[['zero', 'one'], ['x', 'y']],
....: codes=[[1, 1, 0, 0], [1, 0, 1, 0]])
....:
In [80]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)
In [81]: df
Out[81]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [82]: df2 = df.mean(level=0)
In [83]: df2
Out[83]:
0 1
one 1.060074 -0.109716
zero 1.271532 0.713416
In [84]: df2.reindex(df.index, level=0)
Out[84]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
#alignment
In [85]: df_aligned, df2_aligned = df.align(df2, level=0)
In [86]: df_aligned
Out[86]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [87]: df2_aligned
Out[87]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
swaplevel
The swaplevel ()
method is in a two-level order. Can be replaced.
In [88]: df[:5]
Out[88]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [89]: df[:5].swaplevel(0, 1, axis=0)
Out[89]:
0 1
y one 1.519970 -0.493662
x one 0.600178 0.274230
y zero 0.132885 -0.023688
x zero 2.410179 1.450520
reorder_levels
The reorder_levels ()
method is the swaplevel
method. Generalize to allow you to replace the level of a hierarchical index in one step.
In [90]: df[:5].reorder_levels([1, 0], axis=0)
Out[90]:
0 1
y one 1.519970 -0.493662
x one 0.600178 0.274230
y zero 0.132885 -0.023688
x zero 2.410179 1.450520
or
MultiIndex`Rename ()
usually used to rename columns in DataFrame
The # pandas.DataFrame.rename) method can also rename the MultiIndex
label. The columns
argument of rename
can be a dictionary containing only the columns to rename.
In [91]: df.rename(columns={0: "col0", 1: "col1"})
Out[91]:
col0 col1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
This method can also be used to rename a particular label in the DataFrame
's main index.
In [92]: df.rename(index={"one": "two", "y": "z"})
Out[92]:
0 1
two z 1.519970 -0.493662
x 0.600178 0.274230
zero z 0.132885 -0.023688
x 2.410179 1.450520
rename_axis ()
The method is ʻIndex or
Rename MultiIndex. In particular, you can specify a level name for
MultiIndex, which is useful later when you use
reset_index () `to move the value from MultiIndex to a regular column.
In [93]: df.rename_axis(index=['abc', 'def'])
Out[93]:
0 1
abc def
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
Note that the columns in the DataFrame
are indexes, so using rename_axis
with the columns
argument will rename the index.
In [94]: df.rename_axis(columns="Cols").columns
Out[94]: RangeIndex(start=0, stop=2, step=1, name='Cols')
Rename
and rename_axis
support the specification of dictionaries, Series
, and mapping functions to map labels / names to new values.
If you want to work directly with the ʻIndex object instead of via the
DataFrame, then [ʻIndex.set_names ()
](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index) You can rename it using .set_names.html # pandas.Index.set_names).
In [95]: mi = pd.MultiIndex.from_product([[1, 2], ['a', 'b']], names=['x', 'y'])
In [96]: mi.names
Out[96]: FrozenList(['x', 'y'])
In [97]: mi2 = mi.rename("new name", level=0)
In [98]: mi2
Out[98]:
MultiIndex([(1, 'a'),
(1, 'b'),
(2, 'a'),
(2, 'b')],
names=['new name', 'y'])
: warning: ** Warning ** Prior to pandas 1.0.0, it was also possible to set the MultiIndex name by updating the level name.
>>> mi.levels[0].name = 'name via level' >>> mi.names[0] # only works for older panads 'name via level'
As of pandas 1.0, this implicitly fails to update the name of MultiIndex. Use ʻIndex.set_names () ` instead Please give me.
MultiIndex
Objects with MultiIndex
applied must be sorted in order to be effectively indexed and sliced. Like any other index, sort_index ()
You can use it.
In [99]: import random
In [100]: random.shuffle(tuples)
In [101]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
In [102]: s
Out[102]:
qux one 0.206053
foo two -0.251905
bar two -2.213588
one 1.063327
qux two 1.266143
baz one 0.299368
foo one -0.863838
baz two 0.408204
dtype: float64
In [103]: s.sort_index()
Out[103]:
bar one 1.063327
two -2.213588
baz one 0.299368
two 0.408204
foo one -0.863838
two -0.251905
qux one 0.206053
two 1.266143
dtype: float64
In [104]: s.sort_index(level=0)
Out[104]:
bar one 1.063327
two -2.213588
baz one 0.299368
two 0.408204
foo one -0.863838
two -0.251905
qux one 0.206053
two 1.266143
dtype: float64
In [105]: s.sort_index(level=1)
Out[105]:
bar one 1.063327
baz one 0.299368
foo one -0.863838
qux one 0.206053
bar two -2.213588
baz two 0.408204
foo two -0.251905
qux two 1.266143
dtype: float64
You can also pass the level name to sort_index
if the level of MultiIndex
is named.
In [106]: s.index.set_names(['L1', 'L2'], inplace=True)
In [107]: s.sort_index(level='L1')
Out[107]:
L1 L2
bar one 1.063327
two -2.213588
baz one 0.299368
two 0.408204
foo one -0.863838
two -0.251905
qux one 0.206053
two 1.266143
dtype: float64
In [108]: s.sort_index(level='L2')
Out[108]:
L1 L2
bar one 1.063327
baz one 0.299368
foo one -0.863838
qux one 0.206053
bar two -2.213588
baz two 0.408204
foo two -0.251905
qux two 1.266143
dtype: float64
For higher dimensional objects, if you have MultiIndex
, you can sort by level on axes other than the index.
In [109]: df.T.sort_index(level=1, axis=1)
Out[109]:
one zero one zero
x x y y
0 0.600178 2.410179 1.519970 0.132885
1 0.274230 1.450520 -0.493662 -0.023688
Indexing works even if the data isn't sorted, but it's pretty inefficient (and you'll see a PerformanceWarning
). It also returns a copy of the data instead of the view.
In [110]: dfm = pd.DataFrame({'jim': [0, 0, 1, 1],
.....: 'joe': ['x', 'x', 'z', 'y'],
.....: 'jolie': np.random.rand(4)})
.....:
In [111]: dfm = dfm.set_index(['jim', 'joe'])
In [112]: dfm
Out[112]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 z 0.537020
y 0.110968
In [4]: dfm.loc[(1, 'z')]
PerformanceWarning: indexing past lexsort depth may impact performance.
Out[4]:
jolie
jim joe
1 z 0.64094
In addition, indexing when not completely sorted can result in errors similar to the following:
In [5]: dfm.loc[(0, 'y'):(1, 'z')]
UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
The [ʻis_lexsorted ()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.is_lexsorted.html#pandas.MultiIndex.is_lexsorted) method of
MultiIndexis an index. Indicates whether is sorted and the
lexsort_depth` property returns the sort depth.
In [113]: dfm.index.is_lexsorted()
Out[113]: False
In [114]: dfm.index.lexsort_depth
Out[114]: 1
In [115]: dfm = dfm.sort_index()
In [116]: dfm
Out[116]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 y 0.110968
z 0.537020
In [117]: dfm.index.is_lexsorted()
Out[117]: True
In [118]: dfm.index.lexsort_depth
Out[118]: 2
The choices now work as expected.
In [119]: dfm.loc[(0, 'y'):(1, 'z')]
Out[119]:
jolie
jim joe
1 y 0.110968
z 0.537020
Like NumPy ndarrays, pandas ʻIndex ·
Series·
DataFrame also provides a
take () method to get elements along a particular axis at a particular index. The index specified must be an ndarray at the list or integer index position.
take` can also accept a negative integer as a relative position from the end of the object.
In [120]: index = pd.Index(np.random.randint(0, 1000, 10))
In [121]: index
Out[121]: Int64Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')
In [122]: positions = [0, 9, 3]
In [123]: index[positions]
Out[123]: Int64Index([214, 329, 567], dtype='int64')
In [124]: index.take(positions)
Out[124]: Int64Index([214, 329, 567], dtype='int64')
In [125]: ser = pd.Series(np.random.randn(10))
In [126]: ser.iloc[positions]
Out[126]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
In [127]: ser.take(positions)
Out[127]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
For DataFrame, the specified index must be a one-dimensional list or ndarray that specifies the row or column position.
In [128]: frm = pd.DataFrame(np.random.randn(5, 3))
In [129]: frm.take([1, 4, 3])
Out[129]:
0 1 2
1 -1.237881 0.106854 -1.276829
4 0.629675 -1.425966 1.857704
3 0.979542 -1.633678 0.615855
In [130]: frm.take([0, 2], axis=1)
Out[130]:
0 2
0 0.595974 0.601544
1 -1.237881 -1.276829
2 -0.767101 1.499591
3 0.979542 0.615855
4 0.629675 1.857704
Note that the take
method of the pandas object is not intended to work with Boolean indexes and can return unexpected results.
In [131]: arr = np.random.randn(10)
In [132]: arr.take([False, False, True, True])
Out[132]: array([-1.1935, -1.1935, 0.6775, 0.6775])
In [133]: arr[[0, 1]]
Out[133]: array([-1.1935, 0.6775])
In [134]: ser = pd.Series(np.random.randn(10))
In [135]: ser.take([False, False, True, True])
Out[135]:
0 0.233141
0 0.233141
1 -0.223540
1 -0.223540
dtype: float64
In [136]: ser.iloc[[0, 1]]
Out[136]:
0 0.233141
1 -0.223540
dtype: float64
Finally, a little performance note, the take
method handles a narrower range of inputs, which can provide much faster performance than a fancy index. ]
In [137]: arr = np.random.randn(10000, 5)
In [138]: indexer = np.arange(10000)
In [139]: random.shuffle(indexer)
In [140]: %timeit arr[indexer]
.....: %timeit arr.take(indexer, axis=0)
.....:
219 us +- 1.23 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)
72.3 us +- 727 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)
In [141]: ser = pd.Series(arr[:, 0])
In [142]: %timeit ser.iloc[indexer]
.....: %timeit ser.take(indexer)
.....:
179 us +- 1.54 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)
162 us +- 1.6 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)
So far we've covered the MultiIndex
quite extensively. Documentation for DatetimeIndex
and PeriodIndex
is here, documentation for TimedeltaIndex
is [here](https: /) See /dev.pandas.io/docs/user_guide/timedeltas.html#timedeltas-index).
The following subsections highlight some other index types.
CategoricalIndex
CategoricalIndex
is an index that helps support duplicate indexes. .. This is a container that surrounds Categorical
with many duplicate elements. Allows efficient indexing and storage of indexes that include.
In [143]: from pandas.api.types import CategoricalDtype
In [144]: df = pd.DataFrame({'A': np.arange(6),
.....: 'B': list('aabbca')})
.....:
In [145]: df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
In [146]: df
Out[146]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [147]: df.dtypes
Out[147]:
A int64
B category
dtype: object
In [148]: df['B'].cat.categories
Out[148]: Index(['c', 'a', 'b'], dtype='object')
Setting the index creates a CategoricalIndex
.
In [149]: df2 = df.set_index('B')
In [150]: df2.index
Out[150]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
Indexes using __getitem__ /. iloc / .loc
work the same as ʻIndex. The indexer must ** belong to a category **. Otherwise, you will get a
KeyError`.
In [151]: df2.loc['a']
Out[151]:
A
B
a 0
a 1
a 5
The CategoricalIndex
is ** retained ** after the index.
In [152]: df2.loc['a'].index
Out[152]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
When you sort the indexes, they are sorted by category (because you created the index with CategoricalDtype (list ('cab'))
, they are sorted by cab
).
In [153]: df2.sort_index()
Out[153]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
Groupby operations on indexes retain the properties of the index as well.
In [154]: df2.groupby(level=0).sum()
Out[154]:
A
B
c 4
a 6
b 5
In [155]: df2.groupby(level=0).sum().index
Out[155]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
The reindex operation returns an index of the result based on the indexer type passed. Passing a list will return the usual ʻIndex. Passing a
Categorical returns a
CategoricalIndexindexed according to the **
Categorical`dtype category passed. This allows you to arbitrarily index values that ** don't exist in the category, just as you would reindex pandas.
In [156]: df3 = pd.DataFrame({'A': np.arange(3),
.....: 'B': pd.Series(list('abc')).astype('category')})
.....:
In [157]: df3 = df3.set_index('B')
In [158]: df3
Out[158]:
A
B
a 0
b 1
c 2
In [159]: df3.reindex(['a', 'e'])
Out[159]:
A
B
a 0.0
e NaN
In [160]: df3.reindex(['a', 'e']).index
Out[160]: Index(['a', 'e'], dtype='object', name='B')
In [161]: df3.reindex(pd.Categorical(['a', 'e'], categories=list('abe')))
Out[161]:
A
B
a 0.0
e NaN
In [162]: df3.reindex(pd.Categorical(['a', 'e'], categories=list('abe'))).index
Out[162]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, name='B', dtype='category')
: warning: ** Warning ** Formatting and comparing operations with
CategoricalIndex
must be in the same category. Otherwise, you will get aTypeError
.In [163]: df4 = pd.DataFrame({'A': np.arange(2), .....: 'B': list('ba')}) .....: In [164]: df4['B'] = df4['B'].astype(CategoricalDtype(list('ab'))) In [165]: df4 = df4.set_index('B') In [166]: df4.index Out[166]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, name='B', dtype='category') In [167]: df5 = pd.DataFrame({'A': np.arange(2), .....: 'B': list('bc')}) .....: In [168]: df5['B'] = df5['B'].astype(CategoricalDtype(list('bc'))) In [169]: df5 = df5.set_index('B') In [170]: df5.index Out[170]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, name='B', dtype='category')
In [1]: pd.concat([df4, df5]) TypeError: categories must match existing categories when appending
ʻInt64Index` is the foundation of the pandas index foundation. This is an immutable array that implements an ordered slicable set.
RangeIndex
provides the default index for all NDFrame
objects A subclass of ʻInt64Index.
RangeIndex is a version of ʻInt64Index
optimized to represent a monotonous order set. These are similar to Python's range type
.
Float64Index
By default, Float64Index
is a floating point or integer in indexing. Created automatically when you pass a mixed value of floating point numbers. This allows for a pure label-based slicing paradigm that makes the scalar index and slicing []
, ʻix, and
loc` work exactly the same.
In [171]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])
In [172]: indexf
Out[172]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')
In [173]: sf = pd.Series(range(5), index=indexf)
In [174]: sf
Out[174]:
1.5 0
2.0 1
3.0 2
4.5 3
5.0 4
dtype: int64
Scalar selections for []
and .loc
are always label-based. An integer specification matches an equal float index (for example, 3
is equivalent to 3.0
).
In [175]: sf[3]
Out[175]: 2
In [176]: sf[3.0]
Out[176]: 2
In [177]: sf.loc[3]
Out[177]: 2
In [178]: sf.loc[3.0]
Out[178]: 2
The only position index is via ʻiloc`.
In [179]: sf.iloc[3]
Out[179]: 3
The scalar index not found throws a KeyError
. Slices are primarily based on index values when using []
, ʻix, and
loc, and ** always based on position when using ʻiloc
. The exception is when the slice is a boolean value. In this case, it is always based on position.
In [180]: sf[2:4]
Out[180]:
2.0 1
3.0 2
dtype: int64
In [181]: sf.loc[2:4]
Out[181]:
2.0 1
3.0 2
dtype: int64
In [182]: sf.iloc[2:4]
Out[182]:
3.0 2
4.5 3
dtype: int64
The float index allows you to use slices with floating point numbers.
In [183]: sf[2.1:4.6]
Out[183]:
3.0 2
4.5 3
dtype: int64
In [184]: sf.loc[2.1:4.6]
Out[184]:
3.0 2
4.5 3
dtype: int64
If it is not a float index, slices using floats will raise a TypeError
.
In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)
In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)
The following are common use cases for using this type of index: Imagine an irregular timedelta-like indexing scheme where the data is recorded as a float. This could be, for example, a millisecond offset.
In [185]: dfir = pd.concat([pd.DataFrame(np.random.randn(5, 2),
.....: index=np.arange(5) * 250.0,
.....: columns=list('AB')),
.....: pd.DataFrame(np.random.randn(6, 2),
.....: index=np.arange(4, 10) * 250.1,
.....: columns=list('AB'))])
.....:
In [186]: dfir
Out[186]:
A B
0.0 -0.435772 -1.188928
250.0 -0.808286 -0.284634
500.0 -1.815703 1.347213
750.0 -0.243487 0.514704
1000.0 1.162969 -0.287725
1000.4 -0.179734 0.993962
1250.5 -0.212673 0.909872
1500.6 -0.733333 -0.349893
1750.7 0.456434 -0.306735
2000.8 0.553396 0.166221
2250.9 -0.101684 -0.734907
Selection operations always work on a value basis for all selection operators.
In [187]: dfir[0:1000.4]
Out[187]:
A B
0.0 -0.435772 -1.188928
250.0 -0.808286 -0.284634
500.0 -1.815703 1.347213
750.0 -0.243487 0.514704
1000.0 1.162969 -0.287725
1000.4 -0.179734 0.993962
In [188]: dfir.loc[0:1001, 'A']
Out[188]:
0.0 -0.435772
250.0 -0.808286
500.0 -1.815703
750.0 -0.243487
1000.0 1.162969
1000.4 -0.179734
Name: A, dtype: float64
In [189]: dfir.loc[1000.4]
Out[189]:
A -0.179734
B 0.993962
Name: 1000.4, dtype: float64
You can get the first second (1000 milliseconds) of the data as follows:
In [190]: dfir[0:1000]
Out[190]:
A B
0.0 -0.435772 -1.188928
250.0 -0.808286 -0.284634
500.0 -1.815703 1.347213
750.0 -0.243487 0.514704
1000.0 1.162969 -0.287725
If you need integer position-based selection, use ʻiloc`.
In [191]: dfir.iloc[0:5]
Out[191]:
A B
0.0 -0.435772 -1.188928
250.0 -0.808286 -0.284634
500.0 -1.815703 1.347213
750.0 -0.243487 0.514704
1000.0 1.162969 -0.287725
IntervalIndex
[ʻIntervalIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.html#pandas.IntervalIndex) and its own dtype, ʻIntervalDtype
, and ʻInterval` Scalar types allow first-class support for interval notation in pandas. I will.
ʻIntervalIndex allows some unique indexing, [
cut () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html#pandas As the return type for the .cut) and [
qcut ()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html#pandas.qcut) categories Is also used.
ʻIntervalIndex can be used as an index in
Seriesand
DataFrame`.
In [192]: df = pd.DataFrame({'A': [1, 2, 3, 4]},
.....: index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))
.....:
In [193]: df
Out[193]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
(3, 4] 4
Label-based indexes via .loc
that follow the ends of the interval work as expected and select that particular interval.
In [194]: df.loc[2]
Out[194]:
A 2
Name: (1, 2], dtype: int64
In [195]: df.loc[[2, 3]]
Out[195]:
A
(1, 2] 2
(2, 3] 3
If you select the * included * label within an interval, it will be selected for each interval.
In [196]: df.loc[2.5]
Out[196]:
A 3
Name: (2, 3], dtype: int64
In [197]: df.loc[[2.5, 3.5]]
Out[197]:
A
(2, 3] 3
(3, 4] 4
When selected using intervals, only exact matches are returned (pandas 0.25.0 and later).
In [198]: df.loc[pd.Interval(1, 2)]
Out[198]:
A 2
Name: (1, 2], dtype: int64
If you try to select an interval that is not exactly included in the ʻIntervalIndex, you will get a
KeyError`.
In [7]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError: Interval(0.5, 2.5, closed='right')
To select all ʻIntervals that overlap a particular ʻInterval
, [ʻoverlaps ()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex. Create a Boolean indexer using the overlaps.html # pandas.IntervalIndex.overlaps) method.
In [199]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
In [200]: idxr
Out[200]: array([ True, True, True, False])
In [201]: df[idxr]
Out[201]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
cut
and qcut
cut ()
and [qcut ()
](https: / /pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html#pandas.qcut) both return Categorical
objects, and the bins they create arein the
.categories attribute. It is saved as IntervalIndex
.
In [202]: c = pd.cut(range(4), bins=2)
In [203]: c
Out[203]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
In [204]: c.categories
Out[204]:
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
closed='right',
dtype='interval[float64]')
cut ()
passes ʻIntervalIndex as the
binsargument can also do. This allows for a convenient pandas idiom. First, set some data and
bins to a fixed number [
cut () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut. Call html#pandas.cut) to create a bin. Then the value of that
.categories was subsequently called [
cut () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html# You can binn new data in the same bin by passing it to the
bins` argument of pandas.cut).
In [205]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[205]:
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
Values outside all bins are assigned the NaN
value.
If you need intervals at normal frequency, use the [ʻinterval_range () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.interval_range.html#pandas.interval_range) function. You can use it to create ʻIntervalIndex
with different combinations of start
, ʻend, and
periods. The default period for ʻinterval_range
is 1 for numeric intervals and calendar days for datetime-like intervals.
In [206]: pd.interval_range(start=0, end=5)
Out[206]:
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
closed='right',
dtype='interval[int64]')
In [207]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4)
Out[207]:
IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]],
closed='right',
dtype='interval[datetime64[ns]]')
In [208]: pd.interval_range(end=pd.Timedelta('3 days'), periods=3)
Out[208]:
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]],
closed='right',
dtype='interval[timedelta64[ns]]')
The freq
argument can be used to specify a non-default period, with various period aliases
for datetime-like intervals. # timeseries-offset-aliases) is available.
In [209]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[209]:
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]],
closed='right',
dtype='interval[float64]')
In [210]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4, freq='W')
Out[210]:
IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]],
closed='right',
dtype='interval[datetime64[ns]]')
In [211]: pd.interval_range(start=pd.Timedelta('0 days'), periods=3, freq='9H')
Out[211]:
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]],
closed='right',
dtype='interval[timedelta64[ns]]')
In addition, you can use the closed
argument to specify who closes the interval. By default, the interval is closed on the right.
In [212]: pd.interval_range(start=0, end=4, closed='both')
Out[212]:
IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]],
closed='both',
dtype='interval[int64]')
In [213]: pd.interval_range(start=0, end=4, closed='neither')
Out[213]:
IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)],
closed='neither',
dtype='interval[int64]')
_ From version 0.23.0 _
If you specify start
・ ʻend ・
periods, the resulting ʻIntervalIndex
will create an interval from start
to ʻendwith as many elements as
periods`.
In [214]: pd.interval_range(start=0, end=6, periods=4)
Out[214]:
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]],
closed='right',
dtype='interval[float64]')
In [215]: pd.interval_range(pd.Timestamp('2018-01-01'),
.....: pd.Timestamp('2018-02-28'), periods=3)
.....:
Out[215]:
IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]],
closed='right',
dtype='interval[datetime64[ns]]')
Label-based indexing with integer axis labels is a tricky topic. It is frequently discussed among various members of the mailing list and the scientific Python community. In pandas, our general view is that labels are more important than integer positions. Therefore, for integer axis indexes, standard tools such as .loc
allow label-based indexing * only *. The following code raises an exception.
In [216]: s = pd.Series(range(5))
In [217]: s[-1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-217-76c3dce40054> in <module>
----> 1 s[-1]
~/work/1/s/pandas/core/series.py in __getitem__(self, key)
1076 key = com.apply_if_callable(key, self)
1077 try:
-> 1078 result = self.index.get_value(self, key)
1079
1080 if not is_scalar(result):
~/work/1/s/pandas/core/indexes/base.py in get_value(self, series, key)
4623 k = self._convert_scalar_indexer(k, kind="getitem")
4624 try:
-> 4625 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4626 except KeyError as e1:
4627 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
~/work/1/s/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
~/work/1/s/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
~/work/1/s/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
~/work/1/s/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
~/work/1/s/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: -1
In [218]: df = pd.DataFrame(np.random.randn(5, 4))
In [219]: df
Out[219]:
0 1 2 3
0 -0.130121 -0.476046 0.759104 0.213379
1 -0.082641 0.448008 0.656420 -1.051443
2 0.594956 -0.151360 -0.069303 1.221431
3 -0.182832 0.791235 0.042745 2.069775
4 1.446552 0.019814 -1.389212 -0.702312
In [220]: df.loc[-2:]
Out[220]:
0 1 2 3
0 -0.130121 -0.476046 0.759104 0.213379
1 -0.082641 0.448008 0.656420 -1.051443
2 0.594956 -0.151360 -0.069303 1.221431
3 -0.182832 0.791235 0.042745 2.069775
4 1.446552 0.019814 -1.389212 -0.702312
This deliberate decision was made to prevent ambiguity and subtle bugs (many users find bugs when they modify the API to stop "fallback" in position-based indexing. I reported).
If the index of the Series
or DataFrame
is monotonously increasing or decreasing, it is possible that the label-based slice boundaries are outside the index range, much like normal Python list slice indexing. The monotony of the index is ʻis_monotonic_increasing () ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.is_monotonic_increasing.html#pandas.Index.is_monotonic_increasing) and [ ʻIs_monotonic_decreasing ()
You can test with the attribute.
In [221]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=['data'], data=list(range(5)))
In [222]: df.index.is_monotonic_increasing
Out[222]: True
#Lines 0, 1 do not exist, but returns lines 2, 3 (both), 4
In [223]: df.loc[0:4, :]
Out[223]:
data
2 0
3 1
3 2
4 3
#An empty DataFrame is returned because the slice is out of index
In [224]: df.loc[13:15, :]
Out[224]:
Empty DataFrame
Columns: [data]
Index: []
On the other hand, if the index is not monotonous, both slice boundaries must be * unique * values of the index.
In [225]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5],
.....: columns=['data'], data=list(range(6)))
.....:
In [226]: df.index.is_monotonic_increasing
Out[226]: False
#There is no problem because both 2 and 4 are in the index
In [227]: df.loc[2:4, :]
Out[227]:
data
2 0
3 1
1 2
4 3
#0 does not exist in the index
In [9]: df.loc[0:4, :]
KeyError: 0
#3 is not a unique label
In [11]: df.loc[2:3, :]
KeyError: 'Cannot get right slice bound for non-unique label: 3'
ʻIndex.is_monotonic_increasing and ʻIndex.is_monotonic_decreasing
just lightly check if the index is monotonic. To see the exact monotony, either [ʻis_unique () `](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.is_unique.html# Combine with the pandas.Index.is_unique) attribute.
In [228]: weakly_monotonic = pd.Index(['a', 'b', 'c', 'c'])
In [229]: weakly_monotonic
Out[229]: Index(['a', 'b', 'c', 'c'], dtype='object')
In [230]: weakly_monotonic.is_monotonic_increasing
Out[230]: True
In [231]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[231]: False
Unlike standard Python sequence slices that do not contain endpoints, pandas label-based slices do. The main reason for this is that it is often not easy to determine the "subsequent label" or the next element after a particular label in the index. For example, consider the following Seires
.
In [232]: s = pd.Series(np.random.randn(6), index=list('abcdef'))
In [233]: s
Out[233]:
a 0.301379
b 1.240445
c -0.846068
d -0.043312
e -1.658747
f -0.819549
dtype: float64
Suppose you want to slice from c to e using an integer. This is done as follows:
In [234]: s[2:5]
Out[234]:
c -0.846068
d -0.043312
e -1.658747
dtype: float64
However, if you specify only from c
and ʻe`, determining the next element in the index can be a bit complicated. For example, the following does not work:
s.loc['c':'e' + 1]
A very common use case is to specify a specific time series that starts and ends on two specific dates. To make this possible, we designed the label-based slice to include both endpoints.
In [235]: s.loc['c':'e']
Out[235]:
c -0.846068
d -0.043312
e -1.658747
dtype: float64
This is arguably "more practical than pure", but be careful if you expect label-based slices to behave exactly like standard Python integer slices.
Different indexing operations can change the dtype of Series
.
In [236]: series1 = pd.Series([1, 2, 3])
In [237]: series1.dtype
Out[237]: dtype('int64')
In [238]: res = series1.reindex([0, 4])
In [239]: res.dtype
Out[239]: dtype('float64')
In [240]: res
Out[240]:
0 1.0
4 NaN
dtype: float64
In [241]: series2 = pd.Series([True])
In [242]: series2.dtype
Out[242]: dtype('bool')
In [243]: res = series2.reindex_like(series1)
In [244]: res.dtype
Out[244]: dtype('O')
In [245]: res
Out[245]:
0 True
1 NaN
2 NaN
dtype: object
This is because the (re) indexing operation above implicitly inserts NaNs
and changes the dtype
accordingly. This can cause problems when using numpy ufuncs
such as numpy.logical_and
.
See this Past Issue (https://github.com/pydata/pandas/issues/2388) for more information.
Recommended Posts