Click here for details ↓ https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html According to the document, it is a hierarchical index and can be used when manipulating high-dimensional data (roughly).
Panel that handles 3D data structures in pandas has been abolished, so check how to handle it here. Since it is a personal memorandum, it is not exhaustive at all. Therefore, if you have any questions, you have to investigate it as appropriate.
python 3.7.7 pandas 1.1.1 Execution environment: jupyter notebook
As expected, research data cannot be used, so sample data will be created appropriately.
#pandas import
import pandas as pd
#Data creation
array =[['cat','cat','cat','cat','cat','dog','dog','dog','bird','bird'],
['black','white','white','brown','brown','black','white','brown','white','yellow'],
['male','male','female','male','female','female','female','male','male','female'],
[3,5,5,5,4,7,8,10,1,2],
[20,25,26,30,22,77,55,90,10,15],
[3.6,4.5,4.0,5.2,3.0,15.6,10.5,20.1,0.52,0.89]]
#Convert to DataFrame.Transpose rows and columns with T
df = pd.DataFrame(array).T
#column settings
df.columns = ['animal','color','male or female','age','height','weight']
Like this. Since the content is created implicitly, there seems to be a very obese cat.
animal | color | male or female | age | height | weight | |
---|---|---|---|---|---|---|
0 | cat | black | male | 3 | 20 | 3.6 |
1 | cat | white | male | 5 | 25 | 4.5 |
2 | cat | white | female | 5 | 26 | 4 |
3 | cat | brown | male | 5 | 30 | 5.2 |
4 | cat | brown | female | 4 | 22 | 3 |
5 | dog | black | female | 7 | 77 | 15.6 |
6 | dog | white | female | 8 | 55 | 10.5 |
7 | dog | brown | male | 10 | 90 | 20.1 |
8 | bird | white | male | 1 | 10 | 0.52 |
9 | bird | yellow | female | 2 | 15 | 0.89 |
Digression: You can use .to_markdown () to create a markdown notation table. It seems that it was implemented from 1.0.0 or later of pandas? However, in the end, .to_html () is indebted.
index How to set and reset the index
The Column name specified by set_index is the index of the level starting from 0 from animal. level→0 : animal, 1 : color, 2 : male or female
df2 = df.set_index(['animal','color','male or female'])
age | height | weight | |||
---|---|---|---|---|---|
animal | color | male or female | |||
cat | black | male | 3 | 20 | 3.6 |
white | male | 5 | 25 | 4.5 | |
female | 5 | 26 | 4 | ||
brown | male | 5 | 30 | 5.2 | |
female | 4 | 22 | 3 | ||
dog | black | female | 7 | 77 | 15.6 |
white | female | 8 | 55 | 10.5 | |
brown | male | 10 | 90 | 20.1 | |
bird | white | male | 1 | 10 | 0.52 |
yellow | female | 2 | 15 | 0.89 |
By the way, the index levels can be set in any order you like.
df3 = df.set_index(['animal','male or female','color'])
age | height | weight | |||
---|---|---|---|---|---|
animal | male or female | color | |||
cat | male | black | 3 | 20 | 3.6 |
white | 5 | 25 | 4.5 | ||
female | white | 5 | 26 | 4 | |
male | brown | 5 | 30 | 5.2 | |
female | brown | 4 | 22 | 3 | |
dog | female | black | 7 | 77 | 15.6 |
white | 8 | 55 | 10.5 | ||
male | brown | 10 | 90 | 20.1 | |
bird | male | white | 1 | 10 | 0.52 |
female | yellow | 2 | 15 | 0.89 |
Comparing with the above table, it can be confirmed that the values of the first and second levels are interchanged.
As a side note, .set_index () has a parameter drop, which is True by default. This specifies whether to delete the data of the column used as Index. Therefore, when drop = False, the data of the column specified in Index remains as it is as shown below.
df4 = df.set_index(['animal','color','male or female'], drop=False)
animal | color | male or female | age | height | weight | |||
---|---|---|---|---|---|---|---|---|
animal | color | male or female | ||||||
cat | black | male | cat | black | male | 3 | 20 | 3.6 |
white | male | cat | white | male | 5 | 25 | 4.5 | |
female | cat | white | female | 5 | 26 | 4 | ||
brown | male | cat | brown | male | 5 | 30 | 5.2 | |
female | cat | brown | female | 4 | 22 | 3 | ||
dog | black | female | dog | black | female | 7 | 77 | 15.6 |
white | female | dog | white | female | 8 | 55 | 10.5 | |
brown | male | dog | brown | male | 10 | 90 | 20.1 | |
bird | white | male | bird | white | male | 1 | 10 | 0.52 |
yellow | female | bird | yellow | female | 2 | 15 | 0.89 |
Like this. Basically, it seems that there is no problem with the default, so you do not have to worry about it.
To reset the Index, use .reset_index ().
df1 = df2.reset_index()
If you compare df and df1, you can see that the table is the same.
Confirmation that the column name specified in Index is likely to be included in names.
df2.index
>
MultiIndex([( 'cat', 'black', 'male'),
( 'cat', 'white', 'male'),
( 'cat', 'white', 'female'),
( 'cat', 'brown', 'male'),
( 'cat', 'brown', 'female'),
( 'dog', 'black', 'female'),
( 'dog', 'white', 'female'),
( 'dog', 'brown', 'male'),
('bird', 'white', 'male'),
('bird', 'yellow', 'female')],
names=['animal', 'color', 'male or female'])
I wonder if it can be used when checking the column name in Index.
df2.index.names
>
FrozenList(['animal', 'color', 'male or female'])
From here, we will use df2 to see various things.
Use .index.get_level_values (). This will return a specific level of label as a vector.
#Specified by a specific level column name
df2.index.get_level_values('animal')
> Index(['cat', 'cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'bird', 'bird'], dtype='object', name='animal')
Same result as above ↓
#Can also be specified by level number
df2.index.get_level_values(0)
> Index(['cat', 'cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'bird', 'bird'], dtype='object', name='animal')
Avoid duplication with set, convert to list, and sort it to create a new list. If you want to sort in descending order, set the argument reverse to True. (I don't think it makes much sense here.)
#Get label without duplication
a = sorted(list(set(df2.index.get_level_values('animal'))))
a
> ['bird', 'cat', 'dog']
By using .xs (), you can specify the index column name and its value (label) for selection or extraction.
Let's use the list created earlier to check the operation of xs (). The contents are ['bird','cat','dog'] ← Using the values in this list as keys, find the average height for each male and female animal.
M = {}
F = {}
for s in a:
m = df2['height'].xs([s, 'male'], level=['animal', 'male or female']).mean()
M[s] = m
f = df2['height'].xs([s, 'female'], level=['animal', 'male or female']).mean()
F[s] = f
#result
M
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}
F
> {'bird': 15.0, 'cat': 24.0, 'dog': 66.0}
In addition to .mean (), .min (), .max (), .std (), .count (), etc. can be used according to the purpose.
For the time being, commentary
First, specify the column name of the value you want to find (this time I want to know the average of'height') df2['height']
Next, use .xs () to make detailed specifications. df2['height'].xs([s, 'male'], level=['animal', 'male or female'])
First, about [s,'male']. This time, in order to calculate the average height of each animal for each male (male) and female (female), specify a label corresponding to the information of "which animal and which gender" in the first argument of .xs ().
Next, about level = ['animal','male or female']. This seems to indicate what level the [s,'male'] you specified earlier is. So, it would be nice if you could correspond to the feeling of s in'animal'and'male' in'male or female'. Since the level can be specified by a number, the result is the same even if level = [0,2].
Finally, I want to find the average, so add .mean ().
The contents of the dictionary look like this ↓
#In the program above.mean()Remove and run
m = df2['height'].xs([s, 'male'], level=['animal', 'male or female'])
#result
M
>
{'bird': color
white 10
Name: height, dtype: object,
'cat': color
black 20
white 25
brown 30
Name: height, dtype: object,
'dog': color
brown 90
Name: height, dtype: object}
For each animal, the value of'height'corresponding to the male (male) can be obtained.
In order, I felt that I should check from here first. I think the usage of .loc is the same as for a normal DataFrame that is not Multiindex.
Data review
df2
>
age height weight
animal color male or female
cat black male 3 20 3.6
white male 5 25 4.5
female 5 26 4
brown male 5 30 5.2
female 4 22 3
dog black female 7 77 15.6
white female 8 55 10.5
brown male 10 90 20.1
bird white male 1 10 0.52
yellow female 2 15 0.89
For Index, specify'cat'of'animal'. Specify all column names of Column.
df2.loc['cat']
#Or
# df2.loc['cat', :]
>
age height weight
color male or female
black male 3 20 3.6
white male 5 25 4.5
female 5 26 4
brown male 5 30 5.2
female 4 22 3
You can specify only the column names of Column that you want to narrow down.
df2.loc['cat', ['height', 'weight']]
>
height weight
color male or female
black male 20 3.6
white male 25 4.5
female 26 4
brown male 30 5.2
female 22 3
The first argument is the label of a specific level of Index, and the second argument is the column name of the Column you want to narrow down.
df2.loc['cat', 'height']
>
color male or female
black male 20
white male 25
female 26
brown male 30
female 22
Name: height, dtype: object
Specify multiple labels within the same level of Index.
df2.loc[['cat', 'bird'], 'height']
>
animal color male or female
cat black male 20
white male 25
female 26
brown male 30
female 22
bird white male 10
yellow female 15
Name: height, dtype: object
This time, how to cross the index level. Narrow down in order of Index level (from 0 in terms of number). If nothing is specified in (), the result is the same as df2.loc [:,'height'].
# 'cat'What is.
df2.loc[('cat'), 'height']
>
color male or female
black male 20
white male 25
female 26
brown male 30
female 22
Name: height, dtype: object
# 'cat'Of'white'What is.
df2.loc[('cat','white'), 'height']
>
male or female
male 25
female 26
Name: height, dtype: object
# 'cat'Of'white'Is'male'What is.
df2.loc[('cat','white','male'), 'height']
> 25
If you narrow down to this point, you can rewrite the value.
df3 = df2.copy()
#Substitute 30
df3.loc[('cat','white','male'), 'height'] = 30
df3.loc[('cat','white','male'), 'height']
> 30
It seems that slices can also be used, but if you do it without thinking, the following error will occur
df2.loc[('cat','brown','male'):('dog','black','female')]
> UnsortedIndexError: 'Key length (3) was greater than MultiIndex lexsort depth (0)'
Solution Sort using .sort_index ().
df5 = df2.sort_index(level=0)
df5
>
age height weight
animal color male or female
bird white male 1 10 0.52
yellow female 2 15 0.89
cat black male 3 20 3.6
brown female 4 22 3
male 5 30 5.2
white female 5 26 4
male 5 25 4.5
dog black female 7 77 15.6
brown male 10 90 20.1
white female 8 55 10.5
Specify level = number as an argument. Since level = 0 was specified this time, it is sorted based on the animal label.
df5.loc[('cat','brown','male'):('dog','black','female')]
>
age height weight
animal color male or female
cat brown male 5 30 5.2
white female 5 26 4
male 5 25 4.5
dog black female 7 77 15.6
('cat','brown'): ('dog','white') can be narrowed down.
pd.indexSlice
idx = pd.IndexSlice
df5.loc[idx[:, :, ['male']], 'height']
>
animal color male or female
bird white male 10
cat black male 20
brown male 30
white male 25
dog brown male 90
Name: height, dtype: object
Looking only at the output result, the male'height'for each animal can be extracted as in the case of using .xs (). So if you want to get the same result as using .xs (), do this. a = ['bird', 'cat', 'dog']
# pd.Use Index Slice
M_idx = {}
for s in a:
M_idx[s] = df5.loc[idx[[s], :, ['male']], 'height'].mean()
M_idx
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}
Comparison with when using .xs ()
# .xs()use
M = {}
for s in a:
M[s] = df2['height'].xs([s, 'male'], level=['animal', 'male or female']).mean()
M
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}
Which one to use seems to be personal preference
I feel that Multiindex can be used in various ways once I get used to it, but I'm still not sure how to use it. Later, there may be typographical errors or typographical errors that I overlooked, so I will correct them.
MultiIndex / advanced indexing
Recommended Posts