Pandas
is a library that provides functions to support data analysis in the programming language Python
[^ wiki]. I think Pandas
is complicated even in the Python
library [^ atm]. However, the degree of freedom is so high that it is unthinkable for data analysts to analyze data without Pandas
. So, I would like to explain to the point that "If you understand this far, you can do anything (if you look at other sites)" [^ title].
[^ wiki]: See https://ja.wikipedia.org/wiki/Pandas
[^ atm]: But there is an atmosphere that cannot be said to be difficult.
[^ title]: Pandas
is not a language, but the title fits nicely.
\ 1. Preparation
Enable the use of numpy
(one-dimensional) index references, slicing, Boolean index references, and fancy index references
Enable the use of numpy
(two-dimensional) index references, slicing, and Boolean index references. Understand the behavior of the numpy.iloc_
function. (Fancy index reference is a specification that is difficult to use in ndarray in 2D, so I rarely use it personally)
\ 2. Introduction to Pandas
Create Series, DataFrame
and use index reference (Series
is basically an extension of numpy
(1D). DataFrame
is basically df.loc
(label name priority) ) Or dx.iloc
(number priority) is basically an extension of numpy
(2D).)
Allows you to add, extract, delete, modify, etc. data for Series, DataFrame
(If the element or index name in Series, DataFrame
is a character string, you can perform batch operations of extraction and modification. It is convenient, so understand the character string processing by the str
accessor of Pandas
)
After doing so, you should get on track and reach a level where you can investigate various things yourself (you should be able to understand group by
and so on smoothly).
For example, with Numpy
arr = np.arange(12) #arr is a one-dimensional ndarray
arr = arr.reshape(3,4) #arr is a two-dimensional ndarrary
# arr[i,j]The first element of is a row, the second element is a column
arr[:2] #2D ndarray
arr[:2, 0] #1D ndarray
arr[:, arr[0] > 2] #2D ndarray
With Pandas
pop = {'Nevada' : {2001 : 2.4, 2002 : 2.9},
'Ohio' : {2000 : 1.5, 2001 : 1.7}}
df = DataFrame(pop) # DataFrame(2D)
df[df['Nevada'] > 2] # DataFrame(2D)
df.iloc[-1:]['Nevada'] # Series(1D)
What is the type like that? If you are aware of that and understand it, it seems that half is over.
So, let's summarize the behavior of the index reference of ndarray (2D) and then proceed to Pandas
~
import
import numpy as np # ndarray
#Needed to display matplot in jupyter
%matplotlib inline
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
import pandas as pd
Numpy
Let's take a two-dimensional ndarray. To understand Pandas
There are two things to understand here:
Understand Numpy
index references, slicing, Boolean index references, and fancy index references.
Two-dimensional Numpy
is ʻarr [, ʻarr [<row specification>, <column specification>]
(I understand it without a rattle).
arr = np.arange(12).reshape(3,4) #arr is a two-dimensional ndarrary(3 rows 4 columns)
#array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
#Get a one-dimensional ndarray
arr[1] #Element reference by scalar value
arr[0:2] #Slicing Extract the 0th and 1st lines(The second line is not extracted)
##For each element in the first line(>2)Returns the boolean value of
arr[1] > 2 # array([ True, True, True, True], dtype=bool)
#Get a two-dimensional ndarray
arr>2 #Boolean index reference
arr[np.array([True, False, True])] #Extract lines 0 and 2 with Boolean index reference
# arr[[True, False, True]] # Warning
arr[[0,2,1]] #See fancy index:For index reference(integer)Use an array Extract the 0th, 2nd, and 1st lines in order
Numpy
(in 2D, ʻarr [・ (first argument), ・ (second argument)]`). The first argument is the row and the second argument is the column.It's basically the same as a one-dimensional ndarray, but note only the pitfalls that are easy to fall into:
#When you want to specify only the second argument. The first argument cannot be omitted. At that time slicing`:`As the first argument
arr[:, 1]
#If you specify a fancy index for the first and second arguments, the operation will be a little unintuitive.
## (Also note that it will be a one-dimensional ndarray!
## np.array([arr[i,j] for i,j in zip([1,2], [0,1])])Equivalent to.# array([4, 9])
arr[[1,2], [0,1]]
## 1,2nd line and 0,To get a 2D ndarray that extracts the area of the first column, do the following:
arr[np.iloc_([1,2], [0,1])]
array([[4, 5],
[8, 9]])
Numpy
itself are extremely low, but since the idea becomes important when using Pandas
, I will describe it below (skip it for the time being):#line
arr[:,1] > 2 # array([False, True, True], dtype=bool)
#The first row is(>2)Extract lines that look like
arr[arr[:,1] > 2] # arr[np.array([False, True, True])]Same as.(I haven't used it much personally)
# arr[arr[:, 1] > 2, :]Same as.
arr[1] > 5
arr[:, arr[1] > 5] # array([False, False, True, True], dtype=bool)
#arr[:, np.array([False, False, True, True])] #Same as
In summary, the behavior of the index reference type in ndarray (2D) looks like this [^ summary]:
[^ summary]: It's a little forcible. The "None" column of the second argument points to ʻarr [・] . The parentheses in (1d) mean that you don't use them too much. 1d stands for 1D
ndarray and 2d stands for 2D
ndarray`.
First argument\Second argument | None | scalar | Slicing | Boolean index | Fancy index |
---|---|---|---|---|---|
None | - | ❌ | ❌ | ❌ | ❌ |
scalar | 1d | 0d | 1d | 1d | 1d |
Slicing | 2d | 2d | 2d | 2d | 2d |
Boolean index | 2d | 1d | 2d | (2d) | (1d) |
Fancy index | 2d | 1d | 2d | (1d) | (1d) |
list
(slightly different)#A trap that can be mistaken for something else. I want to triple the element of arr
> arr = [0,1,2,3]
> arr*4
[0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]
> np.arange(4)*4
[0,4,8,12]
#If you want to do the same without converting to numpy, use comprehensions.
> [i*4 for i in range(4)]
[0,4,8,12]
In Numpy
, it should have been treated as the same ndarray regardless of whether it is 1D or 2D, but in Pandas
, it is divided as 1D => Series
, 2D => DataFrame
. I am. So, although the names are different, the DataFrame
and Series
cannot be separated because they go back and forth between 2D <=> 1D.
For example, you can extract a one-dimensional Series
by specifying a single row / column from DataFrame
. Conversely, you can create a DataFrame
by specifying the Series
(1D) listʻor
dict as an argument to the
DataFrame` (2D) constructor.
So, understanding whether a variable is one-dimensional or two-dimensional is important even if the name changes to Series, DataFrame
.
Basically, I often put dict
, list
in the constructor. In the case of dict
, it will be Series
with index.
#Example of thrusting a dict
dic = {'word' : 470, 'camera' : 78}
Series(dic)
#Often, a zip and dict combination technique is used to generate a Series:
Series(dict(zip(words, frequency)))
For index references, it is an extension of the one-dimensional ndarray. The difference is that the index name can also be included as an index argument.
ser = Series(np.random.randn(5), index = list('ABCDE'))
#A 1.700973
#B 1.061330
#C 0.695804
#D -0.435989
#E -0.332942
#dtype: float64
#Slicing
ser[1] #The first line, that is'A'Extract row 0 dimension(type =float64 type)
ser['A'] # 'A'Extract rows(type = float)
ser[1:3] #1,Extract the second line(Series(One dimensional)
ser[-1:] #Extract the last line
ser[:-1] #Extract all rows except the last row
ser[[1,2]] # 1,Extract the second line(Fancy index)
ser[['A', 'B']] # (Fancy)You can also give the index as a string
ser > 0 #The type of ser is Series(1D)Each element is a boolean value
ser[ser > 0] #Boolean index(ser > 0)Element reference with
# Read,Since both can be written, it is also possible to write the rvalue only to the corresponding one, as shown below..
#The technique of bringing a condition to an lvalue is often used in DataFrame.
ser[ser > 0] = 0
list
or dict
, it doesn't matter what the inside is (list
, Series
, dict
, tuple
Yes)#When both outside and inside are dict
pop = {'Nevada' : {2001 : 2.4, 2002 : 2.9},
'Ohio' : {2000 : 1.5, 2001 : 1.7}}
df2 = DataFrame(pop)
# Nevada Ohio
#2000 NaN 1.5
#2001 2.4 1.7
#2002 2.9 NaN
#The outside is dict,When the inside is series
# df1,df2 is a DataFrame type(So df1['name'], df2['address']Is a Series type)
##column name is['typeA', 'typeB'],index name is[0,1,2,3]
dfA = DataFrame({'typeA' : df1['name'], 'typeB' : df2['address']})
##index name is[0,1,2,3],column name is['name', 'address'](attribute T is transposed)
dfB = DataFrame([df1['name'], df2['address']]).T
We often use the + builtin zip
function to create a DataFrame
:
dict(zip([1,2,3], [4,5,6,7])) #{1: 4, 2: 5, 3: 6} =>Cannot be converted to DataFrame
list(zip([1,2,3], [4,5,6,7])) #[(1, 4), (2, 5), (3, 6)] =>Can be converted to DataFrame(outside:List, inside:Because it's a tuple)
pd.DataFrame(list(zip([1,2,3], [4,5,6,7]))) # => OK!
DataFrame
by inserting a 2D ndarray.df = DataFrame(np.arange(12).reshape(3,4), columns = list('ABCD'))
print(df)
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
DataFrame
by combining it with the Series
.DataFrame(Series({'word' : 470, 'camera' : 78}), columns = ['frequency'])
Creating a DataFrame
from aSeries
will be discussed in detail in the Data Addition section of the beginner's edition.
In Pandas
,df [・]
ordf.loc [<row specification>] ʻor,
df.loc [df.iloc [<row specification> ]
ordf.iloc [<row specification>, <column specification>]
can be created. df [・]
behaves quite confusingly as follows.
#I often use
#dfA[1] # runtime error!!The first column cannot be retrieved as an integer value
#dfA['typeA'] #'typeA'Series columns(1D)Extracted as
dfA[['typeB', 'typeA']] # typeB,DataFrame with type A columns in order(2D)Extracted as
dfA['typeA'] > 3 #1D Series(Each element is a boolean value)
#A little confusing(I often use it personally)
dfA[dfA['typeA'] > 3] #dfA'typeA'Extract rows with 3 or more columns.
# dfA.loc[dfA['typeA'] > 3] #If you are worried, use this
#Below, it's quite complicated, so I don't use it much.
dfA[1:] #The first line~DataFrame(2D)Extracted as(Note that it is a row extraction)
#dfA[1:]I would write this rather than myself.
dfA.loc[1:] #Clarified that it is a line specification. Or dfA.loc[1:, :]
df.loc
is a version where you can specify the label name of Numpy
. So basically, you should write it with the same glue as the index reference of Numpy
.
df.loc
However, there are two things to keep in mind when dealing with df.loc
(quite important and easy to get stuck in).
One is that df.loc
has priority over the label name, so even when an integer value is specified for ʻindex, the index number is not referenced, but the line corresponding to the label name is extracted. is. For example, when you want to
sort` and extract the first row, it is quite easy to get an accident:
dic = list(zip([0,3,5,6], list('ADCB')))
dfA = DataFrame(dic, columns = ['typeA', 'typeB'])
# typeA typeB
#0 0 A
#1 3 D
#2 5 C
#3 6 B
dfA = dfA.sort_values(by = 'typeB')
# typeA typeB
#0 0 A
#3 6 B
#2 5 C
#1 3 D
dfA.loc[1] #1st(In other words, the second place)I want to extract rows, but when I use loc, the rows with index label name 1 are extracted:
#typeA 3
#typeB D
#Name: 1, dtype: object
##To prevent such a tragedy, df.Use iloc. The line number has priority.
##(#3 6 B)Can be extracted
dfA.iloc[1]
ʻIloc is often used after extraction. (If ʻindex
is not in numerical order, it cannot be referenced byloc [number]
.)
df = df[df['A'] == name]
df.iloc[0]['B'] #It feels a little uncluttered...
The other is a trap that is easy to fall into when dealing with integer index, but if you want to extract the last row, referencing a negative value in df.loc
will fail. Since the label name has priority, it is said that there is no -1 label. Again, use df.iloc
to emphasize that the extraction is for line numbers.
# dfA.loc[-1] : NG
dfA.iloc[-1] # OK(The last line is Series(1D)Extracted as)
dfA.iloc[-1:] # OK(The last line is DataFrame(2D)Extracted as)
On the contrary, df.iloc
can only use numbers, so if you want to specify rows by row numbers and columns by label names, write as follows.
df.iloc[i]['A'] #It is good to write like this
#iloc can only specify columns with numbers
# Location based indexing can only have
# [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
# res *= df.iloc[i, 'A'] #error
To summarize the index reference of DataFrame
,
If you are uneasy, use df.loc [<row specification>] ʻor,
df.loc [rather than
df [・] `.
However, if you want to emphasize the extraction of row numbers, use df.iloc [<row specification>]
or df.iloc [<row specification>, <column specification>]
.
If you keep only these two points in mind, you can extract data as enjoyably as the index reference of Numpy
. It's really easy if you only need to remember one type of df.loc
, but integer indexes are so popular in practice that you can't avoid using df.iloc
: sweat. ::
loc
, ʻiloc` cannot be copied to the lvalue when indexer is used twice as shown below. (The value you want to modify in the original DataFrame is not modified).
#A value is trying to be set on a copy of a slice from a DataFrame
df.loc[5]['colA'] #Cannot be an lvalue
#no problem!(Because it is a reference)
df.loc[k, 'non_view_rate'] *= mult
So far we've looked at Pandas
index references. Maybe it's over the mountain, but there are still some confusing parts such as additions and corrections to DataFrame
. For the basic usage of each function, [Introduction to data analysis by Python --- Data processing using NumPy, pandas](https://www.amazon.co.jp/ Introduction to data analysis by Python --- NumPy, pandas Data processing using-Wes-McKinney / dp / 4873116554), and here I would like to summarize it in a reverse way.
ser = Series([1,2,3], index = list('ABC'))
#A 1
#B 2
#C 3
#dtype: int64
Will be expressed as Series (3 * 1)
. The index names are all the same. (['A','B','C']
) Let's see how to concatenate various patterns of data.
(Series(31) <- Series(31)) -> DataFrame
DataFrame (2 * 3)
DataFrame([s1, s2]) #Using the constructor
DataFrame (3 * 2)
df = DataFrame([s1, s2], index = list('AB')).T
pd.concat([s1, s2], axis = 1) #If you want to stack downwards, concat(.., axis = 1)Should be used
DataFrame (6 * 1)
serA.append(serB)
#Or
pd.concat([serA, serB])
#If you want the index name to be a serial number of 0..
s1.append(s2).reset_index(drop = True) #Re-sort the index
DataFrame (1 * 6)
df1 = DataFrame(serA)
df2 = DataFrame(serB)
ndf = df1.join(df2, how = 'outer', lsuffix = 'A', rsuffix = 'B') #I wonder
#Only two can be connected here.
ndf = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
0A 1A 2A 0B 1B 2B
0 1 2 3 4 5 6
(DataFrame(n3) <- Series(31)) => DataFrame
DataFrame ((n + 1) * 3)
# Can only append a Series if ignore_index=True or if the Series has a index name
df.append(serA, ignore_index = True)
cols = ['colA', 'colB', 'colC']
res_df = DataFrame(columns = cols)
res_df = res_df.append(Series([1,2,3], cols).T, ignore_index = True)
...
(Series(3*1) + Series(3*1)) + Series(3*1) + Series(3*1)-> DataFrame(4*3)
df = DataFrame([serA, serB, serC, serD])
# DataFrame(3*4)If you want to.Just add T
df = DataFrame([serA, serB, serC, serD]).T
#Add one line
df.loc['newrow'] = 0
df.append(serA, ignore_index = True)
#Add multiple lines
df1.append(df2)
#Add multiple DataFrames(Poke the list)
df1.append([df2, df3, df4])
#Or
pd.concat([df1, df2, df3, df4])
#1 column added
df['newcol'] = 0
#Not applicable index is outer join(outer)To.(NAN値To.)
df1.join(df2, how = 'outer')
df1.join([df2, df3], how = 'outer')
#merge can be more detailed, but limited to merging two DataFrames:
df1.merge(df1, how = 'outer')
df1.merge (df2)
, please refer to [1].For other details, see the official page http://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.DataFrame.merge.html or http://sinhrks.hatenablog.com/entry / 2015/01/28/073 I think 327 is good.
For the latter,
+Simple vertical concatenation DataFrame.append
+Flexible concatenation pd.concat
+Join by column value pd.merge
+Join by index DataFrame.join (make easy version of merge)
It is easy to understand because it is written with figures and examples.
#Rename index
df.index = ['one', 'two', 'three']
#index number reassignment
df.reset_index(drop = True) #Re-sort the index(From 0~)
#Rename columns
##table creation=>After editing, the column may not be in the expected order, so
##It is safer to explicitly specify the column order.
df = df[['old_a', 'old_b', 'old_c']]
df.columns = ['new_a', 'new_b', 'new_c']
#Or df.use rename
df = df[['old_a', 'old_b', 'old_c']] #Either way, if you don't care about the order of the columns.
#Since rename is not a destructive method, it must be assigned to an lvalue. Specify columns for parameter.(Note that it is not an axis parameter)
df = df.rename(columns = {'old_a' : 'new_a', 'old_b' : 'new_b', 'old_c' : 'new_c'})
Note1) You can also use df.rename
when you want to change some index and column names. (Specify the dict
type (as a correspondence table before-change) in the ʻindex or
columns parameter. Note that there is no ʻaxis
parameter. The rest isinstead of
column. columns
(with s
))
Note2) reindex
is a replacement of the existing index position, not an index name change.
set_index creates a new object using one or more specific columns as an index, such as df.set_index (['c1','c0'])
. Note that this is not a method for renaming index. reset_index
converts a hierarchical index to a column. Just the relationship of set_index <=> reset_index
.
# 'A'Focus on the columns'wrong'Select the rows that are and of those rows'B'Column'sth'Change to
df.loc[df['A'] == 'wrong', 'B'] = 'sth'
sort_index
(for sorting around index), sort_values
(specified by by
) can be sorted in ascending or descending order (in the case of descending order, specify ʻascending = False`) )df [・]
.http://naotoogawa.hatenablog.jp/entry/2015/09/12/PandasのDataFrameの嵌りどころ
#Enclose each Boolean index in parentheses
df = df[(df['A'] > 0) | (df['B'] > 0)]
df [・]
is in multiple candidates#apply is a function that takes a Series as an argument(Lambda expression)Into the first argument
#map is a function that takes an element as an argument(Lambda expression)Into the first argument
df = df[df['A'].map(lambda d : d in listA)]
Delete rows and columns with df.drop
(non-destractive). If you specify axis, you can delete both rows and columns.
df = df.drop("A", axis=1)
#column is'A', 'B', .. 'F'so'C'From the column'F'列まso削除したいときとかは、以下のようにする方が多い
df = df[['A', 'B']]
See http://nekoyukimmm.hatenablog.com/entry/2015/02/25/222414.
how ='all'
.#Returns DataFrame type
df.apply(lambda ser: ser % 2 == 0)
df.applymap(lambda x: x % 2 == 0)
df['goal'] == 0
df.isin([1,2])
df = df[~df.index.duplicated()] #Remove duplicate index(Delete the data that appears after the second time)
#Returns Series type
df.apply(lambda ser : (ser > 0).any())
df['A'].map(lambda x : x > -1)
serA > serB #series type
-bool_ser #Flip the index of an element of a bool index
#The second argument is only the element that becomes False in the first argument
df['A'].where(df['A'] > 0, -df['A']) #Series version of abs(If it does not apply to the first argument, add a negative sign(In other words, it becomes positive because it is negative.)
(df['goal'] == 0).all() #True if you are addicted to all the conditions
df.apply(lambda ser: ser % 2 == 0)
(df['cdf(%)'] < 90).sum() #Count the number that meets the conditions
df.where(df % 3 == 0, -df)
It's quite common to get stuck around NA
, so make a note of where the NA
value may be generated.
dic = dict(zip(list('ABCD'), [3,4,6,2])) #Generate dict
ser = Series(dic, index = list('ABCDE'))
#Column E not in dic is NAN
#A 3.0
#B 4.0
#C 6.0
#D 2.0
#E NaN
#dtype: float64
(Example)
pop = {'Nevada' : {2001 : 2.4, 2002 : 2.9},
'Ohio' : {2000 : 1.5, 2001 : 1.7}}
df = DataFrame(pop)
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 NaN
df.reindex
contains something that is not included in the index of df in the index parameter.Example omitted
loc
fielddf.loc[[2002, 2001, 1999], ['Alaska', 'Nevada']]
Alaska Nevada
2002 NaN 2.9
2001 NaN 2.4
1999 NaN NaN
Note) df ['non_exists']
, df.loc [:,'non_exists']
(specify a name that is not in the column) and an error.
Addition between data frames (NAN is the unsupported element of ʻindex or
column`)
When how ='outer'
is specified for merge, join
ʻAppend, concat` (if index does not support)
df.dropna
(parameter is how or axis),df.fillna (0)
(set NA value to 0 uniformly
Specify fill_value
parameter or method
parameter in df.reindex
Only the NA value of combine_first
: ʻold_df` is completed by the first argument. (Note that the part where the element is 0 in old_df is not subject to patching!)
# df ..index jumps(index.name : B_idx, columns = ['A'] =>index serial number(0~89)I want to set the interpolation value to 0
old_df = DataFrame(index = range(90), columns = 'A')
new_df = old_df.combine_first(df).fillna(0) # index.name disappears
Especially because the character string operation of Series
is often used soberly. It can be used not only for elements but also for index names and column names!
See http://pandas.pydata.org/pandas-docs/stable/text.html (especially at the bottom) for more information. If you want to operate a character string in DataFrame, you can usually solve it by looking here.
You can use it, for example, when you want to extract only the lines that match a certain regular expression.
#df'A'Update df by extracting only the rows that start with lowercase letters in the column
r = '^[a-z]'
df = df[df['A'].str.match(r)] # df['A'].str.match(r)Is a Boolean index
[1] [Introduction to data analysis using Python --- Data processing using NumPy and pandas](https://www.amazon.co.jp/ Introduction to data analysis using Python --- Data processing using NumPy and pandas-Wes -McKinney / dp / 4873116554)
[2] http://sinhrks.hatenablog.com/entry/2015/01/28/073327
[3] Documentation http://pandas.pydata.org/pandas-docs/stable/api.html
Recommended Posts