Last night, I summarized [Introduction to Data Scientists] Basics of Scipy as the basis of scientific calculation, data processing, and how to use the graph drawing library, but tonight I will summarize the basics of Pandas. I will supplement the explanations in this book. 【Caution】 ["Data Scientist Training Course at the University of Tokyo"](https://www.amazon.co.jp/%E6%9D%B1%E4%BA%AC%E5%A4%A7%E5%AD%A6%E3 % 81% AE% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83 % 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E8% 82% B2% E6% 88% 90% E8% AC% 9B% E5% BA% A7-Python% E3% 81 % A7% E6% 89% 8B% E3% 82% 92% E5% 8B% 95% E3% 81% 8B% E3% 81% 97% E3% 81% A6% E5% AD% A6% E3% 81% B6 % E3% 83% 87% E2% 80% 95% E3% 82% BF% E5% 88% 86% E6% 9E% 90-% E5% A1% 9A% E6% 9C% AC% E9% 82% A6% I will read E5% B0% 8A / dp / 4839965250 / ref = tmm_pap_swatch_0? _ Encoding = UTF8 & qid = & sr =) and summarize the parts that I have some doubts or find useful. Therefore, I think the synopsis will be straightforward, but please read it, thinking that the content has nothing to do with this book.
"Pandas is a convenient library for so-called pre-processing before modeling in Python (using machine learning etc.) ... You can perform operations such as spreadsheets and data extraction and retrieval."
>>> import pandas as pd
>>> from pandas import Series, DataFrame
>>> pd.__version__
'1.0.3
"Series is like a one-dimensional array ..." "Like", what is it? So, if you look at the type below and output it, it looks like. .. ..
>>> sample_pandas_data = pd.Series([0,10,20,30,40,50,60,70,80,90])
>>> print(type(sample_pandas_data))
<class 'pandas.core.series.Series'>
>>> print(sample_pandas_data)
0 0
1 10
2 20
3 30
4 40
5 50
6 60
7 70
8 80
9 90
dtype: int64
<class'pandas.core.series.Series'> is indexed.
According to the reference "Pandas is based on NumPy, so compatibility is very high."
>>> array = np.arange(0,100,10)
>>> array
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
>>> series_sample = pd.Series(array)
>>> series_sample
0 0
1 10
2 20
3 30
4 40
5 50
6 60
7 70
8 80
9 90
dtype: int32
Specify dtype ='int64'
>>> array = np.arange(0,100,10, dtype = 'int64')
>>> array
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64)
>>> series_sample = pd.Series(array)
>>> series_sample
0 0
1 10
2 20
3 30
4 40
5 50
6 60
7 70
8 80
9 90
dtype: int64
【reference】 Differences between Pandas and NumPy and how to use them properly
>>> sample_pandas_index_data = pd.Series([0,10,20,30,40,50,60,70,80,90], index = ['a','b','c','d','e','f','g','h','i','j'])
>>> sample_pandas_index_data
a 0
b 10
c 20
d 30
e 40
f 50
g 60
h 70
i 80
j 90
dtype: int64
>>> sample_pandas_index_data.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
>>> sample_pandas_index_data.values
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64)
Can be created from a numpy array.
>>> array0 = np.arange(0,100,10, dtype = 'int64')
>>> array1 = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
>>> sample_pandas_index_data2 = pd.Series(array0,index = array1)
>>> sample_pandas_index_data2
a 0
b 10
c 20
d 30
e 40
f 50
g 60
h 70
i 80
j 90
dtype: int64
"DataFrame is a two-dimensional array ..." data is converted from dictionary format. The output is in tabular format.
>>> attri_data1 = {'ID':['100','101','102','103','104'],
... 'City':['Tokyo','Osaka','Kyoto','Hokkaido','Tokyo'],
... 'Birth_year':['1990','1989','1970','1954','2014'],
... 'Name':['Hiroshi','Akiko','Yuki','Satoru','Steve']}
>>> attri_data_frame1=DataFrame(attri_data1)
>>> attri_data_frame1
ID City Birth_year Name
0 100 Tokyo 1990 Hiroshi
1 101 Osaka 1989 Akiko
2 102 Kyoto 1970 Yuki
3 103 Hokkaido 1954 Satoru
4 104 Tokyo 2014 Steve
>>> type(attri_data1)
<class 'dict'>
>>> attri_data_frame1=DataFrame(attri_data1, index=['a','b','c','d','e'])
>>> attri_data_frame1
ID City Birth_year Name
a 100 Tokyo 1990 Hiroshi
b 101 Osaka 1989 Akiko
c 102 Kyoto 1970 Yuki
d 103 Hokkaido 1954 Satoru
e 104 Tokyo 2014 Steve
>>> attri_data_frame1.T
a b c d e
ID 100 101 102 103 104
City Tokyo Osaka Kyoto Hokkaido Tokyo
Birth_year 1990 1989 1970 1954 2014
Name Hiroshi Akiko Yuki Satoru Steve
>>> attri_data_frame1.Birth_year
a 1990
b 1989
c 1970
d 1954
e 2014
Name: Birth_year, dtype: object
>>> attri_data_frame1[['ID','Birth_year']]
ID Birth_year
a 100 1990
b 101 1989
c 102 1970
d 103 1954
e 104 2014
>>> attri_data_frame1[attri_data_frame1['City']=='Tokyo']
ID City Birth_year Name
a 100 Tokyo 1990 Hiroshi
e 104 Tokyo 2014 Steve
>>> attri_data_frame1['City']=='Tokyo'
a True
b False
c False
d False
e True
Name: City, dtype: bool
>>> attri_data_frame1[attri_data_frame1['City'].isin(['Tokyo','Osaka'])]
ID City Birth_year Name
a 100 Tokyo 1990 Hiroshi
b 101 Osaka 1989 Akiko
e 104 Tokyo 2014 Steve
** axis = 1 is a column **
>>> attri_data_frame1.drop(['Birth_year'], axis = 1)
ID City Name
a 100 Tokyo Hiroshi
b 101 Osaka Akiko
c 102 Kyoto Yuki
d 103 Hokkaido Satoru
e 104 Tokyo Steve
** axis = 0 is a line **
>>> attri_data_frame1.drop(['c','e'], axis = 0)
ID City Birth_year Name
a 100 Tokyo 1990 Hiroshi
b 101 Osaka 1989 Akiko
d 103 Hokkaido 1954 Satoru
The above operation does not change the original data
>>> attri_data_frame1
ID City Birth_year Name
a 100 Tokyo 1990 Hiroshi
b 101 Osaka 1989 Akiko
c 102 Kyoto 1970 Yuki
d 103 Hokkaido 1954 Satoru
e 104 Tokyo 2014 Steve
Replaced by the following option replace = True.
>>> attri_data_frame1.drop(['c','e'], axis = 0, inplace = True)
>>> attri_data_frame1
ID City Birth_year Name
a 100 Tokyo 1990 Hiroshi
b 101 Osaka 1989 Akiko
d 103 Hokkaido 1954 Satoru
>>> attri_data1 = {'ID':['100','101','102','103','104'],
... 'City':['Tokyo','Osaka','Kyoto','Hokkaido','Tokyo'],
... 'Birth_year':['1990','1989','1970','1954','2014'],
... 'Name':['Hiroshi','Akiko','Yuki','Satoru','Steve']}
>>> attri_data_frame1=DataFrame(attri_data1)
>>> attri_data_frame1
ID City Birth_year Name
0 100 Tokyo 1990 Hiroshi
1 101 Osaka 1989 Akiko
2 102 Kyoto 1970 Yuki
3 103 Hokkaido 1954 Satoru
4 104 Tokyo 2014 Steve
>>> math_pt = [50, 43, 33,76,98]
>>> attri_data_frame1['Math']=math_pt
>>> attri_data_frame1
ID City Birth_year Name Math
0 100 Tokyo 1990 Hiroshi 50
1 101 Osaka 1989 Akiko 43
2 102 Kyoto 1970 Yuki 33
3 103 Hokkaido 1954 Satoru 76
4 104 Tokyo 2014 Steve 98
>>> attri_data2 = {'ID':['100','101','102','105','107'],
... 'Math':[50, 43, 33,76,98],
... 'English':[90, 30, 20,50,30],
... 'Sex':['M', 'F', 'F', 'M', 'M']}
>>> attri_data_frame2=DataFrame(attri_data2)
>>> attri_data_frame2
ID Math English Sex
0 100 50 90 M
1 101 43 30 F
2 102 33 20 F
3 105 76 50 M
4 107 98 30 M
>>> attri_data_frame1
ID City Birth_year Name Math
0 100 Tokyo 1990 Hiroshi 50
1 101 Osaka 1989 Akiko 43
2 102 Kyoto 1970 Yuki 33
3 103 Hokkaido 1954 Satoru 76
4 104 Tokyo 2014 Steve 98
Find the same key and merge it. The key is ID. .. ..
>>> pd.merge(attri_data_frame1,attri_data_frame2)
ID City Birth_year Name Math English Sex
0 100 Tokyo 1990 Hiroshi 50 90 M
1 101 Osaka 1989 Akiko 43 30 F
2 102 Kyoto 1970 Yuki 33 20 F
>>> pd.merge(attri_data_frame1,attri_data_frame2, how = 'outer')
ID City Birth_year Name Math English Sex
0 100 Tokyo 1990 Hiroshi 50 90.0 M
1 101 Osaka 1989 Akiko 43 30.0 F
2 102 Kyoto 1970 Yuki 33 20.0 F
3 103 Hokkaido 1954 Satoru 76 NaN NaN
4 104 Tokyo 2014 Steve 98 NaN NaN
5 105 NaN NaN NaN 76 50.0 M
6 107 NaN NaN NaN 98 30.0 M
Relation Merge, join, concatenate and compare
"Aggregation around a specific column with group by"
>>> attri_data_frame2.groupby('Sex')['Math'].mean()
Sex
F 38.000000
M 74.666667
Name: Math, dtype: float64
>>> attri_data_frame2.groupby('Sex')['English'].mean()
Sex
F 25.000000
M 56.666667
Name: English, dtype: float64
You can sort by index with attri_data_frame1.sort_index ().
>>> attri_data_frame1=DataFrame(attri_data1, index=['e','b','a','c','d'])
>>> attri_data_frame1
ID City Birth_year Name
e 100 Tokyo 1990 Hiroshi
b 101 Osaka 1989 Akiko
a 102 Kyoto 1970 Yuki
c 103 Hokkaido 1954 Satoru
d 104 Tokyo 2014 Steve
>>> attri_data_frame1.sort_index()
ID City Birth_year Name
a 102 Kyoto 1970 Yuki
b 101 Osaka 1989 Akiko
c 103 Hokkaido 1954 Satoru
d 104 Tokyo 2014 Steve
e 100 Tokyo 1990 Hiroshi
Attri_data_frame1.sort_values (by = ['Birth_year']) allows you to sort by the value in the'Birth_year' column.
>>> attri_data_frame1.sort_values(by=['Birth_year'])
ID City Birth_year Name
c 103 Hokkaido 1954 Satoru
a 102 Kyoto 1970 Yuki
b 101 Osaka 1989 Akiko
e 100 Tokyo 1990 Hiroshi
d 104 Tokyo 2014 Steve
Perform operations such as excluding missing values.
>>> attri_data_frame1.isin(['Tokyo'])
ID City Birth_year Name
e False True False False
b False False False False
a False False False False
c False False False False
d False True False False
>>> attri_data_frame1['Name'] = np.nan
>>> attri_data_frame1
ID City Birth_year Name
e 100 Tokyo 1990 NaN
b 101 Osaka 1989 NaN
a 102 Kyoto 1970 NaN
c 103 Hokkaido 1954 NaN
d 104 Tokyo 2014 NaN
>>> attri_data_frame1.isnull()
ID City Birth_year Name
e False False False True
b False False False True
a False False False True
c False False False True
d False False False True
Count the number of nulls.
>>> attri_data_frame1.isnull().sum()
ID 0
City 0
Birth_year 0
Name 5
dtype: int64
Extraction of Math> = 50
>>> attri_data_frame2
ID Math English Sex Money
0 100 50 90 M 1000
1 101 43 30 F 2000
2 102 33 20 F 500
3 105 76 50 M 300
4 107 98 30 M 700
>>> attri_data_frame2[attri_data_frame2['Math'] >= 50]
ID Math English Sex Money
0 100 50 90 M 1000
3 105 76 50 M 300
4 107 98 30 M 700
Money Gender average
>>> attri_data_frame2['Money'] = np.array([1000,2000, 500,300,700])
>>> attri_data_frame2
ID Math English Sex Money
0 100 50 90 M 1000
1 101 43 30 F 2000
2 102 33 20 F 500
3 105 76 50 M 300
4 107 98 30 M 700
>>> attri_data_frame2.groupby('Sex')['Money'].mean()
Sex
F 1250.000000
M 666.666667
Name: Money, dtype: float64
You may want to process missing values. .. ..
>>> attri_data_frame2['Money'].mean()
900.0
>>> attri_data_frame2['Math'].mean()
60.0
>>> attri_data_frame2['English'].mean()
44.0
Add the writing, reading, and index presence / absence of the csv file. It is necessary to read the saved file with or without index.
>>> attri_data_frame2.to_csv(r'samole0.csv',index=False)
>>> attri_data_frame2.to_csv(r'samole1.csv',index=True)
>>> df = pd.read_csv("samole0.csv")
>>> df
ID Math English Sex Money
0 100 50 90 M 1000
1 101 43 30 F 2000
2 102 33 20 F 500
3 105 76 50 M 300
4 107 98 30 M 700
>>> df = pd.read_csv("samole1.csv")
>>> df
Unnamed: 0 ID Math English Sex Money
0 0 100 50 90 M 1000
1 1 101 43 30 F 2000
2 2 102 33 20 F 500
3 3 105 76 50 M 300
4 4 107 98 30 M 700
>>> df = pd.read_csv("samole1.csv", index_col=0)
>>> df
ID Math English Sex Money
0 100 50 90 M 1000
1 101 43 30 F 2000
2 102 33 20 F 500
3 105 76 50 M 300
4 107 98 30 M 700
Without index ,. .. .. After all it is better to be aware of it.
>>> df.to_csv(r'samole3.csv')
>>> df_ = pd.read_csv("samole3.csv")
>>> df_
Unnamed: 0 ID Math English Sex Money
0 0 100 50 90 M 1000
1 1 101 43 30 F 2000
2 2 102 33 20 F 500
3 3 105 76 50 M 300
4 4 107 98 30 M 700
>>> df_ = pd.read_csv("samole3.csv", index_col=0)
>>> df_
ID Math English Sex Money
0 100 50 90 M 1000
1 101 43 30 F 2000
2 102 33 20 F 500
3 105 76 50 M 300
4 107 98 30 M 700
・ Summarized according to the basics of Pandas in this book ・ Pandas can also draw graphs and perform various processing, but I think that it can be used if you understand the range summarized this time.
・ For further learning, a link was added to the relatively easy-to-understand Tutorial.
Package overview Getting started tutorials What kind of data does pandas handle? How do I read and write tabular data? How do I select a subset of a DataFrame? How to create plots in pandas? How to create new columns derived from existing columns? How to calculate summary statistics? How to reshape the layout of tables? How to combine data from multiple tables? How to handle time series data with ease? How to manipulate textual data? Comparison with other tools
Recommended Posts