Memorandum @ Python OR Seminar: Pandas

pandas

>>> import pandas as pd

pandas says "[Introduction to Data Analysis with Python](http://www.amazon.co.jp/Python%E3%81%AB%E3%82%88%E3%82%8B%E3%83%87%E3" % 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E5% 85% A5% E9% 96% 80-% E2% 80% 95NumPy% E3% 80% 81pandas% E3% 82% 92% E4% BD% BF% E3% 81% A3% E3% 81% 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 87% A6% E7% 90% 86-Wes-McKinney / dp / 4873116554) ”. If you want to study in detail, please study here. Pythonによるデータ分析入門

Data type

There are two types of data in pandas, ** Series ** and ** DataFrame **.

Series A type that handles one column (row) of data.

#Ordinary list type
>>> lst = [1, 2, 3, 4, 5]
# Series
>>> s = pd.Series(lst)
>>> s
0    1
1    2
2    3
3    4
4    5
dtype: float64

DataFrame A type that handles table data.

#Ordinary dictionary type
>>> dic = {"a": [1, 2, 3], "b": [9, 8, 7]}
# DataFrame
>>> df = pd.DataFrame(dic)
	a	b
0	1	9
1	2	8
2	3	7

I/O

File reading is provided.

>>> pd.read*? # *Partial match when attached
pd.read_clipboard
pd.read_csv
pd.read_excel
pd.read_fwf
pd.read_gbq
pd.read_hdf
pd.read_html
pd.read_json
pd.read_msgpack
pd.read_pickle
pd.read_sql
pd.read_sql_query
pd.read_sql_table
pd.read_stata
pd.read_table

csv file reading example

>>> dataset = pd.read('sample.csv')

csv file writing example

>>> dataset.to_csv('write.csv')

Read data

When you want to see a few lines of data

>>> dataset.head() #If you put integer in the argument
>>> dataset.tail() #Read integer line

>>> dataset.ix[n] #See line n
>>> dataset.ix[m:n] # m~(n-1)See line
>>> dataset.ix[[0, 3, 5]] #See a distant line

When you want to see a sequence of data

>>> dataset.Column name
#Or
>>> dataset['Column name']
#When you want to see multiple columns
>>> dataset[['Column name 1', 'Column name 2', ...,]]

Search

>>> dataset[dataset.UID == 'Column name']

Handling of missing values NA

Check for missing values

It seems to combine len () and count ()

--len () gets the size of the data. --count () gets the number of elements other than NA in the column direction. (If the argument is ʻaxis = 1`, it will be in the row direction)

>>> len(dataset) - dataset.count()
UID             0
dtime           0
Sousyouhi       0
Hatsudenryou    0
Jikasyouhi      0
Uriden          0
Kaiden          0
Use_AirCon      0
Use_Kyutou      3
Use_Kaden       0
dtype: int64

If you want to take a closer look, you should combine ʻis null () and ʻany (). ʻIsnull (): Set the element NA to True and the others to False. ʻAny () : Returns True if there is even one True in the column direction, False if there is none. (If the argument is ʻaxis = 1`, it will be in the row direction)

>>> dataset[dataset.isnull().any(axis=1)]

Handling of missing values NA

Mainly two. dropna (): Delete the line containing NA. (Column direction with ʻaxis = 1as an argument) fillna ('something'): Replace NA with something`.

>>> dataset.dropna()

>>> dataset.fillna(0) #Replace NA with 0

#Fill NA with previous value front?
>>> dataset.fillna(method='ffill')
#Fill NA with back value back?
>>> dataset.fillna(method='bfill')

Summary statistics

If you use describe (), it will calculate most of the things. count, mean, std, min, 25%, 50%, 75%, max

>>> dataset.describe()

Grouping

Group by element.

>>> dataset.groupby('Column name')

Graph

It seems that you can draw some graphs with just pandas.

Time series graph

Set dtime to index

>>> tdataset = dataset.copy()
>>> tdataset.index = tdatasset.dtime.apply(pd.to_datetime)
>>> tdataset.drop('dtime', axis=1, inplace=True)
>>> b = tdataset[tdataset.UID == 'id1'] \
...                      [['UID', 'Soushohi']]
>>> b.plot()

Resampling

If it is left as it is (every 2 hours) as above, the graph is too fine. Every other day.

>>> c = b.resample('1d') # 1m:Every other month
>>> c.plot()
>>> b.resample('1d', 'std').plot() #standard deviation
>>> b.drop('UID', axis=1).resample('1d', 'max').plot() #Maximum value

moving average

>>> pd.rolling_mean(c, 12).plot() #12 weeks

histogram

>>> c.hist()

Box plot

>>> c.boxplot(return_type='axes')

Correlation coefficient

>>> c.corr()

Correlogram

>>> import statsmodels.api as sm
>>> plot(sm.tsa.acf(b.Column name))

Scatter plot

>>> pd.tools.plotting.scatter_matrix(c)

Multiple regression

>>> c = c.fillna(0)
>>> m = sm.OLS(c.Soushoshouhi, \
...          c[['Hatsudenryou', 'Use_AirCon']])
>>> r = m.fit()
>>> r.summary2()

Online documentation

The rest is here.