We have summarized the grammars that are often used in Python's data analysis library "pandas".

Change log

2019-02-18 Updated the display enlargement method
2018-05-06 Reflect comments (pd.set_option('display.width', 100)）
2018-02-14 Link correction
2017-11-01　df.fillna(method='ffill')Corrected the description of
2017-06-09 Correction of broken links, etc.
2016-10-10 Editing examples
2016-06-21　df.rolling, pd.date_range, pd.datetime, df.Add pivot, add other examples

Handling CSV files

pd.read_csv (read)

There are many options available, so you can load anything other than csv.

import pandas as pd

df = pd.read_csv('some.csv')

Example: When reading multiple columns (date and hour) together as a datetime type index (date_hour)

df = pd.read_csv('some.csv', parse_dates={'date_hour':['date', 'hour']}, index_col='date_hour')

Example: When a file containing Japanese and -- are treated as missing data

df = pd.read_csv('some.csv', encoding='Shift_JIS', na_values='--')

--List of options

option	meaning
index_col	Column name to be index
parse_dates	Column name (list or dictionary) to be read as datetime type
date_parser	parse_Self-made function to read the column specified by dates
na_values	Character string (list) to be a missing value
encoding	'Shift_JIS'Such
sep	Delimiter(' ': In the case of space)

--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

df.to_csv (write)

It's terribly easy to write to a csv file. There are many options.

df.to_csv('some2.csv')

Example: If you don't need index

df.to_csv('some2.csv', index=None)

Example: When naming the index

df.to_csv('some2.csv', index_label='date')

--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

Handling of missing values

df.dropna (deleted)

Delete row (index) with missing value (nan)

df = df.dropna()

Example: If a particular column (temp or depth) has a missing value, delete that row (Ignore missing values other than temp and depth)

df = df.dropna(subset=['temp','depth'])

df.fillna (fill in)

Example: Fill in the blanks with a constant (0)

df = df.fillna(0)

Example: Fill in the front (rear) holes (2017/11/01 Corrected to reflect @ hadacchi's comment)

df = df.fillna(method='ffill')  #forward fill in the blank nan 1.0 nan -> nan 1.0 1.0 （forward  =Index increase direction = DataFrame downward direction)
df = df.fillna(method='bfill')  #backward fill-in-the-blank nan 1.0 nan -> 1.0 1.0 nan （backward =Index decrease direction = DataFrame upward direction)

--Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

df.interpolate (interpolation)

Many interpolation methods are available.

df = df.interpolate(method='index')

method: linear, time, index, values, nearest, zero, slinear, quadratic, cubic, barycentric, krogh, polynomial, spline, piecewise_polynomial, pchip
For datetime type index etc., it is better to use ʻindex instead of linear`.

--Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html

df.resample (resample)

Change the resolution (frequency) of time series data. When I use it, I often reduce the number of data, so I recognize it as a function for compression.

Example: Convert hourly data to daily average (ignore missing values and average)

#daily = hourly.resample('D', how='mean') <-Old way of writing
daily = hourly.resample('D').mean()

Example: If the value is placed at 00:00, but you want it to be 12:00

daily = hourly.resample('D', loffset='12H').mean()

Time format

Charactor	meaning	Remarks
M	Month	0.5M で半Monthとはならない
D	Day	I'm doing something with 15D
H	time	12H は12time
T or min	Minutes	30min で30Minutes

--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html

df.rolling (moving average)

Take the n-term moving average, maximum value, etc.

Example: 3-term moving average (center = True and place the value in the center (2nd term in this case))

ma3 = hourly.rolling(3, center=True).mean()

Handling of time

pd.date_range

Easily create consecutive datetime variables

date = pd.date_range('2012-1-1', '2012-1-2', freq='D')

Example: Since it is a pandas Timestamp as it is above, when returning to the python datetime variable

date = pd.date_range('2012-1-1', '2012-1-2', freq='D').to_pydatetime()

pd.datetime

ʻImport datetime` not needed

date = pd.datetime(2012, 1, 1, 0, 0, 0)

Other frequently used functions

df.describe (summary statistics)

Take a quick look at the number of data, mean, variance, and other statistics for each column.

print df.describe()

df.groupby (aggregate)

Data can be handled easily by grouping by column (converting the value of column so that it can be handled as index). It is easier to understand than stack and unstack.

Example: Group by multiple columns (type and time) and take the average of each group.

grouped_mean = df.groupby(['type','time']).mean()

--Reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

df.pivot（1D→2D）

Image of turning a one-dimensional array into a two-dimensional array

temp2d = df.pivot(index='y', columns='x', values='temp')

Other tips

DataFrame creation (dictionary type)

If you want to create a DataFrame from something other than a file, create a dictionary and load it.

data = {'a':[0, 1], 'b':[2, 3]}

df = pd.DataFrame(data)

Example: When indexing the above example

date = pd.date_range('2012-1-1', '2012-1-2', freq='D')

df = pd.DataFrame(data, index=date)

Expansion of wrapping width when printing

If you don't want the display to wrap when printing a data frame with a large number of columns, you can change the display width.

pd.set_option('display.width', 100)
or
pd.set_option('display.max_columns', 100)  #Can be controlled by the number of columns

#pd.set_option('line_width', 100)  # line_width is deprecated or abolished (2018/05/06 Thanks to @dhwty）

print df  #Does not wrap up to 100 characters (or column)

Reference link

pandas document
Processing datasets with pandas (1) @ ynakayama
pandas memo @ airtoxin
Refer to data frame by condition in Pandas @wwacky

Grammar summary often used in pandas

** Handling CSV files **

** pd.read_csv ** (read)

** df.to_csv ** (write)

** Handling of missing values **

** df.dropna ** (deleted)

** df.fillna ** (fill in)

** df.interpolate ** (interpolation)

** df.resample ** (resample)

** Time format **

** df.rolling ** (moving average)

** Handling of time **

** Other frequently used functions **

** df.describe ** (summary statistics)

** df.groupby ** (aggregate)