Introduction

This article is a memo of data analysis by python using pandas. I will briefly describe the basics. The variables to which the data read by pandas are assigned are as follows.

an_data

What is pandas

Python data analysis library

pandas import

Import example (Hereafter, usage example is described in pd)

import pandas as pd

Read CSV file and Excel file

Read CSV file

an_data = pd.read_csv("<CSV file name>")

Read Excel file

an_data = pd.read_excel("<Excel file name>")

Display the read data for 5 lines

Useful when you want to check the contents of the data

an_data.head()

Data join

Describes the process when combining separately read data. First data name: an_data1 Second data name: an_data2

Union

Combine the data vertically. (If you set ignore_index = True, the index will be reassigned)

concat_data = pd.concat([an_data1, an_data2], ignore_index=True)

join

Combine the data horizontally.

merge_data = pd.merge(an_data1, an_data2[[<Combine specific keys>]], on="<Join key>", how=left)

How to handle date data

Change the data recognized as a character string to date data

an_data["date"] = pd.to_datetime(an_data["date"])

Extract only year and month

an_data["Year / month"] = an_data["date"].dt.strftime("%Y%m")

Mysterious numbers are read for data such as Excel that has the following notation. Make it possible to handle mysterious numbers as date data.

--Original data that becomes a mysterious number --May 1, 2020 --When the above data is pd.to_datetime (), the following numbers are output (this is an example, so it is different from the actual value). - 456789

ch_time = pd.to_timedelta(an_data.loc[flg_num, "date"].astype("float"), unit="D") + pd.to_datetime("1900/01/01")

If you process one level higher, it will be separated by a hyphen like YYYY-MM-DD. It is troublesome if it is mixed with a slash-separated date, so unify the format (YYYY-MM-DD).

ch_time = pd.to_datetime(an_data.loc[~flg_num, "date"])

Find a value without data

Addition of index containing null data

an_data.isnull().sum()

Contains null data or outputs as boolean (axis is 0 for vertical, 1 for horizontal)

an_data.isnull().any(axis=0)

Data output by describe

The number of data, mean, standard deviation, and quintuplet summary are output.

an_data.describe()

Data type output

<Read data>.dtypes

Example of data summarization by groupby

With the as_index = False option, the index that is automatically created with the name specified by groupby is not created.

Group by the same value that one column has and add the numbers in another column

an_data.groupby(["<Specific column>"]).sum()["<Another column>"]

It is possible to specify multiple specific columns and different columns.

an_data.groupby(["<Specific column A>", "<Specific column B>"]).sum()["<Another column A">, "<Another column b>"]

Group by the same value that one column has and count the values in another column.

tmp = an_data.groupby(["<Specific column>"]).count()["<Another column>"]

group by and aggregation

You can output average etc. by using agg

tmp = an_data.groupby("<Specific column>").agg("mean", "median", "max", "min")["<The column you want to aggregate>"]

pivot_table

Create a new table

an_data = pd.pivot_table(an_data, index="<Column name you want to index>", columns"<The column name you want to be column>", values=["<Column values you want to output to the table>"], aggfunc="", fill_value=0

sort

ascending = True is in ascending order

an_data.sort_values(by=["<Main sort target column>"], ascending=True)

Remove whitespace

an_data["test_val"] = an_data["test_val"].str.replace("  ", "")
an_data["test_val"] = an_data["test_val"].str.replace(" ", "")

Data confirmation by unique

print(len(pd.unique(an_data["val"])))
print(pd.unique(an_data["val"]))

Check the number of parameters whose value is a numerical value

flg_num = an_data["<Column name>"].astype("str").str.isdigit()
flg_num.sum()

loc Label designation

Specify by label

an_data.loc[:, [<column01>, <column02>]]

Specified by boolean

an_data.loc[[True, False, True, False], [False, False, True]]

iloc Numbering

Specify by number

an_data.iloc[[0:2], [2:4]]

Specified by boolean

an_data.iloc[[False, True, True], [False, True, False]]

Rename label

Specifying inplace = True will change the original data

an_data.rename(columns={"<Name before change>":"<Name after change>"}, index={"<Name before change>":"<Name after change>"}, inplace=True)

Reindex

Since the index value is reassigned, the index is not covered.

an_data = an_data.reset_index(drop=False)

Drop processing

an_data = an_data.drop("<Items you want to drop>", axis=1)

where Use where to change the value of the item that "does not" match the condition In the following example, if the value of is not 5 or less, the value of is changed to 1.

an_data["<Target item>"] = an_data["<Target item>"].where(an_data["<Comparison items>"]<5, 1)

relativedelta

delta = relative_delta(dtime1, dtime2)
The content of delta is the difference between dtime1 and dtime2 (dtime1 is the latest date))

Removal of missing values

an_data = an_data.dropna(subsent=["<Column name where the missing value exists>"])

The replacement of the missing part is as follows (filled with 0)

an_data = an_data.fillna(0)

Confirmation is possible at

an_data.isna().sum()

csv output

an_data.to_csv("", index=False)

example graph output

Import example of library for graph output (Hereafter, usage example is described in plt) In addition, the second line describes an example when using it with jupyter notebook.

import matplotlib.pyplot as plt
%matplotlib inline

Read the data to be plotted (listed in the order of horizontal axis, vertical axis, option) At least you should put a label

plt.plot(an_data.index, an_data['<Items on the vertical axis>'], label='label-1')
plt.plot(an_data.index, an_data['<Items on the vertical axis>'], label='label-2')
plt.plot(an_data.index, an_data['<Items on the vertical axis>'], label='label-3')

Plot the read data on a graph

plt.legend()

Processing memos often used in pandas (beginners)