This article is a memo of data analysis by python using pandas. I will briefly describe the basics. The variables to which the data read by pandas are assigned are as follows.
Python data analysis library
Import example (Hereafter, usage example is described in pd)
import pandas as pd
an_data = pd.read_csv("<CSV file name>")
an_data = pd.read_excel("<Excel file name>")
Useful when you want to check the contents of the data
an_data.head()
Describes the process when combining separately read data. First data name: an_data1 Second data name: an_data2
Combine the data vertically. (If you set ignore_index = True, the index will be reassigned)
concat_data = pd.concat([an_data1, an_data2], ignore_index=True)
Combine the data horizontally.
merge_data = pd.merge(an_data1, an_data2[[<Combine specific keys>]], on="<Join key>", how=left)
Change the data recognized as a character string to date data
an_data["date"] = pd.to_datetime(an_data["date"])
Extract only year and month
an_data["Year / month"] = an_data["date"].dt.strftime("%Y%m")
Mysterious numbers are read for data such as Excel that has the following notation. Make it possible to handle mysterious numbers as date data.
--Original data that becomes a mysterious number --May 1, 2020 --When the above data is pd.to_datetime (), the following numbers are output (this is an example, so it is different from the actual value). - 456789
ch_time = pd.to_timedelta(an_data.loc[flg_num, "date"].astype("float"), unit="D") + pd.to_datetime("1900/01/01")
If you process one level higher, it will be separated by a hyphen like YYYY-MM-DD. It is troublesome if it is mixed with a slash-separated date, so unify the format (YYYY-MM-DD).
ch_time = pd.to_datetime(an_data.loc[~flg_num, "date"])
Addition of index containing null data
an_data.isnull().sum()
Contains null data or outputs as boolean (axis is 0 for vertical, 1 for horizontal)
an_data.isnull().any(axis=0)
The number of data, mean, standard deviation, and quintuplet summary are output.
an_data.describe()
<Read data>.dtypes
With the as_index = False option, the index that is automatically created with the name specified by groupby is not created.
an_data.groupby(["<Specific column>"]).sum()["<Another column>"]
It is possible to specify multiple specific columns and different columns.
an_data.groupby(["<Specific column A>", "<Specific column B>"]).sum()["<Another column A">, "<Another column b>"]
tmp = an_data.groupby(["<Specific column>"]).count()["<Another column>"]
You can output average etc. by using agg
tmp = an_data.groupby("<Specific column>").agg("mean", "median", "max", "min")["<The column you want to aggregate>"]
pivot_table
Create a new table
an_data = pd.pivot_table(an_data, index="<Column name you want to index>", columns"<The column name you want to be column>", values=["<Column values you want to output to the table>"], aggfunc="", fill_value=0
ascending = True is in ascending order
an_data.sort_values(by=["<Main sort target column>"], ascending=True)
an_data["test_val"] = an_data["test_val"].str.replace(" ", "")
an_data["test_val"] = an_data["test_val"].str.replace(" ", "")
print(len(pd.unique(an_data["val"])))
print(pd.unique(an_data["val"]))
flg_num = an_data["<Column name>"].astype("str").str.isdigit()
flg_num.sum()
loc Label designation
an_data.loc[:, [<column01>, <column02>]]
an_data.loc[[True, False, True, False], [False, False, True]]
iloc Numbering
an_data.iloc[[0:2], [2:4]]
an_data.iloc[[False, True, True], [False, True, False]]
Specifying inplace = True will change the original data
an_data.rename(columns={"<Name before change>":"<Name after change>"}, index={"<Name before change>":"<Name after change>"}, inplace=True)
Since the index value is reassigned, the index is not covered.
an_data = an_data.reset_index(drop=False)
an_data = an_data.drop("<Items you want to drop>", axis=1)
where
Use where to change the value of the item that "does not" match the condition
In the following example, if the value of
an_data["<Target item>"] = an_data["<Target item>"].where(an_data["<Comparison items>"]<5, 1)
relativedelta
delta = relative_delta(dtime1, dtime2)
The content of delta is the difference between dtime1 and dtime2 (dtime1 is the latest date))
an_data = an_data.dropna(subsent=["<Column name where the missing value exists>"])
The replacement of the missing part is as follows (filled with 0)
an_data = an_data.fillna(0)
Confirmation is possible at
an_data.isna().sum()
an_data.to_csv("", index=False)
Import example of library for graph output (Hereafter, usage example is described in plt) In addition, the second line describes an example when using it with jupyter notebook.
import matplotlib.pyplot as plt
%matplotlib inline
Read the data to be plotted (listed in the order of horizontal axis, vertical axis, option) At least you should put a label
plt.plot(an_data.index, an_data['<Items on the vertical axis>'], label='label-1')
plt.plot(an_data.index, an_data['<Items on the vertical axis>'], label='label-2')
plt.plot(an_data.index, an_data['<Items on the vertical axis>'], label='label-3')
Plot the read data on a graph
plt.legend()
Recommended Posts