Pandas memo

If there is any Pandas related content in Python learning, I will update it from time to time.

Pandas A library that provides functions to support data analysis

import

`python`


import pandas as pd

Data capture

Read CSV [read_csv]

`python`


csv_test_1 = pd.read_csv('hoge.csv')

Read Excel [read_excel]

`python`


excel_data = pd.read_excel('hoge.xlsx')

Data join (union)

Vertical combination of data [concat]

`python`


csv_test_2 = pd.read_csv('hoge_2.csv')
csv_test = pd.concat([csv_test_1 , csv_test_2], ignore_index=True)
csv_test.head()

Data merge LEFT JOIN [merge]

-When the item names of both tables to be joined are the same. Combine with ```on =" id "` `` as a condition.

`Post-join table= pd.merge(Table 1,Table 2, on="Join item", how="Method")`




#### **`python`**
```python

join_data = pd.merge(a_data, b_data[["id", "date", "customer"]], on="id", how="left")
join_data.head()

-When the item names of both tables to be joined are different. Combined with `left_on =" customer_name ", right_on =" customer name "" `.

`python`


pd.merge(a_data, b_data, left_on="customer_name", right_on="Customer name", how="left")

Data confirmation

Acquisition of unique data [pd.unique (data)]

`python`


pd.unique(test_data.item_name))
len(pd.unique(test_data.item_name))) #Number of unique data

Date manipulation

Convert the value in column a to datetime type [to_datetime ()]

`python`


test_data["a"] = pd.to_datetime(test_data["a"])

Extraction of date [dt]

Date format [dt.strftime ("% Y% m")]

`python`


time_data["payment_month"] = time_data["payment_date"].dt.strftime("%Y%m")

Pivot table

Create a pivot table [pd.pivot_table]

`python`


pd.pivot_table(test_data, index='item_name', columns='payment_month', values=['price', 'quantity'], aggfunc='sum')

** ・ Pivot_table overview ** index: Specify a row columns: Specify columns values: Specify the values to be aggregated aggfunc: Specify the aggregation method

It's not the content of Pandas, so I'll organize it later.

Data display

Display [print]

`python`


print(len(test_data))  #Display the number of data

Display the first 5 lines of data [head]

`python`


csv_test_1.head()

Specify the data column and display the first 5 rows [head]

`python`


csv_test_1["Column name"].head()

Manipulating data

Extract data with .loc function [.loc (condition, column to be acquired)]

`python`


res = test_data.loc[flg_is_null, "item_name"]

Creating a data column

Set the value obtained by multiplying a and b to new in the additional column.

`python`


test_data["new"] = test_data["a"] * test_data["b"]

Data calculation

Sum up column a [column.sum ()]

`python`


test_data["a"].sum()

Aggregate by specified group [groupby ("column"). Sum ("column")]

`python`


test_data.groupby("create_date").sum()["price"]

Aggregate by specified group (multiple specifications) [groupby ("column"). Sum ("column")]

`python`


test_data.groupby(["create_date", "item_name"]).sum()[["price", "quantity"]]

Data comparison

Compare the total in column a with the total in column b and display the result in TRUE / FALSE

`python`


test_data["a"].sum() == test_data["b"].sum()

Check for missing values, return null for each column as TRUE / FALSE, and sum with sum

`python`


test_data.isnull().sum()

Confirmation of missing values Returns the presence or absence of missing values in TRUE / FALSE for each column

`python`


test_data.isnull().any(axis=0)

Output of various statistics [describe ()]

`python`


test_data.describe()

Maximum and minimum values of the specified column [max (), min ()]

`python`


test_data["create_date"].min()
test_data["create_date"].max()

Data type confirmation [dtypes]

`python`


test_data.dtypes

-The following various statistics can be displayed with describe (). Number of data (count), mean (mean), standard deviation (std), minimum (min), quartile (25%, 75%), median (50%), maximum (max)

Work memo ・ Data cleansing

Data processing: Pandas Visualization: Matplotlib Machine learning: scikit-learn