As an output of study
・ Overview of basic libraries used in data analysis ・ Elementary code
There are the following three libraries used in data analysis. Parentheses are customary terms ・ Pandas (pd) ・ Numpy (np) ・ Pyplot (plt) of matplotlib
pandas pandas is a library that can read data, check simple information of data, arrange data, check and delete missing areas, and aggregate.
numpy python A library that makes it easy to build numerical calculation algorithms that process faster than conventional numerical calculations.
matplotlib Drawing library that supports graphs such as 2D graphs and 3D graphs
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline #Display in current browser
df = pd.read_csv("file name") #Read csv in file
df = pd.read_csv("file name",header=None) #You can set whether to add a heading or not by specifying the header.
df.head() #Read the first five lines of the csv file
df.tail() #Read the last five lines of the csv file
#If you specify a value for the function argument, you can read up to the specified line.
df.head(10) #Read from the beginning to the 10th line of the csv file
df.tail(10) #Read from the end to the 10th line of the csv file
df.shape #A property that calculates the number of matrices in a file
df.describe() #A function that calculates basic statistics such as minimum and maximum values, standard deviation, and mean
df.info() #A function that looks up the types of strings, integers, and floating point numbers
df["Column name"] #Specific column(column)Extract
df[["Column name","Column name",...,"Column name"]] #Specific column(column)Extract multiple
df[df["Column name"]Conditional expression] #Extract columns that meet the conditions
df[df["y"]>=df["y"].mean()] #"y"Extract above the average of y from the column
df["Column name"].sort_values(by="y",accending=False) #Sort in descending order for y
df["Column name"][df["Column name"]Conditional expression] #Extract the left column that meets the conditions of the right parenthesis
df["Column name"].plot() #横軸を行番号、縦軸を指定したColumn nameの数値の折れ線グラフを生成
df["Column name"].plot(figsize=(side,Vertical)) #Set the graph size ratio with figsize
df["Column name"].plot(figsize=(side,Vertical),title="Title name") #Title setting
ax = df["Column name"].plot(figsize=(side,Vertical),title="Title name")
ax.set_xlabel("Label name") #x軸のLabel nameを設定
ax.set_ylabel("Label name") #y軸のLabel nameを設定
df["Column"].plot.hist() #ヒストグラムを生成、Columnを階級で分けて度数を調べてくれる
df["Column"].plot.hist(grid=True) #Add grid lines
plt.axvline(x=Numerical value,color="color") #Draw a vertical line
plt.axvline(x=df["y"],color="red")
df["y"].plot.hist() #Overlay graphs
plt.axvline(x=df["y"],color="red")
df["y"].plot.hist()
plt.savefig("file name.extension") #Save graph
df[["Column name 1","Column name 2"]].boxplot(by="Column name 1") #boxplotで指定した引数の項目ごとのColumn name 2の数のばらつきを調べる箱ひげ図
df.isnull() #Check the column with null
df.isnull().any() #Check if there is null for the column
df.isnull().sum() #Count the number of nulls for a column
df["Column name"].value_counts() #Output the number of numbers
df.fillna() #Convert all null values to concrete numbers
df.dropna(subset=["Column name"]) #If there is null for the column, delete the corresponding row
df[["Column name 1","Column name 2"]].corr() #Output the correlation between two columns
df.plot.scatter(x="Column name",y="Column name",figsize=(5,5)) #Plot scatter plot
Recommended Posts