Tips for data analysis ・ Notes

Introduction

For writing down notes for data analysis by yourself

-** Tips for analysis ** -** Notes on analysis **


Previous information
This time, I will explain using the data used in the Kaggle competition. Kaggle House Prices DataSet Kaggle House Prices Kernel Make a note using.

Please note that we do not care about model accuracy etc. as we keep it for Memo`


Data contents
There are 81 features in all, of which SalePrices is used as the objective variable (target) for analysis. The explanatory variables are other than SalePrices.

Tips First of all, it is better to do this for the data

Preprocessing

First of all, read the data (DataFrame-> df) and do ʻEDA (exploratory data analysis) `.

--.head (): Display the beginning of data, extract 5 lines if (5), defalt value is 5 --.info (): Data summary (number of rows, number of columns, column name of each column, type of data stored in each column etc ....) --.describe (): Basic statistics of data (min, max, 25% etc .....) --.shape [0], .shape [1]: Check the number of rows and columns of data --.columns: Get column names of data --.isnull (). sum (): Check the number of missing areas in each column --.dtypes: Check the type type of each column


# import
import pandas as pd
import matplotlib.pyplot as plt    ## for drawing graph 

## load Data
df = pd.read~~~~(csv , json etc...)   

#Top display of data
df.head() 

#Summary display of data
df.info()

#Number of dimensions of data (how many rows and columns)
print('There are {} rows and {} columns in df'.format(df.shape[0], df.shape[1]))

#Get column of data
df.columns

#Count the number of missing values in each column data
df.isnull().sum()

#Type type confirmation of each column data
df.dtypes

You can get the column names with df.columns, but if you keep a list of column names according to the type type, you may be able to use it if you think about it later. Describe the code. It is also possible to change ʻinclude to type type (float64 etc ...)`.


obj_columns = df.select_dtypes(include=['object']).columns
numb_columns = df.select_dtypes(include=['number']).columns

plot example

Histogram of each column (feature)

Histogram the features of each column. However, this can only handle numerical data. Since it cannot be applied to character string data, it will be described later.

bins is the setting of the fineness of the class value for analyzing the frequency, and figsize is the setting of the size of the figure. You can make many other arguments, so please refer to the following. official pyplot.hist document


df.hist(bins=50, figsize=(20,15))
plt.show()

スクリーンショット 2020-08-22 15.28.00.png

Correlation matrix heat map

In machine learning, it is important to find the correlation matrix for each feature. Visualize with a heat map to visualize the correlation matrix.

corr is the correlation matrix, annot sets the value in the cell, cmap specifies the color of the figure You can make many other arguments, so please refer to the following. official seaborn.heatmap document

import seaborn as sns  ## for drawing graph 
corr = train_df.corr()
corr[np.abs(corr) < 0.1] = 0  ## corr<0.1  =>  corr=0 
sns.heatmap(corr, annot=True, cmap='YlGnBu')
plt.show()

スクリーンショット 2020-08-22 15.47.09.png

Estimating the importance of features

Random Forest is used to calculate which features are important for the target SalePrices. In order to perform RandomForest, it is first necessary to divide it into explanatory variable and objective variable. ** This time, all variables other than target (SalePsices) are used as explanatory variables. ** **

--.drop ("***", axis = (0 or 1)): Specify a column that is not used for ***, row if axis = 0, column if 1. --RandomForestRegressor (n_estimators = **): n_estimators is the number of learnings


from sklearn.ensemble import RandomForestRegressor
X_train = df.drop("SalePrices", axis=1)
y_train = df[["SalePrices"]]
rf = RandomForestRegressor(n_estimators=80, max_features='auto')
rf.fit(X_train, y_train)

ranking = np.argsort(-rf.feature_importances_)  ##To draw in descending order of importance
sns.barplot(x=rf.feature_importances_[ranking], y=X_train.columns.values[ranking], orient='h')
plt.show()

スクリーンショット 2020-08-22 16.08.50.png

Distribution of each feature (for checking outliers, etc.)

It is said that when features include outliers when performing regression analysis, it tends to affect the accuracy of the model. Therefore, it is considered that outlier processing is indispensable for improving the accuracy of the model. Therefore, it is easy to visualize and confirm how many outliers are included in each feature.

Here, plots of the top 30 features are performed.

--.iloc [:,:]: Extract value by specifying row or column --sns.regplot: Plot the 2D data and the result of the linear regression model


X_train = X_train.iloc[:,ranking[:30]]

fig = plt.figure(figsize=(12,7))
for i in np.arange(30):
    ax = fig.add_subplot(5,6,i+1)
    sns.regplot(x=X_train.iloc[:,i], y=y_train)

plt.tight_layout()
plt.show()

important point

Check if the target (objective variable) follows a normal distribution

In this data, target is SalePrices. In machine learning, whether the objective variable follows a normal distribution is important because it affects the model. So let's look at the distribution of SalePrices. In this figure, the vertical axis shows the percentage and the horizontal axis shows the SalePrices.


sns.distplot(y_train, color="red", label="real_value")
plt.legend()
plt.show()

スクリーンショット 2020-08-22 16.19.37.png

From the above figure, it can be seen that the distribution is slightly biased to the left. It is not a normal distribution (data in which most of the numbers are concentrated in the center when graphed and are "distributed" in a symmetrical bell shape). Therefore, the commonly used ** logarithmic conversion and difference conversion ** are shown below so that the objective variable follows a normal distribution.


y_train2 = np.log(y_train)
y_train2 = y_train2.replace([np.inf, -np.inf], np.nan)
y_train2 = y_train2.fillna(0)
sns.distplot(y_train2, color="blue", label="log_value")
plt.legend()
plt.show()

スクリーンショット 2020-08-22 16.19.43.png

By performing logarithmic conversion, the figure is close to the normal distribution.


y_train3 = y_train.diff(periods = 1)
y_train3 = y_train3.fillna(0)
sns.distplot(y_train3, color="green", label="diff_value")
plt.legend()
plt.show()

スクリーンショット 2020-08-22 16.19.49.png

It is a figure that can be said to be a normal distribution by performing difference conversion. In other words, it can be inferred that the objective variable that does not easily affect the accuracy of the model is SalePrices that has undergone differential conversion. In this way, ** When dealing with any data, it is better to check whether the values are biased. ** ** I would like to write a future article comparing the accuracy of the model when it actually follows the normal distribution and when it does not.

Summary

The above is a summary of what to do first when analyzing data. In the future, I would like to write articles about the handling of character string data and the handling of time series data mentioned earlier when analyzing data.

Recommended Posts

Tips for data analysis ・ Notes
Python for Data Analysis Chapter 4
[Python] Notes on data analysis
Python data analysis learning notes
Python for Data Analysis Chapter 3
Notes for challenging basketball video analysis
Preprocessing template for data analysis (Python)
Data analysis for improving POG 3-Regression analysis-
Data analysis Titanic 2
Data analysis python
Data analysis Titanic 1
Data analysis Titanic 3
JupyterLab Basic Setting 2 (pip) for data analysis
JupyterLab Basic Setup for Data Analysis (pip)
Analysis for Data Scientists: Qiita Self-Article Summary 2020
Prepare a programming language environment for data analysis
Analysis for Data Scientists: Qiita Self-Article Summary 2020 (Practice)
[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
Data analysis with python 2
Data analysis using xarray
Data analysis parts collection
Folder structure for analysis
Data analysis using Python 0
Data analysis overview python
Source analysis for Django--INSTALLED_APPS
Python data analysis template
Data analysis with Python
Data analysis for improving POG 1 ~ Web scraping with Python ~
[For beginners] How to study Python3 data analysis exam
Organizing basic procedures for data analysis and statistical processing (2)
Analysis of measurement data ①-Memorandum of understanding for scipy fitting-
[CovsirPhy] COVID-19 Python package for data analysis: SIR-F model
[CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis
Stop thinking for use in data analysis competition LightGBM
[CovsirPhy] COVID-19 Python Package for Data Analysis: SIR model
[CovsirPhy] COVID-19 Python Package for Data Analysis: Parameter estimation
Data set for machine learning
My python data analysis container
Multidimensional data analysis library xarray
[CovsirPhy] COVID-19 Python Package for Data Analysis: Scenario Analysis (Parameter Comparison)
Installation notes for TensorFlow for Windows
[Understand in the shortest time] Python basics for data analysis
Which should I study, R or Python, for data analysis?
An introduction to statistical modeling for data analysis (Midorimoto) reading notes (in Python and Stan)
<Python> Build a dedicated server for Jupyter Notebook data analysis
Wrap analysis part1 (data preparation)
[PyTorch] Data Augmentation for segmentation
[Python + Selenium] Tips for scraping
Stock price data acquisition tips
~ Tips for beginners to Python ③ ~
Data analysis using python pandas
Introduction to Statistical Modeling for Data Analysis GLM Model Selection
Notes for HHKB Programming Contest 2020
Analyzing Twitter Data | Trend Analysis
Introduction to Statistical Modeling for Data Analysis Generalized Linear Models (GLM)
Create a USB boot Ubuntu with a Python environment for data analysis
Launch notes for existing Django applications
Pandas / DataFrame Tips for practical use
Basic syntax notes for shell script
Python: Time Series Analysis: Preprocessing Time Series Data
Shift the data for 3 months Shift the data for n months