Hello! This is Nakagawa from Hitachi, Ltd. Lumada Data Science Lab..
At Lumada Data Science Lab., We actively accept in-house SEs as trainees and train data scientists with the aim of improving the quality of proposals to our customers. In the practical training, we regularly challenge the subject of data analysis and discuss the questions that arise in the subject with the members of the institute engaged in data analysis work. In this article, I would like to introduce the contents of the data analysis and discussion.
By sharing solutions to problems that are easy for people who are just starting to analyze data, and sharing effective techniques for those who are already in the business of data analysis, we can share what data analysis is. I hope it will be an opportunity to think.
--Mr. Matsushita (male 9 years after joining the company) --Engaged in social security related SE work in the Public Systems Division --Experience in developing in Java and C, but inexperienced in data analysis --Active mid-career SE who loves traveling abroad and drinking
Here, we would like to introduce the specific contents of the exercises that Mr. Matsushita summarized.
--Perform multiple regression analysis using Python and scikit-learn. (Development environment is Jupyter Notebook, which is convenient for data analysis using Python) --Scikit-learn has some data analysis and machine learning to try out right away. There is an attached dataset (https://scikit-learn.org/stable/datasets/index.html#boston-house-prices-dataset). --This time, among the attached datasets, the Boston house prices dataset Work on. --Data analysis is performed by referring to the CRISP-DM process.
CRISP-DM An effective way of thinking in advancing data analysis is [CRISP-DM (CRoss-Industry Standard Process for Data Mining)](https://mineracaodedados.files.wordpress.com/2012/04/the-crisp-dm-model- the-new-blueprint-for-data-mining-shearer-colin.pdf). It is divided into the following processes, and it is a data mining methodology and process model that analyzes data while rotating the PDCA cycle, from understanding the customer's business issues to actual modeling, its evaluation and deployment to business (business improvement). ..
Understanding the business
Understanding the data
Data preparation
Modeling
Evaluation
Deployment
We also worked on this theme in this order.
Clarify business challenges and set goals for data analysis. This time, the problem to be solved has already been clarified, so it is as follows.
Goal: Creating and evaluating a numerical forecasting model for Boston home prices
Check the data to be analyzed and decide whether it can be used as it is or whether the data needs to be processed. Specifically, check if there is any data that cannot be used for data analysis as it is due to many missing or outliers, and if there is data that cannot be used, decide on a data processing policy such as deletion or completion.
Import the library you want to use and load the data for data analysis.
#Library import
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
#Data set loading
from sklearn import datasets
boston_data = datasets.load_boston()
#Storage of explanatory variables
boston = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)
#Objective variable(House price)Storing
boston_medv = pd.DataFrame(boston_data.target)
#Data confirmation
boston.info()
The variables of the Boston house prices dataset that are the subject of this data analysis are as follows.
Column name | Contents |
---|---|
CRIM | Crime rate per capita by town |
ZN | 25,Percentage of residential areas divided into plots over 000 square feet |
INDUS | Percentage of non-retail area per town |
CHAS | Charles River Dummy Variable(=1 if at the border of the river, 0 otherwise) |
NOX | Nitric oxide concentration(1 in 10 million) |
RM | Average number of rooms per dwelling |
AGE | Percentage of dwellings built before 1940 |
DIS | Distances to 5 Boston Employment Centers(Weighted) |
RAD | Indicators of accessibility to radial highways |
TAX | 10,Property tax rate per $ 000 |
PTRATIO | Students by town-Teacher ratio |
B | 「1000(Bk - 0.63)^Index of residence ratio calculated in "2" * Bk is the ratio of African Americans by town |
LSTAT | Percentage of low-income earners per population |
MEDV | Median house price in $ 1000 * Objective variable |
Check the Boston house prices dataset for missing values.
#Confirmation of missing values
boston.isnull().sum()
Output result
#Number of missing values
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
dtype: int64
It was confirmed that there are no missing values in this data.
Regarding outliers / outliers, it is necessary to consider what to treat as outliers / outliers in the first place. For this purpose, it is necessary to understand the background of the business / event to be analyzed and the background of the data such as the measurement method. After understanding the background, for example, outliers and outliers are judged from the following viewpoints.
--Isn't it a value that is clearly impossible due to the nature of the variable (example: when the age is a value such as a minus or a character string)? --Isn't it an impossible value due to the nature of the business or event (eg, if the age limit for applying for a loan is limited, but the value does not meet the limit)? --If the probability distribution of the variable can be assumed, is it a significantly different distribution? (Example: When both extreme values frequently appear with respect to the assumption of the normal distribution)
First, check the variation in values using a box plot.
#Check outliers on the box plot(Visualize in tiles)
fig, axs = plt.subplots(ncols=5, nrows=3, figsize=(13, 8))
for i, col in enumerate(boston.columns):
sns.boxplot(boston[col], ax=axs[i//5, i%5])
#Adjusting the graph interval
fig.subplots_adjust(wspace=0.2, hspace=0.5)
#Remove margins
fig.delaxes(axs[2, 4])
fig.delaxes(axs[2, 3])
Output result
If you check the box plot, there are many variations in the values of CRIM, ZN, CHAS, RM, DIS, PTRATIO, B, LSTAT, and it seems that there may be outliers. In addition, use the histogram to see the values and distribution of specific variables.
#Check the distribution with the histogram(Visualize in tiles)
fig, axs = plt.subplots(ncols=5, nrows=3, figsize=(13, 8))
for i, col in enumerate(boston.columns):
sns.distplot(boston[col], bins=20, kde_kws={'bw':1}, ax=axs[i//5, i%5])
#Adjusting the graph interval
fig.subplots_adjust(wspace=0.2, hspace=0.5)
#Remove margins
fig.delaxes(axs[2, 4])
fig.delaxes(axs[2, 3])
Output result
When I checked the histogram while imagining the properties of the variables, it seems that they are possible values, so I will not treat them as outliers or outliers here. "CHAS" has a special shape in both the boxplot and the histogram, because it is a dummy variable that flags whether it is along the river with 0, 1.
According to the policy decided in "Understanding the data", the data is processed so that it can be put into the next modeling. For example, perform the following processing.
--Complement or exclude missing values, outliers, and outliers with mean or mode --Separate numerical data into meaningful units and convert to category data --Flaged to handle label data as a number
As confirmed in "Understanding the data", proceed to the next phase assuming that there are no missing values, outliers, or outliers.
Model using a method suitable for the conditions of data analysis. Select the variables to be input to the model and divide the data for training and testing of the model and use it for data analysis. This time, we will select linear regression analysis (multiple regression) as the modeling algorithm. By the way, in scikit-learn, the selection index of which algorithm and modeling method should be used is summarized in cheat sheet. ..
Select the variables you want to submit to your model. The process of searching for valid combinations while reducing the number of variables actually used. Reducing the variables used has the following benefits:
--Lower calculation cost and shorten processing time --Avoid overfitting and improve generalization (prediction performance for unknown data)
There are the following methods for selecting variables.
--Filter Method: A method of ranking variables based on evaluation indicators and selecting the top variables. --Wrapper Method: Actually model with a combination of multiple variables Techniques for choosing the best-performing variable combination -Embedded Method: A method to select variables at the same time in a machine learning algorithm
This time, we will describe as an example how to check the strength of correlation between variables according to the methodology of Filter Method. Pairplot is useful because it visualizes the histogram of each variable and the correlation of all combinations of the two variables.
#Quantify the correlation of quantitative variables using seaborn heatmap
boston["MEDV"] = boston_medv
plt.figure(figsize=(11, 11))
sns.heatmap(boston.corr(), cmap="summer", annot=True, fmt='.2f', square=True, linewidths=.5)
plt.ylim(0, boston.corr().shape[0])
plt.show()
#Visualize the correlation of quantitative variables graphically using seaborn pair plot
sns.pairplot(boston)
plt.show()
Output result
If you check the heatmap, there is a strong positive correlation between RAD and TAX, so here we select TAX, which has a stronger negative correlation with MEDV, out of the two variables.
It is common to use part of the data for model learning and the rest for verifying the predictive power of the created model. This time, we will use 50% of the data for learning and 50% for testing.
#Xm for explanatory variables and objective variables,Stored in Ym respectively
Xm = boston.drop(['MEDV', 'RAD'], axis=1)
Ym = boston.MEDV
#Import a library that splits train data and test data
from sklearn.model_selection import train_test_split
# X_train, X_The data distributed to test is randomly determined
# test_size=0.5 to 50%To test
X_train, X_test = train_test_split(Xm, test_size=0.5, random_state=1234)
Y_train, Y_test = train_test_split(Ym, test_size=0.5, random_state=1234)
It actually gives the data and fits it into a linear regression model (multiple regression model).
#Import sklearn linear regression model and fit with train data
from sklearn import linear_model
model_lr = linear_model.LinearRegression()
model_lr.fit(X_train, Y_train)
#Use the generated model to get the predicted values for the explanatory variables of the test data
predict_lr = model_lr.predict(X_test)
#Regression coefficient
print(model_lr.coef_)
#Intercept
print(model_lr.intercept_)
Output result
#Regression coefficient
[-2.79020004e-02 5.37476665e-02 -1.78835462e-01 3.58752530e+00
-2.01893649e+01 2.15895260e+00 1.95781711e-02 -1.66948371e+00
6.47894480e-03 -9.66999954e-01 3.62212576e-03 -6.65471265e-01]
#Intercept
48.68643686655955
Evaluate the accuracy and performance of the created model to determine whether the goal can be achieved. In addition, the model is tuned as necessary based on the evaluation results.
In the accuracy evaluation of the model, the error and the strength of correlation between the predicted value and the correct answer value of the model created using the following indexes are evaluated.
--MAE (Mean Absolute Error): Average of absolute values of error --MSE (Mean Squared Error): Mean squared error --RMSE (Root Mean Squared Error): Square root of MSE --Coefficient of determination: The square of the strength of the correlation
#Evaluation
# MAE
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(Y_test, predict_lr)
print("MAE:{}".format(mae))
# MSE
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, predict_lr)
print("MSE:{}".format(mse))
# RMSE
rmse = np.sqrt(mse)
print("RMSE:{}".format(rmse))
#Coefficient of determination
print("R^2:{}".format(model_lr.score(X_test, Y_test)))
#Evaluation results
MAE:3.544918694530246
MSE:23.394317851152568
RMSE:4.83676729346705
R^2:0.7279094372333412
You have created a model with a coefficient of determination of approximately 0.73.
Apply data analysis evaluation results to your business to solve business problems. This time, the goal is to create and evaluate a model, but in actual business, we will use the created model to improve operations and develop systems.
Members of the Lumada Data Science Lab. Will answer the candid questions that the apprentices have experienced through actual data analysis.
I was wondering what to remove as outliers, how should I decide? I couldn't figure it out by looking at the boxplot and histogram.
The purpose is to exclude records that are clearly impossible values or have special conditions. These data inadvertently distort the modeling results. First of all, it is important to take a good look at the data. As Mr. Matsushita did, plotting boxplots and histograms is also a common way to get noticed. For example, if the value of a variable is unusually biased, you can easily notice it by plotting it. Also, by going further into the tendency of records that take such values, you may notice the phenomenon behind them and the meaning of the values.
How many correlation coefficients do you judge that the correlation between variables is strong? Or can you check it in another way? I'm not sure about the criteria.
It seems that the correlation coefficient is 0.7 or more and it is generally considered that the correlation is strong, but in reality it depends on the domain. Therefore, it is very important to discuss the criteria with the customer. In this case, we focused on the combination of explanatory variables with a correlation coefficient of 0.9 or more, but it is important not only to look at the correlation coefficient but also to check the tendency of variation in the scatter plot.
Is the coefficient of determination $ R ^ 2 $ calculated by
cross_val_score
for k-fold cross-validation?
Information on implementing the library can be found in the API Reference (https://scikit-learn.org/stable/modules/classes.html), which you should refer to. The index for calculating the cross_val_score
in the question is specified by the argument scoring
. You can specify the score calculated by name or function. If not specified, the score
function implemented in the modeling algorithm is used. LinearRegression
implements $ R ^ 2 $.
Should k-fold cross-validation be done every time? Is there a case where cross-validation is not performed?
k-fold cross-validation is one method, and it is important to perform validation as a general theory. Hold-out validation, k-fold cross-validation, leave-one-out cross-validation, etc. are performed according to conditions such as data volume and variation. The purpose of the verification is to detect the phenomenon that the model overfits the data used for training (overfitting) and improve the prediction performance (generality) for unknown data. If you know the statistical behavior of the population and the distribution is clear, you may want to use all the data to estimate the parameters of the distribution.
What should be identified as the main factor (the explanatory variable that most affects the objective variable)? Regression coefficient? Correlation coefficient? There seem to be many ways to do it, but I'm not sure what to choose.
In the case of a multiple regression model, it is sufficient to compare the standardized regression coefficients in order to consider the difference in the size and unit of the variables, paying attention to the independence of the explanatory variables (there is no multicollinearity). Multicollinearity requires careful confirmation of multivariable relationships, such as VIF (Distribution Expansion Coefficient), an approach that actually regresses one variable with another to evaluate the effect, and principal component regression. As you can see, the approach of synthesizing uncorrelated variables in advance and then regressing is taken.
I would like to refer to the implementation of another person who has described the data analysis. Is there any good way?
It goes without saying that you should peruse Qiita, but there is a site called Kaggle that holds a data science competition, and Notebook for various problems including Boston housing. (Data analysis program) is open to the public and discussions are actively held. Just reading this will be a great learning experience for what other data analysts are using. You may also want to read the introductory book to get a basic understanding of the prerequisite statistics.
Since sklearn has many data analysis methods, we were able to implement the model more smoothly than we had imagined. However, even with the sample data prepared like this time, I often had trouble with the data analysis policy, so I imagined that more trial and error would be required for data analysis in actual business. I think I understand a little that it is said that "data analysis requires 90% of preprocessing up to modeling." This time, the main focus was on mastering the method, but I hope that through the exercises, we will be able to understand the essence of the method and make proposals based on data analysis that is convincing to our customers.
This time, we asked Mr. Matsushita, an apprentice, to work on the Boston house prices dataset. In the discussion, I think we were able to have meaningful discussions on outliers, correlations, and ideas about verification. Lumada Data Science Lab. Will continue to post various practical training articles, so please look forward to the next post. Thank you for reading this far.