When you analyze the data yourself, make a note of preprocessing of features for modeling
.
Data EDA etc. before preprocessing
Tips and precautions when analyzing data
Please refer to it as it is described in.
What is important in preprocessing is *** ・ Missing value processing *** *** ・ What is the type of feature that is preprocessed ***?
In this article -** Numeric (numerical data) ** -** Categorical ** This section describes the preprocessing method for the two type types. `There are other pre-processings for various data such as Datetime and location information data, but please note that they are not described in this article. ``
Previous information
This time, I will explain using the data used in the Kaggle competition.
Kaggle House Prices DataSet
Kaggle House Prices Kernel Make a note using.
Please note that we do not care about model accuracy etc. as we keep it for Memo`
Data contents
There are 81 features in all, of which SalePrices is used as the objective variable (target) for analysis.
The explanatory variables are other than SalePrices.
First, check the contents of the data, and if the data contains missing values, there are two approaches to eliminate the missing values.
1. Delete columns and rows that contain missing values.
2. Complement (fill in the blanks) with another numerical value for the missing value.
When doing the above, it is important to note *** why the data is analyzed and what you want to output as Output ***. There are two possible reasons for data analysis. *** The first is the construction of a model that accurately predicts the objective variable (target), and the second is the understanding of the data ***. In the case of constructing the first model that predicts with high accuracy, if columns and rows containing missing values are deleted unnecessarily, the number of data may decrease significantly. It can be said that this is not a good idea. Therefore, it is a good idea to supplement the missing value with another numerical value. At this time, a typical example is to complement with the average value. On the other hand, when it comes to understanding data, unnecessarily complementing and over-modifying the data may lead to misunderstanding of the data. *** In other words, when dealing with missing values, what is important when analyzing the data is important. *** ***
Below is the code for 1. Delete columns and rows containing missing values
, 2. Complement (fill in) with different numbers for missing values
.
import pandas as pd
## load Data
df = pd.read~~~~(csv , json etc...)
## count nan
df.isnull().sum()
From the above figure, it can be seen that the data this time contains missing values. Now, I would like to process missing values for * LotFrontage *.
--.dropna (how = "???", axis = (0 or 1)): how = ??? is any or all, axis = 0 if any, column if 1
If any, the column or row containing at least one missing value is deleted. On the other hand, if all, columns or rows where all values are missing are deleted.
df.dropna(how="any", axis=0)
df.dropna(how="all", axis=0)
In the case of any, if even one missing value is included, the specified axis value is deleted. As a result, the shape of df is (0,81). In the case of all, if all missing values are included, the specified axis value is deleted. As a result, the shape of df is (1460,81).
***
--.dropna (subset = ["???"]): Select a specific column / row with subset
df.dropna(subset=["LotFrontage"])
By using a subset as an argument, if a specific column / row contains a missing value, that column / row can be deleted. This argument can sometimes work effectively.
--.fillna (???, inplace = bool (True of False)): The original object can be changed by specifying fill (fill in), inplace with any value in ???
If inplace = True, it can be updated without increasing the memory, but since the object is updated, it becomes difficult to reuse it. .. .. .. .. .. Therefore, it is recommended to create a new object for non-large data. (Personal opinion ...)
#Fill in the blanks with 0 for NAN.
df.fillna(0)
#When completing for multiple columns
df.fillna({'LotFrontage': df["LotFrontage"].mean(),'PoolArea': df["PoolArea"].median(), 'MoSold': df["MoSold"].mode().iloc[0]})
As mentioned above, the mean and median of the column can be filled in for the argument part. In the case of `mode, the first line of iloc [0] is acquired because it is returned in the data frame. ``
There are various other complement methods for fillna, so please check the official documentation. pandas.DataFrame.fillna
The first thing to do when working with Numeric data is *** scaling ***. Scaling means converting to a number with a certain width. For example, if you want to predict ice sales from variables such as temperature, precipitation, and humidity, each variable has a different unit and value range. If you continue learning as it is, you may not be able to learn well, so it is necessary to adjust the values to a certain width. This is scaling.
There are several methods for scaling features. In this article, I would like to touch on the 1.Min Max Scaler, 2.Standard Scaler, 3.log transformation
that I use relatively.
It is to convert the values of all features to the same scale. ``
Subtract the minimum value from all values and divide by the difference between Min and Max. As a result, the value becomes 0 to 1. ``
*** However, this method has a disadvantage, and since the value is in the range of 0 to 1, the standard deviation becomes small and the influence of outliers is suppressed. *** ***
If you need to worry about outliers, it can be difficult to consider them with this method.
sklearn.preprocessing.MinMaxScaler
from sklearn import preprocessing
#First from df to dtype=Extract column column of number
numb_columns = df.select_dtypes(include=['number']).columns
# dtype=Extract only number data
num_df = df[numb_columns]
# MinMaxScaler
mm_scaler = preprocessing.MinMaxScaler()
#Get with array
num_df_fit = mm_scaler.fit_transform(num_df)
#Convert to array to Dataframe
num_df_fit = pd.DataFrame(num_df_fit, columns=numb_columns)
** After the conversion process is completed, it is better to check if the scaling is correct. ** ** For example
#Confirmation of the maximum value of each feature
num_df_fit.max()
#Confirmation of the minimum value of each feature
num_df_fit.min()
It's best to check after conversion in this way: relaxed:
Can be converted to a standardized distribution with mean 0, variance 1.
First, subtract the mean value to get a value around 0. Next, divide the value by the standard deviation so that the resulting distribution becomes the standard with a mean of 0 and a standard deviation of 1.
StandardScaler has the same disadvantages (outlier processing) as MinMaxScaler.
In StandardScaler, outliers affect the calculation of mean and standard deviation, narrowing the range of features.
*** In particular, since the magnitude of the outliers of each feature is different, the spread of the converted data of each feature may be significantly different. *** ***
sklearn.preprocessing.StandardScaler
# StandardScaler
s_scaler = preprocessing.StandardScaler()
#Get with array
num_df_s_fit = s_scaler.fit_transform(num_df)
#Convert to array to Dataframe
num_df_s_fit = pd.DataFrame(num_df_s_fit, columns=numb_columns)
Logarithmic conversion was discussed in the previous article, so I will explain it lightly.
In the machine learning model,
Since the normal distribution is often assumed, if there is a variable that does not follow the normal distribution, logarithmic conversion etc. may be performed so that it follows the normal distribution
.
Tips and precautions when performing data analysis
num_df = np.log(num_df)
Catergorical data is character string data such as gender (male / female) and prefectures (Hokkaido / Aomori / Iwate ...). When handling such data, it is often converted to numerical data before handling.
In this article, I would like to touch on 1.Label Encoding, 2.One Hot Encoding
, which I use relatively.
sklearn.preprocessing.LabelEncoder
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
#First from df to dtype=Extract column column of object
obj_columns = df.select_dtypes(include=["object"]).columns
# dtype=Extract only object data
obj_df = df[obj_columns]
#Extract unique values for Street columns
str_uniq = obj_df["Street"].unique()
# labelEncoder
le.fit(str_uniq)
list(le.classes_)
le.transform(str_uniq)
As above, create an instance with ** le.fit ** and You can get the unique value of the feature with ** list (le.classes_) **. Use ** le.transform () ** to map unique values to numbers.
When dealing with numbers converted by LabelEncoder, Tree Based Model (random forest, etc.) is said to be good. ``
Non Tree Nased Model (regression analysis, etc.) is not effective. ``
*** This is because the unique value is converted into a numerical value, but the magnitude of the numerical value is dealt with in the case of regression analysis even though the magnitude of the numerical value has no meaning. *** ***
sklearn.preprocessing.OneHotEncoder
from sklearn import preprocessing
# One Hot Encoder
oh = preprocessing.OneHotEncoder()
str_oh = oh.fit_transform(obj_df.values)
Specify the array type as the argument of transform.
In this article, we have described the preprocessing of features for modeling. In Kaggle etc., we see that ** Target Encoding ** is often used for Categorical data. I haven't been able to publish it in the article due to my lack of study, but I would like to update it as soon as I study.
Tips and precautions when analyzing data
Recommended Posts