I'm new to Python / machine learning. As a result of enthusiasm for data analysis, I was stuck because I neglected to confirm the missing value, so I will leave a memo as a reflection.
--Before starting data analysis, you should check for missing values. --If missing values are found, some measures should be taken, such as overwriting the data other than the missing values or excluding the rows containing the missing values for analysis.
--When I participated in a data analysis contest called Kaggle, I analyzed an amount of data that could not be visually confirmed. ――At that time, I did not notice the existence of the missing value (NaN), and the program became full of NaN, and the error did not stop.
`1 + NaN``` is
`NaN```―― ① First and foremost, check if there are any missing values in the data. --Use ```isnull (). Any () `` ` --Tells you which columns contain missing values in your dataframe --If you check the missing values for df_example as shown below, you can check the existence of missing values for population and GDP with *** True *** (I imagine that you do not know the exact population of North Korea, etc.) Can also be)
#Example:countries.Suppose that csv contains basic statistical data of each country
import pandas as pd
df_example = pd.read_csv("hogehoge/example.csv").copy()
print(df_example.isnull().any())
#Example
Id False
Name False
Population True
GDP True
Region False
life_expct False
-② Perform replacement work in the column where the existence of missing values is confirmed. --I will omit another deletion method when the entire column is composed of NaN, and the processing when deleting the row itself instead of replacing the missing value.
#Where the missing value existence column is found
df_example.loc[df_example['Population'].isnull(), 'Population'] = 0
--In this case, pay attention to whether the value to be replaced is appropriate and what to keep in mind in the later calculation. --For example, if you replace the population with 0 as above, there could be two patterns: ―― “This data is analyzed only to calculate the top 30 most populous countries and their characteristics, so this is not a problem.” ―― "Since we will analyze the average population from this data, in that case, let's calculate only" countries whose population value is not 0 "and make sure that the value of the denominator and numerator is correct."
――Given the data, it is important to check the missing values instead of jumping to it and starting the analysis.
-Pandas determines if missing value NaN is included, counts the number -Exclude (delete) / replace (fill in) / extract missing value NaN with pandas
(that's all)
――The author experienced that the later analysis would be completely useless because the missing values were mixed in the input layer of deep learning, and I came to write this article. ――In addition to confirming missing values, I think there are many confirmation processes and data cleansing processes before analysis, such as drawing a histogram to search for outliers. I have refrained from mentioning them in this article as of March 24, 2020, but I would like to add them after examining them.
Recommended Posts