[Python beginner's memo] Importance and method of confirming missing value NaN before data analysis

I'm new to Python / machine learning. As a result of enthusiasm for data analysis, I was stuck because I neglected to confirm the missing value, so I will leave a memo as a reflection.

Conclusion

--Before starting data analysis, you should check for missing values. --If missing values are found, some measures should be taken, such as overwriting the data other than the missing values or excluding the rows containing the missing values for analysis.

What happened

--When I participated in a data analysis contest called Kaggle, I analyzed an amount of data that could not be visually confirmed. ――At that time, I did not notice the existence of the missing value (NaN), and the program became full of NaN, and the error did not stop.

What is a missing value?

Not a Number / NaN --Special expression when the processing result of calculation cannot be expressed ――It seems that you need to study very deeply to follow the details, so I will not touch on it in this article. --Since it has the property of returning the calculation result with other numbers as NaN, if even one NaN is included in the program, the calculation result may not be obtained correctly. --The operation result of `1 + NaN``` is `NaN```

Countermeasures-Recommendations at the start of data analysis

―― ① First and foremost, check if there are any missing values in the data. --Use ```isnull (). Any () `` ` --Tells you which columns contain missing values in your dataframe --If you check the missing values for df_example as shown below, you can check the existence of missing values for population and GDP with *** True *** (I imagine that you do not know the exact population of North Korea, etc.) Can also be)

#Example:countries.Suppose that csv contains basic statistical data of each country
import pandas as pd
df_example = pd.read_csv("hogehoge/example.csv").copy()

print(df_example.isnull().any())

#Example
Id            False
Name          False
Population    True
GDP           True
Region        False
life_expct    False

-② Perform replacement work in the column where the existence of missing values is confirmed. --I will omit another deletion method when the entire column is composed of NaN, and the processing when deleting the row itself instead of replacing the missing value.

#Where the missing value existence column is found
df_example.loc[df_example['Population'].isnull(), 'Population'] = 0

Caution

--In this case, pay attention to whether the value to be replaced is appropriate and what to keep in mind in the later calculation. --For example, if you replace the population with 0 as above, there could be two patterns: ―― “This data is analyzed only to calculate the top 30 most populous countries and their characteristics, so this is not a problem.” ―― "Since we will analyze the average population from this data, in that case, let's calculate only" countries whose population value is not 0 "and make sure that the value of the denominator and numerator is correct."

Summary

――Given the data, it is important to check the missing values instead of jumping to it and starting the analysis.

reference

-Pandas determines if missing value NaN is included, counts the number -Exclude (delete) / replace (fill in) / extract missing value NaN with pandas

(that's all)

Supplement

――The author experienced that the later analysis would be completely useless because the missing values were mixed in the input layer of deep learning, and I came to write this article. ――In addition to confirming missing values, I think there are many confirmation processes and data cleansing processes before analysis, such as drawing a histogram to search for outliers. I have refrained from mentioning them in this article as of March 24, 2020, but I would like to add them after examining them.