Pandas are convenient, aren't they? I would like to remove more than 1.5 times the interquartile range of data from a Pandas DataFrame as outliers. Instead of deleting the entire row based on the value in a column, try to detect outliers for each column and fill them with NaN.
Reference: [Find outliers in the interquartile range (IQR) during correlation analysis (Python)-I sell services and buy homes](http://www.ie-kau.net/entry/2016/ 04/14 /% E7% 9B% B8% E9% 96% A2% E5% 88% 86% E6% 9E% 90% E3% 81% AE% E6% 99% 82% E3% 81% AB% E5% 9B % 9B% E5% 88% 86% E4% BD% 8D% E7% AF% 84% E5% 9B% B2% 28IQR% 29% E3% 81% A7% E5% A4% 96% E3% 82% 8C% E5 % 80% A4% E3% 82% 92% E8% A6% 8B% E3% 81% A4% E3% 81% 91% E3% 82% 8B% EF% BC% 88Pyt)
drop_outlier.py
def drop_outlier(df):
for i, col in df.iteritems():
#Quartile
q1 = col.describe()['25%']
q3 = col.describe()['75%']
iqr = q3 - q1 #Interquartile range
#Outlier reference point
outlier_min = q1 - (iqr) * 1.5
outlier_max = q3 + (iqr) * 1.5
#Excludes values that are out of range
col[col < outlier_min] = None
col[col > outlier_max] = None
If you want to put the data in a machine learning function such as scikit-learn, fill in the deleted data with fillna
etc. This way, the outlier data will be replaced with another value, so take that into consideration when using: joy:
df.fillna(method='bfill')
Have a fun Pandas life.
Recommended Posts