Fill outliers with NaN based on quartiles in Pandas

Pandas are convenient, aren't they? I would like to remove more than 1.5 times the interquartile range of data from a Pandas DataFrame as outliers. Instead of deleting the entire row based on the value in a column, try to detect outliers for each column and fill them with NaN.

Reference: [Find outliers in the interquartile range (IQR) during correlation analysis (Python)-I sell services and buy homes](http://www.ie-kau.net/entry/2016/ 04/14 /% E7% 9B% B8% E9% 96% A2% E5% 88% 86% E6% 9E% 90% E3% 81% AE% E6% 99% 82% E3% 81% AB% E5% 9B % 9B% E5% 88% 86% E4% BD% 8D% E7% AF% 84% E5% 9B% B2% 28IQR% 29% E3% 81% A7% E5% A4% 96% E3% 82% 8C% E5 % 80% A4% E3% 82% 92% E8% A6% 8B% E3% 81% A4% E3% 81% 91% E3% 82% 8B% EF% BC% 88Pyt)

drop_outlier.py


def drop_outlier(df):
  for i, col in df.iteritems():
    #Quartile
    q1 = col.describe()['25%']
    q3 = col.describe()['75%']
    iqr = q3 - q1 #Interquartile range

    #Outlier reference point
    outlier_min = q1 - (iqr) * 1.5
    outlier_max = q3 + (iqr) * 1.5

    #Excludes values that are out of range
    col[col < outlier_min] = None
    col[col > outlier_max] = None

If you want to put the data in a machine learning function such as scikit-learn, fill in the deleted data with fillna etc. This way, the outlier data will be replaced with another value, so take that into consideration when using: joy:

df.fillna(method='bfill')

Have a fun Pandas life.

Recommended Posts

Fill outliers with NaN based on quartiles in Pandas
Create a new csv with pandas based on the local csv
Identify outliers with RandomForestClassifier in scikit-learn
[Pandas] Find quartiles and detect outliers
Fill in missing values with Scikit-learn impute
Working with 3D data structures in pandas
Is there NaN in the pandas DataFrame?
Delete rows with arbitrary values in pandas DataFrame
Try working with Mongo in Python on Mac
Remove rows with duplicate indexes in pandas DataFrame
Handle integer types with missing values in Pandas
Slightly different behavior depending on version in Pandas