Summary of this article

――Investigating the probability that "outliers" in the box plot will appear when a normal distribution is assumed. --Definition of "outliers": 1st quartile-quartile range * less than 1.5 or 1st quartile-quartile range * greater than 1.5 (including extreme values) --The probability of "outliers" appearing is approximately 0.70%.

Motivation to write this article

――When you draw a box plot when performing statistical analysis in business, "outliers" often appear. ――I wanted to know how likely it is that "outliers" will appear when a certain distribution is assumed.

Introduction: Box plot and outliers

Please refer to the following sites for explanations on boxplots and outliers in boxplots. -Box plot --Wikipedia -How to read the box plot

Calculate the probability of outliers appearing in the normal distribution

Let's calculate the probability of outliers using the probability density function of the standard normal distribution.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

q1_ideal = stats.norm.ppf (q = 0.25, loc = mu, scale = sd) # 1st quartile q3_ideal = stats.norm.ppf (q = 0.75, loc = mu, scale = sd) # 3rd quartile iqr_ideal = q3_ideal-q1_ideal # interquartile range lb_ideal = q1_ideal-1.5 * iqr_ideal # Lower outlier boundary ub_ideal = q3_ideal + 1.5 * iqr_ideal # Upper outlier boundary

print('Q1:', q1_ideal)
print('Q3:', q3_ideal)
print('IQR:', iqr_ideal)
print('Lower Bound:', lb_ideal)
print('Upper Bound:', ub_ideal)

print ('Probability of lower outliers:', stats.norm.cdf (x = lb_ideal, loc = mu, scale = sd) * 100,'%') print ('Probability of upper outliers:', stats.norm.sf (x = ub_ideal, loc = mu, scale = sd) * 100,'%') print ('Probability of outliers:', (stats.norm.sf (x = ub_ideal, loc = mu, scale = sd) + stats.norm.cdf (x = lb_ideal, loc = mu, scale = sd)) * 100,'%')

>Q1: -0.674489750196
>Q3: 0.674489750196
>IQR: 1.34897950039
>Lower Bound: -2.69795900078
>Upper Bound: 2.69795900078

Probability of lower outliers: 0.348830161964% Probability of upper outliers: 0.348830161964% Outlier probability: 0.697660323928%

So, with a normal distribution, the probability of getting outliers is 0.7%. If there are 1000 samples, about 7 will be outliers. 0.3% is outside 3σ, so it's more than that.

Verify that it is actually the case

Let's use data randomly sampled from a normal distribution to see if this really happens.

#Data generation n = 1000000 #number of samples mu = 0 # average sd = 1 # standard deviation q1 = stats.scoreatpercentile(data, 25) q3 = stats.scoreatpercentile(data, 75) iqr = q3-q1 lb = q1-1.5iqr ub = q3+1.5iqr print('Q1:', q1) print('Q2:', med) print('Q3:', q3) print('IQR:', iqr) print('Lower Bound:', lb) print('Upper Bound:', ub) print ('Ratio of the number of samples with upper outliers to the total number of samples:', len (np.where (data <lb) [0]) / n * 100,'%') print ('Ratio of the number of samples with lower outliers to the total number of samples:', len (np.where (data> ub) [0]) / n * 100,'%') print ('Ratio of outliers to the total number of samples:', (len (np.where (data> ub) [0]) + len (np.where (data <lb))) / n * 100,'%')

>Q1: -0.674873830027
>Q2: -0.00106013590319
>Q3: 0.673290672641
>IQR: 1.34816450267
>Lower Bound: -2.69712058403
>Upper Bound: 2.69553742664

Percentage of total number of samples with outliers: 0.3554% Percentage of total sample numbers with lower outliers: 0.3478% Percentage of outliers in total sample size: 0.7032%

The percentage of outlier samples calculated by random sampling was 0.7%, which was almost the same as the value calculated from the probability density function.

Calculate the probability of outliers on a boxplot

Summary of this article

Motivation to write this article

Introduction: Box plot and outliers

Calculate the probability of outliers appearing in the normal distribution

Verify that it is actually the case