――Investigating the probability that "outliers" in the box plot will appear when a normal distribution is assumed. --Definition of "outliers": 1st quartile-quartile range * less than 1.5 or 1st quartile-quartile range * greater than 1.5 (including extreme values) --The probability of "outliers" appearing is approximately 0.70%.
――When you draw a box plot when performing statistical analysis in business, "outliers" often appear. ――I wanted to know how likely it is that "outliers" will appear when a certain distribution is assumed.
Please refer to the following sites for explanations on boxplots and outliers in boxplots. -Box plot --Wikipedia -How to read the box plot
Let's calculate the probability of outliers using the probability density function of the standard normal distribution.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
q1_ideal = stats.norm.ppf (q = 0.25, loc = mu, scale = sd) # 1st quartile q3_ideal = stats.norm.ppf (q = 0.75, loc = mu, scale = sd) # 3rd quartile iqr_ideal = q3_ideal-q1_ideal # interquartile range lb_ideal = q1_ideal-1.5 * iqr_ideal # Lower outlier boundary ub_ideal = q3_ideal + 1.5 * iqr_ideal # Upper outlier boundary
print('Q1:', q1_ideal)
print('Q3:', q3_ideal)
print('IQR:', iqr_ideal)
print('Lower Bound:', lb_ideal)
print('Upper Bound:', ub_ideal)
print ('Probability of lower outliers:', stats.norm.cdf (x = lb_ideal, loc = mu, scale = sd) * 100,'%') print ('Probability of upper outliers:', stats.norm.sf (x = ub_ideal, loc = mu, scale = sd) * 100,'%') print ('Probability of outliers:', (stats.norm.sf (x = ub_ideal, loc = mu, scale = sd) + stats.norm.cdf (x = lb_ideal, loc = mu, scale = sd)) * 100,'%')
>Q1: -0.674489750196
>Q3: 0.674489750196
>IQR: 1.34897950039
>Lower Bound: -2.69795900078
>Upper Bound: 2.69795900078
Probability of lower outliers: 0.348830161964% Probability of upper outliers: 0.348830161964% Outlier probability: 0.697660323928%
So, with a normal distribution, the probability of getting outliers is 0.7%. If there are 1000 samples, about 7 will be outliers. 0.3% is outside 3σ, so it's more than that.
Let's use data randomly sampled from a normal distribution to see if this really happens.
#Data generation n = 1000000 #number of samples mu = 0 # average sd = 1 # standard deviation q1 = stats.scoreatpercentile(data, 25) q3 = stats.scoreatpercentile(data, 75) iqr = q3-q1 lb = q1-1.5iqr ub = q3+1.5iqr print('Q1:', q1) print('Q2:', med) print('Q3:', q3) print('IQR:', iqr) print('Lower Bound:', lb) print('Upper Bound:', ub) print ('Ratio of the number of samples with upper outliers to the total number of samples:', len (np.where (data <lb) [0]) / n * 100,'%') print ('Ratio of the number of samples with lower outliers to the total number of samples:', len (np.where (data> ub) [0]) / n * 100,'%') print ('Ratio of outliers to the total number of samples:', (len (np.where (data> ub) [0]) + len (np.where (data <lb))) / n * 100,'%')
>Q1: -0.674873830027
>Q2: -0.00106013590319
>Q3: 0.673290672641
>IQR: 1.34816450267
>Lower Bound: -2.69712058403
>Upper Bound: 2.69553742664
Percentage of total number of samples with outliers: 0.3554% Percentage of total sample numbers with lower outliers: 0.3478% Percentage of outliers in total sample size: 0.7032%
The percentage of outlier samples calculated by random sampling was 0.7%, which was almost the same as the value calculated from the probability density function.
Recommended Posts