For example, suppose you have a data frame like this.
x y
0 2 0.954025
1 3 0.146810
2 1 0.409961
3 1 0.164558
4 3 0.782152
5 2 0.905869
6 3 0.210528
7 1 0.437970
8 1 0.801206
9 3 0.089576
10 2 0.960357
11 2 0.670732
When I try to draw a diagram with standard error from this data frame, it looks like this.
import numpy as np
import matplotlib.pyplot as plt
m = df.pivot_table(index='x', values='y', aggfunc='mean')
e = df.pivot_table(index='x', values='y', aggfunc='sem')
m.plot(xlim=[0.8, 3.2], yerr=e)
In this way, the error bar can be created by specifying the magnitude of the error in yerr.
Therefore, we define and use cilen to find the length of the confidence interval.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def cilen(arr, alpha=0.95):
if len(arr) <= 1:
return 0
m, e, df = np.mean(arr), stats.sem(arr), len(arr) - 1
interval = stats.t.interval(alpha, df, loc=m, scale=e)
cilen = np.max(interval) - np.mean(interval)
return cilen
m = df.pivot_table(index='x', values='y', aggfunc='mean')
e = df.pivot_table(index='x', values='y', aggfunc=cilen)
m.plot(xlim=[0.8, 3.2], yerr=e)
I was able to create a diagram with confidence intervals.
The method of calculating the confidence interval is confusing due to the "n or n-1" problem.
For those who searched for "confidence interval Python" according to "Avoid reinventing the wheel",
I think you'll be confused by the subtle differences, try both, and be frustrated by the different results.
1 is the same answer as R. The difference is in 2. I'll give it a try.
> x <- c(1, 1, 3, 3)
> t.test(x)
One Sample t-test
data: x
t = 3.4641, df = 3, p-value = 0.04052
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.1626138 3.8373862
sample estimates:
mean of x
2
import numpy as np
from scipy import stats
alpha = 0.95
data = [1, 1, 3, 3]
mean_val = np.mean(data)
sem_val = stats.sem(data) # standared error of the mean
ci = stats.t.interval(alpha, len(data)-1, loc=mean_val, scale=sem_val)
print(ci)
>> (0.16261376896260105, 3.837386231037399)
import math
import numpy as np
from scipy import stats
alpha = 0.95
data = [1, 1, 3, 3]
mean_val = np.mean(data)
std_val = np.std(data)
ci = stats.t.interval(alpha,len(data)-1,loc=mean_val,scale=std_val/math.sqrt(len(data)))
print(ci)
>> (0.40877684735786857, 3.5912231526421312)
This time, I decided to follow R in the world, so I chose 1.
What is different about 2 is the end of the part where ci is calculated.
math.sqrt(len(data))
This. Divide by n. However, if you want to do inference statistics, it is better to divide by n-1. This is because we assume a t distribution. In fact,
math.sqrt(len(data) - 1)
Then, the answer of Method 2 also completely matches R.
Recommended Posts