This is my first post. I hope you will read it with a warm feeling. Background is a basic researcher working in a lot of pharmaceutical companies, so statistical knowledge and programming are self-taught ...
There were various typographical errors ... I'm sorry if you tried to move it and it didn't work ... I should have been able to fix it, so thank you. Dunnett's test In biological research, you often want to do ** Student's t-test and show a significant difference ** (The expression significant difference should be eliminated. 019-00857-9) is another opportunity ...).
When evaluating between multiple groups, there is a problem of multiplicity, so there are many opportunities to use Tukey-HSD test etc. Dunnett's test is also one of the multi-group pair comparison tests, and it is a method used when you do not need to perform all pair comparisons and you want to compare ** control group (or control group) with other groups **. I use it quite a lot because it ** has higher detection power ** than comparing all groups.
On Wikipedia
In statistics, Dunnett's test (Dunnett's test) is one of the multiple comparison procedures [1]. Developed by Canadian statistician Charles Danette to compare each of the many treatment groups with a single control group. Multiple comparisons to the control group are also called many-to-one comparisons. Wikipedia
It has been described as.
I understand that you can use ** when the population is normally distributed and evenly distributed **.
I used excel to analyze data, but now I am doing data analysis with python (using Colaboratory) because I want to practice python personally. However, for statistical analysis, I brought in data and used R with EZR and SAS with EXSUS.
Please read @ issakuss's article (Python Statistical Techniques-Statistical Analysis Against Python-), where ** [pingwouin](https: / I found a package called /pingouin-stats.org/api.html#)**.
It's wonderful! !! Thanks to the creator. With this, it seemed that I could live on python alone.
However…
So, let's implement it by ourselves while studying! I decided to do that. (~~ Well, if you look for it, there may be a package somewhere ~~)
I had to know the theory to implement it myself, so lightly! !! studied.
The method for calculating the test statistic T is similar to Tukey. Assuming that there are k groups from 1 to i, the ** test statistic T ** is 1 with the control group as 1.
However
-$ μ
At this time, ** degrees of freedom v **
Compared to this test statistic T ...
table? It was like (laughs) Speaking of which, there were a lot of tables on the back of university textbooks ... The actual table is [this](https://www.weblio.jp/content/%E3%83%80%E3%83%8D%] E3% 83% 83% E3% 83% 88% E3% 81% AE% E8% A1% A8). In pingouin, it seemed that Tukey's table was included and referenced, but I didn't like to include the table ...
It seems that SAS uses a ** probmc function ** and does not use a table. So I decided to implement this function (most of the time I had a hard time). Thankfully, the formula was listed in here, so I will implement it as it is. I will.
The probmc function works with the following arguments (simple to write, I don't know SAS).
probmc(distribution, q, prob, df, nparms <, parameters>)
argument | Contents |
---|---|
distribution | Choose whether Dunnett is one-sided or two-sided. It seems that it can be used for distributions other than Dunnett. |
q, prob | q is the quantile (here we enter the test statistic T) and prob is the probability. prob is the probability that a random variable will be less than q. Therefore, the p-value can be calculated as 1 – prob. For example, significance level 5%To calculate the critical value of prob= 0.Set to 95. Either is missing and used as an argumentThe missing argument is the return valueIt will be. This time, I want to know the p-value, so I will delete the prob. |
df | Degrees of freedom v |
nparms | In the case of Danette, the total number of groups is k-1 |
parameters | λ12,λ13,...λ1i |
is.
What is λ, but it is a value calculated from the number of samples to be compared.
(I will confirm later whether the formula will be the same if the number of samples is the same for the formula when the number of samples in each group is different)
The prob obtained by the probmc function looks like this
prob = probmc("dunnet2", T, ., v, k-1, <lambda1,lambda2...>)
And the formula that works inside this is as follows.
```math
prob = \int_{0}^{∞}\int_{-∞}^{∞} ϕ(y)∏_{i=1}^{k-1}\biggl[Φ\biggl(\frac{\lambda_iy+Tx}{\sqrt{1-\lambda_i^2}}\biggr)- Φ\biggl(\frac{\lambda_iy-Tx}{\sqrt{1-\lambda_i^2}}\biggr)\biggr]dy d\mu_v(x)
If you can implement this formula, you're done! (I didn't have the energy to understand the contents ...)
here $ \ phi (x) $: Probability density function of normal distribution (pdf) $ \ Phi (x) $: Cumulative density function (cdf) It will be. These can be implemented using scipy and numpy.
import numpy as np
from scipy.stats import norm
lambda_params = [lambda_1,lambda_2]
T=T
def pdf(x):
return norm.pdf(x)
def cdf(x):
return norm.cdf(x)
def f(x,y):
return pdf(y)*np.prod([cdf((lambda_i*y+T*x)/(np.sqrt(1-lambda_i**2)))-cdf((lambda_i*y-T*x)/(np.sqrt(1-lambda_i**2))) for lambda_i in lambda_params])
with this
It will be.
Also, $ d \ mu_v (x) $ is defined in SAS as follows:
d\mu_v(x)=\frac{ν^{\frac{v}{2}}}{\Gamma(\frac{v}{2})2^{\frac{v}{2}-1}}x^{v-1}exp(-\frac{vx^2}{2})dx
here $ \ Gamma (x) $: Gamma function And there is also a function in scipy (also present in math).
from scipy.special import gamma
v=v
def du(v,x):
return (v**(v/2)*x**(v-1)*np.exp(-v*x**2/2))/(gamma(v/2)*2**(v/2-1))
** One problem here ... **
The integration interval of the original equation is
from scipy import integrate
import matplotlib.pyplot as plt
x=np.arange(-3,3,0.01)
plt.figure()
for v in range(9,30,3): #change v
def duv(x):
return du(v,x)
y = [integrate.quad(duv,0,xi)[0] for xi in x]
plt.plot(x,y,label=i)
plt.xlabel("x")
plt.ylabel("$\mu_v(x)$")
plt.legend()
plt.show()
Hmm…? ** Is it evenly reversed on the minus side? ** This is a problem in replacing the integration interval ... It looks like a sigmoid function, so I'm probably using it in that way. Unfortunately, I couldn't understand the meaning of the formula, so I decided to proceed with the range of ** x from 0 to ∞ ** for the time being.
Also, looking at the precision when the value is the largest in order to reduce the calculation, $ \ mu_9 (3) = 0.99999999999998976 $ seems to be sufficient, so the range of x is 0 → 5 in the calculation. (By the way, $ \ mu_9 (2) = 0.99999603534120194 $)
So the original formula,
prob = \int_{0}^{∞}\int_{-∞}^{∞} ϕ(y)∏_{i=1}^{k-1}\biggl[Φ\biggl(\frac{\lambda_iy+Tx}{\sqrt{1-\lambda_i^2}}\biggr)- Φ\biggl(\frac{\lambda_iy-Tx}{\sqrt{1-\lambda_i^2}}\biggr)\biggr]dy d\mu_v(x)
Is transformed
def f_g(x,y):
return f(x,y) * duv(x)
def dunnett_prob(T,v,lambda_params):
T=T
v=v
lambda_params=lambda_params
return integrate.dblquad(f_g,0,np.inf,lambda x:-np.inf,lambda x:np.inf)[0]
Now let's look at the graph of $ f_ {T \ lambda} (x, y) g_v (x) $ to reduce the amount of calculation.
T, v = 10,9 #Set the value appropriately
lambda_param =[0.5,0.4] #Set the value appropriately
y = np.arange(-10,10,0.1)
plt.figure()
for x in np.arange(0,2,0.5):
z = [f_g(x,yi) for yi in y]
plt.plot(y,z,label=x)
plt.legend()
plt.xlabel("y")
plt.ylabel("$f(x,y)g(x)$")
#plt.ylim(0,0.0001)
plt.show()
It looks like an even function! In other words
prob ≒ 2 \int_{0}^{5}\int_{0}^{∞}f_{T\lambda}(x,y)g_v(x)dx
However
f_{T\lambda}(x,y) = ϕ(y)∏_{i=1}^{k-1}\biggl[Φ\biggl(\frac{\lambda_iy+Tx}{\sqrt{1-\lambda_i^2}}\biggr)- Φ\biggl(\frac{\lambda_iy-Tx}{\sqrt{1-\lambda_i^2}}\biggr)\biggr]
It will be.
def dunnett_prob(T,v,lambda_params):
T=T
v=v
lambda_params=lambda_params
return 2*integrate.dblquad(f_g,0,3,lambda x:0,lambda x:np.inf)[0]
So I think it will be summarized as follows. I wanted to organize it neatly using classes and so on, but I'm sorry it's dirty because I can't do it well.
dunnett.py
import numpy as np
from scipy.stats import norm
from scipy import integrate
from scipy.special import gamma
def dunnett_prob(T,v,lambda_params):
def pdf(x):
return norm.pdf(x)
def cdf(x):
return norm.cdf(x)
def f(x,y):
return pdf(y)*np.prod([cdf((lambda_i*y+T*x)/(np.sqrt(1-lambda_i**2)))-cdf((lambda_i*y-T*x)/(np.sqrt(1-lambda_i**2))) for lambda_i in lambda_params])
def duv(x):
return (v**(v/2)*x**(v-1)*np.exp(-v*x**2/2))/(gamma(v/2)*2**(v/2-1))
def f_g(x,y):
return f(x,y) * duv(x)
return 2*integrate.dblquad(f_g,0,5,lambda x:0,lambda x:np.inf)[0]
Create appropriate data and try it out. Consider the pattern of whether drug A and drug B give effective values for the control group. First from data creation
import pandas as pd
control = np.random.normal(10,scale=2,size=4)
drug_A = np.random.normal(10.5,scale=2,size=6)
drug_B = np.random.normal(14,scale=2,size=5)
df_c = pd.DataFrame(data=control,columns=["value"])
df_c["group"] = "control"
df_A = pd.DataFrame(data=drug_A,columns=["value"])
df_A["group"] = "drug_A"
df_B = pd.DataFrame(data=drug_B,columns=["value"])
df_B["group"] = "drug_B"
df = pd.concat([df_c,df_A,df_B],ignore_index=True)
value | group |
---|---|
11.389 | control |
7.154 | control |
7.932 | control |
14.729 | control |
10.507 | drug_A |
6.578 | drug_A |
6.580 | drug_A |
13.098 | drug_A |
10.131 | drug_A |
13.748 | drug_A |
15.245 | drug_B |
14.078 | drug_B |
17.533 | drug_B |
11.082 | drug_B |
15.413 | drug_B |
We will go for this data. First of all, from data molding. You need to calculate the test statistic T and λ. I would like to take a similar method, thinking of putting it on pigouin later.
import pingouin as pg
def dunnett_two_side_unbalanced(data, dv, between,control):
#First compute the ANOVA
aov = pg.anova(dv=dv, data=data, between=between, detailed=True)
v = aov.at[1, 'DF'] #Degree of freedom
ng = aov.at[0, 'DF'] + 1 #Total number
grp = data.groupby(between)[dv]
n_sub = grp.count()
control_index = n_sub.index.get_loc(control) #Get control group index
n = n_sub.values #The number of samples
gmeans = grp.mean().values#Each average
gvar = aov.at[1, 'MS'] / n #Each variance
vs_g = np.delete(np.arange(ng),control_index) #Get index other than control group
#Pairwise combinations (find test statistic T)
mn = np.abs(gmeans[control_index]- gmeans[vs_g])
se = np.sqrt(gvar[control_index] + gvar[vs_g]) #The formula is a little different, but because it assumes homoscedasticity
tval = mn / se
#Find lambda
lambda_params = n[vs_g]/(n[control_index]+n[vs_g])
pval = [1-dunnett_prob(t,v,lambda_params) for t in tval]
stats = pd.DataFrame({
'A': [control]*len(vs_g),
'B': np.unique(data[between])[vs_g],
'mean(A)': np.round(gmeans[control_index], 3)*len(vs_g),
'mean(B)': np.round(gmeans[vs_g], 3),
'diff': np.round(mn, 3),
'se': np.round(se, 3),
'T': np.round(tval, 3),
'p-dunnett': pval
})
return stats
dunnett_two_side_unbalanced(df,"value","group","control")
A | B | mean(A) | mean(B) | diff | se | T | p-dunnett |
---|---|---|---|---|---|---|---|
control | drug_A | 10.30 | 10.11 | 0.194 | 1.917 | 0.101 | 0.99312 |
control | drug_B | 10.30 | 14.67 | 4.369 | 1.993 | 2.193 | 0.08992 |
have become. Is it really right ...?
I'm completely tired ... Finally, I would like to do the same analysis with R using EZR. R cannot be written, so please forgive me only for the output.
Multiple Comparisons of Means: Dunnett Contrasts
Fit: aov(formula = value ~ group.factor, data = Dataset)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
drug_A - control == 0 -0.1939 1.9174 -0.101 0.992
drug_B - control == 0 4.3693 1.9926 2.193 0.084
I'm tired, so I'll do my best to review it again. It seems that R uses the Monte Carlo method, so I think the values are different, but 0.005 is too large, isn't it?
Thank you for watching so far. As mentioned above, amateurs tried their best.
[Multiple comparison (Part 2-1: Specific method (i)), 3rd Armitage study session: Material 2, Masaaki Doi](http://www012.upp.so-net.ne.jp/doi/ biostat / CT39 / multiple_comparison_2_1.pdf)
import numpy as np
from scipy.stats import norm
from scipy import integrate
from scipy.special import gamma
import pingouin as pg
def dunnett_two_side_unbalanced(data, dv, between,control):
def dunnett_prob(T,v,lambda_params):
def pdf(x):
return norm.pdf(x)
def cdf(x):
return norm.cdf(x)
def f(x,y):
return pdf(y)*np.prod([cdf((lambda_i*y+T*x)/(np.sqrt(1-lambda_i**2)))-cdf((lambda_i*y-T*x)/(np.sqrt(1-lambda_i**2))) for lambda_i in lambda_params])
def duv(x):
return (v**(v/2)*x**(v-1)*np.exp(-v*x**2/2))/(gamma(v/2)*2**(v/2-1))
def f_g(x,y):
return f(x,y) * duv(x)
return 2*integrate.dblquad(f_g,0,5,lambda x:0,lambda x:np.inf)[0]
#First compute the ANOVA
aov = pg.anova(dv=dv, data=data, between=between, detailed=True)
v = aov.at[1, 'DF'] #Degree of freedom
ng = aov.at[0, 'DF'] + 1 #Total number
grp = data.groupby(between)[dv]
n_sub = grp.count()
control_index = n_sub.index.get_loc(control) #Get control group index
n = n_sub.values #The number of samples
gmeans = grp.mean().values#Each average
gvar = aov.at[1, 'MS'] / n #Each variance
vs_g = np.delete(np.arange(ng),control_index) #Get index other than control group
#Pairwise combinations (find test statistic T)
mn = np.abs(gmeans[control_index]- gmeans[vs_g])
se = np.sqrt(gvar[control_index] + gvar[vs_g]) #The formula is a little different, but because it assumes homoscedasticity
tval = mn / se
#Find lambda
lambda_params = n[vs_g]/(n[control_index]+n[vs_g])
pval = [1-dunnett_prob(t,v,lambda_params) for t in tval]
stats = pd.DataFrame({
'A': [control]*len(vs_g),
'B': np.unique(data[between])[vs_g],
'mean(A)': np.round(gmeans[control_index], 3)*len(vs_g),
'mean(B)': np.round(gmeans[vs_g], 3),
'diff': np.round(mn, 3),
'se': np.round(se, 3),
'T': np.round(tval, 3),
'p-dunnett': pval
})
return stats
Recommended Posts