Introduction

This is my first post. I hope you will read it with a warm feeling. Background is a basic researcher working in a lot of pharmaceutical companies, so statistical knowledge and programming are self-taught ...

Postscript (2020/02/14)

There were various typographical errors ... I'm sorry if you tried to move it and it didn't work ... I should have been able to fix it, so thank you. Dunnett's test In biological research, you often want to do ** Student's t-test and show a significant difference ** (The expression significant difference should be eliminated. 019-00857-9) is another opportunity ...).

When evaluating between multiple groups, there is a problem of multiplicity, so there are many opportunities to use Tukey-HSD test etc. Dunnett's test is also one of the multi-group pair comparison tests, and it is a method used when you do not need to perform all pair comparisons and you want to compare ** control group (or control group) with other groups **. I use it quite a lot because it ** has higher detection power ** than comparing all groups.

On Wikipedia

In statistics, Dunnett's test (Dunnett's test) is one of the multiple comparison procedures [1]. Developed by Canadian statistician Charles Danette to compare each of the many treatment groups with a single control group. Multiple comparisons to the control group are also called many-to-one comparisons. Wikipedia

It has been described as.

I understand that you can use ** when the population is normally distributed and evenly distributed **.

I want to do it all with python!

I used excel to analyze data, but now I am doing data analysis with python (using Colaboratory) because I want to practice python personally. However, for statistical analysis, I brought in data and used R with EZR and SAS with EXSUS.

Please read @ issakuss's article (Python Statistical Techniques-Statistical Analysis Against Python-), where ** [pingwouin](https: / I found a package called /pingouin-stats.org/api.html#)**.

Most of the test methods can be used! !! !!

It's wonderful! !! Thanks to the creator. With this, it seemed that I could live on python alone.

However…

There is no Dunnett's test in pingouin ...

So, let's implement it by ourselves while studying! I decided to do that. (~~ Well, if you look for it, there may be a package somewhere ~~)

policy

I had to know the theory to implement it myself, so lightly! !! studied.

The method for calculating the test statistic T is similar to Tukey. Assuming that there are k groups from 1 to i, the ** test statistic T ** is 1 with the control group as 1.

T{1i}=\frac{|μ_1-μ_i|}{\sqrt{(\frac{1}{n_1}+\frac{1}{n_i})\sigma^2}}

However -$ μ : Average - n : Number of samples - σ ^ 2 $: Covariance (I think it's common sense because I often didn't understand the definition of characters, but let me describe it).

At this time, ** degrees of freedom v ** $ v = \sum_{i=1}^{k}n_i - k$ (That is, (total number of samples)-(total number of groups)).

Compared to this test statistic T ...

This is Danette's table! !! !! ………?

table? It was like (laughs) Speaking of which, there were a lot of tables on the back of university textbooks ... The actual table is [this](https://www.weblio.jp/content/%E3%83%80%E3%83%8D%] E3% 83% 83% E3% 83% 88% E3% 81% AE% E8% A1% A8). In pingouin, it seemed that Tukey's table was included and referenced, but I didn't like to include the table ...

SAS is doing something amazing

It seems that SAS uses a ** probmc function ** and does not use a table. So I decided to implement this function (most of the time I had a hard time). Thankfully, the formula was listed in here, so I will implement it as it is. I will.

The probmc function works with the following arguments (simple to write, I don't know SAS).

probmc(distribution, q, prob, df, nparms <, parameters>)

argument	Contents
distribution	Choose whether Dunnett is one-sided or two-sided. It seems that it can be used for distributions other than Dunnett.
q, prob	q is the quantile (here we enter the test statistic T) and prob is the probability. prob is the probability that a random variable will be less than q. Therefore, the p-value can be calculated as 1 – prob. For example, significance level 5%To calculate the critical value of prob= 0.Set to 95. Either is missing and used as an argumentThe missing argument is the return valueIt will be. This time, I want to know the p-value, so I will delete the prob.
df	Degrees of freedom v
nparms	In the case of Danette, the total number of groups is k-1
parameters	λ12,λ13,...λ1i

is. What is λ, but it is a value calculated from the number of samples to be compared. $ \lambda_{1i} = \sqrt{\frac{n_i}{n_1+n_i}} $ It will be. If the number of samples in each group is the same, parameters will not be passed, but in the calculation that if you implement the one that seems to be the most complicated, you can do other things, ** If the number of samples in each group is different ** Implement.

(I will confirm later whether the formula will be the same if the number of samples is the same for the formula when the number of samples in each group is different)

probmc function (both sides dunnett with different sample numbers in each group)

The prob obtained by the probmc function looks like this

`prob = probmc("dunnet2", T, ., v, k-1, <lambda1,lambda2...>)`


 And the formula that works inside this is as follows.

```math
prob = \int_{0}^{∞}\int_{-∞}^{∞} ϕ(y)∏_{i=1}^{k-1}\biggl[Φ\biggl(\frac{\lambda_iy+Tx}{\sqrt{1-\lambda_i^2}}\biggr)- Φ\biggl(\frac{\lambda_iy-Tx}{\sqrt{1-\lambda_i^2}}\biggr)\biggr]dy d\mu_v(x)

If you can implement this formula, you're done! (I didn't have the energy to understand the contents ...)

here $ \ phi (x) $: Probability density function of normal distribution (pdf) $ \ Phi (x) $: Cumulative density function (cdf) It will be. These can be implemented using scipy and numpy.

import numpy as np
from scipy.stats import norm

lambda_params = [lambda_1,lambda_2]
T=T


def pdf(x):
  return norm.pdf(x)

def cdf(x):
  return norm.cdf(x)

def f(x,y):
  return pdf(y)*np.prod([cdf((lambda_i*y+T*x)/(np.sqrt(1-lambda_i**2)))-cdf((lambda_i*y-T*x)/(np.sqrt(1-lambda_i**2))) for lambda_i in lambda_params])

with this $ prob=\int_{0}^{∞}\int_{-∞}^{∞}f_{T\lambda}(x,y)dyd\mu_v(x) $

It will be.

Also, $ d \ mu_v (x) $ is defined in SAS as follows:

d\mu_v(x)=\frac{ν^{\frac{v}{2}}}{\Gamma(\frac{v}{2})2^{\frac{v}{2}-1}}x^{v-1}exp(-\frac{vx^2}{2})dx

here $ \ Gamma (x) $: Gamma function And there is also a function in scipy (also present in math).

from scipy.special import gamma

v=v

def du(v,x):
  return (v**(v/2)*x**(v-1)*np.exp(-v*x**2/2))/(gamma(v/2)*2**(v/2-1))

** One problem here ... ** The integration interval of the original equation is $\int_{0}^{∞}d\mu_v(x)$ It was, but this $\int_{0}^{∞}d\mu_v(x)=\int_{?}^{?}g_v(x)dx $ Looking at the graph when trying to convert to ...

from scipy import integrate
import matplotlib.pyplot as plt

x=np.arange(-3,3,0.01)

plt.figure()
for v in range(9,30,3): #change v

  def duv(x):
    return du(v,x)

  y = [integrate.quad(duv,0,xi)[0] for xi in x]

  plt.plot(x,y,label=i)
plt.xlabel("x")
plt.ylabel("$\mu_v(x)$")
plt.legend()
plt.show()

ダウンロード (1).png

Hmm…? ** Is it evenly reversed on the minus side? ** This is a problem in replacing the integration interval ... It looks like a sigmoid function, so I'm probably using it in that way. Unfortunately, I couldn't understand the meaning of the formula, so I decided to proceed with the range of ** x from 0 to ∞ ** for the time being.

Also, looking at the precision when the value is the largest in order to reduce the calculation, $ \ mu_9 (3) = 0.99999999999998976 $ seems to be sufficient, so the range of x is 0 → 5 in the calculation. (By the way, $ \ mu_9 (2) = 0.99999603534120194 $)

Ready!

So the original formula,

prob = \int_{0}^{∞}\int_{-∞}^{∞} ϕ(y)∏_{i=1}^{k-1}\biggl[Φ\biggl(\frac{\lambda_iy+Tx}{\sqrt{1-\lambda_i^2}}\biggr)- Φ\biggl(\frac{\lambda_iy-Tx}{\sqrt{1-\lambda_i^2}}\biggr)\biggr]dy d\mu_v(x)

Is transformed $ prob=\int_{0}^{∞}\int_{-∞}^{∞}f_{T\lambda}(x,y)g_v(x)dydx $ have become. Implement this

def f_g(x,y):
  return f(x,y) * duv(x)

def dunnett_prob(T,v,lambda_params):
  T=T
  v=v
  lambda_params=lambda_params
  return integrate.dblquad(f_g,0,np.inf,lambda x:-np.inf,lambda x:np.inf)[0]

Now let's look at the graph of $ f_ {T \ lambda} (x, y) g_v (x) $ to reduce the amount of calculation.

T, v = 10,9 #Set the value appropriately
lambda_param =[0.5,0.4] #Set the value appropriately

y = np.arange(-10,10,0.1)

plt.figure()
for x in np.arange(0,2,0.5):
  z = [f_g(x,yi) for yi in y]
  plt.plot(y,z,label=x)
plt.legend()
plt.xlabel("y")
plt.ylabel("$f(x,y)g(x)$")
#plt.ylim(0,0.0001)
plt.show()

ダウンロード (2).png

It looks like an even function! In other words

prob ≒ 2 \int_{0}^{5}\int_{0}^{∞}f_{T\lambda}(x,y)g_v(x)dx

However

f_{T\lambda}(x,y) = ϕ(y)∏_{i=1}^{k-1}\biggl[Φ\biggl(\frac{\lambda_iy+Tx}{\sqrt{1-\lambda_i^2}}\biggr)- Φ\biggl(\frac{\lambda_iy-Tx}{\sqrt{1-\lambda_i^2}}\biggr)\biggr]

g_v(x)=\frac{ν^{\frac{v}{2}}}{\Gamma(\frac{v}{2})2^{\frac{v}{2}-1}}x^{v-1}exp(-\frac{vx^2}{2})

It will be.

def dunnett_prob(T,v,lambda_params):
  T=T
  v=v
  lambda_params=lambda_params
  return 2*integrate.dblquad(f_g,0,3,lambda x:0,lambda x:np.inf)[0]

Complete! !! ...?

So I think it will be summarized as follows. I wanted to organize it neatly using classes and so on, but I'm sorry it's dirty because I can't do it well.

`dunnett.py`


import numpy as np
from scipy.stats import norm
from scipy import integrate
from scipy.special import gamma

def dunnett_prob(T,v,lambda_params):

  def pdf(x):
    return norm.pdf(x)

  def cdf(x):
    return norm.cdf(x)

  def f(x,y):
    return pdf(y)*np.prod([cdf((lambda_i*y+T*x)/(np.sqrt(1-lambda_i**2)))-cdf((lambda_i*y-T*x)/(np.sqrt(1-lambda_i**2))) for lambda_i in lambda_params])

  def duv(x):
    return (v**(v/2)*x**(v-1)*np.exp(-v*x**2/2))/(gamma(v/2)*2**(v/2-1))

  def f_g(x,y):
    return f(x,y) * duv(x)

  return 2*integrate.dblquad(f_g,0,5,lambda x:0,lambda x:np.inf)[0]

I actually used it

Create appropriate data and try it out. Consider the pattern of whether drug A and drug B give effective values for the control group. First from data creation

import pandas as pd

control = np.random.normal(10,scale=2,size=4)
drug_A = np.random.normal(10.5,scale=2,size=6)
drug_B = np.random.normal(14,scale=2,size=5)


df_c = pd.DataFrame(data=control,columns=["value"])
df_c["group"] = "control"
df_A = pd.DataFrame(data=drug_A,columns=["value"])
df_A["group"] = "drug_A"
df_B = pd.DataFrame(data=drug_B,columns=["value"])
df_B["group"] = "drug_B"
df = pd.concat([df_c,df_A,df_B],ignore_index=True)

value	group
11.389	control
7.154	control
7.932	control
14.729	control
10.507	drug_A
6.578	drug_A
6.580	drug_A
13.098	drug_A
10.131	drug_A
13.748	drug_A
15.245	drug_B
14.078	drug_B
17.533	drug_B
11.082	drug_B
15.413	drug_B

We will go for this data. First of all, from data molding. You need to calculate the test statistic T and λ. I would like to take a similar method, thinking of putting it on pigouin later.

import pingouin as pg

def dunnett_two_side_unbalanced(data, dv, between,control):

  #First compute the ANOVA
  aov = pg.anova(dv=dv, data=data, between=between, detailed=True)
  v = aov.at[1, 'DF'] #Degree of freedom
  ng = aov.at[0, 'DF'] + 1 #Total number
  grp = data.groupby(between)[dv]
  n_sub = grp.count()
  control_index = n_sub.index.get_loc(control) #Get control group index
  n = n_sub.values #The number of samples
  gmeans = grp.mean().values#Each average
  gvar = aov.at[1, 'MS'] / n #Each variance

  vs_g = np.delete(np.arange(ng),control_index) #Get index other than control group

  #Pairwise combinations (find test statistic T)
  mn = np.abs(gmeans[control_index]- gmeans[vs_g])
  se = np.sqrt(gvar[control_index] + gvar[vs_g]) #The formula is a little different, but because it assumes homoscedasticity
  tval = mn / se

  #Find lambda
  lambda_params = n[vs_g]/(n[control_index]+n[vs_g])

  pval = [1-dunnett_prob(t,v,lambda_params) for t in tval]

  stats = pd.DataFrame({
                      'A': [control]*len(vs_g),
                      'B': np.unique(data[between])[vs_g],
                      'mean(A)': np.round(gmeans[control_index], 3)*len(vs_g),
                      'mean(B)': np.round(gmeans[vs_g], 3),
                      'diff': np.round(mn, 3),
                      'se': np.round(se, 3),
                      'T': np.round(tval, 3),
                      'p-dunnett': pval
                      })
  return stats

dunnett_two_side_unbalanced(df,"value","group","control")

A	B	mean(A)	mean(B)	diff	se	T	p-dunnett
control	drug_A	10.30	10.11	0.194	1.917	0.101	0.99312
control	drug_B	10.30	14.67	4.369	1.993	2.193	0.08992

have become. Is it really right ...?

Try it with EZR

I'm completely tired ... Finally, I would like to do the same analysis with R using EZR. R cannot be written, so please forgive me only for the output.


Multiple Comparisons of Means: Dunnett Contrasts


Fit: aov(formula = value ~ group.factor, data = Dataset)

Linear Hypotheses:
                      Estimate Std. Error t value Pr(>|t|)  
drug_A - control == 0  -0.1939     1.9174  -0.101    0.992  
drug_B - control == 0   4.3693     1.9926   2.193    0.084

that? ?? ?? Is it a little different? ??

I'm tired, so I'll do my best to review it again. It seems that R uses the Monte Carlo method, so I think the values are different, but 0.005 is too large, isn't it?

Thank you for watching so far. As mentioned above, amateurs tried their best.

References

[Multiple comparison (Part 2-1: Specific method (i)), 3rd Armitage study session: Material 2, Masaaki Doi](http://www012.upp.so-net.ne.jp/doi/ biostat / CT39 / multiple_comparison_2_1.pdf)

For copy

import numpy as np
from scipy.stats import norm
from scipy import integrate
from scipy.special import gamma
import pingouin as pg


def dunnett_two_side_unbalanced(data, dv, between,control):

  def dunnett_prob(T,v,lambda_params):

    def pdf(x):
      return norm.pdf(x)

    def cdf(x):
      return norm.cdf(x)

    def f(x,y):
      return pdf(y)*np.prod([cdf((lambda_i*y+T*x)/(np.sqrt(1-lambda_i**2)))-cdf((lambda_i*y-T*x)/(np.sqrt(1-lambda_i**2))) for lambda_i in lambda_params])

    def duv(x):
      return (v**(v/2)*x**(v-1)*np.exp(-v*x**2/2))/(gamma(v/2)*2**(v/2-1))

    def f_g(x,y):
      return f(x,y) * duv(x)

    return 2*integrate.dblquad(f_g,0,5,lambda x:0,lambda x:np.inf)[0]

  #First compute the ANOVA
  aov = pg.anova(dv=dv, data=data, between=between, detailed=True)
  v = aov.at[1, 'DF'] #Degree of freedom
  ng = aov.at[0, 'DF'] + 1 #Total number
  grp = data.groupby(between)[dv]
  n_sub = grp.count()
  control_index = n_sub.index.get_loc(control) #Get control group index
  n = n_sub.values #The number of samples
  gmeans = grp.mean().values#Each average
  gvar = aov.at[1, 'MS'] / n #Each variance

  vs_g = np.delete(np.arange(ng),control_index) #Get index other than control group

  #Pairwise combinations (find test statistic T)
  mn = np.abs(gmeans[control_index]- gmeans[vs_g])
  se = np.sqrt(gvar[control_index] + gvar[vs_g]) #The formula is a little different, but because it assumes homoscedasticity
  tval = mn / se

  #Find lambda
  lambda_params = n[vs_g]/(n[control_index]+n[vs_g])

  pval = [1-dunnett_prob(t,v,lambda_params) for t in tval]

  stats = pd.DataFrame({
                      'A': [control]*len(vs_g),
                      'B': np.unique(data[between])[vs_g],
                      'mean(A)': np.round(gmeans[control_index], 3)*len(vs_g),
                      'mean(B)': np.round(gmeans[vs_g], 3),
                      'diff': np.round(mn, 3),
                      'se': np.round(se, 3),
                      'T': np.round(tval, 3),
                      'p-dunnett': pval
                      })
  return stats

I want to do Dunnett's test in Python