I want to do Dunnett's test in Python

Introduction

This is my first post. I hope you will read it with a warm feeling. Background is a basic researcher working in a lot of pharmaceutical companies, so statistical knowledge and programming are self-taught ...

Postscript (2020/02/14)

There were various typographical errors ... I'm sorry if you tried to move it and it didn't work ... I should have been able to fix it, so thank you. Dunnett's test In biological research, you often want to do ** Student's t-test and show a significant difference ** (The expression significant difference should be eliminated. 019-00857-9) is another opportunity ...).

When evaluating between multiple groups, there is a problem of multiplicity, so there are many opportunities to use Tukey-HSD test etc. Dunnett's test is also one of the multi-group pair comparison tests, and it is a method used when you do not need to perform all pair comparisons and you want to compare ** control group (or control group) with other groups **. I use it quite a lot because it ** has higher detection power ** than comparing all groups.

On Wikipedia

In statistics, Dunnett's test (Dunnett's test) is one of the multiple comparison procedures [1]. Developed by Canadian statistician Charles Danette to compare each of the many treatment groups with a single control group. Multiple comparisons to the control group are also called many-to-one comparisons. Wikipedia

It has been described as.

I understand that you can use ** when the population is normally distributed and evenly distributed **.

I want to do it all with python!

I used excel to analyze data, but now I am doing data analysis with python (using Colaboratory) because I want to practice python personally. However, for statistical analysis, I brought in data and used R with EZR and SAS with EXSUS.

Please read @ issakuss's article (Python Statistical Techniques-Statistical Analysis Against Python-), where ** [pingwouin](https: / I found a package called /pingouin-stats.org/api.html#)**.

Most of the test methods can be used! !! !!

It's wonderful! !! Thanks to the creator. With this, it seemed that I could live on python alone.

However…

There is no Dunnett's test in pingouin ...

So, let's implement it by ourselves while studying! I decided to do that. (~~ Well, if you look for it, there may be a package somewhere ~~)

policy

I had to know the theory to implement it myself, so lightly! !! studied.

The method for calculating the test statistic T is similar to Tukey. Assuming that there are k groups from 1 to i, the ** test statistic T ** is 1 with the control group as 1.

T{1i}=\frac{|μ_1-μ_i|}{\sqrt{(\frac{1}{n_1}+\frac{1}{n_i})\sigma^2}}

However -$ μ : Average - n : Number of samples - σ ^ 2 $: Covariance (I think it's common sense because I often didn't understand the definition of characters, but let me describe it).

At this time, ** degrees of freedom v ** $ v = \sum_{i=1}^{k}n_i - k$ (That is, (total number of samples)-(total number of groups)).

Compared to this test statistic T ...

This is Danette's table! !! !! ………?

table? It was like (laughs) Speaking of which, there were a lot of tables on the back of university textbooks ... The actual table is [this](https://www.weblio.jp/content/%E3%83%80%E3%83%8D%] E3% 83% 83% E3% 83% 88% E3% 81% AE% E8% A1% A8). In pingouin, it seemed that Tukey's table was included and referenced, but I didn't like to include the table ...

SAS is doing something amazing

It seems that SAS uses a ** probmc function ** and does not use a table. So I decided to implement this function (most of the time I had a hard time). Thankfully, the formula was listed in here, so I will implement it as it is. I will.

The probmc function works with the following arguments (simple to write, I don't know SAS).

probmc(distribution, q, prob, df, nparms <, parameters>)

argument Contents
distribution Choose whether Dunnett is one-sided or two-sided. It seems that it can be used for distributions other than Dunnett.
q, prob q is the quantile (here we enter the test statistic T) and prob is the probability. prob is the probability that a random variable will be less than q. Therefore, the p-value can be calculated as 1 – prob. For example, significance level 5%To calculate the critical value of prob= 0.Set to 95. Either is missing and used as an argumentThe missing argument is the return valueIt will be. This time, I want to know the p-value, so I will delete the prob.
df Degrees of freedom v
nparms In the case of Danette, the total number of groups is k-1
parameters λ12,λ13,...λ1i

is. What is λ, but it is a value calculated from the number of samples to be compared. $ \lambda_{1i} = \sqrt{\frac{n_i}{n_1+n_i}} $ It will be. If the number of samples in each group is the same, parameters will not be passed, but in the calculation that if you implement the one that seems to be the most complicated, you can do other things, ** If the number of samples in each group is different ** Implement.

(I will confirm later whether the formula will be the same if the number of samples is the same for the formula when the number of samples in each group is different)

probmc function (both sides dunnett with different sample numbers in each group)

The prob obtained by the probmc function looks like this

prob = probmc("dunnet2", T, ., v, k-1, <lambda1,lambda2...>)


 And the formula that works inside this is as follows.

```math
prob = \int_{0}^{∞}\int_{-∞}^{∞} ϕ(y)∏_{i=1}^{k-1}\biggl[Φ\biggl(\frac{\lambda_iy+Tx}{\sqrt{1-\lambda_i^2}}\biggr)- Φ\biggl(\frac{\lambda_iy-Tx}{\sqrt{1-\lambda_i^2}}\biggr)\biggr]dy d\mu_v(x)

If you can implement this formula, you're done! (I didn't have the energy to understand the contents ...)

here $ \ phi (x) $: Probability density function of normal distribution (pdf) $ \ Phi (x) $: Cumulative density function (cdf) It will be. These can be implemented using scipy and numpy.

import numpy as np
from scipy.stats import norm

lambda_params = [lambda_1,lambda_2]
T=T


def pdf(x):
  return norm.pdf(x)

def cdf(x):
  return norm.cdf(x)

def f(x,y):
  return pdf(y)*np.prod([cdf((lambda_i*y+T*x)/(np.sqrt(1-lambda_i**2)))-cdf((lambda_i*y-T*x)/(np.sqrt(1-lambda_i**2))) for lambda_i in lambda_params])

with this $ prob=\int_{0}^{∞}\int_{-∞}^{∞}f_{T\lambda}(x,y)dyd\mu_v(x) $

It will be.

Also, $ d \ mu_v (x) $ is defined in SAS as follows:

d\mu_v(x)=\frac{ν^{\frac{v}{2}}}{\Gamma(\frac{v}{2})2^{\frac{v}{2}-1}}x^{v-1}exp(-\frac{vx^2}{2})dx

here $ \ Gamma (x) $: Gamma function And there is also a function in scipy (also present in math).

from scipy.special import gamma

v=v

def du(v,x):
  return (v**(v/2)*x**(v-1)*np.exp(-v*x**2/2))/(gamma(v/2)*2**(v/2-1))

** One problem here ... ** The integration interval of the original equation is $\int_{0}^{∞}d\mu_v(x)$ It was, but this $\int_{0}^{∞}d\mu_v(x)=\int_{?}^{?}g_v(x)dx $ Looking at the graph when trying to convert to ...

from scipy import integrate
import matplotlib.pyplot as plt

x=np.arange(-3,3,0.01)

plt.figure()
for v in range(9,30,3): #change v

  def duv(x):
    return du(v,x)

  y = [integrate.quad(duv,0,xi)[0] for xi in x]

  plt.plot(x,y,label=i)
plt.xlabel("x")
plt.ylabel("$\mu_v(x)$")
plt.legend()
plt.show()

ダウンロード (1).png

Hmm…? ** Is it evenly reversed on the minus side? ** This is a problem in replacing the integration interval ... It looks like a sigmoid function, so I'm probably using it in that way. Unfortunately, I couldn't understand the meaning of the formula, so I decided to proceed with the range of ** x from 0 to ∞ ** for the time being.

Also, looking at the precision when the value is the largest in order to reduce the calculation, $ \ mu_9 (3) = 0.99999999999998976 $ seems to be sufficient, so the range of x is 0 → 5 in the calculation. (By the way, $ \ mu_9 (2) = 0.99999603534120194 $)

Ready!

So the original formula,

prob = \int_{0}^{∞}\int_{-∞}^{∞} ϕ(y)∏_{i=1}^{k-1}\biggl[Φ\biggl(\frac{\lambda_iy+Tx}{\sqrt{1-\lambda_i^2}}\biggr)- Φ\biggl(\frac{\lambda_iy-Tx}{\sqrt{1-\lambda_i^2}}\biggr)\biggr]dy d\mu_v(x)

Is transformed $ prob=\int_{0}^{∞}\int_{-∞}^{∞}f_{T\lambda}(x,y)g_v(x)dydx $ have become. Implement this

def f_g(x,y):
  return f(x,y) * duv(x)

def dunnett_prob(T,v,lambda_params):
  T=T
  v=v
  lambda_params=lambda_params
  return integrate.dblquad(f_g,0,np.inf,lambda x:-np.inf,lambda x:np.inf)[0]

Now let's look at the graph of $ f_ {T \ lambda} (x, y) g_v (x) $ to reduce the amount of calculation.

T, v = 10,9 #Set the value appropriately
lambda_param =[0.5,0.4] #Set the value appropriately

y = np.arange(-10,10,0.1)

plt.figure()
for x in np.arange(0,2,0.5):
  z = [f_g(x,yi) for yi in y]
  plt.plot(y,z,label=x)
plt.legend()
plt.xlabel("y")
plt.ylabel("$f(x,y)g(x)$")
#plt.ylim(0,0.0001)
plt.show()

ダウンロード (2).png

It looks like an even function! In other words

prob ≒ 2 \int_{0}^{5}\int_{0}^{∞}f_{T\lambda}(x,y)g_v(x)dx

However

f_{T\lambda}(x,y) = ϕ(y)∏_{i=1}^{k-1}\biggl[Φ\biggl(\frac{\lambda_iy+Tx}{\sqrt{1-\lambda_i^2}}\biggr)- Φ\biggl(\frac{\lambda_iy-Tx}{\sqrt{1-\lambda_i^2}}\biggr)\biggr]
g_v(x)=\frac{ν^{\frac{v}{2}}}{\Gamma(\frac{v}{2})2^{\frac{v}{2}-1}}x^{v-1}exp(-\frac{vx^2}{2})

It will be.

def dunnett_prob(T,v,lambda_params):
  T=T
  v=v
  lambda_params=lambda_params
  return 2*integrate.dblquad(f_g,0,3,lambda x:0,lambda x:np.inf)[0]

Complete! !! ...?

So I think it will be summarized as follows. I wanted to organize it neatly using classes and so on, but I'm sorry it's dirty because I can't do it well.

dunnett.py


import numpy as np
from scipy.stats import norm
from scipy import integrate
from scipy.special import gamma

def dunnett_prob(T,v,lambda_params):

  def pdf(x):
    return norm.pdf(x)

  def cdf(x):
    return norm.cdf(x)

  def f(x,y):
    return pdf(y)*np.prod([cdf((lambda_i*y+T*x)/(np.sqrt(1-lambda_i**2)))-cdf((lambda_i*y-T*x)/(np.sqrt(1-lambda_i**2))) for lambda_i in lambda_params])

  def duv(x):
    return (v**(v/2)*x**(v-1)*np.exp(-v*x**2/2))/(gamma(v/2)*2**(v/2-1))

  def f_g(x,y):
    return f(x,y) * duv(x)

  return 2*integrate.dblquad(f_g,0,5,lambda x:0,lambda x:np.inf)[0]

I actually used it

Create appropriate data and try it out. Consider the pattern of whether drug A and drug B give effective values for the control group. First from data creation

import pandas as pd

control = np.random.normal(10,scale=2,size=4)
drug_A = np.random.normal(10.5,scale=2,size=6)
drug_B = np.random.normal(14,scale=2,size=5)


df_c = pd.DataFrame(data=control,columns=["value"])
df_c["group"] = "control"
df_A = pd.DataFrame(data=drug_A,columns=["value"])
df_A["group"] = "drug_A"
df_B = pd.DataFrame(data=drug_B,columns=["value"])
df_B["group"] = "drug_B"
df = pd.concat([df_c,df_A,df_B],ignore_index=True)
value group
11.389 control
7.154 control
7.932 control
14.729 control
10.507 drug_A
6.578 drug_A
6.580 drug_A
13.098 drug_A
10.131 drug_A
13.748 drug_A
15.245 drug_B
14.078 drug_B
17.533 drug_B
11.082 drug_B
15.413 drug_B

We will go for this data. First of all, from data molding. You need to calculate the test statistic T and λ. I would like to take a similar method, thinking of putting it on pigouin later.

import pingouin as pg

def dunnett_two_side_unbalanced(data, dv, between,control):

  #First compute the ANOVA
  aov = pg.anova(dv=dv, data=data, between=between, detailed=True)
  v = aov.at[1, 'DF'] #Degree of freedom
  ng = aov.at[0, 'DF'] + 1 #Total number
  grp = data.groupby(between)[dv]
  n_sub = grp.count()
  control_index = n_sub.index.get_loc(control) #Get control group index
  n = n_sub.values #The number of samples
  gmeans = grp.mean().values#Each average
  gvar = aov.at[1, 'MS'] / n #Each variance

  vs_g = np.delete(np.arange(ng),control_index) #Get index other than control group

  #Pairwise combinations (find test statistic T)
  mn = np.abs(gmeans[control_index]- gmeans[vs_g])
  se = np.sqrt(gvar[control_index] + gvar[vs_g]) #The formula is a little different, but because it assumes homoscedasticity
  tval = mn / se

  #Find lambda
  lambda_params = n[vs_g]/(n[control_index]+n[vs_g])

  pval = [1-dunnett_prob(t,v,lambda_params) for t in tval]

  stats = pd.DataFrame({
                      'A': [control]*len(vs_g),
                      'B': np.unique(data[between])[vs_g],
                      'mean(A)': np.round(gmeans[control_index], 3)*len(vs_g),
                      'mean(B)': np.round(gmeans[vs_g], 3),
                      'diff': np.round(mn, 3),
                      'se': np.round(se, 3),
                      'T': np.round(tval, 3),
                      'p-dunnett': pval
                      })
  return stats

dunnett_two_side_unbalanced(df,"value","group","control")

A B mean(A) mean(B) diff se T p-dunnett
control drug_A 10.30 10.11 0.194 1.917 0.101 0.99312
control drug_B 10.30 14.67 4.369 1.993 2.193 0.08992

have become. Is it really right ...?

Try it with EZR

I'm completely tired ... Finally, I would like to do the same analysis with R using EZR. R cannot be written, so please forgive me only for the output.


Multiple Comparisons of Means: Dunnett Contrasts


Fit: aov(formula = value ~ group.factor, data = Dataset)

Linear Hypotheses:
                      Estimate Std. Error t value Pr(>|t|)  
drug_A - control == 0  -0.1939     1.9174  -0.101    0.992  
drug_B - control == 0   4.3693     1.9926   2.193    0.084

that? ?? ?? Is it a little different? ??

I'm tired, so I'll do my best to review it again. It seems that R uses the Monte Carlo method, so I think the values are different, but 0.005 is too large, isn't it?

Thank you for watching so far. As mentioned above, amateurs tried their best.

References

[Multiple comparison (Part 2-1: Specific method (i)), 3rd Armitage study session: Material 2, Masaaki Doi](http://www012.upp.so-net.ne.jp/doi/ biostat / CT39 / multiple_comparison_2_1.pdf)

For copy

import numpy as np
from scipy.stats import norm
from scipy import integrate
from scipy.special import gamma
import pingouin as pg


def dunnett_two_side_unbalanced(data, dv, between,control):

  def dunnett_prob(T,v,lambda_params):

    def pdf(x):
      return norm.pdf(x)

    def cdf(x):
      return norm.cdf(x)

    def f(x,y):
      return pdf(y)*np.prod([cdf((lambda_i*y+T*x)/(np.sqrt(1-lambda_i**2)))-cdf((lambda_i*y-T*x)/(np.sqrt(1-lambda_i**2))) for lambda_i in lambda_params])

    def duv(x):
      return (v**(v/2)*x**(v-1)*np.exp(-v*x**2/2))/(gamma(v/2)*2**(v/2-1))

    def f_g(x,y):
      return f(x,y) * duv(x)

    return 2*integrate.dblquad(f_g,0,5,lambda x:0,lambda x:np.inf)[0]

  #First compute the ANOVA
  aov = pg.anova(dv=dv, data=data, between=between, detailed=True)
  v = aov.at[1, 'DF'] #Degree of freedom
  ng = aov.at[0, 'DF'] + 1 #Total number
  grp = data.groupby(between)[dv]
  n_sub = grp.count()
  control_index = n_sub.index.get_loc(control) #Get control group index
  n = n_sub.values #The number of samples
  gmeans = grp.mean().values#Each average
  gvar = aov.at[1, 'MS'] / n #Each variance

  vs_g = np.delete(np.arange(ng),control_index) #Get index other than control group

  #Pairwise combinations (find test statistic T)
  mn = np.abs(gmeans[control_index]- gmeans[vs_g])
  se = np.sqrt(gvar[control_index] + gvar[vs_g]) #The formula is a little different, but because it assumes homoscedasticity
  tval = mn / se

  #Find lambda
  lambda_params = n[vs_g]/(n[control_index]+n[vs_g])

  pval = [1-dunnett_prob(t,v,lambda_params) for t in tval]

  stats = pd.DataFrame({
                      'A': [control]*len(vs_g),
                      'B': np.unique(data[between])[vs_g],
                      'mean(A)': np.round(gmeans[control_index], 3)*len(vs_g),
                      'mean(B)': np.round(gmeans[vs_g], 3),
                      'diff': np.round(mn, 3),
                      'se': np.round(se, 3),
                      'T': np.round(tval, 3),
                      'p-dunnett': pval
                      })
  return stats

Recommended Posts

I want to do Dunnett's test in Python
I want to write in Python! (2) Let's write a test
I want to do something in Python when I finish
I want to do something like sort uniq in Python
[Python] What I did to do Unit Test
I want to create a window in Python
I want to merge nested dicts in Python
I want to display the progress in Python!
I want to do a monkey patch only partially safely in Python
I want to write in Python! (1) Code format check
I want to embed a variable in a Python string
I want to easily implement a timeout in python
Even in JavaScript, I want to see Python `range ()`!
I want to randomly sample a file in Python
I want to work with a robot in python.
I want to write in Python! (3) Utilize the mock
[ML Ops] I want to do multi-project with Python
I want to use the R dataset in python
I want to manipulate strings in Kotlin like Python!
[Python] How to do PCA in Python
I want to do ○○ with Pandas
I want to debug with Python
I want to be able to run Python in VS Code
I want to make input () a nice complement in python
I tried to implement permutation in Python
I want to print in a comprehension
How to do R chartr () in Python
I tried to implement PLSA in Python 2
I want to use jar from python
I want to build a Python environment
I want to analyze logs with Python
How to do portmanteau test with python
I want to play with aws with python
I tried to implement ADALINE in Python
I wanted to solve ABC159 in Python
I tried to implement PPO in Python
I want to embed Matplotlib in PySimpleGUI
I want to solve APG4b with Python (only 4.01 and 4.04 in Chapter 4)
I want to do a full text search with elasticsearch + python
[Python / AWS Lambda layers] I want to reuse only module in AWS Lambda Layers
I wanted to do something like an Elixir pipe in Python
I want to pass the G test in one month Day 1
I want to do it with Python lambda Django, but I will stop
I want to use MATLAB feval with python
I want to pin Datetime.now in Django tests
I want to memoize including Python keyword arguments
I was able to recurse in Python: lambda
I want to email from Gmail using Python.
[Python] I want to manage 7DaysToDie from Discord! 1/3
Minimal implementation to do Union Find in Python
I want to make a game with Python
I wrote "Introduction to Effect Verification" in Python
I don't want to take a coding test
I want to store DB information in list
I want to convert a table converted to PDF in Python back to CSV
I want to batch convert the result of "string" .split () in Python
I want to explain the abstract class (ABCmeta) of Python in detail.
I tried to implement TOPIC MODEL in Python
I want to use Temporary Directory with Python2
Environment maintenance made with Docker (I want to post-process GrADS in Python
I want to color a part of an Excel string in Python