** What you can do with this article **
I major in biology at graduate school. Python is very useful for graph analysis of experimental data, but it was quite a problem because there are few libraries that perform various tests (especially significant difference tests between groups). As a solution
Etc. are possible. (For 2, I was able to do it by introducing Rpy2)
However, I really want to run nonparametric tests in Python! Therefore, I decided to implement a nonparametric test using a library called the third method ** scilit_posthocs **.
If you are only interested in implementing scikit_posthocs, I hope you can jump to the table of contents below at once.
First of all, to start the test The following content is based on the following J-stage article For those who do not understand statistical tests I, II //www.jstage.jst.go.jp/article/kagakutoseibutsu/51/6/51_408/_pdf), [Ⅲ](https://www.jstage.jst.go.jp/article/kagakutoseibutsu/51/ 7 / 51_483 / _pdf)
I interpret and describe these articles in my own way.
It's going to be a little long, so I'll update it when I have time, but in short, I'll follow the next flow.
** Unsupported (independent) data in 3 or more groups ** ↓ Normality test (Shapiro-Wilk test or QQ plot ..) → Go to Non-para ↓ Homoscedasticity test (Bartlett's test) → To Non-para ↓ One-way ANOVA (ANOVA) → To Non-para ↓ Tukey_HSD test, Scheffe test, Tukey test (n is the same in each group), Dunnett test (comparison with Control group)
In the article I referred to,
** If you go to nonparametric in the above flowchart ** ↓ Homoscedasticity test (Levene test, Fligner test) ↓ One-way ANOVA (Kruskal-Wallis test) ↓ Steel-Dwass (dscf) test, Conover test For the time being, these tests are also posthoc tests, so why not have a significant difference in one-way ANOVA? ?? I think that, but according to the above article, I wrote that it is not necessary to perform analysis of variance.
scikit_posthocs is a library that covers a lot of tests, which is not covered by scipy or statsmodels, and is very easy to use. The official website is very well organized, so please check it out. Official HP GitHub repository
Dependent packages are Numpy, Pandas, scipy, stasmodels, matplotlib, seaborn. scikit_posthocs can be installed with pip.
!pip install scikit_posthocs
is. Any test (other than HSD) can be executed as follows.
import scikit_posthoc as sp
import seaborn as sns
#Load Titanic data
df = sns.load_dataset("titanic")
#Steel-Dwass test
#val_col is the value column
#group_col is the column of the group you want to compare
sp.posthoc_dscf(df,val_col="fare",group_col="class")
The result will be returned in the following data frame. The contents of the table are p-values.
I put the above contents together on github. rola-bio/stats_test Download stats_test.py in it to your working directory and import it. And when you run stats_test (), As shown in the flow above, the normality and homoscedasticity of the data are tested and analysis of variance is performed automatically. The data is then analyzed with a suitable test and a bar graph of the significant difference results and data is illustrated. By default, one of the Tukey-HSD, Steel-Dwass, and Conover tests is selected.
Now, let's use this function to analyze the difference in fares depending on the type of passenger from the Titanic passenger data that is actually installed as standard in seaborn.
titanic.ipynb
import stats_test as st
import seaborn as sns
#Load Titanic data
df = sns.load_dataset("titanic")
df.head()
Next, use stats_test () to specify the data frame, the value you want to test, and which element to group.
This time, I tried to divide the types of passengers by boarding place (embark_town).
titanic.ipynb
st.stats_test(df,val_col="fare",group_col="embark_town")
Oops ~~? ?? I got an error when I ran this.
TypeError: '<' not supported between instances of 'float' and 'str'
Apparently there is an error (nan) in fare or embark_town. You may get this error if group_col is mixed with ints or null values. In case of int error
df ["column name"] = df ["column name"] .astype (str)
You can deal with it with. This time, I removed nan with dropna as shown below.
titanic.ipynb
st.stats_test(df.dropna(subset=["embark_town"]),val_col="fare",group_col="embark_town")
Apparently there is a significant difference of p-value <0.001 or less between all groups. People who rode in Cherbourg are significantly crazy ...
Oops, me! I have made a remark that seems to be a man.
People who ride in Cherbourg are significantly bogged down
This data does not distinguish between men and women, so let's do a significant difference test by gender next.
titanic.ipynb
for sex in df["sex"].unique():
print("""
This result is from {}
""".format(sex))
df_query = df.query("sex =='{}'".format(sex))
st.stats_test(df_query.dropna(subset=["embark_town"]),
val_col="fare",group_col="embark_town")
The result was something like that. It's hard to understand because the color coding of the boarding place has changed from the first result. .. .. You can adjust it by playing with sign_barplot () in the package.
In any case, Cherbourg passengers seem to be significantly richer for both men and women. (Gununu ,,,) However, the wage difference between Southampton and Cherbourg for men has risen to a p-value of about 0.01. Is it because of Maya Yoshida?
That's it.
By the way, if you pass result = True to stats_test (), the result of the test in the middle will also be displayed. You can specify the test yourself by passing test = "test name". (Or you can easily change it by playing with the function of stats_test.py, one_way_ANOVA ())
Over time, we may also write implementations of individual tests. For details, please refer to the contents of the code ...
Recommended Posts