[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)


This time, "Data format and visualization" in "Chapter 4 Information and Communication Network and Data Utilization / End of the Book" of Information I teacher training materials published on the page of the Ministry of Education, Culture, Sports, Science and Technology will be implemented in python. And I would like to give some supplementary consideration.

Teaching materials

[High School Information Department "Information I" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/1416756.htm "High School Information Department "Information I" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") [Chapter 4 Utilization of Information and Communication Networks and Data / End of Book (PDF: 10284KB)](https://www.mext.go.jp/component/a_menu/education/micro_detail/__icsFiles/afieldfile/2019/09/24/ 1416758_006_1.pdf "Chapter 4 Utilization of Information and Communication Networks and Data-End of Book (PDF: 10284KB)")


What to do this time

I would like to rewrite the implementation example shown by R in Learning 24 "Data Format and Visualization" (p202-) to python.


For teaching materials

A box plot and a violin plot were created from the sample data of the body measurement of the Statistics Bureau of the Ministry of Internal Affairs and Communications.

Although it says, the data file read by R seems to be the result of men and women running 50m, so I used the following data prepared appropriately here. high_male_data.csv

Also, the "diamonds.csv" used in the python implementation version in the latter half was the one on the kaggle site. Diamonds - Kaggle

Qualitative data and its types

Box plot and violin plot

Source code for python implementation

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

gen_50 = pd.read_csv('/content/gen_50.csv')


sns.boxplot(x = 'gender', y = 'run50m', data = gen_50)
sns.violinplot(x = "gender", y = "run50m", data = gen_50)

Since I used the seaborn module to draw the violin plot, I also used the seaborn module to draw the boxplot.

Output result of python implementation version

ダウンロード (3).png

ダウンロード (4).png

[Reference] Source code of R implementation version (from teaching materials)

library( ggplot2 )
#Data reading
gen_50 <- read.csv("gen_50.csv")
boxplot(run50m~gender, data=gen_50)
ggplot(data=gen_50, aes(x=gender, y=run50m, color=gender)) + geom_violin()

By the way, the source code written in the actual teaching materials is as follows.


Please note that there is a bug here. Wrong: boxplot (run50m ~ gender.data = gen_50) Correct: boxplot (run50m ~ gender, data = gen_50)

[Reference] Output result of R implementation version

ダウンロード (1).png ダウンロード (1).png

Histogram, scatter plot and box plot

What is the purpose of data visualization? By visualizing, problems can be discovered, and detailed analysis, interpretation, and solutions can be considered. Here, let's explain using the diamonds sample data included in the package called ggplot2 of the statistical analysis software R. This data is tens of thousands of large data including the carat, cut, transparency, size, price, etc. of diamonds actually distributed in the United States.

Source code for python implementation

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('/content/diamonds.csv')

df_carat_lt3 = df[df['carat'] < 3]

#carat on x-axis, count on y-axis

#Draw a histogram

#Draw a scatter plot
df_carat_lt3.plot.scatter(x = 'carat', y = 'price')

sns.boxplot(x = 'cut', y = 'price', data = df_carat_lt3)

Output result of python implementation version

ダウンロード (8).png ダウンロード (9).png

It can be seen that the larger the carat number, the higher the price.

ダウンロード (10).png

The order of the x-axis in the last figure is not very good, but the order of cut quality is Ideal> Premium> Very Good> Good> Fair.


The teaching materials are written as follows.

This will look like a mysterious phenomenon. The higher the quality of the cut, the lower the price. The same reversal phenomenon occurs with high color and transparency. The hint is "confounding factors". A confounding factor is a hidden variable (factor) that is highly correlated with the two variables of interest.

The results so far show that the higher the carat count, the higher the price. From this, it can be inferred that many diamonds with a large carat count have poor cut quality. (That is, the higher the quality of the cut, the lower the price, which may be due to the number of carats, which are confounding factors, etc.)

Source code for python implementation (cut vs carat)

sns.boxplot(x = 'cut', y = 'carat', data = df_carat_lt3)

Output result of python implementation version (cut vs carat)

ダウンロード (11).png

As expected, especially when looking at the median, we found that the Ideal carat count with the highest cut quality was the lowest and the Fair carat count with the lowest cut quality was the lowest.

If you would like to see a more detailed analysis of this diamond data analysis, you should check the following sites. https://www.kaggle.com/fuzzywizard/diamonds-in-depth-analysis

About the quality of diamond cuts (supplement)

—— The higher the quality of the cut, the higher the cost per carat. —— Large carat rough diamonds require more material to achieve better symmetry and proportions, resulting in greater waste.

[Reference] Source code of R implementation version (from teaching materials)


smaller <- diamonds %>% filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) + geom_histogram(binwidth = 0.01)

ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price))

ggplot(diamonds, aes(cut, price)) + geom_boxplot()

By the way, the source code written in the actual teaching materials is as follows.


Please note that there is a bug here. Wrong: smaller <-diamonds%>% + filter (carat <3) Correct: smaller <-diamonds%>% filter (carat <3)

[Reference] Output result of R implementation version

A tibble: 53940 × 10
carat	cut	color	clarity	depth	table	price	x	y	z
<dbl>	<ord>	<ord>	<ord>	<dbl>	<dbl>	<int>	<dbl>	<dbl>	<dbl>
0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
0.86	Premium	H	SI2	61.0	58	2757	6.15	6.12	3.74
0.75	Ideal	D	SI2	62.2	55	2757	5.83	5.87	3.64
ダウンロード (5).png ダウンロード (6).png ダウンロード (7).png

Source code

python version https://gist.github.com/ereyester/68b781bd6668005c157b300c5bf22905

R version https://gist.github.com/ereyester/737207c4c99556850950c5b5a49dbfcc

