This time, "Data format and visualization" in "Chapter 4 Information and Communication Network and Data Utilization / End of the Book" of Information I teacher training materials published on the page of the Ministry of Education, Culture, Sports, Science and Technology will be implemented in python. And I would like to give some supplementary consideration.
[High School Information Department "Information I" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/1416756.htm "High School Information Department "Information I" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") [Chapter 4 Utilization of Information and Communication Networks and Data / End of Book (PDF: 10284KB)](https://www.mext.go.jp/component/a_menu/education/micro_detail/__icsFiles/afieldfile/2019/09/24/ 1416758_006_1.pdf "Chapter 4 Utilization of Information and Communication Networks and Data-End of Book (PDF: 10284KB)")
I would like to rewrite the implementation example shown by R in Learning 24 "Data Format and Visualization" (p202-) to python.
For teaching materials
A box plot and a violin plot were created from the sample data of the body measurement of the Statistics Bureau of the Ministry of Internal Affairs and Communications.
Although it says, the data file read by R seems to be the result of men and women running 50m, so I used the following data prepared appropriately here. high_male_data.csv
Also, the "diamonds.csv" used in the python implementation version in the latter half was the one on the kaggle site. Diamonds - Kaggle
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
gen_50 = pd.read_csv('/content/gen_50.csv')
plt.subplots_adjust(wspace=0.5)
sns.boxplot(x = 'gender', y = 'run50m', data = gen_50)
plt.show()
sns.violinplot(x = "gender", y = "run50m", data = gen_50)
plt.show()
Since I used the seaborn module to draw the violin plot, I also used the seaborn module to draw the boxplot.
library( ggplot2 )
#Data reading
gen_50 <- read.csv("gen_50.csv")
boxplot(run50m~gender, data=gen_50)
ggplot(data=gen_50, aes(x=gender, y=run50m, color=gender)) + geom_violin()
By the way, the source code written in the actual teaching materials is as follows.
Please note that there is a bug here. Wrong: boxplot (run50m ~ gender.data = gen_50) Correct: boxplot (run50m ~ gender, data = gen_50)
What is the purpose of data visualization? By visualizing, problems can be discovered, and detailed analysis, interpretation, and solutions can be considered. Here, let's explain using the diamonds sample data included in the package called ggplot2 of the statistical analysis software R. This data is tens of thousands of large data including the carat, cut, transparency, size, price, etc. of diamonds actually distributed in the United States.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('/content/diamonds.csv')
df_carat_lt3 = df[df['carat'] < 3]
#carat on x-axis, count on y-axis
plt.xlabel('carat')
plt.ylabel('count')
#Draw a histogram
df_carat_lt3['carat'].hist(bins=250)
plt.show()
#Draw a scatter plot
df_carat_lt3.plot.scatter(x = 'carat', y = 'price')
plt.show()
sns.boxplot(x = 'cut', y = 'price', data = df_carat_lt3)
plt.show()
It can be seen that the larger the carat number, the higher the price.
The order of the x-axis in the last figure is not very good, but the order of cut quality is Ideal> Premium> Very Good> Good> Fair.
The teaching materials are written as follows.
This will look like a mysterious phenomenon. The higher the quality of the cut, the lower the price. The same reversal phenomenon occurs with high color and transparency. The hint is "confounding factors". A confounding factor is a hidden variable (factor) that is highly correlated with the two variables of interest.
The results so far show that the higher the carat count, the higher the price. From this, it can be inferred that many diamonds with a large carat count have poor cut quality. (That is, the higher the quality of the cut, the lower the price, which may be due to the number of carats, which are confounding factors, etc.)
sns.boxplot(x = 'cut', y = 'carat', data = df_carat_lt3)
plt.show()
As expected, especially when looking at the median, we found that the Ideal carat count with the highest cut quality was the lowest and the Fair carat count with the lowest cut quality was the lowest.
If you would like to see a more detailed analysis of this diamond data analysis, you should check the following sites. https://www.kaggle.com/fuzzywizard/diamonds-in-depth-analysis
—— The higher the quality of the cut, the higher the cost per carat. —— Large carat rough diamonds require more material to achieve better symmetry and proportions, resulting in greater waste.
library(ggplot2)
diamonds
smaller <- diamonds %>% filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) + geom_histogram(binwidth = 0.01)
ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price))
ggplot(diamonds, aes(cut, price)) + geom_boxplot()
By the way, the source code written in the actual teaching materials is as follows.
Please note that there is a bug here. Wrong: smaller <-diamonds%>% + filter (carat <3) Correct: smaller <-diamonds%>% filter (carat <3)
A tibble: 53940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
(abridgement)
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
python version https://gist.github.com/ereyester/68b781bd6668005c157b300c5bf22905
R version https://gist.github.com/ereyester/737207c4c99556850950c5b5a49dbfcc
Recommended Posts