"GCI Data Scientist Training Course" is offered by the University of Tokyo (Matsuo Laboratory) "* Practical data The contents of the exercise part are published in Jupyter NoteBook format (CC-BY-NC-ND) in the Scientist Training Course and Deep Learning Course </ u> * ". Chapter 6 is "** Data processing using Pandas **", where you will learn how to use a powerful and high-speed data manipulation library for data analysis. Instead of the "Like" button for the valuable and wonderful teaching materials that you can learn in Japanese, I will post the answers you have solved. Please point out any mistakes.
6.1 Pandas
** <Practice 1> ** Let's extract only the Kyoto column for the following data.
hier_data_frame1 = DataFrame(np.arange(12).reshape((3,4)) ,index = [['c','d','d'],[1,2,1]] ,columns = [['Kyoto','Nagoya','Hokkaido','Kyoto'] ,['Yellow','Yellow','Red','Blue']] ) hier_data_frame1.index.names =['key1','key2'] hier_data_frame1.columns.names =['city','color'] hier_data_frame1
hier_data_frame1["Kyoto"]
** <Practice 2> ** For the data of Exercise 1, put the city together and calculate the average value between the columns.
hier_data_frame1.mean(level = "city", axis = 1)
** <Practice 3> ** Let's calculate the total value of the rows for each key2 for the data of Exercise 1.
hier_data_frame1.sum(level = "key2")
** <Practice 1> ** Let's do an inner join to the following two data tables.
pd.merge(attri_data_frame4, attri_data_frame5, how = "inner")
** <Practice 2> ** Let's outer join the table of attri_data_frame5 based on attri_data_frame4.
pd.merge(attri_data_frame4, attri_data_frame5, how = "outer")
** <Practice 3> ** Let's vertically join the following data to attri_data_frame4.
pd.concat([attri_data_frame4, attri_data_frame6])
** <Practice 1> ** Use the data from "student-mat.csv" used in the previous chapter. Let's add a column that doubles the age.
student_data_math["2age"] = 2 * student_data_math["age"]
student_data_math[["age", "2age"]].head()
** <Practice 2> ** With the same data as above, let's divide the "absences" column into the following three bins and count the number of each. In addition, in the option setting of cut, the right side is closed section by default, but this time add right = False for 0 to be entered.
# pd.cut(student_data_math.absences, absences_bins, right=False)
pd.value_counts(pd.cut(student_data_math.absences, absences_bins, right=False))
> [5, 100) 151
> [1, 5) 129
> [0, 1) 115
> Name: absences, dtype: int64
** <Practice 3> ** With the same data as above, let's divide the "absences" column into three bins using qcut.
# pd.qcut(student_data_math.absences, 3)
pd.value_counts(pd.qcut(student_data_math.absences, 3))
> (-0.001, 2.0] 183
> (6.0, 75.0] 115
> (2.0, 6.0] 97
> Name: absences, dtype: int64
** <Practice 1> ** Let's use "student-mat.csv" that we used earlier and use "student-mat.csv" to aggregate pandas. First, let's find the average score of G1 with school as the axis.
student_data_math = pd.read_csv("student-mat.csv",sep=";")
student_data_math.groupby("school")["G1"].mean()
> school
> GP 10.939828
> MS 10.673913
> Name: G1, dtype: float64
** <Practice 2> ** Next, let's find the average score of G1, G2, and G3, centering on school and gender.
student_data_math.groupby(["school", "sex"])[["G1", "G2", "G3"]].mean()
** <Practice 3> ** Next, let's calculate the maximum and minimum values of G1, G2, and G3 together, centering on school and gender.
student_data_math.groupby(["school", "sex"])[["G1", "G2", "G3"]].agg(["max", "min"])
** <Practice 1> ** For the following data, if there is even one column of NaN, delete it and display the result.
sample_data_frame2.dropna()
** <Practice 2> ** Fill NaN with 0 for the data prepared above.
sample_data_frame2.fillna(0)
** <Practice 3> ** For the data prepared above, fill in NaN with the average value of each column.
sample_data_frame2.fillna(sample_data_frame2.mean())
** 6.3.1 Comprehensive Problem 1 ** Use the previously used "student-mat.csv" to answer the following questions.
(1) For the above data, calculate the average score of G1 by age x gender, and create a table with age on the vertical axis and gender on the horizontal axis.
(2) For the result table displayed in (1), let's display the result of deleting all rows (records) that are NA.
student_data_math = pd.read_csv("student-mat.csv",sep=";")
# (1)
print("(1)")
print(student_data_math.groupby(["age", "sex"])["G1"].mean().unstack())
# (2)
print("\n(2)")
print(student_data_math.groupby(["age", "sex"])["G1"].mean().unstack().dropna())
> (1)
> sex F M
> age
> 15 10.052632 12.250000
> 16 10.203704 11.740000
> 17 11.103448 10.600000
> 18 10.883721 10.538462
> 19 10.642857 9.700000
> 20 15.000000 13.000000
> 21 NaN 10.000000
> 22 NaN 6.000000
>
> (2)
> sex F M
> age
> 15 10.052632 12.250000
> 16 10.203704 11.740000
> 17 11.103448 10.600000
> 18 10.883721 10.538462
> 19 10.642857 9.700000
> 20 15.000000 13.000000
Recommended Posts