Multiple services are provided in the streets as a data analysis tool. I work in the manufacturing industry, but I mainly hear the following two methods for data analysis and visualization.
--Environment ① Data analysis using Python + Numpy + Pandas + α --Environment (2) Data analysis using BI tools (Business Intelligence tools)
In comparing and examining the differences between these two tools, I actually tried it based on the idea that "I should try the same analysis with the two tools."
PC OS used: Microsoft Windows10 Pro 64bit Browser used: Microsoft Edge
・ Usage environment: kaggle notebook Since it is a cloud service, the version cannot be confirmed and it will be as of August 24, 2020.
kaggle is a community and competition website for data analysts. There are also competitions with prize money, and it seems that engineers are competing for the accuracy of data analysis. The details are easy to understand at the link below. Kaggle Tutorial Part 1 What is Kaggle? What does it mean to participate? Also, I opened an account by referring to the link below. If you have an account, you can use data analysis related services including kaggle notebook free of charge. Introduction to Kaggle Beginners! From opening an account to submitting Titanic
-Usage environment: Microsoft Power BI Desktop Ver 2.84.802.0 64-bit You can get Power BI Desktop from the Microsoft Store. Most of the functions including this work can be used free of charge.
The data analysis stream borrowed part of Udemy's data science course. The following courses cover everything from basics to simple practical training, and are recommended for those who want to systematically learn data science. [180,000 people in the world] Practical Python Data Science
Taking over the above curriculum in udemy, I will perform the following analysis related to the famous "Titanic sinking" as the first step of data analysis by two methods.
・ What kind of people were the passengers of the Titanic? (Gender, age, etc.) ・ Causal relationship between the above-mentioned characteristics and their complex relationship and survival rate
In this article, we will describe the processing common to the two tools and the operation results with the "Python + numpy + pandas + α" tool. The results of "Data analysis by BI tools" will be described in the next article.
When you open the actual customer data in Excel, it looks like this.
Bring local data into the environment and display a summary Only the first 5 data are extracted to give a bird's-eye view of the data.
Four. Check the ratio of men and women for each room grade The item P class indicates the grade of the room. You can see that there are many men in the third-class guest rooms.
It is also possible to easily reverse the axis.
Five. Create items (Person) "Men", "Women", "Children (under 16 years old)" using items "Age" and "Sex"
First, create a Person column.
Then, check the distribution of Persons by grade. Is the first-class room expensive? There are few children. In addition, it can be seen that the ratio of male adult males is high in the third-class guest rooms. Furthermore, since the ratio of children is high, there were many single men and families in the third-class rooms. I can imagine.
The above is a simple analysis flow. Actually, from now on, it will be a flow to take a bird's-eye view of the above characteristics and surrounding characteristics and the causal relationship of survivors, but only the first stage is excerpted and described.
While doing the above work, my impression is
--Look at some actual data such as the beginning and capture the atmosphere of the entire data --Check the data summary to see if the data is corrupted & fix it --Create your own data to get the information you intended
I felt that it is an advantage of python-based notebooks such as kaggle that it is possible to proceed with analysis while "grabbing" a huge amount of data on various axes. (Of course, of course ...)
In the next article, I'll try the same thing with Microsoft's Power BI, a Bi tool.
Recommended Posts