Practice working with data using libraries such as python and libraries such as numpy, pandas, and seaborn. The data used is from kaggle. This time, we will use the data from 2016 New Coder Survey. The contents of the data are as follows. Simply put, it's data about who is learning to code.
Free Code Camp is an open source community where you learn to code and build projects for nonprofits.
CodeNewbie.org is the most supportive community of people learning to code.
Together, we surveyed more than 15,000 people who are actively learning to code. We reached them through the twitter accounts and email lists of various organizations that help people learn to code.
Our goal was to understand these people's motivations in learning to code, how they're learning to code, their demographics, and their socioeconomic background.
Also, as a premise, it is executed on ipython notebook. The version is pyenv: anaconda3-2.4.0 (Python 3.5.2 :: Anaconda 2.4.0) is.
If you are familiar with this field, we would appreciate it if you could take a warm look at the content and give us some advice if you notice anything. I would be grateful if you could comment on something like "I would do this kind of analysis if I used this data"! (It will be helpful if you can use the code base!)
I will load the ones that I think I will use.
import numpy as np
from numpy.random import randn
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
I downloaded the data from 2016 New Coder Survey and put it in the same folder with the name "code_survey.csv".
survey_df = pd.read_csv('corder_survey.csv')
shape
survey_df.shape
(15620, 113)
I see. There are quite a few items. The number of rows is 15620 (the number of people targeted for data), and the number of columns (answer items) is 113.
info You may also want to use info.
survey_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15620 entries, 0 to 15619
Columns: 113 entries, Age to StudentDebtOwe
dtypes: float64(85), object(28)
memory usage: 13.5+ MB
describe
survey_df.describe()
You can see information such as'count',' mean',' std',' min', '25%', '50%', '75%', and'max' in each column. Data is omitted because there is too much data.
for col in survey_df.columns:
print(col)
Now I have displayed all 113 items. Since this is a practice, I will pick up the columns to be used first.
Gender:sex
HasChildren:With or without children
EmploymentStatus:Current employment form
Age:age
Income:income
HoursLearning:Learning time
SchoolMajor:Major
Gender
countplot
Let's start with the gender data. Let's make a histogram. seaborn count plot is useful.
sns.countplot('Gender', data=survey_df)
In Japan, it seems that men and women are divided, but there is diversity and it seems to be overseas.
By the way, for a simple histogram, there is plt.hist
in matplotlib.
(There is also plt.bar
that makes a bar graph, but plt.hist
is easier when making a histogram from the frequency distribution of data.
dataset = randn(100)
plt.hist(dataset)
(Randn will generate random numbers according to the normal distribution)
There are also various options.
# normed:Normalization, alpha:Transparency, color:color, bins:Number of bins
plt.hist(dataset, normed=True, alpha=0.8, color='indianred', bins=15)
HasChildren
Similarly, try drawing with or without children using count plot
.
sns.countplot('HasChildren', data=survey_df)
If it is 0 or 1, it is difficult to understand, so let's say No
without children and Yes
with children.
survey_df['HasChildren'].loc[survey_df['HasChildren'] == 0] = 'No'
survey_df['HasChildren'].loc[survey_df['HasChildren'] == 1] = 'Yes'
You can now convert.
df.map
Conversion using map seems to be good.
survey_df['HasChildren'] = survey_df['HasChildren'].map({0: 'No', 1: 'Yes'})
sns.countplot('HasChildren', data=survey_df)
sns.countplot('HasChildren', data=survey_df)
It's a little easier to understand!
EmploymentStatus
The current employment form is also represented by count plot
.
sns.countplot('EmploymentStatus', data=survey_df)
It's kind of messy and hard to understand. ..
So let's change the axis.
sns.countplot(y='EmploymentStatus', data=survey_df)
Easy to see!
Age
Again, try using count plot
.
sns.countplot('Age', data=survey_df)
It's colorful and beautiful, but it's hard to see as a graph.
So let's smooth the graph.
kde plot
Use kernel density estimation (kde: kernel density plot). The method itself is easy.
sns.kdeplot(survey_df['Age'])
There are many people in their 20s and 30s. Is it just as you would expect? However, the hem is widening even if you get older to some extent.
Now let's consider a little kernel density estimation. (If you look at Wikipedia and other sites, you can see a proper explanation.) [Kernel Density Estimate](https://ja.wikipedia.org/wiki/%E3%82%AB%E3%83%BC%E3%83%8D%E3%83%AB%E5%AF%86%E5 % BA% A6% E6% 8E% A8% E5% AE% 9A)
dataset = randn(30)
plt.hist(dataset, alpha=0.5)
sns.rugplot(dataset)
The rug plot shows each sample point with a bar.
This is an image of creating a kernel function (which is easy to understand if you consider a normal distribution as an example) for each sample point in this graph and adding them together.
sns.kdeplot(dataset)
In estimating kernel density
--Kernel function: How to spread the influence of each sample point --Bandwidth: Width of kernel function spread
You need to decide on two things.
You can also use various kernel functions. The default is gau (Gaussian distribution, normal distribution).
kernel_options = ["gau", "biw", "cos", "epa", "tri", "triw"]
for kernel in kernel_options:
sns.kdeplot(dataset, kernel=kernel, label=kernel)
Bandwidth can also be changed.
for bw in np.arange(0.5, 2, 0.25):
sns.kdeplot(dataset, bw=bw, label=bw)
So far, the explanation of kernel density estimation has been separated, and then we will continue.
Income
Again, use kdeplot
.
sns.kdeplot(survey_df['Income'])
The unit is dollars, so it's an annual income.
Let's take a closer look at the data.
describe
survey_df['Income'].describe()
RuntimeWarning: Invalid value encountered in median
count 7329.000000
mean 44930.010506
std 35582.783216
min 6000.000000
25% NaN
50% NaN
75% NaN
max 200000.000000
Name: Income, dtype: float64
It seems that the problem that the quartile becomes NaN has already been solved at the time of writing the article, but it seems to be waiting for merge. Wait for the version upgrade without worrying about it. describe() returns RuntimeWarning: Invalid value encountered in median RuntimeWarning #13146
boxplot
I would like to create a boxplot (boxplot).
sns.boxplot(survey['Income'])
From the vertical line on the left, the minimum boxplot, the first quartile (Q1), the median, the third quartile (Q3), and the maximum boxplot are shown. Those that deviate from (minimum value --IQR1.5) ~ (maximum value + IQR1.5) as IQR = Q3-Q1 are represented by black dots as outliers that deviate from the boxplot.
It is also possible to express outliers without them.
sns.boxplot(survey['Income'], whips=np.inf)
violinplot
There is also a Vaviolin plot that gives boxplot information about kde.
sns.violinplot(survey_df['Income'])
The distribution is easier to understand!
HoursLearning
Let's take a look at the study time. First, let's do a kde plot.
sns.kdeplot(survey_df['HoursLearning'])
It's the amount of learning per week in terms of time. Also, the extremum is noticeable, but this is happening with a clear number. Normally, I answer the questionnaire with good numbers, so this happens.
I also have a violin plot.
sns.violinplot(survey_df['HoursLearning'])
It reflects the characteristics of the kde plot.
There is also a distplot
that can generate both countplot
and kdeplot
together.
nan
is deleted and used.
hours_learning = survey_df['HoursLearning']
hours_learning = hours_learning.dropna()
sns.distplot(hours_learning)
You can turn the histogram into a lag plot and add options. Convenient!
sns.distplot(hours_learning, rug=True, hist=False, kde_kws={'color':'indianred'})
SchoolMajor
If it is a continuous value, kdeplot
will be useful, but since this is a categorization, use countplot
.
sns.countplot(y='SchoolMajor' , data=survey_df)
It's hard to see. .. There are too many categories. I want to see the top 10 or so.
collections.Counter
from collections import Counter
major_count = Counter(survey_df['SchoolMajor'])
major_count.most_common(10)
Count by using the standard library collections
.
Furthermore, by setting most_common (10)
, you can get the top 10 of them.
[(nan, 7170),
('Computer Science', 1387),
('Information Technology', 408),
('Business Administration', 284),
('Economics', 252),
('Electrical Engineering', 220),
('English', 204),
('Psychology', 187),
('Electrical and Electronics Engineering', 164),
('Software Engineering', 159)]
Let's display it on the graph.
X = []
Y = []
major_count_top10 = major_count.most_common(10)
for record in major_count_top10:
X.append(record[0])
Y.append(record[1])
# [nan, 'Computer Science', 'Information Technology', 'Business Administration', 'Economics', 'Electrical Engineering', 'English', 'Psychology', 'Electrical and Electronics Engineering', 'Software Engineering']
# [7170, 1387, 408, 284, 252, 220, 204, 187, 164, 159]
plt.barh(np.arange(10), Y)
plt.yticks(np.arange(10), X)
I referred to here. Bar Graph -Introduction to matplotlib
You can use plt.barh
to change the axis of plt.bar
. I also labeled it with yticks
.
Well, I don't want nan to be displayed and I want to sort it in reverse order.
X = []
Y = []
major_count_top10 = major_count.most_common(10)
major_count_top10.reverse()
for record in major_count_top10:
# record[0] == record[0]There is a supplement below
if record[0] == record[0]:
X.append(record[0])
Y.append(record[1])
# ['Software Engineering', 'Electrical and Electronics Engineering', 'Psychology', 'English', 'Electrical Engineering', 'Economics', 'Business Administration', 'Information Technology', 'Computer Science']
# [159, 164, 187, 204, 220, 252, 284, 408, 1387]
plt.barh(np.arange(9), Y)
plt.yticks(np.arange(9), X)
The graph I was thinking of was created!
Here, False
is returned only when comparing between NaN
, so I implemented this. (See also the URL below)
However, it is difficult to understand The implementation method introduced by @shiracamus is easier to understand. I will use this in the future as well.
if record[0] == record[0]:
Changed the part as follows.
if pd.notnull(record[0]):
First, Gender is for men and women only for simplicity.
male_female_df = survey_df.where((survey_df['Gender'] == 'male') + (survey_df['Gender'] == 'female') )
You can count by layer by using hue
of count plot
.
countplot(hue)
sns.countplot('Gender', data=male_female_df, hue='HasChildren')
It seems that both men and women have the same proportion of children.
Graphs other than count plot
can also be represented by layers. Use FacetGrid
.
sns.FacetGrid
fig = sns.FacetGrid(male_female_df, hue='Gender', aspect=4)
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = male_female_df['Age'].max()
fig.set(xlim=(0, oldest))
fig.add_legend()
Men are a little younger, aren't they?
Since there are multiple Employment Status, I would like to use only the top few.
# male_female_df is survey_Gender of df narrowed down to men and women
#Get the top 5 Employment Status
from collections import Counter
employ_count = Counter(male_female_df['EmploymentStatus'])
employ_count_top = employ_count.most_common(5)
print(employ_count_top)
employ_list =[]
for record in employ_count_top:
if record[0] == record[0]:
employ_list.append(record[0])
def top_employ(status):
return status in employ_list
#employ using apply_Get only the rows of items in list
new_survey_df = male_female_df.loc[male_female_df['EmploymentStatus'].apply(top_employ)]
sns.countplot(y='EmploymentStatus', data=new_survey_df)
Now there are only the top 3 items.
Let's look at the gender stratification using the count plot
hue.
sns.countplot(y='EmploymentStatus', data=employ_df, hue='Gender')
First, convert HasChildren to No-> 0, Yes-> 1
.
new_survey_df['HasChildren'] = new_survey_df['HasChildren'].map({'No': 0, 'Yes': 1})
Use factor plot
here. Let's see how the Employment Status` is related to the presence or absence of children.
factorplot
sns.factorplot('EmploymentStatus', 'HasChildren', data=new_survey_df, aspect=2)
ʻEmloyed for wages` is a little high. This is convincing.
You can also see the factor plots by layer, so let's see if there is any difference between men and women.
sns.factorplot('EmploymentStatus', 'HasChildren', data=new_survey_df, aspect=2, hue='Gender')
This is pretty interesting. In fact, it was completely different for men and women. In light of the employment situation, there are many possibilities.
lmplot
I would like to see the relationship with a regression line. Use lmplot
for the regression line.
sns.lmplot('Age', 'HasChildren', data=new_survey_df)
By the way, lmplot
can also be seen by layer, so I would like to try it.
First, stratify by ʻEmployment Status`.
sns.lmplot('Age', 'HasChildren', data=new_survey_df, hue='EmploymentStatus')
Generally, the value of ʻEmployed for wages` is a little high, but as we saw in the previous section, it seems to be a little clearer if we divide it by gender.
By the way, I will also stratify Gender
.
sns.lmplot('Age', 'HasChildren', data=new_survey_df, hue='Gender')
You can also display multiple graphs side by side. I drew two graphs using subplots
.
fig, (axis1, axis2) = plt.subplots(1, 2, sharey=True)
sns.regplot('HasChildren', 'Age', data=new_survey_df, ax=axis1)
sns.violinplot(y='Age', x='HasChildren', data=new_survey_df, ax=axis2)
regplot
is a low-level version of lmplot
, which is the same for making simple regressions.
I used regplots
because the functions that can be used with subplots
seem to be limited to those that return a matplotlib Axes
object, and lmplot
could not be used.
Detailed explanation Plotting with seaborn using the matplotlib object-oriented interface Was easy to understand.
It seems that the "Axis-level" function mentioned in this article can be used. (regplot, boxplot, kdeplot, and many others)
On the other hand, the "Figure-level" function is also introduced, and there is also lmplot
in it.
(lmplot, factorplot, jointplot and one or two others)
It's almost ʻAxis and partly like
Figure`.
So it seems that FacetGrid
is good for figures.
Plotting on data-aware grids
You can also arrange multiple graphs with FacetGrid
.
I tried to display the distribution of Age side by side by EmploymentStatus.
fig = sns.FacetGrid(new_survey_df, col='EmploymentStatus', aspect=1.5)
fig.map(sns.distplot, 'Age')
oldest = new_survey_df['Age'].max()
fig.set(xlim=(0, oldest))
fig.add_legend()
Better writing that I've wanted to know since I started Python Python pandas data selection process in a little more detail <Part 2>
[20,000 people in the world] Practical Python Data Science This is a recommended video course that is easy to understand with detailed explanations. The point is that even if you ask a question, the answer will be returned the next day.
[Introduction to Data Analysis with Python-Data Processing Using NumPy and pandas](https://www.amazon.co.jp/Python%E3%81%AB%E3%82%88%E3%82%8B%E3 % 83% 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E5% 85% A5% E9% 96% 80-% E2% 80% 95NumPy% E3% 80% 81pandas% E3% 82% 92% E4% BD% BF% E3% 81% A3% E3% 81% 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 87% A6% E7% 90% 86-Wes-McKinney / dp / 4873116554) O'Reilly's data analysis book. It is well organized.
Start Python Club The Python community. I'm not specialized in data analysis, but I'm widely active in Python, so if you use Python, I think it's fun to go.
Recommended Posts