We will continue to talk about data visualization with matplotlib and pandas until previous.
Let's use external data as more practical data this time. First, download the data from pydata-book, which is also used as a reference for this article.
pydata-book/ch08/tips.csv https://github.com/pydata/pydata-book/blob/master/ch08/tips.csv
import numpy as np
from pandas import *
import matplotlib.pyplot as plt
tips = read_csv('tips.csv')
#Cross tabulate CSV data
party_counts = crosstab(tips.day, tips.size)
print( party_counts )
# =>
# size 1 2 3 4 5 6
# day
# Fri 1 16 1 1 0 0
# Sat 2 53 18 13 1 0
# Sun 0 39 15 18 3 1
# Thur 1 48 4 5 1 3
#Normalize the data
party_counts = party_counts.div(party_counts.sum(1), axis=0)
print( party_counts )
# =>
# [4 rows x 6 columns]
# size 1 2 3 4 5 6
# day
# Fri 0.052632 0.842105 0.052632 0.052632 0.000000 0.000000
# Sat 0.022989 0.609195 0.206897 0.149425 0.011494 0.000000
# Sun 0.000000 0.513158 0.197368 0.236842 0.039474 0.013158
# Thur 0.016129 0.774194 0.064516 0.080645 0.016129 0.048387
#Plot with a stacked bar chart
party_counts.plot(kind='bar', stacked=True)
plt.show()
plt.savefig("image.png ")
From this graph, we can see that the number of people increases on weekends (Saturday and Sunday). There is almost no one customer on Sundays, and the proportion of group customers who are thought to be with a family of 3 to 4 people is clearly increasing.
The bar graph represents this when the frequency of values is a discrete variable. Let's show the ratio of chips to the total amount in a bar graph.
Fitting a continuous probability distribution to a probability distribution such as a normal distribution I explained earlier using Gaussian fitting as an example. ** kernel density estimate ** Plots are called KDE plots. You can make a density plot using mixed normal distribution kernel density estimation by specifying kind ='kde' for plot.
fig = plt.figure()
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
tips['tip_pct'] = tips['tip'] / tips['total_bill']
result = tips['tip_pct']
result.plot(kind='kde')
ax1.hist(result, bins=50, alpha=0.6)
plt.show()
plt.savefig("image2.png ")
You can do something like fitting by plotting the kernel density estimate on top of the normalized histogram. This is a common technique.
Let's try fitting a plot drawn with two different standard normal distributions N (0,1) and N (10,4).
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
#Normal distribution part 1
comp1 = np.random.normal(0,1,size=200) # N(0,1)
#Normal distribution part 2
comp2 = np.random.normal(10,2,size=200) # N(10,4)
#Combine two normal distributions into one series
values = Series(np.concatenate([comp1, comp2]))
print( values )
# =>
# [4 rows x 6 columns]
# 0 -0.305123
# 1 -1.663493
# 2 0.845320
# 3 1.217024
# 4 -0.597437
# 5 0.559524
# 6 0.849613
# 7 -0.916863
# 8 2.705579
# 9 1.397815
# 10 -1.135680
# 11 0.322982
# 12 0.568366
# 13 0.567607
# 14 0.360048
# ...
# 385 15.695692
# 386 8.868396
# 387 8.625446
# 388 5.793579
# 389 8.169981
# 390 8.434327
# 391 10.305067
# 392 11.032880
# 393 8.319812
# 394 9.026077
# 395 9.534395
# 396 4.498352
# 397 12.557349
# 398 7.365278
# 399 11.065254
# Length: 400, dtype: float64
#Draw a bar graph
values.hist(bins=100, alpha=0.3, color='b', normed=True)
#Kernel density estimation
values.plot(kind='kde', style='r--')
plt.show()
plt.savefig("image3.png ")
Introduction to data analysis with Python-Data processing using NumPy and pandas http://www.oreilly.co.jp/books/9784873116556/
Recommended Posts