Data visualization method using matplotlib (+ pandas) (4)

We will continue to talk about data visualization with matplotlib and pandas until previous.

Visualize external data

Let's use external data as more practical data this time. First, download the data from pydata-book, which is also used as a reference for this article.

pydata-book/ch08/tips.csv https://github.com/pydata/pydata-book/blob/master/ch08/tips.csv

import numpy as np
from pandas import *
import matplotlib.pyplot as plt

tips = read_csv('tips.csv')

#Cross tabulate CSV data
party_counts = crosstab(tips.day, tips.size)
print( party_counts )
# =>
# size  1   2   3   4  5  6
# day                      
# Fri   1  16   1   1  0  0
# Sat   2  53  18  13  1  0
# Sun   0  39  15  18  3  1
# Thur  1  48   4   5  1  3

#Normalize the data
party_counts = party_counts.div(party_counts.sum(1), axis=0)
print( party_counts )
# =>
# [4 rows x 6 columns]
# size         1         2         3         4         5         6
# day
# Fri   0.052632  0.842105  0.052632  0.052632  0.000000  0.000000
# Sat   0.022989  0.609195  0.206897  0.149425  0.011494  0.000000
# Sun   0.000000  0.513158  0.197368  0.236842  0.039474  0.013158
# Thur  0.016129  0.774194  0.064516  0.080645  0.016129  0.048387

#Plot with a stacked bar chart
party_counts.plot(kind='bar', stacked=True)
plt.show()
plt.savefig("image.png ")

From this graph, we can see that the number of people increases on weekends (Saturday and Sunday). There is almost no one customer on Sundays, and the proportion of group customers who are thought to be with a family of 3 to 4 people is clearly increasing.

Histogram and fitting

The bar graph represents this when the frequency of values is a discrete variable. Let's show the ratio of chips to the total amount in a bar graph.

Fitting a continuous probability distribution to a probability distribution such as a normal distribution I explained earlier using Gaussian fitting as an example. ** kernel density estimate ** Plots are called KDE plots. You can make a density plot using mixed normal distribution kernel density estimation by specifying kind ='kde' for plot.

fig = plt.figure()
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)

tips['tip_pct'] = tips['tip'] / tips['total_bill']
result = tips['tip_pct']

result.plot(kind='kde')
ax1.hist(result, bins=50, alpha=0.6)

plt.show()
plt.savefig("image2.png ")

You can do something like fitting by plotting the kernel density estimate on top of the normalized histogram. This is a common technique.

Let's try fitting a plot drawn with two different standard normal distributions N (0,1) and N (10,4).

fig = plt.figure()
ax = fig.add_subplot(1,1,1)

#Normal distribution part 1
comp1 = np.random.normal(0,1,size=200) # N(0,1)
#Normal distribution part 2
comp2 = np.random.normal(10,2,size=200) # N(10,4)

#Combine two normal distributions into one series
values = Series(np.concatenate([comp1, comp2]))

print( values )
# =>
# [4 rows x 6 columns]
# 0    -0.305123
# 1    -1.663493
# 2     0.845320
# 3     1.217024
# 4    -0.597437
# 5     0.559524
# 6     0.849613
# 7    -0.916863
# 8     2.705579
# 9     1.397815
# 10   -1.135680
# 11    0.322982
# 12    0.568366
# 13    0.567607
# 14    0.360048
# ...
# 385    15.695692
# 386     8.868396
# 387     8.625446
# 388     5.793579
# 389     8.169981
# 390     8.434327
# 391    10.305067
# 392    11.032880
# 393     8.319812
# 394     9.026077
# 395     9.534395
# 396     4.498352
# 397    12.557349
# 398     7.365278
# 399    11.065254
# Length: 400, dtype: float64

#Draw a bar graph
values.hist(bins=100, alpha=0.3, color='b', normed=True)
#Kernel density estimation
values.plot(kind='kde', style='r--')

plt.show()
plt.savefig("image3.png ")

reference

Introduction to data analysis with Python-Data processing using NumPy and pandas http://www.oreilly.co.jp/books/9784873116556/