A well-prepared record of data analysis in Python

Overview

I had O'Reilly buy "Data analysis starting with Python" at the company.

Record the construction procedure so that you can spread it in-house.

Postscript

I was told that there is no procedure for Windows even though it is for in-house missions, so I added it. Windows users should use Cygwin. Here is a reference: How to install pip and setuptools on Cygwin When you can use pip, Click here for how to insert virtualenv

Environment

Introducing the library

I will omit the explanation of pip and virtualenv. Make sure you have the mkvirtualenv and pip commands available. Also, I'm going to get used to python3, so I'll use python3. O'Reilly says to put canopy express, but I'll put the library on my own.

$ mkvirtualenv --no-site-package --python /usr/local/bin/python3 analytics
(analytics)$ pip install numpy
(analytics)$ pip install scipy 
(analytics)$ pip install matplotlib
(analytics)$ pip install ipython
(analytics)$ pip install ipython[notebook] 
(analytics)$ ipython

I separated it into an environment called analytics. From now on, I will work in this environment. Install ipython and other libraries used for analysis. Check the installed library

$ pip freeze
appnope==0.1.0
cycler==0.9.0
decorator==4.0.6
gnureadline==6.3.3
ipykernel==4.2.2
ipython==4.0.1
ipython-genutils==0.1.0
Jinja2==2.8
jsonschema==2.5.1
jupyter-client==4.1.1
jupyter-core==4.0.6
MarkupSafe==0.23
matplotlib==1.5.0
mistune==0.7.1
nbconvert==4.1.0
nbformat==4.0.1
notebook==4.0.6
numpy==1.10.4
path.py==8.1.2
pexpect==4.0.1
pickleshare==0.5
ptyprocess==0.5
Pygments==2.0.2
pyparsing==2.0.7
python-dateutil==2.4.2
pytz==2015.7
pyzmq==15.1.0
scipy==0.16.1
simplegeneric==0.8.1
six==1.10.0
terminado==0.6
tornado==4.3
traitlets==4.0.0
wheel==0.24.0

Check if ipython works.

Python 3.5.1 (default, Dec  7 2015, 21:59:08) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]:

Exit with Ctrl + d, then install pandas

$ pip install pandas

Operation check

Let's check the operation. Start with --pylab option to use graph drawing

$ ipython --pylab
...
RuntimeError: Python is not installed as a framework. The Mac OS X backend will

I get an error. What is "Python is not installed as a framework." Solved by referring to the result of google here. Create a matplotlibrc file under ~ / .matplotlib. Fill in the following.

`~/.matplotlib/matplotlibrc`


backend : TkAgg

Check the operation again.

ipython --pylab
In [1]: import pandas  #pandas can be played
In [2]: plot(arange(10))  #You can use matplotlib

OK if a straight line graph is displayed

Use IPython notebook

ipython notebook

The browser will be launched. Create Notebook from New on the upper right. Since it will be a page where you can type commands, first of all

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Hit and execute with the play button above. Now you can draw the graph, then

plt.plot(np.random.randn(1000))

Press the play button with. Generate 1000 random numbers that follow a normal distribution and draw them on a graph. Ipython notebook can record the command line like this. it's amazing!

スクリーンショット 2016-02-11 20.47.21.png

Try data analysis

Advance preparation

Move to a suitable working directory

git clone https://github.com/pydata/pydata-book.git

This will bring you sample data that you can use to practice your statistics.

cd pydata-book/ch02

Let's analyze usagov_bitly_data2012-03-16-1331923249.txt in this with Python! By the way, this is like a log of shortened URL generation.

analysis

I thought I'd write it, but I'll omit it because it will be a textbook plagiarism from now on!

Introducing the line profiler

A handy tool that comes up in Chapter 3. Record it because it is part of the environment construction. In the analysis, it seems that you want to see the behavior of the function line by line when performing some advanced calculations. For example, if the calculation of 10ms is repeated 1 million times, but it can be improved a little to 5ms each time, 1 million times can save a lot of time. I think that these improvements will probably be more effective when it comes to scientific and technological calculations using large-scale matrices. So, it seems that the line profiler is a convenient tool that can evaluate which process is taking how long it takes for each line of the function.

Introduction method

pip install line_profiler
ipython profile create
vi ~/.ipython/extensions/line_profiler_ext.py

`txt:~/.ipython/extensions/line_profiler_ext.py`


import line_profiler

def load_ipython_extension(ip):
    ip.define_magic('lprun', line_profiler.magic_lprun)

vi ~/.ipython/profile_default/ipython_config.py

`py:~/.ipython/profile_default/ipython_config.py`


#------------------------------------------------------------------------------
# TerminalIPythonApp configuration
#------------------------------------------------------------------------------

c.TerminalIPythonApp.extensions = [
  'line_profiler_ext',
]

#------------------------------------------------------------------------------
# TerminalIPythonApp configuration
#------------------------------------------------------------------------------

c.TerminalIPythonApp.extensions = [
  'line_profiler_ext',
]

Try to evaluate the function

In [1]: from numpy.random import randn

In [2]: def add_and_sum(x, y):
   ...:     added = x + y
   ...:     summed = added.sum(axis=1)
   ...:     return summed
   ...: 

In [5]: x = randn(3000, 3000)

In [6]: y = randn(3000, 3000)

Execute the add_and_sum defined above. Evaluate how long it takes with the arguments x and y. Can be used with the magic command% lprun.

In [16]: %lprun -f add_and_sum add_and_sum(x, y)
Timer unit: 1e-06 s

Total time: 0.036058 s
File: <ipython-input-2-19f64f63ba0a>
Function: add_and_sum at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def add_and_sum(x, y):
     2         1        28247  28247.0     78.3      added = x + y
     3         1         7809   7809.0     21.7      summed = added.sum(axis=1)
     4         1            2      2.0      0.0      return summed