I had O'Reilly buy "Data analysis starting with Python" at the company.
Record the construction procedure so that you can spread it in-house.
I was told that there is no procedure for Windows even though it is for in-house missions, so I added it. Windows users should use Cygwin. Here is a reference: How to install pip and setuptools on Cygwin When you can use pip, Click here for how to insert virtualenv
I will omit the explanation of pip and virtualenv. Make sure you have the mkvirtualenv and pip commands available. Also, I'm going to get used to python3, so I'll use python3. O'Reilly says to put canopy express, but I'll put the library on my own.
$ mkvirtualenv --no-site-package --python /usr/local/bin/python3 analytics
(analytics)$ pip install numpy
(analytics)$ pip install scipy
(analytics)$ pip install matplotlib
(analytics)$ pip install ipython
(analytics)$ pip install ipython[notebook]
(analytics)$ ipython
I separated it into an environment called analytics. From now on, I will work in this environment. Install ipython and other libraries used for analysis. Check the installed library
$ pip freeze
appnope==0.1.0
cycler==0.9.0
decorator==4.0.6
gnureadline==6.3.3
ipykernel==4.2.2
ipython==4.0.1
ipython-genutils==0.1.0
Jinja2==2.8
jsonschema==2.5.1
jupyter-client==4.1.1
jupyter-core==4.0.6
MarkupSafe==0.23
matplotlib==1.5.0
mistune==0.7.1
nbconvert==4.1.0
nbformat==4.0.1
notebook==4.0.6
numpy==1.10.4
path.py==8.1.2
pexpect==4.0.1
pickleshare==0.5
ptyprocess==0.5
Pygments==2.0.2
pyparsing==2.0.7
python-dateutil==2.4.2
pytz==2015.7
pyzmq==15.1.0
scipy==0.16.1
simplegeneric==0.8.1
six==1.10.0
terminado==0.6
tornado==4.3
traitlets==4.0.0
wheel==0.24.0
Check if ipython works.
Python 3.5.1 (default, Dec 7 2015, 21:59:08)
Type "copyright", "credits" or "license" for more information.
IPython 4.0.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]:
Exit with Ctrl + d, then install pandas
$ pip install pandas
Let's check the operation. Start with --pylab option to use graph drawing
$ ipython --pylab
...
RuntimeError: Python is not installed as a framework. The Mac OS X backend will
I get an error. What is "Python is not installed as a framework." Solved by referring to the result of google here. Create a matplotlibrc file under ~ / .matplotlib. Fill in the following.
~/.matplotlib/matplotlibrc
backend : TkAgg
Check the operation again.
ipython --pylab
In [1]: import pandas #pandas can be played
In [2]: plot(arange(10)) #You can use matplotlib
OK if a straight line graph is displayed
ipython notebook
The browser will be launched. Create Notebook from New on the upper right. Since it will be a page where you can type commands, first of all
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
Hit and execute with the play button above. Now you can draw the graph, then
plt.plot(np.random.randn(1000))
Press the play button with. Generate 1000 random numbers that follow a normal distribution and draw them on a graph. Ipython notebook can record the command line like this. it's amazing!
Move to a suitable working directory
git clone https://github.com/pydata/pydata-book.git
This will bring you sample data that you can use to practice your statistics.
cd pydata-book/ch02
Let's analyze usagov_bitly_data2012-03-16-1331923249.txt in this with Python! By the way, this is like a log of shortened URL generation.
I thought I'd write it, but I'll omit it because it will be a textbook plagiarism from now on!
A handy tool that comes up in Chapter 3. Record it because it is part of the environment construction. In the analysis, it seems that you want to see the behavior of the function line by line when performing some advanced calculations. For example, if the calculation of 10ms is repeated 1 million times, but it can be improved a little to 5ms each time, 1 million times can save a lot of time. I think that these improvements will probably be more effective when it comes to scientific and technological calculations using large-scale matrices. So, it seems that the line profiler is a convenient tool that can evaluate which process is taking how long it takes for each line of the function.
pip install line_profiler
ipython profile create
vi ~/.ipython/extensions/line_profiler_ext.py
txt:~/.ipython/extensions/line_profiler_ext.py
import line_profiler
def load_ipython_extension(ip):
ip.define_magic('lprun', line_profiler.magic_lprun)
vi ~/.ipython/profile_default/ipython_config.py
py:~/.ipython/profile_default/ipython_config.py
#------------------------------------------------------------------------------
# TerminalIPythonApp configuration
#------------------------------------------------------------------------------
c.TerminalIPythonApp.extensions = [
'line_profiler_ext',
]
#------------------------------------------------------------------------------
# TerminalIPythonApp configuration
#------------------------------------------------------------------------------
c.TerminalIPythonApp.extensions = [
'line_profiler_ext',
]
In [1]: from numpy.random import randn
In [2]: def add_and_sum(x, y):
...: added = x + y
...: summed = added.sum(axis=1)
...: return summed
...:
In [5]: x = randn(3000, 3000)
In [6]: y = randn(3000, 3000)
Execute the add_and_sum defined above. Evaluate how long it takes with the arguments x and y. Can be used with the magic command% lprun.
In [16]: %lprun -f add_and_sum add_and_sum(x, y)
Timer unit: 1e-06 s
Total time: 0.036058 s
File: <ipython-input-2-19f64f63ba0a>
Function: add_and_sum at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def add_and_sum(x, y):
2 1 28247 28247.0 78.3 added = x + y
3 1 7809 7809.0 21.7 summed = added.sum(axis=1)
4 1 2 2.0 0.0 return summed
Recommended Posts