Learn Pandas in 10 minutes

In "pandas 0.15.2 documentation" , " 10 Minutes to pandas ", so when I looked into it, my mind was pretty organized. If you do it seriously, it will not be finished in 10 minutes, but just take a note of what seems to be convenient.

First, import Pandas and Numpy.

#import liblaries
import pandas as pd
import numpy as np

Create a DataFrame

There are several ways to create a DataFrame, so organize them. First, create a matrix with numpy for DataFrame, and paste the index and label.

Indexing.

#Create a index
dates = pd.date_range("20130101", periods=6)
dates

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None

Create a DataFrame and paste the index.

#Create a DatFrame
df = pd.DataFrame(np.random.randn(6,4),index = dates, columns = list("ABCD"))
df

 	A 	B 	C 	D
2013-01-01 	0.705624 	-0.793903 	0.843425 	0.672602
2013-01-02 	-1.211129 	2.077101 	-1.795861 	0.028060
2013-01-03 	0.706086 	0.385631 	0.967568 	0.271894
2013-01-04 	2.152279 	-0.493576 	1.184289 	-1.193300
2013-01-05 	0.455767 	0.787551 	0.239406 	1.627586
2013-01-06 	-0.639162 	-0.052620 	0.288010 	-2.205777

This time, create a DataFrame with an image that creates a Series for each label. Here you can have different dtypes for each label

df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2

 	A 	B 	C 	D 	E 	F
0 	1 	2013-01-02 	1 	3 	test 	foo
1 	1 	2013-01-02 	1 	3 	train 	foo
2 	1 	2013-01-02 	1 	3 	test 	foo
3 	1 	2013-01-02 	1 	3 	train 	foo


df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

DataFrame reference

Next is how to view the data in the desired form.

Display only index, only columns, only numpy data.

df.index

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None


df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

df.values

array([[ 0.705624  , -0.79390348,  0.84342517,  0.67260162],
       [-1.21112884,  2.0771009 , -1.79586146,  0.02806019],
       [ 0.70608621,  0.38563092,  0.9675681 ,  0.27189394],
       [ 2.15227868, -0.49357565,  1.18428903, -1.19329976],
       [ 0.45576744,  0.78755094,  0.23940583,  1.62758649],
       [-0.63916155, -0.05261954,  0.28800958, -2.20577674]])

A summary of statistics is displayed together, which is convenient.

df.describe()

 	A 	B 	C 	D
count 	6.000000 	6.000000 	6.000000 	6.000000
mean 	0.361578 	0.318364 	0.287806 	-0.133156
std 	1.177066 	1.034585 	1.087978 	1.368150
min 	-1.211129 	-0.793903 	-1.795861 	-2.205777
25% 	-0.365429 	-0.383337 	0.251557 	-0.887960
50% 	0.580696 	0.166506 	0.565717 	0.149977
75% 	0.705971 	0.687071 	0.936532 	0.572425
max 	2.152279 	2.077101 	1.184289 	1.627586

Invert the DataFrame matrix.

df.T

2013-01-01 00:00:00 	2013-01-02 00:00:00 	2013-01-03 00:00:00 	2013-01-04 00:00:00 	2013-01-05 00:00:00 	2013-01-06 00:00:00
A 	0.705624 	-1.211129 	0.706086 	2.152279 	0.455767 	-0.639162
B 	-0.793903 	2.077101 	0.385631 	-0.493576 	0.787551 	-0.052620
C 	0.843425 	-1.795861 	0.967568 	1.184289 	0.239406 	0.288010
D 	0.672602 	0.028060 	0.271894 	-1.193300 	1.627586 	-2.205777

Sort by any axis. For example, sort the labels in descending order.

df.sort_index(axis=1, ascending=False)

 	D 	C 	B 	A
2013-01-01 	0.672602 	0.843425 	-0.793903 	0.705624
2013-01-02 	0.028060 	-1.795861 	2.077101 	-1.211129
2013-01-03 	0.271894 	0.967568 	0.385631 	0.706086
2013-01-04 	-1.193300 	1.184289 	-0.493576 	2.152279
2013-01-05 	1.627586 	0.239406 	0.787551 	0.455767
2013-01-06 	-2.205777 	0.288010 	-0.052620 	-0.639162

Next is the value of label "B" in ascending order.


df.sort(columns='B')

A 	B 	C 	D
2013-01-01 	0.705624 	-0.793903 	0.843425 	0.672602
2013-01-04 	2.152279 	-0.493576 	1.184289 	-1.193300
2013-01-06 	-0.639162 	-0.052620 	0.288010 	-2.205777
2013-01-03 	0.706086 	0.385631 	0.967568 	0.271894
2013-01-05 	0.455767 	0.787551 	0.239406 	1.627586
2013-01-02 	-1.211129 	2.077101 	-1.795861 	0.028060

Pick out data

Data can be extracted from various points of view. For example, only part of the index.

Extract data by specifying both label and index.

df.loc['20130102':'20130104',['A','B']]

 	A 	B
2013-01-02 	-1.211129 	2.077101
2013-01-03 	0.706086 	0.385631
2013-01-04 	2.152279 	-0.493576

You can make a group with any label. Data can be manipulated as it is.


#Creating a DataFrame
df = pd.DataFrame({"A" : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   "B" : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                   "C" : np.random.randn(8),
                   "D" : np.random.randn(8)})

df

 	A 	B 	C 	D
0 	foo 	one 	1.130975 	1.235940
1 	bar 	one 	-0.140004 	-2.714958
2 	foo 	two 	1.526578 	-0.165415
3 	bar 	three 	-1.049092 	-0.037484
4 	foo 	two 	-1.182303 	0.288754
5 	bar 	two 	0.530652 	1.204125
6 	foo 	one 	0.678477 	-0.273343
7 	foo 	three 	0.929624 	0.169822


df.sort(columns='B')

A 	B 	C 	D
2013-01-01 	0.705624 	-0.793903 	0.843425 	0.672602
2013-01-04 	2.152279 	-0.493576 	1.184289 	-1.193300
2013-01-06 	-0.639162 	-0.052620 	0.288010 	-2.205777
2013-01-03 	0.706086 	0.385631 	0.967568 	0.271894
2013-01-05 	0.455767 	0.787551 	0.239406 	1.627586
2013-01-02 	-1.211129 	2.077101 	-1.795861 	0.028060

Pick out data

Data can be extracted from various points of view. For example, only part of the index.

Extract data by specifying both label and index.

df.loc['20130102':'20130104',['A','B']]

 	A 	B
2013-01-02 	-1.211129 	2.077101
2013-01-03 	0.706086 	0.385631
2013-01-04 	2.152279 	-0.493576

You can make a group with any label. Data can be manipulated as it is.

#Creating a DataFrame
df = pd.DataFrame({"A" : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   "B" : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                   "C" : np.random.randn(8),
                   "D" : np.random.randn(8)})
df

 	A 	B 	C 	D
0 	foo 	one 	1.130975 	1.235940
1 	bar 	one 	-0.140004 	-2.714958
2 	foo 	two 	1.526578 	-0.165415
3 	bar 	three 	-1.049092 	-0.037484
4 	foo 	two 	-1.182303 	0.288754
5 	bar 	two 	0.530652 	1.204125
6 	foo 	one 	0.678477 	-0.273343
7 	foo 	three 	0.929624 	0.169822

#Grouping and then calculate sum
df.groupby('A').sum()

 	C 	D
A 		
bar 	-0.658445 	-1.548317
foo 	3.083350 	1.255758

Creating a pivot table

Creating a DataFrame to make it a pivot table.

#Create a DataFrame
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] *2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
A 	B 	C 	D 	E
0 	one 	A 	foo 	0.575699 	-1.669032
1 	one 	B 	foo 	0.592889 	-2.526196
2 	two 	C 	foo 	-2.229949 	-0.703339
3 	three 	A 	bar 	0.801380 	-1.638983
4 	one 	B 	bar 	-0.135691 	-0.302586
5 	one 	C 	bar 	0.317401 	1.169608
6 	two 	A 	foo 	0.064460 	-0.109437
7 	three 	B 	foo 	-0.605017 	1.043246
8 	one 	C 	foo 	-0.365220 	0.850535
9 	one 	A 	bar 	1.033552 	0.226002
10 	two 	B 	bar 	-0.260542 	0.352249
11 	three 	C 	bar 	0.518531 	1.407827

It can be converted to a pivot table relatively easily.

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

 	C 	bar 	foo
A 	B 		
one 	A 	1.033552 	0.575699
B 	-0.135691 	0.592889
C 	0.317401 	-0.365220
three 	A 	0.801380 	NaN
B 	NaN 	-0.605017
C 	0.518531 	NaN
two 	A 	NaN 	0.064460
B 	-0.260542 	NaN
C 	NaN 	-2.229949

Summary

If you take a quick look at it once, it will come back when you face the process, which is very appreciated.

reference

pandas 0.15.2 documentation http://pandas.pydata.org/pandas-docs/stable/index.html

10 Minutes to pandas http://pandas.pydata.org/pandas-docs/stable/10min.html