In "pandas 0.15.2 documentation" , " 10 Minutes to pandas ", so when I looked into it, my mind was pretty organized. If you do it seriously, it will not be finished in 10 minutes, but just take a note of what seems to be convenient.
First, import Pandas and Numpy.
#import liblaries
import pandas as pd
import numpy as np
There are several ways to create a DataFrame, so organize them. First, create a matrix with numpy for DataFrame, and paste the index and label.
Indexing.
#Create a index
dates = pd.date_range("20130101", periods=6)
dates
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None
Create a DataFrame and paste the index.
#Create a DatFrame
df = pd.DataFrame(np.random.randn(6,4),index = dates, columns = list("ABCD"))
df
A B C D
2013-01-01 0.705624 -0.793903 0.843425 0.672602
2013-01-02 -1.211129 2.077101 -1.795861 0.028060
2013-01-03 0.706086 0.385631 0.967568 0.271894
2013-01-04 2.152279 -0.493576 1.184289 -1.193300
2013-01-05 0.455767 0.787551 0.239406 1.627586
2013-01-06 -0.639162 -0.052620 0.288010 -2.205777
This time, create a DataFrame with an image that creates a Series for each label. Here you can have different dtypes for each label
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2
A B C D E F
0 1 2013-01-02 1 3 test foo
1 1 2013-01-02 1 3 train foo
2 1 2013-01-02 1 3 test foo
3 1 2013-01-02 1 3 train foo
df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
Next is how to view the data in the desired form.
Display only index, only columns, only numpy data.
df.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None
df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
df.values
array([[ 0.705624 , -0.79390348, 0.84342517, 0.67260162],
[-1.21112884, 2.0771009 , -1.79586146, 0.02806019],
[ 0.70608621, 0.38563092, 0.9675681 , 0.27189394],
[ 2.15227868, -0.49357565, 1.18428903, -1.19329976],
[ 0.45576744, 0.78755094, 0.23940583, 1.62758649],
[-0.63916155, -0.05261954, 0.28800958, -2.20577674]])
A summary of statistics is displayed together, which is convenient.
df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.361578 0.318364 0.287806 -0.133156
std 1.177066 1.034585 1.087978 1.368150
min -1.211129 -0.793903 -1.795861 -2.205777
25% -0.365429 -0.383337 0.251557 -0.887960
50% 0.580696 0.166506 0.565717 0.149977
75% 0.705971 0.687071 0.936532 0.572425
max 2.152279 2.077101 1.184289 1.627586
Invert the DataFrame matrix.
df.T
2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00 2013-01-04 00:00:00 2013-01-05 00:00:00 2013-01-06 00:00:00
A 0.705624 -1.211129 0.706086 2.152279 0.455767 -0.639162
B -0.793903 2.077101 0.385631 -0.493576 0.787551 -0.052620
C 0.843425 -1.795861 0.967568 1.184289 0.239406 0.288010
D 0.672602 0.028060 0.271894 -1.193300 1.627586 -2.205777
Sort by any axis. For example, sort the labels in descending order.
df.sort_index(axis=1, ascending=False)
D C B A
2013-01-01 0.672602 0.843425 -0.793903 0.705624
2013-01-02 0.028060 -1.795861 2.077101 -1.211129
2013-01-03 0.271894 0.967568 0.385631 0.706086
2013-01-04 -1.193300 1.184289 -0.493576 2.152279
2013-01-05 1.627586 0.239406 0.787551 0.455767
2013-01-06 -2.205777 0.288010 -0.052620 -0.639162
Next is the value of label "B" in ascending order.
df.sort(columns='B')
A B C D
2013-01-01 0.705624 -0.793903 0.843425 0.672602
2013-01-04 2.152279 -0.493576 1.184289 -1.193300
2013-01-06 -0.639162 -0.052620 0.288010 -2.205777
2013-01-03 0.706086 0.385631 0.967568 0.271894
2013-01-05 0.455767 0.787551 0.239406 1.627586
2013-01-02 -1.211129 2.077101 -1.795861 0.028060
Data can be extracted from various points of view. For example, only part of the index.
Extract data by specifying both label and index.
df.loc['20130102':'20130104',['A','B']]
A B
2013-01-02 -1.211129 2.077101
2013-01-03 0.706086 0.385631
2013-01-04 2.152279 -0.493576
You can make a group with any label. Data can be manipulated as it is.
#Creating a DataFrame
df = pd.DataFrame({"A" : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
"B" : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
"C" : np.random.randn(8),
"D" : np.random.randn(8)})
df
A B C D
0 foo one 1.130975 1.235940
1 bar one -0.140004 -2.714958
2 foo two 1.526578 -0.165415
3 bar three -1.049092 -0.037484
4 foo two -1.182303 0.288754
5 bar two 0.530652 1.204125
6 foo one 0.678477 -0.273343
7 foo three 0.929624 0.169822
df.sort(columns='B')
A B C D
2013-01-01 0.705624 -0.793903 0.843425 0.672602
2013-01-04 2.152279 -0.493576 1.184289 -1.193300
2013-01-06 -0.639162 -0.052620 0.288010 -2.205777
2013-01-03 0.706086 0.385631 0.967568 0.271894
2013-01-05 0.455767 0.787551 0.239406 1.627586
2013-01-02 -1.211129 2.077101 -1.795861 0.028060
Data can be extracted from various points of view. For example, only part of the index.
Extract data by specifying both label and index.
df.loc['20130102':'20130104',['A','B']]
A B
2013-01-02 -1.211129 2.077101
2013-01-03 0.706086 0.385631
2013-01-04 2.152279 -0.493576
You can make a group with any label. Data can be manipulated as it is.
#Creating a DataFrame
df = pd.DataFrame({"A" : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
"B" : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
"C" : np.random.randn(8),
"D" : np.random.randn(8)})
df
A B C D
0 foo one 1.130975 1.235940
1 bar one -0.140004 -2.714958
2 foo two 1.526578 -0.165415
3 bar three -1.049092 -0.037484
4 foo two -1.182303 0.288754
5 bar two 0.530652 1.204125
6 foo one 0.678477 -0.273343
7 foo three 0.929624 0.169822
#Grouping and then calculate sum
df.groupby('A').sum()
C D
A
bar -0.658445 -1.548317
foo 3.083350 1.255758
Creating a DataFrame to make it a pivot table.
#Create a DataFrame
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] *2,
'D' : np.random.randn(12),
'E' : np.random.randn(12)})
df
A B C D E
0 one A foo 0.575699 -1.669032
1 one B foo 0.592889 -2.526196
2 two C foo -2.229949 -0.703339
3 three A bar 0.801380 -1.638983
4 one B bar -0.135691 -0.302586
5 one C bar 0.317401 1.169608
6 two A foo 0.064460 -0.109437
7 three B foo -0.605017 1.043246
8 one C foo -0.365220 0.850535
9 one A bar 1.033552 0.226002
10 two B bar -0.260542 0.352249
11 three C bar 0.518531 1.407827
It can be converted to a pivot table relatively easily.
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
C bar foo
A B
one A 1.033552 0.575699
B -0.135691 0.592889
C 0.317401 -0.365220
three A 0.801380 NaN
B NaN -0.605017
C 0.518531 NaN
two A NaN 0.064460
B -0.260542 NaN
C NaN -2.229949
If you take a quick look at it once, it will come back when you face the process, which is very appreciated.
pandas 0.15.2 documentation http://pandas.pydata.org/pandas-docs/stable/index.html
10 Minutes to pandas http://pandas.pydata.org/pandas-docs/stable/10min.html
Recommended Posts