Categorizing data and applying functions to each category is called aggregation or transformation. These are considered to be the most important parts of the data analysis workflow. Pandas has powerful functions in group operations and can be operated intuitively.
A famous paper by Hadley Wickham, author of various packages for the R language The Split-Apply-Combine Strategy for Data Analysis (PDF) Describes the group operation process "Separation-Apply-Join". Pandas also uses this group aggregation operation model as a base idea. The data is separated by one or more keys at the beginning of the process, then the function is applied to each group, and the result of applying the function is combined and stored in an object that shows the result.
I tried Getting stock price in Japan with Ruby before. I would like to try an example using pandas to operate a group operation on an actual stock price using the collected data.
Variables grouped by groupby in pandas are GroupBy objects. The apply method separates the data into pieces that make it easier to work with, applies a function to each object, and then combines them.
#Pick up the stock prices of several companies
#NTT DATA
stock_9613 = pd.read_csv('stock_9613.csv',
parse_dates=True, index_col=0)
# DTS
stock_9682 = pd.read_csv('stock_9682.csv',
parse_dates=True, index_col=0)
#IT Holdings
stock_3626 = pd.read_csv('stock_3626.csv',
parse_dates=True, index_col=0)
# NSD
stock_9759 = pd.read_csv('stock_9759.csv',
parse_dates=True, index_col=0)
#Extract closing prices after 2010 into one data frame
df = pd.DataFrame([
stock_9613.ix['2010-01-01':, 'closing price'],
stock_9682.ix['2010-01-01':, 'closing price'],
stock_3626.ix['2010-01-01':, 'closing price'],
stock_9759.ix['2010-01-01':, 'closing price']
], index=['NTT DATA', 'DTS', 'IT Holdings', 'NSD']).T
#=>Date data DTS IT e NSD
# (Omission)
# 2015-01-05 4530 2553 1811 1779
# 2015-01-06 4375 2476 1748 1755
# 2015-01-07 4300 2459 1748 1754
# 2015-01-08 4350 2481 1815 1775
# 2015-01-09 4330 2478 1805 1756
# 2015-01-13 4345 2480 1813 1766
# 2015-01-14 4260 2485 1809 1770
# 2015-01-15 4340 2473 1839 1790
# 2015-01-16 4295 2458 1821 1791
The stock prices of each company since 2010 have been obtained. Well, it seems that the several companies listed here often collaborate, but how much is the actual correlation in the stock market? From here, I will be a little curious and try to find the annual correlation coefficient for NTT DATA.
#Find the transition
rets = df.pct_change().dropna()
#Group by year
by_year = rets.groupby(lambda x: x.year)
#Define an anonymous function that calculates the correlation coefficient
vol_corr = lambda x: x.corrwith(x['NTT DATA'])
#Apply a function to a grouped object
result1 = by_year.apply(vol_corr)
print(result1)
#=>NTT DATA DTS IT Ho NSD
# 2010 1 0.346437 0.492006 0.443910
# 2011 1 0.485108 0.575495 0.619912
# 2012 1 0.261388 0.268531 0.212315
# 2013 1 0.277970 0.358796 0.408304
# 2014 1 0.381762 0.404376 0.385258
# 2015 1 0.631186 0.799621 0.770759
Let's visualize it with matplotlib.
You can also use the apply method to find the correlation between columns. For example, let's find the correlation of DTS's stock price with NTT DATA.
#Apply an anonymous function to find the correlation coefficient between one column and another
result2 = by_year.apply(lambda g: g['DTS'].corr(g['NTT DATA']))
print(result2)
#=>
# 2010 0.346437
# 2011 0.485108
# 2012 0.261388
# 2013 0.277970
# 2014 0.381762
# 2015 0.631186
The same applies to ordinary functions. For example, let's find a least squares (OLS) linear regression for each group.
#Create your own function for linear regression
def regression(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit() #Linear regression method in econometrics library
return result.params #Results are returned
#Apply a linear regression function to grouped stock prices
result3 = by_year.apply(regression, 'DTS', ['NTT DATA'])
#=>NTT data intercept
# 2010 0.313685 0.000773
# 2011 0.509025 -0.000057
# 2012 0.360677 0.000705
# 2013 0.238903 0.002063
# 2014 0.395362 0.001214
# 2015 0.418843 -0.002459
It's only half a month since 2015, so I can't say anything about it, but for the time being, I've got the results for each year. By applying the function to the data grouped in this way, you can try the analysis from various angles, which is very convenient.
Being able to apply the function itself with the apply method opens up many possibilities. The functions applied here can be freely written by the analyst, except for the rule that an object or scalar value is returned as a return value.
The source code for this article is here.
Introduction to data analysis with Python-Data processing using NumPy and pandas http://www.oreilly.co.jp/books/9784873116556/
Recommended Posts