I wrote this article from the motivation of "I want to write a process to create a feature quantity that aggregates certain columns every N days with pandas, but what should I do?"
The sample code is an excerpt from the Google Colony linked below. GoogleColab --pandas x groupby x Grouper sample
In addition, we use this data (kaggle Datasets --Bitcoin Historical Data).
First of all, it is from preprocessing of data, Timestamp column is converted to datetime type for the previous processing.
load_dataset
import pandas as pd
btc = pd.read_csv("/content/bitstampUSD_1-min_data_2012-01-01_to_2019-08-12.csv")
btc[btc.columns.values] = btc[btc.columns.values].ffill()
# https://stackoverflow.com/questions/19231871/convert-unix-time-to-readable-date-in-pandas-dataframe
btc["Timestamp"] = pd.to_datetime(btc["Timestamp"], unit='s')
btc.head()
"""
Timestamp Open High Low Close Volume_(BTC) Volume_(Currency) Weighted_Price
0 2011-12-31 07:52:00 4.39 4.39 4.39 4.39 0.455581 2.0 4.39
1 2011-12-31 07:53:00 4.39 4.39 4.39 4.39 0.455581 2.0 4.39
2 2011-12-31 07:54:00 4.39 4.39 4.39 4.39 0.455581 2.0 4.39
3 2011-12-31 07:55:00 4.39 4.39 4.39 4.39 0.455581 2.0 4.39
4 2011-12-31 07:56:00 4.39 4.39 4.39 4.39 0.455581 2.0 4.39
"""
The problem is ** the process of creating features by aggregating certain columns every N days **, but as the title suggests, ** pandas x groupby x Grouper ** is used. "Group by" is often used when performing aggregation processing, but this time it is a miso to use "Grouper" because we want to group ** every N days ** and every fixed cycle / interval. Since it is time to summarize, if you specify the Timestamp column for the Grouper key and specify freq ='D', the process of summarizing every day is completed.
groupby_grouper
# freq='D'By setting it to, it can be summarized on a daily basis.
btc.groupby(pd.Grouper(key='Timestamp', freq='D')).mean()
"""
Open High Low Close Volume_(BTC) Volume_(Currency) Weighted_Price
Timestamp
2011-12-31 4.472552 4.472624 4.472552 4.472624 7.590527 34.069690 4.472579
2012-01-01 4.680778 4.680778 4.680778 4.680778 5.861697 27.856539 4.680778
2012-01-02 5.000000 5.000000 5.000000 5.000000 11.566478 57.832389 5.000000
2012-01-03 5.145917 5.145917 5.145917 5.145917 13.506125 68.725391 5.145917
2012-01-04 5.176708 5.228729 5.176708 5.228729 17.221039 91.624089 5.195443
... ... ... ... ... ... ... ...
2019-08-08 11802.176188 11809.075389 11794.541660 11802.534583 7.059914 83271.678234 11801.428490
2019-08-09 11820.325465 11826.872431 11813.188132 11820.873674 5.099464 60251.317095 11819.498275
2019-08-10 11582.574299 11587.521007 11576.103201 11582.249007 5.219845 60211.963986 11581.356693
2019-08-11 11378.198736 11382.765819 11373.714118 11379.078882 3.062079 34822.593814 11378.195194
2019-08-12 11527.440000 11551.570000 11520.000000 11520.000000 23.805939 274731.256920 11540.450291
2782 rows × 7 columns
"""
that's all.