PANDAS is the best data processing library for data scientists, but you need to be careful to avoid performance pitfalls when dealing with millions of rows of data. This time I would like to introduce some techniques that I have learned in my work.
DataFrame
Since PANDAS is a column-oriented data structure, we are good at processing column by column. It is recommended to create a DataFrame
in the" 1 record 1 line "format, and to set all measurable values (humidity, price, coordinates, etc.) for each record for each column.
However, in a huge amount of data processing, if for loop processing for each row is performed, the performance will be significantly reduced. In this article, I would like to introduce a smart way to add data in consideration of performance.
First, make a sample table.
data = {'Team': ['Tiger', 'Tiger', 'Rabbit', 'Rabbit', 'Cat',
'Cat', 'Cat', 'Cat', 'Tiger', 'Human', 'Human', 'Tiger'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df= pd.DataFrame(data)
print(team_dashboard)
'''
Team Rank Year Points
0 Tiger 1 2014 876
1 Tiger 2 2015 789
2 Rabbit 2 2014 863
3 Rabbit 3 2015 673
4 Cat 3 2014 741
5 Cat 4 2015 812
6 Cat 1 2016 756
7 Cat 1 2017 788
8 Tiger 2 2016 694
9 Human 4 2014 701
10 Human 1 2015 804
11 Tiger 2 2017 690
'''
print(df.columns) #Get column label
'''
Index(['Team', 'Rank', 'Year', 'Points'], dtype='object')
'''
print(df.index) #Get row index
'''
RangeIndex(start=0, stop=12, step=1)
'''
The DataFrame has a special list called index. In the above example, the elements that represent each column are labeled as ['Team','Rank','Year','Points']
, and 0, 1, 2, 3, ... on the left side. Has a row index that represents each row.
groupby()
groupby
is used when you want to collect data with the same value and perform common operations for each group. Although it has become a standard in data processing, groupby is quite difficult at first because it has various uses.
Let's take a look at an example. If you put the column name in the groupby ()
method, it returns the `` `GroupBy``` object.
#1 level
df_g = df.groupby(by=['Team'])
print(df_g.groups)
'''
{'Cat': Int64Index([4, 5, 6, 7], dtype='int64'),
'Human': Int64Index([9, 10], dtype='int64'),
'Rabbit': Int64Index([2, 3], dtype='int64'),
'Tiger': Int64Index([0, 1, 8, 11], dtype='int64')}
'''
#2 (plural) labels
df_g = df.groupby(by=['Team', 'Year'])
print(df_g.groups)
'''
{('Cat', 2014): Int64Index([4], dtype='int64'),
('Cat', 2015): Int64Index([5], dtype='int64'),
('Cat', 2016): Int64Index([6], dtype='int64'),
('Cat', 2017): Int64Index([7], dtype='int64'),
('Human', 2014): Int64Index([9], dtype='int64'),
('Human', 2015): Int64Index([10], dtype='int64'),
('Rabbit', 2014): Int64Index([2], dtype='int64'),
('Rabbit', 2015): Int64Index([3], dtype='int64'),
('Tiger', 2014): Int64Index([0], dtype='int64'),
('Tiger', 2015): Int64Index([1], dtype='int64'),
('Tiger', 2016): Int64Index([8], dtype='int64'),
('Tiger', 2017): Int64Index([11], dtype='int64')}
'''
In this way, you can see which column is in which group in the form of {column label: [row label, row label, ...]}
. A row index list of data with the same label is stored for each label.
By the way, to get the data in the group, pass the group key to get_group ()
.
df_oneGroup = df_g.get_group('Rabbit')
print(df_oneGroup)
'''
Team Rank Year Points
2 Rabbit 2 2014 863
3 Rabbit 3 2015 673
'''
df_oneGroup = df_g.get_group(('Cat', 2014))
print(df_oneGroup)
'''
Team Rank Year Points
4 Cat 3 2014 741
'''
Well, in fact, get_group ()
isn't used much for anything other than debugging, but the GroupBy
object allows you to do various operations for each group. For example, call the mean () method to average the annual Rank and Points for each team. Many other methods are provided, such as sum ()
and mode ()
. By the way, ʻas_index = False` resets the default group label to [0, 1, 2, ..., n].
df_mean = team_dashboard.groupby(by=['Team', 'Year']).mean()
print(df_mean)
'''
Rank Points
Team Year
Cat 2014 3 741
2015 4 812
2016 1 756
2017 1 788
Human 2014 4 701
2015 1 804
Rabbit 2014 2 863
2015 3 673
Tiger 2014 1 876
2015 2 789
2016 2 694
2017 2 690
'''
df_mean = team_dashboard.groupby(by=['Team', 'Year'], as_index=False).mean()
print(df_mean)
'''
Team Year Rank Points
0 Cat 2014 3 741
1 Cat 2015 4 812
2 Cat 2016 1 756
3 Cat 2017 1 788
4 Human 2014 4 701
5 Human 2015 1 804
6 Rabbit 2014 2 863
7 Rabbit 2015 3 673
8 Tiger 2014 1 876
9 Tiger 2015 2 789
10 Tiger 2016 2 694
11 Tiger 2017 2 690
'''
Like the previous GroupBy.mean ()
, you can get the numerical value for each group, but if you want to get the numerical value for each group separately, use ʻagg () (Aggregration). Functions used for Aggregation can be called with strings, numpy methods, self-made functions, and lambda expressions. To use ʻagg ()
, define it in dict ()
and pass it as follows.
'''
location col1 col2 col3 col4
0 a True 2 1 4
1 a False 6 2 6
2 b True 7 6 3
3 b True 3 3 4
4 b False 8 4 6
5 c True 9 57 8
6 d False 1 74 9
'''
func_dict = {'col1': lambda x: x.any(), #Confirmation of missing state
'col2': np.mean, #average
'col3': np.sum, #total
'col4': lambda S: S.mode()[0]} #Mode
df_agg = df.groupby('location').agg(func_dict).reset_index()
print(df_agg)
'''
location col1 col2 col3 col4
0 a True 4 3 4
1 b True 6 13 3
2 c True 9 57 8
3 d False 1 74 9
'''
cut ()
Introducing cut ()
, which categorizes data by the specified boundary value. For example, you can easily divide the entire data into five categories: midnight, morning, noon, afternoon, and night.
prods = pd.DataFrame({'hour':range(0, 24)})
b = [0, 6, 11, 12, 17, 24]
l = ['Midnight', 'a.m.','noon', 'afternoon', 'Night']
prods['period'] = pd.cut(prods['hour'], bins=b, labels=l, include_lowest=True)
print(prods)
'''
hour period
0 0 midnight
1 1 midnight
2 2 midnight
3 3 midnight
4 4 midnight
5 5 midnight
6 6 midnight
7 7 am
8 8 am
9 9 am
10 10 am
11 11 am
12 12 noon
13 13 pm
14 14 pm
15 15 pm
16 16 afternoon
17 17 pm
18 18 night
19 19 night
20 20 night
21 21 night
22 22 night
23 23 night
'''
resample ()
This time, let's say "calculate the cumulative number of cases per hour". I think I'll try using pd.cumsum ()
here, so create a num_ride_1h
column in advance and give it" 1 ". Then, after grouping by resample ()
in the timestamp column every hour, each group can be completed by calling the cumsum ()
method.
df_raw= make_classification(n_samples, n_features+1)
df_raw['timestamp'] = random_datetimes_or_dates(start, end, n=n_samples)
df_raw['num_ride_1h'] = 1
print(df_raw)
'''
var_0 var_1 var_2 class timestamp num_ride_1h
0 1.062513 -0.056001 0.761312 0 2020-09-21 00:01:57 1
1 -2.272391 1.307474 -1.276716 0 2020-09-21 00:14:49 1
2 -1.470793 1.245910 -0.708071 2 2020-09-21 00:17:46 1
3 -1.827838 1.365970 -0.933938 0 2020-09-21 00:25:13 1
4 -1.115794 -0.045542 -0.830372 0 2020-09-21 00:31:45 1
.. ... ... ... ... ... ...
95 0.247010 0.903812 0.448323 0 2020-09-21 23:29:25 1
96 -0.665399 1.861112 0.063642 1 2020-09-21 23:32:51 1
97 0.875282 0.028704 0.649306 2 2020-09-21 23:36:21 1
98 2.159065 -1.155290 1.238786 0 2020-09-21 23:37:23 1
99 1.739777 -1.775147 0.748372 2 2020-09-21 23:56:04 1
'''
df_raw['num_ride_1h'] = df_raw.resample('1H', on='timestamp')['num_ride_1h'].cumsum()
'''
var_0 var_1 var_2 class timestamp num_ride_1h
0 -1.331170 -0.274703 0.809738 1 2020-10-11 00:10:54 1
1 -1.373495 -1.067991 1.738302 1 2020-10-11 00:14:24 2
2 -1.471448 0.216404 0.296618 0 2020-10-11 00:43:29 3
3 -2.282394 -1.528916 2.605747 1 2020-10-11 00:48:52 4
4 0.162427 0.524188 -0.663437 2 2020-10-11 00:51:23 5
.. ... ... ... ... ... ...
95 1.197076 0.274294 -0.759543 1 2020-10-11 22:23:50 3
96 -0.459688 0.646523 -0.573518 0 2020-10-11 23:00:20 1
97 0.212496 0.773962 -0.969428 2 2020-10-11 23:11:43 2
98 1.578519 0.496655 -1.156869 1 2020-10-11 23:14:31 3
99 1.318311 -0.324909 -0.114971 0 2020-10-11 23:46:46 4
'''
You can also use pd.Grouper ()
. Both can produce the same result.
df_raw['num_ride_1h'] = df_raw.groupby(pd.Grouper(key='timestamp', freq='1h'))['num_ride_1h'].cumsum()
There are several other examples I would like to introduce, but this time I would like to end here. If I have the next opportunity, I will introduce data processing centered on time series data.
** [Kuando Co., Ltd.] ** Kuando Co., Ltd., where I work, is working on local industry upgrades. Please take a look. http://quando.jp/
Recommended Posts