This time, the word "event data" is used to mean data such as ** (time, event name (, user) (, quantity)) **. When trying to create a user prediction model, you may want to format such event data so that it is "one user, one line".
Make a note of how pandas handles this area.
First, consider the following event data as an example.
import pandas as pd
data = [
['user1', '2015-7-1 10:00:00', 'login'],
['user1', '2015-7-1 10:00:10', 'item'],
['user1', '2015-7-1 10:00:30', 'item'],
['user2', '2015-7-1 10:01:00', 'top'],
['user1', '2015-7-1 10:01:30', 'cart'],
['user2', '2015-7-1 10:01:50', 'login'],
['user2', '2015-7-1 10:02:30', 'logout'],
['user1', '2015-7-2 13:00:00', 'login'],
['user1', '2015-7-2 13:01:00', 'logout'],
['user3', '2015-7-2 13:01:00', 'top'],
['user2', '2015-7-2 13:01:50', 'login'],
['user2', '2015-7-2 13:02:30', 'item'],
['user2', '2015-7-2 13:03:30', 'cart'],
['user2', '2015-7-2 13:03:30', 'history'],
]
df = pd.DataFrame(data)
df.columns = ['user_id', 'time', 'event']
df['time'] = pd.to_datetime(df['time'])
df['dummy'] = 1
df
user_id | time | event | dummy | |
---|---|---|---|---|
0 | user1 | 2015-07-01 10:00:00 | login | 1 |
1 | user1 | 2015-07-01 10:00:10 | item | 1 |
2 | user1 | 2015-07-01 10:00:30 | item | 1 |
3 | user2 | 2015-07-01 10:01:00 | top | 1 |
4 | user1 | 2015-07-01 10:01:30 | cart | 1 |
5 | user2 | 2015-07-01 10:01:50 | login | 1 |
6 | user2 | 2015-07-01 10:02:30 | logout | 1 |
7 | user1 | 2015-07-02 13:00:00 | login | 1 |
8 | user1 | 2015-07-02 13:01:00 | logout | 1 |
9 | user3 | 2015-07-02 13:01:00 | top | 1 |
10 | user2 | 2015-07-02 13:01:50 | login | 1 |
11 | user2 | 2015-07-02 13:02:30 | item | 1 |
12 | user2 | 2015-07-02 13:03:30 | cart | 1 |
13 | user2 | 2015-07-02 13:03:30 | history | 1 |
Suppose you want to format this data as "1 user, 1 row" and "the total number of occurrences for each daily event" as a column.
You can use the DataFrame class pivot_table ()
to do something similar to an Excel PivotTable.
I'd like to aggregate by day, but if I use Grouper ()
, I can group the datetime type columns nicely. How convenient ...
res2 = df.pivot_table(index=['user_id', pd.Grouper(freq='d', key='time')], columns='event', values='dummy', aggfunc=len)
res2
event | cart | history | item | login | logout | top | |
---|---|---|---|---|---|---|---|
user_id | time | ||||||
user1 | 2015-07-01 | 1 | NaN | 2 | 1 | NaN | NaN |
2015-07-02 | NaN | NaN | NaN | 1 | 1 | NaN | |
user2 | 2015-07-01 | NaN | NaN | NaN | 1 | 1 | 1 |
2015-07-02 | 1 | 1 | 1 | 1 | NaN | NaN | |
user3 | 2015-07-02 | NaN | NaN | NaN | NaN | NaN | 1 |
Since Grouper's freq ='d'
, it will be aggregated on a daily basis. If you set freq ='M'
, it will be monthly.
It is good to use unstack ().
res3 = res2.unstack()
res3
event | cart | history | item | login | logout | top | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
time | 2015-07-01 | 2015-07-02 | 2015-07-01 | 2015-07-02 | 2015-07-01 | 2015-07-02 | 2015-07-01 | 2015-07-02 | 2015-07-01 | 2015-07-02 | 2015-07-01 | 2015-07-02 |
user_id | ||||||||||||
user1 | 1 | NaN | NaN | NaN | 2 | NaN | 1 | 1 | NaN | 1 | NaN | NaN |
user2 | NaN | 1 | NaN | 1 | NaN | 1 | 1 | 1 | 1 | NaN | 1 | NaN |
user3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
Use swaplevel (Index1, which you want to replace)
.
Also, I want to operate on the column (axis number 1), so add ʻaxis = 1`.
res4 = res3.swaplevel(0, 1, axis=1)
res4
time | 2015-07-01 | 2015-07-02 | 2015-07-01 | 2015-07-02 | 2015-07-01 | 2015-07-02 | 2015-07-01 | 2015-07-02 | 2015-07-01 | 2015-07-02 | 2015-07-01 | 2015-07-02 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
event | cart | cart | history | history | item | item | login | login | logout | logout | top | top |
user_id | ||||||||||||
user1 | 1 | NaN | NaN | NaN | 2 | NaN | 1 | 1 | NaN | 1 | NaN | NaN |
user2 | NaN | 1 | NaN | 1 | NaN | 1 | 1 | 1 | 1 | NaN | 1 | NaN |
user3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
Use sortlevel ()
.
res4.sortlevel(0, axis=1)
time | 2015-07-01 | 2015-07-02 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
event | cart | history | item | login | logout | top | cart | history | item | login | logout | top |
user_id | ||||||||||||
user1 | 1 | NaN | 2 | 1 | NaN | NaN | NaN | NaN | NaN | 1 | 1 | NaN |
user2 | NaN | NaN | NaN | 1 | 1 | 1 | 1 | 1 | 1 | 1 | NaN | NaN |
user3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
Now you have the desired shape.
There will be a column of "number of event types x number of dates", so if there are too many, it cannot be used. I think that it can be used when the number of event types is not large or when it is aggregated on a monthly basis.
Anyway, pandas is convenient.
Recommended Posts