I wrote an article because there weren't many tutorials implemented using sample data regarding recommendations.
There are methods that use machine learning etc. to create recommendations, but this is an article on how to create recommendations using methods based on statistics.
I will explain using python and open dataset.
This is an article about implementation by python. Read the following article about the concept of recommendations. Recommendation tutorial using association analysis (concept)
It will be implemented according to the flow of the article of the concept edition.
There is a charge, but the Google Colaboratory execution environment is available at here.
Import the required libraries.
#Library import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Load the dataset used in this tutorial. This is loading the dataset from the github repository.
#Data set loading
import urllib.request
from io import StringIO
url = "https://raw.githubusercontent.com/tachiken0210/dataset/master/dataset_cart.csv"
#Function to read csv
def read_csv(url):
res = urllib.request.urlopen(url)
res = res.read().decode("utf-8")
df = pd.read_csv(StringIO( res) )
return df
#Run
df_cart_ori = read_csv(url)
Check the contents of the dataset used this time.
df_cart_ori.head()
cart_id | goods_id | action | create_at | update_at | last_update | time | |
---|---|---|---|---|---|---|---|
0 | 108750017 | 583266 | UPD | 2013-11-26 03:11:06 | 2013-11-26 03:11:06 | 2013-11-26 03:11:06 | 1385478215 |
1 | 108750017 | 662680 | UPD | 2013-11-26 03:11:06 | 2013-11-26 03:11:06 | 2013-11-26 03:11:06 | 1385478215 |
2 | 108750017 | 664077 | UPD | 2013-11-26 03:11:06 | 2013-11-26 03:11:06 | 2013-11-26 03:11:06 | 1385478215 |
3 | 108199875 | 661648 | ADD | 2013-11-26 03:11:10 | 2013-11-26 03:11:10 | 2013-11-26 03:11:10 | 1385478215 |
4 | 105031004 | 661231 | ADD | 2013-11-26 03:11:41 | 2013-11-26 03:11:41 | 2013-11-26 03:11:41 | 1385478215 |
This dataset is the log data when a customer on a certain EC site puts a product in the cart.
** ・ card_id: Cart ID associated with the customer (It is okay to think of it as a customer ID, so it will be referred to as the customer below)
**** ・ goods_id: Product ID
**** ・ action: Customer action (ADD: Add to cart, DEL: Delete from cart, etc.)
**** ・ create_at: Time when log was created
**** ・ update_at: Log is updated Time (not used this time)
**** ・ last_update: Time when the log was updated (not used this time)
**** ・ time: Timestamp
**
In addition, general open data sets, including the data set used this time, rarely include a specific name such as a product name in the data set, and are basically expressed by the product ID. Therefore, it is difficult to know what kind of product it is, but please note that this is due to the content of the data.
Next, let's take a look at the contents of this data. This dataset contains data for all customers as a chronological log, so let's focus on just one customer. First, let's extract the customer with the largest number of logs in the data.
df_cart_ori["cart_id"].value_counts().head()
#output
110728794 475
106932411 426
115973611 264
109269739 205
112332751 197
Name: cart_id, dtype: int64
There are 475 logs for 110728794 users, which seems to be the most logged. Let's extract the log of this customer.
df_cart_ori[df_cart_ori["cart_id"]==110728794].head(10)
cart_id | goods_id | action | create_at | update_at | last_update | time | |
---|---|---|---|---|---|---|---|
32580 | 110728794 | 664457 | ADD | 2013-11-26 22:54:13 | 2013-11-26 22:54:13 | 2013-11-26 22:54:13 | 1385478215 |
32619 | 110728794 | 664885 | ADD | 2013-11-26 22:55:09 | 2013-11-26 22:55:09 | 2013-11-26 22:55:09 | 1385478215 |
33047 | 110728794 | 664937 | ADD | 2013-11-26 22:58:52 | 2013-11-26 22:58:52 | 2013-11-26 22:58:52 | 1385478215 |
33095 | 110728794 | 664701 | ADD | 2013-11-26 23:00:25 | 2013-11-26 23:00:25 | 2013-11-26 23:00:25 | 1385478215 |
34367 | 110728794 | 665050 | ADD | 2013-11-26 23:02:40 | 2013-11-26 23:02:40 | 2013-11-26 23:02:40 | 1385478215 |
34456 | 110728794 | 664989 | ADD | 2013-11-26 23:05:03 | 2013-11-26 23:05:03 | 2013-11-26 23:05:03 | 1385478215 |
34653 | 110728794 | 664995 | ADD | 2013-11-26 23:07:00 | 2013-11-26 23:07:00 | 2013-11-26 23:07:00 | 1385478215 |
34741 | 110728794 | 664961 | ADD | 2013-11-26 23:09:41 | 2013-11-26 23:09:41 | 2013-11-26 23:09:41 | 1385478215 |
296473 | 110728794 | 665412 | DEL | 2013-12-03 17:17:30 | 2013-12-03 17:17:30 | 2013-12-03 07:41:13 | 1386083014 |
296476 | 110728794 | 665480 | DEL | 2013-12-03 17:17:37 | 2013-12-03 17:17:37 | 2013-12-03 07:42:29 | 1386083014 |
If you look at this customer, you can see that they are continuously adding carts to their products throughout the day.
Analyzing this for other customers in the same way, it seems that it is easy to see that product y is likely to be added to the cart next to product x in general.
In this way, the purpose of this time is to analyze the pattern that ** "Product y is likely to be added to the cart next to a certain product x" from this data set. **
We will actually process the data from the next, but keep the above in the corner of your head.
From here, we will preprocess the data. First, if you look at the "action" column, there are various things such as ADD and DEL. For now, let's focus on the ADD (added to cart) log. You focus only on the customer's product demand.
df = df_cart_ori.copy()
df = df[df["action"]=="ADD"]
df.head()
cart_id | goods_id | action | create_at | update_at | last_update | time | |
---|---|---|---|---|---|---|---|
3 | 108199875 | 661648 | ADD | 2013-11-26 03:11:10 | 2013-11-26 03:11:10 | 2013-11-26 03:11:10 | 1385478215 |
4 | 105031004 | 661231 | ADD | 2013-11-26 03:11:41 | 2013-11-26 03:11:41 | 2013-11-26 03:11:41 | 1385478215 |
6 | 110388732 | 661534 | ADD | 2013-11-26 03:11:55 | 2013-11-26 03:11:55 | 2013-11-26 03:11:55 | 1385478215 |
7 | 110388740 | 662336 | ADD | 2013-11-26 03:11:58 | 2013-11-26 03:11:58 | 2013-11-26 03:11:58 | 1385478215 |
8 | 110293997 | 661648 | ADD | 2013-11-26 03:12:13 | 2013-11-26 03:12:13 | 2013-11-26 03:12:13 | 1385478215 |
Next, sort the dataset so that the order in which the products were added to the cart for each customer is chronologically ordered.
df = df[["cart_id","goods_id","create_at"]]
df = df.sort_values(["cart_id","create_at"],ascending=[True,True])
df.head()
cart_id | goods_id | create_at | |
---|---|---|---|
548166 | 78496306 | 661142 | 2013-11-15 23:07:02 |
517601 | 79100564 | 662760 | 2013-11-24 18:17:24 |
517404 | 79100564 | 661093 | 2013-11-24 18:25:29 |
23762 | 79100564 | 664856 | 2013-11-26 13:41:47 |
22308 | 79100564 | 562296 | 2013-11-26 13:44:20 |
The process from here will format the data to fit the association analysis library.
For each customer, we will add the items in the cart to the list in order.
df = df[["cart_id","goods_id"]]
df["goods_id"] = df["goods_id"].astype(int).astype(str)
df = df.groupby(["cart_id"])["goods_id"].apply(lambda x:list(x)).reset_index()
df.head()
| cart_id | goods_id |
|---:|----------:|:---------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | 78496306 | ['661142'] |
| 1 | 79100564 | ['662760', '661093', '664856', '562296', '663364', '664963', '664475', '583266'] |
| 2 | 79455669 | ['663801', '664136', '664937', '663932', '538673', '663902', '667859'] |
| 3 | 81390353 | ['663132', '661725', '664236', '663331', '659679', '663847', '662340', '662292', '664099', '664165', '663581', '665426', '663899', '663405'] |
| 4 | 81932021 | ['662282', '664218']
Next, let's check the contents of the above data frame.
Notice the customer on the second line (79100564).
First of all, I put 664136 in the cart after the first item 663801. That is, X = "663801" and Y = "664136". <
Also, after 664136, 664937 is added to the cart. For these, X = "664136" and Y = "664937".
In this way, by arranging this XY pair for each customer, it is possible to perform association analysis.
Now let's format the data so that this XY is paired.
def get_combination(l):
length = len(l)
list_output = []
list_pair = []
for i in range(length-1):
if l[i]==l[i+1]:
pass
else:
list_pair =[l[i],l[i+1]]
list_output.append(list_pair)
return list_output
df["comb_goods_id"] = df["goods_id"].apply(lambda x:get_combination(x))
#Make a list of association inputs
dataset= []
for i,contents in enumerate(df["comb_goods_id"].values):
for c in contents:
dataset.append(c)
print("The number of XY pairs is",len(dataset))
print("Contents of XY pair",dataset[:5])
#output
The number of XY pairs is 141956
Contents of XY pair[['662760', '661093'], ['661093', '664856'], ['664856', '562296'], ['562296', '663364'], ['663364', '664963']]
The above is used as input for association analysis. I have created a method that performs association analysis as an association method, so I will use this.
#Association library
def association(dataset):
df = pd.DataFrame(dataset,columns=["x","y"])
num_dataset = df.shape[0]
df["sum_count_xy"]=1
print("calculating support....")
df_a_support = (df.groupby("x").sum()/num_dataset).rename(columns={"sum_count_xy":"support_x"})
df_b_support = (df.groupby("y").sum()/num_dataset).rename(columns={"sum_count_xy":"support_y"})
df = df.groupby(["x","y"]).sum()
df["support_xy"]=df["sum_count_xy"]/num_dataset
df = df.reset_index()
df = pd.merge(df,df_a_support,on="x")
df = pd.merge(df,df_b_support,on="y")
print("calculating confidence....")
df_temp = df.groupby("x").sum()[["sum_count_xy"]].rename(columns={"sum_count_xy":"sum_count_x"})
df = pd.merge(df,df_temp,on="x")
df_temp = df.groupby("y").sum()[["sum_count_xy"]].rename(columns={"sum_count_xy":"sum_count_y"})
df = pd.merge(df,df_temp,on="y")
df["confidence"]=df["sum_count_xy"]/df["sum_count_x"]
print("calculating lift....")
df["lift"]=df["confidence"]/df["support_y"]
df["sum_count"] = num_dataset
df_output = df
return df_output
#Run the dataset for association analysis
df_output = association(dataset)
df_output.head()
x | y | sum_count_xy | support_xy | support_x | support_y | sum_count_x | sum_count_y | confidence | lift | sum_count | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 485836 | 662615 | 1 | 7.04444e-06 | 7.04444e-06 | 0.000147933 | 1 | 21 | 1 | 6759.81 | 141956 |
1 | 549376 | 662615 | 1 | 7.04444e-06 | 2.11333e-05 | 0.000147933 | 3 | 21 | 0.333333 | 2253.27 | 141956 |
2 | 654700 | 662615 | 1 | 7.04444e-06 | 0.000464933 | 0.000147933 | 66 | 21 | 0.0151515 | 102.421 | 141956 |
3 | 661475 | 662615 | 1 | 7.04444e-06 | 0.000965088 | 0.000147933 | 137 | 21 | 0.00729927 | 49.3417 | 141956 |
4 | 661558 | 662615 | 1 | 7.04444e-06 | 0.000408577 | 0.000147933 | 58 | 21 | 0.0172414 | 116.548 | 141956 |
I got the result.
The contents of each column are as follows
** ・ x: Condition part X
・ y: Conclusion part Y
・ sum_count_xy: Number of data applicable to XY
・ support_xy: XY support
・ support_x: X support
> ・ Support_y: Y support
・ sum_count_x: Number of data applicable to X
・ sum_count_y: Number of data applicable to Y
・ confidence: Confidence
・ lift: Lift
**
Well, the execution of the association is over, but there is one caveat.
** Basically, a high lift value seems to be highly relevant, but if the number of data is small, there is a suspicion that it may have happened by accident. **
Take the top line, for example. This is the part of x = "485836", y = "662615".
This has a very high lift value of 6760, but this happened only once (= support_xy). Assuming that the number of times is enough data, it is highly possible that this happened by accident.
So how do you derive the support_xy threshold to determine by chance or not by chance?
.. .. .. In reality, it is difficult to derive that threshold. (There is no correct answer)
Here, let's check the histogram of support_xy once.
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(df_output["sum_count_xy"], bins=100)
ax.set_title('')
ax.set_xlabel('xy')
ax.set_ylabel('freq')
#It's hard to see, so limit the range of the y-axis
ax.set_ylim(0,500)
fig.show()
You can see that most of the data has x-axis support_xy concentrated near 0.
In this tutorial, all the data near 0 is regarded as a coincidence and will be deleted.
Quantiles are used to set the threshold, and the threshold is set so that the data in the lower 98% can be regarded as a coincidence.
df = df_output.copy()
df = df[df["support_xy"]>=df["support_xy"].quantile(0.98)]
df.head()
x | y | sum_count_xy | support_xy | support_x | support_y | sum_count_x | sum_count_y | confidence | lift | sum_count | |
---|---|---|---|---|---|---|---|---|---|---|---|
58 | 662193 | 667129 | 8 | 5.63555e-05 | 0.0036138 | 0.00281073 | 513 | 399 | 0.0155945 | 5.54822 | 141956 |
59 | 665672 | 667129 | 12 | 8.45332e-05 | 0.00395193 | 0.00281073 | 561 | 399 | 0.0213904 | 7.61026 | 141956 |
60 | 666435 | 667129 | 30 | 0.000211333 | 0.0082279 | 0.00281073 | 1168 | 399 | 0.0256849 | 9.13817 | 141956 |
62 | 666590 | 667129 | 7 | 4.93111e-05 | 0.00421257 | 0.00281073 | 598 | 399 | 0.0117057 | 4.16464 | 141956 |
63 | 666856 | 667129 | 8 | 5.63555e-05 | 0.00390966 | 0.00281073 | 555 | 399 | 0.0144144 | 5.12835 | 141956 |
This saved me the data for the xy pair that I thought happened by accident.
Then the last process, leaving only y, which is relevant to x. As explained earlier, this means that only xy pairs with high lift values need to be left.
Here, we will leave only pairs with lift of 2.0 or higher.
** By the way, it is an image of the meaning of lift (lift value), but it is a value that shows how easy it is for a product of x to be added to the cart compared to other products when the product of x is added to the cart. **
For example, in the data frame above, when the product "662193" (x) is added to the cart, the product "667129" (y) is ** 5.5 times easier to add to the cart * * It means.
df = df[df["lift"]>=2.0]
df_recommendation = df.copy()
df_recommendation.head()
x | y | sum_count_xy | support_xy | support_x | support_y | sum_count_x | sum_count_y | confidence | lift | sum_count | |
---|---|---|---|---|---|---|---|---|---|---|---|
58 | 662193 | 667129 | 8 | 5.63555e-05 | 0.0036138 | 0.00281073 | 513 | 399 | 0.0155945 | 5.54822 | 141956 |
59 | 665672 | 667129 | 12 | 8.45332e-05 | 0.00395193 | 0.00281073 | 561 | 399 | 0.0213904 | 7.61026 | 141956 |
60 | 666435 | 667129 | 30 | 0.000211333 | 0.0082279 | 0.00281073 | 1168 | 399 | 0.0256849 | 9.13817 | 141956 |
62 | 666590 | 667129 | 7 | 4.93111e-05 | 0.00421257 | 0.00281073 | 598 | 399 | 0.0117057 | 4.16464 | 141956 |
63 | 666856 | 667129 | 8 | 5.63555e-05 | 0.00390966 | 0.00281073 | 555 | 399 | 0.0144144 | 5.12835 | 141956 |
This completes the generation of basic recommendation data.
The specific usage of this is as simple as, for example, promoting the product "667129" to the person who added the product "662193" to the cart.
Actually, the data of this recommendation is stored in the database so that y is output when x is input.
Now that I've created the recommendation data, I'll show you the problems with this recommendation (without hiding it).
That is ** it is not possible to target all input products (x). **
Let's check this.
First, let's check how many types of x are in the original transaction data.
#x is the goods of the original data_Since it is an id, it aggregates the unique number of it.
df_cart_ori["goods_id"].nunique()
#output
8398
Originally, there are 8398 types of x.
In other words, we should be able to recommend highly relevant products (y) for these 8398 types of inputs (x).
Now let's check how much input (x) the previous output corresponds to.
#Aggregate the unique number of x in the recommendation result.
df_recommendation["x"].nunique()
#output
463
The input (x) supported by the ** recommendation data mentioned earlier only supports 463 types. ** (463/8392, so about 5%)
This is one weakness of user behavior-based recommendations.
Perhaps, in reality, this will need to be addressed.
The following can be considered as countermeasures.
For recommendations that utilize associations, it is relatively easy to create recommendation data if transaction data is accumulated. If you are trying to make a recommendation engine in your actual work, please give it a try!
Recommended Posts