Recommendation tutorial using association analysis (python implementation)

About this article

I wrote an article because there weren't many tutorials implemented using sample data regarding recommendations.
There are methods that use machine learning etc. to create recommendations, but this is an article on how to create recommendations using methods based on statistics.
I will explain using python and open dataset.

This is an article about implementation by python. Read the following article about the concept of recommendations. Recommendation tutorial using association analysis (concept)

It will be implemented according to the flow of the article of the concept edition.

If you want to try this implementation without building an environment

There is a charge, but the Google Colaboratory execution environment is available at here.

** Import required libraries **

Import the required libraries.

#Library import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

** Dataset loading **

Load the dataset used in this tutorial. This is loading the dataset from the github repository.

#Data set loading
import urllib.request
from io import StringIO

url = "https://raw.githubusercontent.com/tachiken0210/dataset/master/dataset_cart.csv"

#Function to read csv
def read_csv(url):
    res = urllib.request.urlopen(url)
    res = res.read().decode("utf-8")
    df = pd.read_csv(StringIO( res) )
    return df

#Run
df_cart_ori = read_csv(url)

** Check the contents of the dataset **

Check the contents of the dataset used this time.

df_cart_ori.head()
cart_id goods_id action create_at update_at last_update time
0 108750017 583266 UPD 2013-11-26 03:11:06 2013-11-26 03:11:06 2013-11-26 03:11:06 1385478215
1 108750017 662680 UPD 2013-11-26 03:11:06 2013-11-26 03:11:06 2013-11-26 03:11:06 1385478215
2 108750017 664077 UPD 2013-11-26 03:11:06 2013-11-26 03:11:06 2013-11-26 03:11:06 1385478215
3 108199875 661648 ADD 2013-11-26 03:11:10 2013-11-26 03:11:10 2013-11-26 03:11:10 1385478215
4 105031004 661231 ADD 2013-11-26 03:11:41 2013-11-26 03:11:41 2013-11-26 03:11:41 1385478215

** About the dataset used in this tutorial **

This dataset is the log data when a customer on a certain EC site puts a product in the cart.

** ・ card_id: Cart ID associated with the customer (It is okay to think of it as a customer ID, so it will be referred to as the customer below)
**** ・ goods_id: Product ID
**** ・ action: Customer action (ADD: Add to cart, DEL: Delete from cart, etc.)
**** ・ create_at: Time when log was created
**** ・ update_at: Log is updated Time (not used this time)
**** ・ last_update: Time when the log was updated (not used this time)
**** ・ time: Timestamp
**

In addition, general open data sets, including the data set used this time, rarely include a specific name such as a product name in the data set, and are basically expressed by the product ID. Therefore, it is difficult to know what kind of product it is, but please note that this is due to the content of the data.

Next, let's take a look at the contents of this data. This dataset contains data for all customers as a chronological log, so let's focus on just one customer. First, let's extract the customer with the largest number of logs in the data.

df_cart_ori["cart_id"].value_counts().head()
#output
110728794    475
106932411    426
115973611    264
109269739    205
112332751    197
Name: cart_id, dtype: int64

There are 475 logs for 110728794 users, which seems to be the most logged. Let's extract the log of this customer.

df_cart_ori[df_cart_ori["cart_id"]==110728794].head(10)
cart_id goods_id action create_at update_at last_update time
32580 110728794 664457 ADD 2013-11-26 22:54:13 2013-11-26 22:54:13 2013-11-26 22:54:13 1385478215
32619 110728794 664885 ADD 2013-11-26 22:55:09 2013-11-26 22:55:09 2013-11-26 22:55:09 1385478215
33047 110728794 664937 ADD 2013-11-26 22:58:52 2013-11-26 22:58:52 2013-11-26 22:58:52 1385478215
33095 110728794 664701 ADD 2013-11-26 23:00:25 2013-11-26 23:00:25 2013-11-26 23:00:25 1385478215
34367 110728794 665050 ADD 2013-11-26 23:02:40 2013-11-26 23:02:40 2013-11-26 23:02:40 1385478215
34456 110728794 664989 ADD 2013-11-26 23:05:03 2013-11-26 23:05:03 2013-11-26 23:05:03 1385478215
34653 110728794 664995 ADD 2013-11-26 23:07:00 2013-11-26 23:07:00 2013-11-26 23:07:00 1385478215
34741 110728794 664961 ADD 2013-11-26 23:09:41 2013-11-26 23:09:41 2013-11-26 23:09:41 1385478215
296473 110728794 665412 DEL 2013-12-03 17:17:30 2013-12-03 17:17:30 2013-12-03 07:41:13 1386083014
296476 110728794 665480 DEL 2013-12-03 17:17:37 2013-12-03 17:17:37 2013-12-03 07:42:29 1386083014

If you look at this customer, you can see that they are continuously adding carts to their products throughout the day. Analyzing this for other customers in the same way, it seems that it is easy to see that product y is likely to be added to the cart next to product x in general.
In this way, the purpose of this time is to analyze the pattern that ** "Product y is likely to be added to the cart next to a certain product x" from this data set. **

We will actually process the data from the next, but keep the above in the corner of your head.

** Data preprocessing **

From here, we will preprocess the data. First, if you look at the "action" column, there are various things such as ADD and DEL. For now, let's focus on the ADD (added to cart) log. You focus only on the customer's product demand.

df = df_cart_ori.copy()
df = df[df["action"]=="ADD"]
df.head()
cart_id goods_id action create_at update_at last_update time
3 108199875 661648 ADD 2013-11-26 03:11:10 2013-11-26 03:11:10 2013-11-26 03:11:10 1385478215
4 105031004 661231 ADD 2013-11-26 03:11:41 2013-11-26 03:11:41 2013-11-26 03:11:41 1385478215
6 110388732 661534 ADD 2013-11-26 03:11:55 2013-11-26 03:11:55 2013-11-26 03:11:55 1385478215
7 110388740 662336 ADD 2013-11-26 03:11:58 2013-11-26 03:11:58 2013-11-26 03:11:58 1385478215
8 110293997 661648 ADD 2013-11-26 03:12:13 2013-11-26 03:12:13 2013-11-26 03:12:13 1385478215

Next, sort the dataset so that the order in which the products were added to the cart for each customer is chronologically ordered.

df = df[["cart_id","goods_id","create_at"]]
df = df.sort_values(["cart_id","create_at"],ascending=[True,True])
df.head()
cart_id goods_id create_at
548166 78496306 661142 2013-11-15 23:07:02
517601 79100564 662760 2013-11-24 18:17:24
517404 79100564 661093 2013-11-24 18:25:29
23762 79100564 664856 2013-11-26 13:41:47
22308 79100564 562296 2013-11-26 13:44:20

The process from here will format the data to fit the association analysis library.

For each customer, we will add the items in the cart to the list in order.

df = df[["cart_id","goods_id"]]
df["goods_id"] = df["goods_id"].astype(int).astype(str)
df = df.groupby(["cart_id"])["goods_id"].apply(lambda x:list(x)).reset_index()
df.head()
 |   cart_id | goods_id                                                                                                                                     |

|---:|----------:|:---------------------------------------------------------------------------------------------------------------------------------------------| | 0 | 78496306 | ['661142'] | | 1 | 79100564 | ['662760', '661093', '664856', '562296', '663364', '664963', '664475', '583266'] | | 2 | 79455669 | ['663801', '664136', '664937', '663932', '538673', '663902', '667859'] | | 3 | 81390353 | ['663132', '661725', '664236', '663331', '659679', '663847', '662340', '662292', '664099', '664165', '663581', '665426', '663899', '663405'] | | 4 | 81932021 | ['662282', '664218']
Next, let's check the contents of the above data frame.
Notice the customer on the second line (79100564).
First of all, I put 664136 in the cart after the first item 663801. That is, X = "663801" and Y = "664136". < Also, after 664136, 664937 is added to the cart. For these, X = "664136" and Y = "664937". In this way, by arranging this XY pair for each customer, it is possible to perform association analysis.

Now let's format the data so that this XY is paired.

def get_combination(l):
    length = len(l)
    list_output = []
    list_pair = []
    for i in range(length-1):
        if l[i]==l[i+1]:
            pass
        else:
            list_pair =[l[i],l[i+1]]
            list_output.append(list_pair)
    return list_output

df["comb_goods_id"] = df["goods_id"].apply(lambda x:get_combination(x))

#Make a list of association inputs
dataset= []
for i,contents in enumerate(df["comb_goods_id"].values):
    for c in contents:
        dataset.append(c)

print("The number of XY pairs is",len(dataset))
print("Contents of XY pair",dataset[:5])
#output
The number of XY pairs is 141956
Contents of XY pair[['662760', '661093'], ['661093', '664856'], ['664856', '562296'], ['562296', '663364'], ['663364', '664963']]

** Perform Association Analysis **

The above is used as input for association analysis. I have created a method that performs association analysis as an association method, so I will use this.

#Association library
def association(dataset):
    df = pd.DataFrame(dataset,columns=["x","y"])
    num_dataset = df.shape[0]
    df["sum_count_xy"]=1
    print("calculating support....")
    df_a_support = (df.groupby("x").sum()/num_dataset).rename(columns={"sum_count_xy":"support_x"})
    df_b_support = (df.groupby("y").sum()/num_dataset).rename(columns={"sum_count_xy":"support_y"})
    df = df.groupby(["x","y"]).sum()
    df["support_xy"]=df["sum_count_xy"]/num_dataset
    df = df.reset_index()
    df = pd.merge(df,df_a_support,on="x")
    df = pd.merge(df,df_b_support,on="y")
    print("calculating confidence....")
    df_temp = df.groupby("x").sum()[["sum_count_xy"]].rename(columns={"sum_count_xy":"sum_count_x"})
    df = pd.merge(df,df_temp,on="x")
    df_temp = df.groupby("y").sum()[["sum_count_xy"]].rename(columns={"sum_count_xy":"sum_count_y"})
    df = pd.merge(df,df_temp,on="y")
    df["confidence"]=df["sum_count_xy"]/df["sum_count_x"]
    print("calculating lift....")
    df["lift"]=df["confidence"]/df["support_y"]
    df["sum_count"] = num_dataset
    df_output = df
    return df_output
#Run the dataset for association analysis
df_output = association(dataset)
df_output.head()
x y sum_count_xy support_xy support_x support_y sum_count_x sum_count_y confidence lift sum_count
0 485836 662615 1 7.04444e-06 7.04444e-06 0.000147933 1 21 1 6759.81 141956
1 549376 662615 1 7.04444e-06 2.11333e-05 0.000147933 3 21 0.333333 2253.27 141956
2 654700 662615 1 7.04444e-06 0.000464933 0.000147933 66 21 0.0151515 102.421 141956
3 661475 662615 1 7.04444e-06 0.000965088 0.000147933 137 21 0.00729927 49.3417 141956
4 661558 662615 1 7.04444e-06 0.000408577 0.000147933 58 21 0.0172414 116.548 141956

I got the result. The contents of each column are as follows ** ・ x: Condition part X
・ y: Conclusion part Y
・ sum_count_xy: Number of data applicable to XY
・ support_xy: XY support
・ support_x: X support
> ・ Support_y: Y support
・ sum_count_x: Number of data applicable to X
・ sum_count_y: Number of data applicable to Y
・ confidence: Confidence
・ lift: Lift
**

** Post-processing of association results **

Well, the execution of the association is over, but there is one caveat.
** Basically, a high lift value seems to be highly relevant, but if the number of data is small, there is a suspicion that it may have happened by accident. **
Take the top line, for example. This is the part of x = "485836", y = "662615".
This has a very high lift value of 6760, but this happened only once (= support_xy). Assuming that the number of times is enough data, it is highly possible that this happened by accident.
So how do you derive the support_xy threshold to determine by chance or not by chance?

.. .. .. In reality, it is difficult to derive that threshold. (There is no correct answer)

Here, let's check the histogram of support_xy once.

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(df_output["sum_count_xy"], bins=100)
ax.set_title('')
ax.set_xlabel('xy')
ax.set_ylabel('freq')
#It's hard to see, so limit the range of the y-axis
ax.set_ylim(0,500)
fig.show()

You can see that most of the data has x-axis support_xy concentrated near 0.
In this tutorial, all the data near 0 is regarded as a coincidence and will be deleted.
Quantiles are used to set the threshold, and the threshold is set so that the data in the lower 98% can be regarded as a coincidence.

df = df_output.copy()
df = df[df["support_xy"]>=df["support_xy"].quantile(0.98)]
df.head()
x y sum_count_xy support_xy support_x support_y sum_count_x sum_count_y confidence lift sum_count
58 662193 667129 8 5.63555e-05 0.0036138 0.00281073 513 399 0.0155945 5.54822 141956
59 665672 667129 12 8.45332e-05 0.00395193 0.00281073 561 399 0.0213904 7.61026 141956
60 666435 667129 30 0.000211333 0.0082279 0.00281073 1168 399 0.0256849 9.13817 141956
62 666590 667129 7 4.93111e-05 0.00421257 0.00281073 598 399 0.0117057 4.16464 141956
63 666856 667129 8 5.63555e-05 0.00390966 0.00281073 555 399 0.0144144 5.12835 141956

This saved me the data for the xy pair that I thought happened by accident.
Then the last process, leaving only y, which is relevant to x. As explained earlier, this means that only xy pairs with high lift values need to be left.
Here, we will leave only pairs with lift of 2.0 or higher.

** By the way, it is an image of the meaning of lift (lift value), but it is a value that shows how easy it is for a product of x to be added to the cart compared to other products when the product of x is added to the cart. **

For example, in the data frame above, when the product "662193" (x) is added to the cart, the product "667129" (y) is ** 5.5 times easier to add to the cart * * It means.

df = df[df["lift"]>=2.0]
df_recommendation = df.copy()
df_recommendation.head()
x y sum_count_xy support_xy support_x support_y sum_count_x sum_count_y confidence lift sum_count
58 662193 667129 8 5.63555e-05 0.0036138 0.00281073 513 399 0.0155945 5.54822 141956
59 665672 667129 12 8.45332e-05 0.00395193 0.00281073 561 399 0.0213904 7.61026 141956
60 666435 667129 30 0.000211333 0.0082279 0.00281073 1168 399 0.0256849 9.13817 141956
62 666590 667129 7 4.93111e-05 0.00421257 0.00281073 598 399 0.0117057 4.16464 141956
63 666856 667129 8 5.63555e-05 0.00390966 0.00281073 555 399 0.0144144 5.12835 141956

** How to use association results (= recommendation data) **

This completes the generation of basic recommendation data.
The specific usage of this is as simple as, for example, promoting the product "667129" to the person who added the product "662193" to the cart.
Actually, the data of this recommendation is stored in the database so that y is output when x is input.

Problems with this recommendation

Now that I've created the recommendation data, I'll show you the problems with this recommendation (without hiding it).

That is ** it is not possible to target all input products (x). **

Let's check this.
First, let's check how many types of x are in the original transaction data.

#x is the goods of the original data_Since it is an id, it aggregates the unique number of it.
df_cart_ori["goods_id"].nunique()
#output
8398

Originally, there are 8398 types of x.
In other words, we should be able to recommend highly relevant products (y) for these 8398 types of inputs (x).

Now let's check how much input (x) the previous output corresponds to.

#Aggregate the unique number of x in the recommendation result.
df_recommendation["x"].nunique()
#output
463

The input (x) supported by the ** recommendation data mentioned earlier only supports 463 types. ** (463/8392, so about 5%)

This is one weakness of user behavior-based recommendations. Perhaps, in reality, this will need to be addressed.
The following can be considered as countermeasures.

Summary

For recommendations that utilize associations, it is relatively easy to create recommendation data if transaction data is accumulated. If you are trying to make a recommendation engine in your actual work, please give it a try!

Recommended Posts

Recommendation tutorial using association analysis (python implementation)
Recommendation tutorial using association analysis (concept)
Data analysis using Python 0
Association analysis in Python
Data analysis using python pandas
EEG analysis in Python: Python MNE tutorial
Implementation of desktop notifications using Python
Recommendation of data analysis using MessagePack
Python tutorial
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 6: Using the model
[In-Database Python Analysis Tutorial with SQL Server 2017]
Python: Negative / Positive Analysis: Twitter Negative / Positive Analysis Using RNN-Part 1
Time variation analysis of black holes using python
Python Django Tutorial (5)
Python Django Tutorial (2)
Python tutorial summary
Data analysis python
Start using Python
Perform entity analysis using spaCy / GiNZA in Python
Python Django Tutorial (8)
Python Django Tutorial (6)
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
[Environment construction] Dependency analysis using CaboCha in Python 2.7
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 2: Import data to SQL Server using PowerShell
Python Django Tutorial (7)
Python Django Tutorial (1)
Python Django tutorial tutorial
Scraping using Python
Python Django Tutorial (3)
Python Django Tutorial (4)
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
[Python] Implementation of clustering using a mixed Gaussian model
Explanation of the concept of regression analysis using python Part 2
[Python] [Word] [python-docx] Simple analysis of diff data using python
Merge sort implementation / complexity analysis and experiment in Python
Explanation of the concept of regression analysis using Python Part 1
Principal component analysis using python from nim with nimpy
Explanation of the concept of regression analysis using Python Extra 1
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
Data analysis with python 2
Python: Time Series Analysis
Data analysis using xarray
[Docker] Tutorial (Python + php)
Operate Redmine using Python Redmine
RNN implementation in python
Fibonacci sequence using Python
ValueObject implementation in Python
Data analysis overview python
Voice analysis with python
Data cleaning using Python
Using Python #external packages
WiringPi-SPI communication using Python
Age calculation using python
Python data analysis template
Python OpenCV tutorial memo
Search Twitter using Python
[Python tutorial] Data structure
Orthologous analysis using OrthoFinder
Name identification using python
Notes using Python subprocesses
Cloud Run tutorial (python)