About this article

I wrote an article because there weren't many tutorials implemented using sample data regarding recommendations.
There are methods that use machine learning etc. to create recommendations, but this is an article on how to create recommendations using methods based on statistics.
I will explain using python and open dataset.

This is an article about implementation by python. Read the following article about the concept of recommendations. Recommendation tutorial using association analysis (concept)

It will be implemented according to the flow of the article of the concept edition.

If you want to try this implementation without building an environment

There is a charge, but the Google Colaboratory execution environment is available at here.

Import required libraries

Import the required libraries.

#Library import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Dataset loading

Load the dataset used in this tutorial. This is loading the dataset from the github repository.

By the way, it is assumed to be executed in Google Colaboratory.

#Data set loading
import urllib.request
from io import StringIO

url = "https://raw.githubusercontent.com/tachiken0210/dataset/master/dataset_cart.csv"

#Function to read csv
def read_csv(url):
    res = urllib.request.urlopen(url)
    res = res.read().decode("utf-8")
    df = pd.read_csv(StringIO( res) )
    return df

#Run
df_cart_ori = read_csv(url)

Check the contents of the dataset

Check the contents of the dataset used this time.

df_cart_ori.head()

	cart_id	goods_id	action	create_at	update_at	last_update	time
0	108750017	583266	UPD	2013-11-26 03:11:06	2013-11-26 03:11:06	2013-11-26 03:11:06	1385478215
1	108750017	662680	UPD	2013-11-26 03:11:06	2013-11-26 03:11:06	2013-11-26 03:11:06	1385478215
2	108750017	664077	UPD	2013-11-26 03:11:06	2013-11-26 03:11:06	2013-11-26 03:11:06	1385478215
3	108199875	661648	ADD	2013-11-26 03:11:10	2013-11-26 03:11:10	2013-11-26 03:11:10	1385478215
4	105031004	661231	ADD	2013-11-26 03:11:41	2013-11-26 03:11:41	2013-11-26 03:11:41	1385478215

About the dataset used in this tutorial

This dataset is the log data when a customer on a certain EC site puts a product in the cart.

Generally called transaction data. The contents of each column name (column name) are as follows.

** ・ card_id: Cart ID associated with the customer (It is okay to think of it as a customer ID, so it will be referred to as the customer below)
**** ・ goods_id: Product ID
**** ・ action: Customer action (ADD: Add to cart, DEL: Delete from cart, etc.)
**** ・ create_at: Time when log was created
**** ・ update_at: Log is updated Time (not used this time)
**** ・ last_update: Time when the log was updated (not used this time)
**** ・ time: Timestamp
**

In addition, general open data sets, including the data set used this time, rarely include a specific name such as a product name in the data set, and are basically expressed by the product ID. Therefore, it is difficult to know what kind of product it is, but please note that this is due to the content of the data.

Next, let's take a look at the contents of this data. This dataset contains data for all customers as a chronological log, so let's focus on just one customer. First, let's extract the customer with the largest number of logs in the data.

df_cart_ori["cart_id"].value_counts().head()

#output
110728794    475
106932411    426
115973611    264
109269739    205
112332751    197
Name: cart_id, dtype: int64

There are 475 logs for 110728794 users, which seems to be the most logged. Let's extract the log of this customer.

df_cart_ori[df_cart_ori["cart_id"]==110728794].head(10)

	cart_id	goods_id	action	create_at	update_at	last_update	time
32580	110728794	664457	ADD	2013-11-26 22:54:13	2013-11-26 22:54:13	2013-11-26 22:54:13	1385478215
32619	110728794	664885	ADD	2013-11-26 22:55:09	2013-11-26 22:55:09	2013-11-26 22:55:09	1385478215
33047	110728794	664937	ADD	2013-11-26 22:58:52	2013-11-26 22:58:52	2013-11-26 22:58:52	1385478215
33095	110728794	664701	ADD	2013-11-26 23:00:25	2013-11-26 23:00:25	2013-11-26 23:00:25	1385478215
34367	110728794	665050	ADD	2013-11-26 23:02:40	2013-11-26 23:02:40	2013-11-26 23:02:40	1385478215
34456	110728794	664989	ADD	2013-11-26 23:05:03	2013-11-26 23:05:03	2013-11-26 23:05:03	1385478215
34653	110728794	664995	ADD	2013-11-26 23:07:00	2013-11-26 23:07:00	2013-11-26 23:07:00	1385478215
34741	110728794	664961	ADD	2013-11-26 23:09:41	2013-11-26 23:09:41	2013-11-26 23:09:41	1385478215
296473	110728794	665412	DEL	2013-12-03 17:17:30	2013-12-03 17:17:30	2013-12-03 07:41:13	1386083014
296476	110728794	665480	DEL	2013-12-03 17:17:37	2013-12-03 17:17:37	2013-12-03 07:42:29	1386083014

If you look at this customer, you can see that they are continuously adding carts to their products throughout the day. Analyzing this for other customers in the same way, it seems that it is easy to see that product y is likely to be added to the cart next to product x in general.
In this way, the purpose of this time is to analyze the pattern that ** "Product y is likely to be added to the cart next to a certain product x" from this data set. **

We will actually process the data from the next, but keep the above in the corner of your head.

Data preprocessing

From here, we will preprocess the data. First, if you look at the "action" column, there are various things such as ADD and DEL. For now, let's focus on the ADD (added to cart) log. You focus only on the customer's product demand.

df = df_cart_ori.copy()
df = df[df["action"]=="ADD"]
df.head()

	cart_id	goods_id	action	create_at	update_at	last_update	time
3	108199875	661648	ADD	2013-11-26 03:11:10	2013-11-26 03:11:10	2013-11-26 03:11:10	1385478215
4	105031004	661231	ADD	2013-11-26 03:11:41	2013-11-26 03:11:41	2013-11-26 03:11:41	1385478215
6	110388732	661534	ADD	2013-11-26 03:11:55	2013-11-26 03:11:55	2013-11-26 03:11:55	1385478215
7	110388740	662336	ADD	2013-11-26 03:11:58	2013-11-26 03:11:58	2013-11-26 03:11:58	1385478215
8	110293997	661648	ADD	2013-11-26 03:12:13	2013-11-26 03:12:13	2013-11-26 03:12:13	1385478215

Next, sort the dataset so that the order in which the products were added to the cart for each customer is chronologically ordered.

df = df[["cart_id","goods_id","create_at"]]
df = df.sort_values(["cart_id","create_at"],ascending=[True,True])
df.head()

	cart_id	goods_id	create_at
548166	78496306	661142	2013-11-15 23:07:02
517601	79100564	662760	2013-11-24 18:17:24
517404	79100564	661093	2013-11-24 18:25:29
23762	79100564	664856	2013-11-26 13:41:47
22308	79100564	562296	2013-11-26 13:44:20

The process from here will format the data to fit the association analysis library.

For each customer, we will add the items in the cart to the list in order.

df = df[["cart_id","goods_id"]]
df["goods_id"] = df["goods_id"].astype(int).astype(str)
df = df.groupby(["cart_id"])["goods_id"].apply(lambda x:list(x)).reset_index()
df.head()

 |   cart_id | goods_id                                                                                                                                     |

|---:|----------:|:---------------------------------------------------------------------------------------------------------------------------------------------| | 0 | 78496306 | ['661142'] | | 1 | 79100564 | ['662760', '661093', '664856', '562296', '663364', '664963', '664475', '583266'] | | 2 | 79455669 | ['663801', '664136', '664937', '663932', '538673', '663902', '667859'] | | 3 | 81390353 | ['663132', '661725', '664236', '663331', '659679', '663847', '662340', '662292', '664099', '664165', '663581', '665426', '663899', '663405'] | | 4 | 81932021 | ['662282', '664218']
Next, let's check the contents of the above data frame.
Notice the customer on the second line (79100564).
First of all, I put 664136 in the cart after the first item 663801. That is, X = "663801" and Y = "664136". < Also, after 664136, 664937 is added to the cart. For these, X = "664136" and Y = "664937". In this way, by arranging this XY pair for each customer, it is possible to perform association analysis.

Now let's format the data so that this XY is paired.

def get_combination(l):
    length = len(l)
    list_output = []
    list_pair = []
    for i in range(length-1):
        if l[i]==l[i+1]:
            pass
        else:
            list_pair =[l[i],l[i+1]]
            list_output.append(list_pair)
    return list_output

df["comb_goods_id"] = df["goods_id"].apply(lambda x:get_combination(x))

#Make a list of association inputs
dataset= []
for i,contents in enumerate(df["comb_goods_id"].values):
    for c in contents:
        dataset.append(c)

print("The number of XY pairs is",len(dataset))
print("Contents of XY pair",dataset[:5])

#output
The number of XY pairs is 141956
Contents of XY pair[['662760', '661093'], ['661093', '664856'], ['664856', '562296'], ['562296', '663364'], ['663364', '664963']]

Perform Association Analysis

The above is used as input for association analysis. I have created a method that performs association analysis as an association method, so I will use this.

#Association library
def association(dataset):
    df = pd.DataFrame(dataset,columns=["x","y"])
    num_dataset = df.shape[0]
    df["sum_count_xy"]=1
    print("calculating support....")
    df_a_support = (df.groupby("x").sum()/num_dataset).rename(columns={"sum_count_xy":"support_x"})
    df_b_support = (df.groupby("y").sum()/num_dataset).rename(columns={"sum_count_xy":"support_y"})
    df = df.groupby(["x","y"]).sum()
    df["support_xy"]=df["sum_count_xy"]/num_dataset
    df = df.reset_index()
    df = pd.merge(df,df_a_support,on="x")
    df = pd.merge(df,df_b_support,on="y")
    print("calculating confidence....")
    df_temp = df.groupby("x").sum()[["sum_count_xy"]].rename(columns={"sum_count_xy":"sum_count_x"})
    df = pd.merge(df,df_temp,on="x")
    df_temp = df.groupby("y").sum()[["sum_count_xy"]].rename(columns={"sum_count_xy":"sum_count_y"})
    df = pd.merge(df,df_temp,on="y")
    df["confidence"]=df["sum_count_xy"]/df["sum_count_x"]
    print("calculating lift....")
    df["lift"]=df["confidence"]/df["support_y"]
    df["sum_count"] = num_dataset
    df_output = df
    return df_output

#Run the dataset for association analysis
df_output = association(dataset)
df_output.head()

	x	y	sum_count_xy	support_xy	support_x	support_y	sum_count_x	sum_count_y	confidence	lift	sum_count
0	485836	662615	1	7.04444e-06	7.04444e-06	0.000147933	1	21	1	6759.81	141956
1	549376	662615	1	7.04444e-06	2.11333e-05	0.000147933	3	21	0.333333	2253.27	141956
2	654700	662615	1	7.04444e-06	0.000464933	0.000147933	66	21	0.0151515	102.421	141956
3	661475	662615	1	7.04444e-06	0.000965088	0.000147933	137	21	0.00729927	49.3417	141956
4	661558	662615	1	7.04444e-06	0.000408577	0.000147933	58	21	0.0172414	116.548	141956

I got the result. The contents of each column are as follows ** ・ x: Condition part X
・ y: Conclusion part Y
・ sum_count_xy: Number of data applicable to XY
・ support_xy: XY support
・ support_x: X support
> ・ Support_y: Y support
・ sum_count_x: Number of data applicable to X
・ sum_count_y: Number of data applicable to Y
・ confidence: Confidence
・ lift: Lift
**

Post-processing of association results

Well, the execution of the association is over, but there is one caveat.
** Basically, a high lift value seems to be highly relevant, but if the number of data is small, there is a suspicion that it may have happened by accident. **
Take the top line, for example. This is the part of x = "485836", y = "662615".
This has a very high lift value of 6760, but this happened only once (= support_xy). Assuming that the number of times is enough data, it is highly possible that this happened by accident.
So how do you derive the support_xy threshold to determine by chance or not by chance?

.. .. .. In reality, it is difficult to derive that threshold. (There is no correct answer)

Here, let's check the histogram of support_xy once.

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.hist(df_output["sum_count_xy"], bins=100)
ax.set_title('')
ax.set_xlabel('xy')
ax.set_ylabel('freq')
#It's hard to see, so limit the range of the y-axis
ax.set_ylim(0,500)
fig.show()

support_xy histogram *

You can see that most of the data has x-axis support_xy concentrated near 0.
In this tutorial, all the data near 0 is regarded as a coincidence and will be deleted.
Quantiles are used to set the threshold, and the threshold is set so that the data in the lower 98% can be regarded as a coincidence.

Please refer to the link below for quantiles. https://ja.wikipedia.org/wiki/%E5%88%86%E4%BD%8D%E6%95%B0

df = df_output.copy()
df = df[df["support_xy"]>=df["support_xy"].quantile(0.98)]
df.head()

	x	y	sum_count_xy	support_xy	support_x	support_y	sum_count_x	sum_count_y	confidence	lift	sum_count
58	662193	667129	8	5.63555e-05	0.0036138	0.00281073	513	399	0.0155945	5.54822	141956
59	665672	667129	12	8.45332e-05	0.00395193	0.00281073	561	399	0.0213904	7.61026	141956
60	666435	667129	30	0.000211333	0.0082279	0.00281073	1168	399	0.0256849	9.13817	141956
62	666590	667129	7	4.93111e-05	0.00421257	0.00281073	598	399	0.0117057	4.16464	141956
63	666856	667129	8	5.63555e-05	0.00390966	0.00281073	555	399	0.0144144	5.12835	141956

This saved me the data for the xy pair that I thought happened by accident.
Then the last process, leaving only y, which is relevant to x. As explained earlier, this means that only xy pairs with high lift values need to be left.
Here, we will leave only pairs with lift of 2.0 or higher.

** By the way, it is an image of the meaning of lift (lift value), but it is a value that shows how easy it is for a product of x to be added to the cart compared to other products when the product of x is added to the cart. **

For example, in the data frame above, when the product "662193" (x) is added to the cart, the product "667129" (y) is ** 5.5 times easier to add to the cart * * It means.

df = df[df["lift"]>=2.0]
df_recommendation = df.copy()
df_recommendation.head()

	x	y	sum_count_xy	support_xy	support_x	support_y	sum_count_x	sum_count_y	confidence	lift	sum_count
58	662193	667129	8	5.63555e-05	0.0036138	0.00281073	513	399	0.0155945	5.54822	141956
59	665672	667129	12	8.45332e-05	0.00395193	0.00281073	561	399	0.0213904	7.61026	141956
60	666435	667129	30	0.000211333	0.0082279	0.00281073	1168	399	0.0256849	9.13817	141956
62	666590	667129	7	4.93111e-05	0.00421257	0.00281073	598	399	0.0117057	4.16464	141956
63	666856	667129	8	5.63555e-05	0.00390966	0.00281073	555	399	0.0144144	5.12835	141956

How to use association results (= recommendation data)

This completes the generation of basic recommendation data.
The specific usage of this is as simple as, for example, promoting the product "667129" to the person who added the product "662193" to the cart.
Actually, the data of this recommendation is stored in the database so that y is output when x is input.

Problems with this recommendation

Now that I've created the recommendation data, I'll show you the problems with this recommendation (without hiding it).

That is ** it is not possible to target all input products (x). **

Let's check this.
First, let's check how many types of x are in the original transaction data.

#x is the goods of the original data_Since it is an id, it aggregates the unique number of it.
df_cart_ori["goods_id"].nunique()

#output
8398

Originally, there are 8398 types of x.
In other words, we should be able to recommend highly relevant products (y) for these 8398 types of inputs (x).

Now let's check how much input (x) the previous output corresponds to.

#Aggregate the unique number of x in the recommendation result.
df_recommendation["x"].nunique()

#output
463

The input (x) supported by the ** recommendation data mentioned earlier only supports 463 types. ** (463/8392, so about 5%)

This is one weakness of user behavior-based recommendations. Perhaps, in reality, this will need to be addressed.
The following can be considered as countermeasures.

Increase the number of transaction data
Complement with other recommendation data (content-based recommendations, etc.)

Summary

For recommendations that utilize associations, it is relatively easy to create recommendation data if transaction data is accumulated. If you are trying to make a recommendation engine in your actual work, please give it a try!

Recommendation tutorial using association analysis (python implementation)

About this article

If you want to try this implementation without building an environment

** Import required libraries **

** Dataset loading **

** Check the contents of the dataset **

** About the dataset used in this tutorial **

** Data preprocessing **

** Perform Association Analysis **

** Post-processing of association results **

** How to use association results (= recommendation data) **