When building a machine learning model with Python, there are many samples on the net if it is a processing pattern (classification, regression, clustering, etc.) included in scikit-learn, and it can be easily implemented by referring to them. However, if this is not the case, it can be quite difficult to select a library and obtain an implementation sample. A typical example is "** Association Analysis **," which is often used in marketing analysis. In R, you can easily build a model with a library that is often used, but in Python, there are surprisingly few such sample codes. This article introduces that part. By the way, a series of procedures including use cases from a more upstream business perspective, processing means from the original data from the UCI sample dataset, and their explanations are described in detail in section 5.4 of my book "Profitable AI". I am. If you are interested, please refer to this book as well.
Amazon book https://www.amazon.co.jp/dp/4296106961/
Amazon Kindle https://www.amazon.co.jp/dp/B08F9P726T/
Book support page https://github.com/makaishi2/profitable_ai_book_info/blob/master/README.md
Use the data linked below.
https://github.com/makaishi2/sample-data/blob/master/data/retail-france.csv
This data is after some processing from the UCI dataset. Please refer to the above-mentioned book for the processing procedure up to this point.
The following is an overview of the association analysis implementation code using this data. For the entire Notebook
https://github.com/makaishi2/sample-notebooks/blob/master/profitable-ai/association-sample.ipynb
I uploaded it to.
Of the pre-processing common to books, the relevant part was extracted in this sample.
#Common preprocessing
#Hide extra warnings
import warnings
warnings.filterwarnings('ignore')
#Import of required libraries
import pandas as pd
import numpy as np
#Data frame display function
from IPython.display import display
#Display option adjustment
#Floating point display accuracy in pandas
pd.options.display.float_format = '{:.4f}'.format
#Show all items in data frame
pd.set_option("display.max_columns",None)
Import the pre-processed CSV data shown above into the data frame.
url = 'https://raw.githubusercontent.com/makaishi2/sample-data/master/data/retail-france.csv'
df = pd.read_csv(url)
display(df[100:110])
The result of the display function should look like this.
In order to perform association analysis on the above data, it is necessary to convert the data to horizontal format. (If you want to know what horizontal possession is, please refer to the book) The implementation code for that is as follows.
#Aggregate the number of products using the order number and product number as keys
w1 = df.groupby(['order number', 'Item Number'])['Number of products'].sum()
#Check the result
print(w1.head())
The state of w1 at this stage is as follows.
Use the unstack function to move the item number in the row to the column.
#Move item number to column(Use of unstack function)
w2 = w1.unstack().reset_index().fillna(0).set_index('order number')
#Check size
print(w2.shape)
#Check the result
display(w2.head())
The result is as follows:
Finally, use the apply function of the data frame to convert each element from a numerical value to a true / false binary value.
#True depending on whether the aggregation result is positive or 0/Set to False
basket_df = w2.apply(lambda x: x>0)
#Check the result
display(basket_df.head())
The results are as follows. This completes the pre-processing for association analysis.
Use mlextend as a library for association analysis. mlextend is not as famous as scikit-learn, but it is a library for Python machine learning, similar to scikit-learn.
First, install the mlxtend library.
#Introduction of mlxtend
!pip install mlxtend
Next, import the functions ʻapriori`` and
ʻassociation_rules`` to be used in the analysis.
#Loading the library
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
First, a method called a priori analysis is used to extract relationships between products with high numerical values called "** support **".
#A priori analysis
freq_items1 = apriori(basket_df, min_support = 0.06,
use_colnames = True)
#Check the result
display(freq_items1.sort_values('support',
ascending = False).head(10))
#Check the number of itemset
print(freq_items1.shape[0])
The result is as follows.
Extract the relationship with a high "** lift value **" to the last extracted list.
#Extraction of association rules
a_rules1 = association_rules(freq_items1, metric = "lift",
min_threshold = 1)
#Sort by lift value
a_rules1 = a_rules1.sort_values('lift',
ascending = False).reset_index(drop=True)
#Check the result
display(a_rules1.head(10))
#Check the number of rules
print(a_rules1.shape[0])
The following list is the final result.
In the book, based on the above results, I also use NetworkX to create the following relationship graph.
The technical terms "** support " and " lift value **" mentioned here are explained in the book, so please refer to them.
Recommended Posts