I wanted to learn about marketing analysis methods by touching actual data in Python, so I wrote an article for the purpose of keeping a record of that learning. This time, we will use decyl analysis and RFM analysis, which are customer analysis methods.
Use the Online Retail Data Set (https://www.kaggle.com/vijayuv/onlineretail) published by Kaggle.
It runs on Google Craboratory. The versions of Python and each library are as follows.
By dividing customers by certain characteristics, you will be able to find good customers, place advertisements efficiently, and so on. There is also a method of using attributes such as gender and age for segmentation, but here we will deal with two analysis methods using purchase history data.
This is a method of segmenting customers by arranging them in order of purchase amount and dividing them into 10 parts.
This is a method of segmenting customers according to the following three indicators.
-** Recency : How many days ago did you use it? - Frequency : How often did you use it during a certain period? - Monetary **: Total amount spent during a period
#Import base module
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Read the data by specifying the data type.
#Data read
dtypes = {
'InvoiceNo': 'object',
'StockCode': 'object',
'Description': 'object',
'Quantity': 'int8',
'InvoiceDate': 'datetime64[ns]',
'UnitPrice': 'float64',
'CustomerID': 'object',
'Country': 'object'
}
raw_data = pd.read_csv('./data/OnlineRetail.csv', dtype=dtypes, engine='python')
#Overview confirmation
print(raw_data.shape)
raw_data.head(20)
>>>
(541909, 8)
To briefly explain each column
--InvouceNo: Order number --StockCode: Item number --Description: Product description --Quantity: Purchased quantity --InvouceDate: Purchase date and time --UnitPrice: Product unit price --CustomerID: Customer number --Country: Country
Decyl analysis uses Quantity, UnitPrice, and CustomerID. Looking at the data a little more
#Check missing values, etc.
raw_data.info()
>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int8
4 InvoiceDate 541909 non-null datetime64[ns]
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null object
7 Country 541909 non-null object
dtypes: datetime64[ns](1),float64(1),int8(1),object(5)
memory usage: 29.5+ MB
The CustomerID column is missing. Also,
#Statistic confirmation of numerical variables
raw_data.describe()
You can see that there are negative values in some of Quantity and UnitPrice. It seems that there are some rules such as a negative value being entered when canceling a purchase, but this time I will use only data of 0 or more. Also, exclude the data for which the Cutomer ID is missing.
#cleansing
data = raw_data.query('Quantity >= 0 & UnitPrice >= 0').dropna(axis=0, subset=['CustomerID'])
Create a total amount (quantity x unit price) column.
#the amount=Calculate quantity x unit price
data['TotalPrice'] = data['Quantity'] * data['UnitPrice']
#Find the total purchase amount for each customer
decil = data[['CustomerID', 'TotalPrice']].groupby('CustomerID').sum().reset_index()
decil.head()
Using this value, divide by 10% from the top.
#Find the quantile
parties = decil.quantile(q=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]).to_dict()
parties
>>>
{'TotalPrice': {0.1: 151.812,
0.2: 240.62800000000001,
0.3: 340.95799999999997,
0.4: 474.1039999999999,
0.5: 656.6899999999999,
0.6: 905.4079999999998,
0.7: 1309.4160000000002,
0.8: 1954.0280000000002,
0.9: 3488.1160000000036}}
#Define a function to divide by quantile
def cal_decil(x, col, df):
if x <= df[col][0.1]:
return 1
elif x <= df[col][0.2]:
return 2
elif x <= df[col][0.3]:
return 3
elif x <= df[col][0.4]:
return 4
elif x <= df[col][0.5]:
return 5
elif x <= df[col][0.6]:
return 6
elif x <= df[col][0.7]:
return 7
elif x <= df[col][0.8]:
return 8
elif x <= df[col][0.9]:
return 9
else:
return 10
#Calculation of decyl score
decil['Decil_score'] = decil['TotalPrice'].apply(cal_decil, args=('TotalPrice', parties))
decil.head()
We were able to divide our customers into ranks 1-10. From here, we will look at how much each layer contributes to overall sales. You can use Pandas' .cumsum ()
and .cumprod ()
methods to calculate cumulative sums and ratios.
#Sort in descending order to find the cumulative sum / cumulative ratio
decil = decil.sort_values('TotalPrice', ascending=False)
decil['Cumsum'] = decil['TotalPrice'].cumsum()
decil['Cumprod'] = decil['Cumsum'] / decil['Cumsum'].max()
decil
#Percentage of sales at each level
decil[['Decil_score', 'Cumprod']].groupby('Decil_score').max() \
.reset_index().sort_values('Decil_score', ascending=False)
Looking at the results, we can see that the top 10% of total purchases account for about 55% of total sales, the top 20% for about 70%, and the top 30% for about 80%.
RFM analysis uses InvoiceDate, TotalPrice (column created by decyl analysis), and CustomerID.
In order to calculate the Recency, you need to specify the current time (time at the time of analysis). Since the last time of the purchase data is '2011-12-09 12:50:00'
, this time we will set the current time (NOW
) to December 10, 2011 (0:00).
#Get the latest time in history
data['InvoiceDate'].max()
>>>
Timestamp('2011-12-09 12:50:00')
#Set the current time
import datetime as dt
NOW = dt.datetime(2011,12,10)
GroupBy with CustomerID
and use the agg function to find Recency, Frequency and Monetary as follows.
#RFM calculation
rfm = data.groupby("CustomerID") \
.agg({"InvoiceDate": lambda date: (NOW - date.max()).days,
"InvoiceNo": lambda num: num.nunique(),
"TotalPrice": lambda price: price.sum()}).reset_index()
#Column name change
rfm.rename(columns={'InvoiceDate': 'recency',
'InvoiceNo': 'frequency',
'TotalPrice': 'monetary'}, inplace=True)
rfm.head()
From here, we will segment using each index actually calculated. How to divide (reference value and number of divisions) needs to be changed depending on the situation, this time we will divide each index using the quartile.
#Find the quartile of each of the three indicators
quantiles = rfm.quantile(q=[0.25,0.5,0.75])
quantiles_dict = quantiles.to_dict()
print(quantiles_dict)
>>>
{'recency': {0.25: 17.0, 0.5: 50.0, 0.75: 142.0},
'frequency': {0.25: 1.0, 0.5: 2.0, 0.75: 5.0},
'monetary': {0.25: 300.67499999999995, 0.5: 656.6899999999999, 0.75: 1601.0}}
#Define a function that divides RFM by quartiles
def cal_R(x, col, df):
if x <= df[col][0.25]:
return 1
elif x <= df[col][0.50]:
return 2
elif x <= df[col][0.75]:
return 3
else:
return 4
def cal_FM(x, col, df):
if x <= df[col][0.25]:
return 4
elif x <= df[col][0.50]:
return 3
elif x <= df[col][0.75]:
return 2
else:
return 1
#Calculation of various scores
rfm['R_score'] = rfm.recency.apply(cal_R, args=('recency', quantiles_dict))
rfm['F_score'] = rfm.frequency.apply(cal_FM, args=('frequency', quantiles_dict))
rfm['M_score'] = rfm.monetary.apply(cal_FM, args=('monetary', quantiles_dict))
#Calculation of score by integrating 3 indicators
rfm["RFM_score"] = rfm.R_score.astype(str)+ rfm.F_score.astype(str) + rfm.M_score.astype(str)
rfm.reset_index(inplace=True)
rfm.head()
Using the obtained score, we will segment it, for example, "If the score 4 is 2 or more items, it is a good customer".
It seems that unsupervised machine learning method (clustering) may be used for segmentation. Let's try to implement segmentation using the k-means method.
#Standardization
from sklearn.preprocessing import StandardScaler
tmp = rfm[['R_score', 'F_score', 'M_score']]
ss = StandardScaler()
rfm_scaled = ss.fit_transform(tmp)
print(rfm_scaled)
>>>
array([[ 1.34301118, 1.15405079, 1.34141302],
[-1.33165772, -1.47067031, -1.34162009],
[ 0.45145488, -0.59576328, -0.44727572],
...,
[-1.33165772, 0.27914376, 1.34141302],
[-1.33165772, -1.47067031, -1.34162009],
[-0.44010142, -0.59576328, -1.34162009]])
The number of clusters to be divided must be specified by the analyst, but one method for considering the optimum number of clusters is to see the change in the itertia value (sum of squared errors in the cluster) when the number of divisions is changed. .. For more details, please read this article.
#Clustering (examination of the number of divisions)
from sklearn.cluster import KMeans
inertias = {}
for k in range(2, 10):
km = KMeans(n_clusters=k, random_state=1)
km.fit(rfm_scaled)
inertias[k] = km.inertia_
pd.Series(inertias).plot()
plt.xlabel("K (num of clusters)")
plt.ylabel("Inertia Score")
<img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/740683/bd2daeae-5be5-b65f-a9c3-834aea55f405.png " width=60%>
It's a little difficult to judge, but this time I'll divide it into four clusters.
# k=Clustering with 4
k = 4
km = KMeans(n_clusters=k, random_state = 1)
rfm["KMeans"] = km.fit_predict(rfm_scaled)
rfm.head()
#Visualization
sns.pairplot(rfm[['R_score', 'F_score', 'M_score', 'KMeans']], hue="KMeans", palette='jet_r')
plt.show()
#Check the number of elements in each cluster
rfm['KMeans'].value_counts()
>>>
0 1428
3 1112
1 997
2 782
Name: KMeans, dtype: int64
I implemented decyl analysis and RFM analysis in Python. It's a good practice for Pandas, and I found it easier to maintain motivation with meaningful data analysis. Next, I will try product analysis.
https://library.musubu.in/articles/10537# https://www.albert2005.co.jp/knowledge/marketing/customer_product_analysis/decyl_rfm https://qiita.com/deaikei/items/11a10fde5bb47a2cf2c2
Recommended Posts