--This is an introductory article to quickly try out the Factorization Machines algorithm, which has been attracting attention in recommendation technology in recent years, using a library. ――It is not a theoretical explanation but an article for moving. ――Basically, follow the tutorial + supplement.

Various reference materials

This time I used a library called fastFM. The explanation and performance of this library are published on arXiv.

fastFM
- GitHub -Online documentation -Paper

For those who want to know the outline and trends of Factorization Machines, there are reference articles inside and outside Qiita, so I will post some of them. The following books are also available for fastFM. Roughly speaking, it is an algorithm that "uses matrix factorization to perform regression, classification, and ranking that is strong against sparse data."

Reference information --Qiita: I searched for Factorization Machines -Latest Trends in Deep Factorization Machines (2018) --Books: Machine learning starting at work

Main story

Introduction

There are notes. In my own environment, it was as follows.

For Python 3.6.10.

--Can be installed with pip install fastFM

For Python 3.7.6

--An error occurred when installing pip. -Can be operated with Install from source on GitHub.

So, for new environments, install from source, build an environment that can run multiple pythons such as pyenv and introduce 3.6 series, or create some kind of 3.6 environment with Docker etc. I think it will be.

Sample data

When it comes to Factorization Machines (FM) samples, I feel that dictionary-type samples are often used for sample data.

[
{user:A, item:X, ...},
{user:B, item:Y, ...}, ...
]

Like. For sparse data, I think that such data is often input, but this time I would like to handle it assuming a simple csv.

`Sample dummy data`


category,rating,is_A,is_B,is_C,is_X,is_Y,is_Z
A,5,0,0,1,0,1,0
A,1,1,0,0,0,0,1
B,2,0,1,0,0,0,0
B,5,0,0,0,0,1,0
C,1,1,0,0,0,0,1
C,4,0,0,0,0,1,0
...

I'll put all the versions at the bottom. The value is a dummy made appropriately,

--Category column with category information --rating column with 5 grades from 1 to 5 ―― ʻis_?` Column containing flag information ――Imagine a flag that bought a product or a flag that indicates user attributes.

Is assumed.

Processing flow confirmation with regression analysis

The usage of the library itself is simple, and it is familiar to those who have used scikit-learn etc. First, let's create a processing flow with simple regression analysis logic. The details are broken, but the general model creation seems to be as follows.

Read data
Preprocessing (convert data to a format that fits the model)
Divided into "learning data" (, "validation data") and "test data"
Create a model with training data
Apply model to test data
Define and evaluate performance evaluation indicators

This time, I will create a regression model with the theme of ** rating **. (For the sake of clarity, I import each time, but you can import everything at the top.)

Data read

import numpy as np
import pandas as pd

#read csv data
raw = pd.read_csv('fm_sample.csv')

#Separate target columns and other information
target_label = "rating"

data_y = raw[target_label]
data_X = raw.drop(target_label, axis=1)

Pre-processing-data division

#Preprocessing
##Convenient category data processing library, scikit-Use the convenient functions of learn
import category_encoders as ce

##One for the specified column-hot encode
enc = ce.OneHotEncoder(cols=['category'])

X = enc.fit_transform(data_X)
y = data_y

#Data split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=810)

Modeling-evaluation

The evaluation is MAE (Mean Squared Error), but MSE etc. can be calculated quickly.

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error

#Modeling
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

#Evaluation
##Applicability to training data
mean_absolute_error(y_train, reg.predict(X_train))
##Test data error
mean_absolute_error(y_test, reg.predict(X_test))

I think it will be like this. Of course, I would do more for the evaluation part, such as drawing the calculated regression line and seeing how the error deviates, but this time until I move it.

Regression by FM

If the above flow can be done, the rest is completed if the calculation part is made into the fastFM specification to be used this time. One point to note is that ** DataFrame cannot be handled as it is, so csr_matrix is used **.

Modeling-evaluation

from fastFM import als
from scipy.sparse import csr_matrix

#Modeling
fm = als.FMRegression(n_iter=1000, init_stdev=0.1, rank=8, l2_reg_w=0.5, l2_reg_V=0.5, random_state=810)
fm.fit(csr_matrix(X_train), y_train)

#Evaluation
##Applicability to training data
mean_absolute_error(y_train, fm.predict(csr_matrix(X_train)))
##Test data error
mean_absolute_error(y_test, fm.predict(csr_matrix(X_test)))

Sparse matrix, csr_matrix

Here comes the csr_matrix. It deals with sparse data. The image is simple, assuming that DataFrame and normal matrix handle 2D data as follows.

`matrix`


array([
  [0, 0, 1],
  [0, 0, 0],
  [0, 3, 0]
])

To handle only the part that contains the data

`Handling of sparse data`


Size: 3 x 3
Where the data is:
([0,2]1 at the point)
([2,1]3 at the point)

It is an image like. There are several types of handling, such as csr_matrix, coo_matrix, csc_matrix, lil_matrix, and it seems that the handling and processing speed are different, so if you are interested, please search with "scipy sparse matrix" etc. note.nkmk.me and so on.

What to remember this time

--Example of conversion from DataFrame: csr_matrix (df) --converting csr_matrix to a matrix todense Example: csr_matrix (X_train) .todense ()

I wonder if.

Classification by FM

Binary classification is also possible, so I will try it. This time, I will do the task of ** guessing 2 classes with rating of 4 or more or less **. After the pre-processing-data division part, the model is created and evaluated after creating the answer data as to whether the rating is 4 or higher. Also note that in the fastFM classification, values are created with -1 or 1 instead of 0 or 1.

from fastFM import sgd
from sklearn.metrics import roc_auc_score

#Pre-processing continued
##1 if 4 or more otherwise-Set to 1
y_ = np.array([1 if r > 3 else -1 for r in y])

##Creation of training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y_, random_state=810)

#Modeling
fm = sgd.FMClassification(n_iter=5000, init_stdev=0.1, l2_reg_w=0,
                          l2_reg_V=0, rank=2, step_size=0.1)
fm.fit(csr_matrix(X_train), y_train)

##It seems that you can get two types of predicted values.
y_pred = fm.predict(csr_matrix(X_test))
y_pred_proba = fm.predict_proba(csr_matrix(X_test))

#Evaluation
##Example of evaluating the value of AUC
roc_auc_score(y_test, y_pred_proba)

Draw a ROC curve

I will refer to the page of note.nkmk.me and write the ROC curve. ..

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_proba, drop_intermediate=False)

auc = metrics.auc(fpr, tpr)

#Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %.2f)'%auc)
plt.legend()
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)
plt.show()

I was able to do it.

other

Below, paste the sample data. It's made properly, so it's not interesting data. Please check the operation.

`fm_sample.csv`


category,rating,is_A,is_B,is_C,is_X,is_Y,is_Z
A,5,0,0,1,0,1,0
A,1,1,0,0,0,0,1
A,3,0,0,1,0,1,0
A,2,1,0,0,0,0,1
A,4,0,0,0,0,0,1
A,5,1,0,0,1,1,0
A,1,0,1,0,0,0,1
A,2,0,0,0,0,0,1
B,2,0,1,0,0,0,0
B,5,0,0,0,0,1,0
B,3,1,1,0,0,1,0
B,2,0,0,1,0,0,0
B,1,0,0,0,0,0,1
B,3,0,0,1,0,0,1
B,4,0,1,0,0,0,0
B,1,0,0,0,0,0,1
B,2,0,1,0,0,0,1
C,1,1,0,0,0,0,1
C,4,0,0,0,0,1,0
C,2,1,0,1,0,1,0
C,4,0,0,0,0,0,0
C,5,0,0,1,1,1,0
C,2,0,1,0,0,0,1
C,5,1,0,0,0,1,0
C,3,0,0,1,1,1,0
C,2,0,0,0,0,0,1
C,3,0,0,0,0,1,0
A,2,0,0,0,0,0,1
A,4,1,0,0,0,1,0
A,3,0,0,0,0,0,0
A,1,0,0,0,0,0,1
A,3,1,0,0,0,0,0
A,4,0,0,1,0,1,0
A,5,1,1,0,0,1,0
A,3,1,0,0,1,0,0
B,4,0,0,0,0,1,0
B,1,0,0,0,0,0,1
B,5,0,0,0,0,1,0
B,3,0,0,0,0,0,0
B,1,0,0,0,1,0,1
B,3,0,0,1,0,0,0
B,2,0,1,0,0,0,1
B,5,1,0,0,0,1,0
B,4,0,0,0,1,1,1
C,1,0,0,0,0,0,0
C,2,0,0,0,0,0,1
C,3,0,0,1,0,0,0
C,4,0,1,0,0,1,0
C,1,0,0,1,0,0,1
C,1,0,0,0,0,0,0
C,3,0,0,1,0,0,0
C,3,0,0,1,0,1,0
C,5,0,0,0,1,1,0
C,3,0,0,1,0,1,0

A sample to try Factorization Machines quickly with fastFM

Various reference materials

Main story

Introduction

For Python 3.6.10.

For Python 3.7.6

Sample data

Sample dummy data

Processing flow confirmation with regression analysis

Data read

Pre-processing-data division

Modeling-evaluation

Regression by FM

Modeling-evaluation

Sparse matrix, csr_matrix

matrix

Handling of sparse data

Classification by FM

Draw a ROC curve

other

fm_sample.csv

`Sample dummy data`

`matrix`

`Handling of sparse data`

`fm_sample.csv`