--This is an introductory article to quickly try out the Factorization Machines algorithm, which has been attracting attention in recommendation technology in recent years, using a library. ――It is not a theoretical explanation but an article for moving. ――Basically, follow the tutorial + supplement.
This time I used a library called fastFM. The explanation and performance of this library are published on arXiv.
For those who want to know the outline and trends of Factorization Machines, there are reference articles inside and outside Qiita, so I will post some of them. The following books are also available for fastFM. Roughly speaking, it is an algorithm that "uses matrix factorization to perform regression, classification, and ranking that is strong against sparse data."
There are notes. In my own environment, it was as follows.
--Can be installed with pip install fastFM
--An error occurred when installing pip. -Can be operated with Install from source on GitHub.
So, for new environments, install from source, build an environment that can run multiple pythons such as pyenv and introduce 3.6 series, or create some kind of 3.6 environment with Docker etc. I think it will be.
When it comes to Factorization Machines (FM) samples, I feel that dictionary-type samples are often used for sample data.
[
{user:A, item:X, ...},
{user:B, item:Y, ...}, ...
]
Like. For sparse data, I think that such data is often input, but this time I would like to handle it assuming a simple csv.
Sample dummy data
category,rating,is_A,is_B,is_C,is_X,is_Y,is_Z
A,5,0,0,1,0,1,0
A,1,1,0,0,0,0,1
B,2,0,1,0,0,0,0
B,5,0,0,0,0,1,0
C,1,1,0,0,0,0,1
C,4,0,0,0,0,1,0
...
I'll put all the versions at the bottom. The value is a dummy made appropriately,
--Category
column with category information
--rating
column with 5 grades from 1 to 5
―― ʻis_?` Column containing flag information
――Imagine a flag that bought a product or a flag that indicates user attributes.
Is assumed.
The usage of the library itself is simple, and it is familiar to those who have used scikit-learn etc. First, let's create a processing flow with simple regression analysis logic. The details are broken, but the general model creation seems to be as follows.
This time, I will create a regression model with the theme of ** rating **. (For the sake of clarity, I import each time, but you can import everything at the top.)
import numpy as np
import pandas as pd
#read csv data
raw = pd.read_csv('fm_sample.csv')
#Separate target columns and other information
target_label = "rating"
data_y = raw[target_label]
data_X = raw.drop(target_label, axis=1)
#Preprocessing
##Convenient category data processing library, scikit-Use the convenient functions of learn
import category_encoders as ce
##One for the specified column-hot encode
enc = ce.OneHotEncoder(cols=['category'])
X = enc.fit_transform(data_X)
y = data_y
#Data split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=810)
The evaluation is MAE (Mean Squared Error), but MSE etc. can be calculated quickly.
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error
#Modeling
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)
#Evaluation
##Applicability to training data
mean_absolute_error(y_train, reg.predict(X_train))
##Test data error
mean_absolute_error(y_test, reg.predict(X_test))
I think it will be like this. Of course, I would do more for the evaluation part, such as drawing the calculated regression line and seeing how the error deviates, but this time until I move it.
If the above flow can be done, the rest is completed if the calculation part is made into the fastFM specification to be used this time. One point to note is that ** DataFrame cannot be handled as it is, so csr_matrix is used **.
from fastFM import als
from scipy.sparse import csr_matrix
#Modeling
fm = als.FMRegression(n_iter=1000, init_stdev=0.1, rank=8, l2_reg_w=0.5, l2_reg_V=0.5, random_state=810)
fm.fit(csr_matrix(X_train), y_train)
#Evaluation
##Applicability to training data
mean_absolute_error(y_train, fm.predict(csr_matrix(X_train)))
##Test data error
mean_absolute_error(y_test, fm.predict(csr_matrix(X_test)))
Here comes the csr_matrix
. It deals with sparse data.
The image is simple, assuming that DataFrame and normal matrix handle 2D data as follows.
matrix
array([
[0, 0, 1],
[0, 0, 0],
[0, 3, 0]
])
To handle only the part that contains the data
Handling of sparse data
Size: 3 x 3
Where the data is:
([0,2]1 at the point)
([2,1]3 at the point)
It is an image like. There are several types of handling, such as csr_matrix, coo_matrix, csc_matrix, lil_matrix, and it seems that the handling and processing speed are different, so if you are interested, please search with "scipy sparse matrix" etc. note.nkmk.me and so on.
What to remember this time
--Example of conversion from DataFrame: csr_matrix (df)
--converting csr_matrix to a matrix todense Example: csr_matrix (X_train) .todense ()
I wonder if.
Binary classification is also possible, so I will try it.
This time, I will do the task of ** guessing 2 classes with rating of 4 or more or less **.
After the pre-processing-data division part, the model is created and evaluated after creating the answer data as to whether the rating is 4 or higher.
Also note that in the fastFM classification, values are created with -1 or 1
instead of 0 or 1
.
from fastFM import sgd
from sklearn.metrics import roc_auc_score
#Pre-processing continued
##1 if 4 or more otherwise-Set to 1
y_ = np.array([1 if r > 3 else -1 for r in y])
##Creation of training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y_, random_state=810)
#Modeling
fm = sgd.FMClassification(n_iter=5000, init_stdev=0.1, l2_reg_w=0,
l2_reg_V=0, rank=2, step_size=0.1)
fm.fit(csr_matrix(X_train), y_train)
##It seems that you can get two types of predicted values.
y_pred = fm.predict(csr_matrix(X_test))
y_pred_proba = fm.predict_proba(csr_matrix(X_test))
#Evaluation
##Example of evaluating the value of AUC
roc_auc_score(y_test, y_pred_proba)
I will refer to the page of note.nkmk.me and write the ROC curve. ..
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_proba, drop_intermediate=False)
auc = metrics.auc(fpr, tpr)
#Plot ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %.2f)'%auc)
plt.legend()
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)
plt.show()
I was able to do it.
Below, paste the sample data. It's made properly, so it's not interesting data. Please check the operation.
fm_sample.csv
category,rating,is_A,is_B,is_C,is_X,is_Y,is_Z
A,5,0,0,1,0,1,0
A,1,1,0,0,0,0,1
A,3,0,0,1,0,1,0
A,2,1,0,0,0,0,1
A,4,0,0,0,0,0,1
A,5,1,0,0,1,1,0
A,1,0,1,0,0,0,1
A,2,0,0,0,0,0,1
B,2,0,1,0,0,0,0
B,5,0,0,0,0,1,0
B,3,1,1,0,0,1,0
B,2,0,0,1,0,0,0
B,1,0,0,0,0,0,1
B,3,0,0,1,0,0,1
B,4,0,1,0,0,0,0
B,1,0,0,0,0,0,1
B,2,0,1,0,0,0,1
C,1,1,0,0,0,0,1
C,4,0,0,0,0,1,0
C,2,1,0,1,0,1,0
C,4,0,0,0,0,0,0
C,5,0,0,1,1,1,0
C,2,0,1,0,0,0,1
C,5,1,0,0,0,1,0
C,3,0,0,1,1,1,0
C,2,0,0,0,0,0,1
C,3,0,0,0,0,1,0
A,2,0,0,0,0,0,1
A,4,1,0,0,0,1,0
A,3,0,0,0,0,0,0
A,1,0,0,0,0,0,1
A,3,1,0,0,0,0,0
A,4,0,0,1,0,1,0
A,5,1,1,0,0,1,0
A,3,1,0,0,1,0,0
B,4,0,0,0,0,1,0
B,1,0,0,0,0,0,1
B,5,0,0,0,0,1,0
B,3,0,0,0,0,0,0
B,1,0,0,0,1,0,1
B,3,0,0,1,0,0,0
B,2,0,1,0,0,0,1
B,5,1,0,0,0,1,0
B,4,0,0,0,1,1,1
C,1,0,0,0,0,0,0
C,2,0,0,0,0,0,1
C,3,0,0,1,0,0,0
C,4,0,1,0,0,1,0
C,1,0,0,1,0,0,1
C,1,0,0,0,0,0,0
C,3,0,0,1,0,0,0
C,3,0,0,1,0,1,0
C,5,0,0,0,1,1,0
C,3,0,0,1,0,1,0
Recommended Posts