In this article, I tried to implement a simple recommendation system using MovieLens data as an implementation practice of machine learning using PySpark following the above. For the data set, I downloaded ml-small-latest.zip
from here, decompressed it, and used it. After giving a brief overview of the recommendation system, such as algorithm types and usage data, we will implement the ALS Model
implemented in Sparl ML.
I pulled the image of Jupyter/pyspark-notebook: latest with Docker and used it as it is (Python 3.8.6, Spark 3.0.1). For details, see This article.
Based on product information and behavior history data, we "recommend" products recommended to users. An example is "People who bought this product also bought this product," which is often seen on mail-order sites.
The data used in the recommendation system is broadly divided into "product attribute information" and "user behavior history". The former corresponds to the data that characterizes the product, such as the price, category, and description of each product. The latter corresponds to each user's purchase history and evaluation history. In the case of MovieLens used in this article, the data contained in movies.csv
is the attribute information, and the data contained in rating.csv
is the action history.
movieId | title | genres |
---|---|---|
1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
2 | Jumanji (1995) | Adventure|Children|Fantasy |
3 | Grumpier Old Men (1995) | Comedy|Romance |
4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
5 | Father of the Bride Part II (1995) | Comedy |
userId | movieId | rating | timestamp |
---|---|---|---|
1 | 1 | 4.0 | 964982703 |
1 | 3 | 4.0 | 964981247 |
1 | 6 | 4.0 | 964982224 |
1 | 47 | 5.0 | 964983815 |
1 | 50 | 5.0 | 964982931 |
Regarding the algorithms used in the recommendation system, here we will take up three typical ones and look at the advantages and disadvantages.
** Rule base ** A rule that predetermines rules such as "If you buy product A, then product B" and "For users who have purchased 5 or more times, product C" is recommended (non-machine learning method). It has the advantage of being simple and easy to understand, but the disadvantage is that it is troublesome to set rules and it is difficult to hear the application when the number of products and users increases.
** Content base ** It uses attribute information to calculate the degree of similarity between products and recommends products that have a high degree of similarity to the purchased product. As a typical example, there is a method of extracting a feature amount from product category information by Tf-Idf or the like and obtaining cos similarity from a feature vector of each product. Compared to the rule base, it has the advantage of being easier to implement and more accurate, and because it does not use behavior history information, ** cold start problems ** do not occur (see 3. Collaborative filtering). On the other hand, since only similar products are recommended and encounters with new products (** serendivity **) occur, the accuracy tends to be inferior to collaborative filtering when there is sufficient behavior history data.
3. ** Collaborative filtering ** It calculates and recommends the similarity between products or the similarity between users from the user's behavior history data, and the former is called the item base and the latter is called the user base. The action history data is a fairly large amount of data (number of items) x (number of users), and most of it is Sparse data that is null (replaced with 0 when calculating). Therefore, ** matrix factorization ** is often performed to compress the dimensions of the feature vector. In collaborative filtering, ** NMF (Non-negative matrix factorization) , which sets all the elements of a matrix to 0 or more, is often used in matrix factorization. The advantage is that ** serendivity ** is more likely to occur and the accuracy tends to be higher than content-based, but it does not work well when the amount of behavior history information is insufficient ( cold start problem * *) The disadvantage is that the accuracy for new products and new users also deteriorates.
For the handling of DataFrame, please read the above.
#Session launch
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ALS').getOrCreate()
#Data read
ratings = spark.read.csv('ratings.csv', header=True, inferSchema=True)
movies = spark.read.csv('movies.csv', header=True, inferSchema=True)
#Rating shape and schema confirmation
print('ratings_shape:', (ratings.count(), len(ratings.columns)))
ratings.printSchema()
>>>
ratings_shape: (100836, 4)
root
|-- userId: integer (nullable = true)
|-- movieId: integer (nullable = true)
|-- rating: double (nullable = true)
|-- timestamp: integer (nullable = true)
#Display 5 ratings
ratings.show(5)
>>>
+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
| 1| 1| 4.0|964982703|
| 1| 3| 4.0|964981247|
| 1| 6| 4.0|964982224|
| 1| 47| 5.0|964983815|
| 1| 50| 5.0|964982931|
+------+-------+------+---------+
#Movie shape and schema confirmation
print('movies_shape:', (movies.count(), len(movies.columns)))
movies.printSchema()
>>>
movies_shape: (9742, 3)
root
|-- movieId: integer (nullable = true)
|-- title: string (nullable = true)
|-- genres: string (nullable = true)
#Display 5 movies
movies.show(5)
>>>
+-------+--------------------+--------------------+
|movieId| title| genres|
+-------+--------------------+--------------------+
| 1| Toy Story (1995)|Adventure|Animati...|
| 2| Jumanji (1995)|Adventure|Childre...|
| 3|Grumpier Old Men ...| Comedy|Romance|
| 4|Waiting to Exhale...|Comedy|Drama|Romance|
| 5|Father of the Bri...| Comedy|
+-------+--------------------+--------------------+
#Find out the number of users
ratings.select('userId').distinct().count()
>>>
610
#Find out how many movies have a rating
ratings.select('movieId').distinct().count()
>>>
9724
ALS is an alternating least squares method and is one of the NMF methods. This ALS is the only recommendation algorithm implemented in Spark ML at the moment, so we will use it this time. The argument rank
specifies the number of dimensions to be compressed. Also, when implementing NMF with Scikit-learn, it is necessary to process the ratings data in a matrix format (pivot table) with movieID in the vertical direction and userID in the horizontal direction in advance, but in ALS of Spark ML it is as it is You can specify userCol
and itemCol
.
from pyspark.ml.recommendation import ALS, ALSModel
#Creating a model
als = ALS(
rank=20,
maxIter=10,
regParam=0.1,
userCol='userId',
itemCol='movieId',
ratingCol='rating',
seed=0
)
#Model learning
als_model = als.fit(ratings)
The feature vectors of movie and user can be obtained in DataFrame format with .itemFactors
and .userFactors
, respectively. If you check the shape, you can see that both are 20-dimensional.
#Movie feature vector (W) and user feature vector (H))Get
W_movies = als_model.itemFactors
H_users = als_model.userFactors
#Check the shape of W and H
W = W_movies.select('features').collect()
H = H_users.select('features').collect()
print('W:', (len(W), len(W[0][0])))
print('H', (len(H), len(H[0][0])))
>>>
W: (9724, 20)
H: (610, 20)
As a test, let's check the similarity using the feature vector of movie. movieId = 1, 3114, 176579 will be "Toy Story", "Toy Story2" and "Cage Dive" respectively, so let's compare the similarity of these three movies.
#Try with the following three
movies.filter('movieId IN (1, 3114, 176579)').show()
>>>
+-------+------------------+--------------------+
|movieId| title| genres|
+-------+------------------+--------------------+
| 1| Toy Story (1995)|Adventure|Animati...|
| 3114|Toy Story 2 (1999)|Adventure|Animati...|
| 176579| Cage Dive (2017)|Drama|Horror|Thri...|
+-------+------------------+--------------------+
Calculation of cos similarity is not currently implemented in Spark ML. It seems that it was implemented in Mllib, but this time I will easily convert it to a numpy array and calculate it.
#Calculate cos similarity with numpy
import numpy as np
tmp = W_movies.filter('id IN (1, 3114, 176579)').orderBy('id').select('features').collect()
#Get feature vector for each movie
toystory1_vec = np.array(tmp[0][0])
toystory2_vec = np.array(tmp[1][0])
cagedive_vec = np.array(tmp[2][0])
#Calculate cos similarity
print('Similarity between Toy Story and Toy Story 2:', (toystory1_vec @ toystory2_vec) / (np.linalg.norm(toystory1_vec) * np.linalg.norm(toystory2_vec)))
print('Similarity between Toy Story and Cage Dive:', (toystory1_vec @ cagedive_vec) / (np.linalg.norm(toystory1_vec) * np.linalg.norm(cagedive_vec)))
>>>
Similarity between Toy Story and Toy Story 2: 0.9472396175366493
Similarity between Toy Story and Cage Dive: 0.7528829246037524
If you look at the results, you can see that the similarity between "Toy Story" and "Toy Story2" is considerably higher than the similarity between "Toy Story" and "Cage Dive".
You can use .recommendForAllUsers ()
to get the movies to be recommended for each specified user in descending order of priority. You can specify how many high-order inspections to get with the argument numItems
.
# userID:Recommended movies for 100
tmp = als_model.recommendForAllUsers(numItems=10).filter('userId = 100')
tmp.select('recommendations.movieId').collect()
>>>
[Row(movieId=[33649, 74282, 5867, 7121, 1066, 26528, 7071, 179135, 26073, 84273])]
#Check details of recommended movies
movies.filter('movieId IN (106642, 33649, 87234, 1046, 93988, 2843, 171495, 1755, 318, 177593)').show()
>>>
+-------+--------------------+--------------------+
|movieId| title| genres|
+-------+--------------------+--------------------+
| 318|Shawshank Redempt...| Crime|Drama|
| 1046|Beautiful Thing (...| Drama|Romance|
| 1755|Shooting Fish (1997)| Comedy|Romance|
| 2843|Black Cat, White ...| Comedy|Romance|
| 33649| Saving Face (2004)|Comedy|Drama|Romance|
| 87234| Submarine (2010)|Comedy|Drama|Romance|
| 93988|North & South (2004)| Drama|Romance|
| 106642|Day of the Doctor...|Adventure|Drama|S...|
| 171495| Cosmos| (no genres listed)|
| 177593|Three Billboards ...| Crime|Drama|
+-------+--------------------+--------------------+
Movies that are generally classified as'Drama',' Romance',' Comedy', etc. are recommended. Next, let's take a look at a movie that has been highly evaluated by users with useId: 100. This time, I will filter only the movies with a rating of 5.
# useId=100, rating=Filter by 5
tmp = ratings.filter('userId = 100 AND rating = 5')
#Combine with movie data
tmp.join(movies, tmp.movieId == movies.movieId, how='inner').select(['title', 'genres']).show()
>>>
+--------------------+--------------+
| title| genres|
+--------------------+--------------+
| Top Gun (1986)|Action|Romance|
|Terms of Endearme...| Comedy|Drama|
|Christmas Vacatio...| Comedy|
|Officer and a Gen...| Drama|Romance|
|Sweet Home Alabam...|Comedy|Romance|
+--------------------+--------------+
It can be confirmed that movies of genres such as'Drama',' Romance', and'Comedy' have been highly evaluated so far, and I feel that movies that suit my taste are recommended. Also noteworthy is that "Cosmos" with movieId: 171495 is also in the top. Since genres is (no genres listed), this probably didn't rank high in content-based algorithms.
In the above, we made recommendations from the user's point of view, but you can use recommendForAllItems ()
to make recommendations from the item's point of view. For example, you can use it when you want to sell a certain product, but you want to promote it efficiently, thinking about which user should recommend it to buy it.
# movieId:Show 3 users who are likely to rate 100 movies
tmp = als_model.recommendForAllItems(numUsers=3).filter('movieId = 100')
tmp.select('recommendations.userId').collect()
>>>
[Row(userId=[429, 584, 35])]
I implemented a collaborative filtering recommendation in PySpark. So far, only ALS used this time is implemented in Spark ML, but it seems that other algorithms are also implemented in external packages and packages provided by various cloud services, so I hope to try it in the future. think. Thank you very much.
https://toukei-lab.com/recommend-algorithm https://ohke.hateblo.jp/entry/2018/09/08/233000
Recommended Posts