http://connpass.com/event/34680/
As usual, I participated in the middle of the session, so I forgot about the & macbook and took notes on my iPhone, so I feel like I'm not good at Japanese.
Mr. Yasuaki Ariga (@chezou) of Cloudera
http://www.slideshare.net/Cloudera_jp/ibis-pandas-summerds
Demo on Jupyter notebook
Scikit-learn comes out after creating teacher data
spark-sklearn
pip install ibis-framework Can be installed with
If you want to use Impala, you should use Cloudera's director.
Mr. Haruka Naito of CyberAgent
The following three types of recommendation systems are used in Ameba.
Item to Item collaborative filtering
Based on the evaluation of users who are close to each other
Based on user ratings based on the distance between items Accuracy can be obtained even if the item is evaluated less
Divide the number of co-occurrence (number of duplicate users) by the sum of the square roots of the elements
Allocate to each worker using broadcast variables. This eliminates the need for complicated joins
Create an item set (filter) in advance and filter the results
Mr. Nagato Kasaki, DMM.com Lab
Operation story after making
Spark has been used since February 2015.
13 to 168 cases with 3 engineers I was able to handle it because it was automated
Resources are about 1.5 times 230CPUs / 580GB to 360CPUs / 900GB
Time from 3h to 4h
Since there are many services, it is easy to start using new services.
Since the ratio of the number of users and the number of items varies greatly depending on the service, tuning is also required individually.
The sense of scale is 1 million users or 4 million products
We have an item matrix across all services → Recommendations between services will also be possible
Two types of algorithms are used properly
The recipe defines the parameter settings for hive, spark, and sqoop in JSON.
Precision tuning is actually put in and A / B tested (there are academic evaluation formulas, but there are some things that you can't understand without trying). Performance is easy to understand and issues, so tune in advance
Data division sometimes fails due to the 20:80 rule (in many cases, even if it is divided, it is biased). If you can divide it well, it will be shortened from 3 hours to 3 minutes
(Editing below)
LT frame
spark beginners were addicted to recommendations
Disk depletion when submit every 15 minutes jar is copied Submit while recreating the cluster
Small number of partitions when loading from BigQuery The executor cannot be used up Repartitioning is important
Not recommended There are too many users to get a direct product Processed together in a user set
Recommendation engine performance tuning using Spark
dag visualization Let's see
If not dispersed, distribute Do not shuffle with a large amount of data
Rdd used multiple times is cached
Option not to serialize when cpu bottleneck
KryoSerializer is twice as fast
Recommended Posts