http://connpass.com/event/34680/

As usual, I participated in the middle of the session, so I forgot about the & macbook and took notes on my iPhone, so I feel like I'm not good at Japanese.

Ibis: Great pandas Easy large-scale data analysis

Mr. Yasuaki Ariga (@chezou) of Cloudera

http://www.slideshare.net/Cloudera_jp/ibis-pandas-summerds

Demo on Jupyter notebook

Scikit-learn comes out after creating teacher data

Compared to PySpark

Easy to set up
Fast
As a reference, the underlying Cloudera Impala is 7 times faster than PySpark.

spark-sklearn

pip install ibis-framework Can be installed with

If you want to use Impala, you should use Cloudera's director.

Introduction of recommendation system in Ameba

Mr. Haruka Naito of CyberAgent

Recommendation system overview

The following three types of recommendation systems are used in Ameba.

A.J.A. recommend
phoenix ⇦ This is today Collaborative filtering Use Spark
Battle recommend

Use of recommendation system

Related hashtag
Guru
Reading time

Overview

Activity log to hadoop
Send the recommendation result to hbase

First letter of hash as salt on key (for distribution)

Feedback of recommendation result imp / clk etc.

Reranking based on CTR with bandit algorithm

Item to Item collaborative filtering

User-based collaborative filtering

Based on the evaluation of users who are close to each other

Item base

Based on user ratings based on the distance between items Accuracy can be obtained even if the item is evaluated less

Cosine similarity

Divide the number of co-occurrence (number of duplicate users) by the sum of the square roots of the elements

Ingenuity for each case

Keep it simple

Allocate to each worker using broadcast variables. This eliminates the need for complicated joins

I want to limit the recommendation results to those with new freshness

Create an item set (filter) in advance and filter the results

Performance tuning & automation of recommendation engine using Spark

Mr. Nagato Kasaki, DMM.com Lab

Operation story after making

Overview of Spark utilization system

Spark has been used since February 2015.

13 to 168 cases with 3 engineers I was able to handle it because it was automated

Resources are about 1.5 times 230CPUs / 580GB to 360CPUs / 900GB

Time from 3h to 4h

Installation automation

Since there are many services, it is easy to start using new services.

When you want to add a service

Write a recipe
Run the test according to the recipe jenkins

Confirmed operation in test environment

Performance in staging
Release to production

Since the ratio of the number of users and the number of items varies greatly depending on the service, tuning is also required individually.

The sense of scale is 1 million users or 4 million products

We have an item matrix across all services → Recommendations between services will also be possible

Ranking

Two types of algorithms are used properly

User TO item
Item TO item

Data shaping with Hive
Only recommendation calculation in Spark

To maintain simplicity

Output to DB with Sqoop

The recipe defines the parameter settings for hive, spark, and sqoop in JSON.

Precision tuning is actually put in and A / B tested (there are academic evaluation formulas, but there are some things that you can't understand without trying). Performance is easy to understand and issues, so tune in advance

Looking for a bottleneck
Eliminate data bias

Data division sometimes fails due to the 20:80 rule (in many cases, even if it is divided, it is biased). If you can divide it well, it will be shortened from 3 hours to 3 minutes

(Editing below)

LT frame

spark beginners were addicted to recommendations

Disk depletion when submit every 15 minutes jar is copied Submit while recreating the cluster

Small number of partitions when loading from BigQuery The executor cannot be used up Repartitioning is important

Not recommended There are too many users to get a direct product Processed together in a user set

Recommendation engine performance tuning using Spark

dag visualization Let's see

If not dispersed, distribute Do not shuffle with a large amount of data

Rdd used multiple times is cached

Option not to serialize when cpu bottleneck

KryoSerializer is twice as fast

I went to "Summer is in full swing! Spark + Python + Data Science Festival".