It's been three years since I joined the company, so a summary of what I've done so far, including my own notes.

Refusal

Due to the development period, I have not been able to understand and tune each system in detail. Because it was constructed by an individual with information examined within the period Something like "No, no, if you tune it properly, you won't get that result." Please note that there may be many. Also, three years ago, I had never touched Linux or SQL, so I think that the one I was in charge of at the beginning was not as easy to make.

flow

1st Generation: Amazon RedShift-based foundation
2nd Generation: Hive-based foundation with Amazon EMR
3rd Generation: Spark-based foundation with Amazon EMR
4th Generation: A platform based on MySQL and distributed processing (Python) on the server
5th generation: BigQuery-based platform (under construction)

First Generation: A Foundation Based on Amazon RedShift

Overview

An analysis platform for cross-sectional viewing of the KPIs of our series of games. Put data in RedShift using fluentd, aggregate in daily batch, The aggregation result was stored in another table of RedShift. Ruby is used for batch and CakePHP is used for framework. Since fluentd is made by Ruby, I made Ruby so that I could make plugins. CakePHP is a corporate culture.

Development period

Approximately 6 months from October 2013 to March 2014 (parallel to other operations, including verification period)

Impressions

Only the image that I was desperate to learn because I only touched it for the first time ...

Second Generation: Hive-based foundation with Amazon EMR

Trigger for transition

The maintenance cost of RedShift was high for the small amount of data. (At that time, the cost performance was bad because there was only a Dense Storage node) Also, the aggregation time is longer than I expected, and I verified it with the glue "Hadoop is popular recently, let's verify it" Introduced because it was confirmed that aggregation can be done at low cost.

Overview

An analysis platform that allows you to see KPIs for games other than the series. "Python is rich in statistical libraries, so if you're an analysis engineer, why not Python?" There was a recommendation of a senior. So I used Python for batch and django made for framework.

Development period

Approximately 14 months from April 2014 to June 2015 (parallel to other work, including verification period. Visualization is handled by another person)

Impressions

I don't think the tally time has changed dramatically (no detailed notes left ...) However, with EMR, the total cost can be reduced to about half because the server only needs to be started for the total time. I think that starting Python here was the correct answer for doing Spark and machine learning later. Easy to write!

3rd Generation: Spark-based foundation with Amazon EMR

Trigger for transition

I wanted to touch new technology, and Spark was becoming a hot topic, so I decided to try it first. If the cache is effective, the aggregation time will be shortened, if the aggregation time is short, the aggregation cost can be expected to be reduced, and since Spark processing is performed with Python, it is easy to migrate.

Overview

Just changing from Hive to Spark is almost the same as the 2nd generation. I made an RDD-based one with Spark1.2 and a DataFrame-based one with Spark1.4.

Development period

Approximately 6 months from June 2015 to December 2015 (parallel to other operations, including verification period)

Impressions

Due to the lack of documentation, it was difficult to build through trial and error. Even if I go to seminars, there are many companies around me that are in the stage of considering introduction, so I can not consult much ... I think I have acquired the skills to translate the results of Google Translate in order to read unfamiliar English articles. If Spark uses the cache function, iterative processing is fast (when dividing the DAU into various segments, the login log is cached, etc.), but the first time the cache is not working, it takes time. It was.

4th Generation: A foundation based on distributed processing on MySQL and servers

Trigger for transition

While making various things, I thought, "Oh, with this amount of data, I don't need Hive or Spark in the first place?" It was confirmed by reviewing the aggregation algorithm and verifying that there is no problem with the current amount of data when considering the DB table design properly. Also, due to the effect that the cache does not work the first time of Spark, it seems that the processing on the DB or server can shorten the aggregation time.

Overview

The processing content is the same as the previous generation. Changed to process in multi-process in Python. Instead of putting all the logs in the DB I put only the logs necessary for aggregation and deleted them when they were no longer used. (Log file remains in S3)

Development period

Approximately 3 months from December 2015 to February 2016 (parallel to other operations, including verification period)

Impressions

Compared to Spark, the aggregation time is less than half. No additional cost because we used a server and DB that are always running for other purposes. It feels like this will work for a few GB a day.

5th Generation: BigQuery-based Foundation (Under Construction)

Trigger for transition

Even if the amount of data increases in the future, it can be handled to some extent, but there is concern that it will not be usable when the range of analysis expands and the amount of data increases at once, and when the amount of data is large in ad hoc analysis, a Spark cluster is launched. I was doing ... I verified that it would be easier if I could do daily batch and ad hoc analysis with BigQuery.

Overview

The processing content is the same as the previous generation. I used Luigi, a framework for building a data pipeline made by Python, for batch.

Development period

From February 2016 to the present

Impressions (halfway)

It takes a lot of time to make a table. Even if you ask Google, it will take time, so expect it in the future. The limit of 1000 tables per query is currently difficult to use. For titles that exceed 1000 days after release, it is not possible to simply ask for the cumulative billing amount for each user as of the Xth and Yth days. It is necessary to avoid the 1000 table limit, such as creating a table for each day with "cumulative billing amount up to the previous day + billing amount for the current day".

However, the time required for aggregation is short, and even if all the data is included, there is almost no management cost, so it is a big advantage that ad hoc analysis is easy. The aggregation cost for the first 1TB of each month is free, so if the amount of data is small, the aggregation cost will be almost zero.

Summary

When the log to be aggregated is about several GB

item	Installation cost	Operating costs	Ease of ad hoc analysis
RedShift	◯	✕(If you use SSD, it's about △)	◯
Hive	◯	△	△
Spark	△	◯	△
SQL and distributed processing	△	◎	△
BigQuery	△(Table creation is slow)	？(The final cost is still unknown. Probably ◯)	◯

[Introduction cost] ◯: If you write SQL, there is almost no cost △: There are other points to consider besides writing SQL

[Operating cost] ◎: No additional cost because the server and DB that are always running for other purposes are used. ◯: 50,000 or less △: A little less than 100,000 ✕: Over 100,000

[Ease of ad hoc analysis] ◯: No problem if SQL is applied △: There are some difficulties. Slow response or need to consider cache

I think that it may not fit depending on the amount of data and the purpose of analysis. I think there are some places where on-premise is better than cloud. After all, it feels like the right person in the right place. Personally, I like Spark or BigQuery. Spark has a machine learning library, and it is nice because it is flexible because it can be processed by Scala and Python in addition to SQL. After that, if the processing speed is improved. BigQuery doesn't require fine tuning and is low cost, so it's good that it's easy to use. I hope the current problems will be resolved soon. Looking back, it was undergoing major renovations at a pace of once every six months. I haven't been able to dig deeper one by one, but I am grateful for the environment where I have experienced so many things in three years.

I tried to make an analysis base of 5 patterns in 3 years

Refusal

flow

First Generation: A Foundation Based on Amazon RedShift

Overview

Development period

Impressions

Second Generation: Hive-based foundation with Amazon EMR

Trigger for transition

Overview

Development period

Impressions

3rd Generation: Spark-based foundation with Amazon EMR

Trigger for transition

Overview

Development period

Impressions

4th Generation: A foundation based on distributed processing on MySQL and servers

Trigger for transition

Overview

Development period

Impressions

5th Generation: BigQuery-based Foundation (Under Construction)

Trigger for transition

Overview

Development period

Impressions (halfway)

Summary