[JAVA] I tried to assemble from 0 while considering the table design of the day / week / month / overall ranking display function in terms of performance.

1.First of all

Currently, as a portfolio creation, we are developing something like an article posting service, and as one of the functions, we created a function to display rankings such as day / week / month on the screen (hereinafter referred to as ranking display function). There is. The contents of the table design + α that was done to create the ranking display function are described in this article.

I wrote this article for the purpose of documenting the thinking process, thinking that it may be for myself in the future who will have the opportunity to develop similar functions in the future, or for those who will create similar functions in the future. ing.

Although the poster has majored in the information field since he was a student and has knowledge, he is a Java application that does not use DB when he was a student, and in practice after employment, he also designs and develops functions that handle DB. This is actually the first time I have designed a database properly because I have never been there (I carried a kind of darkness on my back). We would appreciate it if you could give us your opinions and suggestions. Also, please forgive me because it is the first post of Qiita.

2. Assumption: Ranking display function overview

As a premise, the ranking display function shall display the ranking of 5 items of daily / weekly / monthly / general / trend. Each ranking is determined as follows.

type	Evaluation criteria
Daily ranking	Total number of likes per day
Weekly ranking	1 week(7 days)Total of "likes"
Monthly ranking	1 month(30 days)Total of "likes"
Overall ranking	Total number of likes for the entire period
Trend ranking	Results of evaluation based on daily "likes" & daily "accesses"

Since it is considered that it is not necessary to always display the latest status of the ranking, it is possible to carry out the aggregation once in a certain period.

Although it is a snake, the following languages are used. ・ Programming language: Java ・ Framework: Spring Framework -DB: Postgresql

3. Table design

In this chapter, we will describe the thinking process of table design for the ranking display function. Hereinafter, the "number of likes" and "number of accesses" are collectively referred to as "count data".

3.1. First, think about something simple

If you care about performance from the beginning, your thinking will be complicated, so don't think about it. If everything is simple, it's better than that, so I first wondered if I could make the table simple.

What I thought about was that the following are candidates for the table that are related to count data from the function overview, but is it really necessary for each count data for one week / one month / whole period? ・ Daily count data ・ Weekly count data ・ Monthly count data ・ Count data for the entire period

The minimum unit of data required to aggregate the count data for one week / one month / whole period is the daily count data. Therefore, it is possible for the program to obtain the count data for each period as long as the count data for each day is recorded. Based on this idea, the function itself can be realized with the following table.

ER diagram_base.png Figure 1. Article table

Although there are other tables required for the article posting service, this article is limited to those related to the ranking display function.

3.2. Consider performance in consideration

The data in the table in Fig. 1 is divided into the following two highly independent use cases.

Data types	Data handling phase	Renewal opportunity
(Count data)	Show ranking	Press the like button/When accessing the article page
Other than those above	Display the article page	Article creation/When editing

In the case of the table in FIG. 1, data other than the count data cannot be updated together during the update of the count data. Therefore, by dividing the table in FIG. 1 as follows, both can be updated at the same time.

ER Figure 2_After split.png Figure 2. Table after splitting count data

After that, we will proceed with the examination of performance based on the table in Fig. 2. The table in Figure 2 still has performance problems (not to mention). Considering that there are two major problems, we considered a design to solve each problem.

** (Problem 1) Ranking aggregation every time the page is displayed = Ranking aggregation calculation time is large ** ** (Problem 2) Store daily count data = Performance degradation due to increased number of records **

3.2.1. Problem 1: The time required for ranking aggregation calculation is large.

If you have the count data for each day, you can calculate the count for one week / one month / whole period, but if you turn it over, the count data for each period will be aggregated every time the user accesses the page where the ranking is displayed. Must.

In order to solve this problem, you can keep the aggregated result of the count data for 1 day / 1 week / 1 month / all period in another table, and when displaying the ranking, you only need to refer to that table. I thought it was good (aggregation is performed at regular intervals by periodic execution processing).

ER Figure 3_Improvement 1.png Figure 3. After adding the category data aggregation table to the table in Figure 2.

By doing this, it is not necessary to aggregate each time, and the rating table adds / updates daily count data, and the aggreagate_points table refers to the count data (when displaying the ranking). Can be done.

3.2.2. Problem 2: Performance degradation due to increased number of records

In the table of FIG. 1, since the count data for each day is held for each article, the number of records increases exponentially as the number of posted articles and the number of working days increases. As a result, the capacity of the DB is reduced and the data search speed is reduced. Therefore, we considered two measures to reduce the number of records.

** (Measures 1) Up to 30 days' worth of records per article ** The count data for the entire period can be kept up to date by continuing to add the daily count data, so there is no problem even if the count data for the past 30 days is reduced (the count data for the entire period is as shown in Fig. 2). Hold another table).

** (Measures 2) Collect data for the past 30 days into one record ** Even if the data held in the table in Fig. 1 is reduced in the past 30 days, it cannot be said that the number of records has decreased because the number of records = the total number of posted articles x 30 (if the number of posted articles is 10,000, the number of records is 300,000). In order to obtain the total count data for one month, it is necessary to extract 30 data of the article with the following SQL each time, so the more data there is, the longer it takes to search.

SELECT * FROM rating WHERE articl_id=[Id of the article];

In that case, I thought that if the data for 30 days from the beginning was made into a set of data, the trouble of extracting 30 data could be saved and the number of records = the total number of posted articles could be suppressed.

The count data of the day is updated frequently, but the other past count data only adds the latest data / deletes the oldest data. Therefore, if the past 30 days are put together in a format that allows these operations to be easily performed on the program side, there is no problem even if they are saved as one record data.

Some DBs can store JSON or XML data (Postgresql used this time can store both), and JSON or XML is easier to handle on the program side, so 30 days worth of data is JSON or XML. I will store it in one record in the format. There is a concern that performance will drop if the data format is JSON or XML, but the timing of making changes to the data is insignificant from the overall performance because it is performed at regular intervals by periodic execution processing.

ER Figure 4_Improvement 2.png Figure 4. After applying the record number reduction measures to the table in Figure 2.

-Example when the count data for the past 30 days is converted to JSON format

{
  rating_info: [
    {
      "favorite_count":"12"
      "date":"2019-04-17"
    },
    {
      "favorite_count":"15"
      "date":"2019-04-16"
    },

　　　：(Abbreviation)
}

-Example of converting the count data for the past 30 days into XML format

<rating_info>
  <day_rating>
    <favorite_count>12</favorite_count>
    <date>2019-04-17</date>
  </day_rating>
  <day_rating>
    <favorite_count>15</favorite_count>
    <date>2019-04-16</date>
  </day_rating>
</rating_info>

3.3. Final table form

Combining the results considered in Section 3.2 gives the following table.

ER Figure 5_Final System.png Figure 5. Final table

As a result, it was possible to separate the tables to be accessed by user operations as follows, so if you create an index for each table using id as a key, you can maintain performance even if there is a lot of data stored in the table. ..

table	role	Table reference opportunity	Table update opportunity	Reference frequency	Update frequency
daily_rating	Counting of the day	Periodic execution processing	When displaying the article page/Press the like button	Once a day (*1)	High (*2)
history_rating	History storage for the past 30 days	Periodic execution processing	Periodic execution processing	Once a day (*1)	Once a day (*1)
aggregate_points	weekly/Monthly/Comprehensive count data retention	When the ranking page is displayed	Periodic execution processing	High	Once a day (*1)

(* 1) When performing periodic execution processing once every 24 hours (* 2) Addition / deletion of records depends on addition / deletion of articles

4. Periodic execution processing (aggregation of count data)

I'm sorry to mention the programming elements. This time, Java + Spring Framework is used, but since Spring Framework has a function that can execute tasks regularly, we will use this.

@Scheduled(cron = "0 0 0 * * *", zone = "Asia/Tokyo")
public void updateAggregateData() {
  //...
}

From left to right, the cron options include seconds (0-59), minutes (0-59), hours (0-23), days (1-31), months (1-12), days of the week (0: days, 1: 1). Mon, 2: Tue, 3: Wed, 4: Thu, 5: Fri, 6: Sat, 7: Sun), so if you use the above source, it can be executed regularly at 0:00:00 every day. It becomes.

5. Finally

Spell it appropriately with the last crushed text.

As I was writing, I thought that the count data such as 1 week / 1 month at the beginning is not necessary because it is obtained from the data for each day. I thought that I removed it, but considering the performance, I thought that it was necessary after all, and I flipped my hand and created a plunge element. The lesson is that counting data such as 1 week / 1 month should be put in the table for the time being so that there is no loss.

The content I wrote this time was also written for the personal purpose of properly documenting what the overwhelmingly sensual contributor thought in his mind with his senses and images. I checked to see if I made a terrible mistake, but it took me less than 24 hours to write this much, so I reconfirmed myself that I was not good at writing documents (every week for 3 years when I was a student). I used to create 6-8 pages of this article-sized material in Word, but I reconfirmed that it's been too long since the heyday and I don't have the feeling of that time). Output is an issue.

It's no exaggeration to say that you are a beginner in table design and DB, so I think there are many things you can learn by reading documents such as DB design thorough guidance and SQL anti-patterns. So, gain practical skills through the development of the article posting service that was the source of this article, and the development of the service with a large number of tables / columns planned next, and sometimes output learning here as well. I'm thinking.

that's all

I'm doing twitter for the time being https://twitter.com/SYM_souten

Reference site

(1) Split a table with a large number of columns in a 1: 1 relationship using a primary key

(2) How to periodically execute tasks with Spring Boot

(3) WWW SQL Designer * Used to create ER diagram