The story of migrating from home server (MariaDB + Java) to AWS (DynamoDB + Python + PHP) with reduced monthly cost

Nice to meet you, my name is Yuya Takeda (@kagamikarasu). We are developing "SOLD OUT 2 Market Information Site".

Redevelopment of "market information site" started around August 2020 and replaced in September. After more than a month, I've calmed down, so I'd like to write an article about the "market information site".

What is SOLD OUT 2?

It is an "online shop pretend" operated by mu. https://so2.mutoys.com/

tutorial https://so2-docs.mutoys.com/common/tutorial.html

Consume items to get new items, and use them to create and sell new products. Each person has their own play style, whether it's selling to NPCs, selling to users, or attending events.

What is SOLD OUT 2 Market Information Site?

This is a market information site developed by me (@kagamikarasu). https://market.kagamikarasu.net/

Since the API is open to the public by mu, I developed it using it. You can grasp the price / inventory transition for each product and the number of inventory reductions for each store in graph / table format.

All the end user can do is reference.

Why redeveloped?

I was running on my home server + Conoha (LB role) + Java 1.8 + MySQL (later MariaDB). It became difficult to develop, so I redeveloped it.

Java is messy, getting and deploying certificates is a hassle, I wanted to quit my home server (no PC / UPS) due to the bloated database ...

Fight against database bloat

It was over a billion records on a large table. At first, I didn't think about anything, so I didn't think about the amount of data.

I felt that it was obviously slow, so I reviewed the index and search conditions, but I was in a state of water on the burnt stone, so I solved it by changing the storage and earning IOPS.

HDD → SSD → SSD (NVME) in 3 years. NVME is very fast, until then if you say sweet Did you make a mistake in typing the command because the migration was too fast when changing the storage? I doubt it.

Transition goal

Cloud environment
Easy to develop / deploy
So that even if it falls, it will return without permission
Keep monthly costs down

I was just studying SAA, so I decided to use AWS. The biggest problem is keeping the fourth monthly cost down.

If it is RDS + EC2 + ALB without thinking about anything, depending on the configuration, I think that RDS will be a reasonable amount of money. It's a normal amount for a corporation, but it's very painful for an individual (at least me) ('A'

Service used

* Please follow the dosage for the items listed below. </ font>

DynamoDB (on demand)
SQS (standard queue)
Lambda(Python+Pandas) + CloudWatchEvent
ECS(EC2spot+ECR+ALB)
S3 (HIVE + json format)
Route53
API Gateway

Configuration diagram after migration

スクリーンショット 2020-10-28 15.15.07.png

DynamoDB It is easiest to migrate to RDS (MariaDB), but if you migrate and operate hundreds of GB of data, the operating cost will be reduced. I wrote that it will migrate to several hundred GB, but all the migration data will be transferred to S3 in HIVE format + JSON format. If necessary, perform S3 → Lambda → SQS → DyanamoDB, which will be described later.

The free tier of DynamoDB is 25GB free tier at the time of writing the article, and 25 on-demand WCU / RCU.

1WCU can write up to 1KB of data in 1 second
1 RCU can read up to 4KB of data in 1 second

Therefore, it is desirable to keep the data per record within 1KB. This time, only one table is used and 5WCU / RCU is assigned. Since the market information site is extracted for each item, the partition key will naturally be item_id. Since I want to have it in chronological order, the sort key is the UNIX time of the registration date and time.

In the database before migration (MariaDB), we had data in store units (before aggregation) at 10-minute intervals, The migrated database (DyanmoDB) has data in item units (aggregated) at 3-hour intervals.

Since SQS is also used, the specific aggregation / storage method will be described later, but the WCU consumption will be as shown in the figure below. You can continue to use the database for free as long as the capacity is within 25GB.

スクリーンショット 2020-10-27 15.02.58.png

SQS Next is SQS. I had never been aware of queuing, but while studying SAA, I found out that it was a very good one and decided to use it in combination with DynamoDB.

The free tier for SQS is 1 million requests. I didn't understand it at first, but I will use 2 requests for sending and receiving. Currently, we are estimating with 2000 items, so we are setting it in 3 hour units to suppress it with 1 million requests. 2000 * 8 * 30 * 2 = 960,000 Since it's SQS, it's okay if you come out a little, but I want to keep it as free as possible, so it's every 3 hours.

As mentioned above, the aggregated data is stored in DynamoDB every 3 hours. It is stored in item units, but currently it was necessary to store about 2000 items every 3 hours. If you store 2000 items in 1 second, you need 2000WCU as 1KB per item. Since it will go bankrupt for more than $ 1000 per month, this time I will combine SQS and DynamoDB to shift the write timing.

Specifically, it is Lambda (aggregation) → SQS → Lambda (SQS extraction / DynamoDB storage) → DynamoDB. It doesn't have to be a FIFO, so it uses a standard queue.

At first I thought that the Lambda trigger of SQS was convenient, but I was experiencing consumption that I did not expect to be monitoring, and when I investigated it, the following occurred. Reference: https://encr.jp/blog/posts/20200326_morning/

It's a small thing, but I felt a little uncomfortable, so I decided to fetch it from SQS by myself with CloudWatch Event and Lambda instead of Lambda trigger.

The number of SQS received is as follows. In CloudWatchEvent, the queue is viewed in 1-minute units, and Lambda sets a queue acquisition limit to prevent capacity overload as much as possible.

スクリーンショット 2020-10-27 16.15.14.png

Lambda CloudWatch Event + Lambda is responsible for all acquisition and aggregation.

Lambda is the number of requests and execution time. The free tier is 1 million requests and 400,000 GB seconds. At least, my usage didn't exceed the free tier. (A little over 10% of the free frame)

At the time of writing, I am making the following functions.

Master data acquisition (S3 save)
Sales / order data acquisition (S3 save)
Sales / order data aggregation (S3 acquisition / storage / SQS)
Population data acquisition (S3 save)
Population data aggregation (S3 acquisition / storage / SQS)
Response function for API Gateway
Other content creation functions

Since I am using Python + Pandas, it is very easy to do the aggregation. At first I thought that I could push it with for, but it became very painful in memory, so I used Pandas.

Since it uses Serverless Framework, it is easier to deploy. I think that what I am doing is the same as CloudFormation, so I also describe API Gateway.

Admin authority is required, but I think this is unavoidable due to the mechanism ...

If you repeat the test and deployment, the waste will increase probably because of version control. I think you should use "serverless-prune-plugin" to automatically delete it.

The usage amount for Lambda is as follows. The amount of memory allocated varies depending on the function, but it is set to 128MB to 512MB.

スクリーンショット 2020-10-27 16.49.43.png

CloudWatchEvent Used for batch processing (regular processing). I think there are various things such as EC2 + cron and digdag, but since Lambda can be used, I will combine it with CloudWatch Event.

When I thought that the batch wasn't working at one point, it was GMT ... Since the language is set to Japanese, I was wary of JST ...

ECS(EC2) Next, regarding ECS, this is very powerful. It is a combination of ECR + ECS + ALB + Route53.

ECR --holding docker image
ECS (Service-> Task) --Run docker container
ALB-Target ECS task (docker container)
Route 53-Docker inter-container communication using ECS service discovery

I'm using EC2 instead of Fargate. (Because it's cheap ...) ECS and ALB allow you to run multiple tasks-> docker containers on one EC2 instance.

This allows you to deploy without updating or adding new instances. (Unless it is the memory limit of the instance) The existence of the EC2 instance itself becomes thin, and since it is run by docker, it does not connect with SSH itself.

EC2 also uses Spot Instances. This has resulted in a 70% reduction in charges compared to on-demand instances, depending on the instance type. スクリーンショット 2020-10-27 22.29.05.png

Instead of a big discount, it's a substitute that can be interrupted by AWS at any time. If it fails, Spot Fleet will automatically replenish the required instances and the ECS service will replenish the required tasks. At the same time, the health check by ALB works, abnormal things are removed from the target, and normal things are targeted.

Since I used the t3.micro instance this time, it costs about 1 yen per hour, and if it is operated by 2 units for 1 month, it will be about 1440 yen. Since we are using Spot Instances, this time it will be 432 yen with a 70% discount, depending on the case.

ALB As mentioned earlier, it is used for combination with ECS, but it is possible to use a certificate by ACM. It's a lot easier than Let's Encrypt, because it's just a push of a button.

However, it will probably cost about 2000 yen depending on the number of accesses. Personal use is a little expensive if you use it with only a single service, but because host-based ruding is possible, If you're running multiple services, I don't think it's a high investment considering ECS, ACM, routing, and managed.

Also, sticky sessions are turned off because the sessions are managed by Redis. This is because if EC2 is suddenly dropped when it is turned on, access will be made to the dropped EC2.

Route 53 It will be around the domain. By the way, regarding the domain, we transferred from "Name.com" to "Route 53" last year. Basically, it cost about 50 yen every month.

Aside from that, there is a feature called ECS service discovery. It is difficult to realize communication between containers with ECS. (Although different containers on the same server can be connected, different servers, network mode, etc ...)

In service discovery, services registered with ECS can be automatically registered / updated in the internal host zone. If the network mode is awsvpc, A record can be registered.

Since a new host zone will be created, it will cost about 50 yen anew, but it is very helpful to automatically link the service and the internal domain. Even if the service (task) goes down, ECS will automatically return and link it.

In this environment, Redis (not ElastiCache) is set up in ECS, so It is used to communicate with the application side (separate container). It would be nice to be able to use ElastiCache, but it costs money ...

S3 In S3, you will be charged for the conserved quantity and GET / PUT. In this system, master data and pre / post data are stored in S3. Since PUT can only be done on the system side, I try not to generate it as much as possible, but since GET depends on the user's timing, cache it in EC2 using Redis so that GET does not occur as much as possible. I am.

スクリーンショット 2020-10-27 22.57.37.png

Crawler measures

Communication also costs money, it is a small thing, but I think that the more services are combined, the greater the impact. Especially in ALB, there is a charge for new connections, so I don't want to waste communication. It's nice that robots.txt can handle it, but I think it's also a good idea to disconnect the inbound connection with ACL.

About 2 months later

I was wondering if it would be unstable because I used Spot Instances, but I was surprised that there was no evidence that it had fallen. I had been developing new functions for several weeks after the replacement, but it is much easier to develop than in the previous environment.

Especially since it is developed with Docker, it is a little difficult at the time of construction, but after that, you just build and update the task. Pandas on the Lambda side is also very easy to handle, and once the mechanism is established, I feel like I can do anything. → Basically, data is generated by Lambda, stored in Redis by Laravel, and the result is returned to the user.

The usage amount of the free tier was as follows before the end of the month. スクリーンショット 2020-10-28 15.31.42.png

It's been over a year since I created my account, so I thought I wouldn't have a one-year free tier. It was a free tier and no billing was incurred. Is it one year from the first billing instead of creating an account?

Future outlook

This time, I feel like I've made the transition. Since all the past data is saved in S3, it is ready to be reflected in DynamoDB whenever necessary. (I feel that historical data is not needed much ....)

Since I was only doing design work, operation work is sparse. For example, the deployment process is manual. I'd like to combine CircleCI and CodeDeploy one after another, but I'm wondering if I'm in trouble if I'm developing by myself.

Besides, the monitoring surroundings are also sparse. We do not monitor the external shape and do not set up SNS. I recently noticed that bots are accessing a lot and consuming LCU soberly, so I set robots.txt and ACL. Since the log is output to CloudWatch, I want to create an environment where I can easily analyze it.

Since we are dealing with numerical data, I think it would be interesting to try machine learning. Actually, I touched it a little (multiple regression), but I screamed on my PC with many parameters ('A' Lambda is a lot tough, so I wonder if it feels like setting up an instance and analyzing it.

Finally

The content of each one has become thin, but if I have the opportunity, I would like to delve into it and write an article. Thank you for reading this far!