This article is the 14th day article of Cloudworks Advent Calendar 2020.
Hi, this is @ shimopata. Recently, I'm addicted to craft cola [^ 1]. Last year, I wrote an article entitled "Story of successfully converting 200 million production data". It was a valuable experience that I rarely experience at the personal development level, but this year as well, I was involved in work that I would not have experienced unless I was a company that provides in-house services. I thought that no one would have the same experience, but I thought it would be useful if it could be used as much as possible.
The other day, I migrated the batch group that was running on EC2 to the batch server platform. The batch server platform was introduced earlier on the CrowdWorks engineer blog, and is a batch execution environment using Fargate. Previously introduced article → Story of making batch serverless and migrating to Fargate due to EOL of Amazon Linux [^ 2]
As an image of migration, we prepared 1 batch and 1 container environment as shown in the following figure. [^ 3]
My team's main focus was on migrating layers above the container running on Fargate. Therefore, I had to create a Docker file and make appropriate modifications to the batch to make it work in it. There are 6 types of batches to be migrated, and the process of support is different for each, so we will pick up only one and introduce the process of support.
Various batches were running on CrowdWorks, and they were distributed and executed in several instances. Some of them were running on Amazon Linux, but I was notified that Amazon Linux will expire at the end of December 2020.
It is stated that the maintenance support period will be set even after the deadline, but from the viewpoint of dealing with future vulnerabilities and extensibility, we have migrated for the following reasons.
--Only important and critical security updates are offered as a reduced package set and will not take care of everything. --Support for new features may not be guaranteed
Many of the batches to be migrated had been in operation before I joined the company, and due to the so-called historical background and the change of personnel, I had the following issues.
--Batches of multiple contexts are running on the same server, and the relationship between each batch is unclear. --There are few documents about servers and batches. Even if there are things that have not been updated and are not accompanied by reality ――Few people know why it is necessary and what effect it will have if it is stopped. --The engineer at that time was absent due to a transfer, etc.
It was necessary to clarify the relationship and range of influence of each batch in a state like a stray batch that is not a stray cat.
It is roughly divided into the steps of research, design, implementation, migration, and document maintenance. I will explain each item in detail.
When migrating, we will investigate how the current batch works in the first place and what kind of execution environment it is. The specific investigations are as follows.
――What is it used for and what kind of business problems will be solved? ――We interviewed the people concerned in the past and reconfirmed the necessity of batches in the first place. This is because if you find that you don't need it, you don't have to migrate in the first place. --cron definition ――I checked the execution interval and which user is running it. --The contents of the file running from cron -I checked the contents of the defined shell files and scripts and understood the processing contents. --Environment variables registered on the server --Check what variables are used and how by combining environment variables with the script being executed.
By doing these things, you can see the whole picture of the system. Once you can see the whole thing, it's time to create a service overview diagram. The image of the outline diagram of the batch that we were in charge of is as follows.
Now that we have a complete picture of the current batch, we have created a post-migration overview. At this time, we examined the impact of changing the execution platform from EC2 to Fargate. In the case of this batch, the SQL execution result was saved as CSV in EC2 for one month for backup. When the execution environment became Fargate, I thought that it would be better to save it in S3 later, so I reviewed the design.
The image of the final design is as follows.
If you make a blueprint like this
--Easy to explain what function you are making at the time of review --Can be used as a material for handing over —— Above all, deepen your understanding of your system
It is recommended to create it because it is a good thing.
First, I started by creating a Docker file. The image used was Alpine Linux. The reason for adoption is
--Batch requirements are as simple as executing SQL and transferring files --Alpine Linux contains only the minimum required packages, so you can build a secure environment. --Alpine Linux has a light image, so it builds quickly
That is the point. Fast build is a very important point. To create an environment for running batches
--Install the required packages → Rebuild --Check the operation → You can see that the prerequisite packages are missing → Rebuild --I find that I don't have the required permissions to run the batch → Rebuild --Comparing with the migration source environment, you can see that it is necessary to create a user → Rebuild
So, every time I make a change to the Docker file, I build it and check the operation. If it is about 1 or 2 times, the build time does not matter, but as mentioned above, the build is repeated many times while repeating trial and error, so a light image and quick build will greatly increase development efficiency. It depends. Therefore, it is recommended to use a light image of Docker, which eliminates unnecessary ones as much as possible.
After implementing the Docker files and modifying the batch, I made a migration plan. Why do you need to plan your migration?
--Because it may destroy existing data --The original batch is not an idempotent batch and re-execution of the batch may not restore the original data
And, if it fails, there are innumerable risks. Therefore, we made a migration plan from the following perspectives.
--Clarify what you should expect when the migration is complete --It cannot be concluded that the batch is operating normally just by completing the execution of the batch without any errors. For example, the batch may have been executed successfully, but the intended data has not been created. In some cases, the number of output data may differ depending on the timing of execution. You should always decide what made the migration complete. --Clarify what the impact would be if a batch failed ――When the execution fails, in some cases, someone should be affected by it. I checked in advance what kind of impact it might have. ――Is it possible to switch back, and what should be the timing and method of switching back? ――Whenever a batch that you thought to succeed fails, humans are impatient. Then, it is easy to make unnecessary operation mistakes and judgment mistakes. I checked in advance whether it is possible to switch back to the original batch and how to do it.
Even if it is troublesome to plan in advance, it is recommended to leave it as a sentence as much as possible because it will be useful in case of an emergency.
By the way, the case of this batch is as follows.
――What is the expected result? --The number of CSV file lines generated during batch execution before and after migration is the same. ――What is the impact of a batch failure? --Some of the data presented to the user becomes outdated ――Is it possible to switch back, and what should be the timing and method of switching back? --It is possible to switch back by executing the batch before migration
You can switch back, and you can rest assured that it is not a high-risk work that you can return to the original state after executing the original batch.
The hard part in migrating the system was reading the current specifications. In some cases, the necessary information was written in the batch material, but in some cases, it did not describe which system was linked, cron settings, user information at the time of batch execution, and so on. If I didn't understand it, I had to go inside the server and check the value each time, and then I had to check and proceed with the migration work, which was a difficult task. Therefore, when the migration to the batch server platform was completed, we discussed "what kind of information would have made the migration work easier if it remained as a document" and decided to leave that information as a Readme. did.
Here is the material created at the time of the meeting.
I will pick up some of them and introduce them.
For example, write "Batch to create a summary table for the database for analysis". Also, it would be very helpful for those who maintain this batch later if you briefly describe what kind of problem occurred and what was created to solve it. For those who come later, it's often hard to tell if this batch is still needed. In fact, it's not unlikely that the batch is unnecessary due to other factors over time. It is possible to read the behavior from code and configuration files, but it is difficult to read the original requirements from them. Therefore, you should describe not only what you are doing, but also why you made it and what kind of problem you had when you made it (what you wanted to solve).
Depending on the created system or batch, resources other than application server and DB are often used. I think there are cases where it is an AWS resource (S3, SQS, etc.) or uses the API of an external application. Just by creating a simple configuration diagram for each resource, the understanding of the engineers who came later will increase. Personally, I recommend using drowio because you can easily create easy-to-read diagrams.
If it is a recently created system, I think that it is common to deploy using CI/CD tools such as CircleCI. In fact, most of our systems are mostly automated by such tools. However, if for some reason you need to deploy manually, leave that step.
It's also related to "what the service is doing", but let's also describe the impact if the service is stopped or the batch is stopped. The successor will be very helpful if you know what priority you need to take when a service goes down for some reason.
The issues that were raised as issues before the migration work
--Batches of multiple contexts are running on the same server, and the relationship between each batch is unclear. --Solution by managing containers separately for each context --There are few documents about servers and batches. Even if there are things that have not been updated and are not accompanied by reality --Solved by preparing documents ――Few people know why it is necessary and what effect it will have if it is stopped. --Solved by preparing documents
I was able to solve it in the form of. I'm glad that I was able to make the transition by eliminating the original debt.
It feels like it's a very niche article after I finish writing it, but I hope it helps you.
Tomorrow's article will be @ tmknom's article on organizational management on batch migration. I only have a feeling of a blockbuster, so please read that as well !!
[^ 1]: If you are interested, you can buy it online or make it yourself by searching. [^ 2]: If you like, please read this article as well! [^ 3]: Since batches in the same context are used using the same container, there is also an environment where two or more batches are executed for one container.