<!-Description the beginning and outline->
My name is @ H-asakawa and I am a web engineer at LITALICO Inc. It's been three years since I joined the company as a new graduate, and it's fast.
This article is the 19th day article of LITALICO Engineers Advent Calendar 2020.
Amazon Linux's End Of Life was originally supposed to be June 30, 2020, but was extended to around January on December 31, 2020 (so I put it off). I will write lessons and failure stories when I newly built EC2 of Amazon Linux 2 for EOL and migrated the production server. It's almost the end of the year, and I'm sure there are no EC2s running on Amazon Linux AMI, but I hope it will be helpful if there is someone in charge of migration from now on.
<!-Edit title and anchor name->
As a premise, I will write the process of learning infrastructure so that you can get a rough idea of my infrastructure skills.
--Joined: I don't know anything about infrastructure ――First year after joining the company: A state of looking next to a senior when sshing to a server or participating in troubleshooting in business ――Second year: Participated in in-house infrastructure training ――Second year: Study (training) by reading related books, focusing on the subject until the web browser is displayed. ――Second year: Building a virtual aws environment using multiple Raspberry Pis Studying hands-on (training) ――Second year: Participated in in-house ISUCON (If you are interested because a colleague who teamed up with you is writing an article, please: Story of participating in in-house ISUCON) ――Third year: Building a small service using fargate in business ――Third year: Server migration with EOL support for Amazon Linux AMI (⬅︎ Imakoko)
I was scared to ssh to the production server at first. I mean, what is ssh at the beginning? It was in a state. I watched the work next to my seniors and participated with a pointing confirmation Yoshi, but that fear did not disappear. From the time I started studying after reading a book, I was able to understand the mechanism and the whole picture a little, and my fear diminished. As I participated in the in-house training and gradually gained knowledge, I wonder if I can manage to google and research and build it myself, and it is still a new popularity. The infrastructure is so much fun that it makes a leap when it works steadily, and when it works well, but the range of influence of one mistake is big .. (crying)
From here, I will write while looking back on what kind of procedure I tried to proceed with server migration and what I tried and failed.
As a transition plan for EOL, we first considered a two-choice policy.
――Take this opportunity to make the whole fargate and make it serverless --Set up a new EC2 and migrate only EC2
Since the grace period until migration was short and the man-hours were only for me, it is a hand to think about how to handle the work around deployment and work with console etc. when completely serverless. I didn't seem to be able to turn around, so I stopped. So for this EOL support, we decided to move only EC2 to a new Linux2 server. In this case, the range of influence is small, and it should be easier than fargate.
Next, I googled the official information about linux2 migration and confirmed the following things.
-Update on Amazon Linux AMI end-of-life (It doesn't matter at all, but it's interesting with audio) --There is no migration tool for Linux2, so you have to set up a new EC2. -There is a support tool of aws called Pre-Upgrade Assistant Tool that diagnoses whether there is a problem in migration (I tried using it, but it was not particularly helpful for migration)
And I thought about what to do. It seems easier to just change EC2 to a new one than to fargate it, and I wanted to take this opportunity to document the construction procedure and server setting information with a configuration management tool.
I wanted to do infrastructure as code
, and I thought it would be difficult to debug if there were no documents left when my work was inadequate, and development
, staging
,production
This is because it was necessary to build a total of 5 EC2s in the environment.
I'm not confident that I won't make a mistake in the procedure if I concentrate on it manually, but I'll definitely make a mistake. Isn't the procedure manual missing somewhere? Before that, I simply didn't feel like doing the same work for five of them.
At that time, if you set up one EC2, then set up five from the AMI, ssh and play with the configuration file etc., it seems that it will end soon, and will the benefits of automating the construction be worth the cost? I was worried. However, when I tried it, I felt very comfortable just because I did not have to ssh and play around with the settings, there was a sense of security that it ended with a button, and a document with guaranteed reproducibility of construction was also an asset. Since it remains, I realized that it would be better to automate it if it was a requirement for duplication of 3 or more units.
What I decided to do this time
--Environmental survey of existing EC2 --Confirm that new EC2 is manually built and works on Linux2 --Make the construction procedure ansible --Build 5 EC2s with ansible --dev and staging server migration work --Production server migration work
What i did
--A quick look at past construction documents (Google Docs) --Ssh to check the inscored library --Get and read the information of the library configuration file that seems to run under/etc or/var/log --Check if there is any process registered directly in cron --Check if there are any places where different values are used for each environment
I'm sure this was used on the production server. .. I was trying to understand the existing environment by combining the work experience knowledge that I gained while sshing to the server and peeking in various ways.
I think that the environmental survey before the transition is a little archaeological. So I thought I should have referred to the approach in the article below more later. Infrastructure engineer and mysterious server
This is what I thought was important
――I often ask seniors who seem to know ―― Find out what process is running --Understand the communication and find out what port is listening
By looking at which port communication is listening on and the process operation status, it seems easy to understand what you are using. Even if you look only at the presence or absence of insco in the library, it seems that there are cases where you stopped using it until a certain time and garbage remains, and even if there are documents that describe that in issues etc. It's very difficult and costly to read. I realized that there is a way to properly investigate the current facts without relying on intuition based on work experience or dark investigation.
Once you've decided what you need to put in to get your new EC2 running in the same state as before, all you have to do is install it steadily.
In this phase, I'm addicted to the installation method in some places, but I had the impression that it was just going steadily.
In order to automate the construction procedure later, record all the commands you typed in what procedure.
After manually building, deploying, and confirming that it works, use the configuration management tool to build it automatically.
This time, I decided to manually set up EC2 and automate the server setting work after that with Ansible
.
This article was interesting because I tried to introduce Ansible, so if you are interested please
Since Ansible was successfully introduced in engineering plastics & on-premise, we announced it at the user meeting
I didn't automate the part to launch EC2 this time, but it took about 3 minutes for each unit because I just selected the OS and volume and linked the existing aws resources. The operation of touching and playing with aws resources in general only via the configuration management tool is not only safe, but it seems to be a desirable ideal state where setting information can be shared with code at any time, but it is also configured to add one tag. It seems that the operation system will be rather rigid via the management tool, so I avoided it this time.
If Terraform
could manage everything including other aws resources, what would happen to the settings there? Personally, I'm most happy that it will save me the trouble of opening the AWS console and clicking resources to check them.
Here are the final Ansible
playbook files. One line commentary for each.
create-my-user.yml Create a new user on EC2 in a blank state so that you can ssh it.
delete-ec2-user.yml In order to use the above users as the main, ec2 users are also deleted as a security measure.
ec2-my-service.yml Set the time zone and host name, insco the required yum library, put ruby, mysql client, apache, passenger, and much more, place various configuration files, set the automatic start of the service I will.
install-newrelic-agent.yml Since newRelic is monitoring the usage status of cpu and memory, insert the agent. Click here for details: How to install New Relic agent in Ansible on Amazon Linux 2
install-td-agent.yml Since td-agent is sending access logs to Bigquery, enter the agent. Click here for details: How to install td-agent v4 in Ansible on Amazon Linux 2
test.yml For checking when you want to try a command, such as displaying the connection destination host name or reading variables for each environment.
ec2-yum-security-check-update.yml
Used for vulnerabilities. Execute $ yum --security check-update
to log which library is upgraded to which version.
ec2-yum-security-update.yml
Used for vulnerabilities. Run $ yum --security update
.
ec2-yum-update.yml
I don't plan to use it so far. Run $ yum update
to update all packages to the latest version.
With fargate, you can't ssh, so you don't need to manage users and deal with library vulnerabilities, but as long as you use EC2, you can't avoid security updates. I wondered if I could make it easier, so I prepared a playbook that just runs the security app.
If you try updating with dev and there is no adverse effect, run it on 2 staging machines, and if you take the AMI for production and then run 2 machines at the same time, it seems that it will not take much time. .. (I haven't operated it properly yet)
When I started researching ansible from a state where I didn't understand anything, even if I read articles and official documents on the web, it was difficult to grasp the intention of why the directory structure was adopted. So, while actually moving it to check communication and adding processing to change the value for each environment, I decided to operate it with a minimal directory structure to realize what I want to do. I haven't worked out it at all and haven't reviewed it yet, so I'll compare it with ansible best practices later, but I'll list it here for the time being.
infrastructure
└── ansible
├── ansible.cfg
├── ansible.log
├── Playbook files for execution.yml
├── hosts
├── roles
│ └── nrinfragent
├── templates
│ ├── Server setting files.j2
│ ├── development
│ ├── production
│ ├── sitemaps
│ ├── staging
│ └── td-agent
└── vars
├── dev.yml
├── prod.yml
├── prod2.yml
├── stg.yml
└── stg2.yml
hosts is a file that describes all the server information of the connection destination. playbook.yml is a file that describes the process that ansible sshes to the server and executes. Constant files for each environment are placed in the vars directory. Files for various settings in the format of Jinja2 are placed in the templates directory.
In this way, the j2 format file is a configuration file that runs various services, but since the constants of each environment can be called in the configuration file, it is possible to set values for each environment while standardizing the configuration file.
Thanks to this specification of Ansible, it was possible to easily absorb the difference of the setting value for each environment, and it was possible to automate the construction for 3 environments.
――Once you make a playbook, it will be set automatically with one command, so it's very easy. ――The value in the configuration file can be flexibly rewritten by giving a constant for each environment, and Ansible is the best ――At first, you may not understand even if you look at the directory structure, so just move it and understand it quickly before thinking about the structure. ――I think fargate is good because you don't have to worry about user management and management costs around ssh. ――In addition to the 5 web servers, if you look around the infrastructure environment such as jenkins, stepping stones, for regular tasks, redash, etc., you will notice that there are many EC2s that require continuous maintenance. ――If the security app playbook can be applied to those EC2s in common, it seems that it will be a little easier to deal with vulnerabilities (I think). ――The benefits of automation with configuration management tools are huge if you have to duplicate even a small number of units.
This was the most difficult phase for me personally.
I set up a new EC2 and the deployment went well. After this, all you have to do is complete the preliminary confirmation work, connect it to the load balancer and disconnect the old one, but it is very difficult to prepare a confirmation list. There were various places that I had to check other than checking the operation on the browser.
In the end, the confirmation items are divided into abnormal and normal systems, and there is no problem one by one by listing where and what kind of error is okay and what kind of processing should be successful. I confirmed that.
However, I was forced to make corrections and reviews many times, such as some confirmation items that I couldn't think of before reaching that point, and some missing parts. I found it difficult to operate in the same condition as before. That's why it's a long-awaited production failure story.
I forgot to check what I was wondering about, after switching the dev and staging 2 servers to the new EC2, and checking that the access log can be obtained normally with Bigquery (used for policy planning and effectiveness verification) Important indicator). Originally, this EC2 used td-agent to transfer application access logs to Bigquery. But I haven't even installed td-agent (I completely forgot about Bigquery's log measurement).
Of course, no logs should be sent. However, unless there is something special, no one in the division is checking the access logs of the test environment such as dev and staging.
Then, the day of production maintenance came without noticing anything. I decided to connect two new servers to the load balancer to make a four-server configuration, and the next morning came (I haven't noticed that half of the logs are missing at this time as well).
In the morning, I started the detachment work to say goodbye to the old server. It looked as if the switch had been completed, and I monitored that there were no errors and confirmed that there were no abnormalities, and finished the maintenance work for the day (working time was about 30 minutes).
Then, the next morning, I noticed that the value of the redash number notified to slack was strange, and I was disappointed to find that the log was missing. (I wrote a trouble report .. crying)
As a reflection, why wasn't Bigquery's access log, which is extremely important, included in the confirmation items? If I was just looking at the error log and said "No error, Yoshi", it was destined to be like this. It was necessary to look more at the normal system as well as the abnormal system.
Also, I thought it would be nice to have a mechanism to raise an alert if something is wrong with td-agent.log
.
Yet another problem was discovered.
On the production server, cron was executing multiple regular task processes, but this task was executed at UTC time instead of JST after the server was switched. As a result, the process that should have been executed was not executed, or the process was executed with a delay of 9 hours.
It is a bad situation that the process you want to execute at the specified time (sending various emails to the user, creating notifications, etc.) is not executed or is delayed by 9 hours. As a result, it became necessary to investigate exactly which process was not executed and which process was delayed by 9 hours. (I wrote a trouble report .. crying)
When I looked it up later, there were a lot of Caution articles on the web, but when I modified the OS time zone, I had to restart cron or reboot the server, or the cron time zone. Does not switch. When EC2 starts, the time zone is set to UTC, and after that, when I built it with ansible, I set the time zone to JST, but cron runs in UTC because I did not reboot. It has been done.
As a reflection, since there is also an automatic start setting for each service and you should check it, is it okay to reboot the server at least once? .. After that, I wish I could verify that the test task was executed at the normal time by turning the test task, instead of just judging whether the time registered in cron was correct.
I realized the importance of testing and prior confirmation. ..
I did it in production, but I managed to finish the migration work and successfully migrate to AmazonLinux2
.
I am grateful to the in-house training that gave me the opportunity to learn about infrastructure, and to my seniors and colleagues who carried it out.
In the future, I would like to create a diagram that organizes the current structure and share it with the team members to discuss future improvements around the infrastructure.
I made some detours during the transition, so I would like to introduce a few at the end.
--vscode (drawio plugin) After all, I gave up on automatically generating the infrastructure configuration diagram and managing the code, and decided to put in a plug-in that can use drawio on vscode and create it. It seems to be quite difficult to update, but I like it because it can be made quickly. Reference: It seems that Draw.io can now be used with VS Code!
--I read the monthly invoice details of aws. I read the aws monthly invoice statement for each resource for business divisions as "billings`" in order to cultivate a sense of cost if I touch the infrastructure. Then, I was able to notice that the garbage accumulated in EBS and snapshots that was created when I acquired the EC2 AMI was charged a lot. Now, even if I don't calculate it one by one, just looking at the aws fee of 0.79 $/h makes me feel like I'm crazy, so I feel like I'm getting along with the aws fee more than before (white eyes). ).
--Arm-based EC2 When the ansible playbook was completed, I read an article that Arm-based EC2 is good for cospa. I wanted to give it a try, so I set up an a1 instance and ran ansible, but mysql5.7 seemed to be incompatible and Insco failed. It may be necessary to rewrite ansible a little or upgrade it to mysql8 series, but it seems that cost reduction can be expected. Reference: I was only happy when I switched Amazon EC2 to Arm
Thank you to everyone who has read this far. It's just a list of what I've done and thought about recently, but I hope there's something useful for you.
Tomorrow is @ negi's article. looking forward to.