Since I did a load test of the API server for the first time, I would like to briefly introduce the purpose of the load test that I learned at that time, examples of preparation and tuning methods, and places that I found difficult.
- For each server, give the upper limit number "No performance higher than XX." </ b> ――If you know this, you can know how many servers can withstand the number of accesses.
- Tune to improve performance. </ b> --When you set the upper limit for one server, if the throughput of one server is good, you will not have to do it. Therefore, tuning is performed during the load test to improve performance.
- Guarantee that the system can be scaled. </ b> --Even if you increase the number of servers, there is no point unless the load is distributed. Make sure that the load is distributed properly in advance.
-API server created with Django ・ Locust (load test tool that can be written in python) · New Relic (Performance Monitoring Service)
There is no point in tuning blindly, so first, assume the minimum required performance value as the target value. This time
- Determine the CPU usage rate, which is the limit value for one server. </ b> ――Determine how much margin you want to set as the limit value. ――By deciding this, the number of requests that can be handled while maintaining the determined CPU usage rate will be the limit value of this server at the time of load test.
- Assuming the maximum number of requests this server needs to handle per minute. </ b> ――This number assumes how many requests you need to handle in a maximum of minutes (or seconds). ――When it is decided how many requests can be handled by one unit (limit value of one unit), finally, "maximum number of requests ÷ limit value of one unit" is used to determine how many servers are required. I can.
For example, from DAU etc., assume that the number of accesses during peak hours is 12000 rpm, and the CPU usage rate that can be used by one server is 50%.
In that case, when the goal is to limit the number of servers to four, The goal is to handle 3000 RPM per unit when the CPU usage rate is 50%. </ b>
By the way, the server configuration at the time of load test is like this.
First of all, in order to see the numbers per unit, we first conducted a load test with one API server. New Relic is only available on API Server 1.
Now, let's prepare Locust for the next load. Here is a reference for how to use Locust. Create a scenario by assuming which API will be called and at what rate.
The following is an example of a simple scenario. User registration is performed only once when creating a client, and then the signin API and status API are called in no particular order. However, the task decorator specifies the percentage at which each API is called.
# -*- coding:utf-8 -*-
"""
A scenario of a week of general movement
"""
from locust import HttpLocust, task, TaskSet
class ScenarioTaskSet(TaskSet):
def on_start(self):
"""
user registration
Here, prepare the necessary information for each client
"""
self.user_id = "xxxxxxxxxxxxxx"
self.client.headers = {
'Content-Type': 'application/json; charset=utf-8',
}
self.client.post(
'/user/signup',
json={
"user_id": self.user_id,
}
)
@task(1)
def signin(self):
"""
user login
"""
self.client.post(
'/user/signin',
json={
"client_id": self.client_id,
}
)
@task(10)
def status(self):
"""
Get user information
"""
self.client.post(
'/client/status',
json={
"client_id": self.client_id,
}
)
class MyLocust(HttpLocust):
task_set = ScenarioTaskSet
#Minimum waiting time for task execution
min_wait = 1000
#Maximum wait time for task execution
max_wait = 1000
Then execute it with the following command.
locust -H http://apiserver.co.jp(← Server you want to load)
When you access it with a browser and the TOP screen of locust appears, the preparation of locust is complete. After that, you can start the load test by specifying the number of users and the amplification factor.
When you actually start, you can see the information that is loading on the browser like this.
locust has a function to synchronize with multiple slaves, and this time we prepared about 10 dedicated servers for locust, and finally loaded the API server from 10 locust servers.
When locust is ready, try loading it and check the load status with New Relic. In this load test, this New Relic will play a leading role, when tuning performance, load is applied with locust, the part where the load is applied with New Relic is investigated, and it is repaired with a tuning star. Do the work. (For information on how to set up New Relic for your Django app, see here.](Https://gist.github.com/voluntas/7278351))
Now let's actually load it and take a look at New Relic. Below is New Relic's "Over View". You can see the whole information on this page. This is the data before tuning, but the throughput is 43 rpm. This is pretty terrible. Also, in the "web tarnsaction response time" graph, it is color-coded for each type of processing, and in this example, light blue occupies most of it, so you can see that it takes too much time for Python processing as a whole. I will. If you look at the Transition below, you can also see that the API processing is taking too long and that there are many errors at the Error rate.
We will tune based on these data and raise it to the target performance. Now, let me give you a simple example of how to tune.
Take a look at New Relic's Transactions page to see which API is the bottleneck when it's slow. If you select Most time conuming, the requests will be displayed in order of slowest processing.
By the way, this time most of the causes are python processing, so "App Server breakdown" is all blue, but usually the ratio for each processing is visualized like this.
Furthermore, below the graph, a "Breakdown table" is displayed, where you can see where it is taking time, including SQL. If the number of SQL query calls (Avg calls) is abnormally high or the processing is taking a long time, fix it. This time, I tuned by combining queries that were called multiple times into one and setting Avg calls to 1, and for SQL with slow Time, adjusting INDEX around and shortening Time.
However, the details of the function that points to the part that python is processing cannot be confirmed here. You can then view the Transaction trace page from "Transaction traces" under "Breakdown table" for more detailed information.
For example, in "Trace details", the processing contents are displayed in the order of processing in this way, so it is quite convenient when you want to mark the repaired part as to which part of the processing is the cause.
In this case, before INSERT, the contents selected from the two tables were summarized by python loop processing and the contents were INSERTed.
By comparing with "Trace details", you can judge that the application code under load is part of the loop processing of python. Therefore, by reducing the load of loop processing, we were able to solve the problematic part of the Application code.
Basically, by repeating "loading → investigating the cause → repairing the cause and improving performance", the request can be stably handled up to the target value of CPU usage, and there is no place to repair anymore. We will improve the performance of the server by improving all the causes including SQL processing and code processing.
After tuning the server, it's time to measure the limits. Since the CPU usage rate is limited to 50%, load is applied while checking the CPU usage rate of the server with the top command.
First, by the time the CPU usage reaches 50%, check that the processing is stable on the locust browser screen, adjust the locust, and keep the load on when the CPU usage is around 50%. .. At this point, you can calculate the RPM from the locust RPM, but this time we are adjusting the performance with reference to the New Relic value, so check the New Relic RPM to unify it, and the limit of the server. It was set as a value.
As a result, the performance of one server has been decided, so if you can increase the number and check whether the load is properly distributed, you can just prepare the required number for operation.
I briefly introduced the method of load test, but this is just an example, and when you actually try the test, various walls stand up.
--Difficult to think of Locust scenarios --It was difficult to adjust the scenario when there is a dependency between Tasks and to consider the frequency of data update. In particular, the frequency of data updates is directly related to the load, so this alone will greatly change the results of the load test. --I'm hitting the API of an external service ――In this case, it was necessary to calculate how much the load of the external service would affect, and it was difficult to think of a method. —— Difficult to investigate the cause of the load ――If the code seems to have a heavy load, or if it is caused by a SQL statement, you can fix it by adding a star to it by following the procedure introduced here. There are many patterns that seem to be okay at first glance, but for some reason they are not working, and in that case, it is necessary to have a lot of knowledge and experience in order to think about the solution approach from various angles.
If you suppress the usage and tuning procedure of New Relic introduced this time, you may be able to do some basic load tests. However, in reality, more problems occur and different tuning approaches are required. If this is your first time, we recommend that you take the test with the advice of an experienced person. If you think that you have improved the performance quite nicely, and if you ask an experienced person to check it, in fact, the amount of load is not enough </ b>.
Also, load testing is rugged, so if you have low skills, you may be overloaded each time you load the server, and you may not know who the load test is.
However, since it will be a lot of study, I'm sure that the throughput of both the server and myself will be significantly increased by the time the painful load test is completed! </ b> If you have a chance, once you experience it, it will be a great learning experience. I'm sure you too will develop friendships with the server you shared the pain with ☆
What happens if the assumed value used in the load test is too large? ~ Does large double as small ~
Recommended Posts