About 10 years ago, I will revive the blog when I was a fledgling engineer. Recently, the chances of connecting to the server with ssh and checking it have decreased, but it's worth remembering. https://itinao.hatenadiary.org/
I am an engineer. As soon as I go to work, glittering salespeople and planning people say this.
Sales "The server is a little heavy" Project "Somehow! It won't be a job"
I "Yeah, I don't know how to investigate." I "I'm sorry. I'll contact my senior soon."
As a programmer, I lack the knowledge and knowledge on the infrastructure side. I want to do something, but I can't do anything about it. ..
It's a common sight when you operate your own service.
Ho-Ren-So is important, but after all he is an engineer. I want to be able to do it myself. To such a person. First, let's learn the concept of bottlenecks.
1.CPU load
2. I/O load
Process occupies CPU (CPU is calculating)
If one process (program) uses the CPU and the usage rate is 100% for a long time, it will interfere with the execution of other processes.
It would be a problem if there is one word wrong, but 100% CPU usage itself is not bad, and it is ideal if there are no bottlenecks other than disk and memory capacity.
Check if the program is out of control (infinite loop, etc.).
Review the processing in the latest release version.
I / O means input / output. Frequent data in and out puts a load on the hardware and network, so the CPU load and I / O load are different. A high CPU load does not necessarily slow down I / O, but a large amount of reading and writing to the disk.
Is there a lot of I / O programs in the file?
Is swapping caused by insufficient memory and disk access is occurring?
If there is not enough memory, the system will use swap. Conversely, if there are many accesses to swap, there is a possibility of insufficient memory.
So far we have seen the concepts of CPU load and I / O load. Next, we will immediately start investigating bottlenecks in CPU load and I / O load.
1.First of all, calm the mind. This is important.
2.Check the load average on top.
3.CPU and I with sar/O Check which is higher.
4.View information for each process in ps.
5.We will take measures such as reviewing the execution program and returning the version.
6.If there is no problem stopping in the middle, kill or restart the bad process.
This is important at all times. It hurts my eyes to get rid of it quickly, but don't panic.
Let's deal with the fatness.
First of all, let's see the load average with the TOP command.
The number of processes waiting for execution and disk I / O per unit time per CPU. A number that reports how many tasks were waiting per unit of time. If this is high, it means that the load on the system is high.
If the load average is higher than the number of cores, it may cause a load.
$top
top - 00:41:49 up 6 days, 2:24, 1 user, load average: 2.15, 3.02, 3.20
Tasks: 93 total, 1 running, 45 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.2 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 3977928 total, 3324844 free, 121568 used, 531516 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 3630656 avail Mem
The load average: 2.15, 3.02, 3.20 is the load average. The values for the last 1 minute, 5 minutes, and 15 minutes from the left.
There are two things to see
Does the load average status exceed the number of cores?
Is Swap occurring?
Next, let's look at the load status for each core.
In the case of multi-core, it may not be possible to judge by load average alone. In such a case, use sar -P ALL to grasp the status of each CPU individually. If there is only one disk even if multiple CPUs are installed, the CPU load can be distributed to other CPUs, but I / O cannot be distributed, which causes a load.
$ sar -P ALL
Linux 3.10.0-862.2.3.el7.x86_64 (118-27-1-88) 10/01/2018 _x86_64_ (2 CPU)
01:17:35 AM CPU %user %nice %system %iowait %steal %idle
01:17:36 AM all 0.00 0.00 0.00 0.00 0.00 100.00
01:17:36 AM 0 0.00 0.00 0.00 0.00 0.00 100.00
01:17:36 AM 1 0.00 0.00 0.00 0.00 0.00 100.00
The meaning of each is here.
display | Description |
---|---|
%user | Percentage of time the CPU was in user mode |
%system | Percentage of time the CPU was in kernel mode |
%iowait | Percentage of time the CPU was waiting for IO |
%idle | Percentage of time the CPU has been idle |
Here's what to see
%If the idle is small, the CPU usage is high and the CPU may be the bottleneck.
If the CPU is found to be the cause of the load Next, let's figure out which process is doing the wrong thing.
$ ps auwx | head
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 19232 1516 ? Ss Feb09 0:00 /sbin/init
root 2 0.0 0.0 0 0 ? S Feb09 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S Feb09 0:00 [migration/0]
root 4 0.0 0.0 0 0 ? S Feb09 0:00 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S Feb09 0:00 [stopper/0]
root 6 0.0 0.0 0 0 ? S Feb09 0:06 [watchdog/0]
root 7 0.0 0.0 0 0 ? S Feb09 0:00 [migration/1]
root 8 0.0 0.0 0 0 ? S Feb09 0:00 [stopper/1]
root 9 0.0 0.0 0 0 ? S Feb09 0:00 [ksoftirqd/1]
See below for the meaning of each.
display | Description |
---|---|
%CPU | Process CPU utilization |
%MEM | Process physical memory |
VSZ(RSS) | Virtual reserved by the process(Physics)Memory area |
STAT | Process state |
TIME | Time the process occupied the CPU |
The processes that can be executed on the CPU are in the TASK_RUNNING state. The CPU is given to the task with the highest priority among multiple processes in the TASK_RUNNING state.
Notation | Status | Description |
---|---|---|
R | TASK_RUNNING | Executable state |
S | TASK_INTERRUPTIBLE | Waiting state. Signal can be received |
D | TASK_UNINTERRUPTIBLE | Waiting state. No signal reception |
Z | TASK_ZOMBIE | Zombie state. State after exit |
T | TASK_STOPPED | Suspend state |
There are two things to see
Look at the size of the RSS to see if there are any extremely large processes.
Check the status of TIME. infinite loop(TASK_RUNNING)If, TIME continues to increase.
If swapping occurs with the TOP command, it may be caused by insufficient physical memory. Let's take a closer look with the sar command.
$ sar -S
00:00:00 kbswpfree kbswpused%swpused kbswpcad %swpcad
00:10:01 2097148 0 0.00 0 0.00
00:20:01 2097148 0 0.00 0 0.00
00:30:01 2097148 0 0.00 0 0.00
00:40:01 2097148 0 0.00 0 0.00
Status | Description |
---|---|
kbswpfree | Free space in swap space |
kbswpused | Swap area usage capacity |
%swpused | Swap area usage percentage |
kbswpcad | Swap area cache capacity |
After checking how much Swap is occurring here, It is easy to understand if you specify the interval with vmstat like vmstat 1 100
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 1 244208 10312 1552 62636 4 23 98 249 44 304 28 3 68 1 0
0 2 244920 6852 1844 67284 0 544 5248 544 236 1655 4 6 0 90 0
1 2 256556 7468 1892 69356 0 3404 6048 3448 290 2604 5 12 0 83 0
0 2 263832 8416 1952 71028 0 3788 2792 3788 140 2926 12 14 0 74 0
0 3 274492 7704 1964 73064 0 4444 2812 5840 295 4201 8 22 0 69 0
display | Description |
---|---|
r | Number of processes waiting to run |
b | Number of sleep (interruptable) processes, number of processes that cannot be executed |
swpd | Swap size (KB) |
free | Free memory (KB) |
buff | Buffer memory size (KB) |
cache | Cache memory size (KB) |
si | Memory size swapped in from disk (KB)/Seconds) |
so | Memory size swapped out to disk (KB)/Seconds) |
bi | Number of blocks received from the block device (blocks)/Seconds) |
bo | Number of blocks sent to the block device (blocks)/Seconds) |
in | Number of interrupts/Seconds |
cs | Context switch count/Seconds |
us | CPU usage time ratio of user process |
sy | Time used to execute kernel code |
id | Percentage of time the CPU is idle |
wa | CPU is I/Waiting for O |
st | Percentage of time the guest operating system was unable to allocate CPU |
r and b are usually 0~About 2.
If this number is large, you may feel that the server is heavy.
Basically, si and so are always zero.
If a number always appears here, either there is insufficient memory or there is a program that consumes memory.
First, determine whether it is CPU or I / O with the following command
top
sar
ps
vmstat
When CPU load is high
Server expansion, program logic, and algorithm improvement
When the I / O load is high
Expand the cache area by adding memory
If memory expansion is not possible, consider distributing data and introducing a cache server
Program improvement I/O reduce frequency
Hmm, I'm tired of putting it together. I hope this can explain the cause of the load.
Recommended Posts