OOM Killer's murderous intention

Customer EC2 Tomcat was suddenly killed by OOM Killer in the early morning without access, so I'm sure some of you may be facing similar problems, so I will write it as a memorandum. .. I think it will be a little useful for tuning the Java heap. (This article is for Java 8)

OOM Killer is a standard Linux function that stops processes that are consuming memory to protect the entire server during Out of Memory. Tomcat was suddenly killed by the OOM Killer. It is a state of Tsujigiri with no questions asked. Since it is early morning, nothing is recorded in the access log, and catalina.out

OpenJDK 64-Bit Server VM warning: Setting LargePageSizeInBytes has no effect on this OS. Large page size is 2048K.
OpenJDK 64-Bit Server VM warning: Failed to reserve large pages memory req_addr: 0x00000006c0000000 bytes: 4294967296 (errno = 12)

OOM Killer was running immediately after the log was recorded. Eh Large Page Memory ~ ??? bytes: 4294967296 ~ ??? Let's take a look at the memory with the free command for the time being, without being able to swallow the situation. Since it is after rebooting, there is naturally plenty of free memory. However, one of the astonishing facts was revealed. Isn't swap total 0? This is the AWS default setting. Since it's an elastic server, swap isn't something that's cool yet, so I guess it's about using physical memory, but since the customer didn't have an elastic contract, OOM Killer went crazy and went to Tsujigiri.

Check Java environment variables to reveal the truth about Tsujigiri

This is one of the causes of Tsujigiri, but I haven't been able to solve the reason why it became Out of Memory. For the time being, use the TOP command to sort the memory usage in descending order and check the running processes. As a result, other than Java, virus checker agents and log collection tool agents consumed 1.6GB of memory. At this point, make a note of the Java PID.

Since it seems certain that the memory allocation failed because of (errno = 12), I thought that there was something wrong with the initial settings. First, check the environment variable settings. Open setenv.sh and check.

-XX:NewSize=1024m ・ ・ ・ ・ ・ ・ ・ ・ ・ Minimum size of New area
-XX:MaxNewSize=1024m ・ ・ ・ ・ ・ ・ ・ ・ ・ Maximum size of New area
-Xms4096 ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ Minimum size of the entire Java heap
-Xmx4096 ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ Maximum size of the entire Java heap
-XX:SurvivorRatio=12 ・ ・ ・ ・ ・ ・ ・ ・ ・ Distribution ratio in New area: Eden is 12 for Survivor area 1.
-XX:TargetSurvivorRatio=90 ・ ・ ・ Minor GC when Surviver area reaches 90%
-XX:MaxTenuringThreshold=15 ・ ・ ・ Number of movements in the Survivor area
-XX:LargePageSizeInBytes=256m ・ ・ Securing space for using large pages
-XX:MetaspaceSize=384m ・ ・ ・ ・ ・ Meta space size
-XX:MaxMetaspaceSize=384m ・ ・ ・ Maximum meta space

It seems that the middleware supplier has reset it, and the Java heap has been greatly increased from the default setting of Tomcat. Yes, there is a LargePage setting, but it's 256MB. Also check the Java heap with jmap.

# jmap -heap [PID]

You can check the details of Java Heap by entering the Java process number in [PID]. If you look at jmap -heap, you can see the default settings that are not defined by setenv, so you can see the whole picture. Undefined in setenv

-XX:CompressedclassspaceSize=1024m ・ ・ ・ Default value of compressed Class area

I found that. The size of the entire Java heap is set with -Xms -Xmx, so you can see that it is 4GB. The NEW area for storing new objects is set to -XX: NewSize with 1GB. The New area is divided into Eden and Survivor. -XX: There is TargetSurvivorRaio, which is a multiple that determines the Eden area for Survivor area 1. In other words, you can see that the ratio of Survivor is 1 and Eden is 12 in the 1GB NEW area. In the above case, the old area is set to 4GB for the entire Java heap and 1GB for the new area, so the remaining 3GB is the old area.

Review about Java heap

Now let's review the Java heap a bit. The Java heap consists of a New area and an Old area. The New area is further divided into the Eden area and the Survivor area. The Survivor area further consists of Survivor0 and Survivor1.

Newly generated objects first enter Eden. When Eden is full, Minor Garbage Collection (GC) will be activated and sieved to move the living ones to Survivor0. Also, when Eden is full, the minor GC will start and this time move the surviving objects to Survivor1. At this point Survivor 0 is empty. While repeatedly sieving the surviving objects with Survivor 0 and 1, the movement is repeated as many times as defined by -XX: MaxTenuringThreshold. Therefore, even if there are two Survivor areas, one of them is always empty, so it can be seen as one space. And -XX: MaxTenuringThreshold A brave man who survived the number of GCs? Moves to the Old area as an old soldier.

When the Old area is finally full, full GC is applied to clean the memory. In fact, the Java heap is not the only memory area in the JVM. It can be roughly divided into Java heap and native heap. The native heap consists of MetaSpace, CompressedClassSpace, C Heap, and Thread Stack.

It's a little difficult to understand, so I'll summarize it in bullet points.

--Java heap --New area: Area for inserting new objects --Eden ・・・ Area to put a new object --Survivor ・・・ Area to put the objects that survived in GC (Survivor0, Survivor1) --Old area ・・・ Area to put surviving old objects (Tenured) --Native heap --MetaSpace ・・・ Area for managing static objects such as classes and methods --ConpressedClassSpace ・・・ Similar to MetaSpace, the area for storing static objects has a default capacity of 1GB. --C Heap ・・・ Area for storing the resources of the JVM itself --Thread Stack ・・・ Area for storing static threads

The figure below is a map of these memory allocations. I also applied the parameters of the above environment variables. It seems that ConpressedClassSpace reserves 1GB by default, so it seems that the native heap alone is consuming nearly 2GB including other parts.

The memory is full at the initial setting stage. At the stage when the full GC starts up, the memory is almost full, so it seems that the full GC won or OOME won or was in a duel state on Ganryujima. It is probable that the old soldiers who had accumulated in the Old area at the timing of full GC rebelled and an Out of Memory Error occurred.

Check with jstat

Jstat is a quick way to see the actual Java heap in real time. It's difficult to understand all of them because there are various options, but the following two are easy for us beginners to understand.

# jstat -gc -t [PID] 1000
# jstat -gcutil -t [PID] 1000

-t has a time stamp at the beginning and means that the target PID information is output every 1000 milliseconds. -gc outputs the usage amount of each item in KB units. -gcutil displays the usage of each item in%. Each output item is as follows.

List of items for jstat -gc

All values before CCSU are KB

item name	Item contents
S0C	Survivor0 setting
S1C	Survivor1 settings
S0U	Actual usage of Survivor0
S1U	Actual usage of Survivor1
EC	Eden setting value
EU	Actual usage of Eden
OC	Old area setting
OU	Actual usage of Old
MC	Metaspace settings
MU	Actual usage of metaspace
CCSC	Compressed Class area settings
CCSU	Actual usage of compressed Class area
YGC	Number of Young Generation Garbage Collection Events
YGCT	Young generation garbage collection cumulative time
FGC	Number of events in full garbage collection
FGCT	Cumulative time for full garbage collection
GCT	Total cumulative time for garbage collection

List of items for jstat -gcutil

The unit before CCS is%

item name	Item contents
S0	Actual usage rate of Survivor0
S1	Actual usage rate of Survivor1
E	Actual usage rate of Eden
O	Actual usage rate of Old
M	Actual usage rate of metaspace
CCS	Actual usage of compressed Class area
YGC	Number of Young Generation Garbage Collection Events
YGCT	Young generation garbage collection cumulative time
FGC	Number of events in full garbage collection
FGCT	Cumulative time for full garbage collection
GCT	Total cumulative time for garbage collection

If you monitor with jstat, you can see the moment when the minor GC is executed and the Surviver area is rewritten when Eden is really 100%. If Eden is set to 0 and the time per minor GC is 0.1 seconds or less, there should be no problem with the New area. (If not, please point out ...) On the other hand, in this era, I wonder if it is necessary to take up 3 times as much space as Old with 1GB for NEW. It's a nonsense story that full GC isn't executed forever, and zombie-like things accumulate and are killed by OOM Killer at the end. If the New area is so large, I think Old should be about the same.

Conclusion

Proper settings can be fatal. Let's be careful.

There were more people who stocked more than I imagined, so I will supplement the conclusion a little. First of all, as I wrote in the comment section, the direct cause of the failure is 8GB Memory, which is the same as the setting of 32GB Memory delivered by the supplier in the past. It is probable that it was applied to the server of. As a result, the java heap, native heap, and other memory usage sums swelled to the same level as physical memory, and overflowed at the timing of full GC. For example, when operating at the 8GB level like this time, it is recommended to set the following as a guide and tune it diligently while watching the situation.

-XX:NewSize=700m 
-XX:MaxNewSize=700m 
-Xms2048m 
-Xmx2048m 
-XX:MetaspaceSize=300m
-XX:MaxMetaspaceSize=300m 
-XX:SurvivorRatio=6 
-XX:TargetSurvivorRatio=9 
-XX:MaxTenuringThreshold=15

(The above is a guide even if you get tired of it. Be sure to tune it independently. The conditions are completely different depending on the environment of the application or server that operates, so you need to make settings that suit each environment. What is the above setting full copy? Please note that enoshiman cannot take any responsibility even if there is a problem.) If the setting is obviously wrong like this time, you can see it fairly quickly, but in reality, there may be various factors such as a memory leak or insufficient physical memory due to the application side, so heap dump for a certain period of time. It is also effective to check if there are any abnormal numbers, or to visualize using VisualVM as pointed out in the comment. Once you get used to it, you can see the change as if you pick it up with jstat. Looking at the transition of the actual usage amount in each area and the GC time, if the decrease is bad, the GC takes a long time, or there is unevenness, there may be something wrong. Looking at the GC that is working properly makes me feel a little sick. (It looks like I'm going for a strange hobby ...)

[JAVA] TOMCAT murder case