O'Reilly Japan -Java Performance Summary of Chapter 4 of this book

Chapter 1 Introduction -Qiita Chapter 2 Performance Testing Approach -Qiita Chapter 3 Java Performance Toolbox -Qiita ← Previous article Chapter 4 How the JIT Compiler Works --Qiita ← This Article Chapter 5 Basics of Garbage Collection -Qiita ← Next article

4.1 Overview of JIT compiler

Examples of differences between compiler and interpreter languages

When reading two values from main memory and adding them ... A good compiler executes a statement that reads data, then executes some other instruction and then adds. You can't because the interpreter sees only one line at a time.

Interpreters have the advantage of portability

The newer version of the CPU can execute almost all the instructions of the previous version of the CPU, but not the other way around (such as the AVX instructions of Intel's Sandy Bridge processor). There is a solution such as performing the processing where performance is important in the shared library prepared for each CPU.

Java is in an intermediate position, and after compiling to Java bytecode, it is compiled to each platform at the same time as execution.

Hotspot compilation

--Most programs often run only a small part of their code --The JVM does not compile immediately when it starts executing code. There are two reasons below. ――It is useless if you compile it and execute it only once. It is called frequently and then starts to compile. --The more executions before compilation, the more information you can get for optimization.

For example, if you want to execute code like b = obj1.equals (obj2) To find out which ʻequals () method should execute, check the type (class) of ʻobj1. If this code is executed if ʻobj1 is always called java.lang.String # equals` Optimize to call the method directly.

Registers and main memory

When is the compiler optimized to save from main memory to registers? Can be mentioned

public class RegisterTest {
  private int sum;
  public void calculateSum(int n) {
    for (int i = 0; i < n; i++) {
      sum += i;
    }
  }
}

For this code, before reading sum from main memory, Optimization may be performed by holding sum in a register, looping it, and summing the calculation result with sum in the main memory.

Registers used by another thread cannot be read (see Chapter 9). Registers are used particularly aggressively when escape analysis (discussed at the end of this chapter) is enabled.

4.2 Basic tuning (client compiler and server compiler)

There are client type and server type. It is so called by the command line arguments -client and -server.

In most cases, -XX is not used as the flag to specify the compiler. The exception is tiered compilation. -XX:+TieredCompilation A server compiler is required for hierarchical compilation.

The client compiler starts compiling early, so it is fast in the early stages. On the other hand, server compilers take time to optimize.

Hierarchical compilation is a method in which the server compiler compiles again when the code becomes "hot". Hierarchical compilation is enabled by default in Java 8.

Hierarchical compilation in Java7 is quirky, for example, there was a problem that the size of the JVM code cache was quickly exceeded.

4.3 Java and JIT compiler versions

There are the following three versions

--32bit client compiler (-client) --32bit server compiler (-server) --64bit server compiler (-d64)

If it is a 32-bit OS, the JVM must also be a 32-bit version. With a 64-bit OS, you can use either JVM.

If the heap is 3GB or less, the 32-bit version uses less memory and is faster. It seems that 32bit has a lower memory reference cost than 64bit.

Chapter 8 discusses compressing ordinary object pointers. Even a 64-bit JVM can use a 32-bit address. However, since the native code used when it is executed uses a 64-bit address, it uses a lot of memory.

In programs that make heavy use of 8-byte types (long, double), 32-bit JVM is slow because the CPU's 64-bit registers cannot be used.

32bitOS has a wall of 4Gbyte (2 ^ 32)

↓ Related official materials CompressedOops

The default Java compiler is displayed with ↓

java -version

↓ Related official materials java

4.4 Intermediate tuning

Code cache tuning

When the JVM compiles the code, a set of assembly language instructions is stored in the code cache. The size of the code cache is fixed. When it's full, it can't compile anymore and works with the interpreter.

When doing hierarchical compilation with Java7, the default size often runs out of code cache (some SIers still don't let me raise the Java version to 8 or later, and in that case I have this kind of trouble ... )

There is no way to know how much code cache your application needs, so you have to run it and check if it's enough (see below for how to check).

---XX: XX: InitialCodeCacheSize-N: Initial code cache size. It defaulted to 2,555,904 bytes in my environment

---XX: XX: ReservedCodeCacheSize = N: Maximum size. It defaulted to 251,658,240 bytes in my environment

---XX: XX: CodeCacheExpansionSize = N: Extended size of the code cache. It defaulted to 65,536 bytes in my environment

Under my environment

momose@momose-pc:~$ java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu219.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu219.04.1, mixed mode, sharing)
momose@momose-pc:~$

What happens if you specify a large ReservedCodeCacheSize of 1Gbyte so that you don't have to worry about running out of code cache? The JVM reserves the native memory area for 1 Gbyte. However, it will not be allocated until it is used.

--Even if there is a lot of reserved memory, there is no problem with performance. --Reserved memory that exceeds physical memory + virtual memory cannot be reserved. There is also a problem that memory cannot be reserved if memory is already reserved in another JVM when multiple JVMs are started.

Memory reservation and allocation are different (detailed in Chapter 8.1)

↓ Reference materials [\ [tips ] \ [Java ] How to check CodeCache area usage -Akira's Tech Notes](http://luozengbin.github.io/blog/2015-09-01-%5Btips%5D%5Bjava% 5Dcodecache% E9% A0% 98% E5% 9F% 9F% E4% BD% BF% E7% 94% A8% E7% 8A% B6% E6% B3% 81% E3% 81% AE% E7% A2% BA% E8% AA% 8D% E6% 96% B9% E6% B3% 95.html)

You can use jconsole to monitor the size of your code cache. If you select Memory Pool Code Cache in the Memory panel, a graph will be displayed (up to Java 8).

It is said that Java 9 or later will be managed in an area called Code Heap. ↓ Memory tab in jconsole

According to -XX: + SegmentedCodeCache in this (https://docs.oracle.com/javase/jp/9/tools/java.htm), It seems that processing by dividing the segment prevents fragmentation of the code and improves efficiency.

According to Java9 \ (\ based on Oracle JVM) catchup -Qiita, it is divided into 3 segments according to the code type.

Compile threshold

The number of executions affects when the compilation occurs.

There is only one case where you should adjust the compilation threshold. If the sum of the following two values exceeds the threshold, it will be queued for the compiler. This is called "standard compilation" (not the official name).

--Call counter: Number of method calls --Back edge counter: The number of times processing returns from the code in the loop (almost the same as the number of times the code in the loop was executed)

With standard compilation, long processing in one loop, long one method, it is not optimized well, If the back edge counter exceeds the threshold, only the loop will be compiled. This compilation is called "OSR (on-stack replacement)".

Threshold for standard compilation can be specified with -XX: CompileThreshold = N flag The default value is 1500 for the client compiler and 10000 for the server compiler.

OSR compilation threshold conditions

Back edge counter value> CompileThreshold(OnStackReplacePercentage 
 - InterpreterProfilePercentage) / 100

---XX: InterpreterProfilePercentage = N defaults to 33 ---XX: OnStackReplacePercentage = N defaults to 933 for client compiler and 140 for server compiler

So the threshold for the client compiler is

1500 * (933 - 33) / 100 = 13500

So, in the case of the server compiler, the threshold is

10000 * (140 - 33) / 100 = 10700

Will be

Another flag is used for hierarchical compilation.

Each time the JVM reaches a safe point, the value of each counter is decremented. Therefore, not all methods will be compiled at some point. So there are some "slimy" methods that run reasonably often, but don't compile (not hot) (which is also one of the reasons why hierarchical compilation is so fast).

The -XX: + PrintCompilation flag (default false). Every time you compile

Timestamp compile ID attribute(Hierarchical compilation level)Method name size deoptimized

A log with the contents such as is displayed.

The attribute is

--% : OSR compilation --s: synchronized method --! : Method has throws -- b: Compile in blocking mode (not output in current Java) -- n: A wrapper for the native method was generated by the compiler

The size is the size of Java bytecode If it is deoptimized, you will get a message that it has been deoptimized.

You can also get information on compiling Java programs that are already running.

jstat -compiler ${Process ID}

You can also display the last compiled version every 1000 milliseconds.

jstat -printcompilation ${Process ID} 1000

OSR compilation is often time consuming.

4.5 Advanced tuning

The content is maniac and seems to be for JVM engineers. It is unlikely that you will implement the tuning details described here ...

Compiler thread

The compilation runs asynchronously, and the number of threads of the compiler changes depending on the number of CPUs and the type of compiler. The number of threads can be changed with -XX: CICompilerCount = N. For hierarchical compilation, 1/3 is client compilation and the rest is server compilation.

Specifying -XX: + BackgroundCompilation will prevent compilation from being asynchronous.

Inline

Code accessed through Getter / Setter is inlined by modern compilers. Inlining is enabled by default. It can also be disabled with -XX: -Inline.

The conditions for inlining are hotness and bytecode size.

If it is hot and the bytecode size is 325 bytes (changeable with -XX: MaxFreqInlineSize = N) or less, inlining is performed. If the size is 35 bytes or less (changeable with -XX: MaxInlineSize = N), inlining is performed unconditionally.

Escape analysis

Optimization when -XX: + DoEscapeAnalysis (default value is true) is enabled. It seems to do various things, but for example

public class Factorial {
    private BigInteger factorial;
    private int n;
    public Factorial(int n) {
        this.n = n;
    }
    public synchronized BigInteger getFactorial() {
        if (factorial == null)
            factorial = ...;
        return factorial;
    }
}

Against

ArrayList<BigInteger> list = new ArrayList<BigInteger>();
for (int i = 0; i < 100; i++) {
    Factorial factorial = new Factorial(i);
    list.add(factorial.getFactorial());
}

--No synchronization required for getFactorial () method --The values of variables n and factorial are stored in registers instead of memory. --The factorial object body does not allocate, only manages fields

Advanced optimization is performed (in rare cases, there may be a bug)

Deoptimized

Non-optimization is no longer an entrant, it is zombied and GC runs

When it is no longer an entrant

There are two, and the first is to perform optimization by associating with the implementation class for a specific interface, but if that premise is broken, it will be deoptimized. The second is the implementation of hierarchical compilation, which is deoptimized by marking it as not an entrant when the compilation by the server compiler is complete.

Zombie code

If made zombies appear in the compile log, the non-entrant code will be abandoned and the GC will run.

↓ Reference materials Java-JA13-Architect-evans.pdf

4.7 Level of hierarchical compilation

--0: Code executed by the interpreter --1: Code compiled by the client compiler in simple mode --2: Code compiled by the client compiler in restricted mode --3: Code compiled by the client compiler in full mode --4: Code compiled by the server compiler

It works at level 0 by default, and in most cases it seems to compile at level 3 first and then at level 4. Level 1 and level 2 are said to be used when the compiler queue is full (it compiles at high speed because it does not use a profiler). Naturally, the non-optimized code goes back to level 0.

4.8 Summary

--The strongest hierarchical compilation --Small methods are inlined --Compilation is handled by the queue --There is an upper limit to the size of the code cache. --Simple code makes it easier to benefit from optimization

Also, the final modifier seems to ** have no effect on performance **.

Java Performance Chapter 4 How the JIT Compiler Works