O'Reilly Japan -Java Performance Summary of Chapter 4 of this book
Chapter 1 Introduction -Qiita Chapter 2 Performance Testing Approach -Qiita Chapter 3 Java Performance Toolbox -Qiita ← Previous article Chapter 4 How the JIT Compiler Works --Qiita ← This Article Chapter 5 Basics of Garbage Collection -Qiita ← Next article
When reading two values from main memory and adding them ... A good compiler executes a statement that reads data, then executes some other instruction and then adds. You can't because the interpreter sees only one line at a time.
Interpreters have the advantage of portability
The newer version of the CPU can execute almost all the instructions of the previous version of the CPU, but not the other way around (such as the AVX instructions of Intel's Sandy Bridge processor). There is a solution such as performing the processing where performance is important in the shared library prepared for each CPU.
Java is in an intermediate position, and after compiling to Java bytecode, it is compiled to each platform at the same time as execution.
--Most programs often run only a small part of their code --The JVM does not compile immediately when it starts executing code. There are two reasons below. ――It is useless if you compile it and execute it only once. It is called frequently and then starts to compile. --The more executions before compilation, the more information you can get for optimization.
For example, if you want to execute code like b = obj1.equals (obj2)
To find out which ʻequals () method should execute, check the type (class) of ʻobj1
.
If this code is executed if ʻobj1 is always called
java.lang.String # equals`
Optimize to call the method directly.
When is the compiler optimized to save from main memory to registers? Can be mentioned
public class RegisterTest {
private int sum;
public void calculateSum(int n) {
for (int i = 0; i < n; i++) {
sum += i;
}
}
}
For this code, before reading sum from main memory, Optimization may be performed by holding sum in a register, looping it, and summing the calculation result with sum in the main memory.
Registers used by another thread cannot be read (see Chapter 9). Registers are used particularly aggressively when escape analysis (discussed at the end of this chapter) is enabled.
There are client type and server type. It is so called by the command line arguments -client
and -server
.
In most cases, -XX
is not used as the flag to specify the compiler.
The exception is tiered compilation.
-XX:+TieredCompilation
A server compiler is required for hierarchical compilation.
The client compiler starts compiling early, so it is fast in the early stages. On the other hand, server compilers take time to optimize.
Hierarchical compilation is a method in which the server compiler compiles again when the code becomes "hot". Hierarchical compilation is enabled by default in Java 8.
Hierarchical compilation in Java7 is quirky, for example, there was a problem that the size of the JVM code cache was quickly exceeded.
There are the following three versions
--32bit client compiler (-client
)
--32bit server compiler (-server
)
--64bit server compiler (-d64
)
If it is a 32-bit OS, the JVM must also be a 32-bit version. With a 64-bit OS, you can use either JVM.
If the heap is 3GB or less, the 32-bit version uses less memory and is faster. It seems that 32bit has a lower memory reference cost than 64bit.
Chapter 8 discusses compressing ordinary object pointers. Even a 64-bit JVM can use a 32-bit address. However, since the native code used when it is executed uses a 64-bit address, it uses a lot of memory.
In programs that make heavy use of 8-byte types (long, double), 32-bit JVM is slow because the CPU's 64-bit registers cannot be used.
32bitOS has a wall of 4Gbyte (2 ^ 32)
↓ Related official materials CompressedOops
The default Java compiler is displayed with ↓
java -version
↓ Related official materials java
When the JVM compiles the code, a set of assembly language instructions is stored in the code cache. The size of the code cache is fixed. When it's full, it can't compile anymore and works with the interpreter.
When doing hierarchical compilation with Java7, the default size often runs out of code cache (some SIers still don't let me raise the Java version to 8 or later, and in that case I have this kind of trouble ... )
There is no way to know how much code cache your application needs, so you have to run it and check if it's enough (see below for how to check).
---XX: XX: InitialCodeCacheSize-N
: Initial code cache size. It defaulted to 2,555,904 bytes in my environment
---XX: XX: ReservedCodeCacheSize = N
: Maximum size. It defaulted to 251,658,240 bytes in my environment
---XX: XX: CodeCacheExpansionSize = N
: Extended size of the code cache. It defaulted to 65,536 bytes in my environment
momose@momose-pc:~$ java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu219.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu219.04.1, mixed mode, sharing)
momose@momose-pc:~$
What happens if you specify a large ReservedCodeCacheSize of 1Gbyte so that you don't have to worry about running out of code cache? The JVM reserves the native memory area for 1 Gbyte. However, it will not be allocated until it is used.
--Even if there is a lot of reserved memory, there is no problem with performance. --Reserved memory that exceeds physical memory + virtual memory cannot be reserved. There is also a problem that memory cannot be reserved if memory is already reserved in another JVM when multiple JVMs are started.
↓ Reference materials [\ [tips ] \ [Java ] How to check CodeCache area usage -Akira's Tech Notes](http://luozengbin.github.io/blog/2015-09-01-%5Btips%5D%5Bjava% 5Dcodecache% E9% A0% 98% E5% 9F% 9F% E4% BD% BF% E7% 94% A8% E7% 8A% B6% E6% B3% 81% E3% 81% AE% E7% A2% BA% E8% AA% 8D% E6% 96% B9% E6% B3% 95.html)
You can use jconsole to monitor the size of your code cache. If you select Memory Pool Code Cache in the Memory panel, a graph will be displayed (up to Java 8).
It is said that Java 9 or later will be managed in an area called Code Heap. ↓ Memory tab in jconsole
According to -XX: + SegmentedCodeCache
in this (https://docs.oracle.com/javase/jp/9/tools/java.htm),
It seems that processing by dividing the segment prevents fragmentation of the code and improves efficiency.
According to Java9 \ (\ based on Oracle JVM) catchup -Qiita, it is divided into 3 segments according to the code type.
The number of executions affects when the compilation occurs.
There is only one case where you should adjust the compilation threshold. If the sum of the following two values exceeds the threshold, it will be queued for the compiler. This is called "standard compilation" (not the official name).
--Call counter: Number of method calls --Back edge counter: The number of times processing returns from the code in the loop (almost the same as the number of times the code in the loop was executed)
With standard compilation, long processing in one loop, long one method, it is not optimized well, If the back edge counter exceeds the threshold, only the loop will be compiled. This compilation is called "OSR (on-stack replacement)".
Threshold for standard compilation can be specified with -XX: CompileThreshold = N
flag
The default value is 1500 for the client compiler and 10000 for the server compiler.
OSR compilation threshold conditions
Back edge counter value> CompileThreshold(OnStackReplacePercentage
- InterpreterProfilePercentage) / 100
---XX: InterpreterProfilePercentage = N
defaults to 33
---XX: OnStackReplacePercentage = N
defaults to 933 for client compiler and 140 for server compiler
So the threshold for the client compiler is
1500 * (933 - 33) / 100 = 13500
So, in the case of the server compiler, the threshold is
10000 * (140 - 33) / 100 = 10700
Will be
Each time the JVM reaches a safe point, the value of each counter is decremented. Therefore, not all methods will be compiled at some point. So there are some "slimy" methods that run reasonably often, but don't compile (not hot) (which is also one of the reasons why hierarchical compilation is so fast).
The -XX: + PrintCompilation
flag (default false).
Every time you compile
Timestamp compile ID attribute(Hierarchical compilation level)Method name size deoptimized
A log with the contents such as is displayed.
The attribute is
--%
: OSR compilation
--s
: synchronized method
--!
: Method has throws
-- b
: Compile in blocking mode (not output in current Java)
-- n
: A wrapper for the native method was generated by the compiler
The size is the size of Java bytecode If it is deoptimized, you will get a message that it has been deoptimized.
You can also get information on compiling Java programs that are already running.
jstat -compiler ${Process ID}
You can also display the last compiled version every 1000 milliseconds.
jstat -printcompilation ${Process ID} 1000
OSR compilation is often time consuming.
The content is maniac and seems to be for JVM engineers. It is unlikely that you will implement the tuning details described here ...
The compilation runs asynchronously, and the number of threads of the compiler changes depending on the number of CPUs and the type of compiler.
The number of threads can be changed with -XX: CICompilerCount = N
.
For hierarchical compilation, 1/3 is client compilation and the rest is server compilation.
Specifying -XX: + BackgroundCompilation
will prevent compilation from being asynchronous.
Code accessed through Getter / Setter is inlined by modern compilers.
Inlining is enabled by default.
It can also be disabled with -XX: -Inline
.
The conditions for inlining are hotness and bytecode size.
If it is hot and the bytecode size is 325 bytes (changeable with -XX: MaxFreqInlineSize = N
) or less, inlining is performed.
If the size is 35 bytes or less (changeable with -XX: MaxInlineSize = N
), inlining is performed unconditionally.
Optimization when -XX: + DoEscapeAnalysis
(default value is true
) is enabled.
It seems to do various things, but for example
public class Factorial {
private BigInteger factorial;
private int n;
public Factorial(int n) {
this.n = n;
}
public synchronized BigInteger getFactorial() {
if (factorial == null)
factorial = ...;
return factorial;
}
}
Against
ArrayList<BigInteger> list = new ArrayList<BigInteger>();
for (int i = 0; i < 100; i++) {
Factorial factorial = new Factorial(i);
list.add(factorial.getFactorial());
}
--No synchronization required for getFactorial () method --The values of variables n and factorial are stored in registers instead of memory. --The factorial object body does not allocate, only manages fields
Advanced optimization is performed (in rare cases, there may be a bug)
Non-optimization is no longer an entrant, it is zombied and GC runs
There are two, and the first is to perform optimization by associating with the implementation class for a specific interface, but if that premise is broken, it will be deoptimized. The second is the implementation of hierarchical compilation, which is deoptimized by marking it as not an entrant when the compilation by the server compiler is complete.
If made zombies appear in the compile log, the non-entrant code will be abandoned and the GC will run.
↓ Reference materials Java-JA13-Architect-evans.pdf
--0: Code executed by the interpreter --1: Code compiled by the client compiler in simple mode --2: Code compiled by the client compiler in restricted mode --3: Code compiled by the client compiler in full mode --4: Code compiled by the server compiler
It works at level 0 by default, and in most cases it seems to compile at level 3 first and then at level 4. Level 1 and level 2 are said to be used when the compiler queue is full (it compiles at high speed because it does not use a profiler). Naturally, the non-optimized code goes back to level 0.
--The strongest hierarchical compilation --Small methods are inlined --Compilation is handled by the queue --There is an upper limit to the size of the code cache. --Simple code makes it easier to benefit from optimization
Also, the final modifier seems to ** have no effect on performance **.
Recommended Posts