# Vectorization conversion of HotSpot JavaVM

Recent HotSpot JavaVMs are optimized to vectorize iterative processing of scalar operations and convert them into SIMD instructions (what is SIMD? See the second half). When I tried it with the code that actually works for optimization, ** 1.5 to 2.8 times speed improvement ** was seen, so it can be used well with Java libraries that perform a lot of arithmetic processing (do not rely on GPGPU). It may be an effective optimization method.

HotSpot's SIMD optimization is based on Superword-Level Parallelism (SLP) (hereafter this optimization is Called SLP). Originally, this paper was intended to reflect the times and convert non-SIMD-compatible image / audio processing into instructions that use SIMD at the compiler and runtime layers, but this is Java bytecode without SIMD instructions. It is also effective for using SIMD.

## Vectorization conversion of HotSpot JavaVM

### SLP conversion process

Assume the following simple loop processing.

for(int i=0; i<1024; i++){
a[i] = b[i] + c[i];
}


For this loop [loop unrolling](https://ja.wikipedia.org/wiki/%E3%83%AB%E3%83%BC%E3%83%97%E5%B1%95%E9%96 % 8B) (Loop unrolling).

for(int i=0; i<1024; i+=4){
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
a[i+2] = b[i+2] + c[i+2];
a[i+3] = b[i+3] + c[i+3];
}


This alone reduces the number of interactions, the number of condition judgments, and the penalty for branch prediction, so speedup can be expected (there are quite a few such loop unrolls that can be done with a compiler or virtual machine).

Now, the processing inside the loop can be regarded as the following vector operation.

\begin{pmatrix} a_{i+0} \\ a_{i+1} \\ a_{i+2} \\ a_{i+3} \end{pmatrix} = \begin{pmatrix} b_{i+0} \\ b_{i+1} \\ b_{i+2} \\ b_{i+3} \end{pmatrix} + \begin{pmatrix} c_{i+0} \\ c_{i+1} \\ c_{i+2} \\ c_{i+3} \end{pmatrix}


This vector operation is actually replaced with a vector operation instruction.

for(int i=0; i<1000; i+=4){
add xmm0, xmm1 and store result to xmm0
}


With this conversion, it is possible to convert the repetition of simple arithmetic processing into a vector arithmetic. Since all modern CPUs can use SIMD instructions, it is effective to use SIMD without changing the source or bytecode.

### Behavior in Java 8

Actually enable / disable SLP using the following sources and compare the execution times. The measurement is appropriate, so if you need accurate evidence, please do it yourself.

import java.util.Arrays;

public class J8SIMD {
private static final int SIZE = 1024 * 1024;
private static final float[] a = new float[SIZE];
private static final float[] b = new float[SIZE];
static {
Arrays.fill(a, 1);
Arrays.fill(b, 2);
}
for(int i=0; i<a.length; i++){
a[i] += b[i];
}
}
public static void main(String[] args){
// warming up
// measure
long t0 = System.currentTimeMillis();
for(int i=0; i<10000; i++){
}
long t1 = System.currentTimeMillis();
}
}


Running the above source on a Core i7-6600U (MMX, SSE, SSE2, SSSE3, SSE4, AVX compatible) makes a significant difference (although I simply loop unrolled because I'm not sure if I used the SIMD instruction). There is no possibility of just a difference: rolling_eyes :).

I used float here, but both ʻint and doublemake a significant difference, so give it a try. However, due to the data width,double is not as different as ʻint and float.

SLP is enabled by default from around Java7u40, and if you want to disable it, use the -XX: -UseSuperWord option.

java8simd>java -XX:+UseSuperWord J8SIMD

java8simd>java -XX:-UseSuperWord J8SIMD

java8simd>java -version
java version "1.8.0_141"
Java(TM) SE Runtime Environment (build 1.8.0_141-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)


The execution time (milliseconds) of each operation is as follows.

Calculation SLP enabled SLP disabled Speed ratio
a[i] += b[i] 4,645 6,928 1.50
a[i] -= b[i] 4,647 7,029 1.52
a[i] *= b[i] 4,629 6,978 1.51
a[i] /= b[i] 4,357 12,151 2.79
a[i] %= b[i] - - -

I gave up because % = didn't finish after waiting a few minutes (why is it so slow?). Addition, subtraction and multiplication showed a 1.5x speedup. The speedup of division is particularly remarkable at nearly 3 times, but is there any movement such as processing on a dedicated FPU when executed with SIMD instructions?

The following is the result of changing to the ʻint version and performing bit operations. Bitshift operations do not seem to be optimized in SLP.

Calculation SLP enabled SLP disabled Speed ratio
a[i] <<= b[i] 18,293 19,249 1.05
a[i] >>= b[i] 18,550 19,431 1.05
a[i] >>>= b[i] 19,862 20,105 1.01
a[i] ｜= b[i] 4,735 6,857 1.45
a[i] &= b[i] 4,756 6,799 1.43
a[i] ^= b[i] 4,697 7,532 1.60

Whether or not SLP worked effectively is a remarkable number, so it seems that you can judge by the presence or absence of the -XX: -UseSuperWord option.

### Conditions for activating SLP

Debug Compiled OpenJDK will show how loops are recognized when using the -XX: + PrintCompilation -XX: + TraceLoopOpts option, and -XX: + PrintAssembly will show the assembly generated by HotSpot. is.

Java 8 / 9ea listed in: rolling_eyes: Vectorization in HotSpot JVM I will put the conditions of. I think that the triggering conditions and restrictions will be relaxed in future versions, so please follow the current affairs.

• Must be a Trip-Counted Loop. In other words, it is a count loop, and limit does not change during the loop and step is determined at compile time.
for(int i=0; i<100; i++){ } // OK
for(int i=5; i<100; i++){ } // OK
for(int i=t0; i<t1; i++){ } // OK
for(int i=0; i<100; i+=4){ } // OK
for(int i=0; i<100; i+=d){ } // NG
for(int i=0; i<100; i*=2){ } // NG
for(int i=0; i<t1; i++){     // NG
...
if(...) t1 ++;
}

• Loop counters are only ʻint, short, char.
for(byte i=0; i<100; i++){ }  // NG
for(char i=0; i<100; i++){ }  // OK
for(short i=0; i<100; i++){ } // OK
for(int i=0; i<100; i++){ }   // OK
for(long i=0; i<100; i++){ }  // NG

• Loop unrolling is not performed when a function call is included.
for(int i=0; i<100; i++){  // NG
...
f();
}

• The target data is an array.
for(int i=0; i<100; i++){  // OK
a[i] += b[i];
}
for(int i=0; i<100; i++){  // NG
a.set(i, a.get(i) + b.get(i));
}

• It must be a simple four arithmetic operation.
a[i] ++;  // OK
a[i] += 2; // OK
a[i] *= b[i]; // OK
a[i] = b[i] + c[i]; // OK

• The index is continuous.
for(int i=0; i<100; i++){  // NG
a[i] += b[2 * i];
}

• Aggregation may be partly possible from Java 9
int r = 0;
for(int i=0; i<100; i++){
r += a[i] * b[i];
}


### Behavioral features and Java 9 extensions

Not surprisingly, collection API and Stream operations in Java (and JavaVM languages such as Scala and Kotlin) are not operations that can be translated into SIMD instructions. If the bottleneck is where you're doing a lot of math with them, it might be worth considering replacing them with array & loop counts.

Since ʻUnsafe is a method call, you cannot expect this optimization in code that uses ʻUnsafe for speed-first risk-taking.

Vectorization has a 64-byte alignment constraint, and the fractions ahead of the array are done in an unvectorized pre-loop. Elements that are not multiples of 4 at the end of the array are also post-looped without vectorization. However, post-loop is said to be vectorized in Java 9.

Java 8 uses 128bit registers (xmm), so it is vectorized every 2 or 4 elements, but in Java 9 AVX2 can be increased to 32 or something.

SIMD

SIMD </ b> (Single Instruction Multiple Data) is a mechanism to apply the same instruction to a lot of numerical data arranged at once.

Science and technology arithmetic processing such as fluid simulation, image processing, voice processing, encryption, and machine learning processing frequently involve linear algebraic operations such as "multiply array $a$ by the value of array $b$". It comes out in large quantities.

A GPU or supercomputer vector arithmetic unit is a collection of a large number of low-cost, power-saving, and space-saving arithmetic cores that eliminate unnecessary circuits to perform a large amount of arithmetic. These can apply one instruction to a large amount of data in a register at a time.

Recent CPUs have CPU instructions that perform several to dozens of vector operations. GPUs that handle a large amount of arithmetic in image processing have thousands of processors dedicated to arithmetic and issue arithmetic instructions to them at once. Both are SIMDs, but HotSpot's SLP optimizes for CPU instructions.