Graal vs. C2: Battle of the JITs
GraalVM claims to βRun Programs Faster Anywhere πβ. Indeed, existing applications often run faster on GraalVM compared to OpenJDK and Oracle JDK. For example, Twitter reports 11 percent reduction in CPU utilization by using Graal CE. With GraalVM EE they were even able to reduce CPU utilization by a staggering 20 percent. What often remains undiscussed, however, is how GraalVM exactly achieves this performance speedup. In this blog post I try to answer this question by comparing GraalVM EE 20.0.0 to Oracle Java 8u241 on a number of benchmarks.
This post is an extended version of the talk I gave at the Compiler Construction course (CS4200) at Delft University of Technology, the slides of which are available here. Both my talk and this post are inspired by Ionut Balosinβs A race of two compilers. I have also taken a lot of inspiration from Aleksey Shipilevβs JVM Anatomy Quarks.
Introduction
Before discussing how GraalVM achieves a performance speedup over Oracle JDK, we need to briefly introduce the JVMβs tiered compilation scheme.
The first step to run a Java program is to compile a .java
file containing Java source to a .class
file containing JVM bytecode.
When running the resulting .class
file, the JVM goes through several stages:
- First the bytecode is interpreted. You can think of this as a loop that processes each bytecode instruction individually, altering the state of the virtual machine according to the semantics of the instruction.
- Once a method becomes hot the client compiler (C1) compiles the bytecode to machine code (instructions native to the host platform) that carries out equivalent operations. C1 is a fast, lightly optimizing bytecode compiler.
- When the code becomes even hotter, the server compiler (C2) kicks in. This compiler produces the most performant machine code.
A note on JIT A bytecode interpreter is much slower than the equivalent machine code, often by at least an order of magnitude. What is the great thing about JIT compilation then? First, while the bytecode is being interpreted the profiler collects information such as how often a branch is taken and what types of values are being observed. The JIT compiler can include this profiling information when deciding what machine code to generate. Second, and more importantly, the JIT compiler can make speculative assumptions and deoptimize if these assumptions turn out to be wrong. These two ingredients make it possible for a JIT compiler to outperform an ahead-of-time compiler.
For the purpose of this blog post it is important to know that in GraalVM the C1 and C2 JIT compilers have been replaced by the Graal JIT compiler. The next four sections each present an experiment in which GraalVM is compared to Oracle Java. The code for the benchmarks in this blog post is available on GitHub. The benchmarks are run GraalVM EE 20.0.0 or Oracle Java 8u241 on a MacBook Pro (13-inch, 2017, 2.5 GHz Intel Core i7).
(Partial) Escape Analysis
Escape analysis is a method for determining the dynamic scope of pointers. Essentially, escape analysis tells us when a reference to an object can escape to other threads of execution or to calling subroutines. A couple of interesting optimizations can be performed when an object is known NOT to escape:
- Stack allocation. With stack allocation, an object is allocated on the stack instead of the heap. Stack allocation is much faster than heap allocation, because it only involves moving the stack pointer. In addition, stack allocation avoids the need for expensive garbage collection.
- Scalar replacement. Scalar replacement replaces an object by its fields. Like with stack allocation, the fields may be allocated on the stack instead of the heap. In addition, the scalar variables may be amenable to further optimizations.
- Lock coarsening. With lock coarsening, the expensive locking operation is eliminated. Some classes in the Java standard library assume that a class may be used in a multithreaded environment, so they pessimistically synchronized access to those methods. However, if an object never leaves the method in which it is constructed, no two threads will ever access this object, and methods that access this object do not need to be synchronized.
Partial escape analysis is a stronger variant of escape analysis that determines whether an object escapes on each path through a method instead of the method as a whole. This means that even if there is a path through the method in which the object escapes, we may still be able to apply above optimizations on the other paths. Of course, nothing comes for free: the compiler needs to insert instructions to reconstruct the object on the heap when the βescapingβ path is taken. Letβs look at an example.
Benchmark | Dist | Mode | Cnt | Score Β± Error | Units |
---|---|---|---|---|---|
Escape.split | graal | avgt | 5 | 669.760 Β± 78.310 | ns/op |
Escape.split | oracle | avgt | 5 | 1182.732 Β± 145.990 | ns/op |
The benchmark shows that Graal is about twice as fast as C2.
Letβs add the perfasm
(dtraceasm
) profiler to take a look at the assembly code that gets generated:
Graal:
0.12% β 0x000000011c191640: mov %r11d,0x2c(%rsp)
9.17% β 0x000000011c191645: mov 0xc(%r10),%r9d ; *getfield x
1.69% β 0x000000011c191649: mov %rdi,%rsi ; *invokevirtual consume
0.23% β 0x000000011c19164c: mov %r9d,%edx
0.07% β 0x000000011c19164f: callq 0x000000011bf990a0 ; *invokevirtual consume
16.97% β 0x000000011c191654: nop
1.55% β 0x000000011c191655: mov 0x2c(%rsp),%r11d
0.33% β 0x000000011c19165a: inc %r11d ; *iinc
8.39% β 0x000000011c19165d: mov 0x8(%rsp),%r10
1.76% β 0x000000011c191662: mov 0x10(%rsp),%rdi ; *iload_2
0.09% β 0x000000011c191667: cmp $0x12c,%r11d
β° 0x000000011c19166e: jl 0x000000011c191640 ; *if_icmpge
As you can see, there is no trace of the MyObject
construction.
Instead, the field x
is immediately passed to the consume
function.
Lets look at what happens on Oracle Java:
Oracle:
0.12% β 0x000000011e7a44bf: mov 0x50(%rsp),%r9
0.12% β 0x000000011e7a44c4: movzbl 0x94(%r9),%r10d ; *getfield isDone
0.12% β 0x000000011e7a44cc: mov 0x8(%rsp),%rbx
β 0x000000011e7a44d1: add $0x1,%rbx ; *ifeq
β 0x000000011e7a44d5: test %eax,-0xf8ae4db(%rip)
β 0x000000011e7a44db: test %r10d,%r10d
β 0x000000011e7a44de: jne 0x000000011e7a469b ; *aload
0.02% β 0x000000011e7a44e4: xor %ebp,%ebp
β 0x000000011e7a44e6: jmp 0x000000011e7a4558
β 0x000000011e7a44e8: nopl 0x0(%rax,%rax,1)
0.82% β 0x000000011e7a44f0: mov %r10,0x60(%r15)
0.32% β 0x000000011e7a44f4: prefetchw 0xc0(%r10)
3.30% β 0x000000011e7a44fc: mov 0x18(%rsp),%r11
1.05% β 0x000000011e7a4501: mov 0xa8(%r11),%r10
0.82% β 0x000000011e7a4508: mov %r10,(%rax)
1.57% β 0x000000011e7a450b: movl $0xf80190ff,0x8(%rax) ; Put class word header
3.32% β 0x000000011e7a4512: mov %r12d,0xc(%rax) ; *new
1.80% β 0x000000011e7a4516: mov 0x24(%rsp),%r10d
0.90% β 0x000000011e7a451b: mov %r10d,0xc(%rax) ; *synchronization entry
1.22% β 0x000000011e7a451f: mov %r8,(%rsp)
1.80% β 0x000000011e7a4523: mov %r9,0x50(%rsp)
1.92% β 0x000000011e7a4528: mov %r11,0x18(%rsp)
0.47% β 0x000000011e7a452d: mov %rbx,0x8(%rsp) ; *aload_1
1.47% β 0x000000011e7a4532: mov 0xc(%rax),%edx ; *getfield x
5.20% β 0x000000011e7a4535: mov 0x60(%rsp),%rsi
1.62% β 0x000000011e7a453a: nop
0.47% β 0x000000011e7a453b: callq 0x000000011e5e00a0 ; *invokevirtual consume
10.67% β 0x000000011e7a4540: inc %ebp ; *iinc
1.80% β 0x000000011e7a4542: cmp $0x12c,%ebp
β° 0x000000011e7a4548: jge 0x000000011e7a44bf ; *if_icmpge
In the introduction I mentioned that Twitter is able to achieve 11 percent reduction in CPU utilization by switching to the Graal JIT-compiler. They also report that using Graal without partial escape analysis results in 5 percent reduction in CPU utilization. This should give you an idea of how big the contribution of escape analysis is to Graalβs performance. If you want to learn more about escape analysis I can recommend Lukas Stadlerβs PhD thesis.
Data-Flow Analysis
Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program.
One example of data-flow analysis is constanat folding: the process of recognizing and evaluating constant expressions at compile time rather than computing them at runtime.
Constant folding can make use of arithmetic identities, e.g. if x
is an integer, then the value 0 * x
is zero independent of the value of x
.
This is a fairly obvious example, but what follows is an example of constant folding that may not be so obvious.
If we run this on GraalVM EE 20.0.0 and Oracle Java we get the following result:
Benchmark | Dist | Mode | Cnt | Score | Units |
---|---|---|---|---|---|
Dataflow.random | graal | avgt | 10 | 51.730 Β± 4.008 | ns/op |
Dataflow.random | oracle | avgt | 10 | 2481.222 Β± 107.754 | ns/op |
It appears that Graalβs data-flow analysis could deduce that r > 1
.
Letβs look at the implementation of Random#nextInt(int)
and speculate what may happen under the hoods:
The parameter bound
will be bound to 2
, which means the method will return (int) ((bound * (long) r) >> 31)
.
Without making any assumption about the value of r
, what are the possible values of this expression?
Letβs introduce some notation to make the discussion easier: each bit is either 0
, 1
, or x
(unknown), and b{n}
denotes the repetition of n
times b
.
- We start with 32 bits:
1x{31}
or0x{31}
- Widen (sign-extend) to 64 bits:
1{32}1x{31}
or0{32}0x{31}
- Shift one position left:
1{31}1x{31}0
or0{31}0x{31}0
- Shift 31 positions right:
1{31}1{31}1x
or0{31}0{31}0x
- Narrow to 32 bits:
1{31}x
or0{31}x
Without knowing anything about the random value r
we know that this expression evaluates to one of at most four possible values:
1{31}0
or -21{32}
or -10{32}
or 00{31}1
or 1
This example may seem a bit contrived: nobody would every write random.nextInt(2) > 1
.
However, it illustrates the surprising ways in which Graal is able to optimize our code.
Moreover, this example shows that if the JIT compiler knows that the range of possible values is restricted to some subset, it may be able to perform more aggressive optimizations and generate more efficient code.
Inlining
Inlining is a compiler optimization that replaces a function call site with the body of the called function. It is one of the fundamental optimizations because it creates new opportunities for optimization. The Graal inliner is much more aggressive; the following example shows why.
We have defined a class A
with an abstract method m
and three concrete subclasses C1
, C2
, and C3
that implement m
.
The arary as
is filled with instances of C1
, C2
, C3
in a round-robin fashion.
The benchmark invokes the method m
on each item a
in the array as
.
Since this is a virtual method call, the runtime needs to resolve the implementation based on the runtime type of the receiver a
.
Lets look at how the two JITs perform on this example:
Dist | Benchmark | Mode | Cnt | Score Error | Units |
---|---|---|---|---|---|
Oracle | Inlining.test | avgt | 10 | 769.603 Β± 24.531 | ns/op |
Graal | Inlining.test | avgt | 10 | 497.817 Β± 13.365 | ns/op |
Graal is much faster (1.55x) than Oracle.
Lets look at the assembly code to see what is going on (by adding -prof perafsm
or -prof dtraceasm
to the JMH command):
Oracle:
Oracle:
9.75% β 0x54790: mov 0x40(%rsp),%r9
0.14% β 0x54795: data16 data16 nopw 0x0(%rax,%rax,1) ; *aload_1
0.88% β 0x547a0: mov 0x10(%r9,%rbp,4),%r10d
8.10% β 0x547a5: mov %r9,0x40(%rsp)
1.52% β 0x547aa: mov %r10,%rsi
0.06% β 0x547ad: shl $0x3,%rsi
0.94% β 0x547b1: movabs $0xffffffffffffffff,%rax
8.19% β 0x547bb: callq 0x032e0 ; *invokevirtual m
1.24% β 0x547c0: inc %ebp ; *iinc
0.85% β 0x547c2: cmp 0x10(%rsp),%ebp
β° 0x547c6: jl 0x54790
C1::m
<function prologue>
0.55% 0x53b0c: incl 0xc(%rsi) ; Increment at offset 12 (= field c1)
<function epilogue>
C2::m
<function prologue>
0.50% 0x538cc: incl 0x10(%rsi) ; Increment at offset 16 (= field c2)
<function epilogue>
C3::m
<function prologue>
0.36% 0x5368c: incl 0x14(%rsi) ; Increment at offset 20 (= field c3)
<function epilogue>
Graal:
2.81% βββ 0x1d89: mov 0x10(%r8,%rcx,4),%ebx ; *aaload
9.85% βββ 0x1d8e: mov %ecx,%edi
2.67% βββ 0x1d90: inc %edi ; *iinc
3.56% βββ 0x1d9a: cmpl $0xf801e2c3,0x8(,%rbx,8) ; {metadata("C2")} | Is this object of class C2?
24.41% β βββ 0x1da5: je 0x1e0e | If yes, jump to inlined C2::m
0.70% β βββ 0x1db3: cmpl $0xf801e341,0x8(,%rbx,8) ; {metadata("C3")} | Is this object of class C3?
5.15% β ββββ 0x1dbe: jne 0x1de3 ; *invokevirtual m | If no, jump over C3::m to next check
2.04% β ββββ 0x1dc4: mov $0x1,%ecx | Inlined C3::m
0.15% β ββββ 0x1dc9: add 0x14(,%rbx,8),%ecx ; *iadd | Inlined C3::m
3.41% β ββββ 0x1dd0: mov %ecx,0x14(,%rbx,8) ; *putfield c3 | Inlined C3::m
2.78% β ββββ 0x1dd7: mov %edi,%ecx
0.22% β ββ°ββ 0x1dd9: jmp 0x1d80 | Jump back to process next object
3.56% β β ββ 0x1de3: cmpl $0xf801e302,0x8(,%rbx,8) ; {metadata("C1")} | Is this object of class C1?
2.07% β ββ 0x1dee: jne 0x1eb2 ; *invokevirtual m | If not, jump to virtual call
3.52% β ββ 0x1df4: mov $0x1,%ecx | Inlined C1::m
0.63% β ββ 0x1df9: add 0xc(,%rbx,8),%ecx ; *iadd | Inlined C1::m
1.15% β ββ 0x1e00: mov %ecx,0xc(,%rbx,8) ; *putfield c1 | Inlined C1::m
4.44% β ββ 0x1e07: mov %edi,%ecx
2.07% β β°β 0x1e09: jmpq 0x1d80 ; *invokevirtual m | Jump back to process next object
5.00% β β 0x1e0e: mov $0x1,%ecx | Inlined C2::m
0.33% β 0x1e13: add 0x10(,%rbx,8),%ecx ; *iadd | Inlined C2::m
0.81% β 0x1e1a: mov %ecx,0x10(,%rbx,8) ; *putfield c2 | Inlined C2::m
7.11% β 0x1e21: mov %edi,%ecx
0.52% β° 0x1e23: jmpq 0x1d80 ; *iload_3 | Jump back to process next object
Oracle did not inline the method call. Instead, the JVM function invocation translates to an assembly call, including the necessary function prologue/epilogue. Graal, on the other hand, compares the objectβs class word against the native pointer of the VMs Klass instance to decide which implementation to use, and jumps to that implementation. Only when an instance that is not of class C1, C2, C3 is observed does the code fall back to a virtual call.
Influencing the Inliner
Weβve looked at how megamorphic call sites are inlined in C2 and Graal, but there are many other factors that may influence the inliner.
For example, in C2 you can specify the maximum number of nested calls that are inlined (-XX:MaxInlineLevel
) and a maximum bytecode size of a method to be inlined (-XX:MaxInlineSize
).
Vectorization
Modern processors support instructions that perform the same operation on multiple data points simultaneously. Such instructions are known as single instruction, multiple data (SIMD). Java does not yet support Vector intrinsics (JEP 338), so to benefit from SIMD instructions the JIT compiler has to apply autovectorization.
One example where vectorization can help is UTF-8 to UTF-16 decoding. In UTF-8 each character occupies 1-4 bytes, whereas in UTF-16 each character occupies exactly 2 bytes. If the string in question only contains ASCII characters, then the decoding is just a widening conversion of each byte. With SIMD instructions this widening conversion can be performed for multiple bytes in parallel. The following code shows how such a fast-path decoding could look like.
If we run this benchmark we obtain the following results:
Benchmark | Compiler | Mode | Cnt | Score Β± Error | Units |
---|---|---|---|---|---|
Vectorization.decodeAscii | C2 | avgt | 10 | 547.945 Β± 5.376 | ns/op |
Vectorization.decodeAscii | Graal | avgt | 10 | 147.822 Β± 5.606 | ns/op |
Graal outperforms C2 by 3.7x, that is a pretty big difference!
Lets add -prof dtraceasm
(Linux: -prof perfasm
) and take a look at the assembly code that the two JIT compilers generate.
C2
0x08: mov $0x200,%ecx
0x0d: xor %rax,%rax
0x10: shl $0x3,%rcx
0x14: rep rex.W stos %al,%es:(%rdi)
0x17: prefetchw 0x140(%r10)
0x1f: prefetchw 0x180(%r10)
...
β 0xc70: movslq %edi,%rbx
β 0xc73: movsbl 0x10(%r8,%rbx,1),%r9d
β 0xc79: test %r9d,%r9d
β 0xc7c: jl 0x000000010ba8fd58
β 0xc82: mov %r9w,0x10(%rdx,%rbx,2)
β 0xc88: movsbl 0x11(%r8,%rbx,1),%r9d
β 0xc8e: test %r9d,%r9d
β 0xc91: jl 0x000000010ba8fd63
β 0xc97: mov %r9w,0x12(%rdx,%rbx,2)
β 0xc9d: movsbl 0x12(%r8,%rbx,1),%r9d
β 0xca3: test %r9d,%r9d
β 0xca6: jl 0x000000010ba8fd69
β 0xcac: mov %r9w,0x14(%rdx,%rbx,2)
β 0xcb2: movsbl 0x13(%r8,%rbx,1),%r9d
β 0xcb8: test %r9d,%r9d
β 0xcbb: jl 0x000000010ba8fd5c
β 0xcc1: mov %r9w,0x16(%rdx,%rbx,2)
β 0xcc7: movsbl 0x14(%r8,%rbx,1),%r9d
β 0xccd: mov %edi,%ecx
β 0xccf: add $0x4,%ecx
β 0xcd2: test %r9d,%r9d
β 0xcd5: jl 0x000000010ba8fd6e
β 0xcdb: mov %r9w,0x18(%rdx,%rbx,2)
β 0xce1: movsbl 0x15(%r8,%rbx,1),%r9d
β 0xce7: test %r9d,%r9d
β 0xcea: jl 0x000000010ba8fd65
β 0xcec: mov %r9w,0x1a(%rdx,%rbx,2)
β 0xcf2: movsbl 0x16(%r8,%rbx,1),%r9d
β 0xcf8: test %r9d,%r9d
β 0xcfb: jl 0x000000010ba8fd6b
β 0xcfd: mov %r9w,0x1c(%rdx,%rbx,2)
β 0xd03: movsbl 0x17(%r8,%rbx,1),%r9d
β 0xd09: test %r9d,%r9d
β 0xd0c: jl 0x000000010ba8fd5e
β 0xd0e: mov %r9w,0x1e(%rdx,%rbx,2)
β 0xd14: add $0x8,%edi
β 0xd17: cmp $0x3f9,%edi
β 0xd1d: jl 0xc70
β 0xd23: cmp $0x400,%edi
β° 0xd29: jge 0x000000010ba8fb91
First note that mov $0x200, %ecx
moves the decimal 512
into %ecx
and the shl
that follows changes this to 4096
.
The xor %rax, %rax
instruction zeros %al
.
The rep rex.W stos %al,%es:(%rdi)
instruction iterates %ecx
(= 4096) times, each time copying the zeros from %al
to %es:(%rdi)
and incrementing %rdi
.
To summarize, this block initializes the char[2048]
array (each char takes up two bytes).
The second thing to notice is that C2 unrolled the loop body 8 times. Specifically, the sign of each byte is checked individually and a jump is performed if the sign is negative.
GraalVM
0x38: mov $0xffffffff,%r10d
β 0x70: lea 0x8(%r13),%r13
β 0x74: cmp %rbp,%r13
β 0x77: jg 0x000000010e10cc8a
β 0x7d: vmovdqu (%rbx,%r13,4),%ymm0
β 0x83: vpxor %xmm1,%xmm1,%xmm1
β 0x87: vpcmpgtd %ymm0,%ymm1,%ymm0
β 0x8b: vpmovmskb %ymm0,%r14d
β 0x8f: test %r10d,%r14d
β° 0x92: je 0x70
Even though the Java loop contains a condition and multiple exists, Graal was able to compile this to AVX2 instructions.
The vmovdqu
instruction moves 256 bits (32 bytes) to the %ymm0
register.
The vpxor
instruction effectively zeros the %xmm1
register.
The vpcmpgtd
instruction compares each signed DWORD in %ymm0
and %ymm1
and stores the result in the corresponding DWORD in %ymm0
.
The vpmovmskb
instruction moves the most significant bit of each byte of the source operand %ymm0
(256-bits) to the destination operand %r14d
(32-bits).
The test
instruction sets the zero flag if the bitwise AND of its operands %r10d
and %r14d
is zero, which is the case only when %r14d
is zero.
The je
instruction jumps to the start of the loop if the zero flag is set.
To summarize: each iteration of the loop processes 32 bytes. Each of the 32 bytes is mapped to 0 if the byte represents a value greater than or equal to 0 and mapped to 1 otherwise. These 32 bits are collected into a single register. If this register is zero, i.e. if all original bytes were greater than or equal to 0, the loop is repeated.
Conclusion
We compared the C2 JIT-compiler (Java 8u241) to the Graal JIT-compiler (GraalVM EE 20.0.0) on four benchmarks: escape analysis, data-flow analysis, inlining, and vectorization. On all benchmarks, the Graal JIT-compiler outperformed the C2 JIT-compiler:
- In the escape analysis benchmark, Graal used partial escape analysis to avoid object construction.
- In the data-flow analysis benchmark, Graal used data flow analysis to remove unreachable code.
- In the inlining benchmark, Graal inlined a megamorphic call site to avoid call overhead.
- In the vectorization benchmark, C2 unrolled the loop but Graal vectorized the loop.