Graal vs. C2: Battle of the JITs

GraalVM claims to “Run Programs Faster Anywhere 🚀”. Indeed, existing applications often run faster on GraalVM compared to OpenJDK and Oracle JDK. For example, Twitter reports 11 percent reduction in CPU utilization by using Graal CE. With GraalVM EE they were even able to reduce CPU utilization by a staggering 20 percent. What often remains undiscussed, however, is how GraalVM exactly achieves this performance speedup. In this blog post I try to answer this question by comparing GraalVM EE 20.0.0 to Oracle Java 8u241 on a number of benchmarks.

This post is an extended version of the talk I gave at the Compiler Construction course (CS4200) at Delft University of Technology, the slides of which are available here. Both my talk and this post are inspired by Ionut Balosin’s A race of two compilers. I have also taken a lot of inspiration from Aleksey Shipilev’s JVM Anatomy Quarks.

Introduction

Before discussing how GraalVM achieves a performance speedup over Oracle JDK, we need to briefly introduce the JVM’s tiered compilation scheme. The first step to run a Java program is to compile a .java file containing Java source to a .class file containing JVM bytecode. When running the resulting .class file, the JVM goes through several stages:

First the bytecode is interpreted. You can think of this as a loop that processes each bytecode instruction individually, altering the state of the virtual machine according to the semantics of the instruction.
Once a method becomes hot the client compiler (C1) compiles the bytecode to machine code (instructions native to the host platform) that carries out equivalent operations. C1 is a fast, lightly optimizing bytecode compiler.
When the code becomes even hotter, the server compiler (C2) kicks in. This compiler produces the most performant machine code.

A note on JIT A bytecode interpreter is much slower than the equivalent machine code, often by at least an order of magnitude. What is the great thing about JIT compilation then? First, while the bytecode is being interpreted the profiler collects information such as how often a branch is taken and what types of values are being observed. The JIT compiler can include this profiling information when deciding what machine code to generate. Second, and more importantly, the JIT compiler can make speculative assumptions and deoptimize if these assumptions turn out to be wrong. These two ingredients make it possible for a JIT compiler to outperform an ahead-of-time compiler.

For the purpose of this blog post it is important to know that in GraalVM the C1 and C2 JIT compilers have been replaced by the Graal JIT compiler. The next four sections each present an experiment in which GraalVM is compared to Oracle Java. The code for the benchmarks in this blog post is available on GitHub. The benchmarks are run GraalVM EE 20.0.0 or Oracle Java 8u241 on a MacBook Pro (13-inch, 2017, 2.5 GHz Intel Core i7).

(Partial) Escape Analysis

Escape analysis is a method for determining the dynamic scope of pointers. Essentially, escape analysis tells us when a reference to an object can escape to other threads of execution or to calling subroutines. A couple of interesting optimizations can be performed when an object is known NOT to escape:

Stack allocation. With stack allocation, an object is allocated on the stack instead of the heap. Stack allocation is much faster than heap allocation, because it only involves moving the stack pointer. In addition, stack allocation avoids the need for expensive garbage collection.
Scalar replacement. Scalar replacement replaces an object by its fields. Like with stack allocation, the fields may be allocated on the stack instead of the heap. In addition, the scalar variables may be amenable to further optimizations.
Lock coarsening. With lock coarsening, the expensive locking operation is eliminated. Some classes in the Java standard library assume that a class may be used in a multithreaded environment, so they pessimistically synchronized access to those methods. However, if an object never leaves the method in which it is constructed, no two threads will ever access this object, and methods that access this object do not need to be synchronized.

Partial escape analysis is a stronger variant of escape analysis that determines whether an object escapes on each path through a method instead of the method as a whole. This means that even if there is a path through the method in which the object escapes, we may still be able to apply above optimizations on the other paths. Of course, nothing comes for free: the compiler needs to insert instructions to reconstruct the object on the heap when the “escaping” path is taken. Let’s look at an example.

@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class ScalarReplacement {
  int x;
  boolean flag;

  @Setup(Level.Iteration)
  public void shake() {
    flag = ThreadLocalRandom.current().nextBoolean();
  }

  @Benchmark
  public void split(Blackhole blackhole) {
    for (int i = 0; i < 300; i++) {
      MyObject o;
      if (flag) {
        o = new MyObject(x);
      } else {
        o = new MyObject(x);
      }
      blackhole.consume(o.x);
    }
  }

  static class MyObject {
    final int x;

    public MyObject(int x) {
      this.x = x;
    }
  }
}

Benchmark	Dist	Mode	Cnt	Score ± Error	Units
Escape.split	graal	avgt	5	669.760 ± 78.310	ns/op
Escape.split	oracle	avgt	5	1182.732 ± 145.990	ns/op

The benchmark shows that Graal is about twice as fast as C2. Let’s add the perfasm (dtraceasm) profiler to take a look at the assembly code that gets generated:

Graal:

12%  ↗  0x000000011c191640: mov    %r11d,0x2c(%rsp)
17%  │  0x000000011c191645: mov    0xc(%r10),%r9d     ; *getfield x
69%  │  0x000000011c191649: mov    %rdi,%rsi          ; *invokevirtual consume
23%  │  0x000000011c19164c: mov    %r9d,%edx
07%  │  0x000000011c19164f: callq  0x000000011bf990a0 ; *invokevirtual consume
97%  │  0x000000011c191654: nop
55%  │  0x000000011c191655: mov    0x2c(%rsp),%r11d
33%  │  0x000000011c19165a: inc    %r11d              ; *iinc
39%  │  0x000000011c19165d: mov    0x8(%rsp),%r10
76%  │  0x000000011c191662: mov    0x10(%rsp),%rdi    ; *iload_2
09%  │  0x000000011c191667: cmp    $0x12c,%r11d
         ╰  0x000000011c19166e: jl     0x000000011c191640 ; *if_icmpge

As you can see, there is no trace of the MyObject construction. Instead, the field x is immediately passed to the consume function. Lets look at what happens on Oracle Java:

Oracle:

12%  ↗  0x000000011e7a44bf: mov    0x50(%rsp),%r9
12%  │  0x000000011e7a44c4: movzbl 0x94(%r9),%r10d        ; *getfield isDone
12%  │  0x000000011e7a44cc: mov    0x8(%rsp),%rbx
         │  0x000000011e7a44d1: add    $0x1,%rbx              ; *ifeq
         │  0x000000011e7a44d5: test   %eax,-0xf8ae4db(%rip)
         │  0x000000011e7a44db: test   %r10d,%r10d
         │  0x000000011e7a44de: jne    0x000000011e7a469b     ; *aload
02%  │  0x000000011e7a44e4: xor    %ebp,%ebp
         │  0x000000011e7a44e6: jmp    0x000000011e7a4558
         │  0x000000011e7a44e8: nopl   0x0(%rax,%rax,1)
82%  │  0x000000011e7a44f0: mov    %r10,0x60(%r15)
32%  │  0x000000011e7a44f4: prefetchw 0xc0(%r10)
30%  │  0x000000011e7a44fc: mov    0x18(%rsp),%r11
05%  │  0x000000011e7a4501: mov    0xa8(%r11),%r10
82%  │  0x000000011e7a4508: mov    %r10,(%rax)
57%  │  0x000000011e7a450b: movl   $0xf80190ff,0x8(%rax)  ; Put class word header
32%  │  0x000000011e7a4512: mov    %r12d,0xc(%rax)        ; *new
80%  │  0x000000011e7a4516: mov    0x24(%rsp),%r10d
90%  │  0x000000011e7a451b: mov    %r10d,0xc(%rax)        ; *synchronization entry
22%  │  0x000000011e7a451f: mov    %r8,(%rsp)
80%  │  0x000000011e7a4523: mov    %r9,0x50(%rsp)
92%  │  0x000000011e7a4528: mov    %r11,0x18(%rsp)
47%  │  0x000000011e7a452d: mov    %rbx,0x8(%rsp)         ; *aload_1
47%  │  0x000000011e7a4532: mov    0xc(%rax),%edx         ; *getfield x
20%  │  0x000000011e7a4535: mov    0x60(%rsp),%rsi
62%  │  0x000000011e7a453a: nop
47%  │  0x000000011e7a453b: callq  0x000000011e5e00a0     ; *invokevirtual consume
67%  │  0x000000011e7a4540: inc    %ebp                   ; *iinc
80%  │  0x000000011e7a4542: cmp    $0x12c,%ebp
         ╰  0x000000011e7a4548: jge    0x000000011e7a44bf     ; *if_icmpge

In the introduction I mentioned that Twitter is able to achieve 11 percent reduction in CPU utilization by switching to the Graal JIT-compiler. They also report that using Graal without partial escape analysis results in 5 percent reduction in CPU utilization. This should give you an idea of how big the contribution of escape analysis is to Graal’s performance. If you want to learn more about escape analysis I can recommend Lukas Stadler’s PhD thesis.

Data-Flow Analysis

Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program. One example of data-flow analysis is constanat folding: the process of recognizing and evaluating constant expressions at compile time rather than computing them at runtime. Constant folding can make use of arithmetic identities, e.g. if x is an integer, then the value 0 * x is zero independent of the value of x. This is a fairly obvious example, but what follows is an example of constant folding that may not be so obvious.

@Warmup(iterations = 5, time = 1, timeUnit = SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class Dataflow {
  @Benchmark
  public int simpleRandom() {
    Random random = new Random();
    int sum = 0;
    for (int i = 0; i < 256; i++) {
      int r = random.nextInt(2); // Random values are always in [0, 1]
      if (r > 1) { // This is always false
        sum += r;
      }
    }
    return sum;
  }
}

If we run this on GraalVM EE 20.0.0 and Oracle Java we get the following result:

Benchmark	Dist	Mode	Cnt	Score	Units
Dataflow.random	graal	avgt	10	51.730 ± 4.008	ns/op
Dataflow.random	oracle	avgt	10	2481.222 ± 107.754	ns/op

It appears that Graal’s data-flow analysis could deduce that r > 1. Let’s look at the implementation of Random#nextInt(int) and speculate what may happen under the hoods:

public class Random {
    // ...

    public int nextInt(int bound) {
        int r = next(31);
        int m = bound - 1;
        if ((bound & m) == 0) {
            r = (int) ((bound * (long) r) >> 31);
        } else {
            // Omitted for brevity
        }
        return r;
    }
}

The parameter bound will be bound to 2, which means the method will return (int) ((bound * (long) r) >> 31). Without making any assumption about the value of r, what are the possible values of this expression? Let’s introduce some notation to make the discussion easier: each bit is either 0, 1, or x (unknown), and b{n} denotes the repetition of n times b.

We start with 32 bits: 1x{31} or 0x{31}
Widen (sign-extend) to 64 bits: 1{32}1x{31} or 0{32}0x{31}
Shift one position left: 1{31}1x{31}0 or 0{31}0x{31}0
Shift 31 positions right: 1{31}1{31}1x or 0{31}0{31}0x
Narrow to 32 bits: 1{31}x or 0{31}x

Without knowing anything about the random value r we know that this expression evaluates to one of at most four possible values:

1{31}0 or -2
1{32} or -1
0{32} or 0
0{31}1 or 1

This example may seem a bit contrived: nobody would every write random.nextInt(2) > 1. However, it illustrates the surprising ways in which Graal is able to optimize our code. Moreover, this example shows that if the JIT compiler knows that the range of possible values is restricted to some subset, it may be able to perform more aggressive optimizations and generate more efficient code.

Inlining

Inlining is a compiler optimization that replaces a function call site with the body of the called function. It is one of the fundamental optimizations because it creates new opportunities for optimization. The Graal inliner is much more aggressive; the following example shows why.

@Warmup(iterations = 5, time = 1, timeUnit = SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class Inlining {
  public abstract static class A {
    int c1, c2, c3;
    public abstract void m();
  }

  public static class C1 extends A {
    public void m() { c1++; }
  }

  public static class C2 extends A {
    public void m() { c2++; }
  }

  public static class C3 extends A {
    public void m() { c3++; }
  }

  A[] as;

  @Setup
  public void setup() {
    /* fill as with c1, c2, c3, c1, c2, ... */
    as = new A[300];
    for (int c = 0; c < 300; c += 3) {
      as[c] = new C1();
      as[c + 1] = new C2();
      as[c + 2] = new C3();
    }
  }

  @Benchmark
  public void test() {
    for (A a : as) {
      a.m();
    }
  }
}

We have defined a class A with an abstract method m and three concrete subclasses C1, C2, and C3 that implement m. The arary as is filled with instances of C1, C2, C3 in a round-robin fashion. The benchmark invokes the method m on each item a in the array as. Since this is a virtual method call, the runtime needs to resolve the implementation based on the runtime type of the receiver a. Lets look at how the two JITs perform on this example:

Dist	Benchmark	Mode	Cnt	Score Error	Units
Oracle	Inlining.test	avgt	10	769.603 ± 24.531	ns/op
Graal	Inlining.test	avgt	10	497.817 ± 13.365	ns/op

Graal is much faster (1.55x) than Oracle. Lets look at the assembly code to see what is going on (by adding -prof perafsm or -prof dtraceasm to the JMH command):

Oracle:

Oracle:
  9.75%  ↗  0x54790: mov    0x40(%rsp),%r9
  0.14%  │  0x54795: data16 data16 nopw 0x0(%rax,%rax,1) ; *aload_1
  0.88%  │  0x547a0: mov    0x10(%r9,%rbp,4),%r10d
  8.10%  │  0x547a5: mov    %r9,0x40(%rsp)
  1.52%  │  0x547aa: mov    %r10,%rsi
  0.06%  │  0x547ad: shl    $0x3,%rsi
  0.94%  │  0x547b1: movabs $0xffffffffffffffff,%rax
  8.19%  │  0x547bb: callq  0x032e0            ; *invokevirtual m
  1.24%  │  0x547c0: inc    %ebp               ; *iinc
  0.85%  │  0x547c2: cmp    0x10(%rsp),%ebp
         ╰  0x547c6: jl     0x54790

C1::m
  <function prologue>
  0.55%    0x53b0c: incl   0xc(%rsi)  ; Increment at offset 12 (= field c1)
  <function epilogue>

C2::m
  <function prologue>
  0.50%    0x538cc: incl   0x10(%rsi) ; Increment at offset 16 (= field c2)
  <function epilogue>

C3::m
  <function prologue>
  0.36%    0x5368c: incl   0x14(%rsi) ; Increment at offset 20 (= field c3)
  <function epilogue>

Graal:

81%      │││  0x1d89: mov    0x10(%r8,%rcx,4),%ebx     ; *aaload
85%      │││  0x1d8e: mov    %ecx,%edi
67%      │││  0x1d90: inc    %edi                      ; *iinc
56%      │││  0x1d9a: cmpl   $0xf801e2c3,0x8(,%rbx,8)  ; {metadata("C2")} | Is this object of class C2?
41%   ╭  │││  0x1da5: je     0x1e0e                                       | If yes, jump to inlined C2::m
70%   │  │││  0x1db3: cmpl   $0xf801e341,0x8(,%rbx,8)  ; {metadata("C3")} | Is this object of class C3?
15%   │ ╭│││  0x1dbe: jne    0x1de3                    ; *invokevirtual m | If no, jump over C3::m to next check
04%   │ ││││  0x1dc4: mov    $0x1,%ecx                                    | Inlined C3::m
15%   │ ││││  0x1dc9: add    0x14(,%rbx,8),%ecx        ; *iadd            | Inlined C3::m
41%   │ ││││  0x1dd0: mov    %ecx,0x14(,%rbx,8)        ; *putfield c3     | Inlined C3::m
78%   │ ││││  0x1dd7: mov    %edi,%ecx
22%   │ │╰││  0x1dd9: jmp    0x1d80                                       | Jump back to process next object
56%   │ ↘ ││  0x1de3: cmpl   $0xf801e302,0x8(,%rbx,8)  ; {metadata("C1")} | Is this object of class C1?
07%   │   ││  0x1dee: jne    0x1eb2                    ; *invokevirtual m | If not, jump to virtual call
52%   │   ││  0x1df4: mov    $0x1,%ecx                                    | Inlined C1::m
63%   │   ││  0x1df9: add    0xc(,%rbx,8),%ecx         ; *iadd            | Inlined C1::m
15%   │   ││  0x1e00: mov    %ecx,0xc(,%rbx,8)         ; *putfield c1     | Inlined C1::m
44%   │   ││  0x1e07: mov    %edi,%ecx
07%   │   ╰│  0x1e09: jmpq   0x1d80                    ; *invokevirtual m | Jump back to process next object
00%   ↘    │  0x1e0e: mov    $0x1,%ecx                                    | Inlined C2::m
33%        │  0x1e13: add    0x10(,%rbx,8),%ecx        ; *iadd            | Inlined C2::m
81%        │  0x1e1a: mov    %ecx,0x10(,%rbx,8)        ; *putfield c2     | Inlined C2::m
11%        │  0x1e21: mov    %edi,%ecx
52%        ╰  0x1e23: jmpq   0x1d80                    ; *iload_3         | Jump back to process next object

Oracle did not inline the method call. Instead, the JVM function invocation translates to an assembly call, including the necessary function prologue/epilogue. Graal, on the other hand, compares the object’s class word against the native pointer of the VMs Klass instance to decide which implementation to use, and jumps to that implementation. Only when an instance that is not of class C1, C2, C3 is observed does the code fall back to a virtual call.

Influencing the Inliner We’ve looked at how megamorphic call sites are inlined in C2 and Graal, but there are many other factors that may influence the inliner. For example, in C2 you can specify the maximum number of nested calls that are inlined (-XX:MaxInlineLevel) and a maximum bytecode size of a method to be inlined (-XX:MaxInlineSize).

Vectorization

Modern processors support instructions that perform the same operation on multiple data points simultaneously. Such instructions are known as single instruction, multiple data (SIMD). Java does not yet support Vector intrinsics (JEP 338), so to benefit from SIMD instructions the JIT compiler has to apply autovectorization.

One example where vectorization can help is UTF-8 to UTF-16 decoding. In UTF-8 each character occupies 1-4 bytes, whereas in UTF-16 each character occupies exactly 2 bytes. If the string in question only contains ASCII characters, then the decoding is just a widening conversion of each byte. With SIMD instructions this widening conversion can be performed for multiple bytes in parallel. The following code shows how such a fast-path decoding could look like.

@Benchmark
@Warmup(iterations = 5, time = 1, timeUnit = SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class Vectorization {
  public byte[] c;

  @Setup(Level.Iteration)
  public void setup() {
    c = new byte[2048];
    Random random = new Random();
    for (int i = 0; i < 2048; i++) {
      c[i] = (byte) (random.nextInt(26) + 65);
    }
  }

  public char[] decodeAscii(Blackhole blackhole) {
    char[] z = new char[2048];
    int i = 0;
    for (; i < 1024; i++) {
      if (c[i] < 0) {
        return decodeGeneric(z);
      }
      z[i] = (char) c[i];
    }
    return z;
  }

  @CompilerControl(DONT_INLINE)
  private byte[] decodeGeneric(char[] z, int i) {
    return null;
  }
}

If we run this benchmark we obtain the following results:

Benchmark	Compiler	Mode	Cnt	Score ± Error	Units
Vectorization.decodeAscii	C2	avgt	10	547.945 ± 5.376	ns/op
Vectorization.decodeAscii	Graal	avgt	10	147.822 ± 5.606	ns/op

Graal outperforms C2 by 3.7x, that is a pretty big difference! Lets add -prof dtraceasm (Linux: -prof perfasm) and take a look at the assembly code that the two JIT compilers generate.

C2

	0x08: mov    $0x200,%ecx
	0x0d: xor    %rax,%rax
	0x10: shl    $0x3,%rcx
	0x14: rep rex.W stos %al,%es:(%rdi)
	0x17: prefetchw 0x140(%r10)
	0x1f: prefetchw 0x180(%r10)   
	...
↗	0xc70: movslq %edi,%rbx
│	0xc73: movsbl 0x10(%r8,%rbx,1),%r9d
│	0xc79: test   %r9d,%r9d
│	0xc7c: jl     0x000000010ba8fd58
│	0xc82: mov    %r9w,0x10(%rdx,%rbx,2)
│	0xc88: movsbl 0x11(%r8,%rbx,1),%r9d
│	0xc8e: test   %r9d,%r9d
│	0xc91: jl     0x000000010ba8fd63
│	0xc97: mov    %r9w,0x12(%rdx,%rbx,2)
│	0xc9d: movsbl 0x12(%r8,%rbx,1),%r9d
│	0xca3: test   %r9d,%r9d
│	0xca6: jl     0x000000010ba8fd69
│	0xcac: mov    %r9w,0x14(%rdx,%rbx,2)
│	0xcb2: movsbl 0x13(%r8,%rbx,1),%r9d
│	0xcb8: test   %r9d,%r9d
│	0xcbb: jl     0x000000010ba8fd5c
│	0xcc1: mov    %r9w,0x16(%rdx,%rbx,2)
│	0xcc7: movsbl 0x14(%r8,%rbx,1),%r9d
│	0xccd: mov    %edi,%ecx
│	0xccf: add    $0x4,%ecx        
│	0xcd2: test   %r9d,%r9d
│	0xcd5: jl     0x000000010ba8fd6e
│	0xcdb: mov    %r9w,0x18(%rdx,%rbx,2)
│	0xce1: movsbl 0x15(%r8,%rbx,1),%r9d
│	0xce7: test   %r9d,%r9d
│	0xcea: jl     0x000000010ba8fd65
│	0xcec: mov    %r9w,0x1a(%rdx,%rbx,2)
│	0xcf2: movsbl 0x16(%r8,%rbx,1),%r9d
│	0xcf8: test   %r9d,%r9d
│	0xcfb: jl     0x000000010ba8fd6b
│	0xcfd: mov    %r9w,0x1c(%rdx,%rbx,2)
│	0xd03: movsbl 0x17(%r8,%rbx,1),%r9d
│	0xd09: test   %r9d,%r9d
│	0xd0c: jl     0x000000010ba8fd5e
│	0xd0e: mov    %r9w,0x1e(%rdx,%rbx,2)
│	0xd14: add    $0x8,%edi        
│	0xd17: cmp    $0x3f9,%edi
│	0xd1d: jl     0xc70
│	0xd23: cmp    $0x400,%edi
╰	0xd29: jge    0x000000010ba8fb91

First note that mov $0x200, %ecx moves the decimal 512 into %ecx and the shl that follows changes this to 4096. The xor %rax, %rax instruction zeros %al. The rep rex.W stos %al,%es:(%rdi) instruction iterates %ecx (= 4096) times, each time copying the zeros from %al to %es:(%rdi) and incrementing %rdi. To summarize, this block initializes the char[2048] array (each char takes up two bytes).

The second thing to notice is that C2 unrolled the loop body 8 times. Specifically, the sign of each byte is checked individually and a jump is performed if the sign is negative.

GraalVM

  0x38: mov        $0xffffffff,%r10d
↗ 0x70: lea        0x8(%r13),%r13
│ 0x74: cmp        %rbp,%r13
│ 0x77: jg         0x000000010e10cc8a
│ 0x7d: vmovdqu    (%rbx,%r13,4),%ymm0
│ 0x83: vpxor      %xmm1,%xmm1,%xmm1
│ 0x87: vpcmpgtd   %ymm0,%ymm1,%ymm0
│ 0x8b: vpmovmskb  %ymm0,%r14d
│ 0x8f: test       %r10d,%r14d
╰ 0x92: je         0x70

Even though the Java loop contains a condition and multiple exists, Graal was able to compile this to AVX2 instructions. The vmovdqu instruction moves 256 bits (32 bytes) to the %ymm0 register. The vpxor instruction effectively zeros the %xmm1 register. The vpcmpgtd instruction compares each signed DWORD in %ymm0 and %ymm1 and stores the result in the corresponding DWORD in %ymm0. The vpmovmskb instruction moves the most significant bit of each byte of the source operand %ymm0 (256-bits) to the destination operand %r14d (32-bits). The test instruction sets the zero flag if the bitwise AND of its operands %r10d and %r14d is zero, which is the case only when %r14d is zero. The je instruction jumps to the start of the loop if the zero flag is set.

To summarize: each iteration of the loop processes 32 bytes. Each of the 32 bytes is mapped to 0 if the byte represents a value greater than or equal to 0 and mapped to 1 otherwise. These 32 bits are collected into a single register. If this register is zero, i.e. if all original bytes were greater than or equal to 0, the loop is repeated.

Conclusion

We compared the C2 JIT-compiler (Java 8u241) to the Graal JIT-compiler (GraalVM EE 20.0.0) on four benchmarks: escape analysis, data-flow analysis, inlining, and vectorization. On all benchmarks, the Graal JIT-compiler outperformed the C2 JIT-compiler:

In the escape analysis benchmark, Graal used partial escape analysis to avoid object construction.
In the data-flow analysis benchmark, Graal used data flow analysis to remove unreachable code.
In the inlining benchmark, Graal inlined a megamorphic call site to avoid call overhead.
In the vectorization benchmark, C2 unrolled the loop but Graal vectorized the loop.