Thursday, 5 September 2024

JVM Tiered Compilation

Over the years I've written multiple posts with basic information about JIT's and interpreters (I'll just link a few of them: [1], [2]). We could say that the combination of an interpreter, multi-tier JIT compilation, profiling, optimization and deoptimization... really fascinates me. The truffle interpreters and Graal JIT compilation combination [1], [2] is one of the last advances in the area that has kept me dreming with better understanding how all this works. While rethinking a bit about this topic I've realised that I had some misconceptions and doubts (kind of: "I think it's like that, but indeed I'm not sure about having read it or just being assuming it"). For example, I had got to wrongly think that the cycle of profiling-optimizing a given method was sort of neverending, which is not the case (at least in the JVM, either with C2 or Graal). So as of late I've been reading some pretty interesting material on this topic, mainly about the JVM, and I'll summarize here my findings.

When we have an interpreter, a "simple" and fast JIT and a more advanced and slower (in compilation time, of course the code it generates is faster) JIT, the base principle that drives the interaction between these 3 components is that the interpreter counts how many times it invokes a method and when it exceeds a threshold (the method is hot) it compiles it with the "simple" JIT compiler. This compiler adds "counting logic" to the code it generates, so that the method invokation counting continues and when another threshold is exceeded the method is invoked with the advanced JIT compiler. OK, things are more complex than that in the JVM. This article gives an excellent explanation (just notice that if we are using the Graal VM in not AOT mode we don't have C2, but Graal)

First, the interpreter is not just counting method invokations, it's also gathering profiling information. Well, from what I've read, the interpreter will only start to gather profiling information after a certain threshold of invokations. The "simple" C1 compiler also adds logic to gather profile info to the native code that it generates, so that in the future the "advanced" C2 compiler can use that profiling information to generate hyper-optimized code. It's important to note that profiling has a cost. When your code is not just doing its job but also gathering profile information it's doing extra work. That's why the C2 compiler is the ending point, it will not do further optimizations. It does not generate code with "invokation counting" and profile gathering logic so that in the future another more optimized version can be compiled. The code generated by C2 is just as optimized as "necessary", and that's all. Well, indeed there's an additional player, deoptimization. Both C1 and C2 do assumptions when compiling, with the intent to get optimized code, but those assumptions could be proven wrong in the future (the generated code contains assertions/guards to verify this), so in that case the code has to be deoptimized, returning to the interpretation stage and starting over. This was another source of doubts for me, I thought deoptimization would jump to a previous native code version, but no, it's mentioned in multiple sources that it just returns to the interpreted form. These 2 images are crystal clear:

Furthermore, there are 3 levels of C1 compilations. It's well explained in that article, and also in this one.

So the levels of compilation are as follows:

  • 0: Interpreted code
  • 1: Simple C1 compiled code
  • 2: Limited C1 compiled code
  • 3: Full C1 compiled code
  • 4: C2 compiled code

C1 and C2 compilers run in separate threads (even several threads for each compiler depending on your number of cores). Each compiler has a queue where it receives compilation requests (so your program does not suddenly "stop" waiting for a compilation). Trivial methods that would not benefit of further profiling and C2 compilation are sent from the interpreter to C1 for a level 1 compilation, that won't add any profiling to the generated code (it won't even add the invokation-counting logic to it). That code is "final" unless it has to be deoptimized. Non trivial methods are sent to C1 level 3 (that adds full profiling) or C1 level 2 (that adds light profiling). It's not clear to me yet when it sends to level2 rather than level3. Finally, when a method is hot enough it's sent to the C2 compiler along with all the profiling information.

There's another point that I should detail a bit more. When I say that the interpreter and the C1 generated code are counting how many times a method is invoked, it's indeed more complex. It also counts how many loop iterations run inside that method. Invoking a method once, but then running a loop for 1000 iterations in that method makes it hotter than invoking 100 times a method that has no loops.

Compilation is based on two counters in the JVM: the number of times the method has been called, - and the number of times any loops in the method have branched back.

As one methods evolves from its bytecode form (to be interpreted) to one of several native forms (and then back to bytecode form if it has been deoptimized), how does the code that invokes that method know where to jump to? We are not going to be patching the many possible callsites to that method, so the solution seems simple (once you read it from someone smarter than you), adding an extra level of indirection. So calls to a method will go through a table that points to the actual code (the native code version or code that continues interpreting bytecodes). So it's that table who gets patched to point to the right address as the method evolves. We already have this kind of tables (i-tables and v-tables) for virtual methods, so I wonder if this patching is just applied on such v-table, or entries in that v-table point to an additional "bytecode vs native" table on which this patching happens. Wow, VM's are amazingly complex.

Another point that seems amazingly complex to me is the deoptimization part. So you are executing the Jitted compiled version of one method and an assertion/guard fails and we have to go back to the bytecode interpreted version. This is not only that future invokations of that method will have to jump into the interpreted version (the indirection table I've just talked about), it's that we are in the middle of the execution of one method (well, this sound also like the on stack replacement thing...). There's an article where they talk about this in the truffle-graal world.

The first problem is how to jump from compiled code back into an interpreter. First of all, we need to be able to jump from compiled code back into the interpreter. This isn’t as simple as a goto. We not only need to jump from one place to another, but we also need to recreate the state that the interpreter expects. That means that we need to recreate the stack frames for every method that was compiled but now is being interpreted. We then just have to restart execution in the interpreter in the AST node that corresponds to the machine code where we deoptimized - so we maintain a mapping between the two for all generated code.

Reading the different articles cited above I've been amazed by the amount of options that the JVM accepts (parameters that you pass to the java process in the command line). I was familiar with the -Xms and -Xmx settings to set the Heap size, and also the -Xss option to set the Thread stack size. But there are so many more! For example:

  • XX:CompileThreshold=invocations. Sets the number of interpreted method invocations before compilation. By default, in the server JVM, the JIT compiler performs 10,000 interpreted method invocations to gather information for efficient compilation.
  • -XX:-TieredCompilation. To disable tiered compilation (it's enabled by default)
  • -XX:+PrintCompilation. When enabling this optione every time a method is compiled the JVM prints out a line with information about what has just been compiled

I won't close this post without linking these 2 articles about meta-compilation, tracing and partial evaluation, [1] and [2]

Meta-compilation reduces the individual implementation effort by making just-in-time compilers reusable and independent of any particular language. RPython and GraalVM are the main systems in this space. Both approach the problem by compiling the executing program together with the interpreter that is executing it. This combination provides the necessary specialization opportunities to produce efficient native code that rivals custom-built state-of-the-art JIT compilers.

No comments:

Post a Comment