Java Lock Coarsening and Cycle Problem

2021-07-22 09:36:54
OfStack

STEP 1 Write in front

"JVM Anatomy Park" is a series of mini-blogs that are constantly updated. It takes 5 to 10 minutes to read each article. Limited by space, only a topic is explained in depth according to questions, tests, benchmark procedures and observation results. Therefore, the data and discussion here can be viewed as anecdotes without checking for 1, writing style, syntactic and semantic errors, repetition or 1. If you choose to accept the contents of this article, you are at your own risk.

Aleksey Shipil Υ v, JVM performance geeks

Twitter @ shipilev

Send questions, comments and suggestions to aleksey@shipilev.net

Lock coarsening (Lock Coarsening). Lock coarsening is the process of merging adjacent synchronization blocks using the same lock object. If the compiler cannot eliminate locks using lock omission (Lock Elision), lock coarsening can be used to reduce overhead.

2. Problems

As we all know, Hotspot does have lock coarsening optimization, which can effectively merge several adjacent synchronization blocks, thus reducing lock overhead. Can put the following code


synchronized (obj) {
 //  Statement  1
}
synchronized (obj) {
 //  Statement  2
}

Convert to


synchronized (obj) {
 //  Statement  1
 //  Statement  2
}

The question is, can Hotspot optimize the loop in this way? For example, put


for (...) {
 synchronized (obj) {
  // 1 Some operations 
 }
}

Optimize it as follows?


synchronized (this) {
 for (...) {
   // 1 Some operations 
 }
}

In theory, nothing can stop us from doing this, and we can even think of this optimization as a lock-only optimization, like loop unswitching 1. The disadvantage, however, is that the locks may be optimized to be too thick, and the thread will occupy all the locks when executing the loop.

Loop unswitching is a compiler optimization technique. By copying the loop body and putting one loop body code in if and else sentences, the internal loop of conditional sentences is moved to the external loop, thus improving the parallelism of loops. Because the processor can operate vectors quickly, the execution speed is improved.

3. Experiment

The easiest way to answer this question is to find evidence of Hotspot optimization. Fortunately, JMH helps make this work very simple. JMH is useful not only for building benchmarks, but also for analyzing benchmarks. Let's start with a simple benchmark:


@Fork(..., jvmArgsPrepend = {"-XX:-UseBiasedLocking"})
@State(Scope.Benchmark)
public class LockRoach {
  int x;
  @Benchmark
  @CompilerControl(CompilerControl.Mode.DONT_INLINE)
  public void test() {
    for (int c = 0; c < 1000; c++) {
      synchronized (this) {
        x += 0x42;
      }
    }
  }
}

(See here for the complete source code, please see the original link)

Here are a few important tips:

Using-XX:-UseBiasedLocking to disable bias locking (ES70Lock) avoids long startup times. Because the bias lock will not start immediately, wait 5 seconds during initialization (see BiasedLockingStartupDelay option)
Disabling the @ Benchmark method inline operation helps us separate the relevant content from the disassembly
Adding "magic number" 0x42 helps to quickly locate addition operation from disassembly

Bias lock (Biased Locking). Although the overhead of CAS atomic instructions is relatively small compared with heavyweight locks, there is still considerable local delay. In order to avoid unnecessary execution of CAS atomic instructions in the process of lock fetching without lock contention, biased locking technology is proposed.
Quickly Reacquirable Locks, Dave Dice, Mark Moir, William Scherer III.

Running environments i7 4790K, Linux x86_64, JDK EA 9b156:

Benchmark Mode Cnt Score Error Units
LockRoach. test avgt 5 533 1.617 ± 19.051 ns/op

What results can be analyzed from the above running data? You can't see anything, can you? We need to investigate what happened behind it. The-prof perfasm configuration comes in handy, showing hot spots in the generated code. Running with the default settings, you can find that the hottest instruction is locked lock cmpxchg (CAS), and only the code near the instruction is printed. The-prof perfasm: mergeMargin=1000 configuration can merge and save these hot spots as output fragments, which may seem a little scary at first glance.

Step 1 analysis shows that continuous jump instructions are locked or unlocked. Pay attention to the code with the most cycles (column 1), and you can see that the hottest cycles are like the following:


 In fact, in fact, the  0x00007f455cc708c1: lea  0x20(%rsp),%rbx
  The      <  Omit some codes and enter  monitor >   ; <--- coarsened (Coarsening) !
  The  0x00007f455cc70918: mov  (%rsp),%r10    ;  Loading  $this
  The  0x00007f455cc7091c: mov  0xc(%r10),%r11d  ;  Loading  $this.x
  The  0x00007f455cc70920: mov  %r11d,%r10d    ; ...hm...
  The  0x00007f455cc70923: add  $0x42,%r10d    ; ...hmmm...
  The  0x00007f455cc70927: mov  (%rsp),%r8     ; ...hmmmmm!...
  The  0x00007f455cc7092b: mov  %r10d,0xc(%r8)   ; LOL Hotspot Redundant storage, omitting two lines below 
  The  0x00007f455cc7092f: add  $0x108,%r11d    ;  Plus  0x108 = 0x42 * 4 <--  Unfold 4 Times 
  The  0x00007f455cc70936: mov  %r11d,0xc(%r8)   ;  Put  $this.x  Back to omit some codes and exit  monitor >   ; <--- coarsened (Coarsening) !
  The  0x00007f455cc709c6: add  $0x4,%ebp     ; c += 4  <---  Unfold 4 Times 
  The  0x00007f455cc709c9: cmp  $0x3e5,%ebp    ; c < 1000?
  In fact, in fact, the  0x00007f455cc709cf: jl   0x00007f455cc708c1

Ha ha. The loop seems to be unfolded four times, and then lock coarsening is implemented in these four iterations! To eliminate the effect of loop unwrapping on lock coarsening, we can quantify the constrained coarsening performance again by configuring the cropping loop unwrapping at-XX: LoopUnrollLimit=1.

Loop unrolling, also known as Loop unwinding, is a cyclic conversion technology. It tries to optimize the execution speed of programs at the expense of binary size, which is called time-space tradeoff. Transformations can be performed manually by programmers or optimized by compilers.


Benchmark      Mode Cnt   Score  Error Units
# Default
LockRoach.test    avgt  5  5331.617  ±  19.051 ns/op
# -XX:LoopUnrollLimit=1
LockRoach.test    avgt  5 20679.043  ±  3.133 ns/op

Wow, the performance has been improved by 4 times! This is obvious because we have observed that the hottest instruction is lock lock cmpxchg. Of course, coarsening locks after 4 times mean 4 times throughput. Very cool. Can we declare success and move on? Not yet. We must verify that disabling loop expansion really provides what we want to compare. The results of perfasm seem to indicate that it contains a similar hot spot cycle, but it is a big step forward.


 In fact, in fact, the  0x00007f964d0893d2: lea  0x20(%rsp),%rbx
  The      <  Omit some codes and enter  monitor >
  The  0x00007f964d089429: mov  (%rsp),%r10    ;  Loading  $this
  The  0x00007f964d08942d: addl  $0x42,0xc(%r10)  ; $this.x += 0x42
  The      <  Omit some codes and exit  monitor >
  The  0x00007f964d0894be: inc  %ebp        ; c++
  The  0x00007f964d0894c0: cmp  $0x3e8,%ebp    ; c < 1000?
  In fact, in fact, the  0x00007f964d0894c6: jl   0x00007f964d0893d2 ;

1 Check OK.

4. Observations

When lock coarsening doesn't work throughout the loop, it looks as if there are N adjacent lock-and-unlock operations in the middle of 1 denier, and another loop optimization, loop unwrapping, provides regular lock coarsening. This will improve performance and help limit the scope of coarsening to avoid excessive coarsening in long cycles.

Summarize

If you think this article is helpful to you, please reprint it, please indicate the source, thank you!