Introduction to the feature of string deduplication in Java

2020-04-01 03:58:27
OfStack

Strings take up a lot of memory in any application. In particular, char[] arrays containing isolated utf-16 characters contribute the most to JVM memory consumption -- because each character takes up two bits.

It's not uncommon for 30% of memory to be consumed by strings, not only because strings are the best format to interact with us, but also because the popular HTTP API USES a lot of strings. With Java 8 Update 20, we now have access to a new feature called string deduplication that requires the G1 garbage collector, which is turned off by default.

String deduplication takes advantage of the fact that inside the string is actually a char array and is final, so the JVM can manipulate them at will.

The developers considered a number of strategies for string de-duplication, but the final implementation followed the following approach:

Whenever the garbage collector accesses a String object, it marks the char array. It takes the hash value of the char array and stores it with a weak reference to the array. Whenever the garbage collector finds another string that has the same hash code as the char array, it compares the two characters character by character.

If they match, one string is modified to point to the char array of the second string. The first char array is no longer referenced and can be recycled.

The whole process certainly carries some overhead, but is controlled by a tight ceiling. For example, if a character is not found to be duplicated, it will not be checked for a period of time.

So how does this feature actually work? First, you need the newly released Java 8 Update 20, and then follow this configuration: -xmx256m-xx :+UseG1GC to run the following code:


public class LotsOfStrings {
 
 private static final LinkedList<String> LOTS_OF_STRINGS = new LinkedList<>();
 
 public static void main(String[] args) throws Exception {
  int iteration = 0;
  while (true) {
   for (int i = 0; i < 100; i++) {
    for (int j = 0; j < 1000; j++) {
     LOTS_OF_STRINGS.add(new String("String " + j));
    }
   }
   iteration++;
   System.out.println("Survived Iteration: " + iteration);
   Thread.sleep(100);
  }
 }
}

This code reports an OutOfMemoryError after 30 iterations.

Now, turn on string deduplication and run the above code using the following configuration:


-Xmx256m -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics

At this point, it is ready to run for a longer time and terminates after 50 iterations.

The JVM now also prints out what it does, so let's take a look:


[GC concurrent-string-deduplication, 4658.2K->0.0B(4658.2K), avg 99.6%, 0.0165023 secs]
  [Last Exec: 0.0165023 secs, Idle: 0.0953764 secs, Blocked: 0/0.0000000 secs]
   [Inspected:     119538]
     [Skipped:       0( 0.0%)]
     [Hashed:     119538(100.0%)]
     [Known:        0( 0.0%)]
     [New:       119538(100.0%)  4658.2K]
   [Deduplicated:    119538(100.0%)  4658.2K(100.0%)]
     [Young:       372( 0.3%)   14.5K( 0.3%)]
     [Old:       119166( 99.7%)  4643.8K( 99.7%)]
  [Total Exec: 4/0.0802259 secs, Idle: 4/0.6491928 secs, Blocked: 0/0.0000000 secs]
   [Inspected:     557503]
     [Skipped:       0( 0.0%)]
     [Hashed:     556191( 99.8%)]
     [Known:       903( 0.2%)]
     [New:       556600( 99.8%)   21.2M]
   [Deduplicated:    554727( 99.7%)   21.1M( 99.6%)]
     [Young:       1101( 0.2%)   43.0K( 0.2%)]
     [Old:       553626( 99.8%)   21.1M( 99.8%)]
  [Table]
   [Memory Usage: 81.1K]
   [Size: 2048, Min: 1024, Max: 16777216]
   [Entries: 2776, Load: 135.5%, Cached: 0, Added: 2776, Removed: 0]
   [Resize Count: 1, Shrink Threshold: 1365(66.7%), Grow Threshold: 4096(200.0%)]
   [Rehash Count: 0, Rehash Threshold: 120, Hash Seed: 0x0]
   [Age Threshold: 3]
  [Queue]
   [Dropped: 0]

For the sake of convenience, we don't need to calculate the sum of all the data by ourselves, just use the convenient sum.

The code snippet above specifies that string de-duplication is performed, which takes 16ms to look at about 120 k strings.

The above features are new, meaning they may not have been fully reviewed. Specific data can look different in real applications, especially where strings are used and passed around multiple times, so some strings can be skipped or have a hashcode already in place (as you probably know, a String's hashcode is lazily loaded).

In the above case, all strings are deduplicated, removing 4.5MB of data from memory.

The [Table] section gives statistics about the internal trace Table, and [Queue] lists how many pairs of deduplication requests are discarded because of the load, which is part of the overhead reduction mechanism.

So what's the difference between string deduplication and string resident? In fact, string de-duplication and residency look similar, except that the temporary mechanism reuses the entire string instance, not just the array of characters.

(link: https://openjdk.java.net/jeps/192), the creator of the point of contention is that developers often don't know where will reside string on the right, or is the right place is hidden by the framework. As I write, when touching the copied string (like country name), do you need some common knowledge. String to heavy, for applications in the same JVM string is good copy, also includes as XML Schemas, urls and jar name generally think that won't appear many times, such as a string.

When string resides in the application threads, the recycling asynchronous concurrent processing, string to also won't increase the consumption of runtime. This also explains, why would we found in the above code Thread. The sleep (). If no sleep will add too much pressure to GC, such strings to won't happen again. But, it's only a matter of only shows the sample code. The actual application, it is often used when running string to heavy a few milliseconds.