Java inversion of strings and related character encoding issues resolved

  • 2020-04-01 02:02:37
  • OfStack


public String reverse(char[] value){
       for (int i = (value.length - 1) >> 1; i >= 0; i--){
           char temp = value[i];
           value[i] = value[value.length - 1 - i];
           value[value.length - 1 - i] = temp;
       }
       return new String(value);
}

This kind of code has no problem with the algorithm. But today, when you look at the StringBuffer source code, you see that the source code for the reverse method is quite nicely written. The source code is as follows:


public AbstractStringBuilder reverse() {
    boolean hasSurrogate = false;
    int n = count - 1;
    for (int j = (n-1) >> 1; j >= 0; --j) {
        char temp = value[j];
        char temp2 = value[n - j];
        if (!hasSurrogate) {
       hasSurrogate = (temp >= Character.MIN_SURROGATE && temp <= Character.MAX_SURROGATE)
           || (temp2 >= Character.MIN_SURROGATE && temp2 <= Character.MAX_SURROGATE);
         }
         value[j] = temp2;
         value[n - j] = temp;
     }
     if (hasSurrogate) {
         // Reverse back all valid surrogate pairs
          for (int i = 0; i < count - 1; i++) {
             char c2 = value[i];
             if (Character.isLowSurrogate(c2)) {
                 char c1 = value[i + 1];
                 if (Character.isHighSurrogate(c1)) {
                 value[i++] = c1;
                 value[i] = c2;
             }
         }
         }
     }
     return this;
 }

This method is defined in AbstractStringBuilder, the parent class of StringBuffer, so the return value of this method is AbstractStringBuilder, which is called in subclasses as follows:

public synchronized StringBuffer reverse() {
    super.reverse();
    return this;
}

From the content of the method, the basic idea in the source code is the same, again traversing half of the string, and then exchanging each character with its corresponding character. However, there is a difference between character.min_surrogate (\ud800) and character.max_surrogate (\udfff). If the whole string was found to contain this kind of situation, again through from beginning to end, at the same time to judge the value. [I] whether meet the Character isLowSurrogate (), if meet the conditions, to judge the value. [I + 1] whether meet the Character isHighSurrogate (), if also meet this kind of situation, will the I and the exchange of characters I + 1. Some may wonder why, since characters in Java are already in Unicode code, each character can hold a Chinese character. Why do this?
A full Unicode character is called CodePoint, and a Java char is called code unit. String objects store Unicode characters in utf-16, and two characters are required to represent Chinese characters of a large character set. This representation is called a Surrogate, with the first character being Surrogate High and the second being Surrogate Low. The specific matters needing attention are as follows:
To determine whether a char is a Character in a Surrogate region, use Character's isHighSurrogate()/isLowSurrogate() method. Returns a full Unicode CodePoint from two Surrogate High/Low characters using the character.tocodepoint ()/codePointAt() method.
  A Code Point, probably need a may also need two char said, therefore cannot directly using CharSequence. The length () method returns a String directly how many Chinese characters, and the need to use String. CodePointCount ()/Character codePointCount ().
  To locate the first N characters in String, not directly to N as the offset, and need to get from String traversal in turn head, need to use the String/Character offsetByCodePoints () method.
From the current Character of the String, found on a Character, also can not directly use offset, implementation, and the need to use String. CodePointBefore ()/Character codePointBefore (), or use the String/Character offsetByCodePoints ()
  From the current Character, look for the next Character, can not directly use offset++ implementation, you need to determine the length of the current CodePoint, calculated again, or use the String/Character. OffsetByCodePoints ().


Related articles: