Detailed Explanation of String Type and Default Character Encoding in Java

  • 2021-07-24 10:57:56
  • OfStack

Why write this

As for why I want to write this, it is mainly a sentence that mmp1 must be said. After a morning, I fainted
Java program in the Chinese garbled problem 1 straight is a puzzle programmers, oneself is no exception, as early as when doing the project encountered a lot of coding pit, then want to fill in, but too troublesome. I can't help it at last this time. 1 must find out

Encoding of String type

According to the information on the Internet, the default character coding of Java is Unicode, while the coding mode of String is related to the coding mode of JVM and the default character set of native operating system. So I made a test
In Java, you can view the local encoding mode as shown in this way (JVM or OS?)


// Gets the system property indicated by the specified key.
System.out.println(System.getProperty(file.encoding));

Look at the annotation, it says that the system character set is obtained, but I have doubts about the concept of this system. Why, because as we all know, the default character encoding method of most Chinese computers is GBK, and entering chcp in CMD can get a value of 936, which means that it is the encoding method of GBK.

But I run the result of this sentence is actually UTF-8, I was running in IDEA, and has used IDEA to set the project encoding mode is UTF-8, so I can only guess the above sentence is actually the encoding mode of obtaining JVM (following the encoding mode of the project)

Let's get back to business. What is the default encoding method of String type? There are the following statements:


/*  Test String Default encoding of type 
*/

//  Use String Parametric construction method of 
String str = new String("hhhh ty Mental retardation %shfu Touch Shufen 10 Points uif Oral administration NSF Black ");
// 1. With GBK Encoding mode acquisition str Byte array of, and then use String Constructing Strings with Parameterized Constructors 
System.out.println(new String(str.getBytes("GBK")));
// 2. With UTF-8 Encoding mode acquisition str And construct a string with the default encoding 
System.out.println(new String(str.getBytes("UTF-8")));

Let's look at the running results of 1:

// 1.
hhhh ty% shfuuifNSFi, u, u δ δ hhhh ty mentally retarded% shfu touch Shufen 10 points uif oral NSF black i bird reply amount u hair for what, u room fiance fiance
// 2.
hhhh ty mental retardation% shfu touch Shufen 10 points uif oral NSF black i bird reply u hair for u room fiance fiance

It is obvious that the default character encoding method of String type here is the same as the encoding method we look at the local system. Therefore, we conclude that the default encoding of String type is related to the local encoding

String. getBytes () Method

In most cases, we don't use String types, but use byte arrays to transfer operational data, and generally use String. getBytes () methods to convert strings into byte arrays. But will this conversion involve coding problems? Carefully looked at the source code of String. getBytes (), divided into two kinds of non-parametric and parametric:


// 1. Parametric nonparametric getBytes() Method 
  public byte[] getBytes() {
    //  Go on and go deeper encode() Method can find that the system default character encoding is used 
    return StringCoding.encode(value, 0, value.length);
  }

// 2. Parameterized getBytes(String charsetName) Method 
  public byte[] getBytes(String charsetName)
      throws UnsupportedEncodingException {
    if (charsetName == null) throw new NullPointerException();
    //  Continuing further, you will find that the parameter character set encoding is used to return the byte array, and if the parameter character set does not exist, the local system default character encoding is used 
    return StringCoding.encode(charsetName, value, 0, value.length);
  }

To sum up, I would like to emphasize again here that because the coding mode of the project was modified, the coding mode of the local system also changed to UTF-8, so the above experiments are all based on the modification of the coding mode of the engineering project by IDE

Interconversion between ByteBuffer and byte Arrays

In NIO, 1 generally uses ByteBuffer as a character buffer, and sometimes we only have byte [] array, so we need to convert them to each other


// ByteBuffer ----> byte[]
byte[] bytes = ByteBuffer.array();

// byte[] ------> ByteBuffer
byte[] bytes = new byte[1024];
ByteBuffer byteBuffer = ByteBuffer.wrap(bytes);

So

To sum up, here is a summary:

The encoding mode of the native ES 114EN is related to the default character encoding mode of the native ES 115EN, but the encoding mode of the ES 116EN can be modified The default character set of the Java program is Unicode, and the encoding mode of the String type declared in the program is related to the encoding mode of JVM The default encoding mode of String. getBytes () method is JVM encoding mode; At the same time, one character set name can be accepted as a parameter, and the character set of the parameter is preferred Because the Unicode character set used in the Java code allows the conversion between various encoding modes, but does not guarantee the loss of bit, so the String type can get the byte array with different encoding modes, as long as the string type is displayed by encoding and decoding The flow path of a file is determined by the encoding mode of the file, so we should pay attention to encoding and decoding when reading and writing files with different encoding modes buffer declared by ByteBuffer can be converted to and from byte arrays, but note that the size 1 of ByteBuffer must be large enough to host all byte arrays

Small summary

It is suddenly enlightened to find out these things. In fact, most of the time, the root of Chinese garbled code problem is that the coding mode and decoding mode are not the same, or the loss of bit is caused by the conversion between different coding modes. Therefore, we should pay attention to standardized coding and decoding methods. After all, some conversion operations are irreversible.


Related articles: