Research and sharing of JAVA and related character set coding issues

2020-04-01 03:32:56
OfStack
The following article will describe and discuss the above problems. We will take "Chinese" as an example to illustrate that the GB2312 code of "Chinese" is "d6d0 cec4", the Unicode code is "4e2d 6587", and the UTF code is "e4b8ad e69687". (note that there is no iso8859-1 code for "Chinese", but it can be "represented" by iso8859-1).

1. Basic coding knowledge

The earliest encoding was iso8859-1, similar to ASCII encoding. However, in order to facilitate the representation of various languages, a number of standard codes have gradually emerged. The important ones are as follows:

1. Iso8859-1

It is a single-byte encoding. The maximum number of characters it can represent is 0-255. It is used in English series. For example, the letter a is encoded as 0x61=97.

Obviously, the iso8859-1 encoding represents a narrow range of characters that cannot represent Chinese characters. However, because it is a single-byte code, consistent with the most basic unit of representation in a computer, it is often still represented by iso8859-1. And on many protocols, this code is used by default. For example, although "Chinese" does not have iso8859-1 encoding, take gb2312 encoding as an example, should be "d6d0 cec4" two characters, use iso8859-1 encoding when it is divided into four bytes to represent: "d6d0 cec4" (in fact, in the storage, is also in bytes). If it is UTF, it is 6 bytes "e4 b8 AD e6 96 87". Obviously, this representation needs to be based on another code.

2. GB2312 / GBK

This is the Chinese character's national code, specially used to represent Chinese characters, is a double-byte code, and the English letter and iso8859-1 consistent (compatible with iso8859-1 code). Among them, GBK code can be used to represent both traditional and simplified characters, while gb2312 can only represent simplified characters, and GBK is compatible with gb2312 code.

3. The unicode

This is the most uniform encoding, which can be used to represent characters in all languages, and is a fixed-length double-byte (or four-byte) encoding, including English letters. So it can be said that it is not compatible with iso8859-1 encoding, nor is it compatible with any encoding. However, compared to iso8859-1, the uniocode code only adds a 0 byte to the front, such as the letter a for "00 61".

It should be noted that fixed-length encoding is easy for a computer to process (note that GB2312/GBK is not fixed-length encoding), and unicode can be used to represent all characters, so unicode encoding is used for processing in many software, such as Java.

4. UTF

Consider that unicode encodings are not compatible with iso8859-1 encodings and tend to take up more space: for English letters, unicode also requires two bytes to represent. So unicode is not easy to transport and store. The result is utf encoding, which is compatible with iso8859-1 encoding and can be used to represent characters in all languages, but utf encoding is variable length encoding, with each character ranging in length from 1 to 6 bytes. In addition, the utf encoding comes with a simple verification function. Generally speaking, English letters are represented by one byte, while Chinese characters use three bytes.

Note that although utf is used for the purpose of using less space, that is only relative to unicode encodings, using GB2312/GBK is undoubtedly the most economical if it is already known to be a Chinese character. On the other hand, it is worth noting that although utf encodings use three bytes for Chinese characters, utf encodings are less expensive than unicode encodings even for Chinese web pages, which contain many English characters.

Second, the Java character processing

In writing a Java application, there are a number of character set encodings involved, some requiring proper setup and some requiring some level of processing.

1. GetBytes (charset)

This is a standard function of Java string manipulation that encodes the characters represented by the string as charset and represents them in bytes. Note that strings are always stored in Java memory in unicode encoding. For example, "Chinese" is normally stored as "4e2d 6587", if charset is "GBK", it is encoded as "d6d0 cec4", and then returns the byte "d6d0 cec4". If the charset is "utf8" then the last is "e4 b8 AD e6 96 87". If it is "iso8859-1", it will return "3f 3f" (note: "3f 3f" is two question marks) because it cannot be encoded.

2. New String (charset)

This is another standard function of Java string manipulation, which, contrary to the previous function, combines byte arrays with charset encoding and converts them to unicode storage. Referring to the getBytes example above, "GBK" and "utf8" both gave the correct result "4e2d 6587", but iso8859-1 ended up with "003f 003f" (two question marks).

Because utf8 can be used to represent/encode all characters, new String(str.getbytes ("utf8"), "utf8") === STR, that is, fully reversible.

3. SetCharacterEncoding ()

This function is used to set the HTTP request or the corresponding encoding.

For request, it refers to the encoding of the submitted content. After specifying, the correct string can be directly obtained through getParameter(). If not specified, the iso8859-1 encoding will be used by default, which requires further processing. See "form input" below. It is important to note that no getParameter() can be executed until setCharacterEncoding() is executed. This method must be called prior to reading request parameters or reading input using getReader().