Details on the use of Java character encoding

2020-04-01 01:54:56
OfStack
1. What is character encoding?

      Character is the general name of characters and symbols, including characters, graphic symbols, mathematical symbols, etc. A collection of abstract characters is a Charset. Character sets were created to facilitate the transmission and storage of information. Currently commonly used character sets are: ASCII, ISO 8859-1, Unicode, GB2312

2. What are the characteristics of various coding sets?

ASCII:

      ASCII (American Standard Code for Information Interchange) is a computer coding system based on the Latin alphabet.

      Contains: control characters (carriage return, backspace, newline key), display type characters (English case, Arabic numerals and symbols).

      Technical features: 7 bits (bits) represent one character, a total of 128 characters

      Disadvantages: can only represent English, like Western Europe, east Asia and Latin America language symbols can not be represented.

ISO 8859-1:

      ISO 8859-1, officially ISO/IEC 8859-1:1998, also known as latin-1 or "western European languages", is the first 8-bit character set of ISO/IEC 8859 within the international organization for standardization.

      Based on ASCII, it adds 96 letters and symbols in the empty 0xa0-0xff range for use by Latin alphabet languages that use additional symbols. ISO 8859-1:1987 has been released.

      Contains: a part of western European language included in ASCII encoding.

      Technical features: 8 bits for one character.

Unicode:

      Unicode Character Set encoding is the abbreviation for the Universal polyoctet Coded Character Set. It is a Character coding system developed by the Unicode Consortium, which supports the exchange, processing and display of written text in various languages in the world today. The code was first developed in 1990 and officially released in 1994, with the latest version being Unicode 4.1.0 as of 31 March 2005.

      Technical features: 16-bit encoding, 2 bytes per character. The Unicode encoding for a character is deterministic. However, in the actual transmission process, the design of different system platforms may not be consistent, and for the purpose of saving space, the implementation of Unicode encoding is different. The way Unicode is implemented is called Unicode Transformation Format (UTF for short). If a 7 bit ASCII character Unicode file, if the transfer process using 2 bytes of the original Unicode encoding transfer will cause a relatively large waste. In this case, you can use utf-8 encoding, which is a variable-length encoding that still represents the basic 7-bit ASCII character as a 7-bit encoding, occupying one byte (the first complement 0). In the case of mixing with other Unicode characters, it will be converted according to a certain algorithm, each character is encoded with 1-3 bytes, and the first digit is 0 or 1 for recognition.

GB2312:

      GB 2312 or GB 2312-80 is the simplified Chinese character set of the national standard of China. Its full name is the basic set of Chinese character coded character set for information exchange, also known as GB0. It was issued by the general administration of standards of China and implemented on May 1, 1981. GB2312 code in mainland China; This code is also used in places like Singapore. GB 2312 is supported by almost all Chinese language systems and international software in mainland China.

      Contents: 6,763 Chinese characters, including 3,755 first-level Chinese characters and 3,008 second-level Chinese characters; At the same time, including the Latin alphabet, Greek alphabet, Japanese hiragana and katakana letters, Russian Cyrillic alphabet, including 682 characters.

      Technical features: each Chinese character and symbol is represented in two bytes. The first byte is called the "high byte" and the second byte is called the "low byte". "High byte" USES 0xA1-0xF7, and "low byte" USES 0xA1-0xFE0xA0). As the first-level Chinese character starts from block 16, the range of "high byte" in the Chinese character is 0xb0-0xf7, and the range of "low byte" is 0xa1-0xfe, occupying the code point of 72*94=6768. Five of them are d7fa-d7fe.