How many bytes does a character occupy in Java language?

  • 2021-07-16 02:37:35
  • OfStack

It is good to distinguish the inner code (internal encoding) from the outer code (external encoding).

Inner code is the character code used inside the program, especially the inner code used in memory by a certain language to realize its char or String type;
Outer code is the character code used by the outside when the program interacts with the outside. "External" is relative to "internal"; Any place that is not the internal encoding used by char or String in memory can be considered "external". For example, the external can be char or String after serialization, or external files, command line arguments, and so on.

The Java language specification stipulates that the char type of Java is code unit of UTF-16, that is, 1 is 16 bits (2 bytes);

char, whose values are 16-bit unsigned integers representing UTF-16 code units (§ 3.1).

Then the string is the sequence of UTF-16 code unit:

The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.

Thus, Java specifies that the inner code of characters should be encoded with UTF-16. Or at least make it impossible for users to perceive that non-UTF-16 encoding is adopted inside String.

Another example:

The serialization of char and String implemented by Java standard library uses UTF-8 as outer code. The string constants and symbol names in the Class file of Java are also specified to be encoded with UTF-8. This is probably a trade-off made by designers at that time to balance the time efficiency of runtime (UTF-16 with fixed-length coding) with the space efficiency of external storage (UTF-8 with variable length coding).

First of all, what exactly do you mean by "characters"?

If by "character" you mean char in Java, well, that's 16 bits, 2 bytes.

If by "characters" you mean those "abstract characters" we see with our eyes, then it makes no sense to talk about how many bytes it takes.

Specifically, it is meaningless to talk about how many bytes a character occupies without specific coding.

It's like having an abstract integer "42". How many bytes do you think it takes? It depends on whether you use byte, short, int, or long to store it. byte takes up 1 byte, short takes up 2 bytes, int is usually 4 bytes, long is usually 8 bytes. Of course, if you use byte, due to its limited number of bits, some numbers can't be saved. For example, 256 can't be put in an byte.

Characters are the same. If you want to talk about "taking up a few bytes", you must make the coding clear first.

The same character may occupy different bytes under different encodings.

Take the word "word" as an example. "Word" takes 2 bytes under GBK coding, 2 bytes under UTF-16 coding, 3 bytes under UTF-8 coding and 4 bytes under UTF-32 coding.

Different characters may occupy different bytes under the same code.

"Word" occupies 3 bytes under UTF-8 encoding, while "A" occupies 1 byte under UTF-8 encoding. (Because UTF-8 is variable length coding)
char in Java is essentially UTF-16 coding. UTF-16 is actually a variable length code (2 bytes or 4 bytes).

If an abstract character occupies 4 bytes under UTF-16 encoding, it obviously cannot be placed in char. In other words, char can only contain those characters that only occupy 2 bytes under UTF-16 encoding.

While getBytes actually does encoding conversion, you should explicitly pass in a parameter to specify the encoding, otherwise it will use the default encoding to convert.

You say "new String (" word "). getBytes (). length returns 3", which means that the default encoding is UTF-8. If you explicitly pass in a parameter, such as "new String (" word "). getBytes (" GBK "). length", then the return is 2.

You can set a default code when starting JVM.

Assuming your class is called Main, you can set a default encoding with the file. encoding parameter when executing this class with java on the command line. For example: java-Dfile. encoding = GBK Main At this time, when you execute the getBytes () method without parameters, new String ("word"). getBytes (). length returns 2, because the default encoding is now GBK. Of course, if you explicitly specify the encoding at this time, new String ("word"). getBytes ("UTF-8"). length still returns 3

Otherwise, the default encoding in the operating system environment will be used.

Usually, GBK under Windows system is GBK, Linux and Mac are UTF-8. However, one thing to note is that when using IDE to run under Windows, such as Eclipse, if the default code of your project is UTF-8, the aforementioned-Dfile.encoding=UTF-8 parameter will be added when running your program in IDE. At this time, even if you are under Windows, the default code is UTF-8 instead of GBK.

Due to the influence of startup parameters and operating system environment, getBytes method without parameters is usually not recommended. It is better to specify parameters explicitly to obtain stable expected behavior.


Related articles: