In depth analysis of python Chinese garbled code problem

  • 2020-04-02 09:27:14
  • OfStack

In this article, all the problems are explained as an example in the form of 'ha'. The various encodings of 'ha' are as follows:
1. UNICODE (utf8-16), C854;
2. Utf-8, E59388;
3. GBK, B9FE.
STR and unicode in python
Chinese coding in python has always been an extremely big problem, often throwing exceptions to code translation. What exactly are STR and unicode in python?
In python, unicode is generally referred to as unicode objects, such as' ha ha 'unicode objects
U \ 'u54c8 \ u54c8'
STR, on the other hand, is a byte array that represents the storage format for unicode objects encoded in utf-8, GBK, cp936, and GB2312. Here it is just a byte stream, no other meaning, if you want to make the byte stream display meaningful, you must use the correct encoding format, decode the display.
Such as:
< img border = 0 height = 225 Alt = "python and unicode string" SRC = "http://files.jb51.net/upload/201103/20110313135227531.jpg" width = 341 >

Unicode object haha is encoded as a utf-8 encoded STR -s_utf8,s_utf8 is just a byte array, which is '\xe5\x93\x88\xe5\x93\x88', but this is just a byte array, if you want to print it out as haha through print statement, you are disappointed, why?

Since the print statement is implemented so that the output is transmitted to the operating system, the operating system will encode the input byte stream according to the encoding of the system, which explains why the string "ha ha" in utf-8 format is printed as "all", because '\xe5\x93\x88\ x93\x88' is interpreted by GB2312 and displays as "all". Again, STR records an array of bytes, just an encoded storage format, and what format it outputs to a file or prints out depends on what the decoded encoding decodes it to.

One more note to print: when a unicode object is passed to print, the unicode object is converted internally to the default encoding (this is just a guess)

Conversion of STR and unicode objects

STR and unicode object conversion, through encode and decode, specific use as follows:

< img border = 0 height = 57 Alt = decode and encode demonstration SRC = "http://files.jb51.net/upload/201103/20110313135230491.jpg" width = 338 >

Convert GBK' haha 'to unicode and then to UTF8

Third, Setdefaultencoding

< img Alt = setdefaultencoding border = 0 height = 240 SRC = "http://files.jb51.net/upload/201103/20110313135230405.jpg" width = 640 >

As shown in the demo code above:

When s(GBK string) is directly encoded as utf-8, an exception is thrown, but by calling the following code:

The import   sys

Reload (sys)

Sys. Setdefaultencoding (' GBK ')

And then you can convert. Why? In python STR and unicode in the process of encoding and decoding, if an encoding STR directly into another kind of coding, will put the STR decoding into unicode, and the code for the default encoding, generally the default encoding is anscii, so in the above example code conversion for the first time will go wrong, when the current set up after the default encoding for 'GBK, won't make a mistake.

As for the reload(sys) because of Python2.5   It will be deleted after initialization   Sys. Setdefaultencoding   This method, we need to reload.

Four, the operation of different file encoding format of the file

Create a file test.txt, file format with ANSI, the content is:

The ABC of Chinese

Read it in python

# coding = GBK

Print the open (" Test. TXT "). The read ()

Results: ABC Chinese

Change the file format to utf-8:

Results: ABC

Obviously, decoding is needed here:

# coding = GBK

The import codecs

Print the open (" Test. TXT "). The read (). The decode (" utf-8 ")

Results: ABC Chinese

I used Editplus to edit the test.txt above, but when I used Windows' own notepad to edit and coexist in utf-8,

Running time error:

Traceback (most recent call last):

File "ChineseTest. Py ", line 3, in  

Print the open (" Test. TXT "). The read (). The decode (" utf-8 ")

UnicodeEncodeError: 'GBK' codec can't encode character u'\ufeff' in position 0: illegal multibyte sequence

It turns out that some software, like notepad, when saving a file encoded in utf-8, inserts three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of the file.

So we need to remove these characters when we read them. The codecs module in python defines this constant:

# coding = GBK

The import codecs

Data = open (" Test. TXT "). The read ()

If the data [: 3] = = codecs. BOM_UTF8:

Data = data / 3:

Print the data. The decode (" utf-8 ")

Results: ABC Chinese

The encoding format of the document and the function of the encoding declaration

What does the encoding format of the source file do to the declaration of the string? This has been bothering me for a long time, but it's finally starting to become clear that the encoding format of the file determines the encoding format of the string declared in the source file, for example:

STR   =   'ha ha'

The print   Repr (STR)

A. If the file format is utf-8, then the value of STR is: '\xe5\x93\x88\xe5\x93\x88' (haha utf-8 encoding)

B. If the file format is GBK, then the value of STR is: '\xb9\xfe\xb9\xfe' (GBK code of haha)

As mentioned in the first section, a string in python is just an array of bytes, so when STR in case a is output to the gbk-encoded console, it will appear as a garbled string: And when the b case of STR output utf-8 encoding console, will also show the problem of messy code, is nothing, perhaps '\xb9\xfe\xb9\xfe' with utf-8 decoding display, is blank bar. > _ <

Now, with the file format out of the way, let's talk about what the encoding declaration does. Each file is at the top, and it USES #   Coding = GBK   A similar statement declares the code, but what is the purpose of this declaration? So far, I think it has three functions:

Non-ascii encoding will appear in the declared source file, usually in Chinese; In advanced ides, the IDE will save your file format to the format you specify. Deciding on the encoding format used to decode a 'hash' into unicode in a source code declaration such as a u' hash 'is also a bit of a puzzle. Here's an example:

# coding: GBK

ss  =   U 'ha ha'

The print   Repr (ss)

The print   'ss: % s'   %   ss

Save this code as utf-8 text, run, what do you think will be output? Your first impression is that the output must be:

U \ 'u54c8 \ u54c8'

Ss: ha, ha

But the actual output is:

U '\ u935d \ u581d \ u6431'

Ss: Kun � kun � kun Gua n

Why is that? At this point, the code declaration is at play, running ss   =   When u' haha ', the whole process can be divided into the following steps:

1)   Get the 'ha ha' encoding: determined by file encoding format as '\xe5\x93\x88\xe5\x93\x88' (haha utf-8 encoding)

2)   When converting to unicode encoding, in the process of this conversion, the decoding of '\xe5\x93\x88\ x93\x88' is not decoded by utf-8, but by GBK specified in the declaration encoding, the '\xe5\x93\x88 'is decoded by GBK, the '\xe5\x93\x88' is "", the unicode encoding of these three words is u'\u935d\u581d\u6431', to the last can explain why print   Repr (ss) outputs u'\u935d\u581d\u6431'.

Okay, this is a little convoluted, so let's look at the next example:

# - * -   Coding: utf-8   - * -

ss   =   U 'ha ha'

The print   Repr (ss)

The print   'ss: % s'   %   ss

Save this example into GBK code this time, and the result is:

UnicodeDecodeError:   'utf8'   The codec   Can 't   decode   byte   0 xb9   The in   The position   Zero:   unexpected   code   byte

Why is there a utf8 decoding error here? Think about the last sample also realize that the transformation as a first step, because the file encoding is GBK, get is' ha ha 'code is GBK code' \ xb9 \ xfe \ xb9 \ xfe ', when in the second step, converted to unicode, will use UTF8 to '\ xb9 \ xfe \ xb9 \ xfe' decoding, and you check the utf-8 encoding table will find UTF8 encoding table (about utf-8 explanation can see character encoding notes: ASCII, utf-8 and unicode) doesn't exist, so will be reported to the above error.


Related articles: