python solves the problem of Chinese character encoding: Unicode Decode Error

  • 2020-05-24 05:47:00
  • OfStack

preface

Recently, due to project requirements, I need to read a Chinese txt document, and then save the file. The document was previously encoded by base64, causing all Chinese characters to read and display garbled codes. After the project team abandoned base64, two errors occurred successively:


ascii codec can't encode characters in position ordinal not in range 128
UnicodeDecodeError:  ' utf8' codec can't decode byte 0x . 

For those of you who don't already know ascii, unicode, and utf-8, see the previous article on strings and encodings

Here are three concepts to understand:

ascii can only represent Numbers, English letters and some special symbols, not Chinese characters Both unicode and utf-8 can represent Chinese characters. unicode is a fixed length and utf-8 is a variable length In-memory storage mode 1 is usually unicode, while disk file storage mode 1 is usually utf-8, because utf-8 saves storage space

So what is the default encoding for python?


>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>> sys.getdefaultencoding()
'utf-8'

The default encoding for python is ascii, which can be passed sys.setdefaultencoding('utf-8') The function sets the default encoding for python.

In python, the encoding of data can be changed by means of encode and decode, for example:


>>> u' Chinese characters '
u'\u6c49\u5b57'
>>> u' Chinese characters '.encode('utf-8')
'\xe6\xb1\x89\xe5\xad\x97'
>>> u' Chinese characters '.encode('utf-8').decode('utf-8')
u'\u6c49\u5b57'

We can set the encoding through these two functions.

So, what type is str in python?


>>> import binascii
>>> ' Chinese characters '
'\xba\xba\xd7\xd6'
>>> type(' Chinese characters ')
<type 'str'>
>>> print binascii.b2a_hex(' Chinese characters ')
babad7d6
>>> print binascii.b2a_hex(u' Chinese characters ')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-1: ordinal not in range(128)
>>> print binascii.b2a_hex(u' Chinese characters '.encode('utf-8'))
e6b189e5ad97
>>> print binascii.b2a_hex(u' Chinese characters '.encode('gbk'))
babad7d6

binascii converts the data from the base 2 to ascii. The above explanation is that the type of 'hanzi' is str, the base 2 is babad7d6, and u 'hanzi' cannot be converted to ascii, thus the first error at the beginning is reported. The solution is to make it.encode(' utf-8 ') str. Because my command line is windows default GBK code, all u' Chinese characters' .encode(‘gbk') When the output result is the same as the 'Chinese character' result.

To sum up 1, str of python is actually a kind of unicode, and python's default code is ascii. When non-ascii is converted to ascii, an error will be reported. Keep in mind the following rules:

unicode = > encode(' appropriate encoding ') = > str str = > decode(' appropriate code ') = > unicode

Another easy way to save a lot of trouble is to set the encoding in the header:


import sys
reloads(sys)
sys.setdefaultencoding('utf-8')

For the second problem, it was an error reading the file. There are two ways of utf-8 files: bom and bom without bom. The difference between the two seems to be that bom files have one more head than bom files without utf-8. As a result, it is wrong to read the files in utf-8.

You'll also need to go to google for help, which is done by using the codecs library to read the file (which, I guess, is to test the header of the file).


import codecs
codecs.open(file_name, "r",encoding='utf-8', errors='ignore')

For coding problems, 1 must understand how ascii, unicode, and utf-8 work.

conclusion

The above is the whole content of this article, I hope the content of this article to your study or work can bring 1 definite help, if you have questions you can leave a message to communicate.


Related articles: