Introduction to character coding in Python methods and Suggestions for its use

2020-04-02 14:30:01
OfStack

1. Introduction to character encoding

1.1. ASCII

ASCII (American Standard Code for Information Interchange) is a single-byte Code. The computer world started with only English, and a single byte can represent 256 different characters, can represent all English characters and many control symbols. However, ASCII USES only half of it (\x80 and below), which is the basis for the implementation of MBCS.

1.2. MBCS

However, other languages soon emerged in the computer world, and single-byte ASCII was no longer sufficient. Later each language has developed a coding, because the words of energy-saving character is too little, and at the same time also need to keep compatible with ASCII code, so these codes are used more bytes to represent characters, such as GBxxx, BIGxxx, etc., they rule is that if the first byte is \ x80, still said ASCII characters; If it is above \x80, it represents a character with the next byte (two bytes in total), then skips the next byte and continues.

Here, IBM invented a concept called Code Page, which takes all of this Code and assigns it to the Page, and GBK is Page 936, which is CP936. Therefore, you can also use CP936 to represent GBK.

MBCS(multi-byte Character Set) is the general name of these codes. So far everyone has used Double bytes, so it is sometimes called a DBCS(double-byte Character Set). To be clear, MBCS is not a specific code. In Windows, MBCS refer to different codes depending on the region you set, while in Linux, MBCS cannot be used as a code. You don't see MBCS in Windows because Microsoft USES ANSI to scare people to be more foreign. At the same time, in simplified Chinese Windows default area Settings, refers to GBK.

1.3. Unicode

Later, some people began to think that too much coding was making the world too complicated and painful, so they sat down together and came up with a solution: all the characters in all languages are represented in the same character set, which is Unicode.

The original Unicode standard, ucs-2, used two bytes for one character, so you can often hear Unicode use two bytes for one character. After a while, however, some people felt that 256*256 was too little and not enough, so the ucs-4 standard came into being. It used four bytes to represent a character, but the most we used was still ucs-2.

The UCS(Unicode Character Set) is just a list of characters, such as "han", which is 6C49. UTF(UCS Transformation Format) is responsible for exactly how characters are transferred and stored.

At first it was easy to save using the UCS code bit, which is utf-16. For example, "ham" can BE saved using \x6C\x49 (utf-16-be) or backwards using \x49\x6C (utf-16-le). But with the use of americans feel that they have suffered a great loss, the English letters used to only need a byte to save, now a big pot of rice to eat into two bytes, space consumption doubled... So utf-8 was born.

Utf-8 is an awkward encoding in that it is longer and compatible with ASCII, whose characters are represented in 1 byte. What's missing here, however, must have been picked up somewhere else. Surely you've heard that Chinese characters in utf-8 are saved in three bytes? 4 bytes of saved characters are in tears... (please search for the details of how ucs-2 becomes utf-8)

Another thing worth mentioning is BOM(Byte Order Mark). When we save a file, the code we use is not saved, but when we open it, we need to remember the code we saved and open it with this code, which causes a lot of trouble. You might want to say that notepad didn't let you select the code when it opened the file? You might as well open notepad before you use the file > UTF introduces a BOM to represent its code. If the first few bytes read are one of them, the code used for the next text to be read is the corresponding code:

BOM_UTF8 '\ xef \ XBB \ XBF'
BOM_UTF16_LE '\ XFF \ xfe'
BOM_UTF16_BE '\ xfe \ XFF'

Not all editors will write BOM, but even without BOM, Unicode can read it. Just like MBCS encoding, the specific encoding needs to be specified, otherwise the decoding will fail.

You may have heard that utf-8 does not require a BOM, which is not true, but most editors read it with utf-8 as the default encoding when they do not have a BOM. Even notepad, which USES ANSI(MBCS) by default when saving, tests the encoding in utf-8 before reading the file, and if it can be decoded successfully, decoded in utf-8. Notepad this awkward practice caused a BUG: if you create a new text file and type in "text", then save it using ANSI(MBCS) and open it again, it becomes "hana". Try:)

2. Coding problems in python 2.x

2.1. The STR and unicode
STR and unicode are subclasses of basestring. Strictly speaking, STR is a byte string, which is a sequence of unicode encoded bytes. Using the len() function for STR 'han' encoded in utf-8 gives a result of 3, because in fact, 'han' encoded in utf-8 == '\xE6\xB1\x89'.

Unicode is the true string, the byte string STR is decoded using the correct character encoding, and len(u' han ') == 1.

Now look at the instance methods of basestring: encode() and decode(). Once you understand the difference between STR and unicode, the two methods are no longer confused:



# coding: UTF-8

u = u' han '

print repr(u) # u'u6c49'

s = u.encode('UTF-8')

print repr(s) # 'xe6xb1x89'

u2 = s.decode('UTF-8')

print repr(u2) # u'u6c49'

#  right unicode It is wrong to decode 

# s2 = u.decode('UTF-8')

#  Similarly, the str Coding is also wrong 

# u2 = s.encode('UTF-8')

Note that although it is an error to call the encode() method on STR, Python does not actually throw an exception, but returns another STR with the same content but a different id. The same is true for unicode calls to the decode() method. It's hard to understand why encode() and decode() should be in basestring instead of unicode and STR, but since that's the case, let's be careful not to make a mistake.

2.2. Character encoding declaration

In the source code file, if it is useful to non-ascii characters, the declaration of character encoding in the file header is required, as follows:



#-*- coding: UTF-8 -*-

Python actually only checks the #, coding, and encoding strings, and the rest of the characters are added for aesthetic purposes. In addition, there are many character encodings available in Python, and there are many aliases that are case-insensitive, such as utf-8, which can be written as u8. See (link: http://docs.python.org/library/codecs.html#standard-encodings).

It is also important to note that the declared encoding must be the same as the actual encoding used when the file is saved, otherwise there is a high probability of code resolution exceptions. Ides now tend to handle this automatically, changing the declaration and saving the declaration's code, but text editors need to be careful :)

2.3. Read and write files

When the built-in open() method opens a file, read() reads STR, and decode() needs to be encoded correctly. For write(), if the parameter is unicode, you need to encode() with the code you want to write; for STR in other encoding formats, you need to decode() with the code of STR, and then encode() with the code you write after converting to unicode. If unicode is passed directly into the write() method as an argument, Python encodes it using the character encoding declared in the source code file and then writes it.



# coding: UTF-8

f = open('test.txt')

s = f.read()

f.close()

print type(s) # <type 'str'>

#  Known to be GBK Encode, decode into unicode

u = s.decode('GBK')

f = open('test.txt', 'w')

#  Encoded in UTF-8 The coding str

s = u.encode('UTF-8')

f.write(s)

f.close()

In addition, the module codecs provides an open() method that specifies an encoding open file, and the open file read using this method will return unicode. When writing, if the parameter is unicode, the code specified when open() is used for encoding and writing. In the case of STR, the operation is performed after decoding to unicode based on the character encoding declared in the source code file. This method is less prone to coding problems than the built-in open() method.



# coding: GBK

import codecs

f = codecs.open('test.txt', encoding='UTF-8')

u = f.read()

f.close()

print type(u) # <type 'unicode'>

f = codecs.open('test.txt', 'a', encoding='UTF-8')

#  write unicode

f.write(u)

#  write str , automatic operation of decoding and encoding 

# GBK The coding str

s = ' han '

print repr(s) # 'xbaxba'

#  I'm going to start with GBK The coding str Decoding for unicode Coding for again UTF-8 write 

f.write(s) 

f.close()

2.4. Coding related methods
The sys/locale module provides some methods for obtaining the default encoding for the current environment.



# coding:gbk

import sys

import locale

def p(f):

    print '%s.%s(): %s' % (f.__module__, f.__name__, f())

#  Returns the default character encoding used by the current system 

p(sys.getdefaultencoding)

#  Return for conversion Unicode The encoding used from file name to system file name 

p(sys.getfilesystemencoding)

#  Gets the default locale and returns the meta-ancestor ( language ,  coding )

p(locale.getdefaultlocale)

#  Returns the text data encoding set by the user 

#  The document mentioned this function only returns a guess

p(locale.getpreferredencoding)

# xbaxba is ' han ' the GBK coding 

# mbcs Is not recommended to use the code, here is only a test to show why it should not be used 

print r"'xbaxba'.decode('mbcs'):", repr('xbaxba'.decode('mbcs'))

# In the author's Windows The results on the ( The locale is set to Chinese ( The simplified ,  China ))

#sys.getdefaultencoding(): gbk

#sys.getfilesystemencoding(): mbcs

#locale.getdefaultlocale(): ('zh_CN', 'cp936')

#locale.getpreferredencoding(): cp936

#'xbaxba'.decode('mbcs'): u'u6c49'

3. Some Suggestions

3.1. Use character encoding declarations, and all source code files in the same project use the same character encoding declarations.
This must be done.

3.2. Discard STR and use unicode entirely.
Pressing u before you put the quotation marks is a bit awkward at first and you often forget to run back to fill, but doing so can reduce coding problems by 90%. If the coding problem is not serious, you may not refer to this article.

3.3. Replace the built-in open() with codecs.open().
If the coding problem is not serious, you may not refer to this article.

3.4. Character encodings that absolutely need to be avoided: MBCS/DBCS and utf-16.
By MBCS I don't mean GBK or anything, but don't use a Python code called 'MBCS' unless the program isn't portable at all.

Encoding 'MBCS' in Python is synonymous with 'DBCS' and refers to the encoding that MBCS refers to in the current Windows environment. There is no such code in the Python implementation of Linux, so there are bound to be exceptions once ported to Linux! In addition, as long as the Windows system region is set, MBCS refers to the code is not the same. Set the results of running the code in section 2.4 in different regions respectively:



# Chinese ( The simplified ,  China )

#sys.getdefaultencoding(): gbk

#sys.getfilesystemencoding(): mbcs

#locale.getdefaultlocale(): ('zh_CN', 'cp936')

#locale.getpreferredencoding(): cp936

#'xbaxba'.decode('mbcs'): u'u6c49'

# English ( The United States )

#sys.getdefaultencoding(): UTF-8

#sys.getfilesystemencoding(): mbcs

#locale.getdefaultlocale(): ('zh_CN', 'cp1252')

#locale.getpreferredencoding(): cp1252

#'xbaxba'.decode('mbcs'): u'xbaxba'

# German ( Germany )

#sys.getdefaultencoding(): gbk

#sys.getfilesystemencoding(): mbcs

#locale.getdefaultlocale(): ('zh_CN', 'cp1252')

#locale.getpreferredencoding(): cp1252

#'xbaxba'.decode('mbcs'): u'xbaxba'

# Japanese ( Japan )

#sys.getdefaultencoding(): gbk

#sys.getfilesystemencoding(): mbcs

#locale.getdefaultlocale(): ('zh_CN', 'cp932')

#locale.getpreferredencoding(): cp932

#'xbaxba'.decode('mbcs'): u'uff7auff7a'

Therefore, when we need to use 'GBK', we should write 'GBK' directly instead of 'MBCS'.

Similarly, although 'utf-16' in most operating systems is a synonym for 'utf-16-le', writing 'utf-16-le' directly is just three more characters, and in case 'utf-16' in an operating system becomes a synonym for 'utf-16-be', you will have an error result. In fact, utf-16 is used fairly infrequently, but it's important to be careful when you use it.