Python encoding processing the difference between str and Unicode

2020-05-10 18:24:30
OfStack
A good article on STR and UNICODE

Sort out the python encodings

Note: the following discussion is about Python2.x version, Py3k, to be tried

start

When processing Chinese with python, read files or messages, http parameters, etc

1 run, find garble (string manipulation, read/write file, print)

Then, what most people do is they call encode/decode to debug, and they don't really think about why the mess is happening

So the most common errors in debugging

Error 1

Traceback (most recent call last): File "
<
stdin
>
", line 1, in 
<
module
>
UnicodeDecodeError: 'ascii' codec can decode byte 0xe6 in position 0: ordinal not in range(128)

Error 2

Traceback (most recent call last): File "
<
stdin
>
", line 1, in 
<
module
>
File "/ System Library/Frameworks/Python framework Versions / 2.7 / lib python2. 7 / encodings utf_8. py", line 16, in decode         return codecs. utf_8_decode (input errors, True) UnicodeEncodeError: 'ascii' codec 'can' t characters in position 0-1: ordinal not in range(128)

First of all,

Must have a general understanding of the character set, character encoding

ASCII | Unicode | UTF-8 | and so on

Character encoding notes: ASCII, Unicode and UTF-8

Taobao search technology blog - Chinese coding chat

str and unicode

Both str and unicode are subclasses of basestring

So there's a way to tell if it's a string

def is_str(s):             return isinstance(s, basestring) 

Convert str and unicode

decode document

encode document

str    -
>
decode (' the_coding_of_str ')
>
 unicode unicode -
>
encode (' the_coding_you_want ')
>
 str 

The difference between

str is a string of bytes, consisting of the encoded (encode) bytes of unicode

A declarative way

s = 'Chinese' s = 'u' Chinese '.encode(' utf-8 ')  
>

>

>
type (' Chinese ')
<
type 'str'
>

Length (number of bytes returned)

>

>

>
u 'Chinese'.encode(' utf-8 ') '\xe4\xb8\xad\xe6\x96\x87'
>

>

>
len(u 'Chinese'.encode(' utf-8 ')) 6

unicode is really a string of characters

A declarative way

s = u 'Chinese' s = 'Chinese' decode(' utf-8 ') s = unicode(' Chinese ', 'utf-8')  
>

>

>
type (u 'Chinese')
<
type 'unicode'
>

Find the length (return the number of characters) that you really want to use in logic

>

>

>
u 'Chinese' u '\u4e2d\u6587'
>

>

>
len(u 'Chinese') 2

conclusion

Is understood to be processed str or unicode, use of the processing method (str. decode/unicode. encode)

Here is how to determine if it is unicode/str

>

>

>
isinstance(u 'Chinese', unicode) True
>

>

>
isinstance(' Chinese ', unicode) False  
>

>

>
isinstance(' Chinese ', str) True
>

>

>
isinstance(u 'Chinese', str) False

Rule of simplicity: do not use encode for str and decode for unicode (encode for str can be encode for str, please refer to the end for details. For simplicity, it is not recommended)

>

>

>
'Chinese'.encode(' utf-8 ') Traceback (most recent call last): File"
<
stdin
>
", line 1, in 
<
module
>
UnicodeDecodeError: 'ascii' codec can decode decode byte 0xe4 position 0: ordinal not range(128)  
>

>

>
decode(' utf-8 ') Traceback (most recent call last): File"
<
stdin
>
", line 1, in 
<
module
>
File "/ System Library/Frameworks/Python framework Versions / 2.7 / lib python2. 7 / encodings utf_8. py", line 16, in         return return codecs return codecs return codecs return codecs codecs codecs ordinal not in range (128).

For different encoding conversion, unicode is used as the intermediate encoding

#s is code_A str s. decode(' code_A ').encode(' code_B ')

File handling,IDE and console

The process, you can use it like this, python as one pool, one entrance, one exit

At the entrance, all turn to unicode, at the pool all turn to unicode, at the exit, turn to the target code (with the exception, of course, of the specific code used in the processing logic)

Read   external input code, decode to unicode   processing (internal code, 1unicode)   encode to the desired target code,   to the target output (file or console)

IDE and the console reported an error, because when print, the encoding and IDE were different

When the output is converted to 1, the normal output can be achieved

>

>

>
encode(' gbk ') ? The & # 63; The & # 63; The & # 63;
>

>

>
print u 'Chinese'.encode(' utf-8 ') Chinese

advice

Standard code

Unified 1 coding, to prevent the confusion caused by a link

Environment encoding, IDE/ text editor, file encoding, database data table encoding

Ensure code source file encoding

This is very important

The default encoding of the py file is ASCII. In the source code file, if you use non-ASCII characters, you need to declare the document in the encoding header

If not declared, enter an error that is not ASCII and must be placed on line 1 or line 2 of the file

File "XXX.py ", line 3 SyntaxError: Non-ASCII character '\xd6' in file line 3, but no encoding declared; see http: / / www. python. org/peps/pep - 0263. The html for details

The declarative approach

# -* -coding: utf-8 -*- or #coding= utf-8

If the header declares coding= utf-8, a = 'Chinese', the code is utf-8

If the header declares coding=gb2312, a = 'Chinese', the code is gbk

so, with 11 codes in the header of all source files in the 1 project, and the codes declared are related to the code 1 saved in the source files (editor related)

In the source code as a hardcoded string for processing, all 1 USES unicode

Separate its type from the encoding of the source file itself, making it easy to handle each location in the process independently

if s == u 'Chinese' :   # instead of s == 'Chinese'         pass # note that when s arrives here, make sure it is unicode

After the above steps, you only need to pay attention to the two unicode and the code you set (1).

Processing order

1. Decode early 2. Unicode everywhere 3. Encode later 

Related modules and 1 some methods

Gets and sets the system default encoding

>

>

>
 import sys 
>

>

>
'ascii'   sys. getdefaultencoding ()
>

>

>
 reload(sys) 
<
module 'sys' (built - in)
>

>

>

>
sys. setdefaultencoding (' utf - 8 ')
>

>

>
sys. getdefaultencoding () 'utf - 8'

str. encode (" other_coding ")

In python, one encoding str is directly encode into another encoding str

str_A is utf-8 str_A.encode (' gbk ').encode(' sys_codec ').encode(' gbk ') is the encodec of sys.getdefaultencoding ()

'get and set the system default code' is related to str.encode here, but I seldom use it like this, mainly because I think it is too complicated to control, so it is easier to input explicit decode and output explicit encode (personal opinion).

chardet

File encoding detection, download

>

>

>
 import chardet 
>

>

>
f = open (' test txt ', 'r')
>

>

>
 result = chardet.detect(f.read()) 
>

>

>
result {' confidence ': 0.99,' encoding ':' utf-8 '}

\u string to unicode string

>

>

>
u 'medium' u '\u4e2d'  
>

>

>
s = '\ u4e2d'
>

>

>
print s. decode   (' unicode_escape ')
>

>

>
a = '\ \ u4fee \ \ u6539 \ \ u8282 \ \ u70b9 \ \ u72b6 \ \ u6001 \ \ u6210 \ \ u529f'
>

>

>
a. decode (' unicode_escape ') u '\ u4fee \ u6539 \ u8282 \ u70b9 \ u72b6 \ u6001 \ u6210 \ u529f'

  is above the Python code processing of the data collation, continue to add the relevant information, thank you for your support to this site!