Python encoding processing the difference between str and Unicode
- 2020-05-10 18:24:30
- OfStack
A good article on STR and UNICODE
Sort out the python encodings
Note: the following discussion is about Python2.x version, Py3k, to be tried
start
When processing Chinese with python, read files or messages, http parameters, etc
1 run, find garble (string manipulation, read/write file, print)
Then, what most people do is they call encode/decode to debug, and they don't really think about why the mess is happening
So the most common errors in debugging
Error 1
Traceback (most recent call last): File "
<
stdin
>
", line 1, in
<
module
>
UnicodeDecodeError: 'ascii' codec can decode byte 0xe6 in position 0: ordinal not in range(128)
Error 2
Traceback (most recent call last): File "
<
stdin
>
", line 1, in
<
module
>
File "/ System Library/Frameworks/Python framework Versions / 2.7 / lib python2. 7 / encodings utf_8. py", line 16, in decode return codecs. utf_8_decode (input errors, True) UnicodeEncodeError: 'ascii' codec 'can' t characters in position 0-1: ordinal not in range(128)
First of all,
Must have a general understanding of the character set, character encoding
ASCII | Unicode | UTF-8 | and so on
Character encoding notes: ASCII, Unicode and UTF-8
Taobao search technology blog - Chinese coding chat
str and unicode
Both str and unicode are subclasses of basestring
So there's a way to tell if it's a string
def is_str(s): return isinstance(s, basestring)
Convert str and unicode
decode document
encode document
str -
>
decode (' the_coding_of_str ')
>
unicode unicode -
>
encode (' the_coding_you_want ')
>
str
The difference between
str is a string of bytes, consisting of the encoded (encode) bytes of unicode
A declarative way
s = 'Chinese' s = 'u' Chinese '.encode(' utf-8 ')
>
>
>
type (' Chinese ')
<
type 'str'
>
Length (number of bytes returned)
>
>
>
u 'Chinese'.encode(' utf-8 ') '\xe4\xb8\xad\xe6\x96\x87'
>
>
>
len(u 'Chinese'.encode(' utf-8 ')) 6
unicode is really a string of characters
A declarative way
s = u 'Chinese' s = 'Chinese' decode(' utf-8 ') s = unicode(' Chinese ', 'utf-8')
>
>
>
type (u 'Chinese')
<
type 'unicode'
>
Find the length (return the number of characters) that you really want to use in logic
>
>
>
u 'Chinese' u '\u4e2d\u6587'
>
>
>
len(u 'Chinese') 2
conclusion
Is understood to be processed str or unicode, use of the processing method (str. decode/unicode. encode)
Here is how to determine if it is unicode/str
>
>
>
isinstance(u 'Chinese', unicode) True
>
>
>
isinstance(' Chinese ', unicode) False
>
>
>
isinstance(' Chinese ', str) True
>
>
>
isinstance(u 'Chinese', str) False
Rule of simplicity: do not use encode for str and decode for unicode (encode for str can be encode for str, please refer to the end for details. For simplicity, it is not recommended)
>
>
>
'Chinese'.encode(' utf-8 ') Traceback (most recent call last): File"
<
stdin
>
", line 1, in
<
module
>
UnicodeDecodeError: 'ascii' codec can decode decode byte 0xe4 position 0: ordinal not range(128)
>
>
>
decode(' utf-8 ') Traceback (most recent call last): File"
<
stdin
>
", line 1, in
<
module
>
File "/ System Library/Frameworks/Python framework Versions / 2.7 / lib python2. 7 / encodings utf_8. py", line 16, in return return codecs return codecs return codecs return codecs codecs codecs ordinal not in range (128).
For different encoding conversion, unicode is used as the intermediate encoding
#s is code_A str s. decode(' code_A ').encode(' code_B ')
File handling,IDE and console
The process, you can use it like this, python as one pool, one entrance, one exit
At the entrance, all turn to unicode, at the pool all turn to unicode, at the exit, turn to the target code (with the exception, of course, of the specific code used in the processing logic)
Read external input code, decode to unicode processing (internal code, 1unicode) encode to the desired target code, to the target output (file or console)
IDE and the console reported an error, because when print, the encoding and IDE were different
When the output is converted to 1, the normal output can be achieved
>
>
>
encode(' gbk ') ? The & # 63; The & # 63; The & # 63;
>
>
>
print u 'Chinese'.encode(' utf-8 ') Chinese
advice
Standard code
Unified 1 coding, to prevent the confusion caused by a link
Environment encoding, IDE/ text editor, file encoding, database data table encoding
Ensure code source file encoding
This is very important
The default encoding of the py file is ASCII. In the source code file, if you use non-ASCII characters, you need to declare the document in the encoding header
If not declared, enter an error that is not ASCII and must be placed on line 1 or line 2 of the file
File "XXX.py ", line 3 SyntaxError: Non-ASCII character '\xd6' in file line 3, but no encoding declared; see http: / / www. python. org/peps/pep - 0263. The html for details
The declarative approach
# -* -coding: utf-8 -*- or #coding= utf-8
If the header declares coding= utf-8, a = 'Chinese', the code is utf-8
If the header declares coding=gb2312, a = 'Chinese', the code is gbk
so, with 11 codes in the header of all source files in the 1 project, and the codes declared are related to the code 1 saved in the source files (editor related)
In the source code as a hardcoded string for processing, all 1 USES unicode
Separate its type from the encoding of the source file itself, making it easy to handle each location in the process independently
if s == u 'Chinese' : # instead of s == 'Chinese' pass # note that when s arrives here, make sure it is unicode
After the above steps, you only need to pay attention to the two unicode and the code you set (1).
Processing order
1. Decode early 2. Unicode everywhere 3. Encode later
Related modules and 1 some methods
Gets and sets the system default encoding
>
>
>
import sys
>
>
>
'ascii' sys. getdefaultencoding ()
>
>
>
reload(sys)
<
module 'sys' (built - in)
>
>
>
>
sys. setdefaultencoding (' utf - 8 ')
>
>
>
sys. getdefaultencoding () 'utf - 8'
str. encode (" other_coding ")
In python, one encoding str is directly encode into another encoding str
str_A is utf-8 str_A.encode (' gbk ').encode(' sys_codec ').encode(' gbk ') is the encodec of sys.getdefaultencoding ()
'get and set the system default code' is related to str.encode here, but I seldom use it like this, mainly because I think it is too complicated to control, so it is easier to input explicit decode and output explicit encode (personal opinion).
chardet
File encoding detection, download
>
>
>
import chardet
>
>
>
f = open (' test txt ', 'r')
>
>
>
result = chardet.detect(f.read())
>
>
>
result {' confidence ': 0.99,' encoding ':' utf-8 '}
\u string to unicode string
> > > u 'medium' u '\u4e2d' > > > s = '\ u4e2d' > > > print s. decode (' unicode_escape ') > > > a = '\ \ u4fee \ \ u6539 \ \ u8282 \ \ u70b9 \ \ u72b6 \ \ u6001 \ \ u6210 \ \ u529f' > > > a. decode (' unicode_escape ') u '\ u4fee \ u6539 \ u8282 \ u70b9 \ u72b6 \ u6001 \ u6210 \ u529f'
is above the Python code processing of the data collation, continue to add the relevant information, thank you for your support to this site!