python implements the method of converting unicode to Chinese and converting the default encoding

  • 2020-05-30 20:31:23
  • OfStack

The example in this article describes how python can convert unicode to Chinese and convert the default encoding. I will share it with you for your reference as follows:

1. When crawler crawls webpage information, it often needs to convert "\u4eba\u751f\u82e6\u77ed\uff0cpy\u662f\u5cb8" into Chinese, which is actually the Chinese code of unicode. Conversion can be done by:

1.


>>> s = u'\u4eba\u751f\u82e6\u77ed\uff0cpy\u662f\u5cb8'
>>> print s
 Life is short, py Is the shore 

2,


>>> s = r'\u4eba\u751f\u82e6\u77ed\uff0cpy\u662f\u5cb8'
>>> s = s.decode('unicode_escape')
>>> print s
 Life is short, py Is the shore 

2. In addition, the encoding error "UnicodeEncodeError: 'ascii' codec can't encode in position 0-5: ordinal not in range(128)" is often encountered in the character encoding problem of python2.

This can usually be done by:


import sys
reload(sys)
sys.setdefaultencoding('utf-8')

This is done by changing the default encoding of Python2, ASCII, to utf-8. But this method is not one-for-all, and may cause some code to behave strangely.

A supplement to sys.setdefaultencoding (' utf-8 ') :

sys.setdefaultencoding('utf-8') That leads to two big problems

Simply put, this will make some code behave strangely, which is not easy to fix, as an invisible bug exists. Here are two examples.

1. Coding error


import chardet
def print_string(string):
  try:
    print(u"%s" % string)
  except UnicodeError:
    print u"%s" % unicode(byte_string, encoding=chardet.detect(string)['encoding'])
print_string(u"þ".encode("latin-1"))
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print(key_in_dict('þ'))

Output:


$~ þ
$~ þ

In the above code, the default ascii encoding cannot be decoded, à & frac34; latin-1 encodes hex to represent c3 be, which obviously goes beyond the ascii code set of only 128 characters, throws UnicodeError exception and enters exception handling. Exception handling will be detected according to the encoding, with the most likely encoding to decode, will be more reliable output à & frac34; .

However, we set defaultencoding to utf-8, because the character range of utf-8 completely covers latin-1, so utf-8 will be directly used for decoding. c3 be in utf-8, is þ . So we printed out completely different characters.

You might say we're not going to write this code. We'll fix it if we write it. But what if I had a third square library that way? So much for the third party library on which the project depends. If you don't rely on the third party library, bug is still there.

2. dictionray is behaving strangely

Let's say we want to find out if an key exists from an dictionary. In general, there are two possible ways to do this.


#-*- coding: utf-8 -*-
d = {1:2, '1':'2', ' hello ': 'hello'}
def key_in_dict(key)
  if key in d:
    return True
  return False
def key_found_in_dict(key):
  for _key in d:
    if _key == key:
      return True
  return False

Let's compare the output of these two functions before and after changing the default encoding of the system.


#-*- coding: utf-8 -*-
print(key_in_dict(' hello '))
print(key_found_dict(' hello '))
print(key_in_dict(u' hello '))
print(key_found_in_dict(u' hello '))
print('------utf-8------')
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print(key_in_dict(' hello '))
print(key_found_dict(' hello '))
print(key_in_dict(u' hello '))
print(key_found_in_dict(u' hello '))

Output:


$~ True
$~ True
$~ False
$~ False
$~ ------utf-8------
$~ True
$~ True
$~ False
$~ True

As you can see, when the default encoding is changed, the output of both functions is no longer 1 to 1.

The in operator of dict hashes the keys and compares the hash values to determine if they are equal. For characters in the ascii set, whether of byte character type or unicode type, the hash value is 1, such as u'1' in {'1':1} will return True, while characters outside the ascii code set, such as' hello 'in the example above, have a hash of byte character type different from unicode.

The == operator does one conversion, converting the byte character (byte string, 'hello' above) to unicode (u' hello 'above), and then compares the converted results. In the ascii system default code, the conversion from 'hello' to Unicode will result in Warning: UnicodeWarning: Unicode equal failed to convert UnicodeWarning Unicode them being unequal convert convert UnicodeWarning Unicode them being unequal as being unequal as being unequal as being unequal both both both When we manually changed the system code to utf-8, this taboo was lifted, and 'hello' was successfully converted to unicode. The end result is that in and == no longer match.

The root of the problem: string in Python2

Python does a lot of tricky things in order to make its syntax look simple and easy to use. Confusion between byte string and text string is one of them.

In Python, there are three main types of string: unicode (text string), str (byte string, base 2 data), and basestring, which are the parent of the first two.

In fact, in the field of language design, whether a string of bytes (sequences of bytes) should be treated as a string (string) 1 is controversial. Java and C#, as we know them, voted against, while Python sided with the supporters. In fact, in many cases, the operations we do on text, such as regular matching, character substitution, etc., are not necessary for bytes. Python considers bytes to be characters, so their set of operations is 1 to 1.

Then, further, Python will try to do automatic type conversion of the bytes if necessary, for example, in the == above, or when the bytes are concatenated with text. Conversion between two different types is not possible without one encoding (encoding), so Python requires one default encoding. When Python2 was born, ASCII was the most popular (so to speak), so Python2 chose ASCII. However, it is well known that ASCII is useless in situations where conversion is required (128 characters, enough to eat).

After so many years of teasing, Python 3 has finally learned its lesson. The default encoding is Unicode, which means that you can convert correctly and successfully whenever you need to convert.

Best practices

Having said that, what can you do if you don't migrate to Python 3?

Here are some Suggestions:

All text string should be of type unicode, not str. If you are operating text and the type is str, you are making bug.

Explicitly convert when needed. Decoding from bytes to text, var.decode (encoding), encoding from text to bytes, var.encode (encoding).

When data is read from the outside, it is bytes by default, then decode becomes the required text. Similarly, when text needs to be sent externally, encode is sent in bytes.

PS: here are a few more Unicode code conversion tools for your reference:

Unicode/ Chinese conversion tool:
http://tools.ofstack.com/transcoding/unicode_chinese

Native/Unicode online encoding conversion tool:
http://tools.ofstack.com/transcoding/native2unicode

Online Chinese character /ASCII code /Unicode code mutual conversion tool:
http://tools.ofstack.com/transcoding/chinese2unicode

More about Python related content interested readers to view this site project: Python coding skills summary, Python pictures skills summary, "Python data structure and algorithm tutorial", "Python Socket programming skills summary", "Python function using techniques", "Python string skills summary", "Python introduction and advanced tutorial" and "Python file and directory skills summary"

I hope this article is helpful for you to design Python program.


Related articles: