python removes the u method before the unicode string

  • 2020-12-21 18:06:36
  • OfStack

Sometimes we run into unicode strings like this:


u'\xe4\xbd\xa0\xe5\xa5\xbd'

This is obviously not a correct unicode string, it may have been transcoded incorrectly somewhere.

To get the correct unicode string, we must first convert the string to a non-ES9en string and then decode it. Doing encode the normal way is definitely not possible because it is not a correct unicode string:


In [1]: u'\xe4\xbd\xa0\xe5\xa5\xbd'.encode('utf8')
Out[1]: '\xc3\xa4\xc2\xbd\xc2\xa0\xc3\xa5\xc2\xa5\xc2\xbd'

In [2]: print u'\xe4\xbd\xa0\xe5\xa5\xbd'.encode('utf8')
 A blind land 
 So how do we get what we want  \xe4\xbd\xa0\xe5\xa5\xbd  ? 

python  provides 1 A special code (  raw_unicode_escape  ) used to deal with this situation: 

In [4]: u'\xe4\xbd\xa0\xe5\xa5\xbd'.encode('raw_unicode_escape')
Out[4]: '\xe4\xbd\xa0\xe5\xa5\xbd'

In [5]: u'\xe4\xbd\xa0\xe5\xa5\xbd'.encode('raw_unicode_escape').decode('utf8')
Out[5]: u'\u4f60\u597d'

In [7]: print u'\u4f60\u597d'
 hello 

Related articles: