The causes and solutions of the urllib.unquote garble in python

2020-05-30 20:26:35
OfStack

Found the problem

The urllib module in Python is used to handle url operations. The unquote method corresponds to the urldecode method in javascript, which decodes url and replaces a character like "%xx" with a single character, such as: "% E6% B3% 95% E5% 9B % BD % BA % A2% E85% 92" will be converted to "French wine" after decoding, but in the process of use, if the posture is wrong, the final converted characters will be scrambled "& aelig; & sup3; The & # 8226; & aring; The & # 8250; & frac12; & ccedil; & ordm; & cent; E..." .

The author encountered such a problem in a real Tornado application. After the browser passed the request to the background, the parameter value was obtained:


name = self.get_argument("name", "")
name = urllib.unquote(name)
# save to db

The value of name is printed to read: "% E6% B3% 95% E5% 9B%BD% BD% A2% E85% 92", which seems fine, but the result ends up in the database as a string of scrambled "æ & sup3; The & # 8226; & aring; The & # 8250; & frac12; & ccedil; & ordm; & cent; E..." This question is really a puzzle.

Cause analysis,

So I wanted to use the directly encoded characters to see if there was any confusion:


name = '%E6%B3%95%E5%9B%BD%E7%BA%A2%E9%85%92'
name = urllib.unquote(name)
# save to db

It is found that there is no problem in this way of processing. After decoding the value of name, it is "French wine". After thinking for a while, the reason can only be attributed to self.get_argument("name") This is the code snippet. Originally, by default, get_argument returns a value of type unicode, while when unquote processes characters of type unicode, it directly returns:


u'\xe6\xb3\x95\xe5\x9b\xbd\xe7\xba\xa2\xe9\x85\x92'

Note: the return value type is unicode, which means that the unquote method receives the parameter of unidoe, and the return value type is unicode, but the '%' is replaced by '\x'.


u"%E6%B3%95%E5%9B%BD%E7%BA%A2%E9%85%92"

Instead:


u"\xe6\xb3\x95\xe5\x9b\xbd\xe7\xba\xa2\xe9\x85\x92"

u"\xe6\xb3\ xe5\x9b\ urldecode2 \xe9\x85\x92" is exactly an ascii code string, only in hexadecimal, let's look at 'e6','b3'... What characters correspond to ascii respectively. You can refer to http:// www.ascii-code.com, 'e6' is an extended ascii character, in the range of 128-255, its corresponding symbol is 'æ '


DEC OCT HEX BIN Symbol 
230 346 E6 11100110 æ

Now you should understand why garbled characters are generated:


æ³•å›½çº¢ e

The solution

In the call self.get_argument('name') After the method, convert the returned value to type str:


name = self.get_argument("name", "")
name = str(name)
name = urllib.unquote(name)
# save to db

After converting to a string of type str, it is equivalent to calling:


'\xe6\xb3\x95\xe5\x9b\xbd\xe7\xba\xa2\xe9\x85\x92'.decode("utf-8")

>>> u'\u6cd5\u56fd\u7ea2\u9152'

conclusion