The causes and solutions of the urllib.unquote garble in python
- 2020-05-30 20:26:35
Found the problem
The author encountered such a problem in a real Tornado application. After the browser passed the request to the background, the parameter value was obtained:
name = self.get_argument("name", "") name = urllib.unquote(name) # save to db
The value of name is printed to read: "% E6% B3% 95% E5% 9B%BD% BD% A2% E85% 92", which seems fine, but the result ends up in the database as a string of scrambled "æ & sup3; The & # 8226; & aring; The & # 8250; & frac12; & ccedil; & ordm; & cent; E..." This question is really a puzzle.
So I wanted to use the directly encoded characters to see if there was any confusion:
name = '%E6%B3%95%E5%9B%BD%E7%BA%A2%E9%85%92' name = urllib.unquote(name) # save to db
It is found that there is no problem in this way of processing. After decoding the value of name, it is "French wine". After thinking for a while, the reason can only be attributed to
This is the code snippet. Originally, by default, get_argument returns a value of type unicode, while when unquote processes characters of type unicode, it directly returns:
Note: the return value type is unicode, which means that the unquote method receives the parameter of unidoe, and the return value type is unicode, but the '%' is replaced by '\x'.
u"\xe6\xb3\ xe5\x9b\ urldecode2 \xe9\x85\x92" is exactly an ascii code string, only in hexadecimal, let's look at 'e6','b3'... What characters correspond to ascii respectively. You can refer to http:// www.ascii-code.com, 'e6' is an extended ascii character, in the range of 128-255, its corresponding symbol is 'æ '
DEC OCT HEX BIN Symbol 230 346 E6 11100110 æ
Now you should understand why garbled characters are generated:
In the call
After the method, convert the returned value to type str:
name = self.get_argument("name", "") name = str(name) name = urllib.unquote(name) # save to db
After converting to a string of type str, it is equivalent to calling:
'\xe6\xb3\x95\xe5\x9b\xbd\xe7\xba\xa2\xe9\x85\x92'.decode("utf-8") >>> u'\u6cd5\u56fd\u7ea2\u9152'