Python for url decoding and Chinese interpretation of the small script of python url decoder

  • 2020-04-02 13:10:22
  • OfStack

 
# -*- coding: utf8 -*- 
#! python 
print(repr(" Test alarm, xxxx Is a big pig ".decode("UTF8").encode("GBK")).replace("\x","%")) 


Note that the first decode("UTF8") is the same as the encoding declared by the file.

The initial contact with this problem came from a small game of Javascript puzzle solving. The tips for a certain level are as follows:

The first few levels are very simple very simple oh ~ ~ this level is just a simple string distortion... .

And then you have a long string that starts with a string like %5Cu4e0b%5Cu4e00%5Cu5173%5Cu7684.
This is something you used to see a lot in the address bar of the browser, but you just didn't know how to convert it into something you could read,
Google on the Internet, combined with python url decoding and unicode decoding, the solution is as follows:


import urllib escaped_str="%5Cu4e0b%5Cu4e00%5Cu5173%5Cu7684%5Cu9875%5Cu9762%5Cu540d%5Cu5b57%5Cu662f%5Cx20%5Cx69%5Cx32%5Cx6a%5Cx62%5Cx6a%5Cx33%5Cx69%5Cx34%5Cx62%5Cx62%5Cx35%5Cx34%5Cx62%5Cx35%5Cx32%5Cx69%5Cx62%5Cx33%5Cx2e%5Cx68%5Cx74%5Cx6d"
print urllib.unquote(escaped_str).decode('unicode-escape') 

Recently, I'm autoproxy firefox plugin gfwlist of Chinese vocabulary in (used agent the classmates, you know), these sites are made with url encoding, however, such as http://zh.wikipedia.org/wiki/%E9%97%A8, you need to use regular expressions will be pick up the url encoding of Chinese characters, wrote a small script as follows:


import urllib 
import re 
with open("listfile","r") as f: 
    for url_str in f: 
        match=re.compile("((%w{2}){3,})").findall(url_str) 
        # Chinese characters url The code style is: percent sign +2 Hexadecimal Numbers, repeat 3 time  

        if match!=None: 
            # If the match is successful, the extracted part is converted to Chinese  
            for trans in match: 
                print urllib.unquote(trans[0]), 

However, the script still has some drawbacks, such as the following lines of test code, which still cannot be properly decoded for certain Chinese characters in the list file


import urllib 
a="http://zh.wikipedia.org/wiki/%BD%F0%B6"
b="http://zh.wikipedia.org/wiki/%E9%97%A8"
de=urllib.unquote 
print de(a),de(b) 

The output result is that the former can be decoded correctly, while the latter cannot. Personally, I think the reason may be related to big5 encoding. If anyone knows any solution, please tell me again ~

Here are the additions:

DE (a). Decode (" GBK ", "ignore")
DE (b). Decode (" utf8 ", "ignore")

You can get some unicode strings for strings.

The unquote you use is not a decoder, you need to decode and encode. I have been using utf8 for my silent scene, I can get GBK which you probably use, so the latter solution is lost to you. Guessing is tiring. It would be nice if everyone used utf8, but some people have gb.

(link: http://yac163.svn.sourceforge.net/viewvc/yac163/trunk/yac163-nox/Pic.py? Revision = 198 & view = markup)

The code is very old. The code is very old. , "ignore").


def strdecode( string,charset=None ):
     if isinstance(string,unicode):
         return string
     if charset:
         try:
             return string.decode(charset)
         except UnicodeDecodeError:
             return _strdecode(string)
     else:
         return _strdecode(string)
 def _strdecode(string):
     try:
         return string.decode('utf8')
     except UnicodeDecodeError:
         try:
             return string.decode('gb2312')
         except UnicodeDecodeError:
             try:
                 return string.decode('gbk')
             except UnicodeDecodeError:
                 return string.decode('gb18030')
 def strencode( string,charset=None ):
     if isinstance(string,str):
         return string
     if charset:
         try:
             return string.encode(charset)
         except UnicodeEncodeError:
             return _strencode(string)
     else:
         return _strencode(string)
 def _strencode(string):
     try:
         return string.encode('utf8')
     except UnicodeEncodeError:
         try:
             return string.encode('gb2312')
         except UnicodeEncodeError:
             try:
                 return string.encode('gbk')
             except UnicodeEncodeError:
                 return string.encode('gb18030')


Related articles: