Share the solution to character set conversion problem when python crawls a web page

  • 2020-04-02 13:46:05
  • OfStack

Questions raised:

      Sometimes we collect web pages, after processing to save string to a file or written to the database, then need to set the encoding of the string, if the collection page encoding is gb2312, and our database is utf-8, so don't do any processing directly inserted into the database may be garbled (not tested, don't know the database will automatically turn code), we need to manually convert gb2312 into utf-8.

First of all, we know that the default character in python is ASCII code, of course there is no problem with English, when the Chinese to kneel immediately.

If you remember, when you print Chinese characters in python, you have to put u in front of the string:


print u" To be gay? "

In this way, Chinese characters can be displayed. The function of u in this case is to convert the following string into unicode code, so that Chinese characters can be displayed correctly.
Associated with this is a unicode() function, which is used as follows


str=" To be gay "
str=unicode(str,"utf-8")
print str

The difference from u is that to convert STR to unicode encoding with unicode, you need to specify the second parameter correctly. Utf-8 here is the file character set of my test.py script itself, and the default may be ANSI.
Unicode is a key, so let's continue

We began to crawl baidu home page, attention, visitors to visit baidu home page, to view the web source code, it's charset = gb2312.


import urllib2
def main():
  f=urllib2.urlopen("http://www.baidu.com")
  str=f.read()
  str=unicode(str,"gb2312")
  fp=open("baidu.html","w")
  fp.write(str.encode("utf-8"))
  fp.close()

if __name__ == '__main__' :
  main()

Explanation:
We first use urllib2. Urlopen () method to grab baidu homepage, f is a handle, STR = f.ead () will read all the source code into STR

To be clear, STR inside is the HTML source code we grab, because the default character set of the web page is gb2312, so if we directly save to the file, the file code will be ANSI.

For most people, this is enough, but sometimes I just wonder what to do if I convert gb2312 to utf-8.

First of all:
      STR =unicode(STR,"gb2312") # gb2312 here is the actual character set of STR, which we will now convert to unicode

And then:
      STR = STR. Encode ("utf-8") # to encode unicode strings as utf-8

Finally:

      Write STR to a file. Open the file and look at the encoding properties. It's utf-8 < Meta charset = "gb2312" change < Meta charset="utf-8", which is a utf-8 web page. You've done this much and you've actually done one gb2312- > Utf-8 transcoding.


Conclusion:

      Let's recall that if you need to save a string in the specified character set, there are the following steps:

      1: decode STR into unicode strings using unicode(STR," original encoding ")

      2: convert the unicode string STR to the character set you specify using str.encode(" specified character set ")

      3: save STR to a file, or write to a database, etc., of course, you already specified the code, right?


Related articles: