Python information extraction of messy code solutions

  • 2020-06-07 04:43:52
  • OfStack

Python information extraction code solution

Talk about things, say their own situation, and I do not pass a kind of, a kind of look at it

Information capture, use python, beautifulSoup lxml, re, urllib2, urllib2 to get want to extract content of the page, and then use lxml or beautifulSoup parsing, insert mysql specific content, good seems very simple very easy appearance, but the disgusting place came inside, 1, domestic development site in the designated site code or what is the time to save the website source code and does not take into account the encoding, anyway one sentence, Even if you use a tool to look at or look at the source port and see that their source code is utf-8, or GBK or something like that, don't believe it, hey, what's the trouble with that < meta http-equiv="Content-Type" content="text/html; charset=UTF-8" / >

Here are 1 processes :(specific libraries are not what I said here)


 import urllib2

   import chardet

  html = urllib2.urlopen(" A web site ")

  print chardet.detect(html) # It's going to print out 1 A dictionary {'a':0.99999,'encoding':'utf-8'}

Well, the whole html coding all know, the insert for utf mysql database established by the eight, but an error occurred when I was in the insert, because after I use lxml string not utf - 8, but Big5 (traditional Chinese characters coding), along with a variety of unknown code EUC - JP (Japanese code), OK, I took the unicode method, first to decrypt this field, in the coding


if chardet.detect(name)['encoding'] == 'GB2312':
  name = unicode(name,'GB2312','ignore').encode('utf-8','ignore')
elif chardet.detect(name)['encoding'] == 'Big5':
 name = unicode(name,'Big5','ignore').encode('utf-8','ignore')
elif chardet.detect(name)['encoding'] == 'ascii':
 name = unicode(name,'ascii','ignore').encode('utf-8','ignore')
elif chardet.detect(name)['encoding'] == 'GBK':
 name = unicode(name,'GBK','ignore').encode('utf-8','ignore')
elif chardet.detect(name)['encoding'] == 'EUC-JP':
 name = unicode(name,'EUC-JP','ignore').encode('utf-8','ignore')
else:
  name = ' The unknown '

Thank you for reading, I hope to help you, thank you for your support to this site!


Related articles: