Python automatically submits and crawls web pages

  • 2020-04-02 09:19:09
  • OfStack

The following is written in python, using LXML to do HTML analysis, from the Internet, it is the fastest analysis, but not verified. Okay, code.
 
import urllib 
import urllib2 
import urlparse 
import lxml.html 
def url_with_query(url, values): 
parts = urlparse.urlparse(url) 
rest, (query, frag) = parts[:-2], parts[-2:] 
return urlparse.urlunparse(rest + (urllib.urlencode(values), None)) 
def make_open_http(): 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor()) 
opener.addheaders = [] # pretend we're a human -- don't do this 
def open_http(method, url, values={}): 
if method == "POST": 
return opener.open(url, urllib.urlencode(values)) 
else: 
return opener.open(url_with_query(url, values)) 
return open_http 
open_http = make_open_http() 
tree = lxml.html.fromstring(open_http("GET", "//www.jb51.net").read()) 
form = tree.forms[0] 
form.fields["q"] = "eplussoft" 
form.action="//www.jb51.net/search" 
response = lxml.html.submit_form(form,open_http=open_http) 
html = response.read() 
doc = lxml.html.fromstring(html) 
lxml.html.open_in_browser(doc) 

Well, captcha is a big problem. Also today saw some baidu post bar things, is bad mood, its verification code is to use ajax to take the picture, this is more trouble. But it seems that most of the forum and blog captcha is such. The first time you grab a page, you won't have a captcha image, let alone a captcha image. There are still a lot of problems to solve...

Related articles: