Python automatically submits and crawls web pages
- 2020-04-02 09:19:09
- OfStack
The following is written in python, using LXML to do HTML analysis, from the Internet, it is the fastest analysis, but not verified. Okay, code.
Well, captcha is a big problem. Also today saw some baidu post bar things, is bad mood, its verification code is to use ajax to take the picture, this is more trouble. But it seems that most of the forum and blog captcha is such. The first time you grab a page, you won't have a captcha image, let alone a captcha image. There are still a lot of problems to solve...
import urllib
import urllib2
import urlparse
import lxml.html
def url_with_query(url, values):
parts = urlparse.urlparse(url)
rest, (query, frag) = parts[:-2], parts[-2:]
return urlparse.urlunparse(rest + (urllib.urlencode(values), None))
def make_open_http():
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
opener.addheaders = [] # pretend we're a human -- don't do this
def open_http(method, url, values={}):
if method == "POST":
return opener.open(url, urllib.urlencode(values))
else:
return opener.open(url_with_query(url, values))
return open_http
open_http = make_open_http()
tree = lxml.html.fromstring(open_http("GET", "//www.jb51.net").read())
form = tree.forms[0]
form.fields["q"] = "eplussoft"
form.action="//www.jb51.net/search"
response = lxml.html.submit_form(form,open_http=open_http)
html = response.read()
doc = lxml.html.fromstring(html)
lxml.html.open_in_browser(doc)
Well, captcha is a big problem. Also today saw some baidu post bar things, is bad mood, its verification code is to use ajax to take the picture, this is more trouble. But it seems that most of the forum and blog captcha is such. The first time you grab a page, you won't have a captcha image, let alone a captcha image. There are still a lot of problems to solve...