Summary of some common Python crawler tips

  • 2020-05-12 02:51:09
  • OfStack

Python crawler: 1 summary of some common crawler techniques

Crawler in the development process also has a lot of reuse process, here summarized 1, after can also save some things.

1. Basic crawling of web pages

get method


import urllib2
url "http://www.baidu.com"
respons = urllib2.urlopen(url)
print response.read()

post method


import urllib
import urllib2

url = "http://abcde.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url,form_data)
response = urllib2.urlopen(request)
print response.read()

2. Use agent IP

In the process of crawler development, IP is often blocked, so agent IP is needed.

In the urllib2 package, there is the ProxyHandler class, through which the agent can be set up to access the web page, as shown in the following code snippet:


import urllib2

proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print response.read()

3. Cookies processing

cookies is the data (usually encrypted) that some websites store on the user's local terminal in order to identify the user and track session. python provides cookielib module for processing cookies. The main function of cookielib module is to provide objects that can store cookie, so as to access Internet resources in cooperation with urllib2 module.

Code snippet:


import urllib2, cookielib

cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

The key is CookieJar(), which manages HTTP cookie values, stores cookie generated by HTTP requests, and adds cookie objects to outgoing HTTP requests. The entire cookie is stored in memory, and after garbage collection of the CookieJar instance, cookie will also be lost. None of the procedures need to be operated independently.

Add cookie manually

cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)

4. Masquerading as a browser

Some websites dislike the visit of crawler, so they refuse the request to crawler 1. Therefore, direct access to the website with urllib2 often results in the situation of HTTP Error 403: Forbidden

Be aware of some of the header, and the Server end will check for these header

1). User-Agent some Server or Proxy will check this value to determine if Request is browser-initiated
2). Content-Type when using the REST interface, Server checks this value to determine how the contents of HTTP Body should be interpreted.

This can be done by modifying header in the http package. The code snippet is as follows:


import urllib2

headers = {
 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
request = urllib2.Request(
 url = 'http://my.oschina.net/jhao104/blog?catalog=3463517',
 headers = headers
)
print urllib2.urlopen(request).read()

5. Page parsing

The most powerful for page parsing is of course the regular expression, which is different for different website users are not the same, without too much explanation, attached to two better url:

Introduction to regular expressions: / / www. ofstack. com article / 79618. htm

Regular expression online test: http: / / tool oschina. net/regex /

The second is the parsing library. There are two commonly used lxml and BeautifulSoup. Two good websites are introduced for the use of these two:

lxml: http: / / my. oschina. net/jhao104 / blog / 639448

BeautifulSoup: http: / / cuiqingcai com / 1319. html

As for these two libraries, my evaluation is that they are both HTML/XML processing libraries. Beautifulsoup pure python implementation is inefficient, but functional, such as the source code of a certain HTML node can be obtained through result search. lxmlC language coding, efficient, support Xpath

6. Processing of verification code

For 1 some simple captcha, can be simple identification. I also only 1 some simple verification code recognition. However, some anti-human captchas, such as 12306, can be manually coded through a coding platform, which is, of course, paid for.

7. gzip compression

Have you ever encountered some web pages, no matter how to transcode is a mess of messy code. Haha, that means you didn't know that many web services have the ability to send compressed data, which can reduce the amount of data transmitted over network lines by more than 60%. This is especially true for XML web services because the compression rate of XML data can be high.

But a server will not send you compressed data unless you tell the server that you can process compressed data.

So you need to modify the code like this:


import urllib2, httplib
request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip') 1
opener = urllib2.build_opener()
f = opener.open(request)

This is the key: create the Request object and add an Accept-encoding header to tell the server that you can accept the gzip compressed data

The next step is to extract the data:


import StringIO
import gzip

compresseddata = f.read() 
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream) 
print gzipper.read()

8. Multi-threaded concurrent fetching

If a single thread is too slow, you're going to have multiple threads, so I'm going to give you a simple thread pool template and this program is simply printing 1-10, but you can see that it's concurrent.

Although the multi-threading of python is very small, it can improve the efficiency to a certain extent for the crawler with such a frequent network.


from threading import Thread
from Queue import Queue
from time import sleep
# q It's a task queue 
#NUM Is the total number of concurrent threads 
#JOBS It's how many tasks 
q = Queue()
NUM = 2
JOBS = 10
# A concrete handler that handles a single task 
def do_somthing_using(arguments):
 print arguments
# This is the worker process, which is responsible for constantly fetching data from the queue and processing it 
def working():
 while True:
 arguments = q.get()
 do_somthing_using(arguments)
 sleep(1)
 q.task_done()
#fork NUM Thread waiting 

 alert( " Hello CSDN " );
for i in range(NUM):
 t = Thread(target=working)
 t.setDaemon(True)
 t.start()
# the JOBS In the queue 
for i in range(JOBS):
 q.put(i)
# Wait for all JOBS complete 
q.join()

Related articles: