Python crawler tutorial sharing crawler code

  • 2020-04-02 14:01:54
  • OfStack

To learn python, it is necessary to write crawler. Not only can you learn and practice using python in a step-by-step manner, but the crawler itself is also useful and interesting. A lot of repetitive downloading and statistical work can be completed by writing a crawler program.

Crawler writing in python requires basic knowledge of python, several modules involved in the network, regular expressions, file operations, and so on. Yesterday learned on the Internet, wrote a crawler automatically download "qiushi encyclopedia" inside the picture. The source code is as follows:


# -*- coding: utf-8 -*-
# The above sentence makes the code support Chinese #--------------------------------------- 
#   Program: qibaipicture crawler  
#   Version: 0.1 
#   Author: zhao wei  
#   Date: 2013-07-25 
#   Language: Python 2.7 
#   Can set the number of pages to download. Not doing more abstraction and interaction optimization.  
#--------------------------------------- import urllib2
import urllib
import re # A regular expression used to grab the address of an image
pat = re.compile('<div class="thumb">\n<img src="(ht.*?)".*?>') # Used to compose web pages URL
nexturl1 = "http://m.qiushibaike.com/imgrank/page/"
nexturl2 = "?s=4582487&slow" # Page count
count = 1 # Set the number of pages to fetch
while count < 3:     print "Page " + str(count) + "n"
    myurl = nexturl1 + str(count) + nexturl2
    myres = urllib2.urlopen(myurl)# Scraping of the page
    mypage = myres.read()# Reading web content
    ucpage = mypage.decode("utf-8") # transcoding     mat = pat.findall(ucpage)# Grab the image address with a regular expression
       
    count += 1;
   
    if len(mat):
        for item in mat:
            print "url: " + item + "n"
            fnp = re.compile('/(w+.w+)$')# The following three lines separate the name of the image file
            fnr = fnp.findall(item)
            fname = fnr[0]
            urllib.urlretrieve(item, fname)# Download the pictures
      
    else:
        print "no data"

How to use it: create a new practices folder, save the source code as a qb.py file, place it in the practices folder, execute python qb.py on the command line, and start downloading images. You can modify the while statement in the source code to set the number of pages to download.


Related articles: