Example of web page information grasping function realized by Python crawler [URL and regular module]

  • 2020-06-01 10:14:32
  • OfStack

This article describes the example of Python crawler to achieve web information capture function. I will share it with you for your reference as follows:

First of all, the implementation of the web page parsing, reading and other operations we need to use the following modules


import urllib
import urllib2
import re

We can try 1 to read a website with readline method, such as baidu


def test():
  f=urllib.urlopen('http://www.baidu.com')
  while True:
   firstLine=f.readline()
   print firstLine

Below we say 1 how to achieve the capture of web information, such as baidu post bar

Here are a few things to do:

We're going to get the page and its code first, so we're going to implement multiple pages, which means its url will change, and we're going to pass 1 page


  def getPage(self,pageNum):
     try:
        url=self.baseURL+self.seeLZ+'&pn='+str(pageNum)
        # create request object 
        request=urllib2.Request(url)
        response=urllib2.urlopen(request)
        #print 'URL:'+url
        return response.read()
     except Exception,e:
        print e

And then we're going to get the content of the novel, where we divide it into the title and the text. The title is on every page, so let's just get it once.

We can click on a website, press f12 to see how his title tag is constructed, such as baidu post bar is < title > .....................

Then we match reg=re.compile(r'<title>(.*?)。') To grab this information

Now that we've grabbed the title we're going to grab the body, and we know that the body is going to have a lot of sections, so we're going to loop around and grab the whole items

For text read and write operations, 1 must be placed outside the loop. Also add 1 more remove hyperlinks, < br > Mechanisms such as

Finally, we can call the main function

Complete code:


# -*- coding:utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')
# Crawler web information capture 
# Required function methods: urllib,re,urllib2
import urllib
import urllib2
import re
# Test functions -> read 
#def test():
#   f=urllib.urlopen('http://www.baidu.com')
#   while True:
#     firstLine=f.readline()
#     print firstLine
# Baidu tieba to get before 10 The text of the main novel 
class BDTB:
   def __init__(self,baseUrl,seeLZ):
     # Member variables 
     self.baseURL=baseUrl
     self.seeLZ='?see_lz='+str(seeLZ)
   # Gets the code for the page post 
   def getPage(self,pageNum):
     try:
        url=self.baseURL+self.seeLZ+'&pn='+str(pageNum)
        # create request object 
        request=urllib2.Request(url)
        response=urllib2.urlopen(request)
        #print 'URL:'+url
        return response.read()
     except Exception,e:
        print e
   # Match the title 
   def Title(self):
     html=self.getPage(1)
     #compile Improve the efficiency of regular matching 
     reg=re.compile(r'<title>(.*?) . ')
     # return list The list of 
     items=re.findall(reg,html)
     f=open('output.txt','w+')
     item=('').join(items)
     f.write('\t\t\t\t\t'+item.encode('gbk'))
     f.close()
   # Match the body 
   def Text(self,pageNum):
     html=self.getPage(pageNum)
     #compile Improve the efficiency of regular matching 
     reg=re.compile(r'"d_post_content j_d_post_content ">(.*?)</div>')
     # return list The list of 
     items=re.findall(reg,html)
     f=open('output.txt','a+')
     #[1:] Slice, the first 1 I don't need to get rid of these elements. 
     for i in items[1:]:
        # Hyperlink removal 
        removeAddr=re.compile('<a.*?>|</a>')
        # with "" replace 
        i=re.sub(removeAddr,"",i)
        #<br> Get rid of 
        i=i.replace('<br>','')
        f.write('\n\n'+i.encode('gbk'))
     f.close()
# Call the entrance 
baseURL='http://tieba.baidu.com/p/4638659116'
bdtb=BDTB(baseURL,1)
print ' The crawler is starting ....'.encode('gbk')
# More pages 
bdtb.Title()
print ' Grab the title! '.encode('gbk')
for i in range(1,11):
  print ' Grasping control %02d page '.encode('gbk')%i
  bdtb.Text(i)
print ' Grab the text !'.encode('gbk')

PS: here are two more handy regular expression tools for you to use:

JavaScript regular expression online testing tool:
http://tools.ofstack.com/regex/javascript

Online regular expression generation tool:
http://tools.ofstack.com/regex/create_reg

More about Python related content to view this site project: the Python regular expression usage summary ", "Python data structure and algorithm tutorial", "Python Socket programming skills summary", "Python function using techniques", "Python string skills summary", "Python introduction and advanced tutorial" and "Python file and directory skills summary"

I hope this article is helpful for you to design Python program.


Related articles: