Example of web page information grasping function realized by Python crawler [URL and regular module]
- 2020-06-01 10:14:32
- OfStack
This article describes the example of Python crawler to achieve web information capture function. I will share it with you for your reference as follows:
First of all, the implementation of the web page parsing, reading and other operations we need to use the following modules
import urllib
import urllib2
import re
We can try 1 to read a website with readline method, such as baidu
def test():
f=urllib.urlopen('http://www.baidu.com')
while True:
firstLine=f.readline()
print firstLine
Below we say 1 how to achieve the capture of web information, such as baidu post bar
Here are a few things to do:
We're going to get the page and its code first, so we're going to implement multiple pages, which means its url will change, and we're going to pass 1 page
def getPage(self,pageNum):
try:
url=self.baseURL+self.seeLZ+'&pn='+str(pageNum)
# create request object
request=urllib2.Request(url)
response=urllib2.urlopen(request)
#print 'URL:'+url
return response.read()
except Exception,e:
print e
And then we're going to get the content of the novel, where we divide it into the title and the text. The title is on every page, so let's just get it once.
We can click on a website, press f12 to see how his title tag is constructed, such as baidu post bar is < title > .....................
Then we match
reg=re.compile(r'<title>(.*?)。')
To grab this information
Now that we've grabbed the title we're going to grab the body, and we know that the body is going to have a lot of sections, so we're going to loop around and grab the whole items
For text read and write operations, 1 must be placed outside the loop. Also add 1 more remove hyperlinks, < br > Mechanisms such as
Finally, we can call the main function
Complete code:
# -*- coding:utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')
# Crawler web information capture
# Required function methods: urllib,re,urllib2
import urllib
import urllib2
import re
# Test functions -> read
#def test():
# f=urllib.urlopen('http://www.baidu.com')
# while True:
# firstLine=f.readline()
# print firstLine
# Baidu tieba to get before 10 The text of the main novel
class BDTB:
def __init__(self,baseUrl,seeLZ):
# Member variables
self.baseURL=baseUrl
self.seeLZ='?see_lz='+str(seeLZ)
# Gets the code for the page post
def getPage(self,pageNum):
try:
url=self.baseURL+self.seeLZ+'&pn='+str(pageNum)
# create request object
request=urllib2.Request(url)
response=urllib2.urlopen(request)
#print 'URL:'+url
return response.read()
except Exception,e:
print e
# Match the title
def Title(self):
html=self.getPage(1)
#compile Improve the efficiency of regular matching
reg=re.compile(r'<title>(.*?) . ')
# return list The list of
items=re.findall(reg,html)
f=open('output.txt','w+')
item=('').join(items)
f.write('\t\t\t\t\t'+item.encode('gbk'))
f.close()
# Match the body
def Text(self,pageNum):
html=self.getPage(pageNum)
#compile Improve the efficiency of regular matching
reg=re.compile(r'"d_post_content j_d_post_content ">(.*?)</div>')
# return list The list of
items=re.findall(reg,html)
f=open('output.txt','a+')
#[1:] Slice, the first 1 I don't need to get rid of these elements.
for i in items[1:]:
# Hyperlink removal
removeAddr=re.compile('<a.*?>|</a>')
# with "" replace
i=re.sub(removeAddr,"",i)
#<br> Get rid of
i=i.replace('<br>','')
f.write('\n\n'+i.encode('gbk'))
f.close()
# Call the entrance
baseURL='http://tieba.baidu.com/p/4638659116'
bdtb=BDTB(baseURL,1)
print ' The crawler is starting ....'.encode('gbk')
# More pages
bdtb.Title()
print ' Grab the title! '.encode('gbk')
for i in range(1,11):
print ' Grasping control %02d page '.encode('gbk')%i
bdtb.Text(i)
print ' Grab the text !'.encode('gbk')
PS: here are two more handy regular expression tools for you to use:
JavaScript regular expression online testing tool:
http://tools.ofstack.com/regex/javascript
Online regular expression generation tool:
http://tools.ofstack.com/regex/create_reg
More about Python related content to view this site project: the Python regular expression usage summary ", "Python data structure and algorithm tutorial", "Python Socket programming skills summary", "Python function using techniques", "Python string skills summary", "Python introduction and advanced tutorial" and "Python file and directory skills summary"
I hope this article is helpful for you to design Python program.