Python writes web crawler scripts and implements APScheduler scheduling

  • 2020-04-02 13:52:26
  • OfStack

Some time ago, I learned python by myself. As a novice, I thought I could write something to practice. I learned that it is very convenient to write crawler script in python.

The requirements of the program are as follows: the page crawled by the crawler is the page of the e-book website of jd.com, and some free e-books will be updated every day. The crawler will send me the free titles updated every day by email as soon as possible and inform me to download them.

I. compilation ideas:

1. Crawler script to obtain free book information of the day

2. Compare the acquired book information with the existing information in the database. If there is a book and no operation is done, the book does not exist

3. When performing database insert operation, send the updated data in the form of email

4. Use the APScheduler scheduling framework to complete the scheduling of python scripts

Ii. Main knowledge of the script:

1. Python simple crawler

The module used this time has urllib2 to grab the page, and the import module is as follows:


import urllib2
from sgmllib import SGMLParser

The urlopen() method gets the HTML source code for the web page, which is stored in content. The main function of the listhref() class is to parse HTML code and handle semi-structured HTML documents.


content = urllib2.urlopen('http://sale.jd.com/act/yufbrhZtjx6JTV.html').read()
listhref = ListHref()
listhref.feed(content)

The listhref() class code can be found in all of the following code, with just a few key points:

The listhref() class inherits the SGMLParser class and overwrites its internal methods. SGMLParser breaks HTML into useful pieces, such as the start and end tags. Once a piece of data has been successfully decomposed into a useful fragment, it calls an internal method based on the data it finds. To use this parser, you need to subclass the SGMLParser class and override the methods of the parent class.

SGMLParser parses HTML into different types of data and tags, then calls separate methods for each type:
Start tag (Start_tag)
Is an HTML tag that starts with a block, like < HTML > . < The head > . < The body > . < The pre > Or a unique mark, like < br > or < img > And so on. This example is when it finds a start tag < a. > , SGMLParser will look for a method named start_a or do_a. If found, SGMLParser calls the method using the tag's property list; Otherwise, it calls the unknown_starttag method with the name of the tag and the list of attributes.
End tag (End_tag)
Is an HTML tag that ends a block, like < / HTML > . < / head > . < / body > or < / pre > And so on. In this case, when an end tag is found, SGMLParser looks for a method called end_a. If found, SGMLParser calls this method, otherwise it calls unknown_endtag with the name of the tag.
Text data
Gets a block of text and calls handle_data to get the text when no other markup of any kind is met.

The following categories are not used in this article
Character reference
An escape character represented by the decimal or equivalent hexadecimal of the character, SGMLParser calls handle_charref with the character when the character is found.
Entity reference
HTML entities, like &ref, when found, the name of the SGMLParser entity calls handle_entityref.
Comments (Comment)
HTML comments, included in < ! -... -- > In between. When found, SGMLParser calls handle_comment with the comment content.
Processing instruction
HTML processing instructions included in < The & # 63; . > In between. When found, SGMLParser calls handle_pi with the directive content.
Declaration (Declaration)
HTML declarations, such as DOCTYPE, are included in < ! . > In between. When found, SGMLParser calls handle_decl with the declared content.

Specific reference API:http://docs.python.org/2/library/sgmllib.html? instructions Highlight = sgmlparser# sgmllib SGMLParser

2. Python operates MongoDB database

First to install python for directing a driver PyMongo, download address: https://pypi.python.org/pypi/pymongo/2.5

The import module


import pymongo

Connect to the database server 127.0.0.1 and switch to the used database mydatabase


mongoCon=pymongo.Connection(host="127.0.0.1",port=27017)
db= mongoCon.mydatabase

Find database related book information, book is the collection of the search


bookInfo = db.book.find_one({"href":bookItem.href})

For the database insert the books information, python support Chinese, but still more complicated for Chinese encoding and decoding, the decoding and encoding please refer to http://blog.csdn.net/mayflowers/article/details/1568852


b={
"bookname":bookItem.bookname.decode('gbk').encode('utf8'),
"href":bookItem.href,
"date":bookItem.date
}
db.book.insert(b,safe=True)

http://api.mongodb.org/python/2.0.1/ about PyMongo please refer to the API documentation

Python sends mail

Import mail module


# Import smtplib for the actual sending function
import smtplib
from email.mime.text import MIMEText

"Localhost" is the mail server address

MSG = MIMEText(context) # text message content
MSG ['Subject'] = sub # topic
MSG ['From'] = "my@vmail.cn" # sender
MSG ['To'] = commaspace.join (mailto_list) # addressee list


def send_mail(mailto_list, sub, context): 
COMMASPACE = ','
mail_host = "localhost"
me = "my@vmail.cn"
# Create a text/plain message
msg = MIMEText(context) 
msg['Subject'] = sub 
msg['From'] = "my@vmail.cn"
msg['To'] = COMMASPACE.join(mailto_list)

send_smtp = smtplib.SMTP(mail_host) 

send_smtp.sendmail(me, mailto_list, msg.as_string()) 
send_smtp.close()

Application documents: http://docs.python.org/2/library/email.html? Highlight = smtplib#

4.Python scheduling framework ApScheduler

Download address https://pypi.python.org/pypi/APScheduler/2.1.0

The official document: http://pythonhosted.org/APScheduler/#faq

API:http://pythonhosted.org/APScheduler/genindex.html

Installation method: after downloading, unzip, then perform python setup.py install, and import the module


from apscheduler.scheduler import Scheduler

The ApScheduler configuration is relatively simple, using only the add_interval_job method in this example, and executing the task script after an interval of 30 minutes. http://flykite.blog.51cto.com/4721239/832036 instance article for reference


# Start the scheduler 
sched = Scheduler()
sched.daemonic = False 
sched.add_interval_job(job,minutes=30) 
sched.start()

About daemonic parameters:

The apscheduler creates a thread that by default is daemon=True, which means it is thread daemon by default.

In the above code, the script will not run on time without adding sched.daemonic=False.

Since the script does not have sched.daemonic=False, it creates a daemon thread. During this process, an instance of the scheduler is created. But because the script runs so fast, the mainthread will end immediately, and the thread that timed the task will end before it has time to execute. (determined by the relationship between the daemon thread and the main thread). For the script to work properly, you must set the script to be a non-daemon thread. Sched. Daemonic = False

Attachment: all the script code

All Code


#-*- coding: UTF-8 -*-
import urllib2
from sgmllib import SGMLParser
import pymongo
import time
# Import smtplib for the actual sending function
import smtplib
from email.mime.text import MIMEText
from apscheduler.scheduler import Scheduler

#get freebook hrefs
class ListHref(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.is_a = ""
self.name = []
self.freehref=""
self.hrefs=[]

def start_a(self, attrs):
self.is_a = 1
href = [v for k, v in attrs if k == "href"]
self.freehref=href[0]

def end_a(self):
self.is_a = ""

def handle_data(self, text):
if self.is_a == 1 and text.decode('utf8').encode('gbk')==" Limited-time free ":
self.hrefs.append(self.freehref)
#get freebook Info
class FreeBook(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.is_title=""
self.name = ""
def start_title(self, attrs):
self.is_title = 1
def end_title(self):
self.is_title = ""
def handle_data(self, text):
if self.is_title == 1: 
self.name=text
#Mongo Store Module
class freeBookMod:
def __init__(self, date, bookname ,href):
self.date=date
self.bookname=bookname
self.href=href

def get_book(bookList):
content = urllib2.urlopen('http://sale.jd.com/act/yufbrhZtjx6JTV.html').read()
listhref = ListHref()
listhref.feed(content)

for href in listhref.hrefs:
content = urllib2.urlopen(str(href)).read()
listbook=FreeBook()
listbook.feed(content)
name = listbook.name
n= name.index(' " ')
#print (name[0:n+2])
freebook=freeBookMod(time.strftime('%Y-%m-%d',time.localtime(time.time())),name[0:n+2],href)
bookList.append(freebook)
return bookList

def record_book(bookList,context,isSendMail):
# DataBase Operation
mongoCon=pymongo.Connection(host="127.0.0.1",port=27017)
db= mongoCon.mydatabase
for bookItem in bookList:
bookInfo = db.book.find_one({"href":bookItem.href})

if not bookInfo:
b={
"bookname":bookItem.bookname.decode('gbk').encode('utf8'),
"href":bookItem.href,
"date":bookItem.date
}
db.book.insert(b,safe=True)
isSendMail=True
context=context+bookItem.bookname.decode('gbk').encode('utf8')+','
return context,isSendMail 

#Send Message
def send_mail(mailto_list, sub, context): 
COMMASPACE = ','
mail_host = "localhost"
me = "my@vmail.cn"
# Create a text/plain message
msg = MIMEText(context) 
msg['Subject'] = sub 
msg['From'] = "my@vmail.cn"
msg['To'] = COMMASPACE.join(mailto_list)

send_smtp = smtplib.SMTP(mail_host) 

send_smtp.sendmail(me, mailto_list, msg.as_string()) 
send_smtp.close() 

#Main job for scheduler 
def job(): 
bookList=[]
isSendMail=False; 
context="Today free books are"
mailto_list=["mailto@mail.cn"]
bookList=get_book(bookList)
context,isSendMail=record_book(bookList,context,isSendMail)
if isSendMail==True: 
send_mail(mailto_list,"Free Book is Update",context)

if __name__=="__main__": 
# Start the scheduler 
sched = Scheduler()
sched.daemonic = False 
sched.add_interval_job(job,minutes=30) 
sched.start()


Related articles: