Use python to extract the implementation code for specific data in HTML files
- 2020-04-02 09:47:45
- OfStack
<div class='entry-content'>
<p> Content of interest 1</p>
<p> Content of interest 2</p>
...
<p> Content of interest n</p>
</div>
<div class='content'>
<p> content 1</p>
<p> content 2</p>
...
<p> content n</p>
</div>
We try to get 'content of interest'
For the text content, we save it to the IDList.
But how do we mark the text that we encounter as the content of interest, that is, in
<div class='entry-content'>
<p> The content here </p>
<p> And here, </p>
...
<p> And what's going on here </p>
</div>
Ideas as follows
encounter < Div class = 'entry - the content' > Set flag = True encounter < / div > Then set flag = False Encountered when flag is True < p > Set the tag getdata = True encounter < / p > And getdata = True, set getdata = False
Python provides us with the SGMLParser class, which parses HTML into 8 classes of data [1] , and then call a separate method for each class: you simply inherit the SGMLParser class and write a handler for the page information.
The available handlers are as follows :
Start tagIs an HTML tag that starts with a block, like < HTML > . < The head > . < The body > or < The pre > Or a unique mark, like < br > or < img > And so on. When it finds a start tag tagname, SGMLParser will look up the name start_tagname or do_tagname Methods. For example, when it finds one < The pre > Tag, which looks for a method of start_pre or do_pre. If found, SGMLParser calls the method using the tag's property list; Otherwise, it is invoked with the tag's name and property list unknown_starttag Methods.End tag
Is an HTML tag that ends a block, like < / HTML > . < / head > . < / body > or < / pre > And so on. When an end tag is found, SGMLParser looks up the name end_tagname Methods. If found, SGMLParser calls this method, otherwise it calls with the name of the tag unknown_endtag .Character reference
Escape characters in decimal or equivalent hexadecimal notation, such as . When found, SGMLParser is called using decimal or equivalent hexadecimal character text handle_charref .Entity reference
HTML entity, like © . When found, SGMLParser is called using the name of the HTML entity handle_entityref .Comments (Comment)
HTML comments, included in < ! -... -- > In between. When found, SGMLParser is called with comment content handle_comment .Processing instruction
HTML processing instructions included in < ? . > In between. When found, SGMLParser is called with the processing instruction content handle_pi .Declaration (Declaration)
HTML declarations, such as DOCTYPE, are included in < ! . > In between. When found, SGMLParser is called with the declared content handle_decl .Text data
Text block. It doesn't satisfy anything in the other seven categories. When found, SGMLParser is called with text handle_data .
from sgmllib import SGMLParser
class GetIdList(SGMLParser):
def reset(self):
self.IDlist = []
self.flag = False
self.getdata = False
SGMLParser.reset(self)
def start_div(self, attrs):
for k,v in attrs:# traverse div And their values
if k == 'class' and v == 'entry-content':# Confirmed entry <div class='entry-content'>
self.flag = True
return
def end_div(self):# encounter </div>
self.flag = False
def start_p(self, attrs):
if self.flag == False:
return
self.getdata = True
def end_p(self):# encounter </p>
if self.getdata:
self.getdata = False
def handle_data(self, text):# Handling text
if self.getdata:
self.IDlist.append(text)
def printID(self):
for i in self.IDlist:
print i
There is a bug in the above train of thought
encounter
<
/ div
>
Then set flag = False
What if I encounter div nesting?
<div class='entry-content'><div> I'm here to make trouble </div><p> Interested in </p></div>
In the first one
<
/ div
>
Then flag = False, causing the 'interested content' to fail.
What to do? How to judge the encounter
<
/ div
>
Is and
<
Div class = 'entry - the content'
>
Which one matches?
Very simple,
<
/ div
>
and
<
div
>
Is corresponding, we can record the number of layers he is in. Go to the child div verbatim plus 1, exit the child div
Verbatim minus 1. This will tell you if it's the same layer.
Revised as follows
from sgmllib import SGMLParser
class GetIdList(SGMLParser):
def reset(self):
self.IDlist = []
self.flag = False
self.getdata = False
self.verbatim = 0
SGMLParser.reset(self)
def start_div(self, attrs):
if self.flag == True:
self.verbatim +=1 # Enter the sub-layer div Okay, number of layers plus 1
return
for k,v in attrs:# traverse div And their values
if k == 'class' and v == 'entry-content':# Confirmed entry <div class='entry-content'>
self.flag = True
return
def end_div(self):# encounter </div>
if self.verbatim == 0:
self.flag = False
if self.flag == True:# Exit the sub-layer div So, minus the number of layers 1
self.verbatim -=1
def start_p(self, attrs):
if self.flag == False:
return
self.getdata = True
def end_p(self):# encounter </p>
if self.getdata:
self.getdata = False
def handle_data(self, text):# Handling text
if self.getdata:
self.IDlist.append(text)
def printID(self):
for i in self.IDlist:
print i
The last
How do we use it once we've created our own class, GetIdList?
Simply create instance t = GetIdList()
The_page is a string and the content is HTML
T.feed (the_page)# parses HTML
T.rintid () prints the result
All the test code is
from sgmllib import SGMLParser
class GetIdList(SGMLParser):
def reset(self):
self.IDlist = []
self.flag = False
self.getdata = False
self.verbatim = 0
SGMLParser.reset(self)
def start_div(self, attrs):
if self.flag == True:
self.verbatim +=1 # Enter the sub-layer div Okay, number of layers plus 1
return
for k,v in attrs:# traverse div And their values
if k == 'class' and v == 'entry-content':# Confirmed entry <div class='entry-content'>
self.flag = True
return
def end_div(self):# encounter </div>
if self.verbatim == 0:
self.flag = False
if self.flag == True:# Exit the sub-layer div So, minus the number of layers 1
self.verbatim -=1
def start_p(self, attrs):
if self.flag == False:
return
self.getdata = True
def end_p(self):# encounter </p>
if self.getdata:
self.getdata = False
def handle_data(self, text):# Handling text
if self.getdata:
self.IDlist.append(text)
def printID(self):
for i in self.IDlist:
print i
##import urllib2
##import datetime
##vrg = (datetime.date(2012,2,19) - datetime.date.today()).days
##strUrl = 'http://www.nod32id.org/nod32id/%d.html'%(200+vrg)
##req = urllib2.Request(strUrl)# Get web pages from the web
##response = urllib2.urlopen(req)
##the_page = response.read()
the_page ='''<html>
<head>
<title>test</title>
</head>
<body>
<h1>title</h1>
<div class='entry-content'>
<div class= 'ooxx'> I'm here to make trouble </div>
<p> Content of interest 1</p>
<p> Content of interest 2</p>
...
<p> Content of interest n</p>
<div class= 'ooxx'> I'm here to make trouble 2<div class= 'ooxx'> I'm here to make trouble 3</div></div>
</div>
<div class='content'>
<p> content 1</p>
<p> content 2</p>
...
<p> content n</p>
</div>
</body>
</html>
'''
lister = GetIdList()
lister.feed(the_page)
lister.printID()
The output after execution is
Content of interest 1
Content of interest 2
Content of interest n
reference
[1] (link: http://help.jb51.net/diveintopython/toc/index.html)