Use python to extract the implementation code for specific data in HTML files

2020-04-02 09:47:45
OfStack

For example, an HTML file with the following structure

 
<div class='entry-content'> 
<p> Content of interest 1</p> 
<p> Content of interest 2</p> 
 ...  
<p> Content of interest n</p> 
</div> 
<div class='content'> 
<p> content 1</p> 
<p> content 2</p> 
 ...  
<p> content n</p> 
</div>

We try to get 'content of interest'

For the text content, we save it to the IDList.
But how do we mark the text that we encounter as the content of interest, that is, in

 
<div class='entry-content'> 
<p> The content here </p> 
<p> And here, </p> 
 ...  
<p> And what's going on here </p> 
</div>

Ideas as follows
encounter < Div class = 'entry - the content' > Set flag = True encounter < / div > Then set flag = False Encountered when flag is True < p > Set the tag getdata = True encounter < / p > And getdata = True, set getdata = False

Python provides us with the SGMLParser class, which parses HTML into 8 classes of data [1] , and then call a separate method for each class: you simply inherit the SGMLParser class and write a handler for the page information.

The available handlers are as follows :

Start tag

Is an HTML tag that starts with a block, like < HTML > . < The head > . < The body > or < The pre > Or a unique mark, like < br > or < img > And so on. When it finds a start tag tagname, SGMLParser will look up the name start_tagname or do_tagname Methods. For example, when it finds one < The pre > Tag, which looks for a method of start_pre or do_pre. If found, SGMLParser calls the method using the tag's property list; Otherwise, it is invoked with the tag's name and property list unknown_starttag Methods.

End tag

Is an HTML tag that ends a block, like < / HTML > . < / head > . < / body > or < / pre > And so on. When an end tag is found, SGMLParser looks up the name end_tagname Methods. If found, SGMLParser calls this method, otherwise it calls with the name of the tag unknown_endtag .

Character reference

Escape characters in decimal or equivalent hexadecimal notation, such as . When found, SGMLParser is called using decimal or equivalent hexadecimal character text handle_charref .

Entity reference

HTML entity, like © . When found, SGMLParser is called using the name of the HTML entity handle_entityref .

Comments (Comment)

HTML comments, included in < ! -... -- > In between. When found, SGMLParser is called with comment content handle_comment .

Processing instruction

HTML processing instructions included in < ? . > In between. When found, SGMLParser is called with the processing instruction content handle_pi .

Declaration (Declaration)

HTML declarations, such as DOCTYPE, are included in < ! . > In between. When found, SGMLParser is called with the declared content handle_decl .

Text data

Text block. It doesn't satisfy anything in the other seven categories. When found, SGMLParser is called with text handle_data .

To sum up , to the following code


from sgmllib import SGMLParser
class GetIdList(SGMLParser):
    def reset(self):
        self.IDlist = []
        self.flag = False
        self.getdata = False
        SGMLParser.reset(self)

    def start_div(self, attrs):
        for k,v in attrs:# traverse div And their values 
            if k == 'class' and v == 'entry-content':# Confirmed entry <div class='entry-content'>
                self.flag = True
                return
    def end_div(self):# encounter </div>
 self.flag = False

    def start_p(self, attrs):
        if self.flag == False:
            return
        self.getdata = True


    def end_p(self):# encounter </p>
        if self.getdata:
            self.getdata = False
    def handle_data(self, text):# Handling text 
        if self.getdata:
            self.IDlist.append(text)


    def printID(self):
        for i in self.IDlist:
            print i

There is a bug in the above train of thought
encounter < / div > Then set flag = False
What if I encounter div nesting?


<div class='entry-content'><div> I'm here to make trouble </div><p> Interested in </p></div>

In the first one < / div > Then flag = False, causing the 'interested content' to fail.
What to do? How to judge the encounter < / div > Is and < Div class = 'entry - the content' > Which one matches?
Very simple, < / div > and < div > Is corresponding, we can record the number of layers he is in. Go to the child div verbatim plus 1, exit the child div Verbatim minus 1. This will tell you if it's the same layer.

Revised as follows


from sgmllib import SGMLParser
class GetIdList(SGMLParser):
    def reset(self):
        self.IDlist = []
        self.flag = False
        self.getdata = False
        self.verbatim = 0
        SGMLParser.reset(self)

    def start_div(self, attrs):
        if self.flag == True:
            self.verbatim +=1 # Enter the sub-layer div Okay, number of layers plus 1
            return
        for k,v in attrs:# traverse div And their values 
            if k == 'class' and v == 'entry-content':# Confirmed entry <div class='entry-content'>
                self.flag = True
                return
    def end_div(self):# encounter </div>
        if self.verbatim == 0:
            self.flag = False
        if self.flag == True:# Exit the sub-layer div So, minus the number of layers 1
            self.verbatim -=1
    def start_p(self, attrs):
        if self.flag == False:
            return
        self.getdata = True

    def end_p(self):# encounter </p>
        if self.getdata:
            self.getdata = False
    def handle_data(self, text):# Handling text 
        if self.getdata:
            self.IDlist.append(text)

    def printID(self):
        for i in self.IDlist:
            print i

The last How do we use it once we've created our own class, GetIdList?
Simply create instance t = GetIdList()
The_page is a string and the content is HTML
T.feed (the_page)# parses HTML

T.rintid () prints the result

All the test code is


from sgmllib import SGMLParser
class GetIdList(SGMLParser):
    def reset(self):
        self.IDlist = []
        self.flag = False
        self.getdata = False
        self.verbatim = 0
        SGMLParser.reset(self)

    def start_div(self, attrs):
        if self.flag == True:
            self.verbatim +=1 # Enter the sub-layer div Okay, number of layers plus 1
            return
        for k,v in attrs:# traverse div And their values 
            if k == 'class' and v == 'entry-content':# Confirmed entry <div class='entry-content'>
                self.flag = True
                return
    def end_div(self):# encounter </div>
        if self.verbatim == 0:
            self.flag = False
        if self.flag == True:# Exit the sub-layer div So, minus the number of layers 1
            self.verbatim -=1
    def start_p(self, attrs):
        if self.flag == False:
            return
        self.getdata = True

    def end_p(self):# encounter </p>
        if self.getdata:
            self.getdata = False
    def handle_data(self, text):# Handling text 
        if self.getdata:
            self.IDlist.append(text)

    def printID(self):
        for i in self.IDlist:
            print i

##import urllib2
##import datetime
##vrg = (datetime.date(2012,2,19) - datetime.date.today()).days
##strUrl = 'http://www.nod32id.org/nod32id/%d.html'%(200+vrg)
##req = urllib2.Request(strUrl)# Get web pages from the web 
##response = urllib2.urlopen(req)
##the_page = response.read()
the_page ='''<html>
<head>
<title>test</title>
</head>
<body>
<h1>title</h1>
<div class='entry-content'>
<div class= 'ooxx'> I'm here to make trouble </div>
<p> Content of interest 1</p>
<p> Content of interest 2</p>
 ... 
<p> Content of interest n</p>
<div class= 'ooxx'> I'm here to make trouble 2<div class= 'ooxx'> I'm here to make trouble 3</div></div>
</div>
<div class='content'>
<p> content 1</p>
<p> content 2</p>
 ... 
<p> content n</p>
</div>
</body>
</html>
'''
lister = GetIdList()
lister.feed(the_page)
lister.printID()

The output after execution is


 Content of interest 1
 Content of interest 2
 Content of interest n

reference

[1] (link: http://help.jb51.net/diveintopython/toc/index.html)