Python USES HTMLParser to parse HTML instances

  • 2020-04-02 14:33:12
  • OfStack

A few days ago, I encountered a problem and needed to pick out part of the content in the web page, so I found two libraries: urllib and HTMLParser. Urllib can crawl the web page down and then be parsed by HTMLParser.

A case in point


from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
  def handle_starttag(self, tag, attrs):
    print "a start tag:",tag,self.getpos()
parser=MyHTMLParser()
parser.feed('<div><p>"hello"</p></div>')

In this example, HTMLParser is a base class that overloads its handle_starttag method and outputs some information.

The way the HTMLParser method is called puzzled me for a long time, and it took me a lot of blog posts to realize that HTMLParser has two types of methods: those that need to be explicitly called and those that don't.

Methods that do not need to be invoked explicitly

The following functions are triggered during parsing, but by default there are no side effects, so we'll override them according to our needs.

1. Htmlparser.handle_starttag (tag,attrs): the starttag call is encountered while parsing, e.g < P class = 'para' > , the parameter tag is the tag name, here is 'p',attrs is the list of all attributes (name,value) of the tag, here is [('class','para')]

2. Htmlparser.handle_endtag (tag): called when an endtag is encountered. Tag is the tag name

3.HTMLPars. Handle_data (data): called when something in the middle of the tag is encountered, e.g < style > P {color: blue; } < / style > , parameter data is the content between the open and closed tags < div > < p > . < / p > < / div > Is not called at div, but only at p

There are other functions, of course, which I won't cover here

Method of explicit invocation

1. Htmlparser. feed(data): the parameter is the HTML string that needs to be parsed, and the string starts to be parsed after the call

2. Htmlparser.getpos (): returns the current line number and offset position, as in (23,5)

3. Htmlparser.get_starttag_text (): returns the contents of the nearest starttag at the current location

After all the content is written, there is one last thing to note. HTMLParser is just a simple module, and its function of parsing HTML is not perfect. For example, it cannot open the tag and close the tag correctly.


from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
  def handle_starttag(self,tag,attrs):
    print 'begin tag',tag
  def handle_startendtag(self,tag,attrs):
    print 'begin end tag',tag str1='<br>'
str2='<br/>'
parser=MyHTMLParser() parser.feed(str1)    # The output "begin tag br"
parser.feed(str2)    # The output "begin end br"


Related articles: