The Python HTMLParser module parses html to get an url instance

  • 2020-05-07 19:56:17
  • OfStack

HTMLParser is the module that python USES to parse html. It can analyze the labels, data and so on in html, which is an easy way to deal with html. HTMLParser adopts an event-driven mode. When HTMLParser finds a specific tag, it will call a user-defined function to notify the program to handle it. Its main user callback functions are named starting with handler_ and are members of HTMLParser. When we use it, we derive a new class from HTMLParser and redefine the functions that begin with handler_. These functions include:

handle_startendtag   handles the start and end tags
handle_starttag         handles the start tag, for example < xx >
handle_endtag           handles the end tag, for example < /xx >
handle_charref           deals with special strings, which are characters beginning with &#, and 1 is usually an internal code representation
handle_entityref       deals with some special characters beginning with an &, such as  
handle_data                 < xx > data < /xx > The data in the middle
handle_comment           handles comments
handle_decl               processing < ! The beginning, for example < !DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
handle_pi                     < ?instruction > The things

Here I will take url obtained from the web page as an example to introduce 1. To get url, you must analyze it < a > Tag, and then get the value of its href attribute. Here is the code:


#-*- encoding: gb2312 -*-
import HTMLParser

class MyParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)    
    
  def handle_starttag(self, tag, attrs):
    #  This redefines the function that handles the start tag 
    if tag == 'a':
      #  Determine the label <a> The properties of the 
      for name,value in attrs:
        if name == 'href':
          print value
    

if __name__ == '__main__':
  a = '<html><head><title>test</title><body><a href="http://www.163.com"> Link to the 163</a></body></html>'
  
  my = MyParser()
  #  Incoming data to be analyzed, yes html . 
  my.feed(a)


Related articles: