The Python HTMLParser module parses html to get an url instance
- 2020-05-07 19:56:17
- OfStack
HTMLParser is the module that python USES to parse html. It can analyze the labels, data and so on in html, which is an easy way to deal with html. HTMLParser adopts an event-driven mode. When HTMLParser finds a specific tag, it will call a user-defined function to notify the program to handle it. Its main user callback functions are named starting with handler_ and are members of HTMLParser. When we use it, we derive a new class from HTMLParser and redefine the functions that begin with handler_. These functions include:
handle_startendtag handles the start and end tags
handle_starttag handles the start tag, for example
<
xx
>
handle_endtag handles the end tag, for example
<
/xx
>
handle_charref deals with special strings, which are characters beginning with , and 1 is usually an internal code representation
handle_entityref deals with some special characters beginning with an &, such as
handle_data
<
xx
>
data
<
/xx
>
The data in the middle
handle_comment handles comments
handle_decl processing
<
! The beginning, for example
<
!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
handle_pi
<
?instruction
>
The things
Here I will take url obtained from the web page as an example to introduce 1. To get url, you must analyze it < a > Tag, and then get the value of its href attribute. Here is the code:
#-*- encoding: gb2312 -*-
import HTMLParser
class MyParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
# This redefines the function that handles the start tag
if tag == 'a':
# Determine the label <a> The properties of the
for name,value in attrs:
if name == 'href':
print value
if __name__ == '__main__':
a = '<html><head><title>test</title><body><a href="http://www.163.com"> Link to the 163</a></body></html>'
my = MyParser()
# Incoming data to be analyzed, yes html .
my.feed(a)