python Web page parser to master the third party lxml extension library and xpath usage

  • 2021-10-25 07:07:49
  • OfStack

Today, we are talking about using another extension library lxml to parse web pages. Similarly, lxml library can complete the html, xml format file parsing, and can be used to parse large documents, parsing speed is relatively fast.

To master the use of lxml, you need to master the use of xpath. Because the lxml extension library is based on xpath, the focus of this chapter is mainly on the description of the use of xpath syntax.

1. Import the lxml extension library and create an object


# -*- coding: UTF-8 -*-

#  From  lxml  Import  etree
from lxml import etree

#  First, get the source code of the web page downloaded by the web page downloader 
#  Take the official case directly here 
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#  Class of the Web Downloader  html_doc  String , Return 1 A  lxml  Object of 
html = etree.HTML(html_doc)

2. Using xpath syntax to extract web page elements

Get elements as nodes


# xpath()  Getting Elements Using Label Nodes 
print html.xpath('/html/body/p')
# [<Element p at 0x2ebc908>, <Element p at 0x2ebc8c8>, <Element p at 0x2eb9a48>]
print html.xpath('/html')
# [<Element html at 0x34bc948>]
#  Find in the descendants of the current node  a  Node 
print html.xpath('//a')
#  Find in the child nodes of the current node  html  Node 
print html.xpath('/html')

Get elements as filtered


'''
 According to the bill 1 Attribute to get the element 
'''
#  Gets the child node , Attribute  class=bro  Adj.  a  Label 
print html.xpath('//a[@class="bro"]')

#  Gets the child node , Attribute  id=link3  Adj.  a  Label 
print html.xpath('//a[@id="link3"]')

'''
 Getting Elements from Multiple Attributes 
'''
#  Get class Property is equal to sister , and id Equal to link3 Adj. a Label 
print html.xpath('//a[contains(@class,"sister") and contains(@id,"link1")]')

#  Get class Property is equal to bro , or id Equal to link1 Adj. a Label 
print html.xpath('//a[contains(@class,"bro") or contains(@id,"link1")]')

#  Use  last()  Function to get the a At the end of the label 1 A a Label 
print html.xpath('//a[last()]')
#  Use  1  Function to get the a The number of labels 1 A a Label 
print html.xpath('//a[1]')
#  Label filtering, position() Obtain the of the descendant generation a The first two of the tags a Label 
print html.xpath('//a[position() < 3]')

'''
 Obtain multiple elements using computational methods 
'''
#  Label filtering, position() Obtain the of the descendant generation a The number of labels 1 Individual and number 3 Tags 
#  Evaluated expressions that can be used: > , < , = , >= , <= , + , - , and , or
print html.xpath('//a[position() = 1 or position() = 3]')

Gets the attributes and text of an element


'''
 Use @ Gets the property value, using the text()  Get label text 
'''
#  Get the property value 
print html.xpath('//a[position() = 1]/@class')
# ['sister']
#  Gets the text value of the label 
print html.xpath('//a[position() = 1]/text()')

Related articles: