python Web page parser to master the third party lxml extension library and xpath usage
- 2021-10-25 07:07:49
- OfStack
Today, we are talking about using another extension library lxml to parse web pages. Similarly, lxml library can complete the html, xml format file parsing, and can be used to parse large documents, parsing speed is relatively fast.
To master the use of lxml, you need to master the use of xpath. Because the lxml extension library is based on xpath, the focus of this chapter is mainly on the description of the use of xpath syntax.
1. Import the lxml extension library and create an object
# -*- coding: UTF-8 -*-
# From lxml Import etree
from lxml import etree
# First, get the source code of the web page downloaded by the web page downloader
# Take the official case directly here
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# Class of the Web Downloader html_doc String , Return 1 A lxml Object of
html = etree.HTML(html_doc)
2. Using xpath syntax to extract web page elements
Get elements as nodes
# xpath() Getting Elements Using Label Nodes
print html.xpath('/html/body/p')
# [<Element p at 0x2ebc908>, <Element p at 0x2ebc8c8>, <Element p at 0x2eb9a48>]
print html.xpath('/html')
# [<Element html at 0x34bc948>]
# Find in the descendants of the current node a Node
print html.xpath('//a')
# Find in the child nodes of the current node html Node
print html.xpath('/html')
Get elements as filtered
'''
According to the bill 1 Attribute to get the element
'''
# Gets the child node , Attribute class=bro Adj. a Label
print html.xpath('//a[@class="bro"]')
# Gets the child node , Attribute id=link3 Adj. a Label
print html.xpath('//a[@id="link3"]')
'''
Getting Elements from Multiple Attributes
'''
# Get class Property is equal to sister , and id Equal to link3 Adj. a Label
print html.xpath('//a[contains(@class,"sister") and contains(@id,"link1")]')
# Get class Property is equal to bro , or id Equal to link1 Adj. a Label
print html.xpath('//a[contains(@class,"bro") or contains(@id,"link1")]')
# Use last() Function to get the a At the end of the label 1 A a Label
print html.xpath('//a[last()]')
# Use 1 Function to get the a The number of labels 1 A a Label
print html.xpath('//a[1]')
# Label filtering, position() Obtain the of the descendant generation a The first two of the tags a Label
print html.xpath('//a[position() < 3]')
'''
Obtain multiple elements using computational methods
'''
# Label filtering, position() Obtain the of the descendant generation a The number of labels 1 Individual and number 3 Tags
# Evaluated expressions that can be used: > , < , = , >= , <= , + , - , and , or
print html.xpath('//a[position() = 1 or position() = 3]')
Gets the attributes and text of an element
'''
Use @ Gets the property value, using the text() Get label text
'''
# Get the property value
print html.xpath('//a[position() = 1]/@class')
# ['sister']
# Gets the text value of the label
print html.xpath('//a[position() = 1]/text()')