XPath parsing library necessary for Python crawler
- 2021-11-10 10:01:18
- OfStack
1. Introduction
XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents. XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer are constructed on top of XPath expression.
Introduction of Xpath parsing library: Regular expressions have been used in the process of data parsing, but it is difficult for regular expressions to match accurately. If regular expressions are written incorrectly, the matched data will also make mistakes.
The web page is composed of three parts: HTML, Css, JavaScript and HTML. There is a hierarchical relationship between the page tags, that is, DOM tree. When obtaining the target data, the tags can be located according to the hierarchical relationship of the web page, and the text or attributes of the tags can be obtained.
Step 2 Install
pip install lxml
3. Nodes
3.1 Select Nodes
XPath uses a path expression to select a node in an XML document. Nodes are selected by following the path or step. The most useful path expressions are listed below:
表达式 | 描述 |
---|---|
nodename | 选取此节点的所有子节点。 |
/ | 从根节点选取。 |
// | 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置。 |
… | 选取当前节点的父节点。 |
. | 选取当前节点。 |
@ | 选取属性。 |
3.2 Select an unknown node
The XPath wildcard character can be used to select an unknown XML element.
通配符 | 描述 |
---|---|
* | 匹配任何元素节点。 |
@* | 匹配任何属性节点。 |
node() | 匹配任何类型的节点。 |
In the following table, we list 1 path expressions and the results of these expressions:
路径表达式 | 结果 |
---|---|
/bookstore/* | 选取 bookstore 元素的所有子元素。 |
//* | 选取文档中的所有元素。 |
//title[@*] | 选取所有带有属性的 title 元素。 |
3.3 Node relationship
Parent (Parent)
Each element and attribute has one parent.
In the following example, the book element is the parent of the title, author, year, and price elements:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
(Children)
An element node can have zero, one, or more children.
In the following example, the title, author, year, and price elements are all children of the book element:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Sib (Sibling)
Nodes with the same parent
In the following example, the title, author, year, and price elements are siblings:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Ancestors (Ancestor)
The father of a node, the father of a father, and so on.
In the following example, the title element is preceded by the book element and the bookstore element:
<bookstore>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
Progeny (Descendant)
Children of a node, children of children, and so on.
In the following example, the descendants of bookstore are book, title, author, year, and price elements:
<bookstore>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
4. XPath instance
Crawl the Encyclopedia of Embarrassment
import requests
# Guide pack
from lxml import etree
import os
base_url = 'https://www.qiushibaike.com/video/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
res = requests.get(url=base_url, headers=headers)
html = res.content.decode('utf-8')
# xpath Analyse
tree = etree.HTML(html)
# Title
content = tree.xpath('//*/a/div[@class="content"]/span/text()')
# Video
video_list = tree.xpath('//*/video[@controls="controls"]/source/@src')
index = 0
for i in video_list:
# Get video 2 Binary flow
video_content = requests.get(url= 'https:' + i,headers=headers).content
# Title
title_1 = content[0].strip('\n')
# Will the video 2 Binary write file
with open(f'Video/{title_1}.mp4','wb') as f:
f.write(video_content)
index += 1