Installation and Application of python BeautifulSoup Library
- 2021-08-28 20:42:04
- OfStack
1. Introduction to 1. BeautifulSoup
BeautifulSoup4 and lxml 1, Beautiful Soup is also a parser of HTML/XML, and its main function is how to parse and extract HTML/XML data.
BeautifulSoup supports HTML parser in Python standard library, and also supports some third-party parsers. If we don't install it, Python will use the default parser of Python. lxml parser is more powerful and faster, so it is recommended to use lxml parser.
Beautiful Soup automatically converts the input document to Unicode encoding and the output document to utf-8 encoding. You don't need to consider encoding unless the document does not specify an encoding method, then Beautiful Soup can't automatically recognize the encoding method. Then, you just need to explain the original encoding method.
2. Installation of BeautifulSoup
First, we need to install an BeautifulSoup library. The version I installed is python 3. Therefore, you can install it directly under cmd with pip3 command.
Command:
pip3 install beautifulsoup4
After installing BeautifulSoup, we can determine whether the installation was successful by importing the library.
Command:
>>> from bs4 import BeautifulSoup
No error will be reported after entering the car, which shows that we have successfully installed it.
3. Common functions of BeautifulSoup
# beautiful soup Object for extracting information from a web page python Library
# BeautifulSoup Object represents the 1 The full contents of each document
# prettify() Output according to the structure of standard indentation format
# get_text() Will HTML Clear all tags in the document , Return 1 Strings containing only literals
from bs4 import BeautifulSoup
text='''
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>
'''
# create Object
bf=BeautifulSoup(text)
# Output in standard indentation format
print(bf.prettify())
# Will HTML Clear all tags in the document , Return 1 Strings containing only literals
print(bf.get_text())
# Tag Object
# Label Denote HTML In 1 Each label
# name
# attrs
tag=bf.title # Get title Label
print(tag)
print(type(tag)) # tag Type
print(tag.name) # Label name
print(tag.attrs) # Tag attribute
print(tag.attrs["lang"]) # Get a property separately Method 1
print(bf.title["lang"]) # Get a property separately Method 2
# NavigableString tag.string
# Represents the text in the label
print(tag.string)
print(type(tag.string)) # View data types
# Comment Annotation section
# 1 Of a special type NavigableString Object
# The output does not include comment symbols
string='''
<p><!-- This is a note! --></p>
'''
sp=BeautifulSoup(string)
print(sp)
print(sp.p.string) # To get the text in the label
# Two commonly used functions
# find_all() Search for the current tag All of tag Child node , And determine whether the given conditions are met
# The result returned is 1 Column , Can contain multiple elements
print(soup.find_all('title'),end="\n-------\n")
#find() Go directly back to the first 1 Elements
print(soup.find("title"))
print(soup.find_all("title",lang="eng")) # Find title Label Attribute lang=eng
print(soup.find_all("title",{"lang":"eng"})) # The result is the same as above
print(soup.find_all(["title","price"])) # Get multiple tags
print(soup.find_all("title",lang="eng")[0].get_text()) # Get Text
# 3 Common nodes
# Child node 1 A Tag May contain multiple strings or other tag These are all these tag Child nodes of
# Parent node With a tag Or string has a parent node: it is contained in a certain tag Medium
# Brother node Parallel node
end="\n-------\n"
print(soup.book,end) # Get book Node information
print(soup.book.contents,end) # Get book All child nodes under
print(soup.book.contents[1],end) # Get book Of all child nodes under the 1 Nodes
print(soup.book.children,end) # children Generate iterator
for child in soup.book.children:
print("===",child)
print(soup.title.parent,end)
print(soup.book.parent,end)
for parent in soup.title.parents: # Attention parent And parents Difference
print("===",parent.name)
print(soup.title.next_sibling,end) # Gets the following of the node 1 Sibling nodes
print(soup.title.previous_sibling,end) # Gets the upper of the node 1 Sibling nodes
print(soup.title.next_siblings,end) # Gets all siblings of this node
for i in soup.title.next_siblings:
print("===",i)
The above is the python BeautifulSoup library installation and use of the details, more information about python BeautifulSoup library please pay attention to other related articles on this site!