Installation and Application of python BeautifulSoup Library

  • 2021-08-28 20:42:04
  • OfStack

1. Introduction to 1. BeautifulSoup

BeautifulSoup4 and lxml 1, Beautiful Soup is also a parser of HTML/XML, and its main function is how to parse and extract HTML/XML data.

BeautifulSoup supports HTML parser in Python standard library, and also supports some third-party parsers. If we don't install it, Python will use the default parser of Python. lxml parser is more powerful and faster, so it is recommended to use lxml parser.

Beautiful Soup automatically converts the input document to Unicode encoding and the output document to utf-8 encoding. You don't need to consider encoding unless the document does not specify an encoding method, then Beautiful Soup can't automatically recognize the encoding method. Then, you just need to explain the original encoding method.

2. Installation of BeautifulSoup

First, we need to install an BeautifulSoup library. The version I installed is python 3. Therefore, you can install it directly under cmd with pip3 command.

Command:


pip3 install beautifulsoup4

After installing BeautifulSoup, we can determine whether the installation was successful by importing the library.

Command:

>>> from bs4 import BeautifulSoup

No error will be reported after entering the car, which shows that we have successfully installed it.

3. Common functions of BeautifulSoup


# beautiful soup  Object for extracting information from a web page python Library 
#  BeautifulSoup  Object represents the 1 The full contents of each document 
#  prettify()  Output according to the structure of standard indentation format 
#  get_text()  Will HTML Clear all tags in the document , Return 1 Strings containing only literals 
from bs4 import BeautifulSoup

text='''
<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
 <title lang="eng">Harry Potter</title>
 <price>29.99</price>
</book>

<book>
 <title lang="eng">Learning XML</title>
 <price>39.95</price>
</book>

</bookstore>
'''

# create  Object 
bf=BeautifulSoup(text)

#  Output in standard indentation format 
print(bf.prettify())
#  Will HTML Clear all tags in the document , Return 1 Strings containing only literals 
print(bf.get_text())

# Tag Object  
#  Label   Denote HTML In 1 Each label 
# name
# attrs

tag=bf.title #  Get title Label 
print(tag)
print(type(tag)) # tag Type 
print(tag.name) #  Label name 
print(tag.attrs) # Tag attribute 
print(tag.attrs["lang"]) # Get a property separately   Method 1
print(bf.title["lang"]) # Get a property separately   Method 2

# NavigableString tag.string
#  Represents the text in the label 
print(tag.string)
print(type(tag.string)) #  View data types 

# Comment  Annotation section 
# 1 Of a special type NavigableString Object 
#  The output does not include comment symbols 
string='''
<p><!--  This is a note!  --></p>
'''
sp=BeautifulSoup(string)
print(sp)
print(sp.p.string) #  To get the text in the label 

#  Two commonly used functions  


# find_all()  Search for the current tag All of tag Child node , And determine whether the given conditions are met 
#  The result returned is 1 Column , Can contain multiple elements 
print(soup.find_all('title'),end="\n-------\n")

#find()  Go directly back to the first 1 Elements 
print(soup.find("title"))

print(soup.find_all("title",lang="eng")) #  Find title Label   Attribute lang=eng
print(soup.find_all("title",{"lang":"eng"})) #  The result is the same as above 
print(soup.find_all(["title","price"])) # Get multiple tags 
print(soup.find_all("title",lang="eng")[0].get_text()) #  Get Text 


# 3 Common nodes 
#   Child node  1 A Tag May contain multiple strings or other tag These are all these tag Child nodes of 
#   Parent node   With a tag Or string has a parent node: it is contained in a certain tag Medium 
#   Brother node   Parallel node 
end="\n-------\n"
print(soup.book,end) #  Get book Node information 
print(soup.book.contents,end) #  Get book All child nodes under 
print(soup.book.contents[1],end) #  Get book Of all child nodes under the 1 Nodes 

print(soup.book.children,end) # children  Generate iterator 
for child in soup.book.children:
  print("===",child)
  
print(soup.title.parent,end)
print(soup.book.parent,end)
for parent in soup.title.parents: # Attention parent And parents Difference 
  print("===",parent.name)
  
print(soup.title.next_sibling,end) #  Gets the following of the node 1 Sibling nodes 
print(soup.title.previous_sibling,end) #  Gets the upper of the node 1 Sibling nodes 
print(soup.title.next_siblings,end) #  Gets all siblings of this node 
for i in soup.title.next_siblings: 
  print("===",i)

The above is the python BeautifulSoup library installation and use of the details, more information about python BeautifulSoup library please pay attention to other related articles on this site!


Related articles: