Python creates object elaboration using the Beautiful Soup module

  • 2020-05-27 06:09:04
  • OfStack

The installation

Install the Beautiful Soup module through pip: pip install beautifulsoup4 .

You can also use PyCharm IDE to write code, find Project in Preferences in PyCharm, search Beautiful Soup module, and install it.

Create the BeautifulSoup object

The Beautiful Soup module is widely used to get data from web pages. We can use the Beautiful Soup module to extract any data from the HTML/XML documents, for example, all the links in a web page or the content in a tag.

To achieve this 1 point, Beautiful Soup provides different objects and methods. Any HTML/XML document can be converted into different Beautiful Soup objects with different properties and methods from which we can extract the required data.

Beautiful Soup has the following three types of objects:

BeautifulSoup Tag NavigableString

Create the BeautifulSoup object

Creating an BeautifulSoup object is the starting point for any Beautiful Soup project.

BeautifulSoup can do this by passing a string or class file object (file-like object), such as a file on a machine or a web page.

Create an BeautifulSoup object from a string

The object is created by passing a string in BeautifulSoup's constructor.


helloworld = '<p>Hello World</p>'
soup_string = BeautifulSoup(helloworld)
print soup_string 
<html><body><p>Hello World</p></body></html>

Create BeautifulSoup objects from class file objects

Create objects in BeautifulSoup's constructor by passing one class file object (file-like object). This is useful when parsing an online web page.


url = "http://www.glumes.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
print soup

In addition to passing class file objects, we can also pass local file objects to BeautifulSoup's constructor to generate objects.


with open('foo.html','r') as foo_file :
 soup_foo = BeautifulSoup(foo_file)
print soup_foo

Create an BeautifulSoup object for XML parsing

The Beautiful Soup module can also be used to parse XML.

When creating an BeautifulSoup object, the Beautiful Soup module will select the appropriate TreeBuilder class to create the HTML/XML tree. By default, select the HTML TreeBuilder object, which will use the default HTML parser to produce an HTML structure tree. In the above code, an BeautifulSoup object is generated from a string, which is parsed into an HTML tree structure.

If we want the Beautiful Soup module to parse the input to XML type, we need to specify exactly the features parameter to use in the Beautiful Soup constructor. With the specific features parameter, Beautiful Soup will select the most suitable TreeBuilder class to satisfy the desired characteristics.

Understand the features parameter

Each TreeBuilder will have different characteristics depending on the parser it USES. Therefore, the input will also have different results depending on the features parameter passed to the constructor.
In the Beautiful Soup module, TreeBuilder currently USES the following parsers:

lxml html5lib html.parser

The features parameter of the BeautifulSoup constructor can accept either a list of strings or a string value.

Currently, the features parameters and parsers supported by each TreeBuilder are shown in the following table:

Features TreeBuilder Parser
[‘lxml','html','fast','permissive'] LXMLTreeBuilder lxml
[‘html','html5lib','permissive','strict','html5′] HTML5TreeBuilder html5lib
[‘html','strict','html.parser'] HTMLParserTreeBuilder html.parser
[‘xml','lxml','permissive','fast'] LXMLTreeBuilderForXML lxml

Depending on the feature parameter specified, Beautiful Soup will select the most appropriate TreeBuilder class. If you specify the corresponding parser and the following error message appears, you may need to install the corresponding parser.


bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. 
Do you need to install a parser library?

For HTML documents, the order in which TreeBuilder is selected is based on the priority established by the parser, as shown in the table above. The first is lxml, the second is html5lib, and the last is html.parser. For example, if we select the html string as the feature parameter, then the Beautiful Soup module will select LXMLTreeBuilder if the lxml parser is available. If lxml is not available, select HTML5TreeBuilder according to the html5lib parser. If it is not available, select HTMLParserTreeBuilder according to html. parser.

As for XML, since lxml is a one-only parser, LXMLTreeBuilderForXML will always be selected.

So, the code to create an Beautiful Soup object for XML is as follows:


helloworld = '<p>Hello World</p>'
soup_string = BeautifulSoup(helloworld,features="xml")
print soup_string

The input result is also an XML file:

When creating Beautiful Soup objects, it is a better practice to specify the parser. This is because the resulting content is very different from parser to parser, especially if the content of our HTML document is illegal.

When we create an BeautifulSoup object, Tag and NavigableString objects are created.

Create the Tag object

We can get the Tag object from the BeautifulSoup object, which is the label in HTML/XML.

The HTML code is shown below:


#!/usr/bin/python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html_atag = """
 <html>
 <body>
 <p>Test html a tag example</p>
 <a href="http://www.glumes.com'>Home</a>
 <a href="http;//www.glumes.com/index.html'>Blog</a>
 </body>
 <html>
 """
soup = BeautifulSoup(html_atag,'html.parser')
atag = soup.a
print type(atag)
print atag

You can see from the results that the type of atag is < class 'bs4.element.Tag' > . The result of soup.a is the first one in the HTML document < a > The label.
The HTML/XML tag object has a name and properties. A name is the name of a tag, such as a tag < a > Is called a. Attributes are class, id, style, and so on. The Tag object allows us to get the name and properties of the HTML tag.

The name of the Tag object

Get the name of the Tag object by.name.


tagname = atag.name
print tagname 

You can also change the name of the Tag object:


atag.name = 'p'

This puts the first in the HTML document above < a > The tag name is changed < p > The label.

Properties of the Tag object

In the HTML page, tags may have different properties, such as class, id, style, and so on. The Tag object can access the attributes of a tag in the form of a dictionary.


atag = soup_atag.a
print atag 

It can also be accessed through.attrs, which will print out all the property content:


print atag.attrs
{'href': u'http://www.glumes.com'}

Create the NavigableString object

The NavigableString object holds the text content of the HTML or XML tags. This is an Unicode encoded string.

The content of this article can be obtained by.string.


url = "http://www.glumes.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
print soup
0

summary

The code summary is as follows:

BeautifulSoup

soup = BeautifulSoup(String) soup = BeautifulSoup(String,features=”xml”)

Tag

tag = soup.tag tag.name tag[‘attribute']

NavigableString

soup.tag.string

conclusion

The above is all about Python using Beautiful Soup module to create objects, I hope the content of this article can bring you a definite help in your study or work, if you have any questions, you can leave a message to communicate.


Related articles: