Example of Python's method for parsing Html using BeautifulSoup

  • 2020-06-12 09:57:07
  • OfStack

introduce

Beautiful Soup provides 1 simple, ES5en-style functions for navigating, searching, modifying analysis trees, and more. It is a toolkit that parses documents to provide users with the data they need to capture. Because of its simplicity, it is possible to write a complete application without much code.

Beautiful Soup automatically converts input documents to Unicode encoding and output documents to utf-8 encoding. You do not need to consider the encoding unless the document does not specify a encoding, in which case Beautiful Soup will not automatically recognize the encoding. Then, you just need to say 1 about the original encoding.

Beautiful Soup has become an excellent python interpreter like lxml and html6lib1, offering users the flexibility of different parsing strategies or strong speed.

This article will give you a detailed introduction about Python using BeautifulSoup to parse Html. The following is not enough. Let's start with a detailed introduction:

1. Install Beautifulsoup4


pip install beautifulsoup4
pip install lxml
pip install html5lib

lxml and html5lib are parsers

2. html


<!-- This is the example.html file. -->
 
<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com" rel="external nofollow" >my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>

Save the html file in html above

3. Start parsing


import bs4
 
exampleFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read(),'html5lib')
elems = exampleSoup.select('#author')
type(elems)
print (elems[0].getText())

The result is Al Sweigart

BeautifulSoup USES the select method to find elements, similar to css's selector

soup.select(‘div') -- -- -- -- all for � < div > The elements of the soup.select(‘#author') The element id is author soup.select(‘.notice') -- class is an element of notice

Refer to Python Programming For a Quick Start -- Automating Tedious work.

conclusion


Related articles: