python crawler tutorial HTML text parsing library BeautifulSoup (iv)

  • 2020-06-01 10:17:44
  • OfStack

preface

The third article of python crawler series introduced Requests, the artifact of network request library. After the request is returned, the target data should be extracted. The content returned by different websites usually has many different formats, one is json format. Another one in XML format, and the other one in the most common format is HTML document. Today we will talk about how to extract the data of interest from HTML

Write your own HTML parser? Or do we use regular expressions? None of this is the best way to do it. Fortunately, the community of Python has long had a well-developed solution to this problem. BeautifulSoup is the solution to this type of problem.

BeautifulSoup is an Python library for parsing HTML documents. With BeautifulSoup, you can extract anything of interest from HTML with very little code. In addition, it has a fixed tolerance of HTML, and it can handle an incomplete HTML document correctly.

Install BeautifulSoup


pip install beautifulsoup4

BeautifulSoup3 has been officially abandoned for maintenance. You need to download the latest version of BeautifulSoup4.

HTML label

Before learning BeautifulSoup4, it is necessary to have a basic understanding of HTML documents, as shown in the following code. HTML is a tree organization structure.


<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p> How to use BeautifulSoup</p>
 <body>
</html> 
It consists of many tags (Tag), such as html, head, title, and so on 1 label pairs make up 1 node, such as... It's one root node There is a relationship between the nodes, for example, h1 and p are neighbors of each other, and they are adjacent brothers (sibling) nodes Is h1 a direct child (children) node of body or a descendant (descendants) node of html body is the parent (parent) node of p, and html is the grandparent (parents) node of p The string nested between the tags is a special child node under the node, for example, "hello, world" is also a node, but without a name.

Using BeautifulSoup

Building an BeautifulSoup object takes two arguments, the first of which is the HTML text string to parse, and the second tells BeautifulSoup which parser to use to parse HTML.

The parser is responsible for parsing HTML into relevant objects, while BeautifulSoup is responsible for manipulating the data." html.parser "is Python's built-in parser, and lxml is a parser based on c language. It performs faster, but requires additional installation

The BeautifulSoup object can be used to locate any label node in HTML.


from bs4 import BeautifulSoup 
text = """
<html> 
 <head>
  <title >hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p class="bold"> How to use BeautifulSoup</p>
  <p class="big" id="key1">  The first 2 a p The label </p>
  <a href="http://foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>
 </body>
</html> 
"""
soup = BeautifulSoup(text, "html.parser")

# title  The label 
>>> soup.title
<title>hello, world</title>

# p  The label 
>>> soup.p
<p class="bold">\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>

# p  Tag content 
>>> soup.p.string
u'\u5982\u4f55\u4f7f\u7528BeautifulSoup'

BeatifulSoup abstracts HTML into four main data types, Tag, NavigableString, BeautifulSoup, Comment. Each label node is an Tag object, NavigableString object 1 is generally a string wrapped in Tag object, BeautifulSoup object represents the entire HTML document. Such as:


>>> type(soup)
<class 'bs4.BeautifulSoup'>
>>> type(soup.h1)
<class 'bs4.element.Tag'>
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>

Tag

Each Tag has a name that corresponds to the HTML tag name.


>>> soup.h1.name
u'h1'
>>> soup.p.name
u'p'

The tag can also have attributes, which are accessed similarly to a dictionary and return a list object


>>> soup.p['class']
[u'bold']

NavigableString

Get the contents of the tag directly with.stirng, which is an NavigableString object that you can explicitly convert to an unicode string.


>>> soup.p.string
u'\u5982\u4f55\u4f7f\u7528BeautifulSoup'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
>>> unicode_str = unicode(soup.p.string)
>>> unicode_str
u'\u5982\u4f55\u4f7f\u7528BeautifulSoup'

After introducing the basic concepts, we can now officially enter the topic. How to find the data we care about from HTML? BeautifulSoup provides two options, one for traversal and the other for search, often in combination to complete the search task.

Traverse the document tree

Walking through the document tree, as the name implies, is to walk from the root node html tag until it finds the target element. One of the drawbacks of walking is that if the content you are looking for is at the end of the document, it has to walk through the whole document to find it, which is slow. Therefore, the second method also needs to be coordinated.

Tag nodes can be obtained directly by means of. Tag name, for example:

Get the body tag:


>>> soup.body
<body>\n<h1>BeautifulSoup</h1>\n<p class="bold">\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>\n</body>

Get the p tag


>>> soup.body.p
<p class="bold">\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>

Gets the content of the p tag


>>> soup.body.p.string
\u5982\u4f55\u4f7f\u7528BeautifulSoup

As mentioned earlier, the content is also 1 node, which can be obtained by.string. Another disadvantage of traversing the document tree is that you can only get the first child node that matches it. For example, if there are two adjacent p tags, the second tag cannot be obtained through.p, which requires borrowing the next_sibling attribute to get the adjacent and subsequent nodes. In addition, there are many properties that are not commonly used, such as.contents for all child nodes and.parent for parent nodes. For more information, please refer to the official documentation.

Search document tree

The search document tree searches for elements by specifying the tag name, and you can pinpoint a node element by specifying the tag's attribute value. The two most common methods are find and find_all. Both methods can be called on BeatifulSoup and Tag objects.

find_all()


<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p> How to use BeautifulSoup</p>
 <body>
</html> 
0

The return value of find_all is a list of Tag. Method calls are flexible and all parameters are optional.

The first parameter, name, is the name of the label node.


<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p> How to use BeautifulSoup</p>
 <body>
</html> 
1

The second parameter is the class attribute value of the tag


<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p> How to use BeautifulSoup</p>
 <body>
</html> 
2

Equivalent to


>>> soup.find_all("p", class_="big")
[<p class="big"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]

Because class is the Python keyword, it is specified as class_.

kwargs is label attribute name-value pairs, such as: find a href attribute values for "http: / / foofish. net" label


<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p> How to use BeautifulSoup</p>
 <body>
</html> 
4

Of course, it also supports regular expressions


<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p> How to use BeautifulSoup</p>
 <body>
</html> 
5

In addition to being a concrete value, a regular expression, a property can also be a Boolean (True/Flase), indicating the presence or absence of the property.


<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p> How to use BeautifulSoup</p>
 <body>
</html> 
6

In combination with traversal and search, locate the body tag, narrow the search scope, and then look for the a tag from body.


<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p> How to use BeautifulSoup</p>
 <body>
</html> 
7

find()

The find method is similar to find_all except that it returns a single Tag object instead of a list, and None if no matching node is found. If multiple Tag are matched, only the 0th is returned.


>>> body_tag.find("a")
<a href="http://foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>
>>> body_tag.find("p")
<p class="bold">\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup</p>

get_text()

To get the contents of the tag, you can use the get_text method in addition to.string. The difference is that the former returns an NavigableString object, while the latter returns a string of type unicode.


<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p> How to use BeautifulSoup</p>
 <body>
</html> 
9

In the real world, we would use the get_text method to get the content in the tag as we would in the first scenario.

conclusion

BeatifulSoup is a library of Python used to manipulate HTML documents. When you initialize BeatifulSoup, you specify the HTML document string and the specific parser. There are three common data types for BeatifulSoup: Tag, NavigableString, and BeautifulSoup. There are two ways to find the HTML element, one is to traverse the document tree and the other is to search the document tree.


Related articles: