Python BeautifulSoup method of use

  • 2020-04-02 13:16:28
  • OfStack

Look at the examples directly:


#!/usr/bin/python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="//www.jb51.net" class="sister" id="link1">Elsie</a>,
<a href="//www.jb51.net" class="sister" id="link2">Lacie</a> and
<a href="//www.jb51.net" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc)
print soup.title
print soup.title.name
print soup.title.string
print soup.p
print soup.a
print soup.find_all('a')
print soup.find(id='link3')
print soup.get_text()

The result is:


<title>The Dormouse's story</title>
title
The Dormouse's story
<p class="title"><b>The Dormouse's story</b></p>
<a class="sister" href="//www.jb51.net" id="link1">Elsie</a>
[<a class="sister" href="//www.jb51.net" id="link1">Elsie</a>, <a class="sister" href="//www.jb51.net" id="link2">Lacie</a>, <a class="sister" href="//www.jb51.net" id="link3">Tillie</a>]
<a class="sister" href="//www.jb51.net" id="link3">Tillie</a>
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

As you can see, soup is BeautifulSoup processed as a formatted string, soup.title gets the title tag, soup.p   You get the first p tag in the document, and to get all of them you have to find all
Function. The find_all function returns a sequence that you can loop through to get what comes to mind in turn.
Get_text () is the returned text, and the label for each BeautifulSoup processed object is valid. You can try print soup.p.gat_text ()
For example, if I want to get the value of the href attribute of the a tag, I can use print soup.a['href'], and other similar attributes, such as class, can be obtained in the same way (soup.a['class']).
In particular, some special tags, such as the head tag, are available through soup.head, which is also mentioned earlier.
How do I get an array of the contents of a label? So if you use the contents property like print soup.head.contents, you get all the children under the head, and you return the result as a list,
You can use [num]   Get the form, get the tag, just use.name.
Children that get the tag, you can also use children, but you can't print soup.head.children does not return the list, it returns < X108e6d150 listiterator object at 0 > .
But you can use list to turn it into a list. Of course, you can use the for statement to iterate over the child inside.
As for The string attribute, if there is more than one tag, it will return None, otherwise it will return The specific string print soup.title. String returns The Dormouse's story
If you have more than one tag, you can try strings
You can use the parent function to look up, or the parents function if you want to look up all of them
Find the next sibling using next_sibling, find the last sibling using previous_sibling, if all, then add s to the corresponding function

How to traverse a tree?

Use the find_all function


find_all(name, attrs, recursive, text, limit, **kwargs)

Examples:


print soup.find_all('title')
print soup.find_all('p','title')
print soup.find_all('a')
print soup.find_all(id="link2")
print soup.find_all(id=True)

The return value is:


[<title>The Dormouse's story</title>]
[<p class="title"><b>The Dormouse's story</b></p>]
[<a class="sister" href="//www.jb51.net" id="link1">Elsie</a>, <a class="sister" href="//www.jb51.net" id="link2">Lacie</a>, <a class="sister" href="//www.jb51.net" id="link3">Tillie</a>]
[<a class="sister" href="//www.jb51.net" id="link2">Lacie</a>]
[<a class="sister" href="//www.jb51.net" id="link1">Elsie</a>, <a class="sister" href="//www.jb51.net" id="link2">Lacie</a>, <a class="sister" href="//www.jb51.net" id="link3">Tillie</a>]

Through CSS search, directly on the example:


print soup.find_all("a", class_="sister")
print soup.select("p.title")

Find by property


print soup.find_all("a", attrs={"class": "sister"})

Search by text


print soup.find_all(text="Elsie")
print soup.find_all(text=["Tillie", "Elsie", "Lacie"])

Limit the number of results


print soup.find_all("a", limit=2)

The result is:


[<a class="sister" href="//www.jb51.net" id="link1">Elsie</a>, <a class="sister" href="//www.jb51.net" id="link2">Lacie</a>, <a class="sister" href="//www.jb51.net" id="link3">Tillie</a>]
[<p class="title"><b>The Dormouse's story</b></p>]
[<a class="sister" href="//www.jb51.net" id="link1">Elsie</a>, <a class="sister" href="//www.jb51.net" id="link2">Lacie</a>, <a class="sister" href="//www.jb51.net" id="link3">Tillie</a>]
[u'Elsie']
[u'Elsie', u'Lacie', u'Tillie']
[<a class="sister" href="//www.jb51.net" id="link1">Elsie</a>, <a class="sister" href="//www.jb51.net" id="link2">Lacie</a>]

In short, you can find what you want through these functions.


Related articles: