Two solutions to the problem of Python BeautifulSoup

2020-04-02 13:36:53
OfStack

Solution 1:

BeautifulSoup of python is used to grab a web page and then output the title of the page, but the output is always messy code, looking for a long time to find the solution, below to share with you
The first is the code


from bs4 import BeautifulSoup
import urllib2

url = '//www.jb51.net/'
page = urllib2.urlopen(url)

soup = BeautifulSoup(page,from_encoding="utf8")
print soup.original_encoding
print (soup.title).encode('gb18030')

file = open("title.txt","w")
file.write(str(soup.title))
file.close()

 

for link in soup.find_all('a'):
    print link['href']

At the beginning of the test, I found that although the output was garbled, it was normal to write in the file. Then I found the solution on the Internet
Print the logic of an object: internally calling the object's s/s s/s to get the corresponding string, this is soup's s/s s/s s/s/s/s/s/s/s/s
For CMD (in the Chinese system), the code is GBK, so as long as the recoding gb18030 can be normal output
So that's this line of code


print (soup.title).encode('gb18030')

Solution 2:

BeautifulSoup when parsing a utf-8 encoded web page, if you do not specify fromEncoding or specify fromEncoding as utf-8, the phenomenon of Chinese messy code will appear.

The solution to this problem is to specify the value of the fromEncoding parameter in the Beautifulsoup constructor as: gb18030


import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('//www.jb51.net/');
soup = BeautifulSoup(page,fromEncoding="gb18030")
print soup.originalEncoding
print soup.prettify()