Two solutions to the problem of Python BeautifulSoup
- 2020-04-02 13:36:53
- OfStack
Solution 1:
BeautifulSoup of python is used to grab a web page and then output the title of the page, but the output is always messy code, looking for a long time to find the solution, below to share with you
The first is the code
from bs4 import BeautifulSoup
import urllib2
url = '//www.jb51.net/'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page,from_encoding="utf8")
print soup.original_encoding
print (soup.title).encode('gb18030')
file = open("title.txt","w")
file.write(str(soup.title))
file.close()
for link in soup.find_all('a'):
print link['href']
At the beginning of the test, I found that although the output was garbled, it was normal to write in the file. Then I found the solution on the Internet
Print the logic of an object: internally calling the object's s/s s/s to get the corresponding string, this is soup's s/s s/s s/s/s/s/s/s/s/s
For CMD (in the Chinese system), the code is GBK, so as long as the recoding gb18030 can be normal output
So that's this line of code
print (soup.title).encode('gb18030')
Solution 2:
BeautifulSoup when parsing a utf-8 encoded web page, if you do not specify fromEncoding or specify fromEncoding as utf-8, the phenomenon of Chinese messy code will appear.
The solution to this problem is to specify the value of the fromEncoding parameter in the Beautifulsoup constructor as: gb18030
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('//www.jb51.net/');
soup = BeautifulSoup(page,fromEncoding="gb18030")
print soup.originalEncoding
print soup.prettify()