Python crawler will crawl the image to write world document method
- 2021-01-22 05:14:40
- OfStack
As a beginner of crawler, I can do it with ease, whether it is to crawl text or picture, but the content crawled by crawler is often not a single picture or text, so I thought whether the text and text can be saved in world document, 1 began to use the following methods to save pictures:
with open('123.doc','wb')as file:
file.write(response.content)
file.close()
As a result, the world document has a bunch of gibberish in it. This method is different, so I started to look for another method, but I didn't find it for a long time. I only found a way for Python to manipulate world.
So I started with a new idea: save the image using the original method, add the image to the world document, and finally delete the image. Here we use the python-dox library, as follows:
import requests
from bs4 import BeautifulSoup
import os
import docx
from docx import Document
from docx.shared import Inches
url = 'https://www.qiushibaike.com/article/119757360'
html = requests.get(url).content
soup = BeautifulSoup(html,'html.parser')
wen = soup.find('div',{"class":"content"}).text
img = str(soup.find('div',{"class":"thumb"})).split('src="')[1].split('"/')[0]
tu = 'https:' + img
img_name = img.split('/')[-1]
# Save the image to local
with open(img_name,'wb')as f:
response = requests.get(tu).content
f.write(response)
f.close()
document = Document()
document.add_paragraph(wen)# Add text to the document
document.add_picture(img_name)# Add images to the document
document.save('tuwen.doc')# Save the document
os.remove(img_name)# Delete images saved locally
In the end, I managed to save the text and text in an world document, albeit in a clumsy way...