Python crawler will crawl the image to write world document method

2021-01-22 05:14:40
OfStack

As a beginner of crawler, I can do it with ease, whether it is to crawl text or picture, but the content crawled by crawler is often not a single picture or text, so I thought whether the text and text can be saved in world document, 1 began to use the following methods to save pictures:


 with open('123.doc','wb')as file:
  file.write(response.content)
  file.close()

As a result, the world document has a bunch of gibberish in it. This method is different, so I started to look for another method, but I didn't find it for a long time. I only found a way for Python to manipulate world.

So I started with a new idea: save the image using the original method, add the image to the world document, and finally delete the image. Here we use the python-dox library, as follows:


import requests
from bs4 import BeautifulSoup
import os
import docx
from docx import Document
from docx.shared import Inches

url = 'https://www.qiushibaike.com/article/119757360'
html = requests.get(url).content
soup = BeautifulSoup(html,'html.parser')
wen = soup.find('div',{"class":"content"}).text
img = str(soup.find('div',{"class":"thumb"})).split('src="')[1].split('"/')[0]
tu = 'https:' + img
img_name = img.split('/')[-1]

# Save the image to local 
with open(img_name,'wb')as f:
 response = requests.get(tu).content
 f.write(response)
 f.close()

document = Document()
document.add_paragraph(wen)# Add text to the document 
document.add_picture(img_name)# Add images to the document 
document.save('tuwen.doc')# Save the document 
os.remove(img_name)# Delete images saved locally

In the end, I managed to save the text and text in an world document, albeit in a clumsy way...