Jun 3, 2020

Python crawler image simple implementation

Often in the shopping Zhihu, sometimes hope to 1 some of the problems of the picture saved. So here’s the program. This is a very simple picture crawler, can only crawl out of the brush part of the picture. Since I am not familiar with this part, I will just say a few words and then record the code without explaining too much. If you’re interested, you can just take it. Pro test for Zhihu and other sites are available.

The previous post Shared how to open an image through url. The purpose is to see what the image looks like when it is crawled, and then filter 1 to save it.

Here, requests library is used to get the page information. It should be noted that an header is needed to get the page information, which can be used to disguise the program as a browser to access the server, otherwise it may be rejected by the server. BeautifulSoup is then used to filter the extra information to get the image address. After you get the picture, filter out 1 small picture such as head and emojis according to the size of the picture. OpenCV, skimage, PIL, etc.

The procedure is as follows:

# -*- coding=utf-8 -*-
import requests as req
from bs4 import BeautifulSoup
from PIL import Image
from io import BytesIO
import os
from skimage import io

url = "https://www.zhihu.com/question/37787176"
headers = {'User-Agent' : 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Mobile Safari/537.36'}
response = req.get(url,headers=headers)
content = str(response.content)
#print content

soup = BeautifulSoup(content,'lxml')
images = soup.find_all('img')
print u" A total of %d image " % len(images)

if not os.path.exists("images"):
  os.mkdir("images")

for i in range(len(images)):
  img = images[i]
  print u" Handling control %d image ..." % (i+1)
  img_src = img.get('src')
  if img_src.startswith("http"):
    ## use PIL
    '''
    print img_src
    response = req.get(img_src,headers=headers)
    image = Image.open(BytesIO(response.content))
    w,h = image.size
    print w,h
    img_path = "images/" + str(i+1) + ".jpg"
    if w>=500 and h>500:
      #image.show()
      image.save(img_path)

    '''

    ## use OpenCV
    import numpy as np
    import urllib
    import cv2

    resp = urllib.urlopen(img_src)

    image = np.asarray(bytearray(resp.read()), dtype="uint8")
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
    w,h = image.shape[:2]
    print w,h
    img_path = "images/" + str(i+1) + ".jpg"
    if w>=400 and h>400:
      cv2.imshow("Image", image)
      cv2.waitKey(3000)
      ##cv2.imwrite(img_path,image)

    ## use skimage

    ## image = io.imread(img_src)
    ## w,h = image.shape[:2]
    ## print w,h
    #io.imshow(image)
    #io.show()

    ## img_path = "images/" + str(i+1) + ".jpg"
    ## if w>=500 and h>500:
      ## image.show()
      ## image.save(img_path)
      ## io.imsave(img_path,image)

print u" Processing done! "

A variety of options are given here for your reference.