Python crawler package BeautifulSoup recursive fetching example details

2020-05-26 09:25:09
OfStack

Summary:

The main purpose of a crawler is to crawl the required content along the network. They are essentially a recursive process. They first need to get the content of the page, then analyze the page content and find another URL, then get the URL page content, and repeat the process over and over again.

Let's take wikipedia as an example.

We want to extract all the links in the wikipedia entry for Kevin bacon to other articles.


# -*- coding: utf-8 -*-
# @Author: HaonanWu
# @Date:  2016-12-25 10:35:00
# @Last Modified by:  HaonanWu
# @Last Modified time: 2016-12-25 10:52:26
from urllib2 import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bsObj = BeautifulSoup(html, "html.parser")

for link in bsObj.findAll("a"):
  if 'href' in link.attrs:
    print link.attrs['href']

The above code pulls out all the hyperlinks on the page.


/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/San_Diego_Comic-Con
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick

First of all, the extracted URL may have some duplicates

Secondly, there are some URL that we don't need, such as sidebar, header, footer, directory bar links, and so on.

Therefore, we can observe that all the links to the entry page have three characteristics:

They're all in the div tag where id is bodyContent The URL link does not contain a colon URL links are all relative paths that start with /wiki/ (also crawling to full absolute paths that start with http)


from urllib2 import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

pages = set()
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
  html = urlopen("http://en.wikipedia.org"+articleUrl)
  bsObj = BeautifulSoup(html, "html.parser")
  return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
  newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
  if newArticle not in pages:
    print(newArticle)
    pages.add(newArticle)
    links = getLinks(newArticle)

Where the parameter of getLinks is /wiki/ < The name of the entry > , and get the URL of the page by merging it with the absolute path of wikipedia. All URL pointing to other terms are captured through regular expressions and returned to the main function.

The main function calls the recursive getlinks and randomly accesses an unused URL until it runs out of entries or stops.

This code can grab the entire wikipedia


from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
  global pages
  html = urlopen("http://en.wikipedia.org"+pageUrl)
  bsObj = BeautifulSoup(html, "html.parser")
  try:
    print(bsObj.h1.get_text())
    print(bsObj.find(id ="mw-content-text").findAll("p")[0])
    print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
  except AttributeError:
    print("This page is missing something! No worries though!")

  for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
    if 'href' in link.attrs:
      if link.attrs['href'] not in pages:
        #We have encountered a new page
        newPage = link.attrs['href']
        print("----------------\n"+newPage)
        pages.add(newPage)
        getLinks(newPage)
getLinks("")

1 generally speaking, the recursion limit for Python is 1000, so you need to set a large recursion counter artificially, or use some other means to keep the code running after 1000 iterations.

Thank you for reading, I hope to help you, thank you for your support of this site!