Take Python's Pyspider as an example to analyze the web crawler implementation method of search engine

  • 2020-04-02 14:44:32
  • OfStack

In this article, we will analyze a web crawler.

A web crawler is a tool that scans web content and records its useful information. It can open a bunch of web pages, analyze the contents of each page to find all the data of interest, store that data in a database, and then do the same for other web pages.

If there are links in the web page that the crawler is analyzing, the crawler will analyze more pages based on those links.

Search engine is based on this principle.

This article, in particular, I chose a stable, "young" open source project (link: https://github.com/binux/pyspider), it is by the binux code implementation.

Note: pyspider is thought to continuously monitor the network, assuming that the page changes over time, so that it will revisit the same page over time.

An overview of the

The crawler pyspider consists of four main components. This includes a scheduler, a fetch program, a content processor, and a monitoring component.

The scheduler accepts the task and decides what to do. There are several possibilities, such as dropping a task (perhaps the particular web page has just been crawled) or assigning a task a different priority.

When the priority of each task is determined, they are passed into the crawler. It re-crawls the web page. The process is complex, but logically simple.

When resources on the network are fetched, the content handler is responsible for extracting useful information. It runs a user-written Python script that is not isolated like a sandbox. Its responsibilities also include catching exceptions or logs and managing them appropriately.

Finally, the crawler pyspider has a monitoring component.

The crawler pyspider provides an extremely powerful web UI that allows you to edit and debug your scripts, manage the crawling process, monitor ongoing tasks, and ultimately output the results.

Projects and tasks

In pyspider, we have the concept of projects and tasks.

A task is a single page that needs to be retrieved from a web site and analyzed.

A project is a larger entity that includes all the pages involved in the crawler, the python scripts needed to analyze the web page, the database used to store the data, and so on.

In pyspider we can run multiple projects simultaneously.

Code structure analysis

The root directory

The folders that can be found in the root directory are:

Data, an empty folder, is where the data generated by the crawler is stored. Docs, which contains the project document, with some markdown code in it. Pyspider, which contains the actual code for the project. Test contains a fair amount of test code. Here I will highlight some important documents: .travis. Yml, a great integration of continuous testing. How do you make sure your project is actually working? After all, it's not enough to test on your own machine with a fixed version of the library. Dockerfile, also a great tool! If I want to try a project on my machine, I just need to run Docker and I don't need to install anything manually, which is a great way to get developers involved in your project. LICENSE is required for any open source project, and (if you have your own open source project) don't forget the file in your project. Requirements. TXT, in the Python world, is a file that specifies what Python packages you need to install on your system in order to run the software. This file is required in any Python project. Run.py, the main entry point for the software. Setup.py is a Python script to install the pyspider project on your system.

The root of the project has been analyzed, and this alone shows that the project was developed in a very professional manner. If you're working on any open source program, hopefully you can reach that level.

Folder pyspider

Let's dig a little deeper and look at the actual code.

Other folders can be found in this folder, and the logic behind the entire software has been split to make it easier to manage and extend.

These folders are: database, fetcher, libs, processor, result, scheduler, webui.

In this folder we can also find the main entry point for the entire project, run.py.

File run. Py

This file first completes all the necessary chores to ensure that the crawler runs successfully. Eventually it produces all the necessary cells. Scroll down to see the entry point for the entire project, cli().

Function cli ()

This function may seem complicated, but if you follow me, you'll see that it's not as complicated as you think. The main purpose of the function cli() is to create all connections to the database and the messaging system. It mainly parses command-line arguments and creates a large dictionary with all the things we need. Finally, we get to work by calling the function all().

All ()

A web crawler does a lot of IO operations, so a good idea is to have different threads or child processes to manage all this work. This way, you can extract useful information from the previous page while waiting for the web to retrieve your current HTML page.

The function all() decides whether to run a child process or thread, and then calls all the necessary functions in different threads or child processes. At this point the pyspider will produce a sufficient number of threads for all the logic modules of the crawler, including webui. When we finish the project and close webui, we will close each process cleanly and nicely.

Now that our crawler is up and running, let's explore a little further.

The scheduler

The scheduler takes tasks from two different queues (newtask_queue and status_queue) and adds them to another queue (out_queue), which is later read by the crawler.

The first thing the scheduler does is load all the tasks that need to be done from the database. After that, it starts an infinite loop. Several methods are called in this loop:

1._update_projects() : try to update various Settings, for example, we want to adjust the crawling speed while the crawler is working.

2._check_task_done() : analyzes the completed task and saves it to the database, which gets the task from the status_queue.

3._check_request() : if the content handler asks for more pages to be analyzed, place them in the newtask_queue, and the function gets new tasks from the queue.

4._check_select() : adds a new page to the crawler's queue.

5._check_delete() : deletes tasks and items that have been marked by the user.

6._try_dump_cnt() : records the number of completed tasks in a file. This is necessary to prevent data loss due to program exceptions.
 


def run(self):
 
 while not self._quit:
 
  try:
 
   time.sleep(self.LOOP_INTERVAL)
 
   self._update_projects()
 
   self._check_task_done()
 
   self._check_request()
 
   while self._check_cronjob():
 
    pass
 
   self._check_select()
 
   self._check_delete()
 
   self._try_dump_cnt()
 
   self._exceptions = 0
 
  except KeyboardInterrupt:
 
   break
 
  except Exception as e:
 
   logger.exception(e)
 
   self._exceptions += 1
 
   if self._exceptions > self.EXCEPTION_LIMIT:
 
    break
 
   continue

The loop also checks for exceptions during the run, or if we asked python to stop processing.
 


finally:
 
 # exit components run in subprocess
 
 for each in threads:
 
  if not each.is_alive():
 
   continue
 
  if hasattr(each, 'terminate'):
 
   each.terminate()
 
  each.join()

Scraping programs

The purpose of a crawler is to retrieve network resources.

Pyspiders can handle both plain HTML text pages and ajax-based pages. It is important to understand that only scraping programs are aware of this difference. We'll just focus on plain HTML text fetching, but most of the ideas can be easily ported to Ajax crawlers.

The idea here is something like a scheduler, where we have two queues for input and output, and a big loop. For all the elements in the input queue, the crawler generates a request and puts the result on the output queue.

It sounds simple but there is a big problem. The web is usually extremely slow, and if all the computation is blocked by waiting for a web page, the whole process will run extremely slowly. The solution is simple: don't block all calculations while waiting on the network. The idea is to send a large number of messages over the network, with a significant number sent at the same time, and then wait asynchronously for the response to return. Once we retrieve a response, we will call another callback function, which will manage the response in the most appropriate manner.

All the complex asynchronous scheduling in the crawler pyspider is done by another excellent open source project


http://www.tornadoweb.org/en/stable/

To complete.

Now that we have a great idea in our heads, let's explore more deeply how this is done.
 


def run(self):
 def queue_loop():
  if not self.outqueue or not self.inqueue:
   return
  while not self._quit:
   try:
    if self.outqueue.full():
     break
    task = self.inqueue.get_nowait()
    task = utils.decode_unicode_obj(task)
    self.fetch(task)
   except queue.Empty:
    break
 tornado.ioloop.PeriodicCallback(queue_loop, 100, io_loop=self.ioloop).start()
 self._running = True
 self.ioloop.start()
<strong>

The run () < / strong >

The function run() is a large loop in the fetch program fetcher.

The function run() defines another function, queue_loop(), that takes all the tasks in the input queue and grabs them. It also listens for interrupt signals. The function queue_loop() is passed as a parameter to tornado's class PeriodicCallback, which, as you might guess, calls the queue_loop() function every once in a while. The function queue_loop() also calls another function that brings us closer to the actual operation of retrieving a Web resource: fetch().

Fetch (self, task, callback=None)

Resources on the network must be retrieved using the function phantomjs_fetch() or the simple http_fetch() function, which only determines what is the correct way to retrieve the resource. Next let's look at the function http_fetch().

Function http_fetch(self, url, task, callback)
 


def http_fetch(self, url, task, callback):
 '''HTTP fetcher'''
 fetch = copy.deepcopy(self.default_options)
 fetch['url'] = url
 fetch['headers']['User-Agent'] = self.user_agent
 
 def handle_response(response):
  ...
  return task, result
 
 try:
  request = tornado.httpclient.HTTPRequest(header_callback=header_callback, **fetch)   
  if self.async:
   self.http_client.fetch(request, handle_response)
  else:
   return handle_response(self.http_client.fetch(request))

Finally, this is where the real work gets done. The code for this function is a bit long, but it has a clear structure and is easy to read.

At the beginning of the function, it sets the header of the fetch request, such as user-agent, timeout, and so on. We then define a function that handles the response: handle_response(), which we will examine later. Finally we get a tornado request object and send it. Notice how the response is handled using the same function in both asynchronous and non-asynchronous cases.

Let's step back and analyze what the function handle_response() does.

Function handle_response (response)
 


def handle_response(response):
 result = {}
 result['orig_url'] = url
 result['content'] = response.body or ''
 callback('http', task, result)
 return task, result

This function holds all the information about a response in the form of a dictionary, such as the url, status code, and actual response, and then calls the callback function. The callback function here is a small method: send_result().

Function send_result(self, type, task, result)
 


def send_result(self, type, task, result):
 if self.outqueue:
  self.outqueue.put((task, result))

This last function puts the result into the output queue, waiting for the content handler processor to read it.

Content handler processor

The purpose of a content handler is to analyze pages that have been retrieved. Its process is also a big loop, but the output has three queues (status_queue, newtask_queue, and result_queue) and the input has only one queue (inqueue).

Let's dig a little deeper into the loop in the function run().

The run () function (self)
 


def run(self):
 try:
  task, response = self.inqueue.get(timeout=1)
  self.on_task(task, response)
  self._exceptions = 0
 except KeyboardInterrupt:
  break
 except Exception as e:
  self._exceptions += 1
  if self._exceptions > self.EXCEPTION_LIMIT:
   break
  continue

This function has less code and is easy to understand. It simply takes the next task to be analyzed from the queue and analyzes it using the on_task(task, response) function. This loop listens for interrupt signals, and as soon as we send one to Python, the loop terminates. Finally, the loop counts the number of exceptions it throws, and too many will end the loop.

Function on_task(self, task, response)
 


def on_task(self, task, response):
 response = rebuild_response(response)
 project = task['project']
 project_data = self.project_manager.get(project, updatetime)
 ret = project_data['instance'].run(
 
 status_pack = {
  'taskid': task['taskid'],
  'project': task['project'],
  'url': task.get('url'),
  ...
  }
 self.status_queue.put(utils.unicode_obj(status_pack))
 if ret.follows:
  self.newtask_queue.put(
   [utils.unicode_obj(newtask) for newtask in ret.follows])
 
 for project, msg, url in ret.messages:
  self.inqueue.put(({...},{...}))
 
 return True

The on_task() function is the method that actually does the work.

It attempts to use the input task to find the item to which the task belongs. It then runs the custom script in the project. Finally, it analyzes the response returned by the custom script. If all goes well, a dictionary containing all the information we get from the web page will be created. Finally, the dictionary is placed in the queue, status_queue, which is later reused by the scheduler.

If there are some new links to be processed in the analyzed page, the new links are put into the newtask_queue and used by the scheduler later.

Now, pyspider will send the results to other projects if needed.

Finally, if something goes wrong, such as a page returning an error, the error message is added to the log.

The end!


Related articles: