Python web Crawler Principle analysis

  • 2020-06-19 10:39:25
  • OfStack

Since this article talks about Python's analysis of the principle of building web crawlers, this site will first show you a selection of articles about crawlers in Python:

An example of python implementing a simple crawler

python Crawler Practice - the simplest web crawler tutorial

Web crawlers are one of the most commonly used systems today. The most popular example is Google, which USES a crawler to gather information from all websites. In addition to search engines, news sites need crawlers to aggregate data sources. It seems that whenever you want to aggregate a lot of information, you should consider using a crawler.

There are many factors in building a web crawler, especially if you want to extend the system. That's why this has become one of the most popular system design interview questions. In this article, we will discuss topics ranging from basic crawlers to large crawlers, and discuss various problems you may encounter during an interview.

1 - Basic solution

How to build a basic web crawler?

Before the system Design interview, we talked about starting with the easy stuff in 8 Things to Know Before the System Design Interview. Let's focus on building basic web crawlers that run on a single thread. With this simple solution, we can continue to optimize.

To crawl a single web page, we only need to make an HTTP GET request to the corresponding URL and parse the response data, which is the core of the crawl tool. With this in mind, a basic web crawler can work like this:

Start with a url pool that contains all the sites we want to crawl.

For each URL, issue an HTTP GET request to get the web page content.

Parse the content (usually HTML) and extract the potential url we want to crawl.

Add new urls to the pool and keep fetching.

Depending on the problem, sometimes we might have a separate system for generating crawl urls. For example, a program can continuously listen for RSS subscriptions and add this URL to the crawl pool for each new article.

2 - Size issues

It is well known that any system that scales will face a series of problems. In web crawlers, there are a lot of things that can go wrong when scaling a system to multiple machines.

Before jumping to the next section, take a few minutes to think about the bottleneck of a distributed web crawler and how to solve it. In the remainder of this article, we will discuss some of the major issues of the solution.

3 - Grab frequency

How often do you go to the website?

This may not sound like a big deal, unless the system is a certain size and you need very fresh content. For example, if you want to keep up with the latest news for an hour, scraping tools might need to keep scraping news sites every hour. But what's wrong with that?

For 1 small site, their server is probably unable to handle such frequent requests. One way is to follow robot.txt per site. For those who don't know what ES63en.txt is, this is basically the standard for websites to communicate with web crawlers. It can specify what files should not be grabbed, and most web crawlers follow the configuration. In addition, you can set different fetching frequencies for different sites. Typically, there are only a few sites that need to be accessed multiple times a day.

4 - to heavy

On a machine, you can keep the URL pool in memory and delete duplicate entries. But things get more complicated in distributed systems. Basically, multiple crawlers can extract the same URL from different web pages, and they all want to add this URL to the URL pool. Of course, it makes no sense to crawl the same page multiple times. So how do we duplicate these sites?

One common method is to use Bloom Filter. In short, a Bronzer filter is a space-saving system that allows you to test whether an element is in a set. However, it may be false. In other words, if a bronzer filter can tell you that an URL is definitely not in the pool, or could be.

To briefly explain how a bronzer filter works, an empty bronzer filter is an array of m bits (all 0). There are also k hash functions that map each element to one of the m bits. So when we add a new element in the bloom filter (URL), we will get the k from the hash function, and set them all to 1. Therefore, when we check the presence of an element, the first thing we get k, if any one of them is not 1, we immediately know this element does not exist. However, if all k bits are 1, this may come from a combination of several other elements.

Blum filter is a very common technique that is a perfect solution for de-duplication of web addresses in web crawlers.

5 - parse

After getting the response data from the site, the next step is to parse the data (usually HTML) to extract the information we care about. This sounds like a simple thing, but it can be difficult to make robust.

The challenge is that you always find strange tags in HTML code, URL, etc., and it's hard to cover all the boundary cases. For example, you may need to deal with codec issues when HTML contains non-ES99en characters. In addition, web pages can cause strange behavior when they contain images, videos, or even PDF.

In addition, some web pages are rendered using Javascript as if they were AngularJS 1, and your crawler may not be able to get any content.

I would say you can't make a perfect, robust crawler for every web page without a silver bullet. You need lots of robustness tests to make sure it works as expected.

conclusion

There are a lot of interesting topics that I haven't covered yet, but I want to mention a few of them just so you can think about them. One thing is the detection cycle. Many sites contain links, such as A- > B- > C- > A, your crawler may run forever. How do you solve this problem?

Another problem is the DNS lookup. DNS lookup may be a bottleneck as the system scales to a fixed level, and you may want to build your own DNS server.

Like many other systems, an extended web crawler can be much more difficult than building a single machine version, and many things can be discussed during a system design interview. Trying to start with a simple solution and continuing to optimize it can make things easier than they seem.

Above is our summary of the web crawler related article content, if you have other want to know can be discussed in the comment area below, thank you for your support to this site.


Related articles: