Distributed crawler based on python and solve the problem of fake animation

  • 2021-10-27 07:57:46
  • OfStack

python version: 3.5. 4

System: win10 x64

Download video from web page

Method 1: Use the urllib. retrieve function

The function only needs two parameters to download the corresponding content to the local, one is the website address and one is the saving location


import urllib.request
url = 'http://xxx.com/xxx.mp4'
file = 'xxx.mp4'
urllib.request.retrieve(url, file)

However, bloggers found that there is no timeout method in this function. When using, it may be suspended animation due to network problems!

Method 2: Use the urllib. request. urlopen function

The usage method is as follows:


import urllib.request
url = 'http://xxx.com/xxx.mp4'
file = 'xxx.mp4'
response = urllib.request.urlopen(url, timeout=5)
data = response.read()
with open(file, 'wb') as video:
    video.write(data)

This function has an timeout setting to avoid fake animation.

Parallelize programs

The pseudo code is as follows:


import urllib.request
import socket
from urllib import error
from queue import Queue
from threading import Thread
import os
class DownloadWorker(Thread):  # Definition 1 Class, inherited from thread Class, overriding its run Function 
    def __init__(self, queue):
        Thread.__init__(self)
        self.queue = queue     # Standard multithreaded implementations all use queue
    def run(self):
        while True:
            link, file = self.queue.get() # Get from the queue 1 Group URL and corresponding save location 
            try: # Use try except Method to handle various exceptions 
                response = urllib.request.urlopen(link, timeout=5)
                data = response.read()
                with open(file, 'wb') as video:
                    video.write(data)
            except error.HTTPError as err: 
                print('HTTPerror, code: %s' % err.code)
            except error.URLError as err:
                print('URLerror, reason: %s' % err.reason)
            except socket.timeout:
                print('Time Out!')
            except:
                print('Unkown Error!')
            self.queue.task_done() # Tag the in the queue 1 Elements have been processed 
def main():
    queue = Queue() # Defining Queues 
    for x in range(8): # Open 8 Threads 
        worker = DownloadWorker(queue)
        worker.daemon = True
        worker.start()
    for lineData in txtData: # Put data into the queue 
        link = lineData[0]
        file = lineData[1]
        queue.put((link, file))
    queue.join() # Wait for the data in the queue to be processed 
if __name__ == '__main__':
    main()

Supplement: A summary of some problems encountered by a large-scale crawler based on python

A few days ago, I saw some interesting information in a forum, and wanted to crawl it down. I estimated the scale. What I wanted to do was to crawl down all the articles in the whole forum and save them as local txt.

1 began to write a crawler, the general idea is:

Start from the start page of the forum and get the URLs of all partitions

Then we get the total number of page numbers from the partition layout, and get all the pages of the partition layout according to the rule of website address

Get the URLs of all the articles on a 1-page page in the above partition layout, then grab these articles and save them as local txt

The above idea is a typical top-down idea, so that the first version of the code is written.

Let's get down to business and summarize the problems encountered under 1:

1. The problem that large-scale crawlers are banned by websites

The above crawler performance in the debugging stage is still good, and later in the actual measurement, it was found that http error occurred after running for a period of time. Because it used a wired network, and it was not the error of the network itself after inspection, it was judged that this website was banned, so it began to study this problem.

1 Generally speaking, when the python crawler disguises itself as a browser, it uses the method of adding headers parameter to the urllib2. Request function, which is similar to


"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"

user_agent code fragment, but in this way, in large-scale crawling, it will be judged by the website as one for long-term quick access, which is easy to be banned. Originally, in the initial code, a time delay of 0.5 s was added between the crawler visiting two web pages, just to prevent this problem, but the result is still not possible. If the delay is increased, it will affect the efficiency of the crawler, and such a large-scale crawling is even more unknown when it will end.

Therefore, consider the method of pretending to be visited by multiple browsers to solve this problem. Specifically, find many user_agent, save it as a list, and use the above user_agent in turn when visiting web pages, thus pretending to be many browsers. Specific sub-functions are attached as follows:


user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
 
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
 
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
 
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.2; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; Media Center PC 6.0; InfoPath.2; MS-RTC LM 8)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; InfoPath.2)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0 Zune 3.0)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MS-RTC LM 8)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; MS-RTC LM 8)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 4.0.20402; MS-RTC LM 8)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; InfoPath.2)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 3.0.04506; Media Center PC 5.0; SLCC1)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0; .NET CLR 3.0.04506; Media Center PC 5.0; SLCC1)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; FDM; Tablet PC 2.0; .NET CLR 4.0.20506; OfficeLiveConnector.1.4; OfficeLivePatch.1.3)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 3.0.04506; Media Center PC 5.0; SLCC1; Tablet PC 2.0)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; InfoPath.2)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.3029; Media Center PC 6.0; Tablet PC 2.0)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; Media Center PC 3.0; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; FDM; .NET CLR 1.1.4322)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; Alexa Toolbar; .NET CLR 2.0.50727)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; Alexa Toolbar)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.40607)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.1.4322)',
        'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1; .NET CLR 1.0.3705; Media Center PC 3.1; Alexa Toolbar; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
        'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)',
        'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; el-GR)',
        'Mozilla/5.0 (MSIE 7.0; Macintosh; U; SunOS; X11; gu; SV1; InfoPath.2; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)',
        'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; c .NET CLR 3.0.04506; .NET CLR 3.5.30707; InfoPath.1; el-GR)',
        'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; c .NET CLR 3.0.04506; .NET CLR 3.5.30707; InfoPath.1; el-GR)',
        'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 6.0; fr-FR)',
        'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 6.0; en-US)',
        'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.2; WOW64; .NET CLR 2.0.50727)',
        'Mozilla/4.79 [en] (compatible; MSIE 7.0; Windows NT 5.0; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 1.1.4322; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)',
        'Mozilla/4.0 (Windows; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)',
        'Mozilla/4.0 (Mozilla/4.0; MSIE 7.0; Windows NT 5.1; FDM; SV1; .NET CLR 3.0.04506.30)',
        'Mozilla/4.0 (Mozilla/4.0; MSIE 7.0; Windows NT 5.1; FDM; SV1)',
        'Mozilla/4.0 (compatible;MSIE 7.0;Windows NT 6.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0;)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; YPC 3.2.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; YPC 3.2.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.0.04506)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; SLCC1; Media Center PC 5.0; .NET CLR 2.0.50727)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; SLCC1; .NET CLR 3.0.04506)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618; .NET CLR 1.1.4322)',
       ]

There are probably more than 60 user_agent on it, which pretends to be more than 60 browsers. After trying this method, it is found that it is rare to crawl for a long time, make mistakes or slow access speed, which basically solves this problem.

However, it should be noted that if the website is banned not according to user_agent, but according to the user's IP, it will be difficult to handle. Some online solutions are cloud computing and the like, which seems to be a little troublesome and not suitable for individual users. If you are interested, you can read the relevant information.

2. The problem of unstable unattended network for a long time

Because the scale is slightly large, it is impossible to stay in front of the computer, so the stability (fault tolerance) of the code needs to be high. Here, try … … except … … syntax of python plays a very good role.

A few days ago, the practice proved that most of the errors are due to the instability of the network at 1, and the solution is also very simple, revisit the following is good, so the function of crawling web pages will be written in the following form


def get_page_first(url):
    global user_agent_index
    user_agent_index+=1
    user_agent_index%=len(user_agent_list)
    user_agent =  user_agent_list[user_agent_index]
    #print user_agent
    print user_agent_index
    headers = { 'User-Agent' : user_agent }
    print u" Crawling "+url
    req = urllib2.Request(url,headers = headers)
    try:
        response = urllib2.urlopen(req,timeout=30)
        page = response.read()
    except:
        response = urllib2.urlopen(req,timeout=30)
        page = response.read()
    print u" Crawling Web Pages "+url
    return page

Here, if there is no response to visiting a webpage 30s, it will be revisited. This problem has been basically solved.

3. Name error when saving as local txt

Since the name of txt is in the form of "date-author-title", the titles of some posts contain such things as? Such as txt is not allowed to appear in the naming, so that an error will occur. The solution here is that if there is an error in saving the file, try to change the name to the form of "date-author-number" first, but if there is still an error, save it in the form of "date-number". The specific code is as follows:


try:
                if news_author[0]=='':
                    save_file(path+'//'+news_time[0]+'--'+news_title+'.txt',news)
                else:
                    save_file(path+'//'+news_time[0]+'--'+news_author[0]+u" - "+news_title+'.txt',news)
            except:
                try:
                    save_file(path+'//'+news_time[0]+'--'+news_title+'.txt',news)
                except:
                    save_file(path+'//'+news_time[0]+'--'+str(j)+'-'+str(index)+'.txt',news)

4. Save the problem of file override with duplicate name

At the beginning, the code was not well considered, and I didn't expect that the author and name would be the same in the posts of the same day. Later, I found that the total number of articles in some pages was different from the number of txt saved, and later I found this problem. Therefore, the sub-function of saving files is modified as follows. The general idea is to check whether the file with the same name exists before saving, and there is no direct saving; If it exists, add (i) after the name (i increments from 1) and repeat the above steps until the file with the same name does not exist:


def save_file(path,inf):
    if not os.path.exists(path):
        f = file(path, 'w')
        f.write(inf)
        f.close
    else:
        i=0
        while(1):
            i+=1
            tpath=path[:-4]
            tpath+='('+str(i)+')'+'.txt'
            if not os.path.exists(tpath):
                break
        f = file(tpath, 'w')
        f.write(inf)
        f.close

5. Some problems of web page crawling speed in multi-thread crawling and non-plate crawling

Theoretically, large-scale crawlers can use multithreading to speed up crawling, but considering not to cause too much pressure on websites and to avoid being banned by websites, the concept of multithreading is not introduced in the main program. However, in order to speed up the progress, we manually open multiple command line windows to run crawlers to crawl articles in different pages at the same time. In this way, when one program reports an error, the others can still run, which also enhances the fault tolerance of the program.

After running for a period of time, it is found that the time delay of the program mainly occurs in the link of crawling web pages, that is, the time of downloading web pages. To improve efficiency, it is necessary to improve this link. When I was thinking about how to solve this problem, I suddenly found that the forum also provided non-graphic web pages (that is, similar to mobile phone version), so that the size of each web page was reduced a lot, and the required information such as article content still existed, so I revised the code again. Then, I found that the speed has really improved greatly. Therefore, in the future, before crawling a webpage, you must first see if there is a webpage similar to a non-graphic version (mobile phone version), which can greatly improve the speed.

6. Overall summary

After some improvement in the later period, the code has basically been able to run continuously for several days without making mistakes, and the stability has been basically solved, and there is no missing phenomenon. After about 10 days of working day and night, I finally caught it.


Related articles: