A nice use for decorators in Python

  • 2020-04-02 14:33:08
  • OfStack

Well, I know it's the middle of the night... But I think it's worth spending half an hour to share the latest ideas

We're going to simulate a situation where you're going to grab a page, and then you're going to grab a bunch of urls from that page, and then you're going to grab data from those suburls. Simple point, we according to the three layers, then our code is as follows:


def func_top(url):
    data_dict= {}
 
    # Gets a child on the page url
    sub_urls = xxxx
 
    data_list = []
    for it in sub_urls:
        data_list.append(func_sub(it))
 
    data_dict['data'] = data_list
 
    return data_dict
 
def func_sub(url):
    data_dict= {}
 
    # Gets a child on the page url
    bottom_urls = xxxx
 
    data_list = []
    for it in bottom_urls:
        data_list.append(func_bottom(it))
 
    data_dict['data'] = data_list
 
    return data_dict
 
def func_bottom(url):
    # To get the data
    data = xxxx
    return data

Func_top is the processing function of the upper page,func_sub is the processing function of the child page, and func_bottom is the processing function of the deepest page.

Under normal circumstances, this is indeed enough to meet the requirements, but the site you want to crawl may be unstable, often links to the results of the data is not available.

So at this point you have two choices:

1. Stop when you encounter an error and then start running again from the broken position
2. Continue when you encounter an error, but run again later. At this time, you do not want to go to the website to pull the data you already have, but only pull the data you did not get

The first scenario is almost impossible to implement, because if the url order of someone else's website is adjusted, then the location of your record is invalid. There is only the second option, which is simply to take the data you already have from the cache and retrieve it when you need it.

OK, the goal is already there, how to achieve it?

If we were in C++, this would be a lot of trouble, and the code would be ugly, but thankfully we're using python, and python has decorators for functions.

So the implementation plan also has:

Define a decorator that fetches data from the cache if it was previously fetched. If not, pull it from the site and store it in the cache.

The code is as follows:


def get_dump_data(dir_name, url):
    m = hashlib.md5(url)
    filename = m.hexdigest()
    full_file_name = 'dumps/%s/%s' % (dir_name,filename)
 
    if os.path.isfile(full_file_name):
        return eval(file(full_file_name,'r').read())
    else:
        return None
 
 
def set_dump_data(dir_name, url, data):
    if not os.path.isdir('dumps/'+dir_name):
        os.makedirs('dumps/'+dir_name)
 
    m = hashlib.md5(url)
    filename = m.hexdigest()
    full_file_name = 'dumps/%s/%s' % (dir_name,filename)
 
    f = file(full_file_name, 'w+')
    f.write(repr(data))
    f.close()
 
 
def deco_dump_data(func):
    def func_wrapper(url):
        data = get_dump_data(func.__name__,url)
        if data is not None:
            return data
 
        data = func(url)
        if data is not None:
            set_dump_data(func.__name__,url,data)
        return data
 
    return func_wrapper

Then, we just need to add deco_dump_data decorator to each func_top,func_sub and func_bottom

Done! The biggest advantage of doing this is that, because top,sub,bottom, each layer will dump data, so for example, after a sub layer dumps data, it will not go to the corresponding bottom layer, which reduces a lot of overhead!

OK, that's it ~ life is short, I use python!


Related articles: