A nice use for decorators in Python
- 2020-04-02 14:33:08
- OfStack
Well, I know it's the middle of the night... But I think it's worth spending half an hour to share the latest ideas
We're going to simulate a situation where you're going to grab a page, and then you're going to grab a bunch of urls from that page, and then you're going to grab data from those suburls. Simple point, we according to the three layers, then our code is as follows:
def func_top(url):
data_dict= {}
# Gets a child on the page url
sub_urls = xxxx
data_list = []
for it in sub_urls:
data_list.append(func_sub(it))
data_dict['data'] = data_list
return data_dict
def func_sub(url):
data_dict= {}
# Gets a child on the page url
bottom_urls = xxxx
data_list = []
for it in bottom_urls:
data_list.append(func_bottom(it))
data_dict['data'] = data_list
return data_dict
def func_bottom(url):
# To get the data
data = xxxx
return data
Func_top is the processing function of the upper page,func_sub is the processing function of the child page, and func_bottom is the processing function of the deepest page.
Under normal circumstances, this is indeed enough to meet the requirements, but the site you want to crawl may be unstable, often links to the results of the data is not available.
So at this point you have two choices:
1. Stop when you encounter an error and then start running again from the broken position
2. Continue when you encounter an error, but run again later. At this time, you do not want to go to the website to pull the data you already have, but only pull the data you did not get
The first scenario is almost impossible to implement, because if the url order of someone else's website is adjusted, then the location of your record is invalid. There is only the second option, which is simply to take the data you already have from the cache and retrieve it when you need it.
OK, the goal is already there, how to achieve it?
If we were in C++, this would be a lot of trouble, and the code would be ugly, but thankfully we're using python, and python has decorators for functions.
So the implementation plan also has:
Define a decorator that fetches data from the cache if it was previously fetched. If not, pull it from the site and store it in the cache.
The code is as follows:
def get_dump_data(dir_name, url):
m = hashlib.md5(url)
filename = m.hexdigest()
full_file_name = 'dumps/%s/%s' % (dir_name,filename)
if os.path.isfile(full_file_name):
return eval(file(full_file_name,'r').read())
else:
return None
def set_dump_data(dir_name, url, data):
if not os.path.isdir('dumps/'+dir_name):
os.makedirs('dumps/'+dir_name)
m = hashlib.md5(url)
filename = m.hexdigest()
full_file_name = 'dumps/%s/%s' % (dir_name,filename)
f = file(full_file_name, 'w+')
f.write(repr(data))
f.close()
def deco_dump_data(func):
def func_wrapper(url):
data = get_dump_data(func.__name__,url)
if data is not None:
return data
data = func(url)
if data is not None:
set_dump_data(func.__name__,url,data)
return data
return func_wrapper
Then, we just need to add deco_dump_data decorator to each func_top,func_sub and func_bottom
Done! The biggest advantage of doing this is that, because top,sub,bottom, each layer will dump data, so for example, after a sub layer dumps data, it will not go to the corresponding bottom layer, which reduces a lot of overhead!
OK, that's it ~ life is short, I use python!