Python crawler multithreading details and example code
- 2020-05-12 02:50:34
- OfStack
python supports multithreading, mainly through thread and threading modules. thread module is a lower level module, and threading module is a package for thread, which can be used more conveniently.
Although the multithreading of python is limited by GIL, it is not really multithreading, but it can significantly improve the efficiency of I/O intensive computing, such as crawler.
Here is an example to verify the efficiency of multithreading. The code only involves page fetching, not parsing.
# -*-coding:utf-8 -*-
import urllib2, time
import threading
class MyThread(threading.Thread):
def __init__(self, func, args):
threading.Thread.__init__(self)
self.args = args
self.func = func
def run(self):
apply(self.func, self.args)
def open_url(url):
request = urllib2.Request(url)
html = urllib2.urlopen(request).read()
print len(html)
return html
if __name__ == '__main__':
# structure url The list of
urlList = []
for p in range(1, 10):
urlList.append('http://s.wanfangdata.com.cn/Paper.aspx?q=%E5%8C%BB%E5%AD%A6&p=' + str(p))
# 1 Like the way
n_start = time.time()
for each in urlList:
open_url(each)
n_end = time.time()
print 'the normal way take %s s' % (n_end-n_start)
# multithreading
t_start = time.time()
threadList = [MyThread(open_url, (url,)) for url in urlList]
for t in threadList:
t.setDaemon(True)
t.start()
for i in threadList:
i.join()
t_end = time.time()
print 'the thread way take %s s' % (t_end-t_start)
We used two methods to get 10 slow pages, which took 50s for 1 mode and 10s for multithreading.
Multithreaded code interpretation:
# Create thread class, inherit Thread class
class MyThread(threading.Thread):
def __init__(self, func, args):
threading.Thread.__init__(self) # Call the constructor of the parent class
self.args = args
self.func = func
def run(self): # Thread activity method
apply(self.func, self.args)
threadList = [MyThread(open_url, (url,)) for url in urlList] # Calls the thread class to create a new thread and returns a list of threads
for t in threadList:
t.setDaemon(True) # Sets the daemon thread to wait for the child thread to finish executing before exiting
t.start() # Thread open
for i in threadList:
i.join() # Wait for the thread to terminate, and wait for the child thread to finish executing before executing the parent thread
The above is the entire content of this article, I hope to help you with your study.