Python crawler multithreading details and example code

  • 2020-05-12 02:50:34
  • OfStack

python supports multithreading, mainly through thread and threading modules. thread module is a lower level module, and threading module is a package for thread, which can be used more conveniently.

Although the multithreading of python is limited by GIL, it is not really multithreading, but it can significantly improve the efficiency of I/O intensive computing, such as crawler.
Here is an example to verify the efficiency of multithreading. The code only involves page fetching, not parsing.


# -*-coding:utf-8 -*-
import urllib2, time
import threading

class MyThread(threading.Thread):
 def __init__(self, func, args):
  threading.Thread.__init__(self)
  self.args = args
  self.func = func

 def run(self):
  apply(self.func, self.args)

def open_url(url):
 request = urllib2.Request(url)
 html = urllib2.urlopen(request).read()
 print len(html)
 return html


if __name__ == '__main__':
 #  structure url The list of 
 urlList = []
 for p in range(1, 10):
  urlList.append('http://s.wanfangdata.com.cn/Paper.aspx?q=%E5%8C%BB%E5%AD%A6&p=' + str(p))


 # 1 Like the way 
 n_start = time.time()
 for each in urlList:
  open_url(each)
 n_end = time.time()
 print 'the normal way take %s s' % (n_end-n_start)


#  multithreading 
 t_start = time.time()
 threadList = [MyThread(open_url, (url,)) for url in urlList]
 for t in threadList:
  t.setDaemon(True)
  t.start()
 for i in threadList:
  i.join()
 t_end = time.time()
 print 'the thread way take %s s' % (t_end-t_start)

We used two methods to get 10 slow pages, which took 50s for 1 mode and 10s for multithreading.
Multithreaded code interpretation:


#  Create thread class, inherit Thread class 
class MyThread(threading.Thread):
 def __init__(self, func, args):
  threading.Thread.__init__(self) #  Call the constructor of the parent class 
  self.args = args
  self.func = func

 def run(self): #  Thread activity method 
  apply(self.func, self.args)





threadList = [MyThread(open_url, (url,)) for url in urlList] #  Calls the thread class to create a new thread and returns a list of threads 
 for t in threadList:
  t.setDaemon(True) #  Sets the daemon thread to wait for the child thread to finish executing before exiting 
  t.start() #  Thread open 
 for i in threadList:
  i.join() #  Wait for the thread to terminate, and wait for the child thread to finish executing before executing the parent thread 

The above is the entire content of this article, I hope to help you with your study.


Related articles: