Python3 multithreaded crawler example explains the code

  • 2020-06-23 01:01:37
  • OfStack

Overview of Multithreading

Multithreading enables the program to separate multiple threads to do multiple things, making full use of CPU free time and improving processing efficiency. python provides two modules to implement multithreaded thread and threading. thread has 1 shortcomings that are remedied in threading. The thread module was scrapped in Python3, and the more powerful threading module was retained.

Usage scenarios

GIL (Global Interpreter Lock, global interpreter lock) exists in the original CPython interpreter of python, so when the python code is interpreted and executed, a mutex is generated to restrict the thread's access to the Shared resource, and GIL is not released until the interpreter encounters an I/O operation or a certain number of times. So, while CPython's thread library directly encapsulates the system's native threads, CPython as a whole is a single process, with only one thread acquiring GIL running at a time and the others waiting. As a result, even in multicore CPU, multithreading only does time-sharing.

If your program is CPU intensive, multiple threads of code will most likely execute linearly. So in this case multi-threading is the chicken bit, and it's probably not as efficient as single threading because of the context switching overhead. But if your code is IO intensive and tasks involving network and disk IO are IO intensive, multi-threading can significantly improve efficiency, such as multi-threaded crawler, multi-threaded file processing and so on

Multithreaded crawler

Code instance for multithreaded crawler

Note: The following code runs under python3, python2 version is very different, it cannot run successfully, if you need help, please note below.


# coding=utf-8
import threading, queue, time, urllib
from urllib import request
baseUrl = 'http://www.pythontab.com/html/pythonjichu/'
urlQueue = queue.Queue()
for i in range(2, 10):
 url = baseUrl + str(i) + '.html'
 urlQueue.put(url)
 #print(url)
def fetchUrl(urlQueue):
 while True:
  try:
   # Non-blocking read queue data 
   url = urlQueue.get_nowait()
   i = urlQueue.qsize()
  except Exception as e:
   break
  print ('Current Thread Name %s, Url: %s ' % (threading.currentThread().name, url))
  try:
   response = urllib.request.urlopen(url)
   responseCode = response.getcode()
  except Exception as e:
   continue
  if responseCode == 200:
   # Data processing of fetching content can be put here 
   # In order to highlight the effect,   Set the time delay 
   time.sleep(1)
if __name__ == '__main__':
 startTime = time.time()
 threads = []
 #  You can adjust the number of threads,   And then control the grab speed 
 threadNum = 4
 for i in range(0, threadNum):
  t = threading.Thread(target=fetchUrl, args=(urlQueue,))
  threads.append(t)
 for t in threads:
  t.start()
 for t in threads:
  # More than a multithreaded join In the case of execution of each thread in turn join methods ,  This ensures that the main thread exits at the end,   And there is no blocking between the threads 
  t.join()
 endTime = time.time()
 print ('Done, Time cost: %s ' % (endTime - startTime))

Operation results:

For 1 thread:


Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/2.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/3.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/4.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/5.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/6.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/7.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/8.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/9.html 
Done, Time cost: 8.182249069213867

2 threads:


Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/2.html 
Current Thread Name Thread-2, Url: http://www.pythontab.com/html/pythonjichu/3.html 
Current Thread Name Thread-2, Url: http://www.pythontab.com/html/pythonjichu/4.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/5.html 
Current Thread Name Thread-2, Url: http://www.pythontab.com/html/pythonjichu/6.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/7.html 
Current Thread Name Thread-2, Url: http://www.pythontab.com/html/pythonjichu/8.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/9.html 
Done, Time cost: 4.0987958908081055

Three threads:


Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/2.html 
Current Thread Name Thread-2, Url: http://www.pythontab.com/html/pythonjichu/3.html 
Current Thread Name Thread-3, Url: http://www.pythontab.com/html/pythonjichu/4.html 
Current Thread Name Thread-4, Url: http://www.pythontab.com/html/pythonjichu/5.html 
Current Thread Name Thread-2, Url: http://www.pythontab.com/html/pythonjichu/6.html 
Current Thread Name Thread-4, Url: http://www.pythontab.com/html/pythonjichu/7.html 
Current Thread Name Thread-1, Url: http://www.pythontab.com/html/pythonjichu/9.html 
Current Thread Name Thread-3, Url: http://www.pythontab.com/html/pythonjichu/8.html 
Done, Time cost: 2.287320137023926

By adjusting the number of threads, it can be seen that the execution time will shorten with the increase of the number of threads, and the fetching efficiency will increase in a direct proportion.

Conclusion:

Python multithreading in IO intensive tasks, multithreading can significantly improve efficiency, CPU intensive tasks are not suitable for the use of multithreading.


Related articles: