A brief analysis of the use of multi process and multi threading in Python

  • 2020-05-07 19:54:38
  • OfStack

In discussions criticizing Python, it is often said that Python multithreading is difficult to use. Others have pointed out that global interpreter lock(also affectionately known as "GIL") prevents Python's multithreaded programs from running at the same time. Therefore, if you switch from another language (such as C++ or Java), the Python threading module will not run as you think. It's important to note that we can still write concurrent or parallel code with Python and get a significant performance boost, as long as you take a few things into account. If you haven't already, I suggest you check out Eqbal Quran's article "concurrency and parallelism in Ruby."

In this article, we'll write a small Python script to download the most popular images on Imgur. We will start with one version of the images in sequence, one by one. Before that, you have to sign up for an app on Imgur. If you don't already have an Imgur account, sign up for one.

The script in this article was tested in Python 3.4.2. With a slight change of 1, it should also work in Python2 -- urllib is the most different part of the two versions.
starts

Let's start by creating an Python module called "download.py". This file contains all the functions needed to get a list of images and download them. We divided these functions into three separate functions:


  get_links
  download_link
  setup_download_dir

The third function, "setup_download_dir", is used to create the target directory for the download (if it does not exist).

API of Imgur requires HTTP requests to support "Authorization" headers with client ID. You can find this client ID on the panel of the Imgur application you registered with, and the response will be encoded in JSON. We can use Python's standard JSON library to decode. Downloading images is easier, you just need to get the images according to their URL and then write them to a file.

The code is as follows:
 


import json
import logging
import os
from pathlib import Path
from urllib.request import urlopen, Request
 
logger = logging.getLogger(__name__)
 
def get_links(client_id):
  headers = {'Authorization': 'Client-ID {}'.format(client_id)}
  req = Request('https://api.imgur.com/3/gallery/', headers=headers, method='GET')
  with urlopen(req) as resp:
    data = json.loads(resp.readall().decode('utf-8'))
  return map(lambda item: item['link'], data['data'])
 
def download_link(directory, link):
  logger.info('Downloading %s', link)
  download_path = directory / os.path.basename(link)
  with urlopen(link) as image, download_path.open('wb') as f:
    f.write(image.readall())
 
def setup_download_dir():
  download_dir = Path('images')
  if not download_dir.exists():
    download_dir.mkdir()
  return download_dir

Next, you need to write a module that USES these functions to download images one by one. We named it "single.py". It contains the main functions of our original version of the Imgur image download. This module will get client ID of Imgur through the environment variable "IMGUR_CLIENT_ID". It will call "setup_download_dir" to create the download directory. Finally, use the get_links function to get the list of images, filter out all the GIF and album URL, and then use "download_link" to download and save the images on disk. Here is the code for "single.py" :


import logging
import os
from time import time
 
from download import setup_download_dir, get_links, download_link
 
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logging.getLogger('requests').setLevel(logging.CRITICAL)
logger = logging.getLogger(__name__)
 
def main():
  ts = time()
  client_id = os.getenv('IMGUR_CLIENT_ID')
  if not client_id:
    raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
  download_dir = setup_download_dir()
  links = [l for l in get_links(client_id) if l.endswith('.jpg')]
  for link in links:
    download_link(download_dir, link)
  print('Took {}s'.format(time() - ts))
 
if __name__ == '__main__':
  main()

On my laptop, the script took 19.4 seconds to download 91 images. Note that these Numbers can vary from network to network. 19.4 seconds is not very long, but what if we want to download more pictures? Maybe 900 instead of 90. On average, it takes 0.2 seconds to download one image, and about 3 minutes to download 900. So 9,000 images will take 30 minutes. The good news is that with concurrency or parallelism, we can dramatically increase this speed.

The following code example will only show the import statement that imports the specific module and the new module. All relevant Python scripts can be easily found here this GitHub repository.
USES threads

Threads are one of the best known ways to achieve concurrency and parallelism. OS 1 generally provides threading features. Threads are smaller than processes and share the same block of memory space.

Here we will write a new module that replaces "single.py". It will create a pool of 8 threads, plus the main thread for a total of 9 threads. The reason I have eight threads is because my computer has eight CPU cores, and one worker thread for every kernel looks good. In practice, the number of threads is careful to take into account other factors, such as other applications and services running on the same machine.

The following script is almost the same as the previous one, except that we now have a new class, DownloadWorker, a subclass of Thread. The run method to run an infinite loop has been rewritten. At each iteration, it calls "self.queue.get ()" to try to get an URL from a thread-safe queue. It will jam until one of the elements to be processed appears in the queue. Once the worker thread gets an element from the queue, it invokes the "download_link" method used in the script to download the image to the directory. After the download completes, the worker thread sends a signal to the queue that the task is complete. This is important because queue 1 is directly in the number of tasks in the trace queue. If the worker thread does not signal that the task is complete, the call to "queue.join ()" will cause the entire main thread to block.
 


from queue import Queue
from threading import Thread
 
class DownloadWorker(Thread):
  def __init__(self, queue):
    Thread.__init__(self)
    self.queue = queue
 
  def run(self):
    while True:
      # Get the work from the queue and expand the tuple
      #  Take the task from the queue and extend it tuple
      directory, link = self.queue.get()
      download_link(directory, link)
      self.queue.task_done()
 
def main():
  ts = time()
  client_id = os.getenv('IMGUR_CLIENT_ID')
  if not client_id:
    raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
  download_dir = setup_download_dir()
  links = [l for l in get_links(client_id) if l.endswith('.jpg')]
  # Create a queue to communicate with the worker threads
  queue = Queue()
  # Create 8 worker threads
  #  create 8 Four worker threads 
  for x in range(8):
    worker = DownloadWorker(queue)
    # Setting daemon to True will let the main thread exit even though the workers are blocking
    #  will daemon Set to True Will cause the main thread to exit, even if worker Are blocked 
    worker.daemon = True
    worker.start()
  # Put the tasks into the queue as a tuple
  #  The task to tuple Is placed in the queue 
  for link in links:
    logger.info('Queueing {}'.format(link))
    queue.put((download_dir, link))
  # Causes the main thread to wait for the queue to finish processing all the tasks
  #  Let the main thread wait for the queue to complete all tasks 
  queue.join()
  print('Took {}'.format(time() - ts))

Run this script on the same machine and the download time becomes 4.1 seconds! That is 4.7 times faster than the previous example. This is much faster, but it's worth mentioning that, because of GIL, only one thread is running at the same time in this process. Therefore, this code is concurrent but not parallel. And the reason it's still faster is because it's an IO intensive task. The process downloads the picture effortlessly and spends most of its time waiting for the network. This is why threads can provide a great speed boost. Each time one of the threads is ready, the process can continue to convert threads. Using threading modules in Python or other interpreted languages with GIL can actually degrade performance. If your code performs CPU intensive tasks, such as unzipping gzip files, using the threading module will result in a longer execution time. For CPU intensive tasks and true parallel execution, we can use the multi-process (multiprocessing) module.

The official Python implementation -- CPython -- comes with GIL, but not all Python implementations do. For example, IronPython, Python implemented using the.NET framework, does not have GIL, nor does Jython implemented based on Java. You can click here to see the existing Python implementation.
generates multi-process

The multiprocess module is easier to use than the threading module because we don't need to add a new class as in the threading example. The only change we need to make to 1 is in the main function.

In order to use multiple processes, we have to create a pool of multiple processes. With the map method it provides, we pass the URL list to the pool, and eight new processes are generated, which will download the image in parallel. This is true parallelism, but it comes at a cost. The entire script memory will be copied to each child process. That's not a big deal in our case, but it can easily lead to serious problems in large programs.
 


from functools import partial
from multiprocessing.pool import Pool
 
def main():
  ts = time()
  client_id = os.getenv('IMGUR_CLIENT_ID')
  if not client_id:
    raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
  download_dir = setup_download_dir()
  links = [l for l in get_links(client_id) if l.endswith('.jpg')]
  download = partial(download_link, download_dir)
  with Pool(8) as p:
    p.map(download, links)
  print('Took {}s'.format(time() - ts))

distributed task

You already know that threading and multi-process modules can be a great help for running scripts on your own computer, so what do you do when you want to perform tasks on different machines, or when you need to scale up to more than one machine? A good use case is the long-running background task of a web application. If you have a lot of time consuming tasks, you don't want to be hogging all the child processes or threads that other application code needs on the same machine. This will degrade the performance of your application and affect your users. It would be nice to be able to run these tasks on one or many other machines.

Python library RQ is great for this type of task. It is a simple but powerful library. First, queue a function and its arguments. It serializes the representation of the function call (pickle) and then adds these representations to a list of Redis. The task queue is only step 1, nothing has been done yet. We need at least one worker (worker thread) to listen to the task queue.

The first step is to install and use the Redis server on your computer, or to have a working Redis server. Then, there are only a few small changes to the existing code. Create an instance of the RQ queue and pass it to an Redis server through the redis-py library. We then execute "q.enqueue (download_link, download_dir, link)" instead of just calling "download_link". The first argument to the enqueue method is a function to which other arguments or keyword arguments are passed when the task actually executes.

The last step is to start some worker. RQ provides a handy script to run worker on the default queue. Simply execute "rqworker" in the terminal window to start listening to the default queue. Make sure your current working directory is the same as the script. If you want to listen to another queue, you can execute "rqworker queue_name", and then the queue named queue_name will be executed. One of the nice things about RQ is that as long as you can connect to Redis, you can run any number of worker on any number of machines; Therefore, it can improve the scalability of your application. Here's the code for the RQ version:
 


from redis import Redis
from rq import Queue
 
def main():
  client_id = os.getenv('IMGUR_CLIENT_ID')
  if not client_id:
    raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
  download_dir = setup_download_dir()
  links = [l for l in get_links(client_id) if l.endswith('.jpg')]
  q = Queue(connection=Redis(host='localhost', port=6379))
  for link in links:
    q.enqueue(download_link, download_dir, link)

However, RQ is not the only 1 solution to the Python task queue. RQ is really easy to use and can make a big difference in simple cases, but if there are more advanced requirements, we can use other solutions (such as Celery).
summary

If your code is IO intensive, threads and multiple processes can help. Multiple processes are easier to use than threads, but consume more memory. If your code is CPU intensive, multi-processing is obviously a better choice -- especially if the machine used is multi-core or multi-CPU. For web applications where you need to scale to multiple machines, RQ is a better choice.


Related articles: