An introductory tutorial on network programming using threads in Python

  • 2020-05-10 18:22:04
  • OfStack

The introduction

There is no shortage of concurrency options for Python, whose standard library includes support for threads, processes, and asynchronous I/O. In many cases, Python simplifies the use of various concurrent methods by creating high-level modules such as asynchrony, threads, and child processes. In addition to the standard library, there are some third party solutions, such as Twisted, Stackless, and process modules. This article focuses on threads that use Python and USES some practical examples to illustrate. While there are many good online resources that detail thread API, this article tries to provide some practical examples to illustrate some common thread usage patterns.

The global interpreter lock (Global Interpretor Lock) indicates that the Python interpreter is not thread safe. The current thread must hold a global lock for safe access to the Python object. Since only one thread can obtain the Python object /C API, the interpreter regularly releases and reacquires the lock for every 100 bytecode instructions it passes. The frequency at which the interpreter checks for thread switches can be controlled by the sys.setcheckinterval () function.

In addition, locks are released and reacquired based on the potential blocking I/O operation. For more information, see Gil and Threading State and Threading the Global Interpreter Lock in the resources section.

It should be noted that because of GIL, CPU restricted applications will not benefit from the use of threads. When using Python, it is recommended to use processes, or to create a mix of processes and threads.

It is important to understand the difference between a process and a thread first. Threads differ from processes in that they share state, memory, and resources. This simple distinction is both a strength and a weakness for threads. On the one hand, threads are lightweight and easy to communicate with each other, but on the other hand, they introduce a variety of problems, including deadlocks, contention conditions, and high complexity. Fortunately, because of GIL and the queue module, the complexity of thread implementation in Python is much lower than in other languages.
Use the Python thread

To continue with this article, I'll assume that you already have Python 2.5 or later installed, because many of the examples in this article will use new features of the Python language that only appear after Python 2.5. To start a thread using the Python language, we'll start with a simple example of "Hello World" :
hello_threads_example


    import threading
    import datetime
    
    class ThreadClass(threading.Thread):
     def run(self):
      now = datetime.datetime.now()
      print "%s says Hello World at time: %s" % 
      (self.getName(), now)
    
    for i in range(2):
     t = ThreadClass()
     t.start()

If you run the example, you get the following output:


   # python hello_threads.py 
   Thread-1 says Hello World at time: 2008-05-13 13:22:50.252069
   Thread-2 says Hello World at time: 2008-05-13 13:22:50.252576

Looking closely at the output, you can see that the Hello World statement is output from both threads, with a date stamp. If you analyze the actual code, you'll find that it contains two import statements; One statement imports the datetime module, and the other imports the threading module. Class ThreadClass inherits from threading.Thread, and that's why you need to define an run method to execute the code you want to run in this thread. The only thing to note in this run method is that self.getName () is a method used to determine the name of the thread.

The last three lines of code actually call the class and start the thread. If you notice, t.start () actually started the thread. The threading module is designed with inheritance in mind, and the threading module is actually built on top of the underlying threading module. For the most part, inheritance from threading.Thread is a best practice because it creates a regular API for threaded programming.
Using thread queues

As mentioned earlier, the use of threads can be complicated when multiple threads need to share data or resources. The threading module provides a number of synchronization primitives, including semaphores, condition variables, events, and locks. When these options exist, it is a best practice to focus instead on using queues. By comparison, queues are easier to handle and can make thread programming safer because they effectively deliver all access to resources from a single thread and support a clearer, more readable design pattern.

In the next example, you will first create a program that executes sequentially or sequentially, get the URL of the web site, and display the first 1024 bytes of the page. Sometimes threads can be used to complete tasks faster. Here is a typical example. First, let's use the urllib2 module to get these pages (one page at a time) and time the running time of the code:
URL gets the sequence


    import urllib2
    import time
    
    hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
    "http://ibm.com", "http://apple.com"]
    
    start = time.time()
    #grabs urls of hosts and prints first 1024 bytes of page
    for host in hosts:
     url = urllib2.urlopen(host)
     print url.read(1024)
    
    print "Elapsed Time: %s" % (time.time() - start)

When you run the above example, you'll get a lot of output in standard output. But you'll end up with the following:


    Elapsed Time: 2.40353488922

Let's take a closer look at this code. You have imported only two modules. First, the urllib2 module reduces the complexity of the work and retrieves the Web page. Then, by calling time.time (), you create a start time value, call the function again, and subtract the start value to determine how long it took to execute the program. In the final analysis of the program's execution speed, "2.5 seconds" is not too bad, but if you need to retrieve hundreds of Web pages, the average would take about 50 seconds. Explore how to create a threaded version that can speed up execution:
URL gets threading


     #!/usr/bin/env python
     import Queue
     import threading
     import urllib2
     import time
     
     hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
     "http://ibm.com", "http://apple.com"]
     
     queue = Queue.Queue()
     
     class ThreadUrl(threading.Thread):
     """Threaded Url Grab"""
      def __init__(self, queue):
       threading.Thread.__init__(self)
       self.queue = queue
     
      def run(self):
       while True:
        #grabs host from queue
        host = self.queue.get()
      
        #grabs urls of hosts and prints first 1024 bytes of page
        url = urllib2.urlopen(host)
        print url.read(1024)
      
        #signals to queue job is done
        self.queue.task_done()
     
     start = time.time()
     def main():
     
      #spawn a pool of threads, and pass them queue instance 
      for i in range(5):
       t = ThreadUrl(queue)
       t.setDaemon(True)
       t.start()
       
      #populate queue with data  
       for host in hosts:
        queue.put(host)
      
      #wait on the queue until everything has been processed   
      queue.join()
     
     main()
     print "Elapsed Time: %s" % (time.time() - start)

There is more code to explain for this example, but it is not much more complex than the first thread example, precisely because of the use of the queue module. This pattern is a common and recommended way to use threads in Python. Specific working steps are described as follows:

      creates an instance of Queue.Queue () and populates it with data.       passes instances of the populated data to the thread class, which is created by inheriting threading.Thread.       generates the daemon thread pool.       takes one item at a time from the queue and performs the corresponding work using the data in that thread and the run method. After doing this,       USES the queue.task_done () function to send a signal to the queue where the task has been completed.       performs join on the queue, which in effect means waiting until the queue is empty before exiting the main program.

One thing to note when using this mode is that by setting daemons to true, the main thread or program is allowed to exit only if the daemons are active. This creates a simple way to control the flow of the program, because you can perform join operations on the queue before exiting, or wait until the queue is empty. The queue module documentation details the actual process, see resources:

      join()
      remains blocked until all items in the queue have been processed. When an item is added to the queue, the total number of outstanding tasks increases. When the consumer thread calls task_done() to indicate that it has retrieved the item and completed all the work, the total number of outstanding tasks is reduced. When the total number of outstanding tasks is reduced to zero, join() ends the blocking.

Using multiple queues

Because the pattern described above is so efficient, it is fairly simple to extend by connecting additional thread pools and queues. In the example above, you just output the beginning of the Web page. The next example returns the full Web page fetched by each thread, and then places the result in another queue. You then set the other thread pool that was added to the second queue, and then perform the corresponding processing on the Web page. The work done in this example involves parsing the Web page using a third party Python module named Beautiful Soup. With this module, you only need two lines of code to extract the title tag for each page you visit and print it out.
Multi-queue data mining site


import Queue
import threading
import urllib2
import time
from BeautifulSoup import BeautifulSoup

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
    "http://ibm.com", "http://apple.com"]

queue = Queue.Queue()
out_queue = Queue.Queue()

class ThreadUrl(threading.Thread):
  """Threaded Url Grab"""
  def __init__(self, queue, out_queue):
    threading.Thread.__init__(self)
    self.queue = queue
    self.out_queue = out_queue

  def run(self):
    while True:
      #grabs host from queue
      host = self.queue.get()

      #grabs urls of hosts and then grabs chunk of webpage
      url = urllib2.urlopen(host)
      chunk = url.read()

      #place chunk into out queue
      self.out_queue.put(chunk)

      #signals to queue job is done
      self.queue.task_done()

class DatamineThread(threading.Thread):
  """Threaded Url Grab"""
  def __init__(self, out_queue):
    threading.Thread.__init__(self)
    self.out_queue = out_queue

  def run(self):
    while True:
      #grabs host from queue
      chunk = self.out_queue.get()

      #parse the chunk
      soup = BeautifulSoup(chunk)
      print soup.findAll(['title'])

      #signals to queue job is done
      self.out_queue.task_done()

start = time.time()
def main():

  #spawn a pool of threads, and pass them queue instance
  for i in range(5):
    t = ThreadUrl(queue, out_queue)
    t.setDaemon(True)
    t.start()

  #populate queue with data
  for host in hosts:
    queue.put(host)

  for i in range(5):
    dt = DatamineThread(out_queue)
    dt.setDaemon(True)
    dt.start()


  #wait on the queue until everything has been processed
  queue.join()
  out_queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)

If you run this version of the script, you get the following output:


 # python url_fetch_threaded_part2.py 

 [<title>Google</title>]
 [<title>Yahoo!</title>]
 [<title>Apple</title>]
 [<title>IBM United States</title>]
 [<title>Amazon.com: Online Shopping for Electronics, Apparel,
 Computers, Books, DVDs & more</title>]
 Elapsed Time: 3.75387597084

As you can see when you analyze this code, we add another instance of a queue and pass that queue to the first thread pool class, ThreadURL. Next, for another thread pool class, DatamineThread, almost exactly the same structure is copied. In the run method of this class, get the Web page, text block, from each thread in the queue, and then process the text block using Beautiful Soup. In this example, Beautiful Soup is used to extract the title tag for each page and print it out. You can easily generalize this example to some of the more valuable application scenarios, because you know the core of a basic search engine or data mining tool. One idea is to use Beautiful Soup to extract links from each page and navigate through them.

conclusion

This article examined threads on Python and demonstrated best practices for using queues to reduce complexity and minor errors, and to improve code readability. Although this basic pattern is relatively simple, you can use it to solve a variety of problems by connecting queues and thread pools at 1. In the final section, you begin to explore how to create a more complex processing pipeline that can be used as a model for future projects. The resources section provides many excellent references on general concurrency and threading.

Finally, it is important to note that threads do not solve all problems, and in many cases, using a process may be more appropriate. In particular, the standard library child process module may be easier to use when you only need to create many child processes and listen for responses. Refer to the resources section for more official documentation.


Related articles: