Sunday, January 28, 2018

Threading in Python

http://www.linuxjournal.com/content/threading-python

Threads can provide concurrency, even if they're not truly parallel.
In my last article, I took a short tour through the ways you can add concurrency to your programs. In this article, I focus on one of those forms that has a reputation for being particularly frustrating for many developers: threading. I explore the ways you can use threads in Python and the limitations the language puts upon you when doing so.
The basic idea behind threading is a simple one: just as the computer can run more than one process at a time, so too can your process run more than one thread at a time. When you want your program to do something in the background, you can launch a new thread. The main thread continues to run in the foreground, allowing the program to do two (or more) things at once.
What's the difference between launching a new process and a new thread? A new process is completely independent of your existing process, giving you more stability (in that the processes cannot affect or corrupt one another) but also less flexibility (in that data cannot easily flow from one thread to another). Because multiple threads within a process share data, they can work with one another more closely and easily.
For example, let's say you want to retrieve all of the data from a variety of websites. My preferred Python package for retrieving data from the web is the "requests" package, available from PyPI. Thus, I can use a for loop, as follows:

length = {}

for one_url in urls:
    response = requests.get(one_url)
    length[one_url] = len(response.content)

for key, value in length.items():
    print("{0:30}: {1:8,}".format(key, value))

How does this program work? It goes through a list of URLs (as strings), one by one, calculating the length of the content and then storing that content inside a dictionary called length. The keys in length are URLs, and the values are the lengths of the requested URL content.
So far, so good; I've turned this into a complete program (retrieve1.py), which is shown in Listing 1. I put nine URLs into a text file called urls.txt (Listing 2), and then timed how long retrieving each of them took. On my computer, the total time was about 15 seconds, although there was clearly some variation in the timing.
Listing 1. retrieve1.py

#!/usr/bin/env python3

import requests
import time

urls = [one_line.strip()
        for one_line in open('urls.txt')]

length = {}

start_time = time.time()

for one_url in urls:
    response = requests.get(one_url)
    length[one_url] = len(response.content)

for key, value in length.items():
    print("{0:30}: {1:8,}".format(key, value))

end_time = time.time()

total_time = end_time - start_time

print("\nTotal time: {0:.3} seconds".format(total_time))

Listing 2. urls.txt

http://lerner.co.il
http://LinuxJournal.com
http://en.wikipedia.org
http://news.ycombinator.com
http://NYTimes.com
http://Facebook.com
http://WashingtonPost.com
http://Haaretz.co.il
http://thetech.com

Improving the Timing with Threads

How can I improve the timing? Well, Python provides threading. Many people think of Python's threads as fatally flawed, because only one thread actually can execute at a time, thanks to the GIL (global interpreter lock). This is true if you're running a program that is performing serious calculations, and in which you really want the system to be using multiple CPUs in parallel.
However, I have a different sort of use case here. I'm interested in retrieving data from different websites. Python knows that I/O can take a long time, and so whenever a Python thread engages in I/O (that is, the screen, disk or network), it gives up control and hands use of the GIL over to a different thread.
In the case of my "retrieve" program, this is perfect. I can spawn a separate thread to retrieve each of the URLs in the array. I then can wait for the URLs to be retrieved in parallel, checking in with each of the threads one at a time. In this way, I probably can save time.
Let's start with the core of my rewritten program. I'll want to implement the retrieval as a function, and then invoke that function along with one argument—the URL I want to retrieve. I then can invoke that function by creating a new instance of threading.Thread, telling the new instance not only which function I want to run in a new thread, but also which argument(s) I want to pass. This is how that code will look:

for one_url in urls:
    t = threading.Thread(target=get_length, args=(one_url,))
    t.start()

But wait. How will the get_length function communicate the content length to the rest of the program? In a threaded program, you really must not have individual threads modify built-in data structures, such as a list. This is because such data structures aren't thread-safe, and doing something such as an "append" from one thread might cause all sorts of problems.
However, you can use a "queue" data structure, which is thread-safe, and thus guarantees a form of communication. The function can put its results on the queue, and then, when all of the threads have completed their run, you can read those results from the queue.
Here, then, is how the function might look:

from queue import Queue

queue = Queue()

def get_length(one_url):
    response = requests.get(one_url)
    queue.put((one_url, len(response.content)))

As you can see, the function retrieves the content of one_url and then places the URL itself, as well as the length of the content, in a tuple. That tuple is then placed in the queue.
It's a nice little program. The main thread spawns a new thread, each of which runs get_length. In get_length, the information gets stuck on the queue.
The thing is, now it needs to retrieve things from the queue. But if you do this just after launching the threads, you run the risk of reading from the queue before the threads have completed. So, you need to "join" the threads, which means to wait until they have finished. Once the threads have all been joined, you can read all of their information from the queue.
There are a few different ways to join the threads. An easy one is to create a list where you will store the threads and then append each new thread object to that list as you create it:

threads = [ ]

for one_url in urls:
    t = threading.Thread(target=get_length, args=(one_url,))
    threads.append(t)
    t.start()

You then can iterate over each of the thread objects, joining them:

for one_thread in threads:
    one_thread.join()

Note that when you call one_thread.join() in this way, the call blocks. Perhaps that's not the most efficient way to do things, but in my experiments, it still took about one second—15 times faster—to retrieve all of the URLs.
In other words, Python threads are routinely seen as terrible and useless. But in this case, you can see that they allowed me to parallelize the program without too much trouble, having different sections execute concurrently.
Listing 3. retrieve2.py

#!/usr/bin/env python3

import requests
import time
import threading
from queue import Queue

urls = [one_line.strip()
        for one_line in open('urls.txt')]

length = {}
queue = Queue()
start_time = time.time()
threads = [ ]

def get_length(one_url):
    response = requests.get(one_url)
    queue.put((one_url, len(response.content)))

# Launch our function in a thread
print("Launching")
for one_url in urls:
    t = threading.Thread(target=get_length, args=(one_url,))
    threads.append(t)
    t.start()

# Joining all
print("Joining")
for one_thread in threads:
    one_thread.join()

# Retrieving + printing
print("Retrieving + printing")
while not queue.empty():
    one_url, length = queue.get()
    print("{0:30}: {1:8,}".format(one_url, length))

end_time = time.time()

total_time = end_time - start_time

print("\nTotal time: {0:.3} seconds".format(total_time))

Considerations

The good news is that this demonstrates how using threads can be effective when you're doing numerous, time-intensive I/O actions. This is especially good news if you're writing a server in Python that uses threads; you can open up a new thread for each incoming request and/or allocate each new request to an existing, pre-created thread. Again, if the threads don't really need to execute in a truly parallel fashion, you're fine.
But, what if your system receives a very large number of requests? In such a case, your threads might not be able to keep up. This is particularly true if the code being executed in each thread is CPU-intensive.
In such a case, you don't want to use threads. A popular option—indeed, the popular option—is to use processes. In my next article, I plan to look at how such processes can work and interact.

No comments:

Post a Comment