Explore why array sorting increases the efficiency of Python's loop

2020-05-05 11:25:08
OfStack

This morning I came across a blog post about two Python scripts, one of which is more efficient. This post has been deleted, so I can't link to it, but the script basically boils down to this:
fast.py


import time
a = [i for i in range(1000000)]
sum = 0
t1 = time.time()
for i in a:
  sum = sum + i
t2 = time.time()
print t2-t1

slow.py


import time
from random import shuffle
a = [i for i in range(1000000)]
shuffle(a)
sum = 0
t1 = time.time()
for i in a:
  sum = sum + i
t2 = time.time()
print t2-t1

As you can see, both scripts have exactly the same behavior. Both generate a list of the first million integers and print the time for summing them. The only difference is slow.py randomly sorts the integers first. As strange as this may seem, it seems that randomization is enough to make the program significantly slower. On my machine, Python2.7.3, fast.py is always a tenth of a second faster than slow.py. You might as well try it. (I didn't test it on Python3, but it shouldn't be much worse.)

So why does randomization of list elements cause such a noticeable slowdown? The original author of the post called this "branching prediction (branch prediction)". If you're not familiar with the term, check out StackOverflow's question, which explains the concept well. (my suspicion is that the original author of the original article encountered this problem, or something like it, and applied the idea to an Python fragment that was not suitable for use.)

Of course, I doubt whether branch prediction (branch prediction) is the real cause of the problem. There are no top-level conditional branches in this Python code, and it makes sense that both scripts have strictly consistent branches in the loop. No part of the program is conditional on these integers, and the elements of each list are independent of the data itself. Of course, I'm still not sure if python is "low level" enough that branch prediction at the CPU level can be a factor in the performance analysis of python scripts. Python is, after all, a high-level language.

So if it's not the branch prediction, why is slow.py so slow? After a bit of research and a few false starts, I think I've found the problem. This answer requires some familiarity with the Python internal virtual machine.

failed start: listing vs. Generator (lists and generators)

My first thought was that Python would process sorted lists [i for i in range(1000000)] more efficiently than random lists. In other words, the list can be replaced with the following generator:


def numbers():
  i = 0
  while i < 1000000:
    yield i
    i += 1

I think it might be more efficient in terms of time efficiency. After all, Python could save a lot of overhead if it used generators internally instead of real lists to save all the integers in memory at once. Random lists in slow.py are not easily captured by a simple generator, and all VM (virtual machines) cannot be optimized in this way.

However, this is not a useful finding. If you insert a.sort () between shuffle() of slow.py and the loop, the program will be as fast as fast.py. Obviously, some of the details of sorting Numbers make the program faster.

failed start: list contrast array

My second thought was that there might be caching problems caused by data structures. a is a list, which naturally leads me to believe that a is actually implemented through linked lists. If the shuffle operation intentionally randomizes the list's nodes, then fast.py might be able to allocate all the list elements of the list to adjacent addresses, thus using advanced local caching, while slow.py would have a lot of cache misses, since each node references another node on the same cached row.

Unfortunately, that's not true either. Python's list object is not a linked list, but a real array. In particular, Python list objects are defined with the C structure:


typedef struct {
 PyObject_VAR_HEAD
 PyObject **ob_item;
 Py_ssize_t allocated;
} PyListObject;

... In other words, ob_item is a pointer to the PyObjects pointer array, and the size assigned is the size we assigned to the array. As a result, this doesn't help either (although it does give me some comfort that I'm not sure about the algorithmic complexity of list operations in Python: the algorithmic complexity of list add operations is O(1), the algorithmic complexity of accessing arbitrary list elements is O(1), and so on). I just want to explain why Guido chose to call them list "lists" instead of array "arrays" when they are actually arrays.

solution: whole object

Array elements are adjacent in memory, so such a data structure does not cause caching problems. The cache location turns out to be the reason for the slowdown in slow.py, but it comes from an unexpected place. In Python, an integer is an object assigned to the heap rather than a simple value. In particular, in virtual machines, integer objects look like this:


typedef struct {
 PyObject_HEAD
 long ob_ival;
} PyIntObject;

The only "interesting" element in the above structure is ob_ival (similar to the integer in C language). If you think using a full heap object to implement integers is wasteful, you're probably right. Many languages are optimized to avoid this. For example, Matz's Ruby interpreter typically stores objects as Pointers, but makes exceptions for frequently used Pointers. Simply put, the Ruby interpreter inserts a fixed-length number into the same space as an object and marks it with the least significant bit as an integer rather than a pointer (in all modern systems, malloc always returns memory addresses aligned by multiples of 2). At that point, you just need the appropriate displacement to get the integer value -- no heap position or redirection required. If CPython did a similar optimization, slow.py and fast.py would have the same speed (and they would all probably be faster).

So what does CPython do with integers? What behavior of the interpreter gives us so much confusion? The Python interpreter allocates integers to the "block" of 40Byte at a time (block). When Python needs to generate a new integer object, it creates the next available space in the current integer "block" and stores the integer in it. Our code allocates a million integers in an array, and most of the adjacent integers are put into adjacent memory. Thus, traversing in an ordered million Numbers shows good cache positioning, while positioning in the first million Numbers in random sort shows frequent cache misses.

So the answer to "why sorting arrays makes code faster" is that it doesn't. Arrays that are not shuffled are traversed more quickly because we access the integer objects in the same order that they are allocated (they must be allocated).