20 ways to make your Python fly!

  • 2020-05-12 02:51:18
  • OfStack

The article I share today is limited in text and code. The absolute dry goods, the kid sou has no deceit, mainly shares 20 techniques to improve the performance of Python, teach you how to say goodbye to slow Python. Original author, full stack programmer, Python, Java, PHP and C++.

1. Optimize the time complexity of the algorithm

The time complexity of the algorithm has the greatest impact on the execution efficiency of the program. In Python, the time complexity can be optimized by selecting the appropriate data structure. For example, the time complexity of list and set to find a certain element is O(n) and O(1) respectively. Different scenarios have different optimization methods. Generally speaking, there are divide and conquer, branch boundaries, greed, dynamic programming and other ideas in 1.

2. Reduce redundant data

Save a large symmetric matrix by using angles 3 or 3. Sparse matrix representation is used in matrices where 0 elements are the majority.

3. Use copy and deepcopy wisely

For data structure objects such as dict and list, direct assignment is done by reference. When you need to copy an entire object, you can use copy and deepcopy from the copy package. The difference between the two functions is that the latter is recursively copied. The efficiency is not the same :(the following program runs in ipython)


import copy
a = range(100000)
%timeit -n 10 copy.copy(a) #  run 10 time  copy.copy(a)

%timeit -n 10 copy.deepcopy(a)
10 loops, best of 3: 1.55 ms per loop
10 loops, best of 3: 151 ms per loop

The -n after timeit represents the number of runs, and the last two lines correspond to the output of the two timeit, same as below. So the latter is one order of magnitude slower.

4. Find elements using dict or set

Both Python dict and set are implemented using the hash table (similar to unordered_map in the c++11 standard library), and the time complexity of finding the element is O(1).


a = range(1000)
s = set(a)
d = dict((i,1) for i in a)
%timeit -n 10000 100 in d
%timeit -n 10000 100 in s
10000 loops, best of 3: 43.5 ns per loop
10000 loops, best of 3: 49.6 ns per loop

dict is slightly more efficient (and takes up 1 more space).

5. Use generators (generator) and yield wisely


%timeit -n 100 a = (i for i in range(100000))
%timeit -n 100 b = [i for i in range(100000)]
100 loops, best of 3: 1.54 ms per loop
100 loops, best of 3: 4.56 ms per loop

Using () gives you an generator object, and the memory space required is independent of the size of the list, so it's 1 more efficient. In specific applications, for example, set(i for i in range(100,000)) is faster than set([i for i in range(100,000)]).

But in the case of looping through:


%timeit -n 10 for x in (i for i in range(100000)): pass
%timeit -n 10 for x in [i for i in range(100000)]: pass

10 loops, best of 3: 6.51 ms per loop
10 loops, best of 3: 5.54 ms per loop

The latter is more efficient, but if you have break in the loop, the benefits of using generator are obvious. yield is also used to create generator:


def yield_func(ls):
  for i in ls:
    yield i+1

def not_yield_func(ls):
  return [i+1 for i in ls]

ls = range(1000000)
%timeit -n 10 for i in yield_func(ls):pass

%timeit -n 10 for i in not_yield_func(ls):pass

10 loops, best of 3: 63.8 ms per loop
10 loops, best of 3: 62.9 ms per loop

For list with less memory, you can return 1 list directly, but the readability of yield is better (personal preference).
python2.x built-in generator functions include xrange functions, itertools packages, and so on.

6. Optimize the loop

Don't put things inside the loop that you can do outside of the loop. For example, the following optimization can be 1 times faster:


a = range(10000)
size_a = len(a)
%timeit -n 1000 for i in a: k = len(a)
%timeit -n 1000 for i in a: k = size_a
1000 loops, best of 3: 569 µs per loop
1000 loops, best of 3: 256 µs per loop

7. Optimize the order of multiple judgment expressions

For and, put the ones that meet fewer conditions first, and for or, put the ones that meet more conditions first. Such as:


a = range(2000) 
%timeit -n 100 [i for i in a if 10 < i < 20 or 1000 < i < 2000]
%timeit -n 100 [i for i in a if 1000 < i < 2000 or 100 < i < 20]   
%timeit -n 100 [i for i in a if i % 2 == 0 and i > 1900]
%timeit -n 100 [i for i in a if i > 1900 and i % 2 == 0]
100 loops, best of 3: 287 µs per loop
100 loops, best of 3: 214 µs per loop
100 loops, best of 3: 128 µs per loop
100 loops, best of 3: 56.1 µs per loop

8. Merge the strings in the iterator using join


In [1]: %%timeit
  ...: s = ''
  ...: for i in a:
  ...:     s += i
  ...:
10000 loops, best of 3: 59.8 µs per loop

In [2]: %%timeit
s = ''.join(a)
  ...:
100000 loops, best of 3: 11.8 µs per loop

join has about a 5 fold improvement in the cumulative approach.

9. Select the appropriate formatting method


s1, s2 = 'ax', 'bx'

%timeit -n 100000 'abc%s%s' % (s1, s2)
%timeit -n 100000 'abc{0}{1}'.format(s1, s2)
%timeit -n 100000 'abc' + s1 + s2
100000 loops, best of 3: 183 ns per loop
100000 loops, best of 3: 169 ns per loop
100000 loops, best of 3: 103 ns per loop

Of the three, % was the slowest, but the gap between the three was not huge (all very fast). (personally, % is the most readable)

Exchange the values of two variables without the aid of an intermediate variable


In [3]: %%timeit -n 10000
  a,b=1,2
  ....: c=a;a=b;b=c;
  ....:
10000 loops, best of 3: 172 ns per loop

In [4]: %%timeit -n 10000

a,b=1,2a,b=b,a
  ....:
10000 loops, best of 3: 86 ns per loop

Use a,b=b,a instead of c=a; a = b; b = c; To swap the values of a,b, more than twice as fast.

11. Use if is


a = range(1000)
s = set(a)
d = dict((i,1) for i in a)
%timeit -n 10000 100 in d
%timeit -n 10000 100 in s
10000 loops, best of 3: 43.5 ns per loop
10000 loops, best of 3: 49.6 ns per loop
0

Using if is True is nearly 1 times faster than if == True.

12. Compare x in a cascade < y < z


a = range(1000)
s = set(a)
d = dict((i,1) for i in a)
%timeit -n 10000 100 in d
%timeit -n 10000 100 in s
10000 loops, best of 3: 43.5 ns per loop
10000 loops, best of 3: 49.6 ns per loop
1

x < y < z is slightly more efficient and more readable.

13. while 1 is faster than while True


a = range(1000)
s = set(a)
d = dict((i,1) for i in a)
%timeit -n 10000 100 in d
%timeit -n 10000 100 in s
10000 loops, best of 3: 43.5 ns per loop
10000 loops, best of 3: 49.6 ns per loop
2

while 1 is much faster than while true, because in python2.x, True is a global variable, not a keyword.

14. Use ** instead of pow


%timeit -n 10000 c = pow(2,20)
%timeit -n 10000 c = 2**20

10000 loops, best of 3: 284 ns per loop
10000 loops, best of 3: 16.9 ns per loop

** is more than 10 times faster!

15. Use cProfile, cStringIO and cPickle to achieve the same function with c (corresponding to profile, StringIO, pickle, respectively)


a = range(1000)
s = set(a)
d = dict((i,1) for i in a)
%timeit -n 10000 100 in d
%timeit -n 10000 100 in s
10000 loops, best of 3: 43.5 ns per loop
10000 loops, best of 3: 49.6 ns per loop
4

The package implemented by c is more than 10 times faster!

16. Use the best way to deserialize

The following is a comparison of the efficiency of eval, cPickle and json in deserializing the corresponding strings:


a = range(1000)
s = set(a)
d = dict((i,1) for i in a)
%timeit -n 10000 100 in d
%timeit -n 10000 100 in s
10000 loops, best of 3: 43.5 ns per loop
10000 loops, best of 3: 49.6 ns per loop
5

It can be seen that json is nearly three times faster than cPickle and more than 20 times faster than eval.

17. Use the C extension (Extension)

At present, there are mainly three ways of native CPython(the most common way to implement python), API, ctypes,Cython and cffi.their functions are to enable Python programs to invoke the dynamic link library compiled by C. Their features are as follows:

CPython native API: by introducing the Python.h header file, the corresponding C program can directly use the Python data structure. The implementation process is relatively tedious, but has a relatively large scope of application.
ctypes: typically used to encapsulate (wrap)C programs by having pure Python programs call functions in the dynamic link library (dll in Windows or so in Unix). If you want to use the existing C library in python, ctypes is a good choice. With some benchmarks, python2+ctypes is the best way to perform.
Cython: Cython is a superset of CPython to simplify the process of writing C extensions. The advantage of Cython is its simple syntax and good compatibility with libraries such as numpy that contain a large number of C extensions. Cython makes scenario 1 generally an optimization of an algorithm or process in a project. In some tests, performance can improve hundreds of times.
cffi: cffi is the implementation of ctypes in pypy (see below) and is also compatible with CPython. cffi provides a way to use the C class library in python, to write C code directly in the python code, and to link to the existing C class library.
Using these optimization methods 1 is generally aimed at the existing project performance bottleneck module optimization, can significantly improve the operation efficiency of the entire program with a small number of changes to the original project.

18. Parallel programming

Because of the existence of GIL, it is difficult for Python to take full advantage of multi-core CPU. However, the following parallel patterns can be implemented with the built-in module multiprocessing:

Multi-process: for the intensive programs of CPU, the encapsulated classes of multiprocessing, Process and Pool can be used to realize parallel computing by multi-process. However, due to the high cost of communication in the process, the efficiency of programs that require a large amount of data interaction between processes may not be improved significantly.
Multithreading: for IO intensive programs, multiprocessing.dummy module USES multiprocessing interface to encapsulate threading, which makes multithreaded programming very easy (for example, Pool map interface can be used, which is concise and efficient).
Distributed: the Managers class in multiprocessing provides a way to share data between different processes and develop distributed programs based on this.
Different business scenarios can choose one or several combinations to optimize program performance.

19. Final kill: PyPy

PyPy is Python implemented with RPython(a subset of CPython), which is more than 6 times faster than Python implemented with CPython, according to benchmark data on the official website. The reason it is fast is that it USES the Just-in-Time (JIT) compiler, which is a dynamic compiler. Unlike the static compiler (gcc,javac, etc.), it is optimized using data from the process in which the program is running. For historical reasons, GIL is still in pypy, but the ongoing STM project attempts to convert PyPy into Python without GIL.

If the python program includes an C extension (not cffi), JIT will be significantly less optimized and even slower than CPython (not Numpy). So in PyPy it is better to use either pure Python or the cffi extension.

With the improvement of STM, Numpy and other projects, it is believed that PyPy will replace CPython.

20. Use performance analysis tools

In addition to the timeit module used in ipython above, there is also cProfile. The usage of cProfile is also very simple: python-m cProfile filename.py, filename.py is the file name of the program to run. You can see the number of times each function is called and the running time in the standard output, so as to find the performance bottleneck of the program, and then you can optimize it accordingly.


Related articles: