Explanation of nearest neighbor table calculation of Python programming example of GPU

2021-11-24 02:20:16
OfStack

Directory technical background Accelerated scene GPU Acceleration Based on Numba Summary summary

Technical background

GPU acceleration is a very common technology in various scenarios of modern industry, which benefits from the high parallelization of GPU computing. There are several parallel optimization solutions for GPU in Python, including cupy, pycuda and numba. cuda mentioned in previous blogs, all of which are the iconic Python libraries for GPU acceleration. Here, we focus on pushing numba. cuda, because the advantage of cupy lies in the realization of many functions, and the flexibility of algorithm implementation is still relatively lacking; Although pycuda provides good flexibility and high performance, it requires us to insert C code into Python code, which is obviously a very different solution from Pythonic. Therefore, we can choose the solution of numba. cuda. As long as we add a modifier of numba. cuda. jit in front of Python function, we can use the most Python programming syntax in Python to realize the acceleration effect of GPU.

Accelerated scene

What we need to know first is what kind of computing scenario GPU can achieve acceleration effect. Obviously, not all computing processes can show acceleration effect on GPU. As mentioned earlier, the acceleration of GPU comes from a high degree of parallelism. The so-called parallelism requires processes not to interfere with or depend on each other before. If the calculation process or result of one process depends on the calculation result of another process, it is impossible to realize complete parallelism, and only serial technology can be used. In order to show the acceleration effect of GPU, we introduce a common problem in the field of molecular dynamics simulation: the calculation of nearest neighbor table.

The problem of nearest neighbor table calculation is described as follows: given a stack of atomic systems with the number of n, the three-dimensional coordinates of each atom are known, given a truncation constant d0, when the distances between two atoms di, j < = d0, these two atoms are considered to be adjacent atoms. Finally, we need to give a 0-1 matrix Ai and j. When Ai and j = 0, it means that i and j are not adjacent to each other, otherwise, they are adjacent. Then for this problem scenario, we can traverse the space of n × n in parallel and directly output the nearest neighbor table of An × n size. This computing scenario is a very suitable calculation accelerated by GPU. Let's first look at the conventional implementation scheme without GPU acceleration:


# cuda_neighbor_list.py

from numba import jit
from numba import cuda
import numpy as np

@jit
def neighbor_list(crd, neighbors, data_length, cutoff):
    """CPU based neighbor list calculation.
    """
    for i in range(data_length):
        for j in range(i+1, data_length):
            if np.linalg.norm(crd[i]-crd[j]) <= cutoff:
                neighbors[i][j] = 1
                neighbors[j][i] = 1
    return neighbors

if __name__ == '__main__':
    np.random.seed(1)
    atoms = 2**2
    cutoff = 0.5
    crd = np.random.random((atoms,3))
    adjacent = np.zeros((atoms, atoms))
    adjacent = neighbor_list(crd, adjacent, atoms, cutoff)
    print (adjacent)

This is the most conventional implementation scheme on CPU, which traverses all atoms, calculates atomic spacing, and then fills in the nearest neighbor table. Here, we also use the instant compilation function of numba. jit, which is a method of compiling related functions when they are executed. It is possible to use some optimizations of SIMD provided by chip manufacturers in vectorization calculation. Of course, this is the implementation and optimization of CPU level, and the implementation results are as follows:

$ python3 cuda_neighbor_list.py
[[0. 0. 0. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 1.]
[0. 0. 1. 0.]]

The result of this output is a 0-1 nearest neighbor table.

GPU Acceleration Based on Numba

For the nearest neighbor table calculation scenario mentioned above, we can easily think that this neighbor_list function can be modified with GPU function. For every di and j, we can start one thread to perform calculation, similar to SIMD technology on CPU, and this optimization in GPU is called SIMT. The method of transforming Python into GPU function is also very simple. It is basically completed by changing the modifier before the function and removing the for loop inside the function. For example, the following case of transforming nearest neighbor table calculation:


# cuda_neighbor_list.py

from numba import jit
from numba import cuda
import numpy as np

@jit
def neighbor_list(crd, neighbors, data_length, cutoff):
    """CPU based neighbor list calculation.
    """
    for i in range(data_length):
        for j in range(i+1, data_length):
            if np.linalg.norm(crd[i]-crd[j]) <= cutoff:
                neighbors[i][j] = 1
                neighbors[j][i] = 1
    return neighbors

@cuda.jit
def cuda_neighbor_list(crd, neighbors, cutoff):
    """GPU based neighbor list calculation.
    """
    i, j = cuda.grid(2)
    dis = ((crd[i][0]-crd[j][0])**2+\
           (crd[i][1]-crd[j][1])**2+\
           (crd[i][2]-crd[j][2])**2)**0.5
    neighbors[i][j] = dis <= cutoff[0] and dis > 0

if __name__ == '__main__':
    import time
    np.random.seed(1)

    atoms = 2**5
    cutoff = 0.5
    cutoff_cuda = cuda.to_device(np.array([cutoff]).astype(np.float32))
    crd = np.random.random((atoms,3)).astype(np.float32)
    crd_cuda = cuda.to_device(crd)
    adjacent = np.zeros((atoms, atoms)).astype(np.float32)
    adjacent_cuda = cuda.to_device(adjacent)

    time0 = time.time()
    adjacent_c = neighbor_list(crd, adjacent, atoms, cutoff)
    time1 = time.time()
    cuda_neighbor_list[(atoms, atoms), (1, 1)](crd_cuda,
                                               adjacent_cuda,
                                               cutoff_cuda)
    time2 = time.time()
    adjacent_g = adjacent_cuda.copy_to_host()
    print ('The time cost of CPU with numba.jit is: {}s'.format(\
                                            time1-time0))
    print ('The time cost of GPU with cuda.jit is: {}s'.format(\
                                            time2-time1))
    print ('The result error is: {}'.format(np.sum(adjacent_c-\
                                            adjacent_g)))

It should be noted that the current Numba does not support all the functions of numpy, so there are 1 calculation functions that we need to manually implement, such as calculating the value of 1 Norm. Here, we not only count the correctness of the results in the output results, but also give the running time:

$ python3 cuda_neighbor_list.py
The time cost of CPU with numba.jit is: 0.6401469707489014s
The time cost of GPU with cuda.jit is: 0.19208502769470215s
The result error is: 0.0

It should be noted that the program has only been run once here, and the acceleration effect of jit instant compilation is not obvious in the first run, and even some speeds are slow, but it can play a relatively large acceleration effect in the function call of the subsequent process. Therefore, the running time here is not very representative. The representative time comparison can be seen in the following cases:


# cuda_neighbor_list.py

from numba import jit
from numba import cuda
import numpy as np

@jit
def neighbor_list(crd, neighbors, data_length, cutoff):
    """CPU based neighbor list calculation.
    """
    for i in range(data_length):
        for j in range(i+1, data_length):
            if np.linalg.norm(crd[i]-crd[j]) <= cutoff:
                neighbors[i][j] = 1
                neighbors[j][i] = 1
    return neighbors

@cuda.jit
def cuda_neighbor_list(crd, neighbors, cutoff):
    """GPU based neighbor list calculation.
    """
    i, j = cuda.grid(2)
    dis = ((crd[i][0]-crd[j][0])**2+\
           (crd[i][1]-crd[j][1])**2+\
           (crd[i][2]-crd[j][2])**2)**0.5
    neighbors[i][j] = dis <= cutoff[0] and dis > 0

if __name__ == '__main__':
    import time
    np.random.seed(1)

    atoms = 2**10
    cutoff = 0.5
    cutoff_cuda = cuda.to_device(np.array([cutoff]).astype(np.float32))
    crd = np.random.random((atoms,3)).astype(np.float32)
    crd_cuda = cuda.to_device(crd)
    adjacent = np.zeros((atoms, atoms)).astype(np.float32)
    adjacent_cuda = cuda.to_device(adjacent)
    time_c = 0.0
    time_g = 0.0

    for _ in range(100):
        time0 = time.time()
        adjacent_c = neighbor_list(crd, adjacent, atoms, cutoff)
        time1 = time.time()
        cuda_neighbor_list[(atoms, atoms), (1, 1)](crd_cuda,
                                                adjacent_cuda,
                                                cutoff_cuda)
        time2 = time.time()
        if _ != 0:
            time_c += time1 - time0
            time_g += time2 - time1
    
    print ('The total time cost of CPU with numba.jit is: {}s'.format(\
                                            time_c))
    print ('The total time cost of GPU with cuda.jit is: {}s'.format(\
                                            time_g))

There are no modifications in this case, only the time of one calculation is adjusted to the time of multiple calculations, and the immediate compilation in the first calculation process is ignored. The final output result is as follows:

$ python3 cuda_neighbor_list.py
The total time cost of CPU with numba.jit is: 14.955506563186646s
The total time cost of GPU with cuda.jit is: 0.018685102462768555s

It can be seen that after the acceleration of GPU, compared with the high-performance computing of CPU, it can speed up nearly 1000 times!

Summary summary

As far as Pythoner is concerned, its performance has been suffering for a long time. If we can use a very Pythonic method to achieve the acceleration effect of GPU, it is undoubtedly great good news for Pythoner, and Numba provides us with such a basic function. Through a case of nearest neighbor table calculation, this paper gives a calculation scenario suitable for GPU acceleration. This computing scenario can be parallelized to a high degree, and the function will be used many times (in the process of molecular dynamics simulation, every step will call this function), so this is the most typical case that is most suitable for GPU acceleration scenario.

Above is about Python GPU programming instance nearest neighbor table calculation explanation of the detailed content, more about Python GPU programming instance information please pay attention to other related articles on this site!