Use Python to write CUDA methods

  • 2020-05-27 06:10:07
  • OfStack

There are two ways to write an CUDA program using Python:

* Numba
* PyCUDA

numbapro is no longer recommended and the functionality has been split and integrated into accelerate and Numba, respectively.

example

numba

Numba optimizes Python code through the just-in-time compilation mechanism (JIT). Numba can be optimized for the hardware environment of the machine, supports the optimization of CPU and GPU, and can be integrated with Numpy so that Python code can run on GPU.

As follows:


import numpy as np 
from timeit import default_timer as timer
from numba import vectorize

@vectorize(["float32(float32, float32)"], target='cuda')
def vectorAdd(a, b):
  return a + b

def main():
  N = 320000000

  A = np.ones(N, dtype=np.float32 )
  B = np.ones(N, dtype=np.float32 )
  C = np.zeros(N, dtype=np.float32 )

  start = timer()
  C = vectorAdd(A, B)
  vectorAdd_time = timer() - start

  print("c[:5] = " + str(C[:5]))
  print("c[-5:] = " + str(C[-5:]))

  print("vectorAdd took %f seconds " % vectorAdd_time)

if __name__ == '__main__':
  main()


PyCUDA

The kernel function of PyCUDA (kernel) is actually written using C/C++, which is dynamically compiled into GPU microcode. The Python code interacts with the GPU code, as shown below:


import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from timeit import default_timer as timer

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void func(float *a, float *b, size_t N)
{
 const int i = blockIdx.x * blockDim.x + threadIdx.x;
 if (i >= N)
 {
  return;
 }
 float temp_a = a[i];
 float temp_b = b[i];
 a[i] = (temp_a * 10 + 2 ) * ((temp_b + 2) * 10 - 5 ) * 5;
 // a[i] = a[i] + b[i];
}
""")

func = mod.get_function("func")  

def test(N):
  # N = 1024 * 1024 * 90  # float: 4M = 1024 * 1024

  print("N = %d" % N)

  N = np.int32(N)

  a = np.random.randn(N).astype(np.float32)
  b = np.random.randn(N).astype(np.float32)  
  # copy a to aa
  aa = np.empty_like(a)
  aa[:] = a
  # GPU run
  nTheads = 256
  nBlocks = int( ( N + nTheads - 1 ) / nTheads )
  start = timer()
  func(
      drv.InOut(a), drv.In(b), N,
      block=( nTheads, 1, 1 ), grid=( nBlocks, 1 ) )
  run_time = timer() - start 
  print("gpu run time %f seconds " % run_time)  
  # cpu run
  start = timer()
  aa = (aa * 10 + 2 ) * ((b + 2) * 10 - 5 ) * 5
  run_time = timer() - start 

  print("cpu run time %f seconds " % run_time) 

  # check result
  r = a - aa
  print( min(r), max(r) )

def main():
 for n in range(1, 10):
  N = 1024 * 1024 * (n * 10)
  print("------------%d---------------" % n)
  test(N)

if __name__ == '__main__':
  main()

contrast

numba USES some instructions to mark certain functions for acceleration (you can also write kernel functions using Python), which is similar to OpenACC, while PyCUDA needs to write kernel and compile at run time. The underlying implementation is based on C/C++. Through the test, the speedup ratio of these two methods is basically the same. But while numba is more like a black box and doesn't know what's going on inside, PyCUDA is intuitive. Therefore, the two approaches have different applications:

* if you don't care about CUDA programming just to speed up your algorithm, it's better to just use numba.

* if you want to learn, research CUDA programming, or experiment with the feasibility of an algorithm under CUDA, then use PyCUDA.

* if a program is to be ported to C/C++ in the future, then 1 must use PyCUDA, because kernel written in PyCUDA is itself written in CUDA C/C++.


Related articles: