Use Python to write CUDA methods
- 2020-05-27 06:10:07
- OfStack
There are two ways to write an CUDA program using Python:
* Numba
* PyCUDA
numbapro is no longer recommended and the functionality has been split and integrated into accelerate and Numba, respectively.
example
numba
Numba optimizes Python code through the just-in-time compilation mechanism (JIT). Numba can be optimized for the hardware environment of the machine, supports the optimization of CPU and GPU, and can be integrated with Numpy so that Python code can run on GPU.
As follows:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
@vectorize(["float32(float32, float32)"], target='cuda')
def vectorAdd(a, b):
return a + b
def main():
N = 320000000
A = np.ones(N, dtype=np.float32 )
B = np.ones(N, dtype=np.float32 )
C = np.zeros(N, dtype=np.float32 )
start = timer()
C = vectorAdd(A, B)
vectorAdd_time = timer() - start
print("c[:5] = " + str(C[:5]))
print("c[-5:] = " + str(C[-5:]))
print("vectorAdd took %f seconds " % vectorAdd_time)
if __name__ == '__main__':
main()
PyCUDA
The kernel function of PyCUDA (kernel) is actually written using C/C++, which is dynamically compiled into GPU microcode. The Python code interacts with the GPU code, as shown below:
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from timeit import default_timer as timer
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void func(float *a, float *b, size_t N)
{
const int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= N)
{
return;
}
float temp_a = a[i];
float temp_b = b[i];
a[i] = (temp_a * 10 + 2 ) * ((temp_b + 2) * 10 - 5 ) * 5;
// a[i] = a[i] + b[i];
}
""")
func = mod.get_function("func")
def test(N):
# N = 1024 * 1024 * 90 # float: 4M = 1024 * 1024
print("N = %d" % N)
N = np.int32(N)
a = np.random.randn(N).astype(np.float32)
b = np.random.randn(N).astype(np.float32)
# copy a to aa
aa = np.empty_like(a)
aa[:] = a
# GPU run
nTheads = 256
nBlocks = int( ( N + nTheads - 1 ) / nTheads )
start = timer()
func(
drv.InOut(a), drv.In(b), N,
block=( nTheads, 1, 1 ), grid=( nBlocks, 1 ) )
run_time = timer() - start
print("gpu run time %f seconds " % run_time)
# cpu run
start = timer()
aa = (aa * 10 + 2 ) * ((b + 2) * 10 - 5 ) * 5
run_time = timer() - start
print("cpu run time %f seconds " % run_time)
# check result
r = a - aa
print( min(r), max(r) )
def main():
for n in range(1, 10):
N = 1024 * 1024 * (n * 10)
print("------------%d---------------" % n)
test(N)
if __name__ == '__main__':
main()
contrast
numba USES some instructions to mark certain functions for acceleration (you can also write kernel functions using Python), which is similar to OpenACC, while PyCUDA needs to write kernel and compile at run time. The underlying implementation is based on C/C++. Through the test, the speedup ratio of these two methods is basically the same. But while numba is more like a black box and doesn't know what's going on inside, PyCUDA is intuitive. Therefore, the two approaches have different applications:
* if you don't care about CUDA programming just to speed up your algorithm, it's better to just use numba.
* if you want to learn, research CUDA programming, or experiment with the feasibility of an algorithm under CUDA, then use PyCUDA.
* if a program is to be ported to C/C++ in the future, then 1 must use PyCUDA, because kernel written in PyCUDA is itself written in CUDA C/C++.