For the 4th year in a row, I am teaching Parallel and Concurrent Programming. One of the main sections of the course is general-purpose GPU programming (GPGPU), where I mostly use CUDA, but I also show them how to the same using OpenCL, bindings in Java and Python, as well as Numba. For their projects, all students use either CUDA or Numba.
Because we want to provide students with access to a GPU, even if outside campus, we show them how to use Google Colab, a Jupiter Notebook environment with optional GPU or TPU, and it has been working fine for 3 straight years.
When preparing this week’s class, I ran the examples on my local machine and it all worked. Fast forward to the lab, and I was demoing the usage of Google Colab and it wouldn’t work. In panic, I gave students temporary access to my machine while I debugged the issue.
I’m still using the code prepared by NVIDIA’s Nuno Subtil for the GPGPU workshop we organized back in 2011. It is a simplified version of the NVIDIA examples, with a very handy
HANDLE_ERRORS macro that checks the return code of each cuda function (
cudaMalloc, cudaMemcpy, cudaFree. The example in particular would take two arrays A (containing elements from 0 to 1000) and B (from 1000 to 0) and would return the array C, with the sum of the corresponding positions in both arrays.
The program would output an array of all zeroes. CUDA functions return 0, showing they are working correctly. I reviewed all the code and it should work. And it worked on my machine ( famous last words ).
Finding the Error
After modifying the arrays manually to check where the error was, I restricted it to the kernel call. And it made sense, it was the only call whose result wasn’t being checked. Knowing this, I’ve added line 47 to understand the error:
Device Variable Copying: no kernel image is available for execution on the device
Understanding the Error
Kernel is the name of the main function that executes on the GPU and it is called from the CPU using a
kernelName<NTHREADS>(args). The kernel image should be the binary on the GPU that will be execute. If it isn’t there, the kernel does nothing. But the program continues silently. This is a really bad design decision on their part.
Fixing the Error
Because different GPUs support different binary versions (due to architectural differences), I tried to compile to the right architecture. Google Colab gave me a Tesla K80, which worked with
compute_37. I’ve added those as flags in the mvvm compiler:
!nvcc -arch=sm_37 -gencode=arch=compute_37,code=sm_37 vector-sum.cu -o vector-sum
And it now worked. Why is this required? Because Tesla K80 is only supported between CUDA 5 and 10. Google Colab is running on CUDA 11 (despite providing unsupported GPUs) and users get silent errors.
Spending 30 minutes juggling debugging this issue while giving students a fallback alternative was very stressful. I mostly blame Nvidia for not having a better exception handling of kernel calls. There’s also a little blame in not supporting old cars in new SDK versions (even if not supporting all features). Google also shares the blame by loading a version of CUDA incompatible with the provided GPU.