Thursday, January 22, 2009

CUDA Working on Cluster

I was able to get CUDA working on both the frontend machine and the single compute node tonight, with some surprising results. It turned out that if I ran nvidia-smi as root, it would properly create the device entries in /dev and all was well. So then I ran my CUDA FDTD program with doubles (thus, requiring a compute 1.3 class CUDA card), and strange timing results occurred.

I was certain that the Tesla C1060 would crush the old development board, which has fewer processors and a slower speed. Yet, the test board proved faster by 4 or 5%, which was a surprise. As I increased the problem size, the results followed the same pattern.

FDTD is a 3D volumetric finite-difference, time-domain electromagnetic simulation. The main loop is O(n^3), and the thing accesses 8 3D arrays for each time step. So it's a really memory-intensive and CPU-intensive program. Because CUDA doesn't support multidimensional thread blocks, I assign, for example, 1x16x16 threads per block, all within the 3D array space. In this case, each block has 256 threads that CUDA schedules on each parallel processor. My first thought is that the newer C1060 may benefit from more parallelism, so I made it 1x32x16, thus 512 threads per block. This slowed the program on both GPUs by a fair bit.

So I thought that perhaps the processors may be resource constrained, so fewer threads would help, so I stepped it down to 1x8x16, so 128 threads. This made a big difference, and now the Tesla C1060 did beat the development board by 15 or 20%. Yay! So, taking that logic further, I went down to 64 threads, but that made things a bit slower, so it looks like 128 threads may be the right balance for the number of registers and other resources needed by FDTD.

I really should play with the "Occupancy Calculator" tool provided by NVIDIA to see if that provides some insight into these sort of performance issues.