Thursday, February 12, 2009

N-Body

I have assigned my students to make a simple n-body simulation in CUDA to compare it to the OpenMP version they've already done. I know there are really great n-body program for CUDA out there, but starting simple is good for a class. I wrote my own, as well, because I want to see if any of the students can make theirs faster than mine.

My initial naive implementation is only about 5 times faster on the Tesla C1060 than the OpenMP version on the Core2Quad 8200. This version does nothing special to take advantage of locality, so my next version should be MUCH faster!

Two surprises:
  1. Even with doubles, the floating point values differ from the Core2Quad after a good number of iterations. This wasn't the case in my previous FDTD program, so I'm a bit concerned. I am using sqrt(), so perhaps that implementation differs a bit.
  2. When I tried 256 threads per block, the CUDA launch failed with error 4 (unspecified launch failure). I've never seen that one before, but the code works with 128 threads and fewer. Hmmm...