My initial naive implementation is only about 5 times faster on the Tesla C1060 than the OpenMP version on the Core2Quad 8200. This version does nothing special to take advantage of locality, so my next version should be MUCH faster!
Two surprises:
- Even with doubles, the floating point values differ from the Core2Quad after a good number of iterations. This wasn't the case in my previous FDTD program, so I'm a bit concerned. I am using sqrt(), so perhaps that implementation differs a bit.
- When I tried 256 threads per block, the CUDA launch failed with error 4 (unspecified launch failure). I've never seen that one before, but the code works with 128 threads and fewer. Hmmm...