I turned to the fancy profiler provided by NVIDIA. It's pretty nifty and told me what I already knew that the O(N^2) ComputeForces functions is taking all the time. What it couldn't tell me is the number of uncoalsced loads, because that performance counter is apparently not (yet?) supported on G200 model GPUs. Darn. It does tell me that I have a lot of coalesced loads and stores, but they may only be a portion of the total. So I'm not particularly thrilled with the profiler, since uncoalesced memory accesses can be the real performance killer in CUDA apps. But it looks pretty and has lots of other somewhat useful statistics. I'm glad I tried it.
Thursday, February 12, 2009
Some pain, no gain
I modified my program to take advantage of coalesced memory accesses (particularly loads in the N-body problem), having to change many, many things. After a fair bit of painful debugging, I got it to work and produce the same results as the original CUDA version. The bummer is that it is only slightly faster (like a percent or two). Now it seems likely that the memory accesses were already pretty good, but they certainly were a little more excessive than they should have been, so I really expected a significant gain. I am taking advantage of locality MUCH more than before, yet it didn't help.