Thursday, February 26, 2009

Cluster still troubling

It turns out the cluster frontend was having memory errors. Reseating the DIMMs (Crucial Ballistix) seemed to help, as it passed Memtest86, but it failed again when put back in service. I tried to swap the disk into one of the cluster nodes to make it the frontend, but bloody Linux thwarted me completely, particularly because it was using disk labels not partition numbers and somehow they were not findable in the new machine. The disk wasn't corrupt, but the kernel panicked on boot each time, no matter what I did.

Eventually, I had to reformat, reinstall ROCKS, restore the users, and put the new frontend in service. It seems to be working now, though ROCKS is misbehaving a bit. Scripts that normally handle all the user adding tasks aren't handling the auto.home adding, which is a surprise. Anyway, I can use emacs and fix it myself. Bloody computers.

Monday, February 23, 2009

Technology troubles

Technology really failed me today. When I got to campus, I rebooted my cluster frontend, which failed on Saturday. It didn't come back up. Then I discovered that the bright guys in network security had blocked my IP address because someone who had it before me had a compromised machine. When I called the help desk, they said I needed to send an email to get it cleared up. Well, that would have been great, except email was blocked! So I plugged into the HIPerWall network, only to discover it didn't work either. The antique PowerMac G4 server that provided DHCP address, name service, and user accounts has failed, apparently permanently, as it won't even turn on the screen. So there are days I really hate technology.

But I think I'd enjoy digging ditches less, so I'll stick with the techology stuff for now.

I used my iPhone to send the email to unblock my IP address, then I configured the Linux storage server to serve DHCP, so HIPerWall is back up. The CUDA cluster is still hosed, however...

Saturday, February 21, 2009

CUDA cluster troubles

The problem with running a cluster that other people use is that those other people rely on it and it sometimes breaks.

In this case, it looks like my frontend crashed and rebooted itself (why, I don't know) yesterday afternoon. Upon reboot, it seems to have not reinitialized the Tesla card, so the students couldn't do their testing. I ran "nvidia-smi" on it, which created the device entries and made it usable again. Perhaps I need to add that to init.d or rc.local

It also turns out that two of my compute nodes had problems, not related to reboots. In one case, the devices weren't present, but nvidia-smi didn't fix it. A reboot, followed by nvidia-smi did. In the other case, the machine with 2 Tesla cards had only one of them working. Again, nividia-smi didn't help, but a reboot did.

While I imagine the problem may stem from student code doing horrible things inside the card, I think it points to a driver problem. Perhaps the driver isn't correctly recovering after some crazy operation, but that's not a good sign for the robustness of the CUDA computing environment.

Thursday, February 12, 2009

Some pain, no gain

I modified my program to take advantage of coalesced memory accesses (particularly loads in the N-body problem), having to change many, many things. After a fair bit of painful debugging, I got it to work and produce the same results as the original CUDA version. The bummer is that it is only slightly faster (like a percent or two). Now it seems likely that the memory accesses were already pretty good, but they certainly were a little more excessive than they should have been, so I really expected a significant gain. I am taking advantage of locality MUCH more than before, yet it didn't help.

I turned to the fancy profiler provided by NVIDIA. It's pretty nifty and told me what I already knew that the O(N^2) ComputeForces functions is taking all the time. What it couldn't tell me is the number of uncoalsced loads, because that performance counter is apparently not (yet?) supported on G200 model GPUs. Darn. It does tell me that I have a lot of coalesced loads and stores, but they may only be a portion of the total. So I'm not particularly thrilled with the profiler, since uncoalesced memory accesses can be the real performance killer in CUDA apps. But it looks pretty and has lots of other somewhat useful statistics. I'm glad I tried it.

N-Body

I have assigned my students to make a simple n-body simulation in CUDA to compare it to the OpenMP version they've already done. I know there are really great n-body program for CUDA out there, but starting simple is good for a class. I wrote my own, as well, because I want to see if any of the students can make theirs faster than mine.

My initial naive implementation is only about 5 times faster on the Tesla C1060 than the OpenMP version on the Core2Quad 8200. This version does nothing special to take advantage of locality, so my next version should be MUCH faster!

Two surprises:
  1. Even with doubles, the floating point values differ from the Core2Quad after a good number of iterations. This wasn't the case in my previous FDTD program, so I'm a bit concerned. I am using sqrt(), so perhaps that implementation differs a bit.
  2. When I tried 256 threads per block, the CUDA launch failed with error 4 (unspecified launch failure). I've never seen that one before, but the code works with 128 threads and fewer. Hmmm...

Tuesday, February 10, 2009

New cluster node

I put together a new node for the CUDA cluster. This one has the same el-cheapo Core2Quad 8200 and 4GB of Kingston DDR3 RAM, as well as the scrounged hard disk from the cluster node it replaced (the hard drive tests OK, but doesn't sound very quiet...) I got a bigger power supply this time so I could put 2 Tesla cards in. The documentation is somewhat weak on multiple Tesla cards per machine, so I wasn't sure if I should put in the SLI bridge or not. I didn't, mostly because I forgot - I had the bridge, but buttoned up the machine before I remembered to use it.

Both Tesla cards are detected with no trouble, but I notice an interesting delay as CUDA programs start. This delay does not occur with normal OpenMP or other programs, so must be something to do with the device detection in CUDA. So far, I'm not concerned, but will see if it causes any trouble. I will also look into how to use both cards by two separate programs to see if two of my students can use this node for CUDA simultaneously.