Saturday, February 21, 2009

CUDA cluster troubles

The problem with running a cluster that other people use is that those other people rely on it and it sometimes breaks.

In this case, it looks like my frontend crashed and rebooted itself (why, I don't know) yesterday afternoon. Upon reboot, it seems to have not reinitialized the Tesla card, so the students couldn't do their testing. I ran "nvidia-smi" on it, which created the device entries and made it usable again. Perhaps I need to add that to init.d or rc.local

It also turns out that two of my compute nodes had problems, not related to reboots. In one case, the devices weren't present, but nvidia-smi didn't fix it. A reboot, followed by nvidia-smi did. In the other case, the machine with 2 Tesla cards had only one of them working. Again, nividia-smi didn't help, but a reboot did.

While I imagine the problem may stem from student code doing horrible things inside the card, I think it points to a driver problem. Perhaps the driver isn't correctly recovering after some crazy operation, but that's not a good sign for the robustness of the CUDA computing environment.