Sunday, March 15, 2009

Cluster and CUDA troubles abound

Running this cluster for my class is more trouble than it's worth and I won't do it again. Research cluster, yes - teaching cluster, no.

Yesterday, one of the students ran the frontend node out of memory. Of course, that meant I couldn't log in to reboot the thing, so I had to go to campus, find which entrance of the building is open on weekends and isn't closed by construction, and press the reset switch. Very annoying. I am disappointed that Linux is that fragile and that mechanisms wouldn't be in place to prevent this sort of problem. Surely crashing that program and freeing the memory should have solved it, but obviously didn't.

An even more annoying problem has to do with the Tesla cards. It looks like the drivers keep crashing. After a while, suddenly programs are no longer able to open the appropriate /dev entries for the NVIDIA cards. Clearly the students could be running buggy programs and such, but that should be expected. I think the drivers must not be recovering properly after a crash or something. Even running nvidia-smi won't fix it. A reboot will, but I discovered "rmmod nvidia" will remove the driver module and clear the problem, then nvidia-smi will cause the driver to reload and reinitialize. But bloody annoying, because the root user needs to do it, and I'm not giving the students root...