Friday, January 16, 2009

Cluster Troubles

I was able to install ROCKS on my Phenom frontend node with no trouble, even though it doesn't have any CUDA software installed yet. I built one compute node with a Core2Quad, 4GB Kingston DDR3 RAM, a scrounged Western Digital 40GB IDE drive, Myrinet 2000 PCI card, and donated NVIDIA 790i motherboard and Tesla C1060 card. I used an old Matrox Millenium PCI video card so I could watch ROCKS install the cluster OS on the system. The install worked fine, but this version of ROCKS doesn't come with the Myrinet Roll, nor does Myricom have a ROCKS 5.1 roll (and my cards are so old they may not be well supported). Getting Myrinet running can wait.

A really great thing about ROCKS is that it doesn't try to recover failed compute nodes - it just reinstalls the OS on them. Unfortunately, this means a lot of things have to go right, and they don't in this case. When I pull the PCI video card out of the nodes, the reinstall fails, presumably as Anaconda (the RedHat installed barfs when it doesn't see a video card). This is a shame, because I have no plans to put video cards in these compute nodes. The workaround is to shut off the automatic reinstall, which also means to automatic updates. This is a serious downside and a shame. My old Athlon nodes worked perfectly without video cards, but something here fails during the reinstall process, and without a video card, I can't see what fails.

Then I moved the frontend and compute node up to the cluster racks in our 4th floor machine room. There, it seems that others have "borrowed" my network cables and jack in the patch panel, so the new system doesn't have network connectivity. I will try a small switch to share the connection used by my other cluster frontend. A serious issue came up upon this move - the networks (IPs and gateways) are different from the HIPerWall lab to the UCInet in the 4th floor server room. So I dutifully went and changed the network settings in /etc/sysconfig to the new network. But it didn't stick. This seemed crazy, but it was true. After a bit of looking around, the network settings were ALSO in another nearby directory and those were the ones really being used. This is a real problem that Linux seems to have that Windows and Mac OS X don't - there needs to be a single source of truth and, in this case, links between the two settings files would have solved it, but whatever installer wrote both files made them separate rather than linked. In almost every Linux system I set up, I have some sort of network config issue that requires manually editing config files somewhere. Admittedly, I tend to have stranger network config requirements than most people, for whom DHCP is fine, but Linux will not take over the desktop, as all the proponents have been hoping, until some consistency is achieved.