Friday, January 30, 2009

New CUDA

I updated the cluster with the recently-released CUDA 2.1 for Linux 64 and everything seems to work well.

I am disappointed that Mac CUDA trails at 2.0, but I presume 2.1 is coming.

I looked into putting a Tesla card into a Mac Pro, but it looks like it wouldn't be supported under Mac OS X. It would probably work under Windows, which we do run on our Mac Pros sometimes. I imagine the reason is that Apple's display drivers wouldn't bother to load for a Tesla card and Mac CUDA uses the bundled drivers. Darn.

Wednesday, January 28, 2009

HIPerWall makes news

I was pleased to discover that Information Week has named my company, Hiperwall Inc., as the startup of the week!

Thursday, January 22, 2009

CUDA Working on Cluster

I was able to get CUDA working on both the frontend machine and the single compute node tonight, with some surprising results. It turned out that if I ran nvidia-smi as root, it would properly create the device entries in /dev and all was well. So then I ran my CUDA FDTD program with doubles (thus, requiring a compute 1.3 class CUDA card), and strange timing results occurred.

I was certain that the Tesla C1060 would crush the old development board, which has fewer processors and a slower speed. Yet, the test board proved faster by 4 or 5%, which was a surprise. As I increased the problem size, the results followed the same pattern.

FDTD is a 3D volumetric finite-difference, time-domain electromagnetic simulation. The main loop is O(n^3), and the thing accesses 8 3D arrays for each time step. So it's a really memory-intensive and CPU-intensive program. Because CUDA doesn't support multidimensional thread blocks, I assign, for example, 1x16x16 threads per block, all within the 3D array space. In this case, each block has 256 threads that CUDA schedules on each parallel processor. My first thought is that the newer C1060 may benefit from more parallelism, so I made it 1x32x16, thus 512 threads per block. This slowed the program on both GPUs by a fair bit.

So I thought that perhaps the processors may be resource constrained, so fewer threads would help, so I stepped it down to 1x8x16, so 128 threads. This made a big difference, and now the Tesla C1060 did beat the development board by 15 or 20%. Yay! So, taking that logic further, I went down to 64 threads, but that made things a bit slower, so it looks like 128 threads may be the right balance for the number of registers and other resources needed by FDTD.

I really should play with the "Occupancy Calculator" tool provided by NVIDIA to see if that provides some insight into these sort of performance issues.

Wednesday, January 21, 2009

Cluster sort of working

The new mini-cluster is online and will be put to use for my class. In fact, I was able to demonstrate it in class on Tuesday. I showed how OpenMP can parallelize the Finite Difference, Time Domain code I've been using for a while and have posted on the original blog.

I ended up with two surprising results: First, on both the Phenom frontend and the Core2Quad compute node, the code ran faster when I told OpenMP to use 8 threads rather than 4 (and each only has 4 cores). This is surprising, because the work is very memory-bound, and normally the additional overhead of context switching between threads would slow it down or make it perform no better. Yet it did, by a fair bit. I can only think there must be some odd cache effect or something, but it warrants more investigation.

Second, the Phenom (at 2.2 GHz with DDR2 RAM, if I remember correctly) kicked the 2.33 GHz Core2Quad's butt, by a quarter to a third in all the tests. And, yes, the C2Q has fast DDR3 RAM and a fast front-side bus. I am not very surprised by this, as AMD has always done well at floating point for scientific computing, but the margin of victory is surprising. Good job, AMD!

I've been having trouble getting CUDA working on the frontend node. I put the test board I received under NDA last year into the node, in the hope I could reuse it rather than stick one of the Teslas in there, but though the driver is installed and seemingly happy, it isn't creating the /dev entries, so CUDA programs can't run. I will try the compute node today, since it has the official Tesla card. Perhaps the drivers just don't want to deal with a test board, in which case, I'll swap a Tesla card in the frontend.

Unfortunately, while getting this system working, my old cluster frontend failed and the network connections in the server room are all messed up, so I need to fix those things. Sometime... 

Friday, January 16, 2009

Cluster Troubles

I was able to install ROCKS on my Phenom frontend node with no trouble, even though it doesn't have any CUDA software installed yet. I built one compute node with a Core2Quad, 4GB Kingston DDR3 RAM, a scrounged Western Digital 40GB IDE drive, Myrinet 2000 PCI card, and donated NVIDIA 790i motherboard and Tesla C1060 card. I used an old Matrox Millenium PCI video card so I could watch ROCKS install the cluster OS on the system. The install worked fine, but this version of ROCKS doesn't come with the Myrinet Roll, nor does Myricom have a ROCKS 5.1 roll (and my cards are so old they may not be well supported). Getting Myrinet running can wait.

A really great thing about ROCKS is that it doesn't try to recover failed compute nodes - it just reinstalls the OS on them. Unfortunately, this means a lot of things have to go right, and they don't in this case. When I pull the PCI video card out of the nodes, the reinstall fails, presumably as Anaconda (the RedHat installed barfs when it doesn't see a video card). This is a shame, because I have no plans to put video cards in these compute nodes. The workaround is to shut off the automatic reinstall, which also means to automatic updates. This is a serious downside and a shame. My old Athlon nodes worked perfectly without video cards, but something here fails during the reinstall process, and without a video card, I can't see what fails.

Then I moved the frontend and compute node up to the cluster racks in our 4th floor machine room. There, it seems that others have "borrowed" my network cables and jack in the patch panel, so the new system doesn't have network connectivity. I will try a small switch to share the connection used by my other cluster frontend. A serious issue came up upon this move - the networks (IPs and gateways) are different from the HIPerWall lab to the UCInet in the 4th floor server room. So I dutifully went and changed the network settings in /etc/sysconfig to the new network. But it didn't stick. This seemed crazy, but it was true. After a bit of looking around, the network settings were ALSO in another nearby directory and those were the ones really being used. This is a real problem that Linux seems to have that Windows and Mac OS X don't - there needs to be a single source of truth and, in this case, links between the two settings files would have solved it, but whatever installer wrote both files made them separate rather than linked. In almost every Linux system I set up, I have some sort of network config issue that requires manually editing config files somewhere. Admittedly, I tend to have stranger network config requirements than most people, for whom DHCP is fine, but Linux will not take over the desktop, as all the proponents have been hoping, until some consistency is achieved.

Wednesday, January 14, 2009

ROCKS not Rolling with CUDA

I finally have enough parts together to start building the new cluster. I have a frontend machine running a 4-core Phenom and a compute node with an Intel Core2Quad. I went to download ROCKS to put on the system and ran into a couple of problems.

First, the ROCKS ftp site is down (as of this writing), which means I can't download the DVD. But I can use the HTTP site and get the CDs, so no great loss.

Then, I went looking for the CUDA Roll (Rolls are extensions to ROCKS functionality so ROCKS knows how to install the software on the nodes). Well, the NVIDIA-supported roll is for ROCKS 4.3, which is way out of date. (And apparently NVIDIA must be embarrassed about it, because it is really hard to find the link on the NVIDIA website.) Someone has built a test version for ROCKS 5.0 and put it on Google Code, but even that is out of date.

So now I have to decide if I should try to make my own roll or at least integrate CUDA into ROCKS so it can be automatically installed or if I should install it by hand on each node and prevent ROCKS from reinstalling on each node crash. Having done the former before, I'm leaning towards the latter, since I have so few nodes. If anyone has a modern CUDA roll, please let me know.

Sunday, January 11, 2009

CUDA Cluster Computing

I'm in the process of putting together a new cluster to support both parallel computing education and research. I have an old, decrepit AMD Athlon cluster that has mostly failed (power supplies, fans, and disk drives are the culprits), so I will be replacing some of those nodes with fancy new nodes. This is needed because UCI no longer has a student-accessible cluster to be used for education. The graduate students used to have access to a fairly nice cluster as their email machine (they didn't know it could run MPI and had 44 Xeon's driving it). That's gone and has not been replaced.

The course I am teaching this quarter is primarily on parallel computer architecture, but in my opinion, the best way to understand a system's behavior is to use it, so I believe in having the students do parallel programming assignments. I will teach them OpenMP for cache-coherent shared memory architectures and MPI for distributed memory architectures. I will also spend a fair bit of time on CUDA, because of its performance and accessibility.

I received  some very generous support for my new cluster from NVIDIA. Because they want to make sure my grad students have the best possible CUDA experience, NVIDIA provided six Tesla C1060s and six 790i motherboards. This will help me make a really nice setup.

Unfortunately my old power supplies, RAM, and CPUs won't work with this new equipment, so I need to supply them. I had hoped to get Intel to donate CPUs, so I asked our industry relations folks to help make contact with Intel. Unfortunately, it got out that I will be teaching CUDA, so the Intel contact refused to help, saying that Larrabee is way better. Of course, Larrabee isn't available yet, so I can't exactly assign projects with that, and the OpenMP and MPI assignments would have made the students familiar with technologies that would have been useful on Intel processors, possibly including Larrabee. Darn.

Anyway, I've bought components for one system now and will borrow a machine from home as the cluster front end, so I will post updates as the system comes together. I plan to put ROCKS on it, having been a long-time ROCKS user and fan.


Friday, January 9, 2009

GPU Computing Talk

I just got back from a very interesting talk by Pat Hanrahan of Stanford on GPUs and the future of parallel computing. He was very compelling in his arguments that conventional CPUs are dinosaurs and need to be replaced or at least supplemented with much more efficient GPU-style processors. He also showed a bit about Intel's Larrabee, which was very interesting. The idea of using a bunch of simple, multithreaded x86 cores in a graphics architecture is intriguing.

The biggest roadblock to this approach is software. Frankly a whole lot of code out there is single threaded and that isn't likely to change in the short term. Existing programming paradigms are problematic for parallel or concurrent code and we programmers tend to make mistakes that are very hard to catch and debug on parallel systems, while sequential debugging is at least tolerable. So until this is addressed by new generations of software architectures and better training for computer scientists and programmers, adoption will be slow, except in specialized domains.

Moving in

Because iWeb only runs on the Mac, yet I want to be able to post from my Tablet PC or iPhone, using my old .Mac blog is a burden. So now I will move my blog on Asymmetric Parallel Computing here.

The old posting will stay up at the old address.

The new blog will cover my experiences with CUDA much more than it will cover the Cell, because I'm fairly over the Cell, finding it hard to get reasonable performance out of the thing. As OpenCL becomes available, I'll investigate that too.