I ended up with two surprising results: First, on both the Phenom frontend and the Core2Quad compute node, the code ran faster when I told OpenMP to use 8 threads rather than 4 (and each only has 4 cores). This is surprising, because the work is very memory-bound, and normally the additional overhead of context switching between threads would slow it down or make it perform no better. Yet it did, by a fair bit. I can only think there must be some odd cache effect or something, but it warrants more investigation.
Second, the Phenom (at 2.2 GHz with DDR2 RAM, if I remember correctly) kicked the 2.33 GHz Core2Quad's butt, by a quarter to a third in all the tests. And, yes, the C2Q has fast DDR3 RAM and a fast front-side bus. I am not very surprised by this, as AMD has always done well at floating point for scientific computing, but the margin of victory is surprising. Good job, AMD!
I've been having trouble getting CUDA working on the frontend node. I put the test board I received under NDA last year into the node, in the hope I could reuse it rather than stick one of the Teslas in there, but though the driver is installed and seemingly happy, it isn't creating the /dev entries, so CUDA programs can't run. I will try the compute node today, since it has the official Tesla card. Perhaps the drivers just don't want to deal with a test board, in which case, I'll swap a Tesla card in the frontend.
Unfortunately, while getting this system working, my old cluster frontend failed and the network connections in the server room are all messed up, so I need to fix those things. Sometime...