Sunday, April 5, 2009

Brains vs. Brawn

In the world of parallel computing, sometimes bigger isn't better. While some algorithms are well suited to vast parallel processing, like that provided by CUDA on the Tesla C1060, sometimes an algorithmic optimization may make the problem run faster on fewer processors. The N-Body problem is a prime example of this. The N-Body problem is a classical problem in astrophysics, molecular dynamics, and even graph drawing. The problem state consists of a number (n) of bodies, which may be stars or molecules or whatever. These bodies interact with each other via gravity or electromagnetic forces, etc. They exert these forces on each other, thus accelerating and causing movement. The forces on each body (from all the other bodies) are computed for each time step, then acceleration and movement can be calculated. This process means that O(n^2) force calculations are needed for each time step, and the simulation may run for many time steps. So, for 50000 bodies, roughly 2.5 billion force calculations are needed per time step.

The good news is that these force calculations are completely independent of each other, which means they can be done in parallel, so this type of problem is extremely well suited to CUDA computing. Each thread can compute the force on a single body by stepping through all the other bodies and determining the interaction with each. Using blocking and the fast shared memory as a cache, this can be quite an efficient operation in CUDA.

On the other hand, there are other ways to solve this type of problem that may be faster. The MDGRAPE system in Japan was designed specifically to perform molecular dynamics computations, and achieved petaFLOPS years ago, well before the current crop of petaFLOPS machines became operational. The reason it never made the Top 500 list is because it wasn't general purpose and wouldn't run the HPL benchmark Top 500 uses. So a special architecture designed to accelerate the N-Body-type problem is one way to speed things up.

Algorithmic enhancements are another way to speed up N-Body. If we observe that distant bodies exert little influence, then perhaps we can group a bunch of distant bodies, treat them as a single distant "body" with corresponding mass and center of mass, then compute the force in aggregate. This approach is used by the Barnes-Hut optimization. In Barnes-Hut, bodies are grouped into various combinations, each decreasing in size, until, at the smallest level, each group only contains a single body. A tree is used to store these groups, with larger groups at the top of the tree and smaller, refined, groups towards the bottom. A recursive algorithm is used to build the tree, then the force calculations can be performed against the largest portions of the tree that meet certain distance criteria. Barnes-Hut drastically reduces the number of force calculations to O(n log n), so rather than needing 2.5 billion force calculations per iteration, the number shrinks to fewer than 3 million! The problem is that, as bodies move, the tree needs to be rebuilt for each iteration (though that could be optimized, perhaps). Of course, Barnes-Hut is an approximation, however, so the results will vary slightly from the O(n^2) version.

I assigned my students to build 3 different versions of N-Body for my parallel architecture class last quarter: one using OpenMP to speed the force calculations, one using CUDA, and one using Barnes-Hut to reduce their number. I wanted the students to see the power of CUDA to speed massively parallel computations, but I also wanted them to see whether a smarter algorithm could overcome the benefit of tremendous hardware parallelism. As expected, the Core2Quads with OpenMP did get good speedups in the force calculations, but it wasn't enough to make the problem fast: with 50000 bodies it took about 15 seconds per time step. The CUDA version was much faster, bringing it down to about 2 seconds per time step on the Tesla cards.

The problem with Barnes-Hut is that it can be done badly. Building the tree is potentially very expensive, particularly if dynamic allocation is used. I assigned the students to do the tree building in parallel, and parallel memory allocation introduces locking delays. I also asked them to use self-scheduling, so managing a task queue meant more locks to slow things down. Some students managed to get no performance improvement in parallel tree building at all, while a few got some. I managed to get about 30-50% improvement by trading some parallelism for speedy recursive calls and avoiding the locks. In any case, the fastest Barnes-Hut implementations we came up with could run 10 time steps in the same 2 seconds it took the CUDA cards to run 1 time step.

So the final tally showed that CUDA provides an order of magnitude improvement over OpenMP on Core2Quad for the O(n^2) case, but the O(n log n) Barnes-Hut version is an order of magnitude faster than that! So it's nice to see that brains triumph over brawn.