Monday, October 18, 2010

Atom Observations

I just put a post up about my experiences with a new Intel Atom + NVIDIA Ion2 system on my homepage:

Since it isn't Asymmetric computing, I won't repeat it here, but take a look at the link if you are interested. (Hint: you won't be putting Atoms in your cluster anytime soon if you do floating point.)

Saturday, September 25, 2010

Cheap GPU Computing may be over

For those of us interested in GPU computing, Greg Pfister has written an interesting article entitled "Nvidia-based Cheap Supercomputing Coming to an End" commenting on the future of NVIDIA's supercomputing technology that has been subsidized by gamers and commodity GPUs. It looks like Intel's Sandy Bridge architecture may end that.

If you don't read Greg Pfister's Perils of Parallel blog, you should. He's been doing parallel computing for a long time and is very good at exposing the pitfalls and hidden costs of parallelism.

Tuesday, August 31, 2010

A year later

Well, it has been a year since my last post on here. A lot has happened since then:

I have left UCI to become Chief Scientist at Hiperwall Inc. (, where we make tiled display wall software. We are just about to release a new and greatly enhanced software version, which is both exciting and tiring, with all the testing and last minute tweaks.

I have moved most of my research material from the old UCI server to my personal site (, though it isn't fully organized yet. That is where I'll be updating things, so please bookmark that and keep watching.

I'm not done with Asymmetric Computing, as we use it for various things in Hiperwall, so I will continue to explore when I have time. For the moment, however, I'm learning how to program the iPhone and iPad, because I have fun plans for them.

Tuesday, August 18, 2009

On Hiatus

Because of intense Hiperwall development (we're nearly done with a new release) and lots of other busy stuff, I'm going to put this blog on hiatus for a bit.

I will be making posts on my site ( regarding computing and other stuff.

I really enjoy computer architecture and parallel computing, so I won't completely give up on this blog just yet, but hope to get back to it during the academic year.

Tuesday, June 23, 2009

Virtual Machines

A slightly off-topic post, but still computing-oriented...

Many people have heard of virtual machines and some have even used them in various capacities. I've used VMWare virtual machines for years quite happily, but have had some recent experiences that have changed my mind (a bit).

I use VMs on my TabletPC so I can run Linux, which I'm more comfortable with for software development than Windows. I also run VMs on my MacBook Pro so I can run Windows for various reasons, including Hiperwall software development. When I got the MacBook Pro, I bought VMWare Fusion, because I figured it would be the best. It is very good, but didn't support OpenGL, which hurt the Hiperwall software development, so I didn't use it for a while. Then I saw that Parallels Desktop 4 supports OpenGL, so I bought it and it is great. Things seem more responsive than VMWare Fusion, and more importantly, it supports both DirectX and OpenGL. The only problem I had was that the installation wizard tried to be too smart and handle everything, but got hung up on some dialog boxes that it didn't expect. I sorted it out manually, though, so all it well. Parallel Desktop 4 works very well for me.

On my PCs, I've been using VMWare Server for years, because it is free and fast (once started). Starting it takes forever, though. Really. On my Core2Quad machine at home, it litterally stops the clock update on the display and my Logitech G15 keyboard for several minutes as the VM initializes. On my poor TabletPC with slower disk and (until recently) less memory, it was nearly interminable and intolerable. And, of course, VMWare Server didn't support much in the way of display devices and such, because it was for servers. So recently, I tried Sun's VirtualBox. I tried it a year or so ago and it was terrible, but it has now become really great. It starts right away, is very responsive, and supports OpenGL, so I can run several virtual Hiperwall tiles on one PC for testing! It's just great.

That doesn't mean you should throw away VMWare Server, though, because it does one thing that the others don't. It can let the guest OS use two processors. This is important for multithreaded apps and when I demonstrate parallel programming in my classes. So it still has its place, but not as my everyday VM.

These VMs execute native x86 code directly on the processor, but all I/O (disk reads and writes, network packets, even sound and graphics) is on virtual devices. These virtual devices then use your machine's physical devices to perform the real work requested. A problem is that interrupt handling is expensive for VMs, so most VMs batch interrupts and process them in groups. This means that responsiveness is often not as good as it should be, even though average throughput is reasonable. So playing a game in a VM is probably doable, but it may suffer from stutters and delays due to the nature of the VM's I/O handling.

So VMs are tremendously useful, and though VMWare makes terrific VMs and has been around for a long time, look at some of the others, too. They may meet your needs even better.

Monday, April 13, 2009


I just saw the announcement that the latest CUDA (2.2) contains a debugger (yay) and a profiler (I hope it's better than the old one) amidst a few other goodies. Check it out here.

If you're interested in the latest NVIDIA CUDA news, follow nvidiadeveloper on Twitter. They don't send too much junk.

Sunday, April 5, 2009

Brains vs. Brawn

In the world of parallel computing, sometimes bigger isn't better. While some algorithms are well suited to vast parallel processing, like that provided by CUDA on the Tesla C1060, sometimes an algorithmic optimization may make the problem run faster on fewer processors. The N-Body problem is a prime example of this. The N-Body problem is a classical problem in astrophysics, molecular dynamics, and even graph drawing. The problem state consists of a number (n) of bodies, which may be stars or molecules or whatever. These bodies interact with each other via gravity or electromagnetic forces, etc. They exert these forces on each other, thus accelerating and causing movement. The forces on each body (from all the other bodies) are computed for each time step, then acceleration and movement can be calculated. This process means that O(n^2) force calculations are needed for each time step, and the simulation may run for many time steps. So, for 50000 bodies, roughly 2.5 billion force calculations are needed per time step.

The good news is that these force calculations are completely independent of each other, which means they can be done in parallel, so this type of problem is extremely well suited to CUDA computing. Each thread can compute the force on a single body by stepping through all the other bodies and determining the interaction with each. Using blocking and the fast shared memory as a cache, this can be quite an efficient operation in CUDA.

On the other hand, there are other ways to solve this type of problem that may be faster. The MDGRAPE system in Japan was designed specifically to perform molecular dynamics computations, and achieved petaFLOPS years ago, well before the current crop of petaFLOPS machines became operational. The reason it never made the Top 500 list is because it wasn't general purpose and wouldn't run the HPL benchmark Top 500 uses. So a special architecture designed to accelerate the N-Body-type problem is one way to speed things up.

Algorithmic enhancements are another way to speed up N-Body. If we observe that distant bodies exert little influence, then perhaps we can group a bunch of distant bodies, treat them as a single distant "body" with corresponding mass and center of mass, then compute the force in aggregate. This approach is used by the Barnes-Hut optimization. In Barnes-Hut, bodies are grouped into various combinations, each decreasing in size, until, at the smallest level, each group only contains a single body. A tree is used to store these groups, with larger groups at the top of the tree and smaller, refined, groups towards the bottom. A recursive algorithm is used to build the tree, then the force calculations can be performed against the largest portions of the tree that meet certain distance criteria. Barnes-Hut drastically reduces the number of force calculations to O(n log n), so rather than needing 2.5 billion force calculations per iteration, the number shrinks to fewer than 3 million! The problem is that, as bodies move, the tree needs to be rebuilt for each iteration (though that could be optimized, perhaps). Of course, Barnes-Hut is an approximation, however, so the results will vary slightly from the O(n^2) version.

I assigned my students to build 3 different versions of N-Body for my parallel architecture class last quarter: one using OpenMP to speed the force calculations, one using CUDA, and one using Barnes-Hut to reduce their number. I wanted the students to see the power of CUDA to speed massively parallel computations, but I also wanted them to see whether a smarter algorithm could overcome the benefit of tremendous hardware parallelism. As expected, the Core2Quads with OpenMP did get good speedups in the force calculations, but it wasn't enough to make the problem fast: with 50000 bodies it took about 15 seconds per time step. The CUDA version was much faster, bringing it down to about 2 seconds per time step on the Tesla cards.

The problem with Barnes-Hut is that it can be done badly. Building the tree is potentially very expensive, particularly if dynamic allocation is used. I assigned the students to do the tree building in parallel, and parallel memory allocation introduces locking delays. I also asked them to use self-scheduling, so managing a task queue meant more locks to slow things down. Some students managed to get no performance improvement in parallel tree building at all, while a few got some. I managed to get about 30-50% improvement by trading some parallelism for speedy recursive calls and avoiding the locks. In any case, the fastest Barnes-Hut implementations we came up with could run 10 time steps in the same 2 seconds it took the CUDA cards to run 1 time step.

So the final tally showed that CUDA provides an order of magnitude improvement over OpenMP on Core2Quad for the O(n^2) case, but the O(n log n) Barnes-Hut version is an order of magnitude faster than that! So it's nice to see that brains triumph over brawn.