Tuesday, August 18, 2009

On Hiatus

Because of intense Hiperwall development (we're nearly done with a new release) and lots of other busy stuff, I'm going to put this blog on hiatus for a bit.

I will be making posts on my site (http://www.stephenjenks.com/) regarding computing and other stuff.

I really enjoy computer architecture and parallel computing, so I won't completely give up on this blog just yet, but hope to get back to it during the academic year.

Tuesday, June 23, 2009

Virtual Machines

A slightly off-topic post, but still computing-oriented...

Many people have heard of virtual machines and some have even used them in various capacities. I've used VMWare virtual machines for years quite happily, but have had some recent experiences that have changed my mind (a bit).

I use VMs on my TabletPC so I can run Linux, which I'm more comfortable with for software development than Windows. I also run VMs on my MacBook Pro so I can run Windows for various reasons, including Hiperwall software development. When I got the MacBook Pro, I bought VMWare Fusion, because I figured it would be the best. It is very good, but didn't support OpenGL, which hurt the Hiperwall software development, so I didn't use it for a while. Then I saw that Parallels Desktop 4 supports OpenGL, so I bought it and it is great. Things seem more responsive than VMWare Fusion, and more importantly, it supports both DirectX and OpenGL. The only problem I had was that the installation wizard tried to be too smart and handle everything, but got hung up on some dialog boxes that it didn't expect. I sorted it out manually, though, so all it well. Parallel Desktop 4 works very well for me.

On my PCs, I've been using VMWare Server for years, because it is free and fast (once started). Starting it takes forever, though. Really. On my Core2Quad machine at home, it litterally stops the clock update on the display and my Logitech G15 keyboard for several minutes as the VM initializes. On my poor TabletPC with slower disk and (until recently) less memory, it was nearly interminable and intolerable. And, of course, VMWare Server didn't support much in the way of display devices and such, because it was for servers. So recently, I tried Sun's VirtualBox. I tried it a year or so ago and it was terrible, but it has now become really great. It starts right away, is very responsive, and supports OpenGL, so I can run several virtual Hiperwall tiles on one PC for testing! It's just great.

That doesn't mean you should throw away VMWare Server, though, because it does one thing that the others don't. It can let the guest OS use two processors. This is important for multithreaded apps and when I demonstrate parallel programming in my classes. So it still has its place, but not as my everyday VM.

These VMs execute native x86 code directly on the processor, but all I/O (disk reads and writes, network packets, even sound and graphics) is on virtual devices. These virtual devices then use your machine's physical devices to perform the real work requested. A problem is that interrupt handling is expensive for VMs, so most VMs batch interrupts and process them in groups. This means that responsiveness is often not as good as it should be, even though average throughput is reasonable. So playing a game in a VM is probably doable, but it may suffer from stutters and delays due to the nature of the VM's I/O handling.

So VMs are tremendously useful, and though VMWare makes terrific VMs and has been around for a long time, look at some of the others, too. They may meet your needs even better.

Monday, April 13, 2009

New CUDA

I just saw the announcement that the latest CUDA (2.2) contains a debugger (yay) and a profiler (I hope it's better than the old one) amidst a few other goodies. Check it out here.

If you're interested in the latest NVIDIA CUDA news, follow nvidiadeveloper on Twitter. They don't send too much junk.

Sunday, April 5, 2009

Brains vs. Brawn

In the world of parallel computing, sometimes bigger isn't better. While some algorithms are well suited to vast parallel processing, like that provided by CUDA on the Tesla C1060, sometimes an algorithmic optimization may make the problem run faster on fewer processors. The N-Body problem is a prime example of this. The N-Body problem is a classical problem in astrophysics, molecular dynamics, and even graph drawing. The problem state consists of a number (n) of bodies, which may be stars or molecules or whatever. These bodies interact with each other via gravity or electromagnetic forces, etc. They exert these forces on each other, thus accelerating and causing movement. The forces on each body (from all the other bodies) are computed for each time step, then acceleration and movement can be calculated. This process means that O(n^2) force calculations are needed for each time step, and the simulation may run for many time steps. So, for 50000 bodies, roughly 2.5 billion force calculations are needed per time step.

The good news is that these force calculations are completely independent of each other, which means they can be done in parallel, so this type of problem is extremely well suited to CUDA computing. Each thread can compute the force on a single body by stepping through all the other bodies and determining the interaction with each. Using blocking and the fast shared memory as a cache, this can be quite an efficient operation in CUDA.

On the other hand, there are other ways to solve this type of problem that may be faster. The MDGRAPE system in Japan was designed specifically to perform molecular dynamics computations, and achieved petaFLOPS years ago, well before the current crop of petaFLOPS machines became operational. The reason it never made the Top 500 list is because it wasn't general purpose and wouldn't run the HPL benchmark Top 500 uses. So a special architecture designed to accelerate the N-Body-type problem is one way to speed things up.

Algorithmic enhancements are another way to speed up N-Body. If we observe that distant bodies exert little influence, then perhaps we can group a bunch of distant bodies, treat them as a single distant "body" with corresponding mass and center of mass, then compute the force in aggregate. This approach is used by the Barnes-Hut optimization. In Barnes-Hut, bodies are grouped into various combinations, each decreasing in size, until, at the smallest level, each group only contains a single body. A tree is used to store these groups, with larger groups at the top of the tree and smaller, refined, groups towards the bottom. A recursive algorithm is used to build the tree, then the force calculations can be performed against the largest portions of the tree that meet certain distance criteria. Barnes-Hut drastically reduces the number of force calculations to O(n log n), so rather than needing 2.5 billion force calculations per iteration, the number shrinks to fewer than 3 million! The problem is that, as bodies move, the tree needs to be rebuilt for each iteration (though that could be optimized, perhaps). Of course, Barnes-Hut is an approximation, however, so the results will vary slightly from the O(n^2) version.

I assigned my students to build 3 different versions of N-Body for my parallel architecture class last quarter: one using OpenMP to speed the force calculations, one using CUDA, and one using Barnes-Hut to reduce their number. I wanted the students to see the power of CUDA to speed massively parallel computations, but I also wanted them to see whether a smarter algorithm could overcome the benefit of tremendous hardware parallelism. As expected, the Core2Quads with OpenMP did get good speedups in the force calculations, but it wasn't enough to make the problem fast: with 50000 bodies it took about 15 seconds per time step. The CUDA version was much faster, bringing it down to about 2 seconds per time step on the Tesla cards.

The problem with Barnes-Hut is that it can be done badly. Building the tree is potentially very expensive, particularly if dynamic allocation is used. I assigned the students to do the tree building in parallel, and parallel memory allocation introduces locking delays. I also asked them to use self-scheduling, so managing a task queue meant more locks to slow things down. Some students managed to get no performance improvement in parallel tree building at all, while a few got some. I managed to get about 30-50% improvement by trading some parallelism for speedy recursive calls and avoiding the locks. In any case, the fastest Barnes-Hut implementations we came up with could run 10 time steps in the same 2 seconds it took the CUDA cards to run 1 time step.

So the final tally showed that CUDA provides an order of magnitude improvement over OpenMP on Core2Quad for the O(n^2) case, but the O(n log n) Barnes-Hut version is an order of magnitude faster than that! So it's nice to see that brains triumph over brawn.

Wednesday, March 25, 2009

OnLive

A company called OnLive has been making the news lately with their announcement of online purchases of games and a new model of gaming. This new model is intriguing, because it hooks our gaming experience into cloud computing. Essentially, we will game via streaming video, with the game video being rendered on a remote "cloud" computer somewhere. This has lots of advantages in terms of not needing noisy, expensive top-of-the-line hardware to play fancy games. Instead, THEY have the fancy hardware and render the game for us, then stream it to our browser or our TV via a presumably inexpensive console.

Will this work? There are many challenges, with latency being the biggest one. Controller events need to be captured, sent to the server system, where they impact the gameplay, which then causes feedback onscreen, which is streamed to you. This round-trip latency may be noticeable if it is much more than your brain's control loop time. Existing multiplayer games experience this as "lag" and it is very annoying, so it may be a problem here too. The LA Times quotes an OnLive exec as saying they want to bring that latency down to 1 millisecond. While they may be able to use prediction and other things to reduce perceived latency, actual packet transfer time is bounded by the speed of light. This means a packet could, at most, travel on the order of 186,000 miles in a second, or 186 miles in a millisecond. (Best case, of course, as there is overhead and signal propagation is often slower than the speed of light.) Therefore, unless they place their servers in every city, 1 ms doesn't make much sense. But then again, perhaps the newspaper misquoted or misunderstood and I am just interpreting it wrong.

I wish OnLive well and look forward to seeing how well it works.

Sunday, March 15, 2009

Cluster and CUDA troubles abound

Running this cluster for my class is more trouble than it's worth and I won't do it again. Research cluster, yes - teaching cluster, no.

Yesterday, one of the students ran the frontend node out of memory. Of course, that meant I couldn't log in to reboot the thing, so I had to go to campus, find which entrance of the building is open on weekends and isn't closed by construction, and press the reset switch. Very annoying. I am disappointed that Linux is that fragile and that mechanisms wouldn't be in place to prevent this sort of problem. Surely crashing that program and freeing the memory should have solved it, but obviously didn't.

An even more annoying problem has to do with the Tesla cards. It looks like the drivers keep crashing. After a while, suddenly programs are no longer able to open the appropriate /dev entries for the NVIDIA cards. Clearly the students could be running buggy programs and such, but that should be expected. I think the drivers must not be recovering properly after a crash or something. Even running nvidia-smi won't fix it. A reboot will, but I discovered "rmmod nvidia" will remove the driver module and clear the problem, then nvidia-smi will cause the driver to reload and reinitialize. But bloody annoying, because the root user needs to do it, and I'm not giving the students root...

Wednesday, March 11, 2009

New cluster at UCI

Though it isn't an asymmetric system there is a new cluster at UCI for researchers and grad students. See the following link for details:


This is good news and seems to be a nice resource for the UCI HPC community.