Tuesday, August 18, 2009

On Hiatus

Because of intense Hiperwall development (we're nearly done with a new release) and lots of other busy stuff, I'm going to put this blog on hiatus for a bit.

I will be making posts on my site (http://www.stephenjenks.com/) regarding computing and other stuff.

I really enjoy computer architecture and parallel computing, so I won't completely give up on this blog just yet, but hope to get back to it during the academic year.

Tuesday, June 23, 2009

Virtual Machines

A slightly off-topic post, but still computing-oriented...

Many people have heard of virtual machines and some have even used them in various capacities. I've used VMWare virtual machines for years quite happily, but have had some recent experiences that have changed my mind (a bit).

I use VMs on my TabletPC so I can run Linux, which I'm more comfortable with for software development than Windows. I also run VMs on my MacBook Pro so I can run Windows for various reasons, including Hiperwall software development. When I got the MacBook Pro, I bought VMWare Fusion, because I figured it would be the best. It is very good, but didn't support OpenGL, which hurt the Hiperwall software development, so I didn't use it for a while. Then I saw that Parallels Desktop 4 supports OpenGL, so I bought it and it is great. Things seem more responsive than VMWare Fusion, and more importantly, it supports both DirectX and OpenGL. The only problem I had was that the installation wizard tried to be too smart and handle everything, but got hung up on some dialog boxes that it didn't expect. I sorted it out manually, though, so all it well. Parallel Desktop 4 works very well for me.

On my PCs, I've been using VMWare Server for years, because it is free and fast (once started). Starting it takes forever, though. Really. On my Core2Quad machine at home, it litterally stops the clock update on the display and my Logitech G15 keyboard for several minutes as the VM initializes. On my poor TabletPC with slower disk and (until recently) less memory, it was nearly interminable and intolerable. And, of course, VMWare Server didn't support much in the way of display devices and such, because it was for servers. So recently, I tried Sun's VirtualBox. I tried it a year or so ago and it was terrible, but it has now become really great. It starts right away, is very responsive, and supports OpenGL, so I can run several virtual Hiperwall tiles on one PC for testing! It's just great.

That doesn't mean you should throw away VMWare Server, though, because it does one thing that the others don't. It can let the guest OS use two processors. This is important for multithreaded apps and when I demonstrate parallel programming in my classes. So it still has its place, but not as my everyday VM.

These VMs execute native x86 code directly on the processor, but all I/O (disk reads and writes, network packets, even sound and graphics) is on virtual devices. These virtual devices then use your machine's physical devices to perform the real work requested. A problem is that interrupt handling is expensive for VMs, so most VMs batch interrupts and process them in groups. This means that responsiveness is often not as good as it should be, even though average throughput is reasonable. So playing a game in a VM is probably doable, but it may suffer from stutters and delays due to the nature of the VM's I/O handling.

So VMs are tremendously useful, and though VMWare makes terrific VMs and has been around for a long time, look at some of the others, too. They may meet your needs even better.

Monday, April 13, 2009

New CUDA

I just saw the announcement that the latest CUDA (2.2) contains a debugger (yay) and a profiler (I hope it's better than the old one) amidst a few other goodies. Check it out here.

If you're interested in the latest NVIDIA CUDA news, follow nvidiadeveloper on Twitter. They don't send too much junk.

Sunday, April 5, 2009

Brains vs. Brawn

In the world of parallel computing, sometimes bigger isn't better. While some algorithms are well suited to vast parallel processing, like that provided by CUDA on the Tesla C1060, sometimes an algorithmic optimization may make the problem run faster on fewer processors. The N-Body problem is a prime example of this. The N-Body problem is a classical problem in astrophysics, molecular dynamics, and even graph drawing. The problem state consists of a number (n) of bodies, which may be stars or molecules or whatever. These bodies interact with each other via gravity or electromagnetic forces, etc. They exert these forces on each other, thus accelerating and causing movement. The forces on each body (from all the other bodies) are computed for each time step, then acceleration and movement can be calculated. This process means that O(n^2) force calculations are needed for each time step, and the simulation may run for many time steps. So, for 50000 bodies, roughly 2.5 billion force calculations are needed per time step.

The good news is that these force calculations are completely independent of each other, which means they can be done in parallel, so this type of problem is extremely well suited to CUDA computing. Each thread can compute the force on a single body by stepping through all the other bodies and determining the interaction with each. Using blocking and the fast shared memory as a cache, this can be quite an efficient operation in CUDA.

On the other hand, there are other ways to solve this type of problem that may be faster. The MDGRAPE system in Japan was designed specifically to perform molecular dynamics computations, and achieved petaFLOPS years ago, well before the current crop of petaFLOPS machines became operational. The reason it never made the Top 500 list is because it wasn't general purpose and wouldn't run the HPL benchmark Top 500 uses. So a special architecture designed to accelerate the N-Body-type problem is one way to speed things up.

Algorithmic enhancements are another way to speed up N-Body. If we observe that distant bodies exert little influence, then perhaps we can group a bunch of distant bodies, treat them as a single distant "body" with corresponding mass and center of mass, then compute the force in aggregate. This approach is used by the Barnes-Hut optimization. In Barnes-Hut, bodies are grouped into various combinations, each decreasing in size, until, at the smallest level, each group only contains a single body. A tree is used to store these groups, with larger groups at the top of the tree and smaller, refined, groups towards the bottom. A recursive algorithm is used to build the tree, then the force calculations can be performed against the largest portions of the tree that meet certain distance criteria. Barnes-Hut drastically reduces the number of force calculations to O(n log n), so rather than needing 2.5 billion force calculations per iteration, the number shrinks to fewer than 3 million! The problem is that, as bodies move, the tree needs to be rebuilt for each iteration (though that could be optimized, perhaps). Of course, Barnes-Hut is an approximation, however, so the results will vary slightly from the O(n^2) version.

I assigned my students to build 3 different versions of N-Body for my parallel architecture class last quarter: one using OpenMP to speed the force calculations, one using CUDA, and one using Barnes-Hut to reduce their number. I wanted the students to see the power of CUDA to speed massively parallel computations, but I also wanted them to see whether a smarter algorithm could overcome the benefit of tremendous hardware parallelism. As expected, the Core2Quads with OpenMP did get good speedups in the force calculations, but it wasn't enough to make the problem fast: with 50000 bodies it took about 15 seconds per time step. The CUDA version was much faster, bringing it down to about 2 seconds per time step on the Tesla cards.

The problem with Barnes-Hut is that it can be done badly. Building the tree is potentially very expensive, particularly if dynamic allocation is used. I assigned the students to do the tree building in parallel, and parallel memory allocation introduces locking delays. I also asked them to use self-scheduling, so managing a task queue meant more locks to slow things down. Some students managed to get no performance improvement in parallel tree building at all, while a few got some. I managed to get about 30-50% improvement by trading some parallelism for speedy recursive calls and avoiding the locks. In any case, the fastest Barnes-Hut implementations we came up with could run 10 time steps in the same 2 seconds it took the CUDA cards to run 1 time step.

So the final tally showed that CUDA provides an order of magnitude improvement over OpenMP on Core2Quad for the O(n^2) case, but the O(n log n) Barnes-Hut version is an order of magnitude faster than that! So it's nice to see that brains triumph over brawn.

Wednesday, March 25, 2009

OnLive

A company called OnLive has been making the news lately with their announcement of online purchases of games and a new model of gaming. This new model is intriguing, because it hooks our gaming experience into cloud computing. Essentially, we will game via streaming video, with the game video being rendered on a remote "cloud" computer somewhere. This has lots of advantages in terms of not needing noisy, expensive top-of-the-line hardware to play fancy games. Instead, THEY have the fancy hardware and render the game for us, then stream it to our browser or our TV via a presumably inexpensive console.

Will this work? There are many challenges, with latency being the biggest one. Controller events need to be captured, sent to the server system, where they impact the gameplay, which then causes feedback onscreen, which is streamed to you. This round-trip latency may be noticeable if it is much more than your brain's control loop time. Existing multiplayer games experience this as "lag" and it is very annoying, so it may be a problem here too. The LA Times quotes an OnLive exec as saying they want to bring that latency down to 1 millisecond. While they may be able to use prediction and other things to reduce perceived latency, actual packet transfer time is bounded by the speed of light. This means a packet could, at most, travel on the order of 186,000 miles in a second, or 186 miles in a millisecond. (Best case, of course, as there is overhead and signal propagation is often slower than the speed of light.) Therefore, unless they place their servers in every city, 1 ms doesn't make much sense. But then again, perhaps the newspaper misquoted or misunderstood and I am just interpreting it wrong.

I wish OnLive well and look forward to seeing how well it works.

Sunday, March 15, 2009

Cluster and CUDA troubles abound

Running this cluster for my class is more trouble than it's worth and I won't do it again. Research cluster, yes - teaching cluster, no.

Yesterday, one of the students ran the frontend node out of memory. Of course, that meant I couldn't log in to reboot the thing, so I had to go to campus, find which entrance of the building is open on weekends and isn't closed by construction, and press the reset switch. Very annoying. I am disappointed that Linux is that fragile and that mechanisms wouldn't be in place to prevent this sort of problem. Surely crashing that program and freeing the memory should have solved it, but obviously didn't.

An even more annoying problem has to do with the Tesla cards. It looks like the drivers keep crashing. After a while, suddenly programs are no longer able to open the appropriate /dev entries for the NVIDIA cards. Clearly the students could be running buggy programs and such, but that should be expected. I think the drivers must not be recovering properly after a crash or something. Even running nvidia-smi won't fix it. A reboot will, but I discovered "rmmod nvidia" will remove the driver module and clear the problem, then nvidia-smi will cause the driver to reload and reinitialize. But bloody annoying, because the root user needs to do it, and I'm not giving the students root...

Wednesday, March 11, 2009

New cluster at UCI

Though it isn't an asymmetric system there is a new cluster at UCI for researchers and grad students. See the following link for details:


This is good news and seems to be a nice resource for the UCI HPC community.

Monday, March 9, 2009

Interesting parallel computing blog

My wife pointed out an interesting parallel computing blog:


It's well worth checking out. The long posts are clearly written and helpful. Take a look. I particularly like the car analogy for multicore processors (i.e., two 62 MPH cars are better than one 120 MPH car?). The author, Greg Pfister, has wide interests, from cloud computing to Larrabee.

Wednesday, March 4, 2009

Cluster Stable

The cluster has been stable for several days now with a Core2Quad frontend node. I don't know what was causing the Phenom frontend node to fail, but as it was my personal machine, I'll take it home and diagnose it after the quarter ends. For now, the students are able to make their parallel programs work on the cluster and finish their projects. Yay!

Thursday, February 26, 2009

Cluster still troubling

It turns out the cluster frontend was having memory errors. Reseating the DIMMs (Crucial Ballistix) seemed to help, as it passed Memtest86, but it failed again when put back in service. I tried to swap the disk into one of the cluster nodes to make it the frontend, but bloody Linux thwarted me completely, particularly because it was using disk labels not partition numbers and somehow they were not findable in the new machine. The disk wasn't corrupt, but the kernel panicked on boot each time, no matter what I did.

Eventually, I had to reformat, reinstall ROCKS, restore the users, and put the new frontend in service. It seems to be working now, though ROCKS is misbehaving a bit. Scripts that normally handle all the user adding tasks aren't handling the auto.home adding, which is a surprise. Anyway, I can use emacs and fix it myself. Bloody computers.

Monday, February 23, 2009

Technology troubles

Technology really failed me today. When I got to campus, I rebooted my cluster frontend, which failed on Saturday. It didn't come back up. Then I discovered that the bright guys in network security had blocked my IP address because someone who had it before me had a compromised machine. When I called the help desk, they said I needed to send an email to get it cleared up. Well, that would have been great, except email was blocked! So I plugged into the HIPerWall network, only to discover it didn't work either. The antique PowerMac G4 server that provided DHCP address, name service, and user accounts has failed, apparently permanently, as it won't even turn on the screen. So there are days I really hate technology.

But I think I'd enjoy digging ditches less, so I'll stick with the techology stuff for now.

I used my iPhone to send the email to unblock my IP address, then I configured the Linux storage server to serve DHCP, so HIPerWall is back up. The CUDA cluster is still hosed, however...

Saturday, February 21, 2009

CUDA cluster troubles

The problem with running a cluster that other people use is that those other people rely on it and it sometimes breaks.

In this case, it looks like my frontend crashed and rebooted itself (why, I don't know) yesterday afternoon. Upon reboot, it seems to have not reinitialized the Tesla card, so the students couldn't do their testing. I ran "nvidia-smi" on it, which created the device entries and made it usable again. Perhaps I need to add that to init.d or rc.local

It also turns out that two of my compute nodes had problems, not related to reboots. In one case, the devices weren't present, but nvidia-smi didn't fix it. A reboot, followed by nvidia-smi did. In the other case, the machine with 2 Tesla cards had only one of them working. Again, nividia-smi didn't help, but a reboot did.

While I imagine the problem may stem from student code doing horrible things inside the card, I think it points to a driver problem. Perhaps the driver isn't correctly recovering after some crazy operation, but that's not a good sign for the robustness of the CUDA computing environment.

Thursday, February 12, 2009

Some pain, no gain

I modified my program to take advantage of coalesced memory accesses (particularly loads in the N-body problem), having to change many, many things. After a fair bit of painful debugging, I got it to work and produce the same results as the original CUDA version. The bummer is that it is only slightly faster (like a percent or two). Now it seems likely that the memory accesses were already pretty good, but they certainly were a little more excessive than they should have been, so I really expected a significant gain. I am taking advantage of locality MUCH more than before, yet it didn't help.

I turned to the fancy profiler provided by NVIDIA. It's pretty nifty and told me what I already knew that the O(N^2) ComputeForces functions is taking all the time. What it couldn't tell me is the number of uncoalsced loads, because that performance counter is apparently not (yet?) supported on G200 model GPUs. Darn. It does tell me that I have a lot of coalesced loads and stores, but they may only be a portion of the total. So I'm not particularly thrilled with the profiler, since uncoalesced memory accesses can be the real performance killer in CUDA apps. But it looks pretty and has lots of other somewhat useful statistics. I'm glad I tried it.

N-Body

I have assigned my students to make a simple n-body simulation in CUDA to compare it to the OpenMP version they've already done. I know there are really great n-body program for CUDA out there, but starting simple is good for a class. I wrote my own, as well, because I want to see if any of the students can make theirs faster than mine.

My initial naive implementation is only about 5 times faster on the Tesla C1060 than the OpenMP version on the Core2Quad 8200. This version does nothing special to take advantage of locality, so my next version should be MUCH faster!

Two surprises:
  1. Even with doubles, the floating point values differ from the Core2Quad after a good number of iterations. This wasn't the case in my previous FDTD program, so I'm a bit concerned. I am using sqrt(), so perhaps that implementation differs a bit.
  2. When I tried 256 threads per block, the CUDA launch failed with error 4 (unspecified launch failure). I've never seen that one before, but the code works with 128 threads and fewer. Hmmm...

Tuesday, February 10, 2009

New cluster node

I put together a new node for the CUDA cluster. This one has the same el-cheapo Core2Quad 8200 and 4GB of Kingston DDR3 RAM, as well as the scrounged hard disk from the cluster node it replaced (the hard drive tests OK, but doesn't sound very quiet...) I got a bigger power supply this time so I could put 2 Tesla cards in. The documentation is somewhat weak on multiple Tesla cards per machine, so I wasn't sure if I should put in the SLI bridge or not. I didn't, mostly because I forgot - I had the bridge, but buttoned up the machine before I remembered to use it.

Both Tesla cards are detected with no trouble, but I notice an interesting delay as CUDA programs start. This delay does not occur with normal OpenMP or other programs, so must be something to do with the device detection in CUDA. So far, I'm not concerned, but will see if it causes any trouble. I will also look into how to use both cards by two separate programs to see if two of my students can use this node for CUDA simultaneously.

Friday, January 30, 2009

New CUDA

I updated the cluster with the recently-released CUDA 2.1 for Linux 64 and everything seems to work well.

I am disappointed that Mac CUDA trails at 2.0, but I presume 2.1 is coming.

I looked into putting a Tesla card into a Mac Pro, but it looks like it wouldn't be supported under Mac OS X. It would probably work under Windows, which we do run on our Mac Pros sometimes. I imagine the reason is that Apple's display drivers wouldn't bother to load for a Tesla card and Mac CUDA uses the bundled drivers. Darn.

Wednesday, January 28, 2009

HIPerWall makes news

I was pleased to discover that Information Week has named my company, Hiperwall Inc., as the startup of the week!

Thursday, January 22, 2009

CUDA Working on Cluster

I was able to get CUDA working on both the frontend machine and the single compute node tonight, with some surprising results. It turned out that if I ran nvidia-smi as root, it would properly create the device entries in /dev and all was well. So then I ran my CUDA FDTD program with doubles (thus, requiring a compute 1.3 class CUDA card), and strange timing results occurred.

I was certain that the Tesla C1060 would crush the old development board, which has fewer processors and a slower speed. Yet, the test board proved faster by 4 or 5%, which was a surprise. As I increased the problem size, the results followed the same pattern.

FDTD is a 3D volumetric finite-difference, time-domain electromagnetic simulation. The main loop is O(n^3), and the thing accesses 8 3D arrays for each time step. So it's a really memory-intensive and CPU-intensive program. Because CUDA doesn't support multidimensional thread blocks, I assign, for example, 1x16x16 threads per block, all within the 3D array space. In this case, each block has 256 threads that CUDA schedules on each parallel processor. My first thought is that the newer C1060 may benefit from more parallelism, so I made it 1x32x16, thus 512 threads per block. This slowed the program on both GPUs by a fair bit.

So I thought that perhaps the processors may be resource constrained, so fewer threads would help, so I stepped it down to 1x8x16, so 128 threads. This made a big difference, and now the Tesla C1060 did beat the development board by 15 or 20%. Yay! So, taking that logic further, I went down to 64 threads, but that made things a bit slower, so it looks like 128 threads may be the right balance for the number of registers and other resources needed by FDTD.

I really should play with the "Occupancy Calculator" tool provided by NVIDIA to see if that provides some insight into these sort of performance issues.

Wednesday, January 21, 2009

Cluster sort of working

The new mini-cluster is online and will be put to use for my class. In fact, I was able to demonstrate it in class on Tuesday. I showed how OpenMP can parallelize the Finite Difference, Time Domain code I've been using for a while and have posted on the original blog.

I ended up with two surprising results: First, on both the Phenom frontend and the Core2Quad compute node, the code ran faster when I told OpenMP to use 8 threads rather than 4 (and each only has 4 cores). This is surprising, because the work is very memory-bound, and normally the additional overhead of context switching between threads would slow it down or make it perform no better. Yet it did, by a fair bit. I can only think there must be some odd cache effect or something, but it warrants more investigation.

Second, the Phenom (at 2.2 GHz with DDR2 RAM, if I remember correctly) kicked the 2.33 GHz Core2Quad's butt, by a quarter to a third in all the tests. And, yes, the C2Q has fast DDR3 RAM and a fast front-side bus. I am not very surprised by this, as AMD has always done well at floating point for scientific computing, but the margin of victory is surprising. Good job, AMD!

I've been having trouble getting CUDA working on the frontend node. I put the test board I received under NDA last year into the node, in the hope I could reuse it rather than stick one of the Teslas in there, but though the driver is installed and seemingly happy, it isn't creating the /dev entries, so CUDA programs can't run. I will try the compute node today, since it has the official Tesla card. Perhaps the drivers just don't want to deal with a test board, in which case, I'll swap a Tesla card in the frontend.

Unfortunately, while getting this system working, my old cluster frontend failed and the network connections in the server room are all messed up, so I need to fix those things. Sometime... 

Friday, January 16, 2009

Cluster Troubles

I was able to install ROCKS on my Phenom frontend node with no trouble, even though it doesn't have any CUDA software installed yet. I built one compute node with a Core2Quad, 4GB Kingston DDR3 RAM, a scrounged Western Digital 40GB IDE drive, Myrinet 2000 PCI card, and donated NVIDIA 790i motherboard and Tesla C1060 card. I used an old Matrox Millenium PCI video card so I could watch ROCKS install the cluster OS on the system. The install worked fine, but this version of ROCKS doesn't come with the Myrinet Roll, nor does Myricom have a ROCKS 5.1 roll (and my cards are so old they may not be well supported). Getting Myrinet running can wait.

A really great thing about ROCKS is that it doesn't try to recover failed compute nodes - it just reinstalls the OS on them. Unfortunately, this means a lot of things have to go right, and they don't in this case. When I pull the PCI video card out of the nodes, the reinstall fails, presumably as Anaconda (the RedHat installed barfs when it doesn't see a video card). This is a shame, because I have no plans to put video cards in these compute nodes. The workaround is to shut off the automatic reinstall, which also means to automatic updates. This is a serious downside and a shame. My old Athlon nodes worked perfectly without video cards, but something here fails during the reinstall process, and without a video card, I can't see what fails.

Then I moved the frontend and compute node up to the cluster racks in our 4th floor machine room. There, it seems that others have "borrowed" my network cables and jack in the patch panel, so the new system doesn't have network connectivity. I will try a small switch to share the connection used by my other cluster frontend. A serious issue came up upon this move - the networks (IPs and gateways) are different from the HIPerWall lab to the UCInet in the 4th floor server room. So I dutifully went and changed the network settings in /etc/sysconfig to the new network. But it didn't stick. This seemed crazy, but it was true. After a bit of looking around, the network settings were ALSO in another nearby directory and those were the ones really being used. This is a real problem that Linux seems to have that Windows and Mac OS X don't - there needs to be a single source of truth and, in this case, links between the two settings files would have solved it, but whatever installer wrote both files made them separate rather than linked. In almost every Linux system I set up, I have some sort of network config issue that requires manually editing config files somewhere. Admittedly, I tend to have stranger network config requirements than most people, for whom DHCP is fine, but Linux will not take over the desktop, as all the proponents have been hoping, until some consistency is achieved.

Wednesday, January 14, 2009

ROCKS not Rolling with CUDA

I finally have enough parts together to start building the new cluster. I have a frontend machine running a 4-core Phenom and a compute node with an Intel Core2Quad. I went to download ROCKS to put on the system and ran into a couple of problems.

First, the ROCKS ftp site is down (as of this writing), which means I can't download the DVD. But I can use the HTTP site and get the CDs, so no great loss.

Then, I went looking for the CUDA Roll (Rolls are extensions to ROCKS functionality so ROCKS knows how to install the software on the nodes). Well, the NVIDIA-supported roll is for ROCKS 4.3, which is way out of date. (And apparently NVIDIA must be embarrassed about it, because it is really hard to find the link on the NVIDIA website.) Someone has built a test version for ROCKS 5.0 and put it on Google Code, but even that is out of date.

So now I have to decide if I should try to make my own roll or at least integrate CUDA into ROCKS so it can be automatically installed or if I should install it by hand on each node and prevent ROCKS from reinstalling on each node crash. Having done the former before, I'm leaning towards the latter, since I have so few nodes. If anyone has a modern CUDA roll, please let me know.

Sunday, January 11, 2009

CUDA Cluster Computing

I'm in the process of putting together a new cluster to support both parallel computing education and research. I have an old, decrepit AMD Athlon cluster that has mostly failed (power supplies, fans, and disk drives are the culprits), so I will be replacing some of those nodes with fancy new nodes. This is needed because UCI no longer has a student-accessible cluster to be used for education. The graduate students used to have access to a fairly nice cluster as their email machine (they didn't know it could run MPI and had 44 Xeon's driving it). That's gone and has not been replaced.

The course I am teaching this quarter is primarily on parallel computer architecture, but in my opinion, the best way to understand a system's behavior is to use it, so I believe in having the students do parallel programming assignments. I will teach them OpenMP for cache-coherent shared memory architectures and MPI for distributed memory architectures. I will also spend a fair bit of time on CUDA, because of its performance and accessibility.

I received  some very generous support for my new cluster from NVIDIA. Because they want to make sure my grad students have the best possible CUDA experience, NVIDIA provided six Tesla C1060s and six 790i motherboards. This will help me make a really nice setup.

Unfortunately my old power supplies, RAM, and CPUs won't work with this new equipment, so I need to supply them. I had hoped to get Intel to donate CPUs, so I asked our industry relations folks to help make contact with Intel. Unfortunately, it got out that I will be teaching CUDA, so the Intel contact refused to help, saying that Larrabee is way better. Of course, Larrabee isn't available yet, so I can't exactly assign projects with that, and the OpenMP and MPI assignments would have made the students familiar with technologies that would have been useful on Intel processors, possibly including Larrabee. Darn.

Anyway, I've bought components for one system now and will borrow a machine from home as the cluster front end, so I will post updates as the system comes together. I plan to put ROCKS on it, having been a long-time ROCKS user and fan.


Friday, January 9, 2009

GPU Computing Talk

I just got back from a very interesting talk by Pat Hanrahan of Stanford on GPUs and the future of parallel computing. He was very compelling in his arguments that conventional CPUs are dinosaurs and need to be replaced or at least supplemented with much more efficient GPU-style processors. He also showed a bit about Intel's Larrabee, which was very interesting. The idea of using a bunch of simple, multithreaded x86 cores in a graphics architecture is intriguing.

The biggest roadblock to this approach is software. Frankly a whole lot of code out there is single threaded and that isn't likely to change in the short term. Existing programming paradigms are problematic for parallel or concurrent code and we programmers tend to make mistakes that are very hard to catch and debug on parallel systems, while sequential debugging is at least tolerable. So until this is addressed by new generations of software architectures and better training for computer scientists and programmers, adoption will be slow, except in specialized domains.

Moving in

Because iWeb only runs on the Mac, yet I want to be able to post from my Tablet PC or iPhone, using my old .Mac blog is a burden. So now I will move my blog on Asymmetric Parallel Computing here.

The old posting will stay up at the old address.

The new blog will cover my experiences with CUDA much more than it will cover the Cell, because I'm fairly over the Cell, finding it hard to get reasonable performance out of the thing. As OpenCL becomes available, I'll investigate that too.