Version User Scope of changes
Jun 2 2008, 12:42 PM EDT (current) laytonjb 1868 words added
Jun 2 2008, 12:41 PM EDT laytonjb

Changes

Key:  Additions   Deletions
Wither Virtualization in HPCC

I think it’s very interesting that when a new technology comes out, people immediately assume it’s a solution for all problems. The phrase that comes to mind is, “When you have a hammer, everything looks like a nail.” In some ways, I think Virtualization has become the hammer and people are searching around for nails (or something like looks like a nail). Recently, many people think that HPCC looks like a Virtualization nail.

I don’t want to be critical of Virtualization because it has been something of a revolution for reducing costs within a data center focusing on enterprise computing. One the reasons that Virtualization works well in the enterprise world is that the utilization of the hardware is fairly low – less than 50% in many cases. In HPCC the utilization of the hardware is usually well over 90%. What is interesting about HPCC is that even though the utilization of the hardware is over 90%, the demand is usually much higher with jobs waiting in the queue for the appropriate resources to be free. In general, the resource manager will try to schedule jobs to utilize as much of the hardware as possible but in some cases there may not be enough free resources to run a job, so it will hold the jobs until the requested resources are available. Consequently, it looks like the hardware is only being used at some level less than 100% (for example, about 90%), but in fact, the demand is much higher than 100%.

So the idea of using Virtualization in HPCC to increase productivity by consolidating under utilized resources won’t work. The simple fact is that virtually all HPCC systems are either fully utilized or over subscribed (if you have an HPCC system that is under utilized, please contact me. I know lots of people who are dying for compute cycles J ). But this doesn’t mean there aren’t some good things that Virtualization could do for HPCC.


Possible Applications of Virtualization in HPCC
I can think of three aspects of Virtualization that have some potential for HPCC. The first one is the idea of using Virtualized hardware on the compute nodes for running user selected distributions. I know this sounds funny, but let me explain. A typical cluster has a set of compute nodes that are almost always homogeneous. That is, they are identical in virtually all respects including hardware and software. But there could be times when you have an application that is built for one specific OS, or one specific kernel, or has some software dependency that can’t be satisfied by what is already available on the compute nodes. What do you do in these cases?

Many times people will solve this problem by setting up a separate cluster for the applications that have specific software requirements. But this can get expensive. I know one organization that has 6 applications, all with different sets of software requirements. Imagine having to construct 6 different clusters, one for each application? So how do we get around this?

One idea from Virtualization that could solve this problem is to use Virtual Machines (VM) for running the appropriate software. In this scenario, the nodes run a host OS on the compute nodes (or equivalently, a hypervisor is run on the compute nodes). When a user submits a job to the resource manager they specify what OS they want or what kernel they want, etc., as part of the job. When the job runs, the resource manager tells the compute node what software needs to be run, and the appropriate software is installed inside a VM. Then the job runs inside the VM and when it is finished the VM is dropped and the node is ready for the next job. This concept allows you to run mixed applications such as Linux and Windows applications on a single cluster, or for applications that need a certain OS that isn’t on the cluster. But, as with everything in life, there is no such think as a free lunch. This is true for this scenario.

The problem lies with applications that are running inside the VM and need to access the hardware, such as IO and network. Plus I am going to assume that these HPCC applications are probably parallel and will run across several nodes, most likely using MPI. If the application that is running in the VM needs to access a high-speed network card for sending messages for example, it will have to contact the host OS which then contacts the card on behalf of the VM. The host OS becomes a middleman which reduces performance and can greatly increase latency. The same is true for accessing disks within the node. From what I’ve seen, the performance degradation used to be on the order of 50% for highs-peed network card (i.e. you got 50% worse performance when running codes that used the high-speed network card, running under a VM, than not running in a VM). Lately, I’ve seen that improve to about 30% degradation. There are various companies who are claiming that they have drivers that allow the VM’s to directly access the hardware. However, I have not seen any benchmarks on these drivers at this time (in on case I know a company that claimed native performance drivers 2 years ago and they have not published any benchmarks to this day). So, the problem of accessing the hardware from the VM is limiting the use of this concept.

A second concept from Virtualization that could be useful for HPCC is the idea of being able to “move” a running process from one node to another node. In the VMWare world this is referred to as Vmotion (it is called something else for Xen and other virtualization tools). The idea is to be able to move a VM from one physical set of hardware to another while the VM is still doing work. Many people say they would like to be able to do this if they find that a node that is part of a job that is failing. However, it is not too likely that you could find a job on a node that is failing and move it before the node actually fails. But the idea of moving a VM could be useful for maintenance. That is, scheduling some nodes for maintenance and then moving the VM’s when the maintenance window opens so you can perform maintenance on the nodes. But overall, there are some problems with the approach of moving VM’s when running HPCC jobs.

Again, I’m assuming that for HPCC the codes are likely to be MPI based codes. One problem is that MPI codes should be “pinned” to cores for the best possible performance (people always want more performance). But according to VMWare, it is not a good idea to pin processes to a specific core because trying to move them may not work because the target node may not completely match the originating node. In addition, they say that having pinned processes can inhibit the movement of the VM.

Perhaps more importantly, when you move the VM you need to stop all of the message passing on the network for the VM that is being moved (both sending and receiving messages). You also need to stop all of the IO traffic from the VM that is being moved. Only then can you start to move the VM. In addition, you will need to make sure that the messages and IO traffic from the originating node are moved to the target node. This is a difficult problem for any VM to accomplish. In a recent test, a single node that was doing some local IO was moved to another node. It took over 20 minutes to move the VM. Imagine trying to do this for a job that is running across multiple nodes, performing message passing, and possibly IO at the same time. So at this time, VM movement is not a good option for HPCC.

A third concept from Virtualization that could be useful for HPCC is based on the idea of using Virtual Machines for checkpointing or restarting applications. One of the Holy Grails for HPCC has been the idea of checkpoint/restart that is independent of the application. A checkpoint is basically a snapshot of the progress of the code. It is a capture of the state of the computation of the node. The reason people want to checkpoint is that if a node goes down and the application fails, you can restart the application from the last checkpoint. Otherwise you have to restart the application from the beginning.

What Virtualization offers is that ability to very easily create a checkpoint since the application is running in a VM, which is only software. So it’s relatively easy to create a checkpoint of the state of the VM and write it to storage. But again, this faces the same problem of making sure the VM is “quiet” before creating the checkpoint.

The fundamental problem is how to “quiet” the system prior to creating the checkpoint. This includes stopping the CPU and what it is doing, stopping all messaging passing, IO, making sure all of the buffers are flushed, etc., and then dumping the state of the VM to a file on some storage. There have been a few companies who have tried to do this for clusters and have failed. There is also a new company who is trying to do this as well. But, fundamentally, this is a very difficult problem.


Summary
Virtualization is becoming something of a revolution for enterprise class IT. It allows the number of servers to be reduced and increase the utilization of the remaining servers. However, since it is seen as panacea of sorts, people are trying to apply it to every conceivable sector of IT, including HPCC. At this time I can see three possible ways for Virtualization to impact HPCC:
  1. Using Virtualization to allow to select an OS distributions and/or other software requirements and have the assigned compute nodes run this software.
  2. Using Virtualization to move processes from one node (originating node) to another node (target node).
  3. Using Virtualization to easily create checkpoints.

While these 3 concepts sound and appear to be easy, they are, in fact, very difficult to achieve in HPCC. The first concept, being able to boot what OS or distribution you need as part of the job, is very attractive to a number of people. But achieving this goal, while keeping good performance, has not yet been achieved. The second concept, moving VM’s from machine to another, is extraordinarily difficult in HPCC because many applications heavily use the network and/or the storage (IO). The third concept, using VM’s to quickly create checkpoints, also suffers from the problem of being very difficult because of the heavy use of networking and/or storage.

So for right now, it appears that Virtualization really doesn’t have a place in HPCC. That doesn’t mean it can’t or won’t happen some day. But for now, Virtualization in HPCC is still something of a non-starter. Sorry, but HPCC isn’t a Virtualization “nail”.


Jeff