HPCC Software

Software is one of the most important aspects of a cluster. The reason is that not only is the generic title “software” mean the applications but it also refers to the software that takes the separate pieces of the cluster (i.e. the nodes) and makes them work together as a single unit. It also includes any development tools you might use, and any monitoring and reporting tools that you might use. In short the software is not the reason we have clusters (the applications) but also the brains behind the cluster.

In this article I want to talk about what software pieces there are in a cluster, why you need them, what they do, how they work together, etc. I hope this will give you an idea of what software pieces a cluster needs to function, well, as a cluster.

I won’t be comparing options for the various software components since that’s really comparing various tools and that’s not the purpose of this article. I will tell you, at a high level, what the tools do and why you need them. However, at the end of the article I will make some recommendations about components you have to have to make a cluster function and what tools are considered upgrades or give enhanced functionality.

This article will also not discuss applications since those are really up to the user and not the cluster itself.


Really Basic Clustering

When clusters first started you could make them about as basic as you wanted. Really all you needed was an OS. You can run around and pop a CD in the drive of each node and install the OS on each compute node. You had to adjust the networking and list of hosts slightly, but at a fundamental level you had a functioning cluster (of course, there are a few minor details such as the creation of accounts and password-less logins and storage that each node can access).

While this sounds really easy, it is a monumental pain to maintain. If you have to update the OS, you had to create a new CD and run around and install it on every node. Alternatively, if it’s a simple change, you could use the network to copy it to each node, install it, and then reboot the node. Still for clusters of reasonable size, this is a pain.

In addition, this type of cluster management is really only meant for a single user. Moreover, that person has to also be the administrator for the cluster. While it’s perhaps a good way to get into clusters, it’s not good for clusters that have to do any real work.


Required Software Components

The first software component for clusters is pretty fundamental and you can probably guess what it is – an Operating System (OS). Each node (link to HPCC nodes article) in the cluster, the mater nodes, the login nodes, and the compute nodes need to have an OS. The OS can be installed on a hard drive in the node or can even be installed on a ramdisk which is sometimes called a “diskless” or “stateless” node. Typically, the master node creates what is called an “image” and then sends it to the compute nodes for installation (either to the hard drive or to the ramdisk).

The second software tool is called a Cluster Management Tool (CMT) in this article. It’s function to manage the cluster. It has several functions some of which are optional. The required functions are:

  • Maintain a list of what nodes are compute nodes (i.e. what nodes are in the cluster). This can be done through something as simple as /etc/hosts that is replicated to each compute node or through a local DNS
  • Create and manage the image or set of packages that are installed on the compute nodes
  • Send the image or packages to the compute nodes (typically done via PXE )
  • Perform basic monitoring of compute nodes (e.g. How are the nodes performing? What nodes are up or down?)
  • Power control of the compute nodes (not absolutely required, but a highly recommended idea). This is ability to remotely turn nodes on and off. This can be done by a variety of different methods, some of which involved additional hardware.

While this list of functions may seem short to someone with cluster experience, these functions are really the core of a CMT. Other functions are really nice to have but are not essential to the cluster.

Examples of CMTs include Platform OCS, Clustercorp ROCKS+, Microsoft Windows CCS, and Platform Manager.


Optional Components

While the number of required tools is fairly small, you actually have a basic functioning cluster with them. However, the cluster is not necessarily ideal and is really only suited to one or maybe two or three users. In addition, you only have limited control and knowledge of the functioning of the cluster. Let’s review some optional components that, while technically are optional, without them a cluster is really not production quality.

There are some components we can either add to the CMT or layer on top of the CMT. Having administered several clusters for several years, I highly recommend you seriously consider these additional components. These components are:

  • More extensive monitoring tools including a graphical view of the status of the cluster. Examples are Ganglia (link - http://ganglia.info/), Cacti (link - http://www.cacti.net/), and Nagios (link -http://www.nagios.org/)
  • A reporting tool that allows you to create reports about the functioning of the cluster
  • User account administration tool (this allows you to create user accounts on the entire cluster, allows the user to set their password and have it propagate to all nodes in the cluster, and to allow password-less logins to the nodes which is essential for running MPI applications)

Another component that is theoretically optional, but really, really recommended, is a job scheduler (sometimes referred to as a resource manager). A job scheduler is a queuing system that allows users to submit jobs for execution but they do not have to be present for the jobs to run. The job scheduler keeps queue of the submitted jobs and will run them when resources (i.e. nodes), become available. Examples of job schedulers include Platform LSF, PBS-Pro, and MOAB .

While it could be considered an application, sometimes development tools are installed. These tools include compilers, editors, debuggers, libraries, etc. With Linux you can easily install the gcc tool chain that comes with pretty much every Linux distribution. With Windows, you would purchase a compiler suite such as Windows Visual Studio.

A very useful tool that, while optional, is highly recommended is a tool to allow remote access the nodes. Almost always this is IPMI . IPMI allows you to remotely login into a node, gather information about the node such as temperatures, as well as other convenient commands. A common example of this is IPMITool .

Sometimes people like to have the ability to access a node even when it’s logged off. This tool usually consists of a hardware piece that is on the node itself, and the software tool that runs on the master node that is used to access the node. As a generic term this tool is called a “lights out management” tool. That is, the power to the node can be off, but you can still get to the node for mounting remote media or booting the node and accessing the BIOS. Examples of this hardware vary by vendor. For example, Dell’s is called DRAC and HP is called ILO.


Recommended Tools/Configurations

So far I’ve listed tools that can be layered depending upon your preferences. If you asked me for a recommended set of tools, in my opinion, it would look something like the following:

  • Operating System for all nodes
  • CMT with the following features:
    • Ability to create and manage the image or set of packages that get installed to the compute nodes. Ideally, the nodes should be built as stateless nodes. They can include a hard drive if you would like one, but the node OS does not depend upon this.
    • The ability to push the image or packages to the compute nodes
    • Compute node monitoring via software and or IPMI (BMC cards). This also provides remote power control of the nodes
    • Graphical monitoring tool
  • Job scheduler
  • Development Tools (if needed)
    • Compiler
    • Editors
    • MPI libraries
    • Scientific libraries

The lights out management is an option that is more based on the processes at your site. There is no right or wrong answer to whether you need it or not. Be sure to explore what benefits it gives you and the price. Then make a decision if you need it or not.

But I think this set of tools is what I consider the bare minimum to a production quality cluster. Of course, I like to add other tools to it to suite my habits and processes, but to be honest, they aren’t critical to the operation of the cluster. I will leave it up to figure what additional tools you might want to use.

Return to Introduction to HPCC page



No user avatar
laytonjb
Latest page update: made by laytonjb , Sep 4 2008, 10:08 AM EDT (about this update About This Update laytonjb Edited by laytonjb

6 words added

view changes

- complete history)
More Info: links to this page

Related Content

  (what's this?Related ContentThanks to keyword tags, links to related pages and threads are added to the bottom of your pages. Up to 15 links are shown, determined by matching tags and by how recently the content was updated; keeping the most current at the top. Share your feedback on Wetpaint Central.)
Top Contributors
Browse by Keywords
Loading...