HPCC Storage

For many years people viewed storage as kind of an after thought in the design of clusters. But over time as HPCC has grown and taken on an increasingly important role in company’s overall strategy, storage has become much more important and is usually designed into the cluster from the beginning. For simple and smaller clusters, the storage is fairly simple. For larger or more complex systems, the storage can be equally, if not greater, in complexity than the rest of the cluster.

Clusters almost always require some sort of shared file system that allows the nodes to “see” the same files. This allows the nodes to read and/or write to the same file (having multiple nodes write to the same file at the same time over NFS can be dangerous but can be accomplished). So a shared file system is almost a mandatory part of a cluster.


Anatomy of HPCC Storage

To better discuss HPCC storage, let’s start with a basic diagram that shows the anatomy of an HPCC storage system. Figure 1 below is a basic layout of a complete HPCC storage system.

HPCC Storage - The Dell TechCenter


Moving from left to right in Figure 1 we begin with the actual storage media that is connected to the data servers via a storage network. These data servers are sometimes called IO nodes and their job is to “serve” the file system to the clients. These IO nodes are connected to the client network via a Data server network. In almost all cases the Data server network is the same as the Client network. The clients are connected to the main network via a client network. Most of the time, the Data server network and the Client network are the same. Finally, the parallel file system runs on the clients and the data servers (IO nodes). The file system communicates with the storage media over the storage network.

This diagram represents the most extreme case, meaning the most “pieces” that make up an HPCC storage solution. Many solutions only have a subset of these pieces. Let’s talk about some of the possible HPCC storage solutions.


NFS

One of the keys to making cluster even possible was NFS. It is the only standard sharable file system. It allows distributed machines to share data which enables parallel applications to read and write to the same files. Since it is a standard you can have systems with different operating systems or different versions all sharing the same data.

Using Figure 1, we can explain NFS fairly easy. The storage network is actually the PCI bus in the data server. In other words, there are drives inside or attached to the data server. This doesn’t have to be true, but this is the most common configuration. There is only one data server for a given NFS file system. This is sometimes called a filer head. Alternatively, the master node can be the data server. The data server is connected to the client network. Figure 2 below shows an NFS setup.


NFS Layout

Figure 2 – NFS Setup

NFS has been in production for a long time both for workstations, desktops, servers, and clusters. It is a well understood file system that many people use everyday. The original version of NFS was released around 1980 (NFSv2). It used UDP as the data transport protocol. A freely distributable version of NFSv2 was released by the University of Berkeley. The next generation of NFS, NFSv3, was released in about 1995 and added TCP as a protocol for NFS. More recently, around 2003, NFSv4 added security to the protocol.

However, there are some limitations to NFS. The primary one is that a NFS file system has a single server, forcing all of the IO from the compute nodes goes through a single server, creating a bottleneck. Typically the NFS server is connected to the cluster network via a link that is similar to the compute nodes (e.g. GigE, IB, Myri-10G). If there are N compute nodes all writing to the same server and the server has the same connection to the cluster network then this too can be a bottleneck. However, most parallel applications have IO patterns that aren’t as severe as having ever node writing at the same time.

An additional problem, and one that people can sometimes forget, is that NFS has difficulty allowing multiple clients writing to the same file at the same time. For a more detailed explanation about NFS read the following link.

Most parallel applications have what is called the “root” process (the rank 0 process) perform all of the IO for the application. This means that the first node in the group of nodes that is running the application performs all of the IO. So the number of nodes that are actually performing IO to the NFS server is smaller than the number of compute nodes. Consequently, NFS is a good file system for clusters of a reasonable size. The definition of reasonable depends upon a number of factors but there have been clusters of 200-300 nodes running NFS quite successfully. But a good rule of thumb is that NFS is good for clusters up to about 64-128 nodes.



Distributed Parallel Storage

However, there are times when a parallel file system is required. Some of these situations are:
  • Larger clusters
  • Clusters running applications that require large amounts of shared IO (usually parallel IO)
  • Situations requiring a very scalable storage solution for both performance and capacity

For these situations an NFS solution is not enough. For these situations, a distributed parallel storage system is needed. However, in general, the complexity of such as system is much greater than NFS.

Not all of the pieces shown in Figure 1 are needed for distributed parallel storage solutions. Which ones are needed really depends upon the specific solution. Making the jump from NFS to a distributed parallel file system is not easy. It really requires that you plan ahead and possibly deploy a test system to understand all of the implications as well as all of the tuning aspects for best performance.

There are several distributed parallel storage solutions available. We can divide them into two groups: traditional block-based storage, and object-based storage. Under the title of block-based storage there is:

  • IBRIX
  • GPFS
  • GFS
  • GlusterFS
  • Rapidscale
  • EMC Highroad (MPFSi)
  • SGI CXFS

Under the title of object based file systems there are,

  • Lustre
  • Panasas
  • PVFS

It is beyond the scope of this simple introduction to talk about the various pluses and minuses of each solution. To learn more go to this website and look for a series of articles on HPCC storage.

Let me finish by saying that before making any decisions about technology or solutions you need to ask yourself some very important questions.
  1. What is my goal in deploying/using a parallel file system? Is it increased performance or scalable capacity? Is it a centralized storage solution for multiple clusters?
  2. Do I or any others who will be running the system have experience with distributed parallel storage solutions?
  3. What applications will I be running and what does the IO pattern for each of them look like?
  4. What much capacity do I want to start with and how quickly will it grow?
  5. What kind of backup system do I want? Do I perhaps need to think about a scheme that does not require a backup?
  6. Do I need or want an HSM as part of the storage solution?
  7. What are my requirements for disaster recovery?
  8. How many people can I dedicate to administering the storage solution?
  9. In addition to the up front costs of the solution I also need to know the following:
    1. How much is the initial cost? Does it include installation and possibly training?
    2. What is my cost in adding more storage?
    3. What is my yearly maintenance cost and how long can I get support at that price?
  10. What kind of network am I likely to deploy? How will the storage fit into this network?
  11. What off-site requirements do I have? (storing data off-site for emergencies)
  12. How will I transfer any existing data to a new parallel data storage system? How long will the transfer take or how long can I afford it to take?
  13. How many users will on the system? How do they run their applications and how do they interact with the storage system?
  14. Will quotas be required?
  15. What processes are in place for handling the storage?
  16. Are there any archival requirements from users?


There are many answers to these questions and these answers can help you determine which solution or solutions fit your needs. But I highly suggest you take some time and think about these questions without focusing on the technology or solutions. Unfortunately, people have a natural tendency to become enamored with new technology (I’m included in this group!) and don’t focus enough on the practical aspects. But knowing the answers to these questions can make your life much easier in the long run (I have some stories about people who became enamored with a particular solution for whatever reason and ended up with a nightmare because they didn’t plan ahead).

Either before or after answering these questions you need to consider the applications. What are they? How do they function? What is their IO pattern? How does IO impact the performance? Then look at the problem sizes. What is your problem size today and where do you think it will be in a few years?

After understanding your applications and their IO requirements, start walking through the plan for deploying the storage. You can start by asking who will be involved in administering the storage system and of those, who has experience with HPCC storage (not just enterprise storage because it is different). Be sure to find out what experience they have and what types of skills they have. In addition, be sure to determine how many people you can devote to administration of the storage. Next, look at your processes (I know I hate that phrase as well, but it’s important). Look at what processes you have at this time for things such as:

  • Quotas
  • Use of scratch space
  • Backups
  • Migration of old files
  • Data recovery
  • Archiving
  • Off-site storage
  • Disaster recovery
  • Migration of data from one storage platform to another

As part of a review of processes, take some time to watch how your users run their jobs and use the storage. Also be sure to interview them and ask them how they run their applications now and how they would like to run them in the future.

Now that you have all of this data, sit down with various vendors, go over the results, and map out a strategy for HPCC storage. Notice that at this point no one has mentioned anything about technology or solutions. Only after you have the map in place, should you begin to look at technology and solutions.


Isn’t this bass ackwards?

It seems backwards doesn’t it? First look at every aspect of storage but the technology and only then start to look at technologies. The reason is that when you start thinking about parallel distributed storage for HPCC things get complicated very quickly. NSF storage is straight-forward and simple if you don’t do anything really, really strange. But as soon as you start to consider massive amounts of distributed storage plus the needs to not increase your complexity, all wrapped up in a desire to perform IO at warp speed, things can quickly overwhelm any attempts to control them – if you don’t plan well.

Another reason that HPCC storage can get complicated is that there is no clear front runner. To find a good solution, you need to look at multiple vendors and multiple technologies – this takes time and introduces complexity. This also means you need to consider that you may be moving your data from one storage solution to another in several years.

For example, if you have 500TB’s of storage using one vendors set of hardware and you decide to move to another vendor (or have to), then you need to plan on how to move 500TB’s of data. If we assume we can transfer 1GB/s between the old storage and the new storage, it will take about 139 hours – almost 6 days. During that time the users can’t use the storage so the cluster is effectively down. In addition, once that data is on the new storage, how can you check to make sure that it is a true mirror copy? That’s going to take additional time. If you want to restore the data from a backup to the new storage rather than copy it from the existing storage, then you will need to plan for that (be sure to plan for the eventuality that at least one tape will be bad and you will have to copy from the existing storage).

One aspect you need to consider is that once you start down the path with a particular vendor and a particular technology you may become locked into that technology. This is NOT a derogatory statement against storage vendors but rather an observation of the state of the industry. There are no standards for parallel distributed storage. Every vendor has their own solution and they don’t play well together. But there might be a solution on the horizon that can help solve that – pNFS.




pNFS

Currently, a number of vendors are working on version 4.1 of the NFS standard. One of the biggest additions to NFSv4.1 (link - http://tools.ietf.org/wg/nfsv4/) is called pNFS or Parallel NFS. When people first hear about pNFS they sometimes think it is an attempt to kludge parallel file system capabilities into NFS, but this isn't the case. It is really the next step in the evolution of the NFS protocol that is a well planned, tested, and executed approach to adding a true parallel file system capability to the NFS protocol. The goal is to improve the performance and scalability while making the file system a standard (recall that NFS is the only true shared file system standard). Moreover, this standard is designed to be used with file-based, block-based, and object-based storage devices with an eye towards freeing customers from vendor or technology lock-in. The NFSv4.1 draft standard contains a draft specification for pNFS that is being developed and demonstrated now. A number of vendors are working together to develop pNFS.

One of the really attractive features of pNFS is that it avoids vendor lock-in and technology lock-in. This is in part due to pNFS being a standard (if it is approved) in NFSv4.1. In fact it will be the only parallel file system standard. So vendors who follow the standard should be able to inter-operate, which is what all customers want. So theoretically a system may have a pool of object based storage, file based storage, and block based storage, and have the pNFS clients all access this storage pool. This allows you, the customer, to choose whatever storage you want from whichever vendor you want as long as there are layout drivers for it.

So why should vendors support NFSv4.1? The answer is fairly simple. With NFSv4.1 they can now support multiple OS's without having to port their entire software stack. They only have to write a driver for their hardware. While writing a driver isn't trivial, it is much easier than porting an entire software stack to a new OS.

Parallel NFS is on its way to becoming a standard. It's currently in the prototyping stage and interoperability testing is being performed by the various participants. It is hoped that sometime in 2008 it will adopted as the new NFS standard and will be available in a number of operating systems. There is a website with information, links, and some code for pNFS.


HPCC Storage Summary
As you can tell, HPCC Storage is not easy. It fact, it can be one of the hardest aspects of a good cluster design. But it is critical to fully utilizing the cluster and more importantly, helping your users.

If you learned nothing else from this section, you should learn that NFS is easy, everything else is hard. If you do need “something else” such as parallel distributed storage, then you need to take the time to seriously gather information, think about what you are doing, and develop a plan for storage. Don’t go get enamored with a technology or solution without having an in-depth knowledge of what your requirements are and what you want to do with the storage and how you will administer it.

While this may scare you slightly (or bring out the inner geek in many of us), there is a ray of hope on the horizon – pNFS.

Return to Introduction to HPCC



No user avatar
laytonjb
Latest page update: made by laytonjb , Sep 25 2008, 9:14 AM EDT (about this update About This Update laytonjb Updated images to png format - laytonjb

2 images added
2 images deleted

view changes

- complete history)
More Info: links to this page
Top Contributors
Browse by Keywords
Loading...