9-20-2008 - Do You Know What’s Going On with Your IO? - Comments


It’s been a few months since I posted some blogs about how to use strace to get an understanding of the IO pattern of you codes (serial and parallel). These first few blogs were actually re-prints of some articles I wrote for Linux Magazine way back at the beginning of the year. So I thought I would take some time and sort of review what you could do to improve your understanding of the IO patterns of your applications.

The Storage Fairy Strikes

I promised myself that when I started writing this blog that I would not be the kind of blogger who complains or whines about a topic. I wanted to be the kind of blogger that wrote about interesting new technology and solutions to problems. I’m going to keep this promise but only after I break it a little :).

I’ve seen many customers coming to Dell or posting public RFP’s that have what I consider to be, very poorly written requirements for storage. More over, these requirements appear to be based on almost nothing. I know, I know – the customer is always right. But sometimes the customer needs to be made aware of how their applications dominant the design of the storage and not knowing the first thing about your application but yet asking for very specific storage solution is really committing storage suicide. Other times, the customer may even need to just take some time to think about what problem they are trying to solve and if their requirements actually solve the problem.

For example, I’m seeing a sharply increasing number of HPCC customers who are asking for unprecedented amounts of storage. It’s now become common to ask for at least 200-500 TB’s for smaller systems (64 nodes and smaller), and to ask for multiple Petabytes for larger systems. Plus, from what I can tell, this request is coming from multiple fields – engineering, chemistry, bioinformatics, physics (mostly LHC projects), etc. These are not Web 2.0 opportunities (that’s a few other long set of blogs :) ). It’s like the storage fairy whispered into the ears of customers, “You need massive amounts of really fast storage” during the night.

Everyone is absolutely and grimly determined that they have to have this amount of storage in place immediately. In addition, most of them require that this storage be on-line all of the time, the aggregate throughput of the cluster be in the multiple GB/s range (or higher), and that the storage system has five 9’s of reliability and can be expanded into the Petabyte range. Oh, and they want it really, really cheap.

I understand the cheap part having been a professor with limited research budgets, but the other two parts can be 90o opposed to the requirement of cheap. Let’s examine these two requirements – performance and capacity – and see if we can’t shed some light (or logic) on requirements.


To Infinity and Beyond! – The Need for Speed

One of the tenants of HPCC storage is performance. Understandingly customers want storage that is really fast, assuming that faster storage will make their applications run faster than slower storage. Makes sense and seems perfectly logical, but two very important questions, “how much faster will my application become with faster storage?” and “how much does it cost?” go unanswered in this assumption.

For example, I know an application that scales extremely well and is in production use at a number of companies. About 10% of the wall clock time of this application is spent doing IO and 90% is spent doing computation. This measurement is done when standard NFS over GigE is used. So naturally people will assume that NFS over GigE is a bottleneck for an application that may be running on 200-500 cores using Infiniband. But let’s look a little deeper at this.

If you could have infinitely fast storage then you would only be able to improve performance 10%. But what does infinitely fast storage cost? Let’s use some simple estimates. NFS over GigE is pretty inexpensive. One estimate is that it is well below $1/GB depending upon the storage and the features. Parallel distributed storage is in the price range of $3-$5 per GB. So it is at least 3-5 times more expensive than simple NFS over GigE.

The performance of parallel distributed storage relative to NFS over GigE can vary. But for this application, the estimates are something of the order of 10 times faster. You may argue that something like Lustre over DDR IB is 100+ times faster than NFS over GigE. But remember that this speed factor is tempered by the application itself (more on that later). So a factor of 10 is about what is usually seen.

So I can spend 3-5 times more money on storage and get 10 times better storage performance. But what does this mean in application performance? It means that instead of 10% of the time spent doing IO, it now spends about 1% of its time doing IO. So I saved 9% of the wall clock time by spending 3-5 times more on storage. Does this make sense?

An alternative to buying faster storage is to perhaps buy more nodes. If NFS storage is 10% of the total cost, then switching to a faster storage system can increase this to about 30% of the total cost. So for this amount of money I’ve increased the performance of the application by 9%. Wouldn’t it perhaps be better to buy more nodes for instead of faster storage? As I mentioned the application scales very well, so if I add just 10% more nodes, then I’ve gotten more performance improvement than improving my storage performance by a factor of 3-5.

What’s the moral of this story? (and it’s not fable but an actual case). The moral is,

Don’t focus on the performance of the individual components of the application, but focus on the overall application performance.



Know Your Applications!

I don’t want to get too far ahead of myself, so the first thing I want to talk about is, knowing the IO pattern of your codes. Before we get too far into the discussion, let’s start be describing what I mean by an IO pattern.

An IO pattern is simply how a code performs IO. But it’s much more detailed than just a simple, “my code reads data and writes data.” These details can tell you such things as, what kind of storage system you need, what kind of performance you are likely to achieve on the IO side, how much you can impact your overall performance, and even how you can modify your code to make it faster from an IO perspective and an overall perspective.

So what metrics or measurements make up an IO pattern? The first thing you need to determine is how your code is performing IO. In particular, do all of the processes read input data and write output data? (I’m assuming parallel MPI based codes in this case). Or do your applications have a single process (usually the rank 0 process) do all of the reading and writing? Or does a small subset of the nodes in the application perform the IO? I cannot stress enough the fundamental importance of this question. Even if you don’t know anything else about the IO pattern of your codes, know which processes perform IO. The difference in the architecture of your storage system and the resulting performance, between applications that have one node performing IO and all nodes performing IO, is tremendous! Let’s look at a simple example.

Let’s assume you have a fairly good size cluster – 256 nodes with 8 cores each (2,048 cores). If you have an application where all of the processes performance IO (either read or write), then you could have a situation where you could have 2,048 cores hitting the storage system at the same time (I’m assuming one MPI process per core). On the other hand, let’s assume you have an application where the rank 0 process performs all of the IO. Let’s also assume you have an application that runs on 128 cores (16 nodes) per run. This means you have a maximum of 16 cores performing IO at any one time. That’s 128 times fewer cores hitting the file system. That can make a huge difference in how you design and select a storage system (regardless of the capacity).

The second thing you need to know is, generally, how much time does your application spend doing IO versus the total run time? A good baseline is test the application using local storage (i.e. the drive in the nodes). If you can, instrument your code to measure the amount of wall clock time it does IO and the total wall clock time. If it’s an ISV code, press the company to tell you about the IO pattern of the code and how much time is spent doing IO versus computation.

If you know these two things, while they sound insignificant, you will be ahead of about 75% of all cluster people I meet. Even better, you will be in a reasonable position to start intelligently looking at storage. But your education has just begun. Let’s move onto more advanced, yet just as critical, topics.


Strace is Your Friend (In Case You Didn’t Know)

At this point we only have a basic idea of what’s going with your application. We should know how it does IO from the stand point of how many processes perform IO and we have an idea of how much time is spent doing IO. But we don’t understand the details of how the IO is performed. This is where strace is your friend.

How do you determine how and when your application does IO? Ideally you can fully instrument your code to spit out this information. But this can be a pain since you could have to insert lots of code into the application. An easier way is to use strace. Since it’s a virtual certainty you are using the read() and write() functions that either come with the compiler, or more probable, are in glibc, then strace can track these function calls. Strace can give you information such as how many bytes were involved in the IO function call and how much time was spent performing the IO. So very quickly you can see that it’s possible to extract some really good information about your application.

One insight you can obtain from this information is the size of the IO functions (read or write). How large are the reads and writes? If the size of the read or write is small, then your application is probably doing lots of small IO operations. On the other hand, the application could be doing large read or writes. The line between “small” and “large” is somewhat arbitrary but there are some who believe anything below 1MB in a single read or write function is small and anything above that is large. I think this estimate is a bit generous and consider anything below perhaps 64KB to be small, anything from 64KB to 1MB to be “medium” and anything above 1MB to be large.

From this information you can begin to create what you might call a “history” of the IO of the application - that is, the read or write operations as a function of time. A simple plot of the number of read and write operations and the amount of IO performed versus time can give you a great deal of information about what you application is doing. This kind of plot is commonly called a histogram. You want to look at the plot for any “patterns” of IO. For example, is a reasonable amount of IO done every so often in the applicaiton or is there a bunch of read operations done initially and the then a bunch of writes at the end of the application? This information can help you determine the level of IO performed (you can plot number of bytes versus time). It can also tell you if the IO activity is constrained to the beginning and end of the run or if there is IO activity during the entire run.

In addition to the “pattern” of reads and writes, another very important aspect to know and understand is how the file pointer is moved in between reads and writes. This appears in the strace output as what are called lseek’s or llseek’s. The reasons these are important is that typically when a file pointer is moved from it’s current location, either forward or backward within the file, then the IO operations are interrupted, reducing the IO throughput. As an example, if you are writing a fair amount of data to a file and then interrupt the write to move the file pointer, then you have greatly increased the latency during that time and have also reduced the throughput, the MB/s, so to speak. So this impacts performance and the time it takes to run the application.

There are certain applications that do a tremendous amount of “seeking”. One example that stands out is a class of applications called Finite Element Method (FEM) applications. These applications will typically write some data to the file system then move the file pointer backward, read some data, move the file pointer backward, read some data, and so on. Some applications will even move the file pointer backward, read some data, and then write some data. All of this file pointer movement, reading and writing relatively small amounts of data, can create havoc on IO performance.

Having this information about the IO pattern of your application is a gold-mine when trying to find a storage solution that fits your application profiles. Describing how one can take this information and turn it into an “optimized” storage solution is the subject of a book and in many cases is really an art rather than a science. So, I will leave that to other blogs and projects.


Phasing in Storage

One of the very interesting requests that customer will sometimes make is that they need massive amounts of storage to be installed all at once for a new project. What they are saying is that they are quite capable of filling up hundreds of terabytes of storage in an extremely short period of time. For example, they might state that they need 500 TB’s of storage for a 128 node cluster but the applications haven’t even been written yet! (this example comes from a real customer). They can also claim they need this much storage when the research teams are just forming.

What I think these customers need to really understand is that filling up hundreds of terabytes in a few weeks actually takes a huge amount of work. Unless the applications do almost nothing but generate data – no computing, no reading, just generating data – then they might possibly be able to fill up hundreds of terabytes of data. Yes there are exceptions to this such as cases where the data is coming from an instrument such as a large physics experiment in Switzerland and France (LHC). But for the other 99% of customers, creating this much data is extremely difficult, if not impossible.

What they really need to consider is how fast storage grows. If a researcher states they need 100 TB’s now, then they are likely to need more in the future because problem sizes always grow. For example, several companies that make CFD (Computational Fluid Dynamic) applications estimate that size of CFD problems double every year. Since the file size is directly proportional to the problem size, then the amount of storage could easily double every year. So if you need X amount of storage this year, then you will need 2X the following year and 4X the third year and so on. Therefore in addition to knowing your capacity requirements at the initial point in time, you should also try to estimate your storage growth path. This really boils down to,

Know your storage requirements as a function of time.

This becomes important because of a simple reason – storage gets cheaper with time. It also gets faster and has more capacity as time moves forward. So if you buy the storage you need now, then in a year or so you may be able to buy even more capacity and more performance. This saves you money and can also give you more performance.


Rinse and Repeat

This blog is already long enough so I want to go back over the steps that I’m recommending you do the follow to understand the IO in your application. The first one fairly simple,


1. Don’t focus on the performance of the individual components of the application, but focus on the overall application performance.

2. Do all of the processes read input data and write output data?

3. How much time does your application spend doing IO versus the total run time?

4. One insight you can obtain from strace information is the size of the IO functions (read or write).

5. From this information you can begin to create what you might call a “history” of the IO of the application - that is, the read or write operations as a function of time.

6. How is the file pointer moved in between reads and writes.

7. Know your storage requirements as a function of time.


With these few bits of information, you are 99.9% ahead of most customers in being able to intelligently examine HPCC storage solutions. As the famous character said, “Learn It, Know It, Live It” (I’ll leave the attribute of this quote to a trivia contest).

Jeff



No user avatar
laytonjb
Latest page update: made by laytonjb , Sep 20 2008, 2:48 PM EDT (about this update About This Update laytonjb Edited by laytonjb

2962 words added

view changes

- complete history)
Keyword tags: cluster HPCC IO Storage strace
More Info: links to this page