Version User Scope of changes
Jun 2 2008, 12:46 PM EDT laytonjb
Jun 2 2008, 12:40 PM EDT laytonjb 1873 words added, 2 photos added

Changes

Key:  Additions   Deletions
Topics for the Fast Paced Life of High Performance Cluster Computing - HPCC RSS Feed



Start Blog Entry
Wither Virtualization in HPCC
06/02/2008 - Comments

I think it’s very interesting that when a new technology comes out, people immediately assume it’s a solution for all problems. The phrase that comes to mind is, “When you have a hammer, everything looks like a nail.” In some ways, I think Virtualization has become the hammer and people are searching around for nails (or something like looks like a nail). Recently, many people think that HPCC looks like a Virtualization nail.

I don’t want to be critical of Virtualization because it has been something of a revolution for reducing costs within a data center focusing on enterprise computing. One the reasons that Virtualization works well in the enterprise world is that the utilization of the hardware is fairly low – less than 50% in many cases. In HPCC the utilization of the hardware is usually well over 90%. What is interesting about HPCC is that even though the utilization of the hardware is over 90%, the demand is usually much higher with jobs waiting in the queue for the appropriate resources to be free. In general, the resource manager will try to schedule jobs to utilize as much of the hardware as possible but in some cases there may not be enough free resources to run a job, so it will hold the jobs until the requested resources are available. Consequently, it looks like the hardware is only being used at some level less than 100% (for example, about 90%), but in fact, the demand is much higher than 100%.

So the idea of using Virtualization in HPCC to increase productivity by consolidating under utilized resources won’t work. The simple fact is that virtually all HPCC systems are either fully utilized or over subscribed (if you have an HPCC system that is under utilized, please contact me. I know lots of people who are dying for compute cycles J ). But this doesn’t mean there aren’t some good things that Virtualization could do for HPCC.


Possible Applications of Virtualization in HPCC
I can think of three aspects of Virtualization that have some potential for HPCC. The first one is the idea of using Virtualized hardware on the compute nodes for running user selected distributions. I know this sounds funny, but let me explain. A typical cluster has a set of compute nodes that are almost always homogeneous. That is, they are identical in virtually all respects including hardware and software. But there could be times when you have an application that is built for one specific OS, or one specific kernel, or has some software dependency that can’t be satisfied by what is already available on the compute nodes. What do you do in these cases?

Many times people will solve this problem by setting up a separate cluster for the applications that have specific software requirements. But this can get expensive. I know one organization that has 6 applications, all with different sets of software requirements. Imagine having to construct 6 different clusters, one for each application? So how do we get around this?

One idea from Virtualization that could solve this problem is to use Virtual Machines (VM) for running the appropriate software. In this scenario, the nodes run a host OS on the compute nodes (or equivalently, a hypervisor is run on the compute nodes). When a user submits a job to the resource manager they specify what OS they want or what kernel they want, etc., as part of the job. When the job runs, the resource manager tells the compute node what software needs to be run, and the appropriate software is installed inside a VM. Then the job runs inside the VM and when it is finished the VM is dropped and the node is ready for the next job. This concept allows you to run mixed applications such as Linux and Windows applications on a single cluster, or for applications that need a certain OS that isn’t on the cluster. But, as with everything in life, there is no such think as a free lunch. This is true for this scenario.

The problem lies with applications that are running inside the VM and need to access the hardware, such as IO and network. Plus I am going to assume that these HPCC applications are probably parallel and will run across several nodes, most likely using MPI. If the application that is running in the VM needs to access a high-speed network card for sending messages for example, it will have to contact the host OS which then contacts the card on behalf of the VM. The host OS becomes a middleman which reduces performance and can greatly increase latency. The same is true for accessing disks within the node. From what I’ve seen, the performance degradation used to be on the order of 50% for highs-peed network card (i.e. you got 50% worse performance when running codes that used the high-speed network card, running under a VM, than not running in a VM). Lately, I’ve seen that improve to about 30% degradation. There are various companies who are claiming that they have drivers that allow the VM’s to directly access the hardware. However, I have not seen any benchmarks on these drivers at this time (in on case I know a company that claimed native performance drivers 2 years ago and they have not published any benchmarks to this day). So, the problem of accessing the hardware from the VM is limiting the use of this concept.

A second concept from Virtualization that could be useful for HPCC is the idea of being able to “move” a running process from one node to another node. In the VMWare world this is referred to as Vmotion (it is called something else for Xen and other virtualization tools). The idea is to be able to move a VM from one physical set of hardware to another while the VM is still doing work. Many people say they would like to be able to do this if they find that a node that is part of a job that is failing. However, it is not too likely that you could find a job on a node that is failing and move it before the node actually fails. But the idea of moving a VM could be useful for maintenance. That is, scheduling some nodes for maintenance and then moving the VM’s when the maintenance window opens so you can perform maintenance on the nodes. But overall, there are some problems with the approach of moving VM’s when running HPCC jobs.

Again, I’m assuming that for HPCC the codes are likely to be MPI based codes. One problem is that MPI codes should be “pinned” to cores for the best possible performance (people always want more performance). But according to VMWare, it is not a good idea to pin processes to a specific core because trying to move them may not work because the target node may not completely match the originating node. In addition, they say that having pinned processes can inhibit the movement of the VM.

Perhaps more importantly, when you move the VM you need to stop all of the message passing on the network for the VM that is being moved (both sending and receiving messages). You also need to stop all of the IO traffic from the VM that is being moved. Only then can you start to move the VM. In addition, you will need to make sure that the messages and IO traffic from the originating node are moved to the target node. This is a difficult problem for any VM to accomplish. In a recent test, a single node that was doing some local IO was moved to another node. It took over 20 minutes to move the VM. Imagine trying to do this for a job that is running across multiple nodes, performing message passing, and possibly IO at the same time. So at this time, VM movement is not a good option for HPCC.

A third concept from Virtualization that could be useful for HPCC is based on the idea of using Virtual Machines for checkpointing or restarting applications. One of the Holy Grails for HPCC has been the idea of checkpoint/restart that is independent of the application. A checkpoint is basically a snapshot of the progress of the code. It is a capture of the state of the computation of the node. The reason people want to checkpoint is that if a node goes down and the application fails, you can restart the application from the last checkpoint. Otherwise you have to restart the application from the beginning.

What Virtualization offers is that ability to very easily create a checkpoint since the application is running in a VM, which is only software. So it’s relatively easy to create a checkpoint of the state of the VM and write it to storage. But again, this faces the same problem of making sure the VM is “quiet” before creating the checkpoint.

The fundamental problem is how to “quiet” the system prior to creating the checkpoint. This includes stopping the CPU and what it is doing, stopping all messaging passing, IO, making sure all of the buffers are flushed, etc., and then dumping the state of the VM to a file on some storage. There have been a few companies who have tried to do this for clusters and have failed. There is also a new company who is trying to do this as well. But, fundamentally, this is a very difficult problem.


Summary
Virtualization is becoming something of a revolution for enterprise class IT. It allows the number of servers to be reduced and increase the utilization of the remaining servers. However, since it is seen as panacea of sorts, people are trying to apply it to every conceivable sector of IT, including HPCC. At this time I can see three possible ways for Virtualization to impact HPCC:
  1. Using Virtualization to allow to select an OS distributions and/or other software requirements and have the assigned compute nodes run this software.
  2. Using Virtualization to move processes from one node (originating node) to another node (target node).
  3. Using Virtualization to easily create checkpoints.

While these 3 concepts sound and appear to be easy, they are, in fact, very difficult to achieve in HPCC. The first concept, being able to boot what OS or distribution you need as part of the job, is very attractive to a number of people. But achieving this goal, while keeping good performance, has not yet been achieved. The second concept, moving VM’s from machine to another, is extraordinarily difficult in HPCC because many applications heavily use the network and/or the storage (IO). The third concept, using VM’s to quickly create checkpoints, also suffers from the problem of being very difficult because of the heavy use of networking and/or storage.

So for right now, it appears that Virtualization really doesn’t have a place in HPCC. That doesn’t mean it can’t or won’t happen some day. But for now, Virtualization in HPCC is still something of a non-starter. Sorry, but HPCC isn’t a Virtualization “nail”.


Jeff - Comments
End Blog Entry

Start Blog Entry
HPCC is Fastest Growing Sector of IT
Who would have thought?
05/27/2008 - Comments

HPCC is Fastest Growing Sector of IT

Just a quick blog (really a note) that HPC is now recognized as the fastest growing sector of IT. This article from eWeek points out that HPC is was a $11.5 Billion per year business in 2007 and is growing at 19.5%. Earl Joseph from IDC says that x86 servers are the dominant force in data center right now (about 2/3 of all servers). On the HPC side of the house, Linux is clearly the dominant OS with Unix being a very distant second place (Linux is replacing Unix in the HPC sector rather than Unix replacing Unix). He also points out that the rush to commoditization has driven the prices down. For example, Dr. Joseph says,


  • x86 systems cost about $2,000 per processor (not sure what he means by a processor)
  • RISC based processors go for about $7,800 per processor
  • Vector based processors go for about $54,000 per processor (ouch!)
  • Itanium costs about $9,000 per processor


One thing he points out that I think many people miss is that in many ways HPC is the R&D arm of IT – both on the vendor side and the customer side. HPC is on the cutting edge of technology for IT and provides prototype solutions for future IT systems. For example, HPC encountered power and cooling problems years ago – well before today’s “crisis” in power and cooling in IT systems. HPC was also the first sector of IT to focus on x86 systems. HPC has also been very focused on large storage systems, in particular parallel storage systems with good data protection and reliability, well before the rest of the IT sector has even thought about it. Only now is enterprise-class IT even thinking about storage in the Petabyte range.

Now, let me also say that I’m seeing efforts to move enterprise-class IT capability into HPC. Since HPC is becoming more common IT is looking at it as just another tool. But at the same time, this new technology has to conform to the practices of IT. Otherwise, it becomes too difficult for IT to integrate HPC into the mainstream and HPC could be relegated to the fringe even longer. So what is HPC missing that IT wants, or more importantly, needs? That is really the subject of entire blog (actually a long, long blog). But here are some things I think need to be added to HPC for HPC to better accept them:


  • Bare metal backup (disaster recovery)
  • More IT like monitoring (HPC has monitoring but it needs to be more IT like)
  • Much better alerting tools
  • Easier to use tools to setup clusters
  • Reporting tools (I find this to be a severe problem in HPC)
  • Integration with NIS, LDAP, and ActiveDirectory
  • Environment Modules (see last blog)
  • Someone to provide support for all pieces of the cluster software stack including an open-source pieces

These are just a few things I think are needed. There are actually many, many more, but as I said, that’s the subject of another blog (or two or three). However, what IT like feature do you think are missing from cluster tool kits?



Jeff - Comments
End Blog Entry



Start Blog Entry
Environment Modules:
Scratching the Multi-MPI, Multi-Compiler Itch
05/14/2008 - Comments


When people first start using clusters they tend to stick with whatever compiler and MPI library came with the cluster when it was installed. As they become more comfortable with the cluster, using the compilers, using the MPI libraries, they start to look around at other options. Are there other compilers that could perhaps improve performance? Similarly, they may start looking at other MPI libraries – can they help improve performance? Do other MPI libaries have tools that can make things easier? Perhaps even more importantly, these people would like to install the next version of the compilers or MPI libraries so they can test them with their code(s). So this begs the question, how do you have multiple compilers and multiple MPI libraries on the cluster at the same time and not get them confused? I’m glad you asked.


Caveman Approach
I don’t want to get the cavemen angry with me, but the “uneducated” approach to using multiple compilers and MPI libraries on a cluster is, shall we say, kludgy. The basic idea is to change your $PATH in the .bashrc (if you are using bash), and then log out and log back in whenever you need to change your compiler/MPI combination. Initially this sounds like a pain and it is, but it works – to some degree. It doesn’t work in the situation where you want to run multiple jobs each with a different compiler/MPI combination.

For example, let’s say I have a job using the gcc 4.2.1compilers using Open MPI. Then I have a job using gcc 4.2.1 and MPICH2. If I have both jobs in the queue at the same time, how can I control my .bashrc to make sure each job has the correct $PATH? The only way to do this is to restrict myself to one job in the queue at a time. When it’s finished I can then change my .bashrc and submit a new job. Even this is not the best approach because if you are using a different compiler/MPI combination from what is in the queue for something as simple as code development, you have to watch when the job is run to make sure your .bashrc matches your job.


Cro-Magnon Man Approach – Environment Modules
A much better way to handle compiler/MPI combinations is to use Environment Modules (Be careful not to confuse “environment modules” with “kernel modules”). According to the website, “The Environment Modules package provides for the dynamic modification of a user’s environment via modulefiles.” While this may not sound earth shattering, it actually is a quantum leap for using multiple compilers/MPI libraries. But you can use it for more than just compiler/MPI combinations, which I will talk about later.

You can use environment modules to alter or change environment variables such as $PATH, $MANPATH, $LD_LIBRARY_LOAD, and others. Since most job scripts for resource managers such as LSF , PBS-Pro, MOAB, are really shell scripts, you can incorporate environment modules into the scripts to set the appropriate $PATH your compiler/MPI combination needs or any other environment variables your application(s) require for operation.


Installing Environment Modules
How you install environment modules depends upon how your cluster is built. For those that use Platform OCS 4.x Platform has created a roll. Here is a readme describing the various rolls Platform has created. The roll should be included on the OCS DVD that came with the cluster.

If you didn’t use Platform OCS, then you should look around for something like an rpm for environment modules (assuming you used an rpm based cluster toolkit). Remember that Google is your friend when looking for rpm’s. If you can’t find an appropriate binary containing environment modules for your system, then you will have to build it from scratch. Fortunately, the instructions on the website are fairly good and you shouldn’t have too much trouble.


Some Examples for Using Environment Modules
I won’t discuss how to create or define environment modules. The website does a far better job than I in explaining how to do this. But if there is enough demand, I may do a future blog that discusses how to create modules for new compilers/MPI libraries/applications. So let’s assume that the environment modules are installed and functioning.

The first thing to check is what modules are available to you by using the “module avail” command:

home8:~> module avail

-------------------------- /usr/share/modules/modulefiles --------------------------
compiler/gcc dot null
compiler/pgi5.2 module-cvs pbs
compiler/pgi5.2-x86_64 module-info use.own
compiler/pgi6.0 modules
compiler/pgi6.0-x86_64 mpi/mpich-1.2.6
home8:~>



This commands lists what environment modules are installed. In this cases there are a number of compilers installed – gcc, pgi 5.2, pgi 5.2-x86_64, pgi 6.0, pgi 6.0-x86_64, as well as a single MPI library – mpich-1.2.6.

Let’s “load” one of the environment modules:


home8:~> module load compiler/pgi5.2-x86_64


You can just cut and past from the list of available modules to load the ones you want or need (this is what I do and it makes things easier). By loading a module you will have just changed the environmental variables defined for that module. Typically this is $PATH, $MANPATH, and $LD_LIBRARY_LOAD.

To check what modules are loaded just use the “list” option.

home8:~> module list
Currently Loaded Modulefiles:
1) compiler/pgi5.2-x86_64


To unload a module just use the “unload” option but you have to specify the complete name of the environment module.

home8:~> module unload compiler/pgi5.2-x86_64
home8:~> module list
No Modulefiles Currently Loaded.


Alternatively, to you can unload all loaded environment modules using the “purge” option.


home8:~> module load compiler/pgi5.2-x86_64
home8:~> module list
Currently Loaded Modulefiles:
1) compiler/pgi5.2-x86_64
home8:~> module purge
home8:~> module list
No Modulefiles Currently Loaded.



The command is “module purge” to unload all loaded modules. You can see above that after the “module purge” command, there are no more environment modules loaded.

Typically you will want to load a compiler and an MPI library such as the following:


home8:~> module load compiler/pgi5.2-x86_64
home8:~> module load mpi/mpich-1.2.6
home8:~> module list
Currently Loaded Modulefiles:
1) compiler/pg
i5.2-x86_64 2) mpi/mpich-1.2.6
home8:~>



Notice that in the list of modules you don't see a MPI module for each version of the compiler. Environment Modules is smart enough to know which version of the MPI libraries to load based on what version of the compiler is loaded.

To run a job you need to load the modules in the job script. Typically, after the part of the script where you request resources (in the PBS world these are defined as #PBS commands), you will then load the environment modules you need.

These examples are generic and you may not have any modules defined. So let’s finish by looking at some of comments in the Platform readme file about the environment modules roll in Platform OCS 4.x.


Using Environment Modules with Platform OCS
From the readme file, the environment module roll defines some common HPC environments. In particular:
  • MPICH (GNU, Intel compilers)
  • MPICH-GM
  • LAM (GNU, Intel compilers)
  • MVAPICH (with the Topspin IB roll)
  • Open MPI

The documentation gives a couple of quick examples:

% module load hpc/mpich-ethernet-gnu
% which mpirun
/opt/mpich/gnu/bin/mpirun


The “which” command tells where the particular command is located. In this case, the mpirun executable points to the one it /opt/mpich/gnu/bin which is the correct one.

To unload the environment module, you just use the “unload” option.

% module unload hpc/mpich-ethernet-gnu
% which mpirun


This shows that this particular module has been unloaded and that you cannot find the mpirun binary any longer (Note that /opt/mpich is not in the standard path so you should not be able to find the executable unless you have modified the default path for the cluster).


Summary
I consider Environment Modules to be one of the essential tools for clusters. They allow you to easily manage multiple compiler/MPI combinations and even applications. A few simple commands and you can modify your environment for your needs.

I highly encourage you to take a look at the website and associated links then take environment modules for a whirl. You won't regret and you will wonder how life existed prior to it.


Jeff - Comments
End Blog Entry


Start Blog Entry
Estimating Cluster Performance
How can you estimate the Top500 performance of your cluster

05/06/2008 - Comments

Quick Introduction to Performance
I've been asked a few times about how to estimate the Top500 performance of a cluster. So I thought I would discuss this a little bit and also give you a spreadsheet to help you estimate the performance.

I won't discuss the Top500 benchmark in this blog. It is what it is. It is very useful and not useful at all, both at the same time. Nevertheless people are very interested in the Top500 performance of their cluster (one of the reasons is that it can help justify the cost of the cluster). So it's quite common to compute the theoretical performance of the cluster (either in GFLOPS or TFLOPS) and also estimate the performance of the cluster when running the Top500 benchmark (the actual name is the HPL benchmark). The good news is that it's fairly easy to estimate performance.


The current processor family from both AMD and Intel are capable of 4 operations per clock cycle (also called 4 ops per clock or 4 ops/clock). The AMD Opteron dual-core and the old Intel P4 Xeon chips are capable of 2 operations per clock cycle (what people call 2 ops per clock). So to determine performance you just multiple the clock speed of the chip by the number of operations per clock and then multiple by the number of cores. This will produce the theoretical performance. If the clock speed is in GHz, then the theoretical performance will be in GFLOPS (billions of floating-point operations per second). If the clock speed is in MHz, then the theoretical performance will be in MFLOPS (millions of floating-point operations per second).

Here's a quick formula:

Theoretical Performance (GFLOPS) = (4 ops/clock) * (clock speed in GHz) * (number of cores per socket) * (number of sockets per node) * (number of nodes)

This formula assumes that the cores are of the newer variety (4 ops/clock).


HPL Performance
Now that we know the theoretical performance of the cluster it's pretty darn easy to estimate the performance of the cluster running HPL. I've done a fairly extensive analysis of the performance of the clusters in the Top500 as a function of their interconnect (network). This analysis is the subject of another blog byt itself, but as a rule of thumb, clusters connected with Gigabit Ethernet (GigE) achieve about 50% of their theoretical performance. Infiniband clusters achieve about 75% of their theoretical performance. But both of these efficiencies are rules of thumb. On the Top500 list I can find GigE clusters that are almost as high as 65% efficiency or as low as 24%. For Infiniband I can find efficiencies as high as 85% and as low as 29% (that number is really low). However, the two rules of thumb, 50% for GigE and 75% for IB, are pretty good estimates.

If you click on the comments link for this blog it will take you to the blog where you can download a simple spreadsheet. The spreadsheet assumes 4 ops/clock and allows you to input the clock speed of the core, number of cores per socket, and number of sockets per node. It will tell you the theoretical performance in GFLOPS and TFLOPS (trillions of floating point operations per second) and the estimated performance for GigE and IB. It's really simple, so don't feel like you can't change things around.

But remember that Top500 performance does equal perfomance on your particular code.


Jeff - Comments
End Blog Entry


Start Blog Entry
strace for Fun and Profit
Introducing an strace analyzer. Also using strace to examine some ISV codes

04/25/2008 - Comments

OK, I Wrote an strace Analyzer
Using strace to look at the IO pattern of codes, particularly for MPI codes makes my life a lot easier. I use strace to examine the bandwidth of read and write functions performed during the execution of a code. I also look at other statistics in the strace output such as the number of IO function calls, the size of the IO operations (read and write), the fastest and slowest IO function calls, and some other statistics that are fairly useful for analyzing the IO pattern of codes. But the tough part of doing this is taking the strace output and computing statistics and looking for patterns. This is even worse for MPI codes where you have N output files to examine. This cries out for some type automation. In this blog I decided to tackle the task writing an strace analyzer. In addition I'm also going to start using it to examine the strace of an ISV code.

The Basics of the strace Analyzer
It's too difficult to explain all of the intricacies of the strace analyzer which I will be calling strace_analyzer (sorry but I'm not very creative when it comes to names). I'll talk about some of the basic pieces of strace_analyzer but I can't talk about the gory details of it due to the length of the blog (I don't want to make it too long). Plus I want to actually exercise the analyzer on an ISV code.

Since the analyzer will be processing what is basically text, I chose to use Perl as the language for the analyzer. I'm not a Perl programmer but with the help of Google and Joe Landman at Scalable Informatics I managed to write the analyzer that produced the results in this blog (Plus I wanted to learn some Perl anyway).

As I mentioned in previous columns the way to get useful information out of strace is to use the following options.


/usr/bin/strace -tt -o /tmp/strace.out.$$ <path-to-code>/<executable> $@


The first option (-tt)gives you micro-second timings for the function calls and the second option (-o) writes the output to a specific file. Using the resulting information we can compute some useful information.


What Should an Analyzer Produce?
Before I started writing strace_analyzer, I sat down and asked myself what I wanted to see. I decided to focus on these IO functions calls at first.


List of IO Function Calls of Interest
  1. read()
  2. write()
  3. close()
  4. open()
  5. llseek() and lseek()
  6. stat()

Based on these functions, here's the list of information I wanted to see.

List of Output Information
  1. read/write bandwidth (bytes/s and MB/s)
  2. How many open and close functions there are as well as the rate
  3. Slowest and fastest IO function calls (the previous list)
  4. names of files that were opened
  5. Number of function calls for each file


So the analyzer computes these various metrics for a given strace output file and writes the results to the standard output (stdout) unit.

To run the analyzer you just type the name of the analyzer followed by the name of the strace file.


% ./strace_analyzer strace.out.5213


The analyzer then writes out the information to standard output. By default it will echo the lines of the strace output file to standard output. If the strace output file is long I recommend redirecting standard output (stdout) to a file.


% ./strace_analyzer strace.out.5213 > strace.out.5213.analyzed


Then you can look at the file strace.out.5213.analyzed to get the required information. I haven't really pimped out the tool yet so if don't want to see the strace output echoed to the the screen you can modify the code and comment out the printf line that writes the line from the strace file to standard output.

Table 1 summarizes just part of the results from applying the tool to the first file (strace.out.5213) from the MPI code in the last blog.


Table 1 - Summary of Results for Sample MPI Output

Statistic Value
-------------------- -----
Number of open() Function Calls 5
Number of read() Function Calls 26
Number of write() Function Calls 20
Number of llseek() Function Calls 2
Number of stat() Function Calls 5
Number of close() Function Calls 12
Time for slowest read() 0.052859 secs.
Time for slowest write() 0.005340 secs.
Time for slowest open() 0.000252 secs.
Time for slowest close() 0.000164 secs.



The tool will also produce a list of the amount of data either read or written. Table 2 contains the reads and Table 3 contains the writes.

Table 2 - Read File Sizes

File Size Number of Files
--------------------
---------------
(1) <1K Bytes
26
(2) 1K< <8K Bytes 0
(3) 8K< <1M Bytes 0
(4) >1M Bytes 0
Total Number of Bytes read
2,811 Bytes


Table 3 - Write File Sizes

File Size Number of Files
-------------------- ---------------
(1) <1K Bytes 19
(2) 1K< <8K Bytes 0
(3) 8K< <1M Bytes 0
(4) >1M Bytes 1
Total Number of Bytes read 400,613 Bytes



If you recall the simple MPI example from the previous blog, then you will notice the single write() function used with 400,000 bytes. The code was a very simple code that just opened a file and wrote 400,000 bytes of dummy data. So we would expect to see probably one write() function call. But as you can see in the tables, there is quite a bit of small size IO going on. For example, there are 19 write() function calls that are less than 1KB in size. There are also 26 read() function calls that are smaller than 1KB. So it looks like MPI codes do a fair amount of IO in the background, outside your code (at least MPICH2 did for this simple example).


Applyingstrace_analyzer to an ISV Code
Using strace_analyzer against the output from a simple code is not really a big deal, but I wanted to show you what it could do. Now I want to apply it against an strace file from an ISV code.

I don't really want to name the ISV and the code since the purpose of showing you this example is to illustrate how to apply strace_analyzer against an ISV code. I don't want to start a long discussion about the ISV coode. While a discussion could be a good thing, I've also seen people use strace information to start bad-mouthing the code and that's not what I want.

The ISV code I want to analyze is what is commonly called an "Implicit" Finite Element Method (FEM) code. There are several that are very popular. The reason I want to use an Implicit FEM code is that they do a lot of IO. Consequently, they will give us a ton of IO information.

The particular problem solved is not important, but it was run on a single node with two sockets and dual-core CPUs in each socket for a total of 4 CPUs. But I used only a single core to make processing thestrace output file a bit easier. So let's reproduce Table 1 for this ISV code (shown as Table 4 below). Recall that this table of data contains statistics on the IO functions used.


Table 4 - Summary of Results for ISV strace File

Aspect Value
-------------------- -----
Number of open() Function Calls 169
Number of read() Function Calls 143,618
Number of write() Function Calls 267,719
Number of llseek() Function Calls 173,439
Number of stat() Function Calls 0
Number of close() Function Calls 176



Unfortunately the strace file wasn't in the correct format for the strace_analyzer so I couldn't get the times for the read() and write() operations.

Table 5 contains the read function statistics and Table 6 contains the write function statistics.

Table 5 - Read File Sizes

File Size Number of Files
--------------------
---------------
(1) <1K Bytes
888
(2) 1K< <8K Bytes 10
(3) 8K< <1M Bytes 142,708
(4) >1M Bytes 12
Total Number of Bytes read
9,518,258,252 Bytes


Table 6 - Write File Sizes

File Size Number of Files
--------------------
---------------
(1) <1K Bytes 211,580
(2) 1K< <8K Bytes 1,199
(3) 8K< <1M Bytes 54,934
(4) >1M Bytes 6
Total Number of Bytes written
3,361,660,572 Bytes



What do the strace Statistics Tell Us?
In addition to just collecting the IO statistics from the strace file you have to critically look at the results. In the case of the simple MPI code, we noticed that the MPI implementation, in this case MPICH2, did a few really small (less than 1KB) read() and write() functions. This is not something that many people notice. But this was a simple MPI code so there were no real surprises.

In the case of the ISV code we noticed something far different. There are very large numbers of read(), write(), and lseek() function calls. In my eyes, this means that the code is doing a lot of IO, particularly when the code only ran for 214 secs. This is an average of about 671 read(), 810 seek(), and 1,251 write() function calls every second.

If we look at the file size statistics we can see that there is a great deal of IO performed - 3.36 GB in writes, and 9.52 GB in reads. This is an average of 15.7 MB/s for writes and 44.48 MB/s for reads. This is quite a bit of IO given that these are averages.

We can also see that most of the write IO is done on small chunks that are less than 1KB in size but most of the reads are done in chunks of 8KB to 1MB. This tells me that the code the code does a lot of small IO for writes (not a good thing for performance) but the reads are done on larger chunks which helps performance.

I can't put any sections of the strace file here because of length, but in looking at the strace file I can see that there are many seek() functions mixed in the middle of both read() and write() functions. This usually means that the code is performing some IO function (read or write) and then moves the file pointer to a new position. This can cause some slowdown in the IO because the file pointer has to move. If you couple this with large number of write() functions, it becomes obvious that this code is dominated by small chunks of data (not great for performance).

Here is a quick summary of the observations I made from the statistics and looking at the actual strace data.

Observations of IO Pattern in ISV code
  1. The code does a lot of read(), write(), and seek() function calls. The large number of seeks is somewhat unusual.
  2. Writes are dominated by small chunks (less than 1KB). This is not good for performance.
  3. Reads are dominated by larger chunks (8KB to 1MB). This is better for performance.
  4. There are many seek() functions in between reads and writes. This is generally not good for performance.


This observations are consistent with what people have seen for almost all of the "Implicit" FEM codes. These codes do a great deal of IO and quite a bit of it is dominated by smaller data sizes for reads and writes. They also do a lot of seek() functions to move the file pointer. I could write another huge blog about why the codes do this, but let me just say that this is typical for these codes and the decisions behind this general IO pattern were thought out.



Next Steps
While I would love to continue to write about strace and using it to analyze the IO pattern of your codes, I think I would be falling down a rat hole by using the strace_analyzer on various codes and looking at the results. While we can learn a lot from doing this, I think it's more appropriate to leave that as homework for you.

I hope you have learned something from these first 3 blogs on using strace to analyze the IO pattern of your codes. My goal was to show you how easy it is to get some data on the IO patterns and then use that data to collect some statistics about the IO pattern. The effort in doing this can only help you, particularly when you start to look for storage hardware and file systems for your HPC systems.

You can download the strace_analyzer tool from from the comment page (click on the comments link and go to the bottom of the page). I'm going to continue to develop it to add things such as CSV output (so you can analyze and plot anything you want in a spreadsheet), histogram output, and an output format for a new tool I hope to develop. This new tool will be able to take the output fromstrace_analyzer and create a simulation of the IO pattern of your code but using dummy data. You can then take the resulting simulation and try it on different storage hardware and file systems to see how they perform on the IO portion of your code. Not a bad idea - eh? Keep watching this blog for updates on how I'm doing :)

If you want to help, please don't hesitate to grab the code and hack away. Just let me know if you come up with anything cool by posting a comment to this blog (click on the comment link). I also watch the beowulf mailing list if you want to post something there. In the meantime, happy stracing!


Jeff - Comments
End Blog Entry

Start Blog Entry
Using strace with MPI
Using strace to examine code IO patterns in MPI codes.

04/21/2008 - Comments

Remember -strace is your friend

In the last blog, I started talking about how to use strace to examine the IO of applications. In that blog I introduced you to how to use strace on serial codes. Now I want to expand our usage of strace to MPI codes.

As we discovered strace is an extremely useful tool for examining the IO pattern for your codes because it lists all of the function calls from the code, including IO functions such as open, llseek, write, read, and close. As part of the output from strace we can also get information such as how much data was written or read and how much time was used in the operation. From this data we can determine IO rates and IO patterns (how the data is written or read).

Assuming that you have read my last blog and mastered the use of strace on serial codes, let's move on to using strace with MPI codes.

Using strace with MPI codes

MPI codes, while a bit more complicated, don't necessarily have to be difficult to use with strace. Ideally, we would like to have one strace output for every MPI process. This includes have one output for each process even on the same node. So if we had 4 cores on a node, we would want 4 strace output files per node. The reason we want one output file per MPI process is so we can tell which MPI process is performing IO, how much IO, and it's performance. This can also help us debug any problems.

Usually MPI codes are launched by using mpirun or mpiexec or something equivalent on the command line. Even ISV codes use either of these two launch schemes even if they are buried in a script or executable. But the problem is that if you try to use strace with either mpirun or mpiexec you won't be able to separate the output from each MPI process. So we need a way to use strace and separate the output files for each process. Fortunately, I have the bash mojo for such a task.

For the example below, I'll be using MPICH2. MPICH2 has a utility to start codes called mpiexec. A sample command line for MPICH2 to run an MPI code is


mpiexec -machinefile ./MACHINEFILE -np 4 <path-to-code>/<executable> <code-options>


where MACHINEFILE is the name of the file containing a list of the machines being used, path-to-code is the path to where the executable is located, executable is the name of the actual executable, and code-options are any command line arguments to the executable.

The first thing people might try is to change the command line to look like,


/usr/bin/strace mpiexec -machinefile ./MACHINEFILE -np 4 <path-to-code>/<executable> <code-options>


but all this does is run strace against mpiexec, not against the executable as we want. How do we fix this?


The way I run strace against an MPI binary is to write 2 scripts. The first script is for the mpiexec command and the second script is for the MPI executable. The first script is fairly easy,


#!/bin/bash
mpiexec -machinefile ./MACHINEFILE -np 4 <path-to-script>/code1.sh <code-options>



I usually name this script something like main.sh. It's not too different than the command before except rather than specify the executable, I specify a script, code1.sh, and I give the path to this second script. The second script, which is code1.sh looks like,


#!/bin/bash
/usr/bin/strace -tt -o /tmp/strace.out.$$ <path-to-code>/<executable> $@



In this second script, which I call code1.sh all of the strace action takes place. As with the serial code I use the -tt option to get microsecond timing, and I specify the output using the -o option. In this case, I'm sending the output to/tmp and naming it strace.out.$$. Here is some of the bash magic I mentioned. The.$$ after strace.out is a special bash variable that contains the PID (Process ID) of the script. Since each MPI process will get a unique PID, we will have separate strace files for each MPI process.

The second bit of bash knowledge is the option $@ at the end of the script. This is a predefined bash variable that contains all of the options after the script code1.sh for the command line to code1.sh. These are the command line arguements for the actual executable. $@ will contain arg1, arg2, arg3, and so on. It's important to make sure you understand how to use $@. So let's look at a really quick example.

There is an IO benchmark called IOR, from Lawrence Livermore Labs, that has a number of arguments you can pass to the code that describe the details of how to run the benchmark. Here's an example,


IOR -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o <file location>


Don't won't worry about what all of the options mean, but let me point out a couple because they can be important for a job scheduler script. The option -N 4 tells the code to use 4 MPI processes. You can change the value of 4 to correspond to what the scheduler defines. Now how do we pass these arguments to the script that actually runs the code?


Sticking with the IOR example the main.sh script looks like,


#!/bin/bash
mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.sh \
-r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o <file location>



Notice how I've taken the command line arguments and put them in the main.sh script after I've told mpiexec to run the script code1.sh. With the bash predefined variable
$@ in the code script, the options are passed to the code. The code script code1.sh doesn't change at all (except for the name of the binary).


#!/bin/bash
/usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/IOR $@



The only thing that changed was the name of the binary from code1 to IOR. So if you want to change the arguments to a code you have to modify the main script. If your code doesn't have any command line arguments, I would recommend just leaving $@ in the code for future reference.

Just a quick note here. I hate to admit this, but I'm not a bash script expert. Brian Mueller from Panasas was the bash script expert who taught me these tricks (thanks Brian!).


Simple Example
Let's start with a simple example from the MPI-2 book by Bill Gropp, et. at. In Chapter 2 the authors present a simple example of an MPI code where each process of N processes writes data to an individual file (this is usually referred to as N-N IO). I have modified the code to write more data than originally presented. Here is the C code from the book with my modifications.

/* example of parallel Unix write into separate files */
#include "mpi.h"
#include <stdio.h>
#define BUFSIZE 100000

int main(int argc, char *argv[])
{
int i, myrank, buf[BUFSIZE];
char filename[128];
FILE *myfile;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
for (i=0; i < BUFSIZE; i++)
buf[i] = myrank * BUFSIZE + i;
sprintf(filename, "testfile.%d", myrank);
myfile = fopen(filename, "w");
fwrite(buf, sizeof(int), BUFSIZE, myfile);
fclose(myfile);
MPI_Finalize();
return 0;
}



After compiling the code, you run the job script on your cluster (or even your dekstop) either specifying the machinefile manually or using a job scheduler. When the job is finished you have to go to each node used in the run, and copy the files from /tmp back to whatever file system is more permanent than /tmp (you can actually automate this in the main.sh script if you like). You could write all of the strace output files to a central file system, but you run the risk that you could get two PIDs that are the same. The chances of this are fairly small, but I don't like to take this chance. :)


Analyzing the strace Output(s)
Now that we know how to run our MPI jobs using strace, let's look through a simple example. I'm running the code that I presented earlier. I'm going to run with 4 MPI processes for this article. After I run the code I get four strace.out files.


strace.out.5213
strace.out.5214
strace.out.5215
strace.out.5216



The PIDs are numbered sequentially because I ran all 4 MPI processes on the same machine. Let's look at one of the strace output files.


Examining the strace file you will notice that it is much longer than for the serial case. The reason is that now we're running an MPI code so much of the extra function calls are due to MPI doing it's thing in the background (i.e. behind our code). I've extracted a few of the important lines from the first strace output file.


15:12:54.920557 access("testfile1", F_OK) = -1 ENOENT (No such file or directory)
15:12:54.920631 access(".", R_OK) = 0
15:12:54.920687 access(".", W_OK) = 0
15:12:54.920748 stat64("testfile1", 0xbfa56800) = -1 ENOENT (No such file or directory)
15:12:54.920816 open("testfile1", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7
...
15:12:54.943471 write(7, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
15:12:54.945790 ftruncate64(7, 400008) = 0
15:12:54.945888 _llseek(7, 0, [400008], SEEK_END) = 0
15:12:54.945954 ftruncate64(7, 400008) = 0
15:12:54.946010 _llseek(7, 0, [400008], SEEK_END) = 0



If you compare these lines to the ones in the serial code, you can see that they are very similar. Despite having more "junk" in the output, let's look at the IO performance.


The write function call writes the same amount of data, 400,008 bytes. The amount of time to write the data is,


54.945790 - 54.943471 = 0.002319 seconds (2319 micro-seconds)


So the IO rate of the write function is,


400,008 bytes / 0.002319 seconds = 1.7249x10^8 bytes/second


This works out to be 172.49 MB/s. A bit faster than the serial code, but again, I think there are some caching affects.

I won't examine the other 3 strace.out.* files since it's fairly straight forward to compute the write performance for each of them. But we're only computing the IO performance for a single write call. Imagine if you have a number of write and read calls in a single code. Then you have to perform the computations for a number of write and read calls. This screams for some sort of automation


What Have We Learned?
While there was nothing earth shattering in this blog we did lay the ground work for examining the IO pattern of MPI codes. While getting strace output is generally easy, in this article we found that not to be the case for general MPI codes. We had to create a couple of scripts so we could get the strace output from each MPI process (which is what we really want). After writing those scripts, getting the strace output for any number of MPI processes is quite easy (Note: you can always add some lines to the scripts to copy the strace files back to your home directory or some centralized location).

If you haven't made the leap yet, you can use these scripts to examine the IO patterns of MPI codes that you don't have the source code. So you can easily examine the IO of commercial ISV codes.


Jeff - Comments
End Blog Entry


Start Blog Entry

strace Is The Friend You Never Knew You Had
Using strace to examine code IO patterns.

04/14/2008 - Comments

strace is Your Friend.

strace is one of those tools that admins use on their systems to track down problems. Typically it is used to debug or troubleshoot problems in *nix operating systems. But strace is one of those tools that is multi-purpose. In this, my first blog, I'll show how you can use strace to get started on examining the IO pattern of your serial codes. In subsequent blogs, I'll show how how to use strace to examine the IO pattern of your MPI codes using the same basic methods.

With HPC codes, it's always better to have more information about the behavior of the code than too little. Knowing the behavior of your code means knowing how it scales, if it's memory bandwidth intensive, if it's floating point intensive, how integer intensive is it, how much message passing is done, what kind of messages are passed, what size messages are passed, and on and on. With this information you can then start to configure your cluster to match your code.

Configuring or designing a cluster to match your code is a very different approach than in the past. With previous high performance systems, you had to modify your code to match the machine. This meant that every time a new series came out, you had to modify and tune your code, which is a long and laborious process. But now, clusters give you so many options that you can tune the hardware and the software to match your code. However, to do this, you need to know your code(s) very well.

One aspect of knowing your code that many people either miss, skip, or forget, is the IO pattern. Many times the reason they don't know is that they are not quite sure how to determine the IO pattern and develop IO requirements from them. In some cases, they don't even know how different IO rates affect the run time of their code. In this and the next several blogs, I hope to give you some a basic tool and tips to examine your IO patterns and develop several basic metrics from them. The tool I will be using is strace.

I'm sure there are some people who already know how to use strace to examine the IO of a code, but for those who don't know how, let's walk through a simple example. In this column, I'm going to start with a serial code. Sure, it's a simple serial code, but I want to show you how you can use strace to examine the IO pattern of your code and also how you can derive some basic metrics. In the next blog, I'll show how to use what we have learned in this blog and apply it to MPI codes.


Simple Serial Example
The code below is a simple C code that just creates an array of floats and writes it to a file. Sure it's a really simple code, but I want to show you how to use strace to look at the IO pattern of the code.


#include <stdio.h>
#define BUFSIZE 100000
int main(int argc, int *argv[] )
{
int i;
float buf[BUFSIZE];
char filename[128];
FILE *myfile;

for (i=0; i < BUFSIZE; i++)
buf[i] = 2.5 + BUFSIZE * i;
sprintf(filename, "testfile");
myfile = fopen(filename, "w");
fwrite(buf, sizeof(float), BUFSIZE, myfile);
fclose(myfile);
}

I compiled the code using gcc on my home box which has a simple software RAID-1 device with a couple of 40GB Seagate drives and uses ext3. We will use strace to run the resulting binary. In general, strace is used to capture the function calls from the binary. This includes calls to IO functions such as open, llseek, write, read, and close. The command line I will use is,


strace -tt -o ./serial.strace ./serial


Notice that I named the binary serial.
I used two options with strace. The first option, -tt gives microsecond timings for all function calls. The second option, -o serial.strace, sends the strace output to a file called serial.strace. The strace output is too long for this blog, but I will show you a few lines from the output and how you can derive some metrics from it.


Analyzing the strace Output
The IO pattern for the code is really simple. It just opens a file, writes the floating data to the file and then closes the data. It's pretty simple, but in this case, there are actually two writes. The first one writes about 397KB of data and the second writes only about 2.7KB. So we have a reasonably large write followed by a very small write. More complicated codes will have different IO patterns (more on that in future blogs). Let's take a closer look at the IO function calls.

We can ignore most of the stuff in the output since we're really interested in the IO part of it. The relevant IO portion(s) of the output are below:


14:47:03.559980 open("testfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
14:47:03.560250 write(3, "\0\0 @@Q\303G\240PCHP|\222HPP\303HP$\364H(|\22I(\346*I"..., 397312) = 397312
14:47:03.562485 write(3, "\0\25\240N\r\30\240N\33\33\240N(\36\240N5!\240NB$\240N"..., 2688) = 2688
14:47:03.562613 close(3)



The first line that I've listed opens the file. The next two lines are important because they actually write out the data. You can find out what the function write does by looking in the man pages, but you may have to use the command man 2 write to get the correct man pages. The function write writes data to a file descriptor (basically a file). You can see the first few bits of data that are being written inside the quotations in the strace output. For this example, the data is binary so the data doesn't much sense. But if the data is text or non-binary data, then you might be able to read the first part of the data. strace does not list all of the data being written to the file. The very last number, after the equals sign, is the number of bytes actually written to the file descriptor. For this example, there are two writes. The first one writes 397,312 bytes and the second writes 2,688 bytes. The total of the fwrite functions is 400,000 bytes (as it should be).

The fwrite function calls in the strace output file also contain more useful information. The time that the particular function begins is printed before the function. I used the -tt option which gives the time including microseconds. If we look at the time for the next function (the next line in the output), we can take the difference between the times to get the elapsed time to perform the previous function. For this example, it is the amount of time it took to perform the fwrite function. Usually it's good enough to look at the seconds and microseconds to determine the amount of time for the write or read function. In this case the elapsed time to perform the first fwrite function is,


03.562485 - 03.560250 = 0.002235 seconds (2235 microseconds)


We also know that the amount of data is 397,312 bytes. So the resulting IO rate is,


397,312 bytes / 0.002235 seconds = 1.778x10^8 bytes/sec


This is the same as 178 MB/s assuming that a MB is 1,000,000 bytes.



The amount of time the second fwrite function took is,


03.562613 - 03.562485 seconds = 0.000128 seconds (128 microseconds)


The amount of data written is 2,688 bytes so the IO rate is,


2,688 bytes / 0.000128 seconds = 2.1x10^7 bytes/sec


This is the same as 21 MB/s (a bit slower). Now you can go through and compute the IO rate for all read and write functions in your strace output! (Doesn't this cry out for automation?).


Observations
Let's compare the IO rates for the two fwrite functions. The first fwrite writes out a reasonable amount of data - 397KB. But the second fwrite only writes about 2.7KB, a much smaller amount of data. Now compare the two IO rates. The first IO rate is about 178 MB/s and the second IO rate is only about 21 MB/s. Why is there such a big difference?

There are a number of factors that influence the IO rate. The latency of the drive where it seeks to a certain location on the disk, the actual speed that the drive can write the data on the disk, how fast the file system can translate the data into blocks for the drive controller, any llseek function calls that move the data location to a different point in the file. And believe it not, the amount of data written has an impact on the IO rate.

For very small amounts of data the amount of time to write the data is almost the same as the seek time or the latency of the drive. This is similar to sending messages across a network. That is, for very small packets, the time to send the data is about the same as the latency of the network.

Examining the time for the second fwrite, we can see that the time is very short and the amount of data is much much smaller than the first fwrite. We can observe that writing small amounts of data results in a very low IO rate. Consequently, we can conclude that it is good to avoid small writes if possible. This is because small amounts of data have low IO rates, which can slow down our code(s).


What Have We Learned?
I hope this blog has introduced you to using strace to examine the IO pattern of your code. For this simple example we found out that the resulting binary does a reasonably large write (397KB) followed by a very small write (2.7KB). The strace output can also be used to examine the IO rates of your code. For this simple serial code we found that the first fwrite function writes data at about 178 MB/s (caching effects included) and the second, much smaller fwrite, has an IO rate of 21 MB/s.

From these numbers it can be observed that it's a good idea to avoid small writes because the IO rate is so low. So if your code does a great deal of small writes, and there are many codes that do this, your overall IO throughput will be very low.

In the next blog I will show you how to use strace with an MPI code. I'll also continue to look at more complicated IO patterns. Until then take a few moments and use strace to look at the IO of you favorite codes. Just be careful because strace can produce a huge amount of data (and no, the drive manufacturers didn't pay me to write this blog).


Jeff - Comments
End Blog Entry