Already a member?
Sign in
4-21-2008 - strace MPI - Comments
Using strace with MPI
Remember -strace is your friend
In the last blog, I started talking about how to use strace to examine the IO of applications. In that blogI introduced you to how to use strace on serial codes. In this blog, I want to expand our usage of strace to MPI codes.
As we discovered strace is an extremely useful tool for examining the IO pattern for your codes because it lists all of the function calls from the code, including IO functions such as open, llseek, write, read, and close. As part of the output from strace we can also get information such as how much data was written or read and how much time was used in the operation. From this data we can determine IO rates and IO patterns (how the data is written or read).
Assuming that you have read my last column and mastered the use ofstrace on serial codes, let's move on to using strace with MPI codes.
Using strace with MPI codes
MPI codes, while a bit more complicated, don't necessarily have to be difficult to use with strace. Ideally, we would like to have one strace output for every MPI process. This includes have one output for each process even on the same node. So if we had 4 cores on a node, we would want 4 strace output files per node. The reason we want one output file per MPI process is so we can tell which MPI process is performing IO, how much IO, and it's performance. This can also help us debug any problems.
Usually MPI codes are launched by usingmpirun ormpiexec or something equivalent on the command line. Even ISV codes use either of these two launch schemes even if they are buried in a script or executable. But the problem is that if you try to use strace and either mpirun or mpiexec you won't be able to separate the output from each MPI process. So we need a way to use strace and separate the output files for each process. Fortunately, I have the bash mojo for such a task.
For the example below, I'll be using MPICH2. MPICH2 has a utility to start codes called mpiexec. A sample command line for MPICH2 to run an MPI code is
mpiexec -machinefile ./MACHINEFILE -np 4 <path-to-code>/<executable> <code-options>
where MACHINEFILE is the name of the file containing a list of the machines being used, path-to-code is the path to where the executable is located, executable is the name of the actual executable, and code-options are any command line arguments to the executable.
The first thing people might try is to change the command line to look like,
/usr/bin/strace mpiexec -machinefile ./MACHINEFILE -np 4 <path-to-code>/<executable> <code-options>
but all this does is run strace against mpiexec, not against the executable as we want. How do we fix this?
The way I run strace against an MPI binary is to write 2 scripts. The first script is for the mpiexec command and the second script is for the MPI executable. The first script is fairly easy,
#!/bin/bash
mpiexec -machinefile ./MACHINEFILE -np 4 <path-to-script>/code1.sh <code-options>
I usually name this script something main.sh. It's not too different than the command line before except rather than specify the executable, I specify a script, code1.sh, and I give the path to this second script. The second script, which is code1.sh looks like,
#!/bin/bash
/usr/bin/strace -tt -o /tmp/strace.out.$$ <path-to-code>/<executable> $@
In this second script, which I call code1.sh all of the strace action takes place. As with the serial code I use the -tt option to get microsecond timing, and I specify the output using the -o option. In this case, I'm sending the output to/tmp and naming it strace.out.$$. Here is the bash magic I mentioned. The.$$ after strace.out is a special bash variable that contains the PID (Process ID) of the script. Since each MPI process will get a unique PID, we will have separate strace files for each MPI process.
The second bit of bash knowledge is the option$@ at the end of the script. This is a predefined bash variable that contains all of the options after the script code1.sh. These are the command line arguements for the actual executable.$@ will contain arg1, arg2, arg3, and so on. It's important to make sure you understand how to use$@. So let's look at a really quick example.
There is an IO benchmark called IOR from Lawrence Livermore Labs, that has a number of arguments you can pass to the code that describe the details of how to run the benchmark. Here's an example,
IOR -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o <file location>
Don't won't worry about what all of the options mean, but let me point out a couple because they can be important for a job scheduler script. The option -N 4 tells the code to use 4 MPI processes. You can change the value of 4 to correspond to what the scheduler defines. Now how do we pass these arguments to the script that actually runs the code?
Sticking with the IOR example themain.sh script looks like,
#!/bin/bash
mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.sh \
-r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o <file location>
Notice how I've taken the command line arguments and put them in the main.sh script. With the $@ bash predefined variable in the code script, the options are passed to the code. The code script doesn't change at all (except for the name of the binary).
#!/bin/bash
/usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/IOR $@
The only thing that changed was the name of the binary from code1 to IOR. So if you want to change the arguments to a code you have to modify the main script. If your code doesn't have any command line arguments, I would recommend just leaving $@ in the code for future reference.
Just a quick note here. I hate to admit this, but I'm not a bash script expert. Brian Mueller from Panasas was the bash script expert who taught me these tricks (thanks Brian!).
Simple Example
Let's start with a simple example from the MPI-2 book by Bill Gropp, et. at. In Chapter 2 the authors present a simple example of an MPI code where each process of N processes writes data to an individual file (this is usually referred to as N-N IO). I modified the code to write more data than originally presented. Here is the C code from the book.
Let's start with a simple example from the MPI-2 book by Bill Gropp, et. at. In Chapter 2 the authors present a simple example of an MPI code where each process of N processes writes data to an individual file (this is usually referred to as N-N IO). I modified the code to write more data than originally presented. Here is the C code from the book.
/* example of parallel Unix write into separate files */
#include "mpi.h"
#include <stdio.h>
#define BUFSIZE 100000
int main(int argc, char *argv[])
{
int i, myrank, buf[BUFSIZE];
char filename[128];
FILE *myfile;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
for (i=0; i < BUFSIZE; i++)
buf[i] = myrank * BUFSIZE + i;
sprintf(filename, "testfile.%d", myrank);
myfile = fopen(filename, "w");
fwrite(buf, sizeof(int), BUFSIZE, myfile);
fclose(myfile);
MPI_Finalize();
return 0;
}
Next, you run the script on your cluster (or even your dekstop) either specifying the machinefile manually or using a job scheduler. When the job is finished you have to go to each node used in the run, and copy the files from /tmp back to whatever file system is more permanent than /tmp. You could write all of the strace output files to a central file system, but you run the risk that you could get two PIDs that are the same. The chances of this are fairly small, but I don't like to take this chance. :)
Analyzing the strace Output(s)
Now that we know how to run our MPI jobs using strace, let's look through a simple example. I'm running the code that I presented earlier. I'm going to run with 4 MPI processes for this article. After I run the code I get four strace.out files.
strace.out.5213
strace.out.5214
strace.out.5215
strace.out.5216
The PIDs are numbered sequentially because I ran all 4 MPI processes on the same machine. Let's look at one of the strace output files.
Examining the strace file you will notice that it is much longer than for the serial case. The reason is that now we're running an MPI code so much of the extra function calls are due to MPI doing it's thing in the background (i.e. behind our code). I've extracted a few of the important lines from the first strace output file.
15:12:54.920557 access("testfile1", F_OK) = -1 ENOENT (No such file or directory)
15:12:54.920631 access(".", R_OK) = 0
15:12:54.920687 access(".", W_OK) = 0
15:12:54.920748 stat64("testfile1", 0xbfa56800) = -1 ENOENT (No such file or directory)
15:12:54.920816 open("testfile1", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7
...
15:12:54.943471 write(7, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
15:12:54.945790 ftruncate64(7, 400008) = 0
15:12:54.945888 _llseek(7, 0, [400008], SEEK_END) = 0
15:12:54.945954 ftruncate64(7, 400008) = 0
15:12:54.946010 _llseek(7, 0, [400008], SEEK_END) = 0
If you compare these lines to the ones in the serial code, you can see that they are very similar. Despite having more "junk" in the output, let's look at the IO performance.
The write function call writes the same amount of data, 400,008 bytes. The amount of time to write the data is,
54.945790 - 54.943471 = 0.002319 seconds (2319 micro-seconds)
So the IO rate of the write function is,
400,008 bytes / 0.002319 secs. = 1.7249x10^8 bytes/second
This works out to be 172.49 MB/s. A bit faster than the serial code, but again, I think there are some caching affects.
I won't examine the other 3 strace.out.* files since it's fairly straight forward to compute the write performance for each of them. But we're only computing the IO performance for a single write call. Imagine if you have a number of write and read calls in a single code. Then you have to perform the computations for a number of write and read calls. This screams for some sort of automation
What Have We Learned?
While there was nothing earth shattering in this blog, but we did lay the ground work for examining the IO pattern of MPI codes. While getting strace output is generally easy, in this article we found that not to be the case for general MPI codes. We had to create a couple of scripts so we could get the strace output from each MPI process (which is what we really want). After those scripts, getting the strace output for any number of MPI processes is quite easy (Note: you can always add some lines to the scripts to copy the strace files back to your home directory or some centralized location).
If you haven't made the leap yet, you can use these scripts to examine the IO patterns of MPI codes that you don't have the source code. So you can easily examine the IO of commercial ISV codes.
Remember -strace is your friend
In the last blog, I started talking about how to use strace to examine the IO of applications. In that blogI introduced you to how to use strace on serial codes. In this blog, I want to expand our usage of strace to MPI codes.
As we discovered strace is an extremely useful tool for examining the IO pattern for your codes because it lists all of the function calls from the code, including IO functions such as open, llseek, write, read, and close. As part of the output from strace we can also get information such as how much data was written or read and how much time was used in the operation. From this data we can determine IO rates and IO patterns (how the data is written or read).
Assuming that you have read my last column and mastered the use ofstrace on serial codes, let's move on to using strace with MPI codes.
Using strace with MPI codes
MPI codes, while a bit more complicated, don't necessarily have to be difficult to use with strace. Ideally, we would like to have one strace output for every MPI process. This includes have one output for each process even on the same node. So if we had 4 cores on a node, we would want 4 strace output files per node. The reason we want one output file per MPI process is so we can tell which MPI process is performing IO, how much IO, and it's performance. This can also help us debug any problems.
Usually MPI codes are launched by usingmpirun ormpiexec or something equivalent on the command line. Even ISV codes use either of these two launch schemes even if they are buried in a script or executable. But the problem is that if you try to use strace and either mpirun or mpiexec you won't be able to separate the output from each MPI process. So we need a way to use strace and separate the output files for each process. Fortunately, I have the bash mojo for such a task.
For the example below, I'll be using MPICH2. MPICH2 has a utility to start codes called mpiexec. A sample command line for MPICH2 to run an MPI code is
mpiexec -machinefile ./MACHINEFILE -np 4 <path-to-code>/<executable> <code-options>
where MACHINEFILE is the name of the file containing a list of the machines being used, path-to-code is the path to where the executable is located, executable is the name of the actual executable, and code-options are any command line arguments to the executable.
The first thing people might try is to change the command line to look like,
/usr/bin/strace mpiexec -machinefile ./MACHINEFILE -np 4 <path-to-code>/<executable> <code-options>
but all this does is run strace against mpiexec, not against the executable as we want. How do we fix this?
The way I run strace against an MPI binary is to write 2 scripts. The first script is for the mpiexec command and the second script is for the MPI executable. The first script is fairly easy,
#!/bin/bash
mpiexec -machinefile ./MACHINEFILE -np 4 <path-to-script>/code1.sh <code-options>
I usually name this script something main.sh. It's not too different than the command line before except rather than specify the executable, I specify a script, code1.sh, and I give the path to this second script. The second script, which is code1.sh looks like,
#!/bin/bash
/usr/bin/strace -tt -o /tmp/strace.out.$$ <path-to-code>/<executable> $@
In this second script, which I call code1.sh all of the strace action takes place. As with the serial code I use the -tt option to get microsecond timing, and I specify the output using the -o option. In this case, I'm sending the output to/tmp and naming it strace.out.$$. Here is the bash magic I mentioned. The.$$ after strace.out is a special bash variable that contains the PID (Process ID) of the script. Since each MPI process will get a unique PID, we will have separate strace files for each MPI process.
The second bit of bash knowledge is the option$@ at the end of the script. This is a predefined bash variable that contains all of the options after the script code1.sh. These are the command line arguements for the actual executable.$@ will contain arg1, arg2, arg3, and so on. It's important to make sure you understand how to use$@. So let's look at a really quick example.
There is an IO benchmark called IOR from Lawrence Livermore Labs, that has a number of arguments you can pass to the code that describe the details of how to run the benchmark. Here's an example,
IOR -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o <file location>
Don't won't worry about what all of the options mean, but let me point out a couple because they can be important for a job scheduler script. The option -N 4 tells the code to use 4 MPI processes. You can change the value of 4 to correspond to what the scheduler defines. Now how do we pass these arguments to the script that actually runs the code?
Sticking with the IOR example themain.sh script looks like,
#!/bin/bash
mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.sh \
-r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o <file location>
Notice how I've taken the command line arguments and put them in the main.sh script. With the $@ bash predefined variable in the code script, the options are passed to the code. The code script doesn't change at all (except for the name of the binary).
#!/bin/bash
/usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/IOR $@
The only thing that changed was the name of the binary from code1 to IOR. So if you want to change the arguments to a code you have to modify the main script. If your code doesn't have any command line arguments, I would recommend just leaving $@ in the code for future reference.
Just a quick note here. I hate to admit this, but I'm not a bash script expert. Brian Mueller from Panasas was the bash script expert who taught me these tricks (thanks Brian!).
Simple Example
Let's start with a simple example from the MPI-2 book by Bill Gropp, et. at. In Chapter 2 the authors present a simple example of an MPI code where each process of N processes writes data to an individual file (this is usually referred to as N-N IO). I modified the code to write more data than originally presented. Here is the C code from the book.
Let's start with a simple example from the MPI-2 book by Bill Gropp, et. at. In Chapter 2 the authors present a simple example of an MPI code where each process of N processes writes data to an individual file (this is usually referred to as N-N IO). I modified the code to write more data than originally presented. Here is the C code from the book.
/* example of parallel Unix write into separate files */
#include "mpi.h"
#include <stdio.h>
#define BUFSIZE 100000
int main(int argc, char *argv[])
{
int i, myrank, buf[BUFSIZE];
char filename[128];
FILE *myfile;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
for (i=0; i < BUFSIZE; i++)
buf[i] = myrank * BUFSIZE + i;
sprintf(filename, "testfile.%d", myrank);
myfile = fopen(filename, "w");
fwrite(buf, sizeof(int), BUFSIZE, myfile);
fclose(myfile);
MPI_Finalize();
return 0;
}
Next, you run the script on your cluster (or even your dekstop) either specifying the machinefile manually or using a job scheduler. When the job is finished you have to go to each node used in the run, and copy the files from /tmp back to whatever file system is more permanent than /tmp. You could write all of the strace output files to a central file system, but you run the risk that you could get two PIDs that are the same. The chances of this are fairly small, but I don't like to take this chance. :)
Analyzing the strace Output(s)
Now that we know how to run our MPI jobs using strace, let's look through a simple example. I'm running the code that I presented earlier. I'm going to run with 4 MPI processes for this article. After I run the code I get four strace.out files.
strace.out.5213
strace.out.5214
strace.out.5215
strace.out.5216
The PIDs are numbered sequentially because I ran all 4 MPI processes on the same machine. Let's look at one of the strace output files.
Examining the strace file you will notice that it is much longer than for the serial case. The reason is that now we're running an MPI code so much of the extra function calls are due to MPI doing it's thing in the background (i.e. behind our code). I've extracted a few of the important lines from the first strace output file.
15:12:54.920557 access("testfile1", F_OK) = -1 ENOENT (No such file or directory)
15:12:54.920631 access(".", R_OK) = 0
15:12:54.920687 access(".", W_OK) = 0
15:12:54.920748 stat64("testfile1", 0xbfa56800) = -1 ENOENT (No such file or directory)
15:12:54.920816 open("testfile1", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7
...
15:12:54.943471 write(7, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
15:12:54.945790 ftruncate64(7, 400008) = 0
15:12:54.945888 _llseek(7, 0, [400008], SEEK_END) = 0
15:12:54.945954 ftruncate64(7, 400008) = 0
15:12:54.946010 _llseek(7, 0, [400008], SEEK_END) = 0
If you compare these lines to the ones in the serial code, you can see that they are very similar. Despite having more "junk" in the output, let's look at the IO performance.
The write function call writes the same amount of data, 400,008 bytes. The amount of time to write the data is,
54.945790 - 54.943471 = 0.002319 seconds (2319 micro-seconds)
So the IO rate of the write function is,
400,008 bytes / 0.002319 secs. = 1.7249x10^8 bytes/second
This works out to be 172.49 MB/s. A bit faster than the serial code, but again, I think there are some caching affects.
I won't examine the other 3 strace.out.* files since it's fairly straight forward to compute the write performance for each of them. But we're only computing the IO performance for a single write call. Imagine if you have a number of write and read calls in a single code. Then you have to perform the computations for a number of write and read calls. This screams for some sort of automation
What Have We Learned?
While there was nothing earth shattering in this blog, but we did lay the ground work for examining the IO pattern of MPI codes. While getting strace output is generally easy, in this article we found that not to be the case for general MPI codes. We had to create a couple of scripts so we could get the strace output from each MPI process (which is what we really want). After those scripts, getting the strace output for any number of MPI processes is quite easy (Note: you can always add some lines to the scripts to copy the strace files back to your home directory or some centralized location).
If you haven't made the leap yet, you can use these scripts to examine the IO patterns of MPI codes that you don't have the source code. So you can easily examine the IO of commercial ISV codes.
laytonjb |
Latest page update: made by laytonjb
, Apr 21 2008, 10:25 AM EDT
(about this update
About This Update
Edited by laytonjb
769 words added 1 word deleted view changes - complete history) |
|
Keyword tags:
None
More Info: links to this page
|
