8-20-2008 - Are SATA Disks Really Less Reliable than FC disks?


I’ve seen many, many people say that Fibre Channel (FC) disks are much better than SATA drives. Of course saying “better” doesn’t mean a whole lot unless you define the parameters of “better”. One of the most common phrases I hear is, “FC disks have a higher MTBF than SATA so they are more reliable” (MTBF = Mean Time Between Failure). Let’s take a look at this comment in the context of HPCC.

The Petascale Data Storage Institute (PDSI) at Carnegie Mellon University was established with one aspect of their charter to collect failure rate data from very large HPC and non-HPC sites. To date the Institute has collected failure data from 30 systems at 6 sites. Some of these sites are HPC sites and some are Internet Service Providers (ISPs). The data from the 30 systems listed the failure history of various components in the systems. In total, the data covered over 100,000 disk drives from at least 4 different vendors for time periods ranging from 1 month to 5 years and for SATA, SCSI, and FC drives.

One of the challenges in analyzing the data was the definition of what constitutes a "failure" versus a "replacement." In reality, if the user considered the drive to be "failed" for whatever reason, they pulled the drive out and replaced it with a new one. This impacts the usability of the systems, so it really constitutes a "failure,” particularly in the eyes of a user.

Garth Gibson and Biana Schroerder at PDSI have examined the data that has been collected and made some interesting observations. Figure One below, courtesy of Garth Gibson at PDSI, shows a descending list of replaced components for 3 of the systems:


  • HPC1 is a 765 node HPC cluster with 4-way SMP nodes and 5 years worth of data on 10k rpm SCSI drives (3,400 drives)
  • COM1 is an ISP with 26,734 SCSI drives running for 1 month when the data was gathered
  • COM2 is an ISP with up to 9,232 servers and 39,039 SCSI drives running for 1.5 years when the data was gathered

Top Ten Replaced Components

Figure One -- Top Ten Replaced Components


Notice that disk drives are among the most frequently replaced hardware components reaching almost 50% of hardware failures in the COM2 data (recall that COM2 used SCSI drives).

They also examined all of the data for disk replacement rates. Figure Two, again courtesy of Garth Gibson at PDSI, is a plot of the Annual Disk Replacement Rate (ARR) for 9 of the systems that span SCSI, SATA, and FC drives. The drive sets are:

System
Type of Drive Count Duration
HPC1 18GB 10K RPM SCSI
36GB 10K RPM SCSI
3,400
5 yrs
HPC2 36GB 10K RPM SCSI
520
2.5 yrs
HPC3 15K RPM SCSI
15K RPM SCSI
7.2K RPM SATA
14,208
1yr
HPC4 250GB SATA
500GB SATA
400GB SATA
13,634
3 yrs
COM1 10K RPM SCSI
26,734
1 month
COM2 15K RPM SCSI
39,039
1.5 yrs
COM3 10K RPM FC-AL
10K RPM FC-AL
10K RPM FC-AL
10K RPM FC-AL
3,700
1 yr



Normalized ARR for HPCC disks

Figure Two -- Annual Disk Replacement Rate


Notice that the average ARR for the data is 3%. But the ARR derived from the disk manufacturer’s data is between 0.58% and 0.88% (the drive manufacturer’s state the Mean Time To Failure (MTTF) is between 1,000,000 and 1,500,000 hours). The ARR from the data indicates a MTTF that is a 2-10 times less than the manufacturer's number. In this study, there is poor evidence for the commonly held belief that SATA failure rates are higher than SCSI or FC. If anything, the SATA drives had a significantly lower failure rate than some SCSI (HPC6) or FC (COM3) drives.

So, based on this data, it definitely looks like SATA is as good as FC or SCSI for reliability for the workloads (HPCC and ISP). I’m hesitant to say that this study proves conclusively that SATA is just as good as anything else because there are so many variables. But the evidence is definitely leaning toward SATA being just as good.



No user avatar
laytonjb
Latest page update: made by laytonjb , Aug 20 2008, 5:19 PM EDT (about this update About This Update laytonjb Edited by laytonjb

167 words added
1 image added

view changes

- complete history)
More Info: links to this page