RAID 3 vs. RAID 5 in a HPC Environment by
Text by Chris Wood
This paper will address the application and file system performance
differences between RAID 3 and RAID 5 when used in a high performance computing
environment. The reader is expected to have a working knowledge of RAID
architecture and parity management for the different RAID types.
RAID 3:
RAID 3 typically organizes data by segmenting a user data record into either
bit or byte sized chunks and evenly spreading the data across "N"
drives in parallel. One of the drives is a parity drive. In this manner, every
record that is accessed is delivered at the full media rate of the
"N" drives that comprise the stripe group. The drawback is that every
record I/O must access every drive in the stripe group.
Additionally, the stripe width chosen forces the user record length to be a
multiple of what each disks underlying atomic format is. Assume each disk has
an atomic 4 KByte format. Atomic format, in this sense, is the blocksize on the
disk, (e.g. the data you get when you issue a single atomic read or write
operation). An 8+P array would have 32 KByte logical blocks. A 5+P would have
20 KByte logical blocks and so on. You may not write to a RAID 3 except in full
stripe logical blocks. This limits application design flexibility and limits
the users ability to change the physical array characteristics.
Multiple Process Problems
For a single long request delivering a single stream of sequential data,
RAID 3 is excellent. For multiple requests RAID 3 performs very poorly. A
single record request (e.g. file system metadata) would tie up 100% of the
disks and, more importantly, pull the disk heads away from the actual user data
to read the metadata. Multiple requests from two or more active processes
(jobs) will only serve to exacerbate this problem. The disk heads are being
pulled from one user's data area to the other every time each process makes a
request. Each head movement requires a "seek" and a disk latency.
Often, this is on the order of 15-20 ms. Two processes typically require four
(4) movements each time an I/O request ping-pongs between the two jobs; two
metadata requests and two data transfer requests.
The resulting disk head positioning time of possibly up to 80 ms. coupled
with the latencies built into any controller contribute to poor overall
throughput. In RAID 3 mode, disk latency is not 1/2 a revolution like a single
disk, but is statistically (N-1)/N, where N is the width of the stripe group.
For example, an 8 way stripe would have a latency of (8-1)/8 or 7/8th of a
revolution. Additionally, on a 1.063 GBit FC-AL loop or a 100 MBytes/sec. HIPPI
channel, sustaining 75 MBytes/sec. data transfers, just two operations, each
reading a megabyte of data with a 20 ms latency to switch between them, would
result in an average throughput of about 30 MBytes/sec. (i.e. 20 ms position,
13.333 data transfer, 20 ms position, 13.333 transfer and so on). This example
where we lost 45 MBytes/sec. of potential throughput assumed NO metadata
fetches. If you add metadata I/O, assuming each metadata fetch was less than 64
KByte, then the throughput drops to between 10 and 15 MBytes/sec. Any host
induced delays will further reduce this number. A good rule of thumb is to
assume that a good RAID 3 controller can sustain about 60-65 unique short
record I/Os per second (60 x 15 ms = 900 ms) or less. Long I/Os decrease this
rate by the function of the transfer time involved for each request.
It is clear that RAID 3 is not well suited for either multiple process I/O
(long or short) and especially not suited for any application that requires a
high I/O per second rate (IOPS) with any degree of randomness. On the other
hand, RAID 3 will deliver excellent performance for single process/single
stream long sequential I/O requests.
Note: Long "skip sequential" processing (regular stride access) is
a special case and is covered in the RAID 5 section for both types of arrays.
RAID 3 Performance Mitigation
Some users have proposed to mitigate the head movement problem and limited
I/O per second rate of RAID 3 by making only very large requests and caching
vast quantities of user data in main memory. This method has been shown to work
in some cases where there is a high degree of locality of reference and data
reuse, but is not very cost effective given the 100X cost delta between RAM and
disk storage. Most users have realized that it is much more cost effective to
use a flexible RAID 5 architecture rather than try to hide the shortcomings of
the RAID 3 architecture. They would rather use expensive main memory for
application space and data manipulation vs. allocating giant in-memory disk
caches.
It is for these reasons of overall inflexibility and cost that the industry
has gravitated to full function RAID 5 type devices in order to try to better
handle both long and short requests, better multiplex many concurrent users and
still deliver the very high data rates desired for long requests.
RAID 5:
RAID 5 devices interleave data on a logical block by block basis. (Logical
block is usually defined in this sense as what the user application or file
system sees and expects to perform atomic I/O on.) Logical blocks may be any
size that is a multiple of the physical disk block. Most high performance RAID
5s allow the user to specify the logical blocksize as a multiple of the
underlying physical disk block. This design allows a single logical block read
to access only a single disk. A 16 block read would access all disks twice in a
8+P configuration. Parity is typically distributed (rotated) across all the
disks to eliminate "hot spots" when accessing parity drives. The most
common method is called "right orthogonal" parity wherein the parity
block starts on the right most drive for track zero and rotates to the left on
a track or cylinder basis. Numerous other schemes have been supported, but the
right orthogonal scheme seems to be the most prevalent. MAXSTRAT's Gen5 family
of storage servers uses a variant of this method.
Not all RAID 5s are alike...
Within the RAID 5 category, there are two fundamental types of RAID 5s:
- The "full function"
variety where the array recognizes large requests and dynamically selects
and reads as many disks in parallel as are required to answer the request,
up to the physical width of the stripe group. This type of array typically
has a separate lower interface port for every drive in the stripe group.
- The "entry level"
model that can not perform parallel stripe reads, but has to read the data
off each disk in a serial sequence. This type of array usually "daisy
chains" multiple drives a small number of lower interfaces to save on
cost at the expense of performance and parallelism.
These "entry" arrays are typically found in small SCSI arrays
designed for PC servers and UNIX workstations that cannot use the data rate but
want the data protection offered by parity arrays. Larger hosts require the
"full function" variety since they want RAID 3 speed for long
requests but also want the improved short, random block performance offered by
RAID 5s.
Skip Sequential (Stride) Operations
Skip sequential operations (regular stride access) degrades the performance
of RAID 3s directly in proportion to the "stride" length. If you read
one block and skip two (stride of 3) then your data rate will be 1/3 of
theoretical burst rate. A stride of 4 delivers 1/4th the rate and so on. A RAID
5, on the other hand will access every third disk in parallel and in the case
of a stride of 3 mapped to a 8+P array would sustain the full data rate. (Disks
1, 4, & 7 hold requests 1, 2 & 3. Disks 2, 5 & 8 hold records 4, 5
& 6 and disks 3 & 6 hold records 7 & 8.) Other strides and other
stripe widths map in a similar manner but may not always deliver 100% of the
potential bandwidth due to resultant topology of the data.
The Read-Modify-Write Cycle
Short block writes are a special case for RAID 5s. In order to make the
parity stay consistent when an entire stripe is not completely updated, (e.g.
fewer records are written than the width of the stripe group) a function called
"read-modify-write" must take place. This function requires that the
old data block, the new data block and the old parity all be XOR'd together to
create the new parity block. In the worst case, this can cause four (4) disk
operations (2 reads/2 writes) to logically perform a single block write.
Modern RAID 5s tend to have built in caching to hold the old data in
internal cache and possibly pre-read the old parity so that when the new write
block is sent to the unit, all the necessary disk blocks are in cache and do
not have to be fetched to do the three-way XOR. The MAXSTRAT Gen5 does this via
24 MBytes of write behind cache and up to 96 MBytes of read ahead/store-through
cache. Obviously, the cache hit ratio is paramount to the performance of a RAID
5 when there are many short, random write updates.
As a second rule of thumb, high performance, full function RAID 5s are
capable of sustaining between 500 and 2000 short random disk read operations a
second. The Gen5 is at the high end of this range. Note: These numbers
represent actual disk reads, not cache hits. Cache hits typically are higher.
Short writes operate at about half this speed. Long writes (greater than a
stripe width) perform in a manner similar to RAID 3 since parity can be
generated on the fly and complete stripes, including new parity, can be written
out in parallel.
Read to Write Ratios
Given the effect of the read-modify-write cycle, the actual read to write
I/O ratio becomes important to any user considering a RAID 5 device. Most
commercial processing has at least a 4-1 read to write ratio (IBM study) and
scientific processing typically is higher on the read side.
Intermediate "swap" I/Os typically are large so the
read-modify-write case is not of interest during job execution. Final output,
now that visual vs. textual output is the rule, virtually eliminates short end
of job writes. In summary, the read-modify-write overhead may be of less
concern than previously thought.
Multiple Processes - RAID 5
Going back to our examples of single and multiple processes that we used
above in the RAID 3 discussion, the case of a single process is very similar.
In a full function RAID 5 array, a long request will cause all the disks to be
accessed in parallel thus resulting in approximately the same data rate that
RAID 3 offers. (Entry level RAID 5 arrays will be gated by the serial nature of
their ability to access more than one disk.) Accessing metadata will tend to
have a significantly lesser effect on overall performance since only one disk
(vs. all the disks in the RAID 3 case) will receive a sudden command to move
its read head to a different position. If this is not the disk that is next in
the chain to transfer data for the long read, then the effect of head movement
may be masked by the statistical location of the long read block transfer start
location as measured within the time domain of the entire operation. This
"mask" is statistically half of the stripe group width since we can
never predict which disk in the stripe group actually holds the first block of
the request.
In any given specific case, the mask time is both a function of file system
design and metadata layout as well as the width of the stripe group. For
instance, file system layout may or may not attempt to issue well formed
(complete stripe group) I/O. To the extent that the metadata fetch operation
can be masked, we recognize that we have, on the average, the time it takes to
read the data off half of a stripe group to recover from a metadata induced
disk head movement. This time is a disk rotational latency period plus the disk
transfer time for one logical block of data. Assuming 64 KByte logical disk
blocks and 7200 RPM drives, you would require about 11.7 ms. to perform this
operation.
We can calculate that the average time required to return the disk head on
the metadata drive (to a position where it will be able to access user data)
will be approximately 12.2 ms. (4.2 ms. latency and 8 ms. average seek). Note
that this number is less than the RAID 3 number since latency is always 1/2 a
revolution for RAID 5 arrays vs. 7/8th of a revolution in the RAID 3 example.
This value tells us that in most cases a fast RAID 5 controller can recover
from a metadata fetch without impacting user data transfers. Unfortunately, our
analysis above also shows us that this "mask" window is only
available 50% of the time. However, half loaf is better than none!
Given the assumptions, we can then extrapolate that the "two
process" performance hit will occur about half of the time on a well
designed RAID 5. The data rate will not degrade to 33 MBytes/sec from 75
MBytes/sec, as in the RAID 3 example, but only to about 54 MBytes/sec. This is
quite an improvement! A four process access will scale in the same manner.
Experience has shown these assumptions to be correct.
Given the above, it is difficult to understand why anybody would choose RAID
3 architecture over a full function RAID 5 design except in the case where the
user is virtually guaranteed that there will be only a single, long process
accessing sequential data. The RAID 3 is so limited that it becomes a poor
choice in this case.
Chris Wood
email: chrisw@maxstrat.com
|