Friday, January 11, 2008
Disk-space over time
I've mentioned before that I do SNMP-based queries against NetWare and drop the resulting disk-usage data into a database. The current incarnation of this database went in August of 2004, so I have just over 4 years of data in it now. You can see some real trends in how we manage data in the charts.
To show you what I'm talking about, I'm going to post a chart based on the student-home-directory data. We have three home-directory volumes for students, which run between 7000-8000 home directories on them. We load-balance by number of directories rather than least-size. The chart:

As you can see, I've marked up our quarters. Winter/Spring is one segment on this chart since Spring Break is hard to isolate on these scales. We JUST started Winter 2008, so the last dot on the chart is data from this week. If you squint in (or zoom in like I can) you can see that last dot is elevated from the dot before it, reflecting this week's classes.
There are several sudden jumps on the chart. Fall 2005. Spring 2005. Spring 2007 was a big one. Fall 2007 just as large. These reflect student delete processes. Once a student hasn't been registered for classes for a specified period of time (I don't know what it is off hand, but I think 2 terms) their account goes on the 'ineligible' list and gets purged. We do the purge once a quarter except for Summer. The Fall purge is generally the biggest in terms of numbers, but not always. Sometimes the number of students purged is so small it doesn't show on this chart.
We do get some growth over the summer, which is to be expected. The only time when classes are not in session is generally from the last half of August to the first half of September. Our printing volumes are also w-a-y down during that time.
Because the Winter purge is so tiny, Winter quarter tends to see the biggest net-gain in used disk-space. Fall quarter's net-gain sometimes comes out a wash due to the size of that purge. Yet if you look at the slopes of the lines for Fall, correcting for the purge of course, you see it matches Winter/Spring.
Somewhere in here, and I can't remember where, we increased the default student directory-quota from 200MB to 500MB. We've found Directory Quotas to be a much better method of managing student directory sizes than User Quotas. If I remember my architectures right, directory quotas are only possible because of how NSS is designed.
If you take a look at the "Last Modified Times" chart in the Volume Inventory for one of the student home-directory volumes you get another interesting picture:

We have a big whack of data aged 12 months or newer. That said, we have non-trivial amounts of data aged 12 months or older. This represents where we'd get big savings when we move to OES2 and can use Dynamic Storage Technology (formerly known as 'shadowvolumes'). Because these are students and students only stick around for so long, we don't have a lot of stuff in the "older than 2 years" column that is very present on the Faculty/Staff volumes.
Being the 'slow, cheap,' storage device is a role well suited to the MSA1500 that has been plaguing me. If for some reason we fail to scare up funding to replace our EVA3000 with another EVA less filled-to-capacity, this could buy a couple of years of life on the EVA3000. Unfortunately, we can't go to OES2 until Novell ships an edirectory enabled AFP server for Linux, currently scheduled for late 2008 at the earliest.
Anyway, here is some insight into some of our storage challenges! Hope it has been interesting.
To show you what I'm talking about, I'm going to post a chart based on the student-home-directory data. We have three home-directory volumes for students, which run between 7000-8000 home directories on them. We load-balance by number of directories rather than least-size. The chart:

As you can see, I've marked up our quarters. Winter/Spring is one segment on this chart since Spring Break is hard to isolate on these scales. We JUST started Winter 2008, so the last dot on the chart is data from this week. If you squint in (or zoom in like I can) you can see that last dot is elevated from the dot before it, reflecting this week's classes.
There are several sudden jumps on the chart. Fall 2005. Spring 2005. Spring 2007 was a big one. Fall 2007 just as large. These reflect student delete processes. Once a student hasn't been registered for classes for a specified period of time (I don't know what it is off hand, but I think 2 terms) their account goes on the 'ineligible' list and gets purged. We do the purge once a quarter except for Summer. The Fall purge is generally the biggest in terms of numbers, but not always. Sometimes the number of students purged is so small it doesn't show on this chart.
We do get some growth over the summer, which is to be expected. The only time when classes are not in session is generally from the last half of August to the first half of September. Our printing volumes are also w-a-y down during that time.
Because the Winter purge is so tiny, Winter quarter tends to see the biggest net-gain in used disk-space. Fall quarter's net-gain sometimes comes out a wash due to the size of that purge. Yet if you look at the slopes of the lines for Fall, correcting for the purge of course, you see it matches Winter/Spring.
Somewhere in here, and I can't remember where, we increased the default student directory-quota from 200MB to 500MB. We've found Directory Quotas to be a much better method of managing student directory sizes than User Quotas. If I remember my architectures right, directory quotas are only possible because of how NSS is designed.
If you take a look at the "Last Modified Times" chart in the Volume Inventory for one of the student home-directory volumes you get another interesting picture:

We have a big whack of data aged 12 months or newer. That said, we have non-trivial amounts of data aged 12 months or older. This represents where we'd get big savings when we move to OES2 and can use Dynamic Storage Technology (formerly known as 'shadowvolumes'). Because these are students and students only stick around for so long, we don't have a lot of stuff in the "older than 2 years" column that is very present on the Faculty/Staff volumes.
Being the 'slow, cheap,' storage device is a role well suited to the MSA1500 that has been plaguing me. If for some reason we fail to scare up funding to replace our EVA3000 with another EVA less filled-to-capacity, this could buy a couple of years of life on the EVA3000. Unfortunately, we can't go to OES2 until Novell ships an edirectory enabled AFP server for Linux, currently scheduled for late 2008 at the earliest.
Anyway, here is some insight into some of our storage challenges! Hope it has been interesting.
Monday, January 07, 2008
I/O starvation on NetWare, another update
I've spoken before about my latency problems on the MSA1500cs. Since my last update I've spoken with Novell at length. Their own back-line HP people were thinking firmware issues to, and recommended I open another case with HP support. And if HP again tries to lay the blame on NetWare, to point their techs at the NetWare backline tech. Who will then have a talk about why exactly it is that NetWare isn't the problem in this case.
This time when I opened the case I mentioned that we see performance problems on the backup-to-disk server, which is Windows. Which is true, when the problem occurs B2D speeds drop through the floor; last Friday a 525GB backup that normally completes in 6 hours took about 50 hours. Since I'm seeing problems on more than one operating system, clearly this is a problem with the storage device.
The first line tech agreed, and escalated. The 2nd line tech said (paraphrased):
That may be where we have to go with this. Unfortunately, I don't think we have 16TB of disk-drives available to fully mirror the cluster. That'll be a significant expense. So, I think we have some rethinking to do regarding what we use this device for.
This time when I opened the case I mentioned that we see performance problems on the backup-to-disk server, which is Windows. Which is true, when the problem occurs B2D speeds drop through the floor; last Friday a 525GB backup that normally completes in 6 hours took about 50 hours. Since I'm seeing problems on more than one operating system, clearly this is a problem with the storage device.
The first line tech agreed, and escalated. The 2nd line tech said (paraphrased):
I'm seeing a lot of parity RAID LUNs out there. This sort of RAID uses CPU on the MSA1000 controllers, so the results you're seeing are normal for this storage system.Which, if true, puts the onus of putting up with a badly behaved I/O system onto NetWare again. The tech went on to recommend RAID1 for the LUNs that need high performance when doing array operations that disable the internal cache. Which, as far as I can figure, would work. We're not bottlenecking on I/O to the physical disks, the bottleneck is CPU on the MSA1000 controller that's active. Going RAID1 on the LUNs would keep speeds very fast even when doing array operations.
That may be where we have to go with this. Unfortunately, I don't think we have 16TB of disk-drives available to fully mirror the cluster. That'll be a significant expense. So, I think we have some rethinking to do regarding what we use this device for.
Wednesday, November 28, 2007
I/O starvation on NetWare, HP update
Last week I talked about a problem we're having with the HP MSA1500cs and our NetWare cluster. The problem is still there, of course. I've opened cases with both HP and Novell to handle this one. HP because I really thing that such command latencies are a defect, and Novell since they're having starvation issues with clusters.
This morning I got a voice-mail from HP, an update for our case. Greatly summarized:
Especially since I don't think I've made back-line on the Novell case yet. They're involved, but I haven't been referred to a new support engineer yet.
This morning I got a voice-mail from HP, an update for our case. Greatly summarized:
The MSA team has determined that your device is working perfectly, and can find no defects. They've referred the case to the NetWare software team.Or...
Working as designed. Fix your software. Talk to Novell.Which I'm doing. Now to see if I can light a fire on the back-channels, or if we've just made HP admit that these sorts of command latencies are part of the design and need to be engineered around in software. Highly frustrating.
Especially since I don't think I've made back-line on the Novell case yet. They're involved, but I haven't been referred to a new support engineer yet.
Labels: clustering, hp, msa, netware, novell, OES, storage, sysadmin
Wednesday, November 21, 2007
I/O starvation on NetWare
The MSA1500cs we've had for a while has shown a bad habit. It is visible when you connect a serial cable to the management port on the MSA1000 controller, and doing a "show perf" after starting performance tracking. The line in question is "Avg Command Latency:", which is a measure of how long it takes to execute an I/O operation. Under normal circumstances this metric stays between 5-30ms. When things go bad, I've seen it as far as 270ms.
This is a problem with our cluster nodes. Our cluster nodes can seen LUNs on both the MSA1500cs and the EVA3000. The EVA is where the cluster has been housed since it started, and the MSA has taken up two low-I/O-volume volumes to make space on the EVA.
IF the MSA is in the high Avg Command Latency state, and
IF a cluster node is doing a large Write to the MSA (such as a DVD ISO image, or B2D operation),
THEN "Concurrent Disk Requests" in Monitor go north of 1000
This is a dangerous state. If this particular cluster node is housing some higher trafficked volumes, such as FacShare:, the laggy I/O is competing with regular (fast) I/O to the EVA. If this sort of mostly-Read I/O is concurrent with the above heavy Write situation it can cause the cluster node to not write to the Cluster Partition on time and trigger a poison-pill from the Split Brain Detector. In short, the storage heart-beat to the EVA (where the Cluster Partition lives) gets starved out in the face of all the writes to the laggy MSA.
Users definitely noticed when the cluster node was in such a heavy usage state. Writes and Reads took a loooong time on the LUNs hosted on the fast EVA. Our help-desk recorded several "unable to map drive" calls when the nodes were in that state, simply because a drive-mapping involves I/O and the server was too busy to do it in the scant seconds it normally does.
This is sub-optimal. This also doesn't seem to happen on Windows, but I'm not sure of that.
This is something that a very new feature in the Linux kernel could help out, that that's to introduce the concept of 'priority I/O' to the storage stack. I/O with a high priority, such as cluster heart-beats, gets serviced faster than I/O of regular priority. That could prevent SBD abends. Unfortunately, as the NetWare kernel is no longer under development and just under Maintenance, this is not likely to be ported to NetWare.
I/O starvation. This shouldn't happen, but neither should 270ms I/O command latencies.
This is a problem with our cluster nodes. Our cluster nodes can seen LUNs on both the MSA1500cs and the EVA3000. The EVA is where the cluster has been housed since it started, and the MSA has taken up two low-I/O-volume volumes to make space on the EVA.
IF the MSA is in the high Avg Command Latency state, and
IF a cluster node is doing a large Write to the MSA (such as a DVD ISO image, or B2D operation),
THEN "Concurrent Disk Requests" in Monitor go north of 1000
This is a dangerous state. If this particular cluster node is housing some higher trafficked volumes, such as FacShare:, the laggy I/O is competing with regular (fast) I/O to the EVA. If this sort of mostly-Read I/O is concurrent with the above heavy Write situation it can cause the cluster node to not write to the Cluster Partition on time and trigger a poison-pill from the Split Brain Detector. In short, the storage heart-beat to the EVA (where the Cluster Partition lives) gets starved out in the face of all the writes to the laggy MSA.
Users definitely noticed when the cluster node was in such a heavy usage state. Writes and Reads took a loooong time on the LUNs hosted on the fast EVA. Our help-desk recorded several "unable to map drive" calls when the nodes were in that state, simply because a drive-mapping involves I/O and the server was too busy to do it in the scant seconds it normally does.
This is sub-optimal. This also doesn't seem to happen on Windows, but I'm not sure of that.
This is something that a very new feature in the Linux kernel could help out, that that's to introduce the concept of 'priority I/O' to the storage stack. I/O with a high priority, such as cluster heart-beats, gets serviced faster than I/O of regular priority. That could prevent SBD abends. Unfortunately, as the NetWare kernel is no longer under development and just under Maintenance, this is not likely to be ported to NetWare.
I/O starvation. This shouldn't happen, but neither should 270ms I/O command latencies.
Labels: clustering, hp, msa, netware, novell, OES, storage, sysadmin
Thursday, August 09, 2007
Updates
We've finally moved the MSA up to Bond Hall. I've spent the better part of the last two days undressing and redressing racks as a result of this. It has been interesting.
One of the things we learned is that it would be a really good idea to invest in power cables of custom length. Especially since we have the nice power-strips up there in BH. 4 footers will do in most cases, and have a minimum of cable to coil and stow. What we do have are longer cables and that leads to having black power cables crammed everywhere there is space, and that complicates chasing them when the time comes.
We have four racks over in Bond Hall in which to stuff things. Because of this, those are some of our densest racks.
The fibre channel bridged the distance just peachy. I don't know what the total distance is, but it is well under 10km so we could get both the fibre switch in BH and the fibre switch up here talking. Servers up here can see the MSA and talk to it at full speed. And do so without incurring any extra Brocade licensing costs to make it all work. That was nice to see.
Right now we're doing some tests on backup-to-disk to the MSA. The performance is... underwhelming. I'm not yet sure if the MSA is the bottleneck, or the speed of data coming from the backup sources. I do know that the MSA is all too easy to saturate for I/O, and it wouldn't surprise me in the least if B2D pegs the needle. Early indications are that the MSA is a welcome addition to our existing backup strategy, but it won't come close to replacing tape.
And finally, I'll be on vacaton all next week. So there won't be any updates until the week of the 21st.
One of the things we learned is that it would be a really good idea to invest in power cables of custom length. Especially since we have the nice power-strips up there in BH. 4 footers will do in most cases, and have a minimum of cable to coil and stow. What we do have are longer cables and that leads to having black power cables crammed everywhere there is space, and that complicates chasing them when the time comes.
We have four racks over in Bond Hall in which to stuff things. Because of this, those are some of our densest racks.
The fibre channel bridged the distance just peachy. I don't know what the total distance is, but it is well under 10km so we could get both the fibre switch in BH and the fibre switch up here talking. Servers up here can see the MSA and talk to it at full speed. And do so without incurring any extra Brocade licensing costs to make it all work. That was nice to see.
Right now we're doing some tests on backup-to-disk to the MSA. The performance is... underwhelming. I'm not yet sure if the MSA is the bottleneck, or the speed of data coming from the backup sources. I do know that the MSA is all too easy to saturate for I/O, and it wouldn't surprise me in the least if B2D pegs the needle. Early indications are that the MSA is a welcome addition to our existing backup strategy, but it won't come close to replacing tape.
And finally, I'll be on vacaton all next week. So there won't be any updates until the week of the 21st.
Thursday, July 19, 2007
That darned MSA again
I'm not sure where this problem sits, but I'm having trouble with this MSA1500cs and my NetWare servers. I've found a failure case that is a bit unusual, but things shouldn't fail this way.
The setup:
zERR_WRITE_FAILURE 20204 /* the low level async block WRITE failed*/
Which, you know, shouldn't happen. And this error is consistent across the following changes:
I haven't thrown SP6 on there yet, as this is a WUF cluster node and this isn't intersession ;). This is one of those areas where I'm not sure who to call. Novell or HP? This is a critical error to get fixed as it impacts how we'll be replicating the EVA. It was errors similar to this, and activities similar to this, that caused all that EXCITEMENT about noon last Wednesday. That was not fun to live through, and we really really don't want to have that happen again.
Call Novell
The next step here is to delete these pools and volumes, recreate them, and see if things go Poink in quite the same way. I'm not convinced that'll fix the problem, as the errors being reported are Write errors, not Read errors, and the faulting blocks are different every time. I'm suspecting instability in the Write channel somewhere that is unique to a nss /poolrebuild, as I didn't get these errors when FILLING these volumes. Write channel in this case has a lot of Fibre Channel in it.
The setup:
- NetWare 6.5, SP5 plus patches
- EVA3000 visible
- MSA1500cs visible
- Pool in question hosted on the MSA
- Pool in question has snapshots
- Do a nss /poolrebuild on the pool
7-19-2007 9:48:22 am: COMN-3.24-1092 [nmID=A0025]The block number changes every time, and when it decides to crap out of the rebuild also changes every time. No consistency. The I/O error (20204) decodes to:
NSS-3.00-5001: Pool FACSRV2/USER2DR is being deactivated.
An I/O error (20204(zio.c[2260])) at block 36640253(file block
-36640253)(ZID 1) has compromised pool integrity.
zERR_WRITE_FAILURE 20204 /* the low level async block WRITE failed*/
Which, you know, shouldn't happen. And this error is consistent across the following changes:
- Updating the HAM driver (QL2300.HAM) from version 6.90.08 (a.k.a 6.90h) to 6.90.13 (6.90m).
- Updating the firmware on the card from 1.43 to 1.45 (I needed to do this anyway for the EVA3000 VCS upgrade next month)
- Applying the N65NSS5B patch, I had N65NSS5A on there before
I haven't thrown SP6 on there yet, as this is a WUF cluster node and this isn't intersession ;). This is one of those areas where I'm not sure who to call. Novell or HP? This is a critical error to get fixed as it impacts how we'll be replicating the EVA. It was errors similar to this, and activities similar to this, that caused all that EXCITEMENT about noon last Wednesday. That was not fun to live through, and we really really don't want to have that happen again.
Call Novell
| Good: | Bad: |
|
|
Wednesday, June 13, 2007
Still waiting
Any day now OES2 will come out.
Any day now.
Any day now I'll get a paravirtualizable NetWare and will be able to run it through its paces.
Any day now I'll get to try and figure out how Xen virtualization of NetWare interacts with an HP MSA1500cs.
But not today.
Any day now.
Any day now I'll get a paravirtualizable NetWare and will be able to run it through its paces.
Any day now I'll get to try and figure out how Xen virtualization of NetWare interacts with an HP MSA1500cs.
But not today.
Labels: msa, netware, novell, OES, virtualization
Wednesday, February 07, 2007
Disaster recovery, space, and storage management
Remember that benchmark series I did back in August and September? We'll we're finally planning on deploying on that hardware. We spent a good chunk of this afternoon trying to figure out how to carve the MSA into pieces. Turns out, it won't be as easy as we thought.
Or more specifically, it'll be very easy. We just won't have enough space to do meaningful backup-to-disk. Its primary role as the disaster recovery copy of the EVA will take up 85% of the available disk-space on the MSA. That leaves 15% of the space to work in VM OS volumes and the B2D stuff. Not much left for the backup-to-disk part of this project, which was billed as a significant part back in June. Oops. We'll be on tape for a while yet.
Happily, we can add storage cabinets to this device no problem. Except funding of course, but that almost goes without saying. We'll have to kick the tree and see if any money falls out of it, otherwise we're waiting until our fiscal year begins (July) to do any B2D stuff.
Or more specifically, it'll be very easy. We just won't have enough space to do meaningful backup-to-disk. Its primary role as the disaster recovery copy of the EVA will take up 85% of the available disk-space on the MSA. That leaves 15% of the space to work in VM OS volumes and the B2D stuff. Not much left for the backup-to-disk part of this project, which was billed as a significant part back in June. Oops. We'll be on tape for a while yet.
Happily, we can add storage cabinets to this device no problem. Except funding of course, but that almost goes without saying. We'll have to kick the tree and see if any money falls out of it, otherwise we're waiting until our fiscal year begins (July) to do any B2D stuff.
Thursday, October 19, 2006
MSA, mirroring, and the NetWare cluster
Mirroring WUF is probably the easist thing we'll do, once we get the fibre interconnect between the local SAN and the BCC SAN. Setting up the software RAID devices in NetWare is a fairly simple thing to do, and will immediately be integrated into the cluster. It was a method like this that I used to migrate the SOFTWARE volume from a direct-attach on FACSRV2 to be on the SAN.
That said, there are some design considerations to take into account. SAN best-practices documents at both HP and Novell (and Microsoft) say it is better to create many LUNs than it is to use a few big LUNs. This is the practice we use in the Exchange cluster. The reasoning behind this is to allow the operating system to queue IO operations across many LUNs rather than stack them all up behind a few LUNs, which has the ultimate effect of making IO more efficient on the SAN device. I propose that we follow this practice when partitioning out the MSA.
The LUNs we create on the MSA for use in WUF will have a 64K stripe-size. This is the stripe that best supports file-server loads. For comparison, the stripe-size in the EVA is an unmodifyable 128K.
MSA guidelines strongly recommend against RAID5 arrays larger than 14 drives, which limits us to Drive Arrays of 6.5TB or smaller. Each drive-array we create loses us a drive for parity. Also, I'd like to designate one drive per shelf to be a hot-spare. This leaves us with 22 drives for use as storage.
Right this moment WUF has just over 6TB allocated to it which is almost to the max for a single drive array.
Tags: MSA, NetWare
That said, there are some design considerations to take into account. SAN best-practices documents at both HP and Novell (and Microsoft) say it is better to create many LUNs than it is to use a few big LUNs. This is the practice we use in the Exchange cluster. The reasoning behind this is to allow the operating system to queue IO operations across many LUNs rather than stack them all up behind a few LUNs, which has the ultimate effect of making IO more efficient on the SAN device. I propose that we follow this practice when partitioning out the MSA.
The LUNs we create on the MSA for use in WUF will have a 64K stripe-size. This is the stripe that best supports file-server loads. For comparison, the stripe-size in the EVA is an unmodifyable 128K.
MSA guidelines strongly recommend against RAID5 arrays larger than 14 drives, which limits us to Drive Arrays of 6.5TB or smaller. Each drive-array we create loses us a drive for parity. Also, I'd like to designate one drive per shelf to be a hot-spare. This leaves us with 22 drives for use as storage.
Right this moment WUF has just over 6TB allocated to it which is almost to the max for a single drive array.
Tags: MSA, NetWare
MSA, Mirroring, and NetWare
The performance testing is largely done.
In short, Mirroring performance matches or is within a few percentage points of EVA performance. Mirror performance follows the slowest device, which in all tests is the EVA. Since the EVA is the benchmark by which we compare production performance, this tells me that we can safely expect to mirror at least some of the EVA data on the MSA.
MSA performance exceeds EVA performance significantly. This is NOT true for the same tests on a Windows server running locally. I can't theorize why this might be, but the data show it quite well.
Unfortunately, I lack the resources to do a TRUE concurrency test. I can't tell you how MSA vs EVA performs when 50 workstations are pounding random IO. From what I've seen, EVA should turn in better numbers in that case due to technological differences. On the other hand, EVA should have turned in faster numbers than it did in this single-streamer test.
Tags: MSA, NetWare
In short, Mirroring performance matches or is within a few percentage points of EVA performance. Mirror performance follows the slowest device, which in all tests is the EVA. Since the EVA is the benchmark by which we compare production performance, this tells me that we can safely expect to mirror at least some of the EVA data on the MSA.
MSA performance exceeds EVA performance significantly. This is NOT true for the same tests on a Windows server running locally. I can't theorize why this might be, but the data show it quite well.
Unfortunately, I lack the resources to do a TRUE concurrency test. I can't tell you how MSA vs EVA performs when 50 workstations are pounding random IO. From what I've seen, EVA should turn in better numbers in that case due to technological differences. On the other hand, EVA should have turned in faster numbers than it did in this single-streamer test.
Tags: MSA, NetWare
Wednesday, October 11, 2006
MSA Performance update
An update to the MSA performance testing.
One thing became very, very clear when testing the default stripe sizes. A 16KB stripe size on a RAID5 array on the MSA gives faster read performance, but much worse Write performance. Enough worse, that I'm curious why it's a default. We'll be going with a 64K stripe for our production use, as that's a good compromise between read/write performance.
The Windows part of the mirror/unmirror test is completed. Write performance tracks, as in the curve has the same shape, the MSA performance. This makes sense, because software mirroring needs to have both writes commit before it'll move on to the next operation. This by necessity forces write performance to follow the slowest performing storage device. All in all, Write performance trailed MSA performance, which in turn trailed EVA performance for the large file test.
Read performance is where the real performance gains were to be had. This also makes sense because software Raid1 generally has each storage device alternating serving blocks. On reflection this could play a bit of hob with in-MSA or in-EVA predictive reads, but testing that is difficult. Performance matched EVA performance for files under 8GB in size, and still exceeded MSA performance for the 32GB file.
I'm running the NetWare test right now. Because this has to run over the network, I can't compare these results to the Windows test. But I can at least get a feeling for whether or not NetWare's software mirror provides similar performance characteristics. Considering how slow this test is running (Gig Ether isn't having as much of an impact as I thought it would), it'll be next week before I'll have more data.
Because of the delays I'm seeing, I've had to strike a few tests from the testing schedule. This needs to be in production during Winter Break, so we need time to set up pre-production systems and start building the environment.
Tags: msa, benchmarking
- RAID stripe performance (standard IOZONE, and a 32GB file IOZONE)
- 64K both Raid0 and Raid5
- Default stripes: 16K Raid5, and 128K Raid0
- Versus EVA performance
- Software mirror performance (software Raid1)
- Windows/NetWare: MSA/EVA
Windows: MSA/MSA?? Windows: EVA/EVA- Concurrency performance
- Multiple high-rate streams to the same Disk Array (different logical drives)
- Multiple high-rate streams to different Disk Arrays
- Random I/O & Sequential I/O performance interaction on the same array
One thing became very, very clear when testing the default stripe sizes. A 16KB stripe size on a RAID5 array on the MSA gives faster read performance, but much worse Write performance. Enough worse, that I'm curious why it's a default. We'll be going with a 64K stripe for our production use, as that's a good compromise between read/write performance.
The Windows part of the mirror/unmirror test is completed. Write performance tracks, as in the curve has the same shape, the MSA performance. This makes sense, because software mirroring needs to have both writes commit before it'll move on to the next operation. This by necessity forces write performance to follow the slowest performing storage device. All in all, Write performance trailed MSA performance, which in turn trailed EVA performance for the large file test.
Read performance is where the real performance gains were to be had. This also makes sense because software Raid1 generally has each storage device alternating serving blocks. On reflection this could play a bit of hob with in-MSA or in-EVA predictive reads, but testing that is difficult. Performance matched EVA performance for files under 8GB in size, and still exceeded MSA performance for the 32GB file.
I'm running the NetWare test right now. Because this has to run over the network, I can't compare these results to the Windows test. But I can at least get a feeling for whether or not NetWare's software mirror provides similar performance characteristics. Considering how slow this test is running (Gig Ether isn't having as much of an impact as I thought it would), it'll be next week before I'll have more data.
Because of the delays I'm seeing, I've had to strike a few tests from the testing schedule. This needs to be in production during Winter Break, so we need time to set up pre-production systems and start building the environment.
Tags: msa, benchmarking
Labels: benchmarking, msa, storage
Wednesday, October 04, 2006
More MSA performance
Since my OES benchmark went so well, I've been asked to do a series on the MSA we just received for our BCC cluster. Long time readers will remember that the BCC cluster will be done with free or cheap software, not Novell BCC. Unfortunately, the same goes for the hardware. So I get to find out if the MSA will really live up to our performance expectations.
The testing series I've worked out is this:
One thing the testing has already shown, and that is for Raid5 performance a quiescent MSA out-performs the in-production EVA. Since there is no way to do tests against the EVA without competing at the disk level for I/O supporting production, I can't get a true apples to apples comparison. By the numbers, EVA should outperform MSA. It's just that classes have started to the EVA is currently supporting the 6 node NetWare cluster and the two node 8,000 mailbox Exchange cluster, where the MSA is doing nothing but being subjected to benchmarking loads.
The other thing that is very apparent in the tests are the prevalence of caching. Both the host server and the MSA have caching. The host server is more file-based caching, and the MSA (512MB) is block-level caching. This has a very big impact on performance numbers for files under 512MB. This is why the 32GB file test is very important to us, since that test blows past ALL caching and yields the 'worst case' performance numbers for MSA.
Tags: msa, benchmarking
The testing series I've worked out is this:
- RAID stripe performance (standard IOZONE, and a 32GB file IOZONE)
- 64K both Raid0 and Raid5
- Default stripes: 16K Raid5, and 128K Raid0
- Versus EVA performance
- Software mirror performance (software Raid1)
- Windows/NetWare: MSA/EVA
- Windows: MSA/MSA
- ?? Windows: EVA/EVA
- Concurrency performance
- Multiple high-rate streams to the same Disk Array (different logical drives)
- Multiple high-rate streams to different Disk Arrays
- Random I/O & Sequential I/O performance interaction on the same array
One thing the testing has already shown, and that is for Raid5 performance a quiescent MSA out-performs the in-production EVA. Since there is no way to do tests against the EVA without competing at the disk level for I/O supporting production, I can't get a true apples to apples comparison. By the numbers, EVA should outperform MSA. It's just that classes have started to the EVA is currently supporting the 6 node NetWare cluster and the two node 8,000 mailbox Exchange cluster, where the MSA is doing nothing but being subjected to benchmarking loads.
The other thing that is very apparent in the tests are the prevalence of caching. Both the host server and the MSA have caching. The host server is more file-based caching, and the MSA (512MB) is block-level caching. This has a very big impact on performance numbers for files under 512MB. This is why the 32GB file test is very important to us, since that test blows past ALL caching and yields the 'worst case' performance numbers for MSA.
Tags: msa, benchmarking
Labels: benchmarking, msa, storage
Monday, October 02, 2006
Performance of the MSA
I'm doing another performance series on the MSA we'll be putting into Bond Hall. This will be our BCC SAN as well as the home for the 'backup to disk' storage.
One of the tests I ran was to do a full IOZONE series on a 32GB file. This is to better get a feel for how such large files perform on the MSA, since I suspect that any backup-to-disk system will be generating files that large. But I got some s-t-r-a-n-g-e numbers. It turns out that the random-write test is much faster than the random-read test. Weird.
So, um. Yeah. And you want to know the scary part? This holds true for both a Raid0 and Raid5 array. Both have a 64K stripe size, which is not default. Raid5's default stripe is 16K, and Raid0 is 128K. I'll test the default stripes next to see if they affect the results any. But this is STILL weird.
Perhaps writes are cached and reordered, and reads just come off of disk? Hard to say. But read speed does improve as the record size increases. The 16MB record size turns in a read speed of only .5 that of the write speed. Yet the read performance at 64K is .12 that of write. Ouch! I'm running the same test on the EVA to see if there is a difference, but I don't know what the EVA stripe-size is.
Tags: benchmarking, MSA
One of the tests I ran was to do a full IOZONE series on a 32GB file. This is to better get a feel for how such large files perform on the MSA, since I suspect that any backup-to-disk system will be generating files that large. But I got some s-t-r-a-n-g-e numbers. It turns out that the random-write test is much faster than the random-read test. Weird.
| RandomR | |||
| Rec KB | 2048 | 4096 | 8192 |
| Thru | 10881 | 17115 | 23311 |
| RandomW | |||
| Rec KB | 2048 | 4096 | 8192 |
| Thru | 64491 | 63949 | 62681 |
So, um. Yeah. And you want to know the scary part? This holds true for both a Raid0 and Raid5 array. Both have a 64K stripe size, which is not default. Raid5's default stripe is 16K, and Raid0 is 128K. I'll test the default stripes next to see if they affect the results any. But this is STILL weird.
Perhaps writes are cached and reordered, and reads just come off of disk? Hard to say. But read speed does improve as the record size increases. The 16MB record size turns in a read speed of only .5 that of the write speed. Yet the read performance at 64K is .12 that of write. Ouch! I'm running the same test on the EVA to see if there is a difference, but I don't know what the EVA stripe-size is.
Tags: benchmarking, MSA
Labels: benchmarking, msa, storage
