Thursday, October 02, 2008
MSA performance in the new config
Today I reconfigured the MSA1500 to run in Active/Active mode. While there, I also rearranged our disk arrays. We have 41, 500GB, 7.2K RPM drives in there. I created two, 20 disk Arrays, and filled each array with Raid 0+1 LUNs. This yielded 9TB of useful space. That extra drive will stay extra until we get an odd number of new drives.
Yes, a profligate waste of space but at least it'll be fast. It also had the added advantage of not needing to stripe in like Raid5 or Raid6 would have. This alone saved us close to two weeks flow time to get it back into service.
Another benefit to not using a parity RAID is that the MSA is no longer controller-CPU bound for I/O speeds. Right now I have a pair of writes, each effectively going to a separate controller, and the combined I/O is on the order of 100Mbs while controller CPU loads are under 80%. Also, more importantly, Average Command Latency is still in the 20-30ms range.
The limiting factor here appears to be how fast the controllers can commit I/O to the physical drives, rather than how fast the controllers can do parity-calcs. CPU not being saturated suggests this, but a "show perf physical" on the CLI shows the queue depth on individual drives:

The drives with a zero are associated with LUNs being served by the other controller, and thus not listed here. But a high queue depth is a good sign of I/O saturation on the actual drives themselves. This is encouraging to me, since it means we're finally, finally, after two years, getting the performance we need out of this device. We had to go to an active/active config with a non-parity RAID to do it, but we got it.
Yes, a profligate waste of space but at least it'll be fast. It also had the added advantage of not needing to stripe in like Raid5 or Raid6 would have. This alone saved us close to two weeks flow time to get it back into service.
Another benefit to not using a parity RAID is that the MSA is no longer controller-CPU bound for I/O speeds. Right now I have a pair of writes, each effectively going to a separate controller, and the combined I/O is on the order of 100Mbs while controller CPU loads are under 80%. Also, more importantly, Average Command Latency is still in the 20-30ms range.
The limiting factor here appears to be how fast the controllers can commit I/O to the physical drives, rather than how fast the controllers can do parity-calcs. CPU not being saturated suggests this, but a "show perf physical" on the CLI shows the queue depth on individual drives:

The drives with a zero are associated with LUNs being served by the other controller, and thus not listed here. But a high queue depth is a good sign of I/O saturation on the actual drives themselves. This is encouraging to me, since it means we're finally, finally, after two years, getting the performance we need out of this device. We had to go to an active/active config with a non-parity RAID to do it, but we got it.
Labels: benchmarking, msa, storage, sysadmin
Monday, September 29, 2008
That darned iPhone
The DHCP scope that is associated with our Wireless network filled this morning. We knew we were getting tight, and were in the process of getting a new system in place that could handle non-contiguous network segments as address pools. But that's not in yet.
The reason we ran out? Apple iPhones. When they come into contact with a Wifi network, they grab an IP address even if they're not actively using it. So. Full scope.
That isn't my department, so I'm not sure what exactly we're doing about it. But, it came up this morning,.
The reason we ran out? Apple iPhones. When they come into contact with a Wifi network, they grab an IP address even if they're not actively using it. So. Full scope.
That isn't my department, so I'm not sure what exactly we're doing about it. But, it came up this morning,.
Wednesday, September 24, 2008
Fickle fortune
I lost a RAID card in one of my Beta servers. Crap. These beasties are all old beasties since that's the only hardware that could be released for the beta. And with crap servers, comes a crap failure rate. This is the second RAID card I've lost, and I've lost one hard-drive too. It isn't common to lose more RAID cards than hard-drives. Arrg.
This puts a kink into things. This was going to be an edirectory host, so I could host my replicas on one set of servers and abuse the crap out of the non-replica application servers. I may have to dual host. Icky icky.
This puts a kink into things. This was going to be an edirectory host, so I could host my replicas on one set of servers and abuse the crap out of the non-replica application servers. I may have to dual host. Icky icky.
Tuesday, September 23, 2008
That darned 32-bit limit
Today I learned that the disk-space counters NetWare provides in SNMP use signed integers for its disk-space monitoring. These are stats published at a table at OID .1.3.6.1.4.1.23.2.28.2.14.1.3. Having just expanded our FacShare volume past 2TB, it went negative-space according to the monitors. A simple integer overflow since apparently Novell is using a signed integer for a number that can never be legitimately negative.
I've pointed this out on an enhancement request. This being NetWare, they may not chose to fix it if it is more than a two-line fix. We'll see.
This also means that volumes over 4TB can not be effectively monitored with SNMP. Since NSS can have up to 8TB volumes on NetWare, this could potentially be a problem. We're not there yet.
I've pointed this out on an enhancement request. This being NetWare, they may not chose to fix it if it is more than a two-line fix. We'll see.
This also means that volumes over 4TB can not be effectively monitored with SNMP. Since NSS can have up to 8TB volumes on NetWare, this could potentially be a problem. We're not there yet.
Friday, September 19, 2008
Monitoring ESX datacenter volume stats
A long while back I mentioned I had a perl script that we use to track certain disk space details on my NetWare and Windows servers. That goes into a database, and it can make for some pretty charts. A short while back I got asked if I could do something like that for the ESX datacenter volumes.
A lot of googling later I found how to turn on the SNMP daemon for an ESX host, and a script or two to publish the data I need by SNMP. It took some doing, but it ended up pretty easy to do. One new perl script, the right config for snmpd on the ESX host, setting the ESX host's security policy to permit SNMP traffic, and pointing my gathering script at the host.
The perl script that gathers the local information is very basic:
Then append the /etc/snmp/snmp.conf file with the following lines (in my case):
The first parameter after exec is the OID to publish. The script returns an array of values, one element per line, that are assigned to .0, .1, .2 and on up. I'm publishing the details I'm interested in, which may be different than yours. That's the 'print' line in the script.
The script itself lives in /root/bin/ since I didn't know where better to put it. It has to have execute rights for Other, though.
The big unique-ID looking number is just that, a UUID. It is the UUID assigned to the VMFS volume. The VMFS volumes are multi-mounted between each ESX host in that particular cluster, so you don't have to worry about chasing the node that has it mounted. You can find the number you want by logging in to the ESX host on the SSH console, and doing a long directory on the /vmfs/volumes folder. The friendly name of your VMFS volume is symlinked to the UUID. The UUID is what goes in to the snmp.conf file.
The last parameter ("Disk1" and "Disk2" above) is the friendly name of the volume to publish over SNMP. As you can see, I'm very creative.
These values are queried by my script and dropped into the database. Since the ESX datacenter volumes only get space consumed when we provision a new VM or take a snapshot, the graph is pretty chunky rather than curvy like the graph I linked to earlier. If VMware ever changes how the vmfstools command returns data, this script will break. But until then, it should serve me well.
A lot of googling later I found how to turn on the SNMP daemon for an ESX host, and a script or two to publish the data I need by SNMP. It took some doing, but it ended up pretty easy to do. One new perl script, the right config for snmpd on the ESX host, setting the ESX host's security policy to permit SNMP traffic, and pointing my gathering script at the host.
The perl script that gathers the local information is very basic:
#!/usr/bin/perl -w
use strict;
my $partition = ".";
my $partmaps = ".";
my $vmfsvolume = "\Q/vmfs/volumes/$ARGV[0]\Q";
my $vmfsfriendly = $ARGV[1];
my $capRaw = 0;
my $capBlock = 0;
my $blocksize = 0;
my $freeRaw = 0;
my $freeBlock = 0;
my $freespace= "";
my $totalspace= "";
open("Y", "/usr/sbin/vmkfstools -P $vmfsvolume|");
while () {
if (/Capacity ([0-9]*).*\(([0-9]*).* ([0-9]*)\), ([0-9]*).*\(([0-9]*).*a
vail/) {
$capRaw = $1;
$capBlock = $2;
$blocksize = $3;
$freeRaw = $4;
$freeBlock = $5;
$freespace = $freeBlock;
$totalspace = $capBlock;
$blocksize = $blocksize/1024;
#print ("1 = $1\n2 = $2\n3 = $3\n4 = $4\n5 = $5\n");
print ("$vmfsfriendly\n$totalspace\n$freespace\n$blocksize\n");
}
} Then append the /etc/snmp/snmp.conf file with the following lines (in my case):
exec .1.3.6.1.4.1.6876.99999.2.0 vmfsspace /root/bin/vmfsspace.specific 48cb2cbc
-61468d50-ed1f-001cc447a19d Disk1
exec .1.3.6.1.4.1.6876.99999.2.1 vmfsspace /root/bin/vmfsspace.specific 48cb2cbc
-7aa208e8-be6b-001cc447a19d Disk2The first parameter after exec is the OID to publish. The script returns an array of values, one element per line, that are assigned to .0, .1, .2 and on up. I'm publishing the details I'm interested in, which may be different than yours. That's the 'print' line in the script.
The script itself lives in /root/bin/ since I didn't know where better to put it. It has to have execute rights for Other, though.
The big unique-ID looking number is just that, a UUID. It is the UUID assigned to the VMFS volume. The VMFS volumes are multi-mounted between each ESX host in that particular cluster, so you don't have to worry about chasing the node that has it mounted. You can find the number you want by logging in to the ESX host on the SSH console, and doing a long directory on the /vmfs/volumes folder. The friendly name of your VMFS volume is symlinked to the UUID. The UUID is what goes in to the snmp.conf file.
The last parameter ("Disk1" and "Disk2" above) is the friendly name of the volume to publish over SNMP. As you can see, I'm very creative.
These values are queried by my script and dropped into the database. Since the ESX datacenter volumes only get space consumed when we provision a new VM or take a snapshot, the graph is pretty chunky rather than curvy like the graph I linked to earlier. If VMware ever changes how the vmfstools command returns data, this script will break. But until then, it should serve me well.
Labels: stats, storage, sysadmin, virtualization
Moving storage around
The EVA6100 went in just fine with that one hitch I mentioned, and now comes all the work we need to do now that we have actual space again. We're still arguing over how much space to add to which volumes, but once we decide all but Blackboard will be very easy to add.
Blackboard needs more space on both the SQL server and the Content server, and as the Content server is clustered it'll require an outage to manage the increase. And it'll be a long outage, as 300GB of weensy files takes a LONG time to copy. The SQL server uses plain old Basic partitions, so I don't think we can expand that partition, so we may have to do another full LUN copy which will require an outage. That has yet to be scheduled, but needs to happen before we get through much of the quarter.
Over on the EVA4400 side, I'm evacuating data off of the MSA1500cs onto the 4400. Once I'm done with that, I'm going to be:
What has yet to be fully determined is exactly how we're going to use the 4400 in this scheme. I expect to get between 15-20TB of space out of the MSA once I'm done with it, and we have around 20TB on the 4400 for backup. Which is why I'd really like that 40TB license please.
Going Active/Active should do really good things for how fast the MSA can throw data at disk. As I've proven before the MSA is significantly CPU bound for I/O to parity LUNs (Raid5 and Raid6), so having another CPU in the loop should increase write throughput significantly. We couldn't do Active/Active before since you can only do Active/Active in a homogeneous OS environment, and we had Windows and NetWare pointed at the MSA (plus one non-production Linux box).
In the mean time, I watch progress bars. TB of data takes a long time to copy if you're not doing it at the block level. Which I can't.
Blackboard needs more space on both the SQL server and the Content server, and as the Content server is clustered it'll require an outage to manage the increase. And it'll be a long outage, as 300GB of weensy files takes a LONG time to copy. The SQL server uses plain old Basic partitions, so I don't think we can expand that partition, so we may have to do another full LUN copy which will require an outage. That has yet to be scheduled, but needs to happen before we get through much of the quarter.
Over on the EVA4400 side, I'm evacuating data off of the MSA1500cs onto the 4400. Once I'm done with that, I'm going to be:
- Rebuilding all of the Disk Arrays.
- Creating LUNs expressly for Backup-to-Disk functionality.
- Flashing the Active/Active firmware on to it, the 7.00 firmware rev.
- Get the two Backup servers installed with the right MPIO widgetry to take advantage of active/active on the MSA>
What has yet to be fully determined is exactly how we're going to use the 4400 in this scheme. I expect to get between 15-20TB of space out of the MSA once I'm done with it, and we have around 20TB on the 4400 for backup. Which is why I'd really like that 40TB license please.
Going Active/Active should do really good things for how fast the MSA can throw data at disk. As I've proven before the MSA is significantly CPU bound for I/O to parity LUNs (Raid5 and Raid6), so having another CPU in the loop should increase write throughput significantly. We couldn't do Active/Active before since you can only do Active/Active in a homogeneous OS environment, and we had Windows and NetWare pointed at the MSA (plus one non-production Linux box).
In the mean time, I watch progress bars. TB of data takes a long time to copy if you're not doing it at the block level. Which I can't.
Labels: clustering, msa, storage, sysadmin
Sunday, September 14, 2008
EVA6100 upgrade a success
Friday night four HP tech arrived to put together the EVA6100 from a pile of parts and the existing EVA3000. It took them 5 hours to get it to the point where we could power-on and see if all of our data was still there (it was, yay), and a few hours after that on our behalf to put everything back together.
There was only one major hitch for the night, which meant I got to bed around 6am Saturday morning instead of 4am.
For EVA, and probably all storage systems, you present hosts to them and selectively present LUNs to those hosts. These host-settings need to have an OS configured for them, since each operating system has its own quirks for how it likes to see its storage. While the EVA6100 has a setting for 'vmware', the EVA3000 did not. Therefore, we had to use a 'custom' OS setting and a 16 digit hex string we copied off of some HP knowledge-base article. When we migrated to the EVA6100 it kept these custom settings.
Which, it would seem, don't work for the EVA6100. It caused ESX to whine in such a way that no VMs would load. It got very worrying for a while there, but thanks to an article on vmware's support site and some intuition we got it all back without data loss. I'll probably post what happened and what we did to fix it in another blog post.
The only service that didn't come up right was secure IMAP for Exchange. I don't know why it decided to not load. My only theory is that our startup sequence wasn't right. Rebooting the HubCA servers got it back.
There was only one major hitch for the night, which meant I got to bed around 6am Saturday morning instead of 4am.
For EVA, and probably all storage systems, you present hosts to them and selectively present LUNs to those hosts. These host-settings need to have an OS configured for them, since each operating system has its own quirks for how it likes to see its storage. While the EVA6100 has a setting for 'vmware', the EVA3000 did not. Therefore, we had to use a 'custom' OS setting and a 16 digit hex string we copied off of some HP knowledge-base article. When we migrated to the EVA6100 it kept these custom settings.
Which, it would seem, don't work for the EVA6100. It caused ESX to whine in such a way that no VMs would load. It got very worrying for a while there, but thanks to an article on vmware's support site and some intuition we got it all back without data loss. I'll probably post what happened and what we did to fix it in another blog post.
The only service that didn't come up right was secure IMAP for Exchange. I don't know why it decided to not load. My only theory is that our startup sequence wasn't right. Rebooting the HubCA servers got it back.
Labels: clustering, exchange, hp, storage, sysadmin
Thursday, September 11, 2008
Fixing DNS issues
I've noticed some slow DNS on my station for the last few weeks and finally got down to checking it out. In the wake of the cache-poisoning scare of late July, we had to upgrade our DNS servers to something a bit less scarily old. I believe this required an operating system rev. The last time this happened to me, we figured out that the DNS server in question had auto-negotiated itself to 10-HalfDuplex, and the switch thought it was 100-FullDuplex. You can imagine what that did to throughput.
I fired up wireshark and started tracking my DNS requests. A pattern soon emerged. The first entry in my resolve.conf list was taking anywhere from .5 to 5.2 seconds to resolve most queries. This is hella slow for a DNS server. Since I don't manage these machines, I let the admin who did manage 'em know about it. He couldn't find anything wrong with the DNS servers on a first glance.
Another thing I noticed when looking at the resolver requests I was passing was a lot of IPv6 requests. Almost all of them were for Active Directory related queries, as I've turned off IPv6 support in my web-browser. I still haven't quite figured out how to disable IPv6 on my openSUSE 10.3 machine here.
As it happens, said DNS admin came back in and said to look at things again. So I dropped into nslookup and started throwing queries and watching the response times in wireshark, and sure enough they were zippy again. He turned off IPv6 support on the DNS servers.
Looks like we'll probably need to have a conversation on campus about IPv6 sooner rather than later. Vista comes with it turned on by default, and happily we don't have much of that yet. But these newer linux distros all have it turned on by default.
I fired up wireshark and started tracking my DNS requests. A pattern soon emerged. The first entry in my resolve.conf list was taking anywhere from .5 to 5.2 seconds to resolve most queries. This is hella slow for a DNS server. Since I don't manage these machines, I let the admin who did manage 'em know about it. He couldn't find anything wrong with the DNS servers on a first glance.
Another thing I noticed when looking at the resolver requests I was passing was a lot of IPv6 requests. Almost all of them were for Active Directory related queries, as I've turned off IPv6 support in my web-browser. I still haven't quite figured out how to disable IPv6 on my openSUSE 10.3 machine here.
As it happens, said DNS admin came back in and said to look at things again. So I dropped into nslookup and started throwing queries and watching the response times in wireshark, and sure enough they were zippy again. He turned off IPv6 support on the DNS servers.
Looks like we'll probably need to have a conversation on campus about IPv6 sooner rather than later. Vista comes with it turned on by default, and happily we don't have much of that yet. But these newer linux distros all have it turned on by default.
