Wednesday, May 14, 2008

NetWare and Xen

Here is something I didn't really know about in virtualized NetWare:

Guidelines for using NSS in a virtual environment

Towards the bottom of this document, you get this:

Configuring Write Barrier Behavior for NetWare in a Guest Environment

Write barriers are needed for controlling I/O behavior when writing to SATA and ATA/IDE devices and disk images via the Xen I/O drivers from a guest NetWare server. This is not an issue when NetWare is handling the I/O directly on a physical server.

The XenBlk Barriers parameter for the SET command controls the behavior of XenBlk Disk I/O when NetWare is running in a virtual environment. The setting appears in the Disk category when you issue the SET command in the NetWare server console.

Valid settings for the XenBlk Barriers parameter are integer values from 0 (turn off write barriers) to 255, with a default value of 16. A non-zero value specifies the depth of the driver queue, and also controls how often a write barrier is inserted into the I/O stream. A value of 0 turns off XenBlk Barriers.

A value of 0 (no barriers) is the best setting to use when the virtual disks assigned to the guest server’s virtual machine are based on physical SCSI, Fibre Channel, or iSCSI disks (or partitions on those physical disk types) on the host server. In this configuration, disk I/O is handled so that data is not exposed to corruption in the event of power failure or host crash, so the XenBlk Barriers are not needed. If the write barriers are set to zero, disk I/O performance is noticeably improved.

Other disk types such as SATA and ATA/IDE can leave disk I/O exposed to corruption in the event of power failure or a host crash, and should use a non-zero setting for the XenBlk Barriers parameter. Non-zero settings should also be used for XenBlk Barriers when writing to Xen LVM-backed disk images and Xen file-backed disk images, regardless of the physical disk type used to store the disk images.

Nice stuff there! The "xenblk barriers" can also have an impact on the performance of your virtualized NetWare server. If your I/O stream runs the server out of cache, performance can really suffer if barriers are non-zero. If it fits in cache, the server can reorder the I/O stream to the disks to the point that you don't notice the performance hit.

So, keep in mind where your disk files are! If you're using one huge XFS partition and hosting all the disks for your VM-NW systems on that, then you'll need barriers. If you're presenting a SAN LUN directly to the VM, then you'll need to "SET XENBLK BARRIERS = 0", as they're set to 16 by default. This'll give you better performance.

Labels: , , , , , ,


Monday, April 02, 2007

Concurrency, again

I performed another test on Friday for concurrency. I had 9 workstations performing an iozone througput test. Each machine ran 20 threads each processing against a 15MB file, for a total working set size of 2.7GB which fits into the server's RAM. The results from the workstations were pretty consistant. The workstations had all of 384MB of RAM in them, and the number of IOZone threads running caused significant page-faulting to occur. Which has the side effect of minimizing client-side caching. The workstations were connected to the core by way of 100MB ethernet, so maximum theoretical speeds are 12.5MB/s.

Some typical results, units are in KB/s

Initial write
11058.47
Rewrite
11457.83
Read
5896.23
Re-read
5844.52
Reverse Read
6395.33
Stride read
5988.33
Random read
6761.84
Mixed workload
8713.86
Random write
7279.35

Consistantly, write performance is better than read performance. On the tests that are greatly benefitted by caching, reverse read and stride read, performance was quite acceptable. All nine machines wrote at near flank speed for 100MB ethernet, which means that the 1GB link the server was plugged in to was doing quite a bit of work during the Initial Write stage.

What is perhaps the most encouraging is that CPU loading on the server itself stayed below the saturation level. Having spoken with some of the engineers who write this stuff, this is not surprising. They've spent a lot of effort in making sure that incoming requests can be fulfilled from cache and not go to disk. Going to disk is more expensive in Linux than in NetWare due to architectural reasons. Had the working set been 4GB or larger I strongly suspect that CPU loading would have been significantly higher. Unfortunately, as school is back in session I can't 'borrow' that lab right now as the tests themselves consume 100% of the resources on the workstations. Students would notice that.

The next step for me is to see if I can figure out how large the 'working set' of open files on FacShare is. If it's much bigger than, say, 3.2GB we're going to need new hardware to make OES work for us. This won't be easy. A majority of the size of the open files are outlook archives (.PST files) for Facilities Management. PST files are low performance critters, so I don't care if they're slow. I do care about things like access databases, though, so figuring out what my 'active set' actually is will take some figuring.

Long story short: With OES2 and 64 bit hardware, I bet I could actually use a machine with 18GB of RAM!

Labels: , , ,


Thursday, March 29, 2007

Why cache is good

One of my post-brainshare tasks is to rebenchmark some OES performance. I did a benchmark series back in September and the results there weren't terribly encouraging. I learned at BrainShare that a mid-December NCPSERV patch fixed a lot of performance issues, and I should rerun my tests. Okay, I can do that.

One test I did underlines the need to tune your cache correctly. Using the same iozone tool I've used in the past, I ran the throughput test with multiple threads. Three tests:

20 threads processing against a separate 100MB file (2GB working set)
40 threads processing against a separate 100MB file (4GB working set)
20 threads processing against a separate 200MB file (4GB working set)

The server I'm playing with is the same one I used in September. It is running OES SP2, patched as of a few days ago. 4GB of RAM, and 2x 2.8 P4 CPU's. The data volume is on the EVA 3000 on a Raid0 partition. I'm testing OES througput not the parity performance of my array. Due to PCI memory, effective memory is 3.2GB. Anyway, the very good table:
                        20x100M        40x100M        20x200M
Initial write 12727.29193 12282.03964 12348.50116
Rewrite 11469.85657 10892.61572 11036.0224
Read 17299.73822 11653.8652 12590.91534
Re-read 15487.54584 13218.80331 11825.04736
Reverse Read 17340.01892 2226.158993 1603.999649
Stride read 16405.58679 1200.556759 1507.770897
Random read 17039.8241 1671.739376 1749.024651
Mixed workload 10984.80847 6207.907829 6852.934509
Random write 7289.342926 6792.321884 6894.767334
The 2GB dataset fit inside of memory. You can see the performance boost that provides on each of the Read tests. It is especially significant on the tests designed to bust read-ahead optimization such as Reverse Read, Stride Read, and Random Read. The Mixed Workload test showed it as well.

One thing that has me scratching my head is why Stride Read is so horrible with the 4GB data-sets. By my measure about 2.8GB of RAM should be available for caching, so most of the dataset should fit into cache and therefore turn in the fast numbers. Clearly, something else is happening.

Anyway, that is why you want to have a high cache-hit percentage on your NSS cache. This is also why 64-bit memory will help you if you have very large working sets of data that your users are playing on, and we're getting to the level where 64-bit will help. And will help even though OES NCP doesn't scale quite as far as we'd like it to. That's the overall question I'm trying to answer here.

Labels: , , , ,


Monday, March 26, 2007

BrainShare done

I'm back at work. BrainShare was a blast, as usual. Learned a lot. Spent most of the day dumping what I learned, and will be spending the rest of the week working on things I learned about last week. Like a new benchmark series with OES with the mid-December NCP patch. I want to see if that changed anything.

Also next week when class is back in I need to analyze our I/O patterns on WUF to better design a test for OES. I need to know FOR SURE if OES-Linux is up to the task of handling 5000 concurrent connections the way we do it. The last series suggested it, but I need more details.

Labels: , ,


Monday, March 19, 2007

TUT212: Novell Storage Services

Not a new topic, but it contained the updates to NSS that'll be there in OES2.

By far the biggest thing is a 64-bit version of OES. Big big big. How big? Very big.

Remember those benchmarks I ran? The ones that compare the ability of OES to keep up with NetWare? And how I learned that on OES NCP operations are CPU bound w-a-y more than on NetWare? That may be going away on 64-bit platforms.

You see, 64-bit linux allows the Kernel to have all addressable memory as kernel memory. 32-bit linux was limited to the bottom 1GB of RAM. If NSS is allowed to store all of its cache in kernel memory, it'll behave exactly like 32-bit NetWare has done since NSS was introduced with NetWare 5.0. I have very high hopes that 64-bit OES will solve the performance problems I've had with OES.

Labels: , , , ,


Wednesday, October 11, 2006

MSA Performance update

An update to the MSA performance testing.
  1. RAID stripe performance (standard IOZONE, and a 32GB file IOZONE)
    1. 64K both Raid0 and Raid5
    2. Default stripes: 16K Raid5, and 128K Raid0
    3. Versus EVA performance
  2. Software mirror performance (software Raid1)
    1. Windows/NetWare: MSA/EVA
    2. Windows: MSA/MSA
    3. ?? Windows: EVA/EVA
  3. Concurrency performance
    1. Multiple high-rate streams to the same Disk Array (different logical drives)
    2. Multiple high-rate streams to different Disk Arrays
    3. Random I/O & Sequential I/O performance interaction on the same array
Testing EVA performance versus MSA performance was a bit of a trick. The EVA is in production, where the MSA is 100% devoted to this test. Hardly apples to apples. I also learned that the stripe size on the EVA is 128KB.

One thing became very, very clear when testing the default stripe sizes. A 16KB stripe size on a RAID5 array on the MSA gives faster read performance, but much worse Write performance. Enough worse, that I'm curious why it's a default. We'll be going with a 64K stripe for our production use, as that's a good compromise between read/write performance.

The Windows part of the mirror/unmirror test is completed. Write performance tracks, as in the curve has the same shape, the MSA performance. This makes sense, because software mirroring needs to have both writes commit before it'll move on to the next operation. This by necessity forces write performance to follow the slowest performing storage device. All in all, Write performance trailed MSA performance, which in turn trailed EVA performance for the large file test.

Read performance is where the real performance gains were to be had. This also makes sense because software Raid1 generally has each storage device alternating serving blocks. On reflection this could play a bit of hob with in-MSA or in-EVA predictive reads, but testing that is difficult. Performance matched EVA performance for files under 8GB in size, and still exceeded MSA performance for the 32GB file.

I'm running the NetWare test right now. Because this has to run over the network, I can't compare these results to the Windows test. But I can at least get a feeling for whether or not NetWare's software mirror provides similar performance characteristics. Considering how slow this test is running (Gig Ether isn't having as much of an impact as I thought it would), it'll be next week before I'll have more data.

Because of the delays I'm seeing, I've had to strike a few tests from the testing schedule. This needs to be in production during Winter Break, so we need time to set up pre-production systems and start building the environment.

Tags: ,

Labels: , ,


Wednesday, October 04, 2006

More MSA performance

Since my OES benchmark went so well, I've been asked to do a series on the MSA we just received for our BCC cluster. Long time readers will remember that the BCC cluster will be done with free or cheap software, not Novell BCC. Unfortunately, the same goes for the hardware. So I get to find out if the MSA will really live up to our performance expectations.

The testing series I've worked out is this:
  1. RAID stripe performance (standard IOZONE, and a 32GB file IOZONE)
    1. 64K both Raid0 and Raid5
    2. Default stripes: 16K Raid5, and 128K Raid0
    3. Versus EVA performance
  2. Software mirror performance (software Raid1)
    1. Windows/NetWare: MSA/EVA
    2. Windows: MSA/MSA
    3. ?? Windows: EVA/EVA
  3. Concurrency performance
    1. Multiple high-rate streams to the same Disk Array (different logical drives)
    2. Multiple high-rate streams to different Disk Arrays
    3. Random I/O & Sequential I/O performance interaction on the same array
The dark green ones are the steps I've completed so far. I'm in the process of restriping for the 16K/128K stripes, which will probably take the rest of the day to complete. I may be able to start off the testing series before I go home tonight. If so, it'll probably get done sometime Sunday evening.

One thing the testing has already shown, and that is for Raid5 performance a quiescent MSA out-performs the in-production EVA. Since there is no way to do tests against the EVA without competing at the disk level for I/O supporting production, I can't get a true apples to apples comparison. By the numbers, EVA should outperform MSA. It's just that classes have started to the EVA is currently supporting the 6 node NetWare cluster and the two node 8,000 mailbox Exchange cluster, where the MSA is doing nothing but being subjected to benchmarking loads.

The other thing that is very apparent in the tests are the prevalence of caching. Both the host server and the MSA have caching. The host server is more file-based caching, and the MSA (512MB) is block-level caching. This has a very big impact on performance numbers for files under 512MB. This is why the 32GB file test is very important to us, since that test blows past ALL caching and yields the 'worst case' performance numbers for MSA.

Tags: ,

Labels: , ,


Monday, October 02, 2006

Performance of the MSA

I'm doing another performance series on the MSA we'll be putting into Bond Hall. This will be our BCC SAN as well as the home for the 'backup to disk' storage.

One of the tests I ran was to do a full IOZONE series on a 32GB file. This is to better get a feel for how such large files perform on the MSA, since I suspect that any backup-to-disk system will be generating files that large. But I got some s-t-r-a-n-g-e numbers. It turns out that the random-write test is much faster than the random-read test. Weird.

RandomR



Rec KB
2048 4096 8192
Thru
10881 17115 23311
RandomW



Rec KB
2048 4096 8192
Thru 64491 63949 62681

So, um. Yeah. And you want to know the scary part? This holds true for both a Raid0 and Raid5 array. Both have a 64K stripe size, which is not default. Raid5's default stripe is 16K, and Raid0 is 128K. I'll test the default stripes next to see if they affect the results any. But this is STILL weird.

Perhaps writes are cached and reordered, and reads just come off of disk? Hard to say. But read speed does improve as the record size increases. The 16MB record size turns in a read speed of only .5 that of the write speed. Yet the read performance at 64K is .12 that of write. Ouch! I'm running the same test on the EVA to see if there is a difference, but I don't know what the EVA stripe-size is.


Tags: ,

Labels: , ,


Tuesday, September 19, 2006

Results: Conclusions

The objective of this series of tests was to determine how well Open Enterprise Server -- Linux (here referred to as 'Linux') scales when compared to Open Enterprise Server - NetWare (here referred to as 'NetWare'). One of the prime goals was to figure out if we need to throw hardware at our cluster if we decide to migrate to Linux soon. My earlier test had shown that for a single station pounding on a Linux and NetWare server, the Linux server turned in better performance.

I was testing the performance of an NSS volume mounted over NCP. In part this is because NetWare clustering only works with NSS, but mostly because of two other reasons. The only other viable file-server for Linux is Samba, and I already know it has 'concurrency issues' that crop up well below the level of concurrency we show on the WUF cluster. Second, the rich meta-data that NSS provides is extensively used by us. I don't believe any Linux file system has an equivalent for directory quotas.

Hardware

  • HP ProLiant BL20P G2
  • 2x 2.8GHz CPU
  • 4GB RAM
  • HP EVA3000 fibre attached
OES-NetWare config
  • NetWare 6.5, SP5 (a.k.a. OES NetWare SP2)
  • N65NSS5B patch
  • nw65sp5upd1
  • 200GB NSS volume, no salvage, RAID0, on EVA3000
OES-Linux config
  • OES Linux SP2
  • Post-patches up to 9/12/06
  • 200GB NSS volume, no salvage, RAID0, on EVA3000
No attempts at tuning the operating systems were taken. Default settings were used to better resemble 'out of the box' performance. The one exception was on the NetWare IOZONE tests, where MAXIMUM SERVICE PROCESSES was bumped to 1000 from 750 (to no measurable effect, as it turned out).

To facilitate the testing I was granted the use of one of the computer labs on mothballs between terms. This lab had 32 stations in it, though only 30 stations were ever used in a test. I thank ATUS for the lending of the lab.

Client Configuration
  • Windows XP Sp2, patched
  • P3 1.6GHz CPU
  • 256MB RAM
  • Dell
  • Novell Client version 4.91.2.20051209 + patches
  • NWFS.SYS dated 11/22/05
When you look at situations where the Linux server was not bogged down with CPU load, it turned in performance that rivaled and in some cases exceeded that turned in by the NetWare server. This is consistent with my January benchmark. File-create and Dir-create both showed very comparable performance when load was low.

Unfortunately, the Linux configuration hits its performance ceiling well before the NetWare server does. Linux just doesn't scale as well as NetWare. I/O operations on Linux are much more CPU bound than on NetWare, as CPU load on all tests on the Linux server was excessive. The impact of that loading was very variable, though, so there is some leeway.

Both of the file-create and dir-create tests created 600,000 objects in each run of the test. This is a clearly synthetic benchmark that also happened to highlight one of the weaknesses of the NCP Server on Linux. During both tests it was 'ndsd' that was showing the high load, and that is the process that handles the NCP server. Very little time was spent in "IO WAIT", with the rest evenly split between USER and SYSTEM.

The IOZONE tests also drove CPU quite high due to NCP traffic, but it seems that actual I/O throughput was not greatly affected by the load. In this test it seems that Linux may have out-run NetWare in terms of how fast it drove the network. The difference is slight, a few percentage points, but looks to be present. I regret not having firm data for that, but what I do have is suggestive of this.

But what does that mean for WWU?

The answer to this comes with understanding the characteristics of the I/O pattern of the WUF cluster. The vast majority of it is read/write, with create and delete thrown in as very small minority operations. Backup performance is exclusively read, and that is the most I/O intensive thing we do with these volumes. There are a few middling sized Access databases on some of the shared volumes, but most of our major databases have been housed in the MS SQL server (or Oracle).

For a hypothetical reformat of WUF to be OES-Linux based, I can expect CPU on the servers doing file-serving to be in the 60-80% range with frequent peaks to 100%. I can also expect 100% CPU during backups. This, I believe, is the high end of the acceptable performance envelope for the server hardware we have right now. With half of the nodes scheduled for hardware replacement in the next 18 months, the possibility of dual and even quad-core systems becomes much more attractive if OES Linux is to be a long term goal.

OES-Linux meets our needs. Barely, but it does. Now to see what OES2 does for us!

Tags: ,

Labels: ,


Monday, September 18, 2006

Results: IOZONE and throughput tests

Unfortunately for me there were significant problems with the iozone and throughput tests. With iozone it is very, very clear that some form of client-side caching took place during the NetWare tests that did not occur during the Linux test. This seriously tainted the data. On NetWare, one station recorded a throughput of 292715 for a 16MB file size and 32KB record size, yet that same station at the same data-set recorded a throughput of 6602 on Linux. Yet, somehow, the total run-time for that workstation was not 44 times longer for the OES-Linux run than the OES-NetWare run.

With the throughput tests, there were no perceivable differences between 16 simultaneous threads and 32 simultaneous threads. The NetWare throughput test showed signs of client-side caching as well, so those results are tainted. Plus I learned that there were some client-side considerations that impacted the test. The clients all had WinXP SP2 in 256MB of RAM, and instantiating 16 to 32 simultaneous IOZone threads causes serious page faults to occur during the test.

As such, I'm left with much more rough data from these tests. CPU load for the servers in question, network load, and fibre-channel switch throughput. Since these didn't record very granular details, the results are very rough and hard to draw conclusions from. But I'll do what I can.

At the outset I predicted that these tests would be I/O intensive, not CPU intensive. It turns out I was wrong for Linux, as CPU loads approached those exhibited by the dir-create and file-create tests for the whole iozone run. On the other hand, the data are suggestive that the CPU loading did not affect performance to a significant degree. CPU load on NetWare did approach 80% during the very early phases of the iozone tests, when file-sizes were under 8MB, and decreased markedly as the test went on. It was during this time that the highest throughputs were reported on the SAN.

Looking at the network throughput graphs for both the lab-switch uplink to the router core and the NIC on the server itself suggest that throughput to/from OES-Linux was actually faster than OES-NetWare. The difference is slight if it is there, but at a minimum both servers drove an equivalent speed of data over the ethernet. Unfortunately, the presence of client-side caching on the clients for the NetWare run prevent me from determining the actual truth of this.

On the fibre-channel switch attached to the server and the disk device (an HP EVA) I watched the throughputs recorded on the fibre ports for both devices. The high-water mark for data transfer occurred during the first 30 minutes of the iozone run with NetWare, the Linux test may have posted an equivalent level but that test was ran during the night and therefore its high-water mark was unobserved. At the time of the NetWare high-water mark all 32 stations were pounding on the server with file-sizes under 16MB. The level posted as 101 MB/s (or 6060 MB/Minute), which is quite zippy. This transfer rate coincided quite well with the rate observed on the ethernet. This translates to about 80% utilization on the ethernet, which is pretty close to the maximum expected throughput for parallel streams.

For comparison, the absolute maximum transfer rate I've achieved with this EVA is 146 MB/s (8760 MB/Min). This was done with iozone running locally on the OES-Linux box and TSATEST running on one of the WUF cluster nodes backing up a large locally mounted volume. Since this setup involved no ethernet overhead, it did test the EVA to its utmost. It was quite clear that the iozone I/O was contending with the TSATEST data, as when the iozone test was terminated the TSATEST screen reported throughput increasing from 830 MB/Min to 1330 MB/Min. I should also note that due to the zoning on the Fibre Channel switch, this I/O occurred on different controllers on the EVA.

These tests suggest that when it comes to shoveling data as fast as possible in parallel, OES-Linux performs at a minimum the equivalent of OES-NetWare and may even surpass it by a few percentage points. This test tested modify, read, and write operations, which except for the initial file-create and final file-delete operations are metadata-light. Unlike file-create, the modify, read, and write operations on OES-Linux appear to not be significantly impacted by CPU loading.

Next, conclusions.

Tags: ,

Labels: ,


Results: create operation differences

Looking at charts that show just create operations on the platforms are interesting.
Graph comparing file-create and dir-create operations on OES-LinuxI just put the Min values in the error bars to make it a cleaner graph. But here you can see the trend mentioned in the file-create tess about the 4000 object line. Only here 4500 objects seems to be the point where file-create passes dir-create in terms of time per operation. This is a result of CPU usage and the fact that file-create appears to be more affected by it than NetWare is. The idential NetWare chart is illustrative, but since CPU never went above 70% for more than a few moments it isn't a pure apples-to-apples comparison.
Graph comparing file-create and dir-create operations on OES-NWIn this case, file-create remains below dir-create for the whole run. What's more, dir-create drove CPU a lot harder than file-create did. The early data in the Linux run shows that OES-Linux would follow this file-create-is-faster pattern given sufficient CPU.

Exactly why file-create performance degrades so fast when CPU contention begins is unclear me. In terms of disk bandwidth, all four tests barely twitched the needle on the SAN monitor; these tests do not involve big I/O transfers. As far as NSS is concerned, a directory and a file are very similar objects in the grand scheme of things. Yet NSS seems to track more data related to directories than files, so it seems counter intuitive that file-create would lag when CPU becomes a problem. This question is one I should bring with me to BrainShare 2007.

Next, IOZONE and throughput tests.

Tags: ,

Labels: ,


Results: file-create

The Test:
30 workstations create a sub-directory, and in that sub-directory create 20,000 files. At each 500 files it does a directory listing and times how long it takes to retrieve the list. A running total of the time taken to create files is kept, and a log of how long each entry takes to create is also kept.
Graph comparing file-create times between OES-Linux and OES-NetWare on an NSS volueThis chart is interesting in several ways. First of all, note the lower error bars for the Linux line. Those bars overlap and up to about 4000 files actually is below the NetWare average. This says to me that when there is CPU room, Linux may be faster than NetWare when responding to file creates. This particular line was caused by the same method as the previous test, namely that some test stations started up to 30 seconds before the whole group was running and therefore had a window of uncontended I/O. Those same workstations finished their tests while others were still around 12000 files, which further explains the downward trend of the Linux line above that threshold.

The second interesting thing is the sheer variability of the results. As with the dir-create test, CPU was completely utilized on the OES-linux box. The reported load-averages were very similar to dir-create. Some test workstations were able to run a complete test before others even got to 12000 files. Yet others took a really long time to process. The file-create test ran well over an hour, where the same test on NetWare took just under 30 minutes.
Graph comparing file enumeration between OES-Linux and OES-NetWareThis graph shows significant differences between the two platforms. As with the first chart, 4000 directories and under some workstations turned in NetWare-equivalent response times when speaking to OES-Linux. As with the above, this was due to uncontended I/O. But once all the clients started running the test the response time for directory enumeration was greatly degraded.

Because file-create seems to clog the I/O channels more than dir-create did, directory enumeration had to compete in the same channels and thus response times suffered. Towards the end of the test when some workstations had finished early response times were creeping back towards parity with OES-NetWare.

Next, create operation differences.


Tags: ,

Labels: ,


Friday, September 15, 2006

Results: dir-create

Taking at look at the data for the dir-create test, you can see the differences between the two platforms.

The Test:
30 workstations create a sub-directory, and in that sub-directory create 20,000 directories. At each 500 directories it does a directory listing and times how long it takes to retrieve the list. A running total of the time taken to create directories is kept, and a log of how long each entry takes to create is also kept.

Directory Create graph comparing NetWare to Linux
This chart shows it very well. As I've said before, the state of the server affected this run. At its peak, the NetWare server had a CPU load around 65%. The Linux server had a load average around 18, which roughly translates to a CPU load of 900%. Directory Create is an expensive operation due to the amount of meta-data involved. This is clearly much more expensive on the Linux platform than it is on the NetWare platform.

The range of results is also quite interesting. Generally speaking, when speaking to a NetWare server the clients had a pretty even spread of response times. Time were faster than others. It just happens. Because of testing limits I was not able to start all stations at exactly the same time; however, start-time was within 30 seconds of eachother. The stations that went first recorded really good times for the first 3000 directories or so then slowed down as everyone got going. This effect was quite clear in the raw Linux data, though it is hidden in the above chart.

A side effect of that is that when the fast clients finished, it removed some of the I/O contention going on. You can see that in the downward curve of the Linux line towards the end of the test. That doesn't indicate that Linux was getting better at higher speeds, just that some clients had finished working and had removed themselved from the testing environment.

Directory Enumeration graph comparing NetWare and Linux
This is the chart that describes how long it takes to enumerate a single directory inside of a dir-list of the created sub-directory. As the test progressed there were more directories to enumerate. Mere enumeration isn't an expensive operation, as it just involved a sub-set of the metadata involved in the directory-entries. As with the dir-create test, dir-enum shows that Linux is slower on the ball than NetWare is under heavy load conditions. This is pretty clearly CPU related, as a single client running these tests shows very little difference between the platforms.

The hump and fall-off of the Linux line is an artifact of faster workstations getting done quicker and getting out of the way. The sheer variability of the linux line is interesting in and of itself. I'm sure further testing may identify the cause of that, but I'm limited on time and other resources so I won't be investigating it now.

Next, on Monday, file-create and file-enumerate.

Tags: ,

Labels: ,


Thursday, September 14, 2006

Testing completed

And now I enter the data-analysis phase. It'll be a while until I release numbers.

But, I figured I'd give some impressions I got from the tests. For brevity purposes, when I say NetWare I mean, "OES NetWare 6.5 SP3 with patches up to 8/23/06", and when I say Linux, I mean, "OES Linux SP2, with patches up to 9/1/06". Also, when talking about I/O, I'm referring to, "I/O performed over the network via NCP to an NSS volume."
  • I/O on Linux is more CPU bound than on NetWare. For absolute sure, dir-create and file-create are much more expensive operations CPU-wise. They both perform similarly when done with unloaded systems, but the system hit for create on Linux is much higher than on NetWare. This could be due to System/User memory barriers, but my testing isn't robust enough to test that sort of thing. NetWare is all Ring 0, where by necessity Novell has brought a lot of the file-sharing functions in Linux into Ring 3.
  • Bulk I/O speed is similar. When talking about bulk I/O functions, in my case this was the IOZONE test, both platforms perform similarly. Unfortunately, caching played a big role on the NetWare test and didn't perform any role in the Linux test. This is the inverse of my findings in January. The testing gods frowned on me.
  • Linux seems to support faster network I/O than NetWare. Unfortunately, this may just be a side-effect of the caching. But network loads were higher when running the bulk IO tests on Linux than they were with NetWare. This can be a good thing (Linux supports more network I/O than NetWare) or a bad thing (Linux requires more network I/O for similar performance). Not sure at this time which it is.
CPU loads on the WUF cluster nodes during term generally run on average in the 8-12% range. The multiplier for CPU load was similar for dir-create and file-create operations, if you assume (incorrectly) that the CPU is reflective of file I/O activity Linux machines performing the same duties would report load-averages around 8.0. Since most I/O are reads, and that operation is not as load-inducing as a create, the averages would be under 100% (load-average of 2.0 for these boxes). But still higher than for NetWare.

Another thing to note is that the bulk IO test with IOZONE also induced very high load-averages on Linux, but the apparent throughput was very comparable to NetWare. IOZONE works by creating a file of size X and runs a series of tests on records of size Y. Unlike the dir-create and file-create tests, this test doesn't test how fast you can create files it tests how fast you can get data. Clearly record I/O within files still induces CPU load in the form of NDSD activity; however, unlike the dir-create and file-create tests the apparent throughput is not nearly as affected by high-CPU conditions.

From this early stage it looks like we could convert WUF to Linux and still not need new hardware. But we'd be running that hardware harder, much harder, than it would have run under NetWare. Since we're not pushing the envelope with our NetWare servers now, we have the room to move. If our servers were running closer to 20% CPU, the answer would be quite different.

As I read the documentation, it looks like NCPserv is a function of ndsd. Therefore, seeing ndsd taking up CPU cycles that way was due to NCP operations, not DS operations. If that's the case, substituting a reiser partition for the NSS partition would decrease CPU loading some, but probably not the order of magnitude it needs.

Tags: ,

Labels: ,


Wednesday, September 13, 2006

return of.. part 2

Now that I'm looking at the network loading data I am seeing something interesting. The NetWare server handled the first 30 minutes of load better than the OES-Linux server did, but after that the OES-Linux server provided better throughput. The difference isn't great, a few percentage points on the GbE link, but it is there. CPU is still pretty high, but it's more than keeping up.

Unfortunately, we seem to have an 'apples to apples' problem. While the network utilization appears to be higher with the OES Linux server, implying better throughputs, it is clear from the few clients that have finished the run that there was no caching involved with this particular test. Comparing numbers, therefore, will be a bear.

Ideally I'd rerun the NetWare test with client caching and oplock 2 disabled, but I don't have time for that. This server needs to be given back to the service I borrowed it from.

Tags: ,

Labels: ,


Tuesday, September 12, 2006

return of differences.

The big iozone test has kicked off about 45 minutes ago. When I did this on NetWare, CPU hung at around 60-65% or so, and the Telecom guys made happy noises as their new monitoring software turned colors they'd never seen before.

Okay, it turned 'warning'. Before it was either green/working, or red/broken. They'd never seen yellow/high-load before. They were quite happy.

Anyway... the 1GB link between the lab with all the workstations and the router core was running 79-81% utilization. Nice!

On the san link we had around 20% utilization, the highest I'd ever seen the EVA drive before.

Right now I can't tell what that link is running, but the link into the server itself is running in the 50-60% range. Better analysis will occur tomorrow when I can ask the Telecom guys how that link behaved overnight. As for the server, load-levels are well above 3.0 again. Right at this moment it's at 14ish, with ndsd being the prime process.

At this point I'm begining to question what unix load-averages mean when compared to the cpu-percentage reported by NetWare. Are they comparable? How does one compare? Anyway, the dir-create and file-create tests did showed to be much more cpu-bound on Linux than NetWare, and this sort of bulk I/O seems to have a similar binding. Late in the test CPU on NetWare was fairly low, 20% range, with the prime teller of loading being allocated Service Processes. So I'm pretty curious as to what load will look like when all the stations get into the 128MB file sizes and larger.

Tags: ,

Labels: ,


differences continued

I finally managed to get the 'big file' test done. Results in the end were similar to those reported in the previous post. The file create process isn't that much improved over dir-create that the tests ran fast. Though, there was a marked difference. During the dir-create test the uptime load-levels were in the 17-19 range, where with the file-create test they were in the 4-6 range. Much improved, but still pushing CPU well past 100%.

I haven't looked at the data closely yet, but I suspect that the same trends reported in the dir-create test follow here. I didn't do a test for dir-create and file-create on NetWare with a smaller number of stations, but then it didn't seem like I needed to. The 'break even' point, where CPU is just under 100%, on the dir-create looks to be in the 4-6 station range, with the file-create point on or around 10 stations.

Tags: ,

Labels: ,


Monday, September 11, 2006

differences bloom

I got the OES-Linux SP2 server formatted and installed this morning. And the NSS volume created. I ran the first benchmark, and golly there is a difference.

Test 1 is the 'big directory' test. The client stations create 20,000 sub directories in a sub-directory titled the name of the machine. The time to create each directory is tracked, and the time it takes to enumerate each directory is also tracked. In testing out the benchmark it is clear that mkdir is a more expensive operation than 'touch' is in the make-file test (also 20,000 files).

On NetWare with 30 client machines pounding the server, CPU rose to about 80% or so and stayed there. Load on the CPUs were equal. There was some form of bottlenecking going on because some clients finished much faster than others, and it isn't clear what separated the two classes.

On Linux the load-average is pretty stable around 18. The process taking up that CPU is ndsd. The numbers I'm getting back from the clients are vastly worse than NetWare. The first time I ran it I figured that this was due to the workstation objects not having the posixAccount extension. So I fixed that, and now the percentages are better, but still much worse than NetWare. I'll run this test again with only 10 clients, so I get to compare smaller concurrent access numbers.

That kind of load is not exactly 'real user load', it's a synthetic load designed to show how well either platform handles abuse. The iozone benchmark should be closer to comparable since that's just a single file, and ndsd shouldn't be involved with those accesses much at all. That'll be almost entirely i/o subsystem.

Tags: ,

Labels: ,


Friday, September 08, 2006

progressing

Right now I'm running the mass IOZONE test. 30 workstations are pounding the test NetWare server with IOZONE, running this command-line:

iozone -Rab \report-dump\IOZONE-std\%COMPUTERNAME%-iozone1.xls -g 1G -i 0 -i 1 -i 2 -i 3 -i 4 -i 5

Right now all the stations are chewing on the 1GB file, and are all at various record-size stages. But the fun thing is the "nss /cachestats" output:
BENCHTEST-NW:nss /cachestat
***** Buffer Cache Statistics *****
Min cache buffers: 512
Num hash buckets: 524288
Min OS free cache buffers: 256
Num cache pages allocated: 414103
Cache hit percentage: 63%
Cache hit: 3407435
Cache miss: 1978789
Cache hit percentage(user): 60%
Cache hit(user): 3031275
Cache miss(user): 1978789
Cache hit percentage(sys): 100%
Cache hit(sys): 376160
Cache miss(sys): 0
Percent of buckets used: 48%
Max entries in a bucket: 7
Total entries: 399112
Yep. All that I/O is only partially being satisified by cache-reads. As it should be at this stage of the game.

What surprised me yesterday when I kicked off this particular test was how baddly hammered the server was at the very begining. This is the small file-size test, and better approximates actual usage. CPU during the first 30 minutes of the test was in the 70-90% range, and was asymetric, CPU1 was nearly pegged. During that phase of it we also drove a network utilization of 79-83% on the GigE uplink from the switch serving the testing machines and the router core. And on the Fibre Channel switch serving the test server, the high-water mark for transfer speed was 101 MB/Second (~20% utilization).

The FC speed is notible. The fasted throughput I was able to produce on the port linking the EVA was about 25 MB/Second, and that was done with TSATEST running against local volumes in parallel on three machines. Clearly our EVA is capable of much higher performance than we've been demanding of it. Nice to know.

Depending on how the numbers look once this test is done, I might change my testing procedure a bit. Run a separate 'small file' run in IOZone to capture the big-load periods, and perhaps a separate 'big file' run with 1G files to capture the 'cache exhaustion' performance.

From a NetWare note, the 'Current MP Service Processes' counter hit the max of 750 pretty fast during the early stages of the test. Upping the max to 1000 showed how utilization of service processes progressed during the test. Right now it's steady at 530 used processes. Since I don't think Linux has a similar tunable parameter, this could be one factor making a difference between the platforms.

Tags: ,

Labels: ,


Wednesday, September 06, 2006

Technology is cooooool

I just worked out a REALLY NEAT trick to help with managing my benchmarking clients. I figured something like this is possible, but actually seeing it work was one of those moments that make what I do so fun. I came real close to shouting, "I am the zombie master," but I held off. Just.

Anyway, the trick:
  1. Make sure all the clients are imported as Workstation Objects.
  2. Create a Workstation Group, and add all of the clients into it.
  3. Add the newly created Workstation Group as a R/W trustee of the volume I'm benchmarking against. This allows the workstations as themselves, not users, to write files.
  4. Create a Workstation Policy, associate it to the group.
  5. In the Workstation Policy, create a Scheduled Task. Point it at the batchfile I wrote that'll map a drive to the correct volume, run the tests, and clean up.
  6. Modify the schedule so it'll run at a specific time, making sure to uncheck the 'randomize' box.
  7. Force a refresh of the Policies on the clients (restarting the Workstation Manager service will do it).
And the best part? I don't have to by physically present to kick off the activities! Woo! I can even run the big I/O ones in the depths of night.

The jobs all seem to start within 30 seconds of the scheduled time. This doesn't seem to be due to differences in the workstation clocks, on checking those are all within 3 seconds of 'true', rather the Workstation Manager task polling interval. I wish I could get true 'everyone right now' performance, but that's not possible without w-a-y more minions.

On the 'large number of sub-directories' test, the early jumpers seemed to get a continued edge over their late starters. The time to create directories for the early jumpers was consistantly in the 3-5ms range, where the late jumpers were in the 10-13ms range. Significant difference there. And some started fast and became slow, so there is clearly some threshold involved here beyond just the server dealing with all those new directory entries. CPU load on the NetWare box (what I have staged up first) during the test with 32 clients creating and enumerating large directories was in the 55-70% range. That load is spread equally over both CPUs, so those bits of NSS are fully MP enabled.

Tags: ,

Labels: ,


Thursday, August 31, 2006

Math is hard

So far in the process of debugging this little proggie I've discovered three major math errors.
  1. The timer I'm using has some math in it that made me miss decimal places. Oops.
  2. Small bug where I was retrieving the directory list the same number of times as the stepping facter. Give a stepping factor of 500, and it'll grab the dir list 500 times. Oops.
  3. After retrieving the time taken to grab the directory list I divided that by the stepping factor. Then the total number of entries it retrieved. It should have been just divided by the number of entries retrieved. Oops.
*sigh*

Tags:

Labels:


Benchmark observations

Switching this little prog to make files instead of directories was the work of about three lines of code. No biggie. Comparing it to the Directory make, there are a couple of observations I can make when running against NetWare/NSS onna SAN:
  • MKDIR seems to be more server-CPU intensive than TOUCH by quite a bit. During the long MKDIR test CPU was noticibly higher than ambient, but the TOUCH test barely twitched the needle. Hmmm.
  • MKDIR is a faster operation than TOUCH, from a client's perspective.
  • Directories are faster to enumerate than files.
  • Enumeration operations are sensitive to network latency. When the client is busy, enumeration gets noiser.
  • Both create and enumerate are sensitive to client CPU loads.
  • Enumeration is much faster than create, by about four orders of magnitude.
  • At least Directory Create time does trend upwards over time depending on how many objects are in the parent directory. Though this is is only really visible when going well above 100,000 directories, and is very slight; 2.049ms at 2000 dirs and 2.4159ms at 500K dirs. Haven't tested files yet.
Both tests were run on a directory with Purge Immediate set. For a REASON. In the actual benchmark I'll probably also set PI, so I'm not filling the slack space with deleted files and have to explain away a mid-benchmark performance drop when the server has to expire off the oldest files/directories.

Tags:

Labels:


Perhaps not?

I ran a few tests with the benchmark I wrote yesterday, and it doesn't seem to be a useful one. Perhaps if I tweak it to use files instead of directories it'll be more useful. But the charts I get out of it show very slow slides up, with no clear break points. MKDIR functions work nearly linearly, and enumerating the directories also shows very good performance. It still takes a long time to parse through 100,000 directories, but atomicly it works well.

Though, I wonder if files yield different results?

And, doh, I bet I'll get different results when running against a Windows server than a NetWare one. Heh. We'll see.

Tags:

Labels:


Wednesday, August 30, 2006

That was easier than I thought...

After not nearly as much time spent in front of VisualStudio as I thought I'd need, I now have a tool to test out the big-directory case. I'm running a few tests to see if the output does make sense, but early returns show not too shabby information. I'm still not 100% certain that I have my units right, but at least it'll give me something to compare against, and whether or not large directory listings are subjest to linear, or curved response times.

Some sample output:

Iterations, MkDirOpTime(ms), EnumerationTimeDOS(ms), EnumerationOpTimeDOS(ms)
500, 0.213460893024992, 4.33778808035625, 0.0086755761607125
1000, 0.205388901117948, 8.56917296772817, 0.00856917296772817
1500, 0.206062279938047, 12.6200889185171, 0.00841339261234476
2000, 0.203543292746338, 16.5182862916397, 0.00825914314581986
2500, 0.202069268714127, 20.5898861176478, 0.00823595444705914
3000, 0.201296393468305, 24.786919106575, 0.00826230636885834


That's 3000 directories being enumerated in that bottom line. I'm also not 100% on the time unit being miliseconds, though the same operation converts the arbitrary (?) system units into real units for all of those.

The utility takes two arguments:
  • Number of directories to create.
  • [optional] Stepping factor.
The above output was generated with "bigdir 3000 500" for 3000 subdirectories and do directory enumerations for every 500 directories. It defaults to a step of 1.

Tags:

Labels:


Tuesday, August 29, 2006

Getting ready for a benchmark

Last January I did a benchmark of OES-Linux versus OES-NetWare performance for NCP and CIFS sharing. That was done on OES SP1 due to SP2's relatively recent release. SP2 has now been out for quite some time, and both platforms have seen significant improvements with regards to NSS and NCP.

Right now I'm looking to test two things:
  • NCP performance to an NSS volume from a Windows workstation (iozone)
  • Big directory (10,000+ entries) performance over NCP (tool unknown)
I'm open to testing other things, but my testing environment is limited. There are a few things I'd like to test, but don't have the material to do:
  • Large scale concurrent connection performance test. Essentially, the NCP performance test done massively parallel. Over 1000 simultanious connections. Our cluster servers regularly serve around 3000 simultanious connections during term, and I really want to know how well OES-Linux handles that.
  • Scaled AFP test. This requires having multiple Mac machines, which I don't have access to. We have a small but vocal Mac community (all educational institutions do, I believe), and they'll notice if performance drops as a result of a theoretical NetWare to Linux kernel change.
  • Any AFP test at all. No mac, means no testy testy.
  • NCP performance to an NSS volume from a SLED10 station. I don't have a reformatable test workstation worth beans that can drive a test like this one, and I don't trust a VM to give consistent results.
The large directory test is one that my co-workers pointed to after my last test. The trick there will be finding a tool that'll do what I need to do. IOZONE comes with one that comes kinda close, but isn't right. I need to generate X sub-directories, and time how long it takes to enumerate those X sub-directories. Does it scale linearly, or is there a threshold where the delay goes up markedly?

This may require me to write custom code, which I'm loth to do but will do if I have to. Especially since different API calls can yield different results on the same platform, and I'm not programmer enough to be able to be certain which API call I'm hooking is the one we want to test. This is why I'd like to find a pre-built tool.

If you have something that you'd like tested, post in the comments. It may actually happen if you include a pointer to a tool that'll measure it. Who knows?

Tags: ,

Labels: ,


This page is powered by Blogger. Isn't yours?