Tuesday, March 25, 2008

IPv6 vs IPX

In a session last week came the following comment from a presenter (paraphrased):
How may of you in the room have been at this long enough to do IPX? Ok, great. Now how many of you have done anything with IPv6? Doesn't that look JUST like IPX?
And he's right, to a point. IPX addresses are of the form network-number:node-number, such as:

00008021:0002a540d0e1

Where 'node number' is the MAC address of the network card in question. It's up to the routers to figure out where network-numbers live, and advertised services issue full-network broadcasts to advertise said service, which is the primary reason that IPX just doesn't scale if WAN links are in the mix. But that's by the by.

IPv6 addresses work similarly:

2001:0db8:85a3:08d3:1319:8a2e:0370:7334

The last 48 bits are the MAC address and the bits ahead of it constitute the network number. Except... the IPv6 designers knew about the failings of IPX and worked around them. The last 48 bits don't have to be the MAC address, though as I understand it that address has to exist for each physical interface. Unlike IPX, IPv6 has the ability to have 'secondary' addresses. The lack of this ability was the main reason that Novell Cluster Services only worked on IP networks, which caused its own wave of grief when clustering was introduced in the NetWare 5.1 era. Secondary IPv6 numbers don't have to follow the MAC format, which in my opinion is a good thing!

Yes, when I first read about IPv6 addressing I had that same, "wow, this is just like IPX," moment the BrainShare presenter had. Only, more scalable, and more flexible.

Labels: , , , ,


Thursday, March 20, 2008

BrainShare Thursday

Not a good day. My first course, "Advanced BASH," could more accurately be described as, "BASH scripting tips & tricks". I then proceeded to skip the other three sessions I had signed up for.
  • Novell Open Enterprise Server 2 Interoperability with Windows and AD. All about Domain Services for Windows and Samba. Neither of which we'll ever use. No idea why I wanted to be in this session.
  • Rapid Deployment of ZENworks Configuration Management. Other people around here have suggested that if we haven't moved yet, wait until at least SP3 before moving. If then. So, demotivated. Plus I was rather tired.
  • Configuring Samba on OES2. CIFS will do what we need, I don't need Samba. Don't need this one. Skipped.
DL236: Advanced BASH Course
BASH tips and tricks. I got a lot out of it, but the developers around me were quietly derisive.

ZEN Overview and Features
Not so much with the futures, but it did explain Novell's overall ZEN strategy. It isn't a coincidence that most of Novell's recent purchases have been for ZEN products.

TUT303: OES2 Clusters, from beginning to extremes
This was great. They had a full demo rig, and they showed quite a bit in it. Including using Novell Cluster Services to migrate Xen VM's around. They STRONGLY recommended using AutoYast to set up your cluster nodes to ensure they are simply identical except for the bits you explicitly want different (hostname, IP). And also something else I've heard before, you want one LUN for each NSS Pool. Really. Plus, the presenters were rather funny. A nice cap for the day.

And tonight, Meet the Experts!

Labels: , , , , , , ,


Monday, March 17, 2008

Today at Brainshare

Monday. Opening day. I had trouble getting to sleep last night due to a poor choice of bed-time reading (don't read action, don't read action, don't read action). And had to get up at 6am body time in order to get breakfast before the morning keynote. There be zombies.

Breakfast was uninspired. As per usual, the hashbrowns had cooled to a gellid mass before I found everything and got a seat.

The Monday keynotes are always the CxO talks about strategy and where we're going. Today a mess of press releases from Novell give a good idea what the talks were about. Hovsepian was first, of course, and was actually funny. He gave some interesting tid-bits of knowledge.
  • Novell's group of partners is growing, adding a couple hundred new ones since last year. This shows the Novell 'ecosystem' is strong.
  • 8700 new customers last year
  • Novell press mentions are now only 5% negative.
Jeff Jaffe came on to give the big wow-wow speech about Novell's "Fossa" project, which I'm too lazy to link to right now. The big concern is agility. He also identified several "megatrends" in the industry:
  • High Capacity Computing
  • Policy Engines
  • Orchestration
  • Convergence
  • Mobility
I'm not sure what 'Convergence' is, but the others I can take a stab at. Note the lack of 'virtualization' in this list. That's soooo 2007. The big problem is now managing the virtualization, thus Orchestration. And Policy Engines.

Another thing he mentioned several times in association with Fossa and agility, is mergers and acquisitions. This is not something us Higher Ed types ever have to deal with, but it is an area in .COM land that requires a certain amount of IT agility to accommodate successfully. He mentioned this several times, which suggests that this strategy is aimed squarely at for-profit industry.

Also, SAP has apparently selected SLES as their primary platform for the SMB products.

Pat Hume from SAP also spoke. But as we're on Banner, and it'll take a sub-megaton nuclear strike to get us off of it, I didn't pay attention and used the time to send some emails.

Oh, and Honeywell? They're here because they have hardware that works with IDM. That way the same ID you use for your desktop login can be tied to the RFID card in your pocket that gets you into the datacenter. Spiffy.

ATT375 Advanced Tips & Tricks for Troubleshooting eDir 8.8
A nice session. Hard to summarize. That said, they needed more time as the Laptops with VMWare weren't fast enough for us to get through many of the exercises. They also showed us some nifty iMonitor tricks. And where the high-yield shoot-your-foot-off weapons are kept.

BUS202 Migrating a NetWare Cluster to OES2
Not a good session. The presenter had a short slide deck, and didn't really present anything new to me other than areas where other people have made major mistakes. And to PLAN on having one of the linux migrations go all lost-data on you. He recommended SAN snapshots. It shortly digressed into "Migrating a NetWare Cluster to Linux HA", which is a different session all together. So I left.

TUT215 Integrating Macintosh with Novell
A very good session. The CIO of Novell Canada was presenting it, and he is a skilled speaker. Apparently Novell has written a new AFP stack from scratch for OES2 Sp1, since NETATALK is comparatively dog slow. And, it seems, the AFP stack is currently out performing the NCP stack on OES2 SP1. Whoa! Also, the Banzai GroupWise client for Mac is apparently gorgeous. He also spent quite a long time (18 minutes) on the Kanaka client from Condrey Consulting. The guy who wrote that client was in the back of the room and answered some questions.

Labels: , , , , , ,


Thursday, February 14, 2008

OES2-SP1 soon to be inclosed beta

Novell just announced that OES2 SP1 is going into closed beta.

"What is in this release of Open Enterprise Server

Novell Open Enterprise Server 2 Support Pack 1 refreshes the SUSE Linux Enterprise Server 10 distribution with SLES10 SP2, fixes defects found since the release of OES2 and also adds in the following functionality:

  • Novell engineered CIFS and AFP protocols
  • New version of iFolder (3.7)
  • Updated iPrint with an accounting API
  • 64-bit version of eDirectory
  • Enhanced migration tools and migration GUI
  • Improved performance of the XEN hypervisor
  • Domain Services for Windows
  • NetWare 6.5 Support Pack 8

Note that although Domain Services for Windows is part of OES2 SP1, a separate beta program will be run in order to collate DSfW feedback."

Novell engineered CIFS? I soooo want to know what that is. Is is a completely new CIFS stack, or is it Samba with Novell extensions whacked on? I want to know! The other important bit of information:

The beta test program is currently scheduled to begin mid March and run through October.
Which means there won't be product for my 2008 upgrade window. Fie. Well, at least we'll have ample time to prototype and test for the 2009 upgrade window.


Labels: , , ,


Wednesday, January 16, 2008

NetWare library patches

Novell recently split the libc and clib patches for NetWare. For a long time patches like "nwlib6a" included both. Now, they're split.

This just caused me a problem. It turns out that if you have libcsp6b (the LibC patch) applied and not nwlib6k (the CLib patch), there is an abend possibility. It happened yesterday. It turns out that in that case, a badly formed network broadcast can cause an abend. This caused three of my six cluster nodes to fall on their butts at the same time. That was fun. Strange (but good) thing is, I had already applied both patches to these three servers but hadn't gotten around to rebooting them yet. So, by killing themselves they actually fixed the problem.

The abend, key details:

EIP in SERVER.NLM at code start +0015FD27h

Heh heh heh. Oops.

And now a bit of history. Long time NetWare admins can ignore this part.

Q: Why are there two C libraries?

CLIB is the library NetWare started with. It began life in the dark and misty past, probably in the late 1980's. It is the deepest, darkest bowels of NetWare from the era when Novell was it when it came to office networking. Being so old, its APIs are very mature. Applications developed against CLIB generally speaking just plain work.

CLIB is also depreciated since it is highly proprietary, and doesn't play well with others. "Just plain works" in this instance means an assumption of 8.3 names, with kludging to support long file names if at all possible. CLIB applications have a tendency to have IPX dependencies for no good reason.

LIBC was created, IIRC, around the release of NetWare 5.0 when it became possible for NetWare to operate in a "pure IP" environment. LIBC was designed with the concept of POSIX semantics in mind, which CLIB was not. LIBC was created from scratch with long file name support. By now, as of NetWare 6.5 SP7, most of the NetWare kernel is written against LIBC rather than CLIB.

As an example of LIBC vs CLIB, take the 'MyWeb' service this blog is served by. When I did this the first time, it was on NetWare 6.0, using Apache 1.3. Apache 1.3 was linked against CLIB and was very stable. The service notes for the Apache Modules I needed to run to make it work made it clear that supporting long file-names on remote servers was something that only recently started working.

When the migration to NetWare 6.5 came around, it meant I had to migrate MyWeb to Apache 2.0. Apache 2.0 is linked against LIBC and used a different apache module to make things work. I had troubles. The LibC functions were not nearly as mature as their CLIB counterparts, and it showed. 3.5 years later things are now a lot more stable then back then.

Labels: , , ,


Friday, November 30, 2007

OES2 SP1 timing

Novell just posted the third draft of their OES2 Best Practices guide. Which you can locate here. In that guide is this text:
Domain Services for Windows, which is scheduled to ship with OES 2 SP1 (currently scheduled for late 2008), will also offer some clear advantages.
"Late 2008" means they WILL NOT have SP1 out by August of 2008. This means that the upgrade of our 6 node cluster to OES will have to wait until 2009. Grrarrr!

Another 21 months of a 32-bit operating system on the single biggest storage consumer on campus. We'll have at least one hardware refresh before then for some of the nodes, and... boy I hope they have NetWare drivers for that. The very limited testing I did with NetWare-in-Xen was not encouraging from a performance stand-point. If it looks like I'll have to deploy that way for the next servers we get in the cluster, I'll have to do more real testing to characterize the performance hit (if any). The idea of a 64-bit memory space for file-caching makes me drool. Not getting it for 21 months is painful.

That said, if Novell releases the eDirectory enabled AFP server for OES2-Linux outside of the service-pack I could still make the 2008 window. That's our only dependency for SP1

Labels: , , , , ,


Wednesday, November 28, 2007

I/O starvation on NetWare, HP update

Last week I talked about a problem we're having with the HP MSA1500cs and our NetWare cluster. The problem is still there, of course. I've opened cases with both HP and Novell to handle this one. HP because I really thing that such command latencies are a defect, and Novell since they're having starvation issues with clusters.

This morning I got a voice-mail from HP, an update for our case. Greatly summarized:
The MSA team has determined that your device is working perfectly, and can find no defects. They've referred the case to the NetWare software team.
Or...
Working as designed. Fix your software. Talk to Novell.
Which I'm doing. Now to see if I can light a fire on the back-channels, or if we've just made HP admit that these sorts of command latencies are part of the design and need to be engineered around in software. Highly frustrating.

Especially since I don't think I've made back-line on the Novell case yet. They're involved, but I haven't been referred to a new support engineer yet.

Labels: , , , , , , ,


Wednesday, November 21, 2007

I/O starvation on NetWare

The MSA1500cs we've had for a while has shown a bad habit. It is visible when you connect a serial cable to the management port on the MSA1000 controller, and doing a "show perf" after starting performance tracking. The line in question is "Avg Command Latency:", which is a measure of how long it takes to execute an I/O operation. Under normal circumstances this metric stays between 5-30ms. When things go bad, I've seen it as far as 270ms.

This is a problem with our cluster nodes. Our cluster nodes can seen LUNs on both the MSA1500cs and the EVA3000. The EVA is where the cluster has been housed since it started, and the MSA has taken up two low-I/O-volume volumes to make space on the EVA.

IF the MSA is in the high Avg Command Latency state, and
IF a cluster node is doing a large Write to the MSA (such as a DVD ISO image, or B2D operation),
THEN "Concurrent Disk Requests" in Monitor go north of 1000

This is a dangerous state. If this particular cluster node is housing some higher trafficked volumes, such as FacShare:, the laggy I/O is competing with regular (fast) I/O to the EVA. If this sort of mostly-Read I/O is concurrent with the above heavy Write situation it can cause the cluster node to not write to the Cluster Partition on time and trigger a poison-pill from the Split Brain Detector. In short, the storage heart-beat to the EVA (where the Cluster Partition lives) gets starved out in the face of all the writes to the laggy MSA.

Users definitely noticed when the cluster node was in such a heavy usage state. Writes and Reads took a loooong time on the LUNs hosted on the fast EVA. Our help-desk recorded several "unable to map drive" calls when the nodes were in that state, simply because a drive-mapping involves I/O and the server was too busy to do it in the scant seconds it normally does.

This is sub-optimal. This also doesn't seem to happen on Windows, but I'm not sure of that.

This is something that a very new feature in the Linux kernel could help out, that that's to introduce the concept of 'priority I/O' to the storage stack. I/O with a high priority, such as cluster heart-beats, gets serviced faster than I/O of regular priority. That could prevent SBD abends. Unfortunately, as the NetWare kernel is no longer under development and just under Maintenance, this is not likely to be ported to NetWare.

I/O starvation. This shouldn't happen, but neither should 270ms I/O command latencies.

Labels: , , , , , , ,


Tuesday, September 18, 2007

OES2: clustering

I made a cluster inside Xen! Two NetWare VM's inside a Xen container. I had to use a SAN LUN as the shared device since I couldn't make it work doing it just to a single file. Not sure what's up with that. But, it's a cluster, the volume moves between the two just fine.

Another thing about speeds, now that I have some data to play with. I copied a bunch of user directory data over to the shared LUN. It's a piddly 10GB LUN so it filled quick. That's OK, it should give me some ideas of transfer times. Doing a TSATEST backup from one cluster-node to the other (i.e. inside the Xen bridge) gave me speeds on the order of 1000MB/Min. Doing a TSATEST backup from a server in our production tree to the cluster node (i.e. over the LAN) gave me speeds of about 350MB/Min. Not so good.

For comparison, doing a TSATEST backup from the same host only drawing data from one of the USER volumes on the EVA (highly fragmented, but must faster, storage) gives a rate of 550 MB/Min.

I also discovered the VAST DIFFERENCES between our production eDirectory tree, which has been in existence since 1995 if the creation timestamp on the tree object is to be believed, and the brand new eDir 8.8 tree the OES2 cluster is living in. We have a heckova lot more attributes and classes in the prod tree than in this new one. Whoa. It made for some interesting challenges when importing users into it.

Labels: , , , ,


OES2-beta progress

As mentioned before, I have the OES2 beta. Right now I have two NetWare servers parked in Xen VM's on SLES10SP1. This is how it is supposed to work!

I haven't gotten very far in my testing, but a few things are showing. I managed to do a TSATEST-based throughput run of a backup of SYS. That's about a gig of data. Throughputs for just one stream to one of the servers was around 500 MB/min, which is passible and within the realm of real performance for slower hardware. The downside of that is that the CPU reported by "xm top" was around 45%, where the CPU reported in MONITOR was closer to 25%. That's way higher than I expected, but could be related to all the disk I/O ops. This I/O was to a file in the file-system, not a physical device like a LUN on the SAN (that comes later).

Now I'm trying to get Novell Cluster Services installed. I want to get a weensy 2-node cluster set up to prove that it can be done. I suspect it can, but actually seeing it will be very nice.

Labels: , , ,


Monday, July 09, 2007

More fun OES2 tricks

I had an idea while I was googling around a bit ago. This may not work the way I expect as I'm not 100% on the technologies involved. But it sounds feasible.

Lets say you want to create a cluster mirror of a 2-node cluster for disaster recovery purposes. This will need at least four servers to set up. You have shared storage for both cluster pairs. So far so good.

Create the four servers as OES2-Linux servers. Set up the shared storage as needed so everything can see what it should in each site. Use DRBD to create new block-devices that'll be mirrored between the cluster pairs. Then set up NetWare-in-VM on each server, using the DRBD block-devices as the Cluster disk devices. You could even do SYS: on the DRBD block-devices if you want a true cluster-clone. That way when disk I/O happens on the clustered resources it gets replicated asynchronously to the DR site; unlike software RAID1 the I/O is considered committed when it hits local storage, SW RAID1 only considers writes committed when all mirrored LUNs report the commit.

Then, if the primary site ever dies, you can bring up an exact replica of the primary cluster, only on the secondary cluster pair. Key details like how to get the same network in both locations I leave as an exercise for the Cisco engineers. But still, an interesting idea.

Labels: , , , ,


Thursday, March 22, 2007

TUT 202: NetWare cluster migrations to LInux clusters

There is a book on this: "Novell Cluster Services for NetWare and Linux". This session was about OES, not OES2. Again, my notes:
  • On linux, cluster nodes are added through YaST
  • 120 bytes of meta-data per file on NSS
  • iPrint volumes could go on non-NSS volumes
  • ext3 on OES2 is indexed, not indexed on OES1. Problem for larger directories.
  • Novell Server Consolidation and Migration Tool can migrate Netware to Linux
  • While running in mixed mode, can not extend or create NSS pools. Reboots all around to make this take.
  • In mixed mode, trustee modifications do NOT transfer to the other OS. Migrate your NetWare volumes to OES-Linux, and leave them there!
    • In OES-Linux, trustees are kept in a file, not in the file-system.
  • In mixed mode, cluster load/unload scripts are kept in /etc/opt/novell/ncs/
    • When out of mixed mode, scripts are promoted into edir
  • Cluster licenses are not checked in OES-linux, but still 'count' come audit time. So have them.
  • The 'cluster convert' command ends mixed-mode operation
  • Clustering inside VMWare ESX server: only 2-node Microsoft clusters are supported. All others are not.

Labels: , , , ,


Thursday, September 21, 2006

The pro/con of clustering

A question was posed:
On the topic of clusters, do you find the benefits of a cluster/SAN setup out weighed by the increased complication in node upgrades/patching and the "all your eggs in one basket" when it comes to storage on the SAN.
One of the biggest things to get used to with clustering is that your uptimes for your cluster nodes will go down dramatically from what you're used to with your existing mainline servers, but your service uptimes will go up. Once we put in the cluster we haven't had an unplanned multi-hour outage that wasn't attributable to network issues. The key here is 'unplanned'. We've had several planned outages for both service-packing and actual hardware upgrades to the SAN array itself.

Prior to the cluster WWU put in three 'facstaff' servers and three 'student' servers to handle user directories and shared directories. This way when one server died, only a third of that class of user was out of luck. The cluster still follows this design for the user directories, but that's more for load-balancing between the cluster nodes than disaster resiliance. Since the cluster went in we've merged all of our facstaff shared volumes into a single volume. This was done because we were getting more and more cases of departments needing access to both Share1 and Share3, and we didn't have drive letters for that.

Patching and service-packing the cluster is easier than it would be with stand-alone servers. I can script things so that three of our six cluster nodes vacate themselves from the cluster in the middle of the night, so I can apply service-packs to them in the middle of the day. Repeat the same trick the next day. I can have a service-pack rolled out to the cluster in 48 hours with no after hours work on my part. THAT is a savings (unless you're counting on the overtime pay, which I don't get anyway).

The downside is the 'eggs in one basket' problem. If this building sinks into the Earth right now, WWU is screwed. Recovering from tape, after we get replacement hardware of course, would take close to a week. Don't think we haven't noticed this problem.

To be fair, though, we'd have this problem even if we were still on separate servers. True disaster recovery requires multi-location of data and services, which stand-alone servers also suffer from. Under the old architecture and presuming those servers were split between campus and our building, the 'building sinking into the ground' scenario would cause a significant portion of campus to stop working and a significant portion of students to lose everything for the days it'd take us to recover from tape. During that time WWU's teaching function would probably halt as, 'the Earth ate my homework,' would be a very valid excuse.

In our case losing a third or two thirds of all user-directory and shared-directory data would halt the business of the university. While the outage wouldn't be quite as severe as it would be if our SAN melted, it would be just as disruptive. Because of that, going for an 'all or nothing' solution that increases perceived uptime was very much in order.

We're in the process of trying to replicate our SAN data to a backup datacenter on campus. We can't afford Novell's Business Continuity Cluster which would provide automation to make this exact thing work. So we're having to make do on our own. We don't yet have a firm plan on how to make it work, and the 'fail back' plan is just a shaky; we only got the hardware for the backup SAN a month ago. It will happen, we just don't know what the final solution will look like.

As for iSCSI versus FibreChannel, my personal bias is for FC. However, I fully realize that gigabit ethernet is w-a-y cheaper than any FC solution out there today. I prefer FC because the bandwidth is higher and due to how it is designed I/O contention on the wire has less impact to overall performance. Just remember that iSCSI really really likes jumbo frames (MTU >1500 bytes), and not all router techs are OK with twiddling that; you may end up with a parallel and separate ethernet setup between your servers and the iSCSI storage.

As for iSCSI throughput, I haven't done tests on that. However I just got done looking at a whole bunch of throughput tests in and out of our FC SAN. During the IOZONE tests on NetWare, I recorded a high-water mark of 101 MB/s out of the EVA. This is 80% GigE speed, and therefore theoretically this transfer rate was doable over iSCSI. The true high-water mark was achieved by running IOZONE on the Linux server locally on the server, and on a cluster node running TSATEST on a locally mounted volume. At that time I saw a maximum transfer rate of 146 MB/s, which is 117% of GigE speed, so iSCSI wouldn't have been able to handle that. On the other hand, during day to day operations and during backups the transfer rate has never exceeded the 125 MB/s GigE mark. It's come close, but not exceeded it.

Tags:

Labels:


This page is powered by Blogger. Isn't yours?