Monday, May 12, 2008
DataProtector 6 has a problem, continued

See? This is an in-progress count of one of these directories. 1.1 million files, 152MB of space consumed. That comes to an average file-size of 133 bytes. This is significantly under the 4kb block-size for this particular NTFS volume. On another server with a longer serving enhincrdb hive, the average file-size is 831 bytes. So it probably increases as the server gets older.
On the up side, these millions of weensy files won't actually consume more space for quite some time as they expand into the blocks the files are already assigned to. This means that fragmentation on this volume isn't going to be a problem for a while.
On the down side, it's going to park (in this case) 152MB of data on 4.56GB of disk space. It'll get better over time, but in the next 12 months or so it's still going to be horrendous.
This tells me two things:
- When deciding where to host the enhincrdb hive on a Windows server, format that particular volume with a 1k block size.
- If HP supported NetWare as an Enhanced Incremental Backup client, the 4kb block size of NSS would cause this hive to grow beyond all reasonable proportions.
Since it is highly likely that I'll be using DataProtector for OES2 systems, this is something I need to keep in mind.
Wednesday, May 07, 2008
DataProtecter 6 has a problem
Once of the niiiice things about DP is what's called, "Enhanced Incremental Backup". This is a de-duplication strategy, that only backs up files that have changed, and only stores the changed blocks. From these incremental backups you can construct synthetic full backups, which are just pointer databases to the blocks for that specified point-in-time. In theory, you only need to do one full backup, keep that backup forever, do enhanced incrementals, then periodically construct synthetic full backups.
We've been using it for our BlackBoard content store. That's around... 250GB of file store. Rather than keep 5 full 275GB backup files for the duration of the backup rotation, I keep 2 and construct synthetic fulls for the other 3. In theory I could just go with 1, but I'm paranoid :). This greatly reduces the amount of disk-space the backups consume.
Unfortunately, there is a problem with how DP does this. The problem rests on the client side of it. In the "$InstallDir$\OmniBack\enhincrdb" directory it constructs a file hive. An extensive file hive. In this hive it keeps track of file state data for all the files backed up on that server. This hive is constructed as follows:
- The first level is the mount point. Example: enhincrdb\F\
- The 2nd level are directories named 00-FF which contain the file state data itself
The last real full backup I took of the content store backed up just under 1.7 million objects (objects = directory entries in NetWare, or IIRC inodes in unix-land). Yet the enhincrdb hive had 2.7 million objects. Why the difference? I'm not sure, but I suspect it was keeping state data for 1 million objects that no longer were present in the backup. I have trouble believing that we managed to churn over 60% of the objects in the store in the time I have backups, so I further suspect that it isn't cleaning out state data from files that no longer have a presence in the backup system.
DataProtector doesn't support Enhanced Incrementals for NetWare servers, only Windows and possibly Linux. Due to how this is designed, were it to support NetWare it would create absolutely massive directory structures on my SYS: volumes. The FACSHARE volume has about 1.3TB of data in it, in about 3.3 directory entries. The average FacStaff User volume (we have 3) has about 1.3 million, and the average Student User volume has about 2.4 million. Due to how our data works, our Student user volumes have a high churn rate due to students coming and going. If FACSHARE were to share a cluster node with one Student user volume and one FacStaff user volume, they have a combined directory-entry count of 7.0 million directory entries. This would generate, at first, a \enhincrdb directory with 7.0 million files. Given our regular churn rate, within a year it could easily be over 9.0 million.
When you move a volume to another cluster node, it will create a hive for that volume in the \enhincrdb directory tree. We're seeing this on the BlackBoard Content cluster. So given some volumes moving around, and it is quite conceivable that each cluster node will have each cluster volume represented in its own \enhincrdb directory. Which will mean over 15 million directory-entries parked there on each SYS volume, steadily increasing as time goes on taking who knows how much space.
And as anyone who has EVER had to do a consistency check of a volume that size knows (be it vrepair, chkdsk, fsck,or nss /poolrebuild), it takes a whopper of a long time when you get a lot of objects on a file-system. The old Traditional File System on NetWare could only support 16 million directory entries, and DP would push me right up to that limit. Thank heavens NSS can support w-a-y more then that. You better hope that the file-system that the \enhincrdb hive is on never has any problems.
But, Enhanced Incrementals only apply to Windows so I don't have to worry about that. However.... if they really do support Linux (and I think they do), then when I migrate the cluster to OES2 next year this could become a very real problem for me.
DataProtector's "Enhanced Incremental Backup" feature is not designed for the size of file-store we deal with. For backing up the C: drive of application servers or the inetpub directory of IIS servers, it would be just fine. But for file-servers? Good gravy, no! Unfortunately, those are the servers in most need of de-dup technology.
Monday, May 05, 2008
Linux @ Home
1: Wireless driver problems
I have an intel 3945 WLAN card. It works just fine in linux, well supported. What throws it for a loop, however, are sleep and hibernate states. It can go one, two, four, maybe five cycles through sleep before it will require a reboot in order to find the home wireless again. If it doesn't lock the laptop up hard. Since my usage patterns are heavily dependent upon Sleep mode, this is a major, major disincentive to keep the Linux side booted.
I understand the 2.6.25 kernel is a lot better about this particular driver. Thus, I wait with eager anticipation the release of openSUSE 11.0. This driver is currently the ipw3945 driver, and will eventually turn into iwl3945 driver once it comes down the pipe. What little I've read about it suggests that the iwl driver is more stable through power states.
2: NetWare remote console
I use rconip for remote console to NetWare. Back when Novell first created the IP-based rconsole, they also released rconj along side ConsoleOne to provide it. As this was written in Java, it was mind bogglingly slow. This little .exe file was vastly faster, and I've come to use it extensively. Unless I get Wine working, this tool will have to stay on my Windows XP partition. It works great, and I haven't found a good linux-based replacement yet.
Time has moved on. Hardware has gotten faster, and the 'java penalty' has reduced markedly. RconJ is actually usable, but I still don't use it. Plus, it would require me to install ConsoleOne onto my laptop. It's 32-bit, so that's actually possible, but I really don't want to do that.
The Remote Console through the Novell Remote Monitor (that service out on :8009) has a nice remote-console utility, but it also requires Java. I'm still biased against java, and java-on-linux still seems fairly unstable to me. I don't trust it yet. It also doesn't scale well. When I'm service-packing, it is a LOT nicer looking to have 6 rconip windows up than 6 browser-based NRM java-consoles open. Plus, rconip will allow me access to the server console if DS is locked, something that NRM can't do and is invaluable in an emergency.
Once the wireless driver problems are fixed, I'll boot the linux side much more often. Remote-X over SSH actually makes some of my remote management a touch easier than it is in WinXP. And if I really really need to use Windows, my work XP VM is accessible over RDesktop. There are a few other non-work reasons why I don't boot Linux very often, but I'll not go into those here.
So, oddly, NetWare is partly responsible for keeping me in Windows at home. But only partly.
Labels: linux, netware, novell, opinion, virtualization
Thursday, April 17, 2008
NetWare and Novell, changing a company
We are all familiar with NetWare, the dominate Network Operating system of the 1980s and 1990s. We are all familiar with Microsoft's tactics of penetrating the NOS market with Windows NT by focusing on using Windows as an application platform.Apparently Richard worked for Novell around 2001. I find that interesting since my first BrainShare was 2001, and that was when they announced the release of NetWare 6.0. While there he saw what seemed to be an outright denial that NetWare had been passed up by Windows and something new needed to be done.
In 2001 I knew that Windows had for all intents and purposes won. The only place you ever really saw NetWare servers were as file-servers, or running GroupWise or the small handful of apps that used NetWare as an application server. The stalwart loyalists among us saw this as annoying, but not a major problem.
It was also good for Novell's bottom line. NetWare still accounted for a large percentage of their revenues. Even though the writing was on the wall, they were still making real money on it so didn't see a need to change. This is why NetWare 6.0 introduced the AMP stack to NetWare, as a way to better make NetWare an application server and to slow the loss of customers. At BrainShare 2001 there was open speculation about "NetWare 7.0" and what it would look like.
And there still was until 2005 when Novell announced what the next version of NetWare would be. This being after the SUSE and Ximian purchases, it would be based on Linux. This move had been rumored, and alternately derided and lauded, for some time. There was a great wailing and gnashing of teeth on the part of the stalwart NetWare loyalists. It also started an exodus of customers, as Novell's financial reports at the time point out.
Fortunately for the company, they started actively promoting (for certain values of 'active' that are higher than they were previously, but still in the theme of Novell Stealth Marketing) and developing their other products, like GroupWise, Novell Identity Management, ZenWorks, and most especially their Linux business. It took them until last quarter to turn in a quarter in the black, and NetWare revenues are under 20% of total now. So, they've turned the corner and are no longer dependent on the NetWare cash cow. They have a couple of them in the field now, which is a MUCH healthier place to be.
It's a funny thing, but one of the reasons why NetWare is such a kick-butt file-server compared to everything else is why it's a challenging environment to develop in. Had Novell seen the light earlier and bought SUSE (or rolled their own Linux distro) in... 1999 instead, right after the NW5.1 release, they still would have run into the fundamental architectural problems in 32-bit linux that make it an inferior file-serving platform for large environments. By 2008 their server could have been a LOT more mature, and perfectly poised to take advantage of 64-bit Linux.
Novell in the 1990's is not an example of a 'nimble' company. It is trying to get there now through diversification. Not many companies (especially tech companies) have survived the loss of their prime money earner; Apple has done it through OSX, which required a fanatically loyal fan base to survive the dark years. This is the prime reason people kept predicting the imminent demise or buyout of Novell. Now that they're earning profits again, and have diversified away from just the OS sector, they're not going to be going out of business any time soon.
Now if only they had better SMB packages and programs. I hear repeatedly from peers who support SMBs that Novell's packages and programs in that space are lacking or exploitative. Significant revenue, and more importantly mindshare, are in the SMB market. Plus, today's SMB is tomorrow's large or global enterprise.
Tuesday, March 25, 2008
IPv6 vs IPX
How may of you in the room have been at this long enough to do IPX? Ok, great. Now how many of you have done anything with IPv6? Doesn't that look JUST like IPX?And he's right, to a point. IPX addresses are of the form network-number:node-number, such as:
00008021:0002a540d0e1
Where 'node number' is the MAC address of the network card in question. It's up to the routers to figure out where network-numbers live, and advertised services issue full-network broadcasts to advertise said service, which is the primary reason that IPX just doesn't scale if WAN links are in the mix. But that's by the by.
IPv6 addresses work similarly:
2001:0db8:85a3:08d3:1319:8a2e:0370:7334
The last 48 bits are the MAC address and the bits ahead of it constitute the network number. Except... the IPv6 designers knew about the failings of IPX and worked around them. The last 48 bits don't have to be the MAC address, though as I understand it that address has to exist for each physical interface. Unlike IPX, IPv6 has the ability to have 'secondary' addresses. The lack of this ability was the main reason that Novell Cluster Services only worked on IP networks, which caused its own wave of grief when clustering was introduced in the NetWare 5.1 era. Secondary IPv6 numbers don't have to follow the MAC format, which in my opinion is a good thing!
Yes, when I first read about IPv6 addressing I had that same, "wow, this is just like IPX," moment the BrainShare presenter had. Only, more scalable, and more flexible.
Labels: brainshare, clustering, netware, novell, sysadmin
Thursday, March 20, 2008
BrainShare Wednesday
That said, the new GroupWise WebAccess is gorgeous. I wish Exchange had their non-ActiveX pages look that good.
TUT175: RBAC: Avoiding the horror, getting past the hype
Mostly about IDM as it turned out. Only minimally interesting from an abstract viewpoint about roles in general.
TUT 277: Advanced eDirectory Configuration, new features, and tuning for performance
I learned a few things I didn't know, such as the fact that each object as an "AncestorList" attribute listing who their parent objects are. This apparently greatly speeds up searching. SP3, coming out this Summer, will have faster LDAP binds for a couple of reasons. Right now Novell is recommending 2 million objects as a reasonable maximum size for a partition for performance reasons.
And also they reiterated something I've heard before...
You know how back in the NetWare 4 days, we said to design your tree by geography at the first level, and then get to departments? Um, sorry about that. It was great back then, but for LDAP or IDM it really, really slows things down.Yep. I took my first class for my CNA when 'Green River' was just coming out, or was just out. So I remember that.
TUT221: iPrint on Linux, what Novell Support wants you to know
A nice session from a mainline support guy about the ways people don't do iPrint on linux correctly. We're not going there until pcounter can run in linux, so this is still somewhat abstract. But, nice to know.
- The reason that some print jobs render differently than direct-print jobs, is because of how Windows is designed. Direct-print jobs render with the 'local print provider', and iPrint jobs render with the 'network print provider'. This is a Microsoft thing, not an iPrint thing. You can duplicate it by setting up a microsoft IPP printer (assuming you're not mandating SSL like we are) and printing to the same printer with the same driver.
- The Manager on Linux doesn't use a Broker, it uses a 'driver store'.
- The Manager on NetWare doesn't always bind to the same broker. I didn't know that.
- It is recommended to have only one Broker, or one driver store per tree.
- Novell recommends using DNS rather than IP for your printer-agents, check your manager load scripts.
Labels: brainshare, edir, linux, microsoft, netware, novell, OES, pcounter, printing
Tuesday, March 18, 2008
BrainShare Tuesday
ATT326: Advanced Linux Troubleshooting
An ATT, therefore hard to summarize. But I learned about a few new commands I didn't know about before. Like strace. And vimdiff.
TUT130: Challenges in Storage I/O in Virtualization
Another nice one, but an emergency at work (printing down in a dorm, during finals week) distracted me heavily during the first half of it. Which resulted in the following note in my notes:
NPIV looks really nifty. Look into it.NPIV being how you can use fibre-channel zoning to zone off VM's, rather than HBA's. Highly useful. I also learned about a neat new thing called Virtual Fabrics. Virtual Fabrics work kind of like VLANS for fabrics. You can segregate your fabrics into fabrics that share hardware but nothing else. Handy if your, say, Solaris admins don't want you mucking about with their zoning, while saving money through consolidated hardware.
TUT216: OES2 SP1 Architectural Overview
There is a LOT of new stuff in SP1.
- It will include eDir 8.8.4 (8.8.3 will ship this summer sometime)
- NCP and eDir will be fully 64-bit
- OES2 SP1 will be based on SLES SP2, which will be releasing about the same time
- AFP Support
- AFP 3.1
- Uses Diffie-Helman 1 for password exchange, meaning the 8-character password problem is solved.
- Fully SMP-safe
- Has cross-protocol locking with NCP. CIFS doesn't have cross-protocol locking yet, but IIRC, Samba does
- Does not need LUM enabled users
- CIFS Support
- NTLMv1, but v2 is a possibility if enough people ask, so file those enhancement requests!!
- CIFS is separate from Samba, therefore can not be used in conjunction with Domain Services for Windows
- As with AFP, fully SMP safe
- EDir 8.8.4
- LDAP auditing enhanced
- "newer auth protocols", but they didn't say what.
TUT211: Enhanced Protocol Support in OES2 SP1
This is the session where they went into detail about the AFP and CIFS support. They said that netatalk, the existing AFP stack on Linux, gets really slow once you go over the 20 concurrent users. Whoa! I can soooo understand why Novell felt the need to make a new one.
- The 8 character password limit has been fixed! They now support DH1 for passing passwords.
- The 'afptcp' daemon can use one password protocol at a time, so you can only use DH1, or one of the other three I can't remember.
- Support for OSX 10.1 and 10.2 is scanty, and 10.5 is limited but users may not notice anyway.
- Passwords will be case sensitive.
- Kerberos will be in a future release
- Performance is faster than NetWare, partly due to the ability to multi-thread
- Can register services by way of SLP
- Only supports NSS for the time being, the other Linux file-systems will be a future feature.
- Can support 500 concurrent users, and 1000+ in the future. This fits our current AFP loads.
- We can configure more about how it works than we could on NetWare, such as how many worker threads to spawn.
- Has meaninful debug logs!
- Has a new command, 'afpstat' that works like 'netstat' for giving a snapshot of afp connections.
Tonight was the night formerly known as 'Sponsor Night,' but has a new name now that everyone who gets a booth is no longer a 'sponsor'. Some are sponsors, some are exhibiters. I can't keep track. Anyway, today was their party. "World of Novellcraft!" Homage to vid-gaming.
Lots of Wii, lots of Rock Band, some Halo, lots of women dressed in Renaissance Festival gear getting their pictures taken by the 90%+ male audience. I've blogged before about my ambivalence about Sponsor Night. I lasted until about 7, when I came back to the hotel.
Tomorrow I have an actual LUNCH BREAK in my schedule! Ooo! And
Labels: brainshare, edir, linux, netware, novell, OES, sysadmin
Monday, March 17, 2008
Today at Brainshare
Breakfast was uninspired. As per usual, the hashbrowns had cooled to a gellid mass before I found everything and got a seat.
The Monday keynotes are always the CxO talks about strategy and where we're going. Today a mess of press releases from Novell give a good idea what the talks were about. Hovsepian was first, of course, and was actually funny. He gave some interesting tid-bits of knowledge.
- Novell's group of partners is growing, adding a couple hundred new ones since last year. This shows the Novell 'ecosystem' is strong.
- 8700 new customers last year
- Novell press mentions are now only 5% negative.
- High Capacity Computing
- Policy Engines
- Orchestration
- Convergence
- Mobility
Another thing he mentioned several times in association with Fossa and agility, is mergers and acquisitions. This is not something us Higher Ed types ever have to deal with, but it is an area in .COM land that requires a certain amount of IT agility to accommodate successfully. He mentioned this several times, which suggests that this strategy is aimed squarely at for-profit industry.
Also, SAP has apparently selected SLES as their primary platform for the SMB products.
Pat Hume from SAP also spoke. But as we're on Banner, and it'll take a sub-megaton nuclear strike to get us off of it, I didn't pay attention and used the time to send some emails.
Oh, and Honeywell? They're here because they have hardware that works with IDM. That way the same ID you use for your desktop login can be tied to the RFID card in your pocket that gets you into the datacenter. Spiffy.
ATT375 Advanced Tips & Tricks for Troubleshooting eDir 8.8
A nice session. Hard to summarize. That said, they needed more time as the Laptops with VMWare weren't fast enough for us to get through many of the exercises. They also showed us some nifty iMonitor tricks. And where the high-yield shoot-your-foot-off weapons are kept.
BUS202 Migrating a NetWare Cluster to OES2
Not a good session. The presenter had a short slide deck, and didn't really present anything new to me other than areas where other people have made major mistakes. And to PLAN on having one of the linux migrations go all lost-data on you. He recommended SAN snapshots. It shortly digressed into "Migrating a NetWare Cluster to Linux HA", which is a different session all together. So I left.
TUT215 Integrating Macintosh with Novell
A very good session. The CIO of Novell Canada was presenting it, and he is a skilled speaker. Apparently Novell has written a new AFP stack from scratch for OES2 Sp1, since NETATALK is comparatively dog slow. And, it seems, the AFP stack is currently out performing the NCP stack on OES2 SP1. Whoa! Also, the Banzai GroupWise client for Mac is apparently gorgeous. He also spent quite a long time (18 minutes) on the Kanaka client from Condrey Consulting. The guy who wrote that client was in the back of the room and answered some questions.
Labels: brainshare, clustering, netware, novell, OES, sysadmin, virtualization
Thursday, February 14, 2008
OES2-SP1 soon to be inclosed beta
"What is in this release of Open Enterprise ServerNovell Open Enterprise Server 2 Support Pack 1 refreshes the SUSE Linux Enterprise Server 10 distribution with SLES10 SP2, fixes defects found since the release of OES2 and also adds in the following functionality:
- Novell engineered CIFS and AFP protocols
- New version of iFolder (3.7)
- Updated iPrint with an accounting API
- 64-bit version of eDirectory
- Enhanced migration tools and migration GUI
- Improved performance of the XEN hypervisor
- Domain Services for Windows
- NetWare 6.5 Support Pack 8
Note that although Domain Services for Windows is part of OES2 SP1, a separate beta program will be run in order to collate DSfW feedback."
Novell engineered CIFS? I soooo want to know what that is. Is is a completely new CIFS stack, or is it Samba with Novell extensions whacked on? I want to know! The other important bit of information:
The beta test program is currently scheduled to begin mid March and run through October.Which means there won't be product for my 2008 upgrade window. Fie. Well, at least we'll have ample time to prototype and test for the 2009 upgrade window.
Labels: clustering, netware, novell, OES
Monday, February 04, 2008
Today's 18 year olds...
- Were born in 1990
- ...have never known life without cable or satellite TV.
- ...probably have never seen a rotary dial phone.
- ...have had internet access for most of their school life.
Which got me thinking about a few things. One of the items that is frequently put forth about Kids These Days (tm) is that they don't KNOW anything, they just know how to FIND things. There is some debate about this, but it is a common sentiment. I believe that kids these days (KTD) have figured out keyword based searching, and the search engines have gotten good enough at mind-reading that arcane search incantations aren't needed nearly as often as they were in the past.
Before Google, there was AltaVista. This was an era of the internet where boolean search incantations were needed to really narrow down to what you wanted. I didn't switch to Google for a long time because Google didn't have the NEAR search term, which I used on AltaVista as a way to narrow results to be more relevant. I didn't know at the time that Google effectively threw that term in on every search.
Those of us who lived through that era of the internet built up searching skills. I remember some searches I did back then that were pretty complex. I can't remember the exact terms used, but they looked like this:
bootes AND (antaries OR proxima) AND (fulcrum NEAR pinnacle)
I had a logic class in college, so these sorts of parenthetical statements made sense to me. Still do, I just don't end up needing to uncork the boolean logic to find what I need anymore as the search engines have gotten good enough that I don't NEED to do it. I know google allows much of the above, but I haven't had to do it so I don't know the syntax for it.
So I posit that yes, KTD don't know anything, but neither are their search skills robust.
Which brings me to Novell. I got to thinking what a NetWare administrator in 1990 had to know to do their job, and how I could fit into such a hypothetical time.
Right now if I don't know the answer to a problem I have a few methods to figure it out.
- Hit the online Novell Knowledge Base over at novell.com/support
- Hit the peer-support forums over at forums.novell.com (or nntp://forums.novell.com/ if you prefer old-school)
- Pay for a support incident
- Ask around the office
- Hit the peer-support forums over on CompuServe, which required a modem and a CompuServe account.
- See if the problem is mentioned in the book-shelf of manuals, which was a big investment to own.
- Pay for a support incident.
- Ask around the office.
As I see it, a NetWare admin of 1990 was on average more knowledgeable about their product than the NetWare admin of 2008. Such administrators avoided the cost of paying for support incidents by having the manuals in hard-copy form, and plonking down real money for CompuServe accounts. If I have a weird problem I'll hit up the Novell KB to see if there is a TID on it, then check the support forums to see if it is mentioned there, before I'll expend an incident on the thing. In time I've managed to teach myself how NetWare works in some very basic ways, simply by troubleshooting oddball problems. This is why I typically end up talking to backline support when I call in, unless the problem is a known issue in the private KB. My skills are probably on par with what was normal 'back in the day'.
I think this holds true for a lot of the tech field. Back then there was a lot of stuff you just had to KNOW. Or failing that, have spent the money to get the backup resources in place (manuals, support contracts). These days a base understanding of how things work is the key to phrasing the right search queries in the online knowledge bases, and less rote memorization (training) can be effective in solving a greater list of problems.
Prosthetic memory! Prosthetic training! The tools of geeks everywhere.
Wednesday, January 30, 2008
I don't think it works that way
I finally, finally managed to get a meaningful packet-capture after it fails, and I found some traffic that... doesn't look right. Take a look:
-> NCP Connection Destroy
<- R OK
<- FIN, PSH, ACK
<- RST, ACK
-> ACK (to R OK)
<- RST
Note the three packets in the middle. The responding server is tearing down the connection twice for some reason. Compare this with a 'normal' tear-down:
-> NCP Connection Destroy
<- R OK
-> FIN, PSH, ACK
<- ACK
-> RST, ACK
The first example I gave is the last traffic on the wire before the server abends, so is of course highly suspicious. The pattern that leaps right out is that the responding server is issuing the FIN,PSH,ACK and RST,ACK pair, rather than the sending server, and doing so before the sending server can say "I got it" to the connection close packet.
Now I need to catch it in the act again to prove this theory.
Friday, January 25, 2008
A needed patch.
The sorting problem happens when you have eDir 8.8 installed. Suddenly C1 starts sorting things by creation date rather than as you've ever seen it before. This is... confusing. ConsoleOne 1.3h helped some of it for us, but not all. And now, we have a patch!
Let ConsoleOne Sort Correctly!
Wednesday, January 16, 2008
NetWare library patches
This just caused me a problem. It turns out that if you have libcsp6b (the LibC patch) applied and not nwlib6k (the CLib patch), there is an abend possibility. It happened yesterday. It turns out that in that case, a badly formed network broadcast can cause an abend. This caused three of my six cluster nodes to fall on their butts at the same time. That was fun. Strange (but good) thing is, I had already applied both patches to these three servers but hadn't gotten around to rebooting them yet. So, by killing themselves they actually fixed the problem.
The abend, key details:
EIP in SERVER.NLM at code start +0015FD27hHeh heh heh. Oops.
And now a bit of history. Long time NetWare admins can ignore this part.
Q: Why are there two C libraries?
CLIB is the library NetWare started with. It began life in the dark and misty past, probably in the late 1980's. It is the deepest, darkest bowels of NetWare from the era when Novell was it when it came to office networking. Being so old, its APIs are very mature. Applications developed against CLIB generally speaking just plain work.
CLIB is also depreciated since it is highly proprietary, and doesn't play well with others. "Just plain works" in this instance means an assumption of 8.3 names, with kludging to support long file names if at all possible. CLIB applications have a tendency to have IPX dependencies for no good reason.
LIBC was created, IIRC, around the release of NetWare 5.0 when it became possible for NetWare to operate in a "pure IP" environment. LIBC was designed with the concept of POSIX semantics in mind, which CLIB was not. LIBC was created from scratch with long file name support. By now, as of NetWare 6.5 SP7, most of the NetWare kernel is written against LIBC rather than CLIB.
As an example of LIBC vs CLIB, take the 'MyWeb' service this blog is served by. When I did this the first time, it was on NetWare 6.0, using Apache 1.3. Apache 1.3 was linked against CLIB and was very stable. The service notes for the Apache Modules I needed to run to make it work made it clear that supporting long file-names on remote servers was something that only recently started working.
When the migration to NetWare 6.5 came around, it meant I had to migrate MyWeb to Apache 2.0. Apache 2.0 is linked against LIBC and used a different apache module to make things work. I had troubles. The LibC functions were not nearly as mature as their CLIB counterparts, and it showed. 3.5 years later things are now a lot more stable then back then.
Labels: clustering, netware, novell, sysadmin
Friday, January 11, 2008
Disk-space over time
To show you what I'm talking about, I'm going to post a chart based on the student-home-directory data. We have three home-directory volumes for students, which run between 7000-8000 home directories on them. We load-balance by number of directories rather than least-size. The chart:

As you can see, I've marked up our quarters. Winter/Spring is one segment on this chart since Spring Break is hard to isolate on these scales. We JUST started Winter 2008, so the last dot on the chart is data from this week. If you squint in (or zoom in like I can) you can see that last dot is elevated from the dot before it, reflecting this week's classes.
There are several sudden jumps on the chart. Fall 2005. Spring 2005. Spring 2007 was a big one. Fall 2007 just as large. These reflect student delete processes. Once a student hasn't been registered for classes for a specified period of time (I don't know what it is off hand, but I think 2 terms) their account goes on the 'ineligible' list and gets purged. We do the purge once a quarter except for Summer. The Fall purge is generally the biggest in terms of numbers, but not always. Sometimes the number of students purged is so small it doesn't show on this chart.
We do get some growth over the summer, which is to be expected. The only time when classes are not in session is generally from the last half of August to the first half of September. Our printing volumes are also w-a-y down during that time.
Because the Winter purge is so tiny, Winter quarter tends to see the biggest net-gain in used disk-space. Fall quarter's net-gain sometimes comes out a wash due to the size of that purge. Yet if you look at the slopes of the lines for Fall, correcting for the purge of course, you see it matches Winter/Spring.
Somewhere in here, and I can't remember where, we increased the default student directory-quota from 200MB to 500MB. We've found Directory Quotas to be a much better method of managing student directory sizes than User Quotas. If I remember my architectures right, directory quotas are only possible because of how NSS is designed.
If you take a look at the "Last Modified Times" chart in the Volume Inventory for one of the student home-directory volumes you get another interesting picture:

We have a big whack of data aged 12 months or newer. That said, we have non-trivial amounts of data aged 12 months or older. This represents where we'd get big savings when we move to OES2 and can use Dynamic Storage Technology (formerly known as 'shadowvolumes'). Because these are students and students only stick around for so long, we don't have a lot of stuff in the "older than 2 years" column that is very present on the Faculty/Staff volumes.
Being the 'slow, cheap,' storage device is a role well suited to the MSA1500 that has been plaguing me. If for some reason we fail to scare up funding to replace our EVA3000 with another EVA less filled-to-capacity, this could buy a couple of years of life on the EVA3000. Unfortunately, we can't go to OES2 until Novell ships an edirectory enabled AFP server for Linux, currently scheduled for late 2008 at the earliest.
Anyway, here is some insight into some of our storage challenges! Hope it has been interesting.
Monday, January 07, 2008
I/O starvation on NetWare, another update
This time when I opened the case I mentioned that we see performance problems on the backup-to-disk server, which is Windows. Which is true, when the problem occurs B2D speeds drop through the floor; last Friday a 525GB backup that normally completes in 6 hours took about 50 hours. Since I'm seeing problems on more than one operating system, clearly this is a problem with the storage device.
The first line tech agreed, and escalated. The 2nd line tech said (paraphrased):
I'm seeing a lot of parity RAID LUNs out there. This sort of RAID uses CPU on the MSA1000 controllers, so the results you're seeing are normal for this storage system.Which, if true, puts the onus of putting up with a badly behaved I/O system onto NetWare again. The tech went on to recommend RAID1 for the LUNs that need high performance when doing array operations that disable the internal cache. Which, as far as I can figure, would work. We're not bottlenecking on I/O to the physical disks, the bottleneck is CPU on the MSA1000 controller that's active. Going RAID1 on the LUNs would keep speeds very fast even when doing array operations.
That may be where we have to go with this. Unfortunately, I don't think we have 16TB of disk-drives available to fully mirror the cluster. That'll be a significant expense. So, I think we have some rethinking to do regarding what we use this device for.
Wednesday, January 02, 2008
Where NetWare Fits
We regularly run between 1200-6000 concurrent connections on our cluster nodes. This is a density that just doesn't happen all that often in the market. If you have 6000 users close enough together to all talk to the same file-server at LAN speeds using a protocol designed for file-serving (such as NCP, SMB/CIFS, or AFP), you're a big organization. 6000 is a large corporate campus, a large governmental entity of some kind, or a larger .EDU like us. Nationally, the number of 'large' file-servers like that is peanuts compared to the number of 'workgroup' (i.e. under 300 concurrent users) servers out there.
It is therefore no surprise to me that Novell is not devoting a lot of engineering to supporting the top end of this market. While it may pay well, there just isn't enough revenue coming from these customers to try and handle the hardest-to-test use-case: very high concurrency. I find it disappointing because I AM one of those customers (a larger .EDU), but I understand the business drivers supporting the decision.
For the moment, NetWare 6.5 (32bit) is the top-dog performance wise for our environment. That isn't going to stay true for much longer. It would not surprise me to find out that a Windows Enterprise Server (x86_64) with 16GB of RAM can out-perform a NetWare 6.5 (32bit) server with 4GB of RAM, simply due to the added room for a file-cache. What I don't know is how CPU-bound file-serving I/O is on a Windows Enterprise Server, that's the one area that could keep NetWare 6.5 (32bit) on top. I already know that OES2-Linux out-performs NetWare for NCP traffic, so long as you stay within CPU bounds.
For high-concurrency applications, as far as I know NetWare still wins.
Friday, December 28, 2007
NetWare and Hyperthreading, again
It has long been consensus in the support forums that, (paraphrased) "If you have hyperthreading turned on and get an I/O thread stuck on a logical process, woe be unto you."
I have a server that I've been backing up for a fellow admin in another department. This particular server has 525GB of storage to back up, so it's going to take some time. It has been vexing figuring this one out. Until today, when I finally twigged to the fact that this server has HT turned on. I turn HT off as almost the first thing I do when setting up a server, so I don't think about it when troubleshooting.
Between 1000 and 1215 today, the backup got 882MB of data. Yeah, very crappy.
At 1215 I turned off the logical processors. This is a handy feature NetWare has, and I used it in the article I linked above.
At 1222 when I checked back the backup was up to 4.0GB.
At 1417 it is now up to 71GB backed up.
The only thing that changed was me turning off the logical processors. That's it. At that rate, this server should be backed up in around 15 hours, which is a far cry from the 30+ hours it was doing before.
Turn Hyperthreading off on your NetWare servers. Just do it.
Labels: hyperthreading, netware, novell, storage
Thursday, December 27, 2007
eDirectory certificate server changes
With Certificate Server 3.2 and later, in order to completely backup the Certificate Authority, it is necessary to back up the CRL database and the Issued Certificates database. On NetWare, these files are located in the sys:system\certserv directory.
For other platforms, both of these databases are located in the same directory as the eDirectory dib files. The defaults for these locations are as follows:
Windows: c:\novell\nds\dibfiles
Linux/AIX/Solaris: /var/opt/novell/edirectory/data/dib
These defaults can be changed at the time that eDirectory is installed.
The files to back up for the CRL database are crl.db, crl.lck, crl.01 and the crl.rfl directory. The files to back up for the Issue Certificates database are cert.db, cert.lck, cert.01, and the cert.rfl directory.
I didn't know about that directory. I also didn't know that the CA is publishing a certificate-revocation-list to sys:apache2\htdocs\crl\. Time to twiddle the backup jobs.
Thursday, December 20, 2007
eDir 8.8, Priority Sync
6.0 Priority SyncWhich sounds spiffy. Instant sync of passwords? I'm all for that. Then I remembered, 'wasn't that happening already? That's right, that's the "SYNC_IMMEDIATE" flag in schema.' And that's what's described in this older CoolSolutions article.
Priority Sync is a new feature in Novell® eDirectory 8.8™ that is complimentary to the current synchronization process in eDirectory. Through Priority Sync, you can synchronize the modified critical data, such as passwords, immediately.
You can sync your critical data through Priority Sync when you cannot wait for normal synchronization. The Priority Sync process is faster than the normal synchronization process. Priority Sync is supported only between two or more eDirectory 8.8 or later servers hosting the same partition.
6.1 Need for Priority Sync
Normal synchronization can take some time, during which the modified data would not be available on other servers. For example, suppose that in your setup you have different applications talking to the directory. You change your password on Server1. With normal synchronization, it is some time before this change is synchronized with Server2. Therefore, a user would still be able to authenticate to the directory through an application talking to Server2, using the old password.
In large deployments, when the critical data of an object is modified, changes need to be synchronized immediately. The Priority Sync process resolves this issue.
Looking at iMonitor I see this:

As 90-95% of our user objects are in either the root container or the students container, those are the statistics I'm interested in. The "maximum ring delta" number is very, very rarely over 30 seconds for these two partitions. With it being intersession right now, we're seeing some higher numbers than usual right now but it is still kept in close sync. As we have 24 hour computer labs, and a simple login causes several user-object attributes to update, we have a continual flow of directory changes. In our case, using Priority Sync would buy us a few seconds at most. We're not under any sort of regulatory mandate to do things 'instantly' so that isn't an issue, and our password-change process is well known to our end users for taking "up to 5 minutes".
Still, I like the idea even if it isn't a good fit for us.
Wednesday, December 19, 2007
eDir 8.8 is in
Whenever you do upgrades like this you always wonder if those balls you're juggling are tennis-balls or grenades. It took about a half hour per server and didn't have any significant hitches. The one problem that did surface is that the OES1-linux server's LDAP server had its certificate change from the one it was using to SSL CertificateDNS. This was not good, as that certificate doesn't have the subject-name we need and this caused some S/LDAP binds to fail due to SSL validation problems. That was an easy fix. The LDAP servers on the NetWare boxes didn't change.
This was a tennis-ball upgrade. So far.
We haven't turned on case-sensitive LDAP binds yet, but soon. Soon.
One unexpected side-effect of getting all three eDir servers upgraded to 8.8 like this, is that the Change Cache is now cleaned of those permanent residents we've had for ages. Woo!
Monday, December 17, 2007
Not dead.
On my list of things to do during the winter inter-session is to get eDir 8.8 deployed in the production tree. I just need to have ALL the servers in the tree (all, not just replica holders due to backlink updates) up and talking when I do the first one, and that could take some scheduling. This is the first step to OES2, which will be deployed on the eDir servers first.
As soon as I get some new hardware, since they're getting old.
Friday, November 30, 2007
OES2 SP1 timing
Domain Services for Windows, which is scheduled to ship with OES 2 SP1 (currently scheduled for late 2008), will also offer some clear advantages."Late 2008" means they WILL NOT have SP1 out by August of 2008. This means that the upgrade of our 6 node cluster to OES will have to wait until 2009. Grrarrr!
Another 21 months of a 32-bit operating system on the single biggest storage consumer on campus. We'll have at least one hardware refresh before then for some of the nodes, and... boy I hope they have NetWare drivers for that. The very limited testing I did with NetWare-in-Xen was not encouraging from a performance stand-point. If it looks like I'll have to deploy that way for the next servers we get in the cluster, I'll have to do more real testing to characterize the performance hit (if any). The idea of a 64-bit memory space for file-caching makes me drool. Not getting it for 21 months is painful.
That said, if Novell releases the eDirectory enabled AFP server for OES2-Linux outside of the service-pack I could still make the 2008 window. That's our only dependency for SP1
Wednesday, November 28, 2007
I/O starvation on NetWare, HP update
This morning I got a voice-mail from HP, an update for our case. Greatly summarized:
The MSA team has determined that your device is working perfectly, and can find no defects. They've referred the case to the NetWare software team.Or...
Working as designed. Fix your software. Talk to Novell.Which I'm doing. Now to see if I can light a fire on the back-channels, or if we've just made HP admit that these sorts of command latencies are part of the design and need to be engineered around in software. Highly frustrating.
Especially since I don't think I've made back-line on the Novell case yet. They're involved, but I haven't been referred to a new support engineer yet.
Labels: clustering, hp, msa, netware, novell, OES, storage, sysadmin
Wednesday, November 21, 2007
I/O starvation on NetWare
This is a problem with our cluster nodes. Our cluster nodes can seen LUNs on both the MSA1500cs and the EVA3000. The EVA is where the cluster has been housed since it started, and the MSA has taken up two low-I/O-volume volumes to make space on the EVA.
IF the MSA is in the high Avg Command Latency state, and
IF a cluster node is doing a large Write to the MSA (such as a DVD ISO image, or B2D operation),
THEN "Concurrent Disk Requests" in Monitor go north of 1000
This is a dangerous state. If this particular cluster node is housing some higher trafficked volumes, such as FacShare:, the laggy I/O is competing with regular (fast) I/O to the EVA. If this sort of mostly-Read I/O is concurrent with the above heavy Write situation it can cause the cluster node to not write to the Cluster Partition on time and trigger a poison-pill from the Split Brain Detector. In short, the storage heart-beat to the EVA (where the Cluster Partition lives) gets starved out in the face of all the writes to the laggy MSA.
Users definitely noticed when the cluster node was in such a heavy usage state. Writes and Reads took a loooong time on the LUNs hosted on the fast EVA. Our help-desk recorded several "unable to map drive" calls when the nodes were in that state, simply because a drive-mapping involves I/O and the server was too busy to do it in the scant seconds it normally does.
This is sub-optimal. This also doesn't seem to happen on Windows, but I'm not sure of that.
This is something that a very new feature in the Linux kernel could help out, that that's to introduce the concept of 'priority I/O' to the storage stack. I/O with a high priority, such as cluster heart-beats, gets serviced faster than I/O of regular priority. That could prevent SBD abends. Unfortunately, as the NetWare kernel is no longer under development and just under Maintenance, this is not likely to be ported to NetWare.
I/O starvation. This shouldn't happen, but neither should 270ms I/O command latencies.
Labels: clustering, hp, msa, netware, novell, OES, storage, sysadmin
Monday, October 15, 2007
Peer-to-peer sharing
I want to share U:\SharedStuff\ApacheGroup\ to five other users. U: is my home directory, which is actually map-rooted so I don't see the top level directory. So I go to a web page and tell it I want to share this directory, to these people, for this long. Go.
It struck me that this sort of thing can be engineered with NetWare and OES. The key components are eDirectory, NSS, and NetStorage.
The web server takes the request and translates $Path into a real path by referencing the HomeDirectory attribute of the user who requested the share. Then, using LDAP it creates two objects:
A Group Object
- Created and named dynamically
- [AuxClass] Attribute with user-defined name
- [AuxClass] Attribute with the creator
- [AuxClass] Attribute with the expiry date
- Since this is eDirectory, group memberships apply immediately rather than taking a logout/login cycle to refresh the access token like in MS networks.
- Created & named dynamically
- Associated to the created group
- Assigned to the specified users
- This allows the share to show up in NetStorage
There is a small constellation of maintenance tasks that also need to be created, such as a janitor process to deal with expirations, a helpdesk view to track who has what shares, a historic view to see what shares got deleted recently that suddenly need to be back RIGHT NOW, something to interface this with whatever disk or directory quota systems are in use.
The use of NetStorage allows WebDAV to be used as an access method, which allows the shares to be seen. The really brave may be able to leverage DFS to create actual directory structures reflecting the shares in the actual directories so drive mappings can be used; unfortunately I have no idea if a DFS database that large is a good idea.
Users would love this. No need to go through management to get a directory set up on the shared space. You just set up and go. Great for adhoc groups, or small private gatherings.
Unfortunately, this sort of share model is one that a lot of sys-admins are familiar with. If you've ever had a chance to examine the network of a small business with under 15 users, all of whom call themselves 'not that good with computers', you know what I'm talking about. This model of sharing is the one that Windows for Workgroups was designed for, and is still the default mode for plain old WinXP. Excessive use of peer to peer sharing like that can lead to one unholy mess, especially if a key person leaves (or in the case of the Windows example, one hard drive crashes hard).
If left unchecked, you can get whole business processes designed with the assumption that [username] will never retire. That already happens to an alarming extent, but this would make the dependency more invisible to those of us charged with making it all work again when it breaks. You can have shared spaces that are business critical to the company living 100% inside a user's self-managed space, and vulnerable to deletion on termination of that employee.
This is all part of the balance we as system administrators have to keep between end user functionality, and data protection. Desktop techs fight a constant battle to get users to save data on the server where it is backed up, and Novell puts out things like iFolder to help that whole thing become more invisible. We created shared directories to draw a big line between 'my stuff' and 'us stuff'.
That said, data-access habits are changing all the time. My own boss prefers to email a 150KB Excel spreadsheet to all of us, even though all of us have ready access a shared directory setup just for that. SharePoint integrates with Office to make the web-server look like a file-server. We still have to adapt with the times.
User-directed sharing is something I can see as highly desirable among the student population and faculty as well. Among staff, I'm less sure its a good idea outside of the 'trivial' personal use we're allowed.
Wednesday, September 26, 2007
OES2 release date
OES2 will be released on October 5th.
OES2-SP1 is targeted for mid-April, 2008.
AFP integration will be in SP1.
I sooooooooo hope they don't push SP1 past July. If that happens, my main migration of our cluster will have to be pushed to 2009. Ick. We're already running out of effective file-cache in 32-bit memory space. I need 64-bit to really give good performance. Hope hope hope.
A few other minor points:
- Around the release of SP1, Prosoft and Condrey Consulting (Kanaka) will release an NCP client for Mac.
- The clearing of throats next to a mic is a sign of someone who doesn't do a lot of work in front of mics.
- OES2 is fully 64-bit optimized (on Linux)
- They claim EVEN BETTER NSS performance on OES2. I hope to try that out, soon as I can figure out how to get SLES10/OES2-beta5 to talk to my SAN luns. It hates me.
Tuesday, September 25, 2007
OES2 Web-chat tomorrow
Open Enterprise Server 2 Live Webcast
Tomorrow, September 26th at 11AM PDT.
They'll be talking about all the spiffy thats in OES2, and some new info about code releases. I think this is the 'event' they mentioned a while back.
Monday, September 24, 2007
Mod_edir issues again
Right now I'm suspecting libc, as that's been my problem in the past. Perhaps the connection tear-down code in mod_edir isn't "taking" somehow.
Unfortunately, I'm not sure if I can call in an incident against mod_edir, or if I'll have to work with the devs (somehow) and call in against libc. If I reboot the web-servers every couple of days that causes the connections to close, but that is not a fun solution.
Thursday, September 20, 2007
That... is a lot of connections!
Yep. Check the Concurrent Connections number. That is a very big number. During term we run between 1500 and 4000 concurrent connections. Yet... that is way above that. What's more, going into the Novell Remote Manager, I find this pair of very interesting numbers:Connection Slots Allocated: 44000
Connection Slots Being Used: 43982
Looking at the connections shows me what the problem is. All those 'extra' connections are for the user account that allows MyWeb (what you're reading this through ultimately) to work. Somehow... and this is a guess... mod_edir seems to be creating a new connection for each request coming in, rather than reusing the old ones. Or perhaps it isn't cleaning up after itself. Probably since I put SP6 in.
This would explain why this particular server has an unreasonably high memory allocation to CONNMGR. Must Poke More.
Tuesday, September 18, 2007
OES2: clustering
Another thing about speeds, now that I have some data to play with. I copied a bunch of user directory data over to the shared LUN. It's a piddly 10GB LUN so it filled quick. That's OK, it should give me some ideas of transfer times. Doing a TSATEST backup from one cluster-node to the other (i.e. inside the Xen bridge) gave me speeds on the order of 1000MB/Min. Doing a TSATEST backup from a server in our production tree to the cluster node (i.e. over the LAN) gave me speeds of about 350MB/Min. Not so good.
For comparison, doing a TSATEST backup from the same host only drawing data from one of the USER volumes on the EVA (highly fragmented, but must faster, storage) gives a rate of 550 MB/Min.
I also discovered the VAST DIFFERENCES between our production eDirectory tree, which has been in existence since 1995 if the creation timestamp on the tree object is to be believed, and the brand new eDir 8.8 tree the OES2 cluster is living in. We have a heckova lot more attributes and classes in the prod tree than in this new one. Whoa. It made for some interesting challenges when importing users into it.
Labels: clustering, netware, novell, OES, virtualization
OES2-beta progress
I haven't gotten very far in my testing, but a few things are showing. I managed to do a TSATEST-based throughput run of a backup of SYS. That's about a gig of data. Throughputs for just one stream to one of the servers was around 500 MB/min, which is passible and within the realm of real performance for slower hardware. The downside of that is that the CPU reported by "xm top" was around 45%, where the CPU reported in MONITOR was closer to 25%. That's way higher than I expected, but could be related to all the disk I/O ops. This I/O was to a file in the file-system, not a physical device like a LUN on the SAN (that comes later).
Now I'm trying to get Novell Cluster Services installed. I want to get a weensy 2-node cluster set up to prove that it can be done. I suspect it can, but actually seeing it will be very nice.
Labels: clustering, netware, novell, OES