Share: |


Wrap up of the migration from the Tru64 TruCluster mission critical NAS server to the CentOS 5 Linux NAS server

 

 

This post is to do a wrap-up of the topic I have been posting about on and off here for a while about the new mission critical NAS server cluster based off CentOS5. Previous posts in this series, starting August 29th of 2007:

 

  1. Tru64 NAS Server Replacement Project
  2. NFS, GFS, nodirplus / readdirplus, and Tru64 updates
  3. CentOS 5 NAS Cluster
  4. CentOS 5 HA Cluster Speeds and Feeds
  5. Kernel Hackage
  6. One Week Later
  7. Bug 431253
  8. GFS or NFSD?

 

 

We are not quite done with the migration of all the file systems off of the Tru64 TruCluster. It's original ~4.5 Terabytes have been slowly absorbed by the new Linux cluster. We have been very cautious. We wanted to make sure that we introduced change in a controlled manner, in case we had any more of those HP-UX client type issues lurking in the woodwork. Dan Goetzman, chief NAS abuser, did find another one, and only this week too. More on that below.

 

Semantics

 

We also have the fact that we are still running our modified version of the CentOS 5 OS. Neither RedHat nor CentOS either one has closed the issue we opened (See post "Bug 431253" above), and I think that is a smoking gun waiting to shoot some folks in the toes. Here is why I think that: The file open / close semantics used to "live" inside the code provided by each file system. Ext3 file open / close code could therefore could be slightly (or even very) different from GFS or XFS or some other file system, since each file system was written at different times and places by different people for different reasons, and in some cases like XFS or JFS, for different operating systems than Linux. XFS comes to us from SGI, therefore Irix, and JFS is from IBM / AIX.

 

Recent kernels have provided the file access semantics internally. An installable file system is not required to use them, but they are available to all. The file system maintainers have started to move from the code inside each of the various file system types to routines in the kernel. It makes sense: Why maintain this common code in all these different places?

 

 

GFS went 'there' (to using the kernel file access routines) first, and it is our belief that that this is where the HP-UX client issue was introduced. The kernel routines (written by a subset of people who more than likely did not write all the internal routines contained in all the different file systems) don't work 100% the same way as those buried in the file system code. This might be a bit of understatement.

 

Since Dan's reading on the subject leads him to believe that the other FS types were going to migrate to letting the kernel handle the semantics, that was/is going to put everyone in the same boat. The broken HP NAS client boat. So the metaphor is not too mixed, the smoking gun is then used to shoot a hole in the bottom of the boat, passing through ones toes and perhaps some aquatic life forms.

 

 

We don't have to migrate to a new version of the CentOS OS any time soon though. CentOS is working fine. Dan's file semantics kernel patch is working and has long runtime on it, so we have confidence we can move forward. We do have some motivation to move forward if we can: The TruCluster is off both hardware and software support.

 

Ouroboros Tru64 TruCluster

 

 

The Tru64 TruCluster hardware now has so much excess capacity, since its formerly brimming file systems have been "drained" over to the CentOS cluster that any hardware failure could easily be dealt with by self-cannibalization. Ehww. Sounds ugly when I type it that way. True though: we have two ES40 server nodes, each with four GB of RAM and four CPU's. There are empty RAID sets of all disk capacities (36GB, 72GB, 144GB). The fiber channel cards, Brocade switches, memory channel, etc are all twinned out for the TruCluster. If something fails, it fails over to the surviving bits, and in the seven years we have had this gear the only failures we have had have been either of disks or failures of imagination. In failure mode, we can choose to either ignore it now, or use the redundant  capacity, raid other Alpha based gear for parts (I still have VMS servers running on Alpha gear which in a pinch might give up their lives), or worst case do a time and material call to HPQ. More than likely, the TruCluster will just eat itself though, reducing in size and capacity as it goes. That takes care of the hardware.

 

The software is a different story. It can not eat itself ... hopefully. It never has anyway. It is stable and we have not patched it in literally years. Before that the patch rate was pretty low, and consisted of mostly point patches for specific problems. Stability of the OS / NAS bits is good news and bad news.  Good that it is stable. Bad when things like NFS V4 are starting to creep into the shop, which the TruCluster just will not deal with other than by forcing the client to downshift to V3 or V2.

 

Easy Does It

 

 

This slow migration of critical file systems allowed Dan to not be spending such a concentrated, focused time on data migration, but to go slow, do a good job, and think about each move in depth. Quality still counts, especially when you are moving your most critical bits and bytes!

 

 

As I write this, I just looked at the status of the move on the internal Wiki: the vast majority of the file systems that have for literally years lived on the TruCluster are now over on the CentOS 5 cluster. We have been running builds and packaging against them for months.

 

 

The uptime of the cluster as a whole has been satisfactory. We have had no customer facing service outages at all, and even if there have been rolling upgrades or individual node outages, they have been inside the design parameters. The point of doing this as a cluster was to be able to offline a node, work on it, then have it rejoin the cluster, and Dan has taken advantage of that to upgrade the ILO cards and do various other service related things. I looked at one of the three nodes a moment ago, and it has over sixty days of uptime. That does not matter though: the customer facing service uptime has been pretty much since we put it into service last December.

 

 

The main thing, and this is the key point is that our customer never knew we did anything to the cluster, and that was just exactly like what we used to do with the TruCluster, even if the underlying OS, and clustering technology, and hardware, and therefore technical procedures are completely different.

 

Sun Client Bug

 

 

Since I last posted here, we have discovered one more unruly client. This time is is Solaris, and the fix is a patch to that OS, not something to the server. Dan as usual has been all over the problem. Here is what he found. First a note in web forum from Casper Dik at Sun:

 

Casper H.S. *** <Casper.***@xxxxxxx> writes:

 

 

"Ross"  <nospam@xxxxxxxx> writes:

 

Thank  you, Casper!
Here is the output:
bash-2.05$ cd testdir
bash-2.05$ ls  -f
.. testfile .

 

Ah,  yes.

 

The  chmod code is broken and can't deal with "." and ".." not
being the first two  entries of a directory.

Bug id: 4171523 which was filed  eons ago and not fixed (being a P4 it
dropped of the radar screen, it  seems)

I've upped the priority, pinged the responsible engineer  and
added that chown suffers from the same issue.

 

Casper
--
Expressed in this posting are my opinions. They are in no way related
to  opinions held by my employer, Sun Microsystems.
Statements on Sun products  included here are not gospel and may
be fiction rather than truth

 

 

Casper appears to be a pretty valid authority on such things, according to some research someone on my team did, turning up this:

 

 

 

 

Dan used Casper's information to find this:

 

 

"There is a Solaris BugID for this exact problem, they seem to know about it.
It appears to be only fixed for Solaris 9 and 10;

 

125499-01 - For  Solaris 10 on sparc
123394-01 - For Solaris 9 on sparc

 

I [Dan] applied the patch to [a Sun system we use a lot], and all is well.
Fix is going to be on the Solaris side for this one...

 

The patch fixed chmod/chown as that is what it patched. It looks like chgrp is still broken, same exact defect.
So far, I cannot find where Sun has fixed  chgrp for the same problem"

 

This is not a show stopper as near as we can tell, at least for us. Your shop, and mileage of course will vary. Peeling back the covers a bit, Dan found the underlying bits to this that were causing the problem:

 

The GFS filesystem getdents() call returns the directory entries in no  particular order. These get returned back, via NFS, to the Solaris client where  the user space utils chmod/chown/chgrp EXPECT items #1 and #2 to be "." and  "..". Depending on the returned list order, a loop can develop, and does in our  example, until the ch* command has exhausted it's user space open file limit. I  confirmed that our LCFS server is NOT returning the list as the Solaris client  expects. Note, that as far as I know all other NFS clients have no problem with  the list returned. Just SOLARIS!

 

Great! A Solaris bug that seem to be in  most/all clients (I have tested [a solaris client] and [and another solaris client]) triggered by a abnormal, but  not illegal, return by the NFS server.
... I did test  with XFS as the backing store filesystem, no problem. So it must be in the GFS  getdents() quirk.

 

Relative Costs, Relative Features

 

I have noted here before why we went to the complexity and expense of the TruCluster, but assuming you have not read everything in this blog over the years about that subject. That goes all the way back to the beginning in 2005, in posts like "Linux and NAS", where I noted this:

 

 

"We take a 2 tiered approach to NAS storage for R&D Support. In our first tier is the 5 9’s type storage. The stuff that just can’t go down. The bits and pieces that are used on our “assembly line” to build and manufacturer our own products. The kind of storage that, if it were down would idle hundreds of people around the world in R&D and endanger our time to market. And we know with a great deal of pain just how critical this storage is, because we used to use a storage appliance there, and it could not survive our network. It crashed all the time, and we paid for it dearly."

 

 

We paid pretty dearly for the TruCluster too: round numbers about 140k per Terabyte. Sure, a single SATA disk has a Terabyte now, and for a bit less money per TB. For fun, I divided the cost of the TruCluster per TB cost by the cost of a TB SATA disk, and the spreadsheet said that the disk basically cost nothing, as a percentage. Tweaking up the accuracy a bit higher in OpenOffice Calc, I get 0.00277. Pretty near free.

 

I will not say that the CentOS 5 based system is as good as our TruCluster is/was. It is both better and worse, and depends on how you look at it. How you define "better". That it achieves high customer facing uptime was a requirement. That it is as fast or faster (and it is faster at some things, such as CIFS) was also a requirement. It would not even be worth pursuing without those very minimal goals. It is less expensive. On the down side, our little Linux machine is not as HA, since nothing invented on this planet yet today can match TruCluster on that score. Sigh. <tongue-in-cheek> I guess that is why it had to die. </tongue-in-cheek>

 

There are things the new server did not have to be. One obvious thing is that it did not have to be the same SSI cluster architecture as what came before it. TruCluster is just one possible SSI cluster. The best one out there, but there are others. There is a Linux SSI project, although we gave up waiting for it to mature to the same place (or near enough for our needs) as TruCluster. According to the feature matrix of the current OpenSSI product it looks like it might be viable for NAS now: NFS-HA is listed in any case, plus " A highly available cluster filesystem with transparent failover.". Maybe our next generation NAS server will have a look there. The technology moves so fast that every generation has been significantly different than the one that came before it. But I slightly digress.

 

 

The primary design goal of the TruCluster based NAS server was not to use the technology for its own sake but to have NO customer facing service outages. The new server did not have to be SSI. It just had to achieve the same thing  from the point of view of our R&D customer. Serve files fast and reliably: be NAS data-tone.

 

The TruCluster is/was far more high performance on the I/O subsystem, to say the least. Hundreds of disk arms versus the one SATA one in this comparison. Cache in the NAS heads. Cache in the HSG80 disk controllers. Cache inside the disks. Disks spinning 25% faster. It is not a fair or even sane comparison. At best if gives a hint about how one might go about building a lower cost solution with high density disks as a starting place. Even at the high cost of the 2001 Tru64 based solution, the avoided cost of downtime paid for the TruCluster over and over and over. I tracked it once. I figure, based on how bad the NAS appliances had hurt us, and based on a few times when the TruCluster stumbled on various issues, but did not fall due to its design, that we came out about two million dollars *ahead*.

 

 

With the passing of Tru64 into that dark night, the new kid on the block comes with a very different price point and way of doing things. I have posted the design here already (in the links above), and the speeds and feeds. so I will not beat that to death. The main point here is that our current tier 1 file server solution, 7 years down the road from out last one, is not the same technical solution, but leverages commodity parts and prices, is assembled a slightly different way to achieve the same service goal, and runs about 1.5% of the cost per Terabyte.

 

That cost is not the whole story. The Tru64 Trucluster came with vendor support. Hardware and Software. Some of the best in the biz too: Ex Digital folks with a passion for their gear. Our solution is supported by us, and while the hardware has support contracts with Sun and Apple, we also have onsite spares of most of the major subsystems so that we can get the unit subsystems back up and running fast. If it works right, most downs should *not* be customer facing.

So far, so good.