Continuing to look at some of the useful things one needs to know when going production with Linux. Applies to any other OS too in most ways, but examples are all from our Linux deployments, specifically our first year on our Linux based NAS.
Last post I talked about using a special version of 'fsck" to repair GFS based file systems. As I thought about that post I realized that I had some more general things I wanted to get into in this area. I also noted that i would talk more specifics about useful commands and related things we have learned along the way that you should know about *before* your Linux based cluster fails.
As I also alluded to in my last post, the first year on the Linux cluster has not been utterly pain free. For one thing, we had the fencing set up wrong so that when a node failed for any reason, it could not be "shot in the head" and recovered by the surviving nodes. This has now been fixed: I know this because last night a heartbeat failed, the node was fenced, and the service recovered on a surviving node. Then, the heartbeat returned and the cluster was whole again. We did nothing, and there was no customer facing outage. Here is Dan's exact verbiage:
I see that node #1 on the file serving cluster here in Houston rebooted this morning. That's the node that normally handles the NFS service by default priority. It looks like node #1 lost the heartbeat token to the other nodes. Probably due to the NIC driver or something. This is the same thing that happened a couple of weeks back on node #3.
This time, the cluster recovered with out my help. As it should! With the fence configuration fixed now, node #2 was able to fence (reset) node #1 via the Sun ELOM (IPMI over LAN) and node #1 then rebooted and joined the cluster again. All is well!
The cluster resource manager moved NFS to node #2 to maintain that service. For now, I have left NFS on node #2 with the Samba service that normally defaults to node #2.
The cluster did it's thing, no service outage. Although I suspect NFS stalled for a few moments and then took off again... Life is Good!
We don't yet know why heartbeat is getting lost from time to time, but at least we now totally survive it when it happens. More on that in a second...
This takes me to the things I wanted to say about a few design choices we made in setting up the cluster, and I want to tie these back to another, different cluster we deployed with far less success 8 or so years ago as well as the TruCluster that the Linux cluster replaced.
Design point / choice number one: If you have read the previous series about out NAS cluster design, you might have followed a link to this picture. If not, and you do so now, you will see that we chose to implement three nodes: Sun X2200's in our case. Why?
Our TruCluster (may it rest in peace, and in this case, pieces) had magnificent uptime. It ran and ran without rarely a burp. However, the TruCLuster was two ES40 nodes. If we took one down to apply patches, we were left literally "standing on one foot": We were not HA any more. At least one time, the *other* node failed while we had one down for service, which meant we had a customer facing outage.
With the price of an ES40, a third node would have been a significant bit of money for the insurance. Our thinking at the time: This is the best cluster software on the planet (when it was viable, before TruCluster was led to the firing line) so what are the chances we'll take a hit on a surviving node when we have one one down for service.
As with all Disaster Recovery / Business Continuance math, that question is tricky, and the real answer is: "There is a 100% chance of the surviving node going down while the other node is offline... given enough time.". In the seven years that the TruCluster was in service, it happened at least once to us.
Commodity hardware and Linux change the spare-hardware-insurance math. The price of a third X2200 plus Linux is an order of magnitude less than another ES40 node would have been. More than an order. There is the issue of increased complexity though, and I'll come back to that in a bit. To led into the complexity issue I want to go back to the heartbeat design.
Most clusters use a dedicated bit of hardware for the heartbeat internode signaling. If you use only one interconnect though, you have a single point of failure. The CentOS cluster software does not require a private network segment for heartbeat, and in fact the default is to use the public network segment. That appears to be thought to be "Best Practice".
If the cluster is done right, then at least two high speed, modern, supported, monitorable network switches are in play, and each of the three nodes connects to *both*. The heartbeat signaling is small, low bandwidth traffic. With the port to port switching, high speed switch backplane, and second switch redundancy, the heartbeat should be fine. To do this right on private networks would mean adding two *more* high speed switches, plus two more NIC's to each server. At some point the cost and complexity are not returning much in the way of value, and may in fact be adding more points of failure to your cluster such that it starts failing when nothing is really even wrong!
OK: That is the theory, but as Dan's note indicates the theory is being challenged by the occasional loss of heartbeat. I hate it when that happens!
That seems like a good way to move into my point about the cluster we used to have that lowered our uptime. A lot. it was not a Linux cluster, but it was a vendor supported, vendor installed, vendor configured solution using some of the better clustering technology of the day that was not TruCluster.
The problem was that the application that ran on the cluster was not cluster aware, and we were never able to fully script it so that all the bits and pieces from the application would fail over in cases where there was a problem. The app, not knowing all this redundant stuff was out there was often confused as to which node it was running on, and failed at least once a month. We finally took the cluster apart, created two stand alone servers, and uptime went up to over a year.
There were echoes of this when we first built the Linux cluster: NFS nor Samba are really cluster aware, at least not yet. I think Samba will be soon. NFS, being stateless, does not really need to be as cluster aware as one might think. Since GFS is keeping file state, and all the underlying addressing mechanisms for the files the same across all the nodes, NFS can stop and restart anywhere.
You can see the results of this design in what happened in Dan's account of the failure. We are doing most of our cluster magic by using GFS as the file system so that all nodes could mount the same FS, yet not overwrite each other. The CentOS cluster software only has to worry about where a particular service is running, and moving them around. File state is not it job. We then set it up so that NFS runs on one node, Samba on another, and the third was insurance... That inexpensive insurance we could not afford on the TruCluster.
The takeaway from all this then: Clusters do not, in and of themselves make everything magically more HA. You have to start with the best of breed in cluster software, but you also have to know the cluster environment, and test it seven ways from yesterday to be sure that in failure mode it is actually doing what you think it should be doing. This ties back to something I said last post. To paraphrase: There is no substitute for knowing what you are doing. Linux is not magic. Clustering is not magic. All the magic comes from your people. Your business is only as good as your process: Process designed by people who do not know what they are doing will land you in a world of hurt.
Today's Linux Cluster Commands
Hoping down off my soapbox now, here is another bit of Cluster Wisdom (tm) as documented by our NAS Wizard, Dan Goetzman. First off, I want to back up and establish some common terms, which Dan provides here:
The "TruCluster Replacement" project is a evolution of our LCFS (Low Cost File Server) project using Linux clustering (CentOS) and the "Snapple" (SuN x2200 servers and APPLE XServe Raid (XSR) storage) hardware platform.
- In addition to the features of the "Snapple" based LCFS platform...
- CentOS 5 cluster technology to provide failover services for NFS and Samba.
- GFS Cluster/SAN/Parallel filesystems for user data.
- CLVM to make the SAN storage available on all cluster nodes.
- Public network failover using Linux "bonding" driver.
The cluster consists of;
- rnd-fs - Main cluster name (NOT in dns).
- rnd-fs01 - Cluster node #1 (default node for NFS service).
- rnd-fs02 - Cluster node #2 (default node for Samba service).
- rnd-fs03 - Cluster node #3 (default backup and virus scanning service.
- rnd-clunfs.bmc.com - NFS server service.
- rnd-fs.bmc.com - Samba server service.
- yellow.bmc.com - Big Brother service.
So, now that we have some common terms and server names, the commands in this and future posts will have context. Finally for today, some helpful cluster commands:
- clustat - To view the normal cluster configuration.
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
rnd-fs01 1 Online, Local, rgmanager
rnd-fs02 2 Online, rgmanager
rnd-fs03 3 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:NFS rnd-fs01 started
service:Samba rnd-fs02 started
- clusvcadm -r NFS -m rnd-fs03 - To relocate the NFS service to node #3.
- system-config-cluster - To configure the cluster.
- ip addr show - To display the where the cluster is serving a cluster IP resource
That should do it for this time. Next time, shutting down and rebooting a single node, and removing a node from a cluster, plus some troubleshooting.