Share This:

Just because a hurricane hit us doesn't mean I can't write a blog post!



Last September we "stood up", for the very first time, our CentOS Linux based cluster to replace the aged and unsupported Tru64 TruCluster. It was not all that long ago in fact that I wrote the wrap-up article to that adventure, so I guess this is a postscript.



First off, the fact that I have changed roles has influenced several things around the file server: A new manager took over my team, and when the server had a problem she suddenly found out she had a Linux file server she was responsible for. It is documented seven ways from Sunday on the Wiki : Dan is amazing about things like that. The problem of course is that when everything is working, no one reads the doc. When it fails they don't have time. Dan works with me on my new team, but went back and fixed the file server for the old team a couple of times, and here is the nut of what this article is about. What I am about to say here is going to be true about any and every complicated bit of technology that people rely on every day: it will not be limited to just Linux.



You have to know how to use the technology.



The Linux NAS was never advertised as being as good as the TruCluster that proceeded it, but when it failed it took people that understood TruCluster / Tru64 / ADVFS to fix it. Same thing with any technology stack I have ever worked with.



Technology is only as good as the people and process that support it. See ITIL for details.



This is a truth that I think about all the time in my new role as a technologist. 10% of the work is designing the solution. The rest of it is training, communicating, and then going back and retraining some more (more than likely).



Along comes this hurricane named Ike, and it is huge: As big as the state of Texas from side to side. Houston's power grid crumbled before Ike. The Linux NAS server has a weak spot in the design: It will not run without electrons. I know, I know: We should have had wind power as a backup. Next time....



Upon the return of power, the Global File System that underlies the core design of the NAS marks many high I/O, high usage file systems as needing repair and they will not mount. The log says that the file system has been "withdrawn":


--------------------- GFS Begin ------------------------

  WARNING: GFS filesystems withdraw
     GFS: fsid=rnd-fs:p4_gfs.1: withdrawn:

  WARNING: GFS withdraw events
      [<ffffffff884c3c94>] :gfs:gfs_lm_withdraw+0xc4/0xd3:
     GFS: fsid=rnd-fs:p4_gfs.1: about to withdraw from the cluster:
     GFS: fsid=rnd-fs:p4_gfs.1: telling LM to withdraw:

  WARNING: GFS fatal events
     GFS: fsid=rnd-fs:p4_gfs.1: fatal: filesystem consistency error:

  ---------------------- GFS End -------------------------


This is system admin 101 stuff: FSCK and fix stuff, and you are back running... except that in the cluster and GFS the commands name is not FSCK. And you can not just FSCK: here then is what Dan wrote on our Wiki about how to recover from this:




HOWTO: Recover a GFS filesystem from a "withdraw" state


When a corrupt GFS filesystem structure is discovered by a node, that node will "withdraw' from the filesystem. That is, all I/O for the corrupted filesystem will be blocked on that node to prevent further filesystem corruption. Note, other nodes may still have access to the filesystem as they have not discovered the corruption.


  • halt/reboot - Use a hardware halt on the node that is in the "withdraw" state and then reboot that node.  


Note: A simple reboot command should work, but on our version of the cluster it seems to hang in the GFS umount stage on the withdrawn filesystem. So a hard reboot of the node seems to be required at this time.


  • umount ${MOUNT_POINT} - Un-mount the filesystem on ALL NODES!  
  • gfs_fsck ${BLOCK_DEVICE} - To run a full fsck. Run on one node only!  
  • mount ${MOUNT_POINT} - On all nodes to restore service.  


Note: nfsd will hang on the withdrawn filesystem. You may
need to relocate the NFS service to a surviving node first!




Since being in production, Dan has had to do this particular recovery action about four times. Ike only gets credit for this last one. The other three times had to do with a single node failing and leaving I/O pending. This in turn appears to be the ILOM card in the node acting up.



Next time: Some other handy Linux cluster things to know before your Linux based cluster fails...