Share: |


The Chief R&D Support NAS Basher takes a deep dive into kernel code to fix our CentOS cluster for HP-UX clients

 

  Since   I   last posted here about out CentOS NAS cluster, we have been in the weeds.   Our hopes for Linux being able to deal with this Enterprise class level of   support have been shaken *and* stirred. I will let Dan Goetzman tell the story   in a sec, but first some background since my last post.

 

  When we first released the CentOS server, it was not into full production,   move everything from the Tru64 server mode. We were more cautious than that.   The Tru64 file server, despite being out of support and now running on   hardware with no support contract was still not causing any problems. Not any   *new* ones anyway. So we migrated our groups home directories first, and then   a few "lower availability required" file systems, and then sat back and   evaluated.

 

  At first it looked like we would go ahead and live with the Sun NFSV2 Stale   Handle problem   (noted   in my first post), but then a raft of patches to the kernel came out, and   there were quite a number of them that hit areas of the kernel that were of   interest to us, specifically in NFS and GFS.

 

Dan and I talked about it, and decided to try the new kernel. That meant   re-certification but we decided to try it on the test hardware. Immediately   Dan found a problem with HP-UX clients, and it was *deadly*. Worse, we found   out the old server had this problem too! We had not actually tested the entire   mix of HP-UX clients possible.

  The HP-UX Problem

 

UNIX and Linux have the concept of bits set to define the read and write   ability of a file. If I own file 'xyz' and the write bits are turned off, I   can not write to the file even though I own it. I can use the 'chmod' command   to turn the write bit on, and then I can write to it.

 

The funny thing about that is that by using the chmod command, I am   technically writing to the file, actually the inode of the file. That means   that there is a bit of code someplace that makes sure I own the file and can   do it.

 

With GFS, and *only* GFS as the backing store, and HP-UX and only certain   versions of HP-UX as the client, accessing via NFS, we went down a code path   where HP-UX would attempt to creat a file, and then get rejected when it tried   to write to the file.

 

Dan's initial look at this came back with the theory that the GFS team had   begun to use certain generic kernel file system semantics, and that other file   systems like EXT3 and XFS had not.

 

This was a show stopping problem. Our environment is far too heterogeneous to   work with a gaping hole like this. We talked about it some more. Dan's   research had found one other post about this issue, meaning we were out   someplace in the code that very few had followed into. That was going to mean   that the "Many Eyes, Shallow Bugs" leverage of Open Source was not working to   our advantage.

 

Having the source code meant that we could see if this was something that we   could fix, but Dan told me at least three times that he was not a kernel guy,   and that he was not even sure what the Posix compliant behavior should be. He   decided to take a swing at it anyway. I turn it over to him here:

  Dan's Kernel Story

 

  I finally have "hacked together" a fix for the "HP-UX NFS client   on a el5 based NFS server with GFS filesystems" problem!
  
   After adding a bunch of "printk's" to the kernel and many kernel builds, I was   able to trace down the kernel function that was at the root of our problem. It   seems that NFSD calls
vfs_create (and that returns OK)   and then calls nfsd_setattr to set the file attributes   correctly. nfsd_setattr does a few things and ends up   calling notify_change, and down the road a bit more will   end up calling gfs_setattr and then farther down the path   will end up calling generic_permission (a regular kernel   routine).
  
   It's this
generic_permission call that returns   -EACCES. Apparently due to the fact that the file was   created with the correct owner, but with NO access   permissions in the case of the HP-UX NFS client. Interesting, this   generic_permission call is supposed to replace the   gfs_permission call that was the way it was done in the   pre 2.6.10 days. Apparently GFS is the only filesystem (as of the el5 vintage)   that has made this change. ext3 does not yet call   generic_permission. I found patches to make this change   to XFS, but a trace of XFS on el5 reveals it does not call   generic_permission at this time. So, that's why it only   fails on GFS on CentOS5!
  
   Not really wanting to change a kernel function that other things might call, I   elected to change where the nfsd layer in the kernel gets the error returned   (by
notify_change). My hack simply checks if   notify_change returns error=-EACCES   and then IF (NFS uid == inode uid) reset the error var to   0. That is, if the owner of the inode is the same as the calling owner uid   then allow access. I added a printk at the kernel.debug level so I can see   this via syslog if I have the kernel.debug level set to log. To verify it   works...
  
   Initial tests indicate success. I have all 3 nodes on the cluster up on this   "BMCFIX" kernel now. HP-UX NFS clients seem to work AOK now.
   I posted this to bugs.centos.com case, to have the experts look at how to   provide a more permanent fix. As I am not really up on things like POSIX   compliance and all. This is just a hack to prove that I am on the correct path   and it will resolve the problem.
  
   Anyhow, here is the patch if you are interested in the exact code that was   added to fs/nfsd/vfs.c to the nfsd_setattr function:

 

 

  --- vfs.c_save  2008-01-18 13:06:50.000000000 -0600
   +++ vfs.c       2008-01-18 13:18:40.000000000 -0600
   @@ -348,6 +348,11 @@
           if (!check_guard || guardtime == inode->i_ctime.tv_sec) {
                   fh_lock(fhp);
                   err = notify_change(dentry, iap);
   +               /* Allow access override if owner for HP-UX NFS client bug on GFS */
   +               if (err == -EACCES & (current->fsuid == inode->i_uid)) {
   +                       printk (KERN_DEBUG "nfsd_setattr: Bug detected! Ignoring -EACCES error for owner\n");
   +                       err = 0;
   +               }
                   err = nfserrno(err);
                   fh_unlock(fhp);

  Dan's bug number is at http://bugs.centos.org and is 2583.

  We Admit: Its a Hack

 

  I post this all here in the spirit of openness, should anyone follow us out   here to the bare edge of GFS based NAS servers. We do not know what the right   way to really fix this problem would be, but we looked at it as being like my   example about owning a file system being an implicit authority to at least   write to the inode.

 

Dan stood this code up a four days ago, and so far, so good. In fact, we know that it is doing what we want in terms of being a cluster because a "network burp" caused thre NFS service to migrate from one node to another. We only knew about it because we saw it in the log. The customer facing service kept right on running.

  Open Source

  This problem has all sorts of things about the advantages and dis-advantages   of Open Source, all wrapped into one neat bug number.

  1.      By having the source code, and a guy good enough to read and understand it,     we were able to fix a severe problem in-house, with relying on anyone  
  2.      Because we were on the bleeding edge where very few folks appear to be, we     were on our own. "Many Eyes, Shallow Bugs" principle does not work when     there are not many sets of eyes looking at all the possible cases  
  3.      Linux is great for a heterogeneous environment, as long as one is willing to     put in the time and effort sometimes. Along the way of shooting this bug,     Dan was laughing about some of the code comments about all the other patches     in Linux to deal with various corner cases for things like Irix and other     more obscure combinations of problems. It is easy to see why the embedded     market loves Linux.
      
  4.      By choosing CentOS, we chose not having a support option, but one way out of     this would be to use the equivalent version of RedHat, and taking out a     support contract. That back door possibility was part of the attraction of     CentOS.  
  5.      By tripping over this now, and documenting it, we have hopefully made life     easier for whomever comes this way next: Dan notes that XFS is getting ready     to start using the kernel provided file systems semantics, so they would     have seen this next.