Share This:

The chief NAS blaster of R&D Support gets bugged about the NFS Create problem

 

Specifically, this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=431253

 

Which is the same as this one:

http://bugs.centos.org/view.php?id=2583

 

Attention Deficit

 

When the bug we had out at CentOS had languished for a while, Dan and I talked about it for a bit, and then he scooped up my test computer (goodbye to PCLinuxOS for now) and went back to his desk to try something out. The goal was to see if he could recreate the bug we had found on CentOS 5 also on RedHat EL5. Given the way CentOS is derived, it made sense to use that CentOS would have all the same bugs as RH EL5, and that the inverse was probably true, that RH would have all the same bugs as CentOS.

 

Carefully putting together a test RH NFS Server that uses GFS as the file system, he carefully made sure that all the same packages at all the same releases were installed. And then he tested the HP-UX clients, and sure enough, they failed the exact same way. Here is the text from the RH Bugzilla bug (Formatting is mine...):

 


Description of problem:

HP-UX NFS clients fail creating a new file on a CentOS 5 NFS server updated with the 2.6.18-53.1.6.el5 when GFS is used as the backing filesystem.

Version-Release number of selected component (if applicable):

kmod-gfs-0.1.16-5.2.6.18_8.el5
kernel-2.6.18-53.1.6.el5

How reproducible:

Using any HP-UX client.
I tested using both 11.00 and 11.23

Steps to Reproduce:

1. Create a GFS filesystem on the server
2. NFS export the filesystem
3. Mount on any hp-ux client
4. hp-ux$ cp anyfile
/to/nfs/fs/on/rhas5

Actual results:

hp-ux$ cp: cannot create anyfile: Permission denied

Expected results:

No error

Additional info:

HP-UX clients first create the new file using the NFS procedure CREATE (using
UNCHECKED and mode=0) and the server returns NFS3_EACCESS. I traced the -EACCES
error to the generic_permission kernel function. Apparently ext3 and xfs
filesystems do not use generic_permission but gfs does and it returns -EACCES to the nfsd.

I have created a simple patch to check for the -EACCES error and allow access
if the FSUID=Inode-UID. This resolves the problem but is probably not the best
way to fix the bug. I will attach the patch for reference. Hopefully those more
knowlegable than I can determine the correct fix needed to resolve the root
cause of this bug.


More Debugging

 

The case was picked up by Robert Peterson of RedHat, and the first thing he wanted us to do was to check and be sure the behavior still existed in the latest, CVS version of this code. There were some changes in the GFS extended attribute code that might have impact on this. Robert did not have access to an HP-UX workstation to verify.

 

Dan loaded up the CVS version of the code and verified that it did not fix the problem, submitted the traces he had from both CentOS and RH on the problem, and that is where the matter now stands.

Why?

 

It appears, looking through the various bugzilla reports that the RedHat folks do not get wrapped around the axle when bugs show up and are reported against CentOS instead of their distro: the spirit of Open Source appears to hold true, and that they are more concerned about the bugs getting fixed than where they were found.

 

At the same time, reporting it against CentOS did not get any obvious attention. We knew we were out on the bleeding edge with our configuration anyway, but we theorized that it would get some cycled if we did some work to reproduce it in RedHat code. We did, and it did. We now have someone looking at it, and what we hope is that a fix will get generated that will both roll into everyones base code, and be sure to be Posix compliant as well. Given the way CentOS is derived, fixing it in Redhat fixes it in CentOS at the same time, and sooner or later everyone in all the distros that use the GFS code will get it.

Works for Us

 

In the meantime, Dan's patch has had the CentOS cluster stable and doing exactly what we want for going on a month now. We are putting new file systems on it, able to take nodes down without disrupting service, and so forth. It is pretty much exactly what we were hoping for.

 

Why would we need to take a node down? The most recent example was the lights out management cards. Dan had discovered that they also run Linux, and that there was a new version of the embedded code for them. We had suffered a few hangs on the cards, and wanted them patched up to current. By being in a cluster, Dan could down a node, flash the ILO card, and bring it back up, then do the next node, and then the next, and our customers never even noticed he was working on the system.

 

When it works like that, it is a beautiful thing.