Skip navigation

Kernel Hackage

Posted by Steve Carl Jan 22, 2008
Share This:

The Chief R&D Support NAS Basher takes a deep dive into kernel code to fix our CentOS cluster for HP-UX clients

 

  Since   I   last posted here about out CentOS NAS cluster, we have been in the weeds.   Our hopes for Linux being able to deal with this Enterprise class level of   support have been shaken *and* stirred. I will let Dan Goetzman tell the story   in a sec, but first some background since my last post.

 

  When we first released the CentOS server, it was not into full production,   move everything from the Tru64 server mode. We were more cautious than that.   The Tru64 file server, despite being out of support and now running on   hardware with no support contract was still not causing any problems. Not any   *new* ones anyway. So we migrated our groups home directories first, and then   a few "lower availability required" file systems, and then sat back and   evaluated.

 

  At first it looked like we would go ahead and live with the Sun NFSV2 Stale   Handle problem   (noted   in my first post), but then a raft of patches to the kernel came out, and   there were quite a number of them that hit areas of the kernel that were of   interest to us, specifically in NFS and GFS.

 

Dan and I talked about it, and decided to try the new kernel. That meant   re-certification but we decided to try it on the test hardware. Immediately   Dan found a problem with HP-UX clients, and it was *deadly*. Worse, we found   out the old server had this problem too! We had not actually tested the entire   mix of HP-UX clients possible.

  The HP-UX Problem

 

UNIX and Linux have the concept of bits set to define the read and write   ability of a file. If I own file 'xyz' and the write bits are turned off, I   can not write to the file even though I own it. I can use the 'chmod' command   to turn the write bit on, and then I can write to it.

 

The funny thing about that is that by using the chmod command, I am   technically writing to the file, actually the inode of the file. That means   that there is a bit of code someplace that makes sure I own the file and can   do it.

 

With GFS, and *only* GFS as the backing store, and HP-UX and only certain   versions of HP-UX as the client, accessing via NFS, we went down a code path   where HP-UX would attempt to creat a file, and then get rejected when it tried   to write to the file.

 

Dan's initial look at this came back with the theory that the GFS team had   begun to use certain generic kernel file system semantics, and that other file   systems like EXT3 and XFS had not.

 

This was a show stopping problem. Our environment is far too heterogeneous to   work with a gaping hole like this. We talked about it some more. Dan's   research had found one other post about this issue, meaning we were out   someplace in the code that very few had followed into. That was going to mean   that the "Many Eyes, Shallow Bugs" leverage of Open Source was not working to   our advantage.

 

Having the source code meant that we could see if this was something that we   could fix, but Dan told me at least three times that he was not a kernel guy,   and that he was not even sure what the Posix compliant behavior should be. He   decided to take a swing at it anyway. I turn it over to him here:

  Dan's Kernel Story

 

  I finally have "hacked together" a fix for the "HP-UX NFS client   on a el5 based NFS server with GFS filesystems" problem!
  
   After adding a bunch of "printk's" to the kernel and many kernel builds, I was   able to trace down the kernel function that was at the root of our problem. It   seems that NFSD calls
vfs_create (and that returns OK)   and then calls nfsd_setattr to set the file attributes   correctly. nfsd_setattr does a few things and ends up   calling notify_change, and down the road a bit more will   end up calling gfs_setattr and then farther down the path   will end up calling generic_permission (a regular kernel   routine).
  
   It's this
generic_permission call that returns   -EACCES. Apparently due to the fact that the file was   created with the correct owner, but with NO access   permissions in the case of the HP-UX NFS client. Interesting, this   generic_permission call is supposed to replace the   gfs_permission call that was the way it was done in the   pre 2.6.10 days. Apparently GFS is the only filesystem (as of the el5 vintage)   that has made this change. ext3 does not yet call   generic_permission. I found patches to make this change   to XFS, but a trace of XFS on el5 reveals it does not call   generic_permission at this time. So, that's why it only   fails on GFS on CentOS5!
  
   Not really wanting to change a kernel function that other things might call, I   elected to change where the nfsd layer in the kernel gets the error returned   (by
notify_change). My hack simply checks if   notify_change returns error=-EACCES   and then IF (NFS uid == inode uid) reset the error var to   0. That is, if the owner of the inode is the same as the calling owner uid   then allow access. I added a printk at the kernel.debug level so I can see   this via syslog if I have the kernel.debug level set to log. To verify it   works...
  
   Initial tests indicate success. I have all 3 nodes on the cluster up on this   "BMCFIX" kernel now. HP-UX NFS clients seem to work AOK now.
   I posted this to bugs.centos.com case, to have the experts look at how to   provide a more permanent fix. As I am not really up on things like POSIX   compliance and all. This is just a hack to prove that I am on the correct path   and it will resolve the problem.
  
   Anyhow, here is the patch if you are interested in the exact code that was   added to fs/nfsd/vfs.c to the nfsd_setattr function:

 

 

  --- vfs.c_save  2008-01-18 13:06:50.000000000 -0600
   +++ vfs.c       2008-01-18 13:18:40.000000000 -0600
   @@ -348,6 +348,11 @@
           if (!check_guard || guardtime == inode->i_ctime.tv_sec) {
                   fh_lock(fhp);
                   err = notify_change(dentry, iap);
   +               /* Allow access override if owner for HP-UX NFS client bug on GFS */
   +               if (err == -EACCES & (current->fsuid == inode->i_uid)) {
   +                       printk (KERN_DEBUG "nfsd_setattr: Bug detected! Ignoring -EACCES error for owner\n");
   +                       err = 0;
   +               }
                   err = nfserrno(err);
                   fh_unlock(fhp);

  Dan's bug number is at http://bugs.centos.org and is 2583.

  We Admit: Its a Hack

 

  I post this all here in the spirit of openness, should anyone follow us out   here to the bare edge of GFS based NAS servers. We do not know what the right   way to really fix this problem would be, but we looked at it as being like my   example about owning a file system being an implicit authority to at least   write to the inode.

 

Dan stood this code up a four days ago, and so far, so good. In fact, we know that it is doing what we want in terms of being a cluster because a "network burp" caused thre NFS service to migrate from one node to another. We only knew about it because we saw it in the log. The customer facing service kept right on running.

  Open Source

  This problem has all sorts of things about the advantages and dis-advantages   of Open Source, all wrapped into one neat bug number.

  1.      By having the source code, and a guy good enough to read and understand it,     we were able to fix a severe problem in-house, with relying on anyone  
  2.      Because we were on the bleeding edge where very few folks appear to be, we     were on our own. "Many Eyes, Shallow Bugs" principle does not work when     there are not many sets of eyes looking at all the possible cases  
  3.      Linux is great for a heterogeneous environment, as long as one is willing to     put in the time and effort sometimes. Along the way of shooting this bug,     Dan was laughing about some of the code comments about all the other patches     in Linux to deal with various corner cases for things like Irix and other     more obscure combinations of problems. It is easy to see why the embedded     market loves Linux.
      
  4.      By choosing CentOS, we chose not having a support option, but one way out of     this would be to use the equivalent version of RedHat, and taking out a     support contract. That back door possibility was part of the attraction of     CentOS.  
  5.      By tripping over this now, and documenting it, we have hopefully made life     easier for whomever comes this way next: Dan notes that XFS is getting ready     to start using the kernel provided file systems semantics, so they would     have seen this next.
      

Two Heads

Posted by Steve Carl Jan 17, 2008
Share This:

A sequel of sorts to PCLinuxOS 2007 and Mint 4.0: ELDs?

 

I have read over and over that having two screens on a single computer increases productivity by 20-30%. Linux has had dual monitor support via X for a while. What with one thing and another, the only computers I have ever actually done dual head support on before now were my laptops. While I liked it a great deal, especially for presentations, I have never had it at my desk with a machine that it is much more logical that I would have: my Linux desktop.

 

Ubuntu 7.10, and by extension, Mint 4.0 has a new monitor management widget with dual head support so that direct editing of the /etc/X11/xorg.conf file is not longer required. I know there are serious geek points being lost here by admitting that I prefer not to edit the xorg.conf file directly unless I have to, but so be it. Fedora has had this dual head setup GUI widget for a long time, so all I can say is “About time there Ubuntu”.

 

The new Dell 745 running Mint 4.0 just needed a dual head card to make it happen. It has a PCI-E video card, and an on-board video card, but the BIOS won't let both be active at the same time. Doh.

 

Some quick research found a cheap one with the Nvidia GeForce 7300 GS chipset. This PCI-E card has a VGA port and a DVI port with a VGA adapter. Both the flat panels I have are VGA: Dell 197 FP and Dell 172FP. A 19 and a 17 inch panel, both 4:3 ratio. The 197 is whiter on the panel backlight, probably because it is newer.

 

I pulled the ATI based card out of the Dell 745, installed the nVidia based card, and booted. Linux Mint immediately told me it was in low resolution, and did I want to do anything about that. I click “yes”, and we went into the display setup do-dad. It saw I had two heads on the box, and let me configure them, and then continued the boot.

 

Once up, I had a weird desktop. It was in Xinerama mode, and the Dell 197FP was in 1024x768 mode, but the pan mode was in 1280x1024, so the desktop was all slippy slidy and panning all over the place when I moved the mouse. The 172FP was stuck in 1024x768 mode even though it can do 1280x1024 so once I 'slid' off the left panel onto the right hand one it would work right. Generally odd.

 

Clicking on the restricted drivers controller, and enabling the nVidia drivers did not change this. Going into the System / Administration / Screens and Graphics (I have the Gnome menus enabled, not the SLAB looking thing) applet, I poked around with various settings before figuring out a few things. No matter what I did in there though, I could not get the screens set up the way I wanted them to be. I could not get the 172FP out of 1024x768 mode. It worked, but the panel was doing some funny things to the fonts to make them look OK. Sort of like smooth but blurry.

 

I had read on the Mint wiki that the Envy app had been maintained in the Mint distro because it had more and better control over the screens than the current Ubuntu application did. I went to Applications / Systems Tools / Envy and fired that up. It asked if I wanted to install the nVidia drivers. I thought I had them on via the restricted source manager, but decided that there would be no harm if I said yes.

 

There was no harm, but I did not expect what happened. Envy pulled off the current nVidia packages, downloaded 76 new packages, re-compiled the driver, and finally launched the nVidia configuration applet.

 

I now had far better control of the system set up, but it took a bit more tweaking to get it to do what I wanted. First off I had to tell the applet that the right hand head was relative and to the right of the left hand head. And I had to be in Xinerama mode to allow things I clicked on in email to launch in the browser started on the right hand head, otherwise they were in separate X sessions and isolated from each other.

 

VMWare Server started and ran just fine on the new setup, and now I could make a full screen guest appear on the right hand screen and work with it at the same time as the left hand screen stayed talking to the host OS. Very cool.

 

One last oddity. I had to enable 'root' mode to be able to use the nVidia screen management widget. Running it as me generated error messages about not being able to write to /etc/X11/xorg.conf. It came with no SU or SUDO style wrapper. It was not hard, and I probably could have dug up the name of the binary and launched it with SUDO and achieved the same thing.

 

So, in the few minutes I have had as a two headed Linux person, am I 30% more productive? I would say yes. The screen real estate allowing me to refer to things and “highlight and paste” from browser to email alone has made it worth it. The down side: My PCLOS box now has no monitor on it. Guest I'll have to dig up a CRT for it, at least till I configure VNC Server.

BarCampESM

Posted by Steve Carl Jan 17, 2008
Share This:

Join us January 18th and 19th in Austin TX for BarCampESM!

 

If you happen to be able to get over to Austin, TX Friday or Saturday, and you are interested in open and informal discussions around Open Management or Enterprise Systems Managment or ITIL or BSM or if you just  like hangin' with other geeks, then drop by J. Blacks on West 6th Street in downtown Austin. Here is the Wiki:

 

http://barcamp.org/BarCampESM

 

The first 50 folks signed up there get free food on Saturday. After that first 50 you have to pay for the food, but not for the company.

 

The details, including a link to Google Maps, are on the Wiki. I hope to see some of you there!

YAB

Posted by Steve Carl Jan 17, 2008
Share This:

I must be crazy. Yet Another Blog: "On Being Open", at the Open Managment Consortium new website.

<note from 2010: Posted for historical completeness only. The OMC website is no more>

 

Back when I only had this blog, I posted here at least twice a week, except on vacations and whatnot. That activity all happened at nights and on weekends, and when my family would see me bent over the laptop typing away, they would say "Writing another Blog?"

 

Last year I added a second blog over at blogger. The URL is http://on-being-open.blogspot.com/, but the name of that blog is "Adventures in Linux and Open Source". It was meant to be an extension of what I do here, but with a personal use slant. Of course these things have a way of taking on a life of their own, and I, like here, have strayed from time to time from my central theme.

 

Now comes a new phase of my Open Source life, and beginning to work with the Open Managment Consortium. Which meant a new blog on that new website:

 

http://beta.openmanagement.org/blogs/stevecarl/

 

The beta is because this is a new web page just assembled, and I'll have to update this when it goes GA.

 

The title of this blog was suggested by the URL to my personal one, and is called "On Being Open"

 

Now my family does not ask me if I am working on a blog, but "Which Blog is This?" [sung to the tune of "What Child is This". I have a musical family.]

 

PS: my newest post at the OMC went up last night, and it is "The VW Beetle Principle"

Share This:

Fresh installs on nearly fresh computers: A new Enterprise Linux Desktop Adventure

 

Featuring an old desktop Linux guy...

 

Happy New Year!

 

Since I last posted anything here, having spent three weeks on vacation in Far West Texas, some things have changed on the Enterprise Linux Desktop (ELD) in my office.

 

First off, a new Dell Optiplex 745 appeared on my desk. It never booted anything before it booted my Mint 4.0 LiveCD. I messed for a bit with the LiveCD, surfing and editing and generally making sure it looked like Mint liked this new hardware. Once that checked out OK, Mint 4.0 spun down the the hard-drive, replacing whatever OS that was on there before. Pretty sure it was not Linux in any case. Whatever it was, it is not there now!:

 

/dev/sda1   *           1        1216     9767488+  83  Linux
/dev/sda2            1217        1459     1951897+  82  Linux swap / Solaris
/dev/sda3            1460       19452   144528772+  83  Linux

 

This is my normal "/" separated from "/home" config.

 

The Optiplex 745 replaced a Precision 340. The 340 was running Mint 3.1. The new gear is better in every possible way: Dual Core. More memory, etc. It lines up like this:

 

 


New Optiplex 745Old Precision 340
CPUIntel 6300 Dual Core, 1.87 Ghz, 7445 BogoMIPSIntel Pentium 4, 2.0 Ghz, 3991 BogoMIPS
RAM2 GigaBytes1.25 Gigabytes
Disk (/dev/sda, /dev/hda)Seagate, 8 MB Cache, 160 GB, SATAWD, 2MB cache, 80 GB, ATA
Videofglrx driver, RV516, ATI X1300/X1550ATI Rage 128
NetworkNetXtreme BCM5754 Gigabit Ethernet PCI Express3com 3c905c, 100 Mb

 

(

updated 1/14/2008 to fix the hard drive spec swap. Thanks David!

)

You'd be tempted, based on that specification lineup, to think that the new system is twice as fast as the old one, and you'd be correct. Mostly. The dual processors make it so a single runaway thread is easy to cancel and recover from, and of course Linux is beautifully SMP these days, so it feels 2x fast. But there is more than meets the eye - or - at least that meets the BogoMIPS here. BogoMIPS aren't called that for nothing. Core processors are far better at out-of-order instruction execution and predictive pipe-lining, and the Intel 6300 has VT so that VMware Server is a better at guest hosting on the new system than the old.

 

None of that shows up in a BogoMIPS rating. Bogo is short for Bogus. That is a good name. It is not that a BogoMIPS rating is useless. It just has to be kept in perspective.

 

The hard drives in each computer are both 7200 RPM units, and there is just one arm, so anything that goes I/O intensive is not seeing 2x. SATA w/ 8MB cache is better than ATA with 2MB cache, but not that much better.

 

Random thought of the day: Anyone recall when IBM called these things "Hardfiles"?

 

Mint 4.0 Install

 

Mint 4.0 installed without any issues in the 745. Compiz enabled. The newer graphics card handles the Compiz effects without any apparent strain. Of course, I keep most of them turned off except for things like window preview and other functional or informational effects. Be that as it may, the 340 would not run Compiz on its Rage 128 card. Not in any reduced mode that I tried anyway. Evolution, Openoffice, everything all just snap along. OO 2.3 launches in particular are about 1 second. Wow.

 

Evolution 2.12.1 has not had any issues at all. It blazes along, and very clearly benefits from the underlying speedy hardware.

 

One interesting thing is that the Dell E197FP LCD flat panel is supported much better. I have no idea what exactly made it better, but the OS detects and sets up the panel in /etc/X11/xorg.conf without any interference on my part. The fonts are nicer looking, better aliased, and the overall effect is that the entire screen is much bigger and more useful than it was before. I changed out the OS and the computer, so only the flat panel is the same, so no way to know what fixed this. More on this later in this program.

 

In late breaking news about video: I put a new version of Compiz on and now the 3D desktop does not work anymore. I don't actually care that much, since there are few things  in there I really use other than the Expose-like feature and the Window preview, but this is a bummer from the point of view of stability. In a real ELD of course this Compiz package change would have been tested by the desktop support folks before it was certified to roll out to the environment.

 

A new PCLinuxOS 2007 system

 

With Mint 4.0 happily spinning on the 745, and my Evolution email and other Enterprise desktop stuff brought over, it was time to decide what to do with the 340. It is not a bad little box. Sure, it would not run Vista well or anything, but then, it is over three years old. Vista is not a valid benchmark of computer / OS viability. Linux runs well on way less than this Dell 340 box, especially with the 1.25 GB RAM in place.

 

I gave some thought to Ubuntu Server, just to see it in action... and to see what its GFS code looks like. In a soon-to-happen post, GFS has been a real pain on the CentOS cluster and we have been giving serious thought to what to do about it. More on that in another post.

 

I decided to load back up PCLinuxOS 2007. I had never run it on a desktop class system, only laptops. On old grungy laptops it felt really crisp, so I wondered how it would do on this fairly good spec 340.

 

It runs crisply.

 

Installing it was not the pain free Mint 4.0 experience though. Not horrible or anything, Just two issues.

 

  1. GRUB does not work
  2. The default video settings were a slight pain.

GRUB / LILO and the Dell 340

I fixed GRUB by re-installing PCLinuxOS and selecting LILO as the boot loader. I messed around with GRUB for a bit first, and finally determined that for some reason that GRUB could not see /dev/hda, which it thinks of as hd0. Well, it should have thought that. But it didn't. It would boot the MBR, then fall into the GRUB prompt and nothing manual I tried would load the kernel. I booted back to the LiveCD, and used chroot to mess around with it for a while and then decided I was having a bad GRUB / PC BIOS interaction.

 

A light went on about something Mint 3.0 / 3.1 has been doing on the 340. It would not boot on the 340 either, but it was failing in a different way. Mint booting would fail FSCK on /dev/hda1. I'd then mount /dev/hda1 (/) and /dev/hda3(/home) and type exit and it would come up.

PCLinuxOS failed to see the disk at all with GRUB.

 

LILO fixed everything, and thank goodness PCLinuxOS still includes the option of using either Bootloader!

<gripe>GRUB changes the names of the disks just enough to be confusing. The first disk it finds (be it hda or sda) is called HD0. That is just close enough to HDA visually to be confusing. I wish it was DISK0, or even better, DISK1 instead.</gripe>

 

Video Set Up

 

PCLinuxOS is KDE 3.5.8, where Mint uses Gnome (unless you load up the KDE packages or the Kubuntu-ish version [of which there is no 4.0 GA version yet]{At the time of this writing, and that will just about be enough on the brackets!}).

 

On the 340, Mint did the right thing as far as screen resolution and whatnot. PCLinuxOS came in at 1024x768. To fix this required going into DrakConf (System / Configuration / Configuring Your Computer). The KDE Control Center app for configuring the display will not help you, even though it is not greyed out or anything. Really, would it be a huge mode to the KDE code to put a little note in the KDE Control Center to tell people to go use DrakConf?

 

Two things had to be changed: First I had to tell it that the display was a generic 1280x1024 flat panel (It is a Dell E172FP, but this was not detected). This is at Hardware / Configure your monitor in DrakConf aka the "PCLinuxOS Control Center".

 

Also in that same place is "Hardware / Change the screen resolution". Once this is done, and X is restarted, I had 1280x1024. At it was lovely. Far nicer anti-aliasing than what Mint had done with this particular video card. But this is not Apples and Apples. This could be KDE and the way PCLinuxOS sets it up rather than anything about xorg. Mint used Gnome and I have started to notice that there are some corner cases in hardware where Gnome does a better job, and others where KDE looks better. I have not chased this to ground.

General Linux Video Ramblings

 

There is greatness in the way Linux does video support. Having different layers like X and the desktop, it (like any other OS done this way) is able to keep the look and feel stuff largely separate from the hardware support, allowing different groups to focus on what they are are interested in. At the same time, KDE and Gnome drive enough stuff that they do have a very tight level of reliance upon the X server. It is a testament to Open Standards that KDE/Gnome do not seem to care if they are installed on top of XFree86 or Xorg. Still, despite all this abstraction, it remains that there are some set ups where one looks better than the other.

 

Case in point:

 

Previous set up was the 340 with the Dell E197FP display. Distro was Mint. It looked OK, but not as good as Mint on the Dell 620, or my personal Acer. I could never figure it out, never get the fonts to look smooth and anti-aliased.

 

I moved the E197FP to the 745, and added a Dell E172FP to the 340. Same sync speeds, same resolution: 1280x1024. Mint still looked jaggy on the 340 / E172, but the 745 / E197 now looks great. Smooth a and clean: In a funny way it feels like the E197FP gained more screen real estate. I guess because everything can be smaller and be readable without looking jagged.

 

Then PCLOS rolled in to the 340/E172 combo, and now it looks great. Every bit as nice to look at as the 745 / E197.

 

There are subtle interactions in the stack of OS, X Server, Video Card, Monitor, Distro, and chosen desktop environment that can make a huge difference in the way the whole shooting match looks and feels.

ELD and PCLOS

 

As noted in my previous PCLOS on a laptop foray, PCLOS works great as an Enterprise Linux Desktop. It would be an easy argument to make in fact that Kubuntu / PCLOS would be a better choice for ELD, at least if the people using it were recently using an OS from Redmond. The KDE 3.5 user interface paradigm is closer to the one on MS's XP than the Gnome 2.20 of Mint 4.0. When I give the ELD lab at places like LinuxWorld or SHARE, I always use a KDE based desktop.

 

And "Yes, there is a KDE version of Mint 4.0 coming soon". The Mint blog says that they are having to work harder than they thought porting the Mint add-in tools to KDE.

 

Dell and Linux

 

Even though these are not officially supported for Linux by Dell, both of these computers run it without issue. The Ubuntu Linux supported hardware is an Inspiron 530 N at this writing. Other than the GRUB thing on the 340, both these distros work great. Either one is viable as an Enterprise Linux desktop.

 

While I have looked at many Distros over the years of this blog as Enterprise Linux, I ususally do it on laptops. This was my first all-real-desktop hardware look. My theory has always been that laptop hardware is harder to support for the OS, and if laptops work, desktops should be a breeze.

 

I don't think I have totally invalidated that idea here today. The GRUB issue on the 340, but not on the 745 does show that *all* hardware needs to be evaluated before a particular Linux distro is deployed to the enterprise desktop.

 

The point is more that there are other viable distros for the ELD role than just SUSE/Novell and RedHat. These two work fine.

Vacation Strikes Again

Posted by Steve Carl Dec 23, 2007
Share This:

Gone till January, 2008

 

"Adventures in Linux" is on hiatus.

 

Kind of like that writers strike in Hollywood, except that it is nothing at all like that.

 

Upon return, I should have an update on the CentOS NAS Cluster. Till then.....

 

Happy Holidays everyone!!!!

 

PS: I will be posting over at http://on-being-open.blogspot.com/ from time to time during the holidays if you are interested in personal Linux / Open Source topics...

Share This:

Getting along with MS Windows users who do not even know you are not an MS Windows user, starring Mint 4.0.

 

Mint 4.0 has been performing flawlessly so far. Here I am on week two of using it, after taking a chance that it would be OK, based on testing of both Ubuntu 7.10 before and things I did on my personal Acer. Along the way I have found a few behaviors that may be helping my overall stability.

 

First up: OpenOffice 2.3 is just amazing. I have worked on two presentations, where people sent me MS formatted presentations that I had to add content to and then send on to the next person for their edits and updates. Back and forth these things flowed via email, culminating in two presentations, where the presenter was a person running MS Windows with a copy of my slide deck piled into a much bigger slide deck. The formatting looked fine, there were no weird page bleeds. Nothing. No one in the audience ever knew that this was created on anything other than MS Win. It was every bit as boring as every other presentation ever given. :)

 

These are the OO packages I have been using:

 

openoffice.org 1:2.3.0-1ubuntu5.3
openoffice.org-base 1:2.3.0-1ubuntu5.3
openoffice.org-calc 1:2.3.0-1ubuntu5.3
openoffice.org-common 1:2.3.0-1ubuntu5.3
openoffice.org-core 1:2.3.0-1ubuntu5.3
openoffice.org-core02 2.1.0-6
openoffice.org-debian-menus 2.1-6
openoffice.org-draw 1:2.3.0-1ubuntu5.3
openoffice.org-evolution 1:2.3.0-1ubuntu5.3
openoffice.org-filter-mobiledev 1:2.3.0-1ubuntu5.3
openoffice.org-gnome 1:2.3.0-1ubuntu5.3
openoffice.org-gtk 1:2.3.0-1ubuntu5.3
openoffice.org-help-en-us 1:2.3.0-1ubuntu2
openoffice.org-hyphenation 0.2
openoffice.org-impress 1:2.3.0-1ubuntu5.3
openoffice.org-java-common 1:2.3.0-1ubuntu5.3
openoffice.org-l10n-common 1:2.3.0-1ubuntu2
openoffice.org-l10n-en-gb 1:2.3.0-1ubuntu2
openoffice.org-l10n-en-za 1:2.3.0-1ubuntu2
openoffice.org-math 1:2.3.0-1ubuntu5.3
openoffice.org-style-human 1:2.3.0-1ubuntu5.3
openoffice.org-thesaurus-en-us 1:2.2.0-2ubuntu1
openoffice.org-writer 1:2.3.0-1ubuntu5.3
python-uno 1:2.3.0-1ubuntu5.3

 

A quick scan of those names and it would appear that Mint does not modify them in any way from what Ubuntu provides. These showed up as an update though: originally when I brought up OpenOffice it used to say "Mint Edition", but now it does not. I don't really care what it says, as long as it works.

 

Clean Screen

 

Part of "working" is having the screen be clean and easy to read. No one wants to work on a presentation and have their eyes cross after 10 minutes because the fonts looked horrible. The D620's screen resolution is 1440x900, and Mint did not have to be told that, or be told that the dpi was 128, or to turn on anti-aliasing optimized for LCD's. Mint did all that out of the box. Further, while the "915resolution" package is installed, it does not actually appear to be in use. If it is, it is being very quiet, with no messages at boot. Only video related stuff appears to be coming out of agpgart:

 

[   11.920000] Linux agpgart interface v0.102 (c) Dave Jones
[   11.928000] agpgart: Detected an Intel 945GM Chipset.
[   11.928000] agpgart: Detected 7932K stolen memory.
[   11.944000] agpgart: AGP aperture is 256M @ 0xd0000000

 

I interpret that to mean that the BIOS patching that the 915resolution package did is not longer needed by the current xorg:

 

xserver-xorg  1:7.2-5ubuntu13

 

Which has this intriging set of packages:

 

xserver-xorg-video-i810 2:1.7.4-0ubuntu5 Intel i8xx, i9xx display driver
xserver-xorg-video-intel  2:2.1.1-0ubuntu9 X.Org X server -- Intel i8xx, i9xx display driver

 

The /etc/X11 xorg.conf seems to confirm:

 

Section "Device"
         Identifier      "Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller"
         Driver          "intel"
         BusID           "PCI:0:2:0"
EndSection

 

All of which begs the question: Why is 915resolution installed anymore? It is not hurting anything, but it is clearly not needed. I have been very happy with the level of GUI performance, and the 3D information settings I settled upon mentioned in my last post, plus adding in the package mentioned in the comment to it by 67GTA, to wit I needed to install compizconfig-settings-manager to get the extra window preview I was looking for without having to run all the extra stuff.

 

Wait Wait

 

One possible reason for the new level of stability provided by Mint 4.0, as well as other recent Linuxii is the way an application will darken when it becomes non-responsive. It is not an application specific thing, but something under the covers of X itself. For example, I'll be typing away at the address bar on an email in Evolution. Evolution will be querying the GAL to try an find closest matchs to the mess I am typing. If I get too far ahear, the entire Evoltion window fades to a dark gray to let me know it is no longer paying attention to my meanderings. I can jump over to a different app, like say Firefox, and keep right on working on a different thing. Linux is not hung. Apps are not gated on each other. And when Evolution catches back up, it fades back to normal color. I have even had both Firefox and Evolution grayed out, and hopped over to OpenOffice to work while they sorted themselves out.

 

So far they have always come back, and when they do they are ready to do more. In Evolutions case, it was horrible when I accidentally set the GAL server to one in France, but even with it set to one in Houston it can gray out from time to time while I am here in San Francisco.

 

What I am wondering now is that...what did Evolution do *before* the feature appeared. Now I stop typing and let it catch up. Before I probably blustered on and perhaps some of the failures I have seen in Evolution back in the 2.10 days were caused by me just typing away when it was lost in thought. Dunno. Just a theory. Nothing to back it up other than the fact the 2.12 under Mint has been solid as the day is long for two weeks, while remote to the MS Exchange server I am using even.

 

Evo 2.12 also has one other nice feature. I used to get meetings from time to time from an Outlook user that Evolution would say were too complicated, and would ask me to *save* it to the hard drive, and then open it. I never did. I just hopped over to webmail and accepted them there. That has not happened so far on two weeks of 2.12 despite some deeply complex meetings showing up.

 

Up to Date

 

Mint 4.0, on their website, mentioned that they were not going to be pushing out as much service as they had in the past, trying to favor stability. Stable it certainly has been, but a fair amount of updates have come down via the nifty MintUpdate applet. I have accepted everything it has recommended and so far, no issues. Updates besides OpenOffice have been to things like FireFox (at 2.0.10... 2.0.11 came down on the Mac the other day, so still a little behind). Actually, now that I mention it, OpenOffice is slightly behind too, at 2.3.0, not 2.3.1. The release notes for OpenOffice say this .1 version is all fixes, not new functions. I am not having any problems, so it may not matter. A bunch of libmono stuff also landed. Download servers are fast, and probably mostly Ubuntu's.

 

Meetings, live or otherwise

 

MS's "Livemeeting" (another app name like Sharepoint: Sharepoint is only for sharing amongst MS Windows folks. All others need not apply)(Yeah, I know there is a "reduced experience" when using other  browsers with Sharepoint. Phooey Wiki is full function no matter what browser or platform I am on). Livemeeting supposedly has a web interface that works from Firefox, and it does work sometimes. I have not figured out the rhyme or reason of the failures, but I assume, based on prejudice and not investigation (at least I admit it), that it is something weird in the code LiveMeeting is sending the browser. For those sad moments when I must use Livemeeting, if FF does not work, there is always Codeweavers Crossover Office and IE6, or worse, VMware and MSWin as a guest. But that always feels like such a failure of purpose.

 

There is also Opera, which often handles IE stuff better than FF. And there is IEs4Linux and now in Mint something called "Wine-doors" to let you put IE in in ways other than Codeweavers. One way or the other, I can get at Livemeeting when I have to. Todays was IE6 under Codeweavers. I was too sleepy to figure out why FF was hung out to dry by Livemeeting.

Oh. The misnomer of LiveMeeting: When have you ever been to a lively meeting? I mean really. :)

 

Is Good... Bad?

 

If MS Windows users do not know I am a Linux user, and think there are no issues using MS Windows specific things all the time, is that really good? What is the incentive to think about Open Standards, and Open document formats? If Linux keeps doing all the heavy lifting of compatibility, that is good for us Linuxii, but is that bad in the large sense of things? I am not sure, but it seems like it.

On the other hand, I know more and more people that are looking at either Linux or OS.X as their full time desktop, and there is all the web 2.0 stuff...

 

Wait and see. For now, from the vantage point of Mint 4.0

Share This:

A look at the Ubuntu 7.10 based newly minted version of Linux for use as my daily use system

 

The last two posts I did here had to do with using a data center grade version of Linux to create a production level, critical service. The upside to using such a Linux is supposed to be stability and predictability. Everything tested. Everything settled and stable. Calm. Quiet. Maybe even a little boring.

 

The downside? Not the latest kernel or packages, therefore not the latest features. Do we need the latest? Maybe. NFS Version 4 and IPv6 are going to be challenges in the near future. Linux has had support for these things for a long long time, but now we are starting to see some systems from several vendors enabling them *by default*.

 

Running a data center is always about balancing the value of the new features against the change and disruption they bring, and none of that has much to do most of the time with the way we run an R&D Datacenter. R&D *has* to be out there in front, so we are always force to adopt and adapt to new technologies by our customer far faster than what is probably normal for a data center.

 

I point out all this to re-iterate a point I have made about the Linux Enterprise desktop before, but that bears a quick repeat: While Linux is ready to be an Enterprise desktop right now today, chances are that a centrally supported/managed desktop image means that a cross enterprise standard Linux desktop image is *not* based off the latest and greatest version of Linux, but something like SLED/Novell or RedHat Enterprise Linux Desktop. And in some ways that is a shame. There is nothing wrong with them per se, but...

 

Linux evolves so quickly that all sorts of nifty new feature/function that Linux Desktop users would probably love to have is also probably only in the newer releases. This rapid evolution is not even all Linux's fault. The underlying hardware, especially laptops, move quickly too. Example: By the time I received my Dell D620 laptop, I could not order one from Dell, because they had moved on to the D630, and had substantial hardware changes to the unit that Linux would have to deal with, such as moving from the old low end Intel 945 graphics card to the new X3100. Only the newer Xorg servers included with the new release of the distro have full support for this new chipset. If you want things like Compiz Fusion to really shine, you have to be on the new release... and in fact, Compiz Fusion is itself only on the newest releases! Before that Compiz and Beryl were still separate things. I do admire how fast they merged though.

 

Unless the strategy is moving everyone to thin clients (and that is perfectly do-able with Linux, although it seems unlikely for laptops) then rapid certification and provisioning is what is going to make the end users happy with these fresh new releases. Another reason to go Web 2.0, since that decreases your desktop certification efforts immensely. All you have to know is whether or not Firefox or Opera as shipped on the latest Linux works with your AJAX application and you are done. You could really care less about most of the rest of it. OK. You might care. Adrian Monk would of course, but I digress.

 

Hey: What about Mint then?

 

Right... right. getting there. All that was to say that I freely admit that I am an early adopter. I am always interested in / curious about how well the bleeding edge stuff is doing. In part, it is how I give back to Open Source too: Problems that I and my early adopter fellow geeks can find and report on the edge make the further back products more stable when they come along.

 

Mint 4.0 is about as bleeding edge as it gets as of this writing. Fedora 8 and OpenSuse 10.3 are already getting old it would seem. Life on the edge is fleeting for a Linux Distro. Mint is based on Ubuntu 7.10, and there is all sorts of new nifty goodness there: I read that for the Ubuntu folks 7.10 was a chance to get out there and get as many new features in as they could because it was not going to be one of their extended (LTS) support releases.

 

I installed Ubuntu 7.10 on the Dell D620 for a few weeks, and I have to say that I did not get the feeling that it was in any way an *unstable* release, for all its new features and currency. What I was curious about was what Mint was going to do to add to the already polished and updated feel that Ubuntu had. 7.10 was the best Ubuntu I had ever installed, and it seemed at the time that they were not leaving much for Mint to do. But it turns out that was not at all correct. Mint takes Ubuntu's refinement to yet another level. I love the artwork, which is a refinement of 3.0's. Kind of a carbon fiber / modern look, without looking tacky like some carbon fiber themes I have seen.

 

If current releases of Linux like Ubuntu 7.10 are where the bleeding edge stuff is, and Extended support / enterprise releases are where the centrally supported Linux desktop crowds want to be, then Mint is to Ubuntu 7.10 what Ubuntu 7.10 is to 6.06 LTS (Long Term Support). Slicker, newer, more features, etc.

 

Here then are some of the specifics.

 

Evolution

 

I can not talk about an Enterprise Linux desktop, at least in a Microsoft Exchange email server based shop without bringing up Evolution first. Evolution with its MS Exchange connector is still the best way to let a Linux person communicate with MS WIndows / Outlook using counterparts in their environment. In a perfect world MS would be more open about their mail protocols, or Production IT shop might lose their fascination with MS Exchange in exchange for a more standard set of server protocols (whew... that was a hard sentence to type...), but if you have to deal with MS Exchange, you probably want Evolution.

 

Mint as near as I can tell does not add anything to the underlying Evolution packages. They appear to be exactly the same as the ones Ubuntu ships, namely:

 

evolution 2.12.1-0ubuntu1
evolution-common 2.12.1-0ubuntu
evolution-data-server 1.12.1-0ubuntu1
evolution-data-server-common 1.12.1-0ubuntu1
evolution-dbg 2.12.1-0ubuntu1
evolution-exchange 2.12.0-0ubuntu1
evolution-plugins 2.12.1-0ubuntu1
evolution-webcal 2.12.0-0ubuntu1
nautilus-sendto 0.10-0ubuntu1
openoffice.org-evolution 1:2.3.0-1ubuntu5

 

The bad news here is that only the base Evolution package has a debugging version currently available. The good news is that 2.12 has so far been a pretty trouble free version so that I have not yet *had* to report any problems. This is true for both Ubuntu 7.10 and Mint 4.0.

 

Backing up to the Install

 

Installing Mint 4.0 is exactly the same as Ubuntu 7.10 was, or Mint 3.1, or Mint 3.0 or Ubuntu 7.04... Same installer. Same LiveCD boot. What is different is that when I booted on the Live CD, Compiz fusion was enabled by default, and appeared to work without any problems at all. Once it was installed to the hard drive though, Compiz was *disabled* by default. Kinda weird. It *knew* it worked.

My main complaint about the install is the same one I have had for a while, which is that Ubuntu nor Mint can seem to figure out that there is already a Linux installed on the hard drive, and does not ask if I want to re-use the existing /home. Every time I have to go into manual disk layout mode and force it to format "/", but otherwise leave the disk alone. My didsk layout has not changed on this laptop since I first built it:

 

Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00019fb7

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        1824    14651248+   7  HPFS/NTFS
/dev/sda2   *        1825        3040     9767520   83  Linux
/dev/sda3            3041        3283     1951897+  82  Linux swap / Solaris
/dev/sda4            3284        9729    51777495   83  Linux

 

SDA2 is "/". SDA4 is "/home". Really, this is an easy configuration to figure out. Anyone that wanted their data to survive an upgrade would do something like this. Further, Fedora and OpenSUSE both can figure out at install that the userid in "/home" called "steve" is the same one as the new one it just asked me to create during the install. They ask if they should reuse the home directory, and change the UID to whatever makes them internally happy. Going back the other way, Ubuntu / Mint have no idea. I have to fix the UID on the files manually. Going from Ubuntu 7.10 to Mint 4.0 was no problem though.

 

Fedora and OpenSUSE are both starting to use LiveCD installs at least as options, but neither of them has it down to as few a number of steps. If Mint / Ubuntu were a bit smarter about disk layout, it would be even fewer install steps!

 

The rest of the story

 

I came back from vacation, after a week of not even looking at a computer other than my iPhone, and had one day before I had to get on a plane and fly from the swamplands to our Sunnyvale office. My D620 was running Ubuntu 7.10 without issue, and so I did the only thing I could do before a long business trip where the Linux laptop would be my one and only connection back to the home office. I upgraded it from Ubuntu 7.10 to Mint 4.0, because 4.0 had gone GA while I was on vacation, and based on everything I know about Mint, I expected no issues. There were none. Mint dropped in like a champ, and I could starting looking at the nifty new tools like the MintUpdate and the updated stuff like the MintDesktop

 

One may wonder why the world needs yet another updater. I can use Synaptic or Adept to do this after all. But I really like the classification of risk information that MintUpdate provides, and think it is a valuable addition. MintDesktop fixes the problem I have with finding where Gnome buried some settings, especially how to turn off "Spatial Mode" in Nautilus. Mint 4.0 continues to default to providing their SLED like default menu, and I guess that means there are people that like it. I must not only be an early adopter but an outlier when it comes to the value of certain features. Whatever. I know how to turn back on the Gnome default menus in no time at all now, so it is not hurting me. I left the SLED-like menu turned on, with its codename "Daryna" displaying (at the bottom: Gnome standard menus at the top), and mess with it from time to time to see if I can figure out why it is still here. After all, I did not get why Ubuntu was so popular all that long ago. I have to be open to new things in this world of Open Source, or I will be run over by the rate-of-paradigm-change truck. I hate that truck sometimes though. :)

 

GUI Goodness

 

I found the place to turn back on Compiz fusion (System / Preferences / Appearance / Visual Effects), and loaded up the Emerald theme manager (with Synaptic), which now works with Compiz (Beryl is nowhere to be found: It really is married back into Compiz). I played around with a few themes, and settled on one with transparent edged windows that let me see behind the current window, which is useful. The effects are set to "Normal", as the "Extra" level of effects acted weird on the D620's low-end 945GM graphics card. I am not a measure of what others want here: If the effect does not provide me an actual useful function, I turn it off in favor of speed anyway. I know many like the glitz just because it is fun to watch.

 

The down side is I can not find a higher level of effect control than this off/normal/extra or the Gnome GL desktop control. I am sure it is around, but it is not obvious. There is an effect missing from Beryl that I want back. When I hover the task bar, I want to see the composited mini-version of the window for that task. That was just too handy!

 

Other than that, "Normal" setting of Compiz has been stable and is pretty crisp. I have the D620 sitting right next to my MacBook Pro right now, and it is interesting to see how similar the window shadowing looks to what OS.X 10.5 does. OS.X really stepped up the drama of these shadows in the new release, but Compiz is not far behind it.

 

Just Call Me Compatible

 

One of the things I like about Linux, and especially Mint / Ubuntu is how easy it is to load up additional software so that I can co-exist with my OS.X system. Sure, it comes pretty tweaked out for MS Windows co-existence already, and while I do have to be compatible at the office with those folks, in my personal life there is no MS Windows. Only Linux amd OS.X. Synaptic makes short work of putting on Avahi, Avahi Service Discovery, Macutils, HFS support, plus stuff like ACPI Sensors, HDDtemp, gkrellm and so on. My current system temps are CPU 47C, Hard Drive 38C. I knew you'd want to know that.... actually it is interesting that the IBM Thinkpads appear to have far more sensors on them. When I run ACPI and IBM-ACPI on them, I see so many thermal zones I can not display them all in the task bar and leave room for anything else. The Macbook Pro is like that too. This D620 only has two. It runs cool though.

 

A Week and Counting

 

Mint 4.0 has been the main way I have done work for the last week. I have done presentations with OpenOffice (saved to .ppt for the MS users that needed it), looked at spreadsheets, read email, scheduled and accepted meetings, updated task lists, searched contacts.... all the usual office kind of stuff. The only problem I had was Evolution seemed really slow looking up email addresses, and I found out I forgot to update the GAL (Global Address List) server to the right one, so that was my fault. Evo defaults to picking the first GAL server it can find in the alphabet, which in this case is on the other side of the planet from me. Once I had that set to point at a server in the same country, that problem went away.

 

Mint 4.0, and its underlying Ubuntu 7.10 may be bleeding edge (2.6.22 kernel) but so far it is stable, fast, pretty, and high function as well. The only reason I can see why this would not be a good enterprise laptop / desktop OS for everyone would be that pesky amount of time it takes to certify new OS releases by the central support team.

Share This:

The good, the bad, and the trivia of the certification of the CentOS 5 HA cluster

 

Last week I posted about our moving to replace our trusty Tru64 TruCluster NAS server with a new CentOS 5 based NAS solution. I said then that I would peel back the covers a bit and show the test results from our qualifcation runs. In fact, back a few posts I said we were going to be open about this whole project, and here is that promised openess, in all its geeky glory.

 

 

This post is largely not my work , but that of Dan Goetzman, the man with the NAS plan that did all this work. This one post actually covers literally months of work in planning and testing and gathering results. The only changes I have made to Dan's post to our internal Wiki are that I deleted two graphs (because I don't know how to post graphics here), and removed systems names in favor of system type information: anyone reading this is not going to care if we named a Solaris system “Yoda” or “Shuttlebay”: it is still a Solaris system not matter what geek-space name we picked for it. Hey!, we're geeks: We admit it.

 

 

My deep thanks to Dan for letting me use his work like this. Truth be told, this whole blog would be a much shallower, less technical thing if it was not for his work over the years. He keeps me honest. He gives me ideas, data, time, and outstanding work. I could not ask for more from anyone.

 


Server - Sun X2200

  • HW = (3) Sun X2200m2

  • Data Disk = Apple XServe Raid, (28) 750GB SATA disks, (4) 3.5 TB RAID LUN's (Raid 5, 6+1)

  • OS = CentOS 5 with Cluster Suite using GFS filesystems for user data.

  • Connectathon Version = cthon04

NFS test results

Test 1 - Basic function

  • iozone test - Pass

  • locktest - Pass

Test 2 - Full function

 

OS

Basic
v2u

Basic
v2t

Basic
v3u

Basic
v3t

Lock
v2u

Lock
v2t

Lock
v3u

Lock
v3t

CPT

Client

Notes

Solaris 9

Pass

Pass

Pass

Pass

Pass

Pass

Pass

Pass

Pass

Superman


Solaris 8

Fail(1)

Fail(1)

Pass

Pass

Fail(1)

Fail(1)

Pass

Pass

Pass

Gas


Solaris 7












HP-UX 11.00

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass

Hercules


AIX 5.1.0

Pass(2)

Pass(2)

Pass(2)

Pass

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass

Perfaix02


Tru64 5.1B

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Fail(3)

Pass(2)

Pass(2)

Pass(2)

Pass

Thing


Linux 2.6.8-1.521

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass

Smore


Linux 2.4.21-4.EL

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass(2)

Pass

Putter


 

 

Notes:

  1. Permission denied on mounting due      to known Solaris NFSV2 with acl problem.

  2. Passed with warnings. Typically      client side implementation issues with locking.

  3. Lock test failed to complete  due      to a coredump.

  4. Lock test failed in "non native 64 bit mode" only

Test 3 - Client platform

Same as "basic function test", run on each client/platform.

  • Results recorded in CPT column in table above.

Test 4 - Throughput

About 70 MB/s peak sustained write rate measured at the server.

Clients used;

  1. SunFire V440, Solaris 8, 1000 BT,      NFSV3_TCP, iozone -i 0 x 3 streams

  2. SunFire v880, Solaris 9, 1000  BT,      NFSV3_TCP, iozone -i 0 x 3 streams

  3. HP rx2600, HP-UX 11.23, 1000 BT, NFSV3_TCP, iozone -i 0 x 3      streams

Note: Each client ran 3 iozone streams for a total of 9 streams.
Note: I/O was directed to a single XSR Raid controller.

Test 5 - Burn test

Pass - Multiple clients running a complete "iozone -a" pass concurrently.

Clients used;

  1. Solaris system 1

  2. Solaris system 2

  3. Solaris system 3

  4. Tru64

  5. HP-UX

Note: The NFS UDP clients, Tru64 and HP-UX, were very slow. This
is due to the very fast gigabit NFS server and slow 100BT clients.

Test 6 - Basic Tier 2 client test

As it is difficult to find a working compiler for some of the tier 2 clients, cthon04 was not used to test the NFS protocol. Instead a basic confidence NFS test was used.

  1. Verify my $HOME automounts and is      accessible

  2. Copy contents of my $HOME to a test area on the NAS

Clients tested;

  • Dynix/PTX (Sequent)

  • OpenVMS using VMS/TCPIP

  • SCO (1)

  • SINIX/Reliant (2)

  • OSX

  • FreeBSD

Notes:

  1. NFSV2 UDP dropped packet      retry/timeout problems. Set r/wsize=1K to run tests.

  2. cpio ran to completion, but with errors trying to reset the      modification time (-m option to cpio)

CIFS test results

Test 1 - Basic function

  • iozone test - Pass

Test 2 - Client platform

Same as "basic function test", run on each client/platform.

  • NT 4.0 SP6a - Pass

  • Windows XP SP2 - Pass

  • Windows 2000 SP4 - Pass

  • Windows 2003 Server SP1 - Pass

  • OSX 10.4.10 - Pass

Test 3 - Throughput

About 90 MB/s peak sustained write rate measured at the server switch port.

Clients used;

  1. Windows Server 2003 SP1, 1000 BT,      iozone -t 4 -s 300m -r 32k -i0

  2. Windows Server 2003 SP1, 1000  BT,      iozone -t 4 -s 300m -r 32k -i0

  3. Windows Server 2003 SP1, 1000  BT,      iozone -t 4 -s 300m -r 32k -i0

  4. Windows Server 2003 SP1, 1000 BT, iozone -t 4 -s 300m -r  32k      -i0

Note: Each client ran 4 iozone streams for a total of 16 streams.
Note: I/O was spread across all 4 XSR Raid Controllers.

Test 4 - Burn test

Use a select set of clients to run a full iozone -a pass.

  • iozone -a - Pass

Note: Used the same set of clients used for test #3 above.

High Availability Tests

Test 1 - "Graceful" Shutdown

Node#1 using a graceful shutdown.

 

Pass - No file service outage detected

  • Start "iozone -a"      (both UDP and TCP) on NFS clients

  • shutdown -h now - On  node#1

  • clustat - On a surviving      node shows NFS service was relocated to another node OK.

  • NFSV3-UDP "iozone -a"      continues to run OK from a NFS client

  • NFSV3-TCP "iozone -a"      continues to run OK from a NFS client

  • Power up and boot head#1

  • clustat - Shows NFS  service      recovered back to node#1

  • NFSV3-UDP "iozone -a"      continues to run OK from a NFS client

  • NFSV3-UDP "iozone -a" continues to run OK      from a NFS client

Test 2 - Power Cord "yank"

Power was interrupted to node#1 by "yanking" the power cords.

 

Fail - NFS service fails to recover.

  • Problem: fence_ipmilan      fails due to X2200 LOM card not available. Second fence method is      defined (fence_brocade) but the fence daemon is not able to query      via ccs_get to obtain the next fence method. This is a bug!

  • Syslog Messages:

fenced[2881]: agent "fence_ipmilan" reports: Rebooting machine @ IPMI:172.19.176.17...ipmilan: Failed to connect after 30 seconds Failed
ccsd[2844]: process_get: Invalid connection descriptor received.
ccsd[2844]: Error while processing get: Invalid request descriptor
fenced[2881]: fence "rnd-fs01" failed
fenced[2881]: fencing node "rnd-fs01"

Test 3 - Network Cable "yank"

Public network cable was "yanked" from node#1 while testing.

 

Pass - No file service outage detected

  • Start "iozone -a"      on NFS test clients

  • Yank network cable from node#1

  • Cluster detects node failure

  • Cluster fences failed node      sucessfully, using fence_ipmilan

  • Cluster relocates NFS service  to a      surviving node

  • "iozone -a" on      the test client continues after a short delay as expected

  • Reconnect network cable

  • Node#1 joins the cluster

  • NFS service remains on node#2

  • "iozone -a" continues to run OK

Additional Tests

Test - Filesystem "expand on the fly"

Increase the size of a filesystem while running a "iozone -a" test from a NFS client.

 

- No file service outage detected

  • Start a "iozone -a"      on a NFS client

  • Create a 100 GB "segment"      using the admin GUI

  • Attach the new segment to the  test      filesystem using the admin GUI

  • "iozone -a" on the NFS client was not      interrupted

Legato Backup and Restore Testing

Using Legato as the backup server.

Restore Tests

  1. Single file restore - To a      alternate path.

  2. Sub tree restore - To a  alternate      path.

  3. Entire volume restore - To a      alternate path.

  4. Single file restore - With a NT ACL defined.


There it is then.  A pretty nifty box so far. We have migrated more data to it over the last week, and so far, so good.

 

I am on vacation in West Texas next week. We do not allow EMI out there, so I doubt I'll have anything new to post here. How much can one say about Linux or Open Source when one is surrounded by high desert and has no computer in reach?

 

I'm going to defer the posts about storage virtualization and the new mirror process until I have more time to do them justice. I also have a ton of new desktop stuff under way: I have been working with Mint 4.0 Beta, Fedora 8, OpenSUSE 10.3, and Ubuntu 7.10 for a few weeks now so when I finish up on the NAS series of posts I'll jump back in with some more there about the Enterprise Linux desktop. Hint: It works better all the time.

CentOS 5 NAS Cluster

Posted by Steve Carl Nov 9, 2007
Share This:
Update  on replacing the Tru64 NAS server with Linux High Availability (HA)  CentOS 5 server

As I noted in my last post, here is an update on where we are at with replacing our trusty but aged Tru64 TruCluster NAS server with a new HA NAS Server. The new server is a CentOS 5 based cluster with three nodes. I'll get into the particular in a second, but first,

How We Got Here (in a nutshell)

 

Digital created the best cluster software in the world, VAXCluster. Digital ported this to Tru64. Digital was sold to Compaq. Compaq continued Tru64 and TruCluster. We had a NAS appliance. We bought another. It failed and failed and failed, for over a year. We replaced that with the TruCluster. HP bought Compaq and killed the AlphaChip and Tru64 TruCluster future development. Our TruCluster aged, and we began to look at replacements. Two appliance vendors came in, were tested, failed. Tru64 started  to have issues with new NFS clients. We started our in-House HA  NAS testing based off our years of Tier II NAS using Linux. Pant Pant Pant. Whew. Twenty plus years of history in one paragraph!

What We Liked About Tru64 TruCluster

 

It may be true that we over-engineered the Tru64 NAS solution. After being burned so badly by the appliance, and having so many critical builds depend on the server, we were not prepared for anything other than the most reliable NAS we could figure out how to build. Tru64 was tried and true. TruCluster was the best cluster software there was for UNIX, and the Alphachip was the hottest chip on the block back then. It all seemed to be a no-brainer.

 

Once built, we had rolling upgrades, and while a node might fail, the service stayed up. Customer facing (my customer being of course BMC R&D) outages were few and far between, and while we had data loss issues once (leading to the Linux snapshot  servers), never a server failure. TruCluster let us sleep at night.

 

We hoped that Linux clustering would one day catch up to TruCluster, and so watched things  like the Linux SSI project with great interest.

Whatever we ultimately use, it has to pass the NAS Server  testing suite of tests.

Re-Thinking the NAS Solution

 

We knew what we liked about TruCluster, but after seven years, we also decided it was time to question some of its very basic design assumptions. We came up with a new set, tested the two new appliances against them, and then decided to try to build it  ourselves out of Linux parts we found laying about the OSS World.

 

On the assumption that a picture is worth a large quantity of words, here is  a DIA diagram, saved as PNG I drew of the new beastie:

http://lh3.google.com/stevecarl/RzTqSfdxlLI/AAAAAAAAADI/S3txOOy_KrQ/s144/lcfs-ha.png

Words Anyway

 

And now that we have that picture, a fair quantity of words is probably in order explaining what in the heck that is all about.

The Servers are Sun X2200 M2's running CentOS 5 and Cluster Suite. An X2200 is small, but it is big enough to keep the gig pipe full, so we do not need anything bigger.

 

To make all the cluster stuff happen, we are using Cluster LVM over the top of the Linux Multipath drivers. Each device has two paths because there are two switches in the SAN fabric, and each cluster node is hooked to each switch. GFS lays in on top of that to create the global file system across all the nodes.

 

Node one runs NFS. Node two runs Samba. Node three runs the backups. Should the NFS or Samba node fail, the service will restart on one of the surviving nodes, and since the file system is global to all three nodes, no magic occurs at the service level to move the file systems or anything.

The Spinning Bits

 

The disks are the ever nifty Apple Xserve  RAID units. We burn a fair amount of capacity for HA on these: The RAID 5 is 5+1, with a hot spare, for a total of seven disks per RAID controller. The disks are 750 GB SATA. There are 14 disks in each shelve, and we have two shelves, for a total of 15 Terabytes of capacity, before formatting.

 

There is a single point of failure here: there is a single RAID card over each side, and so even though there are two cards in the shelve, each card only manages half the disks. They do not talk to each other. This is not Enterprise grade storage.

 

We mitigate that risk by having bought the spares kits: We have spare disks in carriers, spare RAID card, and spare RAID card battery. This was part of the rethink: we decided to save some money on the disks but have a recoverability plan. It is not that it will never go down, but that we can get it going again quickly. The gear is all on three year hardware support, so broken bits are a matter of RMA'ing things, and everything should be designed to return to service quickly.

 

We have over a year of runtime on these units on the second tier storage, and have not had any serious issues thus far, thus our willingness to try this configuration out.

Testing and Migration

 

In addition to all the run time on similar gear, we have been beating the heck out of these. By “We” I of course mean “Dan”, the master NAS blaster. Here is his Wiki record of the problems and the workarounds from the testing:


NFSV2 "STALE File Handle" with GFS filesystems

Problem Description

Only when using NFSV2 over a GFS filesystem! NFSV3 over GFS is OK. NFSV2 over XFS is also OK.

 

We were able to duplicate this from any NFSV2 client;

  • cd /data/rnd-clunfs-v2t -     To trigger the automount

  • ls - Locate one of  the test     directories, a simple folder called "superman"

  • cd superman - Step  down     into the folder

  • ls - Attempt to look at the contents, returns the     error:

ls: cannot open directory .: Stale NFS file handle

Note: This might be the same problem as in Red Hat bugzilla #229346

 

Not sure,  and it appears to be in a status of ON_Q, so it is not yet released as a update. If this is the same problem, it's clearly a problem in the GFS code.

Problem Resolution

 

To verify that this was indeed the same bug as the Red Hat buzilla #229346, I found the patch for the gfs kernel module and applied it to our CentOS cluster.


The patch does indeed fix this problem!

 

Instructions to apply the patch;

  • Download the gfs kernel module     source, gfs-kmod-0.1.16-5.2.6.18_8.1.8.el5.src.rpm (if     your kernel is 2.6.18_8.1.1.el5)

  • rpmbuild -bp     gfs-kmod-0.1.16-5.2.6.18_8.1.8.el5.src.rpm - Unpack the source     rpm to /usr/src/redhat/SOURCES

  • cd /usr/src/redhat/SOURCES and add the following     patch;

Filename: gfs-nfsv2.patch

--- gfs-kernel-0.1.16/src/gfs/ops_export.c_orig 2007-08-31 09:43:29.000000000 -0500
+++ gfs-kernel-0.1.16/src/gfs/ops_export.c      2007-08-31 09:43:52.000000000 -0500
@@ -61,9 +61,6 @@

        atomic_inc(&get_v2sdp(sb)->sd_ops_export);
 
-       if (fh_type != fh_len)
-               return NULL;
-
        memset(&parent, 0, sizeof(struct inode_cookie));
 
        switch (fh_type) {
  • cd /usr/src/redhat/SPECS and make the following     changes;

Filename: gfs-kernel.spec

Name:           %{kmod_name}-kmod
Version:        0.1.16
Release:        99.%(echo %{kverrel} | tr - _) <--Change the version from 5 to 99--<<<<
Summary:        %{kmod_name} kernel modules

Source0:        gfs-kernel-%{version}.tar.gz
Patch0:         gfs-nfsv2.patch                <--Add this line--<<<<
Patch1:         gfs-kernel-extras.patch
Patch2:         gfs-kernel-lm_interface.patch

%setup -q -c -T -a 0
%patch0 -p0                                    <--Add this line--<<<<
pushd %{kmod_name}-kernel-%{version}*
%patch1 -p1 -b .extras
%patch2 -p1
  • rpmbuild -ba --target x86_64     gfs-kmod.spec - Build the new patched kmod-gfs rpm package

  • rpm -Uvh     /usr/src/redhat/RPMS/kmod-gfs-0.1.16-99.2.6.18_8.1.8.el5.x86_64.rpm - Install the patched gfs module

  • depmod -a - Required  step     to see the new module on reboot

  • Reboot the system to load the new kernel

NFSV2 Mount "Permission Denied" on Solaris clients

Problem Description

Certain Solaris clients, Solaris 7, 8, and maybe 9, fail with "Permission Denied" on mount when using NFSV2. Apparently the problem is a known issue in Solaris when the NFS server ( in this case CentOS ) offers NFS ACL support. Apparently, Solaris attempts to use NFS ACL's even with NFSV2 where they are NOT supported.

This problem has been fixed on more recent versions of Solaris (like some 9 and all 10+).

Note: This problem was detected on a previous test/evaluation of Red Hat AS 5 and expected with CentOS 5.
Disclaimer: I think this is a accurate description of the problem.

Problem Resolution

 

Assume Solaris NFS clients will NOT use NFSV2?

Cluster Recovery Fails on "Power Cord Yank Test"

Problem Description

 

The cluster software must fence a failed node successfully before it will recover a cluster service, like NFS or Samba. The fence method used in our configuration is the Sun X2200 Embedded LOM via remote ipmi. When the power cord on the X2200 servers is disconnected, the ELOM is also down. This causes the fence operation to the ELOM to fail. The cluster configuration allows multple fence methods to be defined to address this issue. But there appears to be a bug in this version of the software that prevents the ccsd (Cluster Configuration Service Daemon) from answering the fenced "ccs_get" request for the alternate fence method when a node has failed.

Problem Resolution

 

None at this time. Waiting on a fix from CentOS. Assumption is that we can run with this configuration, but the cluster will not failover services if a power cord or both power supplies on one of the X2200 nodes were to "fail". This would result in a service interruption.


And there you have it so far: We have our teams home directories running on the new servers, and other than being fast, we see no real difference yet. We are trading in a few problems on Tru64 for a few possible problems on CentOS 5, but we assume that we'll be able to either work around them (Such as making Solaris clients use V3, which they tend to prefer anyway) or with a patch to Cluster Services at some point to deal with the power cord issue.

 

Next time: “The Numbers of NAS” -or- “Speeds and Feeds for the Geeks who just want to know”. And the new way we are going to do snapshots. And if I have time and space, some stuff about storage virtualization.

Share This:
Getting  caught up with what has been happening: Three days, over 200 sessions,  only one me to get to it all.

When I first heard about the BMC / RealOps deal, I had this mental impression from the term “run book automation” of this agent that watched what people did during the day, and then distilled that into RunBooks. It made no sense of course, but it was a sort of funny mental image: I kept seeing things like run books for calling for Pizza Delivery, or run books for “Hey have you seen Sally recently?” kind of things. Part of the reason I had that image is because it seemed an easier thing to do that what the real Run Book product does.

 

I sat in a session where the covers were pealed back to reveal how the product works. How is scales, how it interacts with the heterogenous environment, how it automatically opens Remedy cases when the automation can not fix things... all sorts of stuff. The way things like XML were leveraged to make abstractions of some very real, very persnickity interactions was just amazing.

 

The product comes with all sorts of things already pre modeled too (Called 'Workflows”): The example shown was a ping failure, where the tool suppressed the alert, and then walked through an ITIL structured process to diagnose and attempt to fix the error. It reminded me a little of the way that Patrol Classic with its agents could deploy specific remedies to specific problems automatically, but raised to a different level. This was not running in the agent, this was not even agent based, and even better, all the abstractions have been dealt with so that this model would work no matter what the specific failing end node is: It knows what the end node is because the CMDB is leveraged. Runbook already has the knowledge to try various things, and if it can not fix it, it can then unsupress the alert and open a ticket.

 

Even better: the new ticket is opened at the correct urgency level because the CMDB and BMC SIM know just how many business critical processes are affected.

 

Even better: new workflow models can be written all the time, and the RunBooks folks know about the BMC Developer connection, and are looking to set it up so that people can share their automation! It was music to my OSS ears.

 

Virtualization, Capacity Planning, and RunBook

 

Another set of sessions I attended was about how to use our Performance Assurance product (It will always be Best/1 to me..) to model things like VMware ESX, IBM AIX with HMC's, Solaris Zones, and the like. I was starting to wonder towards the end how to deal with things like DRS and V/Motion not just in a capacity planning sesne (although with my hetrogenous workloads of pretty much every X86 OS there is, that was a factor) but how things like the CMDB stay up to date when DRS moves workload to a different system.

 

It turns out that is a trappable event, and that can trigger us to call RunBook. RunBook can look at what happened, and trigger discovery, and discovery can updated the CMDB. Cool. Sure, you could trigger discovery directly I suppose, but RunBook adds this whole new level of intelligence to what happens, so that only the things that need to happen actually occur.

 

RunBook and Emprisa

 

Another recent addition to BMC's BSM portfolio was Emprisa Networks. I'll be honest, I knew nothing about what that was about. I decided to go to a session to fix that, and now I feel a bit silly for not having known just how cool this addition to our BMC family is. Emprisa can audit and manage all the networking gear on the network. It knows about things like uncommitted config changes in routers, and can enforce policies so that if someone makes an unauthorized config change (even if they remembered to commit it) it can roll it right back out. Ultimately the Change Manager rolls out the config changes via the Emprisa tool. The SmartMerge(TM) core tech of this is very very nice: It builds all the scripts required to do everything, and then you can see exactly what it plans to do before it does it. Eventually as trust builds, some folks apparently just let it do the right thing. It's all logged and it interacted with Remedy to make sure all the right permissions were in place first, so BSM is not violated by automation, but rather implemented by it.

 

This was the part I really liked: the session the presenter (none other than Emprisa co-founder Sherrie Woodring) sat and had a great conversation with one of the main guys from RunBook about building the things into RunBook to make it ready out of the box to work with Emprisa. I love working with people who automatically start doing the right thing before they are even asked!

 

Did I mention all this is being developed in the R&D Labs my team supports?

 

Back to the Gulf Coast next week, and a new post about the final config for the new Linux server we are going to use to replace our Tru64 TruCluster.

Share This:
Day  one at BMC UserWorld, and how BSM applies to things other than a production shop.

 

I started “Adventures” a couple of years ago to talk mostly about Linux and Open Source, as they were used inside my organization of R&D Support. Back then the idea was mostly to talk about OSS and Linux in particular, as they drove innovation and had a positive ROI. I have since of course branched to many other places, and so for example there has been a great deal about Linux as an Enterprise desktop / laptop OS.

 

A large number of the folks that read this blog are people who are doing many of the same things that I do. That only makes sense. They are often IT people in unusual IT shops: Universities, heavy duty cutting edge R&D in both hardware and software, and the like.  People who are unafraid of change, often are change drivers, and are usually the bleeding edge / early adopters at their home companies.

 

Being in such shops, you may wonder how such things as ITIL, BSM, and such relate to your daily jobs. If you are non-traditional IT, then how does this huge thing called BSM really relate to you? Is BSM just about traditional IT: the financials and sales and marketing and email and so forth? Or does it extend into R&D IT, Customer Support IT, Training / Education IT, and the other more esoteric areas where we do use computers: we may have data centers and stackers and networks and whatnot, just like the traditional IT folks, but we use them mostly in very non-production, non-traditional ways.

 

The answer of course is: “It Depends” (P). The patented response of consultants the world over.

 

Measure the result that you want

 

On a small scale, say fifty computers, there is probably not a great deal of ROI in working through an entire BSM rollout. The company/school/organization is small enough that the person signing the checks for the gear most likely knows the person wanting the gear, and knows what they plan on using it for. They will have already discussed whether the new gear is worth the benefits it will deliver (even if, because this is a personal relationship, they do not think of it in exactly those terms). The person writing the checks won't sign them unless they know that the new thingy is something they need to succeed. Also, in a small organization, the person making the request probably has a pretty good idea what the fiscal impact is going to be.

 

As companies grow, the layers between the check writer and the person doing the technical function with the gear grow. At some point, the check writer will look at a purchase request and ask “What in the heck is this for?” because they can not know everything that is going on anymore.

 

At that point, there starts being value in implementing certain parts of BSM. At this point, the critical thing to remember at all times is to measure for the results that you want.

Here is my real world example: Virtualization. I am attending all sorts of sessions here at BMC UserWorld because while I have used our computer modeling tools in the past (back when they were called Best/1 in fact), I have not yet loaded up and used the tools that model VMware.

 

Our VMware farm started over a year ago, and it has about 20 medium to large computers in it right now. Sun X4600's, Dell 6950's and the like on the large side, but also many smaller systems such as Dell 2950's and Sun X2100's. There is a definite difference in CPU architectures, and the ratios of CPU (as measured in SPEC-Int), RAM, and I/O capacity between all these platforms.  We have retired over 500 old old computers with these VMware computers, so at that simple ROI level, we have already come out ahead. We got back scads of data center floor space and even decommissioned four small labs, power was not consumed that would have been, etc. I did not in fact actually need any BMC Software products to measure any of that (although we did use Remedy AM to track both the real and virtual assets, and that will be of huge benefit while we are working on setting up among other things like SRM).

 

The farm is now big enough and worse, diverse enough, that I need to start figuring out some things about the farms behavior in the future. The diversity is not just in the underlying hardware platform. We support R&D on all their supported platforms and product lines. There is a huge mix of Linux, Solaris X86, and various MS. BMC has over 600 products, so the applications running are even more varied. These VM's are going to behave differently from each other, to be sure. Some will be RAM intensive. Others will be CPU hogs. Others will be massive I/O engines. Some will be various combinations of the above. Some will be only mildly active, but still doing useful things. Others will be stone dead most of the time.

 

With one VMware server there is no need to do measure anything with BMC PA. HA, DRS, V/Motion and so forth are not options. The built in tools about system utilization are all you need.

 

In the bigger scenario, it is far more complicated. HA only works between like-CPU'ed systems. You can not have a Sun X4600 (AMD) fail over to a Dell 6950 (Intel) for example. There are restrictions in what V/Motion can hot-migrate, and what required a cold move. And you have to have a SAN and enough HBA's on the servers to make it all work too. You want a new server? You have to be able to measure what you have so you can tell the check writer why you need more of them, and you also need to be sure you are buying the right thing. Example: It does no good to add more virtual server disk images to an overloaded SAN device.

 

One of the sessions (actually two of them, it was a double session) I attended today addressed exactly this topic. And I learned an interesting thing I did not know before about the way in which VMware servers are measured. I did mainframe VM for the first 15 or so years of my career. I know about things like the Virtual to Real ratios, wall clock elongation, and so forth. The MF folks solved that years ago by making the guests aware they were running in a virtual world so that the performance data could be dealt with accordingly. It is not trivial to do, but can be done.

 

I assumed that the VMware world was going to see this exact same thing, and one of the things I was not sure of was how the problem of multiple different guest OS's from multiple different vendors was going to deal with hypervisor awareness. The answer did not surprise me much. <linux content> Linux is getting there first </linux content>. MS Windows is going to take some heavy lifting. As I write this I realize I did not ask about Solaris X86, but I assume that being open source that it will get there pretty quickly if it has not already.

 

The good news is that you only need inside the guest visibility for some pretty detailed application modeling, and can do a credible capacity plan at the VM level using the Performance Assurance tools as they exist today.

 

That circles back to BSM and measuring the right thing: If what one needs to justify is new server investment, then being able to show that the servers are consumed at the server level is probably sufficient.

 

Measure People Correctly Too

 

ITIL does not really purport itself as a way to manage people, but the same concepts, or perhaps philosophies apply. I am fairly far afield from a normal “Adventures” post already, so what the heck. If you are in a non-traditional IT shop, and part of or managing a team of people in that shop, then these same things apply to the team as they do the computers. You have to measure to gain the results you want. While I do not personally think that a singular measurement of ticket closure rate is a great measure even in a traditional IT shop, it pretty much flies out the window in a non-traditional one. It may not seem intuitively obvious: I usually tell new members to R&D support that came from production environments that it takes about six months to make all the mental shifts required to get one simple concept: In R&D, flexibility rules. Availability is not even number two or even three on the list.

 

This changes everything, from your SLA on down. I am sure this same thing is true of other environments besides R&D too. A school with a computer lab for example: they will be very good at imaging the hardware back to known states every semester, or even more often because of how dissected the students computers will be. You could not measure the person that is responsible for that lab on computer platform uptime. The end result would be students afraid to look at the computer for fear of incurring the wrath of the lab admin.

 

If you measure and reward (or I suppose punish) people based on closed ticket counts as a primary measure, then ALL you will get is closed tickets. Lots of them. The customer will not be happy: I guarantee it. You can stand in front of a group of customers with an OpenOffice Impress slide showing ticket closure rates mapped to SLA all day long, and you will never convince them that you are doing the job for them. Really: Your customer, no matter who they are, does not care one little bit about your closure rate. If you want satisfied customers, you have to measure a thing called “Customer Satisfaction”.

 

Nothing makes a customer madder than being asked if a ticket can be closed before the problem is solved. Well, maybe closing the ticket without asking. Measure and compensate people based on ticket closure rates, and you get the following:

 

  1. Some people will jump on the easy tickets first, and leave  the hard stuff till someone forces them to take it. Actually, this one is interesting, because it also can work against them: I watch who does this. I watch who takes the hard cases. Guess which one I  reward?

  2. Some people will close tickets without asking if they  solved  the problem.  This is against everything ITIL and BSM talk about,  but it really happens when people think that the way they are being  measured for their compensation and promotion is largely ticket count.

  3. Some people will ask to close a ticket that is about to pop the SLA timer, whether or not the problem is solved. They may offer      to open a new one, but still, the customer does not want to have that conversation. They want their problem solved.

 

Actions are important here. Organizationally one can not say that ticket counts are less important than other things, but then when bonus time, raise time, or public recognition time arrives the people rewarded are those with the high ticket closure rates. In R&D Support, I measure my team based on customer satisfaction. Remedy sends a survey to each customer with each closure. Obviously not every survey is answered: I only need a statistically significant response rate though.

 

I do look at tickets. At review time I read through all the tickets that a person worked over the year. I look for good ITIL/BSM practices, like whether a ticket is associated with an asset or not. I read the work log to see the the customer was kept statused along the way on a long running ticket. I look at the closure to be sure it is not something like “Done” but something that is useful for the Service Desk in the future, in case of similar problems. I look at the complexity of the ticket. And of course I look at the customer satisfaction rating.

 

I will give a higher rating to someone with 10 tickets than another with 100, if the 10 were high is difficulty and done right and the 100 were system reboots.

 

The point of that sidetrip is only this: Measure the right thing, it does not matter if you are talking about people or computers. ITIL at it's core is a set of ideas / philosophies / learnings that can be generalized to all sorts of situations.

 

End of Day One.....

Share This:
On the road with the Dell  D620 laptop and a freshly installed Mint 3.1. OpenSUSE 10.3 put on hold  till the IBM T41 can be brought into play for it.

In my last several posts I was looking at OpenSUSE 10.3 on the Dell  D620 laptop. It was my original   intention to leave 10.3 on the D620 when I traveled to Pune India this  week.   When it came down to the wire though, I decided that I needed to be  sure that   my laptop was 100% for the next two weeks, and did not have time to  mess any   further with 10.3.

 

This is not a strong indictment of 10.3. I expect when I get off the  road and   have a moment to install it on my IBM T41 that it will work  flawlessly. 10.3   had many good things about it and I would like to explore it more, but  I was nervous about relying on it specifically on the   D620 hardware while I was on the road. I thought briefly about taking  the T41   with me and doing the install while traveling, but I already had my  personal   Apple in the backpack, and with the D620, it was full. No room for  another,   not to mention what a pain it is to take the laptops through airport  security. Two laptops in one backpack are not light either.

 

This week I am in Pune, India, where BMC has an office. R&D  Support has a   team here, and I am here to work on various teamwork related issues  that in   part includes a global rollout of ITSM Version 7. I am not really  spending   much time thinking about Linux per-se, more ITIL than anything else.  It was   the need to not have to worry about laptop issues that drove me to  converting the D620   at the last minute back to Mint.

 

While I was at it, I went ahead and grabbed the new 3.1 version. I  knew 3.0   worked well on the D620, and so just sort of assumed 3.1 would as  well. That   has in fact been the case, and the laptop has largely faded over the  last week into   just being a device I need to do my job while on the road.

Evolution has to be working  of course, and Mint 3.1 still ships   version 2.10 by default. Ubuntu 7.10 has 2.12, but Mint 4.0 based off  7.10 won't be   out till November.

 

OpenSUSE had Evo 2.12 out before Ubuntu,  and I was loath   to move back to 2.10 since that was a thing that was working fairly  well for   me under OpenSUSE. It was a tug of war between a fairly well working  Evo 2.12 on a not working as well OpenSUSE versus A well working Mint  and a downlevel but working Evolution. I finally went with the Mint's  mental comfort of knowing I would not be   having any D620 specific issues when I had no time to work on them. I  needed Linux and Evolution to "Just   Work"(tm).

 

Being literally on the other side of the planet from where the MS  Exchange   server that stores my email was a lingering concern, given my recent  spate of   Evo-related issues. As part of the old joke goes, IBM has been working  on   upgrading the speed of light but has not yet succeeded (A   CMG bit of humor about issues involved in extra-galactic memory page  swapping).   Adding in a quarter second round trip to each packet of the  server-to-client email conversation   was probably not going to enhance the stability of Evo. As my mother  says,   this was a "Wasted Worry"(TM, My mom). Evolution 2.10 under Mint 3.1  is fast, and stable even when   accessing MS Exchange on the far side of the Earth.

 

I planned to be hooking up to projectors while here to go through some  Wiki   doc with the team. That one factor more than anything else is what  tipped my decision to load Mint. I   am sure OpenSUSE can do this too, but I was having trouble with it.  Given   time, I could have figured it out, but I was out of runway. Well. I  needed to   be on the runway for the mind-and-seat-numbing 20+ hour flight. I knew  Mint handled   external monitor switching on the D620 without issue, so I went with  it. Sure   enough, I needed to hook up to this previously-unknown-to-me  projector, and sure enough, it all worked fine. It even handled   the difference between a 1440x900 internal panel and a 1024x768  external   projector without major issue.

 

Something Mint Did Not Handle

 

Lest I be considered a Mint Pollyanna, I should note that Mint did not  handle a Dell docking station with complete poise. I was assigned a  place to sit here, and my workspace included a Dell docking station plus  external monitor. This eased the power cord issues (110v plug tips in a  220v country: Yes, I have an adapter. But who wants to crawl under the  desk every day to plug in! There is 220v down there!). I took a moment  to try and set up dual head mode, and run with both panels. Thanks for  playing. But no.

 

Mint did tell me that it could not validate the config I was using,  but I told it to go ahead anyway. Silly Geek. X crashed, and would not  return till I deleted /etc/X11/xorg.conf and let a reboot rebuild the  file.

 

Fine. I'll just use the dock as a charging station then.

 

Having said that, here is something it did handle in the same docking  department. I was going to hot plug into the dock, and someone standing  with me at the time told me "That won't work. You need to shut down  first!". They have XP on a D620, and the same dock as this one. I  plugged in, and the screen went black. "I told you so" look was  delivered by special airmail. I pressed Fn-f8, the screen returned, and  Owl delivering the post took off realizing they were not in the right  place. OK, it wasn't really that dramatic. Can't resist a good Harry  Potter reference though. Apparently this hotdock operation does not work  on XP, or at least not well. Surprise!

 

I've Got the Power

 

The D620 came with the small four cell battery. There is an optional  six cell, but I don't have it. Under MSWin XP, the undocked time when  being actively used, panel fully bright, is about two hours. Under  OpenSUSE is was about that too. Under Mint, it is more like three.  Better CPU throttling, or different services started or what... I don't  know why it is different. It is not something I expected. Still, I need  to get the six cell battery, because that would be "Even Better"(tm).

 

Another Dimension

 

As I have noted in my personal blog, I am not a big user of all the 3D  desktop stuff. On OpenSUSE, Compiz Fusion was a recipe for frequent X  restarts on the D620. I assume that is all tied back to the graphics  card issues I was having. Beryl under Mint works all day long every day  though, so the few features that I really do like (f9 "Expose", task bar  hover for window preview) are available.

 

Next Journey

 

I leave Pune next Tuesday, arrive home Wednesday, and then fly again  the following Monday. This time will be to Vancouver for BMC Userworld. I  have never been to a Userworld before, and I am looking forward to it.

 

I  am feeling a slight tug towards trying Ubuntu 7.10 for that leg of the  trip: Based off the Dell Mint experience, and what I have read of 7.10,  it should work fine. But I still need a trouble free experience so I can  focus on the event, so ... who knows.

 

That is on the other side of a long plane ride, where I will have  plenty of time to think about it.

Share This:

Like Fedora and Mint before, this post is about the trials and tribulations of   using OpenSUSE 10.3 as a full time Linux desktop in a corporate environment.   That means among other things that I won't really be paying much attention to   things like whether or not it can play MP3's, but more interested in whether   or not it can read and write to standard document formats, as well as   non-standard but prevalent formats such as MS's .doc. (I call it standard if   it is open, and non-standard if it is not. Just because it is everywhere does   not map to it is a good idea to use them. Many things are everywhere.)

 

  Evolution 2.12.0!

 

  Let there be no doubt that SUSE is the company that owns Evolution. This is   the newest freshest revision of Evolution in my large and very current stable   of Linux desktops:

 

    rpm -qa | grep -i evolution:

    evolution-exchange-lang-2.12.0-5
     evolution-exchange-debuginfo-2.12.0-5
     evolution-data-server-devel-1.12.0-5
     evolution-exchange-2.12.0-5
     evolution-data-server-1.12.0-5
     evolution-pilot-2.12.0-5
     evolution-data-server-debuginfo-1.12.0-5
     evolution-exchange-doc-2.12.0-5
     evolution-2.12.0-5
     evolution-devel-2.12.0-5
     evolution-webcal-lang-2.12.0-5
     evolution-data-server-doc-1.12.0-5
     evolution-sharp-0.14.0.1-5
     evolution-webcal-2.12.0-5
     evolution-debuginfo-2.12.0-5

 

  I have on all the debugging versions, and this is the only distro that I have   that ships out of the box a debugging version for Connector itself (see second   line of above list)!. In four days of hard use, I have had one crash, and bug   buddy said something new: New to me anyway. It said it did not have any useful   diagnostic information in the dump, and so it was not going to do anything.   Interesting that with all these debuginfo's on, there was nothing useful!

 

  Evolution appears to be fast and mostly stable. The crash was on day one. Has   not happened again. Recent changelog activity in the bugs I reported recently   now has my reports as dups of closed problems, so I may be chasing a ghost   now.

 

  Every now and then Evolution goes off into la-la land. I have no idea why. I   tried cleaning out all the .gconf and .evolution file in case it was cruft   left over from the previous Evolution, but that appeared to make no   difference.

 

  The only other issue I have had is minor but annoying. I have my Mint 3.0   desktop, running Evolution 2.10, filtering my inbox. All my rules for mail   handling run from there, all my archiving. I lopped out 500 messages   yesterday, and suddenly 2.12 could not see *anything* in the inbox, until new   messages appeared. It still can only see only those message that have arrived   after that "lopping" (technical term for email erasure) from the other   machine. Being read or unread makes no difference. Looks like I'll have to not   delete so many emails at the same time, or be logged out from the OpenSUSE box   when I do it.

 

  As an experiment, I defined a second inbox, this time using IMAP. IMAP can see   everything in the Inbox, even when WebDAV can't. In fact, I have moved to   using IMAP to read and send, and only use WebDAV (Connector is WebDAV based)   to read and accept meetings and access the address book. A little clunky but   workable and IMAP is faster.  When I travel and access my inbox from the   road, IMAP is always better on a high-latency line, so this is not such a bad   thing.

  OpenSUSE is smarter than me, therefore it does not obey me.

 

  When I received the Dell D620, I was informed by someone that already had one   that it was a 1280x800 screen. I never questioned that. Mint and Fedora set it   up that way by default. So did MS Windows XP. OpenSUSE 10.3 insisted it was in   fact a 1440x900 screen, and would not set it up any other way. It left off the   Intel "915resolution" video bios patch in /etc/sysconfig/videobios. It had   tiny little pixelated fonts in both Gnome and KDE that my eyes just could not   read. I decided OpenSUSE was wrong, and fought it. I forced the videobios   patch on, set it to 1280x800, and the Gnome and KDE fonts looked much nicer.   Very smooth and round, though with occasionally odd hinting.

 

  Laughing in triumph, I proceeded on to other things. SUSE laughed back: Gnome   started crashing every other minute. I switched to KDE. It looked nice, and   was stable. But every now and then it would, for no apparent reason, switch   resolutions on me... To a really weird 1396 x something-or-the-other. OpenSUSE   was not minding me. Not one little bit. It would not stay in 1280x800.

 

  Light began to dawn. I research the D620, found the 1440x900 option, took back   out the videobios patch, and instead overrode the default font sizes and   default DPI. Font were now bigger, smoother and less pixelated. Not great, but   better.

 

  Funny thing is that the fonts looked better at 1280x800, like the hardware was   doing some very nice interpolation. The fonts don't look too bad now, and the   crashes have stopped so I'll leave it this way.

 

  OpenSUSE is the first distro that appeared to know what the screen hardware   really was, but it may be that Mint and Fedora just behaved better when I told   them a wrong screen resolution, hiding the reality of the hardware from the   loose nut at the keyboard. (Update: Future me has now installed Mint 3.1 of the D620, which knew exactly what the hardware was and configured it out of the box correctly. Oh well.)

  KDE versus Gnome

 

  I have no real deep seated preference these days for either GUI. I like   them both, and I use them both. SUSE used to be a KDE-first Distro, but with   the purchase of Ximian, it started to lean towards Gnome. In fact there was a   real firestorm a while back when it looked like SUSE was going to go Gnome   only, but that turned out to not be the case.

 

  My preferences have very little to do with my Dell D620's. Gnome is just not   stable on it even with the resolution sorted, with all sorts of hangs and xorg   crashes. Better, but not great. Bug Buddy never gets invoked for any of it, so   I have nothing to submit. I suppose it is possible that the resolution issues   have left bad bits laying about for Gnome to trip over, but I do not have time   right now to figure that out.

 

  Now that I have the default screen resolution and DPI sorted out, KDE not only   looks nice, but is very stable. Evolution looks better under KDE than Gnome.

 

  KDE did flake out when I added the screen brightness widget to the panel.   Crash city. A trip back to Gnome (to have a GUI), deleted the .kde* files,   switch back to KDE, re-set up everything, and totally avoid the brightness   app, and now everything is OK again.

  Long Live OpenSUSE, but not on the D620

 

  There is much to like about OpenSUSE 10.3. Yast is far faster than it was.   Alternate repositories are much easier to get going. I like the currency of   Evolution even if it is going to skew my attempts to report the Connector   issues I have been having. That may be no big deal though: recent comments   have the stuff I have submitted so far as both duplicates of other issues, and   those base issues closed as resolved. I might be wasting Gnome project folks   valuable time (although it appears that they did not know it also happened on   PCLOS: Gnome Project had the Evo crashes as as Ubuntu and Fedora in the parts   I read more closely).

 

  For all its spicy OpenSUSE goodness, Mint is better on the D620 hardware.   Better than OpenSUSE, better than Fedora 7, better than PCLOS. I think when I   get back from Pune India and Vancouver, Canada, I'll put OpenSUSE on the IBM   T41 (Where SUSE has worked very well in the past), and Ubuntu 7.10 or the next   version of Mint on the D620. If I find a travel power cord, I may do the D620   to Mint conversion on the flight across the Atlantic. 24 hours on an airplane:   Gotta have something geeky to do! Besides, I have to check into the screen   resolution thing to be sure I am not just mis-remembering what Mint was doing. (Update: Future me again. Yeah. I could not wait. I installed Mint late in the night. Mint got it right the first time.)

  Travel plans

 

  I'll be on the road for most of the rest of October. First I am going to Pune,   India. I have been out of the US before, but never this far away. We have a   BMC office there, and I will be going there to meet more of the global R&D   support team. After that I will be headed over to Vancouver to attend BMC's   UserWorld! I am interested in this because I have never attended a BMC user   conference, despite having been at BMC for over 18 years. All this travel   means schlepping both my Apple and my D620: My backpack looks like I am headed   to give a lab at Linuxworld or something. I'll post from the road as I can.

Share This:
Last weeks "failed" test plan now leads to a new test plan. This time, more platforms are included to help isolate Evolution issues from Distro issues.

 

  Lets see: the big hand is on the 1, and the little hand is on the 50 so... Well I'll be. Welcome to post number 150 at talk.bmc.com... not counting the 17 or so I   currently have on my personal blog over at   on-being-open.blogspot.com   of course. "Adventures" kicked off in September of 2005, so clearly there has   been no issue finding things to talk about every week here. :)

 

  I have some updates on where we are at with the Linux NAS server that I'll get   posted soon, but I wanted to put a diagram with it, and so that means taking   the time to draw them with   Dia,   and that just has not happened yet. The other topic I have been working on   here lately is Evolution 2.10.3 against MS Exchange 2003.

  Multi-Distro Evolution testing

 

  In my sequential testing, first Mint, then PCLOS, now Fedora 7, all on the   Dell D620, I have seen what looks to me to be the same problem, repeating over   and over: An instability in the Exchange Connector that requires totally   recycling Evolution to recover from.

 

  I have blathered on and on here about Open Source and Community, and people   doing what they can. I have finally decided that this is a case of needed to   step up and do what I can. I am not a C language coder. Not beyond 'Hello   World" anyway. I have written in many languages over the nearly 30 years I   have been in IT, but that is not one of them. Truth be told, as much as I like   to think of myself as a hacker, I am more of a hacker-arounder. (tm). Being a   manager, even a technical manager, for over 20 years and not in the coding   trenches means I am not going to be able to fix the problem(s) in Connector   myself. Well... I *could*, I just first would have to come up to speed on C++.   Then the Connector project, and the WebDAV protocol. Might take a while.

 

  What I can do is try and get the diagnostic information to those who can fix   it. I am technical enough to be able to install and configure Linux, and   follow instructions for needed diagnostic information. I clearly can find   problems with regularity!

 

  My working theories about the Evolution Connector problems are currently:

 

  •      Evolution Connector does not get as many development cycles as it needs,
    •        Perhaps because it is not as easy / interesting to work on as making Evo work against Open Source mail servers or Groupwise.   
    •        Perhaps because the MS Exchange server environment is hard to come by for the currently active Evolution developers.   
  •      Neither the KDE nor Gnome environment makes a difference in the stability of Connector.  
  •      The Connector stability issue is not related to any particular Distro.  

 

The first assumption is not testable. I base it on the fact that most Open   Source stuff with problems of this type get fixed far faster than this one.   My general observation is that continuing problems in a given area of open   source maps to a lack of critical mass. Not enough eyes are looking at it. I   am totally guessing. This could also just be a real snakey , twisted problem. Or both.

 

  Last post I was starting to feel like the latter theory was not valid because   Fedora 7 had been rock solid stable on Evo Connector. It has been better for   whatever reason. It went way over a week before Connector started crashing.   When   it   started crashing, it did not happen nearly as often as Mint or PCLOS did.

 

  It does crash, and when it crashes, it looks like the same problem, at least   from the outside looking in.

 

  One advantage of working in R&D Support is that old hardware goes into   bone piles, and I can go dumpster diving. I'm not proud. Linux doesn't need   all that much hardware to run on either. I am also technical enough to fix   broken hardware. To create a way to test my theories, as well as report the   issues as they occur, I needed to be able to have several Distros running at   the same time, rather than serially on my D620. The dumpster yielded some new   victims. It would be optimal if the hardware all matched of course, but that   is not possible. I did, for space in my office reasons go with all old   laptops. I have to work in here in the daytime!

  Combo Burrito

 

  Here is the current lineup:

 
   Dell C400 running Fedora 7: I wrote about this computer   a   while back on my personal blog, when I was comparing Mint and Fedora.   Fedora is happy here, and supports all the hardware out of the box, so no need   to rock the boat. 1.2 Ghz Processor, 1 GB RAM, so pretty decent. I have an   open bug on Fedora, so this one is waiting should I need to do any additional   debugging.

 

  Dell Inspiron 8100 running Mint 3.1: With a 1.2 Ghz processor, 512 MB   RAM, and a dim but otherwise lovely 1600x1200 screen, Mint 3.1 (brand spanking   new...) installed late one night on the   8100   without issues. With a dedicated graphics card from Nvidia, I was hoping Beryl   or Compiz would work, but there is not enough memory on the card to deal with   the resolution and the 3d compositing. Nope. No big deal. I was mostly   curious. The keyboard on the 8100 is very clunky and metallic sounding. So far   so good. No Evolution failures.

 

  Dell Latitude C610 running PCLOS: I liked PCLOS on the D620, and VMware   was easy to get going on it. The   ACPI   reporting oddities I was seeing with PCLOS on the D620 are not an issue on   the C610 either. The screen is bright but only 1024 by 768, same as the Dell   C400. RAM is smallish at 256MB, but the KDE memory mapping tool shows well   over half of it is in use as a disk cache, so it is plenty. The processor is   1.8 Ghz, and the graphics card is a separate Radeon, so at 1024x768 Beryl   works well. Keyboard is much better than the 8100's, and this unit (also like   the C400) has a TrueMobile Wifi chip, which PCLOS found and configured without   issue. Evolution has not crashed here yet (but I already have one PCLOS crash   collected from when it was on the D620).

 

  IBM T41: Right now the T41 is running Mint 3.0, but this one I have set   aside for Mandriva

 

  Dell Desktop running Mint 3.0: This is where I run many of my central   processes, and I don't upgrade this system very often. I do not expect that it   will be much different from Mint 3.1 here, because the Kernel and Evolution   release levels are the same.

 

  Dell D620 running OpenSUSE 10.3: Fedora 7 has been replaced by OpenSUSE   10.3. Despite running on the newest hardware of the lot, and the D620 having   run Mint, Ubuntu, PCLinuxOS, and Fedora 7 with mostly success, OpenSUSE was a   pain to get running smoothly. Most of it revolved around the graphics hardware   being mis-identified. After some brute force, it was running at 1280x800.   KDE's default fonts look pretty bad, so I have Gnome running there for now.   Then I had to get rid of SLAB. But now it is up and running, and Evolution has   debugging symbols on *all* the modules, so I am ready to fail... Not that I   want to. But that is the goal here.

  In reserve are my Acer 5610 and IBM X30, should anything act up. Both are   running Mint right now, and I am hoping they don't get called into service.

  Next Steps

 

  Getting Mandriva 2007 into the mix is my goal for next week. Week after that I   fly to Pune India for a week, so this project will pause for a bit then.

 

  In each case, I am running as close to the 2.6.22 kernel as I can, and the   2.10.3 version of Evolution as is available on each Distro. Obviously across   distros I can not even come close to controlling for all the variables here.   My theory is that the same Connector failure will occur regardless of all   these variables. The responses so far on Bugzilla my two failures now marked   as a dup of 458322, so I may be on the right track. I'll read through the   details on each next week and see what I need to do to debug further. I am   loading up all available debugging packages and Gnome's bug-buddy. I saw that   one comment said something about also wanting debugging symbols for things   like Pango, so I'll have to sort that all out.

 

  This post, like   the   NAS post before it, is about a project in flight, rather than one where it   already done and all the facts are in, conclusions reached, and doc written   up. I hope to learn along the way of course, but also just practice what I   preach about being open. Besides, I have to have something to post in entry   151! :)

Filter Blog

By date:
By tag: