Try a little X86 VM infrastructure, succeed, grow organically. At some point it becomes time to stop and look around and figure out some better ways to do some things in the virtual world. One stop along that road today: Storage Virtualization.
As noted in my last post, "Virtually Greener", this post is a deeper dive into some of the things we have learned along the way about virtualizing X86, specifically using VMWare.
The first problem any X86 virtualization project faces is education and culture change. There is Fear, Uncertainty and Doubt, and this is not the normal vendor generated FUD against a competitor stuff. This 'real' FUD is based on:
Fear of Change
Fear of the Unknown
Fear of Loss of Control
Hopefully the 'Fear of Change' is pretty obvious.
The Virtual Machine story sounds so unreal to someone not initiated into the mystery's of virtualization:
RDS: "Hi, I'm your friendly neighborhood R&D Support person. I would like to take that ancient computer you are using away from you and replace it with one that does not actually exist, but you'll get better performance."
RDP (R&D Person): "Come again?"
RDS: "I'd like to take that computer you have been using for years and shut it down and scrap it, but first I want to P2V it so that you can keep using it, except that the Virtual version will be much better than the one you have now."
RDP: "Err... right. How will it be better?"
RDS: "I can add more memory if you need it, part of which may be shared with other virtual computers The other computers are because you virtual machine will live inside this great big computer with a bunch of other virtual computers similar to yours."
RDP: "A bunch of others... won't that be slower?"
RDS: "No, because the new computer is 10 or 20 times faster than the one you have not now."
RDP: "How many others?"
RDS: "Maybe 30 or 40 or 50. No more than 75 probably. It depends."
RDP: "And this will be faster? How? I can count..."
And so forth....
OK. I admit that the above conversation never happened exactly. At least not all at once, and not with just one person. But it shows the confusion that surrounds this whole thing. And it smacks of Big Brother, centralized, Mainframe, Glass House stuff that some people ran screaming from because of all the restrictions.
The truth of the matter is that most people just don't use all the resources of their computer very often, and that most of the time it is sitting idle. Right now, as I type this in on my MacBook, the CPU gauge in the task bar is barely visible, with less than 7% out of 200% currently being used. And this is a two year old unit, not even the fastest laptop going anymore.. and it's a laptop. As long as I am not doing image processing or massive file conversions, or at least as long as I am not doing them at the same time as someone else using my computer at the same time, there is room on here for many of us.
Even with the overhead of virtualization (and VMWare is very high right now, relative to the mainframe: We use 30% as a planning number for VMWare, whereas most stuff on the mainframe is at about 5% these days), the fact is that you can layer in a large number of users on a fairly inexpensive large, data center grade computer, and have the net sum of that be far less expensive, and perform better. As noted in "Virtually Greener" there is a great deal of power to be saved here as well.
The hard part is drawing this all out in ways that people can latch onto. We decided to go after it by creating a "proof of concept" VMWare farm.
Proof of Concept
How big one makes such a thing as a "Virtual Farm" depends on many factors. We had pretty lofty goals, and a fairly large scale data center. We are after about one in three X86 computers over a 12 month period, and we are well on out way to achieving that goal. Our R&D data center seemed like a prime candidate for this as we had so many older, more inefficient computers still in use. This old gear remained for reasons of customer support, and therefore by keeping around a raft of hardware and their related OS and application releases, we had many thousands of square feet of pre-1998 computers still in service.
The problem was how to keep the functionality: keep our internal customers supported (remembering that my customer is other groups inside BMC, like R&D and Customer Support), but reduce the footprint. A conversation I never want to have with R&D or Customer Support:
CS: "You made a change to our infrastructure and it affected how well we were able to support [insert customer name here]. I am coming over to your office as soon as Sally returns my brass knuckles."
Kidding. Our CS folks are a passionate bunch though.
We started small: Dell 1850s, 1950s, 2950's, and then a Sun X4600. 4 GB, 8Gb, 16GB, 32Gb and then finally the VMWare release 3/3.01/3.02 limit of 64 GB. We created the ESX servers all with internal disks, and started putting up the smallest of VM's. When the target is a 1993 computer with an average of 128 MB of RAM, even a 4GB ESX server can hold a good number of OS images.
When a request for a new computer would arrive, we'd look at it, and ask if it was a performance, benchmarking, capacity planning, or device driver related need. If it was not, we would offer a VM instead of a real computer.
The first advantage was that we could provision that immediately. In less than two hours from the time that the request had arrived in the Remedy inbox, we could turn around an exact environment that met the needs of the requester. And they worked. They were not slow. With the VMWare VI (Virtual Infrastructure) console they had complete control over their VM. They could install things in it, reboot it at will, and never need to get us involved. With more and more people using the environments, we were able to build out templates that allowed us to turn around requests even faster.
Like any pilot program, there comes a time when it is time to go with it, or time to throw in the towel. Did the FUD win, or did the facts? Where the facts on your side? This project was a go, but now we had a problem. It started small, it grew like a weed, and now we had a pile of servers. The big ones worked better because they allowed more resources to be shared: a Sun X4600 or Dell 6950 could easily run 50 VMs, or even 75 of the really small VM's. To take advantage of features like DRS so that workloads could be balanced across multiple systems, and VMotion, and HA clustering so that if hardware fails the VM's can re-start on surviving members of the cluster takes additional investment. It take a SAN, and switches, and paying attention to what type of HBA you buy so that VMWare supports it. It takes planning and thought, and in some cases some outside the box thinking. One does not want to have all their cost savings and power savings and data center floor space eaten right back up.
The inverse problem is that one does not want to cheap out on the gear. When there are 50 VM's running on one server, even if all of that workload is not considered production *individually*, a server failure that can not recover quickly elsewhere means that at a minimum 50 people were just idled, and probably more if these were multi-user OS's running inside the VM's.
VMWare simplifies the math here by publishing what hardware they support. We have a fair number of Apple Xserve RAID (XSR) disk arrays, and we like them a great deal. We would have liked to have been able to use them for VMWare, but they are not certified. Tests showed they worked just find for most things, except work load with a high amount of random read. Virtual machines can do exactly that sort of randon read access pattern often, so XSR's are not optimal.
Or are they?
One of the big myths of disk space is that SATA is way slower than SCSI or Fiber Channel. It is... and it is not. Most SATA disks these days have fluid bearing designs, and extremely high MTBF (Mean Time Between Failure rating)... for all that it worth. The newer revisions of the SATA interface are pretty fast. http://www.sata-io.org/3g.asp documents 3g at 300 Megabytes a Second. The current spec is only half that, but that is still a crisp 150 MB/Sec. Way faster than I can type.
Part of what slows down SATA versus SCSI is the number of arms versus the density of the data: the biggest SCSI I have seen as of this writing is 300GB. The biggest SATA: 1 TeraByte. Give or take an elephant, three SCSI disks with three data transferring arms is going to be faster than one SATA disk with one lonely little actuator. Then there is the fact that SCSI disks can be had that spin faster than the normal 7200 RPM of SATA: 10,000 and even 15,000 RPM. if you have three of the 15k SCSI units, you are going to go *way* faster then the 1 little 7200 RPM unit. You will pay for that premium speed though, and not just in Capital to purchase, but power. Spinning a disk at 15,000 RPM takes engineering, testing, expensive parts, and *power*.
Looked at another way: The same number of disks, SCSI or SATA or FC, spinning at the same RPM, with the same density per track, and the same amount of on-board disk cache, is going to perform near enough the same as to make the price difference between the three favor SATA. SCSI or FC might be a little faster, but not dramatically so.
That is where storage virtualization can help. Storage Virtualization scares most people even more than OS virtualization. The FUD reasons are more or less the same. SV comes with a huge unknown: With Storage Virtualization the link between the physical location of the disk, and the data is largely broken.
One example might be a LUN where an OS keeps '/'. Normally, that is the first 10GB of any computer I build. With SV, I can 'spray' that out over multiple disks (kind of like RAID does) but even farther and wider, over more than one disk array, and even multiple different models and vendors. The VLUN can be across literally hundreds of disks, and many tens of controllers. The IBM SVC limit is somewhere around 1024 devices. That is a lot of arms to throw at data. It is not one block per disk though. There is a chunk size involved here too. Since that 1024 devices can in turn each be a RAID5 array, then the I/O could be across thousands of actual disks.
Hopefully it is easy to see how in that scenario, the advantages of SCSI are diminished, and also see how a system programmer is going to be looking very closely at the storage virtualization device to be sure it is always healthy. It is now the only one that knows where the data is. The storage virtualization devices we played with are the ones from IBM, called the SAN Volume Controller or SVC. The SVC is an IBM X Series computer running Linux (yea! I had to get Linux in here *someplace!*) that sits between the hosts and the disks. The disks are just providers of blocks, and all the hosts are looking for on the SAN are LUNS, so the SVC creates VLUNS, and uses a block table to keep track of it all.
The SVC's come in pairs, and are active / active clusters so that there is no single point of failure. You can add more SVCs to the group to extend the speeds and feeds in a nearly linear fashion. They are quite amazing.
We tested the SVC's using the Apple XSR storage, and the technical term for that is that they "rocked". Our test set up was 2 Apple XSR's. Each XSR was fully configured with 750GB drives, 14 in each one. The XSR uses one controller for each seven disks, and the two controllers are not aware of each other, so this is essentially two disk arrays in each tray. We set each strip up as RAID5, with one hot spare, so that leave data disks in each strip. 4 controllers = 4 stripes = 20 disks. The SVC created VLUNS over the top of all of this.
The Apple XSR has one very annoying feature. When setting up the LUNS across a RAID group, they can not be of different sizes. You can pick the number of LUNs, but the XSR sets them up all equally sized. Here is another way the IBM SVC can be very useful, since the VLUNs are created out of block pools.
The IBM SVC also acts as a write cache (in addition to any the devices themselves might have) so the VLUNS it presents appear to be quite speedy. When you think about how you can write across as many arms as you want to put into a VLUN, they actually are though, so this is not just an appearance. IBM currently owns the high water mark for speeds and feeds in the virtual storage sub-market:
I have brought all this up to point out two interesting things:
The IBM SVC *is* certified by VMWare
The IBM SVC will use the Apples, but only as a generic block device. The SVC has certified certain disk devices at different service levels, and the SVC needs to create a special VLUN for a cluster quorum disk, and it will not use the Apple XSR for this because they are not certified.
So... you CAN use the Apple XSR, in a certified way with VMWare, but you need at least one certified disk set to keep the SVC happy, and more importantly, safe. This *is* all your data we are talking about here.
The SVC also lets you implements classes of storage, so you can use (for example) the Apple XSR blocks as a place to keep mirrors, snapshots, or perhaps templates, but put the running VM's on what is more normally considered "Enterprise" storage. You can save a serious amount of money this way, at least as long as your "Farm" is big enough so that the costs of a virtualization device is offset by the savings of being able to use tier II storage for some things.
Big IBM SVC caveat: The San Volume Controller assumes that you have already done the right thing as far as data reliability. It does not implement things like RAID at the virtual layer. One way to look at that is that your VLUN is only as stable as your least reliable disk in the block data strip. I believe that other storage virtualization products implement RAID at the virtual layer, and that would have some real advantages. With the SVC, plan for disk failures. Failures are not the end of the world, if they are planned for.
It also might be a good idea not to tell everyone you don't really know where their data actually is....
Truth be told, we have not yet pulled the trigger on storage virtualization in the R&D production environment. We are not quite ready to, but growth is going to force the issue sooner or later.
Before I leave this part, one last thing about SATA versus SCSI or FC. In our experience of failure rates (and it is nothing like the scale of Googles in this area), SATA based disks do fail more often. Once you get past "Infant Death Syndrome" (The tendency for new electronic gear to either fail early, within a few weeks, or last a fairly long time), then the SATA disks do seem to fail more often in mid-life than SCSI or FC. To use SATA means always using RAID, Hot Spares, Cold Spares, monitoring, and generally planning for failure. Even with all that, done right, they can be worth the money one saves, at least in a large scale deployment.
Is It Worth It? This VM thing?
It can be. There are new problems to deal with in the virtual world, but if you come from a VM mainframe shop, you already know what they are. The number one issue: VMware sprawl. It is just so easy to create VM's to test every little thing that before you know it you are hip deep in the things. The good news: Everything is getting tested. The bad news: that SAN... even that lovely storage virtualized, multi-tiered unit with massive TeraBytes of capacity is full. It can not be helped: You will need a policy about the life cycle for Virtual Machine. When it is born, you should already be planning its funeral. Hey... that's really BSM / ITIL'ish!.
The other thing that is true is that a big central service needs a capacity plan. We are deploying BMC's own Performance Assurance product (formerly Best/1) across our VM farms so that we can start figuring out where the bottle necks are, and where the biggest bang for our IT buck is to improve the service. That is pretty BSM / ITIL'ish too!
In the traditional story of IT skill sets and the evolving Data Center, you may be able to run this setup with a higher ratio of OS images to System Programmers / Admins, but the people doing the work do have to be more skilled than ever.
There has been one benefit of the organic growth though: With the capacity plans we can show that fewer bigger machines is better than more smaller machines, and that should let us retire *from* the VMWare farm the first wave of smaller systems we used to figure the whole thing out in the first place. This is great for us because there are some classes of needs that just do not work in the Virtual World. They need real, dedicated hardware. By re-using the first, smaller, VMWare servers for those needs, I can either defer other new purchases or, even better, retire some of that 1998 gear next. It is a win either way.
Hopefully this post did not come off as a commercial for VMWare, Apple, or IBM (or, now that I think about it, Sun or Dell). In order to keep this real, I talked about gear we actually tested and used, but there are other storage virtualizers, other inexpensive disk arrays, and other OS virtualization solutions. For example: I use Parallels on my Mac and I love it. I could do the same thing with Xserves and Parallels. The only problem I would have with an Apple Xserve solution is there is no 2U or 4U larger version of the Xserve with even more CPU's available.
We have tested Xen and like many of its features. Virtual Iron seems to be a very mature and usable product with which you could do most everything I describe here, and with which I theorize you would have the same sets of issues, especially about VM image sprawl. It happened in the 1970's and onward on the mainframe with VM/370 and its children. It will happen again in this new space. Everything that is old is new again.