Jun 23, 2012 is the 100th birthday of Alan Turing. 76 years ago, Turing, just 24 years old, designed an imaginary machine to solve an important question: are all numbers computable? As a result, he actually designed a simple but the most powerful computing model known to computer scientists. To honor Turing, two scientists, Jeroen van den Bosand and Davy Landman, constructed a working Turing's machine . It is not the first time such a machine is built. The interesting thing this time is that the machine was built totally from a single LEGO Mindstorms NXT set.
The modern brick design of LEGO was developed in 1958. It was a revolutionary concept. The first LEGO brick built 54 years ago still interlocks with those made in the current time to construct toys and even the Turing machine. When you want to build a LEGO toy or machine, you don't need to worry about when and where the bricks are manufactured. You focus on the thing you are building and what standard shapes and how many of LEGO bricks you need. And you can get them in any of those LEGO store no matter what you are building.
Sounds familiar? This is very similar to how one would build a cloud service using resources in a shared fabric pool. You don't care which or what clusters or storage arrays these resources are hosted. All you care is types (e.g. 4cpu vs 8cpu VM) and service levels (e.g. platinum vs. gold) these resources need to support. Instead of taking each element devices, such as computer hosts or storage arrays, as key building blocks, IT now needs to focus on the logic layer that provides computing power to everything running inside the cloud - VMs, storage, databases, and application services. This new way to build services changed everything on how to measure, analyze, remediate and optimize resources shared within the fabric pool in the cloud.
To understand why we need to shift our focus to pools and away from element devices, let's talk about another popular toy - puzzle set. Last year, I bought a 3D earth jigsaw puzzle set to my son who was 3 years old at that time. He was very excited as he just took a trip to Shanghai and was expecting a trip to Disney World. He was eager to learn all the places he had been and would be visiting. So he and I (well, mostly I) built the earth using all those puzzle pieces. The final product was a great sphere constructed with 240 pieces. We have enjoyed it for 2 weeks until one of the pieces was missing. How can you blame a 3 year-old boy who wanted to redo the whole thing by himself? Now here is the problem, unlike those two scientists who used LEGO bricks to build the Turing machine, I can't easily go to a store to just buy that missing piece. I need to somehow find that missing piece or call the manufacture to send me a replacement. In the IT, it is called incident based management. When all your applications are built using dedicated infrastructure devices, you have a way to customize those devices and the way how they are put together to tailor to the particular needs of that application. If one of those devices has issue, it impacts the overall health of that application. So you file a ticket and operations team will do triage, isolation, and remediation.
In a cloud environment with shared resource pools, things happen differently. Since now the pool is built with standard blocks and is shared by applications, you have the ability, through cloud management system, to set policy which moves VMs or logical disks around if their underneath infrastructure blocks get hit by issues. So a small percentage of unhealthy infrastructure blocks doesn't necessary need immediate triage and repairing action. If you monitor only the infrastructure blocks themselves, you will be overwhelmed by alerts that not necessary impact your cloud services. To respond all these alerts immediately increases your maintenance costs without necessary improving your service quality. Google did a study on the failure rate of their storage devices. They found that the AFR (annual failure rate) of those storage device is 8%. Assuming Google has 200,000 storage devices (in reality, it may have more than that), every half hour, you will have a storage alert somewhere in your environment. How expensive is it to have a dedicate team to keep doing triage and fixing those problem?
So how do we know when services hosted in a pool will be impacted? We give a name to this problem - pool decay. You need to measure the decay state - the combination of performance behavior of the pool itself and distribution of the unhealthy building blocks underneath it. In this way, you will be able to tell how the pool, as a single unit, performs and how much ability it has to provide the computing power to hosted services. When you go out to look for a solution that can truly understand the cloud, you need to check whether it has such ability to detect the pool decay without giving you excessive false positive. Otherwise, you will just get a solution who is cloudwashing.
Back to my missing piece in the 3D jigsaw set, I finally found it under the sofa. But the lesson learned, I now bought my boy LEGO sets instead.
Next week, we will examine how the resource pool with the automation introduces another well known challenge - outage storm. Stay tuned.