I recently had a chance to chat with Ron Kaminski, ITS Sr. Consultant, and Shane Z. Smith – their capacity planner who is focused on web site and data center performance. Ron had previously written a wonderful paper on the above topic and I wanted to learn more.
DK: You previously wrote a paper about these concerns, but not everyone had a chance to read it. And things may have changed. Could you give us some insights into what people are missing as they approach SaaS and Cloud?
Ron K: People think you don’t need to bother with tools – just throw more hardware at the problem. But they miss why they have a capacity issue. As an example, in a file server farm, you might find that the real work only takes up 1% of the resources used. Large amounts of resources are devoted to things like virus scanners, a huge consumer. No one expects virus scanners to take up 70% of the machine. Infrastructure decisions are being made without understanding how much (or how little) of the CPU utilization is for real work (how little).
Kimberly Clark has sites everywhere in the world, so when you do backups or maintenance activity based on North American time, good times to do it in the US may be peak usage time for the other side of the globe. To manage this effectively, you need to be able to differentiate between real work and support work. People who just look at total CPU don’t see that.
Shane S: Without tools based on workload characterization, we wouldn’t have caught these issues. And even trying to get the data can be a problem. You can’t look at data at this level in PaaS (platform as a service), so that’s a problem. And SaaS vendors require proof that performance problems aren’t at your end. You need tools that help you do that.
Ron K: We needed to create our own collector, because we have a lot of virtual terminals. Did you know that if you leave a DOS window open, it burns 100% of the CPU? People aren’t aware of this impact. We need the ability to detect what process is causing the problem and then pass rules that minimize the impact. As an example, some screen savers can take 100% of the CPU too. By banning selected programs or behaviors, we have recovered as much as 40 CPUs worth of resources. You need a tool to get to the details. Otherwise, you are like a one-legged guy in a horse race. Vendor tools need to get easier to use and be a lot more scalable.
With 3000 nodes, you need something that is scale-aware. There is going to be lots of manual characterization.
DK: Doesn’t a CMDB help with workload characterization?
Ron K: Yes, it is going to be essential to make this scale. And some people will use one workload characterization file to manage everything – discover it once and apply it to every situation – but that works best with a more homogenous environment.
But this isn’t perfect. People’s assumptions of what is on a node versus what shows up in the CMDB are often out of sync. Ideally, I want a capacity planning tool to feed this information into a CMDB. Get people out of the business of doing workload characterization.
Capacity planners need to understand what is driving CPU busy – if not, they will wildly overprovision. Disk is often what causes them to die, or one bad link in the network. In a global firm, it can be very difficult to determine these choke points. Without war in the world, all companies are going to be global and we need the right tools to do that.
DK: Do these initiatives change the role of a capacity planner? Eliminate it?
Shane S: That depends on the situation. With PaaS, we have zero insight into workloads and performance – PaaS is stateless. You can’t get performance metrics from them. What you get is total CPU and total memory – that’s all. You can’t do capacity planning like that. With IaaS (infrastructure as a service), you need to get the vendor to let you put in your tools. It can take time. In the private cloud – you must do it. In PaaS, you are reliant on the PaaS to provide you monitoring/performance tools to get the information you required for capacity planning.
Ron K: The problem is one of scale. You can’t do it by hand. You need tools. There are some things you can get even if you only get totals. If you have a 4-CPU box and one CPU is routinely 100% busy, this is a loop or something that is just soaking up CPU, like a DOS command window on a virtual desktop. If you can’t get a collector in there, get a mini-collector that once a day goes in and collects process consumption. You can get information by comparing day to day – it may help to flush out what is just using too much resources. You have to be sneakier as a capacity planner now. I’m looking to vendor partners who come up with tools that help us manage the complexity and the scale. But for now, we’re working to deploy these mini-collectors so we can least point out the silliness.
Shane S: The stateless environment is a challenge – it is much harder to get the data.
Ron K: Service providers want to charge you. If 30% of your environments are doing something stupid, you are just wasting money. Capacity planning is an issue of scale – there can be a lot of savings there.
DK: What do you mean when you see these technologies as having an “Achilles Heel?”
Ron K: This issue is ubiquitous web access – people think that because anything that can be done can and should be done on the web. But how you choose to get it there can impact resource demand. Just because you want to do this, there are thousands of ways to do it. You need to think about this, focus on efficiency and select better. They call it code for a reason. It isn’t obvious what is using a lot of resources and so choices in coding can have a huge resource impact. Choices must take into account the distances, especially for chatty applications. We’re going to look back in 30 yrs and laugh at the code we are running now. I told my mother back in 1970, “In the future, no one is going to read books anymore. People would read off screens in the future.” In the future, people won’t be writing code.
The current manual method in IT takes too long to get things running. In 20 years, vendors will have this automated. Users will be saying what they want the app to do. Human interfaces, like Siri, will know what the rules are and enable the automation of applications, without IT coders. Those programs will quickly learn how to optimize the code. We will always need capacity planning tools until the automation is excellent, and I put it to you that we will always need them for advanced analysis. Coding as a career has to end. Data is just getting bigger – so the impact of bad coding decisions is going to get worse.
In reality, there are too many nodes to do this the way we used to do it. Large scale automation in the future will help. I believe corporate data centers will go away – everything will be on the web or in the cloud. But then the tools will have to scale even more. I think IaaS will be the software vendor’s next big challenge. We need this because too many still see capacity planning simply as telling you how much to buy, not how efficient your systems can be. Large scale vendors of services need to find a way to handle these problems.
Capacity planning teams will shrink – automation will replace them. The future is global firms.
Shane S: Automation will eventually take over, but the vendors are extremely far away from that. PaaS and IaaS don’t seem to be doing capacity planning properly, but it is in their interest to do this.
Ron K: This is a big opportunity for a vendor and it would give better value for their customers too. It might result in a new thing – CPaaS. Cloud users need something that doesn’t require their cloud providers to install something – we need new tools which can use the data already available. Both cloud providers and cloud users need more detail. As a user of a cloud, you would want to know that another user is sucking up so much resource that it is impacting you.
DK: Isn’t that the same kind of problem that we had when we first started sharing CPU resources?
Ron K: Yes, and we still need the kind of data that shows you when this is happening. “Hey, cloud vendor – why does my performance suck when I’m doing the same work as I was last hour?” This tool needs to exist.
What we don’t have is time – there is so much to do, but never enough people to get it all done. Tools could help but not if we have to write them. I recently converted my home system to MACs and now, when software is updated, it is automatic – I don’t have to worry about it. That is the future. Upgrades should just happen.
DK: What’s the best way for CPs to adapt to these technologies?
Ron K: Take a step back and see how you spend your time. Figure out what can be automated. Do that. If you don’t find ways to make capacity planning scalable, you won’t be able to get it done and answer questionss fast enough to be relevant. You need to be fast.
Shane S: Automation is key. And you need to understand the toolset that you have to make automation work. Don’t rely on IaaS tools to work properly – validate them. Make your own automated alerts to weed out these problems.