Search BMC.com
Search

Steve Carl

Running Hot and Cold

Posted by Steve Carl Dec 15, 2010
Share: |


It is a really good idea to run your data center warmer these days. Except when it isn't.

 

Data centers need capacity planning, just like the computers that run in them. One size does not fit all. There are standards like ASHRAE that give great ideas and guidance, but at the end of it all, building a green data center requires more than just blind adherence to some standards. At the same time, physics underlies it all. There are some things you can do, and others that are ... Suboptimal.

 

Example: the most recent set ASHREA standards indicate that it is good to run your data center warmer than in the past. The range of allowable humidity is higher too.

 

If I were to blindly follow that, I would find computers failing all over the place. The reason is simple: no one told the older computers about it being ok to run warmer. A natural consequence of BMC's heterogenous and deep platform support is that we have quite a number of computers designed and built before the new standards were set. They like it cold.

 

It is even more complicated than that though. You knew it had to be.

 

Fans and Power

 

Fans use power as the cube of their rotational speed. Not the square. Not linear. Airflow is linear.

 

The best example of that I have seen was when I was building a new data center in San Jose. The DC was going into an existing structure, but we had more or less gutted the building and custom built out the space to meet our needs. I had loaded up a fair number of servers into the shiny new data center, as we had certified and commissioned the new DC before we had started moving the people into the adjacent office space.

 

Building management had scheduled the fire marshal to come over one evening to look at the fire panel for the occupied space. Someone tried the alarm on the panel, which was connected incorrectly to the UPS, and the UPS dropped the line power to the building. I guess this was a good thing, as we found out about the incorrect configuration of the firepanel / UPS.

 

There was no backup generator, so the data center HVAC went off line, but because the UPS was working, the servers stayed up and running. I happened to have been in the UPS room looking at the UPS control screens because it was a new model that I was not familiar with, and I was learning how to drive the various displays. Set up SMTP and SNMP, etc.

 

From the UPS room, I could hear the sounds of the data center, and therein began a howl. Slowly at first, but building to a nearly deafening roar, every cooling fan in every server came online, and sped up in increments to its maximum speed. I had no idea how quiet the room had been till it wasn't.

 

I watched the power drain on the UPS with interest. It increased from 160 KVA to 270 KVA. To drive the fans at their maximum speed was chewing power at an incredible rate. Even for all that, and for the outage not being all that long... about 20 minutes... I had two older computers lose hard drives from the heat.

 

The point here is that ASHRAE says that it is OK to warm your cold aisles, and it is to some degree, but what degree that is will depend utterly on what type of computers you have, and at what temperature they are going to start cranking up their fans to stay cool.

 

It is not more power efficient spend less on A/C power if you are spending more on making fans spin faster. What that inflection point is I can not say. Data required. I know for us it is not very far away from the old ASHRAE standards.

 

Too Much of a Good Thing

 

Your data center A/C, be it CRAC's or CRAH's like temperature differential. Give or take and elephant, 30 degrees F is good. In Hot/Cold aisles, with no air mixing, that means that if your cold aisle is 68 degrees, your hot aisle is 98 degrees. No one is going to want to spend much time in the hot aisle, and when they do they will be wearing Hawaiian shirts and Bermuda Shorts. By keeping the air from mixing, your A/C can be as much as 50% more efficient. Since DC HVAC is 40% of the power bill (and therefore contributes hugely to your CO2 emissions) running the HVAC at maximum efficiency it paramount to a Green DC.

 

Some people like to go to warmer climate's, especially during the winter, but there are limits. Increase the temperature in the cold aisle to the ASHAE maximum (assuming all your gear is new enough to be able to run at that temperature) and your cold aisle is at 80 degrees. People are not utterly uncomfortable at that temperature, though they will probably be thinking about putting on the shorts and tennis shoes. The hot aisle is another story. At 110 degrees, there will have to be hazard pay to go in there. Next to the fire extinguisher will be hydration stations.

 

To drive that kind of temperature differential also requires the servers be racked at a density that can generate that level of heat. Lots of ifs and caveats here, and so once again you have to know exactly what kinds of servers and densities are even possible with your specific mix of servers. In some cases, our gear is old enough that even though the power supplies are not very efficient, the server is so large that I can not pack them together that close. In the case of the old Tandems, I can't easily rack them at all...

 

Latitude


Too much HVAC is inefficient and a poor use of power and therefore carbon-footprint intensive. Not enough and your servers run too hot and fail. It is fairly easy to figure out how much A/C is the right amount, as noted in my two previous posts here ("By The Numbers" and "Whats in a Name(plate)"), but not addressed there is the idea of failure, in the sense of what to do when you lose an HVAC unit of some sort. Things fail. Entropy is law.

 

A recent case for us was where a 10 ton CRAC failed in one of the R&D DC's. There were three other units in the DC: two 20 ton units, and another 10 ton. The problem was that there was not enough cooling left in the surviving units to deal with the heat load until the 10 ton could be repaired. 

 

In theory there should have been one more 20 ton unit available: powered down, but piped into the common plenum so that it could assume the workload of the largest possible single failure. Alternatively (and what we did) about 10 tons of heat load had to be powered off till the HVAC technical crew had a chance to get the unit repaired. In this case, parts had to be ordered, and there was a multi-day wait. Supplemental air was brought in. Not pretty.

 

sup-cool.JPG

 

We had some latitude in how this was dealt with because the room is normally cooled to about 70 F in the cold aisle. That extra 10 degrees bought us time, and meant that we did not have to power down quite as many systems. We could, for a short time, run warmer.

 

For this lab, an idle 20 ton unit would be a hugely expensive investment relative to the rooms workload. A spare HVAC unit may make sense when it is 5% of the total room or something, but not when it is 30%. Then it is expensive insurance.

 

The fans ran faster while the 10 ton was being repaired, and we used more power because of that for the duration, but it was a short duration.

 

Guidelines


I am not saying that one should pay no heed to ASHRAE: Far from it. I am saying that in the effort to both design and run a green data center, understand that ASHRAE issues guidelines, not rules of nature like second law of thermodynamics. Apply them knowledgably to your particular set of servers, and also to the future plans for the data center.

 

The lessons of BSM are clear: You can not manage what you can not measure (in this case, the potential heat load), and to manage effectively (I.E., to be efficient) not only saves you money, it makes your DC greener.

Steve Carl

What's in a Name(plate)?

Posted by Steve Carl Dec 2, 2010
Share: |


In my last post, "By the Numbers", I talked about the Diversity Factor, and why it is important to know your real one. The DF is in turn based off your "NamePlate" power rating. I talked a little about this, but I think it is worth a deeper dive, as while it is straightforward most of the time, it is not always. Getting this wrong can be a disaster either in a new Green data center design, or in the case of moving something to a Co-Lo, having to go back and re-adjust the parameters of your contract at a disadvantage.

 

The Nameplate is a label attached somewhere on the power supply. It may not be visible from the back, and you may not be easily able to pull out the power supply to find it. Most computers have just one kind of power supply per model, but there can be sub-models and variations that mess with any assumptions you might make here. I have even seen in some recent models there being two listed power supply models, with one of the power supplies being intended as being more power efficient, and supporting things like taking the server into powered off states during times of inactivity, but being able to be powered back up without a command from a KVM or remote control or even someone standing there and pushing the button. Rather, it powers up as software determines that load in the cluster is increasing, and the servers RAM and CPU need to get on duty.

 

I also noted in my last post that most power supplies today are "World" power. They can deal with 50 or 60 hertz A/C, and voltages ranging from 100 through 250, all without giving you blue smoke. In the technical world, it is considered bad to let the blue smoke out of the computer, because you can never get it back inside. That same auto-ranging capability means that you have to know the wattage of the power supply by explicit vendor statement. It has to say something somewhere about 1100 watts or 2000 watts or whatever. Volts times Amps does not get you there.

 

So what if there is no nameplate, or if there is no easy way to get at the nameplate because the server is powered up and the users would be grumpy about you powering it down to find the nameplate?

 

Also, what if you can see the nameplate, but the server has more than one power supply? Is it active active? N+1 N*2? Active / passive? No way to tell from the Nameplate.

 

The Google is of course your friend.

 

Searching the Vendor Sites

 

Some Vendors are better than others about keeping the data about their computers online. Others are very aggressive about removing data that is on older systems that they have discontinued. Kudos here go to Sun (pre Oracle: I am watching to see if they maintain this level of goodness) and special mention to Dell. IBM is a problem for really old systems, and because things can be spread out quite a bit, and because they use the word "Power" in their server names, leading to many false trails. HP's and therefore Compaq and Digitals doc is very good. Or very bad. Or very missing, depending on what you are looking for. Cisco is pretty good, though the different generations are documented in different places of different docs.

Some old stuff is just not findable.

 

When trying to find out something like nameplate wattage, these keywords in various combinations are what I have found the most useful:

 

  • Model Name (like Enterprise 250 or 7015-r40)
  • The vendor name
  • watts
  • power (except for IBM where this is nearly useless)
  • specification / specs / "technical specification"
  • "power supply"
  • Searching used computer depots for replacement power supplies
  • Maximum BTU

 

About that last one: You can reverse into wattage from BTU. Make sure you use the Maximum BTU rating to keep everything in Maximum Wattage until you apply the DF. You only want to apply the DF once, and to the right number.

 

Wattage from BTU is 3.414 BTU's per Watt. 680 Maximum BTU's in a specification sheet means the maximum wattage of that power supply is 200 watts.

 

Again: Maximum rating. Keep it all the same. Some vendors will give you an estimate of the median power usage, and they may give it in watts or BTU's. Doesn't matter. They have no idea what your config is, or how it fits into your overall data center, and what your measured diversity factor is. Might be handy to know for figuring out potential hotspots if it looks like the median is close to the max. But track it separately and do not confuse them.

 

ADDM

 

BMC's ADDM is another way to find out the nameplate of your servers, and to do it in an automated fashion. I have recently learned how to do some very basic things with ADDM, and the part of this that I really need for designing and maintaining my R&D data centers is this: ADDM can not only discover everything on my network, but it has a database (called the HRD, or Hardware Reference Data) of servers and other gear, with over 1,000 entries, that can enter into the CMDB not only all the other information about the server (OS, patch level, disk config, network config, etc), but update the MaxPower entry in the Atrium CMDB with a servers Max watts rating. Then it is a simple matter of pulling the data out of the CMDB by rack, row, room, or whatever, and having your max wattage right there.

 

In addition to wattage, you have to know the servers size to figure out rack configuration, in Rack Units, or EIA U. ADDM's database has that too, and can populate the CMDB with that as well.

 

This is only the tip if the ADDM iceberg of course. It does way more than just populate power and system size in the CMDB.

 

In combination with the diversity factor, you now have everything you need to figure out how you want to set up the servers in new configurations and densities.

Filter Blog

By date:
By tag:
It's amazing what I.T. was meant to be.