Skip navigation
1 2 3 Previous Next

Green IT

31 posts

Chip Down

Posted by Steve Carl Nov 27, 2017
Share This:

In "The Core(s) of What's Next", written in 2016 and published at the end of 2017 because the author somehow skipped it, I went into a lot of detail about the various chip architectures and what was coming up next.


Even though I did not publish it till recently, it was actually the post that stuck in my head the most on this topic and that I have been watching to see how things develop. Some interesting things.




When I wrote that post, SPARC was something I was seriously thinking about for a large UNIX deployment. My UNIX team loves the OS (Solaris), its solid as the day is long, and Oracle had put LOTS of microcode assists into SPARC in that made it very attractive for the project I had in mind


Then the layoffs happened.


So now it appears for new projects we are left with AMD / Intel, IBM Power, and ARM. Sure, things on Solaris are supported out to the 2030's, so if you already have an investment there, its got a long life still.




We know IBM has announced it plans to take the current Power 8 architecture out to Power 9 and 10, and not only that, says they think they can get down to a 7 nanometer fab. In fact, I am starting to see that 7 nm number a fair bit now. That is, to me, the most interesting part of it because it ties into my whole thesis about Moore's Observation and where that is going.


Fab size


Over in ARM-land we saw an interesting thing happen with the Snapdragon 835. That was the jump down to 10 nanometer on the fab. For green reasons, the most interesting part of that was the 40% lower power consumption. It also meant that the phone / mobile was pushing the envelope of size / power reductions. That makes sense, given the battery powered nature of that platform, but it has impact upstream when that same power / size reduction rolls onto servers. It was clearly not easy to get down to 10 nm. Cannonlake rolled out of 2017 into 2018.


In that second link is a quote from Intel's Senior Fellow, Mark Bohr, underlining the scaling down issue. It convinces me even further that the future of data center computing, and in fact, Green IT from the point of view of power consumption is going to be basically this: As chips can't scale down, there will be no choice but to scale out.


Power Up


When we kicked off our "Go Big to Get Small" initiative, we removed over 1.1 megawatts from our data centers globally. That is a lot of power. But the company is not static. There are new products to support. We were bucking the power growth trend for a long time, but no more. Newer hardware is consuming twice as much power as older hardware. Admittedly it is doing 3 times as much computing too, but the bottom line for the DC is power growth.


I see the number (noted here) of 6% CAGR power increase fairly frequently, and I have zero reason to doubt it. The fab size went down a bit, but the cores and the RAM and all the rest continued going up.


For us its about supporting more customers and products. For the larger world its about IOT and apps and ever increasing numbers of mobile devices expecting the majority of their compute to be happening somewhere else. It may be your DC, it may be a cloud providers, or any mix of that, but the compute demand is there, driving DC growth and therefore more power usage.


At the end of Moore's Observation, there is still more computer demand to come. Much more.. We are just going to have to solve for that in ways others than getting smaller. I don't know if we stop at 7nm, or if we can get down to lower. It seems likely that whatever the absolute, physics required, bottom is, we are very near it.


Couldn't resist that last link. Sorry.


The Core(s) of Whats Next

Posted by Steve Carl Nov 27, 2017
Share This:

well, this is embarrassing. I wrote a blog post a long while back, and never posted it, and I was just getting ready to write an update commentary about it... and its not out here!


So: here is what I wrote forever ago:




In my last Green IT post (Ed note: This wasn't my last post. I just missed posting this one...) I looked at the Green / Power side of CPUs and Cores. Here I want to open that up, and have a look around.

Framing this thought experiment is the idea that we are running out of road with Moore's Observation.


What the Observation Is


It is worth noting here that what Moore observed was not that things would go twice as fast every two years or that things would cost half as much every two years. That sort of happened as a side effect, but the real nut of it was that that the number of transistors in an integrated circuit doubles approximately every two years. Originally it was 12 months, but that was walked back to 2 years, and some split that and call it 18 months. In 2010 it was predicted that by 2013 the rate of doubling would only be every three years.

Just because the transistors doubled does not mean its twice as fast. Not any more than a 1 Ghz chip from one place is half as fast as a 2 Ghz chip from a different place, because it all depends. Double the transistors only means it is twice as complex. Probably twice as big, if the fab size stays the same. Architectures matter. Workload matters. Application matters.


Since the Observation was made in 1965, doubling what an IC had back then was not the same order of magnitude as doubling it now. IBM's Power 7, which came out in 2010 has 1.2 Billion transistors. It is made using 45 nanometer lithography. Three years on, the Power 8 is using 22 Nanometer lithography and the 12 core version has 4.2 billion transistors.

To stay on that arc, the Power 9 would have to be on 11 nanometer lithography, and have over eight billion transistors (Sparc has already passed that...). However, from what I have read, both IBM and Intel's next step down is 14 nanometer, not 11.  It may not seem like a big difference, but when you are talking about billionths of a meter, you are talking about creating and manipulating things the size of a SMALL virus. We are in the wavelength of X-Rays here.


A silicon atom is about .2 nanometers across (as near as such a quantum object can be measured anyway). We are not too many halve-ings away from trying to build pathways the size of 1 atom wide, and quantum mechanics is a real bear to deal with at that scale. Personally, I don't even try. Also, there is not much redundancy in a pathway that wide. Any tiny event can blow the atom right off the substrate.


So we'll do other things. We'll start making them taller, with more layers. The die will get bigger. To get more cores in a socket will mean the socket will get physically larger... up to a point. That point is the balance between heat removal at the atomic scale and power. Seen a heat sink on a 220 watt socket lately? They are huge.


The Design, the Cost, the Chips to Fall


Ok. So making chips is going to get harder. Who can afford to invest the time and effort to build the tooling and the process to make these tiny, hot little things?


Over the last 10 or 15 years we have watched the vendors fall. After kicking Intel's tush around the X86 market place by creating the AMD64 chips, and thereby dooming the Itanium, AMD ended up divesting themselves of their chip fabrication plants and created Global Foundries in the process.


Before that, HP had decided it was not anything they wanted to be doing anymore, and made plans to dump the Alpha they had acquired from Digital via Compaq. They also decided to stop making the PA RISC line, and instead migrate to the short lived, rarely loved Itanium. To be fair, they didn't know what AMD was going to do to that AMD64 design. But there is a reason the Itanium's nickname was the Itanic, and actually it has lasted a while longer than most would have thought.


Intel could not let AMD have all the fun in the 64 bit X86 compatible world, and peddled hard to catch back up. They are having fun at AMD's expense these days, but I never count AMD out. They were not only the first to have the 64 bit X86 market, they had all the cool virtualization assists first. They were early to the party to integrate graphics controllers onto CPU silicon. They blazed trails where GPU's are used as co-processors.


Meanwhile IBM opened up itself to all sorts of speculation by PAYING Global Foundries to take its Fab business: Please. I guess the gaming platforms moving away from Power just hurt too much. Those were the days.


That leaves us with three chip architectures left for your future Data Center:



Plus the newcomer: ARM


Death by 1000 Cuts


Yes: Itanium is still around. May be for a while. If you have a Tandem / HP NonStop, then you have theses for now. Until HP finally moves them to AMD64. If they want feature / speed parity with what going on in the rest of the world, they'll have to do something like that.


The VMS Operating System problem was solved by porting it to AMD64 via VMS Software, Inc. And HP-UX (my first UNIX OS) seems to be slowing turning into Linux customers on, you guessed it, AMD64 chips. HP is a big player in Linux space, so that makes sense. HP-UX 11i v3 keeps getting updated, but the release cadence relative to the industry, especially Linux, looks and feels like it is meant to be on hold. Lets face it, if you have to sue someone to support you, your platform probably has larger issues to deal with. Not trying to be snarky there either. Microsoft and Red Hat Linux dropped their support for the chip. Server Watch says that its all over too. So does PC World.

Linux runs on everything so if Linux doesn't run on your chip... Just saying. You probably do not have to think about where in your DC to put that brand new Itanium based computer. Unless you are Tandem based, as noted.


So what does all this mean for What's Next?


There are few obvious outcomes to all this line of thinking. One is that the operating systems of the next decade are fewer. There is strong alignment of Chip to OS, except on AMD64. It has numerous varieties. There even used to be an AIX there, back in the day (version 1.3 on the PS/2, 1989).


Next is that operating systems themselves are going to hide. Really: As much as I love Linux, no one in the marketing department cares what OS their application is running on / under. The only time I hear an OS related observation from an application person is "why are you taking my app down?" "Oh.. It's Patch Tuesday". Or SSH was hacked. Or whatever.

Its a hard thing for a computer centric person to see sometimes but the change that mobile and DC consolidation and outsourcing (sometimes called "Cloud Computing" hath wrought is that the application itself is king. Its their world and our data centers are just the big central place that they run in.


Clearly Linux and MS Windows are in upward trajectories. Every major player such as IBM, HP, Oracle, etc. etc. supports those two.


The Sparc  / Solaris and Power / AIX applications are still alive and kicking (though with 30% of the market, they are being slowly eroded by Linux). With spinning of its X86 Server business to the same folks that bought their laptops, IBM is left with only high end servers (I Series is technically called midrange) (Oh, and Lenovo made that laptop business work out pretty well for themselves). IBM wants to be in the DC, where the margin is. Same thing more or less at Sun/Oracle. All their server hardware is being focused on making their core product run faster.


HP will be in the AMD64 or ARM world, and that's pretty interesting. The Moonshot product is nothing I have personally been able to play with, but it makes all kinds of sense. If you don't need massive CPU horsepower, you can do some pretty nice appliance like things here. And since Applications are king, not what hardware it runs on, chances to have lots of little units in a grid that are easy to just swap when they fail has a very Internet like flavor to it.


How will Santa Package all our new Toys?


Looking at Moonshot, and all the various CPU's, it seems that, for a while at least, we'll be seeing CPU's inserted into sockets or Ball Grid Arrays (Surface mounted). Apple has certainly proved with the Air line that CPU-soldered-to-the-mainboard solves lots of packaging problems. Till the chips get thicker, and start having water cooling pipes running through them because air just can't pull heat away the way that water can.


Yep: Liquid in the data center (spill cleanup on aisle three). We can be as clever about the packaging as we like, but physics rules here, and to keep trying to make these faster / better / cheaper is going to mean a return to hotter more than likely. That's a real problem in a blade chassis.  Even if the water is closed loop and self contained to the airflow of the RAM / CPU air path, it means taller. Wider.


Or, you go the other way, and just do slower but more. Like hundreds of Mac Mini's stacked wide and deep, or perhaps little slivers of mobos from Mac Airs ranked thirty across and four deep on every tray / shelf. You wouldn't replace the CPU anymore. The entire board assembly with CPU and RAM would become the service unit. Maybe everything fits into the drawer the same way that some disk vendors do it now.


When I designed our most recent data center, it was extremely hard to stay inside the 24 inch / 600 mm rack width. By going taller (48U) I could put more servers in one rack. Which meant more power and wiring to have to keep neatly dressed off to the side, in a rack that had little side room. The Network racks are all 750 mm for that exact reason.

If we go uber-dense on the packaging because of the CPU design limits, then what does that mean about the cabling? Converge the infrastructure all you like, the data paths to that density are going to grow, and 40Gb and 100 GB Ethernet don't actually travel in the Aether. I know, right? More like the Higgs field.


That's a conversation is for another post though.




I wrote that and never posted it apparently in July of 2016. Back to late 2017. Things have happened since then, and so that's what the NEXT post is about.

Virtually Efficient

Posted by Steve Carl Dec 2, 2016
Share This:

It is clear that a data center full of computers running at 90% capacity is far more power efficient that the same workload running on many discrete machines but each machine only averaging, say 3% average utilization. Even if those smaller machines have smaller power supplies.


Example (and I'll show this is conservative in a sec) one machine running 1000 watt supply, with 30 VM's on it. That power supply is probably only averaging 500 watts but lets use .67 as the factor, just to keep it high side. 670 watts. 22 watts per OS image.


The same thing running on 30 smaller machines, each with 200 watt power supply, each using 100 watts (.5 factor). Still favoring the small machines. You have 670 watts versus 3000. You are using 4.4 times less power.


Those numbers are high in our world though. In our current blade world (Documented at length in the Go Big to Get Small" series), we are running 150 VM's per Dell M630 blade, and each blade is pulling less than 500 watts on average. Less than 3.3 watts per VM.


Clearly virtualization is 'Green', not to mention saving tons of DC space, power, cooling and all of that equals money.


All is NOT Golden in Virtual Land


Virtualizing seems a no brainer for most things that do NOT use up the entire physical hardware footprint in a single application / instance. Even that statement had caveats before I even GET to the major issues.


The main problem I see / come across with virtualization is DataBases. Example: Oracle won't certify their RDB for any virtualization platform other than their own OVM, and the reason is I/O. Over in MS land, Hyper-V with Server 2016 is just now becoming good for SQL Server workloads.


Virtual I/O as a bottleneck is a well understood problem, and Intel and AMD long ago added microcode to allow their virtualization assists to dedicate PCI slots to particular instances. One of those is Intel's VT-d. If you are a a mainframer, this is not unlike the 'attach' or 'dedicate' command/directive in z/VM, and with it you give one virtual image complete control over the device. It’s the only one that can do I/O to it. It undoes some of the flexibility of virtualization, but it decreases Virtual I/O overhead. Here is a cookbook for how to do it with KVM for example.


It’s the classic virtual versus physical tradeoff, but it allows you, in theory, to dedicate something like a FiberChannel card to a data base server, and get the hypervisor out of its way for I/O.  You can instantly get into trouble with stuff like this, because this is your important data base! You can't have the single point of failure. You have to dedicate two cards! Which means now all the other images need a complete OTHER set of Fiber cards to run through.  What if you want to do the same thing with your Test / Dev / QA instance to be sure you are keeping your environment apples to apples ..


Lots of dedicated cards, and therefore lots of single use I/O slots on your server frame.




As you work your way through the things you can do to get rid of the most overhead for the least amount of effort, you sooner or later have to arrive at the OS itself.  How much overhead a particular hypervisor has is about as complicated a question as you could care to have. You have so many variables.


  • What's the Hypervisor itself? KVM? OVM/Xen? VMware? Hyper-V? Acropolis?
  • Is this full on Virtualization, or Paravirtualization?
  • What's the hardware platform architecture? (Mainframe Virtualization is decades older / more mature still, though many articles on virtualization forget it was not invented by VMware.)
  • Which generation of chipset / Microcode is in play, and are you fully set up to take advantage of everything available?


Containerization asks a different question or two: How much of that OS do you actually need to do what you want? Also, how isolated do you really need to be between the hosting OS / Platform, and the applications?


If the answer to this is "Not much and not very" then Containers change the math of efficiency / overhead a great deal. Real world example: When we were consolidating one of data centers a few years ago, we went to move as many of the physical Sun / Solaris systems as we could into Zones. When we did the math to figure out how to size the host, we computed we could put about fifty of the OS images one each host. There were variables that we had to take scientific guesses for: 


  • How much faster the new hardware was than the old
  • What was the actual, real combined overhead of the Zone


We used VMWare consolidation style calculations. In the end, we ended up with hosts that easily could have held twice as many OS Images as what we planned for. We eliminated hundreds of systems into few, but it could have been half as many still again. We were after 10:1 space reductions, and we could have gotten 11:1. More importantly we could have spent less money on the Sun blades to absorb the workload if we had known the real number ahead of time. In the scope of the larger project, again, this was not much, but still. What we ended up with was excellent performance for all the new guests, and no need to buy any new capacity for years. It was like Y2K all over again.


That same logic applies to Linux containers today, and companies have come along to make it even MORE attractive by bundling up Containers into libraries that you can check out and personalize. Need an Apache web server? Just spin it up from the library. Docker is a great example of adding value to that by creating an internet registry of such things, and all sorts of management tools around that.


Fast to deploy. Low overhead to run. Portability from host to host. Low virtualization I/O overhead. Higher application density per host. What's not to like?


Other than Application Sprawl of course.


If you thought keeping track of your CLM cloud was fun, wait till you have all tiny little containers running all over the place. Seems we are always trading ease and lowered cost in for sprawl. Have a mainframe? Its central and expensive and too controlled for your taste? Do client server. Spread it around. Lower the acquisition cost. Now there are a zillion little pools of computer, and a zillion applications. Crud! We need to get our arms around that. Lets make great big DC's, and rack them all where we can see them. Crud! We have cabinets full of tiny systems. Lets consolidate and make everything virtual. Cool! Its smaller and we know where it all is, and its all on supportable hardware. But… what are all those zillions of VM's doing? Anyone know who owns all those things? OK. Lets corral all that, and get it all under Cloud Lifecycle Management or something similar. Now we have names. And expirations. But.. What about all those heavy OS's and the RDB's that need better I/O… Hey! Lets Containerize!


None of that even counts all the hidden computing costs going on out there in the public clouds and running off the employee company credit cards.


Round and round we go…


We seek efficiency and manageability and flexibility. We want to enable everyone, but also be sure we know what's going on so we can stay in front of the next zero-day that comes into our life, not to mention have some idea where or corporate data might be currently living.


Coda to What's Next


What got me thinking about all of this was my pondering on what the DC will look like next. Part of that is determined by what is going to go into it. We are not just reaching the end of the line for things like Moore's Observation. We are reaching the end of the line for what we can do to strip out certain kinds of waste. We have gotten rid of underused computers. We have shared pools or storage together so that they can be more efficient. We are de-duping and compressing. We are stripping out all the unrequired parts of the OS.


What we are NOT doing is going back to writing things in assembly language so that programs are as efficient as possible. We are staying high level. Getting ever more abstract. Software will continue to get bigger even as we reach the lower limits of the ability of the hardware to get smaller. That will be an interesting inflection point.


We used to joke for years about how you always needed to upgrade the hardware to run the latest, ever fatter version of MS Windows. The reality there is that Windows 10 runs fine one Windows 7 spec hardware. Its NOT the OS's getting fatter any more. The software target is getting larger because of all the OTHER code we are ginning up. Software defining everything!


The data center will be as small and dense as we can make it. The world that runs inside it, and that it will be connected to? Managing it? That is a whole other thing.

Share This:

At a recent technical conference I was at, I went to several sessions that referred to Moore's "Law" and how it was just going to keep making things better in the DC. Clearly not readers of my stuff. <sad face>. One session leader however, referring to a new server generation said "So, its about 1.5 times faster than the previous generation… So much for Moore's law".  I was proud of my self restraint. I did not cheer out loud. Much.


It was easy to see why there was so much enthusiasm around for the idea that Moore's Observation was inexhaustibly marching on. In particular there was much talk about Solid State Storage, and how much smaller and faster it was. All the new and cool things you could do with it. How tiny and power efficient it is.


All true.


The crux of the matter though is this: Flash memory, like general processor, is reaching the lower limits of current lithography. Samsung is as 12 nm according to the roadmap. Like so many other things, going vertical is the only way around that limitation of physics soon. Making memory more DENSE like that has heat implications, and COST implications. If you read Moore's Observation as a statement about the cost per transistor, Flash, like general processors are going to be getting off the halving / doubling cost / capacity train soon.


It is deceptive right now, at this infection point. We see that moving from SAS or other spinning disk technology to Flash is allowing us to put in one cabinet instead of five or six, for the same capacity, and at higher access speeds. No matter which storage vendor you are looking at, their Solid State offering has that 5 to one ore more space decrease, all in one great leap! Its huge. And with compression and dedupe  and all the things Solid State enables, the cost per Terabyte is getting in line with the older tech, with lower long term costs to boot. Less space. Less power. More speed. It all makes sense. It all fits into the worldview of those raised thinking Moore's Observation is Holy Writ, because it has never failed in their professional lifetimes.


It won't obviously fail just yet. There will be a few turns of the packaging crank to keep upping that physical density on the circuit boards. More chips closer together. Stacked packages. Better airflow management. Still, it can not beat physics. We are in the quantum realm here and there is a lower limit to how small this can go until we start figuring out quantum memory. When a bit becomes a qubit we'll have another leap in capacity (though how that affects speed and cost is not clear at this point).


I said above 'current lithography', and that was intentional. For example, IBM has it in mind to use carbon nanotubes to get us past such limits. IBM also announced they think they can get to 7 nm with a version of current lithography, but since we are at 14-12-10 nm now, that is NOT a huge leap, and it’s a few years before we see 7 nm arrive. Carbon nanotubes take you down to 4 or 5 nm. 4 atoms wide. These articles discuss the scientifically possible, but say nothing about the cost per transistor to achieve them.


What is obvious though is that the Power per cabinet is going to rise, and for a while. The DC can keep getting smaller and more power dense as far as storage is concerned for another few generations without straining packaging too hard.  Hitachi, for example, just announced 14 TB Flash drawers for the G1000 / G1500, and they fit in the same flash slots. I imagine with more packaging / airflow work they have a few more turns of that crank.


Sun  / Oracle gave up on their Chassis, as did IBM (in all of one generation of chassis, when you are talking about their new, revolutionary never been anything like it ever Pureflex chassis). If you want Power or Sparc based gear, we are back in rack mount servers. This will limit our per-cabinet density. Unless you are building custom Super-computers or are Google, we are seeing the new face of the DC in terms of form factor. Same as the old face in a way. Vendors are avoiding esoteric cooling technologies for as long as they can, though IBM has been building water cooled mainframes for a while now. Intel has experimented with computers dipped in oil baths for cooling but that article was from 2012, and clearly that idea has not caught on yet.


To get denser costs more money, and so like Moore's observation relative to the cost per transistor, its always the math of how much more square footage costs than esoteric cooling technologies.


That seems to be the face of the next few generations of DC's.  You can go taller (I have a DC with 57U cabinets for example) but you run out of height, and using that height (special server lifts, special cage walls, etc) is more of a problem. You can go denser, but you soon hot maximum density before you have to introduce specialized cooling. You can hyperconverge, but your cabling plant starts getting more complex as you add bricks. Wider racks with more side space to keep it all neat become attractive.


Soon, your choice will be simple. Bigger DC or move it to the cloud (which just means THEY will need a bigger DC).


(Coda: I know of several DC's just getting started retiring 15 year old gear, and getting to higher density stuff, so for a while, while the older stuff is retired, 'we' (global DC denizens) are still going be able to get smaller. Like the leap from Spinning to Solid State disk though, once you are past that inflection point, it’s a good idea to be sure you have good first right of refusal on the white space next to your cage!)


[Coda 2: I realized after I wrote this that there was an assumption in there: That all the other efficiencies are already in use to drive up utilization per computer. We started at average utilization of less than 3% back for "Go Big to Get Small". Now we try to be close to 90 / 90  - CPU / Memory. When we add a computer resource to the on-prem cloud, its because we are OUT of something. A DC that has not been on that journey yet has wiggle room to shrink.]


[Coda 3: Interesting article about how Facebook is changing DC designs potentially for many others and the Open Compute Project. Just read today]

Moore Storage Please

Posted by Steve Carl Mar 15, 2016
Share This:

For a very very long time I have had it as a to-do to finish up my thinking / posting about how the end of Moore's Observation was going to affect the way we design and build data centers. The core tenant of this was that data centers are about done shrinking. In the whole "Go Big to Get Small" series I did here I talked about the 10:1 floor space reductions and the 4:1 power reductions we were after and have achieved.


The bad news is, you reach a point where you can not 'cash that check' any more.


One exception to this trend, for now, is Storage. Namely: The glacial move away from mechanical storage towards solid state storage


I had written a whole piece about disks, areal densities, vertical recording technologies, and various other things that tied into my general theme of Moores Observation being over. I put in 'on the hook' so long here that it was lost when they upgraded the site. I'm looking at you  Matt Laurenceau.


I had paused writing it because I just kept feeling like I was on the cusp of change and that anything I posted about storage was going to get passed by 3 minutes after I hit send. Then I moved / consolidated another DC, and by the time I got back to it (now) not only was the original post gone, but everything it was about had in fact been passed by.


An intersection we are at that is related:


  1. Annual storage growth is supposed to keep going ballistically up: 40% a year
  2. Annual IT budgets are generally flat to tiny growth. No where near the growth rate of storage, to be sure.


Used to be I'd look at a storage array that could grow to two petabytes usable and think "Wow". Now its more like "Is that all?"


To address some of the problems with the growth of storage, and the general lack of it physically shrinking, some vendors went with tiered approaches: Slow (7200 RPM), High density [eight terabyte right now] drive arrays fronted by faster, lower density drives (10K or 15K RPM), in turn fronted by even smaller amounts of Flash drives, in turn front by even smaller but fast still caches.


It made a sort of sense at the time. Flash memory and RAM were expensive. Early Flash wore out quickly. Spinning drives were cheap and slow (relatively speaking). Further, its well understood that data is largely 'cold': the standard number I see around says that 80-90% of your data is low reference. Only 10-20% of it is 'hot'. That does not even count data warehousing or adding something like a gnarling tape teir. We have been solving this problem for a very long time with things like heretical storage.


We keep solving this problem because we keep trying to drive the cost per Megabyte / Gigabyte / Terabyte down per unit. The unit we think in keeps going up.


The end is in sight. Flash technology is finally getting at or near the same cost as spinning media. For enterprise grade storage, lets call that near 1,000 USD per Terabyte. There is a caveat there: You HAVE to use dedupe technology on the Flash memory to achieve things like this. Its still five times the price of disks at the time I write this. I'll read this next year and go "Wow: Stuff was expensive back then".


Dedupe makes all kinds of sense. Why store 80 copies of something when you can store one, and have 80 pointers to it instead? It works better with fast controllers and storage though. And therein lies a conundrum. Its a classic one.


In the early days of PC's we used to do anything we could to stay in Memory. Disks were just so slow!. The entire UNIX operating systems was designed to treat everything like a file, and to use every bit of RAM to cache the disk I/O. DOS had 'TSR' (Terminate and Stay Resident') memory management programs so that programs would NOT have to be reloaded from the slow slow disks. We compressed hard drives and traded CPU cycles for compression of disk space. Many thought that would slow down the disk, but often it was the other way: A compressed program read off the disk faster than an uncompressed one. As long as you were CPU rich, you were good to go.


Dedupe is like that. You trade controller cycles for disk / Flash storage space. Problem is what I outlined in my post "The Core Necessities". We are off the free CPU train. CPU's are getting more expensive per transistor. We are nearly done with current tech at shrinking them. That means to add CPU power literally means to add more CPU's. Or CPU Cores at least. That equals more heat and more power for the CPU's.


As long as thats cheaper than the actual memory though, we'll do that deal.


Physically speaking, Flash memory has some space and power problems too. See "Moore's Memory" for details. In summary: Quantum size limits are going to keep how small we can go limited. CPU's and memory are going to become three dimensional to deal with that, with more and more layers of silicon. There literally is no place to go but up. That article says they see no limit to how many layers they can go, but I do. I am sure they were taking this into account and thought it obvious, but it does not mention power. Power equals heat. Heat has to go someplace, and I am sure this can be solved with heat pipes, and voltage regulators and all manner of that kind of thing.


All of which means that while we may get more DENSITY, we are not going to see a huge drop, if any in cost or power. Not right away.


So: Storage is going to see a precipitous drop is size / power when it makes the move to all solid state, but then it is going to be on the same ramp as CPU's and memory. Unless something really really new comes along (optical is often bandied about, along with quantum computing) we are going to enter a phase of life where we are going to have to do something else. Something much harder.


We are going to have to manage all this stuff.

Moore's Memory

Posted by Steve Carl Jun 18, 2015
Share This:

I waited a bit to write this post, in part because of being busy with getting a new DC started up, but in part because of of this:


The hard part about writing about tech is that it catches up to you, and fast! The punch line here is that we are more or less at the end of the line for Silicon, and going smaller means new materials. New processes. Massive R&D and investments for new fabs.


It will not be cheaper to get smaller. Far from it.




From one point of view, Moore's observation actually stopped a while ago: Back at 28nanometers:


Worse, from the point of view or RAM/ DRAM et al, the Observation quit being in effect a while ago:


There are implications there for mass storage too, but for now I am just thinking about how all of this affects RAM, and therefore ultimately what that means for us in the Data Center designing / building / maintaining business.


The Good Old Days (of the last couple years)


During recently DC consolidations we have seen 4 to 1 power reductions and 10 to 1 space reductions, but the end of Moore's Observation for processors means that how things go forward soon will be different. Chips will be bigger: its still faster to have more components on a small chip than have them talking to each other across a mainboard, or a backplane. Way way faster. The speed of light is still a law. Einstein!!!!


Parallelism will increase. Has to. More processors. More cache. We can not make them much faster than they are in terms of clock-cycle without them just radiating right off the substrate in useless and uncontrollable ways. Microwaves are just a few Gigahertz


All of that downsizing is just as true for RAM, if not more-so. Memory is on chips, just like CPU's but it has always been on a different path too. How you make a chip 'remember' something is not the same problem as making transistors compute things. The design on the substrate is different, and therefore the way it scales down is different.




Memory requirements are NOT going to stop though. To get things done, more and more things want to stay memory resident. Virtualization requires lots of RAM to hold all the system images. The Power 8 system we just ordered has 1 terabyte of RAM. The systems we have been getting rid of from 10-15 years ago ran 2,4,8,16 or so gigabytes.


To go to bigger and bigger RAM sizes, sooner or later, the RAM chips are going to get bigger. Heat will be a problem. Cooling memory will be a thing again. If you have ever overclocked your PC, and had it live, you know what kind of work you had to do on the heatsink. Liquid cooler caps were not unheard of.


Looking Back


In those two posts I wondered about, ultimately, form factor. Would blades be able to survive in a blade chassis if all this new size was coming into play? Now add in the fact that RAM is going to get larger. More / longer DIMM slots. More memory controllers on-chip. More address-ability of 64 bit-ness but no shrinkage of the chip die.


Looking back at that Power 8 I just mentioned above: You know what I can NOT buy from IBM? Anything with a Power 8 chipset in a blade form factor. Has to be a Rack mount. Has to be a 2U or 4U case. The smallest blades for the Sun / Oracle 6000 are full height! Dell announced the M630 to replace the M620 half height blade, but there is no M430 quarter height blade in sight. Not for the M1000e chassis anyway.


Its Getting Racky


If the case the computer guts sits in has to get bigger, then it seems we are headed back to something more like a rack-mount design. It still has to fit inside our DC, and we do have that huge investment in the current cage.


Also, when you get right down to it, modern DC's like the ones our Co-Lo's are in are able to handle 250, 500, even a 1000 watts per square foot, and we are nowhere near that in our cage. Bigger servers, with big CPU's and RAM installations that run hotter than what we have now are not really a problem there yet.


But then there is this new trend:


Everything that is old is new again! Back in the days of the mini-computers, we had big cases full of CPU's and RAM because they were NOT dense. Now we are headed back to that form factor because we can not make things any smaller / denser? Could be. It certainly aligns with the 'hyperconvergence' story.


We have over 24,000 virtual machines and growing. We can only manage that because we have BMC's CLM in there taking a lot of the work off the data center team. We could not easily use the *current* hyperconverged platforms because of that scale, but that market is changing super fast. See things like EVO:RACK.


And that all leads me to think about the Storage part of this next.

The Core Necessities

Posted by Steve Carl Jan 11, 2015
Share This:

In my last blog entry here and my guest blog over at Kiamesha, I went into what we had achieved with our DC consolidation and redesign, and I started to go into whats next.


At one level, we were happy the design lasted as long as it did, with as few in-flight changes as it all needed. But looked at another way there was a minor disappointment too, and that was that the design lasted so long.


Say what?


One would think, given Moore's Observation (It is not a law) that we should have seen some mid-life turns of the crank in the technology base. There were of course, but nothing that was financially compelling, and certainly nothing that would have saved us enough power to have made a huge difference.


Long Lasting Blades


I mentioned back in the Go Big to Get Small series that the blade we are using for X86/AMD64 virtualization is the m620. When viewed through the lens of capital investment, you have to love that over two years in, the m620 config never really changed. Nothing that came out in that products lifetime made us want to move to any config other than the one we started out with. All the capacity measurements just kept saying that, for our particular use case, that config was the best price / performer.


Recently Dell came out with the m630 blade, and it contains the Intel Intel® Xeon® processor E5-2600 v3 series of processors. The m620 used the Intel® Xeon® processor E5-2600 and E5-2600 v2 product lines. Is that change enough to warrant the move to as new config? We'll soon see. In the meantime though it got me to thinking about the evolution of CPU architecture and how it relates to the greening of the data center.


You Have to Measure


Most every time you see anyone making assertions about something being better than another thing, you see a caveat added. Like "Which smartphone is right for you? Well is depends on how you use it...".


What I am going to talk about here comes with that caution in bold face and italicized font. How a CPU is used really really depends.


Lets take the example of the well known VMware scheduler issue. If you are not a VMware shop or you are running newer versions that have this fixed, you won't have this problem. There was a time however that VMware dispatched guest vCPU's all at the same time. If the guest had 4 vCPU's then VMWare had to find 4 real CPU's on the server free, at the same time, in order to release the guest machine to run. The more vCPU's a guest had, the longer it tended to stay in CPU wait state. This in turn meant that the more real CPU's available to the ESX host OS the more quickly it was likely that the CPU's would become available.


Unless you were CPU bound. Or it was Tuesday. Without measuring, you just don't know.


Our workload, on the m620, tends to run out of virtual memory before it runs out of real CPU. We have seen it in capacity measurement after measurement, and for years on end, using BCO, and all the tools that came before it. It predated the m620 by at least two generations if not more. Our ratio is 16GB of RAM to one CPU (be that a core or a socket running one core). We have seen it so long we have started to wonder if it is not a law of virtual nature. Like Moore's, it is just an Observation.


If you are running KVM, or Xen or some other virtualization solution, the numbers will be different. If the workload you virtualized tended to run at 30% average utilization rather than 5%, your numbers won't be like mine. If you aren't virtualized, but instead run mathematical models all day long every day..


I usually dislike the mealy mouthed "which is better" comparisons that lead you to the answer "Depends on you", so I'm sorry I had to write one. On the other hand, if I had asserted this 16-to-1 RAM-to-Core ratio was a law of virtual nature, everyone would have had the right to climb through their Mac and pound my head, so at least I staved that off.


Green Perfect World


If you look at the CPU spec sheets I linked above, the watts / thermal performance (Intel calls it TDP) of the real CPU's of all three series ran in the range of 80 to 150 watts per socket. That makes sense. They have to be installed in the same systems with the same airflow already engineered. Put out a 200 watt part all the sudden and no one is going to be happy with you.


Two generations back for the e5-2600 it was 4, 6 or 8 cores though. V2 (1 generation ago) added 10 and 12 cores. V3 added 14, 16, and 18 cores. per socket, with other bits like the cache getting larger as well.


Clearly I am getting more and more cores for the same(ish) power. That's not counting anything that may have been done to increase per-core efficiency too, like increased cache, better / bigger / faster look aside buffers, better predictive pipelining, better out of order instruction execution, instructions per cycle, and all that (did I mention you have to measure?).


If I can more or less run "16 core per socket, double my RAM per blade" kind of config, and not have all that double or more the price of the unit, then I can use half the DC real estate to get the same amount of work done. I can hold down the number number of NEW chassis I need to add, and save that 'overhead' power too. Plus the costs of the internal-to-the-chassis switches on the back.


Same logic says that if doubling is too expensive, what about 50% increase? 12 cores and 386GB per blade? As long as that bump is less that 50% more expensive (and every other variable remains the same.. a big if, I know) then why wouldn't I do that?


Standards are good. New standards that save power are, to be very technical, "better". And in some ways: What took so long? Shouldn't I have been able to do this all more than 6 months ago?


No: Because price / performance ISN'T ramping down as fast any more.


That logic applies beyond AMD64 / X86 space too. Same for Sparc or Power. It will be true for ARM, when its data center day arrives. I'd say same for Itanium, but that ones getting ready to exit stage left.


Stay Tiny My Friends


If you have been doing Data Center Consolidation for a while, and particularly if you have been retiring older workloads into the virtual world, this is the next frontier. The servers are starting to get on up to their 3 year depreciation schedule sell-by dates (our oldest ones are 2 years old in this most recent project).


Re-hosting them to newer, more Core-dense servers is the next win, though lets face it: Its not like the last one. I was going after 10-to-1 reductions then. Now I am getting jazzed about 50% reduction possibilities.


Its not nothing though. When viewed on a watts per VM point of view, or from the perspective about how to stay in my new, tiny data center, its what has to come next. When viewed for the perspective about how we all reduce the amount of power our data centers will be using globally over the next decades, it something that remains front and center in the design of the Green DC solution.


Remember when a server had one core? Those were the days...

Share This:

In the entire "Go Big to Get Small" Series I posted here I went into great detail about what we were after, and how exactly we planned to do it. I named names: Device types. like the Dell M1000e and IBM Pureflex Chassis, to name but two.


The goals were lofty, and based off early work we had confidence.


Still, the number one question I get is basically "Yes, but whats the reality of all that? Sure, you  are replacing lots of old gear with new, and you are staying inside your run rate, but its all just a beautiful plan. Call me when its real."


Except now its not just a pie-in-the-sky plan. Its reality. We are not done yet, but Phase One is complete, and the results are in.


Starting Place Two Years Ago


To understand where we are now, lets go back to the beginning of the project, two years ago, and see what we had:


  • USA DC 1:
    • 38,000 square feet
    • Consuming 542 KW
  • USA DC 2
    • 4,248 Square feet
    • Consuming 127 KW
  • International DC
    • 4,000 square feet
    • 450 KW
  • Total:
    • 46,248 square feet
    • 1,119 Megawatts


If you go back a decade those power numbers are higher: Virtualization had taken a bite out of them, but it was organic. For example, in the International DC over the previous 10 years we had dropped from 600 down to 450 as a result, and in the DC 1 listed above, back around 2001 it was consuming 1.1 Megawatts or so.


Knowing the starting place is important in the discussion though, and this project was about reducing the number in my outline above, and that story starts in July of 2012.


Where We Are Today


Those three DC's are now two:


  • USA combined DC
    • 2,500 Square feet
    • 160KW
  • International moved DC
    • 800 Square feet
    • 80 KW
  • Total
    • 3,300 square feet
    • 240 KW


It was more than consolidation. It was downright collapse of footprint, and yet we have *more* system images running now than when we started.


A 14x reduction in floor space, and a 4.6x reduction in power consumption.


PS: Not only is this much Greener than it was, we did it in the old run rate, and will save from this project alone several million dollars. There is a sort of rule of thumb about spending on Green Tech: If its more than a 10% uplift, it won't happen. Google anything about spending on Green tech and you'll see article after article talking about *not* spending more, just to have it be Green in some way.


What about being Green AND saving money? Money you can then use to re-invest in the business? Most people would take that deal.


Here is what all the various blade architectures I documented in the "Go Big to Get Small" series look like all together in the USA DC:




Note all the room for new blades / chassis around these. In the USA DC, I have 40 48U empty racks!


Why? Because this was three DC's and Phase One. Phase One was the low hanging fruit., We started two years ago with 26 DC's and now we have 18. We want to get down to 4 majors and 2 minors, so this space is there to absorb what is going to come here in Phase II.


The Green of this is clear of course. That kind of power reduction equals a massive reduction is CO2 emitted to power this place. One colleague of mine calculated this as being the same thing as 71 average USA houses in power savings and CO2 reductions thereof.

Share This:

My apologies for the massive delay in this series. Flu and Phase II took over my personal and professional lives for a while. The good news here is that we are now after our next 100 KW of power reduction. The bad news of course is that I was not done talking about the last 100+ KW reduction.


I know GBtGS is a fairly long series, so I will try not to assume here that you have read all of it. If you have read it, there may be some repeats of information. Please bear with me. I'll try to keep that at a minimum too.


The DC


The goal of this phase of the GBtGS was to take 37,000 square feet of DC down to 11,000 square feet. More: The 16,000 square foot part was designed with an 18" raised floor, and at 250 PSI floor loading. The 11,000 square feet was 6" raised floor and 72 PSI floor loading. Both spaces were originally 38 watts / SF, though over the years the 11,000 SF floor had been upgraded in some areas to be 50 watts / SF. Still nothing compared to a modern DC's 250 watts per Square Foot.


In the last post there are some pictures and discussion about cleaning air dams of cables out from under the 6" floor. That is key to the success of this phase of the densification. Not only are servers going to be closer together, they are going to be virtualized onto blade chassis, and by virtue of that, much more dense. I have fairly beaten virtualization to death in this series previous posts, so enough said about that.


In the old DC design, airflow was never really much of a consideration. Quite the opposite, it was utterly ignored. if a place got hot, it was easy to just move a few things around. We had the luxury of space.Further, this was originally a mainframe only DC, so airflow just did not matter. The chill water did the work.


I did not have a thermal camera or spreadsheets full of Fluid Dynamics equations in order to figure out a new design with. I did have good basic design principals, such as "Don't mix your hot and cold air". I have imaged the Austin DC with a thermal camera, so I knew what kinds of things to watch out for. I also had my skin: I could stand in the DC, feel the air flow, and feel the heat. See / feel the leaks. Come up with remediation.


All of that led to this 7,000 SF DC:




and this adjacent / connected 4,000 SF DC:




Those two rooms connect at the double doors, and total out to 11,000 square feet. The 7,000 SF room is about 50 watts / SF. The 4,000 SF room is 38 watts / SF. All are 6" raised floor.


You'll see lots of white space here in both DC's. I opened up the cold aisles so things could flow. This dropped air resistance where I wanted the cold air to arrive. I always had in mind being able to add a positive pressure floor tiles to be able to pull the air where I needed it, but it turned out that the big wide cold aisles did the trick.


The 7000 SF space has a drop ceiling that is the hot air return plenum, so I can use virtual chimneys to take the air up, and then back to the CRAC's. Since drop ceiling is used as a hot air return plenum, hot and cold don't mix as much. That is a good thing too, because I was not able to move everything in the room around the way I would have liked, and you can see there are a lot of weird airflows. With the 50 watts per SF, the room is literally colder. There are two places in the room where brand new hot / cold aisle could be implemented, and that allowed higher density server installations, including one area full of M1000e's


In the 4000 SF DC you'll see that the NOC takes up 25% of the room, there on the left, with a small mainframe right below that.That allowed me to densify the racks in the center of the room, at least as far as being able to fill the 42U racks full of gear. The three CRAC's can only vent along the centerline of the space.

I was not able to use the drop ceiling as a plenum in this DC, so airflow had to be managed purely by design of the rows.


Other parts of the room contained production gear that was in odd orientations, and networking gear cabinets with side blowing airflows so that air whooshes about in small circular paths. Not great, but enough density was achieved elsewhere in the space that after virtualization and tight-stacking, 37,000 square feet fit into 11,000, and there was room, power, and HVAC to spare.


Genset and UPS


In 1993 the DC's were fed by one 750 KVA UPS, and backed by a genset. We grew and grew, and by 1998 we added a second 750 KVA UPS. We then segmented that workload so that important things were on the older UPS, and things that could go down for short periods were on the UPS that had no genset behind it.


The goal here was to get the workload down to less than what that older UPS/Genset could manage, and to completely abandon the newer UPS. We would leave it behind, just like the 16,000 square foot floor.


I tracked that overall drop usage with a keen eye. It looked like this:




Mission accomplished and then some. We not only fit on the single UPS, we were not even going to be straining it.


Even better from the Green IT point of view was that the Virtualization efforts had saved us in one year 205 KW.


CO2 and Green


I mentioned in "Not all Electrons..." that in Texas, because of the general mix of ways that power is generated that on average about 1.4 pounds of CO2 is created for every KW/h consumed. That means that about 239 pounds of CO2 per hour less are going into the air because of our DC. That is 5,725 pounds less per day, or 2,089,728 pounds less per year.


Because this was a retrofit of an existing DC, we added no CO2 due to new construction either.


There is another green being saved here of course. The rent of the floor we left. The maintenance on all the infrastructure of the floor. The cost of the power. Total it all up, and we are about a million USD to the good. In other words, this project utterly paid for itself. In one year.


Retrofitting the space saved us a lot of time and money, and got the DC ready for the next phase of the GBtGS project. A phase we are well on the way to executing. At the end of it this next phase, we'll be down another 100 KW, and positioned for even more consolidation from other data centers around North America.


I'll wrap up this series next post. GBtGS entered phase II though, so there will be more to talk about soon.

Share This:

Or; "How to get There from Here"


We live in interesting times. The popular press love to call this the 'Post-PC Era". The PC is in theory dead, or irrelevant. Canonical's Ubuntu, Gnome 3, and Microsoft are all grafting tablet UI's onto their desktop systems, forcing people like me to run to alternative desktops like MATE and Cinnamon or stay on Windows 7 (the new XP?). I rather believe the idea voiced in one of those links to the effect that the Mint version of Linux's rise to dominance in the Linux desktop world is at least in part due to the fact they they created Cinnamon as an answer to the sub-optimal UX  that is Gnome 3 / Unity. My humble opinion only: I hear there are some that like the new UX. Still, Distrowatch has it as number 1 as I look now.


I don't know that this will go anywhere but there is a proposal to switch the default desktop of Fedora to Cinnamon.even. I hope they do, and I hope the Gnome folks figure this out eventually. From the looks of Windows 8.1, it appears MS is starting to realize they should not have gone down the same road these other two projects paved. Not everything is a phone. Not everything is a tablet. But this is not an "Adventures" post.


My thinking on all this, when viewed from the paradigm of the data center is that we are going to a tiered architecture. There is the great big central glass house (Cloudy, public or private), the mid-tier, powerful device, and the edge device. In the DC that's the central router, the distribution layer, and the edge switches. In computing terms, its the server, the desktop, and the tablet / phone / convertible devices. Nothing doomed. Nothing dramatic. Just a shift in usage to match the need, and to put the right device into use for the right workload.


Datacenter Envy


For that first tier: the central glass house. The middle of the cloud. Whatever you want to call it, the perfect world is high density. Maximum energy management / efficiency. High availability. Everything where you can monitor it. Standardization of parts. On and on.


Public clouds are designed from the ground up to be dense/hot/efficient like this. Private clouds probably should be too, and for the same reasons.




We'll never live in that Nirvana of Data Center existence. One of our big customers might be an all Dell shop. Another an all IBM one. Many more historical mixes of other vendors. It does not take long before we are in the place of supporting all of it in our data centers. Rather than that small, hot-as-the-sun Data Center, and have the much less small Data Center with at least one of everything.


That does not mean we can not look at the small, hot Data Center as an ideal and not try to move towards it: That we can not do better than we have in the past. So far in this series I have detailed our approaches to the server and storage sides of that. Now: the room it all sits in.


Keeping up with the Times


When we started this project, we looked at all the cool kids, in their fancy Tier 4 Co-Lo's, and then looked out across our huge data centers un-densely packed, and scattered across the globe. There was a lot of low hanging fruit. The problem is we could not just take everything we had and jam it a modern data center. Even jammed into tall racks, it would take lots and lots and lots of space. Made no fiscal sense. Nothing changed about the CO2 just because it was jammed closer together in that nice new Co-Lo and / or modern DC we had built. If it uses a couple megawatts spread out, it uses a couple of megawatts jammed together, assuming nothing else changed. The watts per square foot goes up. The total watts, or the total required HVAC do not. For every 3500 watts of gear, you need a effective ton of HVAC, no matter how densely or un-densely it is arranged. Spread out, you are using lots of square footage, but you do not need a lot in the way of airflow management. Jam it together, you have to work a little harder to get the hot where it needs to be.


None of that counts the absolute inefficiency that is mixing your hot and cold air. Yikes. Old MF DC's didn't care about that because the MF heat was carried off by the chill water pipes: Never entered the room air at all.


Some of my early thinking about redesigning versus building DC's was in the "Build or Retrofit?" series ( [1], [2], [3]). Where we landed, decision-wise, was that, no matter what the future might hold (Co-Lo or build our own modern DC), where we were now had to be updated first. We had to do all the work to shrink the footprint and the power in the current space before we could design our future perfect. We did not want to take all the old stuff to the new place. We did not want to move everything, build out a huge cage, and then shrink it by a factor of 10 down the road. Get smaller and smaller, hotter and hotter, till we had a tiny cage that glowed in the dark. We wanted to start with the small, hot cage!


Looking at what we had to work with, internally there were all sorts of DC's, all sorts of sizes, with all sorts of capabilities. But there was one DC that stood out. The one that had the "Good Bones".  It had central UPS. It had a GenSet. Its own dedicated Chiller. Lots of available chill water. It was right next to lots of BMC employees, so accessibility was optimal. Even a lights-out DC has people going in and out of it.


Even it was too big though.


Mr Peabody and his boy, Sherman


Time to get into the WABAC (Wayback) machine and see why this place is what it is.


Its 1992. We have outgrown our current building. Our water cooled mainframe needed to be replaced with something better. Stronger. Faster. We used VM/XA to virtualize most everything. Aside: I was, among other things, a VM System Programmer. Virtualization was nothing new even back then even. It just did not have the cool factor it does these days.


That the one mainframe could appear to be many tens of MVS, VM, and VSE images, but we needed more. The plan was to take the 600 and make it a 720. Think about a second full size MF. We looked at the growth rate. We decided it was time to move the people and the data center to a new place, and design it to meet all our needs. It was the 1992 dream house (opened in 1993), and it had everything. a 16,000 square foot primary DC floor. a 7000 square floor DC on the floor above to house communications gear and the operations area. Massive chill water pipes, and enough cooling for four water cooled mainframes as big or bigger than the 720. Heat dissipation of 38 watts a square foot in the room, plus the chill water pipes.


In times before that we had lived in a DC without redundant power, and later one that had an old UPS, but no genset. This new one was beyond awesome.


Then, in 1993, we bought Patrol, and all these little computers needed a place to sit. No problem. We had a big, empty data center. We never bought another water cooled mainframe, as the air cooled CMOS based units became all the rage.


Then we connected to the Internet, and now we had more little computers for things like SMTP, firewalls, WWW, etc. (Personal Aside: This was when I learned UNIX and Linux.)


Twenty years passed, and that DC from the early 1990's became the little DC that could. In 1999 we added another 4000 square feet because we had so much gear coming into the DC. That 38 watts per square foot was still holding well, so we went with that in the new space.


The mainframe operators area was removed, and a NOC was installed. World wide control of everything, using Patrol and Mainview. The old Operator area was re-deployed as server space. More servers. We passed what 38 watts a square foot could handle, so we added CRACs.


We added another UPS, which was fun because now we had power outages again for some of the gear that were not on the old UPS: The new one had no genset behind it. A long enough power outage would bring things down.


Cowlings were added to the CRAC's in the original 7000 square feet, and air conditioning added to get to to nearly 50 watts a square foot.


The raised floor is only six inches, and had cables under it from 1993. 360 and 370 channel cables....




.... Serial cables. Later generations of cables layered over this like sedimentary rock. Ethernet and fiber looked tiny compared to a MF bus and tag cable set, but there were hundreds of them.


From There to Here


The 20 year old data center had at least one more act in it. Under all those layers of cables were the good bones. Maybe not high heat density bones, but the good ones. A solid building. UPS. Genset. Chill water. It was far better than the labs in other places that had been built out of storage areas and had in-rack UPS and supplemental air units. the next best DC had good airflow management, modern wiring, etc, but no genset to back up the UPS.


It was going to take some work. Even as a temporary place to hold things. Consolidate things. Get ready for the next generation DC, whatever that may be,  A year ago we started Phase One of the Go Big to get Small project, and that meant fixing the sins of the past of the DC. Hey: They seemed like good ideas at the time.


It also meant getting rid of 16,000 square feet of DC (an entire floor), and making it all fit in the remaining 11,000 square feet. With room and power to spare to be able to absorb other DC's that needed modernization just as badly if not worse.


We gave ourselves a year to swizzle every platform into new footprints of gear. To redesign the airflow, re-lay out the room. Pull all the underfloor cables damming the air from flowing to where it was needed. Increase the density of the computing, but keep it spread out enough to live in the watts per square foot we had. The new/old DC would be more efficent because it would mix hot/cold air less. it would get more for the cooling dollar. Hold more gear for the same amount of power. We would not just reduce CO2 because we were virtualizing and densifyng, we would because the room would be more efficient.


We had to do all this without delaying any products or causing any outages to production things like the network or the virtualization infrastructure. We did not want to spend a great deal of money on the DC redesign because it was only going to be a stepping stone to the future perfect: Knowing there is not future perfect, because things change. Technology changes. Needs change: No matter what we design now or in the next year, it won't work for whatever comes twenty years from now more than likely.


As I write this, I am also getting ready to start Phase 2 of the project. I have an 11,000 square foot DC with room, power, and HVAC to spare, and I have other DC's that need to move in here. Go through the process. Shrink. Use less space and power. Get us ready for the DC of the future. the one that is half this size, handles 250 watts a square foot more more, and holds what, in 2001, was over 70,000 square feet of data centers and labs.


Next time: Numbers and pictures of the new / old DC.

Share This:

I have mentioned this several times in the series, but will quickly reiterate it here. While in many ways I could be talking about a production shop at AnyCompanyAnyWhere Inc., I am not. This is much more complicated because this is for R&D. This is for our thousands of customers and our hundreds of products. This is where the products get designed and built and supported. We have no particular favorites, but we standardize where we can. Where it makes sense to. For all that, we will have one of most everything, and that goes for storage.


Stretched out across all my R&D data centers are things like IBM DS/Shark, XIV, and SVC. Apple XRAID. Compaq/HP StorageWorks. Xsigo. Hitachi. EMC. Sun/Oracle/Storagetek. Various white box players. JBOD. On and on. Stuff from vendors long dead. Stuff from vendors I can not tell you about.


Not to mention Terabytes of local storage.


Where it makes sense for R&D to have access to this device or that one, we have it. Where the storage is just a LUN presented to a VM, we cut back on the variety a bit.


I am going to talk here about the Hitachi VSP. Its lessons are generalize-able to any SAN storage that might be used for the same mission. Here is one of ours:



Note the three large rectangular divisions: I am calling them "bricks" here, but that is just a name I made up because the front design kind of looks like a brick or tile wall, and calling them tiles ... just did not seem right. These "bricks" are the DKC's and DKU's that the storage array is built from. More on that further on.


Enterprise Storage and Virtual Density


There are certain things one must do when trying to get 10-to-1 decreases in server footprint. When the wall of blade servers goes up, there is more than just the heat coming off that wall to consider:


  1. Boot From SAN: All of it. Every blade from every vendor. Every VM. Every host OS. Keep this as utility as possible so that the servers are just compute nodes. Don't give in to any temptation to install local storage because it is easier. You will be sad. Sooner or later if you do.
  2. Fiber Channel it. As fast as you can afford. We went with 8 Gb, and are ready for 16 Gb on most of the blades / chassis. We looked at Infiniband, and it is not off the table for future iterations, though some early work in storage virtualization left a bad taste in our mouth about it. Ditto ISCSI.
  3. Enterprise Class: When you have this many assets running in this small a space, going down becomes massively more painful. A single blade might take out fifty to seventy VM's. A single chassis ten times that. But the central SAN failing is all of it. Thousands of VM's. The entire internal cloud.
  4. Tier it. Virtualize it. Thin provision it in the hardware. You need to go fast, and you do not want unused bytes of expensive enterprise class storage just sitting around hoping that someday someone will use them.


Gigabytes and Kilowatts


Other than bytes, one fairly common way to measure your storage is how much power it takes to run how much storage. Gigabytes / Terabytes per watt / kilowatt kinds of numbers. This being 2013, I'll go with Watts per Terabyte usually in this post.  Before the "Go Big" consolidation efforts started, some of the devices we had/have in the DC came out when watts per Gigabyte or even Megabyte made more sense. I'm looking at you 1993.


Another thing to consider is that with Enterprise devices like these, the money is up front. By that I mean that a base level device is the most expensive way to build it. A big empty frame is expensive, even though it positions one for less expensive upgrades down the road. It is also the most expensive in terms of watts per Terabyte. Numbers on that farther down the post.


Buying as big as possible up front will save monetary units down the road.


I don't have an amp clamp on any of our VSP's, so all I can measure in our DC is at the PDU or UPS. I can use lots of data center math, and DCIM tools like Nlyte to get pretty close. So that one can follow along at home, I am going to use a free tool that anyone can use. Hitachi's Weight and Power calculator.


There are other advantages to using the same tool.  How we put one of these arrays together is not going to be the way anyone else is. How much tiering. How much total storage. How many controllers. How much cache. All of it. It all changes the numbers for storage-to-power ratios. The calculator is a spreadsheet, and this was version 13.14.


Side Note: I tried to run this under LibreOffice 4.1 and Excel 2011 on the Mac, but this SS appears to be sadly MS Windows specific. That's why I have a Windows 7 VM though.


For the purposes of this article, I'll put together two different configs in the tool: A starter systems and a midrange setup. We'll see how that works out as far as power. It is between you and your finance department how much you are spending for this of course.


A lot of the Hitachi specific terminology is here is in this reference guide.


This SAN is the center of the Virtual world. Some things should not be skimped on. I'll put the cache at 512 GB for both configs. For fat pipes, I am maxing out the 16 port fiber channel at 4 per controller (DKU). That appears in the tool to reserve some ports for high speed internal usage.


The Hitachi VSP can scale all the way into the Petabytes. In the maximum config it has 6 standard rack size cabinets, about 24" by 40 " by 42U.


This first config will just be a single maxed out cabinet (Frame 00 on the config). That is 1 DKC (Controller “brick”) and 2 DKU's (disk “bricks”). Each disk brick can hold either 80 large form factor (LFF) disks, or 128 Small Form Factor (SFF) disks. I will make one SFF (DKU-01) and one LFF (DKU-00).


Config 1: Small. Medium Performance. Big cache.


Using the tool, I will stuff 80 7200 RPM SAS LFF drives at 3 TB each, and 128 10,000 RPM SAS SFF drives at 900 GB each. I will configure 8 drive spare of each type, and that gives me 324,000 Gigabytes of storage, with two tiers, lots of redundancy, and about 260 TB of usable capacity to thin provision into.


According to the Calculator that is 4.3 Kilowatts of power under standard load (whatever that means). Still : an amazingly small amount of power. 13 watts per Terabyte raw, or 16 watts per TB after formatting. There once was a day when 25 watts per TB was the holy grail of SAN storage. Because of the density a 3 TB SAS disk brings to the party, we are well under that.


We actually started with a config very very similar to this one, and connected literally thousands of VM's to it, and they perform far better than they did before. “Before” was on old, slow hardware with internal disks. If this works, one can only imagine what a better tiered, more scaled up version can do.. so lets have a look.


Config 2: Medium Size, Good Tiering.


Enabling Frame 01 and Frame 02, this is the biggest the Hitachi can go without adding a second controller frame (DKC 1).


All these bricks! What to Do?


I put in four SFF bricks and four LFF bricks. For the purposes of this discussion, five tiers (only four performance tiers: there are two 7200 RPM disk types. I just did that to show it could be done):


  1. SAS SFF SSD 400GB
    1. 56 drives

    2. 8 spares
    3. This will be 22 TB of smoking fast storage.
  2. SAS SFF 15K 300GB
    1. 216 drives

    2. 10 spares
    3. 64 TB of high speed storage
  3. SAS SFF 10K 900 GB
    1. 212 drives

    2. 10 spares
    3. 190 TB of medium speed storage
  4. SAS LFF 7200 3 TB
    1. 228 drives

    2. 12 spares
    3. 684 TB of low speed storage
  5. SATA LFF 7200 2 TB
    1. 68 Drives

    2. 12 spares
    3. 136 TB of low speed / lower cost storage


That is a lot of disks and a lot of storage: Its easy to play around and mess with the ratios of the tiers. Performance data would give you some clues about the best tiering. I went with this since it demonstrated the point and made sure there was always plenty of storage in each tier. Also plenty of backup drives of each type. Its not meant to be optimal to cost or even power, but rather something I feel comfortable with saying that 10's of thousands of VM's could run here. As the back-end for a CLM install? No problem.


The power calculator reports this is 11.4 KW for just over a petabyte raw. 4 watts a TB raw or 5 watts a terabyte formatted. There is a a pretty good chunk of 10 and 15 thousand RPM disks spinning. Fast disks use more power than slow ones: only makes sense.


It would be easy to make this lower power / higher capacity. This config should keep the Fiber Channel pipes full.


Adding a fourth cabinet adds more disk controllers in another DKC, and everything should more or less scale linearly from there. As you can see from the picture, we don't have anything like this big and fast yet: All kinds of growth are possible is speed and capacity, with very little increase in footprint or power consumption. We shrank the 38,000 square foot data center to 11,000 square feet, dropped all sort of power, and fit everything storage-wise in these two cabinets.


Full Disclosure / Broken Record: We did not virtualize everything. The Heterogeneity previous alluded to.... That means not every single server in the DC has its disks space out here. Everything we could put here, or on something like it, we did.


Compared to What?


Looking back the starting place for all of this desire for consolidation and hardware updating: Racks full of thousands of servers. Real physical servers. Often desk-side engineering stations on shelves in racks, making airflow management a pain.


There was/is NAS installed of course, but each server booted internal disks, and often had at least two internal SCSI disks.Disks such as Seagate Cheetah, Fujitsu, or IBM Ultrastar. I have a couple of them on my desk here: This 18 GB IBM Ultrastar I am looking at right here (see it?) is rated at 700 Ma at 5 volts (3.5 watts) and 800Ma at 12 volts (9.6 watts) for a total of 13 watts. This other disk is a 33.8 GB Fujitsu. Says at 5V / 1 Amp + 12V 1.2 Amps. 5 watts + 14.4 watts = 19 watts peak. The specs for this Seagate Cheetah say idle power runs between 8.7 to 11.68 watts, depending on interface.


We had piles of servers with of 18, 33, and 72 GB drives. 146 GB was considered huge back in the day. Over five hundred server had 20 GB disks or smaller.


Extremely conservatively:  If the average across the shop was 30 GB, at 10 watts each is is easy to figure out some interesting things like watts per GB and power reductions. This is extremely conservative, if for no other reason than there were a fair number of 5.25 inch disk drives still in use, and they use *a lot* more power than this. The vast majority were either SCSI or IDE, not Fiber Channel or SAS.


The next question is at what point in the history of the Data Center are we comparing this too? Go back 10 years, and we had 17,000 real systems across the globe, and well over 5,000 in the biggest one. We had two 750 KVA UPS's loaded up.


Today we are using less than the full capacity of one UPS. Over the last year or so has been the real Go Big to Get Small effort, so that only about 2,000 devices in this one large DC. 4000 or so disk drives. That's easily 30 KW. A third of a watt per Gigabye, or over 300 watts per TeraByte!


Even worse: That capacity was sprayed all over the place: 38,000 square feet of data center, full to the rafters with old systems, each one an island of storage. Shared capacity was only on the NAS (of which there were many terabytes). Better than a bunch of cross mounted systems, but still not optimal.


No thin provisioning. No tiering. The system went as fast as the fastest disks you bought for it. If the system that was attached to the storage failed, and that storage and whatever it did for the business was unavailable. If it was something like a continuous Integration server, things could grind to a halt till the server was fixed, or the disks swapped to a working server, or some other similar thing.


Not everything could be HA when running amber waving fields of physical servers. Virtualize it. Consolidate it. Now everything has to be HA, but everything benefits from that. Not only is power being saved, and space reduced, and less CO2 emitted, the availability is *higher*.


Its also easier to find all your "Stuff". Its right there.


Pete and Repeat


I want to state again: not everything in the DC is Hitachi. I picked the VSP for this post because we have a lot of it, and it is a good example of what would be achieved with any Enterprise class storage. Had I picked IBM XIV, the details of how the storage is installed, how the controllers work, how the disks go in to the frame, and laid out would have been different. The point would be the same. Ditto any other Enterprise vendors gear. Go big and dense to achieve the goal. Make sure it is HA.


In another kind of repeat, if this all seems like we have been here before it it is because we have been. We just called it the mainframe back then. Therein lies the tale of the data center, and the redesign, and that's next.

Share This:


Some useful blades....


Before I move on to things like Storage or DC design, I wanted to take a post to look back at all the blades I have discussed or at least alluded to. By now most of what we are up to should be obvious, but what the heck: Never hurts to underline a point.

One of the main points is this twofer:


  • We think Virtualization is the key to successful DC consolidation, and across all major platforms these days
  • Also of a “These days” nature: Blades are ready for virtualization duty.




Are blades the densest, most virtualization concentrated option a given vendor may have? Two examples from UNIX:


  • Our Sun blades are T4-1B's. That is 1 socket, and there are no two of four socket blades from Sun.
  • Each 10 U chassis holds 10 blades, so 10 sockets, and at 128 GB per blade, 1280 GB in 10 U.
  • Now, after the most recent product announcements, we would get T5-1B's, with 256 GB per blade, or 2560 GB per 10 sockets / 10U
  • The T5-8 is 8 sockets, and 8U, and enough DIMM sockets to result is the same memory density per socket as the blade solution.


So, more or less, running Sun blades is the same density as the the Sun rack mount solution. The upside of the blades is that the blade chassis can be upgraded a blade at a time. Need 4. Buy 4. Have empty slots left for future needs. In our case a new blade acquisition can be the newer, denser T5 based blades.


Advantage to the Sun Blade solution.


  • Our IBM Pureflex has the FSM, taking up one of the 14 blade slots. Leaves 13 slots for compute nodes,
  • Each slot can hold a two socket / 256 GB Power 7 based server.
  • That's 26 sockets and 3,328 GB in the first chassis. Later chassis don't have the FSM sucking up a blade slot, so a bit more capacity there.
  • We use a lower density DIMM for cost reasons, and because we think in a virtual environment that 256 GB of RAM to 2 sockets is the right ratio. You could double that RAM number with the higher density DIMM.
    • Side note: We wish the HMC could manage these blades. Save us a slot, and we are not using any of the advanced features the FSM gives us. But we knew it was a Pure solution when we bought it, and it was worth it to get to the latest blade tech.

  • Now that would be Power 7+ based blades, and like the Sun, we can make all the new blades Power 7+, and later Power 8, etc.


IBM has some seriously huge Power 7 based systems. The biggest is the 795. In one rack footprint, it will house 32 sockets / 256 cores, and 16 TB of RAM. Using the same DIMM's we use it would be 8 TB.


That's cool if you are after that much power in a single image, and also: It is a full 42U rack : We need to triple the blade number to compare :

  • Blade Chassis 1 (with FSM): 26 sockets / 3328 GB RAM
  • Blade Chassis 2 (14 available slots) : 28 sockets / 3584 GB RAM
  • Blade Chassis 3 (14 available slots) : 28 sockets / 3584 GB RAM
  • Three Chassis, 30U in a 42U rack: 82 Sockets / 10,496 RAM


I could choose a 48U rack, slide in another chassis, and go even further, but it is clear the Blade has higher density and RAM, and once again, upgrading the blade can be done a blade at a time, and as new processor and memory densities come out the new blades can incorporate those.


Our conclusion: Blades are the way to go these days. Having watched them for years, they have finally arrived as the most computing density with the most flexible deployment options available today. Perfect for that little private cloud you have always wanted to build.


Other Blade Advantages


Being a heterogeneous shop, since we are an R&D shop, we have no real reason to take advantage of one of the main selling points of a blade. That is the ability to run more than one processor architecture on different blades inside the same chassis.


We use Dell and Cisco for our AMD64 blades, but if we were, say, and all IBM shop, we could have a mix of Xeon and Power 7 blades in the PureFlex. Quick aside: To some degree we already take advantage of the multi-OS nature of the IBM solution, since we can run AIX, i Series, and Linux on Power on the same chassis.


Sun has Xeon blades for its Blade Chassis. HP has Xeon blades for its blade chassis. In addition we plan to run VMS on the Itanium blades we already have, in addition to HP-UX.


Between the four chassis types (Sorry Dell and Cisco: Lumping you together here) we can run numerous operating environments. KVM. Xen. VMware. Linux. Linux on Power. AIX. i Series. Solaris. HP-UX. VMS. Windows. Hyper-V.


If you are more dedicated to one of those vendors than another, you can (for example) get both your RedHat KVM on AMD64, and Redhat KVM on Power (Red Hat announced KVM on Power at the 2013 summit: had to slip it in...) from the same vendor on the same chassis. Or, in the more likely scenario, run AIX on Power, and an AMD64 operating system of some kind on the same chassis. If you are a shop trying to sole source, its handy.


Its something that Dell or Cisco or any of the pure AMD64 blade vendors can not match, although there are enough OS's for AMD64 to go around, so I doubt they are losing any sleep over it.




You can not put hundreds upon hundreds of operating environments into this density without also having redundant bandwidth. Take VMware for example: We have something like 15,000 VM's of all types, spread across the globe on the VMware servers. When they talk to each other inside their host server, this all happens at virtual network speeds. When they talk outside the host, it goes at whatever speed the server is hooked up as. A single Dell chassis with 16 fully configured blades in theory will run 700-800 VM's.


When they chat with each other inside that blade, is all virtual fast. When they talk across blades (the east / west communications), it happens across the internal switches in the back of the chassis. Still extremely fast.


When they chat outside the chassis, they hop to the Top of Rack (TOR) switch, and move on out from there. With that many OS images running around inside the one chassis, the times it wants to talk outside the chassis will be frequent. We went with 10 Gb Ethernet, and have an eye on 40 GB should the need emerge.


All the vendors support 10 Gb. No problem. We had to retool the TOR and core capabilities to handle it, but it only makes sense. This is not an area to scrimp on.


Similarly, we went 8GB on the Fiber Channel, and will watch to see if there is any need to go to 16. Everything is boot-from SAN, so all the disk I/O happens here. There are no internal disks on the chassis to take any of the I/O load off.


Being fast at I/O is also a big deal inside the VM. The biggest knock a VM gets is that they are bad at I/O, and that is because, in the not too distant past, VM's were bad at I/O. The fastest way to get your customer fighting your "Virtual First" policy is to be really bad at Virtual I/O in a customers I/O intensive environment.


10U seems to be the consensus for how tall a blade enclosure should be.




Every one of these blade environments has its own particular set of management tools. the Sun's have Ops Center. IBM has the FSM. Etc. So on. So forth. Each technology has different remote access (though, this is also a similarity as they all have remote access.)


Most have redundant switches for Ethernet and Fiberchannel. Some add Infiniband or QBR options. But the Sun does not have a common / redundant fiber channel switch in the backplane: Rather each blade has FC cards. Side note: We had hoped the new T5 Blades would change this, but they did not. So the good news is that the T5 blades going into the chassis we already have. The bad news is we still have to buy Express cards on a per-blade basis, which consumes far more ports on the central FC switching.


Most have six power supplies, and run 3+3, but some like the Sun have 2 and runs 1+1. More FC connections to the Sun, but less AC power cords...


Most have two socket blade options, but the Sun only has one socket per blade. Some go all the way to four sockets per blade.


They all may have settled on 10U, but how many blades fits in that chassis varies. For Sun, its 10. No half height. For IBM, its 13 half height, and one FSM in the first chassis, and then 14 half or seven full width blades from there on. Etc.


We will not get the exact same number of VM's per socket either: different virtualization technologies vary in what they can do here. With KVM, I can over-commit the RAM more than with VMware. That changes the per-socket VM counts, since 80% of the AMD64 workload is RAM constrained, not CPU constrained. With Sun's Zones / Containers we can really ramp up the virtualization counts because of the way RAM is shared, but the reverse is true when using LDOMS. IBM is the same way when looking at WPARS versus LPARS. Etc.


No matter what the virtualization tech is, the result is more or less the same: Higher density than if we were using rack mounted servers. Much higher denisty than the fleet of gear we just retired.


10-to-1 and Two Thirds Reduction is Space / Power


I have said it many times before, but to repeat myself: We are an R&D shop. That means many things, among them Heterogeneity. We get to be virtual first... right up until we have a workload that can not be virtualized.


Here are some reasons we can not virtualize something:


  • The platform is too old: No virtualization solution exists. Example: VMS version 7.x
  • Repeatability: The rock that virtualization dashes upon: if you need to know how long something takes, you have to do it on real hardware. Anything else is something no one would believe the number on. VM, on the mainframe, has had this problem for going on 5 decades, depending on how you consider the early work at Cambridge.
  • Scalability: Same thing as above.
  • I/O: Maybe. Maybe not. With the right hardware, you can dedicate I/O to important VM's, and cut out the virtual middleman. But that is expensive in terms of hardware footprint, and starts reducing some of the goodness you get from virtualization. This is a case-by-case call. SAN matters too, and I'll get to that in another post.
  • A BMC product needs access "Under the hood". For example, bare metal provisioning like Blade Logic. Can't test that only on a virtual solution. At some point, you have to run on the real iron.
  • Workload exceeds certain size parameters. Right now if its bigger than 16 GB, has more the 4 CPU's, and needs terabytes and terabytes of local storage for whatever reasons, I would take a long hard look at it to see if it should be real or virtual. With KVM solutions like RHEV you can, in theory, runs absolutely monstrous VM's (2 Terabytes of RAM in one VM? Nice to know it can do it, but nothing I will need soon.): But why consume nearly an entire host just because you can?
    • With Blades I can just slide the end user into a dedicated blade, and if they stop using it, I can turn it back to serving as a virtualization host. Most blades can scale to whatever real need an end user might have, and if they want less, we'll get a big one so we don't waste the slot later.
  • Portable demo environment because the Internet is not available
    • That may be virtual. Probably is. But it may be on a laptop, not on a Blade based server in the DC.
    • See this use case less and less, and the Internet continues to insert more of its tubes all over the place, and people figure out secure ways to let people access it. The second one can get to the Interwebs, they are able to be in to the internal Blade based cloud.


Here are some reasons I will be inevitably be offered that are not real reasons to not virtualize on a blade:


  • I want a real machine.
  • I work from home
  • It needs to be inside network x.y.z
    • This one may still have some truth, but with SDN, it is going to die as a valid reason soon if it did not already.
  • VM's are too slow.
    • If your VM's are too slow, then the hosts need work. My Windows 7 environment is almost solely a VM these days, and its like being on a real machine. I get to it from my Mac or Linux desktops. No issues. And its nice not having to run it locally.
  • I need physical access to my machine so I can:
    • Insert media
    • Reboot it
  • My real machine is cheaper than your virtual one
    • If hardware were the only factor in price, that would probably be true, but support costs are always a huge multiple of the real hardware cost.


There are valid reasons to keep real, physical rack mount servers around, and that reduces the amount of space and power we can recover. I have mentioned over and over in this series the 10-to-1 reduction number for both power and space, and from that it would be easy to assume that means I expect, overall, to get to that kind of reduction. That, for example, my 38,000 square foot DC will become 3,800 instead.


Maybe. Someday. Right now we have a more modest goal. 2/3's reduction in space and power. That is nothing to make fun of. It equals real carbon reductions and real cost savings. We get there because we can leverage the 10-to-1 reductions we can get from Blades and Virtualization to offset the hardware we have to keep around at full size. Over time those reasons will fade. We won't have to support a VMS 7 system anymore, and so once everything is on 8.2 or later, it can all be virtual.


The low hanging fruit still allows us, with a strong investment in time, design, people, and infrastructure to get on the road to that 2/3's reduction goal. With each success we learn something, and we earn the reputation of having a new way of doing things that works. Helps with those that resist the change, and just want to keep everything under their desk.


So: Enough on Blades. Next time, Storage.

Share This:

Last post I finished up on the UNIX blade servers by talking about HP's blades. That clears the deck to get into what is actually the much larger footprint blade-space, AMD64 blades. We have Cisco UCS blades as well, but I am going to cover the Dell M1000e here.


The Wide Wide World of AMD64


On Power chips you can run well supported Linux or AIX. On Sparc chips you can run Solaris, plus an assortment of special purpose OS's (most of which I honestly never heard of before today). On Itanium you can run HP-UX, Non-Stop, and VMS, and some Linux and BSD versions are still available. Red Hat and Microsoft famously announced they would remove support back in 2009.


No chip supports as many different operating systems as AMD64 architecture chips. Intel's new Haswell line looks to keep that going with better performance, and substantially better power savings.


Its not just the entire line of every MS OS there is, and pretty much every version of Linux and BSD. Solaris continues to have X86 versions. Android has been ported (admittedly, Android is Linux). On and on. The big kids in the room as far as servers go is currently Linux and MS Windows Server of course.


Apples OS.X is on this architecture (though tightly tied to the Intel chipsets for now: know of no AMD chips that commercially run OS.X). Even though it is an AMD64 design, Apple limits OS.X to running only on their own hardware, so we can not leverage the AMD64 blades to support that OS. (Note: We *do* support OS.X with a number of products. IOS too, but those are different posts)


Just in terms of total number of OS images, we have well north of ten thousand OS images running on AMD64 based hardware. We are moving as much of that as we can, worldwide, to blades.


The reasons are the same as they were for UNIX: we are after 10-to-1 power reductions, significant CO2 reductions that follow power reductions, 10-to-1 space reductions, and a modernization of our entire data center footprint.


Side Note: It will be extremely interesting to watch ARM in this space. ARM has just about as many OS's running on it as there are AMD64 OS's these days, and they are very interested in creating low power server solutions. If that should "become a thing", then I can see another chipset entering the data center in quantity. The good new here is that it will already be a low power solution, and therefore fit into our overall goals, though it remains to be seen how it will be packaged. maybe a rack shelf full of smart phones with cracked screens? According to some, we are still a year or so away from adding ARM based servers to the DC.


If nothing else, the commercial success of Linux and Windows means we would have *lots* of AMD64. The more of our customers that run those platforms, the more R&D we do on those platforms. Only makes sense.




The M1000e chassis that holds the blades should, by now, "sound" familiar. Up in the front you can stick blades, and blades can be half or full height.  8 full height, or 16 half height per chassis. We have a mix:




There is a midplane that everything connects into


Around back there are power supplies (six 2700 watt units) and slots for a plethora of switches. I do mean a metric ton of options. A total of six slots are there to slide all those switch options into, so everything can be fully redundant (three redundant fabrics of whatever type you need)


For our needs, the 8 GB Fiber Channel and 10 GB Ethernet, matching what we have on the UNIX blades, is available. There are switches from Brocade, Dell (PowerConnect) and Cisco.




The M620 is a beautiful blade for virtualization. You can run 2 sockets, 8 cores each, and 256 GB or RAM (actually way more than that...). it matches our R.O.T about the ratio of processor power to RAM perfectly. It supports all the major AMD64 OS's, including Red Hat Enterprise Virtualization, Citrix XenServer, and of course VMware ESX.


Unlike some of the blades, it does all that in a half height blade, so that we can slot in 16 servers in one 10U space. Since it is boot from SAN, there are no internal hard drives (as we configure it: you can put two in) to mess with.


Crammed for Power


One of the ongoing things we looked at with something this dense is how to wire it up. Six power supplies, each rated at 2700 watts. Connected to 208V (US) power, that's 13 amps each plug. The power supplies can run anywhere from 3+3 to 5+1, depending on how much power you need for the servers installed in the chassis. Hooking them to C20 plugs, and backing the PDU's with 30 amps each gets it done.


That math may seem screwy but its not. Not for this configuration


For the M620, consider the 115 watts per socket E5-2665, 2.4 Ghz part as the processor


In our case, with switches and 16 M620 blades, we are looking at about 5,300 watts total per chassis. max. Probably less, because virtualization is memory intensive, and that uses less of the power hungry CPU.


Easily fits in the 3+3 configuration of the power supplies, and so the total draw across the entire chassis is only about  25 amps, and we'll have at least two 30 amps circuits feeding it. In one install I am looking at 6 different 30 amp circuits feeding three M1000e chassis: The ultimate in redundant power: Six different feeds to six different power supplies. Well.. 18 different power supplies, but on a per-chassis basis it's only 6.


For that we get 256 cores and 4,096 GB of RAM. Per chassis. Three of them in a 42U rack, and we are still only at about 16,000 watts.


Sure, that's about 800 watts a square foot (20 SF cell size), but still: That is a lot of virtualization capacity in a very small space.


Since they are all boot from SAN, any given blade can easily be replaced. Its just a compute node in the array of M620 servers.


Mix and Match / Real and Virtual


Looking at the picture above you can see we have things other than the M620 in those chassis. The nice thing about a blade chassis is that, even if you are not going to use all of it for virtualization, you can still replace a standard rack mount server with a blade server.


We have reasons at BMC to need real hardware. We are a virtual-first shop, but not to the point of not being able to develop some of our products that need real hardware to test against. Performance. Scalability. Development against real iron. Whatever the reason might be, we can slide X86 hardware designed to meet a need into the chassis, and it instantly gets access to the high speed SAN and network of the chassis. It still takes less space. It still uses less power.


These same things are true for all the UNIX blades too of course: This was just the first time in this series I had a picture with an obvious difference to talk about it.


The other thing that is true is that if a blade was bought for one thing, and the reason goes away, it is easy to re-purpose it virtually, inside the chassis. All the remote support things are there that make reconfiguring / repurposing a matter of minutes. Between being a blade, and having BladeLogic.... its a snap.


High Speed Network


Mentioned before, but worth repeating is that the integrated switches give us the ability to have high speed and redundant switches for both Fiber channel, at 8 Gb, and Ethernet at 10 Gb. Like all the other blades I have talked about in this series, when we are ready for 40 Gb Ethernet, or 16 Gb Fiber channel, I only have to change the switches and possibly the mezzanine cards.


The Sun blade is a slight exception there: Its NEM is already capable of 40 Gb Ethernet, but there is no shared FC switch, so I have to replace lots of Express cards on each blade to take it to the next speed. 8 Gb is the fastest card at the time of this writing (just looked to be sure...), but IBM, HP, and Dell are all 16 Gb FC capable already.


Blade Summary


Before I start in on storage, that last paragraph is a good place to stop, because it is probably worth a post to go over the entire blade field we have. As noted there, the Sun blades differ from all the other we have in how they do some things, and of course many of the blades have very similar feature that it is none-the-less worth talking about them. But that is a full post, and its for next time.

Share This:

Todays post is about another of the UNIX blades, and it is our answer to how to get the same 10-to-1 space and power reductions that we have already seen on the Sun/Oracle and IBM Blades. Technically we'll also run VMS as well as HP-UX on the new Blade.


We just got the HP Blade, so everything I discuss here is slightly theoretical. I have no reason to suspect that we will not acheive exactly the same things we have with the other blade solutions, but I do not want to present this as a fiat accompli. Its a work in progress, but I wanted to post about it so it was clear we did not have a hole in the design / strategy.


800 Pound Gorilla in the Room


I would be less than honest if I did not discuss the Itanium chipset right up front. One reason we just got this blade was that we were waiting for the Poulson/9500-series chipset. Itanium does not get a lot of updates (there were nearly three years between the Poulson/9500 and the Tukwilla/9300), and so it lags the commodity offerings from AMD and Intel in terms of speed, and its RAS advantages are basically gone these days as well, with things like memory controllers that can take offline bad DIMM's, etc being things AMD64 chipsets can now do.


Intel is going to merge Itanium and Xeon according to ExtremeTech, and that's a logical move that should help HP keep Non-Stop, HP-UX and VMS on updated platforms, though I personally wish they would just release AMD64 versions of all and be done with it. The linked article above says that there are still niches for the IA64 instruction architecture around mathematical precision, so perhaps the merged architecture will bring these to the broader AMD64 (X84-64) world.


In previous posts about the blades, I discussed my rule of thumb for sizing a blade (it would work for a rack mount too): This is a point-in-time rule of thumb, and I am always questioning it, and looking at our BCO data to validate it. Basically, for a Xeon Nehalem / Sandy Bridge class processor (8 core per socket), we put two sockets (16 cores) with 256 GB of RAM for our virtual environments (Virtual is the vast majority of our R&D environment).


That ROT has worked for the IBM and Sun environments very well. Still, I was nervous about applying it to Itanium. We knew Poulson was bringing Itanium back up closer to where the AMD64 chips were at, and also more in line with Sparc and Power, but how close? This would be our first one.


We took a slight risk (or is that RISC?) and configured the BL860c I4 with two sockets / 8 cores each, and 256 GB of RAM each. The beauty of a blade is that if this does not perform at near parity with the other environments, we can add another blade configured differently to compensate. We'll already have the blade chassis!


Beauty Shots






Two BL860c i4 blades: Plenty of room to grow / adapt.



photo1 (2).JPG.jpg


What we like in a blade: redundant switches!


What We Are After


What we are after is more or less the same as what we were after in the Sun / Oracle and IBM environments: Lots of very old desktop and rackmount HP servers, running old releases of HP-UX. VMS is a possibility too, though that is a secondary goal to the HP-UX footprint right now. Here is a small sample of the model table (its much larger than this):


HP Model



9000 712/100 Workstation110
9000 712/60 Workstation110
9000 712/80 Workstation110
9000 A4001765
9000 A500900
9000 B1000500
9000 D380930
9000 K200 Server1225
9000 K200 Server515
9000 K210 Server1225
9000 K210 Server515
9000 L20002015
9000 L2000 362015
9000 rp2450 Server880
9000 rp2470 Server880
9000 rp3410 Server536
9000 rp34401350
9000 rp3440 Server600
9000 rp5450930
9000 rp54701200
9000 rp7400 Server3000
9000 rp7410 server3000
9000 rp8400 Server1336
9000 rp8400 Server1936
9000 rp8400 Server2436
9000 rp8400 Server2986
9000 rp8420 Server1111
9000 rp8420 Server2141
9000 rp8420 Server2812
9000 rp8420 Server3489
9000 Server rp74202015


That is a small sample. You can see from that table the variance in the nameplate wattage between the models, and even inside some of the models. I also chose that part of the list because it is all PA-RISC based. We have lots of Itanium HP gear too, but the real challenge will be the PA-RISC based stuff.


That was another attraction for the new technology. HP has introduced a Container that can run PA-RISC workloads on Itanium based servers. At the core of that is something called "Aries".


This "feels" a bit like using Zones to run Solaris 8 and 9 workloads, or WPARS to run AIX 5.2 and 5.3 workloads, except its more like Apples Rosetta, which allowed Apples Intel chipped computers to run binaries created for Apples Power chipped computers. I used Rosetta quite a bit and it worked extremely well. I hope that Aries is the same or even better.


We now have to learn how to bring images from old HP computers over to the new blade server. For Itanium, it will be either picking up and IVM and setting it down in the new place, or a P2V into an IVM. For PA-RISC workloads it will hopefully be P2V from PA-RISC to Ares container.


In theory then, even with binary translation of PA-RISC to Itanium IA64 going on, the Poulson class hardware should run things better and faster than some of the ancient gear it is coming from.


That is our bet, and based off Zones and WPARS and the success there, we have a great deal of optimism about it.




Look at the back of the c7000 and across the bottom you will see the six power cords-to-be. These connect to six 2400 watt power supplies. At 208V, these are max 2450 watts each (peak 2692) and 91% efficient. Each


The BL860c i4 blade is nominally 500 watts, and can peak at 595. We are booting from SAN, so we'll never quite reach that peak. No internal disks to spin up. There are only two blades (1000 watts) and a max of 8 (4000 watts), so even with the 10 Gig Ethernet, and the 8 GB Fiber Channel switches, running 3+3 on the power supplies, this blade chassis never uses more than 7350 watts, and most of the time much less than that. I only have to replace 6 servers like rp8400 to get back the power that 2 blades will use. We plan to replace sixty nine, of all types in the first wave.


If you have read the other posts, you should now be feeling a deja-vu: Isn't that starting to look like a 10-to-1 reduction in power?


EIA-U Who?


How much space does sixty nine HP servers, an wide assortment from the last 15 years, take? If they were all 2 U, then  130+ U. All being replaced in 10 U, with 6 empty slots. Technically 12, since there are half slot blades, but nothing we would buy is half height. The Itanium stuff we are interested in are all full height.


So, 10-to-1 rack height reduction, and if these were stuff into racks at 30 U each, then four racks becomes one 1/3 full rack. Three of these blades on one rack, fully populated would give you a theoretical maximum wattage of 21,000.  With a rack cell size (space a rack plus its physical access paths takes) of 20, that's 1000 watts a square foot!. Our DC is nowhere near that, but then, we are nowhere near needing to retire the theoretical 720 physical HP's that would be either.


P2V Density


I am after sixty nine servers. I have two blades. Is that realistic? experience with virtual says that memory is the bottleneck 80 percent of the time in our shop. This configuration gives, on average, 7 GB or RAM per physical server being retired. Is that enough?


Oh yes. Look at how old the stuff going away is. Some of those servers are 256 MB, and many are 1 or 2 GB. Looking over at the IBM and Sun, we are seeing about 50-to-1-blade types of consolidation rates. part of that is that Zones and WPARS are extremely efficient at sharing resources. Our original planning was 20-to-1, and we have gone way past that.


All the HP has to be is about 3/5's as efficient as either of those two platforms, and we are good to go.


If not: We have empty blade slots.



Next time: AMD64 Blades

Share This:

When IBM announced the new Pure line of systems, our attention went straight to the compute nodes. Even from a distance in the announcement glossies, it was clear that was a blade system, and it was not the one that had been around for the last decade. The original announcements had no real details about that tantalizing new blade chassis. Pure was being sold as a whole ecosystem.


But it was that enigmatic blade chassis that caught the eye. It looked like a brand new blade design, and it was.


I was looking at the Bladecenter H for our shop as part of the Go Big to Get Small project, and frankly, I was worried about it. It was at least a ten year old design, and had been designed to probably last about 10 years before the new design came out. The compute nodes were managed differently than the networking switches, and the word was that the power envelope was about done: That no new, significantly higher density blades were really going to be possible in that footprint.


This concern was more or less verified when Power 7+ came out and no Power 7+ blades were going to be issued for the Bladecenter. It was indeed done after its 10 years of life. You know, in computers, a design that lasts 10 years is actually pretty impressive...


I pondered the advisability of perhaps just going 740 or 750 rack mount rather than Bladecenter. We had a great deal of recent success retiring five racks worth of power 4 systems into two 740's with Power 6 chips. It went against the general idea of blade-ifying all the architectures, but in truth, if the chip architecture did not have a viable blade design, there was no point in being pedantic about it. The real goals were footprint reduction. Power reduction. CO2 reduction. if it took a rack mount design to do it because the blade was not there, then ... so what?


It was a close thing. When the Pure showed up I was days away from ordering hardware.


To get the PureFlex carved out from the rest of the Pure system required IBM doing a special config. From their point of view, I was undoing all the goodness of Pure. From mine, I was getting the tasty bit, and keeping the rest of my infrastructure as *my* infrastructure. Nothing against the storage or the switches that Pure contained: They just were not what we use.


Power 7 and Power 7+


That first year the two Power chipped blades were the p260 and the p460. Both had Power 7 chips. As of this writing, you can get Power 7+ versions. A 260 was one slot. A 460 was 2. Shades of the Thin and Wide nodes of the IBM SP2!


The p260 had half the CPU sockets, half the I/O, and half the DIMM slots of a p460. The one and only thing I can now see that could be done with 1 p460 that you could not do with 2 p260s is run a dual VIO server. The I/O daughter cards on the p260 only supported one VIO instance.




I mentioned in the post about the Sun blade my blade sizing rule of thumb: 128 GB per CPU socket, all things being approximately equal. a few years from now that will be a horribly dated rule, but it has worked for the last year or so. I thought through that with this new blade design for a while before deciding it was probably still good. I have not re-examined the assumptions in light of the Power 7+, but for the Power 7, the memory bus speeds were far slower than the Intel blades in the same chassis. PureFlex Xeon blades ran at 1600 Mhz, and the Power 7 at 1066 Mhz.


After much internal discussion about how much an R&D workload needs a dual VIO server, it was decided that (going forward) a new chassis would start with a p460, and that builds and other critical things (NIS. NIM, etc) would run on that blade. The rest of the blades would be either P260 or P460 depending on criticality of workload.


The very first one was bought with two p260s because at the time we did not understand the thing about dual VIO servers yet.




And there, in that picture is the story so far: The two P740's that retired the five racks, and their close personal friend, the PureFlex.


Note in the picture there are three blades: There is one Intel blade, running a closed Linux OS on X86, called the FSM. The FSM is the new fancy Pure system manager, and is based on an hugely enhanced Service Director. You don't manage these Power 7 blades with an HMC unfortunately. FSM may be *better* but if you are already an HMC shop, this is a new thing you have to learn.


I have occasionally wondered about the withdrawal of the SDMC, and the move back to HMC, and how it might relate to the FSM. It seems like perhaps the SDMC overlapped the FSM, and so it was decided to not have three different ways of managing this stuff. Who knows? Not me. And if none of that paragraphs acronymic pondering made sense, then maybe it really doesn't matter. The way it is now is that the PureFlex is managed by the FSM, and the rack mounted world by the HMC.




The chassis of the Pureflex is 10U, and has 14 slots. the FSM can currently manage 3 chassis, so you only lose the slot in the first one. The second and third chassis have all 14 slots available, and soon the FSM will probably handle more than three chassis. Architecturally it seems a conservative decision by IBM to make sure everything scales linearly.


As it related to our idea about using p460's, the first chassis would need to have at least one p260, and six p460s to be full. You could not put a 7th p460 in the first chassis, but you can in chassis two and three.


Each p260 has 2 sockets, therefore 256 GB of RAM. Our p460 equipped chassis have double each of those. Per blade. But the blades twice as big, so the density is the same. In math terms 2x(p260) = 1x(p460).


Except for dual VIO.


Around on the back side we have six 2500 watt (nameplate) power supplies, the Chassis Management Module (CMM) or modules (there can be two), and four slots for things like network switches and fiber channel switches. We wanted full redundancy, so we filled the slots from the start with two 10GB ethernet switches, and 2 8GB Fiber Channel switches (16 GB is available, so already pretty future-ready).


The "North - South" communications of the chassis are done via the network, so having 10 GB is a big plus for a multi-chassis setup.


All the details are in this Redbook linked here.


When you compare the PureFlex to the Sun, you can see the IBM design is newer by virtue of more sockets per blade, and having FC switches shared amongst the blades rather than an FC card per blade like the Sun. The T5-1B blades did not change any of that (even if they upped the single socket speeds enough that perhaps that does not matter as much.)


And, HP and Dell blades have shared switching like that too... but those are other posts.


Density of Compute


At the end of the day, what we are after is the ability to retire physical systems into virtual images. Its an R&D workload, so putting fifty LPARS onto a 256 MB p260 is possible. Some of the systems being replaced don't even have a GB of RAM. Going virtual, onto this blade,  are things like 7026-H70's with 750 watt power supplies. 7043-260's with 640 watts nameplate. 7046-B50's with 140.


At the end of the day, its another 10 to 1 win. 10 times less power. 10 times less space. 10 times less CO2. And everything has moved from Power 3, 4, 5, or 6 to Power 7. Its Virtual. Its faster.




I mentioned before in the Sun blade post that backwards compatibility is critical. We are not just a heterogenous shop, we have workloads running in R&D on all sorts of releases inside any given vendor. To run AIX on the PureFlex you have to be on AIX 5.2 or higher, and there is a pile of caveats. AIX 5.2 and 5.3 only run in WPARS (think Sun Zones) and if you are P2V'ing a physical system to the WPAR, there are required minimum patch levels that may be a problem if you are trying to stay as backward compatible as possible. Hopefully no production shop is backlevel, even if still on 5.2.




AIX 5.1 and before? No choice there but to stay on physical hardware. Still: 5.2 came out in 2002, and was EOL in 2009. And it can run in a WPAR on the PureFlex, so that is pretty good backwards compatibility. For Go Big to Get Small, it is a chance to try and get everything at least up to AIX 5.2.


But Wait, There More!


if you want to run i/Series on the PureFlex, you can. Not as far back as AIX, but i 6.1 and 7.1 are there.


For the record, i as the name of the series and OS? Really? Because everyone I know still calls it AS/400 or i/Series. Not just "i" or the very slightly better "IBM i".


But whatever you want to call it, it runs on the Power 7 blades. We are setting that up now, and will be retiring a 520 or two when we do.


Next time: Another Blade!

Filter Blog

By date:
By tag: