BladeLogic 7.4.x is the most significant update to our software in years. Everyone knows about the changes to provisioning and patching, but there's a little-advertised feature which I've found very useful. Thought I'd post it in the public forums to get some discussion going.
This feature is significant for a lot of reasons. One of the most intriguing is that it could reduce the time to deploy software and files by a dramatic margin.
BladeLogic version 7.4.x quietly added a feature where you could specify a URL for the source of a software deployment. So for instance, you could reference an NFS mount on a UNIX box, or an SMB share on a Windows box. You can also reference an nsh path.
According to pg 340 of the User's Guide,
Agent mounts source for direct use at deployment (no local copy)—The Deploy Job instructs an agent to mount or map the device specified in the URL and deploy the software package directly to the agent. When you select this option, the agent uses the data transmission protocol specified in the URL to access the specified source files. The software package is not copied to a staging area on the agent, so no local copy of the source file is created.
To use either this option, the Source Location field (see stepc) must provide a URL that complies with BladeLogic’s requirements for network data transmission, including a data transmission protocol: either NFS or SMB. See URL Syntax for Network Data Transmission for detailed information about the required syntax.
To illustrate how significant this is, let's discuss a typical use for a BladeLogic software deployment package.
Let's say you have ten thousand servers running Sun Solaris 10, and you need to deploy Sun's latest patch cluster. That's 458.4 megabytes. Typically you would load the patch cluster into the BladeLogic depot, then deploy to the targets. Theoretically, a BladeLogic server with a gigabit connection could deploy 458 megabytes to a target in 7.33 seconds*. In the real world, this is rarely the case. Job of this size have been known to take three minutes, even on a fast network and server. With 10,000 targets, it would take the better part of a month to deploy the patch cluster to all targets. No one is going to spend three weeks deploying patches 24x7, so there has to be a better way.
For ages we've had repeaters to speed up the deploy process. Repeaters are a great solution, and they're easy to implement. But a repeater is three times slower than the method I'm about to describe. The reason is that our file is copied three times when a repeater is used*.
Here's the fastest method to deploy files to numerous hosts with BladeLogic. Create a software deploy job, and for the installable source, specify an NFS mount. For example,
Note I've selected the option "Agent mounts source for direct use at deployment (no local copy)". According to the 7.4.x docs, this sidesteps use of the staging area; this fact alone will double the throughput of a deploy job with single target.
Reducing the time of our deploy from 20.8 days to 10.4 days is a huge improvement. But how can we improve it further? For that trick, we call upon the property dictionary.
Here's how to do it:
1. Using the property dictionary, create a server property called "PATCH_SERVER".
2. Parameterize your software deploy job to reference the property for the name of the server where the patch cluster will be copied from.
3. Copy the patch cluster to a server who can serve up the file via NFS.
4. Last but not least, pick 21 servers that need to be patched, and set their server property to reference the NFS URL from step 3. (see pg 345 of the user's guide for details)
With 21 targets, it should be possible to deploy the file to all targets in 30.5 minutes. (This takes the original 3 minute deploy time, halves it because there's only one copy not two, and multiplies by 21.)
At this point, you're thinking "who cares?" Deploying 21 files is no big deal. The trick is to have those 21 server up to the next 21, and the next 21. Just like a pyramid scheme!
By using this scheme, deploy time could be cut down from 21 days to 90 minutes (!!!)
By creating a "software deployment pyramid", we're leveraging the network and storage bandwidth of every server we deploy the patch to. If anyone else can tell me another way to reduce file deployment time by 99.7%, I'm all ears. This method is ideal for BladeLogic, because traditional patching methods don't give you access to all 10,000 servers at once. BladeLogic also scales better than traditional methods, because we support the use of multiple app servers. There's nothing to stop you from using this method with 20 app servers and ten thousand managed hosts. The pyramid scheme also reduces load on the BladeLogic depot, the storage it lives on, and the network interface of both the app server and depot.
I've intentionally ignored a number of factors, to keep the discussion simple. I haven't factored in the bandwidth of the network. If you do the math, you'll see that we don't need much bandwidth to copy a 500megabyte file in 90 seconds. In fact, it's less than a 56K modem (46kilobits/per sec to be exact.)
I've ignored the bottleneck of the deploy job itself, but that's easy to fix. Just use lots of app servers. BladeLogic performance improves dramatically with multiple app servers, and ANYONE with 10,000 managed servers should be using at least twenty of them.
There are other ways to deploy very large files, such as repeaters, but they're not as fast. Another drawback with the use of repeaters is that it's all-or-nothing. The method I've outlined here could be used when you need to patching or deploy large files, without resorting to re-configuring your BladeLogic infrastructure to use repeaters.
The bottom line is that this method of file deployment is dramatically faster than existing methods, and offers a level of BladeLogic performance which was impossible prior to 7.4.x
The best description I've seen of the traditional file deployment process was posted by Sean Daley in the internal forums. According to what he posted:</I>
"In the past we've noticed that our performance can lag behind normal scp performance quite a bit (.. edited for brevity...). There are a couple of things to look at / note.
1) When deploying 1 GB worth of files you're actually copying 2 GB worth of files. The process of copying a file from one agent to another agent does not make the data go directly from agent A to agent B. The data first has to pass through the client initiating the copy before heading to agent B. So when you're deploying 1 GB of files, the appserver has to copy these files from the fileserver to the target agent. This means 1 GB of data passes from the fileserver to the appserver and then from the appserver to the agent (so effectively 2 GB of data). Even 2 GB of data with those numbers seems weird though.
2) Are you doing a direct or an indirect deploy (using a repeater) when you do this?
If you're doing an indirect deploy, are you using the maximum cache size feature on the repeater? If so don't. That feature performs horribly. (.. edited ..)
3) Going back to #1 again, because the data gets copied between two machines, you'll need to make sure that the network to both servers is not messed up. For example, if a switch port is misconfigured on the file server this could have catastrophic consequences for performance.