One of the core competencies of TrueSight Server Automation (TSSA) is automating the patch update process for servers. TSSA makes this easy for the typical monthly cycle of patching servers in large groups, but in no particular order for each server in the group. Many applications are hosted on multiple servers, often in a HA (High Availability) and/or failover cluster to limit service outage. Automation to patch these type of services needs to update the nodes in a specific order usually one at a time, to prevent service outages during the patch update process. We call this Service Aware Patching.
Microsoft Exchange is one of these services, and I wanted to share here how we have implemented Service Aware Patching entirely in TrueSight Server Automation here at Customer Zero (IE BMC-IT).
When automating any cluster service, we need to be able to idle the current target node by moving services off to the other nodes. Experience has taught is that changing the cluster state requires a careful methodical approach to be successful.
Key "milestones" in the process;
- Verify the target cluster node is in the "normal" state BEFORE attempting to change the cluster state
- Move services off the target node to idle the node (IE moving the node into Maintenance Mode)
- Verify the services successfully moved to another node and the target node is actually idle
- Use normal TSSA Patch Analysis and Deployment to update the patch level of the target node
- Verify the node and services are still in the idle state, as patch deployment probably restarted the node
- Move the services back onto the target node (IE move the node out of Maintenance Mode)
- Verify cluster node and cluster is in the "normal" state BEFORE moving on the the next node is the sequence
Notice we use a careful logical sequence to move the target node through the required phases, we call this the "chain" of steps. Important point here, is that if ANY step in the chain fails the automation procedure needs to stop and NOT PROCEED to the next step. if something unexpected happens in the chain of steps on the cluster, ignoring a failed step and continuing on will most likely cause a service outage due to the automation not halting at a failed step in the chain.
TSSA takes care of the actual patch updates, the question is how do we implement the steps to handle the transitions of the cluster nodes?
Microsoft Exchange automation requires knowledge on how to verify the Exchange service and how to move the Exchange node in and out of "Maintenance Mode". Fortunately there are Exchange experts that have created existing powershell scripts to do what we need here for Service Aware Patching.
We used the powershell scripts available here;
- Office Exchange Server 2013 Maintenance Mode Script (Start)
- Office Exchange Server 2013 Maintenance Mode (Stop)
The first issue we encountered is common when attempting to automate existing scripts of any type. If the script was designed for interactive use and not called by automation, then the scripts typically do not do things like set the exit code on failure. They often rely on the errors to STDOUT being read by the admin running the scripts. This was true with these powershell scripts, so one of our Exchange admins made a copy of the scripts we needed for the all important "verify" steps in our Service Aware Patching chain and modified them to set a non zero exit code if the script failed in any of the key operations or checks. Our TSSA deploy jobs will automatically detect and report failure if the script exits with a non zero exit code.
Once we had the automation procedure steps in the chain defined and the powershell scripts to perform the steps needed to manipulate the Exchange cluster, all the hard work was done! Now we just need to create the TSSA jobs required to execute the steps in the chain in the correct order!
Implementation in TrueSight Server Automation
We used TSSA batch jobs to implement the chain of steps required in executed in a prescribed order.
|Note: Important options to set in the TSSA Batch jobs;|
The main TSSA batch job to run our Service Aware Patching and control the node by node sequence in the cluster;
|TSSA Batch Job: "Exchange 2013 - FULL Sequence Maintenance Mode Operation (PROD)"|
This main TSSA batch job is the controller job that calls the child job for each node in the prescribed order, and any failure will cause the batch job to stop and not continue to the next node.
|Note: In the case of any failures in the chain of steps|
|Manual intervention by the Exchange admins would be required to determine how to resolve without service impact.|
Each of the child jobs is where our chain of steps are executed against each specific node in the cluster;
|TSSA Batch Job: "Exchange 2013 - Maintenance Mode Operation (Node #?)"|
Note: There is one child job per node.
Here's a screenshot showing a successful run of one of the child jobs;
Overview of the TSSA low level jobs and the associated powershell script
|Deploy - Verify Exchange READY for Maintenance Mode|
|Description: Script we created to check mailbox database (Get-MailboxDatabase) status is OK|
|Deploy - Move Exchange to Maintenance Mode|
|Description: Script from VanHybrid site to move Exchange into "Maintenance Mode"|
|Deploy Verify Exchange Maintenance Mode|
|Description: Script we created to check various Exchange components are in the Inactive state and the cluster node state is "Paused"|
|Deploy - Move Exchange to Normal Mode|
|Description: Script from VanHybrid site to move Exchange out of "Maintenance Mode"|
|Deploy - Verify Exchange Normal Mode|
|Description: Script we created to check various Exchange components are in Active state and the cluster node state is "Up"|
Our Service Aware Patching use case for Exchange Service uses a single TSSA Batch job to run the entire sequence end to end, moving one node at a time into maintenance mode and then updating the patch level using normal TSSA patching jobs, and then return the node to service. All nodes will be sequenced through and the end result will be the entire Exchange cluster patched with no service outage. All using automation.