The team is new to BladeLogic Server Automation.
I'd like some advice on maintaining 10,000+ agents to be active/available 95% of the time?
From the Best Practice webinar series on the BMC community by Sean Berry, there is the following points on maintenance.
Agent Health Survey:
Managed servers go up and down regularly
Run the “Update Server Properties” Job periodically, and
before a critical job
updates AGENT_STATUS property:
– “Agent is Alive” for hosts that are up, vs.
– “Agent is Unavailable” for hosts that are down.
AGENT_STATUS in Server Smart Groups to include only available hosts in Jobs
Can’t deploy to a host that’s not up
Re-run Update Server Properties Job more often against a server group that only includes “down” servers
Use a Server Smart Group to identify hosts that have been out of contact > 2 days
For example we could to arrange for the following
How has the community manage agent healtchecks?