    Agent Healthcheck

      The team is new to BladeLogic Server Automation.


      I'd like some advice on maintaining 10,000+ agents to be active/available 95% of the time?  


      From the Best Practice webinar series on the BMC community by Sean Berry, there is the following points on maintenance. 


      Agent Health Survey:

      Managed servers go up and down regularly

      Run the “Update Server Properties” Job periodically, and

      before a critical job

      updates AGENT_STATUS property:

      – “Agent is Alive” for hosts that are up, vs.

      – “Agent is Unavailable” for hosts that are down.


      AGENT_STATUS in Server Smart Groups to include only available hosts in Jobs

      Can’t deploy to a host that’s not up




      Re-run Update Server Properties Job more often against a server group that only includes “down” servers

      Use a Server Smart Group to identify hosts that have been out of contact > 2 days

      For example we could to arrange for the following

      1. Setup a scheduled job run across the fleet, to review "agent health" status and recovery options
      2. Setup a smart group, run a "verify agent" job, and if needed determine agents with issues, diagnose the issue.  Potential next step would be to restart the again if needed
      3. Else troubleshoot manually

      How has the community manage agent healtchecks?