9 Replies Latest reply on Jul 30, 2009 11:50 AM by Jude Seth

    Reboot Job

      I am looking for some help or direction on getting BladeLogic to reboot a series of servers in a specific order and waiting for one system to come back up until restarting the next system.  Basically I am looking for something like this:

       

      -Reboot Domain controller 1

      -Once DC 1 is back up and functioning reboot DC 2

      -Reboot the primary DB server in a cluster

      -Once Primary DB server is up and functioning (service x, y & z are running) reboot secondary DB server

      -Reboot application server

      -Once application server is up and service X is running reboot web server

      -Once web server is up and port 443 is listening reboot web server 2

       

      This is a simple example, but if someone had some ideas on how to this, I could use that to reboot about 40 different windows server that consist of multiple web, database, application and terminal services servers.  I am assuming a network shell script would be the best way to do this, but I don't know if I should use one massive script or multiple little scripts put together in a batch job to do this.  I haven't used the network shell much so some sample scripts would be very helpful.

        • 1. Re: Reboot Job
          Bill Robinson

          There's a couple ways to do this I think.  You could use the reboot script here:

          https://www.bladelogic.com/community/entry.jspa?externalID=674&categoryID=29

           

          Create a Batch Job that contains NSH Jobs that run the reboot script against each target server.  The Batch Job should execute sequentially and halt if one of the sub job fails (box doesn't reboot or come back)

           

          Or you could try something like this:

          https://www.bladelogic.com/community/entry!default.jspa?categoryID=29&externalID=1532

          (I wrote it a while ago but never posted it)

           

          This should let you define a 'cluster' and 'nodes' for the cluster (the cluster is really a group of servers that needs to reboot in a particular order, not a *real* cluster) and will handle rebooting the servers in the proper order.

           

          The process check you want would have to be worked in, but we could modify the scripts a bit to handle that (after reboot, wait for x seconds for a process to start running before exiting w/ the ok)

          • 2. Re: Reboot Job
            Thanks for the quick response.  However I don't have access to that support forum.  I requested access back in april but the answer we are getting from support is that this is the support forum we should be using.
            • 3. Re: Reboot Job
              Bill Robinson

              here's the cluster one:

               

              #!/bin/nsh
              #  BladeLogic Multi-Platform Reboot And Monitoring Script
              # 1.0 Bill Robinson - Initial Creation
              #
              # This script will reboot systems and wait for them to come back online.
              # Additionally, it will check for NODE and CLUSTERNAME properties
              # to sequentially reboot servers in a cluster.  These properties must be set
              # on the servers.  Clusterless servers must have a Node value of 0.
              
              # Maximum time to wait to have the server go down. Not that reliable as we are only
              # testing that the agent has gone down and not necessarily that the server has gone
              # down. Also defined is the interval time between checks to see if the server is down.
              #
              MAX_SHUTDOWN_TIME=300
              SHUTDOWN_INTERVAL=20
              
              #
              # Maximum amount of time we will wait to have the server comeback up once we have detected
              # that it has gone down.  Also defined is the interval time between checks to see if the
              # server is back up.
              #
              MAX_REBOOT_TIME=300
              REBOOT_INTERVAL=20
              
              #
              # Maximum times we will check to see if a process has stopped.
              # Also, time between checks of process in seconds.
              MAX_PROCESS_RETRY=200
              PROCESS_INTERVAL=10
              
              
              OS=`uname -s`
              HOSTNAME=${NSH_RUNCMD_HOST}
              # The NSH_RUNCMD_HOST envar retuns the FQDN which is what we want
              cd //@/
              APP_OS=`uname -D //${HOST}/`
              DEVNULL=/dev/null
              [ "${APP_OS}" = "WindowsNT" ] && DEVNULL=NUL
              
              DEBUG=0
              
              sub print_usage() {
                   echo "\n\n\n"
                   echo "Usage: $PROGRAM [-p <process> ]\n"
                   echo "  -p <process>     - process to look for before rebooting - will not reboot unless process doesn't exist"
                   echo "\n\n\n" 
                   exit 1
              }
              
              sub print_debug() {
                   [ ${DEBUG} -eq 1 ] && echo "DEBUG: $@"
              }
              
              sub print_info() {
                   echo "INFO: $@"
              }
              
              sub print_error() {
                   echo "ERROR: $@"
              }
              
              sub reboot_server() {
                   # Check for running process
                   check_process
                   if [ ${RETVAL} -eq 0 ]
                        then
                        case "${OS}" in
                             SunOS)
                             nexec ${HOSTNAME} shutdown -i6 -y -g 0
                             RETVAL=$?
                             print_info "Rebooting ${HOSTNAME}"
                          ;;
              
                             Linux)
                          nexec ${HOSTNAME} shutdown -r now
                             RETVAL=$?
                             print_info "Rebooting ${HOSTNAME}"
                          ;;
              
                             WindowsNT)
                          nexec ${HOSTNAME} reboot
                             RETVAL=$?
                             print_info "Rebooting ${HOSTNAME}"
                          ;;
              
                             *)
                             print_error "Unknown platform \"${OS}\""
                          exit 1
                          ;;
                  esac
                  [ ${RETVAL} -ne 0 ] && print_error "Possible error in sending reboot request"
              else
                   print_error "Timeout waiting for Process ${PROCESS} to stop on server: ${HOSTNAME}"
                  RETVAL=1
              fi
              }
              
              sub agent_status() {
                   cd //@/
                   print_debug "uname -D //${HOSTNAME}/"
                   uname -D //${HOSTNAME}/ > ${DEVNULL} 2>&1
                   RETVAL=$?
              }
              
              sub check_process() {
                   # Check for Running Process before rebooting
                   RETVAL=0
                   if [ ! -z ${PROCESS} ]
                        then
                        RETRY=0
                        print_info "Checking for Process ${PROCESS} on server: ${HOSTNAME}"
                        
                        while [ `nps -H -h ${HOSTNAME} | grep -iw "${PROCESS}" | grep -v -e "grep" -e "runscript" -e "nsh" | wc -l` -ne 0 ] && [ ${RETRY} -le ${MAX_PROCESS_RETRY} ]
                             do
                             print_info "`date` ${PROCESS} still running server: ${HOSTNAME}"
                             sleep ${PROCESS_INTERVAL}
                             RETRY=`expr ${RETRY} + 1`
                        done
                        [ ${RETRY} -gt ${MAX_PROCESS_RETRY} ] && RETVAL=1
                   fi
              }
              
              sub reboot_nodes(){
                   for SERVER in `blcli Server listAllServers`
                        do
                        SERVER_CLUSTER=`blcli Server getFullyResolvedPropertyValue ${SERVER} CLUSTERNAME`
                        print_debug "Server ${SERVER} is in cluster ${SERVER_CLUSTER}"
                        if [ ${SERVER_CLUSTER} = ${CLUSTER} ] && ([ ${CLUSTER} != "NONE" ] || [ ${SERVER} = ${HOSTNAME} ] )
                             then
                             HOSTNAME=${SERVER}
                             print_info "Rebooting clustered server ${HOSTNAME}"
                             reboot_server
                             [ ${RETVAL} -eq 0 ] && check_reboot
                             [ ${RETVAL} -eq 1 ] && check_restart
                             [ ${RETVAL} -eq 1 ] && print_error "Failed to reboot ${HOSTNAME}" && exit 1
                             print_info "Rebooted clustered server ${HOSTNAME}"
                        fi
                   done
              }
              
              sub check_reboot() {
                   print_info "Waiting for server ${HOSTNAME} to shutdown..."
                  # Give the server a certain amount of time to kill the  agent and reboot
                  count=${SHUTDOWN_INTERVAL}
                  sleep ${SHUTDOWN_INTERVAL}
                   
                   agent_status
                  while [ ${RETVAL} -eq 0 ]
                  do
                      print_info "`date` Agent still running ..."
                      count=`expr ${count} + ${SHUTDOWN_INTERVAL}`
                      if [ ${count} -gt ${MAX_SHUTDOWN_TIME} ]
                      then
                             print_info "Reboot command sent but server not coming down"
                          RETVAL=1
                        else
                            sleep ${SHUTDOWN_INTERVAL}
                             agent_status
                        fi
                  done
              }
              
              sub check_restart() {
                  # Now we know the agent is down and we are waiting for the  system to reboot. Give a bunch of time to come back up.
                  
                  count=${REBOOT_INTERVAL}
                  sleep ${REBOOT_INTERVAL}
              
                   print_info "Waiting for server ${HOSTNAME} to come back up..."
                   agent_status
                  while [ ${RETVAL} -ne 0 ]
                  do
                      print_info "`date` Agent still not up ..."
                      count=`expr ${count} + ${REBOOT_INTERVAL}`
               
                      if [ ${count} -gt ${MAX_REBOOT_TIME} ]
                             then
                          print_info "Reboot has not yet come up after more than $count seconds ..."
                          RETVAL=1
                        else
                             sleep ${REBOOT_INTERVAL}     
                             agent_status
                      fi
                  done
                   print_info "Server ${HOSTNAME} back up and running"
                   
              }
              
              # Test to see if the new argument is an argument or a new option
              sub is_new_arg()
              {
                   print_debug "Checking $1 to see if it's an argument or an option."
                   echo "$1" | egrep -q -e ^-p
                   RES=$?
              
                   # Ensure blank values are ignored
                   if [ -z "$1" ]
                        then
                        RES=0
                   fi
              
                   print_debug "Result for argument ${1}: ${RES}"
                   return ${RES}
              }
              
              sub parse_args()
              {
                   print_debug "Arguments: $@"
              
                   if [ $# -eq 1 ]
                   then
                        print_usage
                   fi
              
                   while [ $# -ge 2 ]
                   do
                       case "$1" in
                        -p)
                             print_debug "Getting Process to check."
                             shift
                             PROCESS=${1}     
                             print_info "Checking for process ${PROCESS}"                              
                             ;;
                        "")
                             shift
                             ;;
                        *)     
                             print_error "Argument ${1} not recognized."
                             print_usage
                             ;;
                       esac
                   done
              
                   print_debug "Finished parsing arguments."
              }
              
              
              # Main Script
              # Check to see if we're running as a runscript
              if [ -z ${NSH_RUNCMD_HOST} ]
                   then
                   print_error "You must run this script using the \"runscript\" option."
                   exit 1
              fi
              
              PROCESS=""
              # Parse the arguments
              parse_args $@
              
              # Start the reboot process
              NODE=`blcli Server getFullyResolvedPropertyValue ${HOSTNAME} NODE`
              
              if [ "${NODE}" = 0 ]
                   then
                   # No other nodes, proceed with reboot
                   reboot_server
                   [ ${RETVAL} -eq 0 ] && check_reboot
                   [ ${RETVAL} -eq 1 ] && check_restart
                   [ ${RETVAL} -eq 1 ] && print_error "Failed to reboot ${HOSTNAME}" && exit 1
              elif [ "${NODE}" = 1 ]
                   then
                   # This is the first node
                   CLUSTER=`blcli Server getFullyResolvedPropertyValue ${HOSTNAME} CLUSTERNAME`
                   reboot_nodes
              elif [ "${NODE}" -gt 1 ]
                   then
                   # This is a secondary node, reboot handled by primary
                   print_info "Reboot handled by Primary Node"
                   exit 0
              fi
              
              exit 0
              
              
              
              
              • 4. Re: Reboot Job
                Bill Robinson

                here's the normal reboot one:

                 

                #
                #  BladeLogic Multi-Platform Reboot And Monitoring Script
                #    (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Krause
                #    (1.1) Updated by Craig Williams and Sean Berry at Northern Trust Bank
                #          to include the sleep statement inside the monitoring loop, timeouts set to 300s (5m),
                #          intervals set to 20s.  Some debugging statements added.
                #    (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for 
                #          WinNT
                #    (1.3) Added AIX Barry Lowrance.
                #          Kill last 3 background jobs added to clean up nexec on app server from non return of shutdown on AIX. I have not had time to test which
                #                      background kill is really needed but %1 does not work, so I added %2 & %3 which should kill anything additional which I may have missed.
                 
                #Barry Lowrance addDEBUG if needed
                # set -x
                 
                #Barry Lowrance modified MAX shutdown time to 10 minutes to give any users some extra warning.
                MAX_SHUTDOWN_TIME=600
                # Runs a check every 30 seconds until the system system is down.
                SHUTDOWN_INTERVAL=30
                 
                 
                #Barry Lowrance modified MAX boot time to 30 minutes for those larger servers.
                MAX_REBOOT_TIME=1800
                # Runs a check every 30 seconds until the system is back up.
                REBOOT_INTERVAL=30
                
                
                if [ $# -ge 1 ]
                        then
                        echo "Accepting boot Arguments for Solaris"
                            BOOT_ARGS=$@
                        echo "Boot Args: $BOOT_ARGS"
                fi
                
                
                OS=`uname -s`
                # csw1 6/12/2006
                # Removed because this was returnign the --short-- hostname
                #HOSTNAME=`uname -n`
                HOSTNAME=$NSH_RUNCMD_HOST
                # The NSH_RUNCMD_HOST envar retuns the FQDN which is what we want
                 
                if [ "$OS" = "WindowsNT" ]
                then
                    DEVNULL=NUL
                else
                    DEVNULL=/dev/null
                fi
                 
                if test -z "$HOSTNAME"
                then
                    echo Usage $0 hostname
                    exit 1
                fi
                 
                pwd | egrep -q ^//
                 
                if [[ $? -ne 0 ]] 
                then
                            print "ERROR: You must run this script using the \"runscript\" option." 1>&2
                            exit 1
                fi
                 
                # Have to be local so the uname -D command works properly
                cd //@/
                 
                agent_up ()
                {
                #    uname -D //$1/ > $DEVNULL 2> $DEVNULL
                    echo uname -D //$1/
                    uname -D //$1/
                    return $?
                }
                 
                if agent_up $HOSTNAME
                then
                 
                    echo Rebooting server $HOSTNAME ...
                 
                    case "$OS" in
                        SunOS)
                                        if [ -z $BOOT_ARGS ] 
                                                    then
                                                    nexec $HOSTNAME shutdown -i6 -y -g 0 &
                                        else
                                                    nexec $HOSTNAME reboot -- $BOOT_ARGS &
                                        fi
                            ;;
                 
                        Linux)
                            nexec $HOSTNAME shutdown -r now &
                            ;;
                 
                        AIX)
                            nexec $HOSTNAME shutdown -r +5&
                            ;;
                 
                 
                        WindowsNT)
                            nexec $HOSTNAME reboot
                            ;;
                 
                        *)
                            echo "Unknown platform \"$OS\""
                            exit 1
                            ;;
                    esac
                 
                    if test $? -ne 0
                    then
                        echo '***** Warning - Possible error in sending reboot request'
                    fi
                 
                    #
                    # Give the server a certain amount of time to kill the
                    # agent and reboo
                    #
                    count=$SHUTDOWN_INTERVAL
                    sleep $SHUTDOWN_INTERVAL
                 
                    while agent_up $HOSTNAME
                    do
                        echo `date` Agent still running ...
                        count=`expr $count + $SHUTDOWN_INTERVAL`
                 
                        if test $count -gt $MAX_SHUTDOWN_TIME
                        then
                            echo "Reboot command sent but server not coming down"
                           #Barry Lowrance kill added to cleanup any background jobs which did not exit.
                            kill %1; kill %2; kill %3
                            exit 1
                        fi
                 
                        sleep $SHUTDOWN_INTERVAL
                    done
                 
                    #
                    # Now we know the agent is down and we are waitin for the
                    # system to reboot. Give a bunch of time to come back up.
                    #
                    count=$REBOOT_INTERVAL
                    sleep $REBOOT_INTERVAL
                 
                    while ! agent_up $HOSTNAME
                    do
                        echo `date` Agent still not up ...
                        count=`expr $count + $REBOOT_INTERVAL`
                        sleep $REBOOT_INTERVAL
                 
                        if test $count -gt $MAX_REBOOT_TIME
                        then
                            echo "Reboot has not yet come up after more than $count seconds ..."
                            #Barry Lowrance kill added to cleanup any background jobs which did not exit.
                            kill %1; kill %2; kill %3
                            exit 1
                        fi
                    done
                 
                    echo Server $HOSTNAME back up and running
                       #Barry Lowrance kill added to cleanup any background jobs which did not exit.     
                        kill %1; kill %2; kill %3
                else
                    echo Agent currently not running
                    #Barry Lowrance kill added to cleanup any background jobs which did not exit.     
                    kill %1; kill %2; kill %3
                    exit 1
                fi
                exit 0
                
                • 5. Re: Reboot Job
                  What version are you on? I believe you could do the following for version 7.5+ Create a BLPackage that does nothing (an empty external command) but select the reboot option on the package. Then, with that package,create a deploy job that has "Number of Targets to Process in Parallel" set to 1. Then it would not move on to the next one until the first is back up.
                  • 6. Re: Reboot Job
                    young so

                    There is another apporach by create smart grp:

                     

                    You should also have a Smart Server Group that looks for servers where NEEDS_REBOOT equals TRUE.  Upon completion of your Set Reboot Property Script Job, browse this group to see all servers that need to be rebooted.

                     

                    We will next want to reboot any servers that need rebooting.  Generally the recommended way to do this is through the use of the “Reboot Windows Server” script, which is included in the samples directory of your Operations Manager installation.  Make sure you have your own copy in your own personal workspace to ensure there is no outside interference while you are modifying job targets.  Right-click on the job and select Open.

                     

                    Here is link to the script:

                     

                    https://www.bladelogic.com/community/entry.jspa?categoryID=29&externalID=681

                    • 7. Re: Reboot Job
                      Bill Robinson

                      The process check is the only thing the blpackage method might barf on I think, maybe we could do something w/ an external command in the blpackage that looks for the pid/process?

                      • 8. Re: Reboot Job
                        Thanks for the responses, as soon as I put out some fires I will try some of these suggestions.
                        • 9. Re: Reboot Job

                          Several helpful answers have been posted.