Recovery and debugging options for TrueSight Capacity Optimization (TSCO) Gateway Server/BPA Unix console Manager run failures.

Version 6
    Share This:

    This document contains official content from the BMC Software Knowledge Base. It is automatically updated when the knowledge article is modified.


    PRODUCT:

    BMC Performance Assurance for Servers


    APPLIES TO:

    TrueSight Capacity Optimization 20.02, 11.5, 11.3.01, 11.0, 10.7. 10,5, 10.3, 10.0 ; BMC Performance Assurance 9.5



    PROBLEM:

     

      What recovery and debugging options are available for the TrueSight Capacity Optimization (TSCO) Gateway Server (formerly BMC Performance Assurance) Unix console Manager runs?  
        
        NOTE: This document was originally published as Solution SLN000000222608.   
       

     

       

    TrueSight Capacity Optimization (Gateway Server) 20.02, 11.5, 11.3, 11.0, 10.7, 10.5, 10.3, 10.0
    BMC Performance Assurance for Servers  9.5, 9.0, 7.5.10, 7.5.00, 7.4.10

    Manager run

      

     


    SOLUTION:

     

    Q: If my Manager run wasn't executed, what is the best way to recover?

      
      The best way to recover is to manually execute the *.Manager script that wasn't executed. This will create the necessary scripts and start data collection on the remote node (from the time that you execute the *.Manager script).   

    Q: Is there an easy way to re-execute all of my *.Manager scripts?

        Here are the quick recovery commands when all Manager runs have failed to execute:

    As the user under which you schedule your Manager runs:
       
       > $BEST1_HOME/bgs/scripts/pcrontab.sh -list | grep Manager | grep -v GeneralManager | awk '{ print $7 }' > /tmp/runs.sh
    > chmod 755 /tmp/runs.sh
    > /tmp/runs.sh
     
       

    This is useful for recovering when the Gateway Server console was down during the period where the runs would have been executed or if there are cron problems on the machine that caused all the runs to not be executed.

       

    Q: If my Manager run wasn't executed, how do I debug the problem?

        What we need to do is determine what caused the Manager run to not be executed. To determine that Technical Support would want to see the following files:   
        
    • The entire contents of the /usr/adm/best1_V.V.VV/bgs/log/pcron directory
    •   
    • The entire contents of the /usr/adm/best1_V.V.VV/bgs/pcron/repository directory
    •   
    • The entire contents of the /usr/adm/best1_V.V.VV/local/manager directory
    •   
    • The output of 'ls -lR [Manager Output Directory]' where [Manager Output Directory] is the directory where the *.Manager script exists for the Manager run that wasn't executed.
    •  
       

    Q: How are these files useful for debugging?

       
    The entire contents of the /usr/adm/best1_V.V.VV/bgs/log/pcron directory
       
        
    • This contains the pcron log file that says which scripts were executed by pcron and when.
    •  
       
    The entire contents of the /usr/adm/best1_V.V.VV/bgs/pcron/repository directory
       
        
    • This contains the entire list of currently scheduled events in pcron
    •  
       
    The entire contents of the /usr/adm/best1_V.V.VV/local/manager directory
       
        
    • This contains the UCM Status Reports and udrCollectMgr log files. This gives a good indication of whether the problem is that the *.Manager script isn't being executed at all, or whether the problem is that somehow UDR Collection Manager is failing to properly start data collection.
    •  
       
    The output of 'ls -l [Manager Output Directory]' where [Manager Output Directory] is the directory where the *.Manager script exists for the Manager run that wasn't executed.
       
        
    • This would allow Technical Support to verify which files were created in the Manager Output Directory (such as 'Does the *.Manager script still exist?').
    •  
       

    Depending on what Technical Support sees in those logs we might need to look further at things like the cron log (/var/cron/log and /var/cron/olog) but you need root access to get those so we generally don't request those initially.

       

    Above, /usr/adm/best1_V.V.VV is the $BEST1_HOME directory for your Gateway Server console version (for example, for BPA version 9.5.00 the correct path would be /usr/adm/best1_9.5.00.

       

    Q: Is there a way to pre-load the remote node with data collection requests so if Manager fails to issue data collection I'll be able to identify the problem before data has been lost?

       

    By default, Manager will pre-populate a few days worth of data collection requests onto the remote node. Manager will then register data collection requests with the agent and the agent will start those requests when the start time comes.

    By default data collection requests are registered with the remote node for 3 days in advance.  Although it is almost nevery necessary to change the default, if one wanted to change the number of days in advance collection requests are registered:

       

    Step 1

        Edit the /usr/adm/best1_V.V.VV/local/setup/collectManager.cfg file.

    Step 2

    Change the 'COLREQ_DAYS_ADVANCE = 3' parameter to 'COLREQ_DAYS_ADVANCE = #'. (Where '#' is the number of days in advance to pre-register data collection requests from 0 to 3). The default is '3' (maintain 3 days of registered collection requests).

    Step 3

    Save the collectManager.cfg file.

    This option is especially useful in an environment where the managing node is going to be upgraded over the weekend and will be unavailable for a few days. This allows Manager to pre-register the collection requests so no data was lost while the Managing node was down.

    This Manager option provides a good buffer when debugging problems with Manager runs being consistently executed because it allows problems to be debugged without any data being lost on the remote nodes. The root cause of the Manager run instability would still need to be identified, but this feature eliminates the most negative symptoms of the problem while we worked to identify the source.
     
        

    Section II: Manager debugging methodology

       
       When debugging Manager problems the first goal should be to determine which task is failing and then attempt to isolate that to a more specific step within that task.

    The high level Manager tasks are:
        
         
    • Data collection
    •    
    • Data transfer
    •    
    • Data processing / Visualizer file creation
    •    
    • (Obsolete) Visualizer file population (or transfer to the Visualizer PC)
    •   
        
       A good place to start researching are either the General Manager Lite (GMLite) reports or the Gateway Manager UI that is part of the TSCO Application Server.  Information similar to the UCM Status Reports mentioned below is also available in those newer interfaces.

    Outside of the Gateway Manager UI and GMLite reports, a good place to start researching is the UDR Collection Manager (UCM) status reports. This includes an overview of the data collection and data transfer component of the Manager run.

    For example, here is an example of the UDR status report showing data collection has failed for node hou-remprd-08:
      
         
          
        
       Expanding the node status we can see that data collection failed to start for that node because the connection to the Service Daemon is failing:  
         
          
        
       The "Service daemon not installed on the remote node (connection refused)" message indicates that the collection request is not being received by the remote node. So, the appropriate place to begin debugging this issue is on the remote node itself to determine why the Service Daemon isn't responding.

    If the entire date is missing from the 'Manager Runs' tree on the left side of the UCM status reports then that would indicate that data collection requests were not registered for any Manager runs on that date. That generally indicates a problem with the execution of the *.Manager scripts due to a problem with cron, pcron, or the *.Manager scripts themselves.

    To debug Manager run execution issues the places to look are:
        
         
    • The $BEST1_HOME/bgs/log/pcron/[hostname]-[ManagerUser].log file.
    •    
    • The output of '$BEST1_HOME/bgs/scripts/pcrontab.sh -list'
    •    
    • The output of 'crontab -l' run as the user under which the Manager runs were scheduled.
    •    
    • The system cron log
    •   
        
       The $BEST1_HOME/bgs/log/pcron/[hostname]-[ManagerUser].log file will indicate if the scripts are being executed. If not, the 'pcrontab -list' command will tell us if the scripts are even scheduled to be executed. If the scripts are scheduled but aren't being executed the output of 'crontab -l' will tell us if the 'pcron' even is properly schedule in cron. If it is, the system cron log will indicate if that job is being properly run every minute on the machine.

    If we can see that the *.Manager scripts are properly scheduled in pcron they aren't being executed the most likely problems are:
        
         
    1. Cron is not executing the 'pcron' event scheduled in the 'pcrontab' file (Maybe someone has removed the Manager User's account from the authorized cron users or something like that)
    2.    
    3. The 'pcron' event is no longer scheduled in the Manager User's crontab file
    4.    
    5. Pcron is being executed by cron but is failing (hopefully information will be in the log files)
    6.    
    7. All of the *.Manager scripts have somehow been deleted from the Manager Output Directory (unlikely)
    8.   
        
       Other debugging entry points within Manager on Unix include:    
         
    • Check that the [date]-[date].ProcessDay.out file has been created by the [date]-[date].ProcessDay script. If the *.ProcessDay script has been executed (and the deletion of the *.ProcessDay.out file is disabled within Manager) then a new *.ProcessDay.out should be created each day. If the log doesn't exist that probably means that the *.ProcessDay script was never executed. This could be because the *.XferData script hung or all nodes failed to transfer for that Manager run.
    •    
    • Check that the [date]-[date].manager.log file has been created by the [begin]-[end].Manager script. If the *.Manager script has been executed (and the deletion of the *.manager.log file is disabled within Manager) then a new *.managet.out file should be created each day. If the log doesn't exist that probably means that the *.Manager script was not executed for that day (or failed before it had a change to create the log file). The *.manager.log contains the output from the udrCollectMgr command (which manages data collection and data transfer) and the output from the *.XferData script.
    •   
       
      
      
    Related Products:  
       
    1. TrueSight Capacity Optimization
    2.  
    3. BMC Performance Assurance for Servers
    4.  
    5. BMC Capacity Management Essentials
       Legacy ID:KA312458

     


    Article Number:

    000330078


    Article Type:

    Solutions to a Product Problem



      Looking for additional information?    Search BMC Support  or  Browse Knowledge Articles