Troubleshooting performance issues in ADDM/Discovery

Version 1
    Share:|

    This document contains official content from the BMC Software Knowledge Base. It is automatically updated when the knowledge article is modified.


    PRODUCT:

    BMC Discovery


    COMPONENT:

    BMC Atrium Discovery and Dependency Mapping


    APPLIES TO:

    BMC Atrium Discovery and Dependency Mapping



    PROBLEM:

     

    What are the elements BMC Customer Support needs to investigate performance issues in ADDM/Discovery?

     


    SOLUTION:

     

    Legacy ID:KA371872

    Performance issues can be difficult to troubleshoot. To find the root cause, we need to collect a lot of information. It takes time to collect it, but it speeds up the resolution of performance issues.

    First elements to collect before investigating a performance issue:

      
       
    • What operations are too slow (UI response, scans, CMDB sync, etc)?    
           
      • If this is a scan, please send us a screen shot of the Discovery Run page.
      •    
      • If this is a UI function such as a query or a report, let us know which one(s).
      •   
    •  
    • How much time do you expect it to take? 
    •  
    • How much time does it takes now? If it looks stuck, we need to know how this was diagnosed.
    •  
    • Was it quicker in the past? How much time did it take then?
      
    Other elements to collect (from the UI):

    It is recommended to collect all of these at the same time. If this information is not collected while the problem is happening, it is important to let Support know when the problem was observed. 

    - Go to Administration > Appliance Configuration and select the Usage Data Collection tab. Click anywhere in the "Submission Data" field, copy/paste the content to a text file and send us the file. This contains the cluster/hardware configuration and the volume of data. 
    - Send us a screenshot of Discovery > Scheduled Runs page. It helps us to evaluate the scan activity. Note this only applies to a scanning appliance. If this is a consolidator, send us a screenshot of Discovery > Currently Processing Runs page, and let us know how often scans are typically running (once per day, 24X7, etc) 
    - Send us a screenshot of Administration > Model Maintenance page (both General tab and DDD Removal blackout windows). It contains important parameters that impact the datastore size and memory usage. 
    - Send us a screenshot of Administration > Discovery Configuration page (just for any settings that are not the default values). It contains important parameters that can impact the scan performance. Also, check the setting for the "Trade discovery performance with interactivity" parameter. 
    - If this is for a scan, go to the Discovery Run page and open the Discovery Access Finishing Rate report. Send us a screen shot of the output. This can show if a few problematic IPs are extending the length of the scan.  

    Run the following generic search query to gather more information about the Discovery Access nodes on the appliance, and send us the result:
        SEARCH DiscoveryAccess SHOW end_state PROCESS WITH countUnique(0)
    NOTE: A very large number (i.e. 500K+) of DiscoveryAccess nodes typically causes performance problems.


    Go to Administration > Performance, 
    - Click the Patterns tab and select the day that performance problems are being observed. Sort on column Average Execution Time. Sort twice to bring highest values to the top. Send us a screen shot of this page. If the problem has occurred over several days, a screen shot from each day would help to identify trends. 
    - Click the Engines tab while a scan is running (or has recently finished). Send us a screen shot of this page. 
    - Click the Hardware tab. Send us a screen shot of the "Daily SAR Statistics" and "Daily Disk Usage Statistics" graphs. 
    - Click the Datastore tab. Send us a screen shot of the Datastore Cache Performance graph 
    - Click the DDD Removal tab. Send us a screen shot of the Discovery Access Removal Statistics graph. 
    - Send us a screenshot of Administration > Search Management page (while a query or scan is running). It allows us to identify queries that can't finish (if any).   

    Go to Administration -> Appliance Support / Support Services
    Under "Atrium Discovery Logs By Date" select the date range for when the problem is happening and at least one day before
    Check the boxes for AppServer, AppServer errors, Discovery, Performance, Reasoning, and Others
    Under "Miscellaneous Files" check the boxes for sar logs, System messages and Atrium Discovery output files
    Under Create Archive, enter a name for the zip file
    Click "Gather" and send us the result

    Other elements to collect (from the command line):

    If the Discovery server is a virtual appliance, execute the commands below (this will stop the tideway services):

      

    sudo /sbin/service tideway stop
    top -d 10 -n 3 > topoutputWhileADDMIsDown.txt
    sudo /sbin/service tideway start

      

    Send us the resulting topoutputWhileADDMIsDown.txt file. This will allow us to check if the CPUs are being shared by ESX with other VMs.

    Send us the versions of the TKU/EDP/SKU installed on the appliance. You can collect this information from the command line as follows: 
    - Open an ssh session to the appliance and login as the "tideway" user
    - Execute the command below. It will prompt for the system password.
    tw_pattern_management --list-uploads
    - Copy/paste the output into a text file or email (no screenshot please)
    For more details see KA 000137700. 

    Notes about the logs requested above:

       You can check the “System messages” logs (/var/log/messages*) for the string “oom-killer”. If the oom-killer is invoked, the appliance ran out of memory, and a process (usually the model) was killed.

    The performance logs show the memory and CPU used by each process. These logs contain the output of “top –c” run every ten minutes. For example:  

     top - 21:30:34 up 794 days, 18:22,  1 user,  load average: 0.00, 0.00, 0.00
    Tasks: 123 total,   1 running, 122 sleeping,   0 stopped,   0 zombie
    Cpu(s):  1.8%us,  0.6%sy,  0.0%ni, 97.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
    Mem:   3926060k total,  3788924k used,   137136k free,   211740k buffers
    Swap:  8388600k total,  6076632k used,  2311968k free,  1949916k cached
     
    Check the "Mem" and "Swap" lines and make sure that virtual memory is not exhausted. 

    The %wa value shows how much time the CPU is waiting on I/O.  If it is at 20+%, that may be too high. If the I/O is slow, and lots of swap space is being used, performance will be poor. Solutions are:

      
       
    1. Add more RAM so that less swap space is used; and/or
    2.  
    3. Make the I/O faster
      

    You can check I/O performance by running the “iozone” utility. See https://docs.bmc.com/docs/display/DISCO111/Disk+IO+Performance+Guidelines.
     

      

    Typical root causes of performance issues: https://docs.bmc.com/docs/display/DISCO111/Factors+affecting+performance

      

    Other common root causes:

      
       
    • Not enough RAM, causing the OS to swap heavily. To confirm, check the performance logs. The appliance typically uses most of the RAM but it does not necessarily mean that the appliance is swapping. An intensive activity can be confirmed with vmstat. 
    •  
    • High CPU I/O wait: if it's consistently above 20% then performance is being degraded.
    •  
    • A high density of large servers. For example, if you have many large Weblogic servers running 100+ Weblogic processes, this will trigger the Weblogic pattern 100+ times per server. As a result, the model process may consume more memory, causing intensive swapping.
    •  
    • In some cases scanning/consolidating is trying to take place at the same time that the appliance is trying to do internal housekeeping (primarily, determining what DDD can be removed and trying to actually remove it). The result of such a conflict is typically that neither scanning nor housekeeping makes any progress. You may be able to manage this by creating DDD removal blackout windows to segregate these activities.
    •  
    • Poor scanning performance can result if there are many/large reasoning transaction files. To check this, from the command line, please run these commands and send the output files:    
      ls -la /usr/tideway/var/persist/reasoning/engine/queue > /usr/tideway/reasoning_persist.out 
      ls -la /usr/tideway/var/persist/consolidation > /usr/tideway/consolidation_persist.out  

     


    Article Number:

    000096875


    Article Type:

    Solutions to a Product Problem



      Looking for additional information?    Search BMC Support  or  Browse Knowledge Articles