Multiple aspects of disk modeling and reporting using Analyze and Predict (from the BPA Gateway Server console)

Version 23
    Share:|

    This document contains official content from the BMC Software Knowledge Base. It is automatically updated when the knowledge article is modified.


    PRODUCT:

    TrueSight Capacity Optimization


    COMPONENT:

    Capacity Optimization


    APPLIES TO:

    9.5, 10.5



    PROBLEM:

     

    LP: BMC Performance Assurance 9.5, 9.0, 7.5.10

      

    TrueSight Capacity Optimization 10.0

      


    DR: BMC Performance Assurance 9.5, 9.0, 7.5.10



    Questions and Issues:

      


    1) Require a Predict Disk Auto-Calibrate feature similar to Predict CPU Auto-Calibrate feature. Analyze reports zero Disk Queue Times for EMC Solid State devices even with high Disk Utilization, as expected. Unfortunately, Predict calculates very high Disk Wait Times based on random arrival rate and thus shows significant Workload IO Wait Time in the Workload Response Time Breakdown that we do not believe is correct.

      


    2) Questions regarding the Analyze Disk Summary report that shows EMC Solid State Devices hdiskpowerxx devices with 74% disk utilization when we think that it can do triple the IO rate that's reported despite the high disk utilization values reported by Analyze?

      


    3) Although Analyze/Predict currently only report (and model) physical hdiskpowerxx disk devices and eliminate the virtual disk hdiskxx devices, we have a requirement to report all virtual disks hdiskxx so need an option to record and report them in Visualizer.

      


    4) Currently we cannot use Predict for accurate IO modeling with Solid State devices and would like these fixes and enhancements to be seriously considered so that we can have confidence in using Predict for reporting and modeling EMC Solid State Devices.

    I have sent  two days of BPA AIX UDR data from a partition servicing a large amount of I/O and four Analyze .an files to the support ticket.



    Issue Summary: Predict calculation of AIX Disk Queue Length and Disk Wait Times are not reflecting the same from measured Analyze Disk Queue Length for new EMC Solid State Devices.
     

     


    SOLUTION:

     

    Legacy ID:KA427861

      

    Solution

      



    Item 1. RFE QM001869008 "Predict modeled Disk Queue not the same as measured/Analyze Queue" was opened and scheduled for implementation in the next release of Predict, 10.3.  This new feature provides an alternative to calculating disk response time based on disk utilization, which is summarized by these relationships:

      

     

      

    Disk Response Time =   Disk Service Time + Disk Queue Time

      

    Disk Queue Time        =   Disk Service Time * (Disk Utilization / (1 - Disk Utilization))

      


    The Predict calculation of queue time is based on modeling assumptions which include the arrival pattern of I/Os to the disk as well as that I/Os are not waiting someplace other than the disk.  You can definitely find real-world examples of these assumptions being violated, and thus you would also find that in those situations, the calculated/modeled queueing time is overstated.  The most common scenario where these assumptions don't apply is one where the I/Os are not being requested randomly, but are in fact coming from a single source (or a very limited number or sources).  In this situation, the I/Os aren't queued at the disk, but are queued inside the application, i.e. a new I/O isn't requested until the last I/O request is completed, so you will see a measured disk queue of zero (or close to it) at the disk, even though the I/O load is substantial and the observed disk utilization can approach 100%.

      

    Here are some sample disk measurements where there is a single I/O-bound application running on an AIX SPLPAR:

      

    User-added image

      

    Although the disk queue has generally higher values when the disk utilization is higher, the actual values for all disk queue are essentially zero.  

      

    When activating the new disk queue calibration feature, the actual measured queue will be used to calculate disk response time rather than using the modeled queue length.  There's a similar feature for the CPU queue, which has been available beginning with the very early versions of Predict.  So if the new disk calibration feature is enabled, this formula is used instead of the formula shown above:

      

    Disk Queue Time        =   Disk Service Time * Measured Disk Queue

      

    So for the sample data shown in the chart, at 90% disk utilization, Predict would normally calculate a queue length of 9.   With the new feature enabled, Predict will use the measured queue of 0 instead of using the modeling result of 9.

      

     

      

    Item 2. This was addressed via a technical analysis of measured per I/O characteristics for a solid state disk.  The details are included below under the section "Additional Information".   

      

     

      


    Item 3. This is addressed via three defects which are documented in KA424568:

      
       
    • QM001767096 "BPA version 9.0.00 reports incorrect Disk Controller I/O (double the correct value) for AIX PowerPath environments"
    •  
    • QM001878751 "CUT-DISK is cutting disk controllers in addition to disk, CUT-DISK should apply to disks only"
    •  
    • QM001878103 "Gateway Server Analyze is reporting columns 15 to18 of table CAXCTRLD in pages/sec instead of mbytes/sec"
      

     

      


    Item 4 is not a separate technical item, but a summary of the impact of the previous three items.

      



    Additional Information

      


    The Predict modeling principle of using service time per I/O for a representing a physical disk are discussed in KA363432.  Here's a summary of the most important concepts for Disk modeling expressed as formulas:

      

     

      

    Disk Utilization                         =   Disk Service Time per IO * Number of IOs

      

    Disk Response Time per IO  =   Disk Service Time per IO + Disk Queue Time per IO

      

    Disk Queue Time per IO        =   Disk Service Time per IO * (Disk Utilization / (1 - Disk Utilization))

      

     

      

    Underlying these formulas are the assumptions that:

      

     

      

    (a) Only one I/O can be serviced at a time (per disk)

      

    (b) Due to (a), measured disk utilization represents a sum of the service times of all I/Os that were serviced

      

    (c) For Distributed Systems platforms (Linux, Windows, AIX, etc.) an I/O is defined as a 4 KB data transfer (for example, an IOP which is 100 KB would be represented as 25 I/Os)

      

     

      

    In this issue, we are discussing if these assumptions are valid (for Predict modeling of I/O (and the same would be true for BCO)), or is it that the service time per I/O is not constant (regardless of load), but is a function of the type and size of the I/O requests?  The customer is stating that the service time is not constant. 

      

     

      

    An analysis of four days of udr data (5-8 September) was done.  As promised, this application generates a lot of I/O!  Overall, the service time per I/O for the disks carrying the majority of the I/O load is in fact pretty constant, regardless of load and/or disk utilization, which would support continuing to use the current design of Analyze and Predict for reporting and modeling I/O capacity.

      

    Engineering agreed that the specific results from this set of solid state devices tends to support the current approach as valid.  However, it's not without reservations:

      

    "The problem is, disk and disk controller technologies have evolved so much in the last couple of decades that the service time per I/O as we measure it, is not guaranteed to be constant anymore. 

      

    Here's a snippet from Adrian Cockcroft's article on this subject that sums it up well:

      

    In the old days, once the device driver sent the disk a request, it knew that the disk would do nothing else until the request was complete. The time it took was the service time, and the average service time is a property of the disk itself. Disks that spin faster and seek faster have lower (better) service times. With today's systems, the device driver issues a request, that request is queued internally by the RAID controller (and the disk drive) and several more requests can be sent before the first one comes back. The service time, as measured by the device driver, varies according to the load level and queue length and is not directly comparable to the old-style service time of a simple disk drive. 

      

     

      

    We will need to go back to our data source to see if we can get measurements that represent disk service time without the internal queue.  For, now this is the best we can do."

      

     

      

    The analysis of the key measured I/O metrics is available in the attached document, and the corresponding Excel spreadsheet containing the data analyzed is also attached to this article. One of the important points of this analysis is that most likely part of the confusion about the capacity of the disk devices was that the customer's in-house staff were observing IOPS rather than the BMC methodology of using a 4 KB data transfer as the yardstick for characterizing and modeling I/O capacity. Since the size of the IOP is highly variable when this application is running, you will not find a correlation between disk utilization and the IOPS rate. However, you will see that disk utilization is well-correlated with the amount of data being transferred (which is reported in either MB/sec or IO/sec), which serves as the basis for the BMC methodology.

      

     

      

    Additional discussion of the I/O characteristics of this application can be seen in the last case study presented in ftpdepot.bmc.com/pub/perform/gfc/das/Presentation%20-%20CCMG2015_Managing%20IBM%20PowerVM%20Virtualization.pdf.

      
    Related Products:  
       
    1. BMC TrueSight Capacity Optimization
    2.  
    3. BMC Performance Assurance

     


    Article Number:

    000110899


    Article Type:

    Solutions to a Product Problem



      Looking for additional information?    Search BMC Support  or  Browse Knowledge Articles