How is CPU Utilization and Capacity correlated on Sun Chip MultiThreading (CMT) Platforms?

Version 3
    Share This:

    This document contains official content from the BMC Software Knowledge Base. It is automatically updated when the knowledge article is modified.


    BMC Performance Assurance for Servers


    BMC Performance Assurance for Servers


       Sun's "CPU Utilization and Capacity on Chip MultiThreading (CMT) Platforms" paper from Sun Microsystems Server Quality Office & CMT Engineering discusses "the problem of measuring CPU utilization on CMT based systems, what tools to use, and what the information means".   
    • BMC Performance Assurance for Virtual Servers  9.5, 9.0,,  7.4.10,  7.4.00
    • BMC Performance Assurance for Servers  9.5, 9.0,,  7.4.10,  7.4.00
    • Sun Chip MultiThreading (CMT)



    Legacy ID:KA317969

       There are typically two different types of things you might want to know when looking at the CPU utilization of a system:  
       (A) How much competition is there for shared resources on the system? (Since competition for shared resources will result in degraded application response time)  
       (B) What will be the impact on application response time of an increased or decreased load on the system (resulting in increased or decreased competition for shared resources)?  
       When looking at CPU utilization you are really just limiting the scope of that question to competition for CPU resources.  
       The 'CPU Utilization and Capacity on Chip MultiThreading (CMT) Platforms' white paper from Sun is basically saying you that you can't tell either of those from CPU Utilization alone and you thus need to look beyond a single measure to answer your question. The paper focuses on the concept of identifying how physical resource contention impacts application response time rather than how one can determine where on the application response time curve you are at the current utilization level of the machine. I also think the key point the paper is trying to make is that in the past when there was a one to one correlation between logical resources on which work could be scheduled (threads/processors scheduled on a CPU) and physical resources on which work could be scheduled (the actual physical CPU cores servicing the work) it was possible to read a lot more into that single CPU utilization value than is currently possible in today's hardware environment.  
       In relation to CPU reporting in the Perform product, 'Example 2: Compute-bound load, 64 threads' is the most interesting because it shows the primary problem we have with CMT reporting. In the 'vmstat 1' output the system is reporting CPU User Time (vmstat 'us') of 50% and CPU Idle Time (vmstat 'id') of 50% during the test because it is using 64 out of the 128 virtual CPUs on the system. But they have basically defined that the test is a compute-bound load on the system. So, during a period when the system really shouldn't be able to do any more work, the top level CPU counters are only reporting the machine is 50% utilized. This is also true in 'Example 3. Java integer intensive threaded load' where they run another compute-bound load on the system that happens to use the physical cores themselves more efficiently. So, one key problem with the CMT environment is that it is difficult (if not impossible) to know where you are on the response time curve using just CPU utilization and queue length when you are running at less than 100% utilization on the machine. As 'Example 1. Compute-bound load, 200 threads.' shows once you saturate the machine you begin to see both 100% CPU utilization and a measurable Queue Length at the system level but before that you don't see either. In the examples in the paper I believe they specifically picked 64 threads to fully utilize all of the integer pipelines on the test machine.  
       So, it is logical to assume that what you would want to collect and report then is core utilization (for example, as provided by 'corestat') but this also turns out not to be what one would need to determine where the system is on the response time curve. The reason is that core utilization is reported at a lower hardware level than what we are looking for. Core utilization doesn't include the time that the core is 'blocked' due to context switching (such as that triggered due to a CPU cache miss). In a CMT environment the OS schedules threads on a virtual CPU (vCPU) [to use the white papers terminology] and then the hardware itself cycles the active threads in a round-robin fashion on the physical core. Whenever the thread becomes unrunnable (for example due to a cache miss) it will be cycled off of the core for another runnable thread in that thread group. The time it takes for the thread to be cycled off of the core isn't measured as 'utilization' within corestat but the core is effectively blocked by the switch. So, even when a machine is running at full utilization corestat will report core utilization as less (typically significantly less) than 100% per core. This is most clear in 'Example 1. Compute-bound load, 200 threads.' where there is a significant run queue at the OS level for a compute-bound application but corestat is reporting utilization of around 80% per core (40% per integer pipeline).  
       Unfortunately the CMT environment doesn't report a simple core 'busy' versus 'idle, waiting for runnable work' metric so it is impossible to know from corestat whether you have core resources available to provide increased transaction throughput at a higher transaction arrival rate (or whether the higher transaction arrival rate will just result in increased wait time).  
       The prstat 'LAT' metric they mention in the paper is interesting, and it is something that we don't currently collect in the Perform product. The reason we don't collect it is actually covered in the Sun white paper - it is valuable at the thread level and will be diluted at the process level (since now it would be an average across multiple potentially disparate threads). They can show it at the process level in the examples because they know all of the threads are doing the same work and will be active concurrently but on a typical system that wouldn't be the case for any arbitrary process. The prstat LAT metric is interesting in that it gives a '% CPU Ready Time' for threads at the system level which may indicate CPU queuing before you would see that at the system CPU utilization level. Although I assume you would see those threads in the run queue at that point so run queue alone would be a good high level indication of CPU contention. I would have liked to have seen more examples vmstat, corestat, and prstat at various levels of physical core contention on the system. I'm checking what types of benchmark testing we have done in-house and whether anyone has created a formal white paper (or CMG paper) on CPU utilization measurement and interpretation in a CMT environment.  
       We have this document from Debbie Sheetz which also discusses some of the same questions as above:     KA343359.  There is a detailed discussion about the limitations of corestat in 'Appendix E. Techniques for Acquiring Per Core CPU Usage'.  
       I think we are all in agreement that using just CPU utilization alone is not sufficient to determine application response time in a CMT environment (and, in general, using just CPU utilization in any environment doesn't give a complete enough picture to talk in terms of application performance).  And when the hardware is virtualized with CMT, there are at least two CPU utilizations to choose from: thread utilization or core utilization.  
       Using CPU utilization combined with CPU Queue Length is a better option, but it doesn't tell you where you are on the response time curve. By the time you start to see an increase in CPU queue length you are much further down the response time curve that you might expect because queuing effects are being hidden at the physical resource level since a process appears to be scheduled on a processor (vCPU) at the system level but that processor (vCPU) isn't necessarily scheduled on a physical resource - it may be waiting its turn on a shared physical resource.  And that queueing isn't being reported.  
       The complexity of utilization reporting when multiple virtual objects share a single physical resource is that the line between Utilization and Wait time become blurred as queues are buried behind the logical unit of work visible at the OS level. This is true of CPUs in a CMT environment, disks in a SAN/RAID environment, logical OS partitions in a virtualized environment, and so on. When the OS itself only sees an abstraction of the physical resource where work is being done the true definition of utilization becomes less clear - this is particularly true as more levels of abstraction are added between the OS resource scheduler and the physical resource itself.  

    Q: Based upon this, what would be the overall conclusion as it applies to the product? Keep using CPU Utilization and Queue Length and no changes? Since CMT is "thread oriented" are there plans to collect threads (which is useful in other instances) and add LAT metric?

       I do think you can keep using BPA CPU utilization and queue length as indicators of CPU saturation -- what is more difficult in an CMT environment is knowing how close you are to saturation when you are below the saturation limit. That is still really an open question in relation to capacity planning for hardware virtualized with CMT.  
       Engineering is actively researching better methods for CPU reporting in Solaris CMT environments. We've been having a lot of discussions with Oracle/Sun about the metrics that are available and what types of metrics we'd like to see added in future Solaris versions to make it easier to determine where you are in the response time versus utilization curve.  
       I don't believe that we have really seriously considered collection of thread level data simply due to the volume of data that would generate. Its easy to decide to look at the individual threads and thread LAT when you are researching a known problem - but it is a more difficult decision of what to do when you are talking about data collection over thousands of machines in an environment, most of which really aren't doing much. We've been talking about conditional data collection (only collect metric/metric group X if condition Y is true) but that adds significant complexity to the design of both the agent and analyze and it isn't clear that we currently know enough to come up with quality rules in that area across all of the various Unix and Windows variants and each of their virtualization solutions.  
       So, I think for now CPU utilization and queue length continues to be the best place to start a performance analysis. But for capacity planning purposes there are still many blind spots in CMT reporting and prediction which I don't believe anyone has overcome at this point.  
       Note: at the time this article was published, a live link to the Sun paper being discussed was no longer available.  

    Q: I recently saw  that Solaris 11 had some new commands, pginfo and pgstat, which say they report the same information as "corestat", which is discussed in the white paper.  Is this part of the BMC Performance Assurance agent?

        This is currently not part of the 9.5 Solaris agent, and there is currently no outstanding request (RFE) to include this support.   
        Here's a link to a discussion of the new commands  
    Related Products:  
    1. BMC Performance Assurance for Servers


    Article Number:


    Article Type:


      Looking for additional information?    Search BMC Support  or  Browse Knowledge Articles