13 Replies Latest reply: Oct 25, 2003 7:27 AM by DevConUser NameToUpdate RSS

CPU Upgrade - normalisation of CPU utlisation in Visualizer

johnwatson240597

We upgraded 2 Sun E10k systems from 6 to 8 processors this week. Domain and analyze files were amended accordingly. Two days after the upgrade, we accessed Visualizer to compare %CPU utilisation for the day before and the day after the upgrade - Utilisation did not appear to have changed and further investigation showed Visualizer is now averaging utilisation over 8 CPUs for all data both before and after the upgrade - therefore all normalised utilisation data reported before the upgrade is now incorrect. i.e. the SQL executed to draw the graphs must be using the last static value for the number of processors to use as the normalisation factor, regardless of the fact that the number of processors may change over time.

Is this a documented "feature"? I know that we can get round this by using hierarchy graphs (as per Perry's Clicks and Tricks presentation) and drawing the averaged graphs from there, but surely there must be a better solution which will allow us to continue reporting normalised CPU as before purely by select CPU Utilisation from the Graphics > CPU/System menu?! If not, we would need to rework a lot of our Visualizer templates used for reporting these servers, which I would see an undesireable situation given that this appears to be a shortcoming in the product.

Any ideas anyone?

  • 1.
    Perry Stupp

    John,

    That doesn't sound right, what version of Visualizer are you running? Visualizer has had the ability to vary the processor count dynamically for quite a while now. Make sure that you show CPU Usage "by processor" or else you may not notice a difference. If that doesn't get it then I suspect you're either using an older version of Visualizer or there's something wrong with the associated CAXNODE keys.

    Regards,

    Perry.

  • 2.
    johnwatson240597

    Perry,
    We are actually using the latest version, Visualizer 3.7.10 - I didnt think it sounded right!!!! I take it then that the SQP query doesnt just use the last entry in the CAXNODE table for num_of_proc for every date?

    Yes, we show the CPU by Processor.........so maybe you're correct that there's a problem with one of the tables. My colleague has opened a support call with BMC, so hopefully a resolution from that quarter.

    In the meantime, we can get round this - as I said - by using the Processor Hierarchy and averaging from there (which does appear to work).

    Strange though!!

    Cheers,

    John

  • 3. Perceive!
    johnwatson240597

    While I'm on the subject - our install of Perceive does deal with this!!! When we draw the CPU utilisation graph showing periods before and after the upgrade, the normalised CPU figure is entirely consistent and correct for the number of CPUs in the server at the respective time. Weird.

  • 4. Problem found?
    johnwatson240597

    Think I've discovered the problem here - looking at the entries for the two nodes that have been upgraded in CAXNODE, there are two for each node. Both entries for each node show the same number of CPUs (8), but the Spec_int ratings are different i.e. these correspond to the number of CPU before and after the upgrade.

    Clearly this means that something screwy has happened with Visualizer - a support case is open, but it is quite interesting all the same.
    I'd imagine that we could just correct the entries in CAXNODE to correct the problem, but will need confirmation from support to do so.

    The suspect rows from CAXNODE are enclosed.

  • 5.
    Perry Stupp

    John,

    "
    We are actually using the latest version, Visualizer 3.7.10 - I didnt think it sounded right!!!!
    "

    Actually, you'd be surprised but this can happen even with 3.7.10 if you move directly to 3.7.10 from a really old version of Visualizer without subsetting or migrating it first.

    "
    I take it then that the SQP query doesnt just use the last entry in the CAXNODE table for num_of_proc for every date?
    "

    That is correct. This used to be the case quite some time ago although that was more because there was only one static entry in the database at a time. Somehwere around 3.5.20 they introduced a hash key into the CAXNODE NODE_ID field to allow the configuration information to vary dynamically.

    Check your analyze file, I suspect that you hard-coded the processor type / number of processors? I don't know for certain where Visualizer gets processors count from (clearly not the CPU statistics) but I suspect that it is influenced by this. While you're at it, you may want to look into why you have four entries rather than just the two you would expect (have you made other configuration changes in the past?). If you find any one interval that has references to two different entries in CAXNODE then Analyze and Predict are producing mismatched hashkeys. This won't really affect Visualizer much but it will introduce additional challenges in customizing Perceive, particularly the normalization that you referred to in your other thread.

    Regarding Perceive CPU Utilization, you're right. I muddled a series of threads together in my head and was thinking of other entries that are dependant on the hard-coded CPU count derived from the discovery process. CPU Utilization will vary dynamically for exactly the reason you stated above, it graphs an average of the per-CPU entries. I will modify my post above accordingly.

    Regards,

    Perry.

  • 6.
    johnwatson240597

    Yes, we do - as standard - hard code the number of CPUs and model into analyze.

    There are four entries in the CAXNODE table as we actually upgraded two servers from six to eight CPUs at the same time, so I think you expect two entries for each i.e. one with the old CPU configuration (and the corresponding SPEC rating) and the other with the new CPU config (likewise with the new SPEC rating). The SPEC ratings do appear to be correct, but the number of processors not.

    As to the version of Visualizer, the measurement data in the Oracle database is completely 3.7.10 - not subsetted/converted from any previous instance. The summary data, however, was subsetted from a 3.6.20 database when we created our existing 3.7.10 instance.

    Do you think that changing the number of CPUs from 8 to 6 in the CAXNODE table in the appropriate row will resolve our problem?

  • 7.
    Perry Stupp

    D'oh!  Sorry, I wasn't paying close enough attention, that's two servers so ignore my comments regarding mismatched keys.

    "
    Yes, we do - as standard - hard code the number of CPUs and model into analyze.
    "

    Have you updated this number since you upgraded? Note that it sometimes takes a while before changes take effect.

    "
    Do you think that changing the number of CPUs from 8 to 6 in the CAXNODE table in the appropriate row will resolve our problem?
    "

    It will but there's a small chance that it will get overwritten the next time you populate data. Let us know how you make out on this if you decide to do it.

    Regards,

    Perry.

  • 8.
    johnwatson240597

    yes, we have updated the number of CPUs since the upgrade (was done on the morning after).

    I'll get one of our trusty DBAs to make the change to CAXNODE to see if this works (will probably do it for one of the servers to begin with) and I'll let you know.

    By the way, have I said how good a product Perceive is? 

  • 9.
    johnwatson240597

    A couple of interesting things:

    One of our test boxes was also upgraded last week, from a 2 to a 3 way E10k domain - I didnt actually know this until I stumbled across it today. In our domain and analyze files these values have never been set (primarily as its a test box so we've never looked at it with the same rigour as we have production) - the entries for this server in CAXNODE are consistent though! Which seems to indicate that we shouldnt set these within either domain or analyze commands files?! Surely that cant be the case?! Or would be the recommendation be that these values are never explicitly set in either of these components?

    Additionally, I've had one of our DBAs make the change to the appropriate row in CAXNODE for one of the problem servers - this appears to have corrected the problem, although we will need to check to see if tomorrow morning population wipes away this manual change.

  • 10.
    johnwatson240597

    Looks like making the change to the appropriate row in CAXNODE works - and is not overwritten by further populations. We'll now need to identify any other nodes where this has been the case, and go back and change them with the help of our DBA.

    In general though, it would be useful to get a recommendation from BMC whether or not we should explicitly specify the model type and number of CPUs in our domain and analyze files - it would appear to be doing this that has caused this issue.

  • 11.
    Patrick Gudat

    "
    In general though, it would be useful to get a recommendation from BMC whether or not we should explicitly specify the model type and number of CPUs in our domain and analyze files -."

    From the Desk of Debbie

    "
    I would like to take this opportunity to remind everyone that hardcoding is a method of LAST RESORT and should only be used when all other attempts at automatic hardware recognition have failed.
    "

  • 12.
    johnwatson240597

    In that case wouldnt it be better to make it less accessable? If that makes sense..........

    It seems like an obvious field to be completed when creating a node in a domain file - specify the name, pick the model number, number of CPUs, collect home etc.

    To quote from page 5-7 of the latest "Collecting Data with Patrol Performance Assurance" document:

    "Click the Configuration tab to identify attributes of the node,
    such as CPU type and operating system. You can use the defaults
    shown here. Much of this information is acquired in the collected
    data. The defaults for the repository directories are the
    installation defaults."

    To me, this doesnt (and nor does any of the other documentation I can recall seeing) suggest that selecting the model type in the domain file is a measure of last resort, indeed perhaps quite the opposite. Perhaps the documentation would help from a revision?

    So I'm guessing I'll now have to go back and remove all of the explicitly specified server types from all of the domain and analyze file now.......suppose it will keep me occupied for a time at least!

  • 13.
    DevConUser NameToUpdate

    Hi John

    There are other methods than hardcoding and not all of them are in the documentation (big suprise  ) you could create your own odm file to translate the bios strings coming from the machines to align with entries in the hardware table instead. I beleive Debbies reference was to a specific problem within the NT console however.

    B