    High network utilization/CPU Run Queues?

      Running BL 7.6 managing a few hundred windows servers, running on RHEL 64-bit 5.2 on a single app server. JVM is tuned to 6GB of memory.


      We've been troubleshooting a High CPU Run Queue issue with this server when BL Compliance job runs. We have the job limited to do max of 20 targets at a time.


      What we found is that the network throughput, on this Virtual Machine, is up to ~ 1500-2000kbps on one vm NIC, and Run Queue is ~ 20.


      When we lowered the number of parallel targets for this job down to 10, the Run queue went down to ~ 7 but Network throughput stayed the same.


      When we lowered the number to 1 target at a time, we started to see runqueue down to 0.7 and throughput below 1,000kbps.


      Server is only reaching about 70 to 80% of CPU utilization, and plenty of memory left.  Strang is that the CPU utilization is higher on System versus User CPU utilization (which is also strange).


      My question is if anyone has seen such correlation between Run Queue and network bottlenecks.  My understanding is that Run Queue will go up (a bad thing) when there is an I/O bottleneck, such as network or storage.


      If so, short of upping the available throughput capacity on the NIC or keeping the number of parallel targets processed, what else have you done to remedy this finding?




          What does the util look like on the targets when they are being scanned? Are you using Oracle or MSSQL DB?

            Vinnie Lima

            Hey Adam. Will need to look at the targets tomorrow for their utilization.


            Using MSSQL DB.  Looking at the MSSQL DB server, network and CPU utilization match the high throughput.


            Can also look at the network flow to see if its Appserver->targets or appserver->SQL or both.

              Another question, are your MSSQL table spaces on local disk, SAN, or NAS? Also, on your cpu util on the appserver, is the proc being hijacked by JAVA, NSH, or something else?

                Vinnie Lima

                Hi Adam,


                Wish I could post some of the statistics here but would be a pain to transfer it from production.


                Basically we were seeing high CPU run queue (20-30) with high system cpu time (versus user cpu time). Today I looked in more detail and found out (with our VMWare admins) that the MSSQL VM was being capped on how much CPU resource it could leverage from the ESX Host.  Additionally, there was very little traffic going from the appserver to the targets, but a lot of traffic between appserver and MSSQL server. These two VMs were on two different ESX Hosts.


                So, the VMadmin vmotioned the SQL vm so that it is co-hosted on the same ESX server (to keep the communication internal to the vswitch), and we saw drastic improvements on the CPU Run queue (was down to < 7 when processing 800 hosts for compliance job w/ 10 parallel targets).


                The ESX hosts have 6x 1GB NICs but the throughput is still 1GB (not 6GB).  What is even more strange is that when NetBackup kicks in the middle of the night, it way surpases the NIC throughput when a compliance job was being run.


                So, current resolution seems either A) removing the CPU cap on the MSSQL VM and/or 2) placing both VMs on the same ESX host so that traffic is transversed internal to the vswitch.


                To your question above, the MSSQL table space is on a local VM, which is in a datastore, which is housed on a fibre channel-based SAN.


                From a process perspective, the top process is the blappserv java process itself.


                Still have two questions:


                1) Is it normal for bl appserver to consume a lot of sys cpu time versus user cpu time?

                2) Has anyone seen > 0 CPU run queue time when running a job?  Don't know if this is normal or abnormal due to something else we haven't pin pointed yet.



                  I have never taken consideration of either of the two questions.


                  I can say though that BMC's best practice is to not have a database on a virtual machine (for any of our products). Too many unexplainable issues have arisen when databases are virtualized. Not sure if that is the issue for you here, but it does complicate things.

