2 Replies Latest reply on Mar 15, 2019 11:16 AM by James Yant

    High pq/rc file count on coordinator node in 3 node cluster

    James Yant
      Share:|

      We've been having problems with our 3 node cluster for several months now where some Discovery runs will never cycle and pattern uploads will never complete until a reboot is performed. The stuck Discovery runs will simply be put on hold by the system indefinitely and attempting to cancel the run will simply place the run instead in the "cancelling" state indefinitely. Some of the other symptoms exhibited are sporadic instances of a slow to respond GUI or time outs when attempting to access the "Discovery Home" tab, manual pattern runs that appear to never finish before the GUI simply times out, and a DA count that seems to slowly but gradually rise until the cluster is forcefully rebooted. A reboot will typically lead to a small drop which turns upwards again after a day of running. Looking on the appliance via SSH shows a reasoning engine queue with a massive file count of anywhere from 50k files all the way to almost 2 million at times. The lower count tends to occur immediately after a reboot only to have the count rise over the next day.

       

      A more recent symptom I've noticed over the past week is that we have several of our Discovery runs actively running at once when normally only 1 or 2 would be running at a time. Worth mentioning also is over the past week the Discovery service has been crashing sometime overnight requiring us to simply click the start button to have it back up and running. A reboot yesterday appears to have remedied this but it's a notable symptom.

       

      Do any of these symptoms sound familiar or lead anyone to suspect any specific cause? I currently have a support case open and have been exchanging logs and information since January but we've yet to find a smoking gun here. We've looked at patterns, reviewed performance graphs via the GUI, and even had a clean bill of health after running db_verify. Yesterday, while looking at logs during a support call, it was noted that the reasoning engine on the coordinator appeared to have been stopped only to have the status command report that it was still running. The current version of our Discovery instance is 11.3 and has been incrementally upgraded from 10.

        • 1. Re: High pq/rc file count on coordinator node in 3 node cluster
          Andrew Waters

          That sounds like you have at least one pattern which is taking a very long time to run. If the engine performance shows few events actually being processed during a scan this would also tend to indicate a pattern issue. Restarting temporarily works around the problem because pq files can be processed in a different order and hence go away until the system tries to run the pattern again.

           

          Regarding taking a long time in the UI, you would need to look at the performance load on the appliances.

          1 of 1 people found this helpful
          • 2. Re: High pq/rc file count on coordinator node in 3 node cluster
            James Yant

            We've taken a look at the active patterns and didn't notice anything outright alarming. We noticed that when looking at whole cluster pattern performance, one of our custom patterns has over 27K invocations and a total execution time of 791. Although a little high, this didn't appear to be a cause for concern to support. Anything specific to look for in regards to pattern performance that we may be missing?