5 Replies Latest reply on Apr 25, 2017 3:14 AM by Matt Lambie

    INFO: Number of batches of missed write commands left to examine

    Matt Lambie

      any idea why our consolidation cluster keeps coming up with "Datastore Recovering" uner Cluster Management and we get the following in the tw_svc_model.log?

       

      model.datastore.ds_store: INFO: Number of batches of missed write commands left to examine: 46056

       

      If we leave discovery running, the number grows and cannot keep up.

      Stop discovery and it eventually completes but then returns after a few consolidation runs!

        • 1. Re: INFO: Number of batches of missed write commands left to examine
          Andrew Waters

          It would imply either there are frequent communication issues between the members of the cluster. Alternatively it could be that there is I/O performance issues where the machines give up waiting for all the members writing the updates to disk.

           

          There should be information in tw_svc_model.log across the cluster giving more information about the issue.

          • 2. Re: INFO: Number of batches of missed write commands left to examine
            Matt Lambie

            thank Andrew, I will do some more digging.

            • 3. Re: INFO: Number of batches of missed write commands left to examine
              Matt Lambie

              Good morning, we have been watching this cluster for a while now and I have raised a ticket to get some support.

               

              However to try and offer help to others, I will try to run this discussion to completion.

               

              History so far:

              We have a three node v10.2.0.3 cluster which has been performing badly for a while.

              In the gui we saw each node in turn show as "Datastore recovering" and had to stop discovery (consolidation) for almost a week to allow this to complete.

              During recovery we saw the following in the tw_svc_model.log:

              140264255354624: 2016-01-01 00:00:00,307: model.datastore.ds_store: INFO: Number of batches of missed write commands left to examine: 6059060

               

              Once completed, we rebooted the cluster nodes and I saw the following error:

              140630730176256: 2016-01-03 22:37:43,920: reasoning.reasoningcontroller: ERROR: Recovery error: Failed to retrieve persistence queue details from starting ECA Controller

               

              Finally, since the recovery and reboot, we are seeing the following errors:

              139770698180352: 2016-01-05 23:56:08,103: model.datastore.ds_store: WARNING: Repairing node 2b3532568e1314cf7dac06cc:da136b56c60951fd74b0c06a:r:Inference

               

              Discovery is running slowly at the moment and the repair warnings have stopped.

               

              I also have calls out to our internal teams to investigate the vm performance

              • 4. Re: INFO: Number of batches of missed write commands left to examine
                Ondrej Kieler

                Hello Matt,

                 

                did you solved this problem? We have same issue.

                 

                Regards,

                 

                Ondrej

                • 5. Re: INFO: Number of batches of missed write commands left to examine
                  Matt Lambie

                  Hi Ondrej.

                   

                  Our experience was to admit defeat and model wipe the datastore

                   

                  Check the size of the doomed db file - ours was large and the cluster just could not keep up with processing new data and clearing out the old.

                   

                  Once the db files are near or over the size of allocated RAM then you are in for a painfully grinding ride.

                   

                  Current planning in the event of this happening again is to export the root_node_keys, model wipe and reimport the keys and custom tpl.

                   

                  In fact we have maintenance tasks in place to do this too.