It would imply either there are frequent communication issues between the members of the cluster. Alternatively it could be that there is I/O performance issues where the machines give up waiting for all the members writing the updates to disk.
There should be information in tw_svc_model.log across the cluster giving more information about the issue.
thank Andrew, I will do some more digging.
Good morning, we have been watching this cluster for a while now and I have raised a ticket to get some support.
However to try and offer help to others, I will try to run this discussion to completion.
History so far:
We have a three node v10.2.0.3 cluster which has been performing badly for a while.
In the gui we saw each node in turn show as "Datastore recovering" and had to stop discovery (consolidation) for almost a week to allow this to complete.
During recovery we saw the following in the tw_svc_model.log:
140264255354624: 2016-01-01 00:00:00,307: model.datastore.ds_store: INFO: Number of batches of missed write commands left to examine: 6059060
Once completed, we rebooted the cluster nodes and I saw the following error:
140630730176256: 2016-01-03 22:37:43,920: reasoning.reasoningcontroller: ERROR: Recovery error: Failed to retrieve persistence queue details from starting ECA Controller
Finally, since the recovery and reboot, we are seeing the following errors:
139770698180352: 2016-01-05 23:56:08,103: model.datastore.ds_store: WARNING: Repairing node 2b3532568e1314cf7dac06cc:da136b56c60951fd74b0c06a:r:Inference
Discovery is running slowly at the moment and the repair warnings have stopped.
I also have calls out to our internal teams to investigate the vm performance
did you solved this problem? We have same issue.
Our experience was to admit defeat and model wipe the datastore
Check the size of the doomed db file - ours was large and the cluster just could not keep up with processing new data and clearing out the old.
Once the db files are near or over the size of allocated RAM then you are in for a painfully grinding ride.
Current planning in the event of this happening again is to export the root_node_keys, model wipe and reimport the keys and custom tpl.
In fact we have maintenance tasks in place to do this too.