While it might not quite be Spring (northern hemisphere-centric) I have already seen an odd daffodil so I am going to pretend it is.
There have been a few posts in the past discussing the importance of keeping the Discovery Access nodes under control. My previous post is from several years ago, so it is time to revisit, based on an actual recent customer experience.
Firstly, we considered a 2-member scanning cluster. Its performance was mainly OK from a user perspective, since no direct reporting was done on it. However, it suffered from strange symptoms occasionally - mainly stuck scans, that wouldn't finish or couldn't be cancelled. I have noticed that when datastore gets too large, this can lead to this sort of irregularity - and since we saw a large number of DDD nodes in the statistics page, we planned to make the DDD removal more aggressive ("Directly Discovered Data removal" setting from 28 days to 14 days, in the Model Maintenance page). We had planned for a couple of days of non-scanning, to allow the removals to complete as quickly as possible.
We did this - but unfortunately we had dramatically underestimated the time it was going to take to do these deletes in the datastore. After a few days, we could still see the model process active chugging away: the persistence queue filled up (/usr/tideway/var/persist/reasoning/engine/queue, hundreds of thousands of files), reduced, and filled up again.
Due to the pressure of getting scanning started again, we decided to do a model wipe, which is a very quick operation. Thankfully we did not need to worry about a root node key export as these were scanners with no direct CMDB connection. All data would be refreshed after scanning was resumed.
Once the scanners had been reconnected to the Proxies, and scanning started, service was restored with no reliability problems.
The next week, we started to observe similar performance problems on the 3-member consolidation cluster. For context, we were scanning about 50 k hosts, and it had been 2 months since the last compaction. We were running about 4 million DDD nodes. Firstly, we performed a compaction, with the intention speeding up subsequent deletes: we reduced the datastore size to 66% of its original size.
We wanted to reduce the aging below the existing 14 days, but the UI option has limited granularity:
and did not have our desired value of 10 days. Moreover, having been bitten by the experience on the scanners, we wanted to change in 1-day increments to make sure deletions were finished before we deduce further. The way to do this is via the tw_options command. If run like this, it will show the current setting (in seconds):
So we changed to 13 days like this:
restarted the services, and confirmed the UI showed the new "custom" value:
Then, we monitored the model process (CPU usage) and persistent queue. The next morning, we were confident it had reached a steady-state, so reduced for another day and repeated until we reached 10 days. The statistics graph looked like this:
which clearly shows as we reduced the time (yellow) bunches of deletion work was added (red/orange) which, once processed, resulted in a dramatic reduction in the total DA count (blue).
Consolidator performance is now acceptable again, and we plan to do another compaction in a few months.