Share This:

A very common problem with HA cells is when they drift out of sync.  The issue occurs when the two cells have different or conflicting information in each cell. This starts to create conflicts between the two cells until it gets to the point where the cells cannot reconcile.  Normally, one or the other cell will die and being on a TSIM it will take down all the processes as the cell is a key component which all of them communicate with. 

 

You can see pending propagation and long stream of “unbuffer sent” messages in the cell logs.  This may or may not be a problem. It could be that the cell at that given time can’t propagate the events or changes for some reason and will on the next attempt.  So, we can’t really go by logs alone but we can determine if there is a need to go further if the problem is constant throughout the logs.  It’s more of a flag really to me.

 

What I normally do in these situations is to check the PRP records in the cell itself by running:

cd $MCELL_HOME/var/<cell>

grep “PRP” mcdb | wc -l

 

If the value returned is say 10 then there is no concern.  If the value is over a 100 this is a concern as this is a clear indication that the HA cell pair has drifted into a state where it is no longer in sync.  This can be further determined by checking the cell size to see if there is a major difference.  If the difference in size is more than say 1 or 2 MB then it is likely to be out of sync.  In one case the difference was several hundred MB and there was no doubt it was out of sync.  The only way to fix the problem is to shutdown both sides and if the primary is actually the active server, copy the mcdb and xact file from the active to the passive node to manually re-sync them.  This is the only way to re-sync the cells.

 

Normally, this will occur if one member has been down for an hour or more.  In those situations, the mcdb and xact should be copied from the server that has been up so it will be in sync before the secondary is restarted.  If the member pair is down for under an hour then it will not be a problem. As it would still have the propagation still waiting.  If it is over an hour then those propagations would have timed out and will never be sent to the passive cell which is what causes the cells to be out of sync.  This sometimes happens during maintenance windows where the VM/Server is brought down to perform OS patching, ESX requires hardware work or something similar.

 

On very rare situations it can happen when it shouldn’t.  The situation would be both primary and secondary cells are in the same data center, there are no firewalls, load balancer or any other device that could cause issues with the propagation between the two cells.  The traffic to the cell is high normal to high.  When that much data has to sync across the cells then there is chances that it could drift out of sync.  What you can generally do is to add these lines to the $MCELL_HOME/etc/<cell>/mcell.conf save the file and restart the cell.

 

TCPKeepAlive=TRUE
DestinationBufferKeepWait=1h
ConnectTimeOut=1000
SynchronizeTimeOut=5000

 

To manually sync the two servers, you can do the following:

***Always take a backup before making any changes***

**This procedure assumes that it is the Primary cell which contains the correct data**

 

1 - Copy the mcell.dir file from MCELL_HOME/etc on the Primary to MCELL_HOME/etc on the Secondary.

2 - Copy the cell KB directory (MCELL_HOME/etc/<cellname>/kb) from the Primary to the Secondary. Take care not to copy the mcell.conf file that is in MCELL_HOME/etc/<cellname>

3 - Stop both the Primary & Secondary cells

4 - Run a statbld on Primary

a - cd MCELL_HOME/var/<cellname>#1

*note if this is a pre 7.4 cell then the directory will be MCELL_HOME/log/<cellname>#1

b - rename xact to xact.1

c - run 'statbld -n <cellname>#1'

5 - Delete the contents of the MCELL_HOME/var/<cellname>#2 directory for the Secondary cell

*note if this is a pre 7.4 cell then the directory will be MCELL_HOME/log/<cellname>#2

6 - Copy the contents of the Primary cell var directory (MCELL_HOME/var/<cellname>#1) over to the Secondary cell var directory.

*note if this is a pre 7.4 cell then the directory will be MCELL_HOME/log/<cellname>#1 to be copied across to MCELL_HOME/log/<cellname>#2

7 - Start the primary cell and shortly afterwards start the secondary cell. Then check the status by running the following:

mgetinfo -n <cellname>#1 -v activity
mgetinfo -n <cellname>#2 -v activity

 

There is a video that walks through the above procedure and it is available on https://www.youtube.com/watch?v=wZ2FyRIVuzo

**Please note that in newer cell versions the windows service status for the standby cell is 'Running' rather than 'Paused'**

 

How to check whether the data in the primary and secondary HA cells are in sync?


You can run:

mquery -n CellName#1 -a CORE_EVENT -s event_handle,CLASS -f csv > EVT1
mquery -n CellName#2 -a CORE_EVENT -s event_handle,CLASS -f csv > EVT2

Then compare files EVT1 and EVT2 (for instance with "diff").

 

Similarly for data:

mquery -n CellName#1 -d -a CORE_DATA -s event_handle,CLASS -f csv > DAT1
mquery -n CellName#2 -d -a CORE_DATA -s event_handle,CLASS -f csv > DAT2

 

 

What other ways do you determine a cell has a sync issue? Let us see your comments and questions.

 

TimeForAction.jpg

Important Note to BPPM 9.x, 8.x users
BMC has publicly announced the end of life of BPPM 9.6. This was done at the end of the year and the support will run out at the end of 2020.

Here is the official notice in case you did not receive it:


December 31, 2019 

 

Dear Customer: 

 

BMC is writing to you today as a customer of BMC ProactiveNet Performance Management.

Effective today and according to terms set out on BMC’s Customer Support site (https://www.bmc.com/support/resources/bmc-product-support-policy.html) BMC is announcing the End of Life for the BMC ProactiveNet Performance Management products as listed in the accompanying product table.  

 

BMC will continue to provide Limited Support for the current generally available versions of these products as defined at the Product End of Life section of BMC’s Product Support Policy until the end of support date of December 31, 2020. As per this policy, BMC will continue to develop hot fixes for problems of high technical impact; however, this solution is considered functionally stabilized, and therefore, will not receive further enhancements during this period.  

 

Effective today until the end of support date listed above, BMC will continue to provide customers with active support contracts access to BMC Customer Support as described in this letter. You can continue to interact with BMC Customer Support just like you do today to receive such support. 

Prior to the end of support date, customers with an active support contract have the option to migrate their ProactiveNet licenses to the replacement offerings.  

If you have any questions regarding this transition, please contact your BMC Account Manager at 1-855-834-7487 and press Option 1 for Sales, or BMC Customer Support at 1-800-537-1813. To receive updated information as it becomes available, please subscribe to our Proactive Alerts for ProactiveNet and TrueSight products from our BMC Support Central site. 

 

Thank you.  

 

Sincerely,

 

Ron Coleman Director, Product Management
BMC Software  

BMCCheck.png

 

 

TrueSight Best Practices Webinar Series

 

Interested in learning more about using TrueSight? If so, keep any eye on this Communities page containing the upcoming Webinar details as well as links to past Webinars.

https://communities.bmc.com/docs/DOC-107967?et=watches.email.document

 

AMIGO.jpg

Have you been putting off an upgrade?

The BMC Assisted MIGration Offering, or AMIGO, is a program designed to assist our customers in planning and preparing for product upgrades from an older, to a newer supported version.  By engaging with BMC Technical Support Analysts, you will be provided with materials containing guidelines and best practices to aid in compiling your own upgrade plan. An upgrade expert will then review your plan and offer advice and suggestions to ensure success through proper planning and testing.

The AMIGO program consists of a Starter Phase and a Review Phase.  Each phase is initiated by opening a support case and ends when the case is closed.

In the Starter Phase, an AMIGO Starter case is opened.  Reference material will be provided and a call with a Technical Support Analyst will take place to discuss the details of your upgrade and address any questions you may have.  The AMIGO Starter case will be closed, and the next step will be for you to prepare a documented upgrade plan.

In the Review Phase, an AMIGO Review case is opened preferably two weeks prior to a set upgrade date.  A call will be scheduled with an upgrade expert to review your detailed plan, providing feedback and recommendations, along with answers to any outstanding questions.  As needed, a follow up discussion with a Technical Support Analyst may take place for feedback after the upgrade is performed.

The AMIGO program includes:

» A “Question and Answer” session before you upgrade

» A review of your upgrade plan with Customer Support

» An upgrade checklist

» Helpful tips and tricks for upgrade success from previous customer upgrades

» A follow-up session with Customer Support to let them know how it went. This will help BMC to enhance the process.

 

To get started, please review the details here:

https://docs.bmc.com/docs/TSOperations/113/amigo-checklist-for-truesight-operations-management-814553031.html

 

Then open a BMC Support issue containing your environment information (product, version, OS, etc.) and the planned date of the installation, if known. We will contact you promptly, and work with you to ensure a successful and timely

 

 

Computer.png

 

 

 

 

New TrueSight knowledge articles added to the BMC Knowledge Base over the last month:

 

000183822 "Start the previously active node node.company.com first and then start this node" is seen even though node.company.com is running.

000182383 Silent install hang and tsim_server_install_log.txt not created

000183905 How can OpenJDK 11 be added to the TrueSight Repository

000183539 What is the function of the processes associated TSIM

000182339 "The publication was not successful: Publish validation of IM(s) failed" seen when running the publish command

000181448 How to stop the PATROL Agent disconnect event increasing to CRITICAL severity after 5 minutes.

000180946 What are the cell default KB changes between TSIM versions 11.3.02 and 11.3.03

000184596 The event and data count between HA cells is not the same, why is this?

000184568 Linux OS events are being Forced-closed or RATE_CLOSED and immediately reopened again

000184513 Events are propagated between cells that have a firewall between them, but the event flow is very low, therefore the connection is closed due to firewall configuration rules for timing out idle connections.

 

                       

Feedback.jpg

Feedback is always welcome....let us know what you want to see ore of in the blog posts.