Share:|

Update Note:

This article is not applicable to TrueSight Capacity Optimization (CO) version 10.0 and later.  One of the many enhancements made to the CO 10.0 release is that the communication mechanism used by the Remote Scheduler Supervisor has been changed to increase its stability and avoid the scenarios covered in this article where it became necessary to manually clean the communication channel.

 

In Capacity Optimization (CO) the Datahub's Remote Scheduling Supervisor service communicates periodically with the CO Schedulers.  If this communication fails, the CO Schedulers will be flagged to a red ERROR state.  They may continue to run jobs that have already been executed or scheduled, but they will no longer respond to new scheduling requests from the CO console.

 

The symptoms associated with the Datahub's Remote Scheduling Supervisor being unable to communicate with the CO Scheduler are varied and cover a broad range of possible problems. In this article we consider several of the most common, when the symptoms include one or more of the following conditions.

 

  • All CO Schedulers are reporting in red ERROR state.
  • When trying to submit a job to a scheduler (or stop a running task) the CO UI responds with the error,  "The system was unable to execute  the given request.  Please check scheduler [Scheduler Name] status."
  • When executed, all CO Reports don't complete and messages state something like, "The report has been submitted and will be executed immediately.  Please refresh to view the results."
  • In the CO Scheduler cpit.log there is a repeated message like, "[main]- [MIFSchedulerCommander] Performing startup procedure for instance #[ID]" and never reports the "[MIFSchedulerCommander] Received startup messages.

 

It is not uncommon for the communication channel between the CO Scheduler and Remote Scheduling Supervisor to be corrupted is when the CO file system fills up. There can be different symptoms that may be visible when this occurs on the CO Application Server (AS) or a CO ETL Engine (EE) server.

 

  • If e-mail alerting is enabled, the most obvious symptom will be an e-mail error messages from CO reporting a failure of the Local Monitoring task.

 

  • If e-mail alerting is not enabled thee may be other symptoms which typically include:
    • CO Analysis and Predict reports failing to generate output with unexpected errors (such as the time filter not covering a period that contains data when a quick analysis shows that it does)
    • Analysis failing with the error, "java.io.IOException - message: No space left on device to write file: /path/file
    • Caplan Scheduler *** ALERT mail *** Task Local monitoring for Default [50] completed with 1 errors
    • File system full clean up

 

If the CO File System fills up, it is a best practice to always clean the CO Scheduler (and datacuum) and CO Datahub working set files to prevent problems with this communication channel. Generally, the steps needed to fix most of the issues related to by the messages above are shown below and should be followed in the order written.  However, if it is only needed to clear up Datahub communication problems, the Scheduler (and datacuum) portion of the clean-up should not need to be followed as described in this article.

 

  • Stop all the CO components on al machines in the CO instance
  • Clean up the working set files on all machines
  • Restart  all the components

 

Restoring the CO communication channel between the CO Datahub and the CO Schedulers is considered a best practice recommendation to resolve the problems associated with the aformentioned symptoms. When the communication channel is down, the remote schedulers will queue the status messages that it wants to send back to the CO console.  As a result of this, when the communication channel is re-established,  there can be a spike in communication which saturates the communication channel and takes time to clear (or causes the original communication problem to continue).

 

Cleaning the Datahub component only  will frequently work to resolve this communication problem, and can be a quick path to recovery. If this resolves the problems, then you will not need to execute the Scheduler clean-up steps listed. The Scheduler (and datacuum) clean-up is used when the CO Remote Scheduling Supervisor communication channel has been corrupted (for example possibly after a file system full problem):

 

To Clean the Dathub, you must Stop all the components (including scheduler) and clean up the corrupted runtime files:

  • Access AS via ssh as CO OS user.
  • Change directory to CO home folder (usually it's /opt/cpit);

cd /[CO Installation Directory]

Try to stop CO DWH in a clean way

./cpit stop Datahub

  • Wait five minutes or until the countdown ends.
  • Insure there are no other DWH jboss processes stuck using these commands :
ps -ef | grep jboss
Issue a kill -9 $PIDNUMBER for every remaining jboss processps -ef | grep jboss
Issue a kill -9 $PIDNUMBER for every remaining jboss process
  • Access AS via ssh as CO OS user. Insure there are no run.sh stuck using these commands:

ps -ef | grep run.sh

Issue a kill -9 $PIDNUMBER for every remaining run.sh

Execute these commands to clean-up the Datahub directories on the AS (path relative to the base CO Installation Directory):

  • CO 9.5 SP1 and SP2:

./cpit clean datahub

  • CO 9.0:

rm -rf datahub/jboss/server/all/data/kahadb/*

rm -rf datahub/jboss/dlq_messages/*

rm -rf datahub/jboss/server/all/tmp/*

rm -rf datahub/jboss/server/all/data/tx-object-store/*

  • CO 4.5:

rm -rf repository/kahadb/*

rm -rf datahub/jboss/server/all/data/kahadb/*

rm -rf datahub/jboss/not_processed_messages/*

rm -rf datahub/jboss/dlq_messages/*

  • CO 4.0:

rm -rf datahub/jboss/server/default/data/data/*

rm -rf datahub/jboss/server/default/data/tx-object-store/*

rm -rf datahub/jboss/dlq_messages/*

rm -rf datahub/jboss/server/default/tmp/*

rm -rf datahub/jboss/bin/activemq-data/localhost/*

 

NOTE ** If you are using these steps to fix the datahub after a machine migration, or a copy of the datahub, CHECK these files to contain no pointers to other machines hostnames:

datahub/jboss/server/all/deploy/cluster-service.xml

datahub/jboss/server/all/deploy/activemq-ra.rar/META-INF/ra.xml

 

To Clean the Scheduler  you will also need to follow these steps

  • Stop AS Scheduler:

cd /[CO Installation Directory]

./cpit stop scheduler

  • Check there are no other schedulers stuck using these commands:

ps -ef | grep scheduler

Issue a kill -9 $PIDNUMBER for every remaining scheduler

  • Clean up the scheduler tasks configuration directories on the AS

Execute these commands to clean the Scheduler directories on the AS (path relative to the base CO Installation Directory):

  • CO 9.5 SP1 and SP2:

./cpit clean scheduler

  • CO 9.5:

rm -rf scheduler/task/*

rm -rf scheduler/mif/notdelivered/*

rm -rf scheduler/localdb/*

  • CO 9.0, 4.5, 4.0:

rm -rf scheduler/task/*

rm -rf scheduler/mif/notdelivered/*

rm -rf scheduler/localdb/*

Execute these commands to clean the datacuum directories on the AS

  • Stop AS dataccum:

cd /[CO Installation Directory]

./cpit stop dataccum

  • Insure there are no other dataccums stuck using these commands:

ps -ef | grep dataccum

Issue a kill -9 $PIDNUMBER for every remaining dataccum

  • Access EE via ssh as CO OS user:

Stop EE scheduler

cd /[CO Installation Directory]

./cpit stop scheduler

  • Insure  there are no other schedulers stuck on the EE using these commands:

ps -ef | grep scheduler

Issue a kill -9 $PIDNUMBER for every remaining scheduler on the EE

  • Clean up the scheduler tasks configuration directories on the EE

Execute these commands to clean the Scheduler directories on the EE (path relative to the base CO Installation Directory):

  • CO 9.5 SP1 and SP2:

./cpit clean scheduler

  • CO 9.5:

rm -rf scheduler/task/*

rm -rf scheduler/mif/notdelivered/*

rm -rf scheduler/localdb/*

  • CO 9.0, 4.5, 4.0:

rm -rf scheduler/task/*

rm -rf scheduler/mif/notdelivered/*

rm -rf scheduler/localdb/*

Execute these commands to clean the datacuum directories on the EE

  • Stop EE dataccum:

cd /[CO Installation Directory]

./cpit stop dataccum

  • Insure there are no other dataccums stuck using these commands:

ps -ef | grep dataccum

Issue a kill -9 $PIDNUMBER for every remaining dataccum

  • Now, check the ETL and chains status:

In UI, goto Administration -> scheduler -> etls page and Administration -> scheduler -> system tasks for RUNNING TASKS that might have been stuck. Note their ID and then force them to be ended using SQL within the CO database:

update task_status set status= 'ENDED' where taskid in (XX,XX2,XX3)

  • Restart the components you stopped to restore the functionality ON BOTH MACHINESRestart the components you stopped to restore the functionality ON BOTH MACHINES
  • Run the "Component status checker" task
  • Wait at least a minute and then Access Administration > System > Status to see the status.

 

 

We hope you found this article useful. This content is also available as: Steps to recover CO functionality when the CO Schedulers are unable to properly communicate with the Datahub Remote Scheduling Supervisor service

 

Knowledge Article ID:    KA350370  https://bmcsites.force.com/casemgmt/sc_KnowledgeArticle?sfdcid=000031595

 

 

 

Miss a blog?  See BMC TrueSight Support Blogs