1 Reply Latest reply on Oct 16, 2018 12:28 AM by Jon Trotter

    Clean way to kill a hung backup job?

    Jon Trotter
      Share This:

      11.2.0.2

      August TKU/Storage TKU

      tideway-appliance-6.18.01.30-727098.centos6.x86_64

      tideway-11.2.0.2-722067.centos6.ga.x86_64

       

      Tried running a backup from the command line using a script we have in place that automatically runs each weekend on two other environments successfully. Within a short time it showed errors and acted like it was going to restart, but 3 hours later it is still not starting. I checked the logs and seems all the nodes were having some issue communicating with the coordinator at the time of the failure, not sure why since it is available via a Putty session. In the coordinator cluster.out log, this entry is there:

       

      WARNING: User 'auto' does not have permission for operation 'model/audit/write'.

       

      This is from the tw_backup.log:

       

      Coordinator

      This command is running on the coordinator from where the backup was run.

       

      tideway   2458  1829  0 13:20 ?        00:00:00 /bin/bash /usr/tideway/data/installed/startup/16model_init --callback IOR:010000003a00000009010100010000000901010002549444c3a746964657761792e636f6d2f436c7573746572434f5242412f536572766961312e312e3400005262000003654f7065726174696f6e43616c6c6261636b3a312e300bd000000010102000b00000031302e3231312e312e340000526200000e000000fe54e9765b00000725000000003a00000300000000000000080000000100000000545441010000001c000000010000000100010501000000010001054544151000000010000001b57761792f746000000746c762d6169362d613030312e66616e6e69656d61652e636f6d0000290000004775f7376635f636c757372f746d702f6f6d6e692d74696465725f6d616e6167657200 start

       

      140073289467648: 2018-10-15 12:47:06,797: backup.manager: INFO: waitForProceed from 'stop services' (60 secs)

      140073289467648: 2018-10-15 12:47:23,150: backup.manager: ERROR: Problem performing db_recover on datastore

      140073289467648: 2018-10-15 12:47:23,150: backup.manager: INFO: Record state 'create backup' completed on uuid a553e300000000000000444444

      140072973797120: 2018-10-15 12:47:25,115: backup.backup_common: INFO: Task: Starting Services

       

      Node 2

      Backup command still shows to be running on this node. No other nodes are showing any commands still processing.

       

      tideway  28953     1  2 12:44 ?        00:04:21 python /usr/tideway/python/misc/backup/main.pyc tw_backup -u auto --passwordfile=/usr/tideway/autopass.txt --reduce-db --backup-local --overwrite --stop-services --daemon=start --security-token=auto:4e365db7d752000004a3f47165db7d7528c0a68d3 --timestamp=137589146856016139 --distributed-operation=f646028e001045b2c73638bcbfaad145

       

      139883265246976: 2018-10-15 13:13:06,775: backup.manager: INFO: Performed 'create backup', next 'start services'

      139883265246976: 2018-10-15 13:13:06,775: backup.manager: INFO: Record state 'create backup' completed on uuid b553e300000000000000444444

      139883265246976: 2018-10-15 13:13:06,862: backup.manager: INFO: waitForProceed from 'create backup' (0 secs)

      139883265246976: 2018-10-15 13:13:06,864: backup.manager: ERROR: TEST Discovery 11-01: Problem performing db_recover on datastore

      139882960066304: 2018-10-15 13:13:08,755: backup.backup_common: INFO: Task: Starting Services

       

      Node 3

      140657932896000: 2018-10-15 13:12:26,157: backup.manager: INFO: Performed 'create backup', next 'start services'

      140657932896000: 2018-10-15 13:12:26,157: backup.manager: INFO: Record state 'create backup' completed on uuid c553e300000000000000444444

      140657932896000: 2018-10-15 13:12:26,203: backup.manager: INFO: waitForProceed from 'create backup' (0 secs)

      140657932896000: 2018-10-15 13:12:26,206: backup.manager: INFO: Waiting for other members to complete create backup tasks

      140657932896000: 2018-10-15 13:12:36,216: backup.manager: INFO: waitForProceed from 'create backup' (10 secs)

      140657932896000: 2018-10-15 13:12:46,232: backup.manager: INFO: waitForProceed from 'create backup' (20 secs)

      140657932896000: 2018-10-15 13:12:56,245: backup.manager: INFO: waitForProceed from 'create backup' (30 secs)

      140657932896000: 2018-10-15 13:13:06,259: backup.manager: INFO: waitForProceed from 'create backup' (40 secs)

      140657932896000: 2018-10-15 13:13:16,265: backup.manager: INFO: waitForProceed from 'create backup' (50 secs)

      140657932896000: 2018-10-15 13:19:16,768: misc.monitored_operations: INFO: proceedMonitoredOperation - Cluster exception Failed to contact TEST Discovery 11-01 after 3 attempts

       

      Node 4

      139672136955648: 2018-10-15 13:12:39,076: backup.manager: INFO: Performed 'create backup', next 'start services'

      139672136955648: 2018-10-15 13:12:39,076: backup.manager: INFO: Record state 'create backup' completed on uuid d553e300000000000000444444

      139672136955648: 2018-10-15 13:12:39,092: backup.manager: INFO: waitForProceed from 'create backup' (0 secs)

      139672136955648: 2018-10-15 13:12:39,095: backup.manager: INFO: Waiting for other members to complete create backup tasks

      139672136955648: 2018-10-15 13:12:49,099: backup.manager: INFO: waitForProceed from 'create backup' (10 secs)

      139672136955648: 2018-10-15 13:12:59,112: backup.manager: INFO: waitForProceed from 'create backup' (20 secs)

      139672136955648: 2018-10-15 13:13:09,119: backup.manager: INFO: waitForProceed from 'create backup' (30 secs)

      139672136955648: 2018-10-15 13:13:09,123: backup.manager: ERROR: TEST Discovery 11-01: Problem performing db_recover on datastore

       

      While this may have been part of the issue, it doesn't change that the system is currently unavailable. Is there a way to cleanly shut down the backup process? Will killing the process cause additional issues or is that the only way to recover?

        • 1. Re: Clean way to kill a hung backup job?
          Jon Trotter

          Ran the fix-interrupted on the coordinator and basically had to kill all the processes that were running on the coordinator and the one backup process running on the cluster member. Once all nodes showed services not running I was able to restart services locally on each member and login. Was missing one role for reports.

          1 of 1 people found this helpful