8 Replies Latest reply on Jun 12, 2018 6:35 AM by Jacek Szlaczka

    ADDM crashes 3-5 minutes after starting

    Jacek Szlaczka
      Share:|

      Hi!

       

      I am trying to figure out why our QA instance of ADDM is crashing shortly after service restart. It looks like

      /usr/tideway/var/localdisk/tideway.db/logs is getting full with many weird binary logs, which effectively take up to 100% space of /usr/ (~25GB increase in 5 min after starting the services).

      Anyway, let me paste details of support case I created, because we are not getting any answers there , so many any of you guys will be able to help us investigate:

       

      Begin quote:

       

       

      We are experiencing a very strange ADDM behavior which I think is related to the model service, let me explain.

      1. 3 days ago
      I received a phonecall that our ADDM QA instance has crashed, with the following error showing in UI:
      ===================================================
      The cluster has shut down because there is insufficient available disk space on ADDM QA CLUSTER-01 (Address: 153.112.6.103).
      ===================================================

      However, when checking available disk space, everything seems fine:

      [tideway@server~]$ df -h
      Filesystem Size Used Avail Use% Mounted on
      /dev/sda7 960M 264M 645M 30% /
      tmpfs 24G 0 24G 0% /dev/shm
      /dev/sda1 477M 35M 417M 8% /boot
      /dev/sda6 1.4G 2.7M 1.3G 1% /home
      /dev/sda5 1.9G 119M 1.7G 7% /tmp
      /dev/sda8 37G 7.5G 28G 22% /usr
      /dev/sda3 2.4G 534M 1.7G 24% /var
      /dev/sdb1 992G 667G 275G 71% /mnt/addm/db_data

      (same situation on second cluster node).


      2. After restarting services on both servers, we are able to access ADDM for maybe 1-2 minutes before it's impossible to log in.

      3. I have noticed that the model service is VERY active right after restart and is consuming ~100% CPU

      4. Another strange thing is, that /usr/tideway/var/localdisk/tideway.db/logs is seeing ENORMOUS growth in 3 minutes after restart. I am attaching terminal output, to see how fast it grows.

      As a result, the /usr/ disk gets full after 3-5 minutes and ADDM crashes. After the crash, /usr/ disk space usage goes back to normal (see, paragraph #1)


      I suppose that a model (for us, patterns/models are developed on QA) could have caused this, but we need to first identify what's the root cause and how to solve it.
      Extending the disk is not a good idea in my opinion, because during normal usage, /usr/ disk space uses only 25-30%, as you can see.

      5. We also see similar /usr/ disk space usage on production, without it going crazy at any time.
      Below you can see disk usage during normal work:

      [a147721@server_prod~]$ df -h
      Filesystem Size Used Avail Use% Mounted on
      /dev/sda7 960M 524M 386M 58% /
      tmpfs 16G 0 16G 0% /dev/shm
      /dev/sda1 74M 34M 36M 49% /boot
      /dev/sda6 1.5G 2.8M 1.4G 1% /home
      /dev/sda5 1.9G 1.6G 264M 86% /tmp
      /dev/sda8 38G 18G 19G 48% /usr
      /dev/sda3 2.4G 606M 1.7G 27% /var
      /dev/sdb1 992G 645G 296G 69% /mnt/addm/db_data

      Please help us find the root cause and get it resolved.

      Thanks,
      Jacek

       

      --- End quote

       

      I am also attaching terminal output after service restart (with timestamps od df -h) and 'du' commands if you scroll down.

       

      Any ideas? Any questions? I would be very thankful if someone could help us out.

       

       

      Cheers!

      Jacek

        • 1. Re: ADDM crashes 3-5 minutes after starting
          Andrew Waters

          what is being reported in the model log, tw_svc_model.log

          3 of 3 people found this helpful
          • 2. Re: ADDM crashes 3-5 minutes after starting
            Jacek Szlaczka

            Hi,

             

            There are a lot of searches like this:

            139909808563968: 2018-05-28 14:44:56,590: model.search.servants: INFO: Search 4271 ([eca_engine]): SEARCH HardwareDetail WHERE (((serial = 'CAT1002R5RQ') AND (vendor = 'Cisco Systems')) AND (type = 'I/O Module')) SHOW

            139909808563968: 2018-05-28 14:44:56,592: model.search.servants: INFO: Completed search 4271: SEARCH HardwareDetail WHERE (((serial = 'CAT1002R5RQ') AND (vendor = 'Cisco Systems')) AND (type = 'I/O Module')) SHOW

            139909808563968: 2018-05-28 14:44:56,599: model.search.servants: INFO: Search 4272 ([eca_engine]): SEARCH HardwareDetail WHERE (((__vendor = 'IBM') AND (type = 'I/O Module')) AND (switchmodule_mac IN ['00:16:c8:02:f9:00', '00:16:c8:02:f9:01', '00:16:c8:02:f9:02', '00:16:c8:02:f9:03', '00:16:c8:02:f9:04', '00:16:c8:02:f9:05', '00:16:c8:02:f9:06', '00:16:c8:02:f9:07', '00:16:c8:02:f9:08', '00:16:c8:02:f9:09', '00:16:c8:02:f9:0a', '00:16:c8:02:f9:0b', '00:16:c8:02:f9:0c', '00:16:c8:02:f9:0d', '00:16:c8:02:f9:0e', '00:16:c8:02:f9:0f', '00:16:c8:02:f9:10', '00:16:c8:02:f9:11', '00:16:c8:02:f9:12', '00:16:c8:02:f9:13', '00:16:c8:02:f9:14', '00:16:c8:02:f9:15', '00:16:c8:02:f9:16', '00:16:c8:02:f9:17', '00:16:c8:02:f9:18', '00:16:c8:02:f9:19', '00:16:c8:02:f9:1a', '00:16:c8:02:f9:40', '00:16:c8:02:f9:41'])) SHOW

            139909808563968: 2018-05-28 14:44:56,602: model.search.servants: INFO: Completed search 4272: SEARCH HardwareDetail WHERE (((__vendor = 'IBM') AND (type = 'I/O Module')) AND (switchmodule_mac IN ['00:16:c8:02:f9:00', '00:16:c8:02:f9:01', '00:16:c8:02:f9:02', '00:16:c8:02:f9:03', '00:16:c8:02:f9:04', '00:16:c8:02:f9:05', '00:16:c8:02:f9:06', '00:16:c8:02:f9:07', '00:16:c8:02:f9:08', '00:16:c8:02:f9:09', '00:16:c8:02:f9:0a', '00:16:c8:02:f9:0b', '00:16:c8:02:f9:0c', '00:16:c8:02:f9:0d', '00:16:c8:02:f9:0e', '00:16:c8:02:f9:0f', '00:16:c8:02:f9:10', '00:16:c8:02:f9:11', '00:16:c8:02:f9:12', '00:16:c8:02:f9:13', '00:16:c8:02:f9:14', '00:16:c8:02:f9:15', '00:16:c8:02:f9:16', '00:16:c8:02:f9:17', '00:16:c8:02:f9:18', '00:16:c8:02:f9:19', '00:16:c8:02:f9:1a', '00:16:c8:02:f9:40', '00:16:c8:02:f9:41'])) SHOW

            139909808563968: 2018-05-28 14:44:56,609: model.search.servants: INFO: Search 4273 ([eca_engine]):

             

             

             

            Notepad++ found (6997 hits) similar to this.

             


            Also a lot of searches like this:

             

            SEARCH DiscoveryAccess

                        WHERE endpoint = '131.97.230.12'

                          AND _last_interesting IS DEFINED

                          AND range_prefix = ''

                        SHOW

            139909787584256: 2018-05-28 14:44:51,555: model.search.servants: INFO: Search 4053 ([eca_engine]):

                        SEARCH DiscoveryAccess

                        WHERE endpoint = '10.77.153.20'

                          AND _last_interesting IS DEFINED

                          AND range_prefix = ''

                        SHOW

             

             

            I will upload the full log file somewhere, because I am not sure what is "abnormal" for the model log.

            • 3. Re: ADDM crashes 3-5 minutes after starting
              Jacek Szlaczka

              Attaching full model log.

              • 4. Re: ADDM crashes 3-5 minutes after starting
                Andrew Waters

                That looks like the datastore is trying to purge some history from the datastore and it manages to generate lots of changes (hence large transaction logs) in the short interval the purge is happening. However it does mention performing checkpoints so it should be tidying them up.

                 

                Have you changed the purge history timeout recently?

                3 of 3 people found this helpful
                • 5. Re: ADDM crashes 3-5 minutes after starting
                  Jacek Szlaczka

                  No, we have not changed any of that recently :/ So, if the transaction log is getting full, is our only option to expand the disk? Or is there something else we can do to avoid this kind of behavior?

                  • 6. Re: ADDM crashes 3-5 minutes after starting
                    Jacek Szlaczka

                    Hi,

                     

                    Looks like our database is corrupted, after running db_verify:

                    [tideway@segotl0849 ~]$ grep BAD  /tmp/db_verifyResultFile.txt

                    db_verify: p0002_rInference_pidx: BDB0090 DB_VERIFY_BAD: Database verification failed

                     

                     

                     

                     

                    Can we just remove the p0002_rInference_pidx file? We dont need inference data and want to get the appliance up as soon as possible.

                     

                    Cheers

                    • 7. Re: ADDM crashes 3-5 minutes after starting
                      Brice-Emmanuel Loiseaux

                      No, this is not a valid solution.

                      If only _pidx files are corrupted, then the solution is to reindex the datastore:

                      - touch /usr/tideway/var/.ds_reindex_request

                      - restart tideway services

                      4 of 4 people found this helpful
                      • 8. Re: ADDM crashes 3-5 minutes after starting
                        Jacek Szlaczka

                        Alright, we have managed to get this oen resolved with the help of BMC support.

                        We had to move the transaction logs location to a different location, because /usr/ kept getting swarmed with transaction logs and we had no spare disk.

                         

                        After doing that and performing reindex, the problem was solved

                        4 of 4 people found this helpful