1 2 Previous Next 19 Replies Latest reply on Aug 6, 2015 11:53 AM by Clement BARRET

    [ NSH ] fork failed: not enough core / *PLEASE HELP*

    Clement BARRET

      Hi Bill Robinson,

       

      You might be knowing how to solve this case... It's really hardcore for me to circumvent this issue which is critical for us... and I guess it's not directly related to the BladeLogic components but related to the server's config... anyway.

       

      We got two bl app servers on which are running the bl job thread pools. They're running perfectly fine.

       

      We got another server which only purpose is to run NSH scripts (outside from any job ones), cron based or triggered by our own user web portal. Let's call this server BLNSH.

       

      So.

       

      On this BLNSH server, we run a bunch of NSH scripts in background from the cron which are listening some input files to trigger some actions. Let's call them WORKERS.

       

      Everything worked fine until recently when we first encountered a VERY annoying issue which I can't achieve to circumvent nor fix.

       

      Sometimes (and I suppose that's when a user starts a bunch of unrelated actions/scripts running on this BLNSH server), my WORKERS are just crashing printing this message "fork failed: not enough core"

       

      prepare_to_execute_generes_job_and_wait:40: fork failed: not enough core

       

      So, basically, in these WORKERs code, there is nothing fancy done here, only some sleep then a test if an input file is there then some sleep etc. A simple and basic loop.

       

      I tried to modify the /etc/security/limits.conf increasing the hard and soft nproc limits (see attached limits.conf file).

      I tried 32768, then 4096 with no success, restarting the crond everytime to take those modifications into account.

       

       

      % ulimit -a

      -t: cpu time (seconds)         unlimited

      -f: file size (blocks)         unlimited

      -d: data seg size (kbytes)     unlimited

      -s: stack size (kbytes)        10240

      -c: core file size (blocks)    unlimited

      -m: resident set size (kbytes) unlimited

      -u: processes                  4096

      -n: file descriptors           32768

      -l: locked-in-memory size (kb) 64

      -v: address space (kb)         unlimited

      -x: file locks                 unlimited

      -N 11:                         30502

      -N 12:                         819200

       

      The nsh version is this one :

       

      % version

      BladeLogic RSCD Agent 8.3.02.452 (Release) [Apr  6 2015 10:53:13]

      Copyright (C) 1996-2012 BladeLogic Inc.

      BladeLogic Network Shell 8.3.02.452 (Release) [Apr  6 2015 10:53:13]

      Copyright (C) 1996-2012 BladeLogic Inc.


      This is a RHEL server.


      % cat /etc/redhat-release

      Red Hat Enterprise Linux Server release 6.2 (Santiago)


      % uname -a

      Linux blnsh01 2.6.32-504.8.1.el6.x86_64 #1 SMP Fri Dec 19 12:09:25 EST 2014 x86_64 x86_64

       

       

      So, I wonder if you could tell me what limit I should raise or if I only could try to put "unlimited" on everything to solve this issue...

       

      Any help will be much appreciated.

       

      Best regards,

        • 1. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
          Bill Robinson

          are these scripts running as the 'bladmin' OS user ?  and this is a bsa appserver, or it's only running things via cron/interactive/etc ?

           

          is there anything in the /etc/security/limits.d ? 

          • 2. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
            Rajeev Gupta

            This happens when the System runs out of Memory.

            Since this is a RHEL machine, you can fix by configuring process spawner:

            blasadmin -s default show proc spawn

            if the output is True, then check the spawner.log in /br directory under NSH

            Make sure this is updating the entries. In case this is stale, then run

            /etc/init.d/blprocserv stop

            /etc/init.d/blprocserv start

            check log again if this is getting updated post running a NSH job.


            More:

            You can increase the JAVA heap size on your RHEL server as well which is helpful if the issue with spawner not there. Additionally, if its still not fixed, then increase the size of the Core/Memory.

            • 3. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
              Bill Robinson

              It doesn’t seem like this system is an appserver since these scripts are running via cron or some other process, so the spawner will not help here.

              • 4. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                Rajeev Gupta

                Oh ok.. but i guess increasing the JavaMemory/Core/Swap memory should fix this.

                • 5. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                  Clement BARRET

                  Bill Robinson

                   

                  1) Yes

                   

                  2) Only via cron/interactive/etc. (as stated)

                   

                  3) Yes, I didn't check this directory. There is only one file called 90-nproc.conf

                   

                  here is its content :

                   

                  % cat /etc/security/limits.d/90-nproc.conf

                  # Default limit for number of user's processes to prevent

                  # accidental fork bombs.

                  # See rhbz #432903 for reasoning.

                   

                  *          soft    nproc     1024

                   

                   

                  Does this file's values supersede the /etc/security/limits.conf ones ?

                   

                  Since when I'm typing "ulimit -a > /tmp/cron_ulimit_check.txt" from the crontab of the bladmin user, I got the proper limits (the ones listed in my previous post, matching the values I put in the limits.conf file).

                  • 6. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                    Clement BARRET

                    Here my latest values (I'll keep you posted on whether it fixed or not my crashing issues) .

                     

                    % ulimit -a

                    -t: cpu time (seconds)         unlimited

                    -f: file size (blocks)         unlimited

                    -d: data seg size (kbytes)     unlimited

                    -s: stack size (kbytes)        unlimited

                    -c: core file size (blocks)    unlimited

                    -m: resident set size (kbytes) unlimited

                    -u: processes                  8192

                    -n: file descriptors           65535

                    -l: locked-in-memory size (kb) unlimited

                    -v: address space (kb)         unlimited

                    -x: file locks                 unlimited

                    -N 11:                         30502

                    -N 12:                         819200

                     

                    I cross my fingers...

                    • 7. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                      Sean Berry

                      Are you running blcli on this machine as well as NSH?

                      • 8. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                        Clement BARRET

                        Well, in my workers, once a worker receives an input file, it will trigger his dedicated job (one per worker) execution using some blcli commands, indeed.

                         

                        But the error occurs while there is nothing sent to the input of the workers so they're just "sleeping", then checking if there is an input file, then sleeping etc.

                        • 9. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                          Bill Robinson

                          How many workers do you have running at any given time, and how much memory on the box ?

                          • 10. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                            Clement BARRET

                            10 workers simultaneously

                             

                            and I guess the server has not enough RAM... (4 GB)

                             

                            % free -k

                                         total       used       free     shared    buffers     cached

                            Mem:       3924636    3836408      88228          0       4508      79804

                            -/+ buffers/cache:    3752096     172540

                            Swap:      1048572     885228     163344

                             

                             

                            I found the "culprit" (we already noticed these bunch of processes were causing the crash)

                             

                             

                            Out of memory: Kill process 6325 (modifyServerPro) score 35 or sacrifice child

                            Killed process 13123, UID 30000, (modifyServerPro) total-vm:1308336kB, anon-rss:107592kB, file-rss:72kB

                            modifyServerPro invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0

                            modifyServerPro cpuset=/ mems_allowed=0

                             

                             

                            % dmesg | grep Kill

                            Out of memory: Kill process 6325 (modifyServerPro) score 35 or sacrifice child

                            Killed process 13123, UID 30000, (modifyServerPro) total-vm:1308336kB, anon-rss:107592kB, file-rss:72kB

                            Out of memory: Kill process 6325 (modifyServerPro) score 36 or sacrifice child

                            Killed process 13141, UID 30000, (modifyServerPro) total-vm:1308336kB, anon-rss:121668kB, file-rss:52kB

                            Out of memory: Kill process 6325 (modifyServerPro) score 36 or sacrifice child

                            Killed process 13144, UID 30000, (modifyServerPro) total-vm:1308336kB, anon-rss:120420kB, file-rss:20kB

                            Out of memory: Kill process 6325 (modifyServerPro) score 36 or sacrifice child

                            [...]

                             

                            So basically... I don't know how to "protect" my processes (running as "bladmin") from being killed by the system when there is no ram left...

                             

                            Is there some "priority" that I could set on my workers (using nice/renice or something like this) to avoid them being killed ?

                            • 11. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                              Bill Robinson

                              OOM killer will usually target the process(es) consuming the most memory, but those may not be the ones that caused the oom condition.

                               

                              there was a write up here: linux - How OOM killer decides which process to kill first? - Unix & Linux Stack Exchange

                               

                              it also looks like there's a way to set exclusions:

                              http://backdrift.org/oom-killer-how-to-create-oom-exclusions-in-linux

                               

                              but if you are really run out of memory i'd look into why - what's spinning up that's causing the oom to get invoked ?

                              • 12. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                                Clement BARRET

                                Thanks for this answer, I already tried those tweaks by googling this "oom kill" and it didn't work...

                                 

                                Even putting -17 as root in the given processes's PID oom_adj file had no effect...

                                 

                                I don't understand why those scripts are taking that much RAM amount though...

                                 

                                These are only NSH scripts which are connected to the blcli (via blcli_connect after setting the right profile and role) and it is quite HUGE compared to standard bash scripts... These are small NSH scripts and I really don't understand why the RAM usage is so high... Is there some part we can configure so that NSH scripts are taking less RAM ?

                                 

                                  PID USER PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

                                11823 bladmin  20  0 1380m 349m  21m S  1.0  9.1  0:23.68 /bin/nsh /tools/bin/generes/prod/generes_worker_daemon.nsh debug=42 WORKER_NUMBER=10

                                31681 bladmin  20  0 1379m 356m  20m S  1.0  9.3  0:41.24 /bin/nsh /tools/bin/generes/prod/generes_worker_daemon.nsh debug=42 WORKER_NUMBER=6

                                2417 bladmin  20  0 1379m 339m  21m S  0.7  8.9  0:30.38 /bin/nsh /tools/bin/generes/prod/generes_worker_daemon.nsh debug=42 WORKER_NUMBER=7

                                5536 bladmin  20  0 1380m 352m  21m S  0.7  9.2  0:27.81 /bin/nsh /tools/bin/generes/prod/generes_worker_daemon.nsh debug=42 WORKER_NUMBER=8

                                23374 bladmin  20  0 1380m 340m  20m S  0.7  8.9  0:32.67 /bin/nsh /tools/bin/generes/prod/generes_worker_daemon.nsh debug=42 WORKER_NUMBER=3

                                25960 bladmin  20  0 1380m 336m  20m S  0.7  8.8  0:33.11 /bin/nsh /tools/bin/generes/prod/generes_worker_daemon.nsh debug=42 WORKER_NUMBER=4

                                28808 bladmin  20  0 1379m 344m  20m S  0.7  9.0  0:32.00 /bin/nsh /tools/bin/generes/prod/generes_worker_daemon.nsh debug=42 WORKER_NUMBER=5

                                [...] (I got 10 of them)

                                 

                                Any thought ?

                                 

                                PS : Meanwhile we requested our OPS to increase the server (a VM) to be allocated more RAM (we hope to get 16GB).

                                • 13. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                                  Bill Robinson

                                  Can’t you ignore ‘virt’ for the most part ?  what matters is the res  size..  so you’ve got like 10x~350m which is around 3g.  did you set the Xms for the blcli to something?  the default for the Xms is like 1m i think and Xms is 256m by default.

                                   

                                  Can you attach the worker script ?

                                  • 14. Re: [ NSH ] fork failed: not enough core / *PLEASE HELP*
                                    Santosh Kothuru

                                    If memory is causing then as a work around clear "memory cache" on server until the permanent fix.

                                    1 2 Previous Next