1 2 Previous Next 15 Replies Latest reply on Apr 23, 2010 8:34 AM by Rob Fernie

    "Too many open files" during CIS compliance run

      We are running CIS content compliance jobs across our servers, and it seems that if we allow the compliance job to take more than around 10 threads, we start getting "Too many open files" errors in the job log and in appserver.log. What is confusing is that the appserver and the process spawner are both set to allow 64K file handles, and even during a run, the total open files is only in the hundreds. We have verified the max fd settings using the Infrastucture Management view in the GUI and by running "pfiles" (this is Solaris) on the JVM processes. All of the JVMs, both for blappserv and blprocserv show "Current rlimit: 65536 file descriptors".

       

      Any suggestions of how we might track this down or tune to get away from this?

       

      We are running 7.4.6.802 on Solaris 10. The actual stack trace looks like this:

       

      [07 Sep 2009 02:59:23,176] [Job-WorkItem-Thread-2] [ERROR] [f246668@FHLMC.COM:APP_BL_WIN_ADMIN_GG:] [Compliance] error occured evaluating compliance rules on component CIS - Win03 v2.0 - Enterprise Level 2 (hq4ntsp57) : com.bladelogic.mfw.util.BlException: Unable to parse file: /opt/bladelogic/tmp/application_server/Job-WorkItem-Thread-21252306761915.ini: Too many open files
      com.bladelogic.mfw.util.BlException: com.bladelogic.mfw.util.BlException: Unable to parse file: /opt/bladelogic/tmp/application_server/Job-WorkItem-Thread-21252306761915.ini: Too many open files
              at com.bladelogic.ast.evaluator.ConditionEvaluator.collectAssetInformation(ConditionEvaluator.java:176)
              at com.bladelogic.ast.evaluator.ConditionEvaluator.evaluate(ConditionEvaluator.java:109)
              at com.bladelogic.model.job.compliance.compliance.ComplianceJobWorkItemImpl.doExecute(ComplianceJobWorkItemImpl.java:157)
              at com.bladelogic.model.job.workitem.BaseJobWorkItemImpl.execute(BaseJobWorkItemImpl.java:117)
              at com.bladelogic.app.service.workitem.WorkItemThread.run(WorkItemThread.java:140)
      Caused by: com.bladelogic.mfw.util.BlException: Unable to parse file: /opt/bladelogic/tmp/application_server/Job-WorkItem-Thread-21252306761915.ini: Too many open files
              at com.bladelogic.model.asset.config.ServerConfigFileImpl.jniHandler(ServerConfigFileImpl.java:257)
              at com.bladelogic.model.asset.config.ServerConfigFileImpl._browseConfigFile(Native Method)
              at com.bladelogic.model.asset.config.ServerConfigFileImpl.parseConfigFile(ServerConfigFileImpl.java:208)
              at com.bladelogic.model.asset.config.ServerConfigFileImpl.parseConfigFile(ServerConfigFileImpl.java:156)
              at com.bladelogic.model.asset.config.ServerConfigFileImpl.<init>(ServerConfigFileImpl.java:117)
              at com.bladelogic.model.asset.config.ServerConfigFileFactory.getFromAgent(ServerConfigFileFactory.java:291)
              at com.bladelogic.model.asset.config.ExtendedObjectImpl.getEntriesFromComponent(ExtendedObjectImpl.java:244)
              at com.bladelogic.model.asset.config.ExtendedObjectImpl.getEntriesFromContainer(ExtendedObjectImpl.java:141)
              at com.bladelogic.model.asset.config.ExtendedObjectImpl.parseExtendedObject(ExtendedObjectImpl.java:113)
              at com.bladelogic.model.asset.config.ExtendedObjectImpl.<init>(ExtendedObjectImpl.java:91)
              at com.bladelogic.model.asset.config.ExtendedObjectFactory.getBlAssetFromContainer(ExtendedObjectFactory.java:204)
              at com.bladelogic.app.collector.LiveAssetContainer.getBlAsset(LiveAssetContainer.java:36)
              at com.bladelogic.app.api.factory.BlAssetServiceImpl.getBlAssetFromContainer(BlAssetServiceImpl.java:66)
              at com.bladelogic.ast.evaluator.ConditionEvaluator.collectAssetInformation(ConditionEvaluator.java:159)
              ... 4 more

        • 1. Re: "Too many open files" during CIS compliance run
          Bill Robinson

          I've run into this before, I had to change the 'max_uprc' (something like that) setting in /etc/system (max user processes)

          and reboot.  I bumped it up to like 256 or 384 and that seemed to work on the install, you may need more.  this may be the problem.

          • 2. Re: "Too many open files" during CIS compliance run

            I was hopeful for that idea, but looking at the system, it doesn't seem to be the case.

             

            This is a fairly new Solaris 10 system and the default maxuprc is 29995:

             

            blprod03/ % nexec -e adb -k
            physmem 1b79ea
            maxuprc/D
            maxuprc:
            maxuprc:        29995

            During today's run when these errors occured, the job was limited to 50 threads as a test (to avoid these errors we have been dropping all the way to 5), and during that run, while the errors were showing up, there were 80-90 procs running on the whole system with a total of 500-600 lwps, no swap use at all, and around 20 of these threads were being farmed to another appserver (we have 3 job appservers). With pfiles, I could see that the appserver JVM had an fd limit of 65536, with 308 open files (mostly sockets, less than 100 of the actual ini files that are always mentioned in the error)

            • 3. Re: "Too many open files" during CIS compliance run
              Bill Robinson

              and you've looked at the ulimit settings for the bladmin unix user right?  those are set to unlimited or something 'large' ?

              • 4. Re: "Too many open files" during CIS compliance run

                In Solaris 10 by default, users have a soft limit of 256 and a hard limit of 64k.

                 

                So within a process a user can set their own ulimit to anything up to 64k without a system setting change.

                 

                In our startup scripts we have maxed out ulimit settings that are inherited by child processes:

                 

                bladmin@blprod04:~ % cat /etc/init.d/blappserv
                #!/bin/sh
                #
                # description: BladeLogic Application Server
                # processname: blappserv
                ulimit -n 65535
                LD_LIBRARY_PATH=${LD_LIBRARY_PATH:=}${LD_LIBRARY_PATH:+:}/opt/bladelogic/lib
                export LD_LIBRARY_PATH

                 

                And that new value is reflected in pfiles and in the Infrastructure Management view in the GUI

                 

                Another curiousity in this which makes the error message confusing is that at times the Infrastructure Management view will show that the appserver has 1000 or more files open, but this error occurs when the appserver has far fewer files open.

                 

                My niggling instinct says that it is somehow another process-- not the appserver-- that is running out of files, but the error certainly appears as a JVM stack trace, and all of the running JVMs are the entire system are set to 65535.

                • 5. Re: "Too many open files" during CIS compliance run
                  Bill Robinson

                  hmm, but that ulimit in the init is before the su to bladmin.  if you su - bladmin and do a ulimit -a what does it show?

                   

                  EOs (that's what's likely failing) are run by either the appserver process or the spawner process.  is there anything in the spawner.log ? and you're sure the EO execution is sent to the process spawner (in blasadmin, 'show proc all' shows the spawner enabled?)

                  • 6. Re: "Too many open files" during CIS compliance run

                    Assuming the extended objects are run using NSH would they loop back through the agent and use whatever account the agent is mapping to? Perhaps the agent has to launch with a higher ulimit

                    • 7. Re: "Too many open files" during CIS compliance run

                      Starting logged in as my personal id:

                       

                      blprod04:~ % ulimit -a
                      -t: cpu time (seconds)         unlimited
                      -f: file size (blocks)         unlimited
                      -d: data seg size (kbytes)     unlimited
                      -s: stack size (kbytes)        8192
                      -c: core file size (blocks)    unlimited
                      -n: file descriptors           256
                      -v: virtual memory size (kb)   unlimited
                      blprod04:~ % ulimit -n 65536
                      blprod04:~ % ulimit -a
                      -t: cpu time (seconds)         unlimited
                      -f: file size (blocks)         unlimited
                      -d: data seg size (kbytes)     unlimited
                      -s: stack size (kbytes)        8192
                      -c: core file size (blocks)    unlimited
                      -n: file descriptors           65536
                      -v: virtual memory size (kb)   unlimited
                      blprod04:~ % su - bladmin

                       

                      [long motd removed]

                       

                      bladmin@blprod04:~ % ulimit -a
                      -t: cpu time (seconds)         unlimited
                      -f: file size (blocks)         unlimited
                      -d: data seg size (kbytes)     unlimited
                      -s: stack size (kbytes)        8192
                      -c: core file size (blocks)    unlimited
                      -n: file descriptors           65536
                      -v: virtual memory size (kb)   unlimited

                       

                       

                      Correspondingly,

                      The su command calls "blappserv". Here can be seen the pfiles output for the blappserv script, which persists as the appserver parent:

                       

                      bladmin@blprod04:~ % pfiles $(pgrep blappserv)
                      13751:  /bin/sh /opt/bladelogic/br/blappserv
                        Current rlimit: 65535 file descriptors
                         0: S_IFCHR mode:0666 dev:339,0 ino:6815752 uid:0 gid:3 rdev:13,2
                            O_RDONLY|O_LARGEFILE
                            /devices/pseudo/mm@0:null
                         1: S_IFREG mode:0644 dev:256,65541 ino:146680 uid:20831 gid:73969 size:2099
                            O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
                            /opt/bladelogic/br/console.log
                         2: S_IFREG mode:0644 dev:256,65541 ino:146680 uid:20831 gid:73969 size:2099
                            O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
                            /opt/bladelogic/br/console.log
                        19: S_IFREG mode:0755 dev:256,65541 ino:17313 uid:0 gid:0 size:1462
                            O_RDONLY|O_LARGEFILE FD_CLOEXEC
                            /opt/bladelogic/br/blappserv

                      And that shows the 64k limit (as does the appserver JVM itself).

                       

                       

                      Yes, we are definitely using the spawner. It helps a great deal. Keeps our swap space clear. And the spawner JVM and its children also show 65536 files.

                       

                      One of these exceptions occurs in that appserver.log at [09 Sep 2009 11:40:59,499].

                       

                      Below is the corresponding content of the spawner log:

                       

                      [09 Sep 2009 11:40:57,422] [RMI TCP Connection(13320)-172.26.74.25] [INFO] Executing command : scriptutil -h he2unx24 -s inetd_svcs_status
                      [09 Sep 2009 11:40:57,493] [WaitForProcessThread-PID21932-39334] [INFO] Process finished: 21932
                      [09 Sep 2009 11:40:57,705] [RMI TCP Connection(13238)-172.26.74.25] [INFO] Creating new process with command : scriptutil -h hq4unx119 -s inetd_svcs_status
                      [09 Sep 2009 11:40:57,710] [RMI TCP Connection(13314)-172.26.74.25] [INFO] Executing command : scriptutil -h hq4unx119 -s inetd_svcs_status
                      [09 Sep 2009 11:40:59,230] [WaitForProcessThread-PID21658-39262] [INFO] Process finished: 21658
                      [09 Sep 2009 11:40:59,376] [WaitForProcessThread-PID21858-39314] [INFO] Process finished: 21858
                      [09 Sep 2009 11:40:59,760] [RMI TCP Connection(13254)-172.26.74.25] [INFO] Creating new process with command : scriptutil -h he2xcs110 -s ssh_info
                      [09 Sep 2009 11:40:59,766] [RMI TCP Connection(13254)-172.26.74.25] [INFO] Executing command : scriptutil -h he2xcs110 -s ssh_info
                      [09 Sep 2009 11:41:00,401] [RMI TCP Connection(13321)-172.26.74.25] [INFO] Creating new process with command : scriptutil -h he3unx108z001 -s unix_svcs_status
                      [09 Sep 2009 11:41:00,406] [RMI TCP Connection(13294)-172.26.74.25] [INFO] Executing command : scriptutil -h he3unx108z001
                      -s unix_svcs_status
                      [09 Sep 2009 11:41:01,484] [RMI TCP Connection(13323)-172.26.74.25] [INFO] Creating new process with command : scriptutil -h he1unxp110z01 -s unix_svcs_status
                      [09 Sep 2009 11:41:01,488] [RMI TCP Connection(13323)-172.26.74.25] [INFO] Executing command : scriptutil -h he1unxp110z01
                      -s unix_svcs_status

                      • 8. Re: "Too many open files" during CIS compliance run
                        Bill Robinson

                        in the bladmin user's profile can you add a 'ulimit -n 65536', restart the appserver and process spawner and see if you hit the same issue?

                        • 9. Re: "Too many open files" during CIS compliance run

                          I have added the ulimit in to bladmin's .profile and then completely shut down all of the job servers, and then completely restarted them all. Same error. I will go ahead and try setting the ulimits for the agents on the appservers are well.

                          • 10. Re: "Too many open files" during CIS compliance run
                            Bill Robinson

                            dang, this happened to me on 7.4.6 and i had to up that maxprc setting and force the setting in the .profile for bladmin and that fixed it.

                             

                            open a support ticket if you haven't already...

                            • 11. Re: "Too many open files" during CIS compliance run

                              Did you ever find a resolution to this issue? We're running into the same thing - and it seems even more prominent now that we have upgraded to 7.6.

                              Thanks!

                              • 12. Re: "Too many open files" during CIS compliance run

                                No, and we haven't put in a ticket yet largely because we completed our heavy round of CIS testing, and it won't get

                                as heavy again until after we upgrade to 8.0. We compensated a bit by reducing the number of threads for the compliance job-- usually 5 or 10 threads brought things to the point that the errors didn't come up, but of course that slows the runs down substantially.

                                 

                                That's a pity to hear that it still occurs in 7.6.

                                • 13. Re: "Too many open files" during CIS compliance run

                                  We're not running CIS, but we see this in any compliance job that leverages a number of EOs, as well as large snapshot jobs for BSARA. We opened a ticket, and we've been working with ulimit settings and limiting parallel threads. I'll investigate the suggestions above and let you know if some magical combination thereof provides any relief.

                                  • 14. Re: "Too many open files" during CIS compliance run

                                    You can add one more person who is seeing "Too many open files" errors. In my case, they're showing up in the job run logs of snapshot jobs. The jobs are using components that include File, Configuration File, and one Extended Object. The EO isn't that involved. All it does is read a text file and run a sed command on it to replace some text before feeding the file to BL.

                                     

                                    This is on a development system so we don't even have many servers in the inventory, about a dozen.

                                     

                                    I'm running BL 7.6 on a Solaris 10 system.

                                    1 2 Previous Next