9 Replies Latest reply on Jan 24, 2018 7:09 AM by Bill Robinson

    Patch Analysis Results using BLCLI

    Steve Abercrombie

      Here's my situation, I've got multiple Windows PA jobs run against multiple servers that all need to reboot separately.  We'll call them PAjob1 - has server1-server20, PAjob2 - has server21-server30, and PAjob3 - has server31-server36.  What I'm trying to accomplish is I want to have them all patch in batches of twenty.  Then I want PAjob2 to wait for PAjob1 to finish completely before kicking off reboots (meaning I made PAjob2 part of a batch job as it will have another NSH script to monitor PAjob1 to finish before then kicking off reboots on the PAjob2 servers.)  After that, I want PAjob3 to do the same thing but with PAjob2 and so on and so forth.  What I  need is three things:

       

      1 - I want to look at the results of the PA jobs and and get the servers that errored out.

      2 - Get the reason why they errored out per server (This could mean in the PatchAnalysis portion or the Remediation portion).

      3 - I want to have a monitor that will watch the previous PAjob effectively so that I know when to reboot.

       

      I'm thinking that NSH using BLCLI would be the most effective way to accomplish this but I'm open to other ideas if it is more effective and performs better. 

       

      Thanks,

       

      Steve

        • 1. Re: Patch Analysis Results using BLCLI
          Bill Robinson

          so it's ok if all the deploy jobs are installing patches at the same time, but you need to hold the reboots to the sequence you have above ?

          • 2. Re: Patch Analysis Results using BLCLI
            Steve Abercrombie

            Yes, the first batch of servers will go ahead and reboot.  The next batch of servers need to wait for the first batch to reboot before those can reboot (ie - DB servers in first batch and web in second batch).  Further, what I'd like to do is have the second batch be able to kick off their reboots when the first batch is done and even if there are still other servers in the same Patch Analysis job still patching.  I don't want all of the servers in the same patch analysis job to necessarily be dependent upon each other if that makes sense.

            • 3. Re: Patch Analysis Results using BLCLI
              Bill Robinson

              right, but there's a lot of little details here... so do you want to:

              run a 'auto-remediate/auto-deploy' (job that runs patching, generates the blpackages and bldeploy jobs, and then runs the bldeploy job) against the first batch, the deploy jobs handle the reboots, then once all of that is done, run the 'auto-remediate/auto-deploy' against batch #2, wait for that to be all done, then #3 ?

              or

              start the patch install on all three batches at the same time, but only reboot servers from batch 2 when all the servers in batch 1 have rebooted then do batch 3 after all the ones in batch 2 ?  because here the problem i see is: reboots will not be handled by the deploy job but by your script because while batch 1 can use the deploys to do the reboot, batch 2 can't.  you have to monitor the patching process on every server to see when it's done so you can reboot it (since all three batches are running at the same time).

               

               

              for case #1 - you should be able to put the patching job in a batch job and then use the 'by server' execution mode.  that will create a batch job run for each server in the target list and that won't wait for analysis on other servers to complete before running the remediation and then deploy.   then a nsh script that uses the blcli to call each batch job in sequence.  call the first one and then monitor status w/ something like this  Run a Job from the BLCLI and wait for it to complete vs executeJobAndWait.  i'm assuming this is auto-remediation and will run the deploy right after the deploy jobs are generated ?  then once it's done, start the 2nd patching/batch job and so on.

               

              i also think case #2 isn't a good idea because the patch install itself could take down the application being patched or the network or something else that would render the service the clusters serve as inaccessible.

              • 4. Re: Patch Analysis Results using BLCLI
                Steve Abercrombie

                Yes, it would be case #2.  If that isn't the recommended path then what would be the recommended path to patch 300 servers within a 2 to 3 hour timeframe?  The two approaches that I've looked into so far was case #2 and the other was creating PatchAnalysis jobs 20 at a time and running them (each job runs against just one server at a time making it easier to monitor for reboots).  The reason I mention 20 at time is because it seems to thread 20 servers at a time unless I'm incorrect.  When I set it up this way, I kept running out of memory on our BladeLogic servers and I verified that it was indeed bringing the servers down.  I'm working on an Enterprise solution that I can set this up on for our Enterprise that will work on 300 servers * 30+ projects.  The solution I'm wanting to be as automated as possible including patching, rebooting, notifying technicians of errors, etc.  All I'm needing is the most efficient way to do this on Windows servers while letting servers reboot when they need to but in a priority as mentioned in case #2.  Thoughts?

                • 5. Re: Patch Analysis Results using BLCLI
                  Bill Robinson

                  what would be the recommended path to patch 300 servers within a 2 to 3 hour timeframe?

                  make sure they all have SSDs for storage and only need a couple patches each ?

                   

                  a few things to look at:

                  - if you are doing this all at once - analysis, then staging of the patches then deploy, does the staging bit take 'a long time' ?

                  - in the env do you have 'MaxLightWeightWorkItemThreads' set to 200 and EnableAsyncExecution blank or set to 'true' (not false) - these are blasadmin settings that allow the commit phase of the deploy to run in an asynchronous fashion and more deploys to happen concurrently, and not need to wait for free WITs (other jobs running)

                   

                  The reason I mention 20 at time is because it seems to thread 20 servers at a time unless I'm incorrect.

                  is the parallelism of the patching job set to 20 ? otherwise it should use all available WITs in the env to hit as many targets as it can in parallel.

                   

                  When I set it up this way, I kept running out of memory on our BladeLogic servers and I verified that it was indeed bringing the servers down

                  set up the patching job to run, or are you using a batch job w/ the 'by server' or something else.  how much memory do the appservers have?  how many instances per appserver ?  max heap per instance ?  there's probably more settings to look into here because you should certainly be able to target > 20 servers at a time w/ a patching job.  the appserver on my laptop can do that.

                   

                  are all of the batches you identified above the A, B, C nodes of clusters ?  like you have 20 clustered systems each w/ 3 nodes or is it a mix ?  i mean, really you just don't want all the nodes in a cluster rebooting at the same time but otherwise you can do the reboots for non-clustered whenever during the window ?

                  • 6. Re: Patch Analysis Results using BLCLI
                    Steve Abercrombie

                    what would be the recommended path to patch 300 servers within a 2 to 3 hour timeframe?

                    make sure they all have SSDs for storage and only need a couple patches each ? - I wish, not sure what we have as I didn't install it

                     

                    a few things to look at:

                    - if you are doing this all at once - analysis, then staging of the patches then deploy, does the staging bit take 'a long time' ? - Staging isn't too bad

                    - in the env do you have 'MaxLightWeightWorkItemThreads' set to 200 and EnableAsyncExecution blank or set to 'true' (not false) - these are blasadmin settings that allow the commit phase of the deploy to run in an asynchronous fashion and more deploys to happen concurrently, and not need to wait for free WITs (other jobs running) - I don't know but I can check, are these the recommended settings?

                     

                    The reason I mention 20 at time is because it seems to thread 20 servers at a time unless I'm incorrect.

                    is the parallelism of the patching job set to 20 ? otherwise it should use all available WITs in the env to hit as many targets as it can in parallel.

                     

                     

                    Yes, it is set to 20.  Is this the recommended setting or should it be higher?

                     

                    When I set it up this way, I kept running out of memory on our BladeLogic servers and I verified that it was indeed bringing the servers down

                    set up the patching job to run, or are you using a batch job w/ the 'by server' or something else.  how much memory do the appservers have?  how many instances per appserver ?  max heap per instance ?  there's probably more settings to look into here because you should certainly be able to target > 20 servers at a time w/ a patching job.  the appserver on my laptop can do that.

                     

                    The patching job is setup to run.  I don't know the answers to the other questions but I'll go dig them up.  What's recommended for each of these?  The patching jobs have a patching job per server meaning that 20 patchanalysis jobs kick off at the same time, then another 20 patchanalysis jobs in about 6 minutes, and so on.  This adds up to a total of 300 patchanalysis jobs for 300 servers allowing the server to reboot when complete instead of waiting for all of the other servers to finish patching.

                     

                    are all of the batches you identified above the A, B, C nodes of clusters ?  like you have 20 clustered systems each w/ 3 nodes or is it a mix ?  i mean, really you just don't want all the nodes in a cluster rebooting at the same time but otherwise you can do the reboots for non-clustered whenever during the window ?

                     

                    Yes, it is a mix of clustered and non-clustered allowing for nodes to reboot separately and not taking the whole cluster down.

                    • 7. Re: Patch Analysis Results using BLCLI
                      Barry Reilly

                      My 2 cents... sharing my experiences of how I implemented it to cover off all scenarios.

                       

                      1. Create and divide Targets into environment type and subtype using Properties.
                      2. Create smart groups for these 2 environments & their 3 sub-divisions
                      3. Tag the targets for each environment and subdivision and the smart groups will auto populate.
                      4. Create  the corresponding windows Patching Job for each Environment/sub-Delete repeated word
                      5. There's more but for now will start with how this works...

                       

                      Environments

                      1. Production

                      2, Non-Production

                      3. [opt] DMZ-Production

                      4.[opt] DMZ-Non-Prod

                       

                      Sub-divisions for each environment

                       

                      1.1. OAAT ( One At A Time, with auto reboot at end of job)

                      1.2 Manual (All Manual targets with no auto reboot at end of job)

                      1.3 All (others)

                       

                      e.g.. 'Prod OAAT' - All production servers where they can only be done one at a time and reboot, before moving to each one

                       

                      e.g. 'Prod Manual - All prod servers where Patch Analysis, stage, commit done but no reboot at all, Business owner of servers or assigned team will carry out manual intervention to ensure the servers get rebooted and started in : 1. the right order 2. the relevant services or batch file/command prompt window is run as user X on server Y)

                       

                      e.g. Prod All - All remaining prod servers with P.A., staging, commit , with auto reboot.

                       

                      Smart group definitions:

                       

                      Patch PROD-All 

                      - Criteria

                                         OS = Windows  AND

                                         ENVIRONMENT = (IsOne of)  Production, DMZ-Production   AND

                                         _PATCH_OAAT = False    AND

                                         _PATCH_MANUAL=false   AND

                                        _PATCH_EXCLUDE=false

                       

                      Note : the _property_name is custom added using the  Property dictionary View under Configuration in the console gui.

                       

                      Patch PROD_OAAT

                      -Criteria

                                         OS = Windows  AND

                                         ENVIRONMENT = (IsOne of)  Production, DMZ-Production   AND

                                         _PATCH_OAAT = True    AND

                                         _PATCH_MANUAL=false   AND

                                        _PATCH_EXCLUDE=false

                       

                      So similar for the others : Prod_Manual; Non-Prod_OAAT; Non-Prod_Manual; Non-Prod_All smart groups.

                       

                      Where you would typically tag this properties examples are:

                       

                      Domain controllers as _PATCH_OAAT;

                       

                      For passive nodes in clusters (SQL, Exchange, IIS), similar idea but have a policy and process that 24 hours before your scheduled patch window to have all passive nodes checked to still be passive, so that every month of patching that group of passive nodes never changes.

                       

                       

                      My monthly patching schedule for example is:

                       

                      Patch Tuesday

                      - schedule Patch Catalog to update 3 times from 18:00 GMT on second Tuesday of the month.

                      - Schedule batch job for P.A.,/stage/ commit  all Bladelogic server s in a OAAT fashion.

                      -Includes app servers & remote repeaters

                       

                      Patch Wednesday

                      - Schedule 2nd Wednesday of the month - run patch analysis for Non-Production (all groups of Non-prod)

                      - Defer staging to 18:00 Wednesday

                      - Defer commit to 18:00 Patch Thursday

                      - auto reboot for Non-Prod-ALL,

                      - schedule Non-Prod-Manual via script job to reboot at X time/date

                       

                      Patch Thursday

                      - Schedule Passive Nodes to do P.A. /stage/commit , no reboot

                      - schedule script reboot job for 12:00 noon, OAAT ( One at a time)

                       

                      -Scheduled Non-Prod_X jobs run at 18:00

                       

                       

                      Patch Friday

                      - scheduled Patch analysis for Prod_All, OAAT; Manual group jobs

                      -Defer staging until 18:00 on Friday

                      -Defer commit until 17:00 Saturday

                       

                      Patch Saturday

                      -All production staging continues

                      -16:00 scheduled script job - Reboot all production servers other than PROD_OAAT;

                      -17:00 commit Prod OAAT & PROD_MANUAL Jobs.

                      -18:00 start commit for Prod-ALL, with auto reboot.

                       

                      Patch Sunday

                       

                      Delete repeated word owners for those in Patch_ manual look after their own servers and the order of rebooting and starting up the apps and dependencies; logging in as user X on server(s) etc.

                       

                      Patching Window ends at 23:00 on Sunday night

                       

                       

                      Smart groups are very powerful when used with custom properties to quickly get a view of things like :

                       

                      All servers missing service pack

                      All servers where Patch_exclude = true

                      All servers where reboot_needed = YES/true

                      - I have a scheduled script job to read the registry and populate smart groups :

                      --- All windows 2003 servers pending reboot

                      ---All windows 2008 .. 2012 etc.. pending reboot

                       

                       

                      Note when setting target limit for the windows Patching jobs remember to be realistic.

                       

                      Prod_OAAT , limit = 1

                      Prod _manual , limit =10 for 80 targets

                      Prod_all, limit = 20 for 1100 targets

                      Non-prod etc..

                       

                      When running multiple App Servers for BSA the load is auto balanced, so addition jobs running/scheduled will start up on the additional BSA app servers as needed.

                       

                      I have 3 BSA app servers with one of them as the patch catalog, file & depo share etc.

                      The other 2 as app servers

                       

                      and then some repeaters for DMZz, remote networks around the world.

                       

                      Note 2: Repeater is just a target with the BSA agent installed . You simply tell the BSA console configuration you are adding a repeater, state name of the target and it will auto change that remote target agent to tag it as a repeater.

                       

                      You then simply either manually set a property per target to use that repeater or better yet use the repeater routing Rules to create a rules with:

                       

                      If network address is X , then use repeater Y.

                      • 8. Re: Patch Analysis Results using BLCLI
                        Steve Abercrombie

                        Thanks, I'll take this into consideration.  I've pretty much mapped out how I need to implement this.  What I found out to be the problem was that I was using NSH to create Batch files that in turn run 20+ Patch Analysis jobs which then kicked off all at the same time on one App Server instead of load balancing to other servers.  I still think it should be able to handle this load but that's what was happening.  Unfortunately I do not control the App Server configuration and can only offer up suggestions for configuration changes.  What I believe will work for my issue is what Bill mentioned above to stagger the patch job times and I'll have to write my NSH job to do that instead.  With the MaxThreads as default set to 200, this means that there could be a max of 200 BladeLogic jobs running without performance issues right?  Thanks for your help and suggestions!

                        • 9. Re: Patch Analysis Results using BLCLI
                          Bill Robinson

                          kicked off all at the same time on one App Server instead of load balancing to other servers

                          there's a couple things to look into here - the batch jobs can get picked up by the same appserver, but the actual per-target 'work' (work item thread) should be shared out somewhat evenly across all the appservers in the env.  that is not shown in the 'tasks in progress' but can be see in the job run logs (something like executing work item thread for server abc on appserver xyz) or in the 'infrastucture manager' view.  if all the wits for one batch job (and its children) are going to one appserver still, that might mean that you have a job routing rule in place that the batch job triggers, or your appservers can't talk to each other on the MinPort, MaxPort, RegistryPort.  Or the bladelogic.keystore is not in sync between the appservers or time or time zones are off between the appservers (or something else).  so if the WIT sharing isn't working that needs to be looked at w/ support.

                           

                          MaxThreads - there's no such thing... which one of these are you talking about:

                          MaxApprovalThreads:3

                          MaxJobThreads:5

                          MaxLightweightWorkItemThreads:0

                          MaxNshProxyThreads:15

                          MaxRESTNotifyThreads:12

                          MaxWorkItemThreads:100

                          MaxWorkerThreads:10

                          MaxAuthSvcThreads:5

                          did you mean the 'maxlightweightworkitemthreads'?  that one is for only the 'commit' phase of the deploy jobs - for that phase to run in an async way and not use a full wit.  which means when you get to the commit phase of the deploy, you can have far more commits running than you have WorkItemThreads in the env.  MaxJobs (not MaxJobThreads) is what governs how many jobs can be running concurrently on an appserver instance.

                           

                          note that when you run a batch job, the member jobs do not count against the 'maxjobs' count.  only the batch job itself contributes to that count.  also - to some degree the maxjobs doesn't matter - if you spin up a batch job that ultimately targets 200 servers, you are using 200 wits.  that's usually the choke point.  so if you have 10 batch jobs and they each target 200 servers, that's 2000 wits that need to be used.

                           

                           

                          imo it might make sense to break this down a little more if you can - you have non-clustered nodes that you can reboot whenever during the window right?  so that could be one group - you can patch and reboot those independent of your clusters right ?  so that's one job that can go off and run w/o much limitation.

                           

                          then you have the cluster nodes - imo that's where you should have your batching up and order of reboot.  and in here, i think you want to patch all the node As and then reboot them, make sure they work and can take traffic (fail over to them) and then do node B.  i would not want to patch B and C while patching A because if some patch messes up A node, now you are stuck.  that might increase the total time but it seems safer (i guess there could be a problem having the nodes w/ different patches too).  so that bit could be done via some orchestration (cough cough BAO) or in the nsh script.  maybe that could be done in a batch job w/o nsh - i could see something like:

                          Batch Job (stop if one member job fails, use targets from each job)

                              - patching job to patch node A, reboot

                              - compliance job that fails if node A isn't working, otherwise it fails over to this node

                              - patching job to patch node B, reboot

                             - compliance job to check node B

                            ....

                          or that could be in a nsh script.  you might be able to cut time by staging the patches ahead of time and then only firing off the commit phase during the window.