8 Replies Latest reply on Feb 17, 2015 3:04 PM by Joseph Schuler

    The saga of the Red Hat Patch Management features in BSA - Issues, performance analysis and suggestions to fix or improve them

    Yanick Girouard

      * Disclaimer: This is a long read, but I recommend anyone who's looking into BSA to analyze and update Red Hat servers to read it and all others to share their experience with it. This is not a rant, but rather constructive criticism on the features that have so far proven to be extremely deceiving for us, if not unusable. Here is the story of our saga with the Red Hat patch management module of BSA and steps I believe BMC should take in order to fix and improve it to match our collective needs.


      Ever since we have started using the Red Hat Patch Catalog in BSA 8.3, we've had countless issues with it. Some were regarding the analysis results, but most were regarding the performance and duration of the Catalog Update Job. Because we needed to have special child channels included in the catalog in order to properly analyze RHES6 targets (needed the optional channel to resolve some dependencies), we had no other alternative but to start using the offline downloader. Because of the way the offline downloader works, the duration of the CUJ almost tripled compared to the online mode version, making it almost useless, or at least extremely unpractical to use for us. Every time a bug was fixed, another one was found and the fix was sometimes making the duration of the CUJ even worse than before. It seemed to me that the more I scratched to understand how things worked under the covers, the more issues I found.

       

      So far, we have opened many RFEs and defects regarding issues with both the duration, performance and other bugs related to the incorrect analysis results or parsing of errata data. Out of 4 request for enhancements and 8 defects opened, only 4 were resolved, while the remaining ones are the most important and regard the core of the issue of performance and run time duration.

       

      The last release, revision 8.3.03.190 (along with offline downloader revision 8.3.03-07), made it so much worse that it's now clear we can no longer use it until the performance is drastically improved. Since we have upgraded to that revision in our development and staging  environments, the average duration of our CUJ went from 4h23m (using 8.3.03.116) to 24h11m! Nothing changed on our server other than the revision of the app server and offline downloader. Something in the fixes brought by the last release are obviously not optimized and actually made it worse.

       

      Here is a screenshot showing the evolution of the CUJ duration (always in offline mode) since we started using it in our development environment:

      02-16-15 10-29-45 AM.png

      * Note that although the catalog name was changed starting at r190, it is the same catalog, it was just renamed in the console.

       

      Following this testing, I have tried to breakdown the CUJ process in depth and tried to pin-point what was in fact making it that slow and see if I could find obvious places to improve. Again, please note that although I'm only showing a single environment here, the same test was done on another application server with the same results. I confirm that nothing has changed regarding the hardware/software setup of the servers other than the version of BSA and the offline downloader. Here's my analysis results.

       

       

      Things to improve in order to reduce execution time of Red Hat CUJ and limit performance impact

       

       

      [Common for online and offline modes]

       

      - Allow the CUJ to use more than one CPU when available for all phases that are using java

       

      It was noticed that during certain phases of the CUJ that use java, only one CPU was being used at full capacity on the application server even though there were 2 CPUs available. The step in question takes more than 1 hour to execute, so the impact is not negligible. Splitting the load across multiple CPUs to reduce the duration would be a logical solution.

       

      - Spawn "Depot Object Processing batch" work item threads on multiple job servers instead of using all available work item threads on the application server the CUJ is running against.

       

      This has for effect to prevent any additional jobs from running on the application server while the CUJ is going through this phase. In a multi-application environment, you would expect this kind of threading to be spread across all application servers to speed up processing and limit the impact. Also, the number of batches to process is usually higher than the maximum number of work item threads available, so it would take less time if they could all spawn on available application server instead of being executed in groups (i.e. start with 49 out of 50 available WIT, then when they are done, spawn more, etc until all batches are done, as opposed to spawn all 150 batches across 6 available application servers and only use 25 WIT on each).

       

      - Add the option to specify the maximum number of concurrent work item threads the CUJ can use at any time.

       

      This is mainly for the Depot Object Processing phase, and for environments that have only a few application servers, this would prevent one of them to be unavailable for more jobs as all its WITs are used by the CUJ.

       

      - Improve the performance of the createrepo wrapper (create_repo_wrapper.sh) by using createrepo's --update and --cachedir options.

       

      Based on observations, the CUJ creates a temporary workspace directory (i.e. catalog_2023204.part/RHES6x86) where the repository data is compiled using createrepo. This directory is recreated everytime and all metadata has to be re-generated by the createrepo command at every run of the CUJ. This takes a considerable amount of time (and CPU) and could be drastically improved if the repodata directory was kept in between runs and reused, with the --cachedir and --update options of createrepo. From the man page of createrepo:

       

             -c --cachedir <path>

                    Specify a directory to use as a cachedir. This allows createrepo to create a cache of checksums of packages in the repository. In consecutive runs of  createrepo over the same repository of files that do not have a complete change out of all packages this decreases the processing time dramatically.

       

             --update

                    If  metadata  already  exists  in the outputdir and an rpm is unchanged (based on file size and mtime) since the metadata was generated, reuse the existing metadata rather than recalculating it. In the case of a large repository with only a few new or modified rpms this can significantly reduce  I/O  and  processing time.

       

      - Use "nice" (for Red Hat application servers) to limit the CPU impact of *any* script or command launched by the CUJ.

       

      This one is self explanatory.

       

      [Online catalog mode only]

       

      - Let us select the repository channels we want to include per OsArch instead of hard-coding them.

       

      This is the main reason why we have to use the offline downloader, even though it was not meant for that. If we need any extra channel that is not part of the hard-coded ones, even if our application has direct access to the Internet, we need to use the offline downloader instead to get it. I reckon this requires major UI changes, but it shouldn't change much of the mechanic you're currently using. All we need is to have a checklist of channels to select for each OsArch and to feed the selected list to the downloader instead of using hard-coded channel names. If you want BSA to become a valid replacement for Red Hat Sattelite 6 for example, this is a must and this goes along with how things are done in the field. Unix sysadmins understand the concept of channels, but not of "product filters" with the actual channel names being hard-coded and hidden like you have now.

       

      i.e. We would select RHES6 x86_64 in a drop-down, and would then be presented with the choice for selecting all of the following, which is basically all the channels available for the offline downloader when filtering RHES6 and x86_64:

       

      rhel-x86_64-rhev-agent-6-server-beta-debuginfo

      rhel-x86_64-rhev-agent-6-server

      rhel-x86_64-rhev-agent-6-server-beta

      rhel-x86_64-rhev-agent-6-server-debuginfo

      rhel-x86_64-server-6-beta

      rhel-x86_64-server-6-beta-debuginfo

      rhel-x86_64-server-6-cf-tools-1-beta-debuginfo

      rhel-x86_64-server-6-cf-tools-1-beta

      rhel-x86_64-server-6-debuginfo

      rhel-x86_64-server-6-rhscl-1-beta

      rhel-x86_64-server-6-rhscl-1-beta-debuginfo

      rhel-x86_64-server-6-rhscl-1-debuginfo

      rhel-x86_64-server-6-rhscl-1

      rhel-x86_64-server-6-thirdparty-oracle-java-beta

      rhel-x86_64-server-6-thirdparty-oracle-java

      rhel-x86_64-server-6

      rhel-x86_64-server-dts-6-beta

      rhel-x86_64-server-dts-6-beta-debuginfo

      rhel-x86_64-server-dts-6-debuginfo

      rhel-x86_64-server-dts-6

      rhel-x86_64-server-dts2-6-beta

      rhel-x86_64-server-dts2-6

      rhel-x86_64-server-dts2-6-beta-debuginfo

      rhel-x86_64-server-dts2-6-debuginfo

      rhel-x86_64-server-extras-6

      rhel-x86_64-server-extras-6-debuginfo

      rhel-x86_64-server-fastrack-6

      rhel-x86_64-server-fastrack-6-debuginfo

      rhel-x86_64-server-optional-6-beta

      rhel-x86_64-server-optional-6

      rhel-x86_64-server-optional-6-beta-debuginfo

      rhel-x86_64-server-optional-6-debuginfo

      rhel-x86_64-server-optional-fastrack-6-debuginfo

      rhel-x86_64-server-optional-fastrack-6

      rhel-x86_64-server-rh-common-6

      rhel-x86_64-server-rh-common-6-beta

      rhel-x86_64-server-rh-common-6-beta-debuginfo

      rhel-x86_64-server-rh-common-6-debuginfo

      rhel-x86_64-server-rhsclient-6-debuginfo

      rhel-x86_64-server-rhsclient-6

      rhel-x86_64-server-supplementary-6-beta

      rhel-x86_64-server-supplementary-6-beta-debuginfo

      rhel-x86_64-server-supplementary-6-debuginfo

      rhel-x86_64-server-supplementary-6

      rhel-x86_64-server-v2vwin-6-beta-debuginfo

      rhel-x86_64-server-v2vwin-6-beta

      rhel-x86_64-server-v2vwin-6

      rhel-x86_64-server-v2vwin-6-debuginfo

      rhn-tools-rhel-x86_64-server-6

      rhn-tools-rhel-x86_64-server-6-beta

      sam-rhel-x86_64-server-6-beta

      sam-rhel-x86_64-server-6-beta-debuginfo

      sam-rhel-x86_64-server-6

      sam-rhel-x86_64-server-6-debuginfo

       

       

      [Offline catalog mode only]

       

      - Make it so the catalog does not have to regenerate all Errata metadata every time it runs in offline mode.

       

      There is a huge gap between the Red Hat CUJ duration of an online catalog versus the one of an offline catalog using the same filters (channels). Much of this huge gap is due to the fact that the offline downloader must re-generate the Errata metadata from scratch every time it runs (or so I observed and was confirmed by BMC's dev team). The reason that was given is that it couldn't tell if the catalog filters had changed or not (because it runs externally) and needed to know which OsArch to add to the "Supported OsArch" property of the Errata objects so that it matches the OsArchs that were included in the catalog.

       

      I'm not clear as to why this is such a problem to resolve, as I can think of several ways this could be fixed, but I'll let your dev team brainstorm on it. Why not simply list all supported OsArch applicable an errata even if it's not an OsArch that is present in the catalog? Why does it matter? When you look at the Errata page on the Red Hat Customer Portal, it lists all supported products, so why can't BSA use this? The offline downloader is wasting an gigantic amount of time and resources to reprocess those every run. If I'm missing something crucial, please explain.

       

      So there it is. It's up to BMC to fix it now, and I believe I've done more than what a customer should be doing by suggesting those fixes. Now it's time to deliver a solution that the BMC user community actually needs and to do so according to those needs. A lot of us are already committed to use BSA to leverage our Red Hat patching requirements, and being stuck with something that works half-way, or takes so much time to run is causing us a lot of grief and possibly even monetary penalties if we can't update/patch servers quick enough after a critical errata is released for example. If the code has to be scrapped and redone to do it right, then so be it, but please stop pushing this back to the next release. We need fixes now, in the releases we are using now.

       

      I hope that this post will bring more visibility to these issues, and put more pressure on BMC to address them accordingly. The bottom line is, a Red Hat patch catalog shouldn't take more than 1 hour maximum or so to update in total if you want it to be competitive with Red Hat Satellite (which takes less than 10 minutes to do the same) and other automated patching solutions, and it shouldn't kill the application server's performance while it's running either.

       

      Thank you for reading.

        • 3. Re: Re: The saga of the Red Hat Patch Management features in BSA - Issues, performance analysis and suggestions to fix or improve them
          Yanick Girouard

          Hi Deepak Pawar, I'll share one log here, but if you need more we'll go through our premier support channels to submit them within the issue we have opened for the CUJ duration. We are already preparing an update to that ticket in relation to this post, which will include more details that the community doesn't need to see. I just don't want this thread to become a witch hunt or a troubleshooting one. That's not why I've posted it.

          • 4. Re: Re: The saga of the Red Hat Patch Management features in BSA - Issues, performance analysis and suggestions to fix or improve them
            Akbar Aziz

            Yanick Girouard thank you for the details of your environment and issues encountered. It seems like you have had a lot of trouble for something that should be easier and working out-of-the box. The team is investigating the issue and we hope to have a plan in place to address these concerns you have raised in this post. I will reach out to you via email so we can discuss this in more detail and next steps.

             

            I will post an update to this thread when a plan of action is defined so others who have run into similar issues will be aware of what steps to take next.

             

            Akbar

            • 5. Re: Re: Re: The saga of the Red Hat Patch Management features in BSA - Issues, performance analysis and suggestions to fix or improve them
              Yanick Girouard

              Akbar, before you do anything, I would encourage you to reproduce this run-time duration in your labs first, using the same product filters and channels we are using in both our offline downloader and our catalog, using revision 8.3.03.190. And then and only then, try the same using your latest release (8.6) to see if the same issue persists. If you can confirm to us that all of these issues have been fixed in the latest release, and that the duration of a CUJ using the offline-mode and our current product filters should take no more than 1-2 hours (which is what would be acceptable for us), that you can demonstrate it (proof showing screenshots and logs) then it will be an incentive for us to upgrade.

               

              Here is the XML of our Red Hat downloader:

               

              <!--
              Please categorize the erratype/errata ID/update level filter as per one of the valid OS, Architecture values.
              Valid values for OS are RHES3, RHES4, RHES5, RHES6, RHAS3, RHAS4,
              Valid values for Architecture are s390x, x86 and x86_64
              
              Please use downloader command with -listChannel option
              to know applicable OS and Architecture.
              
              Expect Redhat Analysis to fail if OS Arch values are not from the above valid set of values.
              User is responsible for selecting correct combination of OS Arch, downloader will
              not validate it.
              -->
              
              <redhat-downloader-config>
                      <config>
                              <!--<proxy-settings>
                                      <port>8080</port>
                                      <host>127.0.0.1</host>
                                      <username>user</username>
                                      <password></password>
                                      <domain-name></domain-name>
                                      <proxy-type>ntlm-v2</proxy-type>
                              </proxy-settings>-->
                              <temporary-location>/BMC_Storage/tmp/offline_downloader</temporary-location>
                              <payload-repository-location>/BMC_Storage/patch_catalogs/linux</payload-repository-location>
                              <!-- The default value for download-request-retries will be 10 if no value is specified -->
                              <download-request-retries>10</download-request-retries>
                              <download-request-timeout>360000</download-request-timeout>
                              <downloader-parallel-threads>10</downloader-parallel-threads>
                      </config>
              
                      <subscription>
                              <errata-type-filter>
                                      <os>RHES5</os>
                                      <arch>x86</arch>
                                      <channel-label>rhel-i386-server-5</channel-label>
                                      <errata-severity>
                                              <critical>true</critical>
                                              <high>true</high>
                                              <moderate>true</moderate>
                                              <low>true</low>
                                      </errata-severity>
                                      <errata-type>
                                              <security>true</security>
                                              <bugfix>true</bugfix>
                                              <enhancement>true</enhancement>
                                      </errata-type>
                              </errata-type-filter>
                              <errata-type-filter>
                                      <os>RHES5</os>
                                      <arch>x86_64</arch>
                                      <channel-label>rhel-x86_64-server-5</channel-label>
                                      <errata-severity>
                                              <critical>true</critical>
                                              <high>true</high>
                                              <moderate>true</moderate>
                                              <low>true</low>
                                      </errata-severity>
                                      <errata-type>
                                              <security>true</security>
                                              <bugfix>true</bugfix>
                                              <enhancement>true</enhancement>
                                      </errata-type>
                              </errata-type-filter>
              
                              <errata-type-filter>
                                      <os>RHES6</os>
                                      <arch>x86</arch>
                                      <channel-label>rhel-i386-server-6</channel-label>
                                      <errata-severity>
                                              <critical>true</critical>
                                              <high>true</high>
                                              <moderate>true</moderate>
                                              <low>true</low>
                                      </errata-severity>
                                      <errata-type>
                                              <security>true</security>
                                              <bugfix>true</bugfix>
                                              <enhancement>true</enhancement>
                                      </errata-type>
                              </errata-type-filter>
                              <errata-type-filter>
                                      <os>RHES6</os>
                                      <arch>x86</arch>
                                      <channel-label>rhel-i386-server-optional-6</channel-label>
                                      <errata-severity>
                                              <critical>true</critical>
                                              <high>true</high>
                                              <moderate>true</moderate>
                                              <low>true</low>
                                      </errata-severity>
                                      <errata-type>
                                              <security>true</security>
                                              <bugfix>true</bugfix>
                                              <enhancement>true</enhancement>
                                      </errata-type>
                              </errata-type-filter>
                              <errata-type-filter>
                                      <os>RHES6</os>
                                      <arch>x86_64</arch>
                                      <channel-label>rhel-x86_64-server-6</channel-label>
                                      <errata-severity>
                                              <critical>true</critical>
                                              <high>true</high>
                                              <moderate>true</moderate>
                                              <low>true</low>
                                      </errata-severity>
                                      <errata-type>
                                              <security>true</security>
                                              <bugfix>true</bugfix>
                                              <enhancement>true</enhancement>
                                      </errata-type>
                              </errata-type-filter>
                              <errata-type-filter>
                                      <os>RHES6</os>
                                      <arch>x86_64</arch>
                                      <channel-label>rhel-x86_64-server-optional-6</channel-label>
                                      <errata-severity>
                                              <critical>true</critical>
                                              <high>true</high>
                                              <moderate>true</moderate>
                                              <low>true</low>
                                      </errata-severity>
                                      <errata-type>
                                              <security>true</security>
                                              <bugfix>true</bugfix>
                                              <enhancement>true</enhancement>
                                      </errata-type>
                              </errata-type-filter>
                      </subscription>
              </redhat-downloader-config>
              
              
              

               

              * Please take note of the extra optional channels for RHES6

               

              And here is our offline patch catalog:

              02-17-15 9-50-36 AM.png

               

              * Note: In our production environment, we also have RHES4x86 and RHES4_x86_64 in our OsArch list, so it would take even longer there.

              • 6. Re: The saga of the Red Hat Patch Management features in BSA - Issues, performance analysis and suggestions to fix or improve them
                Joseph Schuler

                Hi Yanick,

                 

                What version of the JRE is embedded in your offline downloader?  We  had some problems with some versions of JRE in the 8.5.01 downloaders and were able to downgrade to fix it.

                • 7. Re: Re: The saga of the Red Hat Patch Management features in BSA - Issues, performance analysis and suggestions to fix or improve them
                  Yanick Girouard

                  I'm using whatever release came with the latest revision of the downloader that was provided to me, which is 8.3.03-07. The version info of the java binary located inside the jre/bin folder of the offline downloader gives me this:

                   

                  # ./java -version

                  java version "1.6.0_13"

                  Java(TM) SE Runtime Environment (build 1.6.0_13-b03)

                  Java HotSpot(TM) Server VM (build 11.3-b02, mixed mode)

                   

                  Just to clarify however, the duration issue is not with the downloader (although it could most likely be optimized further), but with the CUJ itself, which runs in BSA after the offline downloader is done running. The last run of the offline downloader took a mere 35 minutes to run (9:39 to 11:13), which is fine.

                  • 8. Re: Re: The saga of the Red Hat Patch Management features in BSA - Issues, performance analysis and suggestions to fix or improve them
                    Joseph Schuler

                    Okay, Yanick,

                     

                    This is a different issue from what I was expecting.  The slowdown I saw before was in the metadata gathering portion of the offline downloader and not the CUJ.