* Disclaimer: This is a long read, but I recommend anyone who's looking into BSA to analyze and update Red Hat servers to read it and all others to share their experience with it. This is not a rant, but rather constructive criticism on the features that have so far proven to be extremely deceiving for us, if not unusable. Here is the story of our saga with the Red Hat patch management module of BSA and steps I believe BMC should take in order to fix and improve it to match our collective needs.
Ever since we have started using the Red Hat Patch Catalog in BSA 8.3, we've had countless issues with it. Some were regarding the analysis results, but most were regarding the performance and duration of the Catalog Update Job. Because we needed to have special child channels included in the catalog in order to properly analyze RHES6 targets (needed the optional channel to resolve some dependencies), we had no other alternative but to start using the offline downloader. Because of the way the offline downloader works, the duration of the CUJ almost tripled compared to the online mode version, making it almost useless, or at least extremely unpractical to use for us. Every time a bug was fixed, another one was found and the fix was sometimes making the duration of the CUJ even worse than before. It seemed to me that the more I scratched to understand how things worked under the covers, the more issues I found.
So far, we have opened many RFEs and defects regarding issues with both the duration, performance and other bugs related to the incorrect analysis results or parsing of errata data. Out of 4 request for enhancements and 8 defects opened, only 4 were resolved, while the remaining ones are the most important and regard the core of the issue of performance and run time duration.
The last release, revision 8.3.03.190 (along with offline downloader revision 8.3.03-07), made it so much worse that it's now clear we can no longer use it until the performance is drastically improved. Since we have upgraded to that revision in our development and staging environments, the average duration of our CUJ went from 4h23m (using 8.3.03.116) to 24h11m! Nothing changed on our server other than the revision of the app server and offline downloader. Something in the fixes brought by the last release are obviously not optimized and actually made it worse.
Here is a screenshot showing the evolution of the CUJ duration (always in offline mode) since we started using it in our development environment:
* Note that although the catalog name was changed starting at r190, it is the same catalog, it was just renamed in the console.
Following this testing, I have tried to breakdown the CUJ process in depth and tried to pin-point what was in fact making it that slow and see if I could find obvious places to improve. Again, please note that although I'm only showing a single environment here, the same test was done on another application server with the same results. I confirm that nothing has changed regarding the hardware/software setup of the servers other than the version of BSA and the offline downloader. Here's my analysis results.
Things to improve in order to reduce execution time of Red Hat CUJ and limit performance impact
[Common for online and offline modes]
- Allow the CUJ to use more than one CPU when available for all phases that are using java
It was noticed that during certain phases of the CUJ that use java, only one CPU was being used at full capacity on the application server even though there were 2 CPUs available. The step in question takes more than 1 hour to execute, so the impact is not negligible. Splitting the load across multiple CPUs to reduce the duration would be a logical solution.
- Spawn "Depot Object Processing batch" work item threads on multiple job servers instead of using all available work item threads on the application server the CUJ is running against.
This has for effect to prevent any additional jobs from running on the application server while the CUJ is going through this phase. In a multi-application environment, you would expect this kind of threading to be spread across all application servers to speed up processing and limit the impact. Also, the number of batches to process is usually higher than the maximum number of work item threads available, so it would take less time if they could all spawn on available application server instead of being executed in groups (i.e. start with 49 out of 50 available WIT, then when they are done, spawn more, etc until all batches are done, as opposed to spawn all 150 batches across 6 available application servers and only use 25 WIT on each).
- Add the option to specify the maximum number of concurrent work item threads the CUJ can use at any time.
This is mainly for the Depot Object Processing phase, and for environments that have only a few application servers, this would prevent one of them to be unavailable for more jobs as all its WITs are used by the CUJ.
- Improve the performance of the createrepo wrapper (create_repo_wrapper.sh) by using createrepo's --update and --cachedir options.
Based on observations, the CUJ creates a temporary workspace directory (i.e. catalog_2023204.part/RHES6x86) where the repository data is compiled using createrepo. This directory is recreated everytime and all metadata has to be re-generated by the createrepo command at every run of the CUJ. This takes a considerable amount of time (and CPU) and could be drastically improved if the repodata directory was kept in between runs and reused, with the --cachedir and --update options of createrepo. From the man page of createrepo:
-c --cachedir <path>
Specify a directory to use as a cachedir. This allows createrepo to create a cache of checksums of packages in the repository. In consecutive runs of createrepo over the same repository of files that do not have a complete change out of all packages this decreases the processing time dramatically.
If metadata already exists in the outputdir and an rpm is unchanged (based on file size and mtime) since the metadata was generated, reuse the existing metadata rather than recalculating it. In the case of a large repository with only a few new or modified rpms this can significantly reduce I/O and processing time.
- Use "nice" (for Red Hat application servers) to limit the CPU impact of *any* script or command launched by the CUJ.
This one is self explanatory.
[Online catalog mode only]
- Let us select the repository channels we want to include per OsArch instead of hard-coding them.
This is the main reason why we have to use the offline downloader, even though it was not meant for that. If we need any extra channel that is not part of the hard-coded ones, even if our application has direct access to the Internet, we need to use the offline downloader instead to get it. I reckon this requires major UI changes, but it shouldn't change much of the mechanic you're currently using. All we need is to have a checklist of channels to select for each OsArch and to feed the selected list to the downloader instead of using hard-coded channel names. If you want BSA to become a valid replacement for Red Hat Sattelite 6 for example, this is a must and this goes along with how things are done in the field. Unix sysadmins understand the concept of channels, but not of "product filters" with the actual channel names being hard-coded and hidden like you have now.
i.e. We would select RHES6 x86_64 in a drop-down, and would then be presented with the choice for selecting all of the following, which is basically all the channels available for the offline downloader when filtering RHES6 and x86_64:
[Offline catalog mode only]
- Make it so the catalog does not have to regenerate all Errata metadata every time it runs in offline mode.
There is a huge gap between the Red Hat CUJ duration of an online catalog versus the one of an offline catalog using the same filters (channels). Much of this huge gap is due to the fact that the offline downloader must re-generate the Errata metadata from scratch every time it runs (or so I observed and was confirmed by BMC's dev team). The reason that was given is that it couldn't tell if the catalog filters had changed or not (because it runs externally) and needed to know which OsArch to add to the "Supported OsArch" property of the Errata objects so that it matches the OsArchs that were included in the catalog.
I'm not clear as to why this is such a problem to resolve, as I can think of several ways this could be fixed, but I'll let your dev team brainstorm on it. Why not simply list all supported OsArch applicable an errata even if it's not an OsArch that is present in the catalog? Why does it matter? When you look at the Errata page on the Red Hat Customer Portal, it lists all supported products, so why can't BSA use this? The offline downloader is wasting an gigantic amount of time and resources to reprocess those every run. If I'm missing something crucial, please explain.
So there it is. It's up to BMC to fix it now, and I believe I've done more than what a customer should be doing by suggesting those fixes. Now it's time to deliver a solution that the BMC user community actually needs and to do so according to those needs. A lot of us are already committed to use BSA to leverage our Red Hat patching requirements, and being stuck with something that works half-way, or takes so much time to run is causing us a lot of grief and possibly even monetary penalties if we can't update/patch servers quick enough after a critical errata is released for example. If the code has to be scrapped and redone to do it right, then so be it, but please stop pushing this back to the next release. We need fixes now, in the releases we are using now.
I hope that this post will bring more visibility to these issues, and put more pressure on BMC to address them accordingly. The bottom line is, a Red Hat patch catalog shouldn't take more than 1 hour maximum or so to update in total if you want it to be competitive with Red Hat Satellite (which takes less than 10 minutes to do the same) and other automated patching solutions, and it shouldn't kill the application server's performance while it's running either.
Thank you for reading.