Skip navigation
Share This:

When I encounter an issue that throws an error message, I typically take the obvious route: determine what the error means, search for likely causes, and if needed, enable more verbose logging to capture more detail on what is happening.  This approach is basically to start from the point of failure and track backwards or dig deeper.

 

However, when confronted with a complex issue in a multi-tiered environment – where there may be no error, a failure occurring in another tier, or data issues introduced well upstream – it is sometimes useful to take a systems view of diagnosing issues.  This approach is to look for failures across many tiers, validate that features and components work individually.  I call this the systems approach to diagnostics.

 

In practice, when I look into issues, I am using a hybrid approach that leverages some knowledge of how the feature works and dependencies with appropriate diagnostic tools to narrow in on where the issue is occurring.

 

In this post, I will summarize several types of diagnostic tools that enable the systems approach to diagnostics, and the relative merits and limitations of each.

 

 

Component test / unit test

 

The goal of a component test is to validate that a particular feature, server process, or product component is basically operational.  It is a diagnostic that is run ad hoc. If a component test fails, it quickly identifies a point of failure to be investigated further.  A very simple example would be to verify a server process is running. A more meaningful unit test would be to verify a feature can perform a simple unit of work.  Most component tests are very simple, because it takes time to perform each of them.

 

Since unit tests validate a discrete feature, does it make sense to automate many unit tests, to simply the process of running them, and to generate a more comprehensive result?  It does, to a point.  There are a few challenges that inhibit a fully automated unit test system.  First, how much can you realistically unit test in a reasonable period of time, without negatively impacting the system, and without encountering false failures?  The “keep it simple” principle is relevant here.

 

Second, component tests should only verify features that are installed, enabled, and relevant for the solution.  For enterprise products which support a variety of post-installation deployment options for load balancing and high availability across multiple servers, this presents a significant challenge.  Reliable unit test automation requires some sort of centralized configuration to track the currently deployment.  This requirement is sometimes at odds with maintaining maximum deployment flexibility and redundancy in a high availability environment.   It is always a goal to automate functionality for greater efficiency, consistency, and less effort, but in practice it is easiest to achieve this in the context of a solution which limits or drives deployment choices.

 

A component test or unit test diagnostic only goes so far. A failure indicates something to investigate further. A non-failure means the test did not reveal useful information, so progress to your next step in the diagnostic process.

 

Integration / connectivity unit test

 

The integration / connectivity unit test is a logical extension of the component unit test.  It is also an ad hoc test but it involves multiple components and may span multiple servers.  The goal is to test whether a particular conduit of information or “pipe” works by testing it.   This can be a simple exercise when testing an operation for pulling data back, but substantially more complex when trying to push data through it. There are a few additional challenges beyond those described above for the component test.

 

One challenge is test data segregation.  You want to perform the test in a realistic simulation of production functionality, yet you do not want this test data to be visible to actual operators or to influence metrics.   When performing the test manually, finding and cancelling out the test data is part of the exercise, but automation tools need to consider how to do this programmatically.

 

Another challenge is admin user channel conflict. Testing with the same user account, using authentication parameters read from the same place, is ideal for replicating the behavior of the actual feature, but using the same user account for multiple purposes can complicate administration and log analysis.

 

End to End system validation / Heart beat

 

This kind of scheduled or recurring diagnostic runs test cases and monitors the health of the feature.  It is sometimes called a “heart beat” diagnostic. It tests a feature from end to end at a periodic interval by pushing a test and monitoring for the expected automated result. This sort of diagnostic is only possible when there is an expected, automated response back to confirm correct completion so it is not appropriate for all kinds of features.

 

This kind of diagnostic has some great benefits. In addition to telling you when the overall feature does not work, it can also track the performance of the overall feature, and it can proactively send events or notifications to alert the service desk or administrator.

 

This type of diagnostic faces all of the challenges described for component test and integration test type of diagnostics, and is most successful when implemented as part of a solution which drives or limits post installation deployment options.

 

Metadata consistency checker

 

Multi-tier enterprise applications often employ functionality which abstracts underlying technology, storing metadata to track the mapping to foreign objects.  For example, a database view references database tables, and it is the responsibility of the DBMS to maintain its metadata to ensure it is accurate.  In corner cases where operations fail or when changes are made to underlying objects, the metadata may become inaccurate.     The symptom encountered may be an error from a lower level in the stack when an operation is attempted in a higher level.

 

The role of a metadata consistency checker is to compare metadata with actual objects, and report discrepancies.  These diagnostics are typically only needed in rare situations, but very helpful in them for isolating an issue efficiently and consistently.

 

Data quality analysis

 

Data quality issues can occur in enterprise applications for a number of reasons including implementation decisions, user error, bad data imports, integrations, feature changes, defects, or inappropriate workarounds which go around application data validation features.  Following best practices, using data validation functionality, and following good implementation process can avoid most data issues.  Even in the case where you have all valid data, there may be a need to quantify how much data you have, what type, and where it resides.   Data quality analysis tools assist with this process.  There are a wide variety of ways these tools are accessible.  In some cases, relevant counts are displayed or accessible in the user interface when it is commonly needed information for utilizing the feature.  In other cases, these counts are collected as part of diagnostic features as the amount and quantity of data is relevant to understanding performance issues.  Database or application queries may be utilized to address specific needs, and most applications employ a search mechanism that allows users to query for a variety of common use cases.  Finally, there may be a specific utility to check for or correct specific data situations that can occur.

 

The challenges and merits of data quality analysis tools are that they are closely aligned to the application itself.  The guideline is: If you think there may be a data quality issue, check for and run the data quality tools included.

 

Diagnostic collection

 

A feature which collects system, version, configuration, logs and other diagnostic information to a single file is not in itself a system diagnostic utility, but it is a step in that direction.  Collecting relevant information from many locations to a single file in a standardized format is a step in the process.  Collecting such information from other servers is another step. Finally, running an error analyzer against the files to look for errors at all levels in the stack is the systems diagnostic.  The principle is “collect everything that might be relevant”, so successive analysis routines have all the information required.

 

The challenge of this approach is the information collected can be substantial so the file can be large.  Proper care needs to be taken to either collect information in the relevant time range or only collect non-verbose logs.

 

One merit of this approach is de-coupling the collection and analysis steps.  This allows the analysis routines to be improved over time as the collection process is pretty straightforward. Another merit is that if no analyzer identifies the issue, the file can be easily shared with a subject matter expert who can review the information directly.

 

Error analyzer

 

The error analyzer was introduced above as part of the motivation for a diagnostic collection feature.   For the systems approach to diagnostics, this means looking for trends and correlating errors in diagnostics either by timeframe or operation. There are numerous challenges to this approach.  Errors and timestamps are represented differently, logs have varied formats, and you may get a large number of errors for a particular recurring failure.  The goal would be for the error analyzer to provide trends, categories, and higher level summation of failures.  If the analyzer just reported all errors encountered anywhere, then it would be counterproductive in focusing an investigation.  A key feature for this purpose is to report on unique error message.

 

One of the best uses of an error  analyzer is to routinely identify low level errors which can be investigated and resolved to ensure the system is running smoothly and cleanly.  Then if a problem occurs, running the error analyzer would provide a specific variation and with a message that could be researched.

 

Configuration analyzer

 

The goal of a configuration analyzer is to check for configuration errors, outliers, or values outside recommended settings.  For enterprise software solutions, there may be many possible ways to configure an application where the best values are dependent on configuration values in other applications.  This is essential for versatility. If there were only one valid configuration, the value would be hard-coded and not configurable.   A configuration error is a situation where one configuration is clearly out of sync with a related setting.  A configuration outlier is a potentially intentional configuration value which merits further exploration because it is well outside of the norm.  A configuration value outside recommended settings is the case where there are clear guidelines of how to configure based on other variables like system size, but for which the specified configuration is not the recommended value.  Again this may be intentional, and it may be temporary, but the value of a configuration analyzer would be to raise awareness of this situation.

 

The value of a configuration analyzer is to quickly validate many configuration settings.  The challenge is how determinately how strongly to enforce this setting.  If a setting is intentionally set otherwise, should the user be able to flag it as intentional and suppress the warning going forward, thereby hiding an important deviation?  Should the administrator be able to refine the definition of what is normal or the expected configuration?  Defining the dependencies is another challenge when the number of related products and tiers varies greatly.

 

A configuration analyzer can be defined to run on the server itself from the UI, or it can be a separate process that is recommended but not exposed or enforced within the application.  There are merits to each approach.  The value of a separate process is that the user can be informed but not blocked from proceeding which loosens the requirement to strictly define which configuration errors are blockers vs. merely noteworthy.   The separate process allows the configuration analyzer to be updated easily with additional checks.  However, a separate process risks the diagnostic from being forgotten, missed, and of no value.

 

Automatic updates to analyzers

 

A good systems diagnostic framework allows the analyzer capabilities to grow and evolve both with the product and with experience using it.   This is a delicate balance because configuration abilities differ by product versions, so closely tying the analyzer to the product addresses this challenge somewhat, though the complexity is multiplied as soon as you introduce multiple products and versions.   Several of the paragraphs above describing analyzers imply the value of a separate tool because it can take updates which leverage new routines and checks.   A desirable solution would be to make the analyzers accessible in the product and connect over the internet to download the latest updates.  Since most enterprise application servers are not exposed directly to the internet, a good compromise may be to download and import the latest routines.

 

Job run or task status

 

When an application user runs a job or performs an operation, the results are typically displayed in the user interface so they know the result when it completes.  If the same job is  scheduled and run periodically without user intervention, the results of the job are again stored and reported in the same place.  This is a good principle for application design - diagnostics are consistent no matter how or when the feature runs.  The challenge from a systems diagnostic approach is how do you know when a scheduled job that is part of a larger overall process fails?  How can you correlate this failure with failures in other related applications?

 

There are a few common approaches for how to incorporate job statuses into an overall diagnostic framework.  One is for the client or actor which runs the job to monitor or wait from the job status, and use it’s diagnostics to report the failure.    Another is to have job failures trigger a notification or an event that can be acted upon.  Finally, the job failures may be captured by a diagnostic collection utility as described above, so an offline analysis tool could report on failures.

 

Error logs and queues

 

Some enterprise application features leverage back-end processes which record errors to a log file or table when failures occur, and some queue up asynchronous operations to be processed sequentially on a first-in-first-out basis.  These are key diagnostics when back-end processes fail or are backed up.  From a systems analysis perspective, I would want these diagnostics to be exposed as another data source for the error analyzer so the results could tell me meaningful information about all relevant errors – whether they are recorded in the database or in a log file.  Also, it would be good for queued operations to indicate a failure if the queue grew too large, or it did not diminish at a sufficient rate – since these would be situations warranting further investigation.

 

One merit of recording errors in queues is it presents a logical way to organize errors, identify trends, and find actionable information.  It is also typically easier to implement a notification or event in the case of errors, which are difficult to achieve when errors are written to files.  The challenge of recording errors in tables or within the application data is it is not accessible to external analysis.  It must first be retrieved or dumped to enable analysis.

 

Event manager

 

Several diagnostic capabilities described above referenced “sending an event”, and those assume there is an event manager to accept and manage events.  The value of an event manager rather than a simple notification is the event can be correlated to other events, prioritized, have policies applied, and flow into a defined process to expedite and resolve.  The challenge is deciding where to apply the logic to apply this correlation.  As discussed above, some knowledge of how the features map to components is required and this understanding resides with the application itself rather than the event manager.   The key decisions in successful implementation of an event manager diagnostic is deciding what events matter, and then determining how to act on them.

 

The diagram below illustrates how I view these diagnostics as a progression of less to more comprehensive coverage.  It would be nice to whittle the diagnostic stack to just alerts which tell you about real failures you care about, but doing that effectively requires a dozen prerequisite steps to be leveraged.

12stepsdiagram.jpg

 

My experiences with taking a systems approach to diagnostics have revealed there are many incremental steps toward a more comprehensive, more efficient diagnostic system.  There are nuances, tradeoffs, and different paths to be taken.  Many of the steps are complementary or dependent on other steps, so there is no single golden path that applies generically across all situations.  The best next step is generally a choice of return on investment – how to deliver the most value in the shortest span of time.   Hopefully the information in this post provides a good framework for understanding the journey.

 

This post represents my own thoughts and experiences and does not necessarily represent BMC's position, strategies, or opinion. I welcome any feedback, please use the Add Comments link below.

 

Jesse

Share This:

As a long-time member of Customer Support, I spend a lot of time thinking about how to isolate and resolve issues more quickly.  Once I find the root cause, I ask myself,  “How could I have found it more quickly?”  There are differences between BMC products and the issues that occur, but also similarities in the investigative process. I have found some practices that work well for me and in this blog, I hope to share those tools, techniques and resources that have been useful across several BMC products.

 

If you are not familiar with it, the Maintenance Tool is installed with many BMC products to simplify diagnostic and maintenance activities, including:

  • BMC Cloud Lifecycle Management
  • BMC Atrium CMDB Suite
  • BMC Atrium Orchestrator
  • Most of the BMC Remedy products and applications, including BMC Remedy AR System Server, BMC Remedy IT Service Management Suite
  • BMC ProactiveNet Performance Management.

 

The exact name, location, and features are specific for the product and version which installed the maintenance tool but it is typically in a subdirectory of the product with a file name such as:

  • <product name>MaintenanceTool.cmd (Windows)
  • <product name>MaintenanceTool.sh  (UNIX)

 

The most commonly used feature of the Maintenance Tool is Zip Logs, used to collect diagnostics to a file which can be provided to Customer Support.  This output file collects the most commonly needed product, environment, and diagnostic information. In the most recent product versions, the Maintenance Tool also includes features to review these collected diagnostics and perform elementary analysis.  These are the features I will describe in this post.

 

If the Browse to Log button appears on the Logs page, the features described in this post should be available in the Maintenance Tool.  You can use any Maintenance Tool that has these features to view a Zip Logs output file, even if the zip logs file was created from an earlier version of the product or from a different BMC product.

 

browseToLog2.png

 

To open files:

  1. Click Browse to Log
  2. In the navigation tree, open a directory or Zip file to view the included files.
  3. Double-click a file to open it in the viewing pane.
    The file opens in the best viewer for that file.

 

Below are some files included and the viewer which opens to view them:

 

FileLocationViewerContent
ProductRegistry.xmlWindows directory (Windows)
etc directory (UNIX)
Paged XML viewer

Product versions.
Product install locations.

Product and feature installation history.

OperatingSystemsData.xml

User temporary directory on the origin systemPaged XML viewerEnvironment details at the time Zip Logs was performed.  Includes environment variables and may include System Memory, Partitions, Registry Keys, status of Ports, or other system details.
<product>InstalledConfiguration.xmlUser temporary directory on the origin systemPaged XML viewerEnvironment information and installation options during the last successful installation.
<product>InstallingConfiguration.xmlUser temporary directory on the origin systemPaged XML viewerEnvironment information and installation options during the last attempted installation.

<product>_install_log.txt

<product>_uninstall_log.txt

User temporary directory on the origin systemLog viewerInstall logs and uninstall logs.
<product>_configuration_log.txtUser temporary directory on the origin systemLog viewer

Activity in the Maintenance Tool, including

Zip Logs and Health Check (in some products)

Product specific configuration filesUnder product installation directory on origin systemText viewerProduct configuration
Product specific log filesUnder product installation directory on origin systemText viewerProduct log file
Product specific image filesNo standard image files collected.Image viewerView the image.

 

Each file opens in a separate page in the appropriate viewer for the file type..  This makes it easier to examine and compare information in multiple files.   For example, you may view a screenshot of a problem to find a value and search for that value in a collected log file.

 

Features of Viewers

ViewerFeatures
Paged XML Viewer
  • Resize columns
  • Sort columns by clicking header
Log Viewer
  • Resize columns
  • Sort columns by clicking header
  • Indicator of highest severity level in the log, count of exceptions if you hover over the colored box in right scroll area
  • Color bars indicate exceptions on the page, hover over them to see details, or click them to jump to that line in the log
  • Build boolean queries to find strings or regular expressions in the log
  • Save queries
  • Load saved queries
  • Filter log to only display lines matching a query
  • View number of matches of query
Text Viewer
  • Find a string or regular expression in files
  • Save queries
  • Load saved queries
  • Filter for lines in file with a string or regular expression
  • Go to a line number in a file
Image Viewer
  • Supports most image file formats.
  • No other features beyond displaying the image.

 

I recently recorded a short video showing how I use the Maintenance Tool to view files in the Log Zipper output file.  This provides an overview of the process I use in general, but I wanted to share two more examples in detail.

 

Checking for missing configuration files

 

When I click Zip Logs, the Log Zipper collects the product configuration files and log files needed for analysis.  The Log Zipper is the tool that collects the files, but all the logic of what files to collect and where they should be found on the system is installed with the product.  This means, if the Log Zipper is unable to collect a file it means the file is not in the location expected by the product itself.

 

Here is how I screen for issues of missing configuration files:

  1. Launch the Maintenance Tool
  2. Click Zip Logs and wait for it to complete.
  3. Click Configuration Log.
    This opens the <product>_configuration_log.txt file in the viewer.  You could also navigate and find the file in the Log Zipper output file.
  4. Choose Edit > Filter
  5. Right-click on the default query ( Details containsIgnoreCase '' ) and add or change clauses as necessary.
    You can filter on "Adding file" to see all the files or directories which Log Zipper attempts to collect but rather than looking at this list, I  constucted a query to search for lines where Details contains "Not found" and Severity is INFO, WARNING, or SEVERE, as shown below:
    Filter_missing_files2.png

The output indicates the file name and location where the file was expected.  Close the filter dialog and uncheck the filter checkbox to return the unfiltered results.  See Searching in the log viewer for more details on defining search queries.

 

After testing the query, you can choose Edit > Save Search Query in the Filter window.

SaveQuery.png

 

Specify a value for Short Description.  For example, I call this search "Missing files".

The other attributes can be left blank.  Long Description, Source Description, and Resolution Description hold details on why you are performing the search, information about the string if found, and how to resolve any issues found.  There attributes are not visible or used in the current Maintenance Tool, so they do not need to be populated.

 

Searching for errors in a log file

 

Another useful activity when analyzing an issue is to search for errors in a log file.  However, there are many different error code prefixes or indications of an error, including:

  • ARERR
  • BMC- 
  • ErrorCode
  • Error msg
  • ORA-
  • SQL Server
  • Exit code
  • ReturnCode
  • MessageNum
  • java.lang
  • SEVERE

 

Searching for each of these strings individually takes time and familiarity with the log.  It also makes it difficult to see patterns of errors such as when one error is always preceded by a different error.

 

To simplify this process, I use a query to search for all common error prefixes.

 

When viewing logs in the log viewer, this query is constructed using an OR match type, and a list of strings.  Filter_errors_log_Viewer2.png

Typically, this query is not needed in the log viewer because the Severity column indicates when an error has occured, but the query can filter these lines to ones with specific product errors.

 

When a file is open with the text viewer:

  1. Choose Edit > Filter
  2. Check the regex option, to enter the search using a regular expression.
  3. Enter the list of search expressions, separated by a pipe character so it will return results on any of the search strings, as shown below, and click Filter.

 

Filter_errors_text_viewer.png

This will show just the lines which contain one of these search strings.

 

Save the query for future use by choosing Edit > Save Search Query.

 

Note: This technique of using a regular expression to search for several indications of an error is not unique to the Maintenance Tool.  Many text editors also have the ability to search on a regular expression.  However the Maintenace Tool allows you to do this in one pane when viewing a log, and then easily switch to the OperatingSystemsData file to view information about the system configuration or environment variables, and perform the same sort of analyis across several files to compare related information.

 

I hope this information in this blog post is helpful, perhaps as a quick check before doing a deep analysis of an issue. I am interested in feedback on this topic. How does this technique work for you? Are there other quick searches worth saving and running routinely as a pre-analysis?