When I encounter an issue that throws an error message, I typically take the obvious route: determine what the error means, search for likely causes, and if needed, enable more verbose logging to capture more detail on what is happening. This approach is basically to start from the point of failure and track backwards or dig deeper.
However, when confronted with a complex issue in a multi-tiered environment – where there may be no error, a failure occurring in another tier, or data issues introduced well upstream – it is sometimes useful to take a systems view of diagnosing issues. This approach is to look for failures across many tiers, validate that features and components work individually. I call this the systems approach to diagnostics.
In practice, when I look into issues, I am using a hybrid approach that leverages some knowledge of how the feature works and dependencies with appropriate diagnostic tools to narrow in on where the issue is occurring.
In this post, I will summarize several types of diagnostic tools that enable the systems approach to diagnostics, and the relative merits and limitations of each.
Component test / unit test
The goal of a component test is to validate that a particular feature, server process, or product component is basically operational. It is a diagnostic that is run ad hoc. If a component test fails, it quickly identifies a point of failure to be investigated further. A very simple example would be to verify a server process is running. A more meaningful unit test would be to verify a feature can perform a simple unit of work. Most component tests are very simple, because it takes time to perform each of them.
Since unit tests validate a discrete feature, does it make sense to automate many unit tests, to simply the process of running them, and to generate a more comprehensive result? It does, to a point. There are a few challenges that inhibit a fully automated unit test system. First, how much can you realistically unit test in a reasonable period of time, without negatively impacting the system, and without encountering false failures? The “keep it simple” principle is relevant here.
Second, component tests should only verify features that are installed, enabled, and relevant for the solution. For enterprise products which support a variety of post-installation deployment options for load balancing and high availability across multiple servers, this presents a significant challenge. Reliable unit test automation requires some sort of centralized configuration to track the currently deployment. This requirement is sometimes at odds with maintaining maximum deployment flexibility and redundancy in a high availability environment. It is always a goal to automate functionality for greater efficiency, consistency, and less effort, but in practice it is easiest to achieve this in the context of a solution which limits or drives deployment choices.
A component test or unit test diagnostic only goes so far. A failure indicates something to investigate further. A non-failure means the test did not reveal useful information, so progress to your next step in the diagnostic process.
Integration / connectivity unit test
The integration / connectivity unit test is a logical extension of the component unit test. It is also an ad hoc test but it involves multiple components and may span multiple servers. The goal is to test whether a particular conduit of information or “pipe” works by testing it. This can be a simple exercise when testing an operation for pulling data back, but substantially more complex when trying to push data through it. There are a few additional challenges beyond those described above for the component test.
One challenge is test data segregation. You want to perform the test in a realistic simulation of production functionality, yet you do not want this test data to be visible to actual operators or to influence metrics. When performing the test manually, finding and cancelling out the test data is part of the exercise, but automation tools need to consider how to do this programmatically.
Another challenge is admin user channel conflict. Testing with the same user account, using authentication parameters read from the same place, is ideal for replicating the behavior of the actual feature, but using the same user account for multiple purposes can complicate administration and log analysis.
End to End system validation / Heart beat
This kind of scheduled or recurring diagnostic runs test cases and monitors the health of the feature. It is sometimes called a “heart beat” diagnostic. It tests a feature from end to end at a periodic interval by pushing a test and monitoring for the expected automated result. This sort of diagnostic is only possible when there is an expected, automated response back to confirm correct completion so it is not appropriate for all kinds of features.
This kind of diagnostic has some great benefits. In addition to telling you when the overall feature does not work, it can also track the performance of the overall feature, and it can proactively send events or notifications to alert the service desk or administrator.
This type of diagnostic faces all of the challenges described for component test and integration test type of diagnostics, and is most successful when implemented as part of a solution which drives or limits post installation deployment options.
Metadata consistency checker
Multi-tier enterprise applications often employ functionality which abstracts underlying technology, storing metadata to track the mapping to foreign objects. For example, a database view references database tables, and it is the responsibility of the DBMS to maintain its metadata to ensure it is accurate. In corner cases where operations fail or when changes are made to underlying objects, the metadata may become inaccurate. The symptom encountered may be an error from a lower level in the stack when an operation is attempted in a higher level.
The role of a metadata consistency checker is to compare metadata with actual objects, and report discrepancies. These diagnostics are typically only needed in rare situations, but very helpful in them for isolating an issue efficiently and consistently.
Data quality analysis
Data quality issues can occur in enterprise applications for a number of reasons including implementation decisions, user error, bad data imports, integrations, feature changes, defects, or inappropriate workarounds which go around application data validation features. Following best practices, using data validation functionality, and following good implementation process can avoid most data issues. Even in the case where you have all valid data, there may be a need to quantify how much data you have, what type, and where it resides. Data quality analysis tools assist with this process. There are a wide variety of ways these tools are accessible. In some cases, relevant counts are displayed or accessible in the user interface when it is commonly needed information for utilizing the feature. In other cases, these counts are collected as part of diagnostic features as the amount and quantity of data is relevant to understanding performance issues. Database or application queries may be utilized to address specific needs, and most applications employ a search mechanism that allows users to query for a variety of common use cases. Finally, there may be a specific utility to check for or correct specific data situations that can occur.
The challenges and merits of data quality analysis tools are that they are closely aligned to the application itself. The guideline is: If you think there may be a data quality issue, check for and run the data quality tools included.
A feature which collects system, version, configuration, logs and other diagnostic information to a single file is not in itself a system diagnostic utility, but it is a step in that direction. Collecting relevant information from many locations to a single file in a standardized format is a step in the process. Collecting such information from other servers is another step. Finally, running an error analyzer against the files to look for errors at all levels in the stack is the systems diagnostic. The principle is “collect everything that might be relevant”, so successive analysis routines have all the information required.
The challenge of this approach is the information collected can be substantial so the file can be large. Proper care needs to be taken to either collect information in the relevant time range or only collect non-verbose logs.
One merit of this approach is de-coupling the collection and analysis steps. This allows the analysis routines to be improved over time as the collection process is pretty straightforward. Another merit is that if no analyzer identifies the issue, the file can be easily shared with a subject matter expert who can review the information directly.
The error analyzer was introduced above as part of the motivation for a diagnostic collection feature. For the systems approach to diagnostics, this means looking for trends and correlating errors in diagnostics either by timeframe or operation. There are numerous challenges to this approach. Errors and timestamps are represented differently, logs have varied formats, and you may get a large number of errors for a particular recurring failure. The goal would be for the error analyzer to provide trends, categories, and higher level summation of failures. If the analyzer just reported all errors encountered anywhere, then it would be counterproductive in focusing an investigation. A key feature for this purpose is to report on unique error message.
One of the best uses of an error analyzer is to routinely identify low level errors which can be investigated and resolved to ensure the system is running smoothly and cleanly. Then if a problem occurs, running the error analyzer would provide a specific variation and with a message that could be researched.
The goal of a configuration analyzer is to check for configuration errors, outliers, or values outside recommended settings. For enterprise software solutions, there may be many possible ways to configure an application where the best values are dependent on configuration values in other applications. This is essential for versatility. If there were only one valid configuration, the value would be hard-coded and not configurable. A configuration error is a situation where one configuration is clearly out of sync with a related setting. A configuration outlier is a potentially intentional configuration value which merits further exploration because it is well outside of the norm. A configuration value outside recommended settings is the case where there are clear guidelines of how to configure based on other variables like system size, but for which the specified configuration is not the recommended value. Again this may be intentional, and it may be temporary, but the value of a configuration analyzer would be to raise awareness of this situation.
The value of a configuration analyzer is to quickly validate many configuration settings. The challenge is how determinately how strongly to enforce this setting. If a setting is intentionally set otherwise, should the user be able to flag it as intentional and suppress the warning going forward, thereby hiding an important deviation? Should the administrator be able to refine the definition of what is normal or the expected configuration? Defining the dependencies is another challenge when the number of related products and tiers varies greatly.
A configuration analyzer can be defined to run on the server itself from the UI, or it can be a separate process that is recommended but not exposed or enforced within the application. There are merits to each approach. The value of a separate process is that the user can be informed but not blocked from proceeding which loosens the requirement to strictly define which configuration errors are blockers vs. merely noteworthy. The separate process allows the configuration analyzer to be updated easily with additional checks. However, a separate process risks the diagnostic from being forgotten, missed, and of no value.
Automatic updates to analyzers
A good systems diagnostic framework allows the analyzer capabilities to grow and evolve both with the product and with experience using it. This is a delicate balance because configuration abilities differ by product versions, so closely tying the analyzer to the product addresses this challenge somewhat, though the complexity is multiplied as soon as you introduce multiple products and versions. Several of the paragraphs above describing analyzers imply the value of a separate tool because it can take updates which leverage new routines and checks. A desirable solution would be to make the analyzers accessible in the product and connect over the internet to download the latest updates. Since most enterprise application servers are not exposed directly to the internet, a good compromise may be to download and import the latest routines.
Job run or task status
When an application user runs a job or performs an operation, the results are typically displayed in the user interface so they know the result when it completes. If the same job is scheduled and run periodically without user intervention, the results of the job are again stored and reported in the same place. This is a good principle for application design - diagnostics are consistent no matter how or when the feature runs. The challenge from a systems diagnostic approach is how do you know when a scheduled job that is part of a larger overall process fails? How can you correlate this failure with failures in other related applications?
There are a few common approaches for how to incorporate job statuses into an overall diagnostic framework. One is for the client or actor which runs the job to monitor or wait from the job status, and use it’s diagnostics to report the failure. Another is to have job failures trigger a notification or an event that can be acted upon. Finally, the job failures may be captured by a diagnostic collection utility as described above, so an offline analysis tool could report on failures.
Error logs and queues
Some enterprise application features leverage back-end processes which record errors to a log file or table when failures occur, and some queue up asynchronous operations to be processed sequentially on a first-in-first-out basis. These are key diagnostics when back-end processes fail or are backed up. From a systems analysis perspective, I would want these diagnostics to be exposed as another data source for the error analyzer so the results could tell me meaningful information about all relevant errors – whether they are recorded in the database or in a log file. Also, it would be good for queued operations to indicate a failure if the queue grew too large, or it did not diminish at a sufficient rate – since these would be situations warranting further investigation.
One merit of recording errors in queues is it presents a logical way to organize errors, identify trends, and find actionable information. It is also typically easier to implement a notification or event in the case of errors, which are difficult to achieve when errors are written to files. The challenge of recording errors in tables or within the application data is it is not accessible to external analysis. It must first be retrieved or dumped to enable analysis.
Several diagnostic capabilities described above referenced “sending an event”, and those assume there is an event manager to accept and manage events. The value of an event manager rather than a simple notification is the event can be correlated to other events, prioritized, have policies applied, and flow into a defined process to expedite and resolve. The challenge is deciding where to apply the logic to apply this correlation. As discussed above, some knowledge of how the features map to components is required and this understanding resides with the application itself rather than the event manager. The key decisions in successful implementation of an event manager diagnostic is deciding what events matter, and then determining how to act on them.
The diagram below illustrates how I view these diagnostics as a progression of less to more comprehensive coverage. It would be nice to whittle the diagnostic stack to just alerts which tell you about real failures you care about, but doing that effectively requires a dozen prerequisite steps to be leveraged.
My experiences with taking a systems approach to diagnostics have revealed there are many incremental steps toward a more comprehensive, more efficient diagnostic system. There are nuances, tradeoffs, and different paths to be taken. Many of the steps are complementary or dependent on other steps, so there is no single golden path that applies generically across all situations. The best next step is generally a choice of return on investment – how to deliver the most value in the shortest span of time. Hopefully the information in this post provides a good framework for understanding the journey.
This post represents my own thoughts and experiences and does not necessarily represent BMC's position, strategies, or opinion. I welcome any feedback, please use the Add Comments link below.