One of the fundamental questions in any business is how to spend time. Striving to systematically save time on every single step in a software development process is often overlooked. This blog discusses how automated software checks provide valuable insight into the state of a product, which saves time to diagnose any issues or suspicious behavior. We will also discuss how these automated checks may evolve as software project develops.
In a software project, Mean Time To Diagnose (MTTD), Mean Time To Fix (MTTF) and other metrics dynamically change as codebase grows, processes change, corrective actions take place, teams change etc. However, all other factors being equal, as codebase grows (or more precisely as code complexity grows), time to diagnose increases – and often increases exponentially – which can be, in some cases, measured in days.
The software issue in this context is any unexpected behavior: it may be a configuration problem, bug in processing, data problem, installation problem, problem with the disk space, network problem, external service outage etc. When performing issue diagnosis, precious time can be wasted by steering investigation into a wrong direction or by assigning resources to investigate multiple possible causes, but in general, investigation of complex issues will usually start by eliminating misleading clues. Here, we focus in particular on how to save time for the diagnosis by using automated checks to quickly collect pieces of information that help us eliminate misleading clues.
Example: Checking a Status of a Web-Service
Building a tool or infrastructure (usually around Continuous Integration environment) for automated checks requires at least some customization and in-house development, simply because available tools do not support all project needs out of the box. This infrastructure should evolve together with the product and over time number of automated checks should grow. Overall, the best person who can define these kinds of checks is an engineer with a wide knowledge of the system design, who always starts from the question: what kind of information a certain check provides.
For example, if our product is using an external web service, we need to assure the service is working properly. There are several possible checks for this example and using all of them together is advised:
- Check that service is up and running (simply put: ping). However, the service may be up and running but returning error response, therefore:
- Check that service returns expected response for the same request (one or more). Here, we put assertions for expected values, we define that we do not expect errors in response etc. While repeating the same test is good to preserve compatibility, issues may be hidden somewhere else, therefore:
- Check the response for a valid request that uses one or more randomized values. These kinds of requests may not be comparable, depending on product’s architecture, but they may catch some interesting bugs. Finally:
- Check the negative case(s): deliberately send invalid request(s) and configure response assertions to look for specific error messages, as expected by design
Checks that compare the values returned in the response vs. those stored in the database can also be added. Note that we did not explicitly stated one important property of these (b,c) checks and that is whether they alter application data in any way – i.e. to GET something is not the same as to CREATE something. When checking the later, the automated check is designed in a typical fashion:
- Setup (clean the record if it already exists)
- Test (create)
- Teardown (cleanup, if necessary)
It is also a good idea to add to your automated checks at least one check that will always fail or be “red”, just to make sure that actual monitoring does not have any bugs that may produce invalid (optimistic) results.
Interpreting the Results
This simple example illustrates that every automated check has to be designed to probe the service or component from different angle, because different combinations of PASS/FAIL communicate lead us to the real cause of the problem. Here are some scenarios:
- if (a) is FAIL and any of the remaining checks is PASS, then something may be wrong with the monitoring script
- if (a) and (c) are PASS and (b) is FAIL, then something my be wrong with the service: it may be a data error, error between the application and the database etc.
- if all except (d) are PASS, then something may be wrong with error handling
- if there are many (c) requests that alternatively PASS and FAIL, then something may be wrong with the load-balancer or one of the nodes below it etc.
In our previous experience, we have successfully used Apache JMeter for automating checks like these. JMeter is an open source and free tool that supports many types of tests: from SOAP web services, to databases, FTP, HTTP, and it is fairly simple to add new custom components like the ones our team developed (for HBase, JMS, JSON, OAuth, SSH, comparing XMLs etc).
Once the checks and corresponding [regex/xpath/schema/…] assertions are in place, the automated script is refactored to decouple the environment configuration and input data (where needed) from the test. Decoupling environment configuration from the actual test is extremely important: it enables the team to re-use the same checks across different environments and also cuts down on script maintenance time. It is also easier to respond to environment changes, such as changes in service endpoints. Step further, where needed, is to decouple input data from the test. Input data may simply be moved from the script into corresponding input, e.g. CSV, file. Sometimes, for each row in the input file, one can define custom dynamic value assertions that can be read by the script (also supported by JMeter).
When the script is ready, the next step is configuring it to run automatically on a given interval, for example, every 10 minutes. Of course this will depend on how long it takes for all the checks to complete. To speed up the execution, the script may be configured to execute multiple checks in parallel – in JMeter this is accomplished by turning off the “Run Thread Groups Consecutively” option and organizing different checks in corresponding Thread Groups. Automated re-running of the script was setup with an in-house runner, but because we are using JMeter one can simply setup a cron job for this purpose because JMeter supports running from a Linux command line.
The team should decide how much historical data should be preserved. If test results artifacts (XML reports) consume significant space because test are executed frequently, older artifacts may be archived and compressed so they are available in cases where later root cause analysis of an issue is performed and the team is looking when a certain problem first occurred. Additional improvement on the results reporting side is to enable results comparison from execution to execution, but also preserving information on builds against which test was executed.
Fault Tree Analysis and “Hints”
Final improvement that can also save some time when interpreting the results (and it can make life easier for the new team members) is providing hints based on the test results. In this paper, we have implicitly stated that a result of a check is either PASS or FAIL, i.e. it is a boolean value. Therefore, a boolean decision table can be constructed and “hints” on what may be the possible cause can be defined. These hints can be either verbal (to help human understand possible causes) or may be defined in a way to trigger certain action (like a service restart) for self-healing system. Implementing this fault tree analysis model includes some risk because a bug can exist in these complex models and that can abolish all the time-saving benefits that result from much simpler implementation.
During a software development process, time to diagnose the issue increases as project complexity grows. Combination of monitoring and automated testing is introduced in order to shorten the diagnosis time and quickly react on any unexpected software behavior. In our described setup, we have used JMeter, an open source testing tool, which enabled us to create various automated checks, to parameterize environment configuration, input data, assertions and to automatically re-run all checks on a given time interval. This type of setup will not only speed up time for promoting bug fixes to production, it will also unlock bigger code changes with confidence and it will decrease the number of reported issues which are not software bugs, but rather configuration or test environment problems.