Graphs of data on a desk top with a calculator and magnifying glass
Photo by RDNE Stock project from Pexels


Metrics are a type of measurement, which is the assignment of numbers to something with the intent of describing that something in a way that makes sense (a theoretical construct model). Test managers apply metrics to test process to help gain understanding and influence their decision-making with the goal of improving the performance or value of that process (Flamholtz, 1979), (Kaner, 2010).

Measuring <parameter> in the hopes of improving <goal> (Austin, 1996a)


Informational vs Motivational

There are two different ways collected metrics can be used that also apply to testing processes (Austin, 1996c):

Informational: To gather insight and understanding of testing processes to see how changes improve things.

Motivational: To set a measurable target for testers to to instigate a change in behaviour with goal of improving things.


Value vs Cost

Key Indicators

The two key indictors of software testing are the quality of the information provided and the cost to obtain that information. These two indicators are used to work out testing’s return on investment (ROI), or whether the cost of testing is worth the information provided:

Value: Information on software quality, particularly risk, with a goal of increasing that value

Cost: Time and money needed to get that information with the goal of reducing that cost

The cost of testing is the money (and therefore time) spent hiring, training, equipping and ultimately paying testers to do their job. It can be broken down into different testing activities such as risk-based testing, pair-testing and automation, personal development, or costs associated with testing, such as hardware or software purchases or licences. Cost of testing can also be measured in terms of project time-scales, such as mean time-to-test and amount of white space.

The value of testing however presents a problem, because of its qualitative nature, it can’t be assigned numbers. The temptation is to reach for surrogate metrics like bugs reported or test cases written, however these suffer from construct validity, as not all bugs are equal or known about, test cases are a limited form of verification checking only and neither are enumerable. Just like counting chapters in a book isn’t a measure if its quality or anything else. The only way to measure the true output of testing is to put a time or monetary amount saved by finding a bug before it was too late, but this is a challenging if not impossible task.

This means only partial supervision of testing cost is possible opening the door to measurement dysfunction.


Execution vs Reporting vs Setup

Critical Effort Dimensions

Testing can be broken down into two critical effort dimensions, which are types of tasks to which testers must dedicate effort to achieve the key output indictor of value of information on software quality and risk (Austin, 1996d).

Execution: Effort dedicated to finding new information including interaction, experimentation and observation of the software itself. Effort must be dedicated to test execution otherwise there’s nothing to report to stakeholders.

Bugs: Effort dedicated to investigating, evaluating and communicating the information once found so stakeholders can act upon it. Effort must be dedicated to test reporting, otherwise all the testing in the world would go to waste if stakeholders never knew what was uncovered.

While not strictly required, a third effort dimension can vastly improve the above two:

Setup: Effort dedicated to creating an environment for fast and effective testing. Effort should be dedicated to test setup to get the most out of testing activities with reasonable cost.



As with testing’s overall output value, the quality of effort put into and produced from these three dimensions can’t be measured, their cost can, particularly time dedicated to each. This may give insight into why projects are taking longer as expected such as only considering execution time in estimates. However, for one software project, bug investigation and reporting added another 50-100% onto overall testing time.

In one particular project, estimates were given to testers that it should only take two days to test. Test execution only ended up taking 1.5 days, however setup took three days and bug investigation, evaluation and reporting took another 5.5 days, so the overall testing project took 2 weeks. This was much longer than the original estimate, and without the breakdown of the critical effort dimensions, upper management were left wondering why it took so long.


Construct Validity, Surrogates and Dysfunction

Surrogate metrics are the best-available “stand-ins” where full supervision of critical effort dimensions aren’t available, such as value of information provided on software quality. Instead, metrics that correlate with a goal of improving testing’s value maybe carefully chosen by test managers. However care must be taken to the surrogate metric’s construct validity, or in other words “are you measuring what you think you’re measuring?”.

Where construct validity isn’t taken care of enough by the test manager,  measurement dysfunction can set in, meaning the metric shows improvement but actual testing value decreases. One major cause is gamification, where a motivational goal doesn’t inspire designed behaviour instead team members focus on making the numbers look good only (Austin, 1996b).



With full supervision of cost and value, testers can quickly complete testing of product backlog items meeting sprint goals. However without full supervision of the actual value of this information provided, testing can end up being “quick and dirty” with poorly communicated problems and many serious risks going undiscovered.

This has been a frequently experienced problem in agile projects where a misplaced focus on speed over sustainability resulted in testers performing basic verification checks only. While the cost metrics looked good, the quality of testing reduced drastically unknown to project stakeholders. Even surrogate metrics like number of reported bugs wouldn’t help, as low bug count is indicative of high quality software as much as low quality testing, and any motivational target would result in many low effort, trivial bugs reported over the important but less common showstoppers. The classic case of quantity over quality.



Measurement breakdown:

Actual Purpose (Five Whys): Start with why questions to drill down to find the actual purpose of what it is being measured

Attribute/Instrument/Reading: Identify the thing being measured, Identify the thing doing the measuring, Identify the result of the measurement including precision: granularity and margin of error

Scale: Increase/decrease meaning in respect to the construct model:

  • Ratio (multipliable with nothing-based zero e.g. metres)
  • Interval (consistent gap only e.g. celsius)
  • Cardinal (counting number of items e.g. apples)
  • Ordinal (greater/less than only e.g. rankings 1st, 2nd, 3rd…)
  • Nominal (categorisation only e.g. names)

Principal/Agent/Customer (Three Whos): The one doing the measuring (test manager), the ones being measured (testers), the ones benefitting from improvements (stakeholders e.g.  customers, users, developers and product owners)

Critical Effort Dimension: Identified by asking the question will continued failure in this area affect the company’s goals?

Vision: How does this metric tie into the vision, mission and goals for testing at the company?

General approaches:

Opposing: Playing two opposing metrics off against each other, so as one is gamed to show false improvement towards the goal, the other will decrease to show this moving away from the goal, and vice versa. Attaining the goal is maintaining a balance between the two.

Mixed-Mode: Metrics with a qualitative study, such as a hands on or question-based performance review and evaluation where the accuracy of the metrics can be compared and contrasted. (Robson)

Inverse Correlation (affirming the consequent): Reversing the metric to see whether the opposite still holds true (Austin, 1996b)




  • Austin, R. 1996. Measuring and Managing Performance in Organizations. New York: Dorset House. (pp. xiii(a), 18-19(b), 21(c), 31(d)
  • Flamholtz, E. 1979. Toward a psycho-technical systems paradigm of organizational measurement, Decision Sciences, 10(1), pp. 71–84. Available at: Link
  • Kaner, C., 2010. BBST Foundations Lecture 6A: Measurement. [online] Available at: Link (ti: 0:45)
  • Robson, C., 2011. Real World Research. 3rd Ed. Oxford: Wiley, p. 161.

2023-07-02: General improvements

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these