Components of a Modern Quality Approach To Software … · 2017-08-29 · Below are seven key points of a high-performing DevOps culture [DYNA]: 1. Deploy daily – decouple code

1

Components of a Modern Quality Approach

To Software Development

Dolores Zage, Wayne Zage

Ball State University

Sponsor: iconectiv

Final Report

October 2015

2

Table of Contents

Section 1: Software Development Process and Quality 4

1.1 New Versus Traditional Process Overview 4

1.2 DevOps 5

1.2.1 Distinguishing Characteristics of High-Performing DevOps Development

Cultures 5

1.2.2 Zeroturnaround Survey 5

1.2.2.1 Survey: Quality of Development 6

1.2.2.2 Survey Results: Predictability 6

1.2.2.3 Survey Results: Tool Type Usage 7

1.2.3 DevOps Process Maturity Model 8

1.2.4 DevOps Workflows 9

1.3 Code Reviews 10

1.3.1 Using Checklists 11

1.3.1.1 Sample Checklist Items 11

1.3.2 Checklists for Security 12

1.3.3 Monitoring the Code Review Process 12

1.3.4 Evolvability Defects 13

1.3.5 Other Guidelines 13

1.3.6 High Risk Code and High Risk Changes 14

1.4 Testing 14

1.4.1 Number of Test Cases 15

1.5 Agile Process and QA 16

1.5.1 Agile Quality Assessment (QA) on Scrum 17

1.6 Product Backlog 19

Section 2: Software Product Quality and its Measurement 19

2.1 Goal Question Metric Model 19

2.2 Generic Software Quality Models 20

2.3 Comparing Traditional and Agile Metrics 22

2.4 NESMA Agile Metrics 23

2.4.1 Metrics for Planning and Forecasting 23

2.4.2 Dividing the Work into Manageable Pieces 24

2.4.3 Metrics for Monitoring and Control 24

2.4.4 Metrics for Improvement (Product Quality and Process Improvement) 25

2.5 Another State-of-the-Practice Survey on Agile Metrics 26

2.6 Metric Trends are Important 27

2.7 Defect Measurement 28

2.8 Defects and Complexity Linked 32

2.9 Performance, a Factor in Quality 35

2.10 Security, a Factor in Quality 37

2.10.1 Security Standards 37

2.10.2 Shift of Security Responsibilities within Development 39

3

2.10.3 List of Current Practices 40

2.10.4 Risk of Third Party Applications 40

2.10.5 Rate of Repairs 41

2.10.6 Other Code Security Strategies 41

2.10.7 Design Vulnerabilities 42

Section 3: Assessment of Development Methods and Project Data 43

3.1 The Namcook Analytics Software Risk Master (SRM) tool 43

3.2 Crosstalk Table 46

3.2.1 Ranges of Software Development Quality 49

3.3 Scoring Method of Methods, Tools and Practices in Software Development 50

3.4 DevOps Self-Assessment by IBM 50

3.5 Thirty Software Engineering Issues that have Stayed Constant for Thirty Years 51

3.6 Quality and Defect Removal 52

3.6.1 Error-Prone Modules (EPM) 52

3.6.2 Inspection Metrics 53

3.6.3 General Terms of Software Failure and Software Success 53

Section 4: Conclusions and Project Take-Aways 53

4.1 Process 53

4.2 Product Measurements 55

Acknowledgements 56

Appendix A – Namcook Analytics Estimation Report 56

Appendix B – Sample DevOps Self-Assessment 65

References 68

4

Section 1: Software Development Process and Quality

1.1 New Versus Traditional Process Overview

At first enterprises used Agile development techniques for pilot projects developed by small teams.

Realizing the benefits of shorter delivery and release phases and responsiveness to change while still

delivering quality software, enterprises searched for methods to mimic similar benefits for their larger

development efforts by scaling Agile. Thus, many frameworks and methods were developed to satisfy this

need. Scaled Agile Framework (SAFe) is one of the most implemented scaled Agile frameworks. Most of

the pieces of the framework are borrowed, existing Agile methods packaged and organized in a different

way to accommodate the scale. Integrated within SAFe and other enterprise development scaled

methodologies are Agile methods, such as Scrum and other techniques to foster the delivery of software.

Why evaluate the process? Developing good software is difficult and a good process or method can make

this difficult task a little easier and perhaps more predictable. In the past, researchers performed an

analysis of software standards and determined that standards heavily focus on processes rather than

products [PFLE]. They characterized software standards as prescribing “the recipe, the utensils, and the

cooking techniques, and then assume that the pudding will taste good.” This corresponds with Deming

who argued that, “The quality of a product is directly related to the quality of the process used to create

it” [DEMI]. Watts Humphrey, the creator of the CMM, believes that high quality software can only be

produced by a high quality process. Most would agree that the probability of producing high quality

software is greater if the process is also of high quality. All recognize that possessing a good process in

isolation is not enough. The process has to be filled with skilled, motivated people.

Can a process promote motivation? Agile methods are people-oriented rather than process-oriented. Agile

methods assert that no process will ever make up the skill of the development team and, therefore, the role

of a process is to support the development team in their work. Another movement, DevOps, encourages

collaboration through integrating development and operations. Figure 1 compares the old and new way of

delivering software. The catalyst for many of the enumerated changes is teamwork and cooperation.

Everyone should be participating and all share in responsibility and accountability. In true Agile, teams

have the autonomy of choosing development tools. In

the new world, disconnects should be removed and the

development tools and processes should be chosen so

that people can receive feedback quickly and make

necessary changes. Successful integration of DevOps

and Agile development will play a key role in the

delivery of rapid, continuous, high quality software.

Institutions that can accomplish this merger at the

enterprise scale will outpace those struggling to adapt.

Figure 1: Old and New Way of Developing Software

For this reason, identifying the characteristics of high-performing Agile and DevOps cultures is important

to assist in outlining a new transformational technology.

5

1.2 DevOps

DevOps is the fusion of “Development” and “Operations” functions to reduce risk, liability, and time-to-

market while increasing operational awareness. Over the past decade, it is one of the newest and largest

movements in Information Technology. DevOps evolution stemmed from many of the previous ideas in

software development such as automation tooling, culture shifts, and iterative development models such

as Agile. During the fusion process, DevOps was not provided with a central set of operational

guidelines. Once an organization decided to use DevOps for its software development, it had free reign on

deciding how to implement DevOps which produced its own challenges. Even Agile, whose many

attributes were adopted by the DevOps movement, falls into the same predicament. In 2006, Gilb and

Brody wrote an article suggesting the same lack of quantification for Agile methods, and this is a major

weakness [GILB]. There is insufficient focus on quantified performance levels such as metrics evaluating

required qualities, resource savings and workload capacities of the developed software. This omission of

not being able to measure the change to DevOps does not suggest that DevOps does not prescribe

monitoring and measuring. The purpose of monitoring and measuring within DevOps is to compare the

current state of a project to the same project from some time in the near past, providing an answer to

current project progress. However, this quantification does not help to answer the question about

benchmarking the overall DevOps implementation.

1.2.1 Distinguishing Characteristics of High-Performing DevOps Development Cultures

A possible description of a high-performing DevOps environment is that it produces good quality systems

on time. It is important to identify the characteristics of such high-performing cultures so that these

practices are emulated and metrics can be identified that quantify these typical characteristics and track

successful trends. Below are seven key points of a high-performing DevOps culture [DYNA]:

1. Deploy daily – decouple code deployment from feature releases

2. Handle all non-functional requirements (performance, security) at every stage

3. Exploit automated testing to catch errors early and quickly - 82% of high-performance DevOps

organizations use automation [PUPP].

4. Employ strict version control – version control in operations has the strongest correlation with

high performing DevOps organizations [PUPP].

5. Implement end-to-end performance monitoring and metrics

6. Perform peer code review

7. Allocate more cycle time to reduction of technical debt.

Other attributes of high-performing DevOps environments are that operations are involved early on in the

development process so that plans for any necessary changes to the production environment can be

formulated, and a continuous feedback loop between development, testing and operations is present. Just

as development has its daily stand-up meeting, development, QA and operations should meet frequently

to jointly analyze issues in production.

1.2.2 Zeroturnaround Survey

Many of the above trends are highlighted in a survey conducted by Zeroturnaround of 1,006 developers

on the practices developers use during software development. The goal of this survey was to prove or

disprove the effectiveness of the best quality practices including the methodologies, tools and company

size and industry within the context of these practices [KABA]. Noted in the report was that the survey

respondents possessed a disproportionate bias towards Java and were employed in Software/Technology

6

companies. Zeroturnaround also divided the respondents into three categories depending on their

responses into three groups. The top 10% were identified as rock stars, the middle 80% as average and

the bottom 10% as laggards. For this report, Zeroturnaround concentrated on two aspects of software

development, namely, the quality of the software and the predictability of delivery. The report results are

summarized in sections 1.2.2.1-1.2.2.3.

1.2.2.1 Survey: Quality of Development

The quality of the software was determined by the frequency of critical or blocker bugs discovered post-

release. Zeroturnaround decided that a good measure of software quality is to ask the respondents, “How

often do you find critical or blocker bugs after release?” This simple question is an easy method to judge

at least minimum requirements, and the response is a key quality metric that is most likely to negatively

impact the largest group of end users. The survey responses to this question were converted into

percentages.

Do you find critical or blocker bugs after a release?

A. No - 100%

B. Almost never - 75%

C. Sometimes - 50%

D. Often - 25%

E. Always - 0%

The analysis of the answers to this question, listed below, indicates that released software has a 50%

probability of containing a critical bug. Most respondents did admit to “sometimes” releasing software

with bugs. On average, 58% of releases go to production without critical bugs. It also demonstrates a

distinct difference between developer respondents. The laggards deliver bug-free software only 25% of

the time while the rock star respondents deliver bug-free software 75% of the time.

Average 58%

Median 50%

Mode 50%

Standard Deviation 19%

Laggards (Bottom 10%) 25%

Rock stars (Top 10%) 75%

1.2.2.2 Survey Results: Predictability

Predictability of delivery for this report is determined by delays in releases, execution of planned

requirements, and in-process changes also known as scope creep. The expression

Predictability = (1 / (1 + % late) x (% of plans delivered) x (1 - % scope creep)

converts predictability of delivery into a mathematical expression. An example listed below is provided

to demonstrate the use of the mathematical formula:

Example: Mike’s team is pretty good at delivering on time--they only release late 10% of the time. On

average, they get 75% of their development plans done, and they’ve been able to limit scope creep to just

10% as well. Based on that, we can calculate that Mike’s team releases software with 61% predictability,

7

(1 / (1 + 0.10)) x 0.75 x (1- 0.10) = 0.61 = 61%

The 61% is not the true probability because it is not normalized to 100%. The authors could normalize

over delivery, but chose not to as it did not affect trends, but rather made the number harder to interpret. It

was suggested that change in scope should be included in the formula, but this definitely affects the

ability to predict delivery. The authors tested this suggestion by adding change in scope to the formula.

The outcome was that omitting an estimate for change in scope does not change any trends. These trends

are represented with concrete numbers, so statistical analysis may impact the accuracy of absolute

numbers, but not the relative trends. The findings and observations on the predictability of software

releases are that companies can predict delivery within a 60% margin. Considering just the rock stars,

they can attain the 80% margin. When predictability was categorized by industry, there was no significant

relationship. In fact, predictability by company size increases slightly (3%) for larger companies. The

authors theorize that this is due to a greater level of organizational requirements, thus more non-

development staff are available to coordinate projects and release plans as teams scale up in size.

Noting the obvious difference in the probability of bug-free delivery between the rock stars and laggards,

it is important to enumerate their differences identified from the survey. Based on the responses of over

1,000 engineers, half of the respondents do not fix code quality issues as identified through static analysis

(Sonar or SonarQube), and some do not even monitor code quality. Those that use static analysis see a

10% increase in delivering bug-free software. Automated tests exhibited the largest overall

improvements both in the predictability and quality of software deliveries. The laggards did no

automated testing (0%), and the rock stars covered 75% of the functionality with automated tests. Also,

quality increases most when developers are testing the code. More than half of the respondents have less

than 25% test coverage. There is a significant increase in both predictability and quality as test coverage

increases. Code reviews significantly impact predictability of releases, but moderately affect software

quality. A plausible explanation is that developers are poor at spotting bugs in code, but good at spotting

software design issues and code smells which impact future development and maintenance. The majority

of problems found by reviewers are not functional mistakes, but what the researchers call evolvability

defects (discussed further in section 3.4). Close communication, such as daily standups and team chat

seem to be the best ways to communicate and increase the predictability of releases. Most teams work on

technical debt at least sometimes, but the survey results indicate no significant increases in quality and

predictability of software releases. However, a negative trend appears by not doing any technical debt

maintenance.

1.2.2.3 Survey Results: Tool Type Usage

The article also contains a segment on the reporting of tool and technology usage/popularity combined

with the corresponding increases or decreases in predictability and quality. Developers were asked about

the tools they used and Figure 2 presents a bar chart showing their responses.

The report analyzed which technologies and tool types influence quality and predictability of releases.

The result for quality is that there were no significant trends in the quality of releases when compared to

technologies and tool types. It appears that quality is affected by development practices, but not

development tools. In relation to predictability, using version control and IDE will significantly improve

the predictability of the deliveries, and there is a reasonable increase in predictability for users of code

quality analysis, continuous integration, issue tracker, profiler and IaaS solutions. Use of a text editor and

debugger has little or no effect on predictability.

8

Figure 2: Tool Type Usage Results from Zeroturnaround 2013 Survey

What are our takeaways from this survey and report? First of all, it appears that the majority of the

survey respondents are our peers, developing mainly in Java and categorizing the company type as

Software/Technology companies. Therefore, comparisons can be made. Automation of testing is a

leading indicator of quality reiterated by many reports [4, 5]. As test coverage increases, both

predictability and quality increase and automation can promote greater coverage. Code reviews increase

predictability and can increase the quality of the structure of the code, which is part of refactoring best

practice in Agile developments. Code quality analysis and fixing quality problems is another practice that

increases both quality and predictability. Quality was not significantly affected by a tool set,

underscoring that quality is based mainly on practices, not tools. However, good tools can make a team

more productive. They can also serve as focal points to enforce the practices that further improve the

ability to predictably deliver quality software. Identifying and implementing best practices is one key to

improving software development. However, metrics need to be chosen carefully to measure the

improvement or lack thereof. The importance of automated testing and assessment of coverage has been

outlined by numerous sources. All of these practices should be considered important in our SDL.

1.2.3 DevOps Process Maturity Model

The reason many organizations adopt DevOps is rapid delivery maximizing outcomes. One of the key

attributes and best practices of DevOps is the integration of quality assurance into the workflow. The

earlier errors are caught and fixed which minimizes rework and pushes the team to a stable product.

Figure 3 provides a DevOps Process Maturity Model with five different stages from Initial to Optimize

aligned with the CMMI Maturity Model. Assuming that our target is level 4, major keys to achieve this

level are quantification and control.

9

Figure 3: DevOps Process Maturity Model

To reach the quantify level and eventually the optimize level within the DevOps maturity model, each

workflow should be analyzed to determine how errors are introduced and subsequently identify possible

quality assurance techniques to be inserted into those checkpoints.

1.2.4 DevOps Workflows

The first workflow is documentation. A best practice is good documentation of the process and other

components of development. Documentation should be readily available and up-to-date. Documentation

that is out of date or incorrect can be very detrimental. The process should be documented along with the

configuration of all the systems. The documentation and configuration files should be placed in a

Software Configuration Management (SCM) system. A configuration management system (CMS) should

be used to describe the families and groups of systems in the configuration. A benefit of a configuration

management system is that it checks in periodically with a centralized catalog to ensure that the system is

continuing to comply and run with the approved configuration. There are many instances where a

change in the configuration has a devastating effect. As seen in Figure 4, a single configuration error can

have far reaching impact across IT and the business. From chaos theory, a butterfly flapping its wings in

part of the world can result in a

hurricane in other parts. A

configuration error may not be as

devastating, but it has a large impact in

terms of time, money and risk. Many

CMS have a method to test or validate a

set of configurations before pushing

them to a machine. For example, many

CMS have validation commands (Puppet

or BCFG2) that could be executed as

part of a process before continuing and

installing the configuration. Another

option is to invoke a canary system as

the first line of defense for catching

configuration errors [CHIL].

Figure 4: Configuration Errors Impact [VERI]

10

Another best practice is to use a SCM for all of the products. An SCM allows multiple developers to

contribute to a project or set of files at once, allowing them to merge their contributions without

overwriting previous work. An SCM also allows the rollback of changes in the event of an error making

its way into an SCM. However, rollbacks should be avoided and a code review can be inserted as a

quality assurance step.

1.3 Code Reviews

Code reviews employ someone other than the developer who wrote the code to check the work. Studies

have shown that quick, lightweight reviews found nearly as many bugs as more formal code reviews

[IBM]. At shops like Microsoft and Google, developers don’t attend formal code review meetings.

Instead, they take advantage of collaborative code review platforms like Gerrit (open source), CodeFlow

(Microsoft), Collaborator (Smartbear), ReviewBoard (open source) or Crucible (Atlassian, usually

bundled with Fisheye code browser), or use e-mail to request reviews asynchronously and to exchange

information with reviewers. These tools support a pre-commit method of code review. The code review

occurs before the code/configuration is committed to an SCM.

Reviewing code against coding standards (see Google’s Java coding guide, http://google-

styleguide.googlecode.com/svn/trunk/javaguide.html) is an inefficient way for a developer to spend their

valuable time. Every developer should use the same coding style templates in their IDEs and use a tool

like Checkstyle to ensure that code is formatted consistently. Highly configurable, Checkstyle can be

made to support almost any coding standard. Configuration files are supplied at the Checkstyle download

site supporting the Oracle Code Conventions and Google Java Style. An example of a report that can be

produced using Checkstyle and Maven can be seen at http://checkstyle.sourceforge.net/. Coding style

checker tools free up reviewers to focus on the things that matter such as assisting developers to write

better code and create code that works correctly and is easy to maintain.

Additionally the use of static analysis tools upfront will make reviews more efficient. Free tools such as

Findbugs and PMD for Java can catch common coding bugs, inconsistencies, and sloppy, messy code

and/or dead code before submitting the code for review. The latest static analysis tools go far beyond

this, and are capable of finding serious errors in programs such as null-pointer de-references, buffer

overruns, race conditions, resource leaks, and other errors. Static analysis can also assist testing. If the

unreachable code or redundant conditions can be brought to the attention of the tester early, then they do

not need to waste time in a futile attempt to achieve the impossible. Static analysis frees the reviewer

from searching for micro-problems and bad practices, so they can concentrate on higher-level mistakes.

Static analysis is only a tool to help with code review and is not a substitute. Static analysis tools can’t

find functional correctness problems or design inconsistencies or errors of omission or help you find a

better or simpler way to solve a problem.

The reviewer should be concentrating on

Correctness:

Functional correctness: does the code do what it is supposed to do – the reviewer needs to know

the problem area, requirements and usually something about this part of the code to be effective at

finding functional correctness issues

Coding errors: low-level coding mistakes like using <= instead of <, off-by-one errors, using the

wrong variable (like mixing up lessee and lessor), copy and paste errors, leaving debugging code in

by accident

http://google-styleguide.googlecode.com/svn/trunk/javaguide.html

http://google-styleguide.googlecode.com/svn/trunk/javaguide.html

http://checkstyle.sourceforge.net/

http://findbugs.sourceforge.net/

http://pmd.sourceforge.net/

http://www.drdobbs.com/architecture-and-design/longing-for-code-correctness/240005803

11

Design mistakes: errors of omission, incorrect assumptions, messing up architectural and design

patterns such as the model, the view, and the controller (MVC)

Security: properly enforcing security and privacy controls (authentication, access control, auditing,

encryption)

Maintainability:

Clarity: class and method and variable naming, comments …

Consistency: using common routines or language/library features instead of rolling your own,

following established conventions and patterns

Organization: poor structure, duplicate or unused/dead code

Approach: areas where the reviewer can see a simpler or cleaner or more efficient implementation.

1.3.1 Using Checklists

A checklist is an important component of any review. Checklists are most effective at detecting

omissions. Omissions are typically the most difficult types of errors to find. A reviewer does not require

a checklist to look for algorithm errors or sensible documentation. The difficult task is to notice when

something is missing and reviewers are likely to forget it as well. The longer a checklist becomes, the

less effective each item is reviewed [SMAR]. Limit the checklist to about 20 items. In fact, the SEI

performed a study indicating that a person makes about 15-20 common mistakes. For example, the

checklist can remind the reviewer to confirm that all errors are handled, that function arguments are tested

for invalid values, and that unit tests have been created. It is estimated that people make 15-20 common

mistakes in coding [SMAR]. Below is a sample review checklist from Smartbear

(http://smartbear.com/SmartBear/media/pdfs/best-kept-secrets-of-peer-code-review.pdf).

1.3.1.1 Sample Checklist Items

1. Documentation: All subroutines are commented in clear language.

2. Documentation: Describe what happens with corner-case input.

3. Documentation: Complex algorithms are explained and justified.

4. Documentation: Code that depends on non-obvious behavior in external libraries is documented with

reference to external documentation.

5. Documentation: Units of measurement are documented for numeric values.

6. Documentation: Incomplete code is indicated with appropriate distinctive markers (e.g. “TODO” or

“FIXME”).

7. Documentation: User-facing documentation is updated (online help, contextual help, tool-tips, version

history).

8. Testing: Unit tests are added for new code paths or behaviors.

9. Testing: Unit tests cover errors and invalid parameter cases.

10. Testing: Unit tests demonstrate the algorithm is performing as documented.

11. Testing: Possible null pointers always checked before use.

12. Testing: Array indexes checked to avoid out-of-bound errors.

13. Testing: Don’t write new code that is already implemented in an existing, tested API.

14. Testing: New code fixes/implements the issue in question.

15. Error Handling: Invalid parameter values are handled properly early in the subroutine.

16. Error Handling: Error values of null pointers from subroutine invocations are checked.

17. Error Handling: Error handlers clean up state and resources no matter where an error occurs.

18. Error Handling: Memory is released, resources are closed, and reference counters are managed under

both error and no error conditions.

http://swreflections.blogspot.com/2013/02/code-and-code-reviews-whats-in-name.html

http://smartbear.com/SmartBear/media/pdfs/best-kept-secrets-of-peer-code-review.pdf

12

19. Thread Safety: Global variables are protected by locks or locking subroutines.

20. Thread Safety: Objects accessed by multiple threads are accessed only through a lock.

21. Thread Safety: Locks must be acquired and released in the right order to prevent deadlocks, even in

error-handling code.

22. Performance: Objects are duplicated only when necessary.

23. Performance: No busy-wait loops instead of proper thread synchronization methods.

24. Performance: Memory usage is acceptable even with large inputs.

25. Performance: Optimization that makes code harder to read should only be implemented if a profiler or

other tool has indicated that the routine stands to gain from optimization. These kinds of optimizations

should be well-documented and code that performs the same task simply should be preserved somewhere.

An effective method to maintain the checklist is to match defects found during review to the associated

checklist item. Items that turn up many defects should be kept. Defects that aren’t associated with any

checklist item should be scanned periodically. Usually, there are categorical trends in your defects; turn

each type of defect into a checklist item that would cause the reviewer to find it. Over time, the team will

become used to the more common checklist items and will adopt programming habits that prevent some

of them altogether. The list can be shorten by reviewing the “Top 5 Most Violated” checklist items every

month to determine whether anything can be done to help developers avoid the problem. For example, if a

common problem is “not all methods are fully documented,” a feature in the IDE can be enabled that

requires developers to have at least some sort of documentation on every method.

1.3.2 Checklists for Security

Coding checklists are not specifically devoted to security reviews. Agnitio is a code review tool that

guides a reviewer through a security review by following a detailed code and design review checklist and

records the results of each review, removing the inconsistent nature of manual security code review

documentation. Agnitio is an open-source tool that assists developers and security professionals in

conducting manual security code reviews in a consistent and repeatable way. Code reviews are important

for finding security vulnerabilities and often are the only way to find vulnerabilities, except through

exhaustive and expensive pen testing. This is why code reviews are a fundamental part of secure SDLC’s

like Microsoft’s SDL.

1.3.3 Monitoring the Code Review Process

The code review process should be monitored for defect removal. For example, how do code reviews

compare to the other methods of defect removal practices in predicting how many hours are required to

finish a project? The minimal list of raw numbers collected is lines of code including comments (LOC),

inspection time, and defect count. LOC and inspection time are obvious. A defect in a code review is

something a reviewer wants changed in the code. A tool-assisted review process should be able to

collect these automatically without manual intervention. From these data, other analytical metrics can be

calculated and, if necessary, classified into reviews from a development group, reviews of a certain

author, reviews performed by a certain reviewer, or on a set of files. The calculated ratios are inspection

rate, defect rate and defect density.

The inspection rate is the rate at which a certain amount of code is reviewed. The ratio is LOC divided by

inspection hours. An expected value for a meticulous inspection would be 100-200 LOC/hour; a normal

13

inspection might be 200-500; above 800-1000 is so fast it can be concluded the reviewer performed a

perfunctory job.

The defect rate is the rate defects are uncovered by the reviewers. The ratio is defect count divided by

inspection hours. A typical value for source code would be 5-20 defects per hour depending on the review

technique. For example, formal inspections with both private code-reading phases and inspection

meetings will be on the slow end, whereas the lightweight approaches, especially those without scheduled

inspection meetings, will trend toward the high end. The time spent uncovering the defects in review is

counted in the metric and not the time taken to actually fix those defects.

The defect density is the number of defects found in a given amount of source code. The ratio is defect

count divided by kLOC (thousand lines of code). The higher the defect density, the more defects are

uncovered indicating that the reviews are effective. That is, a high defect density is more likely to mean

the reviewers did a great job than it is to mean the underlying source code is extremely bad. It is

impossible to provide an expected value for defect density due to various factors. For example, a mature,

stable code base with tight development controls might have a defect density as low as 5 defects/kLOC,

whereas new code written by junior developers in an uncontrolled environment except for a strict review

process might uncover 100-200 defects/kLOC.

1.3.4 Evolvability Defects

However, there is even more to reviews than finding bugs and security vulnerabilities. A 2009 study by

Mantyla and Lassenius revealed that the majority of problems found by reviewers are not functional

mistakes, but what the researchers call evolvability defects [MANT]. Evolvability defects are issues

causing code to be harder to understand and maintain, more fragile and more difficult to modify and fix.

Between 60% and 75% of the defects found in code reviews fall into this class. Of these, approximately

1/3 are simple code clarity issues, such as improving element naming and comments. The rest of the

findings are organizational problems where the code is either poorly structured, duplicated, unused, can

be expressed with a much simpler and cleaner implementation, or replacing hand-rolled code with built in

language features or library calls. Reviewers also find changes that do not belong or are not required,

copy-and-paste mistakes and inconsistencies.

These defects or recommendations feed back into refactoring and are important for future maintenance of

the software, reducing complexity and making it easier to change or fix the code in the future. However,

it’s more than this: many of these changes also reduce the technical risk of implementation, offering

simpler and safer ways to solve the problem, and isolating changes or reducing the scope of a change,

which in turn will reduce the number of defects that could be found in testing or escape into the release.

1.3.5 Other Guidelines

An important aspect of enterprise architecture is the development of guidelines for addressing common

concerns across IT delivery teams. An organization may develop security guidelines, connectivity

guidelines, coding standards, and many others. By following common development guidelines, the

delivery teams produce more consistent solutions, which in turn make them easier to operate and support

once in production, thereby supporting the DevOps strategy.

14

1.3.6 High Risk Code and High Risk Changes

If possible, all code should be reviewed. However, what if this is not possible? One needs to ensure that

high risk code and high risk change is always reviewed. Listed below are candidates for these types of

modules.

High risk code:

Network-facing APIs

Plumbing (framework code, security libraries….)

Critical business logic and workflows

Command and control and root admin functions

Safety-critical or performance-critical (especially real-time) sections

Code that handles private or sensitive data

Code that is complex

Code developed by many different people

Code that has had many defect reports – error prone code

High risk changes:

Code written by a developer who has just joined the team

Big changes

Large-scale refactoring (redesign disguised as refactoring)

1.4 Testing

Although static analysis and code review prevent many errors, these activities will not catch them all.

Mistakes can still creep into the production environment. The untested code must be exercised in a

testbed and all changes made in the testbed first. This is another best practice. Testing, except for unit

level, is performed following a formal delivery of the application build to the QA team after most, if not

all, construction is completed. Unit tests are just as much about improving productivity as they are about

catching bugs, so proper unit testing can speed development rather than slow it down. Unit testing is not

in the same class as integration testing, system testing, or any kind of adversarial "black-box" testing that

attempts to exercise a system based solely on its interface contract. These types of tests can be automated

in the same style as unit tests, perhaps even using the same tools and frameworks. However, unit tests

codify the intent of a specific low-level unit of code. They are focused and they are fast. When an

automated test breaks during development, the responsible code change is rapidly identified and

addressed. This rapid feedback cycle generates a sense of flow during development, which is the ideal

state of focus and motivation needed to solve complex problems.

As software grows, defect potential increases and defect removal efficiencies decrease. The defect density

at release time increases and more defects are released to the end-user(s) of the software product. Larger

software size increases the complexity of software and, thereby, the likelihood that more defects will be

injected. For testing, a larger software size has two consequences:

The number of tests required to achieve a given level of test coverage increases exponentially

with software size.

The time to find and remove a defect first increases linearly and then grows exponentially with

software size.

15

As software size grows, software developers would have to improve their defect potentials and removal

efficiencies to simply maintain existing levels of released defect densities. The raw Code Coverage metric

is only relevant when too low and requires further analysis when high.

1.4.1 Number of Test Cases

Although there are many definitions of software quality, it is widely accepted that a project with many

defects lacks quality. Testing is one of the most crucial tasks in software development that can increase

software quality. A large part of the testing effort is in developing test cases. The motive for writing a test

case should be the complete and correct coverage of a requirement which could require five or fifty test

cases. The number of test cases is basically irrelevant for this purpose and can even be a damaging

distraction. A large number of test cases could artificially inflate confidence that the software has been

adequately tested. There is also no standard on what constitutes one test case. A tester can create one

large test case or 200 smaller test cases. It is good practice to write a separate test case for each

functionality. Some testers even break test cases down further into discrete steps. Thus, the number of

test cases cannot assure a requirement’s cover. It is the content of the test cases that covers a requirement.

The number of test cases also does not indicate the quality of the test cases. Choosing the right techniques

and prioritizing the right test cases can provide significant economic benefits. Therefore, it is important to

analyze test case quality. There are many facets to test case quality such as the number of revealed faults

and its efficiency or the time spent to reveal a fault. The most common and oldest are coverage measures

as a direct measure of test case quality [FRAN, Hutchins].

Interesting research results on test coverage are presented in a paper by Mockus, Nagappan, and Dinh-

Trong [MOCK]. Key observations and conclusions from the paper are the following:

"Despite dramatic differences between the two industrial projects under study we found that code

coverage was associated with fewer field failures.” This strongly suggests that code coverage is a

sensible and practical measure of test effectiveness."

The authors state “an increase in coverage leads to a proportional decrease in fault potential."

"Disappointingly, there is no indication of diminishing returns (when an additional increase in

coverage brings smaller decrease in fault potential)."

"What appears to be even more disappointing, is the finding that additional increases in coverage

come with exponentially increasing effort. Therefore, for many projects it may be impractical to

achieve complete coverage."

From this paper, it can be concluded that more coverage means fewer bugs, but this comes with

increasing cost. Although there are strong indications that there is a correlation between coverage and

fault detection, only considering the number of faults may not be sufficient. Code coverage does not

guarantee that the code is correct, and attaining 100 percent code coverage does not imply the system will

have no failures. It means that bugs can be found outside the anticipated scenarios.

16

1.5 Agile Process and QA

The development process has been transformed to Agile and the code is being developed iteratively. The

product owners are maintaining the backlog and the development team is completing chunks of the

product in two or three week increments, or sprints. The QA process follows the released sprint. The

process just described is partially stuck in Waterfall mode (in the mindset of developers) if the QA

department is lagging a sprint behind the development team. Often, the developers in this scenario

consider their work done when they have deployed their changes to a QA environment for testing

purposes. This is “throwing it over the wall” again, but in smaller increments. There can be many factors

working against this setup: loosely defined acceptance criteria, outdated quality standards, time-

consuming regression tests, slow and error-prone deployments to the QA environment, and rigid

organization charts that can all derail the development. Note that the setup described is not unusual and

has been attempted before, so it is important to revisit it and identify some of the common mistakes made

by organizations during their transition to Agile.

There are two main ways that Agile provides a basis for system quality verification. They are the

acceptance criteria and the definition of done. Acceptance criteria are normally expressed in Gherkin, the

language used in Cucumber to write tests. This structured format defines the functionality, performance or

other non-functional aspects that will be required of the software to be accepted by the business and/or

stakeholders. Naturally, the acceptance criteria should be defined before beginning the development

effort on the functionality. The team and testers should all be somewhat involved in producing the

criteria. With the acceptance criteria in hand, the developers now understand what needs to be built by

knowing how it will be tested. Possessing the satisfaction criteria implies that the developers build

software that only yields the desired outcomes.

The second way of checking quality is through the definition of done. Done is a list of quality checks that

have to be satisfied before a piece of functionality can be considered done. Included in the list are also

the non-functional quality requirements that the team must always adhere to when working through the

backlog, sprint after sprint. The acceptance criteria ensures that the software is built to deliver the

expected value and the definition of done ensures that the software is built with quality in mind.

In all development models, testing constitutes a majority of the schedule. Some suggest if this is not the

case, then quality is most likely suffering. In an Agile process, change occurs rapidly, thus regression

testing is a frequent occurrence to determine if changes have not regressed the software. With every new

release, as more functionality is added, the amount tested and retested expands. The only sensible

approach is to automate tasks which can be automated. Developers should be writing unit tests throughout

development. There are integration tests verifying that the internal components work together properly,

and the new code integrates well with external components. These are normally automated. Acceptance

tests are tests that verify that the acceptance criteria are being satisfied, and automating these tests would

prevent manually running regression tests with each new release. The timing of test automation is also

important. The ideal process would be to automate the acceptance tests, Acceptance Test-Driven

Development (ATDD), before the development effort has even begun. There are tools that translate the

acceptance criteria’s Gherkin statements into the tests. Even if ATDD is not followed, automating the

acceptance tests should be part of the done definition. Manually triggering tests to execute and manual

deployments are also common pitfalls. The best practice is early visibility to quality issues by being

notified immediately after committing the offending code into source control. A continuous integration

(CI) build that runs your suite of automated tests enables the developer to look at the issue with all the

17

details of the problem fresh in his/her mind.

As others have experienced, combining acceptance test and development into one gated check-in saves

cost and rework. The code cannot be committed into source control without first passing all required tests

of the CI build. Some of the software can be environment-specific, and the tests must be executed against

an environment that mirrors production as closely as possible. Combining CI build and automated

deployment provides continuous deployment, at least to the test environment. The latest code is

automatically deployed to the test environment after each successful commit to source control enabling

the test environment to continuously execute against the latest work by the developers. Bugs are made

visible significantly sooner in the development cycle, including the fickle “it works on my machine”

bugs.

The best in practice method is to tightly integrate the development and QA efforts. Testing should be

incorporated into the core development cycle such that no team can ever call anything “done” unless it is

fully vetted by thorough testing. There are some aspects of quality assurance that are challenging for a

development team that are primarily focused on delivering new functionality of high quality. These are

security and load testing. These types of tests, although they can be automated, are so costly in time and

processing power to run that they are not a part of any automated test suite that executes as part of

continuous integration. Another category of testing that is definitely best suited for dedicated QA testers

is exploratory testing. Automated tests can only catch bugs in the predictable and designed behavior of an

application, while exploratory tests catch the rest. The QA department should coordinate and refine these

practices, provided that the testers themselves are allocated to the development team whose code is being

tested.

1.5.1 Agile Quality Assessment (QA) on Scrum

There are many challenges when applying an Agile quality assessment. The following questions must be

assessed to determine the process:

Is QA part of the development team?

Can QA be part of same iteration as development?

Who performs QA? (Separate team)

Does QA cost more in Agile as the product fluctuates from sprint to sprint?

Can Agile QA be scaled?

Is a test plan required?

Who defines the test cases?

Are story acceptance tests enough?

When is testing done?

When and how are bugs tracked?

Much of QA is about testing to ensure the product is working right. Automation is QA’s best friend by

providing repeatability, consistency and better test coverage. Since sprint cycles are very short, QA has

little time to test the application. QA performs full functionality testing of the new features added for a

particular sprint as well as full regression testing for all of the previously implemented functionality. As

the development proceeds, this responsibility grows and any automation will greatly reduce the pressure.

Early in the transition to Agile, the process may have less-than-optimal practices until the root cause can

be addressed. For example, a sprint dedicated to regression testing is not reflective of an underlying

18

Agile principle. This sprint is sometimes labeled as a hardening sprint and is considered an anti-pattern.

For example, an Informit publication related a story of a company that struggled with testing everything

in a sprint because of a large amount of legacy code with no automated tests and a hardware element that

required manual testing. Until more automation could be implemented, a targeted regression testing

sprint was initiated at the end of each sprint, and another sprint was added before each release for a more

thorough regression testing session with participation by all groups. To erase the issue of no legacy test

automation, an entire team was assigned to automate tests. Meanwhile, the other teams were trained in

test automation techniques and tools to start creating automated tests during current sprints. Finally, the

test suites were automated, the dedicated test team was no longer required and current Scrum teams were

automating tests. The result was that the time required for regression testing was cut in half and the

hardening sprints were greatly reduced.

A presentation, “Case: Testing in Large-Scale Agile Development”, by Ismo Paukamainen, senior

specialist - test and verification at Ericsson was given at the FiSTB Testing Assembly 2014 [PAUK]. In

this presentation, he describes and outlines Ericson’s transformation from a RUP to an Agile process.

Naturally, parts of the presentation focused on continuous integration which provides continuous

assurance of system quality. It appears that the process was very good in functional performance quality,

but not as good in the areas labeled as non-functional. As he states:

“Before Agile, the system test was a very late activity, having a long lead-time. It was often hard to

convince management about the needs for system tests requiring resources, human and machine for many

weeks. This was because the requirements for the product are most often for the new functionality, not for

non-functional system functions which are in a scope of system tests. The fact is that only ~5% of the

faults found after a release are in the new features, the rest are in the customer perceived quality area.”

At the conclusion of the talk, he also had five insightful takeaways about the transition:

Test competence: If not spread equally, think about other ways to support teams (e.g., in test

analysis and sprint planning. Product owners should take responsibility to check that there are

enough tests for a sprint. A dedicated testing professional position in a cross functional team is

recommended.

Fast feedback: In Waterfall, the aim was to make as much testing as possible in a lower

integration level. Then, the testing was earlier and it was easier to find (and fix) faults closer to

the designer. In Agile, the aim is to get feedback as fast as possible which means that the strategy

is no more to run a mass of tests in the lower level, but to run it on a level that gives the fastest

feedback. So, it might be that running tests on a target environment (= production-like) may serve

better in the sense of feedback and the lower level is needed only to verify some special error

cases that are maybe not possible to execute on target.

Test Automation is a must in Agile. Use common frameworks and test cases as much as

possible. Try to avoid extra maintenance work around automation (for example, Continuous

Integration).

Independent Test Teams is a good way to support cross-functional teams, especially to cover

agile testing. To make non-functional system tests in cross-functional teams would mean:

i) Possible overlapping testing,

ii) Need for more test tools and test environments.

iii) It is also a competence issue and

iv) May be too much to do within sprints.

Independent test teams need to be co-located with cross functional teams and have a good

communication with them. A sense of community!

19

Raise Your Organizational Awareness of the Product Quality: Monitor the system quality

(Robustness, Characteristics, Upgrade …) and make it visible through the whole organization.

The desired software is broken down into named features (requirements, stories), which are part of what it

means to deliver the desired system. For each named feature, there are one or more automated acceptance

tests which, when they pass, will show that the feature in question is implemented. The running tested

features (RTF) demonstrates at every moment in the project how many features are passing all of their

acceptance tests. Automated testing is also a factor in quality producing environments; therefore,

measuring automated unit and acceptance test results is another important measure.

1.6 Product Backlog

A common Agile approach to change management is a product backlog strategy. A foundational concept

is that requirements, and defect reports, should be managed as an ordered queue called a "product

backlog." The contents of the product backlog will vary to reflect evolving requirements, with the product

owner responsible for prioritizing work on the backlog based on the business value of the work item. Just

enough work to fit into the current iteration is selected from the top of the stack by the team at the start of

each iteration as part of the iteration planning activity. This approach has several potential advantages.

First, it is simple to understand and implement. Second, because the team is working in priority order, it is

always focusing on the highest business value at the time, thereby maximizing potential return on

investment (ROI). Third, it is very easy for stakeholders to define new requirements and refocus existing

ones.

There are also potential disadvantages. The product backlog must be groomed throughout the project

lifecycle to maintain priority order, and that effort can become a significant overhead if the requirements

change rapidly. It also requires a supporting strategy to address non-functional requirements. With a

product backlog strategy, practitioners will often adopt an overly simplistic approach that focuses only on

managing functional requirements. Finally, this approach requires a product owner who is capable of

managing the backlog in a timely and efficient manner.

Section 2: Software Product Quality and its Measurement

2.1 Goal Question Metric Model

Many software metrics exist that provide information about resources, processes and products involved in

software development. The introduction of software metrics to provide quantitative information for a

successful measurement program is necessary, but it is not enough. There are other important success

factors that must be considered when selecting metrics. Foremost is that the metrics must quantify

performance achievements towards measurement goals. Basili created the Goal Question Metric (GQM)

interpretation model to assist with the outlining goals, subgoals and questions for a measurement program

[BASI]. Table 1 consists of a GQM definition template employing a DevOps concept for software

development. The development process affects the nature and timing of the metrics.

Table 1: Main Goal of Software Development

Analyze Software Development

20

For the purpose of Assessing and Improving Performance

With respect to Software Quality

From the viewpoint of Management, Scrum Master and

Development Team

In the context of DevOps Environment

2.2 Generic Software Quality Models

The main goal of assessing and improving software development with respect to quality can be broken

down into three aspects. The sub-goals of functional, structural and process quality improvement form the

basis to derive the questions and metrics for the GQM. Dividing software quality into three sub-goals

allows us to illuminate the trade-off that exists among competing goals. In general, functional quality

reflects the software’s conformance to the functional requirements or specifications. Functional quality is

typically enforced and measured through software testing. Software structural quality refers to achieving

the non-functional requirements to support the delivery of the functional requirements, such as reliability,

efficiency, security and maintainability. Just as important as the first two sub-goals which receive the

majority of the quality dialog, process quality is a process that consistently delivers quality software on

time and within budget. Table 2 breaks down the three sub-goals into more measurable components.

Table 2: Three Sub-Goals Broken Down into Measurable Components

Question Property

Functional Does the system deliver the business value

planned? How many user requirements were

delivered in the sprint?

enhancement rate

Does the solution do the right thing? How many

bugs were removed in the sprint?

defect removal rate

Structure How modifiable (maintainable) is the software?

How modular is the software?

How testable is the software?

Maintainability:

duplication

unit size / complexity

What is the performance efficiency of the

software?

Efficiency: time-behavior,

resource utilization, capacity

How secure is the software? Security: confidentiality, integrity,

non-repudiation, accountability,

authenticity

How usable is the software? Usability: learnability, operability,

user error protection, user interface

aesthetics, accessibility

How reliable is the software? Reliability

Process What is the capacity of the software

development process?

Velocity

What is the cycle/lead time? Cycle time/ lead time

How many bugs were fixed before delivery? Defect removal effectiveness

How can we improve the delivery of business?

21

value?

Structural quality is determined through the analysis of the system, its components and source code.

Software quality measurement is about quantifying to what extent a system or software possesses

desirable characteristics. This can be performed through qualitative or quantitative means or a mix of

both. In both cases, for each desirable characteristic, there are a set of measurable attributes, the existence

of which in a piece of software or system tend to be associated with this characteristic. Historically, many

of these attributes have been extracted from the ISO 9126-3 and the subsequent ISO 25000:2005 quality

model, also known as SQuaRe. Based on these models, the Consortium for IT Software (CISQ) has

defined five major desirable structural characteristics needed for a piece of software to provide business

value: Reliability, Efficiency, Security, Maintainability and (adequate) Size. In Figure 5, the right side

five characteristics that matter for the user or owner of the business system depend on left side

measurable attributes. Other quality models have been created such as depicted in Figure 6 from Fenton.

To understand the professional meaning of code quality, the complete study of these concepts is required.

However, these models do not lend themselves naturally to practical development environments and we

need to explore more deeply what impacts business value.

Figure 5: Relationship Between Desirable Software Characteristics (right) and Measurable

Attributes (left) [WIKI]

22

Figure 6: Software Quality Model [FENT]

2.3 Comparing Traditional and Agile Metrics

Traditional software development and Agile methods actually have the same starting point. Each process

plans to develop a product of acceptable quality, applying a specific amount of effort within a certain

timeframe. The approach and processes differ, but the goal stated above is still the same. Traditional

software methods apply metrics to plan and forecast, monitor and control, and to integrate performance

improvement within the process, and Agile also requires metrics with these same capabilities. Agile

clearly differs from the traditional approach in that traditional software development metrics track a plan

through evaluating cost expenditures, whereas Agile development metrics do not track against a plan.

Agile metrics attempt to measure the value delivered or the avoidance of future costs. Another difference

between Agile and traditional is the units of measure. Table 3 is a matrix comparing the core metric units

of Agile to traditional software development [NESM].

Table 3: Comparison of Core Metrics for Agile and Traditional Development Methods

Core Metric Agile Traditional

Product (size) Features, stories Function points (FP), COSMIC

function points, use case points

Quality Defects/iteration, defects, MTTD Defects/release, defects, MTTD

Effort Story points Person months

Time Duration (months) Duration (months)

Productivity Velocity, story points/iteration Hours/FP

23

In Table 3, Agile uses a subjective unit, a story point to measure effort, making comparisons between

teams, projects and organizations impossible. Traditional methods use the standardized units of measure,

function points (FP) and COSMIC function points (CFP). Both FP and CFP are objective and are

recognized international standards. Several estimation and metric tools use the metric hours/FP for

benchmarking purposes. A noticeable characteristic for Agile is the absence of benchmarking metrics or

any other form of external comparison. The units of measure used for product (size) and productivity are

subjective and apply exclusively to the project and team in question. There is no possibility to compare

development teams or tendering contractors on productivity. So, the selection process for a development

team based on productivity is virtually impossible [NESM].

2.4 NESMA Agile Metrics

The Netherlands Software Metrics User Association (NESMA) began in 1989 as a reaction to the

counting guidelines of the International Function Point Users Group (IFPUG) and became one of the first

FPA user organizations in the world. Their NESMA standard for functional size measurement became

the ISO/IEC standard. In 2011, the organization shifted from an FPA user group to an organization that

provides information about the applied use of software metrics: estimation, benchmarking, productivity

measurement, outsourcing and project control. NESMA conducted a search of the web for recommended

Agile metrics, a so-called state-of-the-practice survey. They divided the survey into three main areas of

interest: planning and forecasting, monitoring, and control and performance improvement. The following

three tables, Table 4, 5 and 6, transcribe information from the website (http://nesma.org/2015/04/Agile-

metrics/)

2.4.1 Metrics for Planning and Forecasting

Table 4: Metrics for Planning and Forecasting

Metric Purpose How to measure

Number of features Insight into size of product

(and entire release). Insight

into progress.

The product comprises features that in

turn comprise stories. Features are

grouped as “to do”, “in progress” and

“accepted”.

Number of planned stories

per iteration/release

Same as number of features. The work is described in stories which are

quantified in story points.

Number of accepted stories

per iteration/release

To track progress of the

iteration/release

Formal registration of accepted stories

Team velocity See monitoring and control

LOC Indicates amount of

completed work (progress)

calculation of other metrics

i.e. defect density

According to the rules agreed upon.

http://nesma.org/2015/04/agile-metrics/


24

2.4.2 Dividing the Work into Manageable Pieces

In order to plan and forecast, the development process requires the work to be divided into manageable

pieces. It is essential in larger organizations that these pieces are organized, scaled and of a consistent

hierarchy if they are going to be used for measurement. There are two important abstractions used to

build software: features and components. Features are system behaviors useful to the customer.

Components are distinguishable software parts that encapsulate functions needed to implement the

feature. Agile’s delivery focuses on features (stories). Large-scale systems are built out of components

that provide for the separation of concerns and improved testability, providing a base for fast system

evolution. In Agile, should the teams be organized around features, components or both? Getting it

wrong can lead to brittle system (all features) or a great design with future value (all components).

Previously, large-scaled developments followed the component organization as depicted in Figure 7. The

problem with this organization is that most new features create dependencies that require cooperation

between teams, thereby creating a drag on velocity because the teams spend time discussing and

analyzing dependencies. Sometimes component organization may be desired when one component has

higher criticality, requires rare or unique skills and technologies, or is heavily used by other components

or systems. Feature team organization, pictured in Figure 8, operates through user stories and refactors.

Figure 7: An Agile Program Comprised of Figure 8: An Agile Program Comprised of Feature

Component Teams [SCAL] Teams

For large developments, the organization is not as clear cut. Some features are large and are split into

multiple user stories. It is over simplistic to think of all teams being either component or feature-based.

To ensure the highest feature throughput, the SAFe (Scaled Agile Framework) guidelines suggest a mix

of feature and component teams with the feature team possessing the highest percentage at about 75-80%.

This split is dictated by the number of specialized technologies or skills required to develop the product.

Depending on the hierarchy, features or stories, the unit of work used in Table 5 may mask important

details if not counted uniformly.

2.4.3 Metrics for Monitoring and Control

Table 5: Metrics for Monitoring and Control


Iteration burn-down Performance per iteration; Effort remaining (in hrs) for the current

25

Are we on track? iteration (effort spent/planned expresses

performance).

Team Velocity per iteration To learn historical velocity

for a certain team. Cannot be

used to compare different

teams.

Number of realized story points per

iteration within this release. Velocity is

team and project-specific.

Release burn-down To track progress of a release

from iteration to iteration.

Are we on track for the entire

release?

Number of story points “to do” after

completion of an iteration within the

release (extrapolation with certain

velocity shows the end date).

Release burn-up How much ‘product’ can be

delivered within the given

time frame?

Number of story points realized after

completion of an iteration.

Number of test cases per

iteration

To identify the amount of

testing effort per iteration. To

track progress of testing.

Number of test cases per iteration

recorded as sustained, failed, and to do.

Cycle time (team’s

capacity)

To determine bottlenecks of

the process; the discipline

with the lowest capacity is the

bottleneck

Number of stories that can be handled per

discipline within an iteration (i.e. analysis-

UI-design-coding-unit test –system test).

Little’s Law – cycle times

are proportional to queue

length

Insight into duration; we can

predict completion time based

on queue length.

Work in progress (# stories) divided by

the capacity of the process step.

One metric not mentioned previously is Little’s Law which states that the more items that are in the

queue, the longer the average time each item will take to travel through the system. Therefore, managing

the queue (backlog) is a powerful mechanism for reducing wait time since long queues result in delays,

waste, unpredictable outcomes, disappointed customers and low employee morale. (See section 1.6,

Product Backlog.) However, everyone realizes that variability exists in technology. Some companies

limit utilization to less than 100% so a development has some flexibility, which is counterintuitive to

most models that suggest setting resources to 100% utilization. Also, there is a well-known belief that

work spans time allotted. To offset less than 100% utilization, some have instituted a Hardening

Innovation Planning (HIP) sprint to promote a new innovation or technology, find a solution to a nagging

defect or identify a fantastic new feature.

2.4.4 Metrics for Improvement (Product Quality and Process Improvement)

Table 6: Metrics for Improvement (Product Quality and Process Improvement)


Cumulative number of defects To track effectiveness of testing Logging each defect in defect

26

management system

Number of test sessions To track testing effort and

compare it to the cumulative

number of defects

Extraction of data from the defect

repository

Defect density To determine the quality of

software in terms “lack of

defects”

The cumulative number of

defects divided by KLOC

Defect distribution per origin To decide where to allocate

quality assurance resources

By logging the origin of defects

in the defect repository and

extract the data by means of an

automated tool

Defect distribution per type To learn what type of defects are

the most common and help avoid

them in the future

By logging the type of defects in

the defect repository and extract

the data by means of an

automated tool

Defect cycle time Insight in the time to solve a

defect (speed of defect

resolution)

Opening date of defect minus the

resolution date (usually the

closing date in the defect

repository)

As seen from Tables 4, 5, and 6, Agile metrics are essentially the same metrics as within traditional

development, except that they use the Agile units (features) and concepts.

2.5 Another State-of-the-Practice Survey on Agile Metrics

Another state-of-the-practice survey divided the Agile metrics into three main areas of interest: planning

and forecasting, monitoring and control and performance improvement [GALE]. Most researchers and

consultants claim that there are four distinct areas of interest to collect for Agile development:

Predictability

Value

Quality

Team Health – can be based on an Agile maturity survey

How do categories of predictability (See section 2.2.2 Survey Results: Predictability), value, quality and

team health overlap with the NESMA categories? Predictability maps to planning and forecasting

category, value maps to monitoring and control and quality maps to performance improvement. It

appears that Agile community has not come to a consensus on what is team health. However, since

development teams are the foundation of production, it does appear that the more teams are organized and

focused, the better the outcome.

Based on the three distinct areas, predictability, value and quality, a list of what to measure during Agile

development was compiled. This “What to measure?” list consists of twelve categories:

27

1. Events that halt a release, such as continuous integrations or continuous deployment stops

(quality based metric of type outcome)

2. Number and types of corrective actions per team or across teams (quality, outcome)

3. Number of stories delivered with zero bugs

4. Number of stories reworked (value, output)

5. Percentage of technical debt addressed with a target >30% (value or quality, outcome)

Coding standards violations

Code violations

Dead code

Code dependencies (coupling)

6. Velocity per team where trending is most important. Velocity is not used to measure productivity

but to derive duration of a set of features. (predictability, output).

7. Delivery predictability per story point, average variance improving across teams (predictability,

output)

8. Release burn-down charts – display both story points completed and those added by iteration.

9. Percentage of test automation includes UI level, component/middle tier and unit level coverage

where trending is most important (quality, output)

10. Organizational commit levels, the more that participate, the better the value (predictability,

output)

11. New test cases added per release (quality, outcome)

12. Defect Cycle Time is useful. We want to reduce the time from defect detection to defect fix. This

not only improves the business experience, but reduces the code written on top of faulty code, and

ensures issues are fresher in developer minds and faster to fix.

How does NESMA’s state-of-the-practice influence the list labeled as “What to measure?”? The state-of-

the-practice is a general set of metrics used by Agile environments. The “What to measure?” depends on

more general measures, such as number of stories, but then identifies them by a particular attribute or

event such as the number of stories reworked.

2.6 Metric Trends are Important

Almost every single article reinforces the fact that trending is much more important than any specific data

point [FOWL]. Used as a target, a metric is the only means to communicate a goal. In most cases, it is

an arbitrary number for which excessive amounts of time are used to determine its value or in working to

move toward this value or goal. When an attribute such as quality is turned into a number, it is highly

interpretive and any figure is relative and arbitrary. As Martin Fowler points out, there is a significant

difference between a code coverage at 5% and at 95%, but what about between 94% and 95%? A target

value such as 95% informs developers when to stop, but what if that additional 1% requires a significant

amount of effort to achieve? Should the extra effort be provided and does it bring additional value to the

product? Focusing on trends provides a feedback on real data and creates an opportunity for a reaction.

Moving in either direction, a team can ask what is causing this change. A trend analysis produces actions

earlier than concentrating on an individual number. Arbitrary absolute numbers can create a feeling of

helplessness especially when events outside of a team’s control prevent progress. Trends focus on

moving in the right direction rather than being hostage to external barriers. Since trends are important,

Agile reporting should use shorter periods of reporting to have more opportunity to react and change.

With any type of Agile methodology, it is important to reinforce lean and Agile principles, such as

concentrating on working software where numbers are not as important as trends. The project is already

28

collecting velocity and burn-down numbers. It is simple: the more user requirements delivered to the

customer, the greater the functional completeness (enhancement rate). The benefit, which users receive

from software usage, increases with the degree of the software's functional completeness. The delivery

rate of user requirements is also considered to be the throughput of the software development process.

2.7 Defect Measurement

In software, the narrowest sense of product quality is commonly recognized as a lack of defects or bugs in

the product. The number of delivery defects is known as an excellent predictor of customer satisfaction,

thus it is important to uncover trends in the defect removal processes. Using this viewpoint, or scope,

three important measures of software quality are:

defined as the number of injected defects in software systems, per size attribute.

releasing the software to intended users.

the number of released defects in the software, per size attribute.

Defect potential refers to the total quantity of bugs or defects that will be found in five software artifacts:

requirements, design, code, documents, and “bad fixes” or secondary defects. Defect potentials vary with

application size, and they also vary with the class and type of software. Defect potentials are calibrated

through function points. Organizations with combined defect removal efficiency levels of 75% or less

can be viewed as exhibiting professional malpractice. In other words, they are below acceptable levels for

professional software organizations.

Testing alone is insufficient for optimal defect removal efficiency. Most forms of testing can only

achieve about 35% defect removal efficiency (DRE), and seldom top 50%. DRE is defined as

DRE = E / (E+D)

where E is the number of errors found before delivery of the software to the end user and D is the number

of defects found after delivery. To achieve a high level of cumulative defect removal, many forms of

defect removal techniques need to be done. In a blog (https://www.linkedin.com/grp/post/2191046-

105467445), Caper Jones provides an analysis where he revisited 21 famous software problems such as

the Therac 25 radiation poisoning, the Wall Street crash of 2008, the McAffee anti-virus bug of 2010, the

Kidder Capitol stock trade problem of August 2012, and others. All of these systems had been tested.

None of the famous software problems had been found only through testing. He suggests that pre-test

inspections and static analysis would have found most. Below is the defect removal efficiency rate for

various methods based on commercial applications.

Measuring Defect Removal Efficiency [BLAC]

Patterns of Defect Prevention and Removal Activities

Prevention Activities

Prototypes 20.00%

Pretest Removal

Desk checking 15.00%

Requirements review 25.00%

Design review 45.00%

Document review 20.00%

https://www.linkedin.com/grp/post/2191046-105467445

https://www.linkedin.com/grp/post/2191046-105467445

29

Code inspections 50.00%

Usability labs 25.00%

Subtotal 89.48%

Testing Activities

Unit test 25.00%

New function test 30.00%

Regression test 20.00%

Integration test 30.00%

Performance test 15.00%

System test 35.00%

Field test 50.00%

Subtotal 91.88%

Overall Efficiency 99.32%

Defect tracking and its analysis has traditionally been used to measure software quality throughout the

lifecycle. However, in Agile methodologies, it has been suggested that pre-production defect tracking

may be detrimental to software teams [TECH]. Many suggest that pre-production tracking makes it

difficult to determine a true value of software quality. Pre-production defect tracking (especially resulting

from QA) is still very important to track. However, the focus of the activity should be shifted to

prevention. All defects should be measured by phase of origin (requirements, design, code, user

documents and bad fixes) so that the cause and ways to improve the process can be identified. As

previously stated, for more than 40 years, customer satisfaction has had a strong correlation with the

volume of defects in applications when they are released to customers. Released defect levels are a

product of defect potentials and defect removal efficiency. Jones and Bonsignour reflect that the Agile

community has not yet done a good job of measuring defect potentials, defect removal efficiency,

delivered defects or customer satisfaction [JONEa]. A development group that does not reach defect

removal efficiency rate of 85% or above will not have a good customer satisfaction rating.

For defects, identify areas in the code that have the most bugs, the length of time to fix bugs and the

number of bugs each team can fix during an indicated time span. Track the bug opened/closed ratios to

determine if more bugs are being uncovered than in previous iterations or if a team is falling behind. This

may determine a need to review and fix rather than attempting to deliver a new feature. Determine the

reasons for any changes in trend. Collect post sprint defects, QA defects, and post release defect arrival.

Complete a Root Cause analysis. For example, determine why a particular defect escaped from testing or

why a defect was injected into the code. It is especially insightful when a defect count or trend is

matched with a QA activity.

To compare between teams, systems or organization, defect density (the number of bugs per LOC or

another size metric such as function point) is used. Defect density is a recognized industry standard and a

best practice. Current defect density numbers can be compared against data retrieved from organizations

such as Gartner or International Software Benchmarking Standards Group (ISBSG) normally for a fee.

The true defect density is not known until after the release, and for this reason Microsoft uses code churn

to predict defect density. Moreover, the defect data must filter incidents to get defects. Incidents can be

labeled in the data store as: Change Request Agreed, Deferred, Duplicate, Existing Production Incident,

Merged with another Defect, No Longer an Issue, Not a fault, Not in Scope of Project, Resolution

Implemented, Referred to another project, Third Party Fix, Risk accepted by the business or Workaround

accepted by the business or other customized exceptions. There are also problems associated when

comparing defect density against outsiders. Every tool has its own definition of size (LOC). It is easy for

30

projects to add more code to make the LOC metric look better, and comparisons between code languages

is meaningless without an agreed upon LOC equivalency table. Moreover, source code is not always

available, such as in third party applications. Therefore, benchmarking against yourself may be the most

effective way.

Figure 9 compares Agile to Waterfall defect removal rate in Hewlett Packard projects [SIRI]. The Agile

process has a more sustainable defect removal rate throughout the lifecycle. The Waterfall process

displays a late peak with a gradual decline. This information was collected from two different product

releases created by the same team but using two different processes. Note that it is easier to compare and

observe defect trends in the Agile project and, therefore, it is easier to introduce modifications.

Defects are important to study. A Special Analysis Report on Software Defect Density from the ISBSG

reveals useful information about defects in software, both in development and in the initial period after a

system has gone into operation:

The split of where defects are found, i.e. in development or in operation, seems to follow the

80:20 rule. Roughly 80% of defects are found during development, leaving 20% to be found in

the first weeks of systems’ operation.

Fortunately, in the case of extreme defects, less than 2.5% were found in the first weeks of

systems’ operation.

Extreme defects make up only 2% of the defects found in the Build, Test and Implement tasks of

software development.

The industry hasn’t improved over time. Software defect densities show no changing trend over

the last 15 years.

Figure 9: Agile to Waterfall Defect Removal in Hewlett-Packard Projects

Maintainability is part of every quality model. Especially in the light of DevOps development which

stresses testing in its process, maintainability is an important quality characteristic. This system will be

around for a long time and the traditional assumption that existing systems will decay, become more

difficult and expensive to maintain should be avoided. The system must deliver new and better services

at a reasonable cost. As new features to the system, they must be testable. The symptoms of poor

testability are unnecessary complexity, unnecessary coupling, redundancy and not relating the software

model to the physical model. Also when these conditions exist, they make automated testing more

difficult. In the presence of these symptoms, tests either do not get created or have a lower probability of

31

being executed because of the effort and time commitment. Developers cannot be assured that the system

delivers the value if tests to do not exist or are not executed.

The process has a prevailing influence over the quality of the code. One of the myths of Agile is that an

iterated set of user stories will emerge with a coherent design requiring at most some refactoring to single

out commonalities. In practice, these stories do not self-organize. In previous development experiences,

when adding new functionality a system tends to become more complex and thus the law of increasing

entropy emerged. Refactoring and technical debt are inextricably linked in the Agile space. Refactoring

is a method of removing or reducing the presence of technical debt. However, not all technical debt is a

direct refactoring candidate. Technical debt can stem from documentation, test cases or any deliverable.

Developers, product managers and researchers disagree about what constitutes technical debt. The

simplest definition found is that technical debt is the difference between what was promised and what was

actually delivered, including technical shortcuts made to meet delivery deadlines. An easy way to

document technical debt is to raise an issue within the project management system (e.g., Jira). The issue

can be documented with different priorities, such as those that block future functionality or hamper

implementation. If a problem is small, then it can be added to a sprint if the sprint’s focus has been

completed. This bookkeeping helps monitor technical debt. The process of refactoring must be

incorporated, and as stated previously, reviews are better at uncovering evolvability defects.

Technical debt has become a popular euphemism for bad code. This debt is real and we incur debt both

consciously and accidentally. Static analysis alone cannot fully calculate debt and it may not be always

possible to pay (fix) debt in the future. Modules are built on top of the original technical debt which

creates dependencies that eventually become too ingrained and too expensive to fix. Some researchers

suggest that there are seven deadly sins in bad code, each one representing a major axis of quality

analysis: bad distribution of the complexity, duplications, lack of comments, coding rules violations,

potential bugs, no unit tests or useless ones and bad design. Many of these can be mitigated with the

proper techniques and tools. The SonarQube default project dashboard tracks and displays each of these

deadly sins.

Study after study has shown poor requirements management is the leading cause of failure for traditional

software development teams. When it comes to requirements, Agile software developers typically focus

on functional ones that describe something of value to end users—a screen, report, feature, or business

rule. Most often these functional requirements are captured in the form of user stories, although use cases

or usage scenarios are also common, and more advanced teams will iteratively capture the details as

customer acceptance tests. Over the years, Agilists have developed many strategies for dealing with

functional requirements effectively, which is likely one of the factors leading to the higher success rates

enjoyed by Agile teams. Disciplined Agile teams go even further, realizing that there is far more to

requirements than just this, and that we also need to consider nonfunctional requirements and constraints.

Although Agile teams have figured out how to effectively address functional requirements, most are

struggling with nonfunctional requirements.

The definition of software quality is very diverse as seen in Figures 5 and 6. However, it is widely

accepted that a project with many defects lacks quality. The simplest measure of assessing software

quality is by the frequency of critical or blocker bugs discovered post-release, as was outlined in the

Zeroturnaround survey in Section 1.2.2. The problem with this assessment is that a measure of quality

was completed after the fact. It is not acceptable to postpone the assurance of software quality after

release and, as outlined several times, the cost of removing bugs later only increases. Is there a direct

method to quantify quality pre-release? There is no single metric that defines good versus bad software.

32

Software quality can be measured indirectly from attributes associated with producing quality software.

From the Zeroturnaround survey, seven key points of high-performing DevOps culture, those

environments that produced good quality systems on time, were outlined. Five of seven are

characteristics that can be copied and measured directly. These high quality producing environments

deploy daily, handle non-functional requirements during every sprint, exploit automated testing, mandate

strict version control and perform peer code review. The remaining two trends of successful DevOps

teams, one of implementing end-to-end performance monitoring and metrics and the other of allocating

more cycle time to the reduction of technical debt, are much more challenging. What can be measured is

code churn, static analysis findings, test failures, coverage, performance, bugs and bug arrival rates and an

indication of size.

Any software metric will be criticized for its effectiveness, especially if one is searching for a silver

bullet. All metrics are somewhat helpful. All metrics measure a particular attribute of the software.

Also, the manner in which they are applied may not be perfect. Many practitioners use a metric for a

purpose which it was never intended. The McCabe metric (also known as cyclomatic complexity)

original purpose was measuring the effort to develop test cases for a component or module. Every piece

of code contains sections of sequence, selection and iteration, and this metric quantifies the number of

linearly independent paths through a program's source code. To perform basis path testing proposed by

McCabe, each linearly independent path through the program must be tested; in this case, the number of

test cases will equal the cyclomatic complexity of the program. Therefore, a module with a higher

McCabe metric value requires more testing effort than a module with a lower value since the higher value

indicates more pathways through the code. A higher value also implies that a module may be more

difficult for a programmer to understand since the programmer must understand the different pathways

and the results of those pathways.

For example, a cyclomatic complexity analysis can have problems stemming from recursion and fall-

through. If one of the project goals is good performance, recursion should be avoided. Fall-through

where one component passes control down to another such that there is no single entry/exit point also

affects the metric. Les Hatton claimed (Keynote at TAIC-PART 2008, Windsor, UK, Sept 2008) that

McCabe's Cyclomatic Complexity number has the same predictive ability as lines of code. The selected

threshold for cyclomatic complexity is based on categories established by the Software

Engineering Institute, as follows:

Cyclomatic

Complexity Risk Evaluation...

1-10 A simple module without much risk

11-20 A more complex module with moderate risk

21-50 A complex module of high risk

51 and greater An untestable program of very high risk

2.8 Defects and Complexity Linked

In the section “High Risk Code and High Risk Changes”, one of the bullets suggests that code that is

complex is high risk. Identifying the most complex code and monitoring it to determine the rate of

change assists developers in deciding where to focus efforts in review, testing and refactoring. Software

33

complexity encompasses numerous properties all of which affect the external and internal interactions of

the software. Higher levels of complexity in software increase the risk of unintentionally interfering with

interactions and increases the chance of introducing defects either when creating or making changes to the

software. Many measures of software complexity have been proposed. Perhaps the most common

measure is the McCabe essential complexity metric. This is also sometimes called cyclomatic complexity.

It is a measure of the depth and quantity of routines in a piece of code. Using cyclomatic complexity

measured by itself, however, can produce the wrong results, because there are numerous other properties

that introduce complexity and not just the control flow of the software.

Another important perspective comes from understanding the change in the complexity of a system over

time. Identifying components that

cross a defined threshold of complexity and are thus candidates

suddenly change in complexity and that are forecast to continue with that trend

increase in complexity where they were not expected, as a possible indication of poor

programming or design

It is important to manage control flow code complexity for testability. The completeness of test plans is

often measured in terms of coverage. There are several levels or dimensions of coverage to consider:

Function, or subroutine coverage – measures whether every function or subroutine has been

tested

Code, or statement coverage – measures whether every line of code has been tested

Branch coverage – measures whether every case for a condition has been tested, i.e., tested for

both true and false

Loop coverage – measures whether every case of loop processing has been tested, i.e. zero

iterations, one iteration, many iterations

Path coverage – measures whether every possible combination of branch coverage has been

tested

Large programs can have huge numbers of paths through them. A program with a mere 20 (n) control

point statements (IF, FOR, While, CASE) can have over one million different paths through it (paths =

2n). Removing redundant conditions and organizing necessary conditions in the simplest possible way to

help minimize control flow complexity, and thus minimize both the probability of defects and the

required testing effort.

Control flow is not the only aspect of concern when managing testability. Managing the data flow and its

impact on the complexity of code implementation are also a concern. Several methods exist to measure

the use, organization or allocation of data.

Span between data references is based on the position of the data references and the number of

statements between each reference or the span. The larger the span, the more difficult it becomes

for the developer to determine the value of a variable at a particular point, and the more likely to

have defects and to require more testing.

Particular data can possess different roles or usages within different or the same modules. These

roles are: input needed to produce a module’s output; data changed or created within a module,

34

data used for control and data passing through unchanged. Researchers have observed that the

type of data usage contributes to complexity in different amounts, with data used as control

contributing the most. By considering these data flow complexity factors when designing the

program code, the ultimate testability and quality of the program can be increased.

After identifying these complex parts, developers can:

• Remove hard coding.

• Revisit other design aspects and determine if it needs to be upgraded.

• Have managed code walk through to inspect it for defects.

• Refactor the section of code to simplify it; possibly break it into smaller, more manageable and

more testable pieces.

• Seek alternative design solutions that avoid those parts.

• Adjust your programmer resource plan to place your most reliable programmers on those

challenging programs.

• Allow for additional time and resources for more extensive testing.

The “Pareto Principle”, more commonly known as the 80/20” rule, is a relation that describes causality

and results. It claims that roughly 80% of output is a direct result of about 20% of the input. It is

generally known that 80% of the problems are located in 20% of the code. This phenomenon was also

observed by the study of ISBSG. A problem that all developers would like to know more about is where

is the risk? Which component or module is vulnerable or defect-prone? To assist in this quest, many

software fault prediction models have been proposed. These models consist of various sets of metrics,

both static and dynamic, to predict software fault-proneness. The problem is that the metric only partially

reflects the many aspects that influence software fault-proneness. Even though much effort has been

directed to this effort, none of the software fault prediction techniques has proven to be consistently

accurate [BISH]. It is known that no single metric can predict bugs, that testing itself demonstrates bugs

but does not prove their absence. We have also found the enhanced data is helpful (product and process)

in prediction.

We outlined the importance of linking the metrics to the organization or project objectives to demonstrate

achievement outlined objectives. There are other important factors that also must be considered during

metrics adoption. In an Agile environment, much of the decision-making is delegated to the development

teams. Development teams require software measurement information to assist them during their daily

operations. Measurement needs to be integrated into the workflow to provide this assistance and to avoid

the task of simply collecting data.

Capers Jones, who has been collecting software data for more than thirty years, makes this comment

about metrics [JONEe]:

“Accurate measurements of software development and maintenance costs and accurate

measurement of quality would be extremely valuable. But as of 2014, the software industry labors

under a variety of non-standard and highly inaccurate measures compounded by very sloppy

measurement practices. For that matter, there is little empirical data about the efficacy of software

standards themselves.”

Even with the metric inconsistency problem mentioned above, internal consistent metrics are important

for internal benchmarks and comparisons. Metrics on size, productivity and quality are the key ones to

concentrate on. In order to analyze their value, consistency is the key. Metrics are like words in a

35

sentence; together they create sense and meaning. Over analyzing data provided by one metric or one set

of metrics (such as productivity) can be harmful to other aspects of the software, such as quality.

2.9 Performance, a Factor in Quality

Performance directly translates into utility for the end user. There are different levels of performance as

seen in the properties of Table 1, such as time-behavior, resource utilization and capacity. Ultimately,

performance is about increasing user response times and reducing latency throughout the system while

retaining functional accuracy and consistency. Performance can be measured through low-level code

performance and benchmarking and, at a higher level, by establishing a balance between the resource and

requirements.

The first focus will be on Java resource utilization and system requirements. In general, there are four

main resources which are the keys to the performance of any executing system. They are the CPU

computing power, the memory (both cache and RAM), the IO/Network and the database. Database

access is separated from IO/Network resources because it can greatly affect performance and is often

responsible for IO bottlenecks. The functional performance requirements can also be placed into four

categories. They are throughput, scalability, responsiveness and latency, and resource consumption.

Throughput is normally rated as the maximum number of concurrent users the system can accommodate

at once. When the number of users grows, how the system responds to the increasing number of requests

is a measure of its scalability. Responsiveness is the length of time the system takes to first respond to a

request and latency is the time required to process the request. Not only must the system serve users,

other tasks will consume resources and influence the throughput. In general, most performance problems

can be postulated by one or more of these terms. For example, if speed is the issue, most likely latency is

the problem.

Performance tuning requires effort. In a survey instituted by a tool vendor [PLUM], it was found the 76%

of the respondents struggle most with trying to gather, make sense of, reproduce, and link to the root

cause the evidence required for performance analysis. In Figure 10, a pie chart is presented identifying

the most time-consuming part of the process [ZERO]. The survey also asked how long it took to finally

detect the root cause to solve the performance issue. The average time finding and fixing a root cause is

80 hours.

There are three major types of performance tools to identify performance problems or to optimize

performance. They fall into tool categories that monitor, profile or test. Java profiling and monitoring

tools measure and optimize performance during runtime. Performance testing identifies areas of heavy

resource utilization. There are many Java monitoring tools or Application Performance Management

(APM) tools. The issue with monitoring is that many production environments are a complex mixture of

services very carefully balanced to work together. Plus, with applications shifting to the cloud and

dramatically different enterprise architectures, APM tools are challenged to provide real performance

benefits across systems with virtual perimeters. A blog posted on June 10, 2014 from profitbricks

identifies 38 APM tools (https://blog.profitbricks.com/application-performagement-tools/). This short list

contains some of the most comprehensive APM tools available. Tool #5 is Compuware APM. Another

article from Zeroturnaround which has its own APM tool, New Relic APM for web applications, also lists

this tool for complex applications. The Zeroturnaround article lists the name as Dynatrace, a more recent

name change. This APM has the largest market share as published by a Gartner report [GART]. One of

Dynatrace’s selling points is that it eliminates false positives and erroneous alerts, thereby reducing the

cost of deploying and managing the application. Another tool in the same category, mentioned in both

sources, is AppDynamics. It is listed as tool #2 in profitbricks, and the basic tool is free while the pro

tool has a cost.

https://blog.profitbricks.com/application-performagement-tools/

36

Figure 10: Most Time-Consuming Part of Performance Tuning

Monitoring tools that identify memory leaks, garbage collection inefficiencies and locked threads can also

be used. These are less powerful, usually work through the JVM, and are less costly. One such tool is

Plumbr, which runs as a Java agent on the JVM. It is used as an overall monitoring tool. Java Mission

Control is a performance monitoring tool by Oracle and is free. It has a nice, simple, configurable

dashboard for viewing statistics of many JVM properties.

Many of the monitoring tools assist in identifying when a performance problem exists. Engineers must

then find the cause and eliminate the issue. There are many tools or sources used for this evidence

gathering phase. Many engineers used the application log or heap and thread dumps as evidence. JVM

tools such as jconsole, jmc, jstat and jmap can be used. At an average, an engineer uses no less than four

different tools to gather enough evidence to solve the performance problem. Other specialized tools, such

those offered by JClarity (Illuminate and Censum), can be used to identify the problem. Illuminate is a

performance monitoring tool, while Censum is an application focused on garbage collection logs

analysis. Takipi was created for error tracking and analysis, informing developers exactly when

and why production code breaks. Whenever a new exception is thrown or a log error occurs, the Takipi

software captures the exception and shows the variable state which caused it, across methods and

machines. Takipi will lay this information over the actual code which executed at the moment of the error

so developers can analyze the exception as if they were there when it happened.

Code profilers gather information about low-level code events to focus on performance questions. Many

of them use instrumentation to extract this information. A popular and frequently mentioned profiler is

YourKit and is one of the most established leaders in Java profiling. Another profiler tool is JProfiler. It

can also display call graph where methods are represented by colored rectangles providing instant visual

feedback about where slow code resides.

Application monitoring tools point out problems in performance, profilers provide low level insight and

highlight individual parts, and performance testing tools tell us that the new solution is better than the

previous one. Apache JMeter is an open source Java application for loading test functional behavior and

measuring performance. A section of Zeroturnaround article [ZERO] is a section labeled “Performance

Issues in action” where the reader is led through an application (Atlassian Confluence) using the

performance tools. It uses JMeter to create and gather profile data, YourKit to display and analyze the

37

profile data, and XRebel to diagnose other mostly http performance issues. It is an excellent exercise for

those not familiar with these types of tools.

Teams are constantly delivering code. SonarQube can be used to analyze the frequency of changes, the

size of changes, and to correlate this information with error data to assist in understanding whether the

code is being changed too much or too quickly to be safe. Another metric which measure changes is code

churn. Code churn is a measure of the amount of code change taking place within a software unit over

time. It is easily extracted from a system’s change history, as recorded automatically by a version control

system. Code churn has been used to predict fault potential where large and/or recent changes contribute

the most to fault potential. Microsoft uses code churn as an early prediction of system defect density

using a set of relative code churn measures that relate the amount of churn to other variables such as

component size and the temporal extent of churn [NAGA].

2.10 Security, a Factor in Quality

If you think Java is relatively safe, just think about the yearly security reports beginning in 2010. Java

became the main vehicle for malware attacks in the third quarter of 2010, when the attacks increased 14-

fold, according to Microsoft's Security Intelligence Report Volume 10 [MICR]. In 2012, Kaspersky Lab,

a leading anti-virus company, labeled it the year of Java vulnerabilities. Kaspersky reported that in 2012,

Java security holes were responsible for 50% of attacks while Windows components and Internet

Explorer were only exploited in 3% of the recorded incidents [KASP]. Cisco's 2014 Annual Security

Report puts the blame on Oracle's Java as a leading cause of security woes and reported that Java

represented 91% of all indicators of compromise in 2013 [CISC]. Perhaps the main reason Java is such a

target is the same reason why it is popular with enterprises and developers; namely, it is portable and

works on any operating system. Moreover, patching a large Java application, such as the JRE, is difficult

and there is the possibility that the patch could break the functionality within the application.

Why focus on application security? Estimates from reliable sources report that anywhere from 70% to

90% of the security incidents are due to application vulnerabilities. Moreover, only the application

security inside the application itself can stand a chance at preventing sophisticated attacks.

A report from the SANS Institute “2015 State of Application Security: Closing the Gap” can provide a

current general view of application software security [SANSa]. The report was driven by a survey given

to 435 qualified respondents answering questions about application security and its practices.

Respondents were divided into builders and defenders, with 35% being builders and 65% defenders. The

most interesting and important of the questions focused on security standards, the shift of security

responsibilities within development, the list of current practices, the risk of third party applications and

the rate of repairs using secure development life-cycle practices. These topics will be discussed in

Sections 2.10.1 -2.10.5.

2.10.1 Security Standards

Many security standards and requirements frameworks have been developed in attempts to address risks

to enterprise systems and the resident critical data. A survey question asked the participants to select the

security standards or models followed by their organization. Ten standards or guidelines were explicitly

listed as seen in Figure 11. The Open Web Application Security Project (OWASP) Top 10, a community-

driven, consensus-based list of the top 10 application security risks, with lists available for web and

mobile applications is by far the leading application security standard, followed by builders who

participated in the survey [OWAS].

38

Figure 11: Application Security Standards in Use

The survey report provided a few reasons for the overwhelming reliance on the OWASP Top 10. First of

all, the OWASP Top 10 is the shortest and simplest of the software security guidelines to understand

since there are only 10 different areas of concern. Also, most static analysis and dynamic analysis security

tools report vulnerabilities in OWASP Top 10 risk categories, making it easy to demonstrate compliance.

The OWASP Top 10, like the MITRE/SANS Top 25 [MITR], is also referenced in several regulatory

standards. After the OWASP Top 10, much more comprehensive standards, such as ISO/IEC 27034 and

NIST 800-53/64, often required in government work, are used as security guidelines. Fewer institutions

use the more general coding guidelines and process frameworks such as CERT Secure Coding Standards,

Microsoft’s SDL and BSIMM (Building Security In Maturity Model).

The problem with standards and guidelines is that much of the effort has essentially become exercises in

reporting on compliance and has actually diverted security program resources from the constantly

evolving attacks that must be addressed. The National Security Agency (NSA) recognized this problem

and began an effort to prioritize a list of the controls that would have the greatest impact in improving risk

posture against real-world threats. The SANS Institute coordinated the input and formulated the Critical

Security Controls for Effective Cyber Defense [SANSb]. This set of compiled information has much

valuable information with a strong emphasis on "What Works" - security controls where products,

processes, architectures and services are in use that have demonstrated real world effectiveness. Section 6

of the Critical Security Controls for Effective Cyber Defense report is directly on Application Software

Security (CSC 6). There are eleven suggestions to implement in CSC 6.

Many of the SANS’ survey respondents (47%) indicated that their application security program needed

improvement. There were some organizations that also rated themselves as above average. However,

this may be due to the recent slew of security of breaches which did not directly impact them, perhaps

giving them a false sense of confidence.

39

2.10.2 Shift of Security Responsibilities within Development

The majority (59%) of builder respondents followed lightweight Agile or Lean methods (mostly Scrum),

14% still used the Waterfall method and a smaller percentage followed more structured development

approaches. More of the survey organizations are considering adopting DevOps and SecDevOps practices

and approaches to share the responsibilities for making systems secure and functional among builders,

operations and defenders. These methods are viewed as a radically different way of thinking about and

doing application security. Currently, to produce secure code, most are proceeding externally through pen

testing and compliance reviews. Many concur that defenders need to work collaboratively with builders

and operations teams to embed iterative security checks throughout software design, development and

deployment. The main takeaway is that “builders, defenders and operations should be sharing tools and

ideas as well as responsibility for building and running systems, while ensuring the availability,

performance and security of these systems” [SANSa]. Application security should be everyone’s duty.

Eighty-four percent of successful security breaches have been accomplished through application software

vulnerabilities [PUTM].

Within the development cycle, Agile and application security appear to possess conflicting goals.

Developers work diligently to provide value and meet release deadlines, while security is concerned with

the potential exposure and negative impact that applications can generate for the business and their

customers. Agile developers adopt changes as part of their development environment culture, but even

adding a checkpoint for security could be perceived as an impediment to productivity, especially during a

tight schedule period. Many application builders are also unaware of inherent security issues in their

code. Mandating scanning the code for vulnerabilities and fixing the issues will not create a culture that

contributes to secure application development. Developers also need to be educated in the best practice

for producing secure code. Everyone must contribute to the fine-tuning of the process, determining the

best points to perform security reviews and code scanning functions. “Getting security right means being

involved in the software development process” [MCGR].

The Closing the Gap report [SANSa] listed four important areas to include throughout the development

lifecycle for effective application security. These are:

• “Design and build. Consider compliance and privacy requirements; design security features;

develop use cases and abuse cases; complete attack surface analysis; conduct threat modeling;

follow secure coding standards; use secure libraries and use the security features of application

frameworks and languages.

• Test. Use dynamic analysis, static analysis , interactive application, security testing (IAST),

fuzzing, code reviews, pen testing, bug bounty programs and secure component life-cycle

management.

• Fix. Conduct vulnerability remediation, root cause analysis, web application firewalls (WAF)

and virtual patching and runtime application self-protection (RASP).

• Govern. Insist on oversight and risk management; secure SDLC practices, metrics and

reporting; vulnerability management; secure coding training; and managing third-party software

risk.”

No indication was provided of how to include these into the various development lifecycles. In the

Waterfall methodology, there is a one-to-one mapping. However, in Agile, these must be adapted to be

iterative and incremental.

40

2.10.3 List of Current Practices

In this section, the focus is on the list of current practices compiled from the builders’ responses (Table

7). Risk assessment is the leading practice for all types of applications except for web applications.

Penetration testing is the second leading practice for internal apps. Currently applications are the biggest

source of data breaches. NIST claims that 90% of security vulnerabilities exist at the application layer.

Risk assessment or analyzing how hackers might conduct attacks can provide developers with a better

idea of specific defenses. To insure that an application has no weak points, penetration testing is used.

The article “37 Powerful Penetration Testing Tools for Every Penetration Tester” is a good resource in

identifying the scope and features of current penetration tools [SECU]. Practicing secure coding

techniques is another method to keep applications from getting hacked. The SANS Software Security has

a course designed specifically for Java, DEV541: Secure Coding in Java/JEE: Developing Defensible

Applications, https://software-security.sans.org/course/secure-coding-java-jee-developing-defensible-

applications.

Table 7: Builders’ Application Security Practices

2.10.4 Risk of Third Party Applications

The survey reports that 79% of the builder respondents use open source or third-party software libraries in

their applications. This agrees with a 2012 CIO report that over 80% of typical software applications are

open source components and frameworks consumed in binary form [OLAV]. The CIO report also details

that many organizations regularly download software components and frameworks with known security

vulnerabilities, even if newer, patched versions of the components or frameworks were available. Many

of these contain such well-known vulnerabilities as HeartBleed, ShellShock, POODLE and FREAK. A

thorough assessment must be made when using or procuring applications.

https://software-security.sans.org/course/secure-coding-java-jee-developing-defensible-applications

https://software-security.sans.org/course/secure-coding-java-jee-developing-defensible-applications

41

2.10.5 Rate of Repairs

In the survey, 26% of defenders took two to seven days to deploy patches to critical applications in use,

while another 22% took eight to thirty days, and 14% needed thirty-one days to three months to deploy

patches satisfactorily (Figure 2). Serious security vulnerabilities are important to repair as quickly as

possible. Observing the survey responses, it appears that most respondents need assistance in this effort.

Figure 12: Time to Deploy Patches

Developers require fundamental software security knowledge to understand the vulnerability and fix the

code properly, test for regressions, and build and deploy the fix quickly. Perhaps more importantly, the

vulnerability must be analyzed for root cause, and this must be addressed to hamper a vicious and likely

dangerous cycle.

2.10.6 Other Code Security Strategies

There are other useful strategies to assist in developing secure code. Michael Howard provided lessons

learned from five years of building more secure code. For security reviews, Microsoft ranks modules

(code) by their potential for vulnerabilities by age [HOWE]. As the code base becomes larger, analysis

tools are required. Analysis tools can help determine the amount of review and testing to provide. For

example, analyzing one component produces 10 warnings and analyzing a component of similar size

where the analysis yields 100 warnings, indicates that the second component is in greater need of review.

You can use the output of the analysis to determine overall code riskiness. Microsoft and many other

companies apply tools at check-in time to catch bugs early and execute them at fairly recent intervals to

deal with any new issues quickly. They have learned that executing the tools only every few months

leads to developers having to deal with hundreds of warnings at a time. For every security vulnerability

identified, a root cause analysis is performed. The analyst also determines why an actual issue was not

discovered by tools. There are three possible reasons: the tool did not find the vulnerability, the tool found

it but mistakenly triaged the issue as low priority, and the tool did actually find the issue and humans

ignored it. Such an analysis allows the fine-tuning of tools and their use. There is a great deal of manual

work involve in security assessment, so we need to strive for automation where possible. Build or buy

42

tools that scan code and upload the results to a central site for analysis by security experts. There are

some tools that actually can combine the results of different tool outputs.

2.10.7 Design Vulnerabilities

Many security vulnerabilities are not coding issues at all and are design issues, therefore, Microsoft

mandates threat modeling and attack surface analysis as part of the Security Development Lifecycle

(SDL) Process. Part of the lessons learned is that “It's essential to build threat models to uncover potential

design weaknesses and determine your software's attack surface. You need to make sure that all material

threats are mitigated and that the attack surface is as small as possible.” Microsoft continues to review for

features in its products that are not secure enough for the current computer environment and deprecates

those deemed to be insecure.

Design-level vulnerabilities are the hardest defect category to handle. Design-level problems accounted

for about 50% of the security flaws uncovered during the Microsoft's "security push" in 2002 [HOGL].

Unfortunately, ascertaining whether a program has design-level vulnerabilities requires great expertise,

which makes finding such flaws not only difficult, but particularly hard to automate. Examples of design-

level problems include error handling in object-oriented systems, object sharing and trust issues,

unprotected data channels both internal and external, incorrect or missing access control mechanisms,

lack of auditing/logging or incorrect logging, and ordering and timing errors (especially in multithreaded

systems). These sorts of flaws almost always lead to security risk.

Security issues, not syntactic or code related (such as business logic flaws), cannot be detected in code

and need to be identified by performing threat models and abuse case modeling. Software security

practitioners perform many different tasks to manage software security risks, including

• creating security abuse/misuse cases;

• listing normative security requirements;

• performing architectural risk analysis; building risk-based security test plans;

• using static analysis tools;

• performing security tests;

• performing penetration testing in the final environment; and

• cleaning up after security breaches.

Three of these are closely linked, architectural risk analysis, risk-based security test planning, and security

testing, because a critical aspect of security testing relies on probing security risks. If we hope to secure a

system, it is important to also work on the architectural or design risk. Over the last few years, much

progress has been made in static analysis and code scanning tools. However, the same cannot be said of

architectural risk. There are some good process frameworks such as Microsoft's STRIDE model.

However to obtain the kinds of results expected, these models require specialists and are difficult to

transform to widespread practices. To assist in developing secure software during the design phase, the

SwA Forum and Working Groups developed a pocket guide [SOFT] which includes the following topics:

Basic Concepts

Misuse Cases and Threat Modeling

Design Principles for Secure Software

43

Secure Design Patterns

o Architectural-level Patterns

o Design-level Patterns

Multiple Independent Levels of Security and Safety (MILS)

Secure Session Management

Design and Architectural Considerations for Mobile Applications

Formal Methods and Architectural Design

Design Review and Verification

Key Architecture and Design Practices for Mitigating Exploitable Software Weaknesses

Questions to Ask Developers

The above activities combined with secure coding techniques will enable more secure and reliable

software.

Section 3: Assessments of Development Methods and Project Data In this report, five resources were used to provide comparisons/assessments of the development

methodology and/or project data. These are discussed in sections 3.1 to 3.5.

3.1 Namcook Analytics Software Risk Master (SRM) tool. (Estimation report is attached to this

report in Appendix A.)

3.2 A table from a Crosstalk article by Capers Jones modified for NPAC.

3.3 Scoring method of factors in software development in Software Engineering Best Practices,

Capers Jones (also in the excel file as a separate spreadsheet)

3.4 DevOps self-assessment by IBM (See assessment results in Appendix B.)

3.5 The list of “Thirty Software Engineering Issues that have stayed constant for 30 years”

Additional information on software quality is contained in section 1.5 of this report.

3.1 The Namcook Analytics Software Risk Master (SRM) tool

The Namcook Analytics Software Risk Master (SRM) tool predicts requirements size in terms of pages,

words and diagrams. It also predicts requirements bugs or defects and “toxic requirements” which should

not be included in the application. A toxic requirement is one that causes harm later in development

and/or maintenance. Reproduced below from the Namcook website are samples of typical SRM

predictions for software projects between 100 function points (Tables 8 and 9) and 100,000 function

points (Tables 10 and 11).

Table 8: Metrics for Projects with 1000 Function Points

Requirements creep to reach 1,000 function points = 149

Requirements pages = 275

Requirements words = 110,088

Requirements diagrams = 180

Requirements completeness = 91.44%

Requirements reuse = 25%

Requirements bugs = 169

Toxic requirements = 4

Requirements test cases = 667

Reading days (1 person) = 4.59

Amount one person can understand = 93.27

Financial risks from delays, overruns = 22.33%

44

Table 9: 1,000 Function Points Requirements Methods

Interviews

Joint Application Design (JAD)

Embedded users

UML diagrams

Nassi-Schneidewind diagrams

FOG or FLESCH readability scoring

IBM Doors or equivalent

Requirement inspections

Agile

Iterative

Rational Unified Process (RUP)

Team Software Process (TSP)

Table 10: Metrics for Projects with 10,000 Function Points

Requirements creep to reach 10,000 function points = 2,031

Requirements pages = 2,126

Requirements words = 850,306

Requirements diagrams = 1,200

Requirements completeness = 73.91%

Requirements reuse = 17%

Requirements bugs = 1,127

Toxic requirements = 29

Requirements test cases = 5,472

Reading days (1 person) = 35.43

Amount one person can understand = 12.08%

Financial risks from delays, overruns = 42.50%

Table 11: 10,000 Function Points Requirements Methods

Focus groups

Joint Application Design (JAD)

Quality Function Deployment (QFD)

UML diagrams

State change diagrams

Flow diagrams

Nassi-Schneidewind charts

Dynamic, animated 3D requirements models

FOG or FLESCH readability scoring

IBM Doors or equivalent

State change diagrams

Text static analysis

Requirements inspections

Automated requirements models

Rational Unified Process (RUP)

Team Software Process (TSP)

In general, “greenfield requirements” for novel applications are more troublesome than “brownfield”

requirements which are frequently replacements for aging legacy applications whose requirements are

well known and understood. In total, requirements bugs or defects are approximately 20% of the bugs

45

entering the final released application. “Requirements bugs are resistant to testing and the optimal

methods for reducing them include requirements defect prevention and pre-test requirements inspections.

The use of automated requirements models is recommended. The use of automated requirements static

analysis tools is recommended. The use of tools that evaluate readability such as the FOG and FLESCH

readability scores is recommended.” The last quote and the tables were from [JONEc].

Dolores Zage registered as a user on the Namcook.com site and was able to use the SRM demo

application on the website. Figure 13 below is a listing of the input that was selected to produce the

estimation report. Average settings were used for project staffing details, which are an even mix of

experts and novices. For the project scope, standalone PC had to be selected because other settings

caused a PHP error. Interesting is the fact that no size factor was requested. With the given knowledge

of the project and the limitations of the application, the SRM tool calculated that NPAC would be 465.06

function points or about 53,330 LOC.

SOFTWARE TAXONOMY AND PROCESS ASSESSMENT REPORT

General Information:

Today's date - 08/18/2015 Industry or NAIC Code - telecommunications Company - BSU Country - IN, USA Project Start Date - 18-AUG-2015 Planned Delivery Date - Unknown Actual Delivery Date - Unknown Project Name - numbers Data Provided By - Dolores Project Manager - NA

Project Staffing Details:

Project Staffing Schedule - Normal staff; normal schedule Client Project Experience - Average experienced clients Project Management Experience - Average experienced management Development Team Experience - Even mix of experts and novices Methodology Experience - Even mix of experts and novices Programming Language Experience - Even mix of experts and novices Hardware Platform Experience - Even mix of experts and novices Operating System Experience - Even mix of experts and novices Test Team Experience - Even mix of experts and novices Quality Assurance Team Experience - Even mix of experts and novices Customer Support Team Experience - Even mix of experts and novices Maintenance Team Experience - Even mix of experts and novices

Project Taxonomy Input:

Project Nature - New software application development Work Hours per month - 132 Project Scope - Standalone program:PC Project Class - External program developed under government contract (civilian) Primary Project Type - Communications or telecommunications Secondary Project Type - Service oriented architecture(SOA)

46

Problem Complexity - Majority of avg, a few complex problems, algorithms - 7 Code Complexity - Fair structure with some large modules - 7 Data Complexity - More than 20 files, large complex data interactions - 11 Development Methodology - Agile, Internally Created Development Methodology Value - 10 Current CMMI Level - Level 4: Managed Primary Programming Language - Java - 6.00 - 90% Secondary Programming Language - SQL - 25.00 - 10% Effective Programming Level - 7.9 Number of maintenance sites - 1 Number of initial client sites - 80 Annual growth of client sites - 15 Number of application users - 1000 Annual growth of application users - 10

Testing Methodologies:

Defect Prevention - QFD; Pretest Removal - Desk Check; Static Analysis; Inspections; Test Removal - Unit; Function; Regression; Component; Performance; System; Acceptance;

projectsave

Back

Report

Print

Start Again

Figure 13: SRM Tool Settings

The entire estimation report is in Appendix A. Based on pretest removal and test removal selection in the

tool, the defect removal efficiency was 99.4% as reported in the estimation report.

3.2 Crosstalk Table

The data for Table 12 stems from approximately 600 companies, of whom 150 are Fortune 500

companies. The table divided projects into excellent, average and poor categories. All of the projects

were of function point size 1000 and coded in Java. These data can be extrapolated for comparisons with

NPAC data. For convenience, the data were transferred to an Excel spreadsheet into which the NPAC

data can be inserted. The closer NPAC data align with the excellent category, the higher the probability

that NPAC can be rated as excellent.

Note: Extra explanations are denoted below Table 12 for the numbers in parentheses within the table

cells.

Table 12: Comparisons of Excellent, Average and Poor Software Results

Topics Excellent Average Poor NPAC

Project Info

Size in function points 1000 1000 1000 (1)

Programming Language Java Java Java Java

Language level 6.25 6.0 5.75 (2)

47

Source statements per function point 51.2 53.33 55.75 (3)

Certified reuse percent 20% 10% 5% (4)

Quality

Defect per function point 2.82 3.47 4.27 4.95 (5)

Defect Potential 2818 3467 4266 (6)

Defects per KLOC 55.05 65.01 76.65

Defect Removal Efficiency 99% 90% 83%

Delivered Defects 28 347 725

High Severity Defects 4 59 145

Security Vulnerabilities 2 31 88

Delivered per function point .03 0.35 0.73

Delivered per KLOC .55 6.5 13.03

Key Quality Control Methods

Formal estimates of defects YES NO NO

Formal inspections of deliverables YES NO NO

Static Analysis of Code YES YES NO

Formal Test Case Design YES YES NO

Testing by certified test personnel YES NO NO

Mathematical test case design YES NO NO

Project Parameter Results

Schedule in calendar months 12.02 13.8 18.2

Technical staff + management 6.25 6.67 7.69

Effort in staff months 75.14 92.03 139.96

Effort in staff hours 9919 12147 18477

Cost in dollars $751,415 $920,256 $1,399,770

Cost per function point $751.42 $920.26 $1,399.77

Cost per KLOC $14,676 $17,255 $25,152

Productivity Rates

Function points per staff month 13.31 10.87 7.14

48

Work hours per function point 9.92 12.15 18.48

Lines of code per staff month 681 580 398

Cost Drivers

Bug repairs 25% 40% 45%

Paper documents 20% 17% 20%

Code Development 35% 18% 13%

Meetings 8% 13% 10%

Management 12% 12% 12%

Total 100% 100% 100%

Methods, Tools, Practices

Development Methods TSP/PSP (7) Agile Waterfall

Requirements Methods JAD Embedded Interview

CMMI Levels 5 3 1

Work hours per month 132 132 132

Unpaid overtime 0 0 0

Team experience experienced average inexperienced

Formal risk analysis YES YES NO

Formal quality analysis YES NO NO

Formal change control YES YES NO

Formal sizing of project YES YES NO

Formal reuse analysis YES NO NO

Parametric estimation tools YES NO NO

Inspections of key materials YES NO NO

Accurate status reporting YES YES NO

Accurate defect tracking YES NO NO

More than 15% certified reuse YES MAYBE NO

Low cyclomatic complexity YES MAYBE NO

Test coverage > 95% YES MAYBE NO

Notes on cell contents

49

1. Function point count (FP)

2. Language level – years of experience in Java

3. KLOC/FP

4. Reuse percentage

5. See 2.1 Ranges of software development quality

6. Defect Potential - Using the commercial application type data – 4.95 * FP

7. PSP (Personal Software Process) provides a standard personal process structure for

software developers. TSP (Team Software Process) is a guideline for software product

development teams. TSP focuses on helping development teams to improve their quality

and productivity to better meet goals of cost and progress. (Watts Humphrey, precursor to

DevOps)

3.2.1 Ranges of Software Development Quality

Given the size and economic importance of software, one might think that every industrialized nation

would have accurate data on software productivity, quality, and demographics. This does not seem to

exist. There seems to be no effective national averages for any software topic. A national repository of

software quality data would be very useful to compare against, but it does not exist. One reason is that

quality data are more difficult to collect than productivity data. There are so many individual

development tasks focusing on identifying defects. There are defects found in requirements and defects

identified by static analysis, desk checking, and testing. These counts are not always included in the

quality data. Currently, the best data on software productivity and quality tends to come from companies

that build commercial estimation tools and companies that provide commercial benchmark services. All

of these are fairly small companies. If you look at the combined data from all 2014 software benchmark

groups such as Galorath, ISBSG, Namcook Analytics, Price Systems, Q/P Management Group,

Quantimetrics, QSM, Reifer Associates and Software Productivity Research, the total number of projects

is about 80,000. However, all of these are competitive companies, and with a few exceptions, such as the

recent joint study by ISBSG, Namcook, and Reifer, the data is not shared, compared or not always

consistent.

The following data in Tables 13 and 14 are compiled quality data from Namcook consisting of about

20,000 projects and are approximate average values for software quality aggregated by application size

and type.

Table 13: Quality Data Based on Project Size in Function Points

Size Defect Removal Defects

Potential Efficiency Delivered

1 1.50 96.93% 0.05

10 2.50 97.50% 0.06

100 3.00 96.65% 0.10

1000 4.30 91.00% 0.39

10000 5.25 87.00% 0.68

100000 6.75 85.70% 0.97

Average 3.88 92.46% 0.37

50

Table 14: Quality Data Based on Project Type

Type Defect Removal Defects

Potential Efficiency Delivered

Domestic outsource 4.32 94.50% 0.24

IT projects 4.62 92.25% 0.36

Web projects 4.64 91.30% 0.40

Systems/embedded 4.79 98.30% 0.08

Commercial 4.95 93.50% 0.32

Government 5.21 88.70% 0.59

Military 5.45 98.65% 0.07

Average 4.94 93.78% 0.30

Tables 13 and 14 vary by application size and also by application type. Many suggest that for national

average purposes, the value shown by type is more meaningful than size, since there are very few

applications larger than 10,000 function points, and so these large sizes distort average values. The 2014

defect potentials average is about 4.94 while defect removal efficiency averages about 93.78% and

delivered defects average is 0.3 if the view is cross industry. The range of defect potentials span from

about 1.25 per function point to about 7.50 per function point. Ranges of defect removal efficiency span

from 99.65% to a low of below 77.00%.

3.3 Scoring Method of Methods, Tools and Practices in Software Development

Software development and software project management have dozens of methods, hundreds of tools and

practices. Which ones to use? One method is to evaluate and rank the many different factors using a scale.

In the excel file containing Table 12 is another spreadsheet listing the various methods and practices

scored with a scale that ranges from +10 to -10. A 10 implies it is very beneficial to the quality and

productivity of a project. A -10 indicates that it is very detrimental. An average value is given based on

size and type of projects. The data for the scoring stems from observations among about 150 Fortune 500

companies, 50 smaller companies, and 30 government organizations. The negative scores also include

data from 15 lawsuits. The actual values are not as important as the distribution into the various

categories. Using this method, one can display the range of impact of using the various methods, tools

and practice together.

3.4 DevOps Self-Assessment by IBM

A self-assessment of DevOps practices was also done through an IBM DevOps self-assessment.

(http://www.surveygizmo.com/s3/1659087/IBM-DevOps-Self-Assessment) A copy of the questions, the

answers provided and the results are in the file DevOps+Self+Assessment+Results.pdf

Based on the answers to questions in the assessment, the DevOps practice is measured as scaled, reliable,

consistent or practiced (as seen in Figure 14) in the five areas of Design, Construct, Configuration

Management, Build, Test and Assess Quality.

http://www.surveygizmo.com/s3/1659087/IBM-DevOps-Self-Assessment

51

Figure 14: Levels of Achievement in DevOps Practices

3.5 Thirty Software Engineering Issues that Have Stayed Constant for Thirty Years

In [JONEb], we find the following persistent issues in software engineering:

1. Initial requirements are seldom more than 50% complete.

2. Requirements grow at about 2% per calendar month during development.

3. About 20% of initial requirements are delayed until a second release.

4. Finding and fixing bugs is the most expensive software activity.

5. Creating paper documents is the second most expensive software activity.

6. Coding is the third most expensive software activity.

7. Meetings and discussions are the fourth most expensive activity.

8. Most forms of testing are less than 35% efficient in finding bugs.

9. Most forms of testing touch less than 50% of the code being tested.

10. There are more defects in requirements and design than in source code.

11. There are more defects in test cases than in the software itself.

12. Defects in requirements, design, and code average 5.0 per function point.

13. Total defect removal efficiency before release averages only about 85%.

14. About 15% of software defects are delivered to customers.

15. Delivered defects are expensive and cause customer dissatisfaction and technical debt.

16. About 5% of modules in applications will contain 50% of all defects.

17. About 7% of all defect repairs will accidentally inject new defects.

18. Software reuse is only effective for materials that approach zero defects.

19. About 5% of software outsource contracts end up in litigation.

20. About 35% of projects > 10,000 function points will be cancelled.

21. About 50% of projects > 10,000 function points will be one year late.

22. The failure mode for most cost estimates is to be excessively optimistic.

23. Productivity rates in the U.S. are about 10 function points per staff month.

24. Assignment scopes for development are about 150 function points.

25. Assignment scopes for maintenance are about 750 function points.

26. Development costs about $1200 per function point in the U.S (range < $500 to > $3000).

27. Maintenance costs about $150 per function point per calendar year.

28. After delivery applications grow at about 8% per calendar year during use.

29. Average defect repair rates are about 10 bugs or defects per month.

30. Programmers and managers need about 10 days of annual training to stay current.

52

3.6 Quality and Defect Removal

There are various definitions of quality and a common definition in software engineering is conformance

to requirements. However, requirements themselves can have defects, and then there are requirements

that are labeled as toxic. There are other “ility” qualities such as maintainability and reliability, but these

cannot be measured directly. This is why quality comes down to the absence of defects. This leaves two

powerful metrics for understanding software quality: 1) software defect potentials; 2) defect removal

efficiency (DRE). The phrase “software defect potentials” was first used in IBM circa 1970. Defect

potential includes the total of bugs or defects likely to be found in all software deliverables, such as the

requirements, architecture, design, code, user documents, test cases and bad fixes. The quality

benchmarks for Defect potentials on leading projects are < 3.00 per function point, combined with defect

removal efficiency levels that average > 97% for all projects and 99% for mission-critical projects.

The DRE metric was also developed in IBM in the early 1970s at the same time IBM was developing

formal inspections. The concept of DRE is to track all defects found by the development teams and then

compare those to post-release defects reported by customers in a fixed time period after the initial release

(normally 90 days). If the development team found 900 defects prior to release and customer reported 100

defects in the first three months, then the total volume of bugs was an even 1,000 so defect removal

efficiency is 90%.

The U.S. average circa 2013 for DRE is just a bit over 85%. Testing alone is not sufficient to raise DRE

much above 90%. To approach or exceed 99% in DRE, it is necessary to use a synergistic combination of

pre-test static analysis and inspections combined with formal testing using mathematically designed test

cases, ideally created by certified test personnel. DRE can also be applied to defects found in other

materials such as requirements and design. Requirements, architecture, and design defects are resistant to

testing and, therefore, pre-test inspections of requirements and design documents should be used for all

major software projects. Table 15 illustrates current ranges for defect potentials and defect removal

efficiency levels in the United States circa 2013 for applications in the 1,000 function point size range:

Table 15: Software Defect Potentials and Defect Removal Efficiency

Defect Origins Defect Defect Defects % of

Potential Removal Delivered Total

Requirements defects 1.00 75.00% 0.25 31.15%

Design defects 1.25 85.00% 0.19 23.36%

Test case defects 0.75 85.00% 0.11 14.02%

Bad fix defects 0.35 75.00% 0.09 10.90%

Code defects 1.50 95.00% 0.08 9.35%

User document defects 0.60 90.00% 0.06 7.48%

Architecture defects 0.30 90.00% 0.03 3.74%

TOTAL 5.75 85.00% 0.80 100.00%

3.6.1 Error-Prone Modules (EPM)

One of the most important findings in the early 1970s by IBM was that errors were not randomly

distributed through all modules of large systems, but tended to cluster in a few modules, which were

termed “error-prone modules.” For example, 57% of customer reported bugs in the IBM IMS data base

application were found in 32 modules out of a total of 425 modules. More than 300 IMS modules had

zero defect reports. A Microsoft study found that fixing 20% of the bugs would eliminate 80% of system

53

crashes. Other companies replicated these findings and error-prone modules are an established fact of

large systems.

3.6.2 Inspection Metrics

One of the merits of formal inspections of requirements, design, code, and other deliverables is the suite

of standard metrics that are part of the inspection process. Inspection data routinely includes preparation

effort, inspection session team size and effort, defects detected before and during inspections; defect

repair effort after inspections; and calendar time for the inspections for specific projects. These data are

useful in comparing the effectiveness of inspections against other methods of defect removal such as pair

programming, static analysis, and various forms of testing. To date, inspections have the highest levels of

defect removal efficiency (> 85%) of any known form of software defect removal.

3.6.3 General Terms of Software Failure and Software Success

The terms “software failure” and “software success” are ambiguous in the literature. Here is Capers

Jones’ definition of “success”, where he attempts to quantify the major issues troubling software

[JONEd]: success means

< 3.00 defects per function points;

> 97% defect removal efficiency;

> 97% of valid requirements implemented;

< 10% requirements creep;

0 toxic requirements forced into application by unwise clients;

> 95% of requirements defects removed;

development schedule achieved within + or – 3% of a formal plan;

costs achieved within + or – 3% of a formal parametric cost estimate.

Section 4: Conclusions and Project Take-Aways

We found many excellent suggestions for enabling teams to deliver quality software, but not all things

will work for all teams.

4.1 Process

1. Changing processes leads to differences in software quality.

2. Mixing Agile and DevOps high-performance distinguishing characteristics can lead to rapid delivery

and maximized outcomes through collaborative performance. (Section 1.1)

3. The more collaborative the process becomes, the easier it is to attain item 2. Agile and DevOps is

based on teamwork and cooperation. (Section 1.1) Make the process visible and available to all teams.

Delivery tasks and trends (metrics) are available to all teams. Raise awareness of product quality.

Everyone is responsible and owns the trends.

4. The key points for high-performing DevOps culture: (Section 1.2.1)

Deploy daily – decouple code deployment from feature releases

Handle all non-functional requirements (performance, security) at every stage

Exploit automated testing to catch errors early and quickly- 82% of high-performance DevOps

organizations use automation [Puppet].

Employ strict version control – version control in operations has the strongest correlation with

high performing DevOps organizations [Puppet]. Save all products into a software configuration

management (SCM) system making them readily available, merging contributions by multiple

54

authors, determining where changes have been made. Along with a SCM use a configuration

management system. (Section 1.2.4)

Implement end-to-end performance monitoring and metrics

Perform peer code review

Use collaborative code review platforms such asGerrit, CodeFlow, ReviewBoard,

Crucible, SmartBear and review against coding standards first and apply checklists.

(Section 1.3, 1.3.1)

Apply a separate checklist for security. (Section 1.3.2)

Static analysis, using tools such as Findbugs (byte code), PMD for code, CheckStyle.

Note that SonarQube can take output from these tools and present it. SonarQube also has

an indicator of poor design before “human reviews”. (Section 1.3)

Monitor the code review process (Section 1.3.3)

Allocate more cycle time to reduction of technical debt.

Reviews assist in identifying evolvability code to be harder to understand and maintain.

(Section 1.3.4)

Agile requires refactoring. Refactoring and technical debt are linked.

5. Key success word for Agile is continuous. Continuous testing, planning, iterations, integration and

improvement resulting in continuously delivering tested working software.

6. As test coverage increases, both predictability and quality increase and automation can promote

greater coverage. (Section 1.2.2.3) Raw code coverage metric is only relevant when too low and

requires further analysis when high. Determine what is not covered and why. Multiple studies show

about 85% of defects in production could have been detected by simply testing all possible 2-way

combinations of input parameters. Free testing tool from NIST (Advanced Combinatorial Testing

System)

7. Review high risk code and high risk changes for both vulnerabilities and defects. (Section 1.3.6).

8. Integrate QA into the development process. Fosters collaboration outlined in item 2. (Section 1.5)

9. Groom the product backlog. Many development teams do not have ready useable product

backlog. Over 80% of teams have user stories for their product backlog, but less that 10% find

them acceptable. Product backlog in high ready state can dramatically (as much as 50%)

improve a team’s velocity. (Section 1.6)

10. AUTOMATE, AUTOMATE, AUTOMATE …

When the same deployment tools are used for all development and test environments, errors are

detected and fixed early.

Studies have determined that there is not one best tool, underscoring the fact that quality is based

on practices not on the exact tool. Tools can make a team more productive and collaborative, and

enforce a practice. (Section 1.2.2.3)

More than 80% of high-performing software development organizations rely on automated tools

for infrastructure management and deployment. Automated testing (checking) is a factor in

quality production environments.

11. Develop a defect prevention strategy.

Defect Logging and documentation to provide key parameters for analysis (root cause ->

preventive actions->improvement) and measurement.

Defect Removal Efficiency (DRE) must be over 85%, closer to 95%. (Section 2.7)

55

From both an economic and quality standpoint, defect prevention and testing are all necessary to

achieve lower costs, shorter schedules and low levels of defects.

Conduct dynamic appraisals through functional and performance testing. Coverage, coverage

and more coverage.

As of 2015 there are more than 20 forms of testing. The assumed test stages include 1) unit test,

2) function test, 3) regression test, 4) component test, 5) usability test, 6) performance test, 7)

system test, and 8) acceptance test. Most forms of testing have only about a 35% DRE, so at least

8 types of testing are needed to top 80% in overall testing DRE.

Static appraisals can eliminate 40-50% of the coding defects. (Section 1.3)

Defects do not just originate from code. Only 35% of the total defect potential originates from the

code. Requirements accounts for 20%, design 25%, documents 12%, and bad fixes another 8%.

4.2 Product Measurements

12. No single metric can provide a complete quality measure and selecting the set of metrics that provides

the essential quality coverage is also impossible.

13. It is important to understand that quality needs to be improved faster and to a higher level than

productivity in order for productivity to improve. The major reason for this link is that identifying and

fixing defects is overall the most expensive activity in software development. Quality leads and

productivity follows. Attempting to improve productivity without first improving quality will not be

successful.

14. If only one quality aspect of development is measured, it should be defects. Defects are at the core of

the customer's and external reviewer’s value perception. Released defect levels are a product of

defect potentials and defect removal efficiency. The phrase "defect removal efficiency" refers to one

of the most powerful of all quality metrics. Fixing bugs on the same day as they are discovered can

double the velocity of a team. (Section 2.7)

15. Collect just enough feedback to respond to unexpected events, and change the process as needed.

Metrics on the number of test runs and passes, code coverage metrics and defect metrics should be

reviewed to ensure that the code is providing the value required. The SonarQube default quality

setting tracks the seven deadly sins in bad code: bad distribution of complexity, duplications, lack of

comments, coding rules violations, potential bugs, no unit tests or useless ones, and bad design.

(Section 2.7)

16. Next concentrate on performance and security. These features are externally visible to customers and

the public. Defective and slow software will make customers demand a new product. Security

problems will lead to headlines in the news. (Section 2.9)

Application performance requires a tool for complex applications such as Dynatrace.

Use a monitoring tool such as Java Mission Control for memory leaks, collection inefficiencies

and locked treads.

Use code profilers such as YourKit or JProfiler to identify and remove bottlenecks.

17. Another quality aspect on the radar should be maintainability. Maintainability is a quality attribute

listed in hundreds of quality models. The system will be used and updated for an extended time and it

should not become more difficult and expensive to maintain. As new services (features) are added,

they should be done at a reasonable cost and also be testable. Symptoms of poor maintainability are

unnecessary complexity, unnecessary coupling, redundancy and the software model not reflecting the

56

actual or physical model. Note that many of the symptoms are already on the seven sins of bad code.

(Section 2.7 Section 1.3).

18. Numbers are not as important as trends (Section 2.6).

Acknowledgements

The authors thank Chris Drake, Michael Iacovelli and Frank Schmidt at iconectiv for the valuable insights

and suggestions regarding this work that they shared with us through numerous teleconferences during the

summer of 2015. This research is also supported by the National Science Foundation under Grant No.

1464654.

Appendix A

Namcook Analytics - Estimation Report

Copyright © by Namcook Analytics. All rights reserved.

Web: www.Namcook.com

Blog: http://Namcookanalytics.com

Part 1: Namcook Analytics Executive Overview

Project name: numbers

Project manager: NA

Key Project Dates:

Today's Date 08-18-15

Project start date: 08-18-15

Planned delivery date: 08-17-16

Predicted Delivery date 07-07-16

Planned schedule months: 12.01

Actual schedule months: 10.82

Plan versus Actual: -1.19

57

Key Project Data:

Size in FP 465.06

Size in KLOC 53.33

Total Cost of Development 375,586.16

Part 2: Namcook Development Schedule

Project Development

Staffing Effort Schedule Project $ per Wk Hrs

Months Months Costs Funct. Pt. per FP

Requirements 1.32 4.70 3.56 $46,994.93 $101.05 1.33

Design 1.76 6.66 3.78 $66,576.16 $143.16 1.89

Coding 3.30 9.53 2.89 $95,340.54 $205.01 2.71

Testing 2.97 7.01 2.36 $70,078.58 $150.69 1.99

Documentation 0.91 1.45 1.60 $14,525.71 $31.23 0.41

Quality Assurance 0.83 1.82 2.19 $18,157.13 $39.04 0.52

Total Project 0.87 6.39 10.82 $63,913.11 $137.43 1.81

3.47 37.56 16.38 $375,586.16 $807.61 10.66

Gross schedule months 16.38

Overlap % 0.66

Predicted Delivery Date 10.82 07-07-16

Client target delivery schedule and date 12.01 08-17-16

Difference (predicted minus target) -1.19

Odds 70% Odds 50% Odds 10%

05-13-16 03-28-16 02-12-16

Productivity FP/Month LOC/Month WH/Month

Features deferred to meet

schedule:

58

Productivity (+

reuse) 12.38 660.39 10.66

Function Pts. (38)

Productivity (-

reuse) 10.11 539.07 13.06

Lines of code (2,008)

% deferred -9.92%

Estimates for User Development Activities

User Activities Staffing Schedule Effort Costs $ per FP

User requirements team: 0.72 5.41 3.87 $0 $0.00

User prototype team: 0.58 2.70 1.57 $0 $0.00

User change control team: 0.62 10.82 6.71 $0 $0.00

User acceptance test team: 1.03 1.62 1.68 $0 $0.00

User installation team: 0.81 0.81 0.66 $0 $0.00

0.75 4.27 14.48 $0 $0.00

Number of Initial Year 1 Users: 1,000 12.00

Number of users needing training: 900 0.05 41.86 $0 $0.00

TOTAL USER COSTS

56.34 $0 $0.00

$ per function point $0.00

% of Development 0.00%

Staffing by Occupation

Occupation Normal Peak

Groups Staffing Staffing

1 Programmers 4 5

2 Testers 3 5

3 Designers 1 2

4 Business analysts 1 2

5 Technical writers 1 1

6 Quality assurance 1 1

59

7 1st line management 1 2

8 Data base administration 0 0

9 Project office staff 0 0

10 Administrative staff 0 0

11 Configuration control 0 0

12 Project librarians 0 0

13 2nd line managers 0 0

14 Estimating specialists 0 0

15 Architects 0 0

16 Security specialists 0 0

17 Performance specialists 0 0

18 Function point specialists 0 0

19 Human factors specialists 0 0

20 3rd line managers 0 0

TOTAL 14

20

Risks Odds

Cancellation 12.81%

Negative ROI 16.23%

Cost Overrun 14.09%

Schedule Slip 17.08%

Unhappy Customers 36.00%

Litigation 5.64%

Average Risk 16.98%

Financial Risk 23.65%

Less than 15% = Acceptable

15% - 35% = Caution

Greater than 35% = Danger

Part 3: Namcook Quality Predictive Outputs

60

Software Quality

Defect Potentials Potential

Requirements defect potential 380

Design defect potential 365

Code defect potential 572

Document defect potential 79

Total Defect Potential 1,396

Defect Prevention Efficiency Remainder Bad Fixes

JAD 0% 1,396 0

QFD 27% 1,019 10

Prototype 0% 1,029 (0)

Models 0% 1,029 0

Subtotal 26% 1,029 10

Pre-Test Removal Efficiency Remainder Bad Fixes

Desk check 26% 761 21

Pair programming - not used 0% 782 21

Static analysis 55% 361 10

Inspections 88% 45 1

Subtotal 96% 46 53

Test

Removal Efficiency Remainder

Bad

Fixes

Test

Cases

Per

KLOC

Per

FP

Test

scripts

Test Planning

Unit 31% 32 1 480 19 1 66

Function 33% 22 1 522 21 1 69

Regression 12% 20 1 235 9 1 46

Component 30% 14 1 313 13 1 53

Perfomance 11% 13 0 157 6 0 38

System 34% 9 0 496 20 1 67

61

Acceptance 15% 8 0 106 4 0 31

Subtotal 82% 8 4 2,308 93 5 144

Defects delivered 8

High severity 1

Security flaws 1

High severity % 14.92%

Deliv. Per FP 0.02

High sev per FP 0.00

Security flaws per FP 0.00

Deliv. Per KLOC 0.34

High sev per KLOC 0.05

Security flaws per KLOC 0.02

Cumulative

Removal Efficiency

99.40%

Defect prevention costs $40,832.31

Pre-Test Defect Removal Costs $71,988.87

Testing Defect Removal Costs $140,837.96

Total Development Defect Removal Costs $253,659.14

Defect removal % of development 67.54%

Defect removal per FP 545.43

Defect removal per KLOC 10,226.87

Defect removal per defect 110.49

Three-year Maintenance Defect Removal Costs 60,769.16

TCO defect removal costs 314,428.29

Defect removal % of TCO 0.25%

Reliability (days to first defect) 29.54

Reliabilty (days between defects) 198.03

Customer satisfaction with software 96.42%

62

Part 4: Namcook Maintenance and Cost Benchmarks

Maintenance Summary Outputs for three years

Year of first release 2016

Application size at first release - function points 465

Application growth (three years) - function points 586

Application size at first release - lines of code 24803

Application growth (three years) - lines of code 31245

Application users at first release 1000

Application users after three years 1359233

Incidents after three years 4857

Staff Effort

Cost per Cost per

Three-Year Totals Personnel Months Costs Function Pt. Function Pt.

1,000 1,260

Management 43.93 1581.57 7907862.60 17003.96 13498.29

Customer support 658.43 23703.33 118516657.07 254841.65 202301.52

Enhancement 0.43 15.43 77142.54 165.88 131.68

Maintenance 0.34 12.15 60769.16 130.67 103.73

TOTAL 703.12 25312.49 126562431.37 272142.16 216035.22

Namcook Total Cost of Ownership Benchmarks

Staffing Effort Costs $ per FP % of TCO

at release

Development 3.47 38 $375,586.16 $807.61 Cost per

Maintenance Mgt. 43.93 1582 $7,907,862.60 $17,003.96 6.23%

Customer support 658.43 23703 $118,516,657.07 $254,841.65 93.37%

Enhancement 0.43 15 $77,142.54 $165.88 0.06%

Maintenance 0.34 12 $60,769.16 $130.67 0.05%

User Costs 0.75 14 $0.00 $0.00 0.00%

63

Total TCO 707.35 25365 $126,938,017.53 $272,949.76 100.00%

Part 5: Additional Data Points

Note: Namcook Analytics uses SRM and IFPUG function points as default values.

This section provides application size in 21 metrics.

Alternate Size Metrics Size % of IFPUG

1 IFPUG 4.3 465 100.00%

2 Automated code based function points 498 107.00%

3 Automated UML based function points 479 103.00%

4 Backfired function points 465 100.00%

5 COSMIC function points 558 120.00%

6 Fast function points 451 97.00%

7 Feature points 465 100.00%

8 FISMA function points 474 102.00%

9 Full function points 544 117.00%

10 Function points light 449 96.50%

11 IntegraNova function points 507 109.00%

12 Mark II function points 493 106.00%

13 NESMA function points 484 104.00%

14 RICE objects 2,591 557.14%

15 SCCQI function points 1,479 318.00%

16 Simple function points 453 97.50%

17 SNAP non functional size metrics 118 25.45%

18 SRM pattern matching function points 465 100.00%

19 Story points 362 77.78%

20 Unadjusted function points 414 89.00%

21 Use-Case points 217 46.67%

Document Sizing

64

Percent Complete

1 Requirements 192 76,781 94.55%

2 Architecture 46 18,268 93.24%

3 Initial design 223 89,230 87.46%

4 Detail design 379 151,472 88.77%

5 Test plans 76 30,329 91.16%

6 Development Plans 26 10,231 91.24%

7 Cost estimates 46 18,268 94.24%

8 User manuals 184 73,783 94.88%

9 HELP text 88 35,152 95.06%

10 Courses 67 26,973 94.79%

11 Status reports 47 18,887 93.24%

12 Change requests 90 35,952 99.55%

13 Bug reports 496 198,214 92.51%

TOTAL 1,959 783,541 93.13%

Work hours per page - writing 0.95

Work hours per page - reading 0.25

Total document hours - writing 1,860.91

Total document hours - reading 85.01

Total document hours 1,945.92

Total document months 14.74

Total document $ 147,425.42

$ per function point 317.00

% of total development 39.25%

DevOps Practices Self Assessment

Please describe your purpose for completing the assessment.

I only want to self-assess my practices

Enter the contact information in the fields provided. This information will be used to forward theresults to you. If you included your IBM representative's information, a copy of your responseswill be forwarded to your IBM representative.

Your name

Dolores

Your Email Address

[email protected]

IBM representative name

Na

IBM representative email

[email protected]

What is your company's industry?

Education

What is the geographic area of your company's primary operations?

North America

Please select the assessment experience you prefer.

I would like to manually select each practice to assess

Please select up to two adoption paths to focus your assessment. The next step will let you select from a list ofpractices to further refine your assessment questions.

Develop / Test (Design, Construct, Configuration Management, Build, Test, Quality Assessment)

Please select one or more practices from the list to focus your self-assessment.

DesignConstructConfiguration ManagementBuildQuality ManagementQuality Assessment

Design is focused on producing products during a phase of the project using formal processes for review andmeasures of completion. The confidence of design activities to ensure scope and requirements are understoodfor implementation is low and effectiveness is not measured. Formal method is used to review design productsfor approval or to improve or correct.

Partial

Developers work independently and deliver code changes, deliveries are formally scheduled and resourceintensive. Integration is a planned event that impacts most development activities in an application or project.

Partial

Code deliveries and integrations are performed periodically using a common process and automation.Integrations are accomplished by individual developers and automated when possible. Coding techniques areavailable and used inconsistently. Common architecture standards for application coding are defined andtrained. Code reviews are effective and manually initiated.

Yes

Coding best practices are defined consistently across technologies and include reviews and automatedvalidation through scanning/testing. Consistent architecture standards are used across organization andvalidated in testing and reviews.

Yes

Code changes are collaboratively developed across technologies, application and teams, continuously.Developers have immediate access to relevant information for code changes to ensure iterative improvementsor changes to design, requirements or coding implementation are understood. Standards in coding and reviewsare measured standards across the organization. Best coding and validation practices are trained, used andverified consistently.

Yes

Source control of assets is largely a manual activity that relies heavily on individuals following processes.Performing changes on an asset by more than one team member is only accomplished through locks andaccess controls. Merging asset changes is accomplished on desktops manually and formally scheduled by aspecialized integration team. Applying changes across versions for different releases is performed outside ofthe configuration management tool or process.

No

Builds are performed manually and automated across projects and environments. Build systems range fromdeveloper's IDE (usually for Dev only) to a formal centralized build server which is normally used for formalpromotions to QA-Production. Build management and standards are controlled at the project level. Formal buildsare scheduled following formal delivery and integration events to validate application level integration andapplication promotion.

No

Informal builds produced by individual developers via their desktop IDE are used for validation but neverdeployed to environments. A centralized build service is in place that controls the artifacts and processes usedin the build. Automated build process includes code scanning and unit / security tests. The build processperiodically produces a build of each application under change for testing testing or verification purposes. Buildresults are measured and monitored by development teams consistently across the enterprise.

Yes

A daily build of an application under change is promoted to test. Build is provided as a service that supportscontinuous integration, compilation and packaging by platform. Individual developers, teams and applicationsregularly schedule automated builds based on changes to source code. A dependency management process isin place for software builds using a dependency-management repository to trace the standard libraries andprovision them at build time.

Yes

All builds could be promoted through the software delivery pipeline, if desired. Each project/team tracks changesto the build process as well as source code and dependencies. Modifying the build process requires approval, soaccess to the official build machines and build server configuration is restricted where compliance is a factor orwhere Enterprise Continuous Integration becomes a production system. Build measures are used to improvedevelopment and configuration management processes.

Yes

Crash reporting is incorporated into mobile applications to provide basic measures for quality assessment.Crash reporting is used to improve basic stability of the applications.

Yes

Quality reporting is embedded into mobile applications to support user sentiment and usage patterns. Measuresare used to drive changes into development teams for usability improvements, defect correction andenhancements.

Yes

Mobile application teams assess quality by monitoring social media, application repositories, and user feedback.Each monitoring source is used to define defects or enhancements to the specific application. Measures areused to determine the impact of application team improvements on user satisfaction.

No

The main objective of testing is to validate that the product satisfies the specified requirements. However, testingis still seen by many stakeholders as being the project phase that follows coding.

No

Design

Reliable

Construct

Scaled

Configuration Management

Scaled

Build

Scaled

Test

Reliable

Assess Quality

Scaled

68

References

[BASI] Basili, V., Gianluigi Caldiera, and H. Dieter Rombach. The goal question metric approach. In

Encyclopedia of Software Engineering. Wiley, 1994.

[BISH] Bisht, A., AS Dhanoa, AS Dhillon, G Singh, “A Survey on Quality Prediction of Software

Systems”, isems.org.

[BLAC] Black, R. “Measuring Defect Potentials and Defect Removal Efficiency, 2008, http://www.rbcs-

us.com/images/documents/Measuring-Defect-Potentials-and-Defect-Removal-Efficiency.pdf

[CHIL] Childers, B., “Geek Guide Slow Down to Speed Up, Continuous Quality Assurance in a DevOps

Environment, Linux Journal, 2014

[CISC] “Cisco 2014 Annual Security Report”, http://www.cisco.com/web/offers/lp/2014-annual-security-

report/index.html

[DEMI] W. Deming, Out of the Crisis. MIT Center for Advanced Engineering Study, Cambridge, MA,

1982.

[DYNA] Dynatrace, “DevOps: Hidden Risks and How to Achieve Results”, 2015.

http://www.dynatrace.com/content/dam/en/general/ebook-devops.pdf

[FENT] Fenton, N., J. Bieman, Software Metrics: A Rigorous and Practical Approach, Third Edition,

2014, CRC Press, Boca Rotan, FL.

[FOWL] Fowler, M. “An Appropriate Use of Metrics”, Feb. 2013,

http://martinfowler.com/articles/useOfMetrics.html

[FRAN] P. Frankl and O. Iakounenko. Further empirical studies of test effectiveness. In Proc. 6th ACM

SIGSOFT International Symposium on the Foundations of Software Engineering (FSE’98), pages

153–162. ACM Press, 1998.

[GALE] Galen, R. “2 Dozen Wild & Crazy Agile Metics Ideas”, RGalen Consulting,

[GART] Market Share Analysis: Application Share Analysis: Application Performance Monitoring 2014,

May 27, 2015 http://www.gartner.com/technology/reprints.do?id=1-

2H15OOF&ct=150602&st=sb

[GILB] Gilb, T and L. Brodie, “What’s Wrong with Agile Methods: Some Principles and Values to

Encourage Quantification”, Methods and Tools, Summer 2007, accessed 6/2015

http://www.methodsandtools.com/archive/archive.php?id=58

[HOGL] Hoglund G. and McGraw G (2004): Exploiting Software: How to Break Code. Addison-Wesley,

2004.

[HOWA] Howard, M., “Lessons Learned from Five Years of Building More Secure Software, Trustworthy

Computing, Microsoft,

http://www.google.com/url?url=http://download.microsoft.com/download/A/E/1/AE131728-

943B-42B4-B130-

C1DEBE68F503/Trustworthy%2520Computing.pdf&rct=j&frm=1&q=&esrc=s&sa=U&ved=0C

http://isems.org/images/extraimages/378.pdf

http://isems.org/images/extraimages/378.pdf

http://www.cisco.com/web/offers/lp/2014-annual-security-report/index.html

http://www.cisco.com/web/offers/lp/2014-annual-security-report/index.html

http://www.dynatrace.com/content/dam/en/general/ebook-devops.pdf

http://martinfowler.com/articles/useOfMetrics.html

http://www.gartner.com/technology/reprints.do?id=1-2H15OOF&ct=150602&st=sb

http://www.gartner.com/technology/reprints.do?id=1-2H15OOF&ct=150602&st=sb

http://www.methodsandtools.com/archive/archive.php?id=58

http://www.google.com/url?url=http://download.microsoft.com/download/A/E/1/AE131728-943B-42B4-B130-C1DEBE68F503/Trustworthy%2520Computing.pdf&rct=j&frm=1&q=&esrc=s&sa=U&ved=0CBQQFjAAahUKEwi61_Srx9jHAhULPZIKHZPmBSQ&usg=AFQjCNG1RERb7HJ8OakiFwEL_FLJAc6j6w



69

BQQFjAAahUKEwi61_Srx9jHAhULPZIKHZPmBSQ&usg=AFQjCNG1RERb7HJ8OakiFwEL

_FLJAc6j6w

[IBM] IBM Developer Works,”11 proven practices for more effective, efficient peer code review”,

accessed 6/2015 http://www.ibm.com/developerworks/rational/library/11-proven-practices-for-

peer-review/

[JONEa] Jones, C and O. Bonsignour, The Economics of Software Quality, 2011, Pearson Publishing

[JONEb] Jones, C., “Software Engineering issues for 30 years”, http://www.namcook.com/index.html

[JONEc] Jones, C., “Examples of Software Risk Master (SRM) Requirements Predictions”, January 11,

2014, http://namcookanalytics.com/wp-content/uploads/2014/01/RequirementsData2014.pdf

[JONEd] Jones, C., “Evaluating Software Metrics and Software Measurement Practices”, Version 4.0

March 14, 2014, Namcook Analytics LLC

http://namcookanalytics.com/wp-content/uploads/2014/03/Evaluating-Software-Metrics-and-

Software-Measurement-Practices.pdf.

[JONEe] Jones, C., “The Mess of Software Metrics”, Version 2, September 12, 2014,

http://namcookanalytics.com/wp-content/uploads/2014/09/problems-variations-software-

metrics.pdf

[KABA] Kabanov, Jevgeni, “Developer Productivity Report 2013 – How Engineering Tools & Practices

Impact Software Quality & Delivery”, Zeroturnaround, 2013,

http://pages.zeroturnaround.com/RebelLabs-

AllReportLanders_DeveloperProductivityReport2013.html?utm_source=Productivity%20Report

%202013&utm_medium=allreports&utm_campaign=rebellabs&utm_rebellabsid=76

[KASP] Kaspersky lab, “Oracle Java surpasses Adobe Reader as the most frequently exploited software”.

December 2012,

http://www.kaspersky.com/about/news/virus/2012/Oracle_Java_surpasses_Adobe_Reader_as_the

_most_frequently_exploited_software

[MANT] Mantyla, M.V., and Cl Lassenius, “What types of defects are really discovered in code

reviews?” IEEE Transactions on Software Engineering, 2009, 35(3) 430-448.

[MCGR] McGraw, G, S. Migues, J. West, “BSIMM6”, October 2015, https://www.bsimm.com/wp-

content/uploads/2015/10/BSIMM6.pdf

[MICR] Microsoft, “Microsoft Security Intelligence Report volume 10 (July – December 2010)”,

http://www.microsoft.com/en-us/download/details.aspx?id=17030

[MITR] 2011 CWE/SANS Top 25 Most Dangerous Software Errors, http://cwe.mitre.org/top25

[MOCK] Mockus, A., Nachiappan Nagappan, Trung T. Dinh-Trong, "Test coverage and post-verification

defects: A multiple case study" Proceedings of the 2009 3rd International Symposium on

Empirical Software Engineering and Measurement, October 2009.

[NAGA] Nagappan, N., and T. Ball, “Use of Relative Code Churn to Predict System Defect Density”,

Microsoft Research, 2005, http://research.microsoft.com/pubs/69126/icse05churn.pdf



http://www.ibm.com/developerworks/rational/library/11-proven-practices-for-peer-review/

http://www.ibm.com/developerworks/rational/library/11-proven-practices-for-peer-review/

http://www.namcook.com/index.html

http://namcookanalytics.com/wp-content/uploads/2014/01/RequirementsData2014.pdf

http://namcookanalytics.com/wp-content/uploads/2014/03/Evaluating-Software-Metrics-and-Software-Measurement-Practices.pdf

http://namcookanalytics.com/wp-content/uploads/2014/03/Evaluating-Software-Metrics-and-Software-Measurement-Practices.pdf

http://namcookanalytics.com/wp-content/uploads/2014/09/problems-variations-software-metrics.pdf

http://namcookanalytics.com/wp-content/uploads/2014/09/problems-variations-software-metrics.pdf

http://pages.zeroturnaround.com/RebelLabs-AllReportLanders_DeveloperProductivityReport2013.html?utm_source=Productivity%20Report%202013&utm_medium=allreports&utm_campaign=rebellabs&utm_rebellabsid=76



http://www.kaspersky.com/about/news/virus/2012/Oracle_Java_surpasses_Adobe_Reader_as_the_most_frequently_exploited_software

http://www.kaspersky.com/about/news/virus/2012/Oracle_Java_surpasses_Adobe_Reader_as_the_most_frequently_exploited_software

https://www.bsimm.com/wp-content/uploads/2015/10/BSIMM6.pdf

https://www.bsimm.com/wp-content/uploads/2015/10/BSIMM6.pdf

http://cwe.mitre.org/top25

http://research.microsoft.com/pubs/69126/icse05churn.pdf

70

[NESM] (http://nesma.org/2015/04/Agile-metrics/).

[OLAV] Olavsrud, T., “Do Insecure Open Source Components Threaten Your Apps?”, CIO, March 2012,

http://www.cio.com/article/2397662/governance/do-insecure-open-source-components-threaten-

your-apps-.html

[OWAS] “OWASP Top 10”, www.owasp.org/index.php/Category:OWASP_Top_Ten_Project

[PAUK] Paukamainen, I. “Case: Testing in Large Scale Agile Development”, presentation

http://testingassembly.ttlry.mearra.com/files/2014%20ISMO%20Case_TestingInLargeScaleAgile

Development.pdf

[PFLE] S. L. Pfleeger, N. Fenton, and N. Page, “Evaluating software engineering standards,” IEEE

Comput., vol. 27, no. 9, pp. 71–79, 1994.

[PLUM] Java performance tuning survey results, November 113, 2014,

https://plumbr.eu/blog/performance-blog/java-performance-tuning-survey-results-part-i

[PUPP] Puppet Labs, IT Revolution Press and Thoughtworks, 2014 State Of DevOps Report and 2013 State

Of DevOps Infographic, 2015, https://puppetlabs.com/2013-state-of-devops-infographic

[PUTM] Putman, R. “Secure Agile SDLC.” https://www.brighttalk.com/ webcast/1903/92961.

[SANSa] SANS Institute, “2015 State of Application Security: Closing the Gap”, May 14, 2015,

https://software-security.sans.org/blog/2015/05/14/2015-state-of-application-security-closing-the-

gap

[SANSb] SANS Institute, “Critical Security Controls”, https://www.sans.org/critical-security-controls/

[SCAL] http://scaledAgileframework.com/features-components/

[SECU] “37 Powerful Penetration Testing Tools for Every Penetration Tester”, Security testing, June

2015, http://www.softwaretestinghelp.com/penetration-testing-tools/

[SIRI] Sirias, C., “Project Metrics for Software Development”, InfoQ, July14, 2009,

http://www.infoq.com/articles/project-metrics

[SMAR] SmartBear, “11 Best Practice for Peer Review Code”, Whitepaper

[SOFT] Software Assurance Pocket Guide Series,” Architecture and Design Considerations for Secure

Software”, Development, Volume V Version 1.3, February 22, 2011 https://buildsecurityin.us-

cert.gov/sites/default/files/Architecture_and_Design_Pocket_Guide_v1.3.pdf

[TECH] Quality metrics: A guide to measuring software quality, SearchSoftwareQuality,

http://searchsoftwarequality.techtarget.com/guides/Quality-metrics-A-guide-to-measuring-

software-quality

[VERI] Verizon DBIR 2012, IDC, Infonetics Research


http://www.cio.com/article/2397662/governance/do-insecure-open-source-components-threaten-your-apps-.html

http://www.cio.com/article/2397662/governance/do-insecure-open-source-components-threaten-your-apps-.html

http://www.owasp.org/index.php/Category:OWASP_Top_Ten_Project

http://testingassembly.ttlry.mearra.com/files/2014%20ISMO%20Case_TestingInLargeScaleAgileDevelopment.pdf

http://testingassembly.ttlry.mearra.com/files/2014%20ISMO%20Case_TestingInLargeScaleAgileDevelopment.pdf

https://plumbr.eu/blog/performance-blog/java-performance-tuning-survey-results-part-i

https://puppetlabs.com/2013-state-of-devops-infographic

https://software-security.sans.org/blog/2015/05/14/2015-state-of-application-security-closing-the-gap

https://software-security.sans.org/blog/2015/05/14/2015-state-of-application-security-closing-the-gap

https://www.sans.org/critical-security-controls/

http://www.softwaretestinghelp.com/penetration-testing-tools/

http://www.infoq.com/articles/project-metrics

https://buildsecurityin.us-cert.gov/sites/default/files/Architecture_and_Design_Pocket_Guide_v1.3.pdf

https://buildsecurityin.us-cert.gov/sites/default/files/Architecture_and_Design_Pocket_Guide_v1.3.pdf

http://searchsoftwarequality.techtarget.com/guides/Quality-metrics-A-guide-to-measuring-software-quality

http://searchsoftwarequality.techtarget.com/guides/Quality-metrics-A-guide-to-measuring-software-quality

71

[WIKI] Software Quality, https://en.wikipedia.org/wiki/Software_quality

[ZERO] zeroturnaround.com/rebellabs/the-developers-guide

https://en.wikipedia.org/wiki/Software_quality

Documents

Components of a Modern Quality Approach To Software … · 2017-08-29 · Below are seven key points of a high-performing DevOps culture [DYNA]: 1. Deploy daily – decouple code