Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
Components of a Modern Quality Approach
To Software Development
Dolores Zage, Wayne Zage
Ball State University
Sponsor: iconectiv
Final Report
October 2015
2
Table of Contents
Section 1: Software Development Process and Quality 4
1.1 New Versus Traditional Process Overview 4
1.2 DevOps 5
1.2.1 Distinguishing Characteristics of High-Performing DevOps Development
Cultures 5
1.2.2 Zeroturnaround Survey 5
1.2.2.1 Survey: Quality of Development 6
1.2.2.2 Survey Results: Predictability 6
1.2.2.3 Survey Results: Tool Type Usage 7
1.2.3 DevOps Process Maturity Model 8
1.2.4 DevOps Workflows 9
1.3 Code Reviews 10
1.3.1 Using Checklists 11
1.3.1.1 Sample Checklist Items 11
1.3.2 Checklists for Security 12
1.3.3 Monitoring the Code Review Process 12
1.3.4 Evolvability Defects 13
1.3.5 Other Guidelines 13
1.3.6 High Risk Code and High Risk Changes 14
1.4 Testing 14
1.4.1 Number of Test Cases 15
1.5 Agile Process and QA 16
1.5.1 Agile Quality Assessment (QA) on Scrum 17
1.6 Product Backlog 19
Section 2: Software Product Quality and its Measurement 19
2.1 Goal Question Metric Model 19
2.2 Generic Software Quality Models 20
2.3 Comparing Traditional and Agile Metrics 22
2.4 NESMA Agile Metrics 23
2.4.1 Metrics for Planning and Forecasting 23
2.4.2 Dividing the Work into Manageable Pieces 24
2.4.3 Metrics for Monitoring and Control 24
2.4.4 Metrics for Improvement (Product Quality and Process Improvement) 25
2.5 Another State-of-the-Practice Survey on Agile Metrics 26
2.6 Metric Trends are Important 27
2.7 Defect Measurement 28
2.8 Defects and Complexity Linked 32
2.9 Performance, a Factor in Quality 35
2.10 Security, a Factor in Quality 37
2.10.1 Security Standards 37
2.10.2 Shift of Security Responsibilities within Development 39
3
2.10.3 List of Current Practices 40
2.10.4 Risk of Third Party Applications 40
2.10.5 Rate of Repairs 41
2.10.6 Other Code Security Strategies 41
2.10.7 Design Vulnerabilities 42
Section 3: Assessment of Development Methods and Project Data 43
3.1 The Namcook Analytics Software Risk Master (SRM) tool 43
3.2 Crosstalk Table 46
3.2.1 Ranges of Software Development Quality 49
3.3 Scoring Method of Methods, Tools and Practices in Software Development 50
3.4 DevOps Self-Assessment by IBM 50
3.5 Thirty Software Engineering Issues that have Stayed Constant for Thirty Years 51
3.6 Quality and Defect Removal 52
3.6.1 Error-Prone Modules (EPM) 52
3.6.2 Inspection Metrics 53
3.6.3 General Terms of Software Failure and Software Success 53
Section 4: Conclusions and Project Take-Aways 53
4.1 Process 53
4.2 Product Measurements 55
Acknowledgements 56
Appendix A – Namcook Analytics Estimation Report 56
Appendix B – Sample DevOps Self-Assessment 65
References 68
4
Section 1: Software Development Process and Quality
1.1 New Versus Traditional Process Overview
At first enterprises used Agile development techniques for pilot projects developed by small teams.
Realizing the benefits of shorter delivery and release phases and responsiveness to change while still
delivering quality software, enterprises searched for methods to mimic similar benefits for their larger
development efforts by scaling Agile. Thus, many frameworks and methods were developed to satisfy this
need. Scaled Agile Framework (SAFe) is one of the most implemented scaled Agile frameworks. Most of
the pieces of the framework are borrowed, existing Agile methods packaged and organized in a different
way to accommodate the scale. Integrated within SAFe and other enterprise development scaled
methodologies are Agile methods, such as Scrum and other techniques to foster the delivery of software.
Why evaluate the process? Developing good software is difficult and a good process or method can make
this difficult task a little easier and perhaps more predictable. In the past, researchers performed an
analysis of software standards and determined that standards heavily focus on processes rather than
products [PFLE]. They characterized software standards as prescribing “the recipe, the utensils, and the
cooking techniques, and then assume that the pudding will taste good.” This corresponds with Deming
who argued that, “The quality of a product is directly related to the quality of the process used to create
it” [DEMI]. Watts Humphrey, the creator of the CMM, believes that high quality software can only be
produced by a high quality process. Most would agree that the probability of producing high quality
software is greater if the process is also of high quality. All recognize that possessing a good process in
isolation is not enough. The process has to be filled with skilled, motivated people.
Can a process promote motivation? Agile methods are people-oriented rather than process-oriented. Agile
methods assert that no process will ever make up the skill of the development team and, therefore, the role
of a process is to support the development team in their work. Another movement, DevOps, encourages
collaboration through integrating development and operations. Figure 1 compares the old and new way of
delivering software. The catalyst for many of the enumerated changes is teamwork and cooperation.
Everyone should be participating and all share in responsibility and accountability. In true Agile, teams
have the autonomy of choosing development tools. In
the new world, disconnects should be removed and the
development tools and processes should be chosen so
that people can receive feedback quickly and make
necessary changes. Successful integration of DevOps
and Agile development will play a key role in the
delivery of rapid, continuous, high quality software.
Institutions that can accomplish this merger at the
enterprise scale will outpace those struggling to adapt.
Figure 1: Old and New Way of Developing Software
For this reason, identifying the characteristics of high-performing Agile and DevOps cultures is important
to assist in outlining a new transformational technology.
5
1.2 DevOps
DevOps is the fusion of “Development” and “Operations” functions to reduce risk, liability, and time-to-
market while increasing operational awareness. Over the past decade, it is one of the newest and largest
movements in Information Technology. DevOps evolution stemmed from many of the previous ideas in
software development such as automation tooling, culture shifts, and iterative development models such
as Agile. During the fusion process, DevOps was not provided with a central set of operational
guidelines. Once an organization decided to use DevOps for its software development, it had free reign on
deciding how to implement DevOps which produced its own challenges. Even Agile, whose many
attributes were adopted by the DevOps movement, falls into the same predicament. In 2006, Gilb and
Brody wrote an article suggesting the same lack of quantification for Agile methods, and this is a major
weakness [GILB]. There is insufficient focus on quantified performance levels such as metrics evaluating
required qualities, resource savings and workload capacities of the developed software. This omission of
not being able to measure the change to DevOps does not suggest that DevOps does not prescribe
monitoring and measuring. The purpose of monitoring and measuring within DevOps is to compare the
current state of a project to the same project from some time in the near past, providing an answer to
current project progress. However, this quantification does not help to answer the question about
benchmarking the overall DevOps implementation.
1.2.1 Distinguishing Characteristics of High-Performing DevOps Development Cultures
A possible description of a high-performing DevOps environment is that it produces good quality systems
on time. It is important to identify the characteristics of such high-performing cultures so that these
practices are emulated and metrics can be identified that quantify these typical characteristics and track
successful trends. Below are seven key points of a high-performing DevOps culture [DYNA]:
1. Deploy daily – decouple code deployment from feature releases
2. Handle all non-functional requirements (performance, security) at every stage
3. Exploit automated testing to catch errors early and quickly - 82% of high-performance DevOps
organizations use automation [PUPP].
4. Employ strict version control – version control in operations has the strongest correlation with
high performing DevOps organizations [PUPP].
5. Implement end-to-end performance monitoring and metrics
6. Perform peer code review
7. Allocate more cycle time to reduction of technical debt.
Other attributes of high-performing DevOps environments are that operations are involved early on in the
development process so that plans for any necessary changes to the production environment can be
formulated, and a continuous feedback loop between development, testing and operations is present. Just
as development has its daily stand-up meeting, development, QA and operations should meet frequently
to jointly analyze issues in production.
1.2.2 Zeroturnaround Survey
Many of the above trends are highlighted in a survey conducted by Zeroturnaround of 1,006 developers
on the practices developers use during software development. The goal of this survey was to prove or
disprove the effectiveness of the best quality practices including the methodologies, tools and company
size and industry within the context of these practices [KABA]. Noted in the report was that the survey
respondents possessed a disproportionate bias towards Java and were employed in Software/Technology
6
companies. Zeroturnaround also divided the respondents into three categories depending on their
responses into three groups. The top 10% were identified as rock stars, the middle 80% as average and
the bottom 10% as laggards. For this report, Zeroturnaround concentrated on two aspects of software
development, namely, the quality of the software and the predictability of delivery. The report results are
summarized in sections 1.2.2.1-1.2.2.3.
1.2.2.1 Survey: Quality of Development
The quality of the software was determined by the frequency of critical or blocker bugs discovered post-
release. Zeroturnaround decided that a good measure of software quality is to ask the respondents, “How
often do you find critical or blocker bugs after release?” This simple question is an easy method to judge
at least minimum requirements, and the response is a key quality metric that is most likely to negatively
impact the largest group of end users. The survey responses to this question were converted into
percentages.
Do you find critical or blocker bugs after a release?
A. No - 100%
B. Almost never - 75%
C. Sometimes - 50%
D. Often - 25%
E. Always - 0%
The analysis of the answers to this question, listed below, indicates that released software has a 50%
probability of containing a critical bug. Most respondents did admit to “sometimes” releasing software
with bugs. On average, 58% of releases go to production without critical bugs. It also demonstrates a
distinct difference between developer respondents. The laggards deliver bug-free software only 25% of
the time while the rock star respondents deliver bug-free software 75% of the time.
Average 58%
Median 50%
Mode 50%
Standard Deviation 19%
Laggards (Bottom 10%) 25%
Rock stars (Top 10%) 75%
1.2.2.2 Survey Results: Predictability
Predictability of delivery for this report is determined by delays in releases, execution of planned
requirements, and in-process changes also known as scope creep. The expression
Predictability = (1 / (1 + % late) x (% of plans delivered) x (1 - % scope creep)
converts predictability of delivery into a mathematical expression. An example listed below is provided
to demonstrate the use of the mathematical formula:
Example: Mike’s team is pretty good at delivering on time--they only release late 10% of the time. On
average, they get 75% of their development plans done, and they’ve been able to limit scope creep to just
10% as well. Based on that, we can calculate that Mike’s team releases software with 61% predictability,
7
(1 / (1 + 0.10)) x 0.75 x (1- 0.10) = 0.61 = 61%
The 61% is not the true probability because it is not normalized to 100%. The authors could normalize
over delivery, but chose not to as it did not affect trends, but rather made the number harder to interpret. It
was suggested that change in scope should be included in the formula, but this definitely affects the
ability to predict delivery. The authors tested this suggestion by adding change in scope to the formula.
The outcome was that omitting an estimate for change in scope does not change any trends. These trends
are represented with concrete numbers, so statistical analysis may impact the accuracy of absolute
numbers, but not the relative trends. The findings and observations on the predictability of software
releases are that companies can predict delivery within a 60% margin. Considering just the rock stars,
they can attain the 80% margin. When predictability was categorized by industry, there was no significant
relationship. In fact, predictability by company size increases slightly (3%) for larger companies. The
authors theorize that this is due to a greater level of organizational requirements, thus more non-
development staff are available to coordinate projects and release plans as teams scale up in size.
Noting the obvious difference in the probability of bug-free delivery between the rock stars and laggards,
it is important to enumerate their differences identified from the survey. Based on the responses of over
1,000 engineers, half of the respondents do not fix code quality issues as identified through static analysis
(Sonar or SonarQube), and some do not even monitor code quality. Those that use static analysis see a
10% increase in delivering bug-free software. Automated tests exhibited the largest overall
improvements both in the predictability and quality of software deliveries. The laggards did no
automated testing (0%), and the rock stars covered 75% of the functionality with automated tests. Also,
quality increases most when developers are testing the code. More than half of the respondents have less
than 25% test coverage. There is a significant increase in both predictability and quality as test coverage
increases. Code reviews significantly impact predictability of releases, but moderately affect software
quality. A plausible explanation is that developers are poor at spotting bugs in code, but good at spotting
software design issues and code smells which impact future development and maintenance. The majority
of problems found by reviewers are not functional mistakes, but what the researchers call evolvability
defects (discussed further in section 3.4). Close communication, such as daily standups and team chat
seem to be the best ways to communicate and increase the predictability of releases. Most teams work on
technical debt at least sometimes, but the survey results indicate no significant increases in quality and
predictability of software releases. However, a negative trend appears by not doing any technical debt
maintenance.
1.2.2.3 Survey Results: Tool Type Usage
The article also contains a segment on the reporting of tool and technology usage/popularity combined
with the corresponding increases or decreases in predictability and quality. Developers were asked about
the tools they used and Figure 2 presents a bar chart showing their responses.
The report analyzed which technologies and tool types influence quality and predictability of releases.
The result for quality is that there were no significant trends in the quality of releases when compared to
technologies and tool types. It appears that quality is affected by development practices, but not
development tools. In relation to predictability, using version control and IDE will significantly improve
the predictability of the deliveries, and there is a reasonable increase in predictability for users of code
quality analysis, continuous integration, issue tracker, profiler and IaaS solutions. Use of a text editor and
debugger has little or no effect on predictability.
8
Figure 2: Tool Type Usage Results from Zeroturnaround 2013 Survey
What are our takeaways from this survey and report? First of all, it appears that the majority of the
survey respondents are our peers, developing mainly in Java and categorizing the company type as
Software/Technology companies. Therefore, comparisons can be made. Automation of testing is a
leading indicator of quality reiterated by many reports [4, 5]. As test coverage increases, both
predictability and quality increase and automation can promote greater coverage. Code reviews increase
predictability and can increase the quality of the structure of the code, which is part of refactoring best
practice in Agile developments. Code quality analysis and fixing quality problems is another practice that
increases both quality and predictability. Quality was not significantly affected by a tool set,
underscoring that quality is based mainly on practices, not tools. However, good tools can make a team
more productive. They can also serve as focal points to enforce the practices that further improve the
ability to predictably deliver quality software. Identifying and implementing best practices is one key to
improving software development. However, metrics need to be chosen carefully to measure the
improvement or lack thereof. The importance of automated testing and assessment of coverage has been
outlined by numerous sources. All of these practices should be considered important in our SDL.
1.2.3 DevOps Process Maturity Model
The reason many organizations adopt DevOps is rapid delivery maximizing outcomes. One of the key
attributes and best practices of DevOps is the integration of quality assurance into the workflow. The
earlier errors are caught and fixed which minimizes rework and pushes the team to a stable product.
Figure 3 provides a DevOps Process Maturity Model with five different stages from Initial to Optimize
aligned with the CMMI Maturity Model. Assuming that our target is level 4, major keys to achieve this
level are quantification and control.
9
Figure 3: DevOps Process Maturity Model
To reach the quantify level and eventually the optimize level within the DevOps maturity model, each
workflow should be analyzed to determine how errors are introduced and subsequently identify possible
quality assurance techniques to be inserted into those checkpoints.
1.2.4 DevOps Workflows
The first workflow is documentation. A best practice is good documentation of the process and other
components of development. Documentation should be readily available and up-to-date. Documentation
that is out of date or incorrect can be very detrimental. The process should be documented along with the
configuration of all the systems. The documentation and configuration files should be placed in a
Software Configuration Management (SCM) system. A configuration management system (CMS) should
be used to describe the families and groups of systems in the configuration. A benefit of a configuration
management system is that it checks in periodically with a centralized catalog to ensure that the system is
continuing to comply and run with the approved configuration. There are many instances where a
change in the configuration has a devastating effect. As seen in Figure 4, a single configuration error can
have far reaching impact across IT and the business. From chaos theory, a butterfly flapping its wings in
part of the world can result in a
hurricane in other parts. A
configuration error may not be as
devastating, but it has a large impact in
terms of time, money and risk. Many
CMS have a method to test or validate a
set of configurations before pushing
them to a machine. For example, many
CMS have validation commands (Puppet
or BCFG2) that could be executed as
part of a process before continuing and
installing the configuration. Another
option is to invoke a canary system as
the first line of defense for catching
configuration errors [CHIL].
Figure 4: Configuration Errors Impact [VERI]
10
Another best practice is to use a SCM for all of the products. An SCM allows multiple developers to
contribute to a project or set of files at once, allowing them to merge their contributions without
overwriting previous work. An SCM also allows the rollback of changes in the event of an error making
its way into an SCM. However, rollbacks should be avoided and a code review can be inserted as a
quality assurance step.
1.3 Code Reviews
Code reviews employ someone other than the developer who wrote the code to check the work. Studies
have shown that quick, lightweight reviews found nearly as many bugs as more formal code reviews
[IBM]. At shops like Microsoft and Google, developers don’t attend formal code review meetings.
Instead, they take advantage of collaborative code review platforms like Gerrit (open source), CodeFlow
(Microsoft), Collaborator (Smartbear), ReviewBoard (open source) or Crucible (Atlassian, usually
bundled with Fisheye code browser), or use e-mail to request reviews asynchronously and to exchange
information with reviewers. These tools support a pre-commit method of code review. The code review
occurs before the code/configuration is committed to an SCM.
Reviewing code against coding standards (see Google’s Java coding guide, http://google-
styleguide.googlecode.com/svn/trunk/javaguide.html) is an inefficient way for a developer to spend their
valuable time. Every developer should use the same coding style templates in their IDEs and use a tool
like Checkstyle to ensure that code is formatted consistently. Highly configurable, Checkstyle can be
made to support almost any coding standard. Configuration files are supplied at the Checkstyle download
site supporting the Oracle Code Conventions and Google Java Style. An example of a report that can be
produced using Checkstyle and Maven can be seen at http://checkstyle.sourceforge.net/. Coding style
checker tools free up reviewers to focus on the things that matter such as assisting developers to write
better code and create code that works correctly and is easy to maintain.
Additionally the use of static analysis tools upfront will make reviews more efficient. Free tools such as
Findbugs and PMD for Java can catch common coding bugs, inconsistencies, and sloppy, messy code
and/or dead code before submitting the code for review. The latest static analysis tools go far beyond
this, and are capable of finding serious errors in programs such as null-pointer de-references, buffer
overruns, race conditions, resource leaks, and other errors. Static analysis can also assist testing. If the
unreachable code or redundant conditions can be brought to the attention of the tester early, then they do
not need to waste time in a futile attempt to achieve the impossible. Static analysis frees the reviewer
from searching for micro-problems and bad practices, so they can concentrate on higher-level mistakes.
Static analysis is only a tool to help with code review and is not a substitute. Static analysis tools can’t
find functional correctness problems or design inconsistencies or errors of omission or help you find a
better or simpler way to solve a problem.
The reviewer should be concentrating on
Correctness:
Functional correctness: does the code do what it is supposed to do – the reviewer needs to know
the problem area, requirements and usually something about this part of the code to be effective at
finding functional correctness issues
Coding errors: low-level coding mistakes like using <= instead of <, off-by-one errors, using the
wrong variable (like mixing up lessee and lessor), copy and paste errors, leaving debugging code in
by accident
11
Design mistakes: errors of omission, incorrect assumptions, messing up architectural and design
patterns such as the model, the view, and the controller (MVC)
Security: properly enforcing security and privacy controls (authentication, access control, auditing,
encryption)
Maintainability:
Clarity: class and method and variable naming, comments …
Consistency: using common routines or language/library features instead of rolling your own,
following established conventions and patterns
Organization: poor structure, duplicate or unused/dead code
Approach: areas where the reviewer can see a simpler or cleaner or more efficient implementation.
1.3.1 Using Checklists
A checklist is an important component of any review. Checklists are most effective at detecting
omissions. Omissions are typically the most difficult types of errors to find. A reviewer does not require
a checklist to look for algorithm errors or sensible documentation. The difficult task is to notice when
something is missing and reviewers are likely to forget it as well. The longer a checklist becomes, the
less effective each item is reviewed [SMAR]. Limit the checklist to about 20 items. In fact, the SEI
performed a study indicating that a person makes about 15-20 common mistakes. For example, the
checklist can remind the reviewer to confirm that all errors are handled, that function arguments are tested
for invalid values, and that unit tests have been created. It is estimated that people make 15-20 common
mistakes in coding [SMAR]. Below is a sample review checklist from Smartbear
(http://smartbear.com/SmartBear/media/pdfs/best-kept-secrets-of-peer-code-review.pdf).
1.3.1.1 Sample Checklist Items
1. Documentation: All subroutines are commented in clear language.
2. Documentation: Describe what happens with corner-case input.
3. Documentation: Complex algorithms are explained and justified.
4. Documentation: Code that depends on non-obvious behavior in external libraries is documented with
reference to external documentation.
5. Documentation: Units of measurement are documented for numeric values.
6. Documentation: Incomplete code is indicated with appropriate distinctive markers (e.g. “TODO” or
“FIXME”).
7. Documentation: User-facing documentation is updated (online help, contextual help, tool-tips, version
history).
8. Testing: Unit tests are added for new code paths or behaviors.
9. Testing: Unit tests cover errors and invalid parameter cases.
10. Testing: Unit tests demonstrate the algorithm is performing as documented.
11. Testing: Possible null pointers always checked before use.
12. Testing: Array indexes checked to avoid out-of-bound errors.
13. Testing: Don’t write new code that is already implemented in an existing, tested API.
14. Testing: New code fixes/implements the issue in question.
15. Error Handling: Invalid parameter values are handled properly early in the subroutine.
16. Error Handling: Error values of null pointers from subroutine invocations are checked.
17. Error Handling: Error handlers clean up state and resources no matter where an error occurs.
18. Error Handling: Memory is released, resources are closed, and reference counters are managed under
both error and no error conditions.
12
19. Thread Safety: Global variables are protected by locks or locking subroutines.
20. Thread Safety: Objects accessed by multiple threads are accessed only through a lock.
21. Thread Safety: Locks must be acquired and released in the right order to prevent deadlocks, even in
error-handling code.
22. Performance: Objects are duplicated only when necessary.
23. Performance: No busy-wait loops instead of proper thread synchronization methods.
24. Performance: Memory usage is acceptable even with large inputs.
25. Performance: Optimization that makes code harder to read should only be implemented if a profiler or
other tool has indicated that the routine stands to gain from optimization. These kinds of optimizations
should be well-documented and code that performs the same task simply should be preserved somewhere.
An effective method to maintain the checklist is to match defects found during review to the associated
checklist item. Items that turn up many defects should be kept. Defects that aren’t associated with any
checklist item should be scanned periodically. Usually, there are categorical trends in your defects; turn
each type of defect into a checklist item that would cause the reviewer to find it. Over time, the team will
become used to the more common checklist items and will adopt programming habits that prevent some
of them altogether. The list can be shorten by reviewing the “Top 5 Most Violated” checklist items every
month to determine whether anything can be done to help developers avoid the problem. For example, if a
common problem is “not all methods are fully documented,” a feature in the IDE can be enabled that
requires developers to have at least some sort of documentation on every method.
1.3.2 Checklists for Security
Coding checklists are not specifically devoted to security reviews. Agnitio is a code review tool that
guides a reviewer through a security review by following a detailed code and design review checklist and
records the results of each review, removing the inconsistent nature of manual security code review
documentation. Agnitio is an open-source tool that assists developers and security professionals in
conducting manual security code reviews in a consistent and repeatable way. Code reviews are important
for finding security vulnerabilities and often are the only way to find vulnerabilities, except through
exhaustive and expensive pen testing. This is why code reviews are a fundamental part of secure SDLC’s
like Microsoft’s SDL.
1.3.3 Monitoring the Code Review Process
The code review process should be monitored for defect removal. For example, how do code reviews
compare to the other methods of defect removal practices in predicting how many hours are required to
finish a project? The minimal list of raw numbers collected is lines of code including comments (LOC),
inspection time, and defect count. LOC and inspection time are obvious. A defect in a code review is
something a reviewer wants changed in the code. A tool-assisted review process should be able to
collect these automatically without manual intervention. From these data, other analytical metrics can be
calculated and, if necessary, classified into reviews from a development group, reviews of a certain
author, reviews performed by a certain reviewer, or on a set of files. The calculated ratios are inspection
rate, defect rate and defect density.
The inspection rate is the rate at which a certain amount of code is reviewed. The ratio is LOC divided by
inspection hours. An expected value for a meticulous inspection would be 100-200 LOC/hour; a normal
13
inspection might be 200-500; above 800-1000 is so fast it can be concluded the reviewer performed a
perfunctory job.
The defect rate is the rate defects are uncovered by the reviewers. The ratio is defect count divided by
inspection hours. A typical value for source code would be 5-20 defects per hour depending on the review
technique. For example, formal inspections with both private code-reading phases and inspection
meetings will be on the slow end, whereas the lightweight approaches, especially those without scheduled
inspection meetings, will trend toward the high end. The time spent uncovering the defects in review is
counted in the metric and not the time taken to actually fix those defects.
The defect density is the number of defects found in a given amount of source code. The ratio is defect
count divided by kLOC (thousand lines of code). The higher the defect density, the more defects are
uncovered indicating that the reviews are effective. That is, a high defect density is more likely to mean
the reviewers did a great job than it is to mean the underlying source code is extremely bad. It is
impossible to provide an expected value for defect density due to various factors. For example, a mature,
stable code base with tight development controls might have a defect density as low as 5 defects/kLOC,
whereas new code written by junior developers in an uncontrolled environment except for a strict review
process might uncover 100-200 defects/kLOC.
1.3.4 Evolvability Defects
However, there is even more to reviews than finding bugs and security vulnerabilities. A 2009 study by
Mantyla and Lassenius revealed that the majority of problems found by reviewers are not functional
mistakes, but what the researchers call evolvability defects [MANT]. Evolvability defects are issues
causing code to be harder to understand and maintain, more fragile and more difficult to modify and fix.
Between 60% and 75% of the defects found in code reviews fall into this class. Of these, approximately
1/3 are simple code clarity issues, such as improving element naming and comments. The rest of the
findings are organizational problems where the code is either poorly structured, duplicated, unused, can
be expressed with a much simpler and cleaner implementation, or replacing hand-rolled code with built in
language features or library calls. Reviewers also find changes that do not belong or are not required,
copy-and-paste mistakes and inconsistencies.
These defects or recommendations feed back into refactoring and are important for future maintenance of
the software, reducing complexity and making it easier to change or fix the code in the future. However,
it’s more than this: many of these changes also reduce the technical risk of implementation, offering
simpler and safer ways to solve the problem, and isolating changes or reducing the scope of a change,
which in turn will reduce the number of defects that could be found in testing or escape into the release.
1.3.5 Other Guidelines
An important aspect of enterprise architecture is the development of guidelines for addressing common
concerns across IT delivery teams. An organization may develop security guidelines, connectivity
guidelines, coding standards, and many others. By following common development guidelines, the
delivery teams produce more consistent solutions, which in turn make them easier to operate and support
once in production, thereby supporting the DevOps strategy.
14
1.3.6 High Risk Code and High Risk Changes
If possible, all code should be reviewed. However, what if this is not possible? One needs to ensure that
high risk code and high risk change is always reviewed. Listed below are candidates for these types of
modules.
High risk code:
Network-facing APIs
Plumbing (framework code, security libraries….)
Critical business logic and workflows
Command and control and root admin functions
Safety-critical or performance-critical (especially real-time) sections
Code that handles private or sensitive data
Code that is complex
Code developed by many different people
Code that has had many defect reports – error prone code
High risk changes:
Code written by a developer who has just joined the team
Big changes
Large-scale refactoring (redesign disguised as refactoring)
1.4 Testing
Although static analysis and code review prevent many errors, these activities will not catch them all.
Mistakes can still creep into the production environment. The untested code must be exercised in a
testbed and all changes made in the testbed first. This is another best practice. Testing, except for unit
level, is performed following a formal delivery of the application build to the QA team after most, if not
all, construction is completed. Unit tests are just as much about improving productivity as they are about
catching bugs, so proper unit testing can speed development rather than slow it down. Unit testing is not
in the same class as integration testing, system testing, or any kind of adversarial "black-box" testing that
attempts to exercise a system based solely on its interface contract. These types of tests can be automated
in the same style as unit tests, perhaps even using the same tools and frameworks. However, unit tests
codify the intent of a specific low-level unit of code. They are focused and they are fast. When an
automated test breaks during development, the responsible code change is rapidly identified and
addressed. This rapid feedback cycle generates a sense of flow during development, which is the ideal
state of focus and motivation needed to solve complex problems.
As software grows, defect potential increases and defect removal efficiencies decrease. The defect density
at release time increases and more defects are released to the end-user(s) of the software product. Larger
software size increases the complexity of software and, thereby, the likelihood that more defects will be
injected. For testing, a larger software size has two consequences:
The number of tests required to achieve a given level of test coverage increases exponentially
with software size.
The time to find and remove a defect first increases linearly and then grows exponentially with
software size.
15
As software size grows, software developers would have to improve their defect potentials and removal
efficiencies to simply maintain existing levels of released defect densities. The raw Code Coverage metric
is only relevant when too low and requires further analysis when high.
1.4.1 Number of Test Cases
Although there are many definitions of software quality, it is widely accepted that a project with many
defects lacks quality. Testing is one of the most crucial tasks in software development that can increase
software quality. A large part of the testing effort is in developing test cases. The motive for writing a test
case should be the complete and correct coverage of a requirement which could require five or fifty test
cases. The number of test cases is basically irrelevant for this purpose and can even be a damaging
distraction. A large number of test cases could artificially inflate confidence that the software has been
adequately tested. There is also no standard on what constitutes one test case. A tester can create one
large test case or 200 smaller test cases. It is good practice to write a separate test case for each
functionality. Some testers even break test cases down further into discrete steps. Thus, the number of
test cases cannot assure a requirement’s cover. It is the content of the test cases that covers a requirement.
The number of test cases also does not indicate the quality of the test cases. Choosing the right techniques
and prioritizing the right test cases can provide significant economic benefits. Therefore, it is important to
analyze test case quality. There are many facets to test case quality such as the number of revealed faults
and its efficiency or the time spent to reveal a fault. The most common and oldest are coverage measures
as a direct measure of test case quality [FRAN, Hutchins].
Interesting research results on test coverage are presented in a paper by Mockus, Nagappan, and Dinh-
Trong [MOCK]. Key observations and conclusions from the paper are the following:
"Despite dramatic differences between the two industrial projects under study we found that code
coverage was associated with fewer field failures.” This strongly suggests that code coverage is a
sensible and practical measure of test effectiveness."
The authors state “an increase in coverage leads to a proportional decrease in fault potential."
"Disappointingly, there is no indication of diminishing returns (when an additional increase in
coverage brings smaller decrease in fault potential)."
"What appears to be even more disappointing, is the finding that additional increases in coverage
come with exponentially increasing effort. Therefore, for many projects it may be impractical to
achieve complete coverage."
From this paper, it can be concluded that more coverage means fewer bugs, but this comes with
increasing cost. Although there are strong indications that there is a correlation between coverage and
fault detection, only considering the number of faults may not be sufficient. Code coverage does not
guarantee that the code is correct, and attaining 100 percent code coverage does not imply the system will
have no failures. It means that bugs can be found outside the anticipated scenarios.
16
1.5 Agile Process and QA
The development process has been transformed to Agile and the code is being developed iteratively. The
product owners are maintaining the backlog and the development team is completing chunks of the
product in two or three week increments, or sprints. The QA process follows the released sprint. The
process just described is partially stuck in Waterfall mode (in the mindset of developers) if the QA
department is lagging a sprint behind the development team. Often, the developers in this scenario
consider their work done when they have deployed their changes to a QA environment for testing
purposes. This is “throwing it over the wall” again, but in smaller increments. There can be many factors
working against this setup: loosely defined acceptance criteria, outdated quality standards, time-
consuming regression tests, slow and error-prone deployments to the QA environment, and rigid
organization charts that can all derail the development. Note that the setup described is not unusual and
has been attempted before, so it is important to revisit it and identify some of the common mistakes made
by organizations during their transition to Agile.
There are two main ways that Agile provides a basis for system quality verification. They are the
acceptance criteria and the definition of done. Acceptance criteria are normally expressed in Gherkin, the
language used in Cucumber to write tests. This structured format defines the functionality, performance or
other non-functional aspects that will be required of the software to be accepted by the business and/or
stakeholders. Naturally, the acceptance criteria should be defined before beginning the development
effort on the functionality. The team and testers should all be somewhat involved in producing the
criteria. With the acceptance criteria in hand, the developers now understand what needs to be built by
knowing how it will be tested. Possessing the satisfaction criteria implies that the developers build
software that only yields the desired outcomes.
The second way of checking quality is through the definition of done. Done is a list of quality checks that
have to be satisfied before a piece of functionality can be considered done. Included in the list are also
the non-functional quality requirements that the team must always adhere to when working through the
backlog, sprint after sprint. The acceptance criteria ensures that the software is built to deliver the
expected value and the definition of done ensures that the software is built with quality in mind.
In all development models, testing constitutes a majority of the schedule. Some suggest if this is not the
case, then quality is most likely suffering. In an Agile process, change occurs rapidly, thus regression
testing is a frequent occurrence to determine if changes have not regressed the software. With every new
release, as more functionality is added, the amount tested and retested expands. The only sensible
approach is to automate tasks which can be automated. Developers should be writing unit tests throughout
development. There are integration tests verifying that the internal components work together properly,
and the new code integrates well with external components. These are normally automated. Acceptance
tests are tests that verify that the acceptance criteria are being satisfied, and automating these tests would
prevent manually running regression tests with each new release. The timing of test automation is also
important. The ideal process would be to automate the acceptance tests, Acceptance Test-Driven
Development (ATDD), before the development effort has even begun. There are tools that translate the
acceptance criteria’s Gherkin statements into the tests. Even if ATDD is not followed, automating the
acceptance tests should be part of the done definition. Manually triggering tests to execute and manual
deployments are also common pitfalls. The best practice is early visibility to quality issues by being
notified immediately after committing the offending code into source control. A continuous integration
(CI) build that runs your suite of automated tests enables the developer to look at the issue with all the
17
details of the problem fresh in his/her mind.
As others have experienced, combining acceptance test and development into one gated check-in saves
cost and rework. The code cannot be committed into source control without first passing all required tests
of the CI build. Some of the software can be environment-specific, and the tests must be executed against
an environment that mirrors production as closely as possible. Combining CI build and automated
deployment provides continuous deployment, at least to the test environment. The latest code is
automatically deployed to the test environment after each successful commit to source control enabling
the test environment to continuously execute against the latest work by the developers. Bugs are made
visible significantly sooner in the development cycle, including the fickle “it works on my machine”
bugs.
The best in practice method is to tightly integrate the development and QA efforts. Testing should be
incorporated into the core development cycle such that no team can ever call anything “done” unless it is
fully vetted by thorough testing. There are some aspects of quality assurance that are challenging for a
development team that are primarily focused on delivering new functionality of high quality. These are
security and load testing. These types of tests, although they can be automated, are so costly in time and
processing power to run that they are not a part of any automated test suite that executes as part of
continuous integration. Another category of testing that is definitely best suited for dedicated QA testers
is exploratory testing. Automated tests can only catch bugs in the predictable and designed behavior of an
application, while exploratory tests catch the rest. The QA department should coordinate and refine these
practices, provided that the testers themselves are allocated to the development team whose code is being
tested.
1.5.1 Agile Quality Assessment (QA) on Scrum
There are many challenges when applying an Agile quality assessment. The following questions must be
assessed to determine the process:
Is QA part of the development team?
Can QA be part of same iteration as development?
Who performs QA? (Separate team)
Does QA cost more in Agile as the product fluctuates from sprint to sprint?
Can Agile QA be scaled?
Is a test plan required?
Who defines the test cases?
Are story acceptance tests enough?
When is testing done?
When and how are bugs tracked?
Much of QA is about testing to ensure the product is working right. Automation is QA’s best friend by
providing repeatability, consistency and better test coverage. Since sprint cycles are very short, QA has
little time to test the application. QA performs full functionality testing of the new features added for a
particular sprint as well as full regression testing for all of the previously implemented functionality. As
the development proceeds, this responsibility grows and any automation will greatly reduce the pressure.
Early in the transition to Agile, the process may have less-than-optimal practices until the root cause can
be addressed. For example, a sprint dedicated to regression testing is not reflective of an underlying
18
Agile principle. This sprint is sometimes labeled as a hardening sprint and is considered an anti-pattern.
For example, an Informit publication related a story of a company that struggled with testing everything
in a sprint because of a large amount of legacy code with no automated tests and a hardware element that
required manual testing. Until more automation could be implemented, a targeted regression testing
sprint was initiated at the end of each sprint, and another sprint was added before each release for a more
thorough regression testing session with participation by all groups. To erase the issue of no legacy test
automation, an entire team was assigned to automate tests. Meanwhile, the other teams were trained in
test automation techniques and tools to start creating automated tests during current sprints. Finally, the
test suites were automated, the dedicated test team was no longer required and current Scrum teams were
automating tests. The result was that the time required for regression testing was cut in half and the
hardening sprints were greatly reduced.
A presentation, “Case: Testing in Large-Scale Agile Development”, by Ismo Paukamainen, senior
specialist - test and verification at Ericsson was given at the FiSTB Testing Assembly 2014 [PAUK]. In
this presentation, he describes and outlines Ericson’s transformation from a RUP to an Agile process.
Naturally, parts of the presentation focused on continuous integration which provides continuous
assurance of system quality. It appears that the process was very good in functional performance quality,
but not as good in the areas labeled as non-functional. As he states:
“Before Agile, the system test was a very late activity, having a long lead-time. It was often hard to
convince management about the needs for system tests requiring resources, human and machine for many
weeks. This was because the requirements for the product are most often for the new functionality, not for
non-functional system functions which are in a scope of system tests. The fact is that only ~5% of the
faults found after a release are in the new features, the rest are in the customer perceived quality area.”
At the conclusion of the talk, he also had five insightful takeaways about the transition:
Test competence: If not spread equally, think about other ways to support teams (e.g., in test
analysis and sprint planning. Product owners should take responsibility to check that there are
enough tests for a sprint. A dedicated testing professional position in a cross functional team is
recommended.
Fast feedback: In Waterfall, the aim was to make as much testing as possible in a lower
integration level. Then, the testing was earlier and it was easier to find (and fix) faults closer to
the designer. In Agile, the aim is to get feedback as fast as possible which means that the strategy
is no more to run a mass of tests in the lower level, but to run it on a level that gives the fastest
feedback. So, it might be that running tests on a target environment (= production-like) may serve
better in the sense of feedback and the lower level is needed only to verify some special error
cases that are maybe not possible to execute on target.
Test Automation is a must in Agile. Use common frameworks and test cases as much as
possible. Try to avoid extra maintenance work around automation (for example, Continuous
Integration).
Independent Test Teams is a good way to support cross-functional teams, especially to cover
agile testing. To make non-functional system tests in cross-functional teams would mean:
i) Possible overlapping testing,
ii) Need for more test tools and test environments.
iii) It is also a competence issue and
iv) May be too much to do within sprints.
Independent test teams need to be co-located with cross functional teams and have a good
communication with them. A sense of community!
19
Raise Your Organizational Awareness of the Product Quality: Monitor the system quality
(Robustness, Characteristics, Upgrade …) and make it visible through the whole organization.
The desired software is broken down into named features (requirements, stories), which are part of what it
means to deliver the desired system. For each named feature, there are one or more automated acceptance
tests which, when they pass, will show that the feature in question is implemented. The running tested
features (RTF) demonstrates at every moment in the project how many features are passing all of their
acceptance tests. Automated testing is also a factor in quality producing environments; therefore,
measuring automated unit and acceptance test results is another important measure.
1.6 Product Backlog
A common Agile approach to change management is a product backlog strategy. A foundational concept
is that requirements, and defect reports, should be managed as an ordered queue called a "product
backlog." The contents of the product backlog will vary to reflect evolving requirements, with the product
owner responsible for prioritizing work on the backlog based on the business value of the work item. Just
enough work to fit into the current iteration is selected from the top of the stack by the team at the start of
each iteration as part of the iteration planning activity. This approach has several potential advantages.
First, it is simple to understand and implement. Second, because the team is working in priority order, it is
always focusing on the highest business value at the time, thereby maximizing potential return on
investment (ROI). Third, it is very easy for stakeholders to define new requirements and refocus existing
ones.
There are also potential disadvantages. The product backlog must be groomed throughout the project
lifecycle to maintain priority order, and that effort can become a significant overhead if the requirements
change rapidly. It also requires a supporting strategy to address non-functional requirements. With a
product backlog strategy, practitioners will often adopt an overly simplistic approach that focuses only on
managing functional requirements. Finally, this approach requires a product owner who is capable of
managing the backlog in a timely and efficient manner.
Section 2: Software Product Quality and its Measurement
2.1 Goal Question Metric Model
Many software metrics exist that provide information about resources, processes and products involved in
software development. The introduction of software metrics to provide quantitative information for a
successful measurement program is necessary, but it is not enough. There are other important success
factors that must be considered when selecting metrics. Foremost is that the metrics must quantify
performance achievements towards measurement goals. Basili created the Goal Question Metric (GQM)
interpretation model to assist with the outlining goals, subgoals and questions for a measurement program
[BASI]. Table 1 consists of a GQM definition template employing a DevOps concept for software
development. The development process affects the nature and timing of the metrics.
Table 1: Main Goal of Software Development
Analyze Software Development
20
For the purpose of Assessing and Improving Performance
With respect to Software Quality
From the viewpoint of Management, Scrum Master and
Development Team
In the context of DevOps Environment
2.2 Generic Software Quality Models
The main goal of assessing and improving software development with respect to quality can be broken
down into three aspects. The sub-goals of functional, structural and process quality improvement form the
basis to derive the questions and metrics for the GQM. Dividing software quality into three sub-goals
allows us to illuminate the trade-off that exists among competing goals. In general, functional quality
reflects the software’s conformance to the functional requirements or specifications. Functional quality is
typically enforced and measured through software testing. Software structural quality refers to achieving
the non-functional requirements to support the delivery of the functional requirements, such as reliability,
efficiency, security and maintainability. Just as important as the first two sub-goals which receive the
majority of the quality dialog, process quality is a process that consistently delivers quality software on
time and within budget. Table 2 breaks down the three sub-goals into more measurable components.
Table 2: Three Sub-Goals Broken Down into Measurable Components
Question Property
Functional Does the system deliver the business value
planned? How many user requirements were
delivered in the sprint?
enhancement rate
Does the solution do the right thing? How many
bugs were removed in the sprint?
defect removal rate
Structure How modifiable (maintainable) is the software?
How modular is the software?
How testable is the software?
Maintainability:
duplication
unit size / complexity
What is the performance efficiency of the
software?
Efficiency: time-behavior,
resource utilization, capacity
How secure is the software? Security: confidentiality, integrity,
non-repudiation, accountability,
authenticity
How usable is the software? Usability: learnability, operability,
user error protection, user interface
aesthetics, accessibility
How reliable is the software? Reliability
Process What is the capacity of the software
development process?
Velocity
What is the cycle/lead time? Cycle time/ lead time
How many bugs were fixed before delivery? Defect removal effectiveness
How can we improve the delivery of business?
21
value?
Structural quality is determined through the analysis of the system, its components and source code.
Software quality measurement is about quantifying to what extent a system or software possesses
desirable characteristics. This can be performed through qualitative or quantitative means or a mix of
both. In both cases, for each desirable characteristic, there are a set of measurable attributes, the existence
of which in a piece of software or system tend to be associated with this characteristic. Historically, many
of these attributes have been extracted from the ISO 9126-3 and the subsequent ISO 25000:2005 quality
model, also known as SQuaRe. Based on these models, the Consortium for IT Software (CISQ) has
defined five major desirable structural characteristics needed for a piece of software to provide business
value: Reliability, Efficiency, Security, Maintainability and (adequate) Size. In Figure 5, the right side
five characteristics that matter for the user or owner of the business system depend on left side
measurable attributes. Other quality models have been created such as depicted in Figure 6 from Fenton.
To understand the professional meaning of code quality, the complete study of these concepts is required.
However, these models do not lend themselves naturally to practical development environments and we
need to explore more deeply what impacts business value.
Figure 5: Relationship Between Desirable Software Characteristics (right) and Measurable
Attributes (left) [WIKI]
22
Figure 6: Software Quality Model [FENT]
2.3 Comparing Traditional and Agile Metrics
Traditional software development and Agile methods actually have the same starting point. Each process
plans to develop a product of acceptable quality, applying a specific amount of effort within a certain
timeframe. The approach and processes differ, but the goal stated above is still the same. Traditional
software methods apply metrics to plan and forecast, monitor and control, and to integrate performance
improvement within the process, and Agile also requires metrics with these same capabilities. Agile
clearly differs from the traditional approach in that traditional software development metrics track a plan
through evaluating cost expenditures, whereas Agile development metrics do not track against a plan.
Agile metrics attempt to measure the value delivered or the avoidance of future costs. Another difference
between Agile and traditional is the units of measure. Table 3 is a matrix comparing the core metric units
of Agile to traditional software development [NESM].
Table 3: Comparison of Core Metrics for Agile and Traditional Development Methods
Core Metric Agile Traditional
Product (size) Features, stories Function points (FP), COSMIC
function points, use case points
Quality Defects/iteration, defects, MTTD Defects/release, defects, MTTD
Effort Story points Person months
Time Duration (months) Duration (months)
Productivity Velocity, story points/iteration Hours/FP
23
In Table 3, Agile uses a subjective unit, a story point to measure effort, making comparisons between
teams, projects and organizations impossible. Traditional methods use the standardized units of measure,
function points (FP) and COSMIC function points (CFP). Both FP and CFP are objective and are
recognized international standards. Several estimation and metric tools use the metric hours/FP for
benchmarking purposes. A noticeable characteristic for Agile is the absence of benchmarking metrics or
any other form of external comparison. The units of measure used for product (size) and productivity are
subjective and apply exclusively to the project and team in question. There is no possibility to compare
development teams or tendering contractors on productivity. So, the selection process for a development
team based on productivity is virtually impossible [NESM].
2.4 NESMA Agile Metrics
The Netherlands Software Metrics User Association (NESMA) began in 1989 as a reaction to the
counting guidelines of the International Function Point Users Group (IFPUG) and became one of the first
FPA user organizations in the world. Their NESMA standard for functional size measurement became
the ISO/IEC standard. In 2011, the organization shifted from an FPA user group to an organization that
provides information about the applied use of software metrics: estimation, benchmarking, productivity
measurement, outsourcing and project control. NESMA conducted a search of the web for recommended
Agile metrics, a so-called state-of-the-practice survey. They divided the survey into three main areas of
interest: planning and forecasting, monitoring, and control and performance improvement. The following
three tables, Table 4, 5 and 6, transcribe information from the website (http://nesma.org/2015/04/Agile-
metrics/)
2.4.1 Metrics for Planning and Forecasting
Table 4: Metrics for Planning and Forecasting
Metric Purpose How to measure
Number of features Insight into size of product
(and entire release). Insight
into progress.
The product comprises features that in
turn comprise stories. Features are
grouped as “to do”, “in progress” and
“accepted”.
Number of planned stories
per iteration/release
Same as number of features. The work is described in stories which are
quantified in story points.
Number of accepted stories
per iteration/release
To track progress of the
iteration/release
Formal registration of accepted stories
Team velocity See monitoring and control
LOC Indicates amount of
completed work (progress)
calculation of other metrics
i.e. defect density
According to the rules agreed upon.
24
2.4.2 Dividing the Work into Manageable Pieces
In order to plan and forecast, the development process requires the work to be divided into manageable
pieces. It is essential in larger organizations that these pieces are organized, scaled and of a consistent
hierarchy if they are going to be used for measurement. There are two important abstractions used to
build software: features and components. Features are system behaviors useful to the customer.
Components are distinguishable software parts that encapsulate functions needed to implement the
feature. Agile’s delivery focuses on features (stories). Large-scale systems are built out of components
that provide for the separation of concerns and improved testability, providing a base for fast system
evolution. In Agile, should the teams be organized around features, components or both? Getting it
wrong can lead to brittle system (all features) or a great design with future value (all components).
Previously, large-scaled developments followed the component organization as depicted in Figure 7. The
problem with this organization is that most new features create dependencies that require cooperation
between teams, thereby creating a drag on velocity because the teams spend time discussing and
analyzing dependencies. Sometimes component organization may be desired when one component has
higher criticality, requires rare or unique skills and technologies, or is heavily used by other components
or systems. Feature team organization, pictured in Figure 8, operates through user stories and refactors.
Figure 7: An Agile Program Comprised of Figure 8: An Agile Program Comprised of Feature
Component Teams [SCAL] Teams
For large developments, the organization is not as clear cut. Some features are large and are split into
multiple user stories. It is over simplistic to think of all teams being either component or feature-based.
To ensure the highest feature throughput, the SAFe (Scaled Agile Framework) guidelines suggest a mix
of feature and component teams with the feature team possessing the highest percentage at about 75-80%.
This split is dictated by the number of specialized technologies or skills required to develop the product.
Depending on the hierarchy, features or stories, the unit of work used in Table 5 may mask important
details if not counted uniformly.
2.4.3 Metrics for Monitoring and Control
Table 5: Metrics for Monitoring and Control
Metric Purpose How to measure
Iteration burn-down Performance per iteration; Effort remaining (in hrs) for the current
25
Are we on track? iteration (effort spent/planned expresses
performance).
Team Velocity per iteration To learn historical velocity
for a certain team. Cannot be
used to compare different
teams.
Number of realized story points per
iteration within this release. Velocity is
team and project-specific.
Release burn-down To track progress of a release
from iteration to iteration.
Are we on track for the entire
release?
Number of story points “to do” after
completion of an iteration within the
release (extrapolation with certain
velocity shows the end date).
Release burn-up How much ‘product’ can be
delivered within the given
time frame?
Number of story points realized after
completion of an iteration.
Number of test cases per
iteration
To identify the amount of
testing effort per iteration. To
track progress of testing.
Number of test cases per iteration
recorded as sustained, failed, and to do.
Cycle time (team’s
capacity)
To determine bottlenecks of
the process; the discipline
with the lowest capacity is the
bottleneck
Number of stories that can be handled per
discipline within an iteration (i.e. analysis-
UI-design-coding-unit test –system test).
Little’s Law – cycle times
are proportional to queue
length
Insight into duration; we can
predict completion time based
on queue length.
Work in progress (# stories) divided by
the capacity of the process step.
One metric not mentioned previously is Little’s Law which states that the more items that are in the
queue, the longer the average time each item will take to travel through the system. Therefore, managing
the queue (backlog) is a powerful mechanism for reducing wait time since long queues result in delays,
waste, unpredictable outcomes, disappointed customers and low employee morale. (See section 1.6,
Product Backlog.) However, everyone realizes that variability exists in technology. Some companies
limit utilization to less than 100% so a development has some flexibility, which is counterintuitive to
most models that suggest setting resources to 100% utilization. Also, there is a well-known belief that
work spans time allotted. To offset less than 100% utilization, some have instituted a Hardening
Innovation Planning (HIP) sprint to promote a new innovation or technology, find a solution to a nagging
defect or identify a fantastic new feature.
2.4.4 Metrics for Improvement (Product Quality and Process Improvement)
Table 6: Metrics for Improvement (Product Quality and Process Improvement)
Metric Purpose How to measure
Cumulative number of defects To track effectiveness of testing Logging each defect in defect
26
management system
Number of test sessions To track testing effort and
compare it to the cumulative
number of defects
Extraction of data from the defect
repository
Defect density To determine the quality of
software in terms “lack of
defects”
The cumulative number of
defects divided by KLOC
Defect distribution per origin To decide where to allocate
quality assurance resources
By logging the origin of defects
in the defect repository and
extract the data by means of an
automated tool
Defect distribution per type To learn what type of defects are
the most common and help avoid
them in the future
By logging the type of defects in
the defect repository and extract
the data by means of an
automated tool
Defect cycle time Insight in the time to solve a
defect (speed of defect
resolution)
Opening date of defect minus the
resolution date (usually the
closing date in the defect
repository)
As seen from Tables 4, 5, and 6, Agile metrics are essentially the same metrics as within traditional
development, except that they use the Agile units (features) and concepts.
2.5 Another State-of-the-Practice Survey on Agile Metrics
Another state-of-the-practice survey divided the Agile metrics into three main areas of interest: planning
and forecasting, monitoring and control and performance improvement [GALE]. Most researchers and
consultants claim that there are four distinct areas of interest to collect for Agile development:
Predictability
Value
Quality
Team Health – can be based on an Agile maturity survey
How do categories of predictability (See section 2.2.2 Survey Results: Predictability), value, quality and
team health overlap with the NESMA categories? Predictability maps to planning and forecasting
category, value maps to monitoring and control and quality maps to performance improvement. It
appears that Agile community has not come to a consensus on what is team health. However, since
development teams are the foundation of production, it does appear that the more teams are organized and
focused, the better the outcome.
Based on the three distinct areas, predictability, value and quality, a list of what to measure during Agile
development was compiled. This “What to measure?” list consists of twelve categories:
27
1. Events that halt a release, such as continuous integrations or continuous deployment stops
(quality based metric of type outcome)
2. Number and types of corrective actions per team or across teams (quality, outcome)
3. Number of stories delivered with zero bugs
4. Number of stories reworked (value, output)
5. Percentage of technical debt addressed with a target >30% (value or quality, outcome)
Coding standards violations
Code violations
Dead code
Code dependencies (coupling)
6. Velocity per team where trending is most important. Velocity is not used to measure productivity
but to derive duration of a set of features. (predictability, output).
7. Delivery predictability per story point, average variance improving across teams (predictability,
output)
8. Release burn-down charts – display both story points completed and those added by iteration.
9. Percentage of test automation includes UI level, component/middle tier and unit level coverage
where trending is most important (quality, output)
10. Organizational commit levels, the more that participate, the better the value (predictability,
output)
11. New test cases added per release (quality, outcome)
12. Defect Cycle Time is useful. We want to reduce the time from defect detection to defect fix. This
not only improves the business experience, but reduces the code written on top of faulty code, and
ensures issues are fresher in developer minds and faster to fix.
How does NESMA’s state-of-the-practice influence the list labeled as “What to measure?”? The state-of-
the-practice is a general set of metrics used by Agile environments. The “What to measure?” depends on
more general measures, such as number of stories, but then identifies them by a particular attribute or
event such as the number of stories reworked.
2.6 Metric Trends are Important
Almost every single article reinforces the fact that trending is much more important than any specific data
point [FOWL]. Used as a target, a metric is the only means to communicate a goal. In most cases, it is
an arbitrary number for which excessive amounts of time are used to determine its value or in working to
move toward this value or goal. When an attribute such as quality is turned into a number, it is highly
interpretive and any figure is relative and arbitrary. As Martin Fowler points out, there is a significant
difference between a code coverage at 5% and at 95%, but what about between 94% and 95%? A target
value such as 95% informs developers when to stop, but what if that additional 1% requires a significant
amount of effort to achieve? Should the extra effort be provided and does it bring additional value to the
product? Focusing on trends provides a feedback on real data and creates an opportunity for a reaction.
Moving in either direction, a team can ask what is causing this change. A trend analysis produces actions
earlier than concentrating on an individual number. Arbitrary absolute numbers can create a feeling of
helplessness especially when events outside of a team’s control prevent progress. Trends focus on
moving in the right direction rather than being hostage to external barriers. Since trends are important,
Agile reporting should use shorter periods of reporting to have more opportunity to react and change.
With any type of Agile methodology, it is important to reinforce lean and Agile principles, such as
concentrating on working software where numbers are not as important as trends. The project is already
28
collecting velocity and burn-down numbers. It is simple: the more user requirements delivered to the
customer, the greater the functional completeness (enhancement rate). The benefit, which users receive
from software usage, increases with the degree of the software's functional completeness. The delivery
rate of user requirements is also considered to be the throughput of the software development process.
2.7 Defect Measurement
In software, the narrowest sense of product quality is commonly recognized as a lack of defects or bugs in
the product. The number of delivery defects is known as an excellent predictor of customer satisfaction,
thus it is important to uncover trends in the defect removal processes. Using this viewpoint, or scope,
three important measures of software quality are:
defined as the number of injected defects in software systems, per size attribute.
releasing the software to intended users.
the number of released defects in the software, per size attribute.
Defect potential refers to the total quantity of bugs or defects that will be found in five software artifacts:
requirements, design, code, documents, and “bad fixes” or secondary defects. Defect potentials vary with
application size, and they also vary with the class and type of software. Defect potentials are calibrated
through function points. Organizations with combined defect removal efficiency levels of 75% or less
can be viewed as exhibiting professional malpractice. In other words, they are below acceptable levels for
professional software organizations.
Testing alone is insufficient for optimal defect removal efficiency. Most forms of testing can only
achieve about 35% defect removal efficiency (DRE), and seldom top 50%. DRE is defined as
DRE = E / (E+D)
where E is the number of errors found before delivery of the software to the end user and D is the number
of defects found after delivery. To achieve a high level of cumulative defect removal, many forms of
defect removal techniques need to be done. In a blog (https://www.linkedin.com/grp/post/2191046-
105467445), Caper Jones provides an analysis where he revisited 21 famous software problems such as
the Therac 25 radiation poisoning, the Wall Street crash of 2008, the McAffee anti-virus bug of 2010, the
Kidder Capitol stock trade problem of August 2012, and others. All of these systems had been tested.
None of the famous software problems had been found only through testing. He suggests that pre-test
inspections and static analysis would have found most. Below is the defect removal efficiency rate for
various methods based on commercial applications.
Measuring Defect Removal Efficiency [BLAC]
Patterns of Defect Prevention and Removal Activities
Prevention Activities
Prototypes 20.00%
Pretest Removal
Desk checking 15.00%
Requirements review 25.00%
Design review 45.00%
Document review 20.00%
29
Code inspections 50.00%
Usability labs 25.00%
Subtotal 89.48%
Testing Activities
Unit test 25.00%
New function test 30.00%
Regression test 20.00%
Integration test 30.00%
Performance test 15.00%
System test 35.00%
Field test 50.00%
Subtotal 91.88%
Overall Efficiency 99.32%
Defect tracking and its analysis has traditionally been used to measure software quality throughout the
lifecycle. However, in Agile methodologies, it has been suggested that pre-production defect tracking
may be detrimental to software teams [TECH]. Many suggest that pre-production tracking makes it
difficult to determine a true value of software quality. Pre-production defect tracking (especially resulting
from QA) is still very important to track. However, the focus of the activity should be shifted to
prevention. All defects should be measured by phase of origin (requirements, design, code, user
documents and bad fixes) so that the cause and ways to improve the process can be identified. As
previously stated, for more than 40 years, customer satisfaction has had a strong correlation with the
volume of defects in applications when they are released to customers. Released defect levels are a
product of defect potentials and defect removal efficiency. Jones and Bonsignour reflect that the Agile
community has not yet done a good job of measuring defect potentials, defect removal efficiency,
delivered defects or customer satisfaction [JONEa]. A development group that does not reach defect
removal efficiency rate of 85% or above will not have a good customer satisfaction rating.
For defects, identify areas in the code that have the most bugs, the length of time to fix bugs and the
number of bugs each team can fix during an indicated time span. Track the bug opened/closed ratios to
determine if more bugs are being uncovered than in previous iterations or if a team is falling behind. This
may determine a need to review and fix rather than attempting to deliver a new feature. Determine the
reasons for any changes in trend. Collect post sprint defects, QA defects, and post release defect arrival.
Complete a Root Cause analysis. For example, determine why a particular defect escaped from testing or
why a defect was injected into the code. It is especially insightful when a defect count or trend is
matched with a QA activity.
To compare between teams, systems or organization, defect density (the number of bugs per LOC or
another size metric such as function point) is used. Defect density is a recognized industry standard and a
best practice. Current defect density numbers can be compared against data retrieved from organizations
such as Gartner or International Software Benchmarking Standards Group (ISBSG) normally for a fee.
The true defect density is not known until after the release, and for this reason Microsoft uses code churn
to predict defect density. Moreover, the defect data must filter incidents to get defects. Incidents can be
labeled in the data store as: Change Request Agreed, Deferred, Duplicate, Existing Production Incident,
Merged with another Defect, No Longer an Issue, Not a fault, Not in Scope of Project, Resolution
Implemented, Referred to another project, Third Party Fix, Risk accepted by the business or Workaround
accepted by the business or other customized exceptions. There are also problems associated when
comparing defect density against outsiders. Every tool has its own definition of size (LOC). It is easy for
30
projects to add more code to make the LOC metric look better, and comparisons between code languages
is meaningless without an agreed upon LOC equivalency table. Moreover, source code is not always
available, such as in third party applications. Therefore, benchmarking against yourself may be the most
effective way.
Figure 9 compares Agile to Waterfall defect removal rate in Hewlett Packard projects [SIRI]. The Agile
process has a more sustainable defect removal rate throughout the lifecycle. The Waterfall process
displays a late peak with a gradual decline. This information was collected from two different product
releases created by the same team but using two different processes. Note that it is easier to compare and
observe defect trends in the Agile project and, therefore, it is easier to introduce modifications.
Defects are important to study. A Special Analysis Report on Software Defect Density from the ISBSG
reveals useful information about defects in software, both in development and in the initial period after a
system has gone into operation:
The split of where defects are found, i.e. in development or in operation, seems to follow the
80:20 rule. Roughly 80% of defects are found during development, leaving 20% to be found in
the first weeks of systems’ operation.
Fortunately, in the case of extreme defects, less than 2.5% were found in the first weeks of
systems’ operation.
Extreme defects make up only 2% of the defects found in the Build, Test and Implement tasks of
software development.
The industry hasn’t improved over time. Software defect densities show no changing trend over
the last 15 years.
Figure 9: Agile to Waterfall Defect Removal in Hewlett-Packard Projects
Maintainability is part of every quality model. Especially in the light of DevOps development which
stresses testing in its process, maintainability is an important quality characteristic. This system will be
around for a long time and the traditional assumption that existing systems will decay, become more
difficult and expensive to maintain should be avoided. The system must deliver new and better services
at a reasonable cost. As new features to the system, they must be testable. The symptoms of poor
testability are unnecessary complexity, unnecessary coupling, redundancy and not relating the software
model to the physical model. Also when these conditions exist, they make automated testing more
difficult. In the presence of these symptoms, tests either do not get created or have a lower probability of
31
being executed because of the effort and time commitment. Developers cannot be assured that the system
delivers the value if tests to do not exist or are not executed.
The process has a prevailing influence over the quality of the code. One of the myths of Agile is that an
iterated set of user stories will emerge with a coherent design requiring at most some refactoring to single
out commonalities. In practice, these stories do not self-organize. In previous development experiences,
when adding new functionality a system tends to become more complex and thus the law of increasing
entropy emerged. Refactoring and technical debt are inextricably linked in the Agile space. Refactoring
is a method of removing or reducing the presence of technical debt. However, not all technical debt is a
direct refactoring candidate. Technical debt can stem from documentation, test cases or any deliverable.
Developers, product managers and researchers disagree about what constitutes technical debt. The
simplest definition found is that technical debt is the difference between what was promised and what was
actually delivered, including technical shortcuts made to meet delivery deadlines. An easy way to
document technical debt is to raise an issue within the project management system (e.g., Jira). The issue
can be documented with different priorities, such as those that block future functionality or hamper
implementation. If a problem is small, then it can be added to a sprint if the sprint’s focus has been
completed. This bookkeeping helps monitor technical debt. The process of refactoring must be
incorporated, and as stated previously, reviews are better at uncovering evolvability defects.
Technical debt has become a popular euphemism for bad code. This debt is real and we incur debt both
consciously and accidentally. Static analysis alone cannot fully calculate debt and it may not be always
possible to pay (fix) debt in the future. Modules are built on top of the original technical debt which
creates dependencies that eventually become too ingrained and too expensive to fix. Some researchers
suggest that there are seven deadly sins in bad code, each one representing a major axis of quality
analysis: bad distribution of the complexity, duplications, lack of comments, coding rules violations,
potential bugs, no unit tests or useless ones and bad design. Many of these can be mitigated with the
proper techniques and tools. The SonarQube default project dashboard tracks and displays each of these
deadly sins.
Study after study has shown poor requirements management is the leading cause of failure for traditional
software development teams. When it comes to requirements, Agile software developers typically focus
on functional ones that describe something of value to end users—a screen, report, feature, or business
rule. Most often these functional requirements are captured in the form of user stories, although use cases
or usage scenarios are also common, and more advanced teams will iteratively capture the details as
customer acceptance tests. Over the years, Agilists have developed many strategies for dealing with
functional requirements effectively, which is likely one of the factors leading to the higher success rates
enjoyed by Agile teams. Disciplined Agile teams go even further, realizing that there is far more to
requirements than just this, and that we also need to consider nonfunctional requirements and constraints.
Although Agile teams have figured out how to effectively address functional requirements, most are
struggling with nonfunctional requirements.
The definition of software quality is very diverse as seen in Figures 5 and 6. However, it is widely
accepted that a project with many defects lacks quality. The simplest measure of assessing software
quality is by the frequency of critical or blocker bugs discovered post-release, as was outlined in the
Zeroturnaround survey in Section 1.2.2. The problem with this assessment is that a measure of quality
was completed after the fact. It is not acceptable to postpone the assurance of software quality after
release and, as outlined several times, the cost of removing bugs later only increases. Is there a direct
method to quantify quality pre-release? There is no single metric that defines good versus bad software.
32
Software quality can be measured indirectly from attributes associated with producing quality software.
From the Zeroturnaround survey, seven key points of high-performing DevOps culture, those
environments that produced good quality systems on time, were outlined. Five of seven are
characteristics that can be copied and measured directly. These high quality producing environments
deploy daily, handle non-functional requirements during every sprint, exploit automated testing, mandate
strict version control and perform peer code review. The remaining two trends of successful DevOps
teams, one of implementing end-to-end performance monitoring and metrics and the other of allocating
more cycle time to the reduction of technical debt, are much more challenging. What can be measured is
code churn, static analysis findings, test failures, coverage, performance, bugs and bug arrival rates and an
indication of size.
Any software metric will be criticized for its effectiveness, especially if one is searching for a silver
bullet. All metrics are somewhat helpful. All metrics measure a particular attribute of the software.
Also, the manner in which they are applied may not be perfect. Many practitioners use a metric for a
purpose which it was never intended. The McCabe metric (also known as cyclomatic complexity)
original purpose was measuring the effort to develop test cases for a component or module. Every piece
of code contains sections of sequence, selection and iteration, and this metric quantifies the number of
linearly independent paths through a program's source code. To perform basis path testing proposed by
McCabe, each linearly independent path through the program must be tested; in this case, the number of
test cases will equal the cyclomatic complexity of the program. Therefore, a module with a higher
McCabe metric value requires more testing effort than a module with a lower value since the higher value
indicates more pathways through the code. A higher value also implies that a module may be more
difficult for a programmer to understand since the programmer must understand the different pathways
and the results of those pathways.
For example, a cyclomatic complexity analysis can have problems stemming from recursion and fall-
through. If one of the project goals is good performance, recursion should be avoided. Fall-through
where one component passes control down to another such that there is no single entry/exit point also
affects the metric. Les Hatton claimed (Keynote at TAIC-PART 2008, Windsor, UK, Sept 2008) that
McCabe's Cyclomatic Complexity number has the same predictive ability as lines of code. The selected
threshold for cyclomatic complexity is based on categories established by the Software
Engineering Institute, as follows:
Cyclomatic
Complexity Risk Evaluation...
1-10 A simple module without much risk
11-20 A more complex module with moderate risk
21-50 A complex module of high risk
51 and greater An untestable program of very high risk
2.8 Defects and Complexity Linked
In the section “High Risk Code and High Risk Changes”, one of the bullets suggests that code that is
complex is high risk. Identifying the most complex code and monitoring it to determine the rate of
change assists developers in deciding where to focus efforts in review, testing and refactoring. Software
33
complexity encompasses numerous properties all of which affect the external and internal interactions of
the software. Higher levels of complexity in software increase the risk of unintentionally interfering with
interactions and increases the chance of introducing defects either when creating or making changes to the
software. Many measures of software complexity have been proposed. Perhaps the most common
measure is the McCabe essential complexity metric. This is also sometimes called cyclomatic complexity.
It is a measure of the depth and quantity of routines in a piece of code. Using cyclomatic complexity
measured by itself, however, can produce the wrong results, because there are numerous other properties
that introduce complexity and not just the control flow of the software.
Another important perspective comes from understanding the change in the complexity of a system over
time. Identifying components that
cross a defined threshold of complexity and are thus candidates
suddenly change in complexity and that are forecast to continue with that trend
increase in complexity where they were not expected, as a possible indication of poor
programming or design
It is important to manage control flow code complexity for testability. The completeness of test plans is
often measured in terms of coverage. There are several levels or dimensions of coverage to consider:
Function, or subroutine coverage – measures whether every function or subroutine has been
tested
Code, or statement coverage – measures whether every line of code has been tested
Branch coverage – measures whether every case for a condition has been tested, i.e., tested for
both true and false
Loop coverage – measures whether every case of loop processing has been tested, i.e. zero
iterations, one iteration, many iterations
Path coverage – measures whether every possible combination of branch coverage has been
tested
Large programs can have huge numbers of paths through them. A program with a mere 20 (n) control
point statements (IF, FOR, While, CASE) can have over one million different paths through it (paths =
2n). Removing redundant conditions and organizing necessary conditions in the simplest possible way to
help minimize control flow complexity, and thus minimize both the probability of defects and the
required testing effort.
Control flow is not the only aspect of concern when managing testability. Managing the data flow and its
impact on the complexity of code implementation are also a concern. Several methods exist to measure
the use, organization or allocation of data.
Span between data references is based on the position of the data references and the number of
statements between each reference or the span. The larger the span, the more difficult it becomes
for the developer to determine the value of a variable at a particular point, and the more likely to
have defects and to require more testing.
Particular data can possess different roles or usages within different or the same modules. These
roles are: input needed to produce a module’s output; data changed or created within a module,
34
data used for control and data passing through unchanged. Researchers have observed that the
type of data usage contributes to complexity in different amounts, with data used as control
contributing the most. By considering these data flow complexity factors when designing the
program code, the ultimate testability and quality of the program can be increased.
After identifying these complex parts, developers can:
• Remove hard coding.
• Revisit other design aspects and determine if it needs to be upgraded.
• Have managed code walk through to inspect it for defects.
• Refactor the section of code to simplify it; possibly break it into smaller, more manageable and
more testable pieces.
• Seek alternative design solutions that avoid those parts.
• Adjust your programmer resource plan to place your most reliable programmers on those
challenging programs.
• Allow for additional time and resources for more extensive testing.
The “Pareto Principle”, more commonly known as the 80/20” rule, is a relation that describes causality
and results. It claims that roughly 80% of output is a direct result of about 20% of the input. It is
generally known that 80% of the problems are located in 20% of the code. This phenomenon was also
observed by the study of ISBSG. A problem that all developers would like to know more about is where
is the risk? Which component or module is vulnerable or defect-prone? To assist in this quest, many
software fault prediction models have been proposed. These models consist of various sets of metrics,
both static and dynamic, to predict software fault-proneness. The problem is that the metric only partially
reflects the many aspects that influence software fault-proneness. Even though much effort has been
directed to this effort, none of the software fault prediction techniques has proven to be consistently
accurate [BISH]. It is known that no single metric can predict bugs, that testing itself demonstrates bugs
but does not prove their absence. We have also found the enhanced data is helpful (product and process)
in prediction.
We outlined the importance of linking the metrics to the organization or project objectives to demonstrate
achievement outlined objectives. There are other important factors that also must be considered during
metrics adoption. In an Agile environment, much of the decision-making is delegated to the development
teams. Development teams require software measurement information to assist them during their daily
operations. Measurement needs to be integrated into the workflow to provide this assistance and to avoid
the task of simply collecting data.
Capers Jones, who has been collecting software data for more than thirty years, makes this comment
about metrics [JONEe]:
“Accurate measurements of software development and maintenance costs and accurate
measurement of quality would be extremely valuable. But as of 2014, the software industry labors
under a variety of non-standard and highly inaccurate measures compounded by very sloppy
measurement practices. For that matter, there is little empirical data about the efficacy of software
standards themselves.”
Even with the metric inconsistency problem mentioned above, internal consistent metrics are important
for internal benchmarks and comparisons. Metrics on size, productivity and quality are the key ones to
concentrate on. In order to analyze their value, consistency is the key. Metrics are like words in a
35
sentence; together they create sense and meaning. Over analyzing data provided by one metric or one set
of metrics (such as productivity) can be harmful to other aspects of the software, such as quality.
2.9 Performance, a Factor in Quality
Performance directly translates into utility for the end user. There are different levels of performance as
seen in the properties of Table 1, such as time-behavior, resource utilization and capacity. Ultimately,
performance is about increasing user response times and reducing latency throughout the system while
retaining functional accuracy and consistency. Performance can be measured through low-level code
performance and benchmarking and, at a higher level, by establishing a balance between the resource and
requirements.
The first focus will be on Java resource utilization and system requirements. In general, there are four
main resources which are the keys to the performance of any executing system. They are the CPU
computing power, the memory (both cache and RAM), the IO/Network and the database. Database
access is separated from IO/Network resources because it can greatly affect performance and is often
responsible for IO bottlenecks. The functional performance requirements can also be placed into four
categories. They are throughput, scalability, responsiveness and latency, and resource consumption.
Throughput is normally rated as the maximum number of concurrent users the system can accommodate
at once. When the number of users grows, how the system responds to the increasing number of requests
is a measure of its scalability. Responsiveness is the length of time the system takes to first respond to a
request and latency is the time required to process the request. Not only must the system serve users,
other tasks will consume resources and influence the throughput. In general, most performance problems
can be postulated by one or more of these terms. For example, if speed is the issue, most likely latency is
the problem.
Performance tuning requires effort. In a survey instituted by a tool vendor [PLUM], it was found the 76%
of the respondents struggle most with trying to gather, make sense of, reproduce, and link to the root
cause the evidence required for performance analysis. In Figure 10, a pie chart is presented identifying
the most time-consuming part of the process [ZERO]. The survey also asked how long it took to finally
detect the root cause to solve the performance issue. The average time finding and fixing a root cause is
80 hours.
There are three major types of performance tools to identify performance problems or to optimize
performance. They fall into tool categories that monitor, profile or test. Java profiling and monitoring
tools measure and optimize performance during runtime. Performance testing identifies areas of heavy
resource utilization. There are many Java monitoring tools or Application Performance Management
(APM) tools. The issue with monitoring is that many production environments are a complex mixture of
services very carefully balanced to work together. Plus, with applications shifting to the cloud and
dramatically different enterprise architectures, APM tools are challenged to provide real performance
benefits across systems with virtual perimeters. A blog posted on June 10, 2014 from profitbricks
identifies 38 APM tools (https://blog.profitbricks.com/application-performagement-tools/). This short list
contains some of the most comprehensive APM tools available. Tool #5 is Compuware APM. Another
article from Zeroturnaround which has its own APM tool, New Relic APM for web applications, also lists
this tool for complex applications. The Zeroturnaround article lists the name as Dynatrace, a more recent
name change. This APM has the largest market share as published by a Gartner report [GART]. One of
Dynatrace’s selling points is that it eliminates false positives and erroneous alerts, thereby reducing the
cost of deploying and managing the application. Another tool in the same category, mentioned in both
sources, is AppDynamics. It is listed as tool #2 in profitbricks, and the basic tool is free while the pro
tool has a cost.
36
Figure 10: Most Time-Consuming Part of Performance Tuning
Monitoring tools that identify memory leaks, garbage collection inefficiencies and locked threads can also
be used. These are less powerful, usually work through the JVM, and are less costly. One such tool is
Plumbr, which runs as a Java agent on the JVM. It is used as an overall monitoring tool. Java Mission
Control is a performance monitoring tool by Oracle and is free. It has a nice, simple, configurable
dashboard for viewing statistics of many JVM properties.
Many of the monitoring tools assist in identifying when a performance problem exists. Engineers must
then find the cause and eliminate the issue. There are many tools or sources used for this evidence
gathering phase. Many engineers used the application log or heap and thread dumps as evidence. JVM
tools such as jconsole, jmc, jstat and jmap can be used. At an average, an engineer uses no less than four
different tools to gather enough evidence to solve the performance problem. Other specialized tools, such
those offered by JClarity (Illuminate and Censum), can be used to identify the problem. Illuminate is a
performance monitoring tool, while Censum is an application focused on garbage collection logs
analysis. Takipi was created for error tracking and analysis, informing developers exactly when
and why production code breaks. Whenever a new exception is thrown or a log error occurs, the Takipi
software captures the exception and shows the variable state which caused it, across methods and
machines. Takipi will lay this information over the actual code which executed at the moment of the error
so developers can analyze the exception as if they were there when it happened.
Code profilers gather information about low-level code events to focus on performance questions. Many
of them use instrumentation to extract this information. A popular and frequently mentioned profiler is
YourKit and is one of the most established leaders in Java profiling. Another profiler tool is JProfiler. It
can also display call graph where methods are represented by colored rectangles providing instant visual
feedback about where slow code resides.
Application monitoring tools point out problems in performance, profilers provide low level insight and
highlight individual parts, and performance testing tools tell us that the new solution is better than the
previous one. Apache JMeter is an open source Java application for loading test functional behavior and
measuring performance. A section of Zeroturnaround article [ZERO] is a section labeled “Performance
Issues in action” where the reader is led through an application (Atlassian Confluence) using the
performance tools. It uses JMeter to create and gather profile data, YourKit to display and analyze the
37
profile data, and XRebel to diagnose other mostly http performance issues. It is an excellent exercise for
those not familiar with these types of tools.
Teams are constantly delivering code. SonarQube can be used to analyze the frequency of changes, the
size of changes, and to correlate this information with error data to assist in understanding whether the
code is being changed too much or too quickly to be safe. Another metric which measure changes is code
churn. Code churn is a measure of the amount of code change taking place within a software unit over
time. It is easily extracted from a system’s change history, as recorded automatically by a version control
system. Code churn has been used to predict fault potential where large and/or recent changes contribute
the most to fault potential. Microsoft uses code churn as an early prediction of system defect density
using a set of relative code churn measures that relate the amount of churn to other variables such as
component size and the temporal extent of churn [NAGA].
2.10 Security, a Factor in Quality
If you think Java is relatively safe, just think about the yearly security reports beginning in 2010. Java
became the main vehicle for malware attacks in the third quarter of 2010, when the attacks increased 14-
fold, according to Microsoft's Security Intelligence Report Volume 10 [MICR]. In 2012, Kaspersky Lab,
a leading anti-virus company, labeled it the year of Java vulnerabilities. Kaspersky reported that in 2012,
Java security holes were responsible for 50% of attacks while Windows components and Internet
Explorer were only exploited in 3% of the recorded incidents [KASP]. Cisco's 2014 Annual Security
Report puts the blame on Oracle's Java as a leading cause of security woes and reported that Java
represented 91% of all indicators of compromise in 2013 [CISC]. Perhaps the main reason Java is such a
target is the same reason why it is popular with enterprises and developers; namely, it is portable and
works on any operating system. Moreover, patching a large Java application, such as the JRE, is difficult
and there is the possibility that the patch could break the functionality within the application.
Why focus on application security? Estimates from reliable sources report that anywhere from 70% to
90% of the security incidents are due to application vulnerabilities. Moreover, only the application
security inside the application itself can stand a chance at preventing sophisticated attacks.
A report from the SANS Institute “2015 State of Application Security: Closing the Gap” can provide a
current general view of application software security [SANSa]. The report was driven by a survey given
to 435 qualified respondents answering questions about application security and its practices.
Respondents were divided into builders and defenders, with 35% being builders and 65% defenders. The
most interesting and important of the questions focused on security standards, the shift of security
responsibilities within development, the list of current practices, the risk of third party applications and
the rate of repairs using secure development life-cycle practices. These topics will be discussed in
Sections 2.10.1 -2.10.5.
2.10.1 Security Standards
Many security standards and requirements frameworks have been developed in attempts to address risks
to enterprise systems and the resident critical data. A survey question asked the participants to select the
security standards or models followed by their organization. Ten standards or guidelines were explicitly
listed as seen in Figure 11. The Open Web Application Security Project (OWASP) Top 10, a community-
driven, consensus-based list of the top 10 application security risks, with lists available for web and
mobile applications is by far the leading application security standard, followed by builders who
participated in the survey [OWAS].
38
Figure 11: Application Security Standards in Use
The survey report provided a few reasons for the overwhelming reliance on the OWASP Top 10. First of
all, the OWASP Top 10 is the shortest and simplest of the software security guidelines to understand
since there are only 10 different areas of concern. Also, most static analysis and dynamic analysis security
tools report vulnerabilities in OWASP Top 10 risk categories, making it easy to demonstrate compliance.
The OWASP Top 10, like the MITRE/SANS Top 25 [MITR], is also referenced in several regulatory
standards. After the OWASP Top 10, much more comprehensive standards, such as ISO/IEC 27034 and
NIST 800-53/64, often required in government work, are used as security guidelines. Fewer institutions
use the more general coding guidelines and process frameworks such as CERT Secure Coding Standards,
Microsoft’s SDL and BSIMM (Building Security In Maturity Model).
The problem with standards and guidelines is that much of the effort has essentially become exercises in
reporting on compliance and has actually diverted security program resources from the constantly
evolving attacks that must be addressed. The National Security Agency (NSA) recognized this problem
and began an effort to prioritize a list of the controls that would have the greatest impact in improving risk
posture against real-world threats. The SANS Institute coordinated the input and formulated the Critical
Security Controls for Effective Cyber Defense [SANSb]. This set of compiled information has much
valuable information with a strong emphasis on "What Works" - security controls where products,
processes, architectures and services are in use that have demonstrated real world effectiveness. Section 6
of the Critical Security Controls for Effective Cyber Defense report is directly on Application Software
Security (CSC 6). There are eleven suggestions to implement in CSC 6.
Many of the SANS’ survey respondents (47%) indicated that their application security program needed
improvement. There were some organizations that also rated themselves as above average. However,
this may be due to the recent slew of security of breaches which did not directly impact them, perhaps
giving them a false sense of confidence.
39
2.10.2 Shift of Security Responsibilities within Development
The majority (59%) of builder respondents followed lightweight Agile or Lean methods (mostly Scrum),
14% still used the Waterfall method and a smaller percentage followed more structured development
approaches. More of the survey organizations are considering adopting DevOps and SecDevOps practices
and approaches to share the responsibilities for making systems secure and functional among builders,
operations and defenders. These methods are viewed as a radically different way of thinking about and
doing application security. Currently, to produce secure code, most are proceeding externally through pen
testing and compliance reviews. Many concur that defenders need to work collaboratively with builders
and operations teams to embed iterative security checks throughout software design, development and
deployment. The main takeaway is that “builders, defenders and operations should be sharing tools and
ideas as well as responsibility for building and running systems, while ensuring the availability,
performance and security of these systems” [SANSa]. Application security should be everyone’s duty.
Eighty-four percent of successful security breaches have been accomplished through application software
vulnerabilities [PUTM].
Within the development cycle, Agile and application security appear to possess conflicting goals.
Developers work diligently to provide value and meet release deadlines, while security is concerned with
the potential exposure and negative impact that applications can generate for the business and their
customers. Agile developers adopt changes as part of their development environment culture, but even
adding a checkpoint for security could be perceived as an impediment to productivity, especially during a
tight schedule period. Many application builders are also unaware of inherent security issues in their
code. Mandating scanning the code for vulnerabilities and fixing the issues will not create a culture that
contributes to secure application development. Developers also need to be educated in the best practice
for producing secure code. Everyone must contribute to the fine-tuning of the process, determining the
best points to perform security reviews and code scanning functions. “Getting security right means being
involved in the software development process” [MCGR].
The Closing the Gap report [SANSa] listed four important areas to include throughout the development
lifecycle for effective application security. These are:
• “Design and build. Consider compliance and privacy requirements; design security features;
develop use cases and abuse cases; complete attack surface analysis; conduct threat modeling;
follow secure coding standards; use secure libraries and use the security features of application
frameworks and languages.
• Test. Use dynamic analysis, static analysis , interactive application, security testing (IAST),
fuzzing, code reviews, pen testing, bug bounty programs and secure component life-cycle
management.
• Fix. Conduct vulnerability remediation, root cause analysis, web application firewalls (WAF)
and virtual patching and runtime application self-protection (RASP).
• Govern. Insist on oversight and risk management; secure SDLC practices, metrics and
reporting; vulnerability management; secure coding training; and managing third-party software
risk.”
No indication was provided of how to include these into the various development lifecycles. In the
Waterfall methodology, there is a one-to-one mapping. However, in Agile, these must be adapted to be
iterative and incremental.
40
2.10.3 List of Current Practices
In this section, the focus is on the list of current practices compiled from the builders’ responses (Table
7). Risk assessment is the leading practice for all types of applications except for web applications.
Penetration testing is the second leading practice for internal apps. Currently applications are the biggest
source of data breaches. NIST claims that 90% of security vulnerabilities exist at the application layer.
Risk assessment or analyzing how hackers might conduct attacks can provide developers with a better
idea of specific defenses. To insure that an application has no weak points, penetration testing is used.
The article “37 Powerful Penetration Testing Tools for Every Penetration Tester” is a good resource in
identifying the scope and features of current penetration tools [SECU]. Practicing secure coding
techniques is another method to keep applications from getting hacked. The SANS Software Security has
a course designed specifically for Java, DEV541: Secure Coding in Java/JEE: Developing Defensible
Applications, https://software-security.sans.org/course/secure-coding-java-jee-developing-defensible-
applications.
Table 7: Builders’ Application Security Practices
2.10.4 Risk of Third Party Applications
The survey reports that 79% of the builder respondents use open source or third-party software libraries in
their applications. This agrees with a 2012 CIO report that over 80% of typical software applications are
open source components and frameworks consumed in binary form [OLAV]. The CIO report also details
that many organizations regularly download software components and frameworks with known security
vulnerabilities, even if newer, patched versions of the components or frameworks were available. Many
of these contain such well-known vulnerabilities as HeartBleed, ShellShock, POODLE and FREAK. A
thorough assessment must be made when using or procuring applications.
41
2.10.5 Rate of Repairs
In the survey, 26% of defenders took two to seven days to deploy patches to critical applications in use,
while another 22% took eight to thirty days, and 14% needed thirty-one days to three months to deploy
patches satisfactorily (Figure 2). Serious security vulnerabilities are important to repair as quickly as
possible. Observing the survey responses, it appears that most respondents need assistance in this effort.
Figure 12: Time to Deploy Patches
Developers require fundamental software security knowledge to understand the vulnerability and fix the
code properly, test for regressions, and build and deploy the fix quickly. Perhaps more importantly, the
vulnerability must be analyzed for root cause, and this must be addressed to hamper a vicious and likely
dangerous cycle.
2.10.6 Other Code Security Strategies
There are other useful strategies to assist in developing secure code. Michael Howard provided lessons
learned from five years of building more secure code. For security reviews, Microsoft ranks modules
(code) by their potential for vulnerabilities by age [HOWE]. As the code base becomes larger, analysis
tools are required. Analysis tools can help determine the amount of review and testing to provide. For
example, analyzing one component produces 10 warnings and analyzing a component of similar size
where the analysis yields 100 warnings, indicates that the second component is in greater need of review.
You can use the output of the analysis to determine overall code riskiness. Microsoft and many other
companies apply tools at check-in time to catch bugs early and execute them at fairly recent intervals to
deal with any new issues quickly. They have learned that executing the tools only every few months
leads to developers having to deal with hundreds of warnings at a time. For every security vulnerability
identified, a root cause analysis is performed. The analyst also determines why an actual issue was not
discovered by tools. There are three possible reasons: the tool did not find the vulnerability, the tool found
it but mistakenly triaged the issue as low priority, and the tool did actually find the issue and humans
ignored it. Such an analysis allows the fine-tuning of tools and their use. There is a great deal of manual
work involve in security assessment, so we need to strive for automation where possible. Build or buy
42
tools that scan code and upload the results to a central site for analysis by security experts. There are
some tools that actually can combine the results of different tool outputs.
2.10.7 Design Vulnerabilities
Many security vulnerabilities are not coding issues at all and are design issues, therefore, Microsoft
mandates threat modeling and attack surface analysis as part of the Security Development Lifecycle
(SDL) Process. Part of the lessons learned is that “It's essential to build threat models to uncover potential
design weaknesses and determine your software's attack surface. You need to make sure that all material
threats are mitigated and that the attack surface is as small as possible.” Microsoft continues to review for
features in its products that are not secure enough for the current computer environment and deprecates
those deemed to be insecure.
Design-level vulnerabilities are the hardest defect category to handle. Design-level problems accounted
for about 50% of the security flaws uncovered during the Microsoft's "security push" in 2002 [HOGL].
Unfortunately, ascertaining whether a program has design-level vulnerabilities requires great expertise,
which makes finding such flaws not only difficult, but particularly hard to automate. Examples of design-
level problems include error handling in object-oriented systems, object sharing and trust issues,
unprotected data channels both internal and external, incorrect or missing access control mechanisms,
lack of auditing/logging or incorrect logging, and ordering and timing errors (especially in multithreaded
systems). These sorts of flaws almost always lead to security risk.
Security issues, not syntactic or code related (such as business logic flaws), cannot be detected in code
and need to be identified by performing threat models and abuse case modeling. Software security
practitioners perform many different tasks to manage software security risks, including
• creating security abuse/misuse cases;
• listing normative security requirements;
• performing architectural risk analysis; building risk-based security test plans;
• using static analysis tools;
• performing security tests;
• performing penetration testing in the final environment; and
• cleaning up after security breaches.
Three of these are closely linked, architectural risk analysis, risk-based security test planning, and security
testing, because a critical aspect of security testing relies on probing security risks. If we hope to secure a
system, it is important to also work on the architectural or design risk. Over the last few years, much
progress has been made in static analysis and code scanning tools. However, the same cannot be said of
architectural risk. There are some good process frameworks such as Microsoft's STRIDE model.
However to obtain the kinds of results expected, these models require specialists and are difficult to
transform to widespread practices. To assist in developing secure software during the design phase, the
SwA Forum and Working Groups developed a pocket guide [SOFT] which includes the following topics:
Basic Concepts
Misuse Cases and Threat Modeling
Design Principles for Secure Software
43
Secure Design Patterns
o Architectural-level Patterns
o Design-level Patterns
Multiple Independent Levels of Security and Safety (MILS)
Secure Session Management
Design and Architectural Considerations for Mobile Applications
Formal Methods and Architectural Design
Design Review and Verification
Key Architecture and Design Practices for Mitigating Exploitable Software Weaknesses
Questions to Ask Developers
The above activities combined with secure coding techniques will enable more secure and reliable
software.
Section 3: Assessments of Development Methods and Project Data In this report, five resources were used to provide comparisons/assessments of the development
methodology and/or project data. These are discussed in sections 3.1 to 3.5.
3.1 Namcook Analytics Software Risk Master (SRM) tool. (Estimation report is attached to this
report in Appendix A.)
3.2 A table from a Crosstalk article by Capers Jones modified for NPAC.
3.3 Scoring method of factors in software development in Software Engineering Best Practices,
Capers Jones (also in the excel file as a separate spreadsheet)
3.4 DevOps self-assessment by IBM (See assessment results in Appendix B.)
3.5 The list of “Thirty Software Engineering Issues that have stayed constant for 30 years”
Additional information on software quality is contained in section 1.5 of this report.
3.1 The Namcook Analytics Software Risk Master (SRM) tool
The Namcook Analytics Software Risk Master (SRM) tool predicts requirements size in terms of pages,
words and diagrams. It also predicts requirements bugs or defects and “toxic requirements” which should
not be included in the application. A toxic requirement is one that causes harm later in development
and/or maintenance. Reproduced below from the Namcook website are samples of typical SRM
predictions for software projects between 100 function points (Tables 8 and 9) and 100,000 function
points (Tables 10 and 11).
Table 8: Metrics for Projects with 1000 Function Points
Requirements creep to reach 1,000 function points = 149
Requirements pages = 275
Requirements words = 110,088
Requirements diagrams = 180
Requirements completeness = 91.44%
Requirements reuse = 25%
Requirements bugs = 169
Toxic requirements = 4
Requirements test cases = 667
Reading days (1 person) = 4.59
Amount one person can understand = 93.27
Financial risks from delays, overruns = 22.33%
44
Table 9: 1,000 Function Points Requirements Methods
Interviews
Joint Application Design (JAD)
Embedded users
UML diagrams
Nassi-Schneidewind diagrams
FOG or FLESCH readability scoring
IBM Doors or equivalent
Requirement inspections
Agile
Iterative
Rational Unified Process (RUP)
Team Software Process (TSP)
Table 10: Metrics for Projects with 10,000 Function Points
Requirements creep to reach 10,000 function points = 2,031
Requirements pages = 2,126
Requirements words = 850,306
Requirements diagrams = 1,200
Requirements completeness = 73.91%
Requirements reuse = 17%
Requirements bugs = 1,127
Toxic requirements = 29
Requirements test cases = 5,472
Reading days (1 person) = 35.43
Amount one person can understand = 12.08%
Financial risks from delays, overruns = 42.50%
Table 11: 10,000 Function Points Requirements Methods
Focus groups
Joint Application Design (JAD)
Quality Function Deployment (QFD)
UML diagrams
State change diagrams
Flow diagrams
Nassi-Schneidewind charts
Dynamic, animated 3D requirements models
FOG or FLESCH readability scoring
IBM Doors or equivalent
State change diagrams
Text static analysis
Requirements inspections
Automated requirements models
Rational Unified Process (RUP)
Team Software Process (TSP)
In general, “greenfield requirements” for novel applications are more troublesome than “brownfield”
requirements which are frequently replacements for aging legacy applications whose requirements are
well known and understood. In total, requirements bugs or defects are approximately 20% of the bugs
45
entering the final released application. “Requirements bugs are resistant to testing and the optimal
methods for reducing them include requirements defect prevention and pre-test requirements inspections.
The use of automated requirements models is recommended. The use of automated requirements static
analysis tools is recommended. The use of tools that evaluate readability such as the FOG and FLESCH
readability scores is recommended.” The last quote and the tables were from [JONEc].
Dolores Zage registered as a user on the Namcook.com site and was able to use the SRM demo
application on the website. Figure 13 below is a listing of the input that was selected to produce the
estimation report. Average settings were used for project staffing details, which are an even mix of
experts and novices. For the project scope, standalone PC had to be selected because other settings
caused a PHP error. Interesting is the fact that no size factor was requested. With the given knowledge
of the project and the limitations of the application, the SRM tool calculated that NPAC would be 465.06
function points or about 53,330 LOC.
SOFTWARE TAXONOMY AND PROCESS ASSESSMENT REPORT
General Information:
Today's date - 08/18/2015 Industry or NAIC Code - telecommunications Company - BSU Country - IN, USA Project Start Date - 18-AUG-2015 Planned Delivery Date - Unknown Actual Delivery Date - Unknown Project Name - numbers Data Provided By - Dolores Project Manager - NA
Project Staffing Details:
Project Staffing Schedule - Normal staff; normal schedule Client Project Experience - Average experienced clients Project Management Experience - Average experienced management Development Team Experience - Even mix of experts and novices Methodology Experience - Even mix of experts and novices Programming Language Experience - Even mix of experts and novices Hardware Platform Experience - Even mix of experts and novices Operating System Experience - Even mix of experts and novices Test Team Experience - Even mix of experts and novices Quality Assurance Team Experience - Even mix of experts and novices Customer Support Team Experience - Even mix of experts and novices Maintenance Team Experience - Even mix of experts and novices
Project Taxonomy Input:
Project Nature - New software application development Work Hours per month - 132 Project Scope - Standalone program:PC Project Class - External program developed under government contract (civilian) Primary Project Type - Communications or telecommunications Secondary Project Type - Service oriented architecture(SOA)
46
Problem Complexity - Majority of avg, a few complex problems, algorithms - 7 Code Complexity - Fair structure with some large modules - 7 Data Complexity - More than 20 files, large complex data interactions - 11 Development Methodology - Agile, Internally Created Development Methodology Value - 10 Current CMMI Level - Level 4: Managed Primary Programming Language - Java - 6.00 - 90% Secondary Programming Language - SQL - 25.00 - 10% Effective Programming Level - 7.9 Number of maintenance sites - 1 Number of initial client sites - 80 Annual growth of client sites - 15 Number of application users - 1000 Annual growth of application users - 10
Testing Methodologies:
Defect Prevention - QFD; Pretest Removal - Desk Check; Static Analysis; Inspections; Test Removal - Unit; Function; Regression; Component; Performance; System; Acceptance;
projectsave
Back
Report
Start Again
Figure 13: SRM Tool Settings
The entire estimation report is in Appendix A. Based on pretest removal and test removal selection in the
tool, the defect removal efficiency was 99.4% as reported in the estimation report.
3.2 Crosstalk Table
The data for Table 12 stems from approximately 600 companies, of whom 150 are Fortune 500
companies. The table divided projects into excellent, average and poor categories. All of the projects
were of function point size 1000 and coded in Java. These data can be extrapolated for comparisons with
NPAC data. For convenience, the data were transferred to an Excel spreadsheet into which the NPAC
data can be inserted. The closer NPAC data align with the excellent category, the higher the probability
that NPAC can be rated as excellent.
Note: Extra explanations are denoted below Table 12 for the numbers in parentheses within the table
cells.
Table 12: Comparisons of Excellent, Average and Poor Software Results
Topics Excellent Average Poor NPAC
Project Info
Size in function points 1000 1000 1000 (1)
Programming Language Java Java Java Java
Language level 6.25 6.0 5.75 (2)
47
Source statements per function point 51.2 53.33 55.75 (3)
Certified reuse percent 20% 10% 5% (4)
Quality
Defect per function point 2.82 3.47 4.27 4.95 (5)
Defect Potential 2818 3467 4266 (6)
Defects per KLOC 55.05 65.01 76.65
Defect Removal Efficiency 99% 90% 83%
Delivered Defects 28 347 725
High Severity Defects 4 59 145
Security Vulnerabilities 2 31 88
Delivered per function point .03 0.35 0.73
Delivered per KLOC .55 6.5 13.03
Key Quality Control Methods
Formal estimates of defects YES NO NO
Formal inspections of deliverables YES NO NO
Static Analysis of Code YES YES NO
Formal Test Case Design YES YES NO
Testing by certified test personnel YES NO NO
Mathematical test case design YES NO NO
Project Parameter Results
Schedule in calendar months 12.02 13.8 18.2
Technical staff + management 6.25 6.67 7.69
Effort in staff months 75.14 92.03 139.96
Effort in staff hours 9919 12147 18477
Cost in dollars $751,415 $920,256 $1,399,770
Cost per function point $751.42 $920.26 $1,399.77
Cost per KLOC $14,676 $17,255 $25,152
Productivity Rates
Function points per staff month 13.31 10.87 7.14
48
Work hours per function point 9.92 12.15 18.48
Lines of code per staff month 681 580 398
Cost Drivers
Bug repairs 25% 40% 45%
Paper documents 20% 17% 20%
Code Development 35% 18% 13%
Meetings 8% 13% 10%
Management 12% 12% 12%
Total 100% 100% 100%
Methods, Tools, Practices
Development Methods TSP/PSP (7) Agile Waterfall
Requirements Methods JAD Embedded Interview
CMMI Levels 5 3 1
Work hours per month 132 132 132
Unpaid overtime 0 0 0
Team experience experienced average inexperienced
Formal risk analysis YES YES NO
Formal quality analysis YES NO NO
Formal change control YES YES NO
Formal sizing of project YES YES NO
Formal reuse analysis YES NO NO
Parametric estimation tools YES NO NO
Inspections of key materials YES NO NO
Accurate status reporting YES YES NO
Accurate defect tracking YES NO NO
More than 15% certified reuse YES MAYBE NO
Low cyclomatic complexity YES MAYBE NO
Test coverage > 95% YES MAYBE NO
Notes on cell contents
49
1. Function point count (FP)
2. Language level – years of experience in Java
3. KLOC/FP
4. Reuse percentage
5. See 2.1 Ranges of software development quality
6. Defect Potential - Using the commercial application type data – 4.95 * FP
7. PSP (Personal Software Process) provides a standard personal process structure for
software developers. TSP (Team Software Process) is a guideline for software product
development teams. TSP focuses on helping development teams to improve their quality
and productivity to better meet goals of cost and progress. (Watts Humphrey, precursor to
DevOps)
3.2.1 Ranges of Software Development Quality
Given the size and economic importance of software, one might think that every industrialized nation
would have accurate data on software productivity, quality, and demographics. This does not seem to
exist. There seems to be no effective national averages for any software topic. A national repository of
software quality data would be very useful to compare against, but it does not exist. One reason is that
quality data are more difficult to collect than productivity data. There are so many individual
development tasks focusing on identifying defects. There are defects found in requirements and defects
identified by static analysis, desk checking, and testing. These counts are not always included in the
quality data. Currently, the best data on software productivity and quality tends to come from companies
that build commercial estimation tools and companies that provide commercial benchmark services. All
of these are fairly small companies. If you look at the combined data from all 2014 software benchmark
groups such as Galorath, ISBSG, Namcook Analytics, Price Systems, Q/P Management Group,
Quantimetrics, QSM, Reifer Associates and Software Productivity Research, the total number of projects
is about 80,000. However, all of these are competitive companies, and with a few exceptions, such as the
recent joint study by ISBSG, Namcook, and Reifer, the data is not shared, compared or not always
consistent.
The following data in Tables 13 and 14 are compiled quality data from Namcook consisting of about
20,000 projects and are approximate average values for software quality aggregated by application size
and type.
Table 13: Quality Data Based on Project Size in Function Points
Size Defect Removal Defects
Potential Efficiency Delivered
1 1.50 96.93% 0.05
10 2.50 97.50% 0.06
100 3.00 96.65% 0.10
1000 4.30 91.00% 0.39
10000 5.25 87.00% 0.68
100000 6.75 85.70% 0.97
Average 3.88 92.46% 0.37
50
Table 14: Quality Data Based on Project Type
Type Defect Removal Defects
Potential Efficiency Delivered
Domestic outsource 4.32 94.50% 0.24
IT projects 4.62 92.25% 0.36
Web projects 4.64 91.30% 0.40
Systems/embedded 4.79 98.30% 0.08
Commercial 4.95 93.50% 0.32
Government 5.21 88.70% 0.59
Military 5.45 98.65% 0.07
Average 4.94 93.78% 0.30
Tables 13 and 14 vary by application size and also by application type. Many suggest that for national
average purposes, the value shown by type is more meaningful than size, since there are very few
applications larger than 10,000 function points, and so these large sizes distort average values. The 2014
defect potentials average is about 4.94 while defect removal efficiency averages about 93.78% and
delivered defects average is 0.3 if the view is cross industry. The range of defect potentials span from
about 1.25 per function point to about 7.50 per function point. Ranges of defect removal efficiency span
from 99.65% to a low of below 77.00%.
3.3 Scoring Method of Methods, Tools and Practices in Software Development
Software development and software project management have dozens of methods, hundreds of tools and
practices. Which ones to use? One method is to evaluate and rank the many different factors using a scale.
In the excel file containing Table 12 is another spreadsheet listing the various methods and practices
scored with a scale that ranges from +10 to -10. A 10 implies it is very beneficial to the quality and
productivity of a project. A -10 indicates that it is very detrimental. An average value is given based on
size and type of projects. The data for the scoring stems from observations among about 150 Fortune 500
companies, 50 smaller companies, and 30 government organizations. The negative scores also include
data from 15 lawsuits. The actual values are not as important as the distribution into the various
categories. Using this method, one can display the range of impact of using the various methods, tools
and practice together.
3.4 DevOps Self-Assessment by IBM
A self-assessment of DevOps practices was also done through an IBM DevOps self-assessment.
(http://www.surveygizmo.com/s3/1659087/IBM-DevOps-Self-Assessment) A copy of the questions, the
answers provided and the results are in the file DevOps+Self+Assessment+Results.pdf
Based on the answers to questions in the assessment, the DevOps practice is measured as scaled, reliable,
consistent or practiced (as seen in Figure 14) in the five areas of Design, Construct, Configuration
Management, Build, Test and Assess Quality.
51
Figure 14: Levels of Achievement in DevOps Practices
3.5 Thirty Software Engineering Issues that Have Stayed Constant for Thirty Years
In [JONEb], we find the following persistent issues in software engineering:
1. Initial requirements are seldom more than 50% complete.
2. Requirements grow at about 2% per calendar month during development.
3. About 20% of initial requirements are delayed until a second release.
4. Finding and fixing bugs is the most expensive software activity.
5. Creating paper documents is the second most expensive software activity.
6. Coding is the third most expensive software activity.
7. Meetings and discussions are the fourth most expensive activity.
8. Most forms of testing are less than 35% efficient in finding bugs.
9. Most forms of testing touch less than 50% of the code being tested.
10. There are more defects in requirements and design than in source code.
11. There are more defects in test cases than in the software itself.
12. Defects in requirements, design, and code average 5.0 per function point.
13. Total defect removal efficiency before release averages only about 85%.
14. About 15% of software defects are delivered to customers.
15. Delivered defects are expensive and cause customer dissatisfaction and technical debt.
16. About 5% of modules in applications will contain 50% of all defects.
17. About 7% of all defect repairs will accidentally inject new defects.
18. Software reuse is only effective for materials that approach zero defects.
19. About 5% of software outsource contracts end up in litigation.
20. About 35% of projects > 10,000 function points will be cancelled.
21. About 50% of projects > 10,000 function points will be one year late.
22. The failure mode for most cost estimates is to be excessively optimistic.
23. Productivity rates in the U.S. are about 10 function points per staff month.
24. Assignment scopes for development are about 150 function points.
25. Assignment scopes for maintenance are about 750 function points.
26. Development costs about $1200 per function point in the U.S (range < $500 to > $3000).
27. Maintenance costs about $150 per function point per calendar year.
28. After delivery applications grow at about 8% per calendar year during use.
29. Average defect repair rates are about 10 bugs or defects per month.
30. Programmers and managers need about 10 days of annual training to stay current.
52
3.6 Quality and Defect Removal
There are various definitions of quality and a common definition in software engineering is conformance
to requirements. However, requirements themselves can have defects, and then there are requirements
that are labeled as toxic. There are other “ility” qualities such as maintainability and reliability, but these
cannot be measured directly. This is why quality comes down to the absence of defects. This leaves two
powerful metrics for understanding software quality: 1) software defect potentials; 2) defect removal
efficiency (DRE). The phrase “software defect potentials” was first used in IBM circa 1970. Defect
potential includes the total of bugs or defects likely to be found in all software deliverables, such as the
requirements, architecture, design, code, user documents, test cases and bad fixes. The quality
benchmarks for Defect potentials on leading projects are < 3.00 per function point, combined with defect
removal efficiency levels that average > 97% for all projects and 99% for mission-critical projects.
The DRE metric was also developed in IBM in the early 1970s at the same time IBM was developing
formal inspections. The concept of DRE is to track all defects found by the development teams and then
compare those to post-release defects reported by customers in a fixed time period after the initial release
(normally 90 days). If the development team found 900 defects prior to release and customer reported 100
defects in the first three months, then the total volume of bugs was an even 1,000 so defect removal
efficiency is 90%.
The U.S. average circa 2013 for DRE is just a bit over 85%. Testing alone is not sufficient to raise DRE
much above 90%. To approach or exceed 99% in DRE, it is necessary to use a synergistic combination of
pre-test static analysis and inspections combined with formal testing using mathematically designed test
cases, ideally created by certified test personnel. DRE can also be applied to defects found in other
materials such as requirements and design. Requirements, architecture, and design defects are resistant to
testing and, therefore, pre-test inspections of requirements and design documents should be used for all
major software projects. Table 15 illustrates current ranges for defect potentials and defect removal
efficiency levels in the United States circa 2013 for applications in the 1,000 function point size range:
Table 15: Software Defect Potentials and Defect Removal Efficiency
Defect Origins Defect Defect Defects % of
Potential Removal Delivered Total
Requirements defects 1.00 75.00% 0.25 31.15%
Design defects 1.25 85.00% 0.19 23.36%
Test case defects 0.75 85.00% 0.11 14.02%
Bad fix defects 0.35 75.00% 0.09 10.90%
Code defects 1.50 95.00% 0.08 9.35%
User document defects 0.60 90.00% 0.06 7.48%
Architecture defects 0.30 90.00% 0.03 3.74%
TOTAL 5.75 85.00% 0.80 100.00%
3.6.1 Error-Prone Modules (EPM)
One of the most important findings in the early 1970s by IBM was that errors were not randomly
distributed through all modules of large systems, but tended to cluster in a few modules, which were
termed “error-prone modules.” For example, 57% of customer reported bugs in the IBM IMS data base
application were found in 32 modules out of a total of 425 modules. More than 300 IMS modules had
zero defect reports. A Microsoft study found that fixing 20% of the bugs would eliminate 80% of system
53
crashes. Other companies replicated these findings and error-prone modules are an established fact of
large systems.
3.6.2 Inspection Metrics
One of the merits of formal inspections of requirements, design, code, and other deliverables is the suite
of standard metrics that are part of the inspection process. Inspection data routinely includes preparation
effort, inspection session team size and effort, defects detected before and during inspections; defect
repair effort after inspections; and calendar time for the inspections for specific projects. These data are
useful in comparing the effectiveness of inspections against other methods of defect removal such as pair
programming, static analysis, and various forms of testing. To date, inspections have the highest levels of
defect removal efficiency (> 85%) of any known form of software defect removal.
3.6.3 General Terms of Software Failure and Software Success
The terms “software failure” and “software success” are ambiguous in the literature. Here is Capers
Jones’ definition of “success”, where he attempts to quantify the major issues troubling software
[JONEd]: success means
< 3.00 defects per function points;
> 97% defect removal efficiency;
> 97% of valid requirements implemented;
< 10% requirements creep;
0 toxic requirements forced into application by unwise clients;
> 95% of requirements defects removed;
development schedule achieved within + or – 3% of a formal plan;
costs achieved within + or – 3% of a formal parametric cost estimate.
Section 4: Conclusions and Project Take-Aways
We found many excellent suggestions for enabling teams to deliver quality software, but not all things
will work for all teams.
4.1 Process
1. Changing processes leads to differences in software quality.
2. Mixing Agile and DevOps high-performance distinguishing characteristics can lead to rapid delivery
and maximized outcomes through collaborative performance. (Section 1.1)
3. The more collaborative the process becomes, the easier it is to attain item 2. Agile and DevOps is
based on teamwork and cooperation. (Section 1.1) Make the process visible and available to all teams.
Delivery tasks and trends (metrics) are available to all teams. Raise awareness of product quality.
Everyone is responsible and owns the trends.
4. The key points for high-performing DevOps culture: (Section 1.2.1)
Deploy daily – decouple code deployment from feature releases
Handle all non-functional requirements (performance, security) at every stage
Exploit automated testing to catch errors early and quickly- 82% of high-performance DevOps
organizations use automation [Puppet].
Employ strict version control – version control in operations has the strongest correlation with
high performing DevOps organizations [Puppet]. Save all products into a software configuration
management (SCM) system making them readily available, merging contributions by multiple
54
authors, determining where changes have been made. Along with a SCM use a configuration
management system. (Section 1.2.4)
Implement end-to-end performance monitoring and metrics
Perform peer code review
Use collaborative code review platforms such asGerrit, CodeFlow, ReviewBoard,
Crucible, SmartBear and review against coding standards first and apply checklists.
(Section 1.3, 1.3.1)
Apply a separate checklist for security. (Section 1.3.2)
Static analysis, using tools such as Findbugs (byte code), PMD for code, CheckStyle.
Note that SonarQube can take output from these tools and present it. SonarQube also has
an indicator of poor design before “human reviews”. (Section 1.3)
Monitor the code review process (Section 1.3.3)
Allocate more cycle time to reduction of technical debt.
Reviews assist in identifying evolvability code to be harder to understand and maintain.
(Section 1.3.4)
Agile requires refactoring. Refactoring and technical debt are linked.
5. Key success word for Agile is continuous. Continuous testing, planning, iterations, integration and
improvement resulting in continuously delivering tested working software.
6. As test coverage increases, both predictability and quality increase and automation can promote
greater coverage. (Section 1.2.2.3) Raw code coverage metric is only relevant when too low and
requires further analysis when high. Determine what is not covered and why. Multiple studies show
about 85% of defects in production could have been detected by simply testing all possible 2-way
combinations of input parameters. Free testing tool from NIST (Advanced Combinatorial Testing
System)
7. Review high risk code and high risk changes for both vulnerabilities and defects. (Section 1.3.6).
8. Integrate QA into the development process. Fosters collaboration outlined in item 2. (Section 1.5)
9. Groom the product backlog. Many development teams do not have ready useable product
backlog. Over 80% of teams have user stories for their product backlog, but less that 10% find
them acceptable. Product backlog in high ready state can dramatically (as much as 50%)
improve a team’s velocity. (Section 1.6)
10. AUTOMATE, AUTOMATE, AUTOMATE …
When the same deployment tools are used for all development and test environments, errors are
detected and fixed early.
Studies have determined that there is not one best tool, underscoring the fact that quality is based
on practices not on the exact tool. Tools can make a team more productive and collaborative, and
enforce a practice. (Section 1.2.2.3)
More than 80% of high-performing software development organizations rely on automated tools
for infrastructure management and deployment. Automated testing (checking) is a factor in
quality production environments.
11. Develop a defect prevention strategy.
Defect Logging and documentation to provide key parameters for analysis (root cause ->
preventive actions->improvement) and measurement.
Defect Removal Efficiency (DRE) must be over 85%, closer to 95%. (Section 2.7)
55
From both an economic and quality standpoint, defect prevention and testing are all necessary to
achieve lower costs, shorter schedules and low levels of defects.
Conduct dynamic appraisals through functional and performance testing. Coverage, coverage
and more coverage.
As of 2015 there are more than 20 forms of testing. The assumed test stages include 1) unit test,
2) function test, 3) regression test, 4) component test, 5) usability test, 6) performance test, 7)
system test, and 8) acceptance test. Most forms of testing have only about a 35% DRE, so at least
8 types of testing are needed to top 80% in overall testing DRE.
Static appraisals can eliminate 40-50% of the coding defects. (Section 1.3)
Defects do not just originate from code. Only 35% of the total defect potential originates from the
code. Requirements accounts for 20%, design 25%, documents 12%, and bad fixes another 8%.
4.2 Product Measurements
12. No single metric can provide a complete quality measure and selecting the set of metrics that provides
the essential quality coverage is also impossible.
13. It is important to understand that quality needs to be improved faster and to a higher level than
productivity in order for productivity to improve. The major reason for this link is that identifying and
fixing defects is overall the most expensive activity in software development. Quality leads and
productivity follows. Attempting to improve productivity without first improving quality will not be
successful.
14. If only one quality aspect of development is measured, it should be defects. Defects are at the core of
the customer's and external reviewer’s value perception. Released defect levels are a product of
defect potentials and defect removal efficiency. The phrase "defect removal efficiency" refers to one
of the most powerful of all quality metrics. Fixing bugs on the same day as they are discovered can
double the velocity of a team. (Section 2.7)
15. Collect just enough feedback to respond to unexpected events, and change the process as needed.
Metrics on the number of test runs and passes, code coverage metrics and defect metrics should be
reviewed to ensure that the code is providing the value required. The SonarQube default quality
setting tracks the seven deadly sins in bad code: bad distribution of complexity, duplications, lack of
comments, coding rules violations, potential bugs, no unit tests or useless ones, and bad design.
(Section 2.7)
16. Next concentrate on performance and security. These features are externally visible to customers and
the public. Defective and slow software will make customers demand a new product. Security
problems will lead to headlines in the news. (Section 2.9)
Application performance requires a tool for complex applications such as Dynatrace.
Use a monitoring tool such as Java Mission Control for memory leaks, collection inefficiencies
and locked treads.
Use code profilers such as YourKit or JProfiler to identify and remove bottlenecks.
17. Another quality aspect on the radar should be maintainability. Maintainability is a quality attribute
listed in hundreds of quality models. The system will be used and updated for an extended time and it
should not become more difficult and expensive to maintain. As new services (features) are added,
they should be done at a reasonable cost and also be testable. Symptoms of poor maintainability are
unnecessary complexity, unnecessary coupling, redundancy and the software model not reflecting the
56
actual or physical model. Note that many of the symptoms are already on the seven sins of bad code.
(Section 2.7 Section 1.3).
18. Numbers are not as important as trends (Section 2.6).
Acknowledgements
The authors thank Chris Drake, Michael Iacovelli and Frank Schmidt at iconectiv for the valuable insights
and suggestions regarding this work that they shared with us through numerous teleconferences during the
summer of 2015. This research is also supported by the National Science Foundation under Grant No.
1464654.
Appendix A
Namcook Analytics - Estimation Report
Copyright © by Namcook Analytics. All rights reserved.
Web: www.Namcook.com
Blog: http://Namcookanalytics.com
Part 1: Namcook Analytics Executive Overview
Project name: numbers
Project manager: NA
Key Project Dates:
Today's Date 08-18-15
Project start date: 08-18-15
Planned delivery date: 08-17-16
Predicted Delivery date 07-07-16
Planned schedule months: 12.01
Actual schedule months: 10.82
Plan versus Actual: -1.19
57
Key Project Data:
Size in FP 465.06
Size in KLOC 53.33
Total Cost of Development 375,586.16
Part 2: Namcook Development Schedule
Project Development
Staffing Effort Schedule Project $ per Wk Hrs
Months Months Costs Funct. Pt. per FP
Requirements 1.32 4.70 3.56 $46,994.93 $101.05 1.33
Design 1.76 6.66 3.78 $66,576.16 $143.16 1.89
Coding 3.30 9.53 2.89 $95,340.54 $205.01 2.71
Testing 2.97 7.01 2.36 $70,078.58 $150.69 1.99
Documentation 0.91 1.45 1.60 $14,525.71 $31.23 0.41
Quality Assurance 0.83 1.82 2.19 $18,157.13 $39.04 0.52
Total Project 0.87 6.39 10.82 $63,913.11 $137.43 1.81
3.47 37.56 16.38 $375,586.16 $807.61 10.66
Gross schedule months 16.38
Overlap % 0.66
Predicted Delivery Date 10.82 07-07-16
Client target delivery schedule and date 12.01 08-17-16
Difference (predicted minus target) -1.19
Odds 70% Odds 50% Odds 10%
05-13-16 03-28-16 02-12-16
Productivity FP/Month LOC/Month WH/Month
Features deferred to meet
schedule:
58
Productivity (+
reuse) 12.38 660.39 10.66
Function Pts. (38)
Productivity (-
reuse) 10.11 539.07 13.06
Lines of code (2,008)
% deferred -9.92%
Estimates for User Development Activities
User Activities Staffing Schedule Effort Costs $ per FP
User requirements team: 0.72 5.41 3.87 $0 $0.00
User prototype team: 0.58 2.70 1.57 $0 $0.00
User change control team: 0.62 10.82 6.71 $0 $0.00
User acceptance test team: 1.03 1.62 1.68 $0 $0.00
User installation team: 0.81 0.81 0.66 $0 $0.00
0.75 4.27 14.48 $0 $0.00
Number of Initial Year 1 Users: 1,000 12.00
Number of users needing training: 900 0.05 41.86 $0 $0.00
TOTAL USER COSTS
56.34 $0 $0.00
$ per function point $0.00
% of Development 0.00%
Staffing by Occupation
Occupation Normal Peak
Groups Staffing Staffing
1 Programmers 4 5
2 Testers 3 5
3 Designers 1 2
4 Business analysts 1 2
5 Technical writers 1 1
6 Quality assurance 1 1
59
7 1st line management 1 2
8 Data base administration 0 0
9 Project office staff 0 0
10 Administrative staff 0 0
11 Configuration control 0 0
12 Project librarians 0 0
13 2nd line managers 0 0
14 Estimating specialists 0 0
15 Architects 0 0
16 Security specialists 0 0
17 Performance specialists 0 0
18 Function point specialists 0 0
19 Human factors specialists 0 0
20 3rd line managers 0 0
TOTAL 14
20
Risks Odds
Cancellation 12.81%
Negative ROI 16.23%
Cost Overrun 14.09%
Schedule Slip 17.08%
Unhappy Customers 36.00%
Litigation 5.64%
Average Risk 16.98%
Financial Risk 23.65%
Less than 15% = Acceptable
15% - 35% = Caution
Greater than 35% = Danger
Part 3: Namcook Quality Predictive Outputs
60
Software Quality
Defect Potentials Potential
Requirements defect potential 380
Design defect potential 365
Code defect potential 572
Document defect potential 79
Total Defect Potential 1,396
Defect Prevention Efficiency Remainder Bad Fixes
JAD 0% 1,396 0
QFD 27% 1,019 10
Prototype 0% 1,029 (0)
Models 0% 1,029 0
Subtotal 26% 1,029 10
Pre-Test Removal Efficiency Remainder Bad Fixes
Desk check 26% 761 21
Pair programming - not used 0% 782 21
Static analysis 55% 361 10
Inspections 88% 45 1
Subtotal 96% 46 53
Test
Removal Efficiency Remainder
Bad
Fixes
Test
Cases
Per
KLOC
Per
FP
Test
scripts
Test Planning
Unit 31% 32 1 480 19 1 66
Function 33% 22 1 522 21 1 69
Regression 12% 20 1 235 9 1 46
Component 30% 14 1 313 13 1 53
Perfomance 11% 13 0 157 6 0 38
System 34% 9 0 496 20 1 67
61
Acceptance 15% 8 0 106 4 0 31
Subtotal 82% 8 4 2,308 93 5 144
Defects delivered 8
High severity 1
Security flaws 1
High severity % 14.92%
Deliv. Per FP 0.02
High sev per FP 0.00
Security flaws per FP 0.00
Deliv. Per KLOC 0.34
High sev per KLOC 0.05
Security flaws per KLOC 0.02
Cumulative
Removal Efficiency
99.40%
Defect prevention costs $40,832.31
Pre-Test Defect Removal Costs $71,988.87
Testing Defect Removal Costs $140,837.96
Total Development Defect Removal Costs $253,659.14
Defect removal % of development 67.54%
Defect removal per FP 545.43
Defect removal per KLOC 10,226.87
Defect removal per defect 110.49
Three-year Maintenance Defect Removal Costs 60,769.16
TCO defect removal costs 314,428.29
Defect removal % of TCO 0.25%
Reliability (days to first defect) 29.54
Reliabilty (days between defects) 198.03
Customer satisfaction with software 96.42%
62
Part 4: Namcook Maintenance and Cost Benchmarks
Maintenance Summary Outputs for three years
Year of first release 2016
Application size at first release - function points 465
Application growth (three years) - function points 586
Application size at first release - lines of code 24803
Application growth (three years) - lines of code 31245
Application users at first release 1000
Application users after three years 1359233
Incidents after three years 4857
Staff Effort
Cost per Cost per
Three-Year Totals Personnel Months Costs Function Pt. Function Pt.
1,000 1,260
Management 43.93 1581.57 7907862.60 17003.96 13498.29
Customer support 658.43 23703.33 118516657.07 254841.65 202301.52
Enhancement 0.43 15.43 77142.54 165.88 131.68
Maintenance 0.34 12.15 60769.16 130.67 103.73
TOTAL 703.12 25312.49 126562431.37 272142.16 216035.22
Namcook Total Cost of Ownership Benchmarks
Staffing Effort Costs $ per FP % of TCO
at release
Development 3.47 38 $375,586.16 $807.61 Cost per
Maintenance Mgt. 43.93 1582 $7,907,862.60 $17,003.96 6.23%
Customer support 658.43 23703 $118,516,657.07 $254,841.65 93.37%
Enhancement 0.43 15 $77,142.54 $165.88 0.06%
Maintenance 0.34 12 $60,769.16 $130.67 0.05%
User Costs 0.75 14 $0.00 $0.00 0.00%
63
Total TCO 707.35 25365 $126,938,017.53 $272,949.76 100.00%
Part 5: Additional Data Points
Note: Namcook Analytics uses SRM and IFPUG function points as default values.
This section provides application size in 21 metrics.
Alternate Size Metrics Size % of IFPUG
1 IFPUG 4.3 465 100.00%
2 Automated code based function points 498 107.00%
3 Automated UML based function points 479 103.00%
4 Backfired function points 465 100.00%
5 COSMIC function points 558 120.00%
6 Fast function points 451 97.00%
7 Feature points 465 100.00%
8 FISMA function points 474 102.00%
9 Full function points 544 117.00%
10 Function points light 449 96.50%
11 IntegraNova function points 507 109.00%
12 Mark II function points 493 106.00%
13 NESMA function points 484 104.00%
14 RICE objects 2,591 557.14%
15 SCCQI function points 1,479 318.00%
16 Simple function points 453 97.50%
17 SNAP non functional size metrics 118 25.45%
18 SRM pattern matching function points 465 100.00%
19 Story points 362 77.78%
20 Unadjusted function points 414 89.00%
21 Use-Case points 217 46.67%
Document Sizing
64
Percent Complete
1 Requirements 192 76,781 94.55%
2 Architecture 46 18,268 93.24%
3 Initial design 223 89,230 87.46%
4 Detail design 379 151,472 88.77%
5 Test plans 76 30,329 91.16%
6 Development Plans 26 10,231 91.24%
7 Cost estimates 46 18,268 94.24%
8 User manuals 184 73,783 94.88%
9 HELP text 88 35,152 95.06%
10 Courses 67 26,973 94.79%
11 Status reports 47 18,887 93.24%
12 Change requests 90 35,952 99.55%
13 Bug reports 496 198,214 92.51%
TOTAL 1,959 783,541 93.13%
Work hours per page - writing 0.95
Work hours per page - reading 0.25
Total document hours - writing 1,860.91
Total document hours - reading 85.01
Total document hours 1,945.92
Total document months 14.74
Total document $ 147,425.42
$ per function point 317.00
% of total development 39.25%
DevOps Practices Self Assessment
Please describe your purpose for completing the assessment.
I only want to self-assess my practices
Enter the contact information in the fields provided. This information will be used to forward theresults to you. If you included your IBM representative's information, a copy of your responseswill be forwarded to your IBM representative.
Your name
Dolores
Your Email Address
IBM representative name
Na
IBM representative email
What is your company's industry?
Education
What is the geographic area of your company's primary operations?
North America
Please select the assessment experience you prefer.
I would like to manually select each practice to assess
Please select up to two adoption paths to focus your assessment. The next step will let you select from a list ofpractices to further refine your assessment questions.
Develop / Test (Design, Construct, Configuration Management, Build, Test, Quality Assessment)
Please select one or more practices from the list to focus your self-assessment.
DesignConstructConfiguration ManagementBuildQuality ManagementQuality Assessment
Design is focused on producing products during a phase of the project using formal processes for review andmeasures of completion. The confidence of design activities to ensure scope and requirements are understoodfor implementation is low and effectiveness is not measured. Formal method is used to review design productsfor approval or to improve or correct.
Partial
Developers work independently and deliver code changes, deliveries are formally scheduled and resourceintensive. Integration is a planned event that impacts most development activities in an application or project.
Partial
Code deliveries and integrations are performed periodically using a common process and automation.Integrations are accomplished by individual developers and automated when possible. Coding techniques areavailable and used inconsistently. Common architecture standards for application coding are defined andtrained. Code reviews are effective and manually initiated.
Yes
Coding best practices are defined consistently across technologies and include reviews and automatedvalidation through scanning/testing. Consistent architecture standards are used across organization andvalidated in testing and reviews.
Yes
Code changes are collaboratively developed across technologies, application and teams, continuously.Developers have immediate access to relevant information for code changes to ensure iterative improvementsor changes to design, requirements or coding implementation are understood. Standards in coding and reviewsare measured standards across the organization. Best coding and validation practices are trained, used andverified consistently.
Yes
Source control of assets is largely a manual activity that relies heavily on individuals following processes.Performing changes on an asset by more than one team member is only accomplished through locks andaccess controls. Merging asset changes is accomplished on desktops manually and formally scheduled by aspecialized integration team. Applying changes across versions for different releases is performed outside ofthe configuration management tool or process.
No
Builds are performed manually and automated across projects and environments. Build systems range fromdeveloper's IDE (usually for Dev only) to a formal centralized build server which is normally used for formalpromotions to QA-Production. Build management and standards are controlled at the project level. Formal buildsare scheduled following formal delivery and integration events to validate application level integration andapplication promotion.
No
Informal builds produced by individual developers via their desktop IDE are used for validation but neverdeployed to environments. A centralized build service is in place that controls the artifacts and processes usedin the build. Automated build process includes code scanning and unit / security tests. The build processperiodically produces a build of each application under change for testing testing or verification purposes. Buildresults are measured and monitored by development teams consistently across the enterprise.
Yes
A daily build of an application under change is promoted to test. Build is provided as a service that supportscontinuous integration, compilation and packaging by platform. Individual developers, teams and applicationsregularly schedule automated builds based on changes to source code. A dependency management process isin place for software builds using a dependency-management repository to trace the standard libraries andprovision them at build time.
Yes
All builds could be promoted through the software delivery pipeline, if desired. Each project/team tracks changesto the build process as well as source code and dependencies. Modifying the build process requires approval, soaccess to the official build machines and build server configuration is restricted where compliance is a factor orwhere Enterprise Continuous Integration becomes a production system. Build measures are used to improvedevelopment and configuration management processes.
Yes
Crash reporting is incorporated into mobile applications to provide basic measures for quality assessment.Crash reporting is used to improve basic stability of the applications.
Yes
Quality reporting is embedded into mobile applications to support user sentiment and usage patterns. Measuresare used to drive changes into development teams for usability improvements, defect correction andenhancements.
Yes
Mobile application teams assess quality by monitoring social media, application repositories, and user feedback.Each monitoring source is used to define defects or enhancements to the specific application. Measures areused to determine the impact of application team improvements on user satisfaction.
No
The main objective of testing is to validate that the product satisfies the specified requirements. However, testingis still seen by many stakeholders as being the project phase that follows coding.
No
Design
Reliable
Construct
Scaled
Configuration Management
Scaled
Build
Scaled
Test
Reliable
Assess Quality
Scaled
68
References
[BASI] Basili, V., Gianluigi Caldiera, and H. Dieter Rombach. The goal question metric approach. In
Encyclopedia of Software Engineering. Wiley, 1994.
[BISH] Bisht, A., AS Dhanoa, AS Dhillon, G Singh, “A Survey on Quality Prediction of Software
Systems”, isems.org.
[BLAC] Black, R. “Measuring Defect Potentials and Defect Removal Efficiency, 2008, http://www.rbcs-
us.com/images/documents/Measuring-Defect-Potentials-and-Defect-Removal-Efficiency.pdf
[CHIL] Childers, B., “Geek Guide Slow Down to Speed Up, Continuous Quality Assurance in a DevOps
Environment, Linux Journal, 2014
[CISC] “Cisco 2014 Annual Security Report”, http://www.cisco.com/web/offers/lp/2014-annual-security-
report/index.html
[DEMI] W. Deming, Out of the Crisis. MIT Center for Advanced Engineering Study, Cambridge, MA,
1982.
[DYNA] Dynatrace, “DevOps: Hidden Risks and How to Achieve Results”, 2015.
http://www.dynatrace.com/content/dam/en/general/ebook-devops.pdf
[FENT] Fenton, N., J. Bieman, Software Metrics: A Rigorous and Practical Approach, Third Edition,
2014, CRC Press, Boca Rotan, FL.
[FOWL] Fowler, M. “An Appropriate Use of Metrics”, Feb. 2013,
http://martinfowler.com/articles/useOfMetrics.html
[FRAN] P. Frankl and O. Iakounenko. Further empirical studies of test effectiveness. In Proc. 6th ACM
SIGSOFT International Symposium on the Foundations of Software Engineering (FSE’98), pages
153–162. ACM Press, 1998.
[GALE] Galen, R. “2 Dozen Wild & Crazy Agile Metics Ideas”, RGalen Consulting,
[GART] Market Share Analysis: Application Share Analysis: Application Performance Monitoring 2014,
May 27, 2015 http://www.gartner.com/technology/reprints.do?id=1-
2H15OOF&ct=150602&st=sb
[GILB] Gilb, T and L. Brodie, “What’s Wrong with Agile Methods: Some Principles and Values to
Encourage Quantification”, Methods and Tools, Summer 2007, accessed 6/2015
http://www.methodsandtools.com/archive/archive.php?id=58
[HOGL] Hoglund G. and McGraw G (2004): Exploiting Software: How to Break Code. Addison-Wesley,
2004.
[HOWA] Howard, M., “Lessons Learned from Five Years of Building More Secure Software, Trustworthy
Computing, Microsoft,
http://www.google.com/url?url=http://download.microsoft.com/download/A/E/1/AE131728-
943B-42B4-B130-
C1DEBE68F503/Trustworthy%2520Computing.pdf&rct=j&frm=1&q=&esrc=s&sa=U&ved=0C
69
BQQFjAAahUKEwi61_Srx9jHAhULPZIKHZPmBSQ&usg=AFQjCNG1RERb7HJ8OakiFwEL
_FLJAc6j6w
[IBM] IBM Developer Works,”11 proven practices for more effective, efficient peer code review”,
accessed 6/2015 http://www.ibm.com/developerworks/rational/library/11-proven-practices-for-
peer-review/
[JONEa] Jones, C and O. Bonsignour, The Economics of Software Quality, 2011, Pearson Publishing
[JONEb] Jones, C., “Software Engineering issues for 30 years”, http://www.namcook.com/index.html
[JONEc] Jones, C., “Examples of Software Risk Master (SRM) Requirements Predictions”, January 11,
2014, http://namcookanalytics.com/wp-content/uploads/2014/01/RequirementsData2014.pdf
[JONEd] Jones, C., “Evaluating Software Metrics and Software Measurement Practices”, Version 4.0
March 14, 2014, Namcook Analytics LLC
http://namcookanalytics.com/wp-content/uploads/2014/03/Evaluating-Software-Metrics-and-
Software-Measurement-Practices.pdf.
[JONEe] Jones, C., “The Mess of Software Metrics”, Version 2, September 12, 2014,
http://namcookanalytics.com/wp-content/uploads/2014/09/problems-variations-software-
metrics.pdf
[KABA] Kabanov, Jevgeni, “Developer Productivity Report 2013 – How Engineering Tools & Practices
Impact Software Quality & Delivery”, Zeroturnaround, 2013,
http://pages.zeroturnaround.com/RebelLabs-
AllReportLanders_DeveloperProductivityReport2013.html?utm_source=Productivity%20Report
%202013&utm_medium=allreports&utm_campaign=rebellabs&utm_rebellabsid=76
[KASP] Kaspersky lab, “Oracle Java surpasses Adobe Reader as the most frequently exploited software”.
December 2012,
http://www.kaspersky.com/about/news/virus/2012/Oracle_Java_surpasses_Adobe_Reader_as_the
_most_frequently_exploited_software
[MANT] Mantyla, M.V., and Cl Lassenius, “What types of defects are really discovered in code
reviews?” IEEE Transactions on Software Engineering, 2009, 35(3) 430-448.
[MCGR] McGraw, G, S. Migues, J. West, “BSIMM6”, October 2015, https://www.bsimm.com/wp-
content/uploads/2015/10/BSIMM6.pdf
[MICR] Microsoft, “Microsoft Security Intelligence Report volume 10 (July – December 2010)”,
http://www.microsoft.com/en-us/download/details.aspx?id=17030
[MITR] 2011 CWE/SANS Top 25 Most Dangerous Software Errors, http://cwe.mitre.org/top25
[MOCK] Mockus, A., Nachiappan Nagappan, Trung T. Dinh-Trong, "Test coverage and post-verification
defects: A multiple case study" Proceedings of the 2009 3rd International Symposium on
Empirical Software Engineering and Measurement, October 2009.
[NAGA] Nagappan, N., and T. Ball, “Use of Relative Code Churn to Predict System Defect Density”,
Microsoft Research, 2005, http://research.microsoft.com/pubs/69126/icse05churn.pdf
70
[NESM] (http://nesma.org/2015/04/Agile-metrics/).
[OLAV] Olavsrud, T., “Do Insecure Open Source Components Threaten Your Apps?”, CIO, March 2012,
http://www.cio.com/article/2397662/governance/do-insecure-open-source-components-threaten-
your-apps-.html
[OWAS] “OWASP Top 10”, www.owasp.org/index.php/Category:OWASP_Top_Ten_Project
[PAUK] Paukamainen, I. “Case: Testing in Large Scale Agile Development”, presentation
http://testingassembly.ttlry.mearra.com/files/2014%20ISMO%20Case_TestingInLargeScaleAgile
Development.pdf
[PFLE] S. L. Pfleeger, N. Fenton, and N. Page, “Evaluating software engineering standards,” IEEE
Comput., vol. 27, no. 9, pp. 71–79, 1994.
[PLUM] Java performance tuning survey results, November 113, 2014,
https://plumbr.eu/blog/performance-blog/java-performance-tuning-survey-results-part-i
[PUPP] Puppet Labs, IT Revolution Press and Thoughtworks, 2014 State Of DevOps Report and 2013 State
Of DevOps Infographic, 2015, https://puppetlabs.com/2013-state-of-devops-infographic
[PUTM] Putman, R. “Secure Agile SDLC.” https://www.brighttalk.com/ webcast/1903/92961.
[SANSa] SANS Institute, “2015 State of Application Security: Closing the Gap”, May 14, 2015,
https://software-security.sans.org/blog/2015/05/14/2015-state-of-application-security-closing-the-
gap
[SANSb] SANS Institute, “Critical Security Controls”, https://www.sans.org/critical-security-controls/
[SCAL] http://scaledAgileframework.com/features-components/
[SECU] “37 Powerful Penetration Testing Tools for Every Penetration Tester”, Security testing, June
2015, http://www.softwaretestinghelp.com/penetration-testing-tools/
[SIRI] Sirias, C., “Project Metrics for Software Development”, InfoQ, July14, 2009,
http://www.infoq.com/articles/project-metrics
[SMAR] SmartBear, “11 Best Practice for Peer Review Code”, Whitepaper
[SOFT] Software Assurance Pocket Guide Series,” Architecture and Design Considerations for Secure
Software”, Development, Volume V Version 1.3, February 22, 2011 https://buildsecurityin.us-
cert.gov/sites/default/files/Architecture_and_Design_Pocket_Guide_v1.3.pdf
[TECH] Quality metrics: A guide to measuring software quality, SearchSoftwareQuality,
http://searchsoftwarequality.techtarget.com/guides/Quality-metrics-A-guide-to-measuring-
software-quality
[VERI] Verizon DBIR 2012, IDC, Infonetics Research
71
[WIKI] Software Quality, https://en.wikipedia.org/wiki/Software_quality
[ZERO] zeroturnaround.com/rebellabs/the-developers-guide