23
Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 [email protected] www.CosgroveComputer.com Los Angeles ACM Loyola Marymount University – University Hall December 7, 2005 Responding to Risk in Software Systems Copyright 2001-2005 CCS Inc.

Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 [email protected]

Embed Size (px)

Citation preview

Page 1: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Avoiding the Destiny of Failure in Large Software

Systems

John Cosgrove, PE, CDP, CFC

Cosgrove Computer Systems Inc.(310) 823-9448

[email protected]

Los Angeles ACM

Loyola Marymount University – University Hall

December 7, 2005

Responding to Risk in Software Systems

Copyright 2001-2005 CCS Inc.

Page 2: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

2

Contents

The Problem . . . . . . . . . . . . . . . . . . . . . . . . 3

Seeking a Solution. . . . . . . . . . . . . . . . . . . . 9

Lessons- Learned . . . . . . . . . . . . . . . . . . . .15

Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . .21

Bibliography

Page 3: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

3

The Problem

Most Software Systems Fail Planning Revisited Integrated Risk Management New World of Regulation Future of Software Engineering

Page 4: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

4

Most Software Systems Fail

Most SW projects fail – bigger fail more Failure is not inevitable

– Notable exceptions exist Poor natural visibility typical w/ SW

– Effective planning & status assessment critical Risk management integral to planning

– Risk assessment must include economics of failure

Source: Humphrey – Crosstalk 2005

Page 5: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

5

Planning Revisited

Plans must involve all the responsible stakeholders– Developers, customers, end users, etc.– Win-Win or Lose-Lose

Development cycle policy must be explicit– Critical drivers must be stated – Independent variables– Schedule, cost, performance or quality

Choose one or two – others are dependent variables Dependent variables vary!

Planning is never complete– Rule - Never fail a plan because plan changes to reality 1st

Source: Boehm

Page 6: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

6

Integrated Risk Management

True Risk Management is element of planning Flows from unknowns identified in planning Two broad categories

– Catastrophic or unacceptable risk Treat as requiring insurance in some form

– Conventional risk exposure Classical risk mitigation steps

Both demand $$$ quantification of failure– Cost of failure drives budgets

Page 7: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

7

New World of Regulation

Sarbanes Oxley (SOX)– Enforces accountability for reporting “correctness”– Software projects are investment assets– Correctness, control mechanisms, security are

auditable– Non-compliance penalties include criminal & civil

“If we managed finances in companies the way we manage software—then somebody would go to prison.” -- Armour

Page 8: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

8

Future of Software Engineering

Functional size & complexity increasing rapidly– Size increase ~ 10x every 5 years– Scale matters in all engineered systems

Humphrey’s analogy with transportation system’s speed

“Increasingly software [i.e., computer systems] .. crucial part of the products and services in almost all industries.”

“Most computer systems .. interconnected ..” “.. more internal and external threats …” “In .. past, .. assumed a friendly .. environment.” Source: Humphrey, SEI/CMU 2002

Page 9: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

9

Seeking a Solution

Significant Differences – Software Why Software is Valuable Software Creation Failure Management Minimizing Failure Costs

Page 10: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

10

Significant Differences - Software

Requirements are seldom complete - IKIWISI– “With software the challenge is to balance the unknowable nature of the

requirements with the business need for a firm contractual relationship.” -- Watts Humphrey

“Most engineered systems are defined by comprehensive plans and specifications prior to startup. Few software-intensive systems are.”

Most software projects are challenged or fail completely*– Over $6M – less than 10% succeed, $1M ~50%– Primary cause – no realistic planning by developers– No natural visibility of progress or completion status

* Humphrey “Why Big Software Projects Fail”

Page 11: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

11

Why Software is Valuable

Value created by the abstraction of productive knowledge

– Development is Social learning process Economic value comes from impact on useful activity

– Efficient automotive ignitions Value is increased when the knowledge is readily

adaptable– McDonalds hamburger franchises also work well in China

Franchises show how preserved abstractions can be valuable

Software engineers are ethically obligated to optimize value

Source: Baetjer

Page 12: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

12

Software Creation

What is a Social Learning Process?? Ignorance -> useful, reproducible knowledge Orders-of-Ignorance (OI) – five levels

– 0th – Useful knowledge, have the answer– 1st – Know the ?, but not the answer– 2nd – Unknown # of unknowns, apply process– 3rd – 2-OI but no process to begin– 4th – 3-OI Ignorance of ignorance - meta-ignoranceSource: Armour, Five Orders of Ignorance, C-ACM 10/00

Page 13: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

13

Failure Management

“..as if the concepts of risk and failure are somehow disconnected.”

“.. purpose of development .. do something not done before.”

90% success means 1 in 10 failure– Is the failure tolerable?

Must make it tolerable (e.g., insurance)?– Calculate $ likelihood of failure (e.g.,10% of cost)Source: Armour: “Management of Risk, C-ACM 3/05

Page 14: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

14

Minimizing Failure Costs

Failure costs are never zero– Making costs explicit improves planning

Steps to Minimize– Make all catastrophic risks tolerable

Rationale behind insurance – life, property, etc. Project example – alternate, plan-B solution

– Quantify risk exposure in terms of failure costs Rationale behind testing to avoid costly field retrofits Failure cost exposure drives budgets for mitigation

Page 15: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

15

Lessons-Learned

Air Traffic Control Failure New FBI Software Unusable Unsafe Automotive Ignition Framework for Dependable Designs Dependable Ignition System Example

Page 16: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

16

Air Traffic Control Failure

– LA regional system failed on 9/14/2004, 3.5 hours Backup system also failed

– Many mid-air collision near misses with 800+ A/C– Improperly blamed on “human error”

Fault lay with known “glitch” avoided by manual Ops Fault introduced with year-ago system re-host Only 1 of 21 centers have fault corrected

– Questions – testing, fault tolerance policy, etc. Backup system failed immediately???

Page 17: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

17

New FBI Software Unusable

New Anti-terrorism software – Virtual Case File .. “further delays in four-year effort..”

“$half-billion Upgrade … will not work ..”– “ .. render worthless much of current $170M contract.”

“.. may have outlived its usefulness .. before .. it was .. implemented”

“..officials thought ..get it right the 1st time”.. “That never happens with anybody.”

Source: LA Times, 1/13/05

Page 18: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

18

Unsafe Automotive Ignition

Engine died when accelerating into traffic– Intermittent sensor wire– Ignition control software failed with open circuit

Hazard analysis missed HW-SW interaction Incomplete SW system safety requirements

– Interface failure protection - From Hazard analysis Deterministic values for common failures -- Open, short

– Control algorithm must be protected – Detect failures and substitute “safe” values

Recent examples LA Times 5/05 – “Prius..”

Page 19: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

19

Framework for Dependable Designs

Defend engineering process in court* Set bounds for system - three states

– Operating -- Envelope for normal operations– Non-Operating -- Normal not possible– Exception -- Recover to normal after anomaly

Normal may be degraded-normal

Mishaps occur during state transitions– IDs SW system dependability requirements– Suggests mishap mitigation -- HW or SW* Source: Lawson

Page 20: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

20

Dependable Ignition System Example

Automotive Ignition -- Hazard identified– Sensor wiring may fail from constant movement– Ignition control failure may cause traffic emergency

Requirement - Recover safely from faulty wiring Allocation of requirement – “What if”

– HW - Terminate inputs for predictable open/short values– SW - Detect open/short values, use last or known safe value

Requirements identification before design is best– More options, usually less costly

Page 21: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

21

Summary

Most large SW-intensive system developments fail Public safety and economic security forcing

government & legal systems to recognize importance

Planning and risk management practices are key to any solution

Good systems engineering practices must be adapted to software’s special characteristics

Page 22: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

22

Bibliography - I

Armour, Phillip, The Five Orders of Ignorance, Communications of the ACM, October 2000

Armour, Phillip, Project Portfolios: Organizational Management of Risk, Communications of the ACM, March 2005

Armour, Phillip, Sarbanes-Oxley and Software Projects, Communications of the ACM, June 2005

Baetjer, H., Software as Capital - An Economic Perspective on Software Engineering, IEEE Computer Society Press, 1997

Boehm, Barry, Win-Win Negotiation Tool, Center for Software Engineering-USC, http://sunset.usc.edu

Cosgrove, J., Software Engineering & Law, IEEE Software, May-June 2001 Humphrey, W. S., Managing the Software Process, Addison Wesley, 1990 Humphrey, Watts, The Future of Software Engineering: V, SEI Interactive,

Software Engineering Institute, Carnegie Mellon University, Vol. 5, Num.1, 1Q 2002, http://interactive.sei.cmu.edu/news@sei/columns/watts_new/watts-new-compiled.pdf

Humphrey, Watts, Why Big Software Projects Fail – The 12 Key Questions, CrossTalk Magazine, March 2005 www.stsc.hill.af.mil

Page 23: Avoiding the Destiny of Failure in Large Software Systems John Cosgrove, PE, CDP, CFC Cosgrove Computer Systems Inc. (310) 823-9448 JCosgrove@Computer.org

Responding to Risk in Software Systems

23

Bibliography - II

Lawson, Harold W., An Assessment Methodology for Safety Critical Systems, Lidingo, Sweden, [email protected]

Los Angeles Times, System Failure Snarls Air Traffic in the Southland, 9/15/2004 Los Angeles Times, Human Errors Silenced Airports, 9/16/2004 Los Angeles Times, New FBI Software May Be Unusable, 1/13/2005 Los Angeles Times, Prius Glitches Highlight Problems of Car Computers, 5/18/2005 Lister, T. & DeMarco, T., Both Sides Always Lose: Litigation of Software-Intensive

Contracts, CrossTalk, 2/2000, www.stsc.hill.af.mil/Crosstalk/2000/feb/demarco.asp Parnas, David L., Licensing Software Engineers in Canada, Communications of the

ACM, 11/2002 Poore, Jesse H., A Tale of Three Disciplines … and a Revolution, IEEE Computer,

1/2004 Research Triangle Institute, “The Economic Impacts of Inadequate Infrastructure for

Software Testing”, www.nist.gov, NIST Planning Report 02-3