26
December 5, 2003 1 Mars and Beyond: NASA’s Software Challenges in the 21st Century Dr. Michael R. Lowry NASA Ames Research Center

December 5, 20031 Mars and Beyond: NASA’s Software Challenges in the 21st Century Dr. Michael R. Lowry NASA Ames Research Center

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

December 5, 2003 1

Mars and Beyond: NASA’s Software Challenges in the 21st Century

Dr. Michael R. LowryNASA Ames Research Center

December 5, 2003 2

Outline

• NASA’s Mission• Role of Software within NASA’s Mission• The Challenge: Enable Dependable SW-based Systems• Technical Challenges

– Scaling !– System-software barrier – Software is opaque and brittle in the large

• Reasons for Optimism

December 5, 2003 3

NASA’s Vision• To improve life here• To extend life to there• To find life beyond

NASA’s Mission• To understand and protect our home planet• To explore the universe and search for life• To inspire the next generation of explorers

…as only NASA can

5 Strategic Enterprises

One NASA

SpaceScience

EarthScience

Biological& Physical Research HEDS

AerospaceTechnology

December 5, 2003 4

Software Growth in Aerospace MissionsSoftware Enables NASA’s Missions

1960 1965 1970 1975 1980 1985 1990 19951

10

100

1,000

10,000

Inst

ruct

ion

s (E

qu

ival

ent

Mem

ory

Lo

cati

on

s in

K)

Year

Source: AF Software Technology Support Center

TITAN

VENUSMERCURY

PERSHING 1

SURVEYORPERSHING 1A

POSEIDON C3TITAN 111C

VIKING

PERSHING 11

GALILEO

MISSILE

PERSHING 11 (AO)

TRIDENT C4

VOYAGER

MARINER

UnpilotedSystems

C-17PROJECTED

C-5A

A7D/E

F-111

P-3A

B-1A

AWACS

B-1B

F-15EB-2

F-16 C/D

F-22(PROJECTED)

F-111

GEMINI 3

PilotedSystems

GEMINI 3

APOLLO 7

GEMINI 8

SHUTTLE/OFT

MERCURY 3

SHUTTLE/OPERATIONAL

(Doubling every 3 or 4 years)

December 5, 2003 5

The Challenge: Software Risk Factors

Linear Complex

Tight

Loose

INTERACTIONS

COUPLING

Post Office

Most manufacturing

Junior college

Trade schools

Nuclear plant

Military early-warning

Space missions

Chemical plants

Aircraft

Universities

MiningR&D firms

Military actions

Power grids

Airways

Dams

Rail transport

Marine transport

December 5, 2003 6

Mars Climate Orbiter

• Launched– 11 Dec 1998

• Mission– interplanetary weather satellite– communications relay for Mars

Polar Lander

• Fate:– Arrived 23 Sept 1999– No signal received after initial

orbit insertion

• Cause:– Faulty navigation data caused by

failure to convert imperial to metric units

December 5, 2003 7

MCO Events

• Locus of error– Ground software file called “Small Forces” gives thruster performance data

– This data is used to process telemetry from the spacecraft• Spacecraft signals each Angular Momentum Desaturation (AMD) maneuver• Small Forces data used to compute effect on trajectory• Software underestimated effect by factor of 4.45

• Cause of error– Small Forces Data given in Pounds-seconds (lbf-s)

– The specification called for Newton-seconds (N-s)

• Result of error– As spacecraft approaches orbit insertion, trajectory is corrected

• Aimed for periapse of 226km on first orbit

– Estimates were adjusted as the spacecraft approached orbit insertion:• 1 week prior: first periapse estimated at 150-170km• 1 hour prior: this was down to 110km• Minimum periapse considered survivable is 80km

– MCO entered Mars occultation 49 seconds earlier than predicted• Signal was never regained after the predicted 21 minute occultation• Subsequent analysis estimates first periapse of 57km

December 5, 2003 8

Contributing Factors

• First 4 months, AMD data unusable due to file format errors

– Navigators calculated the data by hand

– File format fixed by April 1999

– Anomalies in the computed trajectory became apparent almost immediately

• Limited ability to investigate the anomalies:– Thrust effects measured along Earth-spacecraft

line of sight using doppler shift

– AMD thrusts are mainly perpendicular to line of sight

• Failure to communicate between teams:– E.g. Issue tracking system not properly used by

navigation team• Anomalies were not properly investigated

• Inadequate staffing– Operations team were monitoring three missions

simultaneously (MGS, MCO and MPL)

• Operations Navigation team unfamiliar with spacecraft

– Different team from the development and test team

– This team did not fully understand the significance of the anomalies

– Assumed familiarity with previous mission (Global Surveyor) was sufficient:

• did not understand why AMD was performed 10-14 times more often

• (MCO has asymmetric solar panels, whereas MGS had symmetric panels)

• Inadequate Testing– Software Interface Specification was not used

during unit testing of small forces software

– End-to-end test of ground software was never completed

– Ground software was not considered “mission critical” so didn’t have independent V&V

• Inadequate Reviews– Key personnel missing from critical design

reviews

December 5, 2003 9

Analysis

• Software size, S, increasing exponentially(doubling every three or four years)

• Errors, cost over-runs, schedule slip due primarily to non-local dependencies during integration (SN , with N<2, best calibration: N=1.2 )

Source: Professor Barry Boehm, Author of Software Cost Modeling

SW Size

Errors

December 5, 2003 10

Predicted Errors as LOC Grows:Current SW Practices/Technology

0.025

0.4

6.3

100

0.01

0.1

1

10

100

10K 100K 1M 10M

Lines of Code

Predicted Errors

Errors = e SN; where S is the number of modules (LOC/M), and error rate e = 1/10,000

Cassini MPL

December 5, 2003 11

Future Mars Exploration: MSL and MSR

December 5, 2003 12

Beyond Mars: JIMO and TPF

December 5, 2003 13

Technical Challenges and Opportunities

• System-software barrier – (Verification is easy, validation is hard)

• Software is transparent and malleable in the small…– But opaque and brittle in the large

• General-purpose software dependability tools work well in the small– But fail to scale to systems in the large.

But there is Reason for OptimismAlign software architectures with system analysisSuccess of formal methods in related field of digital hardware

Scaling through specialization

Divide and Conquer: compositional reasoning

Beyond correctness: exploiting the lattice between true and false for software understanding

Providing the research community with realistic experimental testbeds at scale

December 5, 2003 14

Scaling through Specialization:Practical Static Analysis

PolySpaceC-Verifier

C Global Surveyor(NASA Ames)

DAEDALUS

Coverity

Scalability

Precision

1 MLoc

500 KLoc

50 KLoc

80% 95%

GENERAL-PURPOSEANALYZERS

SPECIALIZEDANALYZERS

December 5, 2003 15

void add(Object o) { buffer[head] = o; head = (head+1)%size;}

Object take() { tail=(tail+1)%size; return buffer[tail];}

Code with Transient Error

Hard to Show ErrorTesting cannot reliably show

the error appearing, since it may require specific

environment actions (inputs) or scheduling (for concurrency errors)

Hard to Find Cause of the Error

Once we know a way to show the error it is difficult to

localize the root cause of the error

+

void add(Object o) { buffer[head] = o; head = (head+1)%size;}

Object take() { tail=(tail+1)%size; return buffer[tail];}

SoftwareModel Checker

JPF

void add(Object o) { buffer[head] = o; head = (head+1)%size;}

Object take() { tail=(tail+1)%size; return buffer[tail];}

Produces Error Trace

ErrorExplanation

Localize Causeof the Error

A model checker can automatically find a trace that show the error appearing

Now we can automatically find an explanation for the error from the error trace produced by the model checker and the original program

The algorithm uses model checking to first find similar traces that also cause the error (negatives) and

traces that do not cause the error (positives)

Set of PositivesTraces that don’t show

the error

Set of NegativesTraces that show different

versions of the errorAnalysis

Explaining the Cause of an Error

1. Source code similarities to explain control errors• code that appear only in negatives• all negatives, and, • only and all negatives (causal)

2. Data invariants – explains errors in data3. Minimal transformations to create a negative from a positive – show the essence of an error

December 5, 2003 16( = “unknown yet”)

Our novel symbolic execution framework:- extends model checking to programs that have complex inputs with unbounded (very large) data- automates test input generation

Generalized Symbolic Execution for Model Checking and Testing

input program

Model checking

Program instrumentation

Decision procedures

instrumented program

correctness specification

continue/backtrack

counterexample(s)/test suite

[heap+constraint+thread scheduling]

Framework:

Future mission software: - concurrent - complex, dynamically allocated data structures (e.g., lists or trees)- highly interactive:

- with complex inputs - large environment

- should be extremely reliable

- testing: - requires manual input - typically done for a few nominal input cases- not good at finding concurrency bugs- not good at dealing with complex data structures

void Executive::startExecutive(){ runThreads(); …}void Executive::executePlan(…) { while(!empty) executeCurrentPlanNode();} …

Rover Executive

execute action

environment/rover status

Input plan

complex input structure

large environment data

concurrency,dynamic data (lists, trees)

- model checking: - automatic, good at finding concurrency bugs - not good at dealing with complex data structures - feasible only with a small environment

- and a small set of input values

Current practice in checking complex software:

- modular architecture: can use different model checkers/decision procedures

class Node { int elem; Node next; Node deleteFirst() { if (elem < 10) return next; else if (elem < 0) assert(false); … } }

Code

Analysis of “deleteFirst” with our framework

-“simulate” the code using symbolic values instead of program data; enumerate the input structures lazily

e0 ≥ 10 /\ e0<0

e0 < 10 e0 ≥ 10

true false

true

FALSE

Precondition: acyclic list

Numeric Constraints

Decision Procedures

Structural Constraints

Lazy initialization+Enumerate all structures

e0nulle0

e0

e0 e1

December 5, 2003 17

System-Level Verification

code code

Design

Implementation

check

monitor

architecture description + module specifications

check (system-level) integration properties based on module specificationsmodule hierarchy and interfaces used for incremental abstractionarchitectural patterns potentially reusablegenerate module/environment assumptions

check implementation modules against their design specifications

monitor properties that cannot be verified

monitor environment assumptions

December 5, 2003 18

Module Verification

• Modules may require context information to satisfy a property

• Assumption || Module ╞ Property (assume – guarantee reasoning)

Environment

Module

a b c

Property

Assumption

• Developer encodes them

• Abstractions of environment, if known

how are assumptions obtained?

• Automatically generate exact assumption A– for any environment E

(E || Module ╞ Property) IFF E╞ A

• Demonstrated on Rover example

Automated Software Engineering 2002

December 5, 2003 19

Mission Manager Viewpoint

Asking the Right QuestionsWhen can we stop testing?What process should we use?What is the value of formal methods?

Qualitative Correlative Model

Peer Review superior to testing for incorrect spec

Model Checking for uncertain environments

Quantitative Predictive ModelMission trade studies: how muchcost for acceptable riskDevelopment: optimize use ofAssurance technologiesMission: increase use of CPUcycles for software monitoring

December 5, 2003 20

HDCP Goals

The overall mission of the HDCP project is to increase the ability of NASA to engineer highly dependable software systems

Method:– Science of Dependability:

• Develop better ways to measure and predict software dependability– What are the potential measurables for the various

attributes?– How can we move past the present surrogates and

approach the artifact more directly?– Empirical evaluation

• of NASA and NASA-contractor dependability problems

• of technologies and engineering principles to address the problems

– Testbeds• Development of realistic testbeds for empirical evaluation of

technologies and attributes.

– Intervention technologies

December 5, 2003 21

Active MDS Testbed Projects

• Golden Gate Project– Demonstrate that RT-Java is suitable for mission systems– Drive MDS/RTSJ rover at JavaOne– Collaborators: Champlin, Giovannoni

• SCRover Project– Develop rover testbed– Collection defect and process data for experience base– Collaborators: Boehm, Madachy, Medvidovic, Port

• Dependability cases– Develop dependability cases for time management and software

architectures– Collaborators: Goodenough, Weinstock, Maxion, Hudak

• Analysis of MDS architectural style– Analysis based on MDS use architectural-components types– Collaborators: Garlan

• Process improvement– Data collection from mainline MDS and SCRover development efforts– Collaborators: Johnson, Port

December 5, 2003 22

MDS in 1 Minute

Approach

Product line practice to exploit commonalities across missions:

An information and control architecture to which missions/products conform

A systems engineering process that is analytical, disciplined, and methodical

Reusable and adaptable framework software

Problem Domain

Mission information, control, and operations of physical systems

• Developed for unmanned space science missions

• Scope includes flight, ground and simulation/test

• Applicable to robots that operate autonomously to achieve goals specified by humans

• Architecturally suited for complex systems where “everything affects everything”

MDS ProductsMDS Products Unified flight, ground and test architecture

Orderly systems engineering methodology

Frameworks (C++ and Java)

Processes, tools, and documentation

Examples

Reusable software

December 5, 2003 23

Component-BasedArchitecture

• Handles interactionsamong elementsof the system software

• Inward looking• Addresses software

engineering issues

State-BasedArchitecture

• Handles interactionsamong elementsof the system under control

• Outward looking• Addresses systems

engineering issues

Managing Interactions

• Complex interactions make software difficult– Elements that work separately often fail to work together– Combinatorics of interaction is staggering, so it’s not easy to get right– This is a major source of unreliability

• There are two approaches to this in MDS:

“A unified approach to managing interactions is essential”

December 5, 2003 24

MDS is…

State-Based ArchitectureState variables hold state values, including degree of uncertainty

Estimators interpret measurement and command evidence to estimate state

Controllers issue commands, striving to achieve goals

Hardware proxies provide access to hardware busses, devices, instruments

Models express mission-specific relations among states, commands, and measurements

A goal is a constraint on the value of a state variable over a time interval

Key Features: Systems analysis/design organized around states and models State control architecturally separated from state determination System operated via specifications of intent: goals on state

December 5, 2003 25

From theory to flight...

JPL Transition Path

• Mars Smart Lander (MSL) Technology Infusion– Scheduled Launch: 2009– MSL has baselined MDS technology

• System engineering• Software frameworks

– MSL Technology Gates• PMSR August, 2004• Integrated demo June, 2005• PDR February, 2006

• MSL sample technology categories– Software architecture with infused technologies– Verification and Validation tools and

methodologies– Processes and supporting tools– Cost modeling for system engineering, software

adaptation and autonomy validationMDS compatible technologies are directly relevant to MSL

December 5, 2003 26

Conclusions

• System-software barrier – (Verification is easy, validation is hard)

• Software is transparent and malleable in the small…– But opaque and brittle in the large

• General-purpose software dependability tools work well in the small– But fail to scale to systems in the large.

But there is Reason for OptimismAlign software architectures with system analysisSuccess of formal methods in related field of digital hardware

Scaling through specialization

Divide and Conquer: compositional reasoning

Beyond correctness: exploiting the lattice between true and false for software understanding

Providing the research community with realistic experimental testbeds at scale