Skoll: A System for Distributed Continuous Quality Assurance

Skoll: A System for Distributed Continuous Quality Assurance

Atif Memon & Adam Porter

University of Maryland

{atif,aporter}@cs.umd.edu

2 http://www.cs.umd.edu/{projects/skoll,~atif/GUITARWeb/}

Quality Assurance for Large-Scale Systems

• Modern systems increasingly complex

– Run on numerous platform, compiler & library combinations

– Have 10’s, 100’s, even 1000’s of configuration options

– Are evolved incrementally by geographically-distributed teams

– Run atop of other frequently changing systems

– Have multi-faceted quality objectives

• How do you QA systems like this?


Distributed Continuous Quality Assurance

• QA processes conducted around-the-world, around-the-clock on powerful, virtual computing grids

– Grids can by made up of end-user machines, project-wide resources or dedicated computing clusters

• General Approach

– Divide QA processes into numerous tasks

– Intelligently distribute tasks to clients who then execute them

– Merge and analyze incremental results to efficiently complete desired QA process

• Expected benefits

– Massive parallelization allows more, better & faster QA

– Improved access to resources/environs. not readily found in-house

– Carefully coordinated QA efforts enables more sophisticated analyses


Collaborators

Sandro Fouché, Alan Sussman, Cemal Yilmaz (now at IBM TJ Watson) & Il-Chul Yoon

Alex OrsoGroupGroup Myra Cohen

Murali Haran, Alan Karr, Mike Last, & Ashish Sanil

Doug Schmidt & Andy Gohkale


Skoll DCQA Infrastructure & Approach

See: A. Porter, C. Yilmaz, A. Memon, A. Nagarajan, D. C. Schmidt, and B. Natarajan, Skoll: A Process and Infrastructure for Distributed Continuous Quality Assurance. IEEE Transactions on Software Engineering. August 2007, 33(8), pp. 510-525.

Clients




Clients

1. Model




Clients

2. Reduce Model


Test ResourcesTest Request



Clients

3. Distribution




Clients

4. Feedback

Test Results




Clients

5. Steering


The ACE+TAO+CIAO (ATC) System

• ATC characteristics

– 2M+ line open-source CORBA implementation

– maintained by 40+, geographically-distributed developers

– 20,000+ users worldwide

– Product Line Architecture with 500+ configuration options

• runs on dozens of OS and compiler combinations

– Continuously evolving – 200+ CVS commits per week

– Quality concerns include correctness, QoS, footprint, compilation time & more


Options Type Settings

Operating System compile-time {Linux, Windows XP, ….}

TAO_HAS_MINIMUM_CORBA compile-time {True, False}

ORBCollocation runtime {global, per-orb, no}

ORBConnectionPurgingStrategy runtime {lru, lfu, fifo, null}

ACE_version component version {v5.4.3, v5.4.4,…}

TAO_version component version {v1.4.3, v1.4.4,…}

run(ORT/run_test.pl) test case {True, False}

Constraints

(TAO_HAS_AMI) (¬TAO_HAS_MINIMUM_CORBA)

run(ORT/run_test.pl) (¬TAO_HAS_MINIMUM_CORBA)

Define QA Space


Nearest Neighbor Search








OK ERR-2

0

|CORBA_MESSAGING

AMI_POLLER

AMI_CALLBACK

AMI

OK ERR-1 ERR-3

0

1 0

1

1 0

1

Fault Characterization

• We used machine learning techniques (classification trees) to model option & setting patterns that predict test failures


Applications & Feasibility Studies

• Compatibility testing of component-based systems

• Configuration-level fault characterization

• Test case generation & input space exploration


See: I. Yoon, A. Sussman, A. Memon & A. Porter, Direct-Dependency-based Software Compatibility Testing. International Conference on Automated Software Engineering, Nov. 2007 (to appear).

Compatibility Testing of Comp-Based Systems

Goal

• Given a component-based system, identify components & their specific versions that fail to build

Solution Approach

• Sample the configuration space, efficiently test this sample & identify subspaces in which compilation & installation fails

– Initial focus on building & installing components. Later work will add functional and performance testing


The InterComm (IC) Framework

• Middleware for coupling large scientific simulations

– Built from up to 14 other components (e.g., PVM, MPI, GCC, OS)

– Each comp can have several actively maintained versions

– There are complex constraints between components, e.g.,

• Requires GCC version 2.96 or later

• When configured with multiple GNU compilers, all must have the same version number

• When configured with multiple comps that use MPI, all must use the same implementation & version

– http://www.cs.umd.edu/projects/hpsl/chaos/ResearchAreas/ic

• Developers need help to

– Identify working/broken configurations

– Broaden working set (to increase potential user base)

– Rationally manage support activities


Annotated Comp. Dependency Graph

• ACDG = (CDG, Ann)

– CDG: DAG capturing inter-comp deps

– Ann: comp. versions & constraints

• Constraints for each cfg, e.g.,

– ver (gf) = x ver (gcr) = x

– ver (gf) = 4.1.1 ver (gmp) ≥ 4.0

• Can generate cfgs from ACDG

– 3552 total cfgs. Takes up to ~10,700 CPU hrs to build all


Improving Test Execution

• Cfgs often share common build subsequences. This build effort should be reusable across cfgs

• Combine all cfgs into a data structure called a prefix tree

• Execute implied test plan across grid by (1) assigning subpaths to clients, (2) building each subcfg in a VM & caching the VMs to enable reuse

• Example: with 8 machines each able to cache up to 8 VMs exhaustive testing takes up to 355 hours

gcr v4.0.3

gmp v4.2.1 pf/6.2

fc 4.0


Direct-Dependency (DD) Coverage

• Hypothesis: A comp’s build process is most likely to be affected by the comps on which it directly depends

– A directly depends on B iff there is a path (in CDG) from A to B containing no comp nodes

• Sampling approach

– Identify all DDs between every pair of components

– Identify all valid instantiations of these DDs (ver. combs that violate no constraints)

– Select a (small) set of cfgs that cover all valid instantiations of the DDs


Executing the DD Coverage Test Suite

• DD test suite much smaller than exhaustive

– 211 cfgs with 649 comps vs 3552 cfgs with 9919 comps

– For IC, no loss of test eff. (same build failures exposed)

• Speedups achieved using 8 machines w/ 8 VM cache

– Actual case: 2.54 (18 vs 43 hrs)

– Best case: 14.69 (52 vs 355 hrs)


Summary

• Infrastructure in place & working

– Complete client/server implementation using VMware

– Simulator for large scale tests on limited resources

• Initial results promising, but lots of work remains

• Ongoing activities

– Alternative algorithms & test execution policies

– More theoretical study of sampling & test exec approaches

– Apply to more software systems


See: C. Yilmaz, M. Cohen, A. Porter, Covering Arrays for Efficient Fault Characterization in Complex Configuration Spaces, ISSTA’04, TSE v32 (1)

Configuration-Level Fault Characterization

Goal

• Help developers localize configuration-related faults

Current Solution Approach

• Use covering arrays to sample the cfg space to test for subspaces in which (1) compilation fails or (2) reg. tests fail

• Build models that characterize the configuration options and specific settings that define the failing subspace


ConfigurationsC1 C2 C3 C4 C5 C6 C7 C8 C9

O1 0 0 0 1 1 1 2 2 2

O2 0 1 2 0 1 2 0 1 2

O3 0 1 2 1 2 0 2 0 1

Covering Arrays

• Compute test schedule from t-way covering arrays

– a set of configurations in which all ordered t-tuples of option settings appear at least once

• 2-way covering array example:


Limitations

• Must choose the strength covering array before computing it

– No way to know, a priori, what the right value is

– Our experience suggests failures patterns can change over time

• Choose too high:

– Run more tests than necessary

– Testing might not finish before next release

• Non-uniform sample negatively affects classification performance

• Choose too low:

– Non-uniform sample negatively affects classification techniques

– Must repeat process at higher strength


Incremental Covering Arrays

• Start with traditional covering array(s) of low strength (usually 2)

• Execute test schedule & classify observed failures

• If resources allow or classification performance requires

– Increment strength

– Build new covering array using previously run array(s) as seeds

See: S. Fouche, M. Cohen and A. Porter. Towards Incremental Adaptive Covering Arrays, ESEC/FSE 2007, (to appear)


Incremental Covering Arrays (cont.)

Multiple CAs at each level of t

• Use t1 as a seed for the first t+1-way array (t+11)

• To create the ith t-way array (ti), create a seed of size = |t-11| using non-seeded cfgs from t-1i

• If |seed| < |t-11| complete seed with cfgs from t+11


MySQL Case Study

• Project background

– Widely-used, 2M+ line open-source database project

– Cont. evolving & maint. by geographically-distributed developers

– Dozens of cfg opts & runs on dozens of OS/compiler combos

• Case study using release 5.0.24

– Used 13 cfg opts with 2-12 settings each (> 110k unique cfgs)

– 460 tests/per config across a grid of 50 machines

– Executed ~50M tests total using ~25 CPU years


Results

• Built 3 trad and incr covering arrays for 2 t 4

– Traditional sizes : 108, 324, 870

– Incremental sizes: 113, 336 (223), 932 (596)

• Incr appr exposed & classified the same failures as trad appr

• Costs depend on t & failures patterns

– Failures at level t : Inc > Trad (4-9%)

– Failures at level < t : Inc < Trad (65-87%)

– Failures at level > t : Inc < Trad (28-38%)


Summary

• New application driving infrastructure improvements

• Initial results encouraging

– Applied process to a configuration space with over 110K cfgs

– Found many test failures corresponding to real bugs

– Incremental approach more flexible than traditional approach. Appears to offer substantial savings in best case, while incurring minimal cost in worst case

• Ongoing extensions

– MySQL continuous build process

– Community involvement starting

– Want to volunteer? Go to http://www.cs.umd.edu


GUI Test Case – Executable By A “Robot”

• JFCUnit– Other interactions

– Exponential with length

• Capture/replay– Tedious

• Test “common” sequences

– Bad Idea

• Model-based techniques

– GUITAR– guitar.cs.umd.edu

• JFCUnit– Other interactions

– Exponential with length

• Capture/replay– Tedious

• Test “common” sequences

– Bad Idea

• Model-based techniques

– GUITAR– guitar.cs.umd.edu


• Test case generation

– Cover all edges

See: Atif M. Memon and Qing Xie, Studying the Fault-Detection Effectiveness of GUI Test Cases for Rapidly Evolving Software. IEEE Transactions on Software Engineering, vol. 31, no. 10, 2005, pp. 884-896.

Follows

Follows

Follows

Follows

Modeling The Event-Interaction SpaceSampling• Event flow graph (EFG)

– Nodes: all GUI events

• Starting events

– Edges: Follows

• relationship

• Reverse Engineering

– Obtained Automatically


Lets See How It Works!

• Point to the CVS head

– Push the button

– Read error report

• What happens

– Gets code from CVS head

– Builds

– Reverse engineers the event-flow graph

– Generates test cases to cover all the edges

• 2-way covering

– Runs them

• SourceForge.net

– Four applications

0

1

2

3

4

5

6

7

8

9

Nu

mb

er

of

Fa

ult

s D

ete

cte

d

CrosswordSage

GanttProject

FreeMind

JMSN


Digging Deeper!

• Intuition

– Non-interacting events (e.g., Save, Find)

– Interacting events (e.g., Copy, Paste)

• Key Idea

– Identify interacting events

– Mark the

EFG edges

(Annotated graph)

– Generate

3-way, 4-way, … covering

test cases for interacting

events only

EFG


Identifying Interacting Events

• High-level overview of approach

– Observe how events execute on the GUI

– Events interact if they influence one another’s execution

• Execute event e2; execute event sequence <e1, e2>

• Did e1 influence e2’s execution?

• If YES, then they must be tested further; annotate the <e1, e2> edge in graph

• Use feedback

– Generate seed suite

• 2-way covering test cases

– Run test cases

• Need to obtain sets of GUI states

– Collect GUI run-time states as feedback

– Analyze feedback and obtain interacting event sets

– Generate new test cases

• 3-way, 4-way, … covering test cases


Did We Do Better?

• Compare feedback-based approach to 2-way coverage

0

1

2

3

4

5

6

7

8

9

Nu

mb

er

of

Fa

ult

s D

ete

cte

d

CrosswordSage

GanttProject

FreeMind

JMSN


Summary

• Manually developed test cases

– JFCUnit, Capture/replay

– Can be deployed and executed by a “robot”

– Too many interactions to test

• Exponential

• The GUITAR Approach

– Develop a model of all possible interactions

– Use abstraction techniques to “sample” the model

• Develop adequacy criteria

– Generate an “initial test suite”; Develop an “Execute tests – collect feedback – annotate model – generate tests” cycle

– Feasibility study & results


Future Work

• Need volunteers for MySQL Build Farm Project

– http://skoll.cs.umd.edu

• Looking for more example systems (help!)

• Continue improving Skoll system

• New problem classes

– Performance and robustness optimization

– Improved use of test data

• Test case ROI analysis

• Configuration advice

• Cost-aware testing (e.g., minimize power, network, disk)

• Use source code analysis to further reduce state spaces

• Extend test generation technology outside GUI applications

• QA for distributed systems

The End

Documents

Skoll: A System for Distributed Continuous Quality Assurance