Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Bioinformatics software testing and quality assurance Joshua W. K. Ho Head, Bioinformatics and Systems Medicine Laboratory Victor Chang Cardiac Research Institute
Winter School in Mathematical and Computational Biology, UQ 2016-07-06
How do I know the output of my program is correct?
How do you know an R package is implemented correctly?
Is running the ‘example’ code alone sufficient? If not, how many test cases do we need? How to generate additional test cases? How do we verify the correctness of the outputs?
Nature 2010 (News Feature)
Nature 2013 (News In Focus)
Science 2013 (Policy Forum)
Nature 2015 (World View)
Clinical application: how can we be sure that our variant calling pipeline is implemented correctly?
Challenge: very hard to check the correctness of the output, especially false negatives
Genome Medicine, 2013
Five commonly employed variant-calling pipelines. SNV concordance ~57.4%, and indel concordance ~26.8% => Need caution interpreting results in genomic medicine setting
Genome Medicine, 2014
Compared results from ANNOVAR and VEP (using ENSEMBL transcripts): Matching annotation in only 65% if loss-of-function variants, and 87% of all exonic variants
Why is testing challenging in bioinformatics?
• Lack of rigorous review (compared to journal peer review) • We do not spend enough time and energy on testing • Misusing external software / components
• Often caused by not checking the limitation and scope of the software function • Over reliance on ‘validation testing’ – merely check if the results “make sense”
• Nonetheless it is often hard to exactly check the correctness of the output of complex algorithmic outcomes
• Often only rely on a very small number of (simple) test cases • Software fault may only show up in some input cases, so a failure may not be observed
unless we try to search the input space widely. • Will need to use diverse and realistic test cases
Main objectives
• Understand the importance and challenges of software testing in bioinformatics
• Understand basic concepts and techniques in software testing
• Understand how we can implement QA in bioinformatics, especially in translational genomic applications
Software testing concepts and techniques
Some definitions in software testing
Error: A defect in the human thought process Misspecification of the range of a variable
Fault: Concrete manifestation of an error within the software. A bug, use of wrong parameters, incorrect software dependency
Failure: Departure of the operational software system behaviour from user expectation Test case: Input and execution condition that is developed for verifying the compliance of the program to the specific requirement Oracle: A mechanism to check the correctness of a test result from any given test case.
Case study: testing a kNN classifier
Mary is a bioinformatician in a research lab. Her project studies whether promoter DNA sequences in yeast can distinguish among three groups of genes - highly expressed, highly repressed, and dynamically expressed. She wants to build a supervised classifier of promoter DNA sequences for her research. She downloaded a new R package that was described in a recent publication. The algorithm behind the main function of this package, kNN, is described in the paper and the R ‘help’ page. The ‘example’ code can be executed successfully, and seems to produce reasonable looking outputs, even though you are not 100% sure of the expected outputs.
What does this R package do?
Class1, TCGATCGATCGGGGATTAGC Class1, ACGATCGGGGACGAGCTACCCATG Class1, CATCGATGGGCTAGCT Class2, ATCGTGGGCTAGCTAGCCCCCC Class2, GATGCTAAACGGGGATCGATCA Class3, ACGTGGATCGAAAAAGCTAGC Class3, GTAGATCGAAATCGATGCATCGAGC Class3, ATGCTAGGGCTAGCTAC
TGATGCGACGATCGATCGCATAC ACGAGGGGCTAGCTACA GCTAGCCCATCGATCTAGATCGAGCGATCGA ACGTTGGCTAGCTACG
Training sequences Training class label
Test sequences
AA AT AC AG TA TT … Class1, 8 2 4 5 2 1 … Class1, 3 5 1 6 7 2 … Class1, 0 2 3 5 2 1 … Class2, 2 9 4 0 9 1 … Class2, 1 4 0 3 0 5 … Class3, 8 1 1 5 1 1 … Class3, 6 0 3 7 7 7 … Class3, 5 2 7 1 6 2 …
AA AT AC AG TA TT … 3 5 1 6 7 2 … 0 2 3 5 2 1 … 2 9 4 0 9 1 … 6 0 3 7 7 7 …
Class1 Class2 Class3 Test data
The label of the test data is determined by the most frequently occurring class in the k nearest training instances. If k=3, the test data will be classified to be Class 3 in this example. If there is a tie of the most frequently occurring class, our classifier will return ‘Uncertain’.
Calculate Euclidian distance (using k-mer frequency) between test data and all training data
Calculation of k-mer frequency
13
Correct implementation
What are good test cases? What is a good oracle for this program?
This version has a fault.
Correct: sqrt(sum((thisTest - thisTrain)^2))
This version has a fault.
Correct: k==kk
Input space
Failure-causing input
Execute by Program Under Test (PUT)
Execute by Program Under Test (PUT)
Verify by Oracle
Verify by Oracle
Successful test case
Failure detected
Test case selection Test execution Output verification
1. Test Case Selection Problem: how to increase the chance of selecting a test case from the failure-causing input
2. Oracle Problem: How to decide if any given test cases is correct?
Can we learn from the software testing field?
A good software testing strategy should actively reveal as many faults as possible using a selected set of test cases. Selection of test cases (input) Special test cases? Random test cases? Test based on program flow? Test based on failure pattern?
Test execution and automation Which test to execute first, what to execute next? When to stop? Can we automate this process?
Verification of test cases How to check correctness for the output from large and complex software? Simulation program? Machine learning software?
Test reporting and documentation Reporting testing for validation and verification
Weyuker (1982) Computer Journal
Example: how to test sin(x)?
sin function sin(0o )=0
sin(30o)=0.5
Suppose the program returns: sin(29.8o )=0.51234 incorrect sin(29.8o )=0.49876 correct? How do I design test cases without knowing the implementation of the program? E.g.,
3 5
sin( )3! 5!x xx x= − + −K
Three standard techniques
Special test cases / special value testing Selected cases where the correct output is known (from external experimental validation or simulation)
N-version programming Check concordance between multiple implementation or variants of the same initial specification
Check that the output is within the ‘expected range’ of values Even thought we cannot determine precisely what the value may be, it is often possible to determine an expected range of values the output should fall in.
21
Three solutions
Solution 1: special test cases, such as 𝑥=0,𝜋/6 , 𝜋/2 ,… Problem: can only test a small subset of inputs
Solution 2: N-version programming, compare the results of multiple independent versions of sin(x) Problem: what happens when an inconsistency is detected?
Solution 3: check the expected range, such as visual inspection of the plot of sin(x) Problem: not quantitative
Compile a test suite of models that have been solved analytically or using numerical method Run each stochastic simulator many times, and check that the result do not deviate substantially from the analytical solution
Software testing concept: Special test cases
What about systems biology?
Software testing concept: N-version programming
How to determine the correctness of the program output?
Indeed it is very hard to verify the correctness of any given output of these programs (If we know the expected output, we do not need these programs in the first place!) Common techniques for dealing with the Oracle problem Special test cases Selected cases where the correct output is known (from external experimental validation or simulation)
N-version programming Check concordance between multiple implementation or variants of the same initial specification
Expected range Check that the range of value is within expectation
25
26
Metamorphic Testing
sin function has the following properties sin(x)= sin(x+360o) ……
Execute the program with input x=29.8o and x=389.8o check that sin(29.8o) and sin(389.8o ) Key idea: Multiple execution of the same program. • Identify the expected output of a program from previously executed test cases
Chen et al (2009) BMC Bioinformatics
Core idea: Execute the same program multiple times with slightly modified input, such that their output could be compared to some expected properties
Different MR and test cases have different effectiveness
We tested a network simulator with real and simulated data using 10 Metamorphic Relations on the original and mutant programs
Chen et al (2009) BMC Bioinformatics
Advantages of Metamorphic Testing
• We can use real data (instead of simulated data) as test cases as there is now a mechanism to verify the output
• Usually quite easy to implement if you know some properties of the algorithm
• Can be use in conjunction with other testing techniques, such as special test cases and N-version programming, etc.
31
An example MR for a kNN classifier
Source test case: Source input: Train.seq, Train.cls, OneTest.seq, k, kk
Source output: cls
Follow-up test case If cls != ‘Uncertain’:
• Follow-up input: (Train.seq,OneTest.seq), (Train.cls, cls), OneTest.seq, k, kk
• Expected follow-up output: cls
Another example MR for a kNN classifier
Source test case: Source input: Train.seq, Train.cls, OneTest.seq, k, kk
Source output: cls
Follow-up test case If cls != ‘Uncertain’:
• Follow-up input: (Train.seq + duplicate all sequences from class cls), (Train.cls + cls), OneTest.seq, k, kk
• Expected follow-up output: cls
Other useful tips for choosing good test cases
• Test boundary values • Use diverse test cases • The order of execution of the test cases matters because failure causing inputs
are generally clustered together in the input space
35
Failure pattern and test case diversity
Failure-causing pattern fixed but unknown
r
o
t
r
o
t
r
r
o
• Can reduce the number of test cases by up to 50% • Challenge is to define a good distance measure
Kamali (2015) Biophysical Reviews
Quality assurance in translational bioinformatics
Motivation: Quality assurance of clinically-oriented bioinformatics pipelines for human genetic mutation identification
Genetic counseling, Inform treatment options
Take blood for DNA sequencing
From hundreds of millions of short reads to identify
genetic variants
Motivation – Clinical Guidelines about quality assurance
• RCPA – Massively Parallel Sequencing Implementation Guidelines • Aimed at diagnostic laboratories implementing next generation sequencing • “The validation study must establish the analytical validity of the bioinformatics
pipeline in terms of being able to correctly detect sequence variants” • “The laboratory must validate the entire bioinformatics pipeline as a whole, under
the given operational environment”
• ….but did not specify how?
Motivation – Many variant calling pipelines exist, but their results have low concordance…
J. O’Rawe, et al., “Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing,” Genome Med., vol. 5, no. 3, p. 28, Mar. 2013.
A genomic variant calling pipeline
A sequence with length ~3x109
~200 million sequence reads, each with length 100
~2 million variants
Vision: Validation and Quality Control on the cloud
Framework Overview
Tests
Test Description
MR0 Deterministic Output
MR1 random permutation of input
MR2 duplication of reads
MR3 unmapped reads
MR4 mapped reads
SI0 simulated reads – no mutations
SI1 simulated reads – mutations
Results
Results
Cost: on-demand vs spot • 9 x c3.8xlarge instances for 6 hours
($1.68/hr/ instance on-demand) • 76% saving using spot instances
On-Demand Spot $90.72 $21.60
Summary
• Understand the importance and challenges of software testing in bioinformatics
• Understand basic concepts and techniques in software testing
• Understand how we can implement QA in bioinformatics, especially in translational genomic applications
[email protected] http://bioinformatics.victorchang.edu.au
Joint work with
Eleni Giannoulatou (Lab Head)
Michael Troup (RA)
Andrian Yang (PhD student)
Amir Kamali (MPhil student)