26
PathoLogic Machine Learning Methods for Prediction Evaluation Conclusions and Future Directions Machine Learning Methods for Metabolic Pathway Prediction Joseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop August 27, 2009 Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Machine Learning Methods for MetabolicPathway Prediction

Joseph M. Dale, Liviu Popescu, and Peter D. Karp

Pathway Tools WorkshopAugust 27, 2009

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 2: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Outline

1 PathoLogic

2 Machine Learning Methods for Prediction

3 Evaluation

4 Conclusions and Future Directions

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 3: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Pathway Tools Inference Capabilities

Initial construction, update:Enzyme/reaction matchingPathway prediction

Refinement:Transcription unit (operon) predictionTransport inferencePathway hole filling

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 4: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

PathoLogic PGDB Construction

Enzyme names, EC numbers and GO terms from genomeannotation are used to identify matching reactions inMetaCyc.All MetaCyc pathways with at least one reaction present inthe target organism are imported as candidate pathways.Candidate pathways are pruned using an iterativealgorithm.

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 5: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

PathoLogic Pathway Prediction

PathoLogic uses an iterative algorithm to prune the initial set ofcandidate pathways:

1 Initialize pathway sets keep = {}, delete = {},undecided = all initial candidates.

2 Apply “keep tests” K1, . . . , Km to undecided pathways; ifany Ki(p) succeeds, move p to keep set.

3 Apply “delete tests” D1, . . . , Dn to undecided pathways; ifany Di(p) succeeds, move p to delete set.

4 If any undecided pathways were moved, update pathwayevidence and go to step 2; otherwise terminate.

keep pathways and remaining undecided pathways (no keep ordelete tests succeeded) are kept in PGDB.

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 6: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Examples of Keep Tests

pathway has a unique reaction presentpathway is “mostly present”:

at most one reaction missingmore reactions present than missingevidence not a proper subset of evidence for variantnot a superset of another pathway

pathway evidence is not a subset of evidence for any otherpathway, and pathway is not missing all key reactions(curated in MetaCyc)

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 7: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Examples of Delete Tests

pathway “mostly absent”:at most one reaction presentmore than one reaction missingno unique reactions present

biosynthetic pathway missing final stepsdegradative pathway missing initial stepspathway missing all “key reactions”

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 8: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Limitations of PathoLogic

As MetaCyc grows (currently > 1300 pathways),PathoLogic makes more false positive predictionsOkay for PGDBs that will receive manual curation (this wasintended), but problematic for BioCyc PGDBs that receiveno curationSeveral areas in which PathoLogic is limited:

extensibilitytunabilityexplainability

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 9: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Extensibility

Above description of PathoLogic above is a simplification!The actual logic is more complex, hard-coded, and brittle.Difficult to add new tests (keep and delete rules), specifyinteractions with existing tests.No formal training procedure to incorporate feedback (i.e.,automatically adjust to correct false predictions).

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 10: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Tunability

PathoLogic currently only makes binary predictions(pathway present / absent).Can’t be tuned to trade off sensitivity/specificity,precision/recall – performance is fixed at a single point.Preference for false positives is hard-coded.

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 11: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Explainability

Existing confidence scores are coarse: e.g., fraction ofreactions present, number of unique enzymes.Not monotonic: pathway X may have more reactionspresent than pathway Y , but X can be pruned while Y iskept.Users can’t see how evidence was combined: which ruleswere applied to call the pathway present / absent.

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 12: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

The Machine Learning Approach

Supervised machine learning:Collect training data:input feature (attribute) vectors X1, . . . , Xnoutput labels y1, . . . , yn

Apply learning algorithm to training data, obtain structure,parameters of function F : X → y .Apply F to new feature vector Xn+1 to yield predictionyn+1 = F (Xn+1)

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 13: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Machine Learning Approach to Pathway Prediction

Collect a “gold standard” set of labeled data for training(and validation): known data on pathwaypresence/absence in various organisms.Define useful features; compute feature values for eachpathway.Input the feature data to domain-independent learningalgorithm to train a model for pathway prediction.Apply the model to new pathway examples when building anew PGDB.

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 14: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Can Machine Learning Help?

ML methods have automated training procedures, easy toadd new features and training data.Many ML methods have probabilistic foundations, yieldingnatural confidence scores:Pr(pathway present | evidence).Many ML methods can explain predictions; e.g.,log-likelihood score for each feature, etc.

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 15: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Feature Extraction

Features are the primary domain-specific component of MLmodels. Ours fall into several groups:

Reaction evidence: based on matching pathwayreactions to enzymes based on genome annotation; e.g.,fraction of reactions present; number of unique enzymes.Pathway holes: patterns of pathway holes (reactionsmissing enzymes); e.g., biosynthetic pathway missing finalreactions; degradation pathway missing initial reactions.Genome context: e.g., two reactions in pathway encodedby genes adjacent on chromosome?

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 16: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Feature Extraction

More feature groups:Pathway variants: e.g., is the evidence for pathway V1 asubset of the evidence for its variant V2?Taxonomic range: does the expected taxonomic range ofthe pathway (curated in MetaCyc) include the targetorganism?Pathway connectivity: e.g., number of dead endcompounds in the pathway, number of adjacent pathways(via input/output metabolites)Miscellaneous PathoLogic features: other featuresadapted from PathoLogic.

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 17: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Feature Selection

In total, 123 features were defined – many slight variations.Multiple redundant features can degrade the performanceof some ML methods.Experimented with various feature selection methods:Akaike information criterion (AIC), Bayes informationcriterion (BIC), cross-validation.Simple hill-climbing on AIC performed as well as moresophisticated (and slower) methods.

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 18: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Prediction Methods

Different ML methods perform better on different problems. Weevaluated several methods:

naïve Bayesdecision treeslogistic regressionk nearest neighborsensemble methods:

baggingboostingrandom forests

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 19: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Gold Standard Dataset

Training / validation set based on six curated PGDBs: E.coli, Arabidopsis, yeast, mouse, cattle, Synechococcuselongatus5,610 tuples of the form(organism, pathway , present |absent)Positive (present) examples are those pathways includedin PGDB after curator review.Negative (absent) examples include pathways deleted bycurators and pathways with no reactions present.

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 20: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Gold Standard Dataset

Breakdown of gold standard pathways by organism:

organism positives negatives totalEscherichia coli K-12 MG1655 235 1035 1270Arabidopsis thaliana columbia 297 971 1268Saccharomyces cerevisiae S288c 119 777 896Synechococcus elongatus PCC 7942 171 778 949Mus musculus 203 754 957Bos taurus 151 119 270

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 21: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Validation Methodology

Learning curves: 80%/20% training/test split; selectsubsets of training set, varying size; measure on test setOverall performance measured on random 50%/50%training/test split, repeated and averaged 20xcross-validation for ROC curves

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 22: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

PathoLogic vs. ML, ROC on sensitivity/specificity

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 23: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

PathoLogic vs. ML, ROC on precision/recall

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 24: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

PathoLogic vs. ML, optimal threshold

High-performing ML methods vs. PathoLogic:

method ACC SN SP FM PR RCPathoLogic 0.91 0.793 0.94 0.786 0.779 0.793naïve Bayes(HC-AIC, 15x bagged) 0.909 0.757 0.949 0.78 0.767 0.796logistic regression(HC-BIC, 8x bagged) 0.912 0.744 0.956 0.786 0.763 0.812decision trees(SSMML, 25x bagged) 0.911 0.729 0.961 0.787 0.77 0.808

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 25: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Conclusions

Performance of ML algorithms roughly equals that ofPathoLogicAdvantages of ML methods over PathoLogic:

numerical confidence scorestradeoff between sensitivity/specificity, precision/recalleasily extensibleexplanation of predictions

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction

Page 26: Machine Learning Methods for Metabolic Pathway Predictionbioinformatics.ai.sri.com/.../ML_PathPred_Joseph_Dale.pdfJoseph M. Dale, Liviu Popescu, and Peter D. Karp Pathway Tools Workshop

PathoLogicMachine Learning Methods for Prediction

EvaluationConclusions and Future Directions

Future Work

Integrate into Pathway ToolsImprove enzyme name matchingMore sophisticated prediction algorithms, using:

dependencies between featuresiterative refinementdependencies between pathways

Joseph M. Dale, Liviu Popescu, and Peter D. Karp Machine Learning Methods for Metabolic Pathway Prediction