1
An online game for improving human phenotype prediction An important goal for biomedical research is to produce genetic and genomic predictors for human phenotypes such as disease prognosis or drug response. To this end, we can now quantify an extremely large number of potential biomarkers for any biological sample. In fact, a single sample could reasonably be described by millions of molecular variations in DNA, RNA, proteins, and metabolites. However, the actual number of samples processed typically remains small in comparison. As a result, attempts to use this data to build predictors often face problems of overfitting. (While a predictive pattern may describe training data very well, it may not reproduce well on other datasets.) It has recently been shown that biological knowledge in the form of gene annotations and pathway databases can be used to guide the process of inferring phenotype predictors [1-3]. While promising, such methods are limited by the amount, quality and problem-specific applicability of the structured knowledge that is available. Following in the line of games that have recently demonstrated success as a means of ‘crowdsourcing’ difficult biological problems [4,5], we are developing games with the purpose of improving human phenotype predictions. Our games work on two levels: (1) games such as Dizeez and GenESP collect novel gene annotations and (2) games like Combo engage players directly in the process of predictor inference. Play game prototypes at: (Also see Poster I03) ABSTRACT Benjamin M Good, Salvatore Loguercio, Andrew I Su The Scripps Research Institute, La Jolla, California, USA REFERENCES We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924) and the NIH through the FaceBase Consortium for a particular emphasis on craniofacial genes (DE-20057). . CONTACT Benjamin Good: bgood@ scripps.edu Salvatore Loguercio: loguerci@ scripps.edu Andrew Su: [email protected] 1. Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology 2. Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes. PLoS Computational Biology 3. Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics 4. Good and Su (2011) Games with a Scientific Purpose. Genome Biology 5. Kawrykow et al (2012) Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment. PLoS One ABSTRACT FUNDING http:// www.genegames.org A game board Challenge Combo: feature selection with community intelligence Human Guided Forest Ensemble classifier where components are decision trees constructed using manually selected subsets of features. Adaptation of Network Guided and Random Forests [1]. Phenotype 1 Phenotype 2 A hand Goal: pick the best set of genes Best: the gene set that produces the best decision tree classifier Classifier: created using training data and selected genes, used to predict phenotype (e.g. breast cancer prognosis) Inferred decision tree Game Score: determined by estimating performance of trees constructed using the selected features on training data. Score: 78 (percent correct) Feature sets from many individual games used to create a Decision Tree Forest classifier. (Each tree votes once.) Motivation find patterns make predictions on new samples cancer normal cancer normal With tens of thousands of measurements but only hundreds of samples, many possible patterns are found. But which ones are real? Using prior biological knowledge, it is possible to identify stronger, more consistent predictive patterns. Prior knowledge encoded in protein- protein interaction databases [1,2] and pathway databases [3] has been used to improve phenotype prediction Network Guided Forest from Dutkowski et al (2011) Online games are successfully tapping into the knowledge and reasoning abilities of thousands of people. Devise protein folding algorithms Fix multiple sequence alignments Design RNA molecules Label all images on the Web Opportunity COMBO is designed to motivate and enable people to help improve phenotype predictors What about knowledge that is not recorded in structured databases? select predictive gene sets

An online game for human phenotype prediction

  • Upload
    goodb

  • View
    2.019

  • Download
    2

Embed Size (px)

Citation preview

Page 1: An online game for human phenotype prediction

An online game for improving human phenotype prediction

An important goal for biomedical research is to produce genetic and genomic predictors for human phenotypes such as disease prognosis or drug response. To this end, we can now quantify an extremely large number of potential biomarkers for any biological sample. In fact, a single sample could reasonably be described by millions of molecular variations in DNA, RNA, proteins, and metabolites. However, the actual number of samples processed typically remains small in comparison. As a result, attempts to use this data to build predictors often face problems of overfitting. (While a predictive pattern may describe training data very well, it may not reproduce well on other datasets.)

It has recently been shown that biological knowledge in the form of gene annotations and pathway databases can be used to guide the process of inferring phenotype predictors [1-3]. While promising, such methods are limited by the amount, quality and problem-specific applicability of the structured knowledge that is available.

Following in the line of games that have recently demonstrated success as a means of ‘crowdsourcing’ difficult biological problems [4,5], we are developing games with the purpose of improving human phenotype predictions. Our games work on two levels: (1) games such as Dizeez and GenESP collect novel gene annotations and (2) games like Combo engage players directly in the process of predictor inference.

Play game prototypes at:

(Also see Poster I03)

ABSTRACT

Benjamin M Good, Salvatore Loguercio, Andrew I Su

The Scripps Research Institute, La Jolla, California, USA

REFERENCES

We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924) and the NIH through the FaceBase Consortium for a particular emphasis on craniofacial genes (DE-20057).

.

CONTACTBenjamin Good: [email protected] Salvatore Loguercio: [email protected] Andrew Su: [email protected]

1. Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology

2. Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes. PLoS Computational Biology

3. Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics

4. Good and Su (2011) Games with a Scientific Purpose. Genome Biology5. Kawrykow et al (2012) Phylo: A Citizen Science Approach for Improving Multiple Sequence

Alignment. PLoS One

ABSTRACT

FUNDING

http://www.genegames.org

A game board

Challenge

Combo: feature selection with community intelligence

Human Guided Forest

Ensemble classifier where components are decision trees constructed using manually selected subsets of features. Adaptation of Network Guided and Random Forests [1].

Phenotype 1

Phenotype 2

A hand

• Goal: pick the best set of genes• Best: the gene set that produces the best decision tree classifier• Classifier: created using training data and selected genes, used to

predict phenotype (e.g. breast cancer prognosis)

Inferred decision tree

Game Score: determined by estimating performance of trees constructed using the selected features on training data.

Score: 78 (percent correct)

Feature sets from many individual games used to create a Decision Tree Forest classifier. (Each tree votes once.)

Motivation

find patterns

make predictions on new samples

cancer

normal

cancer normal

• With tens of thousands of measurements but only hundreds of samples, many possible patterns are found.

• But which ones are real?

• Using prior biological knowledge, it is possible to identify stronger, more consistent predictive patterns.

• Prior knowledge encoded in protein-protein interaction databases [1,2] and pathway databases [3] has been used to improve phenotype prediction

Network Guided Forest from Dutkowski et al (2011)

• Online games are successfully tapping into the knowledge and reasoning abilities of thousands of people.

Devise protein folding algorithms

Fix multiple sequence alignmentsDesign RNA molecules

Label all images on the Web

Opportunity

• COMBO is designed to motivate and enable people to help improve phenotype predictors

• What about knowledge that is not recorded in structured databases?

select predictive gene sets