1
RESEARCH POSTER PRESENTATION DESIGN © 2012 www.PosterPresentations.com Genes selected at highest frequency The Empire State Building was built with 7 million hours of human effort. The Panama Canal took 20 million hours to complete. By comparison, it is estimated that up to 150 billion hours are spent playing games every year (9 billion on Solitaire alone). Obviously people play games because they are enjoyable and fun. But aside from that enjoyment, games largely result in no tangible benefit, neither to the individual nor to society at large. Recently, several groups have built “games with a purpose”, a class of games that focuses on collaboratively harnessing gamers for productive ends. In biology, games have been built to fold proteins and RNAs, and to perform multiple sequence alignment. Here, we present our efforts to apply games to two critical challenges in genetics. First, we have built games focused on organizing and structuring gene annotations. With the increasing popularity of genome-scale science, many analysis strategies (including gene set enrichment, pathway analysis, and cross-species comparisons) depend on comprehensive and accurate gene annotations. These structured annotations are mostly the result of centralized manual curation efforts, but these initiatives do not scale well with the explosive growth of the biomedical literature. We describe several games that target working biologists to extract their expert domain knowledge in computable form. Second, we describe a game for predicting human phenotypes from molecular descriptors. Researchers can now relatively easily characterize any biological sample according to a number of features, including genotype, gene expression, and epigenetics. A key challenge in the field is identifying exactly which of those molecular features can be used to predict a clinical phenotype like disease susceptibility or adverse drug events. While statistical classifiers have been applied to this challenge, they typically do not incorporate prior biological knowledge, and they often fail to replicate in external test populations. Here, we present results from the ‘The Cure’ a game to help identify biomarker gene sets that can be used to improve predictions of breast cancer prognosis based on gene expression. ABSTRACT Game 1: Dizeez Game 3: The Cure REFERENCES 1. Salvatore Loguercio, Benjamin M. Good, Andrew I. Su (2012) Dizeez: an online game for human gene-disease annotation. In: Bio- Ontologies SIG, ISMB: 15 July 2011, Vienna. http :// bio-ontologies.knowledgeblog.org / 438 2. Luis Von Ahn and Laura Dabbish (2004) Labeling images with a computer game. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems 3. Janus Dutkowski and Trey Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology 4. Sage bionetworks: DREAM7 Breast Cancer Prognosis Challenge. http://www.the-dream-project.org/challenges/sage-bionetworks-dream -breast-cancer -prognosis- challenge Contact and Acknowledgements Benjamin Good: [email protected] @bgood , Andrew Su: [email protected] @andrew.su We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924) and the NIH through the FaceBase Consortium for a particular emphasis on craniofacial genes (DE-20057). Purpose: identify new gene-disease links Rules: • Select biological area (e.g. ‘cancer’) to start game. • Given a gene, guess the related disease. • Points are awarded for correct guesses within one minute. • ‘Correct’ answers drawn from text mining •Data: •When several different players suggest the same ‘incorrect’ gene- disease link, we detect a new candidate gene annotation. Molecular and Experimental Medicine, The Scripps Research Institute, La Jolla, CA Andrew I. Su, Salvatore Loguercio, Benjamin M. Good Games for gene annotation and phenotype classification Play these games now!!! at: http://genegames.org DIzeez Results Time frame: 2 months Unique players: 230 Games played: 1045 Guesses collected: 8,525 Unique gene-disease pairs: 6,941 Guesses that match existing annotation: 4804 (69%) For 14 novel gene-disease pairs guessed by >3 players, 9 (64%) were validated by a literature search Player consensus correlates with probability of validation [1] Game 2: GenESP Guess what genes your partner is thinking about when they see ‘neuroblastoma’ Direct reward for consensus formation • Multiplayer • Open-ended Tested pattern [2] Work in Progress find patterns make predictions on new samples cancer normal cancer normal The Challenge With tens of thousands of measurements but only hundreds of samples, many possible patterns are found. But which ones are real? Prior knowledge encoded in databases has been used to improve classifiers by guiding the search predictive gene sets [3] What about knowledge that is not recorded in structured databases? The Cure is designed to motivate and enable people to help improve the feature selection step for predictor inference. Gene info. provided from Gene Ontology, Gene Rifs. Search box highlights genes with annotation match Decision trees built automatically using genes in player’s hands Your current ‘hand’. round ends at 5 cards Goal: pick the best set of genes. Best: the gene set that produces the best decision tree classifier of breast cancer prognosis . Classifier: created using training data and selected genes, used to predict phenotype. Score: cross- validation performance of decision tree using selected genes and training data. RESULTS 214 Players registered (125 in 1 st week): 40% have a PhD. 3,954 games played in 47 days The Game Predictor scored 69% correct on Sage Breast Cancer Prognosis Challenge test set. [4] (Best of all submitted predictors scored 72%) Awaiting results on external validation set. • Clinical data (Age, etc.) http:// genegames.org/cure/

ASHG poster - Games for gene annotation and phenotype classification

  • Upload
    goodb

  • View
    467

  • Download
    1

Embed Size (px)

Citation preview

Page 1: ASHG poster - Games for gene annotation and phenotype classification

RESEARCH POSTER PRESENTATION DESIGN © 2012

www.PosterPresentations.com

Genes selected at highest frequency

The Empire State Building was built with 7 million hours of human effort. The Panama Canal took 20 million hours to complete. By comparison, it is estimated that up to 150 billion hours are spent playing games every year (9 billion on Solitaire alone). Obviously people play games because they are enjoyable and fun. But aside from that enjoyment, games largely result in no tangible benefit, neither to the individual nor to society at large.

Recently, several groups have built “games with a purpose”, a class of games that focuses on collaboratively harnessing gamers for productive ends. In biology, games have been built to fold proteins and RNAs, and to perform multiple sequence alignment. Here, we present our efforts to apply games to two critical challenges in genetics.

First, we have built games focused on organizing and structuring gene annotations. With the increasing popularity of genome-scale science, many analysis strategies (including gene set enrichment, pathway analysis, and cross-species comparisons) depend on comprehensive and accurate gene annotations. These structured annotations are mostly the result of centralized manual curation efforts, but these initiatives do not scale well with the explosive growth of the biomedical literature. We describe several games that target working biologists to extract their expert domain knowledge in computable form.

Second, we describe a game for predicting human phenotypes from molecular descriptors. Researchers can now relatively easily characterize any biological sample according to a number of features, including genotype, gene expression, and epigenetics. A key challenge in the field is identifying exactly which of those molecular features can be used to predict a clinical phenotype like disease susceptibility or adverse drug events. While statistical classifiers have been applied to this challenge, they typically do not incorporate prior biological knowledge, and they often fail to replicate in external test populations. Here, we present results from the ‘The Cure’ a game to help identify biomarker gene sets that can be used to improve predictions of breast cancer prognosis based on gene expression.

ABSTRACT

Game 1: Dizeez

Game 3: The Cure

REFERENCES1. Salvatore Loguercio, Benjamin M. Good, Andrew I. Su (2012) Dizeez: an online

game for human gene-disease annotation. In: Bio-Ontologies SIG, ISMB: 15 July 2011, Vienna. http://bio-ontologies.knowledgeblog.org/438

2. Luis Von Ahn and Laura Dabbish (2004) Labeling images with a computer game. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

3. Janus Dutkowski and Trey Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology

4. Sage bionetworks: DREAM7 Breast Cancer Prognosis Challenge. http://www.the-dream-project.org/challenges/sage-bionetworks-dream-breast-cancer-prognosis-challengeContact and Acknowledgements

Benjamin Good: [email protected] @bgood , Andrew Su: [email protected] @andrew.su

We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924) and the NIH through the FaceBase Consortium for a particular emphasis on craniofacial genes (DE-20057).

• Purpose: identify new gene-disease links• Rules:• Select biological area (e.g. ‘cancer’) to start game. • Given a gene, guess the related disease.• Points are awarded for correct guesses within one

minute.• ‘Correct’ answers drawn from text mining•Data:•When several different players suggest the same

‘incorrect’ gene-disease link, we detect a new candidate gene annotation.

Molecular and Experimental Medicine, The Scripps Research Institute, La Jolla, CA

Andrew I. Su, Salvatore Loguercio, Benjamin M. Good

Games for gene annotation and phenotype classification

Play these games now!!! at: http://genegames.org

DIzeez Results

• Time frame: 2 months• Unique players: 230• Games played: 1045• Guesses collected: 8,525• Unique gene-disease pairs: 6,941• Guesses that match existing annotation:

4804 (69%)• For 14 novel gene-disease pairs guessed

by >3 players, 9 (64%) were validated by a literature search

• Player consensus correlates with probability of validation [1]

Game 2: GenESP

Guess what genes your partner is thinking about

when they see ‘neuroblastoma’

• Direct reward for consensus formation

• Multiplayer• Open-ended• Tested pattern [2]• Work in Progress

find patterns

make predictions on new samples

cancer

normal

cancer normalThe Challenge

• With tens of thousands of measurements but only hundreds of samples, many possible patterns are found.

• But which ones are real?

• Prior knowledge encoded in databases has been used to improve classifiers by guiding the search predictive gene sets [3]

• What about knowledge that is not recorded in structured databases?• The Cure is designed to motivate and enable people to help improve the feature

selection step for predictor inference.

Gene info. provided from Gene Ontology, Gene Rifs.Search box highlights genes with annotation match

Decision trees built automatically using genes in player’s hands

Your current ‘hand’.round ends at 5 cards

• Goal: pick the best set of genes.

• Best: the gene set that produces the best decision tree classifier of breast cancer prognosis.

• Classifier: created using training data and selected genes, used to predict phenotype.

• Score: cross-validation performance of decision tree using selected genes and training data.

RESULTS

• 214 Players registered (125 in 1st week): 40% have a PhD.

• 3,954 games played in 47 days

The Game

• Predictor scored 69% correct on Sage Breast Cancer Prognosis Challenge test set. [4]

• (Best of all submitted predictors scored 72%)

• Awaiting results on external validation set.

• Clinical data (Age, etc.)

http://genegames.org/cure/