Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

  • Published on
    10-May-2015

  • View
    659

  • Download
    2

Embed Size (px)

Transcript

  • 1.Crowdsourcing Biology: The GeneWiki, BioGPS and GeneGames.org Andrew Su, Ph.D.@andrewsuasu@scripps.edu http://sulab.orgSanger/EBI September 7, 2012

2. 2Few genes are well annotatedTP53TNFAPOEMTHFRIL6HLA-DRB1 CountsVEGFAEGFRTGFB1 59%ACE PubMed38%23,278 protein-coding genes GeneontologyGenes, sorted by decreasing counts Data: NCBI gene2pubmed, August 2010 3. 3 because the literature is sparsely curated? Number of PubMed-indexed articles1,000,000 800,000 600,000 400,000 200,000 01979 1984 1989 1994 1999 2004 2009 4. 4 because the literature is sparsely curated? Average of articlesof humantypical scientist Number capacity read by scientist201001979 1984 1989 1994 1999 2004 2009 5. 5311,696 articles (1.5% of PubMed)have been cited by GO annotations 6. 6Sooner or later, the research community willneed to be involved in the 0annotation effort to scale up to the rate of datageneration. 7. 7The Long Tail is a prolific source of content Short Head Contentproduced Long Tail Contributors (sorted) News :Newspapers BlogsVideo:TV/Hollywood YouTube Product reviews:Consumer reports Amazon reviews Food reviews: Food criticsYelp Talent judging:Olympics American Idol 8. 8Wikipedia is reasonably accurate 9. 9Wikipedia has breadth and depth ArticlesWords (millions)Words/articleWikipedia Britannica Onlinehttp://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008 10. 10We can harness theLong Tail of scientiststo directly participate inthe gene annotationprocess. 11. 11From crowdsourcing to structured data The Gene WikiBiological Games 12. Filtering, extracting, and summarizing PubMedDocuments Concepts 13. 13Wiki success depends on a positive feedbackGene wiki page utility 1 100 2 200Number ofNumber of contributorsusers 14. 14 10,000 gene stubs within WikipediaUtility UsersContributors Protein structureGenesummarySymbols and identifiers Gene Ontologyannotations ProteininteractionsTissue expressionLinked patternreferences Links to structured databasesHuss, PLoS Biol, 2008 15. 15 Gene Wiki has a critical mass of readers Utility Total: 5.0 million views / month UsersContributorsHuss, PLoS Biol, 2008; Good, NAR, 2011 16. 16 Gene Wiki has a critical mass of editors Utility Editors Editor countEdit count Users Contributors EditsIncrease of ~10,000 words / month from >1,000 edits Currently 1.42 million wordsApproximately equal to 230 full-length articlesGood, NAR, 2011 17. 17A review article for every gene is powerful Reelin: 98 editors, 703 edits since July 2002Hyperlinks to related concepts Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002 References to the literature 18. 18Making the Gene Wiki more reliableNovartis is a multinational 2 The company name is derivedpharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds".that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), 2 19. 19Making the Gene Wiki more reliableNovartis is a multinational 2 The company name is derivedpharmaceutical company from old Greek, and means based in Basel, Switzerland "destroyer of birds".that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), 36211 total edits36 total edits**************High-trust author Low-trust authorhttp://www.wikitrust.net/ 20. 20Making the Gene Wiki more computableFree text Structured annotations 21. 21Filling the gaps in gene annotation NCBI Entrez Gene: 334Gene WikimappingWikilinkCandidateassertion GO:0006897 GO exactmatch 6319 novel GO annotations 2147 novel DO annotations 22. 22TOP 100GENES 23. 23Gene Wiki content improves enrichment analysisaxon Enrichmentguidance GO termanalysis(GO:0007411)811 articles 264 genesPubMedConcept Gene listabstractsrecognition GO:0007411YesNoLinked genes Yes 13 2 through No 251 12033 PubMedP = 1.55 E-20 24. 24Gene Wiki content improves enrichment analysis muscleEnrichment contraction GO termanalysis(GO:0006936) 251 articles87 genesPubMedConcept Gene listabstractsrecognition +Gene Wiki 87 articles GO:0006936 GO:0006936Linked genes Linked genes throughthrough PubMedPubMed + Gene Wiki P = 1.0P = 1.22 E-09 25. 25Gene Wiki content improves enrichment analysis Morep-value significant(PubMed + GW)PubMed onlyMusclecontraction Moresignificant PubMed + GW p-value (PubMed only) 26. 26Gene Wiki+: Crowdsourced semantic database Q: What genes are related to hemolytic anemia? 27. 27The Long Tail of scientistsis a valuable source ofinformation on genefunction 28. 28From crowdsourcing to structured data The Gene WikiBiological Games 29. 29Gene databases are numerous and overlapping and hundreds more 30. 30Community extensibility and user customizability http://biogps.org 31. 31Utility: A simple and universal plugin interface UtilityContributors Users 32. 32Utility: A simple and universal plugin interface UtilityContributors Users 33. 33Utility: A simple and universal plugin interface UtilityContributors Users 34. 34Utility: A simple and universal plugin interface UtilityContributors Users 35. 35Utility: A simple and universal plugin interface UtilityContributors Users 36. 36Utility: A simple and universal plugin interface UtilityContributors Users Total of 389 gene-centric online databases registered as BioGPS plugins 37. 37Users: BioGPS has critical mass Utility Daily pageviewsContributors Users > 4100 registered usersTop 10 organizations 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn 40,000 page views per week 3. UCSD8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC 38. 38Contributors: Explicit and implicit knowledge UtilityContributors Users 389 plugins registered(65% publicly shared) by over 75 usersspanning 150+ domains 39. 39Mining structured content from HTML 40. 40Defining a data extraction templateTP53 TNF APOE IL6 VEGF EGFR TGFB1 41. 41The BioGPS Semantic Annotatorhttp://50.112.124.237 42. 42TheLong Tail of bioinformaticianscan collaborativelybuild a gene portal. 43. 43From crowdsourcing to structured data The Gene WikiBiological Games 44. 44Seven million human hourshttp://www.flickr.com/photos/archana3k1/4124330493/ 45. 45Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/ 46. 46-150 billion human hoursper yearhttp://www.flickr.com/photos/rvp-cw/6243289302/ 47. 47Using games to fold proteinsFold.it players have successfully: Outperformed state of the art proteinfolding algorithms (Cooper, Nature, 2010) Solved a previously-intractable crystalstructure (Khatib, Nat Struct Mol Biol, 2011) Designed an improved protein foldingalgorithm (Khatib, PNAS, 2011) Improved enzyme activity of de novodesigned enzyme (Eiben, Nat Biotechnol, 2011) 48. 48Using games to fold RNAshttp://eterna.cmu.edu/ 49. 49Using games to align sequenceshttp://phylo.cs.mcgill.ca 50. 50Using games to annotate genes?http://genegames.org 51. 51No good gene-disease annotation database Query: Apolipoprotein EAlzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte disease 52. 52No good gene-disease annotation database Query: Apolipoprotein EAlzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibility 53. 53No good gene-disease annotation databaseQuery: Apolipoprotein E ? Alzheimers disease (AD) ? Lipoprotein glomerulopathy ? Sea-blue histiocyte disease Hyperlipoproteinemia, type III ? Macular degeneration, age-related ? Myocardial infarction susceptibility HIV Psoriasis Vascular Diseases 54. 54No good gene-disease annotation database Query: Apolipoprotein EAlzheimers disease (AD)MemoryCoronary Artery DiseaseNeuropsychological TestsHypertensionCognition Disorders Mental Status SchedulePsychiatric Status RatingDementiaScalesCognition HyperlipidemiasAtrophyDisease Progression Dementia, VascularCardiovascular Diseases Parkinson DiseaseBrain InjuriesCoronary DiseaseMyocardial InfarctionDiabetes Mellitus, Type 2 Memory Disorders477 diseases! 55. 55Play Dizeez to annotate gene-disease links6. Play to win! 5. Hurry! 4. Then on to the next question 3. If its right, you get points1. Read the clue (gene) 2. Click the related disease(only one is right) 56. 56Dizeez players seem pretty smartIn total (since Dec 2011): 207 unique gamers 1045 games played 8525 guesses# Occurrences Gene DiseasePubmed OMIM PharmGKB Gene Wiki7 GAST gastrinoma7 RBP3 retinoblastoma7 SSX1 synovial sarcoma6TGGraves disease6 CRYGC Cataract6 SOX8 mental retardation6WRN Werner syndrome6ABL1 leukemia6 MLL3 leukemia6 SNAI2 breast carcinoma 57. 57Dizeez players seem pretty smartIn total (since Dec 2011): 207 unique gamers 1045 games played 8525 guesses# OccurrencesGene DiseasePubmed OMIM PharmGKB Gene Wiki5 MECOM sarcoma4 ATF7 cancer3 ABCB5 acute myeloid leukemia3 SART1 glioblastoma3 NCK1 leukemia3 NEK1 cancer 58. 58Using games to predict phenotype from genotype?The Cure http://genegames.org 59. 59Classification problems in genome biology Classify new cancernormalsamplesfind patternscancer 100,000s featuresnormalSVM NeuralnetworksNaveBayesKNN 100s samples 60. 60Random forestsSample subset of cases and Train decisioncancer normal features tree 100,000s features 100s samples 61. 61Random forestscancer normal 100,000s features 100s samples 62. 62Random forests Classify newcancer normalsamplescancer 100,000s featuresnormalHow to interjectbiological 100s samplesknowledge? 63. 63Network-guided forests Dutkowski & Ideker (2011). PLoS Computational Biology 64. 64Network-guided forestsSamplefeatures by PPI Train decisioncancer normal networktree 100,000s features 100s samples 65. 65Human-guided forestsSamplefeatures byTrain decisioncancer normalhumantreeintelligence 100,000s features 100s samples 66. 66 67. 67The Cure: Genomic predictors for disease 68. 68The Cure: Genomic predictors for disease 69. 69The Cure: Genomic predictors for disease 70. 70The Cure: Genomic predictors for disease 71. 71The Cure: Genomic predictors for disease 72. 72The Cure: Genomic predictors for disease 73. 73Human-guided forests Classify new samplescancernormal 74. 74Critical Assessment-style challengeWill this work? Check our blog after October 15. 75. 75 TheLong Tail of gamers can collaborativelybuild an accurate disease classifier. 76. 76 CollaboratorsGroup membersDoug Howe, ZFIN Ben Good Max NanisJohn Hogenesch, U PennJon Huss, GNFSalvatore LoguercioChunlei WuLuca de Alfaro, UCSCIan MacleodAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editorsWP:MCB Project Contact http://sulab.org Recruiting graduate studentsasu@scripps.eduin quantitative biology! See@andrewsu http://education.scripps.edu/+Andrew SuFunding and Support@genegame (BioGPS: GM83924, Gene Wiki: GM089820)