Crowdsourcing Biology: The Gene Wiki, BioGPS and

  • Published on

  • View

  • Download

Embed Size (px)


Given at DBMI seminar series at UCSD.


  • 1.Crowdsourcing Biology: The GeneWiki, BioGPS and GeneGames.orgAndrew Su, Ph.D.@andrewsuasu@scripps.eduhttp://sulab.orgApril 5, 2013UCSD DBMI Seminar

2. Few genes are well annotated2Data: NCBI, February 201341%65%CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF20,473protein-codinggenesGenes, sorted by decreasing countsGOAnnotationCounts 3. 0200,000400,000600,000800,0001,000,0001979 1984 1989 1994 1999 2004 2009Number of PubMed-indexed articles because the literature is sparsely curated?3 4. because the literature is sparsely curated?4010201979 1984 1989 1994 1999 2004 2009Average capacity of human scientistNumber of articles read by typical scientist 5. 5311,696 articles (1.5% of PubMed)have been cited by GO annotations 6. 60Sooner or later, theresearch community willneed to be involved in theannotation effort to scaleup to the rate of datageneration. 7. The Long Tail is a prolific source of content7ShortHeadLong TailContentproducedContributors (sorted)News :Video:Product reviews:Food reviews:Talent judging:NewspapersTV/HollywoodConsumer reportsFood criticsOlympicsBlogsYouTubeAmazon reviewsYelpAmerican Idol 8. Wikipedia is reasonably accurate8 9. Wikipedia has breadth and depth9, July 2008ArticlesWords(millions)Wikipedia BritannicaOnline 10. 10We can harness theLong Tail of scientiststo directly participate inthe gene annotationprocess. 11. From crowdsourcing to structured data11The Gene WikiBiological Games 12. Filtering, extracting, and summarizing PubMedDocumentsConcepts Review article 13. Filtering, extracting, and summarizing PubMedDocumentsConcepts 14. Wiki success depends on a positive feedback14Gene wiki page utilityNumber ofusersNumber ofcontributors10012002 15. 10,000 gene stubs within Wikipedia15Protein structureSymbols andidentifiersTissue expressionpatternGene OntologyannotationsLinks to structureddatabasesGenesummaryProteininteractionsLinkedreferencesHuss, PLoS Biol, 2008UtilityUsersContributors 16. Gene Wiki has a critical mass of readers16Total: 4.0 million views / monthHuss, PLoS Biol, 2008; Good, NAR, 2011UtilityUsersContributors 17. Gene Wiki has a critical mass of editors17Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million wordsApproximately equal to 230 full-length articlesGood, NAR, 2011UtilityUsersContributorsEditorcountEditorsEditsEditcount 18. A review article for every gene is powerful18References to the literatureHyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002Heparin: 358 editors, 654 edits since June 2003AMPK: 109 editors, 203 edits since March 2004RNAi: 394 editors, 994 edits since October 2002 19. Making the Gene Wiki more computable19Structured annotationsFree text 20. Filling the gaps in gene annotation20WikilinkGO exactmatchGene WikimappingNCBI Entrez Gene: 334CandidateassertionGO:00068976319 novel GO annotations2147 novel DO annotations 21. Gene Wiki content improves enrichment analysis21GO termGene listConceptrecognitionPubMedabstractsEnrichmentanalysisGO:0007411axonguidance(GO:0007411)264 genesLinked genesthroughPubMedP = 1.55 E-20811 articlesYes NoYes 13 2No 251 12033 22. Gene Wiki content improves enrichment analysis22GO termGene listConceptrecognitionPubMedabstractsGene Wiki+EnrichmentanalysisGO:0006936 GO:0006936musclecontraction(GO:0006936)87 genesLinked genesthroughPubMedLinked genesthroughPubMed +Gene WikiP = 1.0 P = 1.22 E-09251 articles87 articles 23. Gene Wiki content improves enrichment analysis23p-value (PubMed only)p-value(PubMed + GW)MusclecontractionMoresignificantPubMed + GWMoresignificantPubMed only 24. Making the Gene Wiki more computable24Structured annotationsFree textAnalyses 25. Making the Gene Wiki more computable25Structured annotationsFree textDatabases 26. Making the Gene Wiki more computable26DatabasesLinked Data 27. TheLong Tail of scientistsis a valuable source ofinformation on genefunction27 28. From crowdsourcing to structured data28The Gene WikiBiological Games 29. Gene databases are numerous and overlapping29 and hundredsmore 30. Why is there so much redundancy?30UsersRequestsResourcesTimeCommunitydevelopmentBioGPS emphasizes community extensibility 31. Why do developers define the gene report view?31BioGPS emphasizes user customizability 32. http://biogps.orgCommunity extensibility and user customizability32 33. Utility: A simple and universal plugin interface33KEGG{{EntrezGene}}STRING{{EnsemblGene}}Pubmed{{Symbol}}URL templateGene entityRendered URL 34. UtilityUsersContributorsUtility: A simple and universal plugin interface34 35. UtilityUsersContributorsUtility: A simple and universal plugin interface35 36. UtilityUsersContributorsUtility: A simple and universal plugin interface36 37. UtilityUsersContributorsUtility: A simple and universal plugin interface37 38. UtilityUsersContributorsUtility: A simple and universal plugin interface38 39. Utility: A simple and universal plugin interface39UtilityUsersContributorsTotal of > 540 gene-centric onlinedatabases registered as BioGPS plugins 40. Users: BioGPS has critical mass40 > 6400 registered users 14,000 unique visitors per month 155,000 page views per month1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge7. U Penn8. Stanford9. Wash U10. UNCTop 10 organizationsDaily pageviewsUtilityUsersContributors 41. Contributors: Explicit and implicit knowledge41540 plugins registered(>300 publicly shared)by over 120 usersspanning 280+ domainsUtilityUsersContributors 42. All resources should provide RDF42 43. Mining structured content from HTML43 44. Defining a data extraction template44TP53 TNF APOE IL6 VEGF EGFR TGFB1 45. The BioGPS Semantic Annotator45http:// 46. TheLong Tail ofbioinformaticianscan collaborativelybuild a gene portal.46 47. From crowdsourcing to structured data47The Gene WikiBiological Games 48. 48 million human hours 49. 49Twenty million human hours 50. -50150 billion human hours year 51. Using games to fold players have successfully: Outperformed state of the art proteinfolding algorithms (Cooper, Nature, 2010) Solved a previously-intractable crystalstructure (Khatib, Nat Struct Mol Biol, 2011) Designed an improved protein foldingalgorithm (Khatib, PNAS, 2011) Improved enzyme activity of de novodesigned enzyme (Eiben, Nat Biotechnol, 2011) 52. Using games to fold RNAs52 53. Using games to align sequences53 54. Using games to diagnose malaria infection54 55. Using games to map neurons55 56. Using games to annotate genes?56 57. No good gene-disease annotation database57Alzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte diseaseQuery: Apolipoprotein E 58. No good gene-disease annotation database58Alzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityQuery: Apolipoprotein E 59. No good gene-disease annotation database59Alzheimers disease (AD)Lipoprotein glomerulopathySea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular DiseasesQuery: Apolipoprotein E????? 60. No good gene-disease annotation database60Alzheimers disease (AD)Neuropsychological TestsCognition DisordersDementiaCognitionDisease ProgressionCardiovascular DiseasesCoronary DiseaseDiabetes Mellitus, Type 2Memory DisordersQuery: Apolipoprotein EMemoryCoronary Artery DiseaseHypertensionMental Status SchedulePsychiatric Status RatingScalesHyperlipidemiasAtrophyDementia, VascularParkinson DiseaseBrain InjuriesMyocardial Infarction477 diseases! 61. Play Dizeez to annotate gene-disease links613. If its right, you get points4. Then on to thenext question2. Click the related disease(only one is right)5. Hurry!1. Read the clue (gene)6. Play to win! 62. Dizeez players seem pretty smart62In total (since Dec 2011): 230 unique gamers 1045 games played 8525 guesses# Occurrences Gene Disease11 NBPF3 neuroblastoma11 SOX8 mental retardation9 ABL1 leukemia9 SSX1 synovial sarcoma8 APC colorectal cancer8 FES sarcoma8 RBP3 retinoblastoma8 GAST gastrinoma8 DCC colorectal cancer8 MAP3K5 cancerGene Wiki OMIM PharmGKB PubMed 63. Using games to predict phenotype from genotype?63 64. Classification problems in genome biology64cancer normalfind patternsClassify newsamplescancernormalSVMNeuralnetworksNaveBayesKNN100s samples100,000sfeatures 65. Random forests65Sample subsetof cases andfeaturesTrain decisiontreecancer normal100s samples100,000sfeatures 66. Random forests66cancer normal100s samples100,000sfeatures 67. Random forests67Classify newsamplescancernormalcancer normal100s samples100,000sfeaturesHow to interjectbiologicalknowledge? 68. Network-guided forests68Dutkowski & Ideker (2011). PLoS Computational Biology 69. Network-guided forests69Samplefeatures by PPInetworkTrain decisiontreecancer normal100s samples100,000sfeatures 70. Human-guided forests70Samplefeatures byhumanintelligenceTrain decisiontreecancer normal100s samples100,000sfeatures 71. 71 72. The Cure: Genomic predictors for disease72 73. The Cure: Genomic predictors for disease73 74. The Cure: Genomic predictors for disease74 75. The Cure: Genomic predictors for disease75 76. The Cure: Genomic predictors for disease76 77. The Cure: Genomic predictors for disease77 78. Human-guided forests78Classify newsamplescancernormal 79. Critical Assessment-style challenge79 80. Results 214 registered players 50% declared knowledge of cancerbiology 40% self-identified as having Ph.D. Prediction results 70% correct on survival concordanceindex Best scoring model was 76% Player registrations still increasing!80 81. TheLong Tail of gamerscan collaborativelybuild an accuratedisease classifier.81 82. 82Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editorsWP:MCB ProjectCollaboratorsKatie FischBen GoodSalvatore LoguercioMax NanisChunlei WuGroup membersFunding and Support(BioGPS: GM83924, Gene Wiki: GM089820)Contact SuAdriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo OrozcoKey group alumni 83. Doctoral Program in Chemicaland Biological SciencesCALIFORNIAOffice of Graduate Studies10550 N. Torrey Pines RoadLa Jolla, CA 92037Email:gradprgrm@scripps.eduPhone: 858.784.8469