12
The digital revolution in phenotyping Anika Oellrich*, Nigel Collier*, Tudor Groza*, Dietrich Rebholz-Schuhmann*, Nigam Shah*, Olivier Bodenreider, Mary Regina Boland, Ivo Georgiev, Hongfang Liu, Kevin Livingston, Augustin Luna, Ann-Marie Mallon, Prashanti Manda, Peter N. Robinson, Gabriella Rustici, Michelle Simon, Liqin Wang, Rainer Winnenburg and Michel Dumontier Corresponding author. Anika Oellrich, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, United Kingdom. E-mail: [email protected] * These authors contributed equally to this work. Anika Oellrich is a Senior Bioinformatician at the Wellcome Trust Sanger Institute. Her work focuses on aspects of phenotype mining, in large data sets as well as scientific literature. Nigel Collier is Principal Research Associate at the University of Cambridge where he is co-head of the Language Technology Laboratory. His research brings together computational techniques for machine understanding of text. Tudor Groza is the Phenomics Team Leader at the Kinghorn Center for Clinical Genomics, Garvan Institute of Medical Research. His research focuses on phenotype acquisition and application in a clinical setting. Dietrich Rebholz-Schuhmann is Professor for Informatics at the National University of Ireland and Site Director of the Insight Centre for Data Analytics in Galway. His research focuses on biomedical semantic web technologies. Nigam Shah is an Assistant Professor of Medicine and Biomedical Informatics at Stanford University. His research focuses on using ontologies and text- mining to enable learning from electronic medical records. Olivier Bodenreider is a Senior Scientist at the US National Library of Medicine. His research focuses on terminology and ontology in the biomedical domain. Mary Regina Boland is a PhD fellow in the Department of Biomedical Informatics at Columbia University, New York, NY, USA. Her research involves de- veloping informatics algorithms for epidemiology and genetics. Ivo Georgiev is a Postdoctoral Fellow at Computational Bioscience Program at the University of Colorado School of Medicine. He researches methods for unsupervised extraction of biological mechanism information from domain corpora of biomedical publications. Hongfang Liu is an associate professor in biomedical informatics at Mayo College of Medicine, Rochester, MN, USA. Her research areas include biomedical natural language processing and data mining. Kevin Livingston is a Research Associate at the University of Colorado School of Medicine. His research focuses on knowledge representation and reason- ing to support data integration and hypothesis generation. Augustin Luna is a Research Scholar at Memorial Sloan Kettering Cancer Center. He is a recipient of the NIH NRSA Ruth Kirschstein Postdoctoral Fellowship for drug discovery-related research. Ann-Marie Mallon is the Head of Bioinformatics at the Medical Research Council Mammalian Genetics Unit. Her research focuses on developing tools for analyzing large-scale mouse functional data from high-throughput phenotyping. Prashanti Manda is a postdoctoral associate in the Department of Biology at the University of North Carolina, Chapel Hill. Her research interests include Bioinformatics, Bio-ontologies, Semantic Similarity and Data Mining. Peter Robinson is Professor for Medical Genomics at Charite ´ Universita ¨ tsmedizin Berlin and Professor of Bioinformatics at the Free University Berlin. His research interests include computational phenotype analysis, exome and genome sequencing, bioinformatics for next-generation sequencing and Marfan syndrome. Gabriella Rustici is the Bioinformatics Training Manager at the School of Biological Sciences of the University of Cambridge, UK. She is an active contributor to the development of the Cellular Microscopy Phenotype Ontology for the annotation of cellular phenotypes derived from high-content screening data. Michelle Simon is a postdoc in the Bioinformatics department of the Mammalian Genetics Unit, MRC, Harwell, UK. Her research areas include, phenomics, functional genomics and next-generation sequencing. Liqin Wang is a PhD fellow in the Department of Biomedical Informatics at University of Utah, Salt Lake City, UT, USA. Her research focuses on medical knowledge acquisition and disease-specific ontologies development and application. Rainer Winnenburg is a postdoctoral fellow at the Center for Biomedical Informatics Research at Stanford University. He is interested in information ex- traction from biomedical ontologies, electronic medical records and literature. Michel Dumontier is an Associate Professor of Medicine at the Center for Biomedical Informatics Research at Stanford University. His research focuses on developing methods for biomedical knowledge discovery. Submitted: 3 June 2015; Received (in revised form): 4 August 2015 V C The Author 2015. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1 Briefings in Bioinformatics, 2015, 1–12 doi: 10.1093/bib/bbv083 Paper Briefings in Bioinformatics Advance Access published September 29, 2015 by guest on January 25, 2016 http://bib.oxfordjournals.org/ Downloaded from

The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

The digital revolution in phenotypingAnika Oellrich Nigel Collier Tudor Groza Dietrich Rebholz-SchuhmannNigam Shah Olivier Bodenreider Mary Regina Boland Ivo GeorgievHongfang Liu Kevin Livingston Augustin Luna Ann-Marie MallonPrashanti Manda Peter N Robinson Gabriella Rustici Michelle SimonLiqin Wang Rainer Winnenburg and Michel DumontierCorresponding author Anika Oellrich Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Hinxton CB10 1SD United Kingdom E-mailanikaoellrichkclacuk These authors contributed equally to this work

Anika Oellrich is a Senior Bioinformatician at the Wellcome Trust Sanger Institute Her work focuses on aspects of phenotype mining in large data sets aswell as scientific literatureNigel Collier is Principal Research Associate at the University of Cambridge where he is co-head of the Language Technology Laboratory His researchbrings together computational techniques for machine understanding of textTudor Groza is the Phenomics Team Leader at the Kinghorn Center for Clinical Genomics Garvan Institute of Medical Research His research focuses onphenotype acquisition and application in a clinical settingDietrich Rebholz-Schuhmann is Professor for Informatics at the National University of Ireland and Site Director of the Insight Centre for Data Analytics inGalway His research focuses on biomedical semantic web technologiesNigam Shah is an Assistant Professor of Medicine and Biomedical Informatics at Stanford University His research focuses on using ontologies and text-mining to enable learning from electronic medical recordsOlivier Bodenreider is a Senior Scientist at the US National Library of Medicine His research focuses on terminology and ontology in the biomedical domainMary Regina Boland is a PhD fellow in the Department of Biomedical Informatics at Columbia University New York NY USA Her research involves de-veloping informatics algorithms for epidemiology and geneticsIvo Georgiev is a Postdoctoral Fellow at Computational Bioscience Program at the University of Colorado School of Medicine He researches methods forunsupervised extraction of biological mechanism information from domain corpora of biomedical publicationsHongfang Liu is an associate professor in biomedical informatics at Mayo College of Medicine Rochester MN USA Her research areas include biomedicalnatural language processing and data miningKevin Livingston is a Research Associate at the University of Colorado School of Medicine His research focuses on knowledge representation and reason-ing to support data integration and hypothesis generationAugustin Luna is a Research Scholar at Memorial Sloan Kettering Cancer Center He is a recipient of the NIH NRSA Ruth Kirschstein PostdoctoralFellowship for drug discovery-related researchAnn-Marie Mallon is the Head of Bioinformatics at the Medical Research Council Mammalian Genetics Unit Her research focuses on developing tools foranalyzing large-scale mouse functional data from high-throughput phenotypingPrashanti Manda is a postdoctoral associate in the Department of Biology at the University of North Carolina Chapel Hill Her research interests includeBioinformatics Bio-ontologies Semantic Similarity and Data MiningPeter Robinson is Professor for Medical Genomics at Charite Universitatsmedizin Berlin and Professor of Bioinformatics at the Free University Berlin His researchinterests include computational phenotype analysis exome and genome sequencing bioinformatics for next-generation sequencing and Marfan syndromeGabriella Rustici is the Bioinformatics Training Manager at the School of Biological Sciences of the University of Cambridge UK She is an active contributor tothe development of the Cellular Microscopy Phenotype Ontology for the annotation of cellular phenotypes derived from high-content screening dataMichelle Simon is a postdoc in the Bioinformatics department of the Mammalian Genetics Unit MRC Harwell UK Her research areas include phenomicsfunctional genomics and next-generation sequencingLiqin Wang is a PhD fellow in the Department of Biomedical Informatics at University of Utah Salt Lake City UT USA Her research focuses on medicalknowledge acquisition and disease-specific ontologies development and applicationRainer Winnenburg is a postdoctoral fellow at the Center for Biomedical Informatics Research at Stanford University He is interested in information ex-traction from biomedical ontologies electronic medical records and literatureMichel Dumontier is an Associate Professor of Medicine at the Center for Biomedical Informatics Research at Stanford University His research focuses ondeveloping methods for biomedical knowledge discoverySubmitted 3 June 2015 Received (in revised form) 4 August 2015

VC The Author 2015 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40)which permits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

1

Briefings in Bioinformatics 2015 1ndash12

doi 101093bibbbv083Paper

Briefings in Bioinformatics Advance Access published September 29 2015 by guest on January 25 2016

httpbiboxfordjournalsorgD

ownloaded from

Abstract

Phenotypes have gained increased notoriety in the clinical and biological domain owing to their application in numerousareas such as the discovery of disease genes and drug targets phylogenetics and pharmacogenomics Phenotypes definedas observable characteristics of organisms can be seen as one of the bridges that lead to a translation of experimental find-ings into clinical applications and thereby support lsquobench to bedsidersquo efforts However to build this translational bridge acommon and universal understanding of phenotypes is required that goes beyond domain-specific definitions To achievethis ambitious goal a digital revolution is ongoing that enables the encoding of data in computer-readable formats and thedata storage in specialized repositories ready for integration enabling translational research While phenome research isan ongoing endeavor the true potential hidden in the currently available data still needs to be unlocked offering excitingopportunities for the forthcoming years Here we provide insights into the state-of-the-art in digital phenotyping by meansof representing acquiring and analyzing phenotype data In addition we provide visions of this field for future researchwork that could enable better applications of phenotype data

Key words phenomics phenotypes acquisition interoperability semantic representation knowledge discovery

Introduction

Phenotypes are broadly defined as observable characteristics oforganisms and have gained great importance since the discov-ery of the causative relationship between a given underlyinggenetic mechanism (eg gene expression levels mutations) andits phenotypic manifestation Subsequently diverse initiativeshave focused on developing and curating resources that capturethis causal relationship at multiple levels and in the context ofmultiple organisms Examples include but are not limited tothe Online Mendelian Inheritance in Man (OMIM) database [1]the Mouse Genome Informatics database (MGD) [2] FlyBase [3]and the Zebrafish Model Organism database (ZFIN) [4]

The increasing development and exploitation of phenotypeshas led to a varied range of applications identifying diseasegenes [5ndash9] and characterizing functionally yet unclassifiedgenes [10ndash12] repurposing drugs [13 14] pharmacogenomics[15ndash17] and pharmacovigilance [18] as well as solving evolu-tionary questions [19 20]

The goal of this review is to synthesize the state-of-the-artin the past 10 years of Phenome Research and hence provide aunique and broad access point for those interested in studyingtopics in specific areas of phenomics The areas covered includeboth phenotype data evolving from biological experiments anddata needed in a clinical environment This work also presentsvisions for the coming years in phenomics research as it de-rives a series of open challenges based on input collected fromthe community We note here that the work addressed in thisarticle focuses on computational phenotyping ie the collec-tion representation and processing of phenotypes in a com-puter-interpretable format

To enable a structured navigation of the field we map thecontent of the review onto the four conceptual dimensions ofphenomics considered from a computational perspective (de-picted in Figure 1) representation interoperability acquisitionand processing These four dimensions have been identifiedand described in an earlier review [21] and are used for simpli-city here Representation focuses on semantic modeling and as-pects of knowledge capturing Interoperability an orthogonaldimension to representation aims to facilitate intra- andinterspecies phenotype mappings Acquisition refers to thetransformation of the raw data into semi-structured or struc-tured phenotype representations Processing (or application)uses externalized phenotypes to address fundamental or spe-cific challenges from variant prioritization and diagnosis toindividualized preventive care or drug repurposing

Each of these dimensions encapsulates a multitude of as-pects that reflect in their aggregated form the intrinsic complex-ity of phenotypes For example subject to the underlyingdomain the granularity of the representation of phenotypesmay differ depending on their application While biologists cap-ture data that may be too detailed for clinical applications clin-icians need solid and thoroughly supported evidence to be ableto test hypotheses derived from biological experimentsFurthermore phenotypes may be defined and represented inthe context of different organisms eg decreased bone mineraldensity (MP0000063) in the Mammalian Phenotype Ontology(MP) [22] and Osteopenia (HP0000938) in the Human PhenotypeOntology (HPO) [23] even though they may share an underlyinggoal from a translational perspective such as the documenta-tion of a certain gene function This leads to the need forachieving cross-species interoperability to enable integratedprocessing Finally the phenotype acquisition process poses itsown challenges ie technical (process automation or interfaceusability) social (incentive) and ethical (privacy) In the follow-ing we discuss in depth each of the dimensions introducedabove A summary of all the resources provided in this manu-script is provided in Table 1

State-of-the-art phenome researchRepresentation

A phenotype can be any observation of a normal or abnormalstate of an anatomical physiological or biochemical property ofan organism While phenotypes in a biological domain are re-corded as results from biological experiments phenotypes in aclinical domain are used to report the assessment of patientsPhenotypes span from the molecular level to the organism level[42] To enable the other three dimensions to reach their full po-tential a representation is needed that is well understood byhumans and at the same time computer-readable and henceamenable to computational analyses Such a representationdoes not only have to cover normal and abnormal phenotypeswithin a species but also has to facilitate the bridging acrossspecies and enable integration of heterogeneous data at differ-ent levels of granularity

From a computational perspective phenotypes take diverserepresentations (i) free-text descriptionsmdasheg as part of theOMIM disease presentations (ii) vocabulariesmdasheg the clinicalsynopsis in OMIM or the London Dysmorphology Database ter-minology [25] and (iii) ontologiesmdashie vocabularies augmented

2 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

with domain-specific relationshipsmdasheg HPO or MP While freetext descriptions support a better human understanding theylimit the possibilities for automated data analysis [43] With theabundance and ever-increasing amount of data the ultimategoal is to build a uniform and consistent computer-readablerepresentation one that enables a seamless collection and inte-gration of phenotypes recorded in biological studies as well asin a clinical environment Ideally this uniform global represen-tation would also account for both qualified and quantified dataand enable flexible conversion where possible In cases whereconversion is not possible this universal representation wouldhave to be extended with mappings

Ontologies to represent phenotypesCurrently the field consists of a varied set of vocabularies andontologies that support in various forms the abovementionedgoal In particular driven by the wide adoption from the bio-medical community ontologies have become the de factostandard for representing phenotypes To achieve the goal to itsfull extent the community has followed two complementaryapproaches for modeling and integrating phenotype data a pre-composed and a post-composed representation (see Figure 2)The pre-composed approach treats each phenotype as anatomic entity using individual expressions most suitable togeneral human understanding For example an ontology

Figure 1 The four dimensions of the phenotype development phases Representation Subject to the underlying domain and goal phenotypes may be represented at

different granularity levels Interoperability Existing ontologies and vocabularies externalize domain-specific phenotype knowledge at different levels of granularity

Acquisition Capturing and documenting phenotypes in any representational format can be achieved manually (via curation) or automatically (via text mining)

Processing Representing and capturing phenotypes in a structured manner (a form that also enables interoperability) has led to their application in a large variety of

domains The arrows denote direct points of connection between the several phenotyping dimensions Note that this figure only serves as illustration of the interplay

of the four dimension and thus is not aimed at comprehensiveness (eg interoperability could also be achieved with a mapping instead of EQ statements)

The digital revolution in phenotyping | 3

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

adopting this representation consists of concept definitions likelsquoerythrocytopeniarsquo or lsquodeficiency of red blood cellsrsquo lsquodeficiencyof erythrocytesrsquo These concepts are easily understood byhumans and also facilitate computational analysis

The post-composed representation uses elementary pheno-typic units from existing ontologies to compose specific com-plex phenotypes One mechanism to postcompose phenotypesis the entity-quality (EQ) statement approach [24 44] For ex-ample instead of defining lsquoerythrocytopeniarsquo as an atomic con-cept this approach represents the meaning of the phenotype bylinking the quality lsquodeficiencyrsquo with the anatomic entity lsquoredblood cellsrsquo This link is then captured via a logical axiom usingconcepts introduced by existing ontologies such as the GeneOntology (GO) [26] and the Phenotypic quality and TraitOntology (PATO) [24] The caveats of the postcomposition resultfrom the development overheads in building post-composedstatements Additionally a number of pre-composed phenotypeontologies still need to be transformed into a post-composedrepresentation

lsquoNormalrsquo and lsquoabnormalrsquo phenotypesSome of the existing phenotype representations focus on devi-ations of phenotypes (ie their status or quality from a

reference phenotype) [27] The reference phenotype in the caseof model organisms could be either the wild type of the organ-ism or a specific strain from which the mutation has been gen-erated In the best case the phenotype representations formthe core that enables interoperability of different data reposito-ries possibly covering different organisms and being collectedwith different aims in mind The quality of the phenotypic re-source ie the consistency of the phenotypic definitions theoverall structure of the phenotype semantic resource and inparticular the completeness of the electronic resource holdsthe key to enabling efficient data analysis interpretation anddecision support

To extend beyond representing lsquoabnormalrsquo Shimoyama et alhave extended the representation of phenotypes to also incorp-orate environmental factors as well as methods used to meas-ure the phenotype [45] The developed ontologies have beenused to annotate both rat (httprgdmcweduwgphysiology)and human data (httpcoverwustleduCover) Furthermorethe suggested framework allows for the annotation of quanti-fied phenotypes (eg lsquoblood sugar levelsgt 85 mmolLrsquo) insteadof lsquoabnormal increased blood sugar levelsrsquo While this way ofrepresentation could be used to represent the reference pheno-type a data curator is needed to provide this information

Table 1 Summarizes all resources mentioned throughout the manuscript together with their URL and reference (where applicable)

Resource Link Reference

Online Mendelian Inheritance inMan database

httpomimorg [1]

Mouse Genome Database httpinformaticsjaxorg [2]FlyBase httpflybaseorg [3]Zebrafish Model Organism Database httpzfinorg [4]Mammalian Phenotype Ontology httpwwwberkeleyboporgontologiesmp [20]Human Phenotype Ontology httppurlobolibraryorgobohpobo [21]London Dysmorphology Database httpwwwlmdatabasescom [23]Gene Ontology httpgeneontologyorg [24]Phenotypic quality and Trait Ontology httppurlobolibraryorgobopato [25]OrphaNet httpwwworphanetconsorcgibinindexphp [26]PharmGKB httpswwwpharmgkborg [27]Zebrafish Anatomical Ontology httppurlobolibraryorgobozfa [28]International Mouse Phenotyping

Consortiumhttpwwwmousephenotypeorg [29 30]

IMPReSS httpswwwmousephenotypeorgimpressPhenote httpwwwphenoteorgPhenoTips httpsphenotipsorg [31]MetaMap httpwwwnlmnihgovresearchumlsimplementation_resourcesmetamaphtml [32]NCBO Annotator httpsbioportalbioontologyorgannotatorcTakes httpctakesapacheorg [33]ShARECLEF 2013 httpssitesgooglecomsiteshareclefehealthdata [34]DeepPhe httpcancerhealthnlporgBio-LarK httpbio-larkorg [35]PhenoMiner httpssitesgooglecomsitenhcollierprojectsPhenoMiner [36]Unified Medical Language System httpwwwnlmnihgovresearchumls [37]Unified Medical Language System

Metathesaurus toolhttpwwwnlmnihgovpubsfactsheetsumlsmetahtml [38]

UberPheno httppurlobolibraryorgobohpuberpheno [39]SNOMED CT httpwwwnlmnihgovresearchumlsSnomedsnomed_mainhtml [40]AgreementMaker httpagreementmakerorg [41]Zooma httpwwwebiacukfgptzoomaSIDER httpsideeffectsemblde [14]AVAToL httpavatolorg [17]PhenoScape httpphenoscapeorg [18]ORCID httporcidorgResearcherID httpwwwresearcheridcom

4 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Conversions from the different states of lsquonormalrsquo eg similartests in different species are not automatically available

Representing qualitative and quantitative phenotypesOther desiderata for phenotype representations are focused onthe exploitation of ontologies for efficient propagation of experi-mental findings in basic phenomics research into the clinicaldomain and improved research efficiency in both domains(translational medicine) Consequently phenotype descriptionshave to meet clinical needs and cover those diseases that aremost relevant to the clinical context Experimental phenotypicdescriptions are detailed and reflect the experimental setupwhereas clinical descriptions suffer from time constraints andthus tend to lack observational detail Furthermore experimen-tal and clinical phenotypic descriptions may be organized at di-verse levels of granularity and may be biased toward a specificperspective For example experimental findings provide the op-portunity to capture and represent quantitative traits (eglsquoblood sugargt 85 mmolLrsquo) which may require adaptation intoqualitative terms (lsquohigh blood sugarrsquo) for clinical purposesSimilarly from a diagnosis perspective one may require a com-plete and individualized view over the phenotypic profilewhich may include degrees of severity [28] and longitudinalphenotypes [29] hence adding to the overall complexity of therepresentation

Representation of phenotypes summaryThe resources for representing phenotypes have reached apoint where they are able to provide a solid and rich foundationfor building advanced acquisition and processing mechanismsOpen challenges still exist eg modeling degrees of severitynormal states or negation (ie explicitly mentioning the absenceof an abnormality) or mapping quantitative traits to qualitativeconcepts to provide deep knowledge capturing methodologies

Acquisition

Acquisition involves the collection and storage of phenotype in-formation from various resources (see Figure 3) such as OMIMor OrphaNet a rare disease database [30] While some of theseresources are mainly built through manual curation eg MGDothers rely already on (semi-)automated preprocessing to en-hance curator throughput For example PharmGKB [46] uses anautomated classification system to determine relevant publica-tions and extract genendashdrug relationships that are then pro-vided to curators for verification [31 47]

Manual acquisition of phenotypesManual acquisition of phenotypes can be done either by cur-ation of the literature or by direct submission from investiga-tors These two main modes are used to annotate modelorganism data with phenotype observations and their concep-tual descriptions In the case of MGD curators provide standard

Figure 2 To date phenotypes have mostly been captured and defined using a pre-composed andor a post-composed representation A pre-composed representation

assumes the definition of a phenotype as a monolithic conceptmdasha concept that captures the essence of the phenotype semantics The post-composed representation

decomposes the phenotype into an EntityndashQuality pair with its individual components being mapped to appropriate ontological concepts In this case the phenotype

semantics is denoted by the compositional property of the pair The transition between pre-composed and post-composed is realized via logical axioms Both forms of

representation have been successfully applied across different species

The digital revolution in phenotyping | 5

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

phenotype descriptions from MP along with supporting evidenceZFIN similarly provides phenotypic data with evidence comingfrom manual curation and individual investigator contributionsusing the Phenote software (httpwwwphenoteorg) Phenoteallows description of phenotypes in an EQ format which makesuse of any ontology in the Open Biological and BiomedicalOntologies [48] format including PATO the Zebrafish AnatomicalOntology [32] and GO

The International Mouse Phenotyping Consortium (IMPC)[33 49] has applied phenotype encoding standards [50] througha set of standard operating procedures (SOPs as defined inIMPReSS httpswwwmousephenotypeorgimpress) for re-cording high-throughput phenotype measurements in the labEach of the SOPs describes not only the experimental setup forthe measurement of the required parameters but also theontology annotation this test may induce For example the SOPdesigned to assess the grip strength of a mouse includes thesuggestion of the MP term lsquoabnormal grip strengthrsquo(MP0001515)

In addition to Phenote mentioned above one example of asystem that was designed specifically for phenotype capture ina manual mode is PhenoTips [51] This open-source system as-sists clinicians to record phenotypic profiles for patients withrare genetic disorders using HP and OMIM potentially allowingfor diagnosis and comparative phenotype analysis

Discovering evidence for the causes of human disorders andproviding treatment are common goals across the clinical andscientific communities However the understanding of pheno-types has traditionally been different between the two com-munities Clinicians generally consider phenotypes to beaberrations ie deviations from normal morphology physi-ology or behavior [52] while scientists working on biological ex-periments such as mutation experiments have adopted a morepragmatic definition of a selective profile of all the observablecharacteristics of an organism This division stems in part froma focus on the overt expression of the syndromes themselves

[53] on the one hand and on the pathway from syndrome togene expression on the other Both are crucial to understandingthe complex nature of disorders as Sabb et al point out [53]This difference is reflected in the type of data that each commu-nity creates and the systems that have been built to supportdata capture by each

Semi-automated and automated phenotype acquisitionWith the increasing amount of data that is published on a day-to-day basis manual approaches for data curation becomemore and more time-demanding and costly so that computerassistance in screening (document retrieval) and preparing data(information extraction) is unavoidable The degree to whichcomputer assistance is enabled determines whether themethod is semi-automated or automated While in a semi-automated setting a curator manually verifies the extracteddata in an automated setting no manual input is requiredHowever given the absence of manual verification and the cur-rent state-of-the-art in text processing the data generated withan automated method may contain some incorrect data

The structural and semantic complexity of phenotype termscoupled with the scale and changing nature of literature-basedphenotype descriptions makes a traditional fully manual ac-quisition approach difficult to sustain leading to potential du-plication inconsistency and sub-optimal coverage This hascreated a growing interest in textdata mining techniques A di-verse and growing research community is evolving that aims toexploit biomedical natural language processing for the extrac-tion of structured data from free-text and its annotation withthe semantic resources that already exist Although not specif-ically aimed at phenotypes knowledge brokering tools such asMetaMap [34] the NCBO Annotator and the Apache cTAKES [54]have all been widely used for concept annotation of text to bio-medical ontologies and could be used to yield these buildingblocks The issue of customizing these generic tools to the

Figure 3 The increasing amount of data made available over the course of the past years have rendered manual phenotype curation impractical While automating

the process is in principle the only viable solution it possesses its own plethora of technical challenges These include among others (i) boundary detection ie iden-

tifying the exact span of text that represents a phenotype candidate (ii) disambiguation and alignment subject to the desired level of granularity and the underlying

knowledge source and (iii) interpretation which covers lack of context hedging or negation

6 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

extraction of phenotypes in specific disease domains is a keychallenge

Extraction of structured information from electronic healthrecords (EHRs) has a long history of research eg [35 36 38 55]Progress has been hampered by the balance that needs to bedrawn between respecting patient privacy and the need for datato develop comparable gold standards In the past few yearsseveral initiatives have led the way in making available anony-mized collections of EHRs eg Informatics for IntegratingBiology and the Bedside (i2b2) [56] and recently ShARECLEF2013 [57] These tasks aim to identify entities of clinical interestincluding medical problems tests and treatments While nei-ther of these data sets explicitly annotates phenotypes theseentities are highly relevant to phenotype acquisition Fu et alsuggested an annotation scheme to capture phenotypes forchronic obstructive pulmonary disease in EHRs which has beenimplemented in Argo [58] to annotate a corpus of 1000 clinicalrecords [59] Furthermore the newly launched DeepPhe project(httpcancerhealthnlporg) focusses on phenotypes relevantin the cancer genomics domain

Using the scientific literature as a source several groupshave been active in developing approaches explicitly for pheno-types These include the Bio-LarK system which has beenapplied to skeletal dysplasia [39] and the PhenoMiner systemwhich has been applied to the cardiovascular and autoimmunesystems [60] Work by Khordad et al [37] has looked at a moremixed domain using the Unified Medical Language System(UMLS) Metathesaurus tool [40] Ongoing challenges in process-ing EHRs are descriptive naming (eg typical course face het-erogeneous ECG abnormalities) disjoint phenotype mentions(eg blood pressure was observed to be elevated) and coordi-nated terms (eg slow healing and excessive scarring)Harmonization to existing ontologies presents an additionallayer of challenge in deciding how to align phenotype mentionsthat are more or less specific than extant concepts and how toprovide sufficient evidence for human curators

While automated methods are not as thorough as curatorsthey overcome some of the bottlenecks experienced with man-ual curation eg high time consumption and low throughputIn general there is a trade-off between thoroughness (precision)and the amount of acquired data (recall) returned as resultsfrom these methods ie automated methods may not return allrelevant results and may return some incorrect results

Acquisition of phenotypes summaryThe acquisition and harmonization of phenotypes is an ongoingchallenge to be met using evidence from a variety of sources(eg EHRs scientific literature clinical reports) The key issuefor automated approaches involving natural language process-ing support is to identify and resolve lexical syntactic and se-mantic heterogeneity

Interoperability

The interoperability dimension of phenomics research focuseson making all the available phenotype data integrable withother data sources eg diseases or results from genome ana-lyses The overarching goal of interoperability is to facilitatetranslational research and biological discoveries [21] Currentand past work falling into this dimension can be summarized asstandardization efforts alignment of phenotypes within andacross species and mapping to other resources Challenges arisefrom the many levels of complexity phenotypes can span [42] as

well as the development of multiple and mostly disparate re-porting schemes [22 23 41]

Interoperability through semantic layersA prerequisite for interoperable phenotype resources is a se-mantic layer that spans across the resources applied and allowsto keep the consistency and specificity contained in each of theresources For example despite standardization efforts such asthe Minimal Information for Mouse Phenotyping Procedures[61] the existing landscape of mouse phenotype resources isnot fully interoperable and hard to manage [62] A similar scen-ario is seen in hospitals where different wards use disparateways of describing a patient As a consequence the data for apatient cannot readily be used for further analysis preventingpotential holistic treatment opportunities However the needfor standardized reporting has been recognized and is imple-mented through SOPs in the IMPC [33 49 50]

While historically there have been different phenotype rep-resentations for different species such as the human mamma-lian fly and worm phenotype ontology (see 21 representation)EQ statements (see Figure 2) have been suggested to integratephenotypes across different species [24 44] In addition anamendment to the existing EQ statements was suggested tomake them interoperable with anatomy and physiology ontolo-gies [63] to extend the links across the different layers of com-plexity To make the annotation for three different species(human mouse and zebrafish) more accessible Kohler and au-thors made the UberPheno ontology publicly available [64]

Interoperability achieved through mappings (alignment)Further to the representation of phenotypes with EQ state-ments manual [65] and automated methods are in progress toalign different pre-composed semantic representations One ex-ample is UMLS [66] that combines over 180 vocabularies termi-nologies and ontologies such as SNOMED CT [67] Theintegration of new resources into UMLS is semi-automatedConflicts between concepts from newly added resources andconcepts already in UMLS are manually resolved to ensure ahigh-quality alignment of all the incorporated terminologiesvocabularies and ontologies

As an alternative to manual and semi-automated solutionstools exist that provide fully automated alignments betweenontologies such as AgreementMaker [68] and Zooma (httpwwwebiacukfgptzooma) While AgreementMaker takes lex-ical matching and ontological features into account Zoomauses phonetic matching algorithms for the alignment In manycases the resulting alignments often associate one term withmultiple concepts (1n mapping) Bottlenecks in the automatedalignment are caused by species-specific jargon [69 70] and byphenotypes that only exist in one of the species and not theother

In addition to the alignment of multiple resources mappingsare required that would facilitate the integration of diverse re-sources spanning across the different layers of complexity of anorganism While phenotypes in model organisms have been as-signed to genes as well as to specific models (determined by al-lele and background in addition to gene eg MGD) in humanmostly inheritable diseases have been annotated However ifwe were to follow the pathway from the modified gene to theobserved phenotypes a mapping between pathways andphenotypes would also be required [71] As mentioned abovesome integration with other resources has been achieved egwith anatomy and physiology ontologies but a much larger

The digital revolution in phenotyping | 7

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

coverage is required to unlock the full potential of knowledgediscovery using phenotype data

Interoperability of phenotypes summaryThe need for unified and integrable representation of pheno-types has been recognized and projects are underway to im-prove the current situation However there is still a hugeamount of legacy data that need to be dealt with

Processing

Processing (or application) is concerned with the use of pheno-type data that have either been reported in structured form orextracted with for example text mining methods (see acquisi-tion methods of phenotypes) Subject to the target domain theusage of phenotype data can be classified into four broad cate-gories (i) clinical research (diagnosis prognosis patient match-making variant prioritization personalized medicine drug sideeffects) (ii) study of genomendashphenome interactions to advancethe understanding of disease causation or achieve personalizedtherapies (iii) cross-resource consistency analysis and (iv) evo-lutionary research It is worth noting that in most cases pheno-type data have been used in a cross-species context hencetransforming the interoperability dimension into a first-classcitizen rather than an application scenario

Application of phenotypes to study the origins and pathology ofdiseasesThe application area of clinical research consists of a varied setof specific goals which represent in practice coherent researchstreams on their own Phenotypes have been used as uniquesource of data for example for disease prediction [72 73] min-ing key disease characteristics or characteristic phenotypes [1617 74] or patient match-making [51] Furthermore phenotypeshave been used to support the understanding of the geneticmechanism of diseases via direct association with genotypedata [5 9 12] as a mapping bridge across species data [7 8] andin conjunction with the entire set of OMICS data [75] or to im-prove variant prioritization for accurate diagnosis [76 77]

More recently phenotypes have played a major role as a dis-covery agent in large-scale genome-wide association studies [78ndash80] In particular projects such as the 100 000 Genomes Project(httpwwwgenomicsenglandcoukthe-100000-genomes-project)as well as the eMerge Network (httpsemergemcvanderbiltedupage idfrac1458) aim to support the area of Pharmacogenetics bylinking data from EHRs to sequence information from patientsto improve diagnosis and treatment Similarly Phenome-wideAssociation Studies (PheWAS) [81ndash83] allow for the identificationof genes that possibly implicate a disease and can provide prov-enance for results determined through Genome-wide associ-ation studies

Phenotypes and ethnicityGeneric phenotypes have an immense potential which hasbeen only recently discovered and exploited For example eth-nic differences have been shown to explain the optimal statesof the human blood glucose levelsmdashexpressed via the relation-ship between insulin sensitivity and insulin response [84]Similarly when used in the context of a shared genetic architec-ture such data validated the existence of statistically significantrelationships between the platelet count and alcohol depend-ence or between the alkaline phosphatase level and venousthromboembolism [85] This application area is characterizedby a specific set of challenges emerging from the novel

combination of numeric data (from tests and measurement)abnormalities and standard traits One important and still un-solved problem in the processing of phenotypes is the lack ofalignment between measurement representations and pheno-typic abnormalities as well as the ability of representing statesof normality and longitudinal phenotypic data resulting fromtests and measurements

Phenotypes in drug repurposingPhenotypes expose the effects of drug treatments and henceenable the study of their general effects the relation betweendosage and effects as well as the interaction between drugsExisting literature on this topic maps perfectly onto these threeaspects Phenotypes as adverse drug reactions have been mod-eled and captured as early as 2010 by the SIDER initiative [14]and have been used to investigate the causal relationship be-tween dosage and effect by [15] In an exercise that combineslarge-scale acquisition and processing LePendu et al have usedphenotypes as indicators of adverse drug reactions as well assignals of adverse events associated with drugndashdrug inter-actions [18] Finally with the increasing curation adoption anduse of cross-species phenotype data it has been shown that aneffective mapping between model organisms and drug effectprofiles based on similarity can be applied to successfully sug-gest candidate drug targets [13]

Phenotypes in evolutionary studiesIn this context phenotypes have been used to understand pat-terns of diversification and to gain additional knowledge ontrait evolution Two particular initiatives have focused on thisaspect The AVAToL project [19] represents a collaborative andmultidisciplinary effort that combines text mining image ana-lysis and the wisdom of the crowds to discover and documentspecies phenotypes Their ultimate goal is to advance phylogen-etics research and to enable a faster and more accurate con-struction of the Tree of Life The Phenoscape knowledge base[20] on the other hand integrates phenotype data acquired onover 2500 teleost fishes with structured phenotype data fromzebrafish genes to infer candidate genes that explain pheno-typic variety and hence enable the formulation of evolution-arydevolutionary hypotheses

Processing of phenotypes summaryThe quality and range of phenotype applications are curbedonly by the quality and availability of the underlying dataLimitations arise for example from the missing or incorrectcross-species phenotypes alignment either owing to their foun-dational representation or owing to inconsistencies in the levelof granularity Similarly challenges are encountered when rep-resenting and acquiring the more profound dimensions ofphenotypes including degrees of severity use of ambiguous ex-pressions or temporality (in a longitudinal data sense) whichthen hamper the development of complex solutions

A future perspective of phenome research

For the entire field of Phenomics to advance challenges have tobe overcome at the universal level as well as at the level of theindividual dimensions of phenome research (representationacquisition interoperability and processing) The most import-ant universal challenge is the lack of a shared understanding ofwhat a phenotype is among all scientists working with pheno-type data This includes (computational) biologists regardlessof the research questions andor animal model they are working

8 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

on as well as clinicians working in all different areas of humanmedicine To reach a common understanding reporting stand-ards and guidelines need to be derived that facilitate communi-cation across domains as well as large-scale computationalanalysis

Furthermore universal agreement has to be found as towhat additional aspects of phenotypes are relevant and havenot or only sparsely been accounted for in any of the four do-mains Such aspects include types of measurements proven-ance evidence and time To overcome the social barriers toadoption some sort of credit assignment mechanism akin tocitation will be necessary to track the usage of lsquogoodrsquo and reus-able phenotypes Existing author tracking systems such asORCID (httporcidorg) and ResearcherID (httpwwwresearcheridcom) can probably be reused to document author-ship of phenotype models and track usage as well asprovenance

Another important universal issue to overcome is the re-cording of lsquonormalrsquo phenotypes Currently phenomics in thearea of disease gene discovery is geared toward the collectionand analyses of lsquoabnormalrsquo phenotypes eg those resultingfrom diseases or gene modifications However the derivation ofan lsquoabnormalrsquo phenotype is always the result of a comparisonwith what is considered to be lsquonormalrsquo which is collected onlyin some cases [45] The record of more lsquonormalrsquo phenotypeswould allow for more fine-grained analyses and the investiga-tion of causal relationships between different conditions eg incases of comorbidity

From a representational point of view the ultimate futuregoal is to overcome the limitations of species- and domain-specific representations and find a universal way of encodingphenotype data independent from the granularity of the dataIn addition resources need to be built that address so far miss-ing aspects such as evidence or time aspects For example asmall set of evidence codes have been established as part of GOto provide provenance in gene annotations but also to providemeans for computational analysis to avoid data circularityAnother aspect of phenotypes that has not been integrated yetinto the representation of phenotypes are causality and tempor-ality for example how the phenotypes change over time owingto stimuli in the environment or medication While at the mo-ment the representations focus on reporting either a temporarysnapshot of an individual examined in an experiment or as partof a medical investigation the current representation modelsdo not allow for the encoding of phenotype changes over timeas a result to surrounding stimuli

As we extend the scope of our joint understanding of pheno-types and adopt ways to represent this understanding the ac-quisition of phenotypes has to change too Methods have to bedeveloped that can accommodate the recording of additionalaspects such as evidence and time eg when extracting infor-mation from the scientific literature Furthermore more reliableautomated methods are needed that can cope with the com-plexity of free text in clinical settings as well as reporting mech-anisms in wet lab environments to facilitate high-throughputand overcome the need for time- and cost-intensive manuallabor In addition and as mentioned earlier the acquisition di-mension should also address the collection of lsquonormalrsquo pheno-types in the future to improve the results obtained byprocessing the phenotype data

Despite the widespread aims to achieve interoperability andmake best use of the integrated resources there are still chal-lenges that need to be addressed to achieve true interoperabilityof the existing and newly emerging resources both phenotype-

specific and not A long-term goal of the dimension of inter-operability is direct propagation of experimental findings intoprevention and treatment options for patients in a hospital(lsquofrom bench to bedsidersquo) Related to this goal is the aim of per-sonalized treatments by means of building patient-specificmodels integrating phenotype data that can then be used forsimulations of possible treatment outcomes

In the future we anticipate a migration toward describingphenotypes as lsquomodelsrsquo or classifiers that answer a particularquestion For example lsquodoes the patient have pneumoniarsquo orlsquodoes the patient have sepsisrsquo The use of phenotyping willthen be analogous to the use of classifiers in spam filters al-ways running in the background and when an incoming sample(a patient record for example) results in a high confidencematch we would automatically receive an alert that the patientis potentially eligible for a clinical trial is likely to benefit from acertain therapy or is at increased risk for certain complications

With the improvement in any of the other dimensions theprocessing of phenotypes will improve owing to an increase inthe quality of data but at the same time will require extensionsthat can cope with additional data available in the future suchas lsquonormalrsquo phenotypes evidence for phenotype data andcausal relationships encoded with time-dependencies In gen-eral a wide range of support and analysis tools are required tounlock the full potential of phenotype data

Given the achievements in the past years we look forwardtoward an exciting and promising decade of phenomics withample opportunities for researchers to get involved and contrib-ute to evolve and shape the emerging landscape

Key Points

bull Over the course of the past decade phenotype datahas become a key factor in analyzing diseases and re-porting experimental outcomes

bull Successful applications of phenotypes include the de-scription of experimental outcomes (eg the changes inphenotypes owing to gene modifications) computationalknowledge discovery (eg in determining disease genecandidates) and reporting in clinical environments (egpatient monitoring)

bull The research field of phenomics can be divided intofour main dimensions (presentation acquisition inter-operability and processing) each of which is depend-ent to some extent on the others

bull While the development of each of the four dimensionsis individual a common understanding and universalguidelines need to be established on how phenotypesare perceived and how they are used a synchroniza-tion of efforts is needed

bull The future of phenomics research holds exciting chal-lenges and has the potential to create a significant im-pact on the entire biomedical domain

Funding

This work was supported by the National Institutes ofHealth [1 U54 HG006370-01 to AO R01 LM011369 and R01GM101430 and U54 HG004028 to NHS T15 LM00707 toMRB R01-LM008111 to KL R01 GM102282 and R01LM011369 to HL U24 CA143840 to AL U54HG006370 toAM U54 HG008033-01 to MD] the Wellcome Trust [098051to AO] a Marie Curie experience researcher fellowship

The digital revolution in phenotyping | 9

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

[301806 to NC] the National Science Foundation [1207592to IG DBI-1062404 and DBI-1062542 and EF-0905606 toPM] the Bundesministerium fur Bildung und Forschung[0313911 to PNR] the European Communityrsquos SeventhFramework Programme [Grant Agreement 602300 SYBIL toPNR] the Systems Microscopy NoE project [grant agree-ment 258068 to GR] and the Defense Advanced ResearchProjects Agency [W911NF-14-C-0109 to KL] TG was sup-ported by the Kinghorn Foundation OB was supported bythe Intramural Research Program of the NIH NationalLibrary of Medicine MS was supported by the MedicalResearch Council LW was supported by the Homer WarnerCenter for Informatics Research of the IHC Health ServicesRW was supported by an appointment to the NLMResearch Participation Program administered by the OakRidge Institute for Science and Education through an inter-agency agreement between the US Department of Energyand the National Library of Medicine

References1 Amberger J Bocchini C Hamosh A A new face and new chal-

lenges for Online Mendelian Inheritance in Man (OMIM)Human Mutation 201132564ndash7

2 Blake JA Bult CJ Kadin JA et al The Mouse Genome Database(MGD) premier model organism resource for mammaliangenomics and genetics Nucleic Acids Res 201139D842ndash8

3 Tweedie S Ashburner M Falls K et al FlyBase enhancingDrosophila Gene Ontology annotations Nucleic Acids Research2009 37D555ndash9

4 Howe DG Bradford YM Conlin T et al ZFIN the ZebrafishModel Organism Database increased support for mutantsand transgenics Nucleic Acids Research 2013 41D854ndash60

5 Goh KI Cusick ME Valle D et al The human disease networkProc Natl Acad Sci USA 20071048685ndash90

6 Hoehndorf R Schofield PN Gkoutos GV PhenomeNET awhole-phenome approach to disease gene discovery NucleicAcids Res 201139e119

7 Smedley D Oellrich A Kohler S et al PhenoDigm analyzingcurated annotations to associate animal models with humandiseases Database 2013bat025

8 Washington NL Haendel MA Mungall CJ et al Linkinghuman diseases to animal models using ontology-basedphenotype annotation PLoS Biol 20097e1000247

9 Zhou X Menche J Barabasi AL et al Human symptomsndashdisease network Nat Commun 20145

10Van Driel MA Bruggeman J Vriend G et al A text-mining ana-lysis of the human phenome Eur J Hum Genet 200614535ndash42

11Groth P Pavlova N Kalev I et al PhenomicDB a new cross-species genotypephenotype resource Nucleic Acids Res200735D696ndash9

12Korbel JO Doerks T Jensen LJ et al Systematic association ofgenes to phenotypes by genome and literature mining PLoSBiol 20053e134

13Hoehndorf R Hiebert T Hardy NW et al Mouse model pheno-types provide information about human drug targetsBioinformatics 201330719ndash25

14Kuhn M Campillos M Letunic I et al A side effect resource tocapture phenotypic effects of drugs Mol Syst Biol 20106343

15Eriksson R Werge T Jensen LJ et al Dose-specific adversedrug reaction identification in electronic patient recordstemporal data mining in an inpatient psychiatric populationDrug Safety 201437237ndash47

16Lasko TA Denny JC Levy MA Computational phenotype dis-covery using unsupervised feature learning over noisy sparseand irregular clinical data eng PLoS One 20138e66341

17Schulam P Wigley F Saria S Clustering longitudinal clinicalmarker trajectories from electronic health data applicationsto phenotyping and endotype discovery In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial IntelligenceAustin Texas USA 2015

18LePendu P Iyer SV Bauer-Mehren A et al Pharmacovigilanceusing clinical notes Clin Pharmacol Ther 201393547ndash55

19Burleigh G Alphonse K Alverson AJ et al Next-generationphenomics for the tree of life PLoS Curr 20135

20Mabee P Balhoff JP Dahdul WM et al 500 000 fish pheno-types The new informatics landscape for evolutionary anddevelopmental biology of the vertebrate skeleton J ApplIchthyol 201228300ndash5

21Collier N Oellrich A Groza T Toward knowledge support foranalysis and interpretation of complex traits Genome Biol201314214

22Smith CL Eppig JT The Mammalian Phenotype Ontology as aunifying standard for experimental and high-throughputphenotyping data Mamm Genome 201323653ndash68

23Kohler S Doelken SC Mungall CJ et al The HumanPhenotype Ontology project linking molecular biology anddisease through phenotype data Nucleic Acids Res201442D966ndash74

24Gkoutos GV Mungall C Doelken S et al Entityquality-basedlogical definitions for the human skeletal phenome usingPATO In Proceedings of Annual International Conference ofthe IEEE Engineering in Medicine and Biology SocietyMinneapolis Minnesota USA 2009 7069ndash72

25Winter RM Baraitser M Douglas JM A computerised database for the diagnosis of rare dysmorphic syndromes J MedGenet 198421121ndash3

26Botstein D Cherry JM Ashburner M et al Gene Ontology toolfor the unification of biology Nat Genet 20002525ndash9

27Hancock JM Commentary on Shimoyama et al (2012) threeontologies to define phenotype measurement data FrontGenet 20145

28Boland MR Tatonetti NP Hripcsak G Development and valid-ation of a classification approach for extracting severity auto-matically from electronic health records J Biomed Semant2015614

29Greenaway S Blake A Retha A et al Automatically annotat-ing temporal data from a phenotype-driven mutagenesisscreen In Proceedings of Phenotype Day at ISMB Dublin Ireland2015

30Rath A Olry A Dhombres F et al Representation of rare dis-eases in health information systems the Orphanet approachto serve a wide range of end users Hum Mutat 201233803ndash8

31Thorn CF Klein TE Altman RB PharmGKB Methods Mol Biol2005311179ndash91

32Sprague J Bayraktaroglu L Clements D et al The ZebrafishInformation Network the zebrafish model organism data-base Nucleic Acids Res 200634D581ndash5

33Brown SDM Moore MW Towards an encyclopaedia of mam-malian gene function the International Mouse PhenotypingConsortium Dis Models Mech 20125289ndash92

34Aronson AR Lang FM An overview of MetaMap historicalperspective and recent advances J Am Med Inform Assoc201017229ndash36

35Chapman WW Bridewell W Hanbury P et al A simple algo-rithm for identifying negated findings and diseases in dis-charge summaries J Biomed Inform 200134301ndash10

10 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 2: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

Abstract

Phenotypes have gained increased notoriety in the clinical and biological domain owing to their application in numerousareas such as the discovery of disease genes and drug targets phylogenetics and pharmacogenomics Phenotypes definedas observable characteristics of organisms can be seen as one of the bridges that lead to a translation of experimental find-ings into clinical applications and thereby support lsquobench to bedsidersquo efforts However to build this translational bridge acommon and universal understanding of phenotypes is required that goes beyond domain-specific definitions To achievethis ambitious goal a digital revolution is ongoing that enables the encoding of data in computer-readable formats and thedata storage in specialized repositories ready for integration enabling translational research While phenome research isan ongoing endeavor the true potential hidden in the currently available data still needs to be unlocked offering excitingopportunities for the forthcoming years Here we provide insights into the state-of-the-art in digital phenotyping by meansof representing acquiring and analyzing phenotype data In addition we provide visions of this field for future researchwork that could enable better applications of phenotype data

Key words phenomics phenotypes acquisition interoperability semantic representation knowledge discovery

Introduction

Phenotypes are broadly defined as observable characteristics oforganisms and have gained great importance since the discov-ery of the causative relationship between a given underlyinggenetic mechanism (eg gene expression levels mutations) andits phenotypic manifestation Subsequently diverse initiativeshave focused on developing and curating resources that capturethis causal relationship at multiple levels and in the context ofmultiple organisms Examples include but are not limited tothe Online Mendelian Inheritance in Man (OMIM) database [1]the Mouse Genome Informatics database (MGD) [2] FlyBase [3]and the Zebrafish Model Organism database (ZFIN) [4]

The increasing development and exploitation of phenotypeshas led to a varied range of applications identifying diseasegenes [5ndash9] and characterizing functionally yet unclassifiedgenes [10ndash12] repurposing drugs [13 14] pharmacogenomics[15ndash17] and pharmacovigilance [18] as well as solving evolu-tionary questions [19 20]

The goal of this review is to synthesize the state-of-the-artin the past 10 years of Phenome Research and hence provide aunique and broad access point for those interested in studyingtopics in specific areas of phenomics The areas covered includeboth phenotype data evolving from biological experiments anddata needed in a clinical environment This work also presentsvisions for the coming years in phenomics research as it de-rives a series of open challenges based on input collected fromthe community We note here that the work addressed in thisarticle focuses on computational phenotyping ie the collec-tion representation and processing of phenotypes in a com-puter-interpretable format

To enable a structured navigation of the field we map thecontent of the review onto the four conceptual dimensions ofphenomics considered from a computational perspective (de-picted in Figure 1) representation interoperability acquisitionand processing These four dimensions have been identifiedand described in an earlier review [21] and are used for simpli-city here Representation focuses on semantic modeling and as-pects of knowledge capturing Interoperability an orthogonaldimension to representation aims to facilitate intra- andinterspecies phenotype mappings Acquisition refers to thetransformation of the raw data into semi-structured or struc-tured phenotype representations Processing (or application)uses externalized phenotypes to address fundamental or spe-cific challenges from variant prioritization and diagnosis toindividualized preventive care or drug repurposing

Each of these dimensions encapsulates a multitude of as-pects that reflect in their aggregated form the intrinsic complex-ity of phenotypes For example subject to the underlyingdomain the granularity of the representation of phenotypesmay differ depending on their application While biologists cap-ture data that may be too detailed for clinical applications clin-icians need solid and thoroughly supported evidence to be ableto test hypotheses derived from biological experimentsFurthermore phenotypes may be defined and represented inthe context of different organisms eg decreased bone mineraldensity (MP0000063) in the Mammalian Phenotype Ontology(MP) [22] and Osteopenia (HP0000938) in the Human PhenotypeOntology (HPO) [23] even though they may share an underlyinggoal from a translational perspective such as the documenta-tion of a certain gene function This leads to the need forachieving cross-species interoperability to enable integratedprocessing Finally the phenotype acquisition process poses itsown challenges ie technical (process automation or interfaceusability) social (incentive) and ethical (privacy) In the follow-ing we discuss in depth each of the dimensions introducedabove A summary of all the resources provided in this manu-script is provided in Table 1

State-of-the-art phenome researchRepresentation

A phenotype can be any observation of a normal or abnormalstate of an anatomical physiological or biochemical property ofan organism While phenotypes in a biological domain are re-corded as results from biological experiments phenotypes in aclinical domain are used to report the assessment of patientsPhenotypes span from the molecular level to the organism level[42] To enable the other three dimensions to reach their full po-tential a representation is needed that is well understood byhumans and at the same time computer-readable and henceamenable to computational analyses Such a representationdoes not only have to cover normal and abnormal phenotypeswithin a species but also has to facilitate the bridging acrossspecies and enable integration of heterogeneous data at differ-ent levels of granularity

From a computational perspective phenotypes take diverserepresentations (i) free-text descriptionsmdasheg as part of theOMIM disease presentations (ii) vocabulariesmdasheg the clinicalsynopsis in OMIM or the London Dysmorphology Database ter-minology [25] and (iii) ontologiesmdashie vocabularies augmented

2 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

with domain-specific relationshipsmdasheg HPO or MP While freetext descriptions support a better human understanding theylimit the possibilities for automated data analysis [43] With theabundance and ever-increasing amount of data the ultimategoal is to build a uniform and consistent computer-readablerepresentation one that enables a seamless collection and inte-gration of phenotypes recorded in biological studies as well asin a clinical environment Ideally this uniform global represen-tation would also account for both qualified and quantified dataand enable flexible conversion where possible In cases whereconversion is not possible this universal representation wouldhave to be extended with mappings

Ontologies to represent phenotypesCurrently the field consists of a varied set of vocabularies andontologies that support in various forms the abovementionedgoal In particular driven by the wide adoption from the bio-medical community ontologies have become the de factostandard for representing phenotypes To achieve the goal to itsfull extent the community has followed two complementaryapproaches for modeling and integrating phenotype data a pre-composed and a post-composed representation (see Figure 2)The pre-composed approach treats each phenotype as anatomic entity using individual expressions most suitable togeneral human understanding For example an ontology

Figure 1 The four dimensions of the phenotype development phases Representation Subject to the underlying domain and goal phenotypes may be represented at

different granularity levels Interoperability Existing ontologies and vocabularies externalize domain-specific phenotype knowledge at different levels of granularity

Acquisition Capturing and documenting phenotypes in any representational format can be achieved manually (via curation) or automatically (via text mining)

Processing Representing and capturing phenotypes in a structured manner (a form that also enables interoperability) has led to their application in a large variety of

domains The arrows denote direct points of connection between the several phenotyping dimensions Note that this figure only serves as illustration of the interplay

of the four dimension and thus is not aimed at comprehensiveness (eg interoperability could also be achieved with a mapping instead of EQ statements)

The digital revolution in phenotyping | 3

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

adopting this representation consists of concept definitions likelsquoerythrocytopeniarsquo or lsquodeficiency of red blood cellsrsquo lsquodeficiencyof erythrocytesrsquo These concepts are easily understood byhumans and also facilitate computational analysis

The post-composed representation uses elementary pheno-typic units from existing ontologies to compose specific com-plex phenotypes One mechanism to postcompose phenotypesis the entity-quality (EQ) statement approach [24 44] For ex-ample instead of defining lsquoerythrocytopeniarsquo as an atomic con-cept this approach represents the meaning of the phenotype bylinking the quality lsquodeficiencyrsquo with the anatomic entity lsquoredblood cellsrsquo This link is then captured via a logical axiom usingconcepts introduced by existing ontologies such as the GeneOntology (GO) [26] and the Phenotypic quality and TraitOntology (PATO) [24] The caveats of the postcomposition resultfrom the development overheads in building post-composedstatements Additionally a number of pre-composed phenotypeontologies still need to be transformed into a post-composedrepresentation

lsquoNormalrsquo and lsquoabnormalrsquo phenotypesSome of the existing phenotype representations focus on devi-ations of phenotypes (ie their status or quality from a

reference phenotype) [27] The reference phenotype in the caseof model organisms could be either the wild type of the organ-ism or a specific strain from which the mutation has been gen-erated In the best case the phenotype representations formthe core that enables interoperability of different data reposito-ries possibly covering different organisms and being collectedwith different aims in mind The quality of the phenotypic re-source ie the consistency of the phenotypic definitions theoverall structure of the phenotype semantic resource and inparticular the completeness of the electronic resource holdsthe key to enabling efficient data analysis interpretation anddecision support

To extend beyond representing lsquoabnormalrsquo Shimoyama et alhave extended the representation of phenotypes to also incorp-orate environmental factors as well as methods used to meas-ure the phenotype [45] The developed ontologies have beenused to annotate both rat (httprgdmcweduwgphysiology)and human data (httpcoverwustleduCover) Furthermorethe suggested framework allows for the annotation of quanti-fied phenotypes (eg lsquoblood sugar levelsgt 85 mmolLrsquo) insteadof lsquoabnormal increased blood sugar levelsrsquo While this way ofrepresentation could be used to represent the reference pheno-type a data curator is needed to provide this information

Table 1 Summarizes all resources mentioned throughout the manuscript together with their URL and reference (where applicable)

Resource Link Reference

Online Mendelian Inheritance inMan database

httpomimorg [1]

Mouse Genome Database httpinformaticsjaxorg [2]FlyBase httpflybaseorg [3]Zebrafish Model Organism Database httpzfinorg [4]Mammalian Phenotype Ontology httpwwwberkeleyboporgontologiesmp [20]Human Phenotype Ontology httppurlobolibraryorgobohpobo [21]London Dysmorphology Database httpwwwlmdatabasescom [23]Gene Ontology httpgeneontologyorg [24]Phenotypic quality and Trait Ontology httppurlobolibraryorgobopato [25]OrphaNet httpwwworphanetconsorcgibinindexphp [26]PharmGKB httpswwwpharmgkborg [27]Zebrafish Anatomical Ontology httppurlobolibraryorgobozfa [28]International Mouse Phenotyping

Consortiumhttpwwwmousephenotypeorg [29 30]

IMPReSS httpswwwmousephenotypeorgimpressPhenote httpwwwphenoteorgPhenoTips httpsphenotipsorg [31]MetaMap httpwwwnlmnihgovresearchumlsimplementation_resourcesmetamaphtml [32]NCBO Annotator httpsbioportalbioontologyorgannotatorcTakes httpctakesapacheorg [33]ShARECLEF 2013 httpssitesgooglecomsiteshareclefehealthdata [34]DeepPhe httpcancerhealthnlporgBio-LarK httpbio-larkorg [35]PhenoMiner httpssitesgooglecomsitenhcollierprojectsPhenoMiner [36]Unified Medical Language System httpwwwnlmnihgovresearchumls [37]Unified Medical Language System

Metathesaurus toolhttpwwwnlmnihgovpubsfactsheetsumlsmetahtml [38]

UberPheno httppurlobolibraryorgobohpuberpheno [39]SNOMED CT httpwwwnlmnihgovresearchumlsSnomedsnomed_mainhtml [40]AgreementMaker httpagreementmakerorg [41]Zooma httpwwwebiacukfgptzoomaSIDER httpsideeffectsemblde [14]AVAToL httpavatolorg [17]PhenoScape httpphenoscapeorg [18]ORCID httporcidorgResearcherID httpwwwresearcheridcom

4 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Conversions from the different states of lsquonormalrsquo eg similartests in different species are not automatically available

Representing qualitative and quantitative phenotypesOther desiderata for phenotype representations are focused onthe exploitation of ontologies for efficient propagation of experi-mental findings in basic phenomics research into the clinicaldomain and improved research efficiency in both domains(translational medicine) Consequently phenotype descriptionshave to meet clinical needs and cover those diseases that aremost relevant to the clinical context Experimental phenotypicdescriptions are detailed and reflect the experimental setupwhereas clinical descriptions suffer from time constraints andthus tend to lack observational detail Furthermore experimen-tal and clinical phenotypic descriptions may be organized at di-verse levels of granularity and may be biased toward a specificperspective For example experimental findings provide the op-portunity to capture and represent quantitative traits (eglsquoblood sugargt 85 mmolLrsquo) which may require adaptation intoqualitative terms (lsquohigh blood sugarrsquo) for clinical purposesSimilarly from a diagnosis perspective one may require a com-plete and individualized view over the phenotypic profilewhich may include degrees of severity [28] and longitudinalphenotypes [29] hence adding to the overall complexity of therepresentation

Representation of phenotypes summaryThe resources for representing phenotypes have reached apoint where they are able to provide a solid and rich foundationfor building advanced acquisition and processing mechanismsOpen challenges still exist eg modeling degrees of severitynormal states or negation (ie explicitly mentioning the absenceof an abnormality) or mapping quantitative traits to qualitativeconcepts to provide deep knowledge capturing methodologies

Acquisition

Acquisition involves the collection and storage of phenotype in-formation from various resources (see Figure 3) such as OMIMor OrphaNet a rare disease database [30] While some of theseresources are mainly built through manual curation eg MGDothers rely already on (semi-)automated preprocessing to en-hance curator throughput For example PharmGKB [46] uses anautomated classification system to determine relevant publica-tions and extract genendashdrug relationships that are then pro-vided to curators for verification [31 47]

Manual acquisition of phenotypesManual acquisition of phenotypes can be done either by cur-ation of the literature or by direct submission from investiga-tors These two main modes are used to annotate modelorganism data with phenotype observations and their concep-tual descriptions In the case of MGD curators provide standard

Figure 2 To date phenotypes have mostly been captured and defined using a pre-composed andor a post-composed representation A pre-composed representation

assumes the definition of a phenotype as a monolithic conceptmdasha concept that captures the essence of the phenotype semantics The post-composed representation

decomposes the phenotype into an EntityndashQuality pair with its individual components being mapped to appropriate ontological concepts In this case the phenotype

semantics is denoted by the compositional property of the pair The transition between pre-composed and post-composed is realized via logical axioms Both forms of

representation have been successfully applied across different species

The digital revolution in phenotyping | 5

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

phenotype descriptions from MP along with supporting evidenceZFIN similarly provides phenotypic data with evidence comingfrom manual curation and individual investigator contributionsusing the Phenote software (httpwwwphenoteorg) Phenoteallows description of phenotypes in an EQ format which makesuse of any ontology in the Open Biological and BiomedicalOntologies [48] format including PATO the Zebrafish AnatomicalOntology [32] and GO

The International Mouse Phenotyping Consortium (IMPC)[33 49] has applied phenotype encoding standards [50] througha set of standard operating procedures (SOPs as defined inIMPReSS httpswwwmousephenotypeorgimpress) for re-cording high-throughput phenotype measurements in the labEach of the SOPs describes not only the experimental setup forthe measurement of the required parameters but also theontology annotation this test may induce For example the SOPdesigned to assess the grip strength of a mouse includes thesuggestion of the MP term lsquoabnormal grip strengthrsquo(MP0001515)

In addition to Phenote mentioned above one example of asystem that was designed specifically for phenotype capture ina manual mode is PhenoTips [51] This open-source system as-sists clinicians to record phenotypic profiles for patients withrare genetic disorders using HP and OMIM potentially allowingfor diagnosis and comparative phenotype analysis

Discovering evidence for the causes of human disorders andproviding treatment are common goals across the clinical andscientific communities However the understanding of pheno-types has traditionally been different between the two com-munities Clinicians generally consider phenotypes to beaberrations ie deviations from normal morphology physi-ology or behavior [52] while scientists working on biological ex-periments such as mutation experiments have adopted a morepragmatic definition of a selective profile of all the observablecharacteristics of an organism This division stems in part froma focus on the overt expression of the syndromes themselves

[53] on the one hand and on the pathway from syndrome togene expression on the other Both are crucial to understandingthe complex nature of disorders as Sabb et al point out [53]This difference is reflected in the type of data that each commu-nity creates and the systems that have been built to supportdata capture by each

Semi-automated and automated phenotype acquisitionWith the increasing amount of data that is published on a day-to-day basis manual approaches for data curation becomemore and more time-demanding and costly so that computerassistance in screening (document retrieval) and preparing data(information extraction) is unavoidable The degree to whichcomputer assistance is enabled determines whether themethod is semi-automated or automated While in a semi-automated setting a curator manually verifies the extracteddata in an automated setting no manual input is requiredHowever given the absence of manual verification and the cur-rent state-of-the-art in text processing the data generated withan automated method may contain some incorrect data

The structural and semantic complexity of phenotype termscoupled with the scale and changing nature of literature-basedphenotype descriptions makes a traditional fully manual ac-quisition approach difficult to sustain leading to potential du-plication inconsistency and sub-optimal coverage This hascreated a growing interest in textdata mining techniques A di-verse and growing research community is evolving that aims toexploit biomedical natural language processing for the extrac-tion of structured data from free-text and its annotation withthe semantic resources that already exist Although not specif-ically aimed at phenotypes knowledge brokering tools such asMetaMap [34] the NCBO Annotator and the Apache cTAKES [54]have all been widely used for concept annotation of text to bio-medical ontologies and could be used to yield these buildingblocks The issue of customizing these generic tools to the

Figure 3 The increasing amount of data made available over the course of the past years have rendered manual phenotype curation impractical While automating

the process is in principle the only viable solution it possesses its own plethora of technical challenges These include among others (i) boundary detection ie iden-

tifying the exact span of text that represents a phenotype candidate (ii) disambiguation and alignment subject to the desired level of granularity and the underlying

knowledge source and (iii) interpretation which covers lack of context hedging or negation

6 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

extraction of phenotypes in specific disease domains is a keychallenge

Extraction of structured information from electronic healthrecords (EHRs) has a long history of research eg [35 36 38 55]Progress has been hampered by the balance that needs to bedrawn between respecting patient privacy and the need for datato develop comparable gold standards In the past few yearsseveral initiatives have led the way in making available anony-mized collections of EHRs eg Informatics for IntegratingBiology and the Bedside (i2b2) [56] and recently ShARECLEF2013 [57] These tasks aim to identify entities of clinical interestincluding medical problems tests and treatments While nei-ther of these data sets explicitly annotates phenotypes theseentities are highly relevant to phenotype acquisition Fu et alsuggested an annotation scheme to capture phenotypes forchronic obstructive pulmonary disease in EHRs which has beenimplemented in Argo [58] to annotate a corpus of 1000 clinicalrecords [59] Furthermore the newly launched DeepPhe project(httpcancerhealthnlporg) focusses on phenotypes relevantin the cancer genomics domain

Using the scientific literature as a source several groupshave been active in developing approaches explicitly for pheno-types These include the Bio-LarK system which has beenapplied to skeletal dysplasia [39] and the PhenoMiner systemwhich has been applied to the cardiovascular and autoimmunesystems [60] Work by Khordad et al [37] has looked at a moremixed domain using the Unified Medical Language System(UMLS) Metathesaurus tool [40] Ongoing challenges in process-ing EHRs are descriptive naming (eg typical course face het-erogeneous ECG abnormalities) disjoint phenotype mentions(eg blood pressure was observed to be elevated) and coordi-nated terms (eg slow healing and excessive scarring)Harmonization to existing ontologies presents an additionallayer of challenge in deciding how to align phenotype mentionsthat are more or less specific than extant concepts and how toprovide sufficient evidence for human curators

While automated methods are not as thorough as curatorsthey overcome some of the bottlenecks experienced with man-ual curation eg high time consumption and low throughputIn general there is a trade-off between thoroughness (precision)and the amount of acquired data (recall) returned as resultsfrom these methods ie automated methods may not return allrelevant results and may return some incorrect results

Acquisition of phenotypes summaryThe acquisition and harmonization of phenotypes is an ongoingchallenge to be met using evidence from a variety of sources(eg EHRs scientific literature clinical reports) The key issuefor automated approaches involving natural language process-ing support is to identify and resolve lexical syntactic and se-mantic heterogeneity

Interoperability

The interoperability dimension of phenomics research focuseson making all the available phenotype data integrable withother data sources eg diseases or results from genome ana-lyses The overarching goal of interoperability is to facilitatetranslational research and biological discoveries [21] Currentand past work falling into this dimension can be summarized asstandardization efforts alignment of phenotypes within andacross species and mapping to other resources Challenges arisefrom the many levels of complexity phenotypes can span [42] as

well as the development of multiple and mostly disparate re-porting schemes [22 23 41]

Interoperability through semantic layersA prerequisite for interoperable phenotype resources is a se-mantic layer that spans across the resources applied and allowsto keep the consistency and specificity contained in each of theresources For example despite standardization efforts such asthe Minimal Information for Mouse Phenotyping Procedures[61] the existing landscape of mouse phenotype resources isnot fully interoperable and hard to manage [62] A similar scen-ario is seen in hospitals where different wards use disparateways of describing a patient As a consequence the data for apatient cannot readily be used for further analysis preventingpotential holistic treatment opportunities However the needfor standardized reporting has been recognized and is imple-mented through SOPs in the IMPC [33 49 50]

While historically there have been different phenotype rep-resentations for different species such as the human mamma-lian fly and worm phenotype ontology (see 21 representation)EQ statements (see Figure 2) have been suggested to integratephenotypes across different species [24 44] In addition anamendment to the existing EQ statements was suggested tomake them interoperable with anatomy and physiology ontolo-gies [63] to extend the links across the different layers of com-plexity To make the annotation for three different species(human mouse and zebrafish) more accessible Kohler and au-thors made the UberPheno ontology publicly available [64]

Interoperability achieved through mappings (alignment)Further to the representation of phenotypes with EQ state-ments manual [65] and automated methods are in progress toalign different pre-composed semantic representations One ex-ample is UMLS [66] that combines over 180 vocabularies termi-nologies and ontologies such as SNOMED CT [67] Theintegration of new resources into UMLS is semi-automatedConflicts between concepts from newly added resources andconcepts already in UMLS are manually resolved to ensure ahigh-quality alignment of all the incorporated terminologiesvocabularies and ontologies

As an alternative to manual and semi-automated solutionstools exist that provide fully automated alignments betweenontologies such as AgreementMaker [68] and Zooma (httpwwwebiacukfgptzooma) While AgreementMaker takes lex-ical matching and ontological features into account Zoomauses phonetic matching algorithms for the alignment In manycases the resulting alignments often associate one term withmultiple concepts (1n mapping) Bottlenecks in the automatedalignment are caused by species-specific jargon [69 70] and byphenotypes that only exist in one of the species and not theother

In addition to the alignment of multiple resources mappingsare required that would facilitate the integration of diverse re-sources spanning across the different layers of complexity of anorganism While phenotypes in model organisms have been as-signed to genes as well as to specific models (determined by al-lele and background in addition to gene eg MGD) in humanmostly inheritable diseases have been annotated However ifwe were to follow the pathway from the modified gene to theobserved phenotypes a mapping between pathways andphenotypes would also be required [71] As mentioned abovesome integration with other resources has been achieved egwith anatomy and physiology ontologies but a much larger

The digital revolution in phenotyping | 7

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

coverage is required to unlock the full potential of knowledgediscovery using phenotype data

Interoperability of phenotypes summaryThe need for unified and integrable representation of pheno-types has been recognized and projects are underway to im-prove the current situation However there is still a hugeamount of legacy data that need to be dealt with

Processing

Processing (or application) is concerned with the use of pheno-type data that have either been reported in structured form orextracted with for example text mining methods (see acquisi-tion methods of phenotypes) Subject to the target domain theusage of phenotype data can be classified into four broad cate-gories (i) clinical research (diagnosis prognosis patient match-making variant prioritization personalized medicine drug sideeffects) (ii) study of genomendashphenome interactions to advancethe understanding of disease causation or achieve personalizedtherapies (iii) cross-resource consistency analysis and (iv) evo-lutionary research It is worth noting that in most cases pheno-type data have been used in a cross-species context hencetransforming the interoperability dimension into a first-classcitizen rather than an application scenario

Application of phenotypes to study the origins and pathology ofdiseasesThe application area of clinical research consists of a varied setof specific goals which represent in practice coherent researchstreams on their own Phenotypes have been used as uniquesource of data for example for disease prediction [72 73] min-ing key disease characteristics or characteristic phenotypes [1617 74] or patient match-making [51] Furthermore phenotypeshave been used to support the understanding of the geneticmechanism of diseases via direct association with genotypedata [5 9 12] as a mapping bridge across species data [7 8] andin conjunction with the entire set of OMICS data [75] or to im-prove variant prioritization for accurate diagnosis [76 77]

More recently phenotypes have played a major role as a dis-covery agent in large-scale genome-wide association studies [78ndash80] In particular projects such as the 100 000 Genomes Project(httpwwwgenomicsenglandcoukthe-100000-genomes-project)as well as the eMerge Network (httpsemergemcvanderbiltedupage idfrac1458) aim to support the area of Pharmacogenetics bylinking data from EHRs to sequence information from patientsto improve diagnosis and treatment Similarly Phenome-wideAssociation Studies (PheWAS) [81ndash83] allow for the identificationof genes that possibly implicate a disease and can provide prov-enance for results determined through Genome-wide associ-ation studies

Phenotypes and ethnicityGeneric phenotypes have an immense potential which hasbeen only recently discovered and exploited For example eth-nic differences have been shown to explain the optimal statesof the human blood glucose levelsmdashexpressed via the relation-ship between insulin sensitivity and insulin response [84]Similarly when used in the context of a shared genetic architec-ture such data validated the existence of statistically significantrelationships between the platelet count and alcohol depend-ence or between the alkaline phosphatase level and venousthromboembolism [85] This application area is characterizedby a specific set of challenges emerging from the novel

combination of numeric data (from tests and measurement)abnormalities and standard traits One important and still un-solved problem in the processing of phenotypes is the lack ofalignment between measurement representations and pheno-typic abnormalities as well as the ability of representing statesof normality and longitudinal phenotypic data resulting fromtests and measurements

Phenotypes in drug repurposingPhenotypes expose the effects of drug treatments and henceenable the study of their general effects the relation betweendosage and effects as well as the interaction between drugsExisting literature on this topic maps perfectly onto these threeaspects Phenotypes as adverse drug reactions have been mod-eled and captured as early as 2010 by the SIDER initiative [14]and have been used to investigate the causal relationship be-tween dosage and effect by [15] In an exercise that combineslarge-scale acquisition and processing LePendu et al have usedphenotypes as indicators of adverse drug reactions as well assignals of adverse events associated with drugndashdrug inter-actions [18] Finally with the increasing curation adoption anduse of cross-species phenotype data it has been shown that aneffective mapping between model organisms and drug effectprofiles based on similarity can be applied to successfully sug-gest candidate drug targets [13]

Phenotypes in evolutionary studiesIn this context phenotypes have been used to understand pat-terns of diversification and to gain additional knowledge ontrait evolution Two particular initiatives have focused on thisaspect The AVAToL project [19] represents a collaborative andmultidisciplinary effort that combines text mining image ana-lysis and the wisdom of the crowds to discover and documentspecies phenotypes Their ultimate goal is to advance phylogen-etics research and to enable a faster and more accurate con-struction of the Tree of Life The Phenoscape knowledge base[20] on the other hand integrates phenotype data acquired onover 2500 teleost fishes with structured phenotype data fromzebrafish genes to infer candidate genes that explain pheno-typic variety and hence enable the formulation of evolution-arydevolutionary hypotheses

Processing of phenotypes summaryThe quality and range of phenotype applications are curbedonly by the quality and availability of the underlying dataLimitations arise for example from the missing or incorrectcross-species phenotypes alignment either owing to their foun-dational representation or owing to inconsistencies in the levelof granularity Similarly challenges are encountered when rep-resenting and acquiring the more profound dimensions ofphenotypes including degrees of severity use of ambiguous ex-pressions or temporality (in a longitudinal data sense) whichthen hamper the development of complex solutions

A future perspective of phenome research

For the entire field of Phenomics to advance challenges have tobe overcome at the universal level as well as at the level of theindividual dimensions of phenome research (representationacquisition interoperability and processing) The most import-ant universal challenge is the lack of a shared understanding ofwhat a phenotype is among all scientists working with pheno-type data This includes (computational) biologists regardlessof the research questions andor animal model they are working

8 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

on as well as clinicians working in all different areas of humanmedicine To reach a common understanding reporting stand-ards and guidelines need to be derived that facilitate communi-cation across domains as well as large-scale computationalanalysis

Furthermore universal agreement has to be found as towhat additional aspects of phenotypes are relevant and havenot or only sparsely been accounted for in any of the four do-mains Such aspects include types of measurements proven-ance evidence and time To overcome the social barriers toadoption some sort of credit assignment mechanism akin tocitation will be necessary to track the usage of lsquogoodrsquo and reus-able phenotypes Existing author tracking systems such asORCID (httporcidorg) and ResearcherID (httpwwwresearcheridcom) can probably be reused to document author-ship of phenotype models and track usage as well asprovenance

Another important universal issue to overcome is the re-cording of lsquonormalrsquo phenotypes Currently phenomics in thearea of disease gene discovery is geared toward the collectionand analyses of lsquoabnormalrsquo phenotypes eg those resultingfrom diseases or gene modifications However the derivation ofan lsquoabnormalrsquo phenotype is always the result of a comparisonwith what is considered to be lsquonormalrsquo which is collected onlyin some cases [45] The record of more lsquonormalrsquo phenotypeswould allow for more fine-grained analyses and the investiga-tion of causal relationships between different conditions eg incases of comorbidity

From a representational point of view the ultimate futuregoal is to overcome the limitations of species- and domain-specific representations and find a universal way of encodingphenotype data independent from the granularity of the dataIn addition resources need to be built that address so far miss-ing aspects such as evidence or time aspects For example asmall set of evidence codes have been established as part of GOto provide provenance in gene annotations but also to providemeans for computational analysis to avoid data circularityAnother aspect of phenotypes that has not been integrated yetinto the representation of phenotypes are causality and tempor-ality for example how the phenotypes change over time owingto stimuli in the environment or medication While at the mo-ment the representations focus on reporting either a temporarysnapshot of an individual examined in an experiment or as partof a medical investigation the current representation modelsdo not allow for the encoding of phenotype changes over timeas a result to surrounding stimuli

As we extend the scope of our joint understanding of pheno-types and adopt ways to represent this understanding the ac-quisition of phenotypes has to change too Methods have to bedeveloped that can accommodate the recording of additionalaspects such as evidence and time eg when extracting infor-mation from the scientific literature Furthermore more reliableautomated methods are needed that can cope with the com-plexity of free text in clinical settings as well as reporting mech-anisms in wet lab environments to facilitate high-throughputand overcome the need for time- and cost-intensive manuallabor In addition and as mentioned earlier the acquisition di-mension should also address the collection of lsquonormalrsquo pheno-types in the future to improve the results obtained byprocessing the phenotype data

Despite the widespread aims to achieve interoperability andmake best use of the integrated resources there are still chal-lenges that need to be addressed to achieve true interoperabilityof the existing and newly emerging resources both phenotype-

specific and not A long-term goal of the dimension of inter-operability is direct propagation of experimental findings intoprevention and treatment options for patients in a hospital(lsquofrom bench to bedsidersquo) Related to this goal is the aim of per-sonalized treatments by means of building patient-specificmodels integrating phenotype data that can then be used forsimulations of possible treatment outcomes

In the future we anticipate a migration toward describingphenotypes as lsquomodelsrsquo or classifiers that answer a particularquestion For example lsquodoes the patient have pneumoniarsquo orlsquodoes the patient have sepsisrsquo The use of phenotyping willthen be analogous to the use of classifiers in spam filters al-ways running in the background and when an incoming sample(a patient record for example) results in a high confidencematch we would automatically receive an alert that the patientis potentially eligible for a clinical trial is likely to benefit from acertain therapy or is at increased risk for certain complications

With the improvement in any of the other dimensions theprocessing of phenotypes will improve owing to an increase inthe quality of data but at the same time will require extensionsthat can cope with additional data available in the future suchas lsquonormalrsquo phenotypes evidence for phenotype data andcausal relationships encoded with time-dependencies In gen-eral a wide range of support and analysis tools are required tounlock the full potential of phenotype data

Given the achievements in the past years we look forwardtoward an exciting and promising decade of phenomics withample opportunities for researchers to get involved and contrib-ute to evolve and shape the emerging landscape

Key Points

bull Over the course of the past decade phenotype datahas become a key factor in analyzing diseases and re-porting experimental outcomes

bull Successful applications of phenotypes include the de-scription of experimental outcomes (eg the changes inphenotypes owing to gene modifications) computationalknowledge discovery (eg in determining disease genecandidates) and reporting in clinical environments (egpatient monitoring)

bull The research field of phenomics can be divided intofour main dimensions (presentation acquisition inter-operability and processing) each of which is depend-ent to some extent on the others

bull While the development of each of the four dimensionsis individual a common understanding and universalguidelines need to be established on how phenotypesare perceived and how they are used a synchroniza-tion of efforts is needed

bull The future of phenomics research holds exciting chal-lenges and has the potential to create a significant im-pact on the entire biomedical domain

Funding

This work was supported by the National Institutes ofHealth [1 U54 HG006370-01 to AO R01 LM011369 and R01GM101430 and U54 HG004028 to NHS T15 LM00707 toMRB R01-LM008111 to KL R01 GM102282 and R01LM011369 to HL U24 CA143840 to AL U54HG006370 toAM U54 HG008033-01 to MD] the Wellcome Trust [098051to AO] a Marie Curie experience researcher fellowship

The digital revolution in phenotyping | 9

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

[301806 to NC] the National Science Foundation [1207592to IG DBI-1062404 and DBI-1062542 and EF-0905606 toPM] the Bundesministerium fur Bildung und Forschung[0313911 to PNR] the European Communityrsquos SeventhFramework Programme [Grant Agreement 602300 SYBIL toPNR] the Systems Microscopy NoE project [grant agree-ment 258068 to GR] and the Defense Advanced ResearchProjects Agency [W911NF-14-C-0109 to KL] TG was sup-ported by the Kinghorn Foundation OB was supported bythe Intramural Research Program of the NIH NationalLibrary of Medicine MS was supported by the MedicalResearch Council LW was supported by the Homer WarnerCenter for Informatics Research of the IHC Health ServicesRW was supported by an appointment to the NLMResearch Participation Program administered by the OakRidge Institute for Science and Education through an inter-agency agreement between the US Department of Energyand the National Library of Medicine

References1 Amberger J Bocchini C Hamosh A A new face and new chal-

lenges for Online Mendelian Inheritance in Man (OMIM)Human Mutation 201132564ndash7

2 Blake JA Bult CJ Kadin JA et al The Mouse Genome Database(MGD) premier model organism resource for mammaliangenomics and genetics Nucleic Acids Res 201139D842ndash8

3 Tweedie S Ashburner M Falls K et al FlyBase enhancingDrosophila Gene Ontology annotations Nucleic Acids Research2009 37D555ndash9

4 Howe DG Bradford YM Conlin T et al ZFIN the ZebrafishModel Organism Database increased support for mutantsand transgenics Nucleic Acids Research 2013 41D854ndash60

5 Goh KI Cusick ME Valle D et al The human disease networkProc Natl Acad Sci USA 20071048685ndash90

6 Hoehndorf R Schofield PN Gkoutos GV PhenomeNET awhole-phenome approach to disease gene discovery NucleicAcids Res 201139e119

7 Smedley D Oellrich A Kohler S et al PhenoDigm analyzingcurated annotations to associate animal models with humandiseases Database 2013bat025

8 Washington NL Haendel MA Mungall CJ et al Linkinghuman diseases to animal models using ontology-basedphenotype annotation PLoS Biol 20097e1000247

9 Zhou X Menche J Barabasi AL et al Human symptomsndashdisease network Nat Commun 20145

10Van Driel MA Bruggeman J Vriend G et al A text-mining ana-lysis of the human phenome Eur J Hum Genet 200614535ndash42

11Groth P Pavlova N Kalev I et al PhenomicDB a new cross-species genotypephenotype resource Nucleic Acids Res200735D696ndash9

12Korbel JO Doerks T Jensen LJ et al Systematic association ofgenes to phenotypes by genome and literature mining PLoSBiol 20053e134

13Hoehndorf R Hiebert T Hardy NW et al Mouse model pheno-types provide information about human drug targetsBioinformatics 201330719ndash25

14Kuhn M Campillos M Letunic I et al A side effect resource tocapture phenotypic effects of drugs Mol Syst Biol 20106343

15Eriksson R Werge T Jensen LJ et al Dose-specific adversedrug reaction identification in electronic patient recordstemporal data mining in an inpatient psychiatric populationDrug Safety 201437237ndash47

16Lasko TA Denny JC Levy MA Computational phenotype dis-covery using unsupervised feature learning over noisy sparseand irregular clinical data eng PLoS One 20138e66341

17Schulam P Wigley F Saria S Clustering longitudinal clinicalmarker trajectories from electronic health data applicationsto phenotyping and endotype discovery In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial IntelligenceAustin Texas USA 2015

18LePendu P Iyer SV Bauer-Mehren A et al Pharmacovigilanceusing clinical notes Clin Pharmacol Ther 201393547ndash55

19Burleigh G Alphonse K Alverson AJ et al Next-generationphenomics for the tree of life PLoS Curr 20135

20Mabee P Balhoff JP Dahdul WM et al 500 000 fish pheno-types The new informatics landscape for evolutionary anddevelopmental biology of the vertebrate skeleton J ApplIchthyol 201228300ndash5

21Collier N Oellrich A Groza T Toward knowledge support foranalysis and interpretation of complex traits Genome Biol201314214

22Smith CL Eppig JT The Mammalian Phenotype Ontology as aunifying standard for experimental and high-throughputphenotyping data Mamm Genome 201323653ndash68

23Kohler S Doelken SC Mungall CJ et al The HumanPhenotype Ontology project linking molecular biology anddisease through phenotype data Nucleic Acids Res201442D966ndash74

24Gkoutos GV Mungall C Doelken S et al Entityquality-basedlogical definitions for the human skeletal phenome usingPATO In Proceedings of Annual International Conference ofthe IEEE Engineering in Medicine and Biology SocietyMinneapolis Minnesota USA 2009 7069ndash72

25Winter RM Baraitser M Douglas JM A computerised database for the diagnosis of rare dysmorphic syndromes J MedGenet 198421121ndash3

26Botstein D Cherry JM Ashburner M et al Gene Ontology toolfor the unification of biology Nat Genet 20002525ndash9

27Hancock JM Commentary on Shimoyama et al (2012) threeontologies to define phenotype measurement data FrontGenet 20145

28Boland MR Tatonetti NP Hripcsak G Development and valid-ation of a classification approach for extracting severity auto-matically from electronic health records J Biomed Semant2015614

29Greenaway S Blake A Retha A et al Automatically annotat-ing temporal data from a phenotype-driven mutagenesisscreen In Proceedings of Phenotype Day at ISMB Dublin Ireland2015

30Rath A Olry A Dhombres F et al Representation of rare dis-eases in health information systems the Orphanet approachto serve a wide range of end users Hum Mutat 201233803ndash8

31Thorn CF Klein TE Altman RB PharmGKB Methods Mol Biol2005311179ndash91

32Sprague J Bayraktaroglu L Clements D et al The ZebrafishInformation Network the zebrafish model organism data-base Nucleic Acids Res 200634D581ndash5

33Brown SDM Moore MW Towards an encyclopaedia of mam-malian gene function the International Mouse PhenotypingConsortium Dis Models Mech 20125289ndash92

34Aronson AR Lang FM An overview of MetaMap historicalperspective and recent advances J Am Med Inform Assoc201017229ndash36

35Chapman WW Bridewell W Hanbury P et al A simple algo-rithm for identifying negated findings and diseases in dis-charge summaries J Biomed Inform 200134301ndash10

10 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 3: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

with domain-specific relationshipsmdasheg HPO or MP While freetext descriptions support a better human understanding theylimit the possibilities for automated data analysis [43] With theabundance and ever-increasing amount of data the ultimategoal is to build a uniform and consistent computer-readablerepresentation one that enables a seamless collection and inte-gration of phenotypes recorded in biological studies as well asin a clinical environment Ideally this uniform global represen-tation would also account for both qualified and quantified dataand enable flexible conversion where possible In cases whereconversion is not possible this universal representation wouldhave to be extended with mappings

Ontologies to represent phenotypesCurrently the field consists of a varied set of vocabularies andontologies that support in various forms the abovementionedgoal In particular driven by the wide adoption from the bio-medical community ontologies have become the de factostandard for representing phenotypes To achieve the goal to itsfull extent the community has followed two complementaryapproaches for modeling and integrating phenotype data a pre-composed and a post-composed representation (see Figure 2)The pre-composed approach treats each phenotype as anatomic entity using individual expressions most suitable togeneral human understanding For example an ontology

Figure 1 The four dimensions of the phenotype development phases Representation Subject to the underlying domain and goal phenotypes may be represented at

different granularity levels Interoperability Existing ontologies and vocabularies externalize domain-specific phenotype knowledge at different levels of granularity

Acquisition Capturing and documenting phenotypes in any representational format can be achieved manually (via curation) or automatically (via text mining)

Processing Representing and capturing phenotypes in a structured manner (a form that also enables interoperability) has led to their application in a large variety of

domains The arrows denote direct points of connection between the several phenotyping dimensions Note that this figure only serves as illustration of the interplay

of the four dimension and thus is not aimed at comprehensiveness (eg interoperability could also be achieved with a mapping instead of EQ statements)

The digital revolution in phenotyping | 3

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

adopting this representation consists of concept definitions likelsquoerythrocytopeniarsquo or lsquodeficiency of red blood cellsrsquo lsquodeficiencyof erythrocytesrsquo These concepts are easily understood byhumans and also facilitate computational analysis

The post-composed representation uses elementary pheno-typic units from existing ontologies to compose specific com-plex phenotypes One mechanism to postcompose phenotypesis the entity-quality (EQ) statement approach [24 44] For ex-ample instead of defining lsquoerythrocytopeniarsquo as an atomic con-cept this approach represents the meaning of the phenotype bylinking the quality lsquodeficiencyrsquo with the anatomic entity lsquoredblood cellsrsquo This link is then captured via a logical axiom usingconcepts introduced by existing ontologies such as the GeneOntology (GO) [26] and the Phenotypic quality and TraitOntology (PATO) [24] The caveats of the postcomposition resultfrom the development overheads in building post-composedstatements Additionally a number of pre-composed phenotypeontologies still need to be transformed into a post-composedrepresentation

lsquoNormalrsquo and lsquoabnormalrsquo phenotypesSome of the existing phenotype representations focus on devi-ations of phenotypes (ie their status or quality from a

reference phenotype) [27] The reference phenotype in the caseof model organisms could be either the wild type of the organ-ism or a specific strain from which the mutation has been gen-erated In the best case the phenotype representations formthe core that enables interoperability of different data reposito-ries possibly covering different organisms and being collectedwith different aims in mind The quality of the phenotypic re-source ie the consistency of the phenotypic definitions theoverall structure of the phenotype semantic resource and inparticular the completeness of the electronic resource holdsthe key to enabling efficient data analysis interpretation anddecision support

To extend beyond representing lsquoabnormalrsquo Shimoyama et alhave extended the representation of phenotypes to also incorp-orate environmental factors as well as methods used to meas-ure the phenotype [45] The developed ontologies have beenused to annotate both rat (httprgdmcweduwgphysiology)and human data (httpcoverwustleduCover) Furthermorethe suggested framework allows for the annotation of quanti-fied phenotypes (eg lsquoblood sugar levelsgt 85 mmolLrsquo) insteadof lsquoabnormal increased blood sugar levelsrsquo While this way ofrepresentation could be used to represent the reference pheno-type a data curator is needed to provide this information

Table 1 Summarizes all resources mentioned throughout the manuscript together with their URL and reference (where applicable)

Resource Link Reference

Online Mendelian Inheritance inMan database

httpomimorg [1]

Mouse Genome Database httpinformaticsjaxorg [2]FlyBase httpflybaseorg [3]Zebrafish Model Organism Database httpzfinorg [4]Mammalian Phenotype Ontology httpwwwberkeleyboporgontologiesmp [20]Human Phenotype Ontology httppurlobolibraryorgobohpobo [21]London Dysmorphology Database httpwwwlmdatabasescom [23]Gene Ontology httpgeneontologyorg [24]Phenotypic quality and Trait Ontology httppurlobolibraryorgobopato [25]OrphaNet httpwwworphanetconsorcgibinindexphp [26]PharmGKB httpswwwpharmgkborg [27]Zebrafish Anatomical Ontology httppurlobolibraryorgobozfa [28]International Mouse Phenotyping

Consortiumhttpwwwmousephenotypeorg [29 30]

IMPReSS httpswwwmousephenotypeorgimpressPhenote httpwwwphenoteorgPhenoTips httpsphenotipsorg [31]MetaMap httpwwwnlmnihgovresearchumlsimplementation_resourcesmetamaphtml [32]NCBO Annotator httpsbioportalbioontologyorgannotatorcTakes httpctakesapacheorg [33]ShARECLEF 2013 httpssitesgooglecomsiteshareclefehealthdata [34]DeepPhe httpcancerhealthnlporgBio-LarK httpbio-larkorg [35]PhenoMiner httpssitesgooglecomsitenhcollierprojectsPhenoMiner [36]Unified Medical Language System httpwwwnlmnihgovresearchumls [37]Unified Medical Language System

Metathesaurus toolhttpwwwnlmnihgovpubsfactsheetsumlsmetahtml [38]

UberPheno httppurlobolibraryorgobohpuberpheno [39]SNOMED CT httpwwwnlmnihgovresearchumlsSnomedsnomed_mainhtml [40]AgreementMaker httpagreementmakerorg [41]Zooma httpwwwebiacukfgptzoomaSIDER httpsideeffectsemblde [14]AVAToL httpavatolorg [17]PhenoScape httpphenoscapeorg [18]ORCID httporcidorgResearcherID httpwwwresearcheridcom

4 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Conversions from the different states of lsquonormalrsquo eg similartests in different species are not automatically available

Representing qualitative and quantitative phenotypesOther desiderata for phenotype representations are focused onthe exploitation of ontologies for efficient propagation of experi-mental findings in basic phenomics research into the clinicaldomain and improved research efficiency in both domains(translational medicine) Consequently phenotype descriptionshave to meet clinical needs and cover those diseases that aremost relevant to the clinical context Experimental phenotypicdescriptions are detailed and reflect the experimental setupwhereas clinical descriptions suffer from time constraints andthus tend to lack observational detail Furthermore experimen-tal and clinical phenotypic descriptions may be organized at di-verse levels of granularity and may be biased toward a specificperspective For example experimental findings provide the op-portunity to capture and represent quantitative traits (eglsquoblood sugargt 85 mmolLrsquo) which may require adaptation intoqualitative terms (lsquohigh blood sugarrsquo) for clinical purposesSimilarly from a diagnosis perspective one may require a com-plete and individualized view over the phenotypic profilewhich may include degrees of severity [28] and longitudinalphenotypes [29] hence adding to the overall complexity of therepresentation

Representation of phenotypes summaryThe resources for representing phenotypes have reached apoint where they are able to provide a solid and rich foundationfor building advanced acquisition and processing mechanismsOpen challenges still exist eg modeling degrees of severitynormal states or negation (ie explicitly mentioning the absenceof an abnormality) or mapping quantitative traits to qualitativeconcepts to provide deep knowledge capturing methodologies

Acquisition

Acquisition involves the collection and storage of phenotype in-formation from various resources (see Figure 3) such as OMIMor OrphaNet a rare disease database [30] While some of theseresources are mainly built through manual curation eg MGDothers rely already on (semi-)automated preprocessing to en-hance curator throughput For example PharmGKB [46] uses anautomated classification system to determine relevant publica-tions and extract genendashdrug relationships that are then pro-vided to curators for verification [31 47]

Manual acquisition of phenotypesManual acquisition of phenotypes can be done either by cur-ation of the literature or by direct submission from investiga-tors These two main modes are used to annotate modelorganism data with phenotype observations and their concep-tual descriptions In the case of MGD curators provide standard

Figure 2 To date phenotypes have mostly been captured and defined using a pre-composed andor a post-composed representation A pre-composed representation

assumes the definition of a phenotype as a monolithic conceptmdasha concept that captures the essence of the phenotype semantics The post-composed representation

decomposes the phenotype into an EntityndashQuality pair with its individual components being mapped to appropriate ontological concepts In this case the phenotype

semantics is denoted by the compositional property of the pair The transition between pre-composed and post-composed is realized via logical axioms Both forms of

representation have been successfully applied across different species

The digital revolution in phenotyping | 5

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

phenotype descriptions from MP along with supporting evidenceZFIN similarly provides phenotypic data with evidence comingfrom manual curation and individual investigator contributionsusing the Phenote software (httpwwwphenoteorg) Phenoteallows description of phenotypes in an EQ format which makesuse of any ontology in the Open Biological and BiomedicalOntologies [48] format including PATO the Zebrafish AnatomicalOntology [32] and GO

The International Mouse Phenotyping Consortium (IMPC)[33 49] has applied phenotype encoding standards [50] througha set of standard operating procedures (SOPs as defined inIMPReSS httpswwwmousephenotypeorgimpress) for re-cording high-throughput phenotype measurements in the labEach of the SOPs describes not only the experimental setup forthe measurement of the required parameters but also theontology annotation this test may induce For example the SOPdesigned to assess the grip strength of a mouse includes thesuggestion of the MP term lsquoabnormal grip strengthrsquo(MP0001515)

In addition to Phenote mentioned above one example of asystem that was designed specifically for phenotype capture ina manual mode is PhenoTips [51] This open-source system as-sists clinicians to record phenotypic profiles for patients withrare genetic disorders using HP and OMIM potentially allowingfor diagnosis and comparative phenotype analysis

Discovering evidence for the causes of human disorders andproviding treatment are common goals across the clinical andscientific communities However the understanding of pheno-types has traditionally been different between the two com-munities Clinicians generally consider phenotypes to beaberrations ie deviations from normal morphology physi-ology or behavior [52] while scientists working on biological ex-periments such as mutation experiments have adopted a morepragmatic definition of a selective profile of all the observablecharacteristics of an organism This division stems in part froma focus on the overt expression of the syndromes themselves

[53] on the one hand and on the pathway from syndrome togene expression on the other Both are crucial to understandingthe complex nature of disorders as Sabb et al point out [53]This difference is reflected in the type of data that each commu-nity creates and the systems that have been built to supportdata capture by each

Semi-automated and automated phenotype acquisitionWith the increasing amount of data that is published on a day-to-day basis manual approaches for data curation becomemore and more time-demanding and costly so that computerassistance in screening (document retrieval) and preparing data(information extraction) is unavoidable The degree to whichcomputer assistance is enabled determines whether themethod is semi-automated or automated While in a semi-automated setting a curator manually verifies the extracteddata in an automated setting no manual input is requiredHowever given the absence of manual verification and the cur-rent state-of-the-art in text processing the data generated withan automated method may contain some incorrect data

The structural and semantic complexity of phenotype termscoupled with the scale and changing nature of literature-basedphenotype descriptions makes a traditional fully manual ac-quisition approach difficult to sustain leading to potential du-plication inconsistency and sub-optimal coverage This hascreated a growing interest in textdata mining techniques A di-verse and growing research community is evolving that aims toexploit biomedical natural language processing for the extrac-tion of structured data from free-text and its annotation withthe semantic resources that already exist Although not specif-ically aimed at phenotypes knowledge brokering tools such asMetaMap [34] the NCBO Annotator and the Apache cTAKES [54]have all been widely used for concept annotation of text to bio-medical ontologies and could be used to yield these buildingblocks The issue of customizing these generic tools to the

Figure 3 The increasing amount of data made available over the course of the past years have rendered manual phenotype curation impractical While automating

the process is in principle the only viable solution it possesses its own plethora of technical challenges These include among others (i) boundary detection ie iden-

tifying the exact span of text that represents a phenotype candidate (ii) disambiguation and alignment subject to the desired level of granularity and the underlying

knowledge source and (iii) interpretation which covers lack of context hedging or negation

6 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

extraction of phenotypes in specific disease domains is a keychallenge

Extraction of structured information from electronic healthrecords (EHRs) has a long history of research eg [35 36 38 55]Progress has been hampered by the balance that needs to bedrawn between respecting patient privacy and the need for datato develop comparable gold standards In the past few yearsseveral initiatives have led the way in making available anony-mized collections of EHRs eg Informatics for IntegratingBiology and the Bedside (i2b2) [56] and recently ShARECLEF2013 [57] These tasks aim to identify entities of clinical interestincluding medical problems tests and treatments While nei-ther of these data sets explicitly annotates phenotypes theseentities are highly relevant to phenotype acquisition Fu et alsuggested an annotation scheme to capture phenotypes forchronic obstructive pulmonary disease in EHRs which has beenimplemented in Argo [58] to annotate a corpus of 1000 clinicalrecords [59] Furthermore the newly launched DeepPhe project(httpcancerhealthnlporg) focusses on phenotypes relevantin the cancer genomics domain

Using the scientific literature as a source several groupshave been active in developing approaches explicitly for pheno-types These include the Bio-LarK system which has beenapplied to skeletal dysplasia [39] and the PhenoMiner systemwhich has been applied to the cardiovascular and autoimmunesystems [60] Work by Khordad et al [37] has looked at a moremixed domain using the Unified Medical Language System(UMLS) Metathesaurus tool [40] Ongoing challenges in process-ing EHRs are descriptive naming (eg typical course face het-erogeneous ECG abnormalities) disjoint phenotype mentions(eg blood pressure was observed to be elevated) and coordi-nated terms (eg slow healing and excessive scarring)Harmonization to existing ontologies presents an additionallayer of challenge in deciding how to align phenotype mentionsthat are more or less specific than extant concepts and how toprovide sufficient evidence for human curators

While automated methods are not as thorough as curatorsthey overcome some of the bottlenecks experienced with man-ual curation eg high time consumption and low throughputIn general there is a trade-off between thoroughness (precision)and the amount of acquired data (recall) returned as resultsfrom these methods ie automated methods may not return allrelevant results and may return some incorrect results

Acquisition of phenotypes summaryThe acquisition and harmonization of phenotypes is an ongoingchallenge to be met using evidence from a variety of sources(eg EHRs scientific literature clinical reports) The key issuefor automated approaches involving natural language process-ing support is to identify and resolve lexical syntactic and se-mantic heterogeneity

Interoperability

The interoperability dimension of phenomics research focuseson making all the available phenotype data integrable withother data sources eg diseases or results from genome ana-lyses The overarching goal of interoperability is to facilitatetranslational research and biological discoveries [21] Currentand past work falling into this dimension can be summarized asstandardization efforts alignment of phenotypes within andacross species and mapping to other resources Challenges arisefrom the many levels of complexity phenotypes can span [42] as

well as the development of multiple and mostly disparate re-porting schemes [22 23 41]

Interoperability through semantic layersA prerequisite for interoperable phenotype resources is a se-mantic layer that spans across the resources applied and allowsto keep the consistency and specificity contained in each of theresources For example despite standardization efforts such asthe Minimal Information for Mouse Phenotyping Procedures[61] the existing landscape of mouse phenotype resources isnot fully interoperable and hard to manage [62] A similar scen-ario is seen in hospitals where different wards use disparateways of describing a patient As a consequence the data for apatient cannot readily be used for further analysis preventingpotential holistic treatment opportunities However the needfor standardized reporting has been recognized and is imple-mented through SOPs in the IMPC [33 49 50]

While historically there have been different phenotype rep-resentations for different species such as the human mamma-lian fly and worm phenotype ontology (see 21 representation)EQ statements (see Figure 2) have been suggested to integratephenotypes across different species [24 44] In addition anamendment to the existing EQ statements was suggested tomake them interoperable with anatomy and physiology ontolo-gies [63] to extend the links across the different layers of com-plexity To make the annotation for three different species(human mouse and zebrafish) more accessible Kohler and au-thors made the UberPheno ontology publicly available [64]

Interoperability achieved through mappings (alignment)Further to the representation of phenotypes with EQ state-ments manual [65] and automated methods are in progress toalign different pre-composed semantic representations One ex-ample is UMLS [66] that combines over 180 vocabularies termi-nologies and ontologies such as SNOMED CT [67] Theintegration of new resources into UMLS is semi-automatedConflicts between concepts from newly added resources andconcepts already in UMLS are manually resolved to ensure ahigh-quality alignment of all the incorporated terminologiesvocabularies and ontologies

As an alternative to manual and semi-automated solutionstools exist that provide fully automated alignments betweenontologies such as AgreementMaker [68] and Zooma (httpwwwebiacukfgptzooma) While AgreementMaker takes lex-ical matching and ontological features into account Zoomauses phonetic matching algorithms for the alignment In manycases the resulting alignments often associate one term withmultiple concepts (1n mapping) Bottlenecks in the automatedalignment are caused by species-specific jargon [69 70] and byphenotypes that only exist in one of the species and not theother

In addition to the alignment of multiple resources mappingsare required that would facilitate the integration of diverse re-sources spanning across the different layers of complexity of anorganism While phenotypes in model organisms have been as-signed to genes as well as to specific models (determined by al-lele and background in addition to gene eg MGD) in humanmostly inheritable diseases have been annotated However ifwe were to follow the pathway from the modified gene to theobserved phenotypes a mapping between pathways andphenotypes would also be required [71] As mentioned abovesome integration with other resources has been achieved egwith anatomy and physiology ontologies but a much larger

The digital revolution in phenotyping | 7

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

coverage is required to unlock the full potential of knowledgediscovery using phenotype data

Interoperability of phenotypes summaryThe need for unified and integrable representation of pheno-types has been recognized and projects are underway to im-prove the current situation However there is still a hugeamount of legacy data that need to be dealt with

Processing

Processing (or application) is concerned with the use of pheno-type data that have either been reported in structured form orextracted with for example text mining methods (see acquisi-tion methods of phenotypes) Subject to the target domain theusage of phenotype data can be classified into four broad cate-gories (i) clinical research (diagnosis prognosis patient match-making variant prioritization personalized medicine drug sideeffects) (ii) study of genomendashphenome interactions to advancethe understanding of disease causation or achieve personalizedtherapies (iii) cross-resource consistency analysis and (iv) evo-lutionary research It is worth noting that in most cases pheno-type data have been used in a cross-species context hencetransforming the interoperability dimension into a first-classcitizen rather than an application scenario

Application of phenotypes to study the origins and pathology ofdiseasesThe application area of clinical research consists of a varied setof specific goals which represent in practice coherent researchstreams on their own Phenotypes have been used as uniquesource of data for example for disease prediction [72 73] min-ing key disease characteristics or characteristic phenotypes [1617 74] or patient match-making [51] Furthermore phenotypeshave been used to support the understanding of the geneticmechanism of diseases via direct association with genotypedata [5 9 12] as a mapping bridge across species data [7 8] andin conjunction with the entire set of OMICS data [75] or to im-prove variant prioritization for accurate diagnosis [76 77]

More recently phenotypes have played a major role as a dis-covery agent in large-scale genome-wide association studies [78ndash80] In particular projects such as the 100 000 Genomes Project(httpwwwgenomicsenglandcoukthe-100000-genomes-project)as well as the eMerge Network (httpsemergemcvanderbiltedupage idfrac1458) aim to support the area of Pharmacogenetics bylinking data from EHRs to sequence information from patientsto improve diagnosis and treatment Similarly Phenome-wideAssociation Studies (PheWAS) [81ndash83] allow for the identificationof genes that possibly implicate a disease and can provide prov-enance for results determined through Genome-wide associ-ation studies

Phenotypes and ethnicityGeneric phenotypes have an immense potential which hasbeen only recently discovered and exploited For example eth-nic differences have been shown to explain the optimal statesof the human blood glucose levelsmdashexpressed via the relation-ship between insulin sensitivity and insulin response [84]Similarly when used in the context of a shared genetic architec-ture such data validated the existence of statistically significantrelationships between the platelet count and alcohol depend-ence or between the alkaline phosphatase level and venousthromboembolism [85] This application area is characterizedby a specific set of challenges emerging from the novel

combination of numeric data (from tests and measurement)abnormalities and standard traits One important and still un-solved problem in the processing of phenotypes is the lack ofalignment between measurement representations and pheno-typic abnormalities as well as the ability of representing statesof normality and longitudinal phenotypic data resulting fromtests and measurements

Phenotypes in drug repurposingPhenotypes expose the effects of drug treatments and henceenable the study of their general effects the relation betweendosage and effects as well as the interaction between drugsExisting literature on this topic maps perfectly onto these threeaspects Phenotypes as adverse drug reactions have been mod-eled and captured as early as 2010 by the SIDER initiative [14]and have been used to investigate the causal relationship be-tween dosage and effect by [15] In an exercise that combineslarge-scale acquisition and processing LePendu et al have usedphenotypes as indicators of adverse drug reactions as well assignals of adverse events associated with drugndashdrug inter-actions [18] Finally with the increasing curation adoption anduse of cross-species phenotype data it has been shown that aneffective mapping between model organisms and drug effectprofiles based on similarity can be applied to successfully sug-gest candidate drug targets [13]

Phenotypes in evolutionary studiesIn this context phenotypes have been used to understand pat-terns of diversification and to gain additional knowledge ontrait evolution Two particular initiatives have focused on thisaspect The AVAToL project [19] represents a collaborative andmultidisciplinary effort that combines text mining image ana-lysis and the wisdom of the crowds to discover and documentspecies phenotypes Their ultimate goal is to advance phylogen-etics research and to enable a faster and more accurate con-struction of the Tree of Life The Phenoscape knowledge base[20] on the other hand integrates phenotype data acquired onover 2500 teleost fishes with structured phenotype data fromzebrafish genes to infer candidate genes that explain pheno-typic variety and hence enable the formulation of evolution-arydevolutionary hypotheses

Processing of phenotypes summaryThe quality and range of phenotype applications are curbedonly by the quality and availability of the underlying dataLimitations arise for example from the missing or incorrectcross-species phenotypes alignment either owing to their foun-dational representation or owing to inconsistencies in the levelof granularity Similarly challenges are encountered when rep-resenting and acquiring the more profound dimensions ofphenotypes including degrees of severity use of ambiguous ex-pressions or temporality (in a longitudinal data sense) whichthen hamper the development of complex solutions

A future perspective of phenome research

For the entire field of Phenomics to advance challenges have tobe overcome at the universal level as well as at the level of theindividual dimensions of phenome research (representationacquisition interoperability and processing) The most import-ant universal challenge is the lack of a shared understanding ofwhat a phenotype is among all scientists working with pheno-type data This includes (computational) biologists regardlessof the research questions andor animal model they are working

8 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

on as well as clinicians working in all different areas of humanmedicine To reach a common understanding reporting stand-ards and guidelines need to be derived that facilitate communi-cation across domains as well as large-scale computationalanalysis

Furthermore universal agreement has to be found as towhat additional aspects of phenotypes are relevant and havenot or only sparsely been accounted for in any of the four do-mains Such aspects include types of measurements proven-ance evidence and time To overcome the social barriers toadoption some sort of credit assignment mechanism akin tocitation will be necessary to track the usage of lsquogoodrsquo and reus-able phenotypes Existing author tracking systems such asORCID (httporcidorg) and ResearcherID (httpwwwresearcheridcom) can probably be reused to document author-ship of phenotype models and track usage as well asprovenance

Another important universal issue to overcome is the re-cording of lsquonormalrsquo phenotypes Currently phenomics in thearea of disease gene discovery is geared toward the collectionand analyses of lsquoabnormalrsquo phenotypes eg those resultingfrom diseases or gene modifications However the derivation ofan lsquoabnormalrsquo phenotype is always the result of a comparisonwith what is considered to be lsquonormalrsquo which is collected onlyin some cases [45] The record of more lsquonormalrsquo phenotypeswould allow for more fine-grained analyses and the investiga-tion of causal relationships between different conditions eg incases of comorbidity

From a representational point of view the ultimate futuregoal is to overcome the limitations of species- and domain-specific representations and find a universal way of encodingphenotype data independent from the granularity of the dataIn addition resources need to be built that address so far miss-ing aspects such as evidence or time aspects For example asmall set of evidence codes have been established as part of GOto provide provenance in gene annotations but also to providemeans for computational analysis to avoid data circularityAnother aspect of phenotypes that has not been integrated yetinto the representation of phenotypes are causality and tempor-ality for example how the phenotypes change over time owingto stimuli in the environment or medication While at the mo-ment the representations focus on reporting either a temporarysnapshot of an individual examined in an experiment or as partof a medical investigation the current representation modelsdo not allow for the encoding of phenotype changes over timeas a result to surrounding stimuli

As we extend the scope of our joint understanding of pheno-types and adopt ways to represent this understanding the ac-quisition of phenotypes has to change too Methods have to bedeveloped that can accommodate the recording of additionalaspects such as evidence and time eg when extracting infor-mation from the scientific literature Furthermore more reliableautomated methods are needed that can cope with the com-plexity of free text in clinical settings as well as reporting mech-anisms in wet lab environments to facilitate high-throughputand overcome the need for time- and cost-intensive manuallabor In addition and as mentioned earlier the acquisition di-mension should also address the collection of lsquonormalrsquo pheno-types in the future to improve the results obtained byprocessing the phenotype data

Despite the widespread aims to achieve interoperability andmake best use of the integrated resources there are still chal-lenges that need to be addressed to achieve true interoperabilityof the existing and newly emerging resources both phenotype-

specific and not A long-term goal of the dimension of inter-operability is direct propagation of experimental findings intoprevention and treatment options for patients in a hospital(lsquofrom bench to bedsidersquo) Related to this goal is the aim of per-sonalized treatments by means of building patient-specificmodels integrating phenotype data that can then be used forsimulations of possible treatment outcomes

In the future we anticipate a migration toward describingphenotypes as lsquomodelsrsquo or classifiers that answer a particularquestion For example lsquodoes the patient have pneumoniarsquo orlsquodoes the patient have sepsisrsquo The use of phenotyping willthen be analogous to the use of classifiers in spam filters al-ways running in the background and when an incoming sample(a patient record for example) results in a high confidencematch we would automatically receive an alert that the patientis potentially eligible for a clinical trial is likely to benefit from acertain therapy or is at increased risk for certain complications

With the improvement in any of the other dimensions theprocessing of phenotypes will improve owing to an increase inthe quality of data but at the same time will require extensionsthat can cope with additional data available in the future suchas lsquonormalrsquo phenotypes evidence for phenotype data andcausal relationships encoded with time-dependencies In gen-eral a wide range of support and analysis tools are required tounlock the full potential of phenotype data

Given the achievements in the past years we look forwardtoward an exciting and promising decade of phenomics withample opportunities for researchers to get involved and contrib-ute to evolve and shape the emerging landscape

Key Points

bull Over the course of the past decade phenotype datahas become a key factor in analyzing diseases and re-porting experimental outcomes

bull Successful applications of phenotypes include the de-scription of experimental outcomes (eg the changes inphenotypes owing to gene modifications) computationalknowledge discovery (eg in determining disease genecandidates) and reporting in clinical environments (egpatient monitoring)

bull The research field of phenomics can be divided intofour main dimensions (presentation acquisition inter-operability and processing) each of which is depend-ent to some extent on the others

bull While the development of each of the four dimensionsis individual a common understanding and universalguidelines need to be established on how phenotypesare perceived and how they are used a synchroniza-tion of efforts is needed

bull The future of phenomics research holds exciting chal-lenges and has the potential to create a significant im-pact on the entire biomedical domain

Funding

This work was supported by the National Institutes ofHealth [1 U54 HG006370-01 to AO R01 LM011369 and R01GM101430 and U54 HG004028 to NHS T15 LM00707 toMRB R01-LM008111 to KL R01 GM102282 and R01LM011369 to HL U24 CA143840 to AL U54HG006370 toAM U54 HG008033-01 to MD] the Wellcome Trust [098051to AO] a Marie Curie experience researcher fellowship

The digital revolution in phenotyping | 9

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

[301806 to NC] the National Science Foundation [1207592to IG DBI-1062404 and DBI-1062542 and EF-0905606 toPM] the Bundesministerium fur Bildung und Forschung[0313911 to PNR] the European Communityrsquos SeventhFramework Programme [Grant Agreement 602300 SYBIL toPNR] the Systems Microscopy NoE project [grant agree-ment 258068 to GR] and the Defense Advanced ResearchProjects Agency [W911NF-14-C-0109 to KL] TG was sup-ported by the Kinghorn Foundation OB was supported bythe Intramural Research Program of the NIH NationalLibrary of Medicine MS was supported by the MedicalResearch Council LW was supported by the Homer WarnerCenter for Informatics Research of the IHC Health ServicesRW was supported by an appointment to the NLMResearch Participation Program administered by the OakRidge Institute for Science and Education through an inter-agency agreement between the US Department of Energyand the National Library of Medicine

References1 Amberger J Bocchini C Hamosh A A new face and new chal-

lenges for Online Mendelian Inheritance in Man (OMIM)Human Mutation 201132564ndash7

2 Blake JA Bult CJ Kadin JA et al The Mouse Genome Database(MGD) premier model organism resource for mammaliangenomics and genetics Nucleic Acids Res 201139D842ndash8

3 Tweedie S Ashburner M Falls K et al FlyBase enhancingDrosophila Gene Ontology annotations Nucleic Acids Research2009 37D555ndash9

4 Howe DG Bradford YM Conlin T et al ZFIN the ZebrafishModel Organism Database increased support for mutantsand transgenics Nucleic Acids Research 2013 41D854ndash60

5 Goh KI Cusick ME Valle D et al The human disease networkProc Natl Acad Sci USA 20071048685ndash90

6 Hoehndorf R Schofield PN Gkoutos GV PhenomeNET awhole-phenome approach to disease gene discovery NucleicAcids Res 201139e119

7 Smedley D Oellrich A Kohler S et al PhenoDigm analyzingcurated annotations to associate animal models with humandiseases Database 2013bat025

8 Washington NL Haendel MA Mungall CJ et al Linkinghuman diseases to animal models using ontology-basedphenotype annotation PLoS Biol 20097e1000247

9 Zhou X Menche J Barabasi AL et al Human symptomsndashdisease network Nat Commun 20145

10Van Driel MA Bruggeman J Vriend G et al A text-mining ana-lysis of the human phenome Eur J Hum Genet 200614535ndash42

11Groth P Pavlova N Kalev I et al PhenomicDB a new cross-species genotypephenotype resource Nucleic Acids Res200735D696ndash9

12Korbel JO Doerks T Jensen LJ et al Systematic association ofgenes to phenotypes by genome and literature mining PLoSBiol 20053e134

13Hoehndorf R Hiebert T Hardy NW et al Mouse model pheno-types provide information about human drug targetsBioinformatics 201330719ndash25

14Kuhn M Campillos M Letunic I et al A side effect resource tocapture phenotypic effects of drugs Mol Syst Biol 20106343

15Eriksson R Werge T Jensen LJ et al Dose-specific adversedrug reaction identification in electronic patient recordstemporal data mining in an inpatient psychiatric populationDrug Safety 201437237ndash47

16Lasko TA Denny JC Levy MA Computational phenotype dis-covery using unsupervised feature learning over noisy sparseand irregular clinical data eng PLoS One 20138e66341

17Schulam P Wigley F Saria S Clustering longitudinal clinicalmarker trajectories from electronic health data applicationsto phenotyping and endotype discovery In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial IntelligenceAustin Texas USA 2015

18LePendu P Iyer SV Bauer-Mehren A et al Pharmacovigilanceusing clinical notes Clin Pharmacol Ther 201393547ndash55

19Burleigh G Alphonse K Alverson AJ et al Next-generationphenomics for the tree of life PLoS Curr 20135

20Mabee P Balhoff JP Dahdul WM et al 500 000 fish pheno-types The new informatics landscape for evolutionary anddevelopmental biology of the vertebrate skeleton J ApplIchthyol 201228300ndash5

21Collier N Oellrich A Groza T Toward knowledge support foranalysis and interpretation of complex traits Genome Biol201314214

22Smith CL Eppig JT The Mammalian Phenotype Ontology as aunifying standard for experimental and high-throughputphenotyping data Mamm Genome 201323653ndash68

23Kohler S Doelken SC Mungall CJ et al The HumanPhenotype Ontology project linking molecular biology anddisease through phenotype data Nucleic Acids Res201442D966ndash74

24Gkoutos GV Mungall C Doelken S et al Entityquality-basedlogical definitions for the human skeletal phenome usingPATO In Proceedings of Annual International Conference ofthe IEEE Engineering in Medicine and Biology SocietyMinneapolis Minnesota USA 2009 7069ndash72

25Winter RM Baraitser M Douglas JM A computerised database for the diagnosis of rare dysmorphic syndromes J MedGenet 198421121ndash3

26Botstein D Cherry JM Ashburner M et al Gene Ontology toolfor the unification of biology Nat Genet 20002525ndash9

27Hancock JM Commentary on Shimoyama et al (2012) threeontologies to define phenotype measurement data FrontGenet 20145

28Boland MR Tatonetti NP Hripcsak G Development and valid-ation of a classification approach for extracting severity auto-matically from electronic health records J Biomed Semant2015614

29Greenaway S Blake A Retha A et al Automatically annotat-ing temporal data from a phenotype-driven mutagenesisscreen In Proceedings of Phenotype Day at ISMB Dublin Ireland2015

30Rath A Olry A Dhombres F et al Representation of rare dis-eases in health information systems the Orphanet approachto serve a wide range of end users Hum Mutat 201233803ndash8

31Thorn CF Klein TE Altman RB PharmGKB Methods Mol Biol2005311179ndash91

32Sprague J Bayraktaroglu L Clements D et al The ZebrafishInformation Network the zebrafish model organism data-base Nucleic Acids Res 200634D581ndash5

33Brown SDM Moore MW Towards an encyclopaedia of mam-malian gene function the International Mouse PhenotypingConsortium Dis Models Mech 20125289ndash92

34Aronson AR Lang FM An overview of MetaMap historicalperspective and recent advances J Am Med Inform Assoc201017229ndash36

35Chapman WW Bridewell W Hanbury P et al A simple algo-rithm for identifying negated findings and diseases in dis-charge summaries J Biomed Inform 200134301ndash10

10 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 4: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

adopting this representation consists of concept definitions likelsquoerythrocytopeniarsquo or lsquodeficiency of red blood cellsrsquo lsquodeficiencyof erythrocytesrsquo These concepts are easily understood byhumans and also facilitate computational analysis

The post-composed representation uses elementary pheno-typic units from existing ontologies to compose specific com-plex phenotypes One mechanism to postcompose phenotypesis the entity-quality (EQ) statement approach [24 44] For ex-ample instead of defining lsquoerythrocytopeniarsquo as an atomic con-cept this approach represents the meaning of the phenotype bylinking the quality lsquodeficiencyrsquo with the anatomic entity lsquoredblood cellsrsquo This link is then captured via a logical axiom usingconcepts introduced by existing ontologies such as the GeneOntology (GO) [26] and the Phenotypic quality and TraitOntology (PATO) [24] The caveats of the postcomposition resultfrom the development overheads in building post-composedstatements Additionally a number of pre-composed phenotypeontologies still need to be transformed into a post-composedrepresentation

lsquoNormalrsquo and lsquoabnormalrsquo phenotypesSome of the existing phenotype representations focus on devi-ations of phenotypes (ie their status or quality from a

reference phenotype) [27] The reference phenotype in the caseof model organisms could be either the wild type of the organ-ism or a specific strain from which the mutation has been gen-erated In the best case the phenotype representations formthe core that enables interoperability of different data reposito-ries possibly covering different organisms and being collectedwith different aims in mind The quality of the phenotypic re-source ie the consistency of the phenotypic definitions theoverall structure of the phenotype semantic resource and inparticular the completeness of the electronic resource holdsthe key to enabling efficient data analysis interpretation anddecision support

To extend beyond representing lsquoabnormalrsquo Shimoyama et alhave extended the representation of phenotypes to also incorp-orate environmental factors as well as methods used to meas-ure the phenotype [45] The developed ontologies have beenused to annotate both rat (httprgdmcweduwgphysiology)and human data (httpcoverwustleduCover) Furthermorethe suggested framework allows for the annotation of quanti-fied phenotypes (eg lsquoblood sugar levelsgt 85 mmolLrsquo) insteadof lsquoabnormal increased blood sugar levelsrsquo While this way ofrepresentation could be used to represent the reference pheno-type a data curator is needed to provide this information

Table 1 Summarizes all resources mentioned throughout the manuscript together with their URL and reference (where applicable)

Resource Link Reference

Online Mendelian Inheritance inMan database

httpomimorg [1]

Mouse Genome Database httpinformaticsjaxorg [2]FlyBase httpflybaseorg [3]Zebrafish Model Organism Database httpzfinorg [4]Mammalian Phenotype Ontology httpwwwberkeleyboporgontologiesmp [20]Human Phenotype Ontology httppurlobolibraryorgobohpobo [21]London Dysmorphology Database httpwwwlmdatabasescom [23]Gene Ontology httpgeneontologyorg [24]Phenotypic quality and Trait Ontology httppurlobolibraryorgobopato [25]OrphaNet httpwwworphanetconsorcgibinindexphp [26]PharmGKB httpswwwpharmgkborg [27]Zebrafish Anatomical Ontology httppurlobolibraryorgobozfa [28]International Mouse Phenotyping

Consortiumhttpwwwmousephenotypeorg [29 30]

IMPReSS httpswwwmousephenotypeorgimpressPhenote httpwwwphenoteorgPhenoTips httpsphenotipsorg [31]MetaMap httpwwwnlmnihgovresearchumlsimplementation_resourcesmetamaphtml [32]NCBO Annotator httpsbioportalbioontologyorgannotatorcTakes httpctakesapacheorg [33]ShARECLEF 2013 httpssitesgooglecomsiteshareclefehealthdata [34]DeepPhe httpcancerhealthnlporgBio-LarK httpbio-larkorg [35]PhenoMiner httpssitesgooglecomsitenhcollierprojectsPhenoMiner [36]Unified Medical Language System httpwwwnlmnihgovresearchumls [37]Unified Medical Language System

Metathesaurus toolhttpwwwnlmnihgovpubsfactsheetsumlsmetahtml [38]

UberPheno httppurlobolibraryorgobohpuberpheno [39]SNOMED CT httpwwwnlmnihgovresearchumlsSnomedsnomed_mainhtml [40]AgreementMaker httpagreementmakerorg [41]Zooma httpwwwebiacukfgptzoomaSIDER httpsideeffectsemblde [14]AVAToL httpavatolorg [17]PhenoScape httpphenoscapeorg [18]ORCID httporcidorgResearcherID httpwwwresearcheridcom

4 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Conversions from the different states of lsquonormalrsquo eg similartests in different species are not automatically available

Representing qualitative and quantitative phenotypesOther desiderata for phenotype representations are focused onthe exploitation of ontologies for efficient propagation of experi-mental findings in basic phenomics research into the clinicaldomain and improved research efficiency in both domains(translational medicine) Consequently phenotype descriptionshave to meet clinical needs and cover those diseases that aremost relevant to the clinical context Experimental phenotypicdescriptions are detailed and reflect the experimental setupwhereas clinical descriptions suffer from time constraints andthus tend to lack observational detail Furthermore experimen-tal and clinical phenotypic descriptions may be organized at di-verse levels of granularity and may be biased toward a specificperspective For example experimental findings provide the op-portunity to capture and represent quantitative traits (eglsquoblood sugargt 85 mmolLrsquo) which may require adaptation intoqualitative terms (lsquohigh blood sugarrsquo) for clinical purposesSimilarly from a diagnosis perspective one may require a com-plete and individualized view over the phenotypic profilewhich may include degrees of severity [28] and longitudinalphenotypes [29] hence adding to the overall complexity of therepresentation

Representation of phenotypes summaryThe resources for representing phenotypes have reached apoint where they are able to provide a solid and rich foundationfor building advanced acquisition and processing mechanismsOpen challenges still exist eg modeling degrees of severitynormal states or negation (ie explicitly mentioning the absenceof an abnormality) or mapping quantitative traits to qualitativeconcepts to provide deep knowledge capturing methodologies

Acquisition

Acquisition involves the collection and storage of phenotype in-formation from various resources (see Figure 3) such as OMIMor OrphaNet a rare disease database [30] While some of theseresources are mainly built through manual curation eg MGDothers rely already on (semi-)automated preprocessing to en-hance curator throughput For example PharmGKB [46] uses anautomated classification system to determine relevant publica-tions and extract genendashdrug relationships that are then pro-vided to curators for verification [31 47]

Manual acquisition of phenotypesManual acquisition of phenotypes can be done either by cur-ation of the literature or by direct submission from investiga-tors These two main modes are used to annotate modelorganism data with phenotype observations and their concep-tual descriptions In the case of MGD curators provide standard

Figure 2 To date phenotypes have mostly been captured and defined using a pre-composed andor a post-composed representation A pre-composed representation

assumes the definition of a phenotype as a monolithic conceptmdasha concept that captures the essence of the phenotype semantics The post-composed representation

decomposes the phenotype into an EntityndashQuality pair with its individual components being mapped to appropriate ontological concepts In this case the phenotype

semantics is denoted by the compositional property of the pair The transition between pre-composed and post-composed is realized via logical axioms Both forms of

representation have been successfully applied across different species

The digital revolution in phenotyping | 5

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

phenotype descriptions from MP along with supporting evidenceZFIN similarly provides phenotypic data with evidence comingfrom manual curation and individual investigator contributionsusing the Phenote software (httpwwwphenoteorg) Phenoteallows description of phenotypes in an EQ format which makesuse of any ontology in the Open Biological and BiomedicalOntologies [48] format including PATO the Zebrafish AnatomicalOntology [32] and GO

The International Mouse Phenotyping Consortium (IMPC)[33 49] has applied phenotype encoding standards [50] througha set of standard operating procedures (SOPs as defined inIMPReSS httpswwwmousephenotypeorgimpress) for re-cording high-throughput phenotype measurements in the labEach of the SOPs describes not only the experimental setup forthe measurement of the required parameters but also theontology annotation this test may induce For example the SOPdesigned to assess the grip strength of a mouse includes thesuggestion of the MP term lsquoabnormal grip strengthrsquo(MP0001515)

In addition to Phenote mentioned above one example of asystem that was designed specifically for phenotype capture ina manual mode is PhenoTips [51] This open-source system as-sists clinicians to record phenotypic profiles for patients withrare genetic disorders using HP and OMIM potentially allowingfor diagnosis and comparative phenotype analysis

Discovering evidence for the causes of human disorders andproviding treatment are common goals across the clinical andscientific communities However the understanding of pheno-types has traditionally been different between the two com-munities Clinicians generally consider phenotypes to beaberrations ie deviations from normal morphology physi-ology or behavior [52] while scientists working on biological ex-periments such as mutation experiments have adopted a morepragmatic definition of a selective profile of all the observablecharacteristics of an organism This division stems in part froma focus on the overt expression of the syndromes themselves

[53] on the one hand and on the pathway from syndrome togene expression on the other Both are crucial to understandingthe complex nature of disorders as Sabb et al point out [53]This difference is reflected in the type of data that each commu-nity creates and the systems that have been built to supportdata capture by each

Semi-automated and automated phenotype acquisitionWith the increasing amount of data that is published on a day-to-day basis manual approaches for data curation becomemore and more time-demanding and costly so that computerassistance in screening (document retrieval) and preparing data(information extraction) is unavoidable The degree to whichcomputer assistance is enabled determines whether themethod is semi-automated or automated While in a semi-automated setting a curator manually verifies the extracteddata in an automated setting no manual input is requiredHowever given the absence of manual verification and the cur-rent state-of-the-art in text processing the data generated withan automated method may contain some incorrect data

The structural and semantic complexity of phenotype termscoupled with the scale and changing nature of literature-basedphenotype descriptions makes a traditional fully manual ac-quisition approach difficult to sustain leading to potential du-plication inconsistency and sub-optimal coverage This hascreated a growing interest in textdata mining techniques A di-verse and growing research community is evolving that aims toexploit biomedical natural language processing for the extrac-tion of structured data from free-text and its annotation withthe semantic resources that already exist Although not specif-ically aimed at phenotypes knowledge brokering tools such asMetaMap [34] the NCBO Annotator and the Apache cTAKES [54]have all been widely used for concept annotation of text to bio-medical ontologies and could be used to yield these buildingblocks The issue of customizing these generic tools to the

Figure 3 The increasing amount of data made available over the course of the past years have rendered manual phenotype curation impractical While automating

the process is in principle the only viable solution it possesses its own plethora of technical challenges These include among others (i) boundary detection ie iden-

tifying the exact span of text that represents a phenotype candidate (ii) disambiguation and alignment subject to the desired level of granularity and the underlying

knowledge source and (iii) interpretation which covers lack of context hedging or negation

6 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

extraction of phenotypes in specific disease domains is a keychallenge

Extraction of structured information from electronic healthrecords (EHRs) has a long history of research eg [35 36 38 55]Progress has been hampered by the balance that needs to bedrawn between respecting patient privacy and the need for datato develop comparable gold standards In the past few yearsseveral initiatives have led the way in making available anony-mized collections of EHRs eg Informatics for IntegratingBiology and the Bedside (i2b2) [56] and recently ShARECLEF2013 [57] These tasks aim to identify entities of clinical interestincluding medical problems tests and treatments While nei-ther of these data sets explicitly annotates phenotypes theseentities are highly relevant to phenotype acquisition Fu et alsuggested an annotation scheme to capture phenotypes forchronic obstructive pulmonary disease in EHRs which has beenimplemented in Argo [58] to annotate a corpus of 1000 clinicalrecords [59] Furthermore the newly launched DeepPhe project(httpcancerhealthnlporg) focusses on phenotypes relevantin the cancer genomics domain

Using the scientific literature as a source several groupshave been active in developing approaches explicitly for pheno-types These include the Bio-LarK system which has beenapplied to skeletal dysplasia [39] and the PhenoMiner systemwhich has been applied to the cardiovascular and autoimmunesystems [60] Work by Khordad et al [37] has looked at a moremixed domain using the Unified Medical Language System(UMLS) Metathesaurus tool [40] Ongoing challenges in process-ing EHRs are descriptive naming (eg typical course face het-erogeneous ECG abnormalities) disjoint phenotype mentions(eg blood pressure was observed to be elevated) and coordi-nated terms (eg slow healing and excessive scarring)Harmonization to existing ontologies presents an additionallayer of challenge in deciding how to align phenotype mentionsthat are more or less specific than extant concepts and how toprovide sufficient evidence for human curators

While automated methods are not as thorough as curatorsthey overcome some of the bottlenecks experienced with man-ual curation eg high time consumption and low throughputIn general there is a trade-off between thoroughness (precision)and the amount of acquired data (recall) returned as resultsfrom these methods ie automated methods may not return allrelevant results and may return some incorrect results

Acquisition of phenotypes summaryThe acquisition and harmonization of phenotypes is an ongoingchallenge to be met using evidence from a variety of sources(eg EHRs scientific literature clinical reports) The key issuefor automated approaches involving natural language process-ing support is to identify and resolve lexical syntactic and se-mantic heterogeneity

Interoperability

The interoperability dimension of phenomics research focuseson making all the available phenotype data integrable withother data sources eg diseases or results from genome ana-lyses The overarching goal of interoperability is to facilitatetranslational research and biological discoveries [21] Currentand past work falling into this dimension can be summarized asstandardization efforts alignment of phenotypes within andacross species and mapping to other resources Challenges arisefrom the many levels of complexity phenotypes can span [42] as

well as the development of multiple and mostly disparate re-porting schemes [22 23 41]

Interoperability through semantic layersA prerequisite for interoperable phenotype resources is a se-mantic layer that spans across the resources applied and allowsto keep the consistency and specificity contained in each of theresources For example despite standardization efforts such asthe Minimal Information for Mouse Phenotyping Procedures[61] the existing landscape of mouse phenotype resources isnot fully interoperable and hard to manage [62] A similar scen-ario is seen in hospitals where different wards use disparateways of describing a patient As a consequence the data for apatient cannot readily be used for further analysis preventingpotential holistic treatment opportunities However the needfor standardized reporting has been recognized and is imple-mented through SOPs in the IMPC [33 49 50]

While historically there have been different phenotype rep-resentations for different species such as the human mamma-lian fly and worm phenotype ontology (see 21 representation)EQ statements (see Figure 2) have been suggested to integratephenotypes across different species [24 44] In addition anamendment to the existing EQ statements was suggested tomake them interoperable with anatomy and physiology ontolo-gies [63] to extend the links across the different layers of com-plexity To make the annotation for three different species(human mouse and zebrafish) more accessible Kohler and au-thors made the UberPheno ontology publicly available [64]

Interoperability achieved through mappings (alignment)Further to the representation of phenotypes with EQ state-ments manual [65] and automated methods are in progress toalign different pre-composed semantic representations One ex-ample is UMLS [66] that combines over 180 vocabularies termi-nologies and ontologies such as SNOMED CT [67] Theintegration of new resources into UMLS is semi-automatedConflicts between concepts from newly added resources andconcepts already in UMLS are manually resolved to ensure ahigh-quality alignment of all the incorporated terminologiesvocabularies and ontologies

As an alternative to manual and semi-automated solutionstools exist that provide fully automated alignments betweenontologies such as AgreementMaker [68] and Zooma (httpwwwebiacukfgptzooma) While AgreementMaker takes lex-ical matching and ontological features into account Zoomauses phonetic matching algorithms for the alignment In manycases the resulting alignments often associate one term withmultiple concepts (1n mapping) Bottlenecks in the automatedalignment are caused by species-specific jargon [69 70] and byphenotypes that only exist in one of the species and not theother

In addition to the alignment of multiple resources mappingsare required that would facilitate the integration of diverse re-sources spanning across the different layers of complexity of anorganism While phenotypes in model organisms have been as-signed to genes as well as to specific models (determined by al-lele and background in addition to gene eg MGD) in humanmostly inheritable diseases have been annotated However ifwe were to follow the pathway from the modified gene to theobserved phenotypes a mapping between pathways andphenotypes would also be required [71] As mentioned abovesome integration with other resources has been achieved egwith anatomy and physiology ontologies but a much larger

The digital revolution in phenotyping | 7

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

coverage is required to unlock the full potential of knowledgediscovery using phenotype data

Interoperability of phenotypes summaryThe need for unified and integrable representation of pheno-types has been recognized and projects are underway to im-prove the current situation However there is still a hugeamount of legacy data that need to be dealt with

Processing

Processing (or application) is concerned with the use of pheno-type data that have either been reported in structured form orextracted with for example text mining methods (see acquisi-tion methods of phenotypes) Subject to the target domain theusage of phenotype data can be classified into four broad cate-gories (i) clinical research (diagnosis prognosis patient match-making variant prioritization personalized medicine drug sideeffects) (ii) study of genomendashphenome interactions to advancethe understanding of disease causation or achieve personalizedtherapies (iii) cross-resource consistency analysis and (iv) evo-lutionary research It is worth noting that in most cases pheno-type data have been used in a cross-species context hencetransforming the interoperability dimension into a first-classcitizen rather than an application scenario

Application of phenotypes to study the origins and pathology ofdiseasesThe application area of clinical research consists of a varied setof specific goals which represent in practice coherent researchstreams on their own Phenotypes have been used as uniquesource of data for example for disease prediction [72 73] min-ing key disease characteristics or characteristic phenotypes [1617 74] or patient match-making [51] Furthermore phenotypeshave been used to support the understanding of the geneticmechanism of diseases via direct association with genotypedata [5 9 12] as a mapping bridge across species data [7 8] andin conjunction with the entire set of OMICS data [75] or to im-prove variant prioritization for accurate diagnosis [76 77]

More recently phenotypes have played a major role as a dis-covery agent in large-scale genome-wide association studies [78ndash80] In particular projects such as the 100 000 Genomes Project(httpwwwgenomicsenglandcoukthe-100000-genomes-project)as well as the eMerge Network (httpsemergemcvanderbiltedupage idfrac1458) aim to support the area of Pharmacogenetics bylinking data from EHRs to sequence information from patientsto improve diagnosis and treatment Similarly Phenome-wideAssociation Studies (PheWAS) [81ndash83] allow for the identificationof genes that possibly implicate a disease and can provide prov-enance for results determined through Genome-wide associ-ation studies

Phenotypes and ethnicityGeneric phenotypes have an immense potential which hasbeen only recently discovered and exploited For example eth-nic differences have been shown to explain the optimal statesof the human blood glucose levelsmdashexpressed via the relation-ship between insulin sensitivity and insulin response [84]Similarly when used in the context of a shared genetic architec-ture such data validated the existence of statistically significantrelationships between the platelet count and alcohol depend-ence or between the alkaline phosphatase level and venousthromboembolism [85] This application area is characterizedby a specific set of challenges emerging from the novel

combination of numeric data (from tests and measurement)abnormalities and standard traits One important and still un-solved problem in the processing of phenotypes is the lack ofalignment between measurement representations and pheno-typic abnormalities as well as the ability of representing statesof normality and longitudinal phenotypic data resulting fromtests and measurements

Phenotypes in drug repurposingPhenotypes expose the effects of drug treatments and henceenable the study of their general effects the relation betweendosage and effects as well as the interaction between drugsExisting literature on this topic maps perfectly onto these threeaspects Phenotypes as adverse drug reactions have been mod-eled and captured as early as 2010 by the SIDER initiative [14]and have been used to investigate the causal relationship be-tween dosage and effect by [15] In an exercise that combineslarge-scale acquisition and processing LePendu et al have usedphenotypes as indicators of adverse drug reactions as well assignals of adverse events associated with drugndashdrug inter-actions [18] Finally with the increasing curation adoption anduse of cross-species phenotype data it has been shown that aneffective mapping between model organisms and drug effectprofiles based on similarity can be applied to successfully sug-gest candidate drug targets [13]

Phenotypes in evolutionary studiesIn this context phenotypes have been used to understand pat-terns of diversification and to gain additional knowledge ontrait evolution Two particular initiatives have focused on thisaspect The AVAToL project [19] represents a collaborative andmultidisciplinary effort that combines text mining image ana-lysis and the wisdom of the crowds to discover and documentspecies phenotypes Their ultimate goal is to advance phylogen-etics research and to enable a faster and more accurate con-struction of the Tree of Life The Phenoscape knowledge base[20] on the other hand integrates phenotype data acquired onover 2500 teleost fishes with structured phenotype data fromzebrafish genes to infer candidate genes that explain pheno-typic variety and hence enable the formulation of evolution-arydevolutionary hypotheses

Processing of phenotypes summaryThe quality and range of phenotype applications are curbedonly by the quality and availability of the underlying dataLimitations arise for example from the missing or incorrectcross-species phenotypes alignment either owing to their foun-dational representation or owing to inconsistencies in the levelof granularity Similarly challenges are encountered when rep-resenting and acquiring the more profound dimensions ofphenotypes including degrees of severity use of ambiguous ex-pressions or temporality (in a longitudinal data sense) whichthen hamper the development of complex solutions

A future perspective of phenome research

For the entire field of Phenomics to advance challenges have tobe overcome at the universal level as well as at the level of theindividual dimensions of phenome research (representationacquisition interoperability and processing) The most import-ant universal challenge is the lack of a shared understanding ofwhat a phenotype is among all scientists working with pheno-type data This includes (computational) biologists regardlessof the research questions andor animal model they are working

8 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

on as well as clinicians working in all different areas of humanmedicine To reach a common understanding reporting stand-ards and guidelines need to be derived that facilitate communi-cation across domains as well as large-scale computationalanalysis

Furthermore universal agreement has to be found as towhat additional aspects of phenotypes are relevant and havenot or only sparsely been accounted for in any of the four do-mains Such aspects include types of measurements proven-ance evidence and time To overcome the social barriers toadoption some sort of credit assignment mechanism akin tocitation will be necessary to track the usage of lsquogoodrsquo and reus-able phenotypes Existing author tracking systems such asORCID (httporcidorg) and ResearcherID (httpwwwresearcheridcom) can probably be reused to document author-ship of phenotype models and track usage as well asprovenance

Another important universal issue to overcome is the re-cording of lsquonormalrsquo phenotypes Currently phenomics in thearea of disease gene discovery is geared toward the collectionand analyses of lsquoabnormalrsquo phenotypes eg those resultingfrom diseases or gene modifications However the derivation ofan lsquoabnormalrsquo phenotype is always the result of a comparisonwith what is considered to be lsquonormalrsquo which is collected onlyin some cases [45] The record of more lsquonormalrsquo phenotypeswould allow for more fine-grained analyses and the investiga-tion of causal relationships between different conditions eg incases of comorbidity

From a representational point of view the ultimate futuregoal is to overcome the limitations of species- and domain-specific representations and find a universal way of encodingphenotype data independent from the granularity of the dataIn addition resources need to be built that address so far miss-ing aspects such as evidence or time aspects For example asmall set of evidence codes have been established as part of GOto provide provenance in gene annotations but also to providemeans for computational analysis to avoid data circularityAnother aspect of phenotypes that has not been integrated yetinto the representation of phenotypes are causality and tempor-ality for example how the phenotypes change over time owingto stimuli in the environment or medication While at the mo-ment the representations focus on reporting either a temporarysnapshot of an individual examined in an experiment or as partof a medical investigation the current representation modelsdo not allow for the encoding of phenotype changes over timeas a result to surrounding stimuli

As we extend the scope of our joint understanding of pheno-types and adopt ways to represent this understanding the ac-quisition of phenotypes has to change too Methods have to bedeveloped that can accommodate the recording of additionalaspects such as evidence and time eg when extracting infor-mation from the scientific literature Furthermore more reliableautomated methods are needed that can cope with the com-plexity of free text in clinical settings as well as reporting mech-anisms in wet lab environments to facilitate high-throughputand overcome the need for time- and cost-intensive manuallabor In addition and as mentioned earlier the acquisition di-mension should also address the collection of lsquonormalrsquo pheno-types in the future to improve the results obtained byprocessing the phenotype data

Despite the widespread aims to achieve interoperability andmake best use of the integrated resources there are still chal-lenges that need to be addressed to achieve true interoperabilityof the existing and newly emerging resources both phenotype-

specific and not A long-term goal of the dimension of inter-operability is direct propagation of experimental findings intoprevention and treatment options for patients in a hospital(lsquofrom bench to bedsidersquo) Related to this goal is the aim of per-sonalized treatments by means of building patient-specificmodels integrating phenotype data that can then be used forsimulations of possible treatment outcomes

In the future we anticipate a migration toward describingphenotypes as lsquomodelsrsquo or classifiers that answer a particularquestion For example lsquodoes the patient have pneumoniarsquo orlsquodoes the patient have sepsisrsquo The use of phenotyping willthen be analogous to the use of classifiers in spam filters al-ways running in the background and when an incoming sample(a patient record for example) results in a high confidencematch we would automatically receive an alert that the patientis potentially eligible for a clinical trial is likely to benefit from acertain therapy or is at increased risk for certain complications

With the improvement in any of the other dimensions theprocessing of phenotypes will improve owing to an increase inthe quality of data but at the same time will require extensionsthat can cope with additional data available in the future suchas lsquonormalrsquo phenotypes evidence for phenotype data andcausal relationships encoded with time-dependencies In gen-eral a wide range of support and analysis tools are required tounlock the full potential of phenotype data

Given the achievements in the past years we look forwardtoward an exciting and promising decade of phenomics withample opportunities for researchers to get involved and contrib-ute to evolve and shape the emerging landscape

Key Points

bull Over the course of the past decade phenotype datahas become a key factor in analyzing diseases and re-porting experimental outcomes

bull Successful applications of phenotypes include the de-scription of experimental outcomes (eg the changes inphenotypes owing to gene modifications) computationalknowledge discovery (eg in determining disease genecandidates) and reporting in clinical environments (egpatient monitoring)

bull The research field of phenomics can be divided intofour main dimensions (presentation acquisition inter-operability and processing) each of which is depend-ent to some extent on the others

bull While the development of each of the four dimensionsis individual a common understanding and universalguidelines need to be established on how phenotypesare perceived and how they are used a synchroniza-tion of efforts is needed

bull The future of phenomics research holds exciting chal-lenges and has the potential to create a significant im-pact on the entire biomedical domain

Funding

This work was supported by the National Institutes ofHealth [1 U54 HG006370-01 to AO R01 LM011369 and R01GM101430 and U54 HG004028 to NHS T15 LM00707 toMRB R01-LM008111 to KL R01 GM102282 and R01LM011369 to HL U24 CA143840 to AL U54HG006370 toAM U54 HG008033-01 to MD] the Wellcome Trust [098051to AO] a Marie Curie experience researcher fellowship

The digital revolution in phenotyping | 9

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

[301806 to NC] the National Science Foundation [1207592to IG DBI-1062404 and DBI-1062542 and EF-0905606 toPM] the Bundesministerium fur Bildung und Forschung[0313911 to PNR] the European Communityrsquos SeventhFramework Programme [Grant Agreement 602300 SYBIL toPNR] the Systems Microscopy NoE project [grant agree-ment 258068 to GR] and the Defense Advanced ResearchProjects Agency [W911NF-14-C-0109 to KL] TG was sup-ported by the Kinghorn Foundation OB was supported bythe Intramural Research Program of the NIH NationalLibrary of Medicine MS was supported by the MedicalResearch Council LW was supported by the Homer WarnerCenter for Informatics Research of the IHC Health ServicesRW was supported by an appointment to the NLMResearch Participation Program administered by the OakRidge Institute for Science and Education through an inter-agency agreement between the US Department of Energyand the National Library of Medicine

References1 Amberger J Bocchini C Hamosh A A new face and new chal-

lenges for Online Mendelian Inheritance in Man (OMIM)Human Mutation 201132564ndash7

2 Blake JA Bult CJ Kadin JA et al The Mouse Genome Database(MGD) premier model organism resource for mammaliangenomics and genetics Nucleic Acids Res 201139D842ndash8

3 Tweedie S Ashburner M Falls K et al FlyBase enhancingDrosophila Gene Ontology annotations Nucleic Acids Research2009 37D555ndash9

4 Howe DG Bradford YM Conlin T et al ZFIN the ZebrafishModel Organism Database increased support for mutantsand transgenics Nucleic Acids Research 2013 41D854ndash60

5 Goh KI Cusick ME Valle D et al The human disease networkProc Natl Acad Sci USA 20071048685ndash90

6 Hoehndorf R Schofield PN Gkoutos GV PhenomeNET awhole-phenome approach to disease gene discovery NucleicAcids Res 201139e119

7 Smedley D Oellrich A Kohler S et al PhenoDigm analyzingcurated annotations to associate animal models with humandiseases Database 2013bat025

8 Washington NL Haendel MA Mungall CJ et al Linkinghuman diseases to animal models using ontology-basedphenotype annotation PLoS Biol 20097e1000247

9 Zhou X Menche J Barabasi AL et al Human symptomsndashdisease network Nat Commun 20145

10Van Driel MA Bruggeman J Vriend G et al A text-mining ana-lysis of the human phenome Eur J Hum Genet 200614535ndash42

11Groth P Pavlova N Kalev I et al PhenomicDB a new cross-species genotypephenotype resource Nucleic Acids Res200735D696ndash9

12Korbel JO Doerks T Jensen LJ et al Systematic association ofgenes to phenotypes by genome and literature mining PLoSBiol 20053e134

13Hoehndorf R Hiebert T Hardy NW et al Mouse model pheno-types provide information about human drug targetsBioinformatics 201330719ndash25

14Kuhn M Campillos M Letunic I et al A side effect resource tocapture phenotypic effects of drugs Mol Syst Biol 20106343

15Eriksson R Werge T Jensen LJ et al Dose-specific adversedrug reaction identification in electronic patient recordstemporal data mining in an inpatient psychiatric populationDrug Safety 201437237ndash47

16Lasko TA Denny JC Levy MA Computational phenotype dis-covery using unsupervised feature learning over noisy sparseand irregular clinical data eng PLoS One 20138e66341

17Schulam P Wigley F Saria S Clustering longitudinal clinicalmarker trajectories from electronic health data applicationsto phenotyping and endotype discovery In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial IntelligenceAustin Texas USA 2015

18LePendu P Iyer SV Bauer-Mehren A et al Pharmacovigilanceusing clinical notes Clin Pharmacol Ther 201393547ndash55

19Burleigh G Alphonse K Alverson AJ et al Next-generationphenomics for the tree of life PLoS Curr 20135

20Mabee P Balhoff JP Dahdul WM et al 500 000 fish pheno-types The new informatics landscape for evolutionary anddevelopmental biology of the vertebrate skeleton J ApplIchthyol 201228300ndash5

21Collier N Oellrich A Groza T Toward knowledge support foranalysis and interpretation of complex traits Genome Biol201314214

22Smith CL Eppig JT The Mammalian Phenotype Ontology as aunifying standard for experimental and high-throughputphenotyping data Mamm Genome 201323653ndash68

23Kohler S Doelken SC Mungall CJ et al The HumanPhenotype Ontology project linking molecular biology anddisease through phenotype data Nucleic Acids Res201442D966ndash74

24Gkoutos GV Mungall C Doelken S et al Entityquality-basedlogical definitions for the human skeletal phenome usingPATO In Proceedings of Annual International Conference ofthe IEEE Engineering in Medicine and Biology SocietyMinneapolis Minnesota USA 2009 7069ndash72

25Winter RM Baraitser M Douglas JM A computerised database for the diagnosis of rare dysmorphic syndromes J MedGenet 198421121ndash3

26Botstein D Cherry JM Ashburner M et al Gene Ontology toolfor the unification of biology Nat Genet 20002525ndash9

27Hancock JM Commentary on Shimoyama et al (2012) threeontologies to define phenotype measurement data FrontGenet 20145

28Boland MR Tatonetti NP Hripcsak G Development and valid-ation of a classification approach for extracting severity auto-matically from electronic health records J Biomed Semant2015614

29Greenaway S Blake A Retha A et al Automatically annotat-ing temporal data from a phenotype-driven mutagenesisscreen In Proceedings of Phenotype Day at ISMB Dublin Ireland2015

30Rath A Olry A Dhombres F et al Representation of rare dis-eases in health information systems the Orphanet approachto serve a wide range of end users Hum Mutat 201233803ndash8

31Thorn CF Klein TE Altman RB PharmGKB Methods Mol Biol2005311179ndash91

32Sprague J Bayraktaroglu L Clements D et al The ZebrafishInformation Network the zebrafish model organism data-base Nucleic Acids Res 200634D581ndash5

33Brown SDM Moore MW Towards an encyclopaedia of mam-malian gene function the International Mouse PhenotypingConsortium Dis Models Mech 20125289ndash92

34Aronson AR Lang FM An overview of MetaMap historicalperspective and recent advances J Am Med Inform Assoc201017229ndash36

35Chapman WW Bridewell W Hanbury P et al A simple algo-rithm for identifying negated findings and diseases in dis-charge summaries J Biomed Inform 200134301ndash10

10 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 5: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

Conversions from the different states of lsquonormalrsquo eg similartests in different species are not automatically available

Representing qualitative and quantitative phenotypesOther desiderata for phenotype representations are focused onthe exploitation of ontologies for efficient propagation of experi-mental findings in basic phenomics research into the clinicaldomain and improved research efficiency in both domains(translational medicine) Consequently phenotype descriptionshave to meet clinical needs and cover those diseases that aremost relevant to the clinical context Experimental phenotypicdescriptions are detailed and reflect the experimental setupwhereas clinical descriptions suffer from time constraints andthus tend to lack observational detail Furthermore experimen-tal and clinical phenotypic descriptions may be organized at di-verse levels of granularity and may be biased toward a specificperspective For example experimental findings provide the op-portunity to capture and represent quantitative traits (eglsquoblood sugargt 85 mmolLrsquo) which may require adaptation intoqualitative terms (lsquohigh blood sugarrsquo) for clinical purposesSimilarly from a diagnosis perspective one may require a com-plete and individualized view over the phenotypic profilewhich may include degrees of severity [28] and longitudinalphenotypes [29] hence adding to the overall complexity of therepresentation

Representation of phenotypes summaryThe resources for representing phenotypes have reached apoint where they are able to provide a solid and rich foundationfor building advanced acquisition and processing mechanismsOpen challenges still exist eg modeling degrees of severitynormal states or negation (ie explicitly mentioning the absenceof an abnormality) or mapping quantitative traits to qualitativeconcepts to provide deep knowledge capturing methodologies

Acquisition

Acquisition involves the collection and storage of phenotype in-formation from various resources (see Figure 3) such as OMIMor OrphaNet a rare disease database [30] While some of theseresources are mainly built through manual curation eg MGDothers rely already on (semi-)automated preprocessing to en-hance curator throughput For example PharmGKB [46] uses anautomated classification system to determine relevant publica-tions and extract genendashdrug relationships that are then pro-vided to curators for verification [31 47]

Manual acquisition of phenotypesManual acquisition of phenotypes can be done either by cur-ation of the literature or by direct submission from investiga-tors These two main modes are used to annotate modelorganism data with phenotype observations and their concep-tual descriptions In the case of MGD curators provide standard

Figure 2 To date phenotypes have mostly been captured and defined using a pre-composed andor a post-composed representation A pre-composed representation

assumes the definition of a phenotype as a monolithic conceptmdasha concept that captures the essence of the phenotype semantics The post-composed representation

decomposes the phenotype into an EntityndashQuality pair with its individual components being mapped to appropriate ontological concepts In this case the phenotype

semantics is denoted by the compositional property of the pair The transition between pre-composed and post-composed is realized via logical axioms Both forms of

representation have been successfully applied across different species

The digital revolution in phenotyping | 5

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

phenotype descriptions from MP along with supporting evidenceZFIN similarly provides phenotypic data with evidence comingfrom manual curation and individual investigator contributionsusing the Phenote software (httpwwwphenoteorg) Phenoteallows description of phenotypes in an EQ format which makesuse of any ontology in the Open Biological and BiomedicalOntologies [48] format including PATO the Zebrafish AnatomicalOntology [32] and GO

The International Mouse Phenotyping Consortium (IMPC)[33 49] has applied phenotype encoding standards [50] througha set of standard operating procedures (SOPs as defined inIMPReSS httpswwwmousephenotypeorgimpress) for re-cording high-throughput phenotype measurements in the labEach of the SOPs describes not only the experimental setup forthe measurement of the required parameters but also theontology annotation this test may induce For example the SOPdesigned to assess the grip strength of a mouse includes thesuggestion of the MP term lsquoabnormal grip strengthrsquo(MP0001515)

In addition to Phenote mentioned above one example of asystem that was designed specifically for phenotype capture ina manual mode is PhenoTips [51] This open-source system as-sists clinicians to record phenotypic profiles for patients withrare genetic disorders using HP and OMIM potentially allowingfor diagnosis and comparative phenotype analysis

Discovering evidence for the causes of human disorders andproviding treatment are common goals across the clinical andscientific communities However the understanding of pheno-types has traditionally been different between the two com-munities Clinicians generally consider phenotypes to beaberrations ie deviations from normal morphology physi-ology or behavior [52] while scientists working on biological ex-periments such as mutation experiments have adopted a morepragmatic definition of a selective profile of all the observablecharacteristics of an organism This division stems in part froma focus on the overt expression of the syndromes themselves

[53] on the one hand and on the pathway from syndrome togene expression on the other Both are crucial to understandingthe complex nature of disorders as Sabb et al point out [53]This difference is reflected in the type of data that each commu-nity creates and the systems that have been built to supportdata capture by each

Semi-automated and automated phenotype acquisitionWith the increasing amount of data that is published on a day-to-day basis manual approaches for data curation becomemore and more time-demanding and costly so that computerassistance in screening (document retrieval) and preparing data(information extraction) is unavoidable The degree to whichcomputer assistance is enabled determines whether themethod is semi-automated or automated While in a semi-automated setting a curator manually verifies the extracteddata in an automated setting no manual input is requiredHowever given the absence of manual verification and the cur-rent state-of-the-art in text processing the data generated withan automated method may contain some incorrect data

The structural and semantic complexity of phenotype termscoupled with the scale and changing nature of literature-basedphenotype descriptions makes a traditional fully manual ac-quisition approach difficult to sustain leading to potential du-plication inconsistency and sub-optimal coverage This hascreated a growing interest in textdata mining techniques A di-verse and growing research community is evolving that aims toexploit biomedical natural language processing for the extrac-tion of structured data from free-text and its annotation withthe semantic resources that already exist Although not specif-ically aimed at phenotypes knowledge brokering tools such asMetaMap [34] the NCBO Annotator and the Apache cTAKES [54]have all been widely used for concept annotation of text to bio-medical ontologies and could be used to yield these buildingblocks The issue of customizing these generic tools to the

Figure 3 The increasing amount of data made available over the course of the past years have rendered manual phenotype curation impractical While automating

the process is in principle the only viable solution it possesses its own plethora of technical challenges These include among others (i) boundary detection ie iden-

tifying the exact span of text that represents a phenotype candidate (ii) disambiguation and alignment subject to the desired level of granularity and the underlying

knowledge source and (iii) interpretation which covers lack of context hedging or negation

6 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

extraction of phenotypes in specific disease domains is a keychallenge

Extraction of structured information from electronic healthrecords (EHRs) has a long history of research eg [35 36 38 55]Progress has been hampered by the balance that needs to bedrawn between respecting patient privacy and the need for datato develop comparable gold standards In the past few yearsseveral initiatives have led the way in making available anony-mized collections of EHRs eg Informatics for IntegratingBiology and the Bedside (i2b2) [56] and recently ShARECLEF2013 [57] These tasks aim to identify entities of clinical interestincluding medical problems tests and treatments While nei-ther of these data sets explicitly annotates phenotypes theseentities are highly relevant to phenotype acquisition Fu et alsuggested an annotation scheme to capture phenotypes forchronic obstructive pulmonary disease in EHRs which has beenimplemented in Argo [58] to annotate a corpus of 1000 clinicalrecords [59] Furthermore the newly launched DeepPhe project(httpcancerhealthnlporg) focusses on phenotypes relevantin the cancer genomics domain

Using the scientific literature as a source several groupshave been active in developing approaches explicitly for pheno-types These include the Bio-LarK system which has beenapplied to skeletal dysplasia [39] and the PhenoMiner systemwhich has been applied to the cardiovascular and autoimmunesystems [60] Work by Khordad et al [37] has looked at a moremixed domain using the Unified Medical Language System(UMLS) Metathesaurus tool [40] Ongoing challenges in process-ing EHRs are descriptive naming (eg typical course face het-erogeneous ECG abnormalities) disjoint phenotype mentions(eg blood pressure was observed to be elevated) and coordi-nated terms (eg slow healing and excessive scarring)Harmonization to existing ontologies presents an additionallayer of challenge in deciding how to align phenotype mentionsthat are more or less specific than extant concepts and how toprovide sufficient evidence for human curators

While automated methods are not as thorough as curatorsthey overcome some of the bottlenecks experienced with man-ual curation eg high time consumption and low throughputIn general there is a trade-off between thoroughness (precision)and the amount of acquired data (recall) returned as resultsfrom these methods ie automated methods may not return allrelevant results and may return some incorrect results

Acquisition of phenotypes summaryThe acquisition and harmonization of phenotypes is an ongoingchallenge to be met using evidence from a variety of sources(eg EHRs scientific literature clinical reports) The key issuefor automated approaches involving natural language process-ing support is to identify and resolve lexical syntactic and se-mantic heterogeneity

Interoperability

The interoperability dimension of phenomics research focuseson making all the available phenotype data integrable withother data sources eg diseases or results from genome ana-lyses The overarching goal of interoperability is to facilitatetranslational research and biological discoveries [21] Currentand past work falling into this dimension can be summarized asstandardization efforts alignment of phenotypes within andacross species and mapping to other resources Challenges arisefrom the many levels of complexity phenotypes can span [42] as

well as the development of multiple and mostly disparate re-porting schemes [22 23 41]

Interoperability through semantic layersA prerequisite for interoperable phenotype resources is a se-mantic layer that spans across the resources applied and allowsto keep the consistency and specificity contained in each of theresources For example despite standardization efforts such asthe Minimal Information for Mouse Phenotyping Procedures[61] the existing landscape of mouse phenotype resources isnot fully interoperable and hard to manage [62] A similar scen-ario is seen in hospitals where different wards use disparateways of describing a patient As a consequence the data for apatient cannot readily be used for further analysis preventingpotential holistic treatment opportunities However the needfor standardized reporting has been recognized and is imple-mented through SOPs in the IMPC [33 49 50]

While historically there have been different phenotype rep-resentations for different species such as the human mamma-lian fly and worm phenotype ontology (see 21 representation)EQ statements (see Figure 2) have been suggested to integratephenotypes across different species [24 44] In addition anamendment to the existing EQ statements was suggested tomake them interoperable with anatomy and physiology ontolo-gies [63] to extend the links across the different layers of com-plexity To make the annotation for three different species(human mouse and zebrafish) more accessible Kohler and au-thors made the UberPheno ontology publicly available [64]

Interoperability achieved through mappings (alignment)Further to the representation of phenotypes with EQ state-ments manual [65] and automated methods are in progress toalign different pre-composed semantic representations One ex-ample is UMLS [66] that combines over 180 vocabularies termi-nologies and ontologies such as SNOMED CT [67] Theintegration of new resources into UMLS is semi-automatedConflicts between concepts from newly added resources andconcepts already in UMLS are manually resolved to ensure ahigh-quality alignment of all the incorporated terminologiesvocabularies and ontologies

As an alternative to manual and semi-automated solutionstools exist that provide fully automated alignments betweenontologies such as AgreementMaker [68] and Zooma (httpwwwebiacukfgptzooma) While AgreementMaker takes lex-ical matching and ontological features into account Zoomauses phonetic matching algorithms for the alignment In manycases the resulting alignments often associate one term withmultiple concepts (1n mapping) Bottlenecks in the automatedalignment are caused by species-specific jargon [69 70] and byphenotypes that only exist in one of the species and not theother

In addition to the alignment of multiple resources mappingsare required that would facilitate the integration of diverse re-sources spanning across the different layers of complexity of anorganism While phenotypes in model organisms have been as-signed to genes as well as to specific models (determined by al-lele and background in addition to gene eg MGD) in humanmostly inheritable diseases have been annotated However ifwe were to follow the pathway from the modified gene to theobserved phenotypes a mapping between pathways andphenotypes would also be required [71] As mentioned abovesome integration with other resources has been achieved egwith anatomy and physiology ontologies but a much larger

The digital revolution in phenotyping | 7

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

coverage is required to unlock the full potential of knowledgediscovery using phenotype data

Interoperability of phenotypes summaryThe need for unified and integrable representation of pheno-types has been recognized and projects are underway to im-prove the current situation However there is still a hugeamount of legacy data that need to be dealt with

Processing

Processing (or application) is concerned with the use of pheno-type data that have either been reported in structured form orextracted with for example text mining methods (see acquisi-tion methods of phenotypes) Subject to the target domain theusage of phenotype data can be classified into four broad cate-gories (i) clinical research (diagnosis prognosis patient match-making variant prioritization personalized medicine drug sideeffects) (ii) study of genomendashphenome interactions to advancethe understanding of disease causation or achieve personalizedtherapies (iii) cross-resource consistency analysis and (iv) evo-lutionary research It is worth noting that in most cases pheno-type data have been used in a cross-species context hencetransforming the interoperability dimension into a first-classcitizen rather than an application scenario

Application of phenotypes to study the origins and pathology ofdiseasesThe application area of clinical research consists of a varied setof specific goals which represent in practice coherent researchstreams on their own Phenotypes have been used as uniquesource of data for example for disease prediction [72 73] min-ing key disease characteristics or characteristic phenotypes [1617 74] or patient match-making [51] Furthermore phenotypeshave been used to support the understanding of the geneticmechanism of diseases via direct association with genotypedata [5 9 12] as a mapping bridge across species data [7 8] andin conjunction with the entire set of OMICS data [75] or to im-prove variant prioritization for accurate diagnosis [76 77]

More recently phenotypes have played a major role as a dis-covery agent in large-scale genome-wide association studies [78ndash80] In particular projects such as the 100 000 Genomes Project(httpwwwgenomicsenglandcoukthe-100000-genomes-project)as well as the eMerge Network (httpsemergemcvanderbiltedupage idfrac1458) aim to support the area of Pharmacogenetics bylinking data from EHRs to sequence information from patientsto improve diagnosis and treatment Similarly Phenome-wideAssociation Studies (PheWAS) [81ndash83] allow for the identificationof genes that possibly implicate a disease and can provide prov-enance for results determined through Genome-wide associ-ation studies

Phenotypes and ethnicityGeneric phenotypes have an immense potential which hasbeen only recently discovered and exploited For example eth-nic differences have been shown to explain the optimal statesof the human blood glucose levelsmdashexpressed via the relation-ship between insulin sensitivity and insulin response [84]Similarly when used in the context of a shared genetic architec-ture such data validated the existence of statistically significantrelationships between the platelet count and alcohol depend-ence or between the alkaline phosphatase level and venousthromboembolism [85] This application area is characterizedby a specific set of challenges emerging from the novel

combination of numeric data (from tests and measurement)abnormalities and standard traits One important and still un-solved problem in the processing of phenotypes is the lack ofalignment between measurement representations and pheno-typic abnormalities as well as the ability of representing statesof normality and longitudinal phenotypic data resulting fromtests and measurements

Phenotypes in drug repurposingPhenotypes expose the effects of drug treatments and henceenable the study of their general effects the relation betweendosage and effects as well as the interaction between drugsExisting literature on this topic maps perfectly onto these threeaspects Phenotypes as adverse drug reactions have been mod-eled and captured as early as 2010 by the SIDER initiative [14]and have been used to investigate the causal relationship be-tween dosage and effect by [15] In an exercise that combineslarge-scale acquisition and processing LePendu et al have usedphenotypes as indicators of adverse drug reactions as well assignals of adverse events associated with drugndashdrug inter-actions [18] Finally with the increasing curation adoption anduse of cross-species phenotype data it has been shown that aneffective mapping between model organisms and drug effectprofiles based on similarity can be applied to successfully sug-gest candidate drug targets [13]

Phenotypes in evolutionary studiesIn this context phenotypes have been used to understand pat-terns of diversification and to gain additional knowledge ontrait evolution Two particular initiatives have focused on thisaspect The AVAToL project [19] represents a collaborative andmultidisciplinary effort that combines text mining image ana-lysis and the wisdom of the crowds to discover and documentspecies phenotypes Their ultimate goal is to advance phylogen-etics research and to enable a faster and more accurate con-struction of the Tree of Life The Phenoscape knowledge base[20] on the other hand integrates phenotype data acquired onover 2500 teleost fishes with structured phenotype data fromzebrafish genes to infer candidate genes that explain pheno-typic variety and hence enable the formulation of evolution-arydevolutionary hypotheses

Processing of phenotypes summaryThe quality and range of phenotype applications are curbedonly by the quality and availability of the underlying dataLimitations arise for example from the missing or incorrectcross-species phenotypes alignment either owing to their foun-dational representation or owing to inconsistencies in the levelof granularity Similarly challenges are encountered when rep-resenting and acquiring the more profound dimensions ofphenotypes including degrees of severity use of ambiguous ex-pressions or temporality (in a longitudinal data sense) whichthen hamper the development of complex solutions

A future perspective of phenome research

For the entire field of Phenomics to advance challenges have tobe overcome at the universal level as well as at the level of theindividual dimensions of phenome research (representationacquisition interoperability and processing) The most import-ant universal challenge is the lack of a shared understanding ofwhat a phenotype is among all scientists working with pheno-type data This includes (computational) biologists regardlessof the research questions andor animal model they are working

8 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

on as well as clinicians working in all different areas of humanmedicine To reach a common understanding reporting stand-ards and guidelines need to be derived that facilitate communi-cation across domains as well as large-scale computationalanalysis

Furthermore universal agreement has to be found as towhat additional aspects of phenotypes are relevant and havenot or only sparsely been accounted for in any of the four do-mains Such aspects include types of measurements proven-ance evidence and time To overcome the social barriers toadoption some sort of credit assignment mechanism akin tocitation will be necessary to track the usage of lsquogoodrsquo and reus-able phenotypes Existing author tracking systems such asORCID (httporcidorg) and ResearcherID (httpwwwresearcheridcom) can probably be reused to document author-ship of phenotype models and track usage as well asprovenance

Another important universal issue to overcome is the re-cording of lsquonormalrsquo phenotypes Currently phenomics in thearea of disease gene discovery is geared toward the collectionand analyses of lsquoabnormalrsquo phenotypes eg those resultingfrom diseases or gene modifications However the derivation ofan lsquoabnormalrsquo phenotype is always the result of a comparisonwith what is considered to be lsquonormalrsquo which is collected onlyin some cases [45] The record of more lsquonormalrsquo phenotypeswould allow for more fine-grained analyses and the investiga-tion of causal relationships between different conditions eg incases of comorbidity

From a representational point of view the ultimate futuregoal is to overcome the limitations of species- and domain-specific representations and find a universal way of encodingphenotype data independent from the granularity of the dataIn addition resources need to be built that address so far miss-ing aspects such as evidence or time aspects For example asmall set of evidence codes have been established as part of GOto provide provenance in gene annotations but also to providemeans for computational analysis to avoid data circularityAnother aspect of phenotypes that has not been integrated yetinto the representation of phenotypes are causality and tempor-ality for example how the phenotypes change over time owingto stimuli in the environment or medication While at the mo-ment the representations focus on reporting either a temporarysnapshot of an individual examined in an experiment or as partof a medical investigation the current representation modelsdo not allow for the encoding of phenotype changes over timeas a result to surrounding stimuli

As we extend the scope of our joint understanding of pheno-types and adopt ways to represent this understanding the ac-quisition of phenotypes has to change too Methods have to bedeveloped that can accommodate the recording of additionalaspects such as evidence and time eg when extracting infor-mation from the scientific literature Furthermore more reliableautomated methods are needed that can cope with the com-plexity of free text in clinical settings as well as reporting mech-anisms in wet lab environments to facilitate high-throughputand overcome the need for time- and cost-intensive manuallabor In addition and as mentioned earlier the acquisition di-mension should also address the collection of lsquonormalrsquo pheno-types in the future to improve the results obtained byprocessing the phenotype data

Despite the widespread aims to achieve interoperability andmake best use of the integrated resources there are still chal-lenges that need to be addressed to achieve true interoperabilityof the existing and newly emerging resources both phenotype-

specific and not A long-term goal of the dimension of inter-operability is direct propagation of experimental findings intoprevention and treatment options for patients in a hospital(lsquofrom bench to bedsidersquo) Related to this goal is the aim of per-sonalized treatments by means of building patient-specificmodels integrating phenotype data that can then be used forsimulations of possible treatment outcomes

In the future we anticipate a migration toward describingphenotypes as lsquomodelsrsquo or classifiers that answer a particularquestion For example lsquodoes the patient have pneumoniarsquo orlsquodoes the patient have sepsisrsquo The use of phenotyping willthen be analogous to the use of classifiers in spam filters al-ways running in the background and when an incoming sample(a patient record for example) results in a high confidencematch we would automatically receive an alert that the patientis potentially eligible for a clinical trial is likely to benefit from acertain therapy or is at increased risk for certain complications

With the improvement in any of the other dimensions theprocessing of phenotypes will improve owing to an increase inthe quality of data but at the same time will require extensionsthat can cope with additional data available in the future suchas lsquonormalrsquo phenotypes evidence for phenotype data andcausal relationships encoded with time-dependencies In gen-eral a wide range of support and analysis tools are required tounlock the full potential of phenotype data

Given the achievements in the past years we look forwardtoward an exciting and promising decade of phenomics withample opportunities for researchers to get involved and contrib-ute to evolve and shape the emerging landscape

Key Points

bull Over the course of the past decade phenotype datahas become a key factor in analyzing diseases and re-porting experimental outcomes

bull Successful applications of phenotypes include the de-scription of experimental outcomes (eg the changes inphenotypes owing to gene modifications) computationalknowledge discovery (eg in determining disease genecandidates) and reporting in clinical environments (egpatient monitoring)

bull The research field of phenomics can be divided intofour main dimensions (presentation acquisition inter-operability and processing) each of which is depend-ent to some extent on the others

bull While the development of each of the four dimensionsis individual a common understanding and universalguidelines need to be established on how phenotypesare perceived and how they are used a synchroniza-tion of efforts is needed

bull The future of phenomics research holds exciting chal-lenges and has the potential to create a significant im-pact on the entire biomedical domain

Funding

This work was supported by the National Institutes ofHealth [1 U54 HG006370-01 to AO R01 LM011369 and R01GM101430 and U54 HG004028 to NHS T15 LM00707 toMRB R01-LM008111 to KL R01 GM102282 and R01LM011369 to HL U24 CA143840 to AL U54HG006370 toAM U54 HG008033-01 to MD] the Wellcome Trust [098051to AO] a Marie Curie experience researcher fellowship

The digital revolution in phenotyping | 9

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

[301806 to NC] the National Science Foundation [1207592to IG DBI-1062404 and DBI-1062542 and EF-0905606 toPM] the Bundesministerium fur Bildung und Forschung[0313911 to PNR] the European Communityrsquos SeventhFramework Programme [Grant Agreement 602300 SYBIL toPNR] the Systems Microscopy NoE project [grant agree-ment 258068 to GR] and the Defense Advanced ResearchProjects Agency [W911NF-14-C-0109 to KL] TG was sup-ported by the Kinghorn Foundation OB was supported bythe Intramural Research Program of the NIH NationalLibrary of Medicine MS was supported by the MedicalResearch Council LW was supported by the Homer WarnerCenter for Informatics Research of the IHC Health ServicesRW was supported by an appointment to the NLMResearch Participation Program administered by the OakRidge Institute for Science and Education through an inter-agency agreement between the US Department of Energyand the National Library of Medicine

References1 Amberger J Bocchini C Hamosh A A new face and new chal-

lenges for Online Mendelian Inheritance in Man (OMIM)Human Mutation 201132564ndash7

2 Blake JA Bult CJ Kadin JA et al The Mouse Genome Database(MGD) premier model organism resource for mammaliangenomics and genetics Nucleic Acids Res 201139D842ndash8

3 Tweedie S Ashburner M Falls K et al FlyBase enhancingDrosophila Gene Ontology annotations Nucleic Acids Research2009 37D555ndash9

4 Howe DG Bradford YM Conlin T et al ZFIN the ZebrafishModel Organism Database increased support for mutantsand transgenics Nucleic Acids Research 2013 41D854ndash60

5 Goh KI Cusick ME Valle D et al The human disease networkProc Natl Acad Sci USA 20071048685ndash90

6 Hoehndorf R Schofield PN Gkoutos GV PhenomeNET awhole-phenome approach to disease gene discovery NucleicAcids Res 201139e119

7 Smedley D Oellrich A Kohler S et al PhenoDigm analyzingcurated annotations to associate animal models with humandiseases Database 2013bat025

8 Washington NL Haendel MA Mungall CJ et al Linkinghuman diseases to animal models using ontology-basedphenotype annotation PLoS Biol 20097e1000247

9 Zhou X Menche J Barabasi AL et al Human symptomsndashdisease network Nat Commun 20145

10Van Driel MA Bruggeman J Vriend G et al A text-mining ana-lysis of the human phenome Eur J Hum Genet 200614535ndash42

11Groth P Pavlova N Kalev I et al PhenomicDB a new cross-species genotypephenotype resource Nucleic Acids Res200735D696ndash9

12Korbel JO Doerks T Jensen LJ et al Systematic association ofgenes to phenotypes by genome and literature mining PLoSBiol 20053e134

13Hoehndorf R Hiebert T Hardy NW et al Mouse model pheno-types provide information about human drug targetsBioinformatics 201330719ndash25

14Kuhn M Campillos M Letunic I et al A side effect resource tocapture phenotypic effects of drugs Mol Syst Biol 20106343

15Eriksson R Werge T Jensen LJ et al Dose-specific adversedrug reaction identification in electronic patient recordstemporal data mining in an inpatient psychiatric populationDrug Safety 201437237ndash47

16Lasko TA Denny JC Levy MA Computational phenotype dis-covery using unsupervised feature learning over noisy sparseand irregular clinical data eng PLoS One 20138e66341

17Schulam P Wigley F Saria S Clustering longitudinal clinicalmarker trajectories from electronic health data applicationsto phenotyping and endotype discovery In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial IntelligenceAustin Texas USA 2015

18LePendu P Iyer SV Bauer-Mehren A et al Pharmacovigilanceusing clinical notes Clin Pharmacol Ther 201393547ndash55

19Burleigh G Alphonse K Alverson AJ et al Next-generationphenomics for the tree of life PLoS Curr 20135

20Mabee P Balhoff JP Dahdul WM et al 500 000 fish pheno-types The new informatics landscape for evolutionary anddevelopmental biology of the vertebrate skeleton J ApplIchthyol 201228300ndash5

21Collier N Oellrich A Groza T Toward knowledge support foranalysis and interpretation of complex traits Genome Biol201314214

22Smith CL Eppig JT The Mammalian Phenotype Ontology as aunifying standard for experimental and high-throughputphenotyping data Mamm Genome 201323653ndash68

23Kohler S Doelken SC Mungall CJ et al The HumanPhenotype Ontology project linking molecular biology anddisease through phenotype data Nucleic Acids Res201442D966ndash74

24Gkoutos GV Mungall C Doelken S et al Entityquality-basedlogical definitions for the human skeletal phenome usingPATO In Proceedings of Annual International Conference ofthe IEEE Engineering in Medicine and Biology SocietyMinneapolis Minnesota USA 2009 7069ndash72

25Winter RM Baraitser M Douglas JM A computerised database for the diagnosis of rare dysmorphic syndromes J MedGenet 198421121ndash3

26Botstein D Cherry JM Ashburner M et al Gene Ontology toolfor the unification of biology Nat Genet 20002525ndash9

27Hancock JM Commentary on Shimoyama et al (2012) threeontologies to define phenotype measurement data FrontGenet 20145

28Boland MR Tatonetti NP Hripcsak G Development and valid-ation of a classification approach for extracting severity auto-matically from electronic health records J Biomed Semant2015614

29Greenaway S Blake A Retha A et al Automatically annotat-ing temporal data from a phenotype-driven mutagenesisscreen In Proceedings of Phenotype Day at ISMB Dublin Ireland2015

30Rath A Olry A Dhombres F et al Representation of rare dis-eases in health information systems the Orphanet approachto serve a wide range of end users Hum Mutat 201233803ndash8

31Thorn CF Klein TE Altman RB PharmGKB Methods Mol Biol2005311179ndash91

32Sprague J Bayraktaroglu L Clements D et al The ZebrafishInformation Network the zebrafish model organism data-base Nucleic Acids Res 200634D581ndash5

33Brown SDM Moore MW Towards an encyclopaedia of mam-malian gene function the International Mouse PhenotypingConsortium Dis Models Mech 20125289ndash92

34Aronson AR Lang FM An overview of MetaMap historicalperspective and recent advances J Am Med Inform Assoc201017229ndash36

35Chapman WW Bridewell W Hanbury P et al A simple algo-rithm for identifying negated findings and diseases in dis-charge summaries J Biomed Inform 200134301ndash10

10 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 6: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

phenotype descriptions from MP along with supporting evidenceZFIN similarly provides phenotypic data with evidence comingfrom manual curation and individual investigator contributionsusing the Phenote software (httpwwwphenoteorg) Phenoteallows description of phenotypes in an EQ format which makesuse of any ontology in the Open Biological and BiomedicalOntologies [48] format including PATO the Zebrafish AnatomicalOntology [32] and GO

The International Mouse Phenotyping Consortium (IMPC)[33 49] has applied phenotype encoding standards [50] througha set of standard operating procedures (SOPs as defined inIMPReSS httpswwwmousephenotypeorgimpress) for re-cording high-throughput phenotype measurements in the labEach of the SOPs describes not only the experimental setup forthe measurement of the required parameters but also theontology annotation this test may induce For example the SOPdesigned to assess the grip strength of a mouse includes thesuggestion of the MP term lsquoabnormal grip strengthrsquo(MP0001515)

In addition to Phenote mentioned above one example of asystem that was designed specifically for phenotype capture ina manual mode is PhenoTips [51] This open-source system as-sists clinicians to record phenotypic profiles for patients withrare genetic disorders using HP and OMIM potentially allowingfor diagnosis and comparative phenotype analysis

Discovering evidence for the causes of human disorders andproviding treatment are common goals across the clinical andscientific communities However the understanding of pheno-types has traditionally been different between the two com-munities Clinicians generally consider phenotypes to beaberrations ie deviations from normal morphology physi-ology or behavior [52] while scientists working on biological ex-periments such as mutation experiments have adopted a morepragmatic definition of a selective profile of all the observablecharacteristics of an organism This division stems in part froma focus on the overt expression of the syndromes themselves

[53] on the one hand and on the pathway from syndrome togene expression on the other Both are crucial to understandingthe complex nature of disorders as Sabb et al point out [53]This difference is reflected in the type of data that each commu-nity creates and the systems that have been built to supportdata capture by each

Semi-automated and automated phenotype acquisitionWith the increasing amount of data that is published on a day-to-day basis manual approaches for data curation becomemore and more time-demanding and costly so that computerassistance in screening (document retrieval) and preparing data(information extraction) is unavoidable The degree to whichcomputer assistance is enabled determines whether themethod is semi-automated or automated While in a semi-automated setting a curator manually verifies the extracteddata in an automated setting no manual input is requiredHowever given the absence of manual verification and the cur-rent state-of-the-art in text processing the data generated withan automated method may contain some incorrect data

The structural and semantic complexity of phenotype termscoupled with the scale and changing nature of literature-basedphenotype descriptions makes a traditional fully manual ac-quisition approach difficult to sustain leading to potential du-plication inconsistency and sub-optimal coverage This hascreated a growing interest in textdata mining techniques A di-verse and growing research community is evolving that aims toexploit biomedical natural language processing for the extrac-tion of structured data from free-text and its annotation withthe semantic resources that already exist Although not specif-ically aimed at phenotypes knowledge brokering tools such asMetaMap [34] the NCBO Annotator and the Apache cTAKES [54]have all been widely used for concept annotation of text to bio-medical ontologies and could be used to yield these buildingblocks The issue of customizing these generic tools to the

Figure 3 The increasing amount of data made available over the course of the past years have rendered manual phenotype curation impractical While automating

the process is in principle the only viable solution it possesses its own plethora of technical challenges These include among others (i) boundary detection ie iden-

tifying the exact span of text that represents a phenotype candidate (ii) disambiguation and alignment subject to the desired level of granularity and the underlying

knowledge source and (iii) interpretation which covers lack of context hedging or negation

6 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

extraction of phenotypes in specific disease domains is a keychallenge

Extraction of structured information from electronic healthrecords (EHRs) has a long history of research eg [35 36 38 55]Progress has been hampered by the balance that needs to bedrawn between respecting patient privacy and the need for datato develop comparable gold standards In the past few yearsseveral initiatives have led the way in making available anony-mized collections of EHRs eg Informatics for IntegratingBiology and the Bedside (i2b2) [56] and recently ShARECLEF2013 [57] These tasks aim to identify entities of clinical interestincluding medical problems tests and treatments While nei-ther of these data sets explicitly annotates phenotypes theseentities are highly relevant to phenotype acquisition Fu et alsuggested an annotation scheme to capture phenotypes forchronic obstructive pulmonary disease in EHRs which has beenimplemented in Argo [58] to annotate a corpus of 1000 clinicalrecords [59] Furthermore the newly launched DeepPhe project(httpcancerhealthnlporg) focusses on phenotypes relevantin the cancer genomics domain

Using the scientific literature as a source several groupshave been active in developing approaches explicitly for pheno-types These include the Bio-LarK system which has beenapplied to skeletal dysplasia [39] and the PhenoMiner systemwhich has been applied to the cardiovascular and autoimmunesystems [60] Work by Khordad et al [37] has looked at a moremixed domain using the Unified Medical Language System(UMLS) Metathesaurus tool [40] Ongoing challenges in process-ing EHRs are descriptive naming (eg typical course face het-erogeneous ECG abnormalities) disjoint phenotype mentions(eg blood pressure was observed to be elevated) and coordi-nated terms (eg slow healing and excessive scarring)Harmonization to existing ontologies presents an additionallayer of challenge in deciding how to align phenotype mentionsthat are more or less specific than extant concepts and how toprovide sufficient evidence for human curators

While automated methods are not as thorough as curatorsthey overcome some of the bottlenecks experienced with man-ual curation eg high time consumption and low throughputIn general there is a trade-off between thoroughness (precision)and the amount of acquired data (recall) returned as resultsfrom these methods ie automated methods may not return allrelevant results and may return some incorrect results

Acquisition of phenotypes summaryThe acquisition and harmonization of phenotypes is an ongoingchallenge to be met using evidence from a variety of sources(eg EHRs scientific literature clinical reports) The key issuefor automated approaches involving natural language process-ing support is to identify and resolve lexical syntactic and se-mantic heterogeneity

Interoperability

The interoperability dimension of phenomics research focuseson making all the available phenotype data integrable withother data sources eg diseases or results from genome ana-lyses The overarching goal of interoperability is to facilitatetranslational research and biological discoveries [21] Currentand past work falling into this dimension can be summarized asstandardization efforts alignment of phenotypes within andacross species and mapping to other resources Challenges arisefrom the many levels of complexity phenotypes can span [42] as

well as the development of multiple and mostly disparate re-porting schemes [22 23 41]

Interoperability through semantic layersA prerequisite for interoperable phenotype resources is a se-mantic layer that spans across the resources applied and allowsto keep the consistency and specificity contained in each of theresources For example despite standardization efforts such asthe Minimal Information for Mouse Phenotyping Procedures[61] the existing landscape of mouse phenotype resources isnot fully interoperable and hard to manage [62] A similar scen-ario is seen in hospitals where different wards use disparateways of describing a patient As a consequence the data for apatient cannot readily be used for further analysis preventingpotential holistic treatment opportunities However the needfor standardized reporting has been recognized and is imple-mented through SOPs in the IMPC [33 49 50]

While historically there have been different phenotype rep-resentations for different species such as the human mamma-lian fly and worm phenotype ontology (see 21 representation)EQ statements (see Figure 2) have been suggested to integratephenotypes across different species [24 44] In addition anamendment to the existing EQ statements was suggested tomake them interoperable with anatomy and physiology ontolo-gies [63] to extend the links across the different layers of com-plexity To make the annotation for three different species(human mouse and zebrafish) more accessible Kohler and au-thors made the UberPheno ontology publicly available [64]

Interoperability achieved through mappings (alignment)Further to the representation of phenotypes with EQ state-ments manual [65] and automated methods are in progress toalign different pre-composed semantic representations One ex-ample is UMLS [66] that combines over 180 vocabularies termi-nologies and ontologies such as SNOMED CT [67] Theintegration of new resources into UMLS is semi-automatedConflicts between concepts from newly added resources andconcepts already in UMLS are manually resolved to ensure ahigh-quality alignment of all the incorporated terminologiesvocabularies and ontologies

As an alternative to manual and semi-automated solutionstools exist that provide fully automated alignments betweenontologies such as AgreementMaker [68] and Zooma (httpwwwebiacukfgptzooma) While AgreementMaker takes lex-ical matching and ontological features into account Zoomauses phonetic matching algorithms for the alignment In manycases the resulting alignments often associate one term withmultiple concepts (1n mapping) Bottlenecks in the automatedalignment are caused by species-specific jargon [69 70] and byphenotypes that only exist in one of the species and not theother

In addition to the alignment of multiple resources mappingsare required that would facilitate the integration of diverse re-sources spanning across the different layers of complexity of anorganism While phenotypes in model organisms have been as-signed to genes as well as to specific models (determined by al-lele and background in addition to gene eg MGD) in humanmostly inheritable diseases have been annotated However ifwe were to follow the pathway from the modified gene to theobserved phenotypes a mapping between pathways andphenotypes would also be required [71] As mentioned abovesome integration with other resources has been achieved egwith anatomy and physiology ontologies but a much larger

The digital revolution in phenotyping | 7

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

coverage is required to unlock the full potential of knowledgediscovery using phenotype data

Interoperability of phenotypes summaryThe need for unified and integrable representation of pheno-types has been recognized and projects are underway to im-prove the current situation However there is still a hugeamount of legacy data that need to be dealt with

Processing

Processing (or application) is concerned with the use of pheno-type data that have either been reported in structured form orextracted with for example text mining methods (see acquisi-tion methods of phenotypes) Subject to the target domain theusage of phenotype data can be classified into four broad cate-gories (i) clinical research (diagnosis prognosis patient match-making variant prioritization personalized medicine drug sideeffects) (ii) study of genomendashphenome interactions to advancethe understanding of disease causation or achieve personalizedtherapies (iii) cross-resource consistency analysis and (iv) evo-lutionary research It is worth noting that in most cases pheno-type data have been used in a cross-species context hencetransforming the interoperability dimension into a first-classcitizen rather than an application scenario

Application of phenotypes to study the origins and pathology ofdiseasesThe application area of clinical research consists of a varied setof specific goals which represent in practice coherent researchstreams on their own Phenotypes have been used as uniquesource of data for example for disease prediction [72 73] min-ing key disease characteristics or characteristic phenotypes [1617 74] or patient match-making [51] Furthermore phenotypeshave been used to support the understanding of the geneticmechanism of diseases via direct association with genotypedata [5 9 12] as a mapping bridge across species data [7 8] andin conjunction with the entire set of OMICS data [75] or to im-prove variant prioritization for accurate diagnosis [76 77]

More recently phenotypes have played a major role as a dis-covery agent in large-scale genome-wide association studies [78ndash80] In particular projects such as the 100 000 Genomes Project(httpwwwgenomicsenglandcoukthe-100000-genomes-project)as well as the eMerge Network (httpsemergemcvanderbiltedupage idfrac1458) aim to support the area of Pharmacogenetics bylinking data from EHRs to sequence information from patientsto improve diagnosis and treatment Similarly Phenome-wideAssociation Studies (PheWAS) [81ndash83] allow for the identificationof genes that possibly implicate a disease and can provide prov-enance for results determined through Genome-wide associ-ation studies

Phenotypes and ethnicityGeneric phenotypes have an immense potential which hasbeen only recently discovered and exploited For example eth-nic differences have been shown to explain the optimal statesof the human blood glucose levelsmdashexpressed via the relation-ship between insulin sensitivity and insulin response [84]Similarly when used in the context of a shared genetic architec-ture such data validated the existence of statistically significantrelationships between the platelet count and alcohol depend-ence or between the alkaline phosphatase level and venousthromboembolism [85] This application area is characterizedby a specific set of challenges emerging from the novel

combination of numeric data (from tests and measurement)abnormalities and standard traits One important and still un-solved problem in the processing of phenotypes is the lack ofalignment between measurement representations and pheno-typic abnormalities as well as the ability of representing statesof normality and longitudinal phenotypic data resulting fromtests and measurements

Phenotypes in drug repurposingPhenotypes expose the effects of drug treatments and henceenable the study of their general effects the relation betweendosage and effects as well as the interaction between drugsExisting literature on this topic maps perfectly onto these threeaspects Phenotypes as adverse drug reactions have been mod-eled and captured as early as 2010 by the SIDER initiative [14]and have been used to investigate the causal relationship be-tween dosage and effect by [15] In an exercise that combineslarge-scale acquisition and processing LePendu et al have usedphenotypes as indicators of adverse drug reactions as well assignals of adverse events associated with drugndashdrug inter-actions [18] Finally with the increasing curation adoption anduse of cross-species phenotype data it has been shown that aneffective mapping between model organisms and drug effectprofiles based on similarity can be applied to successfully sug-gest candidate drug targets [13]

Phenotypes in evolutionary studiesIn this context phenotypes have been used to understand pat-terns of diversification and to gain additional knowledge ontrait evolution Two particular initiatives have focused on thisaspect The AVAToL project [19] represents a collaborative andmultidisciplinary effort that combines text mining image ana-lysis and the wisdom of the crowds to discover and documentspecies phenotypes Their ultimate goal is to advance phylogen-etics research and to enable a faster and more accurate con-struction of the Tree of Life The Phenoscape knowledge base[20] on the other hand integrates phenotype data acquired onover 2500 teleost fishes with structured phenotype data fromzebrafish genes to infer candidate genes that explain pheno-typic variety and hence enable the formulation of evolution-arydevolutionary hypotheses

Processing of phenotypes summaryThe quality and range of phenotype applications are curbedonly by the quality and availability of the underlying dataLimitations arise for example from the missing or incorrectcross-species phenotypes alignment either owing to their foun-dational representation or owing to inconsistencies in the levelof granularity Similarly challenges are encountered when rep-resenting and acquiring the more profound dimensions ofphenotypes including degrees of severity use of ambiguous ex-pressions or temporality (in a longitudinal data sense) whichthen hamper the development of complex solutions

A future perspective of phenome research

For the entire field of Phenomics to advance challenges have tobe overcome at the universal level as well as at the level of theindividual dimensions of phenome research (representationacquisition interoperability and processing) The most import-ant universal challenge is the lack of a shared understanding ofwhat a phenotype is among all scientists working with pheno-type data This includes (computational) biologists regardlessof the research questions andor animal model they are working

8 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

on as well as clinicians working in all different areas of humanmedicine To reach a common understanding reporting stand-ards and guidelines need to be derived that facilitate communi-cation across domains as well as large-scale computationalanalysis

Furthermore universal agreement has to be found as towhat additional aspects of phenotypes are relevant and havenot or only sparsely been accounted for in any of the four do-mains Such aspects include types of measurements proven-ance evidence and time To overcome the social barriers toadoption some sort of credit assignment mechanism akin tocitation will be necessary to track the usage of lsquogoodrsquo and reus-able phenotypes Existing author tracking systems such asORCID (httporcidorg) and ResearcherID (httpwwwresearcheridcom) can probably be reused to document author-ship of phenotype models and track usage as well asprovenance

Another important universal issue to overcome is the re-cording of lsquonormalrsquo phenotypes Currently phenomics in thearea of disease gene discovery is geared toward the collectionand analyses of lsquoabnormalrsquo phenotypes eg those resultingfrom diseases or gene modifications However the derivation ofan lsquoabnormalrsquo phenotype is always the result of a comparisonwith what is considered to be lsquonormalrsquo which is collected onlyin some cases [45] The record of more lsquonormalrsquo phenotypeswould allow for more fine-grained analyses and the investiga-tion of causal relationships between different conditions eg incases of comorbidity

From a representational point of view the ultimate futuregoal is to overcome the limitations of species- and domain-specific representations and find a universal way of encodingphenotype data independent from the granularity of the dataIn addition resources need to be built that address so far miss-ing aspects such as evidence or time aspects For example asmall set of evidence codes have been established as part of GOto provide provenance in gene annotations but also to providemeans for computational analysis to avoid data circularityAnother aspect of phenotypes that has not been integrated yetinto the representation of phenotypes are causality and tempor-ality for example how the phenotypes change over time owingto stimuli in the environment or medication While at the mo-ment the representations focus on reporting either a temporarysnapshot of an individual examined in an experiment or as partof a medical investigation the current representation modelsdo not allow for the encoding of phenotype changes over timeas a result to surrounding stimuli

As we extend the scope of our joint understanding of pheno-types and adopt ways to represent this understanding the ac-quisition of phenotypes has to change too Methods have to bedeveloped that can accommodate the recording of additionalaspects such as evidence and time eg when extracting infor-mation from the scientific literature Furthermore more reliableautomated methods are needed that can cope with the com-plexity of free text in clinical settings as well as reporting mech-anisms in wet lab environments to facilitate high-throughputand overcome the need for time- and cost-intensive manuallabor In addition and as mentioned earlier the acquisition di-mension should also address the collection of lsquonormalrsquo pheno-types in the future to improve the results obtained byprocessing the phenotype data

Despite the widespread aims to achieve interoperability andmake best use of the integrated resources there are still chal-lenges that need to be addressed to achieve true interoperabilityof the existing and newly emerging resources both phenotype-

specific and not A long-term goal of the dimension of inter-operability is direct propagation of experimental findings intoprevention and treatment options for patients in a hospital(lsquofrom bench to bedsidersquo) Related to this goal is the aim of per-sonalized treatments by means of building patient-specificmodels integrating phenotype data that can then be used forsimulations of possible treatment outcomes

In the future we anticipate a migration toward describingphenotypes as lsquomodelsrsquo or classifiers that answer a particularquestion For example lsquodoes the patient have pneumoniarsquo orlsquodoes the patient have sepsisrsquo The use of phenotyping willthen be analogous to the use of classifiers in spam filters al-ways running in the background and when an incoming sample(a patient record for example) results in a high confidencematch we would automatically receive an alert that the patientis potentially eligible for a clinical trial is likely to benefit from acertain therapy or is at increased risk for certain complications

With the improvement in any of the other dimensions theprocessing of phenotypes will improve owing to an increase inthe quality of data but at the same time will require extensionsthat can cope with additional data available in the future suchas lsquonormalrsquo phenotypes evidence for phenotype data andcausal relationships encoded with time-dependencies In gen-eral a wide range of support and analysis tools are required tounlock the full potential of phenotype data

Given the achievements in the past years we look forwardtoward an exciting and promising decade of phenomics withample opportunities for researchers to get involved and contrib-ute to evolve and shape the emerging landscape

Key Points

bull Over the course of the past decade phenotype datahas become a key factor in analyzing diseases and re-porting experimental outcomes

bull Successful applications of phenotypes include the de-scription of experimental outcomes (eg the changes inphenotypes owing to gene modifications) computationalknowledge discovery (eg in determining disease genecandidates) and reporting in clinical environments (egpatient monitoring)

bull The research field of phenomics can be divided intofour main dimensions (presentation acquisition inter-operability and processing) each of which is depend-ent to some extent on the others

bull While the development of each of the four dimensionsis individual a common understanding and universalguidelines need to be established on how phenotypesare perceived and how they are used a synchroniza-tion of efforts is needed

bull The future of phenomics research holds exciting chal-lenges and has the potential to create a significant im-pact on the entire biomedical domain

Funding

This work was supported by the National Institutes ofHealth [1 U54 HG006370-01 to AO R01 LM011369 and R01GM101430 and U54 HG004028 to NHS T15 LM00707 toMRB R01-LM008111 to KL R01 GM102282 and R01LM011369 to HL U24 CA143840 to AL U54HG006370 toAM U54 HG008033-01 to MD] the Wellcome Trust [098051to AO] a Marie Curie experience researcher fellowship

The digital revolution in phenotyping | 9

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

[301806 to NC] the National Science Foundation [1207592to IG DBI-1062404 and DBI-1062542 and EF-0905606 toPM] the Bundesministerium fur Bildung und Forschung[0313911 to PNR] the European Communityrsquos SeventhFramework Programme [Grant Agreement 602300 SYBIL toPNR] the Systems Microscopy NoE project [grant agree-ment 258068 to GR] and the Defense Advanced ResearchProjects Agency [W911NF-14-C-0109 to KL] TG was sup-ported by the Kinghorn Foundation OB was supported bythe Intramural Research Program of the NIH NationalLibrary of Medicine MS was supported by the MedicalResearch Council LW was supported by the Homer WarnerCenter for Informatics Research of the IHC Health ServicesRW was supported by an appointment to the NLMResearch Participation Program administered by the OakRidge Institute for Science and Education through an inter-agency agreement between the US Department of Energyand the National Library of Medicine

References1 Amberger J Bocchini C Hamosh A A new face and new chal-

lenges for Online Mendelian Inheritance in Man (OMIM)Human Mutation 201132564ndash7

2 Blake JA Bult CJ Kadin JA et al The Mouse Genome Database(MGD) premier model organism resource for mammaliangenomics and genetics Nucleic Acids Res 201139D842ndash8

3 Tweedie S Ashburner M Falls K et al FlyBase enhancingDrosophila Gene Ontology annotations Nucleic Acids Research2009 37D555ndash9

4 Howe DG Bradford YM Conlin T et al ZFIN the ZebrafishModel Organism Database increased support for mutantsand transgenics Nucleic Acids Research 2013 41D854ndash60

5 Goh KI Cusick ME Valle D et al The human disease networkProc Natl Acad Sci USA 20071048685ndash90

6 Hoehndorf R Schofield PN Gkoutos GV PhenomeNET awhole-phenome approach to disease gene discovery NucleicAcids Res 201139e119

7 Smedley D Oellrich A Kohler S et al PhenoDigm analyzingcurated annotations to associate animal models with humandiseases Database 2013bat025

8 Washington NL Haendel MA Mungall CJ et al Linkinghuman diseases to animal models using ontology-basedphenotype annotation PLoS Biol 20097e1000247

9 Zhou X Menche J Barabasi AL et al Human symptomsndashdisease network Nat Commun 20145

10Van Driel MA Bruggeman J Vriend G et al A text-mining ana-lysis of the human phenome Eur J Hum Genet 200614535ndash42

11Groth P Pavlova N Kalev I et al PhenomicDB a new cross-species genotypephenotype resource Nucleic Acids Res200735D696ndash9

12Korbel JO Doerks T Jensen LJ et al Systematic association ofgenes to phenotypes by genome and literature mining PLoSBiol 20053e134

13Hoehndorf R Hiebert T Hardy NW et al Mouse model pheno-types provide information about human drug targetsBioinformatics 201330719ndash25

14Kuhn M Campillos M Letunic I et al A side effect resource tocapture phenotypic effects of drugs Mol Syst Biol 20106343

15Eriksson R Werge T Jensen LJ et al Dose-specific adversedrug reaction identification in electronic patient recordstemporal data mining in an inpatient psychiatric populationDrug Safety 201437237ndash47

16Lasko TA Denny JC Levy MA Computational phenotype dis-covery using unsupervised feature learning over noisy sparseand irregular clinical data eng PLoS One 20138e66341

17Schulam P Wigley F Saria S Clustering longitudinal clinicalmarker trajectories from electronic health data applicationsto phenotyping and endotype discovery In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial IntelligenceAustin Texas USA 2015

18LePendu P Iyer SV Bauer-Mehren A et al Pharmacovigilanceusing clinical notes Clin Pharmacol Ther 201393547ndash55

19Burleigh G Alphonse K Alverson AJ et al Next-generationphenomics for the tree of life PLoS Curr 20135

20Mabee P Balhoff JP Dahdul WM et al 500 000 fish pheno-types The new informatics landscape for evolutionary anddevelopmental biology of the vertebrate skeleton J ApplIchthyol 201228300ndash5

21Collier N Oellrich A Groza T Toward knowledge support foranalysis and interpretation of complex traits Genome Biol201314214

22Smith CL Eppig JT The Mammalian Phenotype Ontology as aunifying standard for experimental and high-throughputphenotyping data Mamm Genome 201323653ndash68

23Kohler S Doelken SC Mungall CJ et al The HumanPhenotype Ontology project linking molecular biology anddisease through phenotype data Nucleic Acids Res201442D966ndash74

24Gkoutos GV Mungall C Doelken S et al Entityquality-basedlogical definitions for the human skeletal phenome usingPATO In Proceedings of Annual International Conference ofthe IEEE Engineering in Medicine and Biology SocietyMinneapolis Minnesota USA 2009 7069ndash72

25Winter RM Baraitser M Douglas JM A computerised database for the diagnosis of rare dysmorphic syndromes J MedGenet 198421121ndash3

26Botstein D Cherry JM Ashburner M et al Gene Ontology toolfor the unification of biology Nat Genet 20002525ndash9

27Hancock JM Commentary on Shimoyama et al (2012) threeontologies to define phenotype measurement data FrontGenet 20145

28Boland MR Tatonetti NP Hripcsak G Development and valid-ation of a classification approach for extracting severity auto-matically from electronic health records J Biomed Semant2015614

29Greenaway S Blake A Retha A et al Automatically annotat-ing temporal data from a phenotype-driven mutagenesisscreen In Proceedings of Phenotype Day at ISMB Dublin Ireland2015

30Rath A Olry A Dhombres F et al Representation of rare dis-eases in health information systems the Orphanet approachto serve a wide range of end users Hum Mutat 201233803ndash8

31Thorn CF Klein TE Altman RB PharmGKB Methods Mol Biol2005311179ndash91

32Sprague J Bayraktaroglu L Clements D et al The ZebrafishInformation Network the zebrafish model organism data-base Nucleic Acids Res 200634D581ndash5

33Brown SDM Moore MW Towards an encyclopaedia of mam-malian gene function the International Mouse PhenotypingConsortium Dis Models Mech 20125289ndash92

34Aronson AR Lang FM An overview of MetaMap historicalperspective and recent advances J Am Med Inform Assoc201017229ndash36

35Chapman WW Bridewell W Hanbury P et al A simple algo-rithm for identifying negated findings and diseases in dis-charge summaries J Biomed Inform 200134301ndash10

10 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 7: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

extraction of phenotypes in specific disease domains is a keychallenge

Extraction of structured information from electronic healthrecords (EHRs) has a long history of research eg [35 36 38 55]Progress has been hampered by the balance that needs to bedrawn between respecting patient privacy and the need for datato develop comparable gold standards In the past few yearsseveral initiatives have led the way in making available anony-mized collections of EHRs eg Informatics for IntegratingBiology and the Bedside (i2b2) [56] and recently ShARECLEF2013 [57] These tasks aim to identify entities of clinical interestincluding medical problems tests and treatments While nei-ther of these data sets explicitly annotates phenotypes theseentities are highly relevant to phenotype acquisition Fu et alsuggested an annotation scheme to capture phenotypes forchronic obstructive pulmonary disease in EHRs which has beenimplemented in Argo [58] to annotate a corpus of 1000 clinicalrecords [59] Furthermore the newly launched DeepPhe project(httpcancerhealthnlporg) focusses on phenotypes relevantin the cancer genomics domain

Using the scientific literature as a source several groupshave been active in developing approaches explicitly for pheno-types These include the Bio-LarK system which has beenapplied to skeletal dysplasia [39] and the PhenoMiner systemwhich has been applied to the cardiovascular and autoimmunesystems [60] Work by Khordad et al [37] has looked at a moremixed domain using the Unified Medical Language System(UMLS) Metathesaurus tool [40] Ongoing challenges in process-ing EHRs are descriptive naming (eg typical course face het-erogeneous ECG abnormalities) disjoint phenotype mentions(eg blood pressure was observed to be elevated) and coordi-nated terms (eg slow healing and excessive scarring)Harmonization to existing ontologies presents an additionallayer of challenge in deciding how to align phenotype mentionsthat are more or less specific than extant concepts and how toprovide sufficient evidence for human curators

While automated methods are not as thorough as curatorsthey overcome some of the bottlenecks experienced with man-ual curation eg high time consumption and low throughputIn general there is a trade-off between thoroughness (precision)and the amount of acquired data (recall) returned as resultsfrom these methods ie automated methods may not return allrelevant results and may return some incorrect results

Acquisition of phenotypes summaryThe acquisition and harmonization of phenotypes is an ongoingchallenge to be met using evidence from a variety of sources(eg EHRs scientific literature clinical reports) The key issuefor automated approaches involving natural language process-ing support is to identify and resolve lexical syntactic and se-mantic heterogeneity

Interoperability

The interoperability dimension of phenomics research focuseson making all the available phenotype data integrable withother data sources eg diseases or results from genome ana-lyses The overarching goal of interoperability is to facilitatetranslational research and biological discoveries [21] Currentand past work falling into this dimension can be summarized asstandardization efforts alignment of phenotypes within andacross species and mapping to other resources Challenges arisefrom the many levels of complexity phenotypes can span [42] as

well as the development of multiple and mostly disparate re-porting schemes [22 23 41]

Interoperability through semantic layersA prerequisite for interoperable phenotype resources is a se-mantic layer that spans across the resources applied and allowsto keep the consistency and specificity contained in each of theresources For example despite standardization efforts such asthe Minimal Information for Mouse Phenotyping Procedures[61] the existing landscape of mouse phenotype resources isnot fully interoperable and hard to manage [62] A similar scen-ario is seen in hospitals where different wards use disparateways of describing a patient As a consequence the data for apatient cannot readily be used for further analysis preventingpotential holistic treatment opportunities However the needfor standardized reporting has been recognized and is imple-mented through SOPs in the IMPC [33 49 50]

While historically there have been different phenotype rep-resentations for different species such as the human mamma-lian fly and worm phenotype ontology (see 21 representation)EQ statements (see Figure 2) have been suggested to integratephenotypes across different species [24 44] In addition anamendment to the existing EQ statements was suggested tomake them interoperable with anatomy and physiology ontolo-gies [63] to extend the links across the different layers of com-plexity To make the annotation for three different species(human mouse and zebrafish) more accessible Kohler and au-thors made the UberPheno ontology publicly available [64]

Interoperability achieved through mappings (alignment)Further to the representation of phenotypes with EQ state-ments manual [65] and automated methods are in progress toalign different pre-composed semantic representations One ex-ample is UMLS [66] that combines over 180 vocabularies termi-nologies and ontologies such as SNOMED CT [67] Theintegration of new resources into UMLS is semi-automatedConflicts between concepts from newly added resources andconcepts already in UMLS are manually resolved to ensure ahigh-quality alignment of all the incorporated terminologiesvocabularies and ontologies

As an alternative to manual and semi-automated solutionstools exist that provide fully automated alignments betweenontologies such as AgreementMaker [68] and Zooma (httpwwwebiacukfgptzooma) While AgreementMaker takes lex-ical matching and ontological features into account Zoomauses phonetic matching algorithms for the alignment In manycases the resulting alignments often associate one term withmultiple concepts (1n mapping) Bottlenecks in the automatedalignment are caused by species-specific jargon [69 70] and byphenotypes that only exist in one of the species and not theother

In addition to the alignment of multiple resources mappingsare required that would facilitate the integration of diverse re-sources spanning across the different layers of complexity of anorganism While phenotypes in model organisms have been as-signed to genes as well as to specific models (determined by al-lele and background in addition to gene eg MGD) in humanmostly inheritable diseases have been annotated However ifwe were to follow the pathway from the modified gene to theobserved phenotypes a mapping between pathways andphenotypes would also be required [71] As mentioned abovesome integration with other resources has been achieved egwith anatomy and physiology ontologies but a much larger

The digital revolution in phenotyping | 7

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

coverage is required to unlock the full potential of knowledgediscovery using phenotype data

Interoperability of phenotypes summaryThe need for unified and integrable representation of pheno-types has been recognized and projects are underway to im-prove the current situation However there is still a hugeamount of legacy data that need to be dealt with

Processing

Processing (or application) is concerned with the use of pheno-type data that have either been reported in structured form orextracted with for example text mining methods (see acquisi-tion methods of phenotypes) Subject to the target domain theusage of phenotype data can be classified into four broad cate-gories (i) clinical research (diagnosis prognosis patient match-making variant prioritization personalized medicine drug sideeffects) (ii) study of genomendashphenome interactions to advancethe understanding of disease causation or achieve personalizedtherapies (iii) cross-resource consistency analysis and (iv) evo-lutionary research It is worth noting that in most cases pheno-type data have been used in a cross-species context hencetransforming the interoperability dimension into a first-classcitizen rather than an application scenario

Application of phenotypes to study the origins and pathology ofdiseasesThe application area of clinical research consists of a varied setof specific goals which represent in practice coherent researchstreams on their own Phenotypes have been used as uniquesource of data for example for disease prediction [72 73] min-ing key disease characteristics or characteristic phenotypes [1617 74] or patient match-making [51] Furthermore phenotypeshave been used to support the understanding of the geneticmechanism of diseases via direct association with genotypedata [5 9 12] as a mapping bridge across species data [7 8] andin conjunction with the entire set of OMICS data [75] or to im-prove variant prioritization for accurate diagnosis [76 77]

More recently phenotypes have played a major role as a dis-covery agent in large-scale genome-wide association studies [78ndash80] In particular projects such as the 100 000 Genomes Project(httpwwwgenomicsenglandcoukthe-100000-genomes-project)as well as the eMerge Network (httpsemergemcvanderbiltedupage idfrac1458) aim to support the area of Pharmacogenetics bylinking data from EHRs to sequence information from patientsto improve diagnosis and treatment Similarly Phenome-wideAssociation Studies (PheWAS) [81ndash83] allow for the identificationof genes that possibly implicate a disease and can provide prov-enance for results determined through Genome-wide associ-ation studies

Phenotypes and ethnicityGeneric phenotypes have an immense potential which hasbeen only recently discovered and exploited For example eth-nic differences have been shown to explain the optimal statesof the human blood glucose levelsmdashexpressed via the relation-ship between insulin sensitivity and insulin response [84]Similarly when used in the context of a shared genetic architec-ture such data validated the existence of statistically significantrelationships between the platelet count and alcohol depend-ence or between the alkaline phosphatase level and venousthromboembolism [85] This application area is characterizedby a specific set of challenges emerging from the novel

combination of numeric data (from tests and measurement)abnormalities and standard traits One important and still un-solved problem in the processing of phenotypes is the lack ofalignment between measurement representations and pheno-typic abnormalities as well as the ability of representing statesof normality and longitudinal phenotypic data resulting fromtests and measurements

Phenotypes in drug repurposingPhenotypes expose the effects of drug treatments and henceenable the study of their general effects the relation betweendosage and effects as well as the interaction between drugsExisting literature on this topic maps perfectly onto these threeaspects Phenotypes as adverse drug reactions have been mod-eled and captured as early as 2010 by the SIDER initiative [14]and have been used to investigate the causal relationship be-tween dosage and effect by [15] In an exercise that combineslarge-scale acquisition and processing LePendu et al have usedphenotypes as indicators of adverse drug reactions as well assignals of adverse events associated with drugndashdrug inter-actions [18] Finally with the increasing curation adoption anduse of cross-species phenotype data it has been shown that aneffective mapping between model organisms and drug effectprofiles based on similarity can be applied to successfully sug-gest candidate drug targets [13]

Phenotypes in evolutionary studiesIn this context phenotypes have been used to understand pat-terns of diversification and to gain additional knowledge ontrait evolution Two particular initiatives have focused on thisaspect The AVAToL project [19] represents a collaborative andmultidisciplinary effort that combines text mining image ana-lysis and the wisdom of the crowds to discover and documentspecies phenotypes Their ultimate goal is to advance phylogen-etics research and to enable a faster and more accurate con-struction of the Tree of Life The Phenoscape knowledge base[20] on the other hand integrates phenotype data acquired onover 2500 teleost fishes with structured phenotype data fromzebrafish genes to infer candidate genes that explain pheno-typic variety and hence enable the formulation of evolution-arydevolutionary hypotheses

Processing of phenotypes summaryThe quality and range of phenotype applications are curbedonly by the quality and availability of the underlying dataLimitations arise for example from the missing or incorrectcross-species phenotypes alignment either owing to their foun-dational representation or owing to inconsistencies in the levelof granularity Similarly challenges are encountered when rep-resenting and acquiring the more profound dimensions ofphenotypes including degrees of severity use of ambiguous ex-pressions or temporality (in a longitudinal data sense) whichthen hamper the development of complex solutions

A future perspective of phenome research

For the entire field of Phenomics to advance challenges have tobe overcome at the universal level as well as at the level of theindividual dimensions of phenome research (representationacquisition interoperability and processing) The most import-ant universal challenge is the lack of a shared understanding ofwhat a phenotype is among all scientists working with pheno-type data This includes (computational) biologists regardlessof the research questions andor animal model they are working

8 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

on as well as clinicians working in all different areas of humanmedicine To reach a common understanding reporting stand-ards and guidelines need to be derived that facilitate communi-cation across domains as well as large-scale computationalanalysis

Furthermore universal agreement has to be found as towhat additional aspects of phenotypes are relevant and havenot or only sparsely been accounted for in any of the four do-mains Such aspects include types of measurements proven-ance evidence and time To overcome the social barriers toadoption some sort of credit assignment mechanism akin tocitation will be necessary to track the usage of lsquogoodrsquo and reus-able phenotypes Existing author tracking systems such asORCID (httporcidorg) and ResearcherID (httpwwwresearcheridcom) can probably be reused to document author-ship of phenotype models and track usage as well asprovenance

Another important universal issue to overcome is the re-cording of lsquonormalrsquo phenotypes Currently phenomics in thearea of disease gene discovery is geared toward the collectionand analyses of lsquoabnormalrsquo phenotypes eg those resultingfrom diseases or gene modifications However the derivation ofan lsquoabnormalrsquo phenotype is always the result of a comparisonwith what is considered to be lsquonormalrsquo which is collected onlyin some cases [45] The record of more lsquonormalrsquo phenotypeswould allow for more fine-grained analyses and the investiga-tion of causal relationships between different conditions eg incases of comorbidity

From a representational point of view the ultimate futuregoal is to overcome the limitations of species- and domain-specific representations and find a universal way of encodingphenotype data independent from the granularity of the dataIn addition resources need to be built that address so far miss-ing aspects such as evidence or time aspects For example asmall set of evidence codes have been established as part of GOto provide provenance in gene annotations but also to providemeans for computational analysis to avoid data circularityAnother aspect of phenotypes that has not been integrated yetinto the representation of phenotypes are causality and tempor-ality for example how the phenotypes change over time owingto stimuli in the environment or medication While at the mo-ment the representations focus on reporting either a temporarysnapshot of an individual examined in an experiment or as partof a medical investigation the current representation modelsdo not allow for the encoding of phenotype changes over timeas a result to surrounding stimuli

As we extend the scope of our joint understanding of pheno-types and adopt ways to represent this understanding the ac-quisition of phenotypes has to change too Methods have to bedeveloped that can accommodate the recording of additionalaspects such as evidence and time eg when extracting infor-mation from the scientific literature Furthermore more reliableautomated methods are needed that can cope with the com-plexity of free text in clinical settings as well as reporting mech-anisms in wet lab environments to facilitate high-throughputand overcome the need for time- and cost-intensive manuallabor In addition and as mentioned earlier the acquisition di-mension should also address the collection of lsquonormalrsquo pheno-types in the future to improve the results obtained byprocessing the phenotype data

Despite the widespread aims to achieve interoperability andmake best use of the integrated resources there are still chal-lenges that need to be addressed to achieve true interoperabilityof the existing and newly emerging resources both phenotype-

specific and not A long-term goal of the dimension of inter-operability is direct propagation of experimental findings intoprevention and treatment options for patients in a hospital(lsquofrom bench to bedsidersquo) Related to this goal is the aim of per-sonalized treatments by means of building patient-specificmodels integrating phenotype data that can then be used forsimulations of possible treatment outcomes

In the future we anticipate a migration toward describingphenotypes as lsquomodelsrsquo or classifiers that answer a particularquestion For example lsquodoes the patient have pneumoniarsquo orlsquodoes the patient have sepsisrsquo The use of phenotyping willthen be analogous to the use of classifiers in spam filters al-ways running in the background and when an incoming sample(a patient record for example) results in a high confidencematch we would automatically receive an alert that the patientis potentially eligible for a clinical trial is likely to benefit from acertain therapy or is at increased risk for certain complications

With the improvement in any of the other dimensions theprocessing of phenotypes will improve owing to an increase inthe quality of data but at the same time will require extensionsthat can cope with additional data available in the future suchas lsquonormalrsquo phenotypes evidence for phenotype data andcausal relationships encoded with time-dependencies In gen-eral a wide range of support and analysis tools are required tounlock the full potential of phenotype data

Given the achievements in the past years we look forwardtoward an exciting and promising decade of phenomics withample opportunities for researchers to get involved and contrib-ute to evolve and shape the emerging landscape

Key Points

bull Over the course of the past decade phenotype datahas become a key factor in analyzing diseases and re-porting experimental outcomes

bull Successful applications of phenotypes include the de-scription of experimental outcomes (eg the changes inphenotypes owing to gene modifications) computationalknowledge discovery (eg in determining disease genecandidates) and reporting in clinical environments (egpatient monitoring)

bull The research field of phenomics can be divided intofour main dimensions (presentation acquisition inter-operability and processing) each of which is depend-ent to some extent on the others

bull While the development of each of the four dimensionsis individual a common understanding and universalguidelines need to be established on how phenotypesare perceived and how they are used a synchroniza-tion of efforts is needed

bull The future of phenomics research holds exciting chal-lenges and has the potential to create a significant im-pact on the entire biomedical domain

Funding

This work was supported by the National Institutes ofHealth [1 U54 HG006370-01 to AO R01 LM011369 and R01GM101430 and U54 HG004028 to NHS T15 LM00707 toMRB R01-LM008111 to KL R01 GM102282 and R01LM011369 to HL U24 CA143840 to AL U54HG006370 toAM U54 HG008033-01 to MD] the Wellcome Trust [098051to AO] a Marie Curie experience researcher fellowship

The digital revolution in phenotyping | 9

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

[301806 to NC] the National Science Foundation [1207592to IG DBI-1062404 and DBI-1062542 and EF-0905606 toPM] the Bundesministerium fur Bildung und Forschung[0313911 to PNR] the European Communityrsquos SeventhFramework Programme [Grant Agreement 602300 SYBIL toPNR] the Systems Microscopy NoE project [grant agree-ment 258068 to GR] and the Defense Advanced ResearchProjects Agency [W911NF-14-C-0109 to KL] TG was sup-ported by the Kinghorn Foundation OB was supported bythe Intramural Research Program of the NIH NationalLibrary of Medicine MS was supported by the MedicalResearch Council LW was supported by the Homer WarnerCenter for Informatics Research of the IHC Health ServicesRW was supported by an appointment to the NLMResearch Participation Program administered by the OakRidge Institute for Science and Education through an inter-agency agreement between the US Department of Energyand the National Library of Medicine

References1 Amberger J Bocchini C Hamosh A A new face and new chal-

lenges for Online Mendelian Inheritance in Man (OMIM)Human Mutation 201132564ndash7

2 Blake JA Bult CJ Kadin JA et al The Mouse Genome Database(MGD) premier model organism resource for mammaliangenomics and genetics Nucleic Acids Res 201139D842ndash8

3 Tweedie S Ashburner M Falls K et al FlyBase enhancingDrosophila Gene Ontology annotations Nucleic Acids Research2009 37D555ndash9

4 Howe DG Bradford YM Conlin T et al ZFIN the ZebrafishModel Organism Database increased support for mutantsand transgenics Nucleic Acids Research 2013 41D854ndash60

5 Goh KI Cusick ME Valle D et al The human disease networkProc Natl Acad Sci USA 20071048685ndash90

6 Hoehndorf R Schofield PN Gkoutos GV PhenomeNET awhole-phenome approach to disease gene discovery NucleicAcids Res 201139e119

7 Smedley D Oellrich A Kohler S et al PhenoDigm analyzingcurated annotations to associate animal models with humandiseases Database 2013bat025

8 Washington NL Haendel MA Mungall CJ et al Linkinghuman diseases to animal models using ontology-basedphenotype annotation PLoS Biol 20097e1000247

9 Zhou X Menche J Barabasi AL et al Human symptomsndashdisease network Nat Commun 20145

10Van Driel MA Bruggeman J Vriend G et al A text-mining ana-lysis of the human phenome Eur J Hum Genet 200614535ndash42

11Groth P Pavlova N Kalev I et al PhenomicDB a new cross-species genotypephenotype resource Nucleic Acids Res200735D696ndash9

12Korbel JO Doerks T Jensen LJ et al Systematic association ofgenes to phenotypes by genome and literature mining PLoSBiol 20053e134

13Hoehndorf R Hiebert T Hardy NW et al Mouse model pheno-types provide information about human drug targetsBioinformatics 201330719ndash25

14Kuhn M Campillos M Letunic I et al A side effect resource tocapture phenotypic effects of drugs Mol Syst Biol 20106343

15Eriksson R Werge T Jensen LJ et al Dose-specific adversedrug reaction identification in electronic patient recordstemporal data mining in an inpatient psychiatric populationDrug Safety 201437237ndash47

16Lasko TA Denny JC Levy MA Computational phenotype dis-covery using unsupervised feature learning over noisy sparseand irregular clinical data eng PLoS One 20138e66341

17Schulam P Wigley F Saria S Clustering longitudinal clinicalmarker trajectories from electronic health data applicationsto phenotyping and endotype discovery In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial IntelligenceAustin Texas USA 2015

18LePendu P Iyer SV Bauer-Mehren A et al Pharmacovigilanceusing clinical notes Clin Pharmacol Ther 201393547ndash55

19Burleigh G Alphonse K Alverson AJ et al Next-generationphenomics for the tree of life PLoS Curr 20135

20Mabee P Balhoff JP Dahdul WM et al 500 000 fish pheno-types The new informatics landscape for evolutionary anddevelopmental biology of the vertebrate skeleton J ApplIchthyol 201228300ndash5

21Collier N Oellrich A Groza T Toward knowledge support foranalysis and interpretation of complex traits Genome Biol201314214

22Smith CL Eppig JT The Mammalian Phenotype Ontology as aunifying standard for experimental and high-throughputphenotyping data Mamm Genome 201323653ndash68

23Kohler S Doelken SC Mungall CJ et al The HumanPhenotype Ontology project linking molecular biology anddisease through phenotype data Nucleic Acids Res201442D966ndash74

24Gkoutos GV Mungall C Doelken S et al Entityquality-basedlogical definitions for the human skeletal phenome usingPATO In Proceedings of Annual International Conference ofthe IEEE Engineering in Medicine and Biology SocietyMinneapolis Minnesota USA 2009 7069ndash72

25Winter RM Baraitser M Douglas JM A computerised database for the diagnosis of rare dysmorphic syndromes J MedGenet 198421121ndash3

26Botstein D Cherry JM Ashburner M et al Gene Ontology toolfor the unification of biology Nat Genet 20002525ndash9

27Hancock JM Commentary on Shimoyama et al (2012) threeontologies to define phenotype measurement data FrontGenet 20145

28Boland MR Tatonetti NP Hripcsak G Development and valid-ation of a classification approach for extracting severity auto-matically from electronic health records J Biomed Semant2015614

29Greenaway S Blake A Retha A et al Automatically annotat-ing temporal data from a phenotype-driven mutagenesisscreen In Proceedings of Phenotype Day at ISMB Dublin Ireland2015

30Rath A Olry A Dhombres F et al Representation of rare dis-eases in health information systems the Orphanet approachto serve a wide range of end users Hum Mutat 201233803ndash8

31Thorn CF Klein TE Altman RB PharmGKB Methods Mol Biol2005311179ndash91

32Sprague J Bayraktaroglu L Clements D et al The ZebrafishInformation Network the zebrafish model organism data-base Nucleic Acids Res 200634D581ndash5

33Brown SDM Moore MW Towards an encyclopaedia of mam-malian gene function the International Mouse PhenotypingConsortium Dis Models Mech 20125289ndash92

34Aronson AR Lang FM An overview of MetaMap historicalperspective and recent advances J Am Med Inform Assoc201017229ndash36

35Chapman WW Bridewell W Hanbury P et al A simple algo-rithm for identifying negated findings and diseases in dis-charge summaries J Biomed Inform 200134301ndash10

10 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 8: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

coverage is required to unlock the full potential of knowledgediscovery using phenotype data

Interoperability of phenotypes summaryThe need for unified and integrable representation of pheno-types has been recognized and projects are underway to im-prove the current situation However there is still a hugeamount of legacy data that need to be dealt with

Processing

Processing (or application) is concerned with the use of pheno-type data that have either been reported in structured form orextracted with for example text mining methods (see acquisi-tion methods of phenotypes) Subject to the target domain theusage of phenotype data can be classified into four broad cate-gories (i) clinical research (diagnosis prognosis patient match-making variant prioritization personalized medicine drug sideeffects) (ii) study of genomendashphenome interactions to advancethe understanding of disease causation or achieve personalizedtherapies (iii) cross-resource consistency analysis and (iv) evo-lutionary research It is worth noting that in most cases pheno-type data have been used in a cross-species context hencetransforming the interoperability dimension into a first-classcitizen rather than an application scenario

Application of phenotypes to study the origins and pathology ofdiseasesThe application area of clinical research consists of a varied setof specific goals which represent in practice coherent researchstreams on their own Phenotypes have been used as uniquesource of data for example for disease prediction [72 73] min-ing key disease characteristics or characteristic phenotypes [1617 74] or patient match-making [51] Furthermore phenotypeshave been used to support the understanding of the geneticmechanism of diseases via direct association with genotypedata [5 9 12] as a mapping bridge across species data [7 8] andin conjunction with the entire set of OMICS data [75] or to im-prove variant prioritization for accurate diagnosis [76 77]

More recently phenotypes have played a major role as a dis-covery agent in large-scale genome-wide association studies [78ndash80] In particular projects such as the 100 000 Genomes Project(httpwwwgenomicsenglandcoukthe-100000-genomes-project)as well as the eMerge Network (httpsemergemcvanderbiltedupage idfrac1458) aim to support the area of Pharmacogenetics bylinking data from EHRs to sequence information from patientsto improve diagnosis and treatment Similarly Phenome-wideAssociation Studies (PheWAS) [81ndash83] allow for the identificationof genes that possibly implicate a disease and can provide prov-enance for results determined through Genome-wide associ-ation studies

Phenotypes and ethnicityGeneric phenotypes have an immense potential which hasbeen only recently discovered and exploited For example eth-nic differences have been shown to explain the optimal statesof the human blood glucose levelsmdashexpressed via the relation-ship between insulin sensitivity and insulin response [84]Similarly when used in the context of a shared genetic architec-ture such data validated the existence of statistically significantrelationships between the platelet count and alcohol depend-ence or between the alkaline phosphatase level and venousthromboembolism [85] This application area is characterizedby a specific set of challenges emerging from the novel

combination of numeric data (from tests and measurement)abnormalities and standard traits One important and still un-solved problem in the processing of phenotypes is the lack ofalignment between measurement representations and pheno-typic abnormalities as well as the ability of representing statesof normality and longitudinal phenotypic data resulting fromtests and measurements

Phenotypes in drug repurposingPhenotypes expose the effects of drug treatments and henceenable the study of their general effects the relation betweendosage and effects as well as the interaction between drugsExisting literature on this topic maps perfectly onto these threeaspects Phenotypes as adverse drug reactions have been mod-eled and captured as early as 2010 by the SIDER initiative [14]and have been used to investigate the causal relationship be-tween dosage and effect by [15] In an exercise that combineslarge-scale acquisition and processing LePendu et al have usedphenotypes as indicators of adverse drug reactions as well assignals of adverse events associated with drugndashdrug inter-actions [18] Finally with the increasing curation adoption anduse of cross-species phenotype data it has been shown that aneffective mapping between model organisms and drug effectprofiles based on similarity can be applied to successfully sug-gest candidate drug targets [13]

Phenotypes in evolutionary studiesIn this context phenotypes have been used to understand pat-terns of diversification and to gain additional knowledge ontrait evolution Two particular initiatives have focused on thisaspect The AVAToL project [19] represents a collaborative andmultidisciplinary effort that combines text mining image ana-lysis and the wisdom of the crowds to discover and documentspecies phenotypes Their ultimate goal is to advance phylogen-etics research and to enable a faster and more accurate con-struction of the Tree of Life The Phenoscape knowledge base[20] on the other hand integrates phenotype data acquired onover 2500 teleost fishes with structured phenotype data fromzebrafish genes to infer candidate genes that explain pheno-typic variety and hence enable the formulation of evolution-arydevolutionary hypotheses

Processing of phenotypes summaryThe quality and range of phenotype applications are curbedonly by the quality and availability of the underlying dataLimitations arise for example from the missing or incorrectcross-species phenotypes alignment either owing to their foun-dational representation or owing to inconsistencies in the levelof granularity Similarly challenges are encountered when rep-resenting and acquiring the more profound dimensions ofphenotypes including degrees of severity use of ambiguous ex-pressions or temporality (in a longitudinal data sense) whichthen hamper the development of complex solutions

A future perspective of phenome research

For the entire field of Phenomics to advance challenges have tobe overcome at the universal level as well as at the level of theindividual dimensions of phenome research (representationacquisition interoperability and processing) The most import-ant universal challenge is the lack of a shared understanding ofwhat a phenotype is among all scientists working with pheno-type data This includes (computational) biologists regardlessof the research questions andor animal model they are working

8 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

on as well as clinicians working in all different areas of humanmedicine To reach a common understanding reporting stand-ards and guidelines need to be derived that facilitate communi-cation across domains as well as large-scale computationalanalysis

Furthermore universal agreement has to be found as towhat additional aspects of phenotypes are relevant and havenot or only sparsely been accounted for in any of the four do-mains Such aspects include types of measurements proven-ance evidence and time To overcome the social barriers toadoption some sort of credit assignment mechanism akin tocitation will be necessary to track the usage of lsquogoodrsquo and reus-able phenotypes Existing author tracking systems such asORCID (httporcidorg) and ResearcherID (httpwwwresearcheridcom) can probably be reused to document author-ship of phenotype models and track usage as well asprovenance

Another important universal issue to overcome is the re-cording of lsquonormalrsquo phenotypes Currently phenomics in thearea of disease gene discovery is geared toward the collectionand analyses of lsquoabnormalrsquo phenotypes eg those resultingfrom diseases or gene modifications However the derivation ofan lsquoabnormalrsquo phenotype is always the result of a comparisonwith what is considered to be lsquonormalrsquo which is collected onlyin some cases [45] The record of more lsquonormalrsquo phenotypeswould allow for more fine-grained analyses and the investiga-tion of causal relationships between different conditions eg incases of comorbidity

From a representational point of view the ultimate futuregoal is to overcome the limitations of species- and domain-specific representations and find a universal way of encodingphenotype data independent from the granularity of the dataIn addition resources need to be built that address so far miss-ing aspects such as evidence or time aspects For example asmall set of evidence codes have been established as part of GOto provide provenance in gene annotations but also to providemeans for computational analysis to avoid data circularityAnother aspect of phenotypes that has not been integrated yetinto the representation of phenotypes are causality and tempor-ality for example how the phenotypes change over time owingto stimuli in the environment or medication While at the mo-ment the representations focus on reporting either a temporarysnapshot of an individual examined in an experiment or as partof a medical investigation the current representation modelsdo not allow for the encoding of phenotype changes over timeas a result to surrounding stimuli

As we extend the scope of our joint understanding of pheno-types and adopt ways to represent this understanding the ac-quisition of phenotypes has to change too Methods have to bedeveloped that can accommodate the recording of additionalaspects such as evidence and time eg when extracting infor-mation from the scientific literature Furthermore more reliableautomated methods are needed that can cope with the com-plexity of free text in clinical settings as well as reporting mech-anisms in wet lab environments to facilitate high-throughputand overcome the need for time- and cost-intensive manuallabor In addition and as mentioned earlier the acquisition di-mension should also address the collection of lsquonormalrsquo pheno-types in the future to improve the results obtained byprocessing the phenotype data

Despite the widespread aims to achieve interoperability andmake best use of the integrated resources there are still chal-lenges that need to be addressed to achieve true interoperabilityof the existing and newly emerging resources both phenotype-

specific and not A long-term goal of the dimension of inter-operability is direct propagation of experimental findings intoprevention and treatment options for patients in a hospital(lsquofrom bench to bedsidersquo) Related to this goal is the aim of per-sonalized treatments by means of building patient-specificmodels integrating phenotype data that can then be used forsimulations of possible treatment outcomes

In the future we anticipate a migration toward describingphenotypes as lsquomodelsrsquo or classifiers that answer a particularquestion For example lsquodoes the patient have pneumoniarsquo orlsquodoes the patient have sepsisrsquo The use of phenotyping willthen be analogous to the use of classifiers in spam filters al-ways running in the background and when an incoming sample(a patient record for example) results in a high confidencematch we would automatically receive an alert that the patientis potentially eligible for a clinical trial is likely to benefit from acertain therapy or is at increased risk for certain complications

With the improvement in any of the other dimensions theprocessing of phenotypes will improve owing to an increase inthe quality of data but at the same time will require extensionsthat can cope with additional data available in the future suchas lsquonormalrsquo phenotypes evidence for phenotype data andcausal relationships encoded with time-dependencies In gen-eral a wide range of support and analysis tools are required tounlock the full potential of phenotype data

Given the achievements in the past years we look forwardtoward an exciting and promising decade of phenomics withample opportunities for researchers to get involved and contrib-ute to evolve and shape the emerging landscape

Key Points

bull Over the course of the past decade phenotype datahas become a key factor in analyzing diseases and re-porting experimental outcomes

bull Successful applications of phenotypes include the de-scription of experimental outcomes (eg the changes inphenotypes owing to gene modifications) computationalknowledge discovery (eg in determining disease genecandidates) and reporting in clinical environments (egpatient monitoring)

bull The research field of phenomics can be divided intofour main dimensions (presentation acquisition inter-operability and processing) each of which is depend-ent to some extent on the others

bull While the development of each of the four dimensionsis individual a common understanding and universalguidelines need to be established on how phenotypesare perceived and how they are used a synchroniza-tion of efforts is needed

bull The future of phenomics research holds exciting chal-lenges and has the potential to create a significant im-pact on the entire biomedical domain

Funding

This work was supported by the National Institutes ofHealth [1 U54 HG006370-01 to AO R01 LM011369 and R01GM101430 and U54 HG004028 to NHS T15 LM00707 toMRB R01-LM008111 to KL R01 GM102282 and R01LM011369 to HL U24 CA143840 to AL U54HG006370 toAM U54 HG008033-01 to MD] the Wellcome Trust [098051to AO] a Marie Curie experience researcher fellowship

The digital revolution in phenotyping | 9

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

[301806 to NC] the National Science Foundation [1207592to IG DBI-1062404 and DBI-1062542 and EF-0905606 toPM] the Bundesministerium fur Bildung und Forschung[0313911 to PNR] the European Communityrsquos SeventhFramework Programme [Grant Agreement 602300 SYBIL toPNR] the Systems Microscopy NoE project [grant agree-ment 258068 to GR] and the Defense Advanced ResearchProjects Agency [W911NF-14-C-0109 to KL] TG was sup-ported by the Kinghorn Foundation OB was supported bythe Intramural Research Program of the NIH NationalLibrary of Medicine MS was supported by the MedicalResearch Council LW was supported by the Homer WarnerCenter for Informatics Research of the IHC Health ServicesRW was supported by an appointment to the NLMResearch Participation Program administered by the OakRidge Institute for Science and Education through an inter-agency agreement between the US Department of Energyand the National Library of Medicine

References1 Amberger J Bocchini C Hamosh A A new face and new chal-

lenges for Online Mendelian Inheritance in Man (OMIM)Human Mutation 201132564ndash7

2 Blake JA Bult CJ Kadin JA et al The Mouse Genome Database(MGD) premier model organism resource for mammaliangenomics and genetics Nucleic Acids Res 201139D842ndash8

3 Tweedie S Ashburner M Falls K et al FlyBase enhancingDrosophila Gene Ontology annotations Nucleic Acids Research2009 37D555ndash9

4 Howe DG Bradford YM Conlin T et al ZFIN the ZebrafishModel Organism Database increased support for mutantsand transgenics Nucleic Acids Research 2013 41D854ndash60

5 Goh KI Cusick ME Valle D et al The human disease networkProc Natl Acad Sci USA 20071048685ndash90

6 Hoehndorf R Schofield PN Gkoutos GV PhenomeNET awhole-phenome approach to disease gene discovery NucleicAcids Res 201139e119

7 Smedley D Oellrich A Kohler S et al PhenoDigm analyzingcurated annotations to associate animal models with humandiseases Database 2013bat025

8 Washington NL Haendel MA Mungall CJ et al Linkinghuman diseases to animal models using ontology-basedphenotype annotation PLoS Biol 20097e1000247

9 Zhou X Menche J Barabasi AL et al Human symptomsndashdisease network Nat Commun 20145

10Van Driel MA Bruggeman J Vriend G et al A text-mining ana-lysis of the human phenome Eur J Hum Genet 200614535ndash42

11Groth P Pavlova N Kalev I et al PhenomicDB a new cross-species genotypephenotype resource Nucleic Acids Res200735D696ndash9

12Korbel JO Doerks T Jensen LJ et al Systematic association ofgenes to phenotypes by genome and literature mining PLoSBiol 20053e134

13Hoehndorf R Hiebert T Hardy NW et al Mouse model pheno-types provide information about human drug targetsBioinformatics 201330719ndash25

14Kuhn M Campillos M Letunic I et al A side effect resource tocapture phenotypic effects of drugs Mol Syst Biol 20106343

15Eriksson R Werge T Jensen LJ et al Dose-specific adversedrug reaction identification in electronic patient recordstemporal data mining in an inpatient psychiatric populationDrug Safety 201437237ndash47

16Lasko TA Denny JC Levy MA Computational phenotype dis-covery using unsupervised feature learning over noisy sparseand irregular clinical data eng PLoS One 20138e66341

17Schulam P Wigley F Saria S Clustering longitudinal clinicalmarker trajectories from electronic health data applicationsto phenotyping and endotype discovery In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial IntelligenceAustin Texas USA 2015

18LePendu P Iyer SV Bauer-Mehren A et al Pharmacovigilanceusing clinical notes Clin Pharmacol Ther 201393547ndash55

19Burleigh G Alphonse K Alverson AJ et al Next-generationphenomics for the tree of life PLoS Curr 20135

20Mabee P Balhoff JP Dahdul WM et al 500 000 fish pheno-types The new informatics landscape for evolutionary anddevelopmental biology of the vertebrate skeleton J ApplIchthyol 201228300ndash5

21Collier N Oellrich A Groza T Toward knowledge support foranalysis and interpretation of complex traits Genome Biol201314214

22Smith CL Eppig JT The Mammalian Phenotype Ontology as aunifying standard for experimental and high-throughputphenotyping data Mamm Genome 201323653ndash68

23Kohler S Doelken SC Mungall CJ et al The HumanPhenotype Ontology project linking molecular biology anddisease through phenotype data Nucleic Acids Res201442D966ndash74

24Gkoutos GV Mungall C Doelken S et al Entityquality-basedlogical definitions for the human skeletal phenome usingPATO In Proceedings of Annual International Conference ofthe IEEE Engineering in Medicine and Biology SocietyMinneapolis Minnesota USA 2009 7069ndash72

25Winter RM Baraitser M Douglas JM A computerised database for the diagnosis of rare dysmorphic syndromes J MedGenet 198421121ndash3

26Botstein D Cherry JM Ashburner M et al Gene Ontology toolfor the unification of biology Nat Genet 20002525ndash9

27Hancock JM Commentary on Shimoyama et al (2012) threeontologies to define phenotype measurement data FrontGenet 20145

28Boland MR Tatonetti NP Hripcsak G Development and valid-ation of a classification approach for extracting severity auto-matically from electronic health records J Biomed Semant2015614

29Greenaway S Blake A Retha A et al Automatically annotat-ing temporal data from a phenotype-driven mutagenesisscreen In Proceedings of Phenotype Day at ISMB Dublin Ireland2015

30Rath A Olry A Dhombres F et al Representation of rare dis-eases in health information systems the Orphanet approachto serve a wide range of end users Hum Mutat 201233803ndash8

31Thorn CF Klein TE Altman RB PharmGKB Methods Mol Biol2005311179ndash91

32Sprague J Bayraktaroglu L Clements D et al The ZebrafishInformation Network the zebrafish model organism data-base Nucleic Acids Res 200634D581ndash5

33Brown SDM Moore MW Towards an encyclopaedia of mam-malian gene function the International Mouse PhenotypingConsortium Dis Models Mech 20125289ndash92

34Aronson AR Lang FM An overview of MetaMap historicalperspective and recent advances J Am Med Inform Assoc201017229ndash36

35Chapman WW Bridewell W Hanbury P et al A simple algo-rithm for identifying negated findings and diseases in dis-charge summaries J Biomed Inform 200134301ndash10

10 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 9: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

on as well as clinicians working in all different areas of humanmedicine To reach a common understanding reporting stand-ards and guidelines need to be derived that facilitate communi-cation across domains as well as large-scale computationalanalysis

Furthermore universal agreement has to be found as towhat additional aspects of phenotypes are relevant and havenot or only sparsely been accounted for in any of the four do-mains Such aspects include types of measurements proven-ance evidence and time To overcome the social barriers toadoption some sort of credit assignment mechanism akin tocitation will be necessary to track the usage of lsquogoodrsquo and reus-able phenotypes Existing author tracking systems such asORCID (httporcidorg) and ResearcherID (httpwwwresearcheridcom) can probably be reused to document author-ship of phenotype models and track usage as well asprovenance

Another important universal issue to overcome is the re-cording of lsquonormalrsquo phenotypes Currently phenomics in thearea of disease gene discovery is geared toward the collectionand analyses of lsquoabnormalrsquo phenotypes eg those resultingfrom diseases or gene modifications However the derivation ofan lsquoabnormalrsquo phenotype is always the result of a comparisonwith what is considered to be lsquonormalrsquo which is collected onlyin some cases [45] The record of more lsquonormalrsquo phenotypeswould allow for more fine-grained analyses and the investiga-tion of causal relationships between different conditions eg incases of comorbidity

From a representational point of view the ultimate futuregoal is to overcome the limitations of species- and domain-specific representations and find a universal way of encodingphenotype data independent from the granularity of the dataIn addition resources need to be built that address so far miss-ing aspects such as evidence or time aspects For example asmall set of evidence codes have been established as part of GOto provide provenance in gene annotations but also to providemeans for computational analysis to avoid data circularityAnother aspect of phenotypes that has not been integrated yetinto the representation of phenotypes are causality and tempor-ality for example how the phenotypes change over time owingto stimuli in the environment or medication While at the mo-ment the representations focus on reporting either a temporarysnapshot of an individual examined in an experiment or as partof a medical investigation the current representation modelsdo not allow for the encoding of phenotype changes over timeas a result to surrounding stimuli

As we extend the scope of our joint understanding of pheno-types and adopt ways to represent this understanding the ac-quisition of phenotypes has to change too Methods have to bedeveloped that can accommodate the recording of additionalaspects such as evidence and time eg when extracting infor-mation from the scientific literature Furthermore more reliableautomated methods are needed that can cope with the com-plexity of free text in clinical settings as well as reporting mech-anisms in wet lab environments to facilitate high-throughputand overcome the need for time- and cost-intensive manuallabor In addition and as mentioned earlier the acquisition di-mension should also address the collection of lsquonormalrsquo pheno-types in the future to improve the results obtained byprocessing the phenotype data

Despite the widespread aims to achieve interoperability andmake best use of the integrated resources there are still chal-lenges that need to be addressed to achieve true interoperabilityof the existing and newly emerging resources both phenotype-

specific and not A long-term goal of the dimension of inter-operability is direct propagation of experimental findings intoprevention and treatment options for patients in a hospital(lsquofrom bench to bedsidersquo) Related to this goal is the aim of per-sonalized treatments by means of building patient-specificmodels integrating phenotype data that can then be used forsimulations of possible treatment outcomes

In the future we anticipate a migration toward describingphenotypes as lsquomodelsrsquo or classifiers that answer a particularquestion For example lsquodoes the patient have pneumoniarsquo orlsquodoes the patient have sepsisrsquo The use of phenotyping willthen be analogous to the use of classifiers in spam filters al-ways running in the background and when an incoming sample(a patient record for example) results in a high confidencematch we would automatically receive an alert that the patientis potentially eligible for a clinical trial is likely to benefit from acertain therapy or is at increased risk for certain complications

With the improvement in any of the other dimensions theprocessing of phenotypes will improve owing to an increase inthe quality of data but at the same time will require extensionsthat can cope with additional data available in the future suchas lsquonormalrsquo phenotypes evidence for phenotype data andcausal relationships encoded with time-dependencies In gen-eral a wide range of support and analysis tools are required tounlock the full potential of phenotype data

Given the achievements in the past years we look forwardtoward an exciting and promising decade of phenomics withample opportunities for researchers to get involved and contrib-ute to evolve and shape the emerging landscape

Key Points

bull Over the course of the past decade phenotype datahas become a key factor in analyzing diseases and re-porting experimental outcomes

bull Successful applications of phenotypes include the de-scription of experimental outcomes (eg the changes inphenotypes owing to gene modifications) computationalknowledge discovery (eg in determining disease genecandidates) and reporting in clinical environments (egpatient monitoring)

bull The research field of phenomics can be divided intofour main dimensions (presentation acquisition inter-operability and processing) each of which is depend-ent to some extent on the others

bull While the development of each of the four dimensionsis individual a common understanding and universalguidelines need to be established on how phenotypesare perceived and how they are used a synchroniza-tion of efforts is needed

bull The future of phenomics research holds exciting chal-lenges and has the potential to create a significant im-pact on the entire biomedical domain

Funding

This work was supported by the National Institutes ofHealth [1 U54 HG006370-01 to AO R01 LM011369 and R01GM101430 and U54 HG004028 to NHS T15 LM00707 toMRB R01-LM008111 to KL R01 GM102282 and R01LM011369 to HL U24 CA143840 to AL U54HG006370 toAM U54 HG008033-01 to MD] the Wellcome Trust [098051to AO] a Marie Curie experience researcher fellowship

The digital revolution in phenotyping | 9

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

[301806 to NC] the National Science Foundation [1207592to IG DBI-1062404 and DBI-1062542 and EF-0905606 toPM] the Bundesministerium fur Bildung und Forschung[0313911 to PNR] the European Communityrsquos SeventhFramework Programme [Grant Agreement 602300 SYBIL toPNR] the Systems Microscopy NoE project [grant agree-ment 258068 to GR] and the Defense Advanced ResearchProjects Agency [W911NF-14-C-0109 to KL] TG was sup-ported by the Kinghorn Foundation OB was supported bythe Intramural Research Program of the NIH NationalLibrary of Medicine MS was supported by the MedicalResearch Council LW was supported by the Homer WarnerCenter for Informatics Research of the IHC Health ServicesRW was supported by an appointment to the NLMResearch Participation Program administered by the OakRidge Institute for Science and Education through an inter-agency agreement between the US Department of Energyand the National Library of Medicine

References1 Amberger J Bocchini C Hamosh A A new face and new chal-

lenges for Online Mendelian Inheritance in Man (OMIM)Human Mutation 201132564ndash7

2 Blake JA Bult CJ Kadin JA et al The Mouse Genome Database(MGD) premier model organism resource for mammaliangenomics and genetics Nucleic Acids Res 201139D842ndash8

3 Tweedie S Ashburner M Falls K et al FlyBase enhancingDrosophila Gene Ontology annotations Nucleic Acids Research2009 37D555ndash9

4 Howe DG Bradford YM Conlin T et al ZFIN the ZebrafishModel Organism Database increased support for mutantsand transgenics Nucleic Acids Research 2013 41D854ndash60

5 Goh KI Cusick ME Valle D et al The human disease networkProc Natl Acad Sci USA 20071048685ndash90

6 Hoehndorf R Schofield PN Gkoutos GV PhenomeNET awhole-phenome approach to disease gene discovery NucleicAcids Res 201139e119

7 Smedley D Oellrich A Kohler S et al PhenoDigm analyzingcurated annotations to associate animal models with humandiseases Database 2013bat025

8 Washington NL Haendel MA Mungall CJ et al Linkinghuman diseases to animal models using ontology-basedphenotype annotation PLoS Biol 20097e1000247

9 Zhou X Menche J Barabasi AL et al Human symptomsndashdisease network Nat Commun 20145

10Van Driel MA Bruggeman J Vriend G et al A text-mining ana-lysis of the human phenome Eur J Hum Genet 200614535ndash42

11Groth P Pavlova N Kalev I et al PhenomicDB a new cross-species genotypephenotype resource Nucleic Acids Res200735D696ndash9

12Korbel JO Doerks T Jensen LJ et al Systematic association ofgenes to phenotypes by genome and literature mining PLoSBiol 20053e134

13Hoehndorf R Hiebert T Hardy NW et al Mouse model pheno-types provide information about human drug targetsBioinformatics 201330719ndash25

14Kuhn M Campillos M Letunic I et al A side effect resource tocapture phenotypic effects of drugs Mol Syst Biol 20106343

15Eriksson R Werge T Jensen LJ et al Dose-specific adversedrug reaction identification in electronic patient recordstemporal data mining in an inpatient psychiatric populationDrug Safety 201437237ndash47

16Lasko TA Denny JC Levy MA Computational phenotype dis-covery using unsupervised feature learning over noisy sparseand irregular clinical data eng PLoS One 20138e66341

17Schulam P Wigley F Saria S Clustering longitudinal clinicalmarker trajectories from electronic health data applicationsto phenotyping and endotype discovery In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial IntelligenceAustin Texas USA 2015

18LePendu P Iyer SV Bauer-Mehren A et al Pharmacovigilanceusing clinical notes Clin Pharmacol Ther 201393547ndash55

19Burleigh G Alphonse K Alverson AJ et al Next-generationphenomics for the tree of life PLoS Curr 20135

20Mabee P Balhoff JP Dahdul WM et al 500 000 fish pheno-types The new informatics landscape for evolutionary anddevelopmental biology of the vertebrate skeleton J ApplIchthyol 201228300ndash5

21Collier N Oellrich A Groza T Toward knowledge support foranalysis and interpretation of complex traits Genome Biol201314214

22Smith CL Eppig JT The Mammalian Phenotype Ontology as aunifying standard for experimental and high-throughputphenotyping data Mamm Genome 201323653ndash68

23Kohler S Doelken SC Mungall CJ et al The HumanPhenotype Ontology project linking molecular biology anddisease through phenotype data Nucleic Acids Res201442D966ndash74

24Gkoutos GV Mungall C Doelken S et al Entityquality-basedlogical definitions for the human skeletal phenome usingPATO In Proceedings of Annual International Conference ofthe IEEE Engineering in Medicine and Biology SocietyMinneapolis Minnesota USA 2009 7069ndash72

25Winter RM Baraitser M Douglas JM A computerised database for the diagnosis of rare dysmorphic syndromes J MedGenet 198421121ndash3

26Botstein D Cherry JM Ashburner M et al Gene Ontology toolfor the unification of biology Nat Genet 20002525ndash9

27Hancock JM Commentary on Shimoyama et al (2012) threeontologies to define phenotype measurement data FrontGenet 20145

28Boland MR Tatonetti NP Hripcsak G Development and valid-ation of a classification approach for extracting severity auto-matically from electronic health records J Biomed Semant2015614

29Greenaway S Blake A Retha A et al Automatically annotat-ing temporal data from a phenotype-driven mutagenesisscreen In Proceedings of Phenotype Day at ISMB Dublin Ireland2015

30Rath A Olry A Dhombres F et al Representation of rare dis-eases in health information systems the Orphanet approachto serve a wide range of end users Hum Mutat 201233803ndash8

31Thorn CF Klein TE Altman RB PharmGKB Methods Mol Biol2005311179ndash91

32Sprague J Bayraktaroglu L Clements D et al The ZebrafishInformation Network the zebrafish model organism data-base Nucleic Acids Res 200634D581ndash5

33Brown SDM Moore MW Towards an encyclopaedia of mam-malian gene function the International Mouse PhenotypingConsortium Dis Models Mech 20125289ndash92

34Aronson AR Lang FM An overview of MetaMap historicalperspective and recent advances J Am Med Inform Assoc201017229ndash36

35Chapman WW Bridewell W Hanbury P et al A simple algo-rithm for identifying negated findings and diseases in dis-charge summaries J Biomed Inform 200134301ndash10

10 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 10: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

[301806 to NC] the National Science Foundation [1207592to IG DBI-1062404 and DBI-1062542 and EF-0905606 toPM] the Bundesministerium fur Bildung und Forschung[0313911 to PNR] the European Communityrsquos SeventhFramework Programme [Grant Agreement 602300 SYBIL toPNR] the Systems Microscopy NoE project [grant agree-ment 258068 to GR] and the Defense Advanced ResearchProjects Agency [W911NF-14-C-0109 to KL] TG was sup-ported by the Kinghorn Foundation OB was supported bythe Intramural Research Program of the NIH NationalLibrary of Medicine MS was supported by the MedicalResearch Council LW was supported by the Homer WarnerCenter for Informatics Research of the IHC Health ServicesRW was supported by an appointment to the NLMResearch Participation Program administered by the OakRidge Institute for Science and Education through an inter-agency agreement between the US Department of Energyand the National Library of Medicine

References1 Amberger J Bocchini C Hamosh A A new face and new chal-

lenges for Online Mendelian Inheritance in Man (OMIM)Human Mutation 201132564ndash7

2 Blake JA Bult CJ Kadin JA et al The Mouse Genome Database(MGD) premier model organism resource for mammaliangenomics and genetics Nucleic Acids Res 201139D842ndash8

3 Tweedie S Ashburner M Falls K et al FlyBase enhancingDrosophila Gene Ontology annotations Nucleic Acids Research2009 37D555ndash9

4 Howe DG Bradford YM Conlin T et al ZFIN the ZebrafishModel Organism Database increased support for mutantsand transgenics Nucleic Acids Research 2013 41D854ndash60

5 Goh KI Cusick ME Valle D et al The human disease networkProc Natl Acad Sci USA 20071048685ndash90

6 Hoehndorf R Schofield PN Gkoutos GV PhenomeNET awhole-phenome approach to disease gene discovery NucleicAcids Res 201139e119

7 Smedley D Oellrich A Kohler S et al PhenoDigm analyzingcurated annotations to associate animal models with humandiseases Database 2013bat025

8 Washington NL Haendel MA Mungall CJ et al Linkinghuman diseases to animal models using ontology-basedphenotype annotation PLoS Biol 20097e1000247

9 Zhou X Menche J Barabasi AL et al Human symptomsndashdisease network Nat Commun 20145

10Van Driel MA Bruggeman J Vriend G et al A text-mining ana-lysis of the human phenome Eur J Hum Genet 200614535ndash42

11Groth P Pavlova N Kalev I et al PhenomicDB a new cross-species genotypephenotype resource Nucleic Acids Res200735D696ndash9

12Korbel JO Doerks T Jensen LJ et al Systematic association ofgenes to phenotypes by genome and literature mining PLoSBiol 20053e134

13Hoehndorf R Hiebert T Hardy NW et al Mouse model pheno-types provide information about human drug targetsBioinformatics 201330719ndash25

14Kuhn M Campillos M Letunic I et al A side effect resource tocapture phenotypic effects of drugs Mol Syst Biol 20106343

15Eriksson R Werge T Jensen LJ et al Dose-specific adversedrug reaction identification in electronic patient recordstemporal data mining in an inpatient psychiatric populationDrug Safety 201437237ndash47

16Lasko TA Denny JC Levy MA Computational phenotype dis-covery using unsupervised feature learning over noisy sparseand irregular clinical data eng PLoS One 20138e66341

17Schulam P Wigley F Saria S Clustering longitudinal clinicalmarker trajectories from electronic health data applicationsto phenotyping and endotype discovery In Proceedings ofthe Twenty-Ninth AAAI Conference on Artificial IntelligenceAustin Texas USA 2015

18LePendu P Iyer SV Bauer-Mehren A et al Pharmacovigilanceusing clinical notes Clin Pharmacol Ther 201393547ndash55

19Burleigh G Alphonse K Alverson AJ et al Next-generationphenomics for the tree of life PLoS Curr 20135

20Mabee P Balhoff JP Dahdul WM et al 500 000 fish pheno-types The new informatics landscape for evolutionary anddevelopmental biology of the vertebrate skeleton J ApplIchthyol 201228300ndash5

21Collier N Oellrich A Groza T Toward knowledge support foranalysis and interpretation of complex traits Genome Biol201314214

22Smith CL Eppig JT The Mammalian Phenotype Ontology as aunifying standard for experimental and high-throughputphenotyping data Mamm Genome 201323653ndash68

23Kohler S Doelken SC Mungall CJ et al The HumanPhenotype Ontology project linking molecular biology anddisease through phenotype data Nucleic Acids Res201442D966ndash74

24Gkoutos GV Mungall C Doelken S et al Entityquality-basedlogical definitions for the human skeletal phenome usingPATO In Proceedings of Annual International Conference ofthe IEEE Engineering in Medicine and Biology SocietyMinneapolis Minnesota USA 2009 7069ndash72

25Winter RM Baraitser M Douglas JM A computerised database for the diagnosis of rare dysmorphic syndromes J MedGenet 198421121ndash3

26Botstein D Cherry JM Ashburner M et al Gene Ontology toolfor the unification of biology Nat Genet 20002525ndash9

27Hancock JM Commentary on Shimoyama et al (2012) threeontologies to define phenotype measurement data FrontGenet 20145

28Boland MR Tatonetti NP Hripcsak G Development and valid-ation of a classification approach for extracting severity auto-matically from electronic health records J Biomed Semant2015614

29Greenaway S Blake A Retha A et al Automatically annotat-ing temporal data from a phenotype-driven mutagenesisscreen In Proceedings of Phenotype Day at ISMB Dublin Ireland2015

30Rath A Olry A Dhombres F et al Representation of rare dis-eases in health information systems the Orphanet approachto serve a wide range of end users Hum Mutat 201233803ndash8

31Thorn CF Klein TE Altman RB PharmGKB Methods Mol Biol2005311179ndash91

32Sprague J Bayraktaroglu L Clements D et al The ZebrafishInformation Network the zebrafish model organism data-base Nucleic Acids Res 200634D581ndash5

33Brown SDM Moore MW Towards an encyclopaedia of mam-malian gene function the International Mouse PhenotypingConsortium Dis Models Mech 20125289ndash92

34Aronson AR Lang FM An overview of MetaMap historicalperspective and recent advances J Am Med Inform Assoc201017229ndash36

35Chapman WW Bridewell W Hanbury P et al A simple algo-rithm for identifying negated findings and diseases in dis-charge summaries J Biomed Inform 200134301ndash10

10 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 11: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

36Friedman C Alderson PO Austin JH et al A general natural-language text processor for clinical radiology J Am Med InformAssoc 19941161ndash74

37Khordad M Mercer RE Rogan P Advances in ArtificialIntelligence Berlin Heidelberg Springer 2011 246ndash57

38Hirschman L Sager N Lyman M Automatic application ofhealth care criteria to narrative patient records In Proceedingsof the Annual Symposium on Computer Application in MedicalCare Washington DC USA 1979

39Groza T Hunter J Zankl A Mining skeletal phenotype de-scriptions from scientific literature PLoS One 20138e55656

40Aronson AR Effective mapping of biomedical text to theUMLS Metathesaurus the MetaMap program In Proceedings ofAmerican Medical Informatics Association (AMIA) AnnualSymposium Washington DC USA 2001 17ndash21

41Schindelman G et al Worm Phenotype Ontology Integratingphenotype data within and beyond the C elegans communityBMC Bioinformatics 20111232

42Freimer N Sabatti C The human phenome project Nat Genet20033415ndash21

43Oellrich A Rebholz-Schuhmann D A classification of existingphenotypical representations and methods for improvement InProceedings of the 2nd OBML Workshop Mannheim Germany 2010

44Mungall CJ Gkoutos GV Smith CL et al Integrating pheno-type ontologies across multiple species Genome Biol201011R2

45Shimoyama M Nigam R McIntosh LS et al Three ontologiesto define phenotype measurement data Front Genet 2012387

46Hewett M Oliver DE Rubin DL et al PharmGKB the pharma-cogenetics knowledge base Nucleic Acids Res 200230163ndash5

47Rubin DL Thorn CF Klein TE et al A statistical approach toscanning the biomedical literature for pharmacogeneticsknowledge J Am Med Inform Assoc 200512121ndash9

48Smith B Ashburner M Rosse C et al The OBO Foundry coor-dinated evolution of ontologies to support biomedical dataintegration Nat Biotechnol 2007251251ndash5

49White JK Gerdin AK Karp NA et al Genome-wide generationand systematic phenotyping of knockout mice reveals newroles for many genes Cell 2013154452ndash64

50Beck T Morgan H Blake A et al Practical application of ontol-ogies to annotate and analyse large scale raw mouse pheno-type data BMC Bioinformatics 200910S2

51Girdea M Dumitriu S Fiume M et al PhenoTips patient phe-notyping software for clinical and research use Hum Mutat2013341057ndash65

52Robinson PN Webber C Phenotype ontologies and cross-species analysis for translational research PLoS Genet201410e1004268

53Sabb FW Burggren AC Higier RG et al Challenges in pheno-type definition in the whole-genome era multivariate mod-els of memory and intelligence Neuroscience 200916488ndash107

54Savova GK Masanz JJ Ogren PV et al Mayo clinical TextAnalysis and Knowledge Extraction System (cTAKES) archi-tecture component evaluation and applications J Am MedInform Assoc 201017507ndash13

55Friedman C Shagina L Lussier Y et al Automated encodingof clinical documents based on natural language processingJ Am Med Inform Assoc 200411392ndash402

56Uzuner O South BR Shen S et al 2010 i2b2VA challenge onconcepts assertions and relations in clinical text J Am MedInform Assoc 201118552ndash6

57Suominen H Salantera S Velupillai S et al Information AccessEvaluation Multilinguality Multimodality and VisualizationBerlin Heidelberg Springer 2013 212ndash31

58Rak R et al Argo an integrative interactive text mining-based workbench supporting curation Database 2012 bas010

59Fu X Batista-Navarro RTB Rak R et al A strategy for annotat-ing clinical records with phenotypic information relating tothe chronic obstructive pulmonary disease In Proceedings ofPhenotype Day at ISMB Boston Massachusetts USA 2014

60Collier N Tran M LeH et al Learning to recognize pheno-type candidates in the auto-immune literature using SVM re-ranking PLoS One 20138e72965

61Mouse Phenotype Database Integration ConsortiumHancock JM Adams NC et al Integration of mouse phenomedata resources Mamm Genome 200718157ndash63

62Smedley D Schofield P Chen CK et al Finding and sharingnew approaches to registries of databases and services forthe biomedical sciences Database 2010baq014

63Hoehndorf R Oellrich A Rebholz-Schuhmann DInteroperability between phenotype and anatomy ontologiesBioinformatics 2010263112ndash18

64Kohler S Doelken SC Ruef BJ et al Construction and accessi-bility of a cross-species phenotype ontology along with geneannotations for biomedical research F1000Res 2013230

65Sarasua C Simperl E Noy NF Crowdmap CrowdsourcingOntology Alignment with Microtasks The Semantic WebndashISWC 2012 Springer Berlin Heidelberg 2012 525ndash41

66Bodenreider O The Unified Medical Language System(UMLS) integrating biomedical terminology Nucleic Acids Res200432D267ndash70

67Stearns MQ et al SNOMED clinical terms overview of the de-velopment process and project status In Proceedings of theAmerican Medical Informatics Association (AIMA) SymposiumWashington DC USA 2001 662ndash6

68Cruz IF Antonelli FP Stroe C AgreementMaker efficient match-ing for large real-world schemas and ontologies In Proceedings ofthe VLDB Endowment Lyon France 2009 Vol 2 1586ndash9

69Groth P Weiss B Pohlenz HD et al Mining phenotypes forgene function prediction BMC Bioinformatics 20089136

70Leonelli S Ankeny RA Re-thinking organisms The impact ofdatabases on model organism biology Stud Hist Philos BiolBiomed Sci 20124329ndash36

71Papatheodorou I Oellrich A Smedley D Linking gene expres-sion to phenotypes via pathway information J Biomed Semant2015617

72Kohler S Schulz MH Krawitz P et al Clinical diagnostics inhuman genetics with semantic similarity searches in ontolo-gies Am J Hum Genet 200985457ndash64

73Paul R Groza T Hunter J et al Decision support methods forfinding phenotypendashdisorder associations in the bone dyspla-sia domain PLoS One 20127e50614

74Paul R et al Inferring characteristic phenotypes via class as-sociation rule mining in the bone dysplasia domain J BiomedInform 20134873ndash83

75Chen R Mias GI Li-Pook-Than J et al Personal omics profilingreveals dynamic molecular and medical phenotypes Cell20121481293ndash307

76Robinson PN Kohler S Oellrich A et al Improved exome pri-oritization of disease genes through cross-species phenotypecomparison Genome Res 201424340ndash8

77Zemojtel T Kohler S Mackenroth L et al Effective diagnosis ofgenetic disease by computational phenotype analysis of thedisease-associated genome Sci Transl Med 20146252ra123

78Denny JC Bastarache L Ritchie MD et al Systematic compari-son of phenome-wide association study of electronic medicalrecord data and genome-wide association study data NatBiotechnol 2013311102ndash10

The digital revolution in phenotyping | 11

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from

Page 12: The digital revolution in phenotypinggenes [10–12], repurposing drugs [13, 14], pharmacogenomics [15–17] and pharmacovigilance [18], as well as solving evolu-tionary questions

79Hindorff LA Sethupathy P Junkins HA et al Potential etio-logic and functional implications of genome-wide associ-ation loci for human diseases and traits Proc Natl Acad SciUSA 20091069362ndash7

80Welter D MacArthur J Morales J et al The NHGRI GWASCatalog a curated resource of SNP-trait associations NucleicAcids Res 201442D1001ndash6

81Denny JC Ritchie MD Basford MA et al PheWASdemonstrating the feasibility of a phenome-wide scan to dis-cover gene-disease associations Bioinformatics 2010261205ndash10

82Pendergrass SA Brown-Gentry K Dudek SM et al Phenome-wide association study (PheWAS) for detection of pleiotropy

within the Population Architecture using Genomics andEpidemiology (PAGE) Network PLoS Genet 2013 e1003087

83Shameer K Denny JC Ding K et al A genome- and phenome-wide association study to identify genetic variants influenc-ing platelet count and volume and their pleiotropic effectsHum Genet 201413395ndash109

84Kodama K Tojjar D Yamada S et al Ethnic differences in therelationship between insulin sensitivity and insulin re-sponse a systematic review and meta-analysis Diabetes Care2013361789ndash96

85Li L Ruau DJ Patel CJ et al Disease risk factors identifiedthrough shared genetic architecture and electronic medicalrecords Sci Transl Med 20146234ra57

12 | Oellrich et al

by guest on January 25 2016httpbiboxfordjournalsorg

Dow

nloaded from