41
PhenDisco: a new phenotype discovery system for the database of genotypes and phenotypes Son Doan, Hyeoneui Kim Division of Biomedical Informatics University of California San Diego Open Access Journal Club, 09/05/2013

PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Embed Size (px)

DESCRIPTION

presented at DBMI, UCSD Journal club on 9/5/2013

Citation preview

Page 1: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

PhenDisco: a new phenotype discovery system for the database

of genotypes and phenotypes

Son Doan, Hyeoneui Kim

Division of Biomedical Informatics University of California San Diego

Open Access Journal Club, 09/05/2013

Page 2: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Roadmap to the Presentation �  Background

�  dbGaP �  Challenges in using dbGaP �  pFINDR program

�  PhenDisco development �  User requirement analysis for PhenDisco �  Data standardization (variables, study metadata) �  System development: technical details

�  PhenDisco demo �  Performance evaluation 9/5/13 2

Page 3: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Background

9/5/13 3

Page 4: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Overview on dbGaP �  Database of Genotypes and Phenotypes

�  Developed by NCBI

�  Stores and distributes the data and outputs of the studies on the interactions of genotypes & phenotypes

�  Provides 2 levels of access �  Open access: variable information including

summary statistics and study information �  Controlled access: raw data – upon approval by

NIH DAC 9/5/13 4

Page 5: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

A Typical Challenge in Using dbGaP

Potentially, dbGaP is great…it contains so many different types of studies and their data!

However, I find it very hard to reuse dbGaP data because there is no easy but robust way to filter studies by important study related information such as study design, analysis methods, analysis data produced by the studies.

Even if I find the studies that seem fitting to my needs, I still need to make sure that the studies have the genotype and/or the phenotype information that I need.

Of course, dealing with the data values with all sort of different formats is another challenge to go through…

(Erin Smith, PhD, Division of Genome Information Science, UCSD)

9/5/13 5

Page 6: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

9/5/13

http://www.ncbi.nlm.nih.gov/gap

6

Page 7: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

9/5/13

http://www.ncbi.nlm.nih.gov/gap

7

Page 8: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

9/5/13

http://www.ncbi.nlm.nih.gov/gap

8

Page 9: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

pFINDR (phenotype Finding IN Data Repositories)

9/5/13

•  Funded by NHLBI •  To facilitate dbGaP use by improving

accuracy and completeness of search returns –  Standardized phenotype variables –  Searchable study related information

9

Page 10: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

User Requirement Analysis

9/5/13 10

Page 11: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Use-Case Driven Development �  User requirements collected from

�  Analysis of data use descriptions from data requests available in dbGaP (14,287 requests)

�  Online user survey (17 users)

�  User interviews (8 local dbGaP users) �  NIH officers/Scientific Advisory Board

recommendations and suggestions

9/5/13 11

Page 12: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Genetic Disease Congenital Abnormality (8.6%)

Cardiovascular Disease (8.1%)

Data Request Analysis

9/5/13

Disease

Chemical or Biological Substance

Therapeutic or preventive Procedure

Research Activity

Laboratory Procedure or

Test

Pathologic Function

Signs or Symptoms

Diagnostic Procedure

Clinical Attributes

Mood, Emotion, and

Individual Behavior

Qualitative Concept

Mental Process

Social Behavior

Organism Function

Daily Function or Activity

Health Care Activity Food

Other

Neoplasm/Cancer (30%)

Psychiatric Disease (13%)

12

Page 13: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Interviews, Survey and SAB/NIH officers’ feedback

�  Functions that maximize search efficiency

�  Examples �  “option to expand search terms through

synonyms” �  “studies displayed in the order of relevancy” �  “select studies from the returned list and save for

later review” �  “search results organized in a way that supports

quick browsing”

9/5/13 13

Page 14: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Problems We Addressed �  Focus areas:

�  Completeness and accuracy of search results �  Abbreviation expansion �  Concept-based search

�  Ease of result review �  Sorting the results by relevancy �  Highlighting search keywords in the retrieved records

�  Additional functionality �  Export of selected study and variable information �  Categorization of variables

9/5/13 14

Page 15: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Data Standardization

9/5/13

•  Variable Standardization •  Study Level Metadata Generation

15

Page 16: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Phenotype Variable Standardization

�  Used variable descriptions

�  Focused on identifying �  Topic (main theme: “pain”, “walking”)

�  Subject of information (i.e., bearer: “study subject”)

�  Mapped the topic and SOI concepts to UMLS Metathesaurus

9/5/13

Variable ID Variable Name Variable Description

Phv00116192.v2.p2 C41RPACE Get pain when walk at ordinary pace?

16

Page 17: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Variable Descriptions

•  135,608 variables

9/5/13 17

Phenotype Variable Standardization

Page 18: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Variable Descriptions

•  135,608 variables

“77 age mom diagnosed – stroke (tia)”

Phenotype Variable Standardization

9/5/13 18

Page 19: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Normalization

•  Spell out abbreviations and short hand expressions

•  Drop question numbers and other unimportant characters

Variable Descriptions

•  135,608 variables

“77 age mom diagnosed – stroke (tia)”

“age mother diagnosed stroke (tia)”

Phenotype Variable Standardization

9/5/13 19

Page 20: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Normalization

•  Spell out abbreviations and short hand expressions

•  Drop question numbers and other unimportant characters

MetaMap Processing

•  Generate CUIs, concept names, semantic types

Variable Descriptions

•  135,608 variables

“77 age mom diagnosed – stroke (tia)”

“age mother diagnosed stroke (tia)”

C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome]

Phenotype Variable Standardization

9/5/13 20

Page 21: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Normalization

•  Spell out abbreviations and short hand expressions

•  Drop question numbers and other unimportant characters

MetaMap Processing

•  Generate CUIs, concept names, semantic types

Semantic Role Assignment

•  Semantic types and keyword- based role identification

•  Evaluation from random sample of 500: 73% accuracy

Variable Descriptions

•  135,608 variables

“77 age mom diagnosed – stroke (tia)”

“age mother diagnosed stroke (tia)”

C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome]

C0001779: age, C0038454: Stroke – topic C0026591: Mother – subject of information

Phenotype Variable Standardization

9/5/13 21

Page 22: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Normalization

•  Spell out abbreviations and short hand expressions

•  Drop question numbers and other unimportant characters

MetaMap Processing

•  Generate CUIs, concept names, semantic types

Semantic Role Assignment

•  Semantic types and keyword- based role identification

•  Evaluation from random sample of 500: 73% accuracy

Variable Categorization

•  Semantic types and keyword-based categorization

•  Evaluation from random sample of 500: 71% accuracy

Variable Descriptions

•  135,608 variables

“77 age mom diagnosed – stroke (tia)”

“age mother diagnosed stroke (tia)”

C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome]

C0001779: age, C0038454: Stroke – topic C0026591: Mother – subject of information

family history, demographics

Phenotype Variable Standardization

9/5/13 22

Page 23: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Category Examples Variable Descriptions

Topics Subject of Information

Variable Categories

Gender of the participant

gender study subject Demographics

Last known smoking status

smoking study subject Smoking History

Cigarettes/day, exam 1 smoking, medical examination

study subject Smoking History Healthcare Activity Finding

Age in years at uric acid measurement

age, uric acid measurement

study subject Demographics Lab Tests

AGE of living mother age mother Demographics - Family

Age at dementia onset as defined by the DSM IV definition

age, dementia study subject

Demographics Medical History

9/5/13 23

Page 24: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Normalization

•  Spell out abbreviations and short hand expressions

•  Drop question numbers and other unimportant characters

MetaMap Processing

•  Generate CUIs, concept names, semantic types

Semantic Role Assignment

•  Semantic types and keyword- based role identification

•  Evaluation from random sample of 500: 73% accuracy

Variable Categorization

•  Semantic types and keyword-based categorization

•  Evaluation from random sample of 500: 71% accuracy

Identification of Similar

Variables

•  Same CUI, similar keywords, and same category

in progress

Variable Descriptions

•  135,608 variables

“77 age mom diagnosed – stroke (tia)”

“age mother diagnosed stroke (tia)”

C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome]

C0001779: age, C0038454: Stroke – topic C0026591: Mother – subject of information

family history, demographics

Phenotype Variable Standardization

9/5/13 24

Page 25: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Study Level Metadata Annotation

9/5/13

•  Manual annotation of 422 studies (07/31/13) •  Metadata items generated

•  Disease topics (encoded with UMLS) •  Geographical information (encoded with ISO

3166-2 subdivision code: state and country) •  IRB approval (required or not) •  Consent type (not restricted, restricted,

unspecified) •  Sample demographics (race and/or ethnicity,

gender, age) 9/5/13 25

Page 26: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

System Development: Integration

9/5/13 9/5/13 26

Page 27: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Free text

Query parser

sdGaP

Relevant studies

Ranked studies

NLP tools + MetaMap

Information Model Mapping

dbGaP

PhenDisco: Put-it-all-together

BM25 ranking algorithm 9/5/13 27

Page 28: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

System Development: Query Parser

9/5/13 9/5/13 28

Page 29: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Contextual Query Language

�  Query types: �  Simple queries: keywords, phrases.

�  Using Boolean logic: AND, OR, NOT

�  Can process index values, e.g., age > 40

�  Build a language guideline: �  BNF form

9/5/13 9/5/13 29

Page 30: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

BNF form cqlQuery ::= prefixAssignment cqlQuery | scopedClause prefixAssignment ::= '>' prefix '=' uri | '>' uri scopedClause ::= scopedClause booleanGroup searchClause | searchClause booleanGroup ::= boolean [modifierList] boolean ::= 'and' | 'or' | 'not' | 'prox' searchClause ::= '(' cqlQuery ')’| index relation searchTerm| searchTerm relation ::= comparitor [modifierList] comparitor ::= comparitorSymbol | namedComparitor comparitorSymbol ::= '=' | '>' | '<' | '>=' | '<=' | '<>' | '==' namedComparitor ::= identifier modifierList ::= modifierList modifier | modifier modifier ::= '/' modifierName [comparitorSymbol modifierValue] prefix, uri, modifierName, modifierValue, searchTerm, index ::= term term ::= identifier | 'and' | 'or' | 'not' | 'prox' | 'sortby' identifier ::= charString1 | charString2 9/5/13 9/5/13 30

Page 31: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

System Development: Study Ranking

9/5/13 9/5/13 31

Page 32: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

BM25 ranking algorithm

9/5/13

•  N: total number of studies. •  nt – number of studies contains

the term t •  c – field in study d •  wc – boost factor for each field c •  Tf – term frequency •  Idf – inverted document

frequency

9/5/13 32

Page 33: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Technical Infrastructure �  URL: http://pfindr-data.ucsd.edu/_PhDVer1/

�  Linux machine: Ubuntu 64 bits

�  Memory: 32GB RAM

�  Database: MySQL 14.14

�  Apache 2.2.20 Web server

�  Programming languages: PHP, Python, JavaScripts

�  Python toolkits: pyparsing, Whoosh 9/5/13 9/5/13 33

Page 34: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

9/5/13

System  Demonstra-on  

9/5/13 34

Page 35: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

System Evaluation

9/5/13

•  Search Accuracy •  User Interface

9/5/13 35

Page 36: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Evaluation on Basic Search

9/5/13

Basic Search dbGaP PhenDisco

Recall Precision Recall Precision

COPD 100 % 41.67% 80.00% 100 %

“macular degeneration” AND white 100 % 42.86% 100 % 85.71%

“breast cancer” AND “breast density”

100 % 66.67% 50.00% 100 %

schizophrenia 100 % 46.88% 86.67% 92.86%

cardiomyopathy 100 % 35.00% 100 % 100 %

Average 100 % 46.61% 83.33% 95.71%

Average F-measure 0.64 0.89

(as of July 7, 2013)

9/5/13 36

Page 37: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Evaluation on Advanced Search

9/5/13

Advanced Search in PhenDisco Recall Precision

“macular degeneration” AND white AND [whole genome genotyping]

100 % 66.67%

“breast cancer” AND “breast density” AND [IRB not required] AND [whole genome genotyping]

100 % 100 %

schizophrenia AND [female] AND [AFFY_6.0] 100 % 100 %

cardiomyopathy AND [copy number variant analysis]

100 % 100 %

Average 100 % 91.67 %

Average F-measure 0.96

(as of July 7, 2013)

9/5/13 37

Page 38: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Feedback on the User Interface (N=6)

9/5/13 9/5/13 38

Page 39: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Trainees �  Post-doctoral trainees

�  Ko-Wei Lin, DVM, PhD (Study Abstraction, Standardization, Evaluation)

�  Mindy Ross, MD, MBA (Study Abstraction, Ontology Building) �  Neda Alipanah, PhD (Ontology Building) �  Xiaoqian Jiang, PhD (Ranking Algorithm) �  Mike Conway, PhD (Study Abstraction)

�  Undergraduate trainees �  Alexander Hsieh (Standardization) �  Vinay Venkatesh (System Development) �  Rafael Talavera (Evaluation) �  Karen Truong (Study Abstraction) �  Asher Garland (System Development)

9/5/13

Page 40: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Acknowledgements �  Lucila Ohno-Machado (PI) �  Collaborator

�  Hua Xu

�  Other contribution �  Jihoon Kim �  Wendy Chapman �  Melissa Tharp

�  Staff �  Stephanie Feudjio Feupe, MS �  Seena Farzaneh, MS �  Rebecca Walker, BS

�  Funding: UH2HL108785 from NHLBI, NIH 9/5/13 9/5/13 40

Page 41: PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

Questions? Project Homepage: http://pfindr.net

PhenDisco: http://pfindr-data.ucsd.edu/_PhDVer1/index.php

Contact: [email protected]

[email protected] [email protected]