29
Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Embed Size (px)

Citation preview

Page 1: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Discovery from Linking Open Data (LOD) Annotated Datasets

Louiqa RaschidUniversity of Maryland

PAnG/PSL/ANAPSID/Manjal

Page 2: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Agenda

• Motivation• Challenges• Solution approaches

Page 3: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

• Emergence of biological datasets in the cloud ofLinked Data.

• Biological objects (e.g., genes or proteins) or clinical trials are annotated with controlled vocabulary terms from ontologies such as GO, MeSH, SNOMED, NCI Thesaurus.

• Links form a graph that captures meaningful knowledge.• Sense making of annotation graphs can explain phenomena,

identify anomalies and potentially lead to discovery.

Page 4: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Agenda

• Motivation– Drug re-purposing– Cross ontology patterns and literature imprint– Cross genome analysis

• Challenges• Solution approaches

Page 5: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal
Page 6: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Signature: Set of mRNAs that increase or decrease in patients and is significant w.r.t the general population.Compute similarity score [-1, +1]

Page 7: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Of 16,000 pairings, 2664 were significant (q < 0.05); half with an opposite relationship. 53 diseases had significant candidate therapeutic drug-disease relationships.

Page 8: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Sirota et al Findings• Efficacy (literature) for 2 drugs: topiramate and prednisolone.• Evaluated efficacy of cimetidine (over getfinib) for lung adenocarcinoma.• Methodology does not provide avenues for explanation, validation or discovery.

Page 9: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Sirota et al:Identified anomaly in this cluster

Page 10: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal
Page 11: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal
Page 12: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Limitations and Extensions• Sirota et al.

– Anomaly in drug cluster but their methodology does not allow further investigation.

• Sims et al.– Methodology is limited to co-occurrence analysis.

• Cannot exploit heterogeneous evidence from LOD sources.

• Cannot exploit knowledge in ontologies.• Finding patterns in graph datasets and visualization

and explanation.

Page 13: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Agenda• Challenges

– Exploiting LOD to create datasets.– Knowledge captured in ontologies. – Similarity metrics/distances tuned for ontologies.– Discovering and validating patterns in graphs.– Literature imprint.– Heterogeneous evidence.– Reasoning with uncertainty.

Page 14: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Solution Approaches• PAnG

• PSL

• Manjal• ANAPSID

• Thanks to our collaborators / domain experts:• Olivier Bodenreider, NLM, NIH• Sherri de Coronado, NCI, NIH• Andreas Thor, University of Leipzig

• Louiqa Raschid ++ at UMD

• Lise Getoor ++ at UMD

• Padmini Srinivasan ++ University of Iowa• Maria Esther Vidal ++ Universidad Simon Bolivar

Page 15: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Integrated access for heterogeneous data sources:adaptive query processing for SPARQL endpoints

TheArabidopsisInformationResource

GeneOntology

ClinicalTrials

Patterns inANnotationGraphs

PSL: Annotation computation by knowledge propagationPANG: Pattern identification using dense subgraphs and graph summaries.

Manjal – Text Mining for

MEDLINE

Annotation Visualizer – Visualize and explore

annotations and patterns

Solution approaches

Page 16: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal
Page 17: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Motivation: Gene Annotation Graphs• Genes are annotated with Gene Ontology (GO)

and Plant Ontology (PO) terms

• Prediction of new annotations as hypothesis for experiments– Link prediction is predicting new functional

annotations for a gene

Anno-tations

Page 18: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Link Prediction Framework

• Dense Subgraph (optional)– Focus on highly connected subgraphs

• Graph summarization: – Identify basic pattern (structure) of the graph

• Link Prediction– Predicted links reinforce underlying graph

pattern

TripartiteAnno-tationGraph (TAG)

Ranked Listof pre-dictedLinks

Link Prediction

Link PredictionScoring

FunctionScoring

Function

Dense Subgraph

Dense SubgraphDistance

RestrictionDistance

RestrictionDenseSubgraph

Filter

GraphSumma-rization

GraphSumma-rization

Cost ModelCost

Model Graphsummary

Link Prediction

Page 19: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Dense Subgraph• Motivation: graph area that is rich or dense with

annotation is an “interesting region”• Density of a subgraph = number of induced

edges / number of vertices• Tripartite graph with node set (A, B, C) is

converted into bipartite graph with (A, C)– Weighted edges = number of shared b’s– Apply technique of [1]

• Distance restriction for DSG possible– Hierarchically arranged ontology terms– All node pairs of A and C are within a given distance

[1] Saha et al. Dense subgraphs with restrictions and applications to gene annotation graphs. RECOMB, 2010

Page 20: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Graph Summarization• Minimum description length approach [2]

– Loss-free; employs cost model • Graph summary

= Signature + Corrections• Signature: graph pattern / structure

– Super nodes = complete partitioning of nodes– Super edges = edges between super nodes

= all edges between nodes of super nodes• Corrections: edges e between

individual nodes– Additions: e G but e signature– Deletions: e G but e signature

[2] Navlakha et.al. Graph summarization with bounded error. SIGMOD, 2008

PO_20030

PO_9006

PO_37

PO_20038

HY5

PHOT1

CIB5

CRY2

COP1

CRY1

PO_20030

PO_9006

PO_37

PO_20038

HY5PHOT1

CIB5CRY2COP1CRY1

==

Page 21: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

PO_20030

PO_9006

PO_37

PO_20038

HY5

PHOT1

CIB5

CRY2

COP1

CRY1

PO_20030

PO_9006

PO_37

PO_20038

HY5PHOT1

CIB5CRY2COP1CRY1

==

PO_20030

PO_9006

PO_37

PO_20038

HY5

CIB5

COP1

PHOT1

CRY2

CRY1

DSG+GS

PSL

Page 22: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal
Page 23: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Distance metrics

Page 24: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Distance metrics

Page 25: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Different retrieved sets of lung cancer related clinical trials

• Idenitfy 100 clinical trials using the search keyword “lung cancer” in CONDITION. Retrieve CT, CONDITION and INTERVENTION. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output.

• Retrieve 100 trials using “lung carcinoma” in the CONDITION field.

• Retrieve 100 trails using “lung carcinoma” in any field.

Page 26: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Retrieve 100 clinical trials using search keyword “lung cancer”.Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output.

Page 27: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

100 clinical trials using search keyword“lung carcinoma” for CONDITION.

Page 28: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

100 clinical trials using search keyword“lung carcinoma” for ALL FIELDS.

Page 29: Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Questions?

PAnG/PSL/ANAPSID/Manjal