38
Gene Analytics: Discovery and Contextualization of Enriched Gene Groups N. Lavrač, I. Mozetič, V. Podpečan, P. Kralj Novak (Jožef Stefan Institute, Ljubljana) H. Motaln, M. Petek, K. Gruden (National Institute of Biology,

Gene Analytics: Discovery and Contextualization of Enriched Gene Groups N. Lavrač, I. Mozetič, V. Podpečan, P. Kralj Novak (Jožef Stefan Institute, Ljubljana)

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Gene Analytics: Discovery and Contextualization of

Enriched Gene Groups

N. Lavrač, I. Mozetič, V. Podpečan, P. Kralj Novak (Jožef Stefan Institute, Ljubljana)

H. Motaln, M. Petek, K. Gruden

(National Institute of Biology, Ljubljana)

2http://www.bisonet.eu BISON Bled WK, Aug. 2009

Talk outline

• Relational data mining and subgroup discovery• Semantic data mining: Using ontologies in

SEGS• BISON clallenge• Experimental use case: Glioma cancer treatment• BISON methodology: combining SEGS+Biomine• Gene analytics services and future work

3http://www.bisonet.eu BISON Bled WK, Aug. 2009

Data Mining

data

Data MiningData Mining

knowledge discovery from data

model, patterns, …

Given: transaction data table, relational database, text documents, Web pages

Find: a classification model, a set of interesting patterns

Person Age Spect. presc. Astigm. Tear prod. LensesO1 young myope no reduced NONEO2 young myope no normal SOFTO3 young myope yes reduced NONEO4 young myope yes normal HARDO5 young hypermetrope no reduced NONE

O6-O13 ... ... ... ... ...O14 pre-presbyohypermetrope no normal SOFTO15 pre-presbyohypermetrope yes reduced NONEO16 pre-presbyohypermetrope yes normal NONEO17 presbyopic myope no reduced NONEO18 presbyopic myope no normal NONE

O19-O23 ... ... ... ... ...O24 presbyopic hypermetrope yes normal NONE

4http://www.bisonet.eu BISON Bled WK, Aug. 2009

Subgroup discovery task definition

(Kloesgen, Wrobel 1997)

– Given: a population of individuals and a property of interest (e.g. AML class, in the task of finding genes differentially expressed in AML leukemia as opposed to ALL leukemia)

– Find: `most interesting’ descriptions of population subgroups• are as large as possible

(high target class coverage)• have most unusual distribution of the target property

(high TP/FP ratio, high significance)

5http://www.bisonet.eu BISON Bled WK, Aug. 2009

Sample microarray analysis tasks

• Two-class diagnosis problem of distinguishing between acute lymphoblastic leucemia (ALL, 27 samples) and acute myeloid leukemia (AML, 11 samples), with 34 samples in the test set. Every sample is described with gene expression values for 7129 genes.

• Multi-class cancer diagnosis problem with 14 different cancer types, in total 144 samples in the training set and 54 samples in the test set. Every sample is described with gene expression values for 16063 genes.

• SD results in simple IF-THEN rules, interpretable by biologists

IF (KIAA0128_gene DIFF-EXPRESSED) AND (prostaglandin_d2_synthase_gene NOT-DIFF-EXP)

THEN Leukemia

6http://www.bisonet.eu BISON Bled WK, Aug. 2009

Relational Data Mining (Inductive Logic Programming)

Relational Relational Data MiningData Mining

knowledge discovery from data

model, patterns, …

Given: a relational database, a set of tables. sets of logical facts, a graph, …Find: a classification model, a set of interesting patterns

7http://www.bisonet.eu BISON Bled WK, Aug. 2009

Relational Data Mining (ILP)• Learning from multiple

tables• Complex relational

problems:– structured data:

representation of molecules and their properties in protein engineering, biochemistry, ...

• Semantic relational data mining– Using domain

ontologies as background knowledge for relational data mining

8http://www.bisonet.eu BISON Bled WK, Aug. 2009

Gene Ontology (GO)• GO is a database of terms for genes:

– Function - What does the gene product do?– Process - Why does it perform these activities?– Component - Where does it act?

• Known genes are annotated to GO terms (www.ncbi.nlm.nih.gov)

• Terms are connected as a directed acyclic graph (is_a, part_of)

• Levels represent specificity of the terms

12093 biological process 1812 cellular components 7459 molecular functions

9http://www.bisonet.eu BISON Bled WK, Aug. 2009

Ontology encoded as relational background knowledge

Prolog facts: predicate(geneID, CONSTANT).

interaction(geneID, geneID).

component(2532,'GO:0016020').

component(2532,'GO:0005886').

component(2534,'GO:0008372').

function(2534,'GO:0030554').

function(2534,'GO:0005524').

process(2534,'GO:0007243').

interaction(2534,5155).

interaction(2534,4803).

Basic, plus generalized background knowledge using GO

zinc ion binding ->

metal ion binding, ion binding, binding

10http://www.bisonet.eu BISON Bled WK, Aug. 2009

Multi-Relational representation

FUNCTION

GENE(main table,class labels)

GENE-GENEINTERACTION

PROCESS COMPONENT

GENE-FUNCTION GENE-PROCESS GENE-COMPONENT

is_a part_of is_a part_of is_a part_of

11http://www.bisonet.eu BISON Bled WK, Aug. 2009

Propositionalization in RDM

f(7,A):-function(A,'GO:0046872').f(8,A):-function(A,'GO:0004871').f(11,A):-process(A,'GO:0007165').f(14,A):-process(A,'GO:0044267').f(15,A):-process(A,'GO:0050874').f(20,A):-function(A,'GO:0004871'), process(A,'GO:0050874').f(26,A):-component(A,'GO:0016021').f(29,A):- function(A,'GO:0046872'), component(A,'GO:0016020').f(122,A):-interaction(A,B),function(B,'GO:0004872').f(223,A):-interaction(A,B),function(B,'GO:0004871'),

process(B,'GO:0009613').f(224,A):-interaction(A,B),function(B,'GO:0016787'),

component(B,'GO:0043231').

Propositionalization through first-order feature construction (KARDIO 1994, LINUS 1991, RSD 2006)

Novelty of SEGS (2008): Feature construction from ontology information, features with support > min_support

existential

12http://www.bisonet.eu BISON Bled WK, Aug. 2009

Gene set enrichmentanalysis with SEGS

• A gene set is enriched if the genes that are members of that gene set are statistically significantly differentially expressed compared to the rest of the genes.

• New gene set enrichment method: SEGS - Searching for Enriched Gene Sets (JSI) (Trajkovski et al. JBI 2008)

• SEGS approach: Using GO, KEGG and ENTREZ ontologies as background knowledge for semantic subgroup discovery

13http://www.bisonet.eu BISON Bled WK, Aug. 2009

OntologiesOntologies

• Gene Ontology (GO): standardized biological terms used to annotate gene products– Molecular Function– Biological Process– Cellular Component

• Kyoto Encyclopedia of Genes and Genomes (KEGG): manually drawn pathway maps representing the knowledge on the molecular interaction and reaction networks

• ENTREZ: gene annotations with GO and KO terms and gene-gene interaction data

14http://www.bisonet.eu BISON Bled WK, Aug. 2009

Identifying differentially expressed genes in data preprocessing

Gene i

Sample j

14/28

To identify genes that display a large difference in gene expression between groups (class A and class B) and are homogeneous within groups, statistical tests (e.g. t-test) and p-values (e.g. permutation test) are computed.Two sample t–statistic is used to testthe equality of group means mA and mB.

15http://www.bisonet.eu BISON Bled WK, Aug. 2009

Ranking of differentially expressed genes

The genes can be ordered in a ranked list L, according to their differential expression between the classes.

The challenge is to extract meaning from this list, to describe them.

The terms of the Gene Ontology were used as a vocabulary for the description of the genes.

16http://www.bisonet.eu BISON Bled WK, Aug. 2009

Gene expression data: Positive and negative examples for data mining

fact(class, geneID, weight).

fact(‘diffexp',64499, 5.434).fact(‘diffexp',2534, 4.423).fact(‘diffexp',5199, 4.234).fact(‘diffexp',1052, 2.990).fact(‘diffexp',6036, 2.500).……fact(‘random',7443, 1.0).fact('random',9221, 1.0).fact('random',23395,1.0).fact('random',9657, 1.0).fact('random',19679, 1.0).……

17http://www.bisonet.eu BISON Bled WK, Aug. 2009

Ontology encoded as relational background knowledge + gene expression data

Prolog facts: predicate(geneID, CONSTANT).

interaction(geneID, geneID).

component(2532,'GO:0016020').

component(2532,'GO:0005886').

component(2534,'GO:0008372').

function(2534,'GO:0030554').

function(2534,'GO:0005524').

process(2534,'GO:0007243').

interaction(2534,5155).

interaction(2534,4803).

fact(class, geneID, weight).

fact(‘diffexp',64499, 5.434).fact(‘diffexp',2534, 4.423).fact(‘diffexp',5199, 4.234).fact(‘diffexp',1052, 2.990).fact(‘diffexp',6036, 2.500).……fact(‘random',7443, 1.0).fact('random',9221, 1.0).fact('random',23395,1.0).fact('random',9657, 1.0).fact('random',19679, 1.0).……

Basic, plus generalized background knowledge using GO

zinc ion binding ->

metal ion binding, ion binding, binding

18http://www.bisonet.eu BISON Bled WK, Aug. 2009

Ontology encoded as relational features + gene expression data

f(7,A):-function(A,'GO:0046872').f(8,A):-function(A,'GO:0004871').f(11,A):-process(A,'GO:0007165').f(14,A):-process(A,'GO:0044267').f(15,A):-process(A,'GO:0050874').f(20,A):-function(A,'GO:0004871'),

process(A,'GO:0050874').f(26,A):-component(A,'GO:0016021').f(29,A):- function(A,'GO:0046872'),

component(A,'GO:0016020').f(122,A):-

interaction(A,B),function(B,'GO:0004872').f(223,A):-

interaction(A,B),function(B,'GO:0004871'), process(B,'GO:0009613').

f(224,A):-interaction(A,B),function(B,'GO:0016787'), component(B,'GO:0043231').

fact(class, geneID, weight).

fact(‘diffexp',64499, 5.434).fact(‘diffexp',2534, 4.423).fact(‘diffexp',5199, 4.234).fact(‘diffexp',1052, 2.990).fact(‘diffexp',6036, 2.500).……fact(‘random',7443, 1.0).fact('random',9221, 1.0).fact('random',23395,1.0).fact('random',9657, 1.0).fact('random',19679, 1.0).……

19http://www.bisonet.eu BISON Bled WK, Aug. 2009

Propositionalization

f1 f2 f3 f4 f5 f6 … … fn

g1 1 0 0 1 1 1 0 0 1 0 1 1

g2 0 1 1 0 1 1 0 0 0 1 1 0

g3 0 1 1 1 0 0 1 1 0 0 0 1

g4 1 1 1 0 1 1 0 0 1 1 1 0

g5 1 1 1 0 0 1 0 1 1 0 1 0

g1 0 0 1 1 0 0 0 1 0 0 0 1

g2 1 1 0 0 1 1 0 1 0 1 1 1

g3 0 0 0 0 1 0 0 1 1 1 0 0

g4 1 0 1 1 1 0 1 0 0 1 0 1

20http://www.bisonet.eu BISON Bled WK, Aug. 2009

Propositional subgroup discoveryf1 f2 f3 f4 f5 f6 … … fn

g1 1 0 0 1 1 1 0 0 1 0 1 1

g2 0 1 1 0 1 1 0 0 0 1 1 0

g3 0 1 1 1 0 0 1 1 0 0 0 1

g4 1 1 1 0 1 1 0 0 1 1 1 0

g5 1 1 1 0 0 1 0 1 1 0 1 0

g1 0 0 1 1 0 0 0 1 0 0 0 1

g2 1 1 0 0 1 1 0 1 0 1 1 1

g3 0 0 0 0 1 0 0 1 1 1 0 0

g4 1 0 1 1 1 0 1 0 0 1 0 1

f2 and f3f2 and f3

[4,0][4,0]

21http://www.bisonet.eu BISON Bled WK, Aug. 2009

Summary: SEGS Method and Results

• SEGS method:– Through semantic subgroup discovery SEGS generates candidate

gene set descriptions as conjunctions of first-order features, combining individual GO, KEGG and ENTREZ terms

– SEGS combines Fisher, GSEA and PAGE enrichment tests to select most interesting groups of differentially expressed genes

• SEGS results:– Descriptions of subgroups of genes that are differentially

expressed (e.g., belong to class DIFF-EXP of top 300 most differentially expressed genes) in contrast with RANDOM genes (randomly selected genes with low differential expression).

• Sample subgroup description: diffexp(A) ;- interaction(A,B) & function(B,'GO:0004871') &

process(B,'GO:0009613')

22http://www.bisonet.eu BISON Bled WK, Aug. 2009

SEGSSEGS implementation implementationQuery: Results:

23http://www.bisonet.eu BISON Bled WK, Aug. 2009

BISON project

• The challenge: Support humans to find new, interesting links accross domains, named bisociations– across different contexts

– across different types of data and knowledge sources

• Open problems:– Fusion of heterogeneous data/knowledge sources into a

joint representation format - a large information network named BisoNet (consisting of nodes and relatioships between nodes)

– Finding unexpected, previously unknown links between BisoNet nodes belonging to different contexts

24http://www.bisonet.eu BISON Bled WK, Aug. 2009

Heterogeneous data sources(BISON, M. Berthold, 2008)

25http://www.bisonet.eu BISON Bled WK, Aug. 2009

Bridging concepts (BISON, M. Berthold, 2008)

26http://www.bisonet.eu BISON Bled WK, Aug. 2009

Use Case: Glioma Cancer(investigated at NIB)

• Glioma– a type of brain cancer – different types– Glioblastoma: life expectany

less than 1 year

• Glioma treatment– No efficient treatment available– Testing new hypotheses for

treatment: using stem cells for drug transport to the brain ?

– New insights in stem cell behavior and brain cancer mechanisms ?

27http://www.bisonet.eu BISON Bled WK, Aug. 2009

Glioma treatmentGlioma treatment

• Biological questions:– Are stems cells efficient/effective for drug transport ?– What are the risks associated?

• ad. Risks: Evaluation of BM-hMSC stem cells stability– Biological experiments: 4 stem cell lines– RNA was isolated and sent for transcriptome analysis– hMSC growth curves reveal “fast” & “slow” growing clones

– Slow: hMSC-1, hMSC-3 5-6 passages/6-weeks– Fast: hMSC-2, hMSC-4 8 passages/6-weeks

– Risk of malignant transformation: – Two lines (hMSC-1 and hMSC2) transformed into cancer cells

– Microarray analysis is performed to find groups of differentially expressed genes in several experiments:– slow vs. fast growing cell lines– normal vs. cancerous cell lines

28http://www.bisonet.eu BISON Bled WK, Aug. 2009

29http://www.bisonet.eu BISON Bled WK, Aug. 2009

SEGS+Biomine MethodologySEGS+Biomine Methodology

e.g. - slow-vs-fast cell growth

Gene sets:Microarray: Contextualization,Exploratory link discovery

30http://www.bisonet.eu BISON Bled WK, Aug. 2009

BiomineBiomine

• In Biomine (UH) (DILS 2006), data from numerous public databases are merged into a large graph: currently consisting of 1,968,951 vertices and 7,008,607 edges.– Vertices correspond to entities and concepts– Edges represent known, annotated relationships between

vertices. – A link (a relation between two entities) is manifested as a path

or a subgraph connecting the corresponding vertices. – A bisociative link is a path traversing nodes belonging to

different domains/contexts

• In Biomine, a method for link discovery between entities in queries was developed for graph exploration.

31http://www.bisonet.eu BISON Bled WK, Aug. 2009

Biomine Information fusionBiomine Information fusion• Biomine graph integrates numerous databases

32http://www.bisonet.eu BISON Bled WK, Aug. 2009

SEGS+Biomine MethodologySEGS+Biomine Methodology

• Biomine information fusion into a BisoNet information network

• Interesting node discovery and contextualisation with SEGS– Information fusion of GO, KEGG, ENTREZ– Identify conjunctions of concepts from different domains

(ontologies)

• Interesting cross=context link discovery with Biomine– create bisociative links as paths in the Biomine subgraph

connecting the concepts proposed by SEGS

• BisoNet Exploration/Explanation– explore BisoNet paths ranked according to weigths/probabilities

(as currently implemented in Biomine)

33http://www.bisonet.eu BISON Bled WK, Aug. 2009

Biomine: Bisociative link discoveryBiomine: Bisociative link discoveryQuery: Result:

34http://www.bisonet.eu BISON Bled WK, Aug. 2009

SEGS merges GO, KEGG and ENTREZ,BisoNet is used for concept visualization

SEGS+BiomineInformation fusion

35http://www.bisonet.eu BISON Bled WK, Aug. 2009

Identify interesting concepts (BisoNet nodes)from different contexts (different databases)

SEGS+BiomineCreative knowledge discovery

36http://www.bisonet.eu BISON Bled WK, Aug. 2009

Create bisociative cross-context links/paths linking BisoNet concepts from different contexts

SEGS+BiomineCreative link discovery

37http://www.bisonet.eu BISON Bled WK, Aug. 2009

Explore and interpret most interestingcross-context BisoNet links/paths between concepts

SEGS+BiomineExploration and explanation

38http://www.bisonet.eu BISON Bled WK, Aug. 2009

SummarySummary• SEGS discovers interesting descriptions of differentially

expressed gene groups as conjunctions of concepts from different contexts

• Biomine finds cross-context links (paths) between concepts discovered by SEGS

• The SEGS+Biomine approach has the potential for creative knowledge and bisociative link discovery

• Preliminary results in stem cell microarray data analysis (EMBC 2009) indicate that the SEGS+Biomine methodology may lead to new insights – in vitro experiments are being planned at NIB to verify and validate the preliminary insights