29
A Machine Learning approach to generate Gene Interaction Networks Nicola Lazzarini Dr. Jaume Bacardit Prof. Natalio Krasnogor Supervisors :

A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

A Machine Learning approach to generate Gene Interaction

Networks

Nicola Lazzarini

Dr. Jaume BacarditProf. Natalio Krasnogor

Supervisors:

Page 2: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

• Gene Interaction Networks• Co-Prediction• General Pipeline• Co-Prediction evaluation• Co-Expression comparison• Results• Conclusions

Outline

Page 3: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Gene Interaction Networks

A graph where nodes represents genes and e d g e s i n d i c a t e a relationship among them.

There exist many methods to infer GIN: Correlation based, Bayesian Networks, ODEs, etc.

Page 4: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

ML Approach

Machine Learning approach:Use the dataset to learn a model (e.g classification) from the existing samples, and afterwards infer gene interactions from the structure of the model

ML techniques used to infer GINs:

• Model Trees Inferring gene regression networks with model trees (Nepomuceno-Chamorro et. al)

• Rule Based Systems An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier

systems (Urbanowicz et al.)

• Association Rules Inferring gene-gene associations from Quantitative Association Rules (Martinez-Ballestreros et. al)

Our approach: use a rule based classification algorithm to define GINs

Page 5: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Attribute1 Attribute2 score

Attribute1 Attribute4 score

Attribute1 Attribute5 score

... .... ....

Attribute1 score

Attribute2 score

Attribute3 score

.... ....

Core assumption:Genes present within the same rules, due to their co-operation in classification, might be functionally related

List of nodes:

BioHEL is a rule based classification algorithm:rule1: if AttributeX > value1 AND AttributeY < value2 then Cancerrule2: if AttributeX < value3 AND AttributeZ < value4 then Normal ... ... ... ... ... ...rule n: ... ... ... ... ...

Network Generation

List of edges:

1

Improving the scalability of rule-based evolutionary learning (Bacardit et al.)1:

Page 6: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

What is Co-Prediction

This way of inferring GINs is termed: Co-PredictionInteraction between genes that act together for class prediction

We want to generalise this analysis:• Define a general pipeline to create Co-Prediction networks• Test on a set of data (8 datasets)• Compare with other GINs inferring methods (Co-Expression)

Already used in specific studies:- Reduced Neonatal Mortality in Meishan Piglets: A Role for Hepatic Fatty Acids? (H.P. Fainberg et al.)- Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features (Bacardit et al.)- Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets (Bassell et al.)

Page 7: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

General Pipeline

There are different ways of create the Co-Prediction Network

4 different variants/configurations

2 main switches allow different variants of the inferring process

Feature Selection

Rule Base Network Generation

Permutation Test

2nd Training

Rule Base Network Generation

Co-Prediction Network

Yes

Yes No

Original Dataset

No

Page 8: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Feature Selection

Remove irrelevant features that might penalize the classification

Selected method: SVM-RFE (Recursive Feature Elimination)Gene Selection for Cancer Classification using Support Vector Machines (Guyon et al.)

Removes features based on SVM model

Only 10% of features are selected from the original set

Page 9: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Permutation Test

We want to select o n l y t h e m o s t important nodes

Shuffle

Original Dataset

.... }Learning Phase - Network Construction

100 Random Datasets

Original

Y Score Y Score Z Score

... ...

Perm.1 Perm.2 Perm.99 Perm.100

Statistical Test

Strong Nodes

Y Score Y Score Z Score

... ...

Y Score Y Score Z Score

... ...

Y Score Y Score Z Score

... ...

Y Score Y Score Z Score

... ...

....

Page 10: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Permutation Test

Two ways to use strong nodes:

1) Filter the list of edges: keep only the interactions where at least one node is “strong” (consider their neighbours)

2) Further Feature Selection method: consider the “strong” nodes (and their neighbours) for a second rule based network inferring process

Page 11: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Configurations

Configuration Description

Conf1 Feature Selection + 1 Training Phase

Conf3 Original Dataset + 1 Training Phase

Conf5 Feature Selection + 2 Training Phases

Conf7 Original Dataset + 2 Training Phases

Final 4 Configurations used:

Conf2, Conf4, Conf6, Conf8 discarded by previous tests

Page 12: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Datasets

Name Attributes Samples

Leukemia 7129 72

Lung Harvard 12534 181

Lung Michigan 7129 96

CNS 7129 60

dlbcl 2647 77

GSE2191 12625 54

GSE3726 22283 52

ProstateML 12600 136

Cohort of 8 human cancer related datasets

Page 13: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Topological Analysis

Topology: layout of a network, is the arrangement of its various elements

Five different parameters: • Clustering Coefficients (Triangle & Square)• Density• Diameter• Shortest Path Length

Page 14: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Enrichment Analysis

Biological terms overrepresented in a gene setSelected categories: Gene Ontology (GO), KEGG pathways and PubMed

Enrichment Analysis done using DAVID web toolSystematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources (Huang et al.)

Page 15: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Co-Expression Comparison

Pearson Correlation Coefficient (PCC)

P (X1, ..., Xn)

P (X1, ..., Xn) =n�

i=1

P (Xi = xi�Xj = xj , ..., Xj+p = xj+p)

P (A,B) = P (B�A) ∗ P (A) =P (A�B)∗P (B)

P (G | D) = P (D|G)∗P (G)P (D)

P (G)P (D | G)

P (D)

xi(t) = fi (x1, ..., xN , u, θi)

θi xi(t)t xi(t) =

dxidt (t)

fi θi

rx,y =

�i (xi − x) (yi − y)��

i (xi − x)2 ×��

i (yi − y)2

(x, y) xi yix y

Measures how two genes tend to respond in the same direction across different samples

Ranges from -1 to +1

Page 16: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Co-Expression Comparison

Given a Co-Prediction network with N nodes and E edges we build 2 type of Co-Expression networks:

• SE type: with E edges • SN type: with N nodes

Considering the edges with highest PCC

This is done for every Configuration

Page 17: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

MTERF

PAX5

PRKCI

PRTN3BLVRA

TAL1

ALDH3B1RFC1

CTSA

PGD

FSCN1

ATP6V0C

WT1

C10orf116

SERPINB6

ABAT APLNRCSTA

MGP

RPS26

TSTA3SLC14A2

AKR1C1C21orf2

CD36GYPE RBP1

PDLIM7

CST3GCNT2

ANPEP

HSD17BP1

CAPNS1

SMPDL3BMT1BSTOM

S100A4RPS6

FAM190B

PXN

IL8

TEAD1

HIST1H4H

TMED10

CFD

CLC

SERPINA1

SC4MOL

TPSAB1

ALCAM

USP9XARID3A

SNAPC3AGPAT2

ETS1

SNTB2

TTN

MNX1 TRD@

SLC2A5

MMRN1HIST2H2AA3

FKBP2

LEPROTGJB1

NMB

XKPOLB

DNM1

FPR1SLC18A3

PCBD1 CSF1

MUC5AC

SERPING1

RHD

MMP16

DHPS

IGJ

MGST1 MICA

SHC1PRKAR2B

TMEFF1

APOC1NONO

NFIX

MLLT3

RPA1RPL27A

FCER1A

AES

HAPLN1

ACTN1

MAPK4

GLB1

PKNOX1

KRT5ZYX

PTPN2UOX LOC100131311

SERPINB1

ABCC6

CSTF2

SQSTM1

GSTP1

NTRK2AHSG

RAPGEF3

CG030 ACADVL

PTGDSATP6V1F

GNS

CCDC6

ANKS1A

RIPK1

MYO1F

CDKN3PTH2R

GNAI2

ACY1

URODCENPC1 VSNL1

FAH AGPAT1OMPPTGS2

ALDH1A1MPO

DDOST

CCNH

NME4HOXA9

ASAH1

ELANE

GYPA

DRD3LTC4S

RAB3B

PIGA KYNU

SALL1

RGS1RALB

CTSCIGHM

TUBBFASGTF2E1

ZFP36L1

ZNF33BACOT7TMED9

MGST2

SLC2A4

PNMTGATA2

FOSL1

ADAM17ZNF174

MLC1

BGLAPVWF

ABCC2

FLNCMYO7A

CPA3

PABPC1

GOLIM4

TERT

IFITM2RPLP1

CD151

POLR3D

ZNF136

HOXB6CHRFAM7A

P4HB

HSD11B1

TAGLN2

GRPR

SLC17A2

DNAJB4

GJB2

TPM2RND2

GH1

PML

CGREF1

GSTO1

HEXA

ARF1AMHKAT5

RHAG

STAR

MSH2ADRM1

TXN

FOLH1

KLRC3

ABCA4MGLL

RPL31TALDO1KMO

PIK3R2

FMO1FASN POLR2F

SEPP1

PSMC3IP

MBL2

CA9 PRKCSH

FBLN1

RARA

NR1H2

HYAL3

DLD

NDUFV3SREBF2CFH

RBM10

PROS1

ZNF787

TGFB1FANCA

PPP2R5B

ERG

ITGB1BP1

EN2

PPARA

COL4A5DIRAS1

SLC17A1 KCNJ8

ZNF91

SOD3

HSD17B1

ITKNMU

TIMM8ACYBA AMPH TIAM1ARC

MAD1L1

CCR5

LOC96610

GPX4CD302

CITED2

RPS3 EPHA1

SPTLC2FHL1MSTP9

HINT1

GRIK3

ALOX12MAGEA11

MYL4

EEF1A1PPT1

FMO3

IRF5

CA6

BDNF RPL8

EPHA2

HDCCLCN6RAB2A

RPN1ID3

CSRP1 IFITM1

GAD1

CEBPD

CD37

SLCO2A1OPRK1

COL2A1HMGA1

MPST

SEMA3B

HBB

MBOAT7

MUC1

GM2A

PKN1PEX5GATA1

CTNNB1

ETF1

SRGN

TTC35

PTMA

ACSM3LZTR1

GATAD1

CD8A

SPINK2

SEPT7

HMGN1

PTPRD

CYP2A6

KRT18

MCM2

PRKG1 C12orf32

TPM1PPIF

TCF3

SLC3A2

CDC20CUL5

CD33ITGA2B EPCAM

POU2F2PCTK2

EWSR1

AIF1

ITPK1OXCT1 BMP6

TMSB10

NUP88MRPS12 PLD1

IL18

CRYAA

RAB5B FTL

MFSD10

RHOA

PPP3CC

WNT10B

DTYMK

MATN1

CD63

MAPKAPK3GNB3

PRAME

SEPT6

HFE JUNB

VPREB1

COMP

CDKN2A

SEPT5

RIN2

FTH1

MPL

GALC GPX1

CXCR4

TBXAS1

SCN7A

PRF1

RBM

GGT5

KIAA0195

CD7

TPM4

ZNF135

APOC4

RNF113A

S100A2

APLP2

IL7

CYB5R3

LYL1

ZFP36L2

HOXB2

UROS

OS9

MAP2K5NEK3

MAB21L1

RPN2ENO1

EYA2

UQCRH

NQO1

CTRB1

SYMPKME3

GRIK1

EBP

NCSTN

GYPB

FCGR1A

PPIB

ICAM3BUD31

TNNC2

ITGB2

NPC2CFP

POLD2

AHDC1

NAT1 ID2B

LAMB2 APOA1

ACTA2CTSDLOC728849

RNH1

MST1R

CYP24A1

CAT

POU2AF1

NME3 FPGS

1 Connected Component

Dataset: LeukemiaType: Co-Prediction

Configuration: Conf1Nodes: 421

Edges: 1529

Triangle Coeff: 0.712

Square Coeff: 0.013

Density: 0.017

Diameter: 5

Avg. SPL: 2.284

Topology Analysis

Page 18: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

PMS1 C12orf24

IQGAP2

ADH5

STXBP3

FARSA

UTP14CZNF146

NMI

CLNS1APTTG1IPPTDSS1

GTF2A2

HGF

NSMAF

TAF7

SRP54

ZNHIT3

KTN1

UBE2A

SLC35A1

PAFAH1B1

PSMC6

DPM1

ZCCHC11

PSMA3

TUBGCP3

PPWD1

EIF4A2

VRK1

METAP1

PUM1

U2AF1

SFRS5

PNN

PIK3R1

RBM4

CUL3

SAFB

DDX5

RBM39

C1QBP

GLO1

ATP5A1

SEPT7

SNRPG

SNRPF

YWHAQ

GARS

SNRPE

CCT6A

PPP1CC

FXR1

ACP1

MAPRE2

SUMO1

BTF3

CCNG1

UQCRFS1

NAP1L1

TOMM20

BCLAF1

EIF4H

PSMD6

HNRNPK

LOC728844

PSMA1

RPS7

SNRPD2

UQCRC2

CEP57

ATXN10

PRMT1

TGFBR2

COX6C

SSBP1

GCSH

SRP14

COX7C

SHFM1

ATP5C1

NAE1

EBNA1BP2

SET

EIF2S2

DDX1

CCT5

PKLR

DDX18

H3F3A

ATP5G3

COX6A1

NDUFA4

COX11

COX7A2

NACA

R3HDM1

TCEB1

FEZ2

WAPAL

LOC90925

SUZ12

KLRK1

UBXN4

SSB

TCF3

PARP1

TCP1

TOP2B

PRPF18

NOLC1

TCEA1

UBE4A

CCT7

SFRS3

PPP2R5E

HNRNPC

TMEM131

SKIV2L2ZNF638

CRYZ

RRAGA

NARS

MEF2APRDX6

PSMB7

MOBKL1B PSMD14PTPN11

KPNB1

CNBPTDG

HMGN4GLRX

PSMA4

SMAD1AVPSERPINI1 ZBTB16

H19HIST1H2AE

SPOCK2LIG4 ITGAXPOSTNAPCS CCL23 WARSALAS1 AGPSDMPK CALML3 CYP4A11SLC14A2 FOSBMCAMPTPN5 ZFP36MCM3 CCT8RING1 GNA11 SDCBP ATP6V1B2 NPTX2ATP5J PRKCZCRYMINPP5J P2RX5

HPGDHIST1H2BG

HIST1H4E

ADARB1

HIST1H2AIDPP4

FAM3CPTPRZ1

PTPRM

CYFIP1

CTSD

CYP2B7P1

GBA

PLAU

CD33

GCNT1DYRK1A

ADIPOQ CDC7N4BP2L2CTGF ZNF91MS4A1 CTLA4VLDLR

MRPS27

ARCN1 UBL4A

TRA2B

SNRPD3

MDH1

IMMT

CCT3

POLR2G

SRP9

MRPS22

TSNLRPPRC

SNRPD1

APEX1

PUM2

XRCC5

SP3

TCF12

PRKDC

PSMA5

RBMX

SFRS2

CALM2

CCT4

YWHAERBBP4

MRPL3

ACADM

COPS5

FNTACBFB

UBE2V2PWP1

TP53

PRMT5

PRKAR1AXPO1

ACLY

DCK

ERH

IARS EPRS

DARS

HNRNPF

SPCS2C11orf58

NMT1DYNLL1

MRPS31FAM120A

FLI1

C5orf13

GTF2I

NEDD8

FABP5L7

HNRNPU

BCL2

NCL

IRF2

HDAC2SFRS7

LOC728849

LBR

MX1

IRF7

IGBP1

IFI44L

XAF1

SLC25A5

RBBP7

HPRT1

CSTF3HLA-DMA

HLA-DRB1

HLA-DMB

SETDB1HLA-DPA1

GCM1ISG15 ANGPT1IKBKEIFIT1 ASMTDNA2

RPS29

EEF1A1

RPL37A

MT1P2

MT1A

RPL18A

RPL41

HLA-DRA

HLA-DQB1

CYP2D6

WNT2B

CD74

CTSA

MT2A

ADSS

PARP4TARBP1

SNRNP70 TM9SF2

SPG7

RPLP1

RBM25

DCAF8

CAPRIN1TP53BP1

B3GNTL1 DDOST FOXO1RPN2 CD24 TRH MPO IFI44 FUCA1DAD1 PIK3C3IRF9 PSMA6PPP2R5C PRSS3PRSS1INHBACREMPRG2CLCSSR2SND1KIAA0101 NFKBIACD72 H2AFZ PSMB8PSMB9CSRP2MYL6B MSLN CCR2 SLC6A8 ANXA11 LIMS1AQP1MOBP ITGA8CD83

IL8RBINHA

TAF1BNIP1

SLC30A3

PIP4K2B

IFI35

CSNK2A2

FABP3

CHIC1

B4GALT1

TRD@

NEUROG1

TMEM106A

KRT86MAP2K5

FAS

CLCN5

DAZL

P2RY1ANK3

TGFA CYP2E1

TM4SF4IL5

ARHGEF7

ID4

PDGFRA

GNG11

ITGA6

PMP22

GLDC

RIMS3

GLIPR1MAN1A1

GJA1ACP5

CLTC

KIT

NFE2

GATM

DRG2KCNJ4

ADCYAP1

SMARCD1 FOLR1

CYP2C19

BBS9

INA

KYNU

SMA4

IFNA1

FOXF1TCIRG1

PCSK6

CYP3A4

EGF

SILVBNC1

FOXF2

E2F5

LHFPL2

AKAP12

MPPED2

KDRTNFSF9UGP2

CLK3 GP9

NCOR2

DNASE1

MAP3K10

ASNA1

CD68

MCHR1GASTAPOB

MMP14MED22ADAM11

TBX5

TULP1

SMARCD3

ACOX2 NCOA2GOLGA2 S100A7

ZRSR2

PAPPASRPK3

IGL@

LOC91316

IGHG1

DGCR2HLA-C

INHBB

CD2PRTN3

PTP4A2ELANE

BPI DEFA4

FAT1CEACAM8

FXYD2

SEPW1 RHOHLCK ZAP70ITGA1 STAB1CD3DCHI3L2

TRBV5-4

TCF7 FES

FDPS

MLLT11

CD3G

CD3EPLCB2 TM9SF4

ANXA3

ATP6V1C1CDC25B

LTF

UBA7

MMP8

MMP9

CPT1B

LCN2CAMP

RGL2

ICAM1IL1B

IL6 NINJ1G0S2

CCL4 IFIT2

SERPINB2NIACR2

CCL3

PF4

PDE4C

PPBP

P4HB

PLEK

ADAM8

LMNA

CD63

ANXA1

UPP1

SOD2

LRPAP1PLAUR

ACSL1IL1RN

CXCL3

PTX3THBS1

GCA

BTN2A1

CD1BABI2

CD1E

CEACAM6

CR2MAL

S100P

DEFB1

BPGMGYPEGYPB

ALDH1A1BCL2L1

SEPT5ITGA2B

MARCH8

MYL4

CXCL2

XKSNCA

PPIF

BCL2A1NFKB1

EPB42

RHAG

MFGE8

GYPA

SPTA1

HBA2

BCL3

PRDX2

URODIL8

HBZ

SLC2A1

NAMPT

TFRC BSG

HBQ1

BLVRB

HMBS

COL16A1

FECH

LYZ

S100A12

ALAS2

S100A8

SELENBP1

HBB

CTAG1B

RENBP

MYLPF

SIRPD

SLC18A1

PMS2L11

PDE4A

PCDH1

HERC2P3CRABP2

ALPP

MAP1ACCL2

S100A9

GH2AQP2

LYST

NEFL

NCAPD3

MVK

SP1

CST4

AVPR1B COMTAGER

ST14

HADHB

TEC

PDIA2

CCR7

C4orf8

SPRR1A

SLC10A1

SLC17A2

ZWINTAS

CES2

CSN3

EFNA3

CALCRL

PAX8

IFNA21

DHRS2

FEV

MPP2

BCAT2

SLC2A4 ZNF8

KRT31

ERCC4RNF113A

ATP6V0A1

PTPRU

CDH5

MYH6

HLA-DQA1FANCC

CCL22

NELL2ARHGEF2

TIMP3

PDE6B

BCL6

GDF5

ATAD4

PSG7

FMOD

HTR2C

PTGIS

AOC2

GABRA1

NR2C1

COL11A2HMGA2REG1A

HCG8CCR9

GNB3 LPCAT3

GHRHR

CHRNE

MADDAFAP1

ADORA3

GFAP

CD4

SAA1

LIMK2 TGFBI

CTSSCX3CR1

CSF1R

CD1D

FGL2

IFI30

HCK

FCER1G

CTSH

PRKCD

S100A10

NCF2

ANXA2

TNFSF10

PLD3

LGALS2

CCL5

SERPINA1

IL13RA1

CD14CFP

VCAN

CST3

ZYX

TNFAIP2

SPI1

GRN

APLP2

RHOGCD302

ITGB2

ITGAM

TSPO

S100A4

FCGR1A

CSTA

AOAH

ASAH1

S100A11

Dataset: LeukemiaType: Co-Expression

Configuration: Conf1_SE

Nodes: 683

Edges: 1529

Triangle Coeff: 0.333

Square Coeff: 0.14

Density: 0.007

Diameter: 24

Avg. SPL: 8.856

59 Connected Components

Topology Analysis

Page 19: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

SFRS7

LBR

LRPPRC

PSMA5

ATP5A1

SRP9

DARSPTPN11

CNBPRBBP4

MOBKL1B

C11orf58

SNRPF

GCSH

SSBP1

SNRPE

SNRPG

PPP1CCNDUFA4

COX6CNAE1

SUMO1ACP1COX7C GARS

YWHAQ MYLPF

CTAG1BPMS2L11 BCAT2

CSN3

MPP2

SMA4

TM4SF4

CLCN5

IL5

FOXF2

FOXF1

ANK3

EGF

FAS

MAP3K10

SILV

HLA-C

IGHG1

TNFSF9

SRPK3

DNASE1

ITGA1

FDPS

LHFPL2

ID4 ITGA6

ARHGEF7

CLTC

CD1B

PDGFRA

MPPED2

PMP22PLD3

CCL5

E2F5

AKAP12

RIMS3

GLDCGNG11

IL13RA1

LRPAP1

CD63

P4HB

UPP1LMNA

PRTN3

ELANE

BTN2A1

CEACAM6 RBM39

SAFB

UBXN4

PUM1

BPI

CEACAM8S100P

DEFA4

CLC PRG2 CREM INHBA HLA-DRA CD74 HPGD ZBTB16

AVPH19

PSMD6 EIF4H BCLAF1 HNRNPK

DPP4

CTSH

CD14

VCAN

CFP

CSTA

TNFAIP2

GRN

FGL2

HCK

FCGR1A

S100A4

ASAH1TSPO

SERPINA1

LGALS2

FCER1G

IFI30

SPI1

ZYX

S100A11

ITGAM

NCF2

KYNU

APLP2

CTSSCD1D

SMARCD3

CD68

ITGB2

CD302

IGL@LOC91316ANXA3MMP8 PRSS3PRSS1

FEV

CST4

TAF1

EFNA3

CES2

SPRR1ACDH5

BCL2A1CCL22CCR9 SLC2A4

THBS1IL1RN

ACSL1

PLAURNAMPT

AGERLPCAT3PTPRU

ERCC4 MVK REG1A

TIMP3

HMGA2

ADCYAP1CYP2C19

NCAPD3ARHGEF2 AVPR1B

GH2

PAX8

DHRS2

SMARCD1

ZWINTAS

IL8RB

SLC30A3

FOLR1FANCC

TMEM106AGNB3

NEUROG1

MYH6RNF113ACHIC1

ATP6V0A1

PDE6B

IL1B

PDE4C

CCL3IFIT2

IL6 CCL4G0S2

SOD2

PTX3

ICAM1

GABRA1CALCRL

SLC18A1

IFNA21

MAP2K5

IFI35

MADD

GFAPCOL11A2

B4GALT1

MDH1

POLR2G

TCF12

CCT5CRYZ

CCT4

APEX1

GLRX

CBFB

FLI1PARP1

PSMA4

CCT6A

DPM1

CR2FXYD2

MEF2A

LOC90925

IMMT

EPRS

ZNF638

FAM120A

SNRPD1

CHI3L2

SEPW1

CD3G

TCF7

FAT1

CDC25B

PTP4A2

CD3EMAL

LCK

TULP1

ZAP70

TRBV5-4

CD3D

CD1E

MED22

ACOX2GAST

UGP2

APOB

CX3CR1

TBX5

MMP14

SPCS2

TRA2B

UBE2V2

CALM2

CCT3

XPO1

SP3

TSN

RBMX

FNTA

IARS

XRCC5

TMEM131

SSB

H3F3A

WAPAL

PSMC6

PKLR

ADH5

ZNHIT3

NARS

ZNF146STXBP3

FARSA

PMS1

C12orf24

UTP14C

BCL2

DYNLL1

COPS5

ACADM

PRKAR1A

PUM2

NEDD8ERH

HDAC2

DCK

LOC728849

MAN1A1

NFE2

GLIPR1

KIT

PTDSS1

GTF2A2

NSMAF

IQGAP2

ACP5

HGF

GJA1

RHAGGYPB

MYL4

ALDH1A1

GYPE

SEPT5

PTPRZ1

BNC1

TGFA

CYP3A4

CYFIP1

ADARB1

FAM3C

UROD

GYPA

HMBS

TFRC

SLC2A1FECH

BLVRB

PRDX2

HBB

IL8

SELENBP1

ALAS2

BPGM

SPTA1

HBQ1

MARCH8

BSG

NFKB1

EPB42

NEFL

SLC10A1

CXCL3 CXCL2

DYRK1A LCN2 HLA-DRB1

CAMP

RPL18AATP6V1C1

UBA7GCNT1

PLAU

TM9SF4 CYP2B7P1

STAB1CPT1B

MT1P2

MT2A

MT1A

EEF1A1

MMP9

RPL41

LTF

RPS29

RPL37A

GBA

CYP2D6

PPBPSERPINB2BCL6

PSG7

HLA-DMA

NR2C1

HLA-DMB

HLA-DPA1

PF4

ITGA8 RBM25 TARBP1 COX6A1 COX7A2 ATXN10 PRMT1 ANXA2R3HDM1 PPP2R5E PRPF18 MSLN CCR2 NFKBIA CD83 MOBPDAD1S100A9S100A8S100A12HIST1H4EHIST1H2AIHIST1H2AEDMPK ZCCHC11 CD72 CSRP2 KIAA0101 H2AFZ PSMB9 PSMB8 ATP5C1PPP2R5C SFRS3 CCT7 FUCA1 PIK3C3 PSMA6 MYL6B PPWD1 S100A10 SLC6A8 AQP1 CTSA CTSD ANXA11 LIMS1 SND1 SPG7SSR2 ATP5G3 SNRPD2 SNRNP70

PRKCD CST3

APCS POSTN

53 Connected Components

Dataset: LeukemiaType: Co-Expression

Configuration: Conf1_SN

Nodes: 422

Edges: 680

Triangle Coeff: 0.323

Square Coeff: 0.199

Density: 0.008

Diameter: 16

Avg. SPL: 5.644

Topology Analysis

Page 20: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Topology Analysis

Co-Prediction vs. Co-ExpressionConfiguration Description

Conf1 F.S + 1 Train.

Conf3 Orig + 1 Train

Conf5 F.S + 2 Train.

Conf7 Orig + 2 Train.

- +

Nodes

Conf5P

Conf1P

Conf1SE

Conf5SE

Conf3SE

Conf7SE

Conf7P

Conf3P

- +

Triangle Clustering Coefficient

Conf3P

Conf7P

Conf1SE

Conf5SE

Conf3SE

Conf7SE

Conf1P

Conf5P

- +

Edges

Conf5SN

Conf1SN

Conf1P

Conf5P

Conf3P

Conf7P

Conf7SN

Conf3SN

- +

Square Clustering Coefficient

Conf3P

Conf7P

Conf1P

Conf5P

Conf7SE

Conf3SE

Conf1SE

Conf5SE

- +

Diameter

Conf1P

Conf5P

Conf7P

Conf3P

Conf1SE

Conf7SE

Conf3SE

Conf5SE

8 1234567

8 1234567

8 1234567

8 1234567

8 1234567

Page 21: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Co-Prediction configurations diversity

O(Conf1,Conf2) = Commons

Commons + Unique1+ Unique2

GO Conf1 Conf3 Conf5 Conf7

Conf1 - 0.166 0.658 0.190

Conf3 - 0.156 0.571

Conf5 - 0.179

Conf7 -

KEGG Conf1 Conf3 Conf5 Conf7

Conf1 - 0.238 0.690 0.124

Conf3 - 0.127 0.508

Conf5 - 0.139

Conf7 -

Overlap of Enriched terms among different configurations

Page 22: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Generally small overlap

Variants on the inferring process lead to different networks

PubMed Conf1 Conf3 Conf5 Conf7

Conf1 - 0.065 0.720 0.080

Conf3 - 0.062 0.423

Conf5 - 0.077

Conf7 -

Similar configurations (1,5) and (3,7) share more

Co-Prediction configurations diversity

Page 23: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Co-Prediction vs. Co-Expression biological knowledge

Enrichment Score to have unbiased results

ES = # of Enriched Terms

# of Nodes

Average Ranking across 8 datasets

Conf1_P Conf1_ESE Conf3_P Conf3_ESE Conf5_P Conf5_ESE Conf7_P Conf7_ESE

GO 3.000 (2.5) 3.000 (2.5) 6.562 (8) 5.250 (6) 2.250 (1) 4.625 (4) 5.062 (5) 6.250 (7)

KEGG 4.437 (6) 3.125 (1) 6.687 (8) 4.312 (5) 3.625 (2) 4.125 (3) 4.250 (4) 5.437 (7)

PubMed 3.750 (4) 3.375 (1) 5.750 (7) 3.500 (2) 3.625 (3) 4.375 (5) 6.250 (8) 5.375 (6)

Conf1_P Conf1_ESN Conf3_P Conf3_ESN Conf5_P Conf5_ESN Conf7_P Conf7_ESN

GO 3.00 (3) 2.750 (2) 6.187 (7) 7.625 (8) 2.000 (1) 3.375 (4) 4.943 (5) 6.125 (6)

KEGG 4.500 (5) 3.625 (3.5) 6.562 (8) 5.937 (7) 3.625 (3.5) 3.250 (1) 3.562 (2) 4.937 (6)

PubMed 4.000 (4) 3.125 (1) 5.875 (8) 5.000 (5) 3.375 (2) 3.562 (3) 5.75 (7) 5.125 (6)

Co-Expression SE type

Co-Expression SN type

Page 24: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Conf1 Conf3 Conf5 Conf7

GO 0.155 0.325 0.166 0.294

KEGG 0.068 0.256 0.092 0.215

PubMed 0.045 0.120 0.043 0.147

Conf1 Conf3 Conf5 Conf7

GO 0.149 0.469 0.144 0.371

KEGG 0.049 0.421 0.068 0.300

PubMed 0.048 0.270 0.047 0.193

Overlap of Enriched terms: Co-Prediction vs. Co-Expression

Co-Expression SE type

Co-Expression SN type

Co-Prediction vs. Co-Expression biological diversity

Page 25: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Statistical test for performance evaluation

Is one method better than the other?

Consider each configuration as independent algorithm

Friedman testNull-hypothesis: the performances of the algorithms are equivalent

Enrichment Score used as metric

Post-Hoc with Nemenyi testCompare each algorithm vs. each other to check if one is better

Significance level α=0.05

Page 26: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

GO Terms•Null-hypothesis rejected with p-value 5.322e for SN type-07

Conf5_P better than: Conf3_P, Conf7_SN, Conf3_SNConf1_P better than: Conf3_SNConf1_SN and Conf5_SN better than: Conf3_SN

Statistical test for performance evaluation

Conf5_P better than: Conf3_P, Conf7_SE

•Null-hypothesis rejected with p-value 0.001012 for SE type

KEGG and PubMed TermsNull-hypothesis is not rejected

Page 27: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Literature Mining

Manual literature mining for “important” nodes:• Top Hubs (highest degree)• Central Nodes (highest betweenness centrality)

Those important nodes appear to play an important role on the disease studied in the microarray

Example: GSE3726 (Colon vs. Breast Cancer)

6 out 7 “important” genes involved in one of those 2 cancers.

Page 28: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

We presented a ML approach to define GINs called Co-Prediction.•It has general pipeline to infer GINs with 4 different variants•It identifies interactions in a different way compared with statistical methods as Co-Expression•Compared to Co-Expression networks, Co-Prediction GINs:

✦Encode the same amount of biological knowledge ✦Different in: - Topology

- Enriched Terms

•It creates biologically relevant networks

Conclusions

Page 29: A Machine Learning approach to generate Gene Interaction …ico2s.org/data/seminars/2014-03-11-nl.pdf · 2014. 5. 20. · Nicola Lazzarini BNC Seminar 11th March 2014 Gene Interaction

Nicola Lazzarini 11th March 2014BNC Seminar

Thank you!