Mining the functional genomics data III
Data integration:Gene Ontology, PPI, URLMAP
Jaak [email protected]
Havana, Cuba, 21.11.2003
EPCLUST Expression data GENOMES
sequence, function, annotation
SPEXSdiscover patterns
URLMAPprovide links
Components of Expression Profilerhttp://ep.ebi.ac.uk/
Expression data
External data, toolspathways, function,
etc.
PATMATCHvisualise patterns
EP:GOGeneOntology
EP:PPIProt-Prot ia.
SEQLOGO
Expression Profiler: EPCLUST
DATA SELECT/FILTER
FOLDER ANALYZE
A “CLUSTER”
URLMAP
GeneOntologyPathwaysDatabasesSPEXSOther tools
URLMAP
• Given a cluster of genes - many web based tools and databases to consult/follow up. How to link to them?
• How to manage many links, many tools?
• Answer: Centralize that linking
URLMAP - no need to “cut & paste”
KEGG:
SRS/InterPro
• Generates all links/forms dynamically
• Maintain links in one place• Handle renaming of gene
id’s by synonyms• Allow domain-specific link
pages
A Simple Metabolic Pathway
Shoshanna Wodak, Jacques van Helden
Links for each item type
• Yeast S. cerevisiae gene ID-s (ORFname, SP id, SGD ID, …)
• Pattern collections, e.g. substrings to profile generation by SEQLOGO
• Keyword searches from web based search engines
Management of links
• Hierarchies of link collections• One can point to any (sub)hierarchy
directly• LINK =
• URL, title, • form parameters• modifications/code• DB lookups for synonyms
“Screen scraping” – doable with a little perl programming
g1,g2,g356g1,g2,g356
g1,g2,g356
g1,g2,g356
g1
g2
g356
Report
Gene OntologyTM
www.geneontology.org
• GO is a systematic effort for data annotation
• Three independent ontologies• Molecular Function• Biological Process• Cellular component
• How to integrate that into analysis tools?
DAG Structure
Annotate to any level within DAG
mitosisS.c. NNF1
mitotic chromosome condensation
S.c. BRN1, D.m. barren
• Database object: gene or gene product
• GO term ID
• Reference
• publication or computational method
• Evidence supporting annotation
GO Annotation: Data
IDA - Inferred from Direct Assay
IMP - Inferred from Mutant Phenotype
IGI - Inferred from Genetic Interaction
IPI - Inferred from Physical Interaction
IEP - Inferred from Expression Pattern
GO Evidence Codes
TAS - Traceable Author Statement
NAS - Non-traceable Author Statement
IC - Inferred by Curator
ISS - Inferred from Sequence or structural Similarity
IEA - Inferred from Electronic Annotation
ND - Not Determined
IDA - Inferred from Direct Assay
IMP - Inferred from Mutant Phenotype
IGI - Inferred from Genetic Interaction
IPI - Inferred from Physical Interaction
IEP - Inferred from Expression Pattern
GO Evidence Codes
TAS - Traceable Author Statement
NAS - Non-traceable Author Statement
IC - Inferred by Curator
ISS - Inferred from Sequence or structural Similarity
IEA - Inferred from Electronic Annotation
ND - Not Determined
From reviews or introductions
From primary literature automated
Example (GoMiner)
EP:GO tool for GeneOntology
• Browse
• Search by keywords; EC, term. etc..
• Get associated genes• Submit associated genes to URLMAP
• Annotate gene clusters using GO terms
URLMAP => Look up expression data
EP:GO EPCLUST
Annotate Clusters (EP:GO)
1
2 3 4
5 6
A,D
B,C
E
F,G,H
J
I
F,G,H
F,G
B,E
B,A
F,G,I
B,E,F,I
Set overlap
GO term CLUSTER
A: |G ∩ C| / min( |G|, |C|)
B: P( choose |C| from N with |G|, observe |G ∩ C|+)
N genes
G ∩ C
Annotation of clusters•
GO:0042254 <U:L> Process: ribosome biogenesis and assembly (+2:15) (depth=7) [sgd:2:187]GO:0042254: 47 from cluster (size 98) vs 187 in this class (including subclasses)
GO:0006364 <U:L> Process: rRNA processing (+3:3) (depth=8) [sgd:50:126]GO:0006364: 35 from cluster (size 98) vs 126 in this class (including subclasses)
GO:0006360 <U:L> Process: transcription from Pol I promoter (+6:14) (depth=8) [sgd:23:155]GO:0006360: 38 from cluster (size 98) vs 155 in this class (including subclasses)
GO:0005730 <U:L> Component: nucleolus (+10:17) (depth=6) [sgd:154:210]GO:0005730: 45 from cluster (size 98) vs 210 in this class (including subclasses)
GO:0030515 <U:L> Function: snoRNA binding (depth=6) [sgd:23:23]GO:0030515: 17 from cluster (size 98) vs 23 in this class (including subclasses)
GO:0030490 <U:L> Process: processing of 20S pre-rRNA (depth=9) [sgd:33:33]GO:0030490: 18 from cluster (size 98) vs 33 in this class (including subclasses)
GO:0005732 <U:L> Component: small nucleolar ribonucleoprotein complex (depth=6) [sgd:30:30]GO:0005732: 16 from cluster (size 98) vs 30 in this class (including subclasses)
GO:0006396 <U:L> Process: RNA processing (+7:52) (depth=7) [sgd:7:370]GO:0006396: 40 from cluster (size 98) vs 370 in this class (including subclasses)
• …
>YAL036C chromo=1 coord=(76154-75048(C)) start=-600 end=+2 seq=(76152-76754)
TGTTCTTTCTTCTTCTGCTTCTCCTTTTCCTTTTTTTCCTTCTCCTTTTCCTTCTTGGACTTTAGTATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTTTCTTGCTTCTTCTTCGATTGCTTCAAAGTAGACATGAAGTCGCCTTCAATGGCCTCAGCACCTTCAGCACTTGCACTTGCTTCTCTGGAAGTGTCATCTGCACCTGCGCTGCTTTCTGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGTTCTGGGCGGCGTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGCAAGATCTCTTGCTTTCAATGGGCTTAAAGCTTGAAAAATTTTTTCACATCACAAGCGACGAGGGCCCGTTTTTTTCATCGATGAGCTATAAGAGTTTTCCACTTTTAAGATGGGATATTACGGTGTGATGAGGGCGCAATGATAGGAAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTTGTTGATTGAGCAAA_ATG_>YAL025C chromo=1 coord=(101147-100230(C)) start=-600 end=+2 seq=(101145-101747)CTTAGAAGATAAAGTAGTGAATTACAATAAATTCGATACGAACGTTCAAATAGTCAAGAATTTCATTCAAAGGGTTCAATGGTCCAAGTTTTACACTTTCAAAGTTAACCACGAATTGCTGAGTAAGTGTGTTTATATTAGCACATTAACACAAGAAGAGATTAATGAACTATCCACATGAGGTATTGTGCCACTTTCCTCCAGTTCCCAAATTCCTCTTGTAAAAAACTTTGCATATAAAATATACAGATGGAGCATATATAGATGGAGCATACATACATGTTTTTTTTTTTTTAAAAACATGGACTCGAACAGAATAAAAGAATTTATAATGATAGATAATGCATACTTCAATAAGAGAGAATACTTGTTTTTAAATGAGAATTGCTTTCATTAGCTCATTATGTTCAGATTATCAAAATGCAGTAGGGTAATAAACCTTTTTTTTTTTTTTTTTTTTTTTTGAAAAATTTTCCGATGAGCTTTTGAAAAAAAATGAAAAAGTGATTGGTATAGAGGCAGATATTGCATTGCTTAGTTCTTTCTTTTGACAGTGTTCTCTTCAGTACATAACTACAACGGTTAGAATACAACGAGGAT_ATG_
...>YBR084W chromo=2 coord=(411012-413936) start=-600 end=+2 seq=(410412-411014)CCATGTATCCAAGACCTGCTGAAGATGCTTACAATGCCAATTATATTCAAGGTCTGCCCCAGTACCAAACATCTTATTTTTCGCAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCAAAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGTCTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTAGTAAGAACCGTCACAGATATGATTGAGCACGGTACAATTATGTATTAGTCAAACATTACCAGTTCTCGAACAAAACCAAAGCTACTCCTGCAACACTCTTCTATCGCACATGTATGGTTCTTATTGTTTCCCGAGTTCTTTTTTACTGACGCGCCAGAACGAGTAAGAAAGTTCTCTAGCGCCATGCTGAAATTTTTTTCACTTCAACGGACAGCGATTTTTTTTCTTTTTCCTCCGAAATAATGTTGCAGCGGTTCTCGATGCCTCAAGAATTGCAGAAGTAAACCAGCCAATACACATCAAAAAACAACTTTCATTACTGTGATTCTCTCAGTCTGTTCATTTGTCAGATATTTAAGGCTAAAAGGAA_ATG_
101 Sequences relative to ORF start
GATGAG.T 1:52/70 2:453/508 R:7.52345 BP:1.02391e-33G.GATGAG.T 1:39/49 2:193/222 R:13.244 BP:2.49026e-33AAAATTTT 1:63/77 2:833/911 R:4.95687 BP:5.02807e-32TGAAAA.TTT 1:45/53 2:333/350 R:8.85687 BP:1.69905e-31TG.AAA.TTT 1:53/61 2:538/570 R:6.45662 BP:3.24836e-31TG.AAA.TTTT 1:40/43 2:254/260 R:10.3214 BP:3.84624e-30TGAAA..TTT 1:54/65 2:608/645 R:5.82106 BP:1.0887e-29...
GATGAG.TTGAAA..TTT
YGR128C + 100
EP:PPI Protein-protein interaction
• There are high-throughput technologies for identifying hypothetical protein-protein interactions
• Which ones of these are more likely to be true?
• Can these predictions help predicting gene function?
PPI pairs
We have expression data
Cluster
Trust those within the same cluster
PPI are enriched within clusters
Ge, Liu, Church, Vidal: Nature Genetics Nov. 2001
Protein-protein interactions: which to trust more?
Answer: Use the distance measure alone
Kemmeren et.al. Randomized expression data
Yeast 2-hybrid studies
Known (literature) PPI
MPK1 YLR350w SNF4 YCL046WSNF7 YGR122W
Molecular Cell, Vol. 9, 1133–1143, May, 2002
d0
Interacting pairs of proteins A and B; C and DWhich would you trust?
A
B1 0
d13
12
7
C
D
EP:PPI – combine PPI and expression
Results
• Confidence in 973 out of 5342 putative two-hybrid interactions from S. cerevisiae is increased.
• Besides verification, integration of expression and interaction data is employed to provide functional annotation for over 300 previously uncharacterized genes.
• The robustness of these approaches is demonstrated by experiments that test the in silico predictions made.
• This study shows how integration improves the utility of different types of functional genomic data and how well this contributes to functional annotation.
promotercoding DNA
GENE 1 GENE 2 GENE 3 GENE 4DNA
transcriptionfactors
G1
G2 G4
G3
Gene regulation by transcription factors
Networks
• Graphical models
• Directed labelled graph
• Nodes genes
• Arcs/Edges relationships
• Labels types of relationships
Start node (gene)
End node (gene)
Connection weight, w
Graph drawing
A BW
Different interpretation of arcs
• Edges can have different meanings, hence different networks
• Binding site for A is in front of B• Proteins A and B interact• Deletion of gene A affects expression of
B (is somewhere in regulation cascade)• “Literature” mentions genes together
promotercoding DNA
GENE 1 GENE 2 GENE 3 GENE 4DNA
transcriptionfactors
G1
G2 G4
G3
Gene regulation by transcription factors
A B C
gene B
gene C
gene D
gene A
A D
B C
Deletion mutants (gene knockouts)
Hughes, T. R. et al: “Functional Discovery via a Compendium of Expression Profiles”, Cell 102 (2000), 109-126.
Green arrows - upregulationRed arrows - downregulationThickness of arrow represents certainty of direction (up/down)
A complete graph
Features/distributions that do not depend on discretisation thresholds
• Visual inspection, biological interpretation
• General statistics and features of the graphs
• Indegree/Outdegree
• Complexity of the networks
• What is the modularity? • How many components? • Deletion of hot-spots, does it break the net?
Filter•choose a list of genes
(MATING, marked in red)•filter for these genes plus
neighbouring genes from the graph
CUP5
AKR1
VMA8
YAR014C
SST2
YEL044W
YER050C
MFA1STE2
BAR1
MFA2
AGA1
AFG3FUS1
FKS1
FUS3
VCX1
ADR1
URA3
ICL1
YGR250CPGU1
YLR042C
YNR067C
HOG1
FIG1
AGA2
KSS1
RAD6
STE6
RAS2
RPD3
CRS4
ASG7
KAR4
NRC465
YIL080W
FUS2
YNL279W
YOL154W
YPL156CYPL192C
YML048W-A
STE11
STE12
GPA1
STE18
STE24
STE4
STE5
STE7
TUP1
YER044C
YJL107C
AFR1
SHE4
CMK2PHO89
RAD16
CYC8 QCR2SWI4NPR2
Mutation network =4
AEP2
AKR1
CMK2
ANP1
RAD16
AFR1
CEM1
CUP5
SST2
DIG1
UBP10
STE2
ERG2
PHO89ERG6
GAS1 PTP2
GYP1
HIR2HPT1
ISW1
FIG1 ISW2
KIN3
MAC1MRPL33
MSU1
NPR2
PET111
RAD57
RIP1
RRP6
ASG7
STE6RTS1
SCS7
SGS1
MFA1
SHE4AGA1
SWI4
FUS1SWI5
VAC8
VMA8
YAL004W
YAR014C
YEL044W
YER050C
FUS3
GPA1
BAR1
MFA2
YER083C
RTT104
YMR014W
YMR029C AGA2YMR031W-A
YMR293C
YOR078W
ADE2
AFG3
BNI1
CLA4
ERG3
FKS1
KAR4
YAR064W
CHS3
VAP1
ICS2
YCLX09W
YDL009C
STP4
PMT1
VCX1HO
THI13
ADR1
YDR249C PAM1
YDR275W
HXT7
HXT6 YDR366CYDR534C
URA3
YEL071W
MNN1
ICL1
RNR1
YER130C
YER135C
SPI1 DMC1
HSP12
NIL1
GSC2
KSS1
MUP1
YGR138C
SKN1
YGR250C
YHR097C YHR116W
YHR122W
YHR145C
YIL060W
YIL096C
YIL117C
RHO3
YIL122W FKH1
NCA3
YJL145W
RPL17B
YJL217W
CYC1
DAN1
PGU1
GFA1
HAP4
RRN3
STE3
PRY2
KTR2
SRL3
YLR040C
YLR042C
SSP120
HSP60
YLR297W
RPS22B YLR413W
HOF1
DDR48
RNA1
YMR266W
YNL078W
SPC98
YNL133C
YNL217W
WSC2YPT11
RFA2
YNR009W
YNR067C
MDH2
YOL154W
NDJ1
WSC3
CDC21
PFY1
RGA1
MSB1
SRL1
YOR248W
YOR296W
YOR338W
GDS1PDE2
FRE5
YPL080C
RPS9A
BBP1
YPL256C
SUA7
MEP3
YPR156C
HMG1
HOG1
MED2
QCR2
RAD6
RAS2
RPD3
RPS24A
CRS4CYC8
YAR031W
YBR012C
HIS7
YCLX07W
YCRX18C PCL2
YDR124W
ECM18APA2
YER024W
HOM3
THI5
YGL053W
NRC465
YGR161C YHR055C
YIL037C
YIL080W
YIL082W
HIS5
YJL037W
SAG1
CPA2
AAD10
HYM1
MET1
MID2
YML047C
KAR5
CIK1
FUS2 SCW10
BOP3
YNL279WTHI12
YOL119C
YOR203W
TEA1
ISU1
YPL156C
YPL192CYPL250C
KAR3YIL082W
-A
YML048W-A
YMR085W
STE11
STE12 STE18
URA1
URA4
STE24
STE4
STE5
STE7SWI6
MAK1
TUP1
YER044C YJL107C
Mutation network =2
lacZ ...Promoter Operator
Repressor
lacIPromoter
Activator
Glucose
Lactose GlucoseGalactose
+
Galactosidase
Lac-Operon
Thomas Schlitt
Gene regulatory networks
• What formalisms to use to describe them?• When does model correspond to biological
reality?• How to simulate models on computer• Is it possible to verify models by
experiments?• How to restore networks from raw data
without knowing the structure or parameters?
Most genes have only a few incoming / outgoing edges, but some have high numbers (>500)
0
20
40
60
80
100
120
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
number of outgoing edges
coun
t
...
Number of incoming/outgoing edges
ARG5,6(108,28)
SST2(60,25)
TEC1
HPT1
GCN4 ERG3(164,15)GAS1
FUS3ERG28QCR2 YER083C
GLN3 SPF1
MRT4
CLB2
YHL029C
0
5
10
15
20
25
30
35
40
45
020406080100120
outdegree
ind
egre
e
Rank of outdegreeR
ank of indegree
outdegree m n indegree m n2.0 Carbohydrate metabolism 363 4 Amino-acid metabolism 9 194
RNA turnover 353 4 Nucleotide metabolism 6 82Meiosis 244 3 Energy generation 5 242Cellstress 207 9 Small molecule transport 5 343Protein translocation 197 3 Other metabolism 5 148
2.8 RNA turnover 110 4 Amino-acid metabolism 4 167Cellstress 62 8 Nucleotide metabolism 3 67Meiosis 54 3 Energy generation 2 184Proteinsynthesis 53 7 Differentiation 2 43
Cellwallmaintenance 47 6 Small molecule transport 2 286
3.6 RNA turnover 48 4 Small molecule transport 2 230RNA processing/ modification 41 4 Other metabolism 2 96Cellstress 27 8 Nucleotide metabolism 2 58Small molecule transport 19 8 Matingresponse 2 57Cellwallmaintenance 19 6 Amino-acid metabolism 2 133
Cellular role table showing the top 5 groups with the highest median degrees for the networks with =2.0, 2.8 and 3.6 with a minimum group size of 3 for outdegree and 40 for the indegree (m median degree, n number of genes per group)
High outdegree High indegreeR
egul
atio
n Metabolism
• Is there one “big” dominant connected component and possibly a number of small components, or several components of comparable sizes?
• Can the network be broken down in several components of comparable size by removing nodes of high degree (i.e., nodes with many incoming or outgoing edges)?
Network modularity
network modularity
Number of connected components in the networks
Number of connected components in the networks
network modularity
component
full network
1% removed
5% removed
10% removed
2.0 largestsecond
total
5383
1
4707
1
368222
261452
3.0 largestsecond
total
355622
246122
138549
7646
17
4.0 largestsecond
total
235434
120537
5426
22
452851
Number of connected components in the networks
network modularity
• Wagner, Genome Research 2002 – there exist many independent modules
• Featherstone and Broadie, Bioessays 2002 - there is only one giant module
• All depends on the definition of the ‘module’
Modularity
other opinions
Gene disruption network for Saccharomyces cerevisiae
a closer look
Filter•choose a list of genes
(MATING, marked in red)•filter for these genes plus
neighbouring genes from the graph
CUP5
AKR1
VMA8
YAR014C
SST2
YEL044W
YER050C
MFA1STE2
BAR1
MFA2
AGA1
AFG3FUS1
FKS1
FUS3
VCX1
ADR1
URA3
ICL1
YGR250CPGU1
YLR042C
YNR067C
HOG1
FIG1
AGA2
KSS1
RAD6
STE6
RAS2
RPD3
CRS4
ASG7
KAR4
NRC465
YIL080W
FUS2
YNL279W
YOL154W
YPL156CYPL192C
YML048W-A
STE11
STE12
GPA1
STE18
STE24
STE4
STE5
STE7
TUP1
YER044C
YJL107C
AFR1
SHE4
CMK2PHO89
RAD16
CYC8 QCR2SWI4NPR2
Mutation network =4
This subnetwork is the result of filtering the full network at =4.0 for the core set marked in red and their next neighbours (red arcs: downregulation, green arcs: upregulation).
CUP5
AKR1
NPR2
PHO89
SHE4
AFR1
CMK2
SWI4
RAD16
VMA8
YAR014C
YEL044W
YER050C
MFA1
STE2
BAR1
MFA2
AGA1
AFG3
FUS1 FKS1FUS3
VCX1
ADR1
URA3ICL1
YGR250C
PGU1
YLR042C
YNR067CHOG1
FIG1
AGA2
KSS1
QCR2
RAD6
STE6
RAS2
RPD3
CRS4
ASG7
CYC8
KAR4
NRC465
YIL080W
FUS2
YNL279W
YOL154W
YPL192C
YML048W-A
STE11
STE12
GPA1
STE18
STE24
STE4
STE5
STE7TUP1
YER044C
YJL107C
SST2
YPL156C
Mating subnetwork
AEP2
AKR1
CMK2
ANP1
RAD16
AFR1
CEM1
CUP5
SST2
DIG1
UBP10
STE2
ERG2
PHO89ERG6
GAS1 PTP2
GYP1
HIR2HPT1
ISW1
FIG1 ISW2
KIN3
MAC1MRPL33
MSU1
NPR2
PET111
RAD57
RIP1
RRP6
ASG7
STE6RTS1
SCS7
SGS1
MFA1
SHE4AGA1
SWI4
FUS1SWI5
VAC8
VMA8
YAL004W
YAR014C
YEL044W
YER050C
FUS3
GPA1
BAR1
MFA2
YER083C
RTT104
YMR014W
YMR029C AGA2YMR031W-A
YMR293C
YOR078W
ADE2
AFG3
BNI1
CLA4
ERG3
FKS1
KAR4
YAR064W
CHS3
VAP1
ICS2
YCLX09W
YDL009C
STP4
PMT1
VCX1HO
THI13
ADR1
YDR249C PAM1
YDR275W
HXT7
HXT6 YDR366CYDR534C
URA3
YEL071W
MNN1
ICL1
RNR1
YER130C
YER135C
SPI1 DMC1
HSP12
NIL1
GSC2
KSS1
MUP1
YGR138C
SKN1
YGR250C
YHR097C YHR116W
YHR122W
YHR145C
YIL060W
YIL096C
YIL117C
RHO3
YIL122W FKH1
NCA3
YJL145W
RPL17B
YJL217W
CYC1
DAN1
PGU1
GFA1
HAP4
RRN3
STE3
PRY2
KTR2
SRL3
YLR040C
YLR042C
SSP120
HSP60
YLR297W
RPS22B YLR413W
HOF1
DDR48
RNA1
YMR266W
YNL078W
SPC98
YNL133C
YNL217W
WSC2YPT11
RFA2
YNR009W
YNR067C
MDH2
YOL154W
NDJ1
WSC3
CDC21
PFY1
RGA1
MSB1
SRL1
YOR248W
YOR296W
YOR338W
GDS1PDE2
FRE5
YPL080C
RPS9A
BBP1
YPL256C
SUA7
MEP3
YPR156C
HMG1
HOG1
MED2
QCR2
RAD6
RAS2
RPD3
RPS24A
CRS4CYC8
YAR031W
YBR012C
HIS7
YCLX07W
YCRX18C PCL2
YDR124W
ECM18APA2
YER024W
HOM3
THI5
YGL053W
NRC465
YGR161C YHR055C
YIL037C
YIL080W
YIL082W
HIS5
YJL037W
SAG1
CPA2
AAD10
HYM1
MET1
MID2
YML047C
KAR5
CIK1
FUS2 SCW10
BOP3
YNL279WTHI12
YOL119C
YOR203W
TEA1
ISU1
YPL156C
YPL192CYPL250C
KAR3YIL082W
-A
YML048W-A
YMR085W
STE11
STE12 STE18
URA1
URA4
STE24
STE4
STE5
STE7SWI6
MAK1
TUP1
YER044C YJL107C
This subnetwork is the result of filtering the full network at =2.0 for the core set marked in red and their next neighbours (red arcs: down- regulation, green arcs: upregulation).
Mating subnetwork
•more information than randomised networks •no optimal •powerlaw distribution of arcs•no obvious modules•local networks make sense
Conclusion
lacZ ...Promoter Operator
Repressor
lacIPromoter
Activator
Glucose
Lactose GlucoseGalactose
+
Galactosidase
Lac-Operon
Thomas Schlitt
A gene network(?)
b1
b2
b3
F1
F2
r1
r2
Of transcription factors
Of transcription factors and KO’s
Of transcription factors and KO’s
Hughes, T. R. et al: “Functional Discovery via a Compendium of Expression Profiles”, Cell 102 (2000), 109-126.
All genes
Effectual set and regulation setAll genes
Transcription factors
Disrupted genes
tRegulation set of t
h Effectual set of h
All genes
Effectual set and regulation setAll genes
Transcription factors
Disrupted genes
g
Regulation set of t
Effectual set of h
How to estimate that the overlap is more than expected by
random?
G
R
E
RE
We assume that the elements of the set E are marked, and pick the set of size |R| at random. Then the size x=|RE| of the intersection are distributed according to hypergeometric distribution. The probability of observing an intersection of size k or larger can be computed according to formula:
k
i R
G
iR
EG
i
E
kxP
0 ||
||
||
||||||
1)(
Data
• Disrupted genes – 263 disrupted genes excluding drug treatments and haploid states (Hughes et al)
• Transcription factor binding sites – 356 binding sites, from these 37 experimentally proved (Pilpel et al, 2001)
Disrupted TF
• Only 5 transcription factors from our set (of known binding sites) were disrupted on the experiments – mbp1, yap1, yaf1, swi5, gcn4
• For three of them – mbp1, yap1, gcn4 –the regulation and effectual sets were highly correlating
• yaf1 is activated with oleate, while in oleate free environment Yaf1 (alias OAF1) disruption does not have significant effect
• swi5 affects only haploid state, while we use only diploid
Effectual sets correlating with other TF binding sites
• From 37 of the experimentally proven binding sites, 20 correlate with one or more effectual sets
• If the disrupted gene correlate with a regulation set of a different gene, the correlation should be explained
Possible explanations why disruption of gene A may correlate with regulation
set of a different gene (TF) T:
• T belongs to the disruption set of A (cascade)
Gene regulation cascade
Possible explanations why disruption of gene A may correlate with regulation
set of a different gene (TF) T:
• T belongs to the disruption set of A (cascade)• T is regulated by A (transcription or translation)
or by a gene on the cascade of A• T is modified (e.g., phosphorylated) by A or a
cascade of A• T and A belongs to the same protein complex
• A and T are functionally related
Binding site/disruption correlation summary
|R| Site |E| Disruption |RE| Description 184 MCB 8 MBP1 5 Part of a DNA binding complex. 78 Yap1 55 YAP1 6
116 Ume6 346 SIN3 20 Interacting proteins. 210 Zap1 3 MSC7 3 Relation via hydrogenases. 243 Ste12 437 DIG1,DIG2 36 DIG1 represses STE12. 153 Ndt80 151 ISW1,ISW2 13 Genetic interaction with ISW2. 180 Rpn4 33 UBR2 17 Similar cellular role. 257 Rap1 121 VAC8 23 Weak link through vacuole 480 Bas1 23 HPT1 11 Adenine response. 149 Stress 126 SIR4 13 Unexplained. 116 Hap234 4 disruptions Unexplained. 151 Gcn4 34 disruptions Central biosynthesis regulator. 89 Leu3 20 disruptions In biosynthesis pathway. 58 Met31-32 16 disruptions In biosynthesis pathway.
188 Aft1 CUP5 MAC1 VMA8 Small molecule transport, ironuptake. 907 rRNA proc. 9 disruptions Ribosomal activity. 514 PAC 9 disruptions Ribosomal activity. 356 ECB 5 Weak disruptions Early Cell-Cycle box 371 PDR 11 Weak disruptions 410 Mcm1 10 disruptions Mcm1p needs coregulators.
Conclusion
• Most of the binding site/disruption set correlations can be explained via • Regulation cascades• Protein complexes
(K. Palin et al, to appear in ECCB 2002, special issue of Bioinformatics)
or
SAME SYMPTOMS
SAME DRUG
RESPONSE VARIATION…
SNP ACCTGACGTGGACCTGTCGTGG
PHARMACOGENETICS = NEW OPPORTUNITIES
SNP = Single nucleotide polymorphisms, 0.1% = 3million
SNP’s make us unique~0.1%, 3.000.000
Goal: Associate SNPs with diseases
A C G T G A C
G T A - AA C T
Genotyping: select fewMeasure
Goal: Associate SNPs with diseases,i.e. identify areas of interest
Association analysis:Identifies MANY, if not
all contributing genesLinks genes to disease
pathways for optimal
target selection
FROM DISEASE GENES TO DRUG TARGETS
InternetInternet
GP
DNA,Plasma,storage
Data + Analysis = Value
LIMS
Informed consentPersonal data
Unique code
Genotypes
Medicalinformation
EGV: Process of data collection and handling
SNP’s
Bioinformatics: Where does IT stand?
• Data modelling, storage, access• Inference from data• Hypotheses generation and testing
• Allow novel types of questions to be asked by providing analysis methods that are able to cope with all the information that is available today
Compute Infrastructure
Bioinformatics: Challenges
• Knowledge representation, data semantics• Data size and its speed of growth• New/emerging data collection technologies• Integration of different data types• Discovery of useful knowledge• Modeling living systems as a whole• Improved health care products• Medical informatics – bringing the
knowledge to doctor’s bench
References for this talkhttp://www.egeen.ee/u/vilo/Publications/
Jaak Vilo, Misha Kapushesky, Patrick Kemmeren, Ugis Sarkans, Alvis Brazma. Expression Profiler. In Parmigiani,G., Garrett,E.S., Irizarry,R. and Zeger,S.L. (eds), The Analysis of Gene Expression Data: Methods and Software, Springer Verlag, New York, NY.
Patrick Kemmeren, Nynke L. van Berkum, Jaak Vilo, Theo Bijma, Rogier Donders, Alvis Brazma, and Frank C.P. Holstege Protein Interaction Verification and Functional Annotation by Integrated Analysis of Genome-Scale Data Molecular Cell 2002, May 24; 9(5) pp. 1133-1143
Johan Rung, Thomas Schlitt, Alvis Brazma, Karlis Freivalds, Jaak Vilo Building and analysing genome-wide gene disruption networks Bioinformatics 2002 Oct;18 Suppl 2:S202-210 European Conference on Computational Biology (ECCB 2002)
Kimmo Palin, Esko Ukkonen, Alvis Brazma, Jaak Vilo Correlating gene promoters and expression in gene disruption experiments Bioinformatics 2002 Oct;18 Suppl 2:S172-180;European Conference on Computational Biology (ECCB, 2002)
Acknowledgements
Alvis Brazma
Patrick Kemmeren, EBI, UMC Utrecht
Frank Holstege, UMC Utrecht
Thomas Schlitt, Johan Rung EBI
Kimmo Palin, Esko Ukkonen, U. Helsinki
+ the rest of the EBI microarray team