Upload
leah-kelley
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Statistical Data Fusion to Prioritize Lists of Genes
Bert Coessens, Stein Aerts
Departement ESAT - SCDKatholieke Universiteit Leuven
Promotor: Bart De MoorAssessor: Yves Moreau
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Context
x
xx
x
x
x
x
xx
Linkage AnalysisPositional Cloning
NEFL
RAB7
GARS
GIB1
LMNA
High-throughput technologies
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Concept
Pathology / Biological process / …
Gene Expression Literature
AnatomicalExpression
GeneRegulation
ProteinDomains
FunctionalAnnotation
EvolutionaryConservation …
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Concept
Model with multiple submodels
Training genes
Training set
Choose submodels TRAIN
Candidate genes
Test set
One ranking foreach submodel
Combinedranking
Orderstatistics
SCORE
genei
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Order Statistics
Given a set of n rank ratios for genei
- what is the probability of getting these ratios by chance alone?
Q r1,r 2, ... , r n n!0
r 1
s1
r2
...sn 1
rn
dsn dsn 1 ...ds1
Joint probability density function of all n order statistics:
V ki 1
k 1
1 i 1 V k i
i !rn k 1
i
Complexity O(n2)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Setup
29 lists of disease genes from OMIM
5 lists of random genesfrom the human genome
Foreach disease or random gene set do:Foreach gene in the set do:a. Leave one gene outb.TRAIN all submodels on the set minus the left-out genec. Create a test set by adding left-out gene to [9, 49, 99] random genesd. SCORE the test set with all trained submodelse. RANK the genes in the test set according to their order statistics p-valueend
end
Calculate for a certain cut-off x the number of - TP: number of left-out genes ranked above x- FP: number of genes but left-out gene ranked above x- TN: number of genes but the left-out gene ranked below x- FN: number of left-out genes ranked below x
Calculate sensitivity and specificity using the above mentioned values,plot (1-specificity) versus sensitivity to obtain a Rank ROC plot andcalculate the area under the curve.
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Disease genes
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Disease genes
- 29 human diseases (OMIM) = 29 gene sets- 627 disease genes with Ensembl identifier in total- average gene set contains 19 genes- smallest gene set = ALS with 4 genes- largest gene set = leukemia with 113 genes
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: SubmodelsTextual data: TXTGate
Sequence similarity: BLAST
+
Rank genes according to e-value
Example: Presenilin 1 vs. Presenilin 2 e-value = 10-133
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Submodels
Functional annotation: GO
Functional annotation: Kegg
Set ofgenes
GO IDs observed
frequencies
Full Genome
GO IDsGO-id
expected frequencies
GO IDs
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Submodels
Protein information: InterPro
Protein information: BIND
Training genes+
Interaction partners
Test gene+
Interaction partners
Overlap?
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Submodels
Gene expression: Microarray data
Gene expression: ESTs
- Model is average expression profile of training genes- Score test gene by calculating Pearson correlation
Human gene expression atlas: Su et al.47 normal human tissues
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Submodels
Cis-regulatory elements: TFBSs
Cis-regulatory elements: TFBS modules
- Check human-mouse CNS blocks in upstream sequence of a test gene
- Compare found motifs with motifs in training set
ModuleSearcher:searches best combination of 3 TFs in 300 bp USof genes in training set
ModuleScanner:scores test gene with model
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Similarity
Statistical meta-analysis
Vector-based similarity
Fisher’s methodAssume there are m independent tests of H0.1. For the i-th test calculate the corresponding p-value, pi.2. If pi has a uniform distribution on [0,1],
then –2Σlog pi has a χ2m
distribution.
T1
T3
T2
- Euclidean distance- Pearson correlation- Cosine similarity
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Correlation
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Rank ROC
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Submodel Rank ROC
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Bias towards known genes
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Screenshot
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Architecture
ESATWeb server
Linux cluster
Java RMI
SOAP messages
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Conclusions and Future
- Different weighting for different submodels- Explore mathematical modeling techniques (neural nets, SVM)- Add more information models- Define best combination of submodels
F
- Allows integration of heterogeneous data- Solves problem of uncertainty- Solves multiple testing problem (Bonferroni correction)- Allows for cut-offs with statistical significance
C
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Acknowledgements
Bart De MoorStein Aerts Yves Moreau
Patrick Glenisson Steven Van Vooren Joke Allemeersch
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Load training set
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Add submodels
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Train submodels
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Load candidate genes
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Score candidate genes with all submodels
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Results of scoring
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Ranking visualized in sprintplot