Upload
hisano
View
34
Download
0
Embed Size (px)
DESCRIPTION
Transcriptional Diagnosis by Bayesian Network. Hsun-Hsien Chang and Marco F. Ramoni. Children’s Hospital Informatics Program Harvard-MIT Division of Health Sciences and Technology Harvard Medical School March 17, 2009. Background. - PowerPoint PPT Presentation
Citation preview
1
Harvard Medical School
Transcriptional Diagnosis by Bayesian Network
Hsun-Hsien Chang and Marco F. Ramoni
Children’s Hospital Informatics Program
Harvard-MIT Division of Health Sciences and Technology
Harvard Medical School
March 17, 2009
2
Harvard Medical School
Background
• Microarray technology enables profiling expression of thousands of genes in parallel on a single chip.
• Comparative analysis of gene expression across tissue states extracts signature genes for disease diagnosis.
• Challenge: – Number of variables (i.e., genes) is much greater than the
number observations (i.e., biological samples), inducing the problem of overfitting.
• Existing methods:– Gene selection: compute statistics (eg., t-statistics, SNR,
PCA) of individual genes and select high rank genes.– Classification model: create a classification function of
selected genes.
3
Harvard Medical School
Proposed Approach
• Issues:– Assumption on gene independencies is inadequate. – Other genes may be collinearly expressed with the signature.– Selection and classification are two non-integrated steps.
Need a cut-off threshold to select high rank genes.
• Proposed strategies:– Adopt system biology approach to infer the functional
dependence among genes.– Use the dependence network for tissue discrimination. – Integrate gene selection and classification model in Bayesian
network framework.
4
Harvard Medical School
Data Representation by Bayesian Network
Gene 1
Gene 2
Gene N
Cas
e 1
.
.
.
.
.
.
Cas
e 2
. . . .
Tissue state 1
Cas
e M
Tissue state 2
G1
Pheno
G2
GN
.
.
.
.
.
.
• Bayesian networks are directed acyclic graphs where:– Node corresponds to random variables.– Directed arcs encode conditional probabilities of the target
nodes on the source nodes.
5
Harvard Medical School
Gene Selection by Bayes Factor
Pheno
G1
G2
GN
Gp
Gq
G1
Pheno
G2
GN
.
.
.
.
.
.
gene selection by Bayes factor
6
Harvard Medical School
Collinearity Elimination via Network Learning
Pheno
G1
G2
GN
Gp
Gq
Pheno
G2
GN
Gp
Gq
G1
Gp
GN
collinearity elimination
7
Harvard Medical School
Sample Classification
• The phenotype variable is independent of the blue genes, given the green genes.
• Technically, the green genes are under the Markov blanket of the phenotype variable, and they are the signature genes used for phenotype determination.
• Tissue classification:
GN
Pheno
G2
Gp
Gq
G1
8
Harvard Medical School
Algorithm Summary
Gene Selection by Bayes Factor
Collinearity Elimination
Sample Classification
Optimize Performance
......
...
...
Optimize Hyperparameters
(sensitivity analysis)
...
9
Harvard Medical School
• Adenocarcinoma (AC) and squamous cell carcinoma (SCC) are major subtypes of lung cancer:– AC and SCC are distinct in survival, chances of metastasis,
and responses to chemotherapy and targeted therapy.
– Physicians lack confidence in correct recognition when there are multiple primary carcinomas.
• Training: – 58 ACs and 53 SCCs.– 77 genes selected in the network.– 25 signature genes.
Discriminate Lung Carcinoma Subtypes
10
Harvard Medical School
Bayesian Network for Lung Carcinoma
11
Harvard Medical School
Large-Scale Testing on Independent Samples
• 422 samples (232 ACs and 190 SCCs) aggregated from 7 cohorts (including Caucasians, African-Americans, Chinese).
• Accuracy = 95.2% AUROC.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1ROC curves
1-specificity
sen
sitiv
ity
Proposed Bayes Net (95.2%)
12
Harvard Medical School
Comparisons with Other Popular Methods
• Higher classification accuracy.• Small-sized signature to avoid overfitting.
Testing AUROC
p-value# signature
genes
Bayesian Network 95.2% --- 25
PCA/LDA 91.2% 0.0047 13PAM
(Tibshirani et al., PNAS 2002)91.0% 0.0014 77
Weighted Voting(Golub et al., Science 1999)
93.4% 0.6240 800
13
Harvard Medical School
KRT6 Family Characterizes the Lung Carcinoma Discrimination
14
Harvard Medical School
KRT6 Family Characterizes the Lung Carcinoma Discrimination
• Keratin-6 family genes (KRT6A, KRT6B, KRT6C) are important for distinguishing lung cancer subtypes.
– Accounting for 95% of the accuracy of the whole 25-gene signature.
– Located on chromosome 12q12-q13.
– A nonlinear, concave discriminative surface.
15
Harvard Medical School
Verification by Chr12q12-q13 Aberrations• Investigate DNA copy number changes in comparative
genomic hybridization (CGH) array.– 12 ACs and 13 SCCs from
Vrije University Medical Center, Netherland.
– A dumbbell discriminative surface achieves 80% classification accuracy.
– Treat average CGH values of genes occupying q12, q13, and q12-13 respectively as three features to construct a Naïve Bayes Classifier.
16
Harvard Medical School
Conclusion
• Reverse engineer regulatory network information for tissue classification.
• Adopt the system biology approach to infer gene dependencies network.– Select genes by Bayes factor.– Eliminate collinearity via network learning.– Integrate gene selection and classification model
in a single Bayesian network framework.• Demonstrate the promising translational
value of the system biology approach in clinical study.