Upload
julissa-holton
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Chip arrays and gene
expression data
Chip arrays and gene
expression data
Motivation
Motivation
With the chip array technology, one can measure the expression of all genes at once (even all exons). Can answer questions such as:
1.Which genes are expressed in a muscle cell?
2.Which genes are expressed during the first weak of pregnancy in the mother? In the new baby?
3.Which genes are expressed in cancer?
4. If one mutates a TF: which genes are not expressed following this change?
5. Which genes are not expressed in the brain of a retarded baby?
6. Which genes are expressed when one is asleep versuswhen the same personis awake?
Analyzing Output
Analyzing Output
Output
w.tBrain tumor
males
Brain tumor
females
Gene 1
Gene 2
Gene 3
Gene 25,000
Each cell is either an absolute number or a relative one, depending on the technology used.
Repeats
w.tBrain tumor
male1
Brain tumor
male2
Brain tumor
female1
Gene 1
Gene 2
Gene 3
Gene 25,000
The repeat can either be the same sample – a different chip or a “real” biological repeat – a different sample.
Expression profile
wt1wt2wt3wt4bt1bt2bt3bt4
g1435415161723
g275466379
g3232525263060
Genes 1 and 3 show the same trend (go both high under the same conditions). That is: they have the same expression profile.
Clustering
wt1
wt2
wt3wt4bt1bt2bt3bt4
g1435415161723
g275466379
g3232525263060
In general, we want to find all the genes that share the same expression profile → suggestive of a functional linkage.
There are clustering algorithms, which do exactly that.
Clustering
wt1
wt2
wt3wt4bt1bt2bt3bt4
g14354022023
g275460809
g32325601661
Clustering of the conditions can suggest two types of brain tumor (bt)
Clustering
wt1
wt2
wt3wt4bt1bt2bt3bt4
g14354022023
g275460809
g323256173
Bi-clustering: both on the conditions and the genes.
Applications
Applications
Applications
Think of increasing the glucose concentration of E.coli and making a chip array in various concentration.
One can potentially discover allgenes in the glucose pathway.
Knocking out a gene → discoverall genes that interact with it.
Applications
Analyzing expression of genes can help reveal the gene network of a given organism.
Gene network
Clinical
/
g111
g24
g30
Do someone has a brain tumor?
wt1
wt2
wt3wt4bt1bt2bt3bt4
g14354022023
g275460809
g32325601661
MammaPrint
Used to assess the risk that a breast tumor will spread to other parts of the body (metastasis). It is based on the well-known 70-gene breast cancer gene signature
In February, 2007 the FDA cleared the MammaPrint test for use in the U.S
Sequence by hybridization
It was thought that the following procedure could work for sequencing a genome:
1.Make a chip containing all x mers (e.g., x = 25).2.Hybridize a genome to the chip.3.By analyzing all the hybridizations with their overlaps – assemble the genome.
Problem: it doesn’t work.
ChIP-on-chip : A method for measuring protein-DNA interaction.
Proteins that bind DNA includes:
Those responsible for transcription regulation
Transcription factors (TFs)
Replication proteins
Histones…
ChIP-on-chip: One chip is for Chromatin ImmunoPrecipitation and the second chip is for DNA microarrays.
The method is used mostly to detect TF binding sites.
Tiling arrays
Here the chip array should include not only protein coding genes but also control regions, or simply – the entire genome.
Deep sequencing reads
Yoder-Himes D.R. et al. PNAS (2009)
Machine learning
Learning mode on.Bioinfo is great.
Clustering
Clustering (of expression data)
UPGMA is one such direct method, receiving as input a distance matrix and giving as output an ultrametric tree.
It was suggested by Sokal and Michener (1958).
Clustering (of expression data)
Often, there is a one-to-one transformation between the data and points in space.
For example, expression of all genes under a specific condition is a point:
Condition 1
Gene 15
Gene 27
Gene 32
Gene 2000054
(5,7,2,…, 54) a point in a space of dimension 20,000.
Clustering (of expression data)
Another example, each expression profile is a point in a space whose dimension is the number of conditions
Condition 1
Condition 2
Condition 3
Condition 4
Gene 15020433
(50,20,4,33) a point in a space of dimension 4
In space: each point is a gene
Condition 1
Condition 2 g1
Our goal will be to cluster genes
Condition 1
Condition 2
Genes that are in the same cluster (show similar patterns of expression) are likely to be functionally related.
Distance between two expression profiles
The Euclidian distance =
Condition 1
Condition 2
Condition 3
Condition 4
Gene 15020433
Gene 23020331
22222,1 )3133()34()2020()3050( d
Distance between two expression profiles
We can compute the distances between each pair of expression profiles and obtain a distance table.
Condition 1
Condition 2
Condition 3
Condition 4
Gene 15020433
Gene 23020331
Gene 33020331
Gene 43020331
The distance table
g1g2g3g4g5g6g7g8
g10324851504898148g202634293384136g3042444492152g40443886142g502489142g6090142g70148g80
The distance table
g1g2g3g4g5g6g7g8
g10324851504898148g202634293384136g3042444492152g40443886142g502489142g6090142g70148g80
Starting tree
g5 g6
We call the father node of g5 and g6 -- “g56”.
g56
Removing the g5 and g6 rows and columns,and adding the g56 row and column
g1g2g3g4g56g7g8
g10324851?98148g202634?84136g3042?92152g40?86142
g56089142g70148g80
Computing distances
g1g2g3g4g5g6g7g8
g10324851504898148
( 56, 1)
1 1( 5, 1) ( 6, 1)
2 249
D g g
D g g D g g
The updated table. Starting the second iteration…
g1g2g3g4g56g7g8
g103248514998148g2026343184136g30424492152g404186142
g56089142g70148g80
Building the tree - Continued
We call the father node of g2 and g3 -- “g23”.
g5 g6
g56
g2 g3
g23
Computing distances
g1g2g3g4g56g7g8
g5649314441089142
( 23, 56)
1 1( 2, 56) ( 3, 56)
2 237.5
D g g
D g g D g g
The updated table. Starting a new iteration…
g1g23g4g56g7g8
g1040514998148g2303837.588144
g404186142g56089142
g70148g80
Tree
g5 g6
g56
g2 g3
g2356
g23
Computing distances
g1g23g4g56g7g8
g1040514998148
( 2356, 1)
1 1( 23, 1) ( 56, 1)
2 244.5
D g g
D g g D g g
Starting a new iteration…
g1g2356g4g7g8
g1044.55198148g2356039.588.75143
g4086142g70148g80
Building the tree
g5 g6
g56
g2 g3
g2356
g23
g4
g23456
Computing distances
g1g2356g4g7g8
g1044.55198148
( 23456, 1)
4 1( 2356, 1) ( 4, 1)
5 545.8
D g g
D g g D g g
Starting an additional iteration…
g1g23456g7g8
g1045.898148g23456088.2142.8
g70148g80
Constructing the tree
g5 g6
g56
g2 g3
g2356
g23
g4
g123456
g1
g23456
One more iteration…
g123456g7g8
g123456089.833143.66g70148g80
Reconstructing the tree
g5 g6
g56
g2 g3
g2356
g23
g4
g1234567
g1
g23456
g7
g123456
The new table
g1234567g8
g12345670144.2857g80
Resulting tree
g5 g6
g56
g2 g3
g2356
g23
g4
g123456
g1
g23456
g7
g1234567
g8
From tree to clusters
g5 g6 g2 g3 g4 g1g7
g8
If we want two clusters, we will cut here, and obtain g8 versus g1-7.
From tree to clusters
g5 g6 g2 g3 g4 g1g7
g8
If we want 3 clusters, we will cut here, and obtain g8,g7, and g1-6.
From tree to clusters
g5 g6 g2 g3 g4 g1g7
g8
The 4 clusters are: g8,g7,g1,g23456
Classification
Condition 1
Condition 2
Gene 15020
Gene 23020
Gene 33020
Gene 43020
Gene 1
Gene 2
?
If red = brain tumor and yellow healthy – do I have a brain tumor?
Gene 1
Gene 2
?
In SVM we find a (hyper)plane that divides the space in two.
SVM = support vector machine
Condition 1
Condition 2
Gene 15020
Gene 23020
Gene 33020
Gene 43020
Gene 1
Gene 2
?
The further the point is from the separating (hyper)plane, the more confident we are in the classification
SVM – confidence in classification
Gene 1
Gene 2
?
Sometimes we cannot perfectly separate the training data. In this case, we will find the best separation.
SVM – cannot always perfectly classify
KNN = k nearest neighbors
Gene 1
Gene 2
?
KNN is another method for classification. For each point it looks at its k nearest neighbors.
If red = brain tumor and yellow healthy – do I have a brain tumor?
Gene 1
Gene 2
?
For each point it looks at its k nearest neighbors. For example, the method with k=3 looks at points 3 nearest neighbors to decide how to classify it. If the majority are “Red” it will classify the point as red.
If red = brain tumor and yellow healthy – do I have a brain tumor?
KNN = k nearest neighbors
Gene 1
Gene 2
?
KNN is better than SVM for the above case.
If red = brain tumor and yellow healthy – do I have a brain tumor?
KNN = k nearest neighbors
In the above example – how will the point be classified in KNN with K=1? In SVM?
Gene 1
Gene 2
?
KNN - exercise
Training dataset
Gene 1
Gene 2
?
The red and yellow points are used to train the classifier.
The more training data one has -> the better the classifier will perform.
Test dataset
Gene 1
Gene 2
?
Usually some points for which we know the answer are not given to the classifier and are used to TEST its performance.
Decision tree
AgeGene1Gene2SmokerOperation
>20highhighyesyes
>20highhighyesyes
>20lowlownono
[20,40]lowhighyesyes
[20,40]highhighnoyes
[20,40]highlowyesno
>40lowlowyesno
>40highlownono
>40lowhighnono
Decision tree
Age >40
Operation = no
Yes No
Gene 2
high low
Operation = yes
Operation = no
Decision trees are automatically built from “train data” and are used for classification.
They also tell us which features are most important.
Voting
Decision trees
Training data that need a classification algorithm (Yes/No)
Voting uses an array of machine learning algorithms and chooses the classification suggested by most classifiers.
KNN SVMTrain:
New datum(Test) No YesYes
YES
Classification is used outside the scope of bioinformatics
The distance between the query and each point in the dataset is computed. Based on the identity of the k nearest members, the digit is identified.
*More advanced algorithms allow rotation and enlargement of the digit to be classified.
UPGMA - exercise
x12x34
x12015x340
In the above example – how will the point be clustered using UPGMA?
x1x2x3x4
x1021230x20810x304x40
x12x3x4
x1201020x304x40
Dataset sizes
A classifier is needed to detect “Pupko disease” based on gene expression.
Pupko disease is extremely rare (say, it inflicts 1 out of 100000 people).
A classifier was trained on a large volume of samples in which all cases are negative. On a test dataset it correctly classified 99.9% of the cases…
the fraction of positive cases in the test data is only :"לא חוכמה"~0.01%.
Take home message: (1) better to train classifier on ~equal number of “positive” and “negative” cases.
(2) Reporting only “% accurate classifications” is not enough. One has to report both FP ,FN, TP, TN (in this example, all positive are FP FALSE POSITIVE RATE OF 100%).
Exercises - examples
, 7 היה T לגן X. המרחק בין גן Y וגן X איחדתי את גן UPGMA ע"י Clusteringב . אלו מהמשפטים הבאים נכון?9 היה T לגן Yוהמרחק בין גן
.8 הוא T ל Y ו Xהמרחק בין הקבוצה שמאחדת את גנים •
כי לא נתון המרחק בין T ל Y ו Xאי אפשר לחשב את המרחק בין האיחוד של •X ל Y.
. 7 קטן מ Y וגן Xהמרחק בין גן •
א'+ב'.•
א'+ג'. •
ב'+ג'.•
א'+ב'+ג'.•
אף תשובה אינה נכונה.•
Exercises - examples
, 7 היה T לגן X. המרחק בין גן Y וגן X איחדתי את גן UPGMA ע"י Clusteringב . אלו מהמשפטים הבאים נכון?9 היה T לגן Yוהמרחק בין גן
.8 הוא T ל Y ו Xהמרחק בין הקבוצה שמאחדת את גנים •
כי לא נתון המרחק בין T ל Y ו Xאי אפשר לחשב את המרחק בין האיחוד של •X ל Y.
. 7 קטן מ Y וגן Xהמרחק בין גן •
א'+ב'.•
א'+ג'. •
ב'+ג'.•
א'+ב'+ג'.•
אף תשובה אינה נכונה.•
Exercises - examples
. אלו מהמשפטים הבאים נכון?23
ככל שמרחק בין הנקודה שרוצים לסווג למשטח המפריד קטן יותר – SVMב א-הסיכוי שהסיווג שגוי קטן יותר.
תמיד כל הנקודות מסוג א' הן בצד אחד וכל הנקודות מסוג ב' הן בצד SVMב ב-השני.
שיסווג חלבונים לטרנס-ממברנליים ולכאלה שלא. SVMניתן לפתח ג-
אף תשובה אינה נכונה.ד-
Exercises - examples
. אלו מהמשפטים הבאים נכון?23
ככל שמרחק בין הנקודה שרוצים לסווג למשטח המפריד קטן יותר – SVMב א-הסיכוי שהסיווג שגוי קטן יותר.
תמיד כל הנקודות מסוג א' הן בצד אחד וכל הנקודות מסוג ב' הן בצד SVMב ב-השני.
שיסווג חלבונים לטרנס-ממברנליים ולכאלה שלא.SVMניתן לפתח ג-
אף תשובה אינה נכונה.ד-
Exercises - examples
. נתון האיור הבא:24
אלו מהמשפטים הבאים נכון?
(ליניארי) הנקודה עם הסימן שאלה תסווג להיות נקודה שחורה.SVMלפי א-
כשמספר השכנים שווה אחד, הנקודה עם הסימן שאלה תסווג להיות KNNלפי ב-נקודה לבנה.
א'+ב'ג-
אף תשובה אינה נכונהד-
Gene 1
Gene 2
?
Exercises - examples
. נתון האיור הבא:24
אלו מהמשפטים הבאים נכון?
(ליניארי) הנקודה עם הסימן שאלה תסווג להיות נקודה שחורה.SVMלפי א-
כשמספר השכנים שווה אחד, הנקודה עם הסימן שאלה תסווג להיות KNNלפי ב-נקודה לבנה.
א'+ב'ג-
אף תשובה אינה נכונהד-
Gene 1
Gene 2
?
Legionalla pneumophilacase-study
Legionalla pneumophilacase-study
How did it all begin? How did it all begin?
Legionella pneumophila
Legionnaire disease nowadaysLegionnaire disease nowadays
Legionella pneumophila
Legionella pneumophila Legionella pneumophila
Legionella pneumophila
Copyright © 2005 Nature Publishing Group. Created by Arkitek from Nature Reviews Microbiology
Identifying the effectorsIdentifying the effectors
Legionella pneumophila
Homology to host proteins
Regulatory
elements
Genome proximity to
other effectors
Secretion signalAbundance in Metazoa / Bacteria
GC contentSequence homology
The featuresThe features
Legionella pneumophila
The effectors machineThe effectors machine
5
5
Legionella pneumophila
The big pictureThe big pictureSimilarity to
known effectors
Regulatory elements
Features
Similarity tohost proteins
G-C content
Secretory signals
Feature selection
NN
SVMNaïve Bayes
Bayesian Net
Voting
Classification algorithms
Experimentalvalidation
Predictedeffectors
Prior knowledge
Trainedmodel
Unclassifiedgenes
Predictednon-effectors
Newly validatedeffectors
Non-effectors
Validatedeffectors
Abundance in Metazoa\Bacteria
Genome arrangement
Legionella pneumophila
Does it really work??Does it really work??
Machine learning