Chip arrays and gene expression data. Motivation

Chip arrays and gene

expression data

Chip arrays and gene

expression data

Motivation

Motivation

With the chip array technology, one can measure the expression of all genes at once (even all exons). Can answer questions such as:

1.Which genes are expressed in a muscle cell?

2.Which genes are expressed during the first weak of pregnancy in the mother? In the new baby?

3.Which genes are expressed in cancer?

4. If one mutates a TF: which genes are not expressed following this change?

5. Which genes are not expressed in the brain of a retarded baby?

6. Which genes are expressed when one is asleep versuswhen the same personis awake?

Analyzing Output

Analyzing Output

Output

w.tBrain tumor

males

Brain tumor

females

Gene 1

Gene 2

Gene 3

Gene 25,000

Each cell is either an absolute number or a relative one, depending on the technology used.

Repeats

w.tBrain tumor

male1

Brain tumor

male2

Brain tumor

female1

Gene 1

Gene 2

Gene 3

Gene 25,000

The repeat can either be the same sample – a different chip or a “real” biological repeat – a different sample.

Expression profile

wt1wt2wt3wt4bt1bt2bt3bt4

g1435415161723

g275466379

g3232525263060

Genes 1 and 3 show the same trend (go both high under the same conditions). That is: they have the same expression profile.

Clustering

wt1

wt2

wt3wt4bt1bt2bt3bt4

g1435415161723

g275466379

g3232525263060

In general, we want to find all the genes that share the same expression profile → suggestive of a functional linkage.

There are clustering algorithms, which do exactly that.

Clustering

wt1

wt2

wt3wt4bt1bt2bt3bt4

g14354022023

g275460809

g32325601661

Clustering of the conditions can suggest two types of brain tumor (bt)

Clustering

wt1

wt2

wt3wt4bt1bt2bt3bt4

g14354022023

g275460809

g323256173

Bi-clustering: both on the conditions and the genes.

Applications

Applications

Applications

Think of increasing the glucose concentration of E.coli and making a chip array in various concentration.

One can potentially discover allgenes in the glucose pathway.

Knocking out a gene → discoverall genes that interact with it.

Applications

Analyzing expression of genes can help reveal the gene network of a given organism.

Gene network

Clinical

/

g111

g24

g30

Do someone has a brain tumor?

wt1

wt2

wt3wt4bt1bt2bt3bt4

g14354022023

g275460809

g32325601661

MammaPrint

Used to assess the risk that a breast tumor will spread to other parts of the body (metastasis). It is based on the well-known 70-gene breast cancer gene signature

In February, 2007 the FDA cleared the MammaPrint test for use in the U.S

Sequence by hybridization

It was thought that the following procedure could work for sequencing a genome:

1.Make a chip containing all x mers (e.g., x = 25).2.Hybridize a genome to the chip.3.By analyzing all the hybridizations with their overlaps – assemble the genome.

Problem: it doesn’t work.

ChIP-on-chip : A method for measuring protein-DNA interaction.

Proteins that bind DNA includes:

Those responsible for transcription regulation

Transcription factors (TFs)

Replication proteins

Histones…

ChIP-on-chip: One chip is for Chromatin ImmunoPrecipitation and the second chip is for DNA microarrays.

The method is used mostly to detect TF binding sites.

Tiling arrays

Here the chip array should include not only protein coding genes but also control regions, or simply – the entire genome.

Deep sequencing reads

Yoder-Himes D.R. et al. PNAS (2009)

Machine learning

Learning mode on.Bioinfo is great.

Clustering

Clustering (of expression data)

UPGMA is one such direct method, receiving as input a distance matrix and giving as output an ultrametric tree.

It was suggested by Sokal and Michener (1958).


Often, there is a one-to-one transformation between the data and points in space.

For example, expression of all genes under a specific condition is a point:

Condition 1

Gene 15

Gene 27

Gene 32

Gene 2000054

(5,7,2,…, 54) a point in a space of dimension 20,000.


Another example, each expression profile is a point in a space whose dimension is the number of conditions

Condition 1

Condition 2

Condition 3

Condition 4

Gene 15020433

(50,20,4,33) a point in a space of dimension 4

In space: each point is a gene

Condition 1

Condition 2 g1

Our goal will be to cluster genes

Condition 1

Condition 2

Genes that are in the same cluster (show similar patterns of expression) are likely to be functionally related.

Distance between two expression profiles

The Euclidian distance =

Condition 1

Condition 2

Condition 3

Condition 4

Gene 15020433

Gene 23020331

22222,1 )3133()34()2020()3050( d

Distance between two expression profiles

We can compute the distances between each pair of expression profiles and obtain a distance table.

Condition 1

Condition 2

Condition 3

Condition 4

Gene 15020433

Gene 23020331

Gene 33020331

Gene 43020331

The distance table

g1g2g3g4g5g6g7g8

g10324851504898148g202634293384136g3042444492152g40443886142g502489142g6090142g70148g80

The distance table

g1g2g3g4g5g6g7g8

g10324851504898148g202634293384136g3042444492152g40443886142g502489142g6090142g70148g80

Starting tree

g5 g6

We call the father node of g5 and g6 -- “g56”.

g56

Removing the g5 and g6 rows and columns,and adding the g56 row and column

g1g2g3g4g56g7g8

g10324851?98148g202634?84136g3042?92152g40?86142

g56089142g70148g80

Computing distances

g1g2g3g4g5g6g7g8

g10324851504898148

( 56, 1)

1 1( 5, 1) ( 6, 1)

2 249

D g g

D g g D g g

The updated table. Starting the second iteration…

g1g2g3g4g56g7g8

g103248514998148g2026343184136g30424492152g404186142

g56089142g70148g80

Building the tree - Continued

We call the father node of g2 and g3 -- “g23”.

g5 g6

g56

g2 g3

g23

Computing distances

g1g2g3g4g56g7g8

g5649314441089142

( 23, 56)

1 1( 2, 56) ( 3, 56)

2 237.5

D g g

D g g D g g

The updated table. Starting a new iteration…

g1g23g4g56g7g8

g1040514998148g2303837.588144

g404186142g56089142

g70148g80

Tree

g5 g6

g56

g2 g3

g2356

g23

Computing distances

g1g23g4g56g7g8

g1040514998148

( 2356, 1)

1 1( 23, 1) ( 56, 1)

2 244.5

D g g

D g g D g g

Starting a new iteration…

g1g2356g4g7g8

g1044.55198148g2356039.588.75143

g4086142g70148g80

Building the tree

g5 g6

g56

g2 g3

g2356

g23

g4

g23456

Computing distances

g1g2356g4g7g8

g1044.55198148

( 23456, 1)

4 1( 2356, 1) ( 4, 1)

5 545.8

D g g

D g g D g g

Starting an additional iteration…

g1g23456g7g8

g1045.898148g23456088.2142.8

g70148g80

Constructing the tree

g5 g6

g56

g2 g3

g2356

g23

g4

g123456

g1

g23456

One more iteration…

g123456g7g8

g123456089.833143.66g70148g80

Reconstructing the tree

g5 g6

g56

g2 g3

g2356

g23

g4

g1234567

g1

g23456

g7

g123456

The new table

g1234567g8

g12345670144.2857g80

Resulting tree

g5 g6

g56

g2 g3

g2356

g23

g4

g123456

g1

g23456

g7

g1234567

g8

From tree to clusters

g5 g6 g2 g3 g4 g1g7

g8

If we want two clusters, we will cut here, and obtain g8 versus g1-7.


g5 g6 g2 g3 g4 g1g7

g8

If we want 3 clusters, we will cut here, and obtain g8,g7, and g1-6.


g5 g6 g2 g3 g4 g1g7

g8

The 4 clusters are: g8,g7,g1,g23456

Classification

Condition 1

Condition 2

Gene 15020

Gene 23020

Gene 33020

Gene 43020

Gene 1

Gene 2

?

If red = brain tumor and yellow healthy – do I have a brain tumor?

Gene 1

Gene 2

?

In SVM we find a (hyper)plane that divides the space in two.

SVM = support vector machine

Condition 1

Condition 2

Gene 15020

Gene 23020

Gene 33020

Gene 43020

Gene 1

Gene 2

?

The further the point is from the separating (hyper)plane, the more confident we are in the classification

SVM – confidence in classification

Gene 1

Gene 2

?

Sometimes we cannot perfectly separate the training data. In this case, we will find the best separation.

SVM – cannot always perfectly classify

KNN = k nearest neighbors

Gene 1

Gene 2

?

KNN is another method for classification. For each point it looks at its k nearest neighbors.


Gene 1

Gene 2

?

For each point it looks at its k nearest neighbors. For example, the method with k=3 looks at points 3 nearest neighbors to decide how to classify it. If the majority are “Red” it will classify the point as red.



Gene 1

Gene 2

?

KNN is better than SVM for the above case.



In the above example – how will the point be classified in KNN with K=1? In SVM?

Gene 1

Gene 2

?

KNN - exercise

Training dataset

Gene 1

Gene 2

?

The red and yellow points are used to train the classifier.

The more training data one has -> the better the classifier will perform.

Test dataset

Gene 1

Gene 2

?

Usually some points for which we know the answer are not given to the classifier and are used to TEST its performance.

Decision tree

AgeGene1Gene2SmokerOperation

>20highhighyesyes

>20highhighyesyes

>20lowlownono

[20,40]lowhighyesyes

[20,40]highhighnoyes

[20,40]highlowyesno

>40lowlowyesno

>40highlownono

>40lowhighnono

Decision tree

Age >40

Operation = no

Yes No

Gene 2

high low

Operation = yes

Operation = no

Decision trees are automatically built from “train data” and are used for classification.

They also tell us which features are most important.

Voting

Decision trees

Training data that need a classification algorithm (Yes/No)

Voting uses an array of machine learning algorithms and chooses the classification suggested by most classifiers.

KNN SVMTrain:

New datum(Test) No YesYes

YES

Classification is used outside the scope of bioinformatics

The distance between the query and each point in the dataset is computed. Based on the identity of the k nearest members, the digit is identified.

*More advanced algorithms allow rotation and enlargement of the digit to be classified.

UPGMA - exercise

x12x34

x12015x340

In the above example – how will the point be clustered using UPGMA?

x1x2x3x4

x1021230x20810x304x40

x12x3x4

x1201020x304x40

Dataset sizes

A classifier is needed to detect “Pupko disease” based on gene expression.

Pupko disease is extremely rare (say, it inflicts 1 out of 100000 people).

A classifier was trained on a large volume of samples in which all cases are negative. On a test dataset it correctly classified 99.9% of the cases…

the fraction of positive cases in the test data is only :"לא חוכמה"~0.01%.

Take home message: (1) better to train classifier on ~equal number of “positive” and “negative” cases.

(2) Reporting only “% accurate classifications” is not enough. One has to report both FP ,FN, TP, TN (in this example, all positive are FP FALSE POSITIVE RATE OF 100%).

Exercises - examples

, 7 היה T לגן X. המרחק בין גן Y וגן X איחדתי את גן UPGMA ע"י Clusteringב . אלו מהמשפטים הבאים נכון?9 היה T לגן Yוהמרחק בין גן

.8 הוא T ל Y ו Xהמרחק בין הקבוצה שמאחדת את גנים •

כי לא נתון המרחק בין T ל Y ו Xאי אפשר לחשב את המרחק בין האיחוד של •X ל Y.

. 7 קטן מ Y וגן Xהמרחק בין גן •

א'+ב'.•

א'+ג'. •

ב'+ג'.•

א'+ב'+ג'.•

אף תשובה אינה נכונה.•


, 7 היה T לגן X. המרחק בין גן Y וגן X איחדתי את גן UPGMA ע"י Clusteringב . אלו מהמשפטים הבאים נכון?9 היה T לגן Yוהמרחק בין גן

.8 הוא T ל Y ו Xהמרחק בין הקבוצה שמאחדת את גנים •

כי לא נתון המרחק בין T ל Y ו Xאי אפשר לחשב את המרחק בין האיחוד של •X ל Y.

. 7 קטן מ Y וגן Xהמרחק בין גן •

א'+ב'.•

א'+ג'. •

ב'+ג'.•

א'+ב'+ג'.•

אף תשובה אינה נכונה.•


. אלו מהמשפטים הבאים נכון?23

ככל שמרחק בין הנקודה שרוצים לסווג למשטח המפריד קטן יותר – SVMב א-הסיכוי שהסיווג שגוי קטן יותר.

תמיד כל הנקודות מסוג א' הן בצד אחד וכל הנקודות מסוג ב' הן בצד SVMב ב-השני.

שיסווג חלבונים לטרנס-ממברנליים ולכאלה שלא. SVMניתן לפתח ג-

אף תשובה אינה נכונה.ד-


. אלו מהמשפטים הבאים נכון?23

ככל שמרחק בין הנקודה שרוצים לסווג למשטח המפריד קטן יותר – SVMב א-הסיכוי שהסיווג שגוי קטן יותר.

תמיד כל הנקודות מסוג א' הן בצד אחד וכל הנקודות מסוג ב' הן בצד SVMב ב-השני.

שיסווג חלבונים לטרנס-ממברנליים ולכאלה שלא.SVMניתן לפתח ג-

אף תשובה אינה נכונה.ד-


. נתון האיור הבא:24

אלו מהמשפטים הבאים נכון?

(ליניארי) הנקודה עם הסימן שאלה תסווג להיות נקודה שחורה.SVMלפי א-

כשמספר השכנים שווה אחד, הנקודה עם הסימן שאלה תסווג להיות KNNלפי ב-נקודה לבנה.

א'+ב'ג-

אף תשובה אינה נכונהד-

Gene 1

Gene 2

?


. נתון האיור הבא:24

אלו מהמשפטים הבאים נכון?

(ליניארי) הנקודה עם הסימן שאלה תסווג להיות נקודה שחורה.SVMלפי א-

כשמספר השכנים שווה אחד, הנקודה עם הסימן שאלה תסווג להיות KNNלפי ב-נקודה לבנה.

א'+ב'ג-

אף תשובה אינה נכונהד-

Gene 1

Gene 2

?

Legionalla pneumophilacase-study

Legionalla pneumophilacase-study

How did it all begin? How did it all begin?

Legionella pneumophila

Legionnaire disease nowadaysLegionnaire disease nowadays


Legionella pneumophila Legionella pneumophila


Copyright © 2005 Nature Publishing Group. Created by Arkitek from Nature Reviews Microbiology

Identifying the effectorsIdentifying the effectors


Homology to host proteins

Regulatory

elements

Genome proximity to

other effectors

Secretion signalAbundance in Metazoa / Bacteria

GC contentSequence homology

The featuresThe features


The effectors machineThe effectors machine

5

5


The big pictureThe big pictureSimilarity to

known effectors

Regulatory elements

Features

Similarity tohost proteins

G-C content

Secretory signals

Feature selection

NN

SVMNaïve Bayes

Bayesian Net

Voting

Classification algorithms

Experimentalvalidation

Predictedeffectors

Prior knowledge

Trainedmodel

Unclassifiedgenes

Predictednon-effectors

Newly validatedeffectors

Non-effectors

Validatedeffectors

Abundance in Metazoa\Bacteria

Genome arrangement


Does it really work??Does it really work??

Machine learning

Documents

Chip arrays and gene expression data. Motivation