Gene Array Analysis

Gene Array Analysis

Statistical genetics - Class 10

Gene array description

Normalization

Data Analysis

Multiple measurements

What is a gene array

Gene arrays are solid supports upon which a collection of gene-specific nucleic acids have been placed at defined locations, either by spotting or direct synthesis.

In array analysis, a nucleic acid-containing sample is labeled and then allowed to hybridize with the gene-specific targets on the array.

Based on the amount of probe hybridized to each target spot, information is gained about the specific nucleic acid composition of the sample.

The major advantage of gene arrays is that they can provide information on thousands of targets in a single experiment.

Nomenclature

Many terms exist for naming gene arrays, including: biochip, DNA chip, GeneChipÂ (a registered trademark of Affymetrix, Inc.), DNA array, microarray macroarray.

Microarray and macroarray may be used to differentiate between spot size or the number of spots on the support.

Glass Support

Experiment

A typical gene array experiment involves: 1. Isolating RNA from the samples to be compared 2. Converting the RNA samples to labeled cDNA via

reverse transcription; this step may be combined with aRNA amplification

3. Hybridizing the labeled cDNA to identical membrane or glass slide arrays

4. Removing the unhybridized cDNA 5. Detecting and quantitating the hybridized cDNA6. Comparing the quantitative data from the various

samples

General Picture

Choosing Cell Populations

The goal of comparative cDNA hybridization is to compare gene transcription in two or more different kinds of cells. For example:

Tissue-specific Genes - Cells from two different tissues (say, cardiac muscle and prostate epithelium) are specialized for performing different functions in an organism. Although we can recognize cells from different tissues by their phenotypes, it is not known just what makes one cell function as smooth muscle, another as a neuron, and still another as prostate.

Ultimately, a cell's role is determined by the proteins it produces, which in turn depend on its expressed genes. Comparative hybridization experiments can reveal genes which are preferentially expressed in specific tissues.


Genetic disease is often caused by genes which are inappropriately transcribed -- either too much or too little -- or which are missing altogether.

Such defects are especially common in cancers, which can occur when regulatory genes are deleted, inactivated, or become constitutively active.

Unlike some genetic diseases (e.g. cystic fibrosis) in which a single defective gene is always responsible, cancers which appear clinically similar can be genetically heterogeneous.

For example, prostate cancer (prostatic adenocarcinoma) may be caused by several different, independent regulatory gene defects even in a single patient.


Cell Cycle Variations Cells undergo DNA replication, mitosis, and eventually

death. These activities require quite different gene products, such as DNA polymerases for genome replication or microtubule spindle proteins for mitosis. A cell's genes encode the "programs" for these activities, and gene transcription is required to execute those programs. Comparative hybridization can be used to distinguish genes that are expressed at different times in the cell cycle. In this way, the pathways responsible for controlling basic life processes can be uncovered.

mRNA Extraction

Genes which code for protein are transcribed into messenger RNA's (mRNA's) in the cell nucleus. The mRNA's in turn are translated into proteins by ribosomes in the cytoplasm. The transcription level of a gene is taken to be the amount of its corresponding mRNA present in the cell. Comparative hybridization experiments compare the amounts of many different mRNA's in two cell populations.

mRNA Extraction

To prepare mRNA for use in a microarray assay, it must be purified from total cellular contents. mRNA accounts for only about 3% of all RNA in a cell.

Common mRNA isolation methods take advantage of the fact that most mRNA's have a poly-adenine (poly(A)) tail. These poly(A)+ mRNA's can be purified by capturing them using complementary oligodeoxythymidine (oligo(dT)) molecules bound to a solid support.

Reverse transcription

Captured mRNA's are still difficult to work with because they are prone to being destroyed.

The environment is full of RNA-digesting enzymes, so free RNA is quickly degraded. To prevent the experimental samples from being lost, they are reverse-transcribed back into more stable DNA form. The products of this reaction are called complementary DNA's (cDNA's) because their sequences are the complements of the original mRNA sequences.

Reverse transcription

A problem with cDNA production is that not all mRNA's are reverse-transcribed with the same efficiency. This fact leads to reverse transcription bias, which can change the relative amounts of different cDNA's measured by the microarray assay.

Reverse transcription bias is not a problem when comparing the same mRNA across two cell populations unless it causes the mRNA not to be transcribed at all.

However, the bias does prohibit quantitative comparison between different mRNA's on one array.

Fluorescent labeling of cDNA's

In order to detect cDNA's bound to the microarray, we must label them with a reporter molecule that identifies their presence. The reporters currently used in comparative hybridization to microarrays are fluorescent dyes (fluors).

A differently-colored fluor is used for each sample so that we can tell the two samples apart on the array. The labeled cDNA samples are called probes because they are used to probe the collection of spots on the array.

Fluors do not show their colors unless stimulated with a specific frequency of light by a laser. Even then, the colors are not directly observed; rather, the wavelength of the emitted light is used to tune a detector which measures the fluorescence.

Normalization

The number of fluor molecules which label each cDNA depends on its length and possibly its sequence composition, both of which are often unknown.

This is one more reason that fluorescent intensities for different cDNA's cannot be quantitatively compared. However, identical cDNA's from the two probes are still comparable as long as the same number of label molecules are added to the same DNA sequence in each probe.

Normalization

To equalize the total concentrations of the two cDNA probes before applying them to the array, the probe solutions are diluted to have the same overall fluorescent intensity.

This procedure makes two possibly unjustified assumptions: 1. that the total amount of mRNA in each cell type

being tested is identical2. that each fluor emits the same amount of light

relative to its concentration.

Hybridization to a DNA Microarray

The two cDNA probes are tested by hybridizing them to a DNA microarray.

The array holds hundreds or thousands of spots, each of which contains a different DNA sequence.

In this way, every spot on an array is an independent assay for the presence of a different cDNA. There is enough DNA on each spot that both probes can hybridize to it at once without interference.

Microarrays are made from a collection of purified DNA's. A drop of each type of DNA in solution is placed onto a specially-prepared glass microscope slide by an arraying machine. The arraying machine can quickly produce a regular grid of thousands of spots in a square about 2 cm on a side

Scanning the Hybridized Array

Once the cDNA probes have been hybridized to the array and any loose probe has been washed off, the array must be scanned to determine how much of each probe is bound to each spot.

The probes are tagged with fluorescent reporter molecules which emit detectable light when stimulated by a laser.

The emitted light is captured by a detector,usualy a charge-coupled device (CCD).

Spots with more bound probe will have more reporters and will therefore fluoresce more intensely.

The scanner also records light from a few molecules that hybridized either to the wrong spot or nonspecifically to the glass slide. This extra light becomes the background of the scanned array image.

Affymetrix arrays

• 107copies per oligo in 24 x 24 um square

• Use 20 pairs of different 25-mers per gene• Perfect match and mismatch

Data Analysis

Normalization Detection of outliers Clustering Multiple measurments

False color images of spotted array

Overlay of two scans of the slide Compares the two samples Green = less relative expression Red = more relative expression Yellow = equal expression Dimmer colors = lower expression levels.

Normalizing two-color arrays

before after

The signals for the two colors are rarely “balanced”.

Normalization

Cy3 signal (log2)

Cy5

sig

nal (

log 2

)

Normalization by iterative linear regression

fit a line (y=mx+b) to the data set

set aside outliers (residuals > 2 x s.e.)

repeat until r2 changes by

< 0.001

then apply slope and intercept to

the original dataset

D Finkelstein et al. http://www.camda.duke.edu/CAMDA00/abstracts.asp

Normalization (Linear)

Cy3 signal (log2)

Cy5

sig

nal (

log 2

)

Normalization (Linear)

Cy3 signal (log2)

Cy5

sig

nal (

log 2

)

average signal {log2 (Cy3 + Cy5)/2}

rati

o {

log

2 (C

y5 /

Cy3

)} Loess function fit line

0

Normalization (Curvilinear)

G Tseng et al., NAR 2001

LOESS function

To use LOESS, the user must specify the degree, d, of the local polynomial to be fit to the data, and the fraction of the data, q, to be used in each fit. In this case, the simplest possible initial function specification is d=1 and q=1. While it is relatively easy to understand how the degree of the local polynomial affects the simplicity of the initial model, it is not as easy to determine how the smoothing parameter affects the function.

LOESS function

The weight function gives the most weight to the data points nearest the point of estimation and the least weight to the data points that are furthest away. The use of the weights is based on the idea that points near each other in the explanatory variable space are more likely to be related to each other in a simple way than points that are further apart. Following this logic, points that are likely to follow the local model best influence the local model parameter estimates the most. Points that are less likely to actually conform to the local model have less influence on the local model parameter estimates. The traditional weight function used for LOESS is the tri-cube weight function,

Image Analysis

2 images per array

Super-imposing

Grid on image

Clone Id Ratio1 1.52 0.8… …

Gene Ratios

Gene expression levels determined by intrinsic properties of each gene

low high expression level

Gene A Gene B

Statistical Analysis

Differences in ratios due to random variation meaningful changes

Hypothesis testing, with H0: no systematic differences between ratios

Most Basic Statistical Analysis

Assumptions ‘red’ and ‘green’ intensities at a given gene

~ i.i.N.d with common variance constant coefficient of variation over the whole

gene set


with Tk = Rk / Gk ,

22

2

22

2

12

1exp

21

11

tc

t

tc

tttf

kT

with c: coefficient of variation, estimated from data

According to Chen et al. 1997 (J Biomedical Optics, 2(4):364)


Classification with hypothesis testingunder-expressed over-expressed

/2 /2

3 classes of genes

Fold Change Graphs

How many times did the expression of this gene change in the treated tissue versus the control?

comparison analysis requires experiment vs control does not apply to absolute analysis parameter value in one vs another Avg diff (perfect match vs mismatch)

Fold Change of Average Difference

Noise and Repeats

>90% 2 to 3 fold Multiplicative

noise Repeat experiments Log scale

dist(4,2)=dist(2,1)

log – log plot

Documents

Gene Array Analysis