Upload
leo-ellis
View
221
Download
2
Tags:
Embed Size (px)
Citation preview
EE150a – Genomic Signal and Information Processing
On DNA Microarrays Technology
October 12, 2004
Recall the information flow in cells
• Replication of DNA
– {A,C,G,T} to {A, C, G,T}
• Transcription of DNA to mRNA
– {A,C,G,T} to {A, C, G,U}
• Translation of mRNA to proteins
– {A,C,G,U} to {20 amino-acids}
• Interrupt the information flow
and measure gene expression
levels!
http://www-stat.stanford.edu/~susan/courses/s166/central.gif
Gene Microarrays• A medium for matching known and unknown sequences of
nucleotides based on hybridization (base-pairing: A-T, C-G)
• Applications
– identification of a sequence (gene or gene mutation)
– determination of expression level (abundance) of genes
– verification of computationally determined genes
• Enables massively parallel gene expression studies
• Two types of molecules take part in the experiments:
– probes, orderly arranged on an array
– targets, the unknown samples to be detected
Microarray Technologies• Oligonucleotide arrays (Affymetrix GeneChips)
– probes are photo-etched on a chip (20-80 nucleotides)
– dye-labeled mRNA is hybridized to the chip
– laser scanning is used to detect gene expression levels (i.e., amount of mRNA)
• cDNA arrays
– complementary DNA (cDNA) sequences “spotted” on arrays (500-1000 nucleotides)
– dye-labeled mRNA is hybridized to the chip (2 types!)
– laser scanning is used to detect gene expression levels
• There are various hybrids of the two technologies above
Oligonucleotide arrays
Source: Affymetrix website
GeneChip Architecture
Source: Affymetrix website
Hybridization
Source: Affymetrix website
Laser Scanning
Source: Affymetrix website
Sample Image
Source: The Paterson Institute for Cancer Research
Competing Microarray Technologies
• So far considered oligonucleotide arrays:
– automated, on-chip design
– light dispersion may cause problems
– short probes, 20-80
• cDNA microarrays are another technology:
– longer probes obtained via PCR, polymerase chain reaction
– [sidenote: what is optimal length?]
– probes grown in a lab, robot printing
– two types of targets – control and test
http://pcf1.chembio.ntnu.no/~bka/images/MicroArrays.jpg
cDNA Microarrays
Sample cDNA Microarray Image
Some Design Issues
• Photo-etching based design: unwanted light exposure
– border minimization
– the probes are 20-80 long
• Hybridization: binding of a target to its perfect complement
• However, when a probe differs from a target by a small
number of bases, it still may bind
• This non-specific binding (cross-hybridization) is a source of
measurement noise
• In special cases (e.g., arrays for gene detection), designer has a
lot of control over the landscape of the probes on the array
Dealing with Measurement Noise
• Recent models of microarray noise
– measurements reveal signal-dependent noise (i.e., shot-noise)
as the major component
– additional Gaussian-like noise due to sample preparation,
image scanning, etc.
• Image processing assumes image background noise
– attempts to subtract it
– sets up thresholds
• Lack of models of processes on microarrays
Probabilistic DNA Microarray Model
• Consider an m£m DNA microarray, with m2 unique types of
nucleotide probes
• A total of N molecules of n different types of cDNA targets
with concentrations c1,…,cn,
is applied to the microarray
• Measurement is taken after the system reached chemical
equilibrium
• Our goal: from the scanned image, estimate the concentrations
DNA Microarray Model Cont’d
• Each target may hybridize to only one type of probe
• There are k non-specific bindings
• Model diffusion of unbound molecules by random walk;
distribution of unbound molecules uniform on the array
– justified by reported experimental results
• Assume known probabilities of hybridization and cross-
hybridization
– Theoretically: from melting temperature
– Experimentally: measurements (e.g., from control target
samples)
Markov Chain Model
Modeling transition between possible
states of a target:
• one specific binding state
• k=2 non-specific bindings
• pn=1-kpc-ph is probability that an
unbound molecule remains free
Measurement is taken after the
system reached state of chemical
equlibrium – need to find steady state
Markov Chain Model Cont’d
Let i=[i,1 i,2 … i,k+2]T be a vector
whose components are numbers of the
type i targets that are in one of the k+2
states of the Markov chain
• i,1 is the # of hybridized molecules
• i,j, 2 < j · k+2 is # of cross-hybrid.
Note that k=1k+2i,k=ci for every i.
Stationary State of the Markov Chain
• In equilibrium, we want to find i such that
where the transition matrix Pi is given by
• Clearly, in the stationary state we have
• Finally, ratio i/ci gives stationary state probabilities
Linear Microarray Model
• Let matrix Q collect the previously obtained probabilities
• The microarray measurement model can be written as
• Vector w describes inherent fluctuations in the measured
signal due to hybridization (shot-noise)
• Binding of the j-type target to the i-type probe is the Bernoulli
random variable with variance qi,j(1-qi,j)
– hence the variance of wi is given by
• Vector v is comprised of iid Gaussian entries
Detection of Gene Expression Levels
• A simple estimate is obtained via pseudo-inverse,
• Maximize a posteriori probability p(s|c), which is equivalent to
where the matrix is given by
• Optimization above readily simplifies to
Simulation Results
• Consider an 8£8 array (m=8)
• Apply n=6 types of targets
• Concentrations: [1e5 2e5 2e5 2e5 1e5 2e5] (N=1e6)
• Assume the following probabilities:
– hybridization – 0.8
– cross-hybridization – 0.1
– release – 0.02
• Let k=3 (number of non-specific bindings)
• Free molecules perform random walk on the array
Simulation Results: Readout Data
Simulation Results: Estimate
Some Comments
• Adopt mean-square error for a measure of performance
• As expected, we observe significant improvement over raw
measurements (improvement in terms of MSE)
• Things to do:
– investigate how to incorporate control sample
measurements
– modification of the technique for very large microarrays is
needed (matrix inversion may be unstable)
• Experimental verification!
Why is this Estimation Problem Important?
• Microarrays measure expression levels of thousands of gene simultaneously
• Assume that we are taking samples at different times during a biological process
• Cluster data in the expression level space
– relatedness in biological function often implies similarity in expression behavior (and vice versa)
– similar expression behavior indicates co-expression
• Clustering of expression level data heavily depends on the measurements
– better estimation may lead to different functionality conclusions
Summary
• Microarray technologies are becoming of great importance for
medicine and biology
– understanding how the cell functions, effects on organism
– towards diagnostics, personalized medicine
• Plenty of interesting problems
– combinatorial design techniques
– statistical analysis of the data
– signal processing / estimation