7
248 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005 DNA Microarray Stochastic Model Stephen W. Davies, Senior Member, IEEE, and David A. Seale Abstract—A stochastic model of the DNA microarray image pixels is presented. The model includes spot pixel intensity dis- tribution, interpixel correlations and the intensity distribution of background noise. The data is indicative of a small exponential additive noise process and a larger Gaussian fluctuation that scales with spot intensity. Correlations are observed among pixels in the spot and between test and control images. The correlated fluctuations may be attributed to variations across each spot in the amount of DNA placed on the spot during the array fabrication process. The model may be used in gene expression estimation algorithm development, both to test new algorithms through simu- lation and to develop optimum algorithms. The model should also be easily adapted to new array based technologies in proteomics. Index Terms—DNA microarrays, gene expression, signal model, stochastic model. I. INTRODUCTION D EOXYRIBONUCLEIC acid (DNA) microarrays are used to measure cellular gene expression levels and study gene interactions. A microarray consists of a grid of tiny spots of DNA with each spot usually corresponding to a different gene (Fig. 1). These arrays are usually formed by printing the DNA onto a glass slide by a robot equipped with capillary printing pins [1]; alternative fabrication schemes use photolithography [2] or ink jet printing [3]. The first step in using the array is to extract ribonucleic acid (RNA) from the cells; RNA indicates which genes are currently active. The RNA is processed to form fluorescently labeled single-stranded nucleic acids that stick to their corresponding microarray spot. Typically, control and test RNA samples are processed on the same array using different fluorophores [4]. The microarray is then scanned with a laser confocal microscope. The array analysis software automatically locates each spot in the image and produces an intensity estimate [5], [6]. As each spot appears as many pixels in the scanned image, a method for combining pixel data into an estimate of gene expression ratio (ratio of test to control) is required. Options in doing this include: 1) forming the simple ratio of average pixel values; 2) forming the ratio of pixel medians; 3) forming the average of pixel by pixel ratios; and 4) excluding low pixel values. A detailed analysis of the process of information recovery from microarrays is provided in [7]. The cumulative effect of errors introduced in each stage of the processing is so large that Manuscript received April 21, 2004; revised June 9, 2004. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grant 238582-1. S. W. Davies is with the Institute for Biomaterials and Bioengineering and the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G9, Canada (e-mail: [email protected]). D. A. Seale is with ATI Technologies Inc., Markham, ON L3T 7X6, Canada (e-mail: [email protected]). Digital Object Identifier 10.1109/TNB.2005.853665 Fig. 1. cDNA microarray spots. factor of two changes in gene expression are usually required for a change to be considered significant [8]–[10]. Metrics for com- paring arrays have been developed [11] and a statistical model of measured expression levels is available [12]. The “measured expression level” is the sum of the pixel values. A flexible spot simulator is available [13]; however, guidance is needed if it is to achieve fidelity to real signal characteristics. A detailed statistical model of the microarray image pixel data would assist in the further development of this technology. With such a statistical model, performance may be bounded by the Cramer–Rao lower bound [14]. Common simulated data based on the model may be used to allow investigators to compare gene expression analysis algorithm performance. Op- timal (e.g., maximum-likelihood) processors may be developed through derivations based on the model. II. MICROARRAY FEATURES Visual inspection of microarray images reveals that high and low intensity spots have similar characters (Fig. 1). The spot shapes are very similar. While the spots are noisy, the neigh- boring pixels appear correlated in their fluctuations. Spots often demonstrate a dark circular region in their centers. Typical array images often show many other artifacts, in- cluding bright specks and broad regions of elevated background fluorescence. Dust and scratches appear as bright dots, specks, or streaks on the array image. Dust streaks can also deform spots. On some arrays, the spots themselves have bright streaks, usually all spreading in the same direction. Microarray special- ists often refer these to as comet tails. Background fluorescence is present over the whole array image. There is evidence that the background fluorescence in the areas of the image outside of the spots is from a different process than that observed within the spots [15]. The background intensity is low, but still bright enough to affect the analysis of the dimmest of spots. As weakly expressed genes may be controllers of other genes, these dim spots may be important, and therefore the influence of background fluorescence may be significant. 1536-1241/$20.00 © 2005 IEEE

DNA Microarray Stochastic Model

  • Upload
    da

  • View
    218

  • Download
    4

Embed Size (px)

Citation preview

Page 1: DNA Microarray Stochastic Model

248 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005

DNA Microarray Stochastic ModelStephen W. Davies, Senior Member, IEEE, and David A. Seale

Abstract—A stochastic model of the DNA microarray imagepixels is presented. The model includes spot pixel intensity dis-tribution, interpixel correlations and the intensity distribution ofbackground noise. The data is indicative of a small exponentialadditive noise process and a larger Gaussian fluctuation thatscales with spot intensity. Correlations are observed among pixelsin the spot and between test and control images. The correlatedfluctuations may be attributed to variations across each spot in theamount of DNA placed on the spot during the array fabricationprocess. The model may be used in gene expression estimationalgorithm development, both to test new algorithms through simu-lation and to develop optimum algorithms. The model should alsobe easily adapted to new array based technologies in proteomics.

Index Terms—DNA microarrays, gene expression, signal model,stochastic model.

I. INTRODUCTION

DEOXYRIBONUCLEIC acid (DNA) microarrays are usedto measure cellular gene expression levels and study gene

interactions. A microarray consists of a grid of tiny spots ofDNA with each spot usually corresponding to a different gene(Fig. 1). These arrays are usually formed by printing the DNAonto a glass slide by a robot equipped with capillary printingpins [1]; alternative fabrication schemes use photolithography[2] or ink jet printing [3].

The first step in using the array is to extract ribonucleic acid(RNA) from the cells; RNA indicates which genes are currentlyactive. The RNA is processed to form fluorescently labeledsingle-stranded nucleic acids that stick to their correspondingmicroarray spot. Typically, control and test RNA samples areprocessed on the same array using different fluorophores [4]. Themicroarray is then scanned with a laser confocal microscope.

The array analysis software automatically locates each spot inthe image and produces an intensity estimate [5], [6]. As eachspot appears as many pixels in the scanned image, a methodfor combining pixel data into an estimate of gene expressionratio (ratio of test to control) is required. Options in doing thisinclude: 1) forming the simple ratio of average pixel values;2) forming the ratio of pixel medians; 3) forming the averageof pixel by pixel ratios; and 4) excluding low pixel values.

A detailed analysis of the process of information recoveryfrom microarrays is provided in [7]. The cumulative effect oferrors introduced in each stage of the processing is so large that

Manuscript received April 21, 2004; revised June 9, 2004. This work wassupported in part by the Natural Sciences and Engineering Research Council ofCanada (NSERC) under Grant 238582-1.

S. W. Davies is with the Institute for Biomaterials and Bioengineering andthe Department of Electrical and Computer Engineering, University of Toronto,Toronto, ON M5S 3G9, Canada (e-mail: [email protected]).

D. A. Seale is with ATI Technologies Inc., Markham, ON L3T 7X6, Canada(e-mail: [email protected]).

Digital Object Identifier 10.1109/TNB.2005.853665

Fig. 1. cDNA microarray spots.

factor of two changes in gene expression are usually required fora change to be considered significant [8]–[10]. Metrics for com-paring arrays have been developed [11] and a statistical modelof measured expression levels is available [12]. The “measuredexpression level” is the sum of the pixel values. A flexible spotsimulator is available [13]; however, guidance is needed if it isto achieve fidelity to real signal characteristics.

A detailed statistical model of the microarray image pixeldata would assist in the further development of this technology.With such a statistical model, performance may be boundedby the Cramer–Rao lower bound [14]. Common simulateddata based on the model may be used to allow investigators tocompare gene expression analysis algorithm performance. Op-timal (e.g., maximum-likelihood) processors may be developedthrough derivations based on the model.

II. MICROARRAY FEATURES

Visual inspection of microarray images reveals that high andlow intensity spots have similar characters (Fig. 1). The spotshapes are very similar. While the spots are noisy, the neigh-boring pixels appear correlated in their fluctuations. Spots oftendemonstrate a dark circular region in their centers.

Typical array images often show many other artifacts, in-cluding bright specks and broad regions of elevated backgroundfluorescence. Dust and scratches appear as bright dots, specks,or streaks on the array image. Dust streaks can also deformspots. On some arrays, the spots themselves have bright streaks,usually all spreading in the same direction. Microarray special-ists often refer these to as comet tails.

Background fluorescence is present over the whole arrayimage. There is evidence that the background fluorescence inthe areas of the image outside of the spots is from a differentprocess than that observed within the spots [15]. The backgroundintensity is low, but still bright enough to affect the analysis of thedimmest of spots. As weakly expressed genes may be controllersof other genes, these dim spots may be important, and thereforethe influence of background fluorescence may be significant.

1536-1241/$20.00 © 2005 IEEE

Page 2: DNA Microarray Stochastic Model

DAVIES AND SEALE: DNA MICROARRAY STOCHASTIC MODEL 249

To address the appearance of background fluorescence in theoutput images, most scanners provide for user adjustment ofthe photomultiplier tube (PMT) voltage. The PMT is used asthe fluorescence detector. Changing its bias voltage changes thelevel of its response to low intensity fluorescence. This servesto set a low intensity threshold. A high threshold results in animage wherein regions between spots are fully black but reducesthe sensitivity to dim spots.

Despite the automation in array manufacture, spots are notperfectly aligned on a grid. The spots tend to vary from the ex-pected positions by one or two pixels. Also, the device used toscan the arrays has fixed-size pixels—possibly leading to spatialaliasing phenomena.

Experienced technicians and facilities can produce high-quality array images that are relatively free of the aforemen-tioned artifacts. However, fluctuation of the pixels within thespot is still observed.

The model presented here addresses the intensity fluctuationsin high quality spots. All of the analysis is conducted on “clean”array images that are relatively free of dust, scratches, and comettails. The model can be used to infer the ultimate performanceof microarrays under the very best of conditions, thus providinga bound on the performance of arrays under normal conditions.

We anticipate an evolution of microarrays from small-scaleresearch efforts to large-scale use as a medical diagnostic tool.Currently, many researchers use homemade arrays and their re-sults are highly variable. Large-scale manufacturing and auto-mated processing will lead to more consistent results; indeed,this will be necessary for regulatory (e.g., FDA) approval). Webelieve that the spots in these commercial array assays will beconsistent with the clean spots modeled here. Thus, we lay afoundation for future signal processing for clinical purposes.

III. METHODOLOGY

Images from a variety of experiments using two different typesof microarrays have been examined. The first type of array, a“cDNA” array, is a printed array of full-length DNA sequencescomplementary to the expected RNAs. These arrays are from theOntario Cancer Institute’s Microarray Center and are represen-tative of a high quality and high volume microarray service. Thesecond type of array, the “oligo” array, is a printed array of 70base DNA sequences complementary to sections of the expectedRNAs, as opposed to the whole molecule. Short sequences ofDNAareknownasoligonucleotides,hence thearrayname.Thesearrays were produced in an academic lab to serve its own needs.

The cDNA arrays have spots arranged in grids, with 400 spotsper grid in groups that are 20 spots by 20 spots. The 32 gridsare arranged in a four by eight grid pattern. There are a total of12 800 spots on each of the cDNA arrays. Each spot is approxi-mately 10 pixels in diameter, and each pixel is 10 m, resultingin spots with a diameter of 100 m. Spots in a grid are nom-inally 200 m apart from spot center to spot center. Approxi-mately 50 000 spots have been used in this investigation.

The oligo arrays have similar structure. The grids are 17 by17 spots, with only 16 grids on the array arranged in a four byfour grid pattern. This pattern produces arrays with 4624 spots.

A different robot printer produced each type of array, thoughboth robots used a similar set of pins to apply the DNA to the

surface. These pins are standard units used by essentially allprinted microarray facilities. Variations in spot center locationsmay arise from printing pin misalignment, robot positioning in-accuracy, and slide misalignment. These variations have bothsystemic and random components. Commercial scanners wereused to gather the images. Mathematically, we represent the in-tensity of the th horizontal and th vertical pixel by .

The spot center locations were determined through a multi-step search procedure. First grid centers were located. Spot loca-tions were then determined through a search about the expectedspot locations relative to the grid centers. The spot location esti-mator maximized the fluorescence in a circular spot-sized maskabout the hypothesized spot center. The estimated spot position

is represented by the integer vector .For each array, the spots were divided into five groups ac-

cording to a rough estimate of their intensity. This estimate isthe sum of the intensity of the pixels in the region 16 by 16pixels around the spot center. For the th spot, this becomes

where is the estimated center position of the th spot. Thisregion was chosen both to simplify the computation and to en-compass the entirety of spot-related pixels. Each group contains20% of the spots. The first group contains the dimmest 20% ofthe spots, the next group contains the 20th through 40th per-centiles, etc.

For each array, the image data was processed in three dif-ferent computations to estimate: 1) the average spot shape; 2) thepixel intensity distribution, and 3) the average autocorrelationand cross correlation between spot pixels.

The average spot shape was produced for spots in each of thefive brightness groups. Pixels in a 16 by 16 pixel region aroundthe spot center were used in the computation of the spot shapeas in

where the index is used to select the spots in the desired groupand there are such spots.

Before their use in computation of distributions and correla-tions, each selected spot was normalized by dividing each pixelof the spot by the sum of the intensity of the 16 by 16 regionencompassing that spot as in

This allowed spots of different average intensities to contributeequally to the computations. Thus, these statistical measures arerelative to average spot intensity and so their use, for examplein simulations, requires multiplication by the spot intensity.

To avoid fluctuations associated with just the background orthe background to spot transition region, a fraction of the pe-ripheral pixels in the spots were excluded in the computationsof correlations and distributions. The spots were all larger thana circle with a radius of three pixels, so pixels of interest were

Page 3: DNA Microarray Stochastic Model

250 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005

Fig. 2. Correlation data and intensity histograms use only the pixels in theregion of interest shown here.

considered to be the ones that were included in a circular regionof radius three, as shown in Fig. 2.

Pixel intensity distributions were examined through his-tograms of the fluctuation about the expected spot. Thisfluctuation, , was obtained by subtracting the normalizedpeak shape from the normalized spot as in

where the normalized peak shape is given by

Also, a histogram for nonspot regions was produced to permitthe examination of the intensity distribution of the backgroundnoise.

All of the correlations were measured on the second brightestgroup of spots (the 60th to 80th percentiles). This group wasused due to its relative freedom from artifacts due to backgroundor clipping. An estimate of the autocorrelation was producedby taking the average of the estimated autocorrelation of eachindividual spot in the group. The normalized average spot shapewas subtracted from each normalized spot before estimating theautocorrelation. The unbiased estimate of the auto- and crosscorrelation was computed as

where and are the pixels in the normalized spot minusthe normalized average spot shape. The first summation aver-ages over the selected spots. The second summation is over theset of those values of and such that, given the offsets and ,the corresponding pixels are within the circular include region.The size of that set is . The autocorrelation uses pixels froma single fluorophore channel (ie. ), while the crosscorrelation uses pixels from the two channels (ie. is fromthe first fluorophore image, and is from the second fluo-rophore image).

IV. RESULTS

The computations were performed for data from multiplearrays. The resulting data had similar structure in each case,though parameter values were different. The results for a singlerepresentative array are presented here.

The average spot shape was computed for each brightnessgroup. The average spot shapes are presented in Fig. 3. The av-erage computed from the brightest spots, shown in Fig. 3(e), hasa somewhat flat top. Some of the spots in this group had pixelsthat exceeded the maximum brightness that could be measuredby the array scanner; the contribution of their clipped valuesproduced a relatively flat region in this diagram. The backgroundnoise in the image is a significant factor in the dimmest group,resulting in the fluctuations on the surface of the average spotseen in Fig. 3(a). In addition to effects described above, the edgesof the average spots have values that tend smoothly to zero at theperimeter. The average of the second brightest group, shown inFig. 3(d), was selected as being the most representative of typicalspots. Spots in this group do not suffer from clipping due toscanner maximum and/or minimum quantization values and aresufficiently bright to minimize the effects of background noise.

The spot pixel histograms were skewed positive (Fig. 4).Closer inspection revealed that the histogram for the brightestgroup was essentially Gaussian; it had the least skew and avisual comparison with a Gaussian curve revealed only smalldeviation. As one inspected the groups in decreasing order ofintensity, the skew increased significantly. Nonzero probabilityfor negative positions in the histograms is a result of subtractingthe normalized average spot shape. The leftmost peak in the0–20 percentile data is a result of thresholding introduced bythe photomultiplier tube’s voltage setting.

Fig. 5 is a histogram of the intensity of pixels in a nonspotregion. Over 96% of the image data in the nonspot region se-lected had pixel values of zero; this is an artifact of the photo-multiplier tube’s voltage setting. These pixels were not includedin the histogram since they would dominate the figure. Pixelvalues in this image could take values between zero and 65 535.The background noise histogram seems to indicate that the noiseprocess has an exponential distribution with a parameter of ap-proximately 0.006.

Autocorrelations were collected for each array image. Allwere similar in appearance, so only one is displayed in Fig. 6(a).The autocorrelation displayed a strong sharp central peak anda pyramidal structure stretching across the correlation lagspace. As shown in Fig. 6(b), the cross correlation betweencorresponding spots in the test and control fluorophore imagesdisplayed a similar pyramid but without the strong central peak.

V. DISCUSSION

The average spot shapes presented in Fig. 3 are consistentwith the hypothesis that the pattern of DNA fixed to the sur-face of the array determines the shape of the spot, while theamount of labeled sample nucleic acid determines the inten-sity of each pixel. The individual spot shapes in Fig. 3 repre-sent very different intensity ranges yet still present essentiallythe same shape as determined by the printed DNA. The patternof fixed DNA is caused by both the shape of the pin and thesurface properties of the array. The pin determines the initial

Page 4: DNA Microarray Stochastic Model

DAVIES AND SEALE: DNA MICROARRAY STOCHASTIC MODEL 251

Fig. 3. Average spot shapes for the five intensity bins. (a) 0%–20% spots. (b) 20%–40%. (c) 40%–60%. (d) 60%–80%. (e) 80%–100%.

pattern of the DNA droplet, while the surface modifies this ini-tial pattern due to the way the spot beads, spreads, and dries.Regions where more DNA fixes to the array will ultimately leadto greater amounts of fluorescent sample DNA in that region.There are more fixed single-stranded DNA molecules availablein these regions, so sample DNA or RNA passing through thisarea has a greater chance of annealing.

The typical spot shape, Fig. 3(d), demonstrates artifacts ofboth the pin system and the surface properties of the array. Thetops of the spots have a vertical groove running across them,parallel to the axis. The capillary slot in the spotting pin likelycauses this groove. A square-tipped pin was used to produce thisarray and could result in the square shape of the average spots.The spots also have a slight dip in their centers. This is likelycaused by the interaction of the fluid in the sample with the tipof the pin and the surface of the array. This leads to the oftenseen “donut”-shaped spots.

The slope along the edges of the spot is the result of two ef-fects: 1) the spots tend to be dim along their edges and 2) theestimated spot locations are not perfectly accurate. In the firstcase, it is easy to see that a droplet placed on a surface will tendto spread out. This spread droplet will have a thin edge alongits perimeter and it is expected that the spots will demonstratethis effect in some way. The second phenomenon, spot locationerror, leads to misalignment of successive spots in the averagingprocedure. This tends to smear the true spot shape and resultsin the slope along the edge of the average spot being less thanthat of the true spot shape. Even after the above-mentioned spotlocalization procedure, there is sufficient residual spot localiza-tion error to support significant smearing.

The variations in the amount of hybridized labeled nucleicacid lead to variations in fluorescence and nonuniformities inthe spot intensity. The pixel histogram data in Fig. 4 show thatthe normalized intensity histograms have similar width regard-

Page 5: DNA Microarray Stochastic Model

252 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005

Fig. 4. Histograms for normalized intensity values where data has beenpregrouped based on 20 percentile intervals of raw average intensity.Successive intervals have been offset by 1000.

Fig. 5. Histogram of the pixel values in the nonspot region from a typicalmicroarray image. The solid line depicts the histogram values, while the dashedline depicts an exponential distribution with a parameter of 0.006, scaled tomatch the height of the histogram data. Note that more than 96% of the imagecontained zero-valued pixels, which are not included in this figure.

less of the intensity group from which they were formed. If thestandard deviation of the unnormalized data was constant thena considerable broadening of the histograms would be expectedwith decreasing intensity group. The 16 16 summed spot in-tensities were roughly equally distributed over two orders of

Fig. 6. (a) Autocorrelation of 60–80 percentile spots. (b) Cross correlationbetween fluorophores of 60–80 percentile spots.

magnitude. Thus, Fig. 4 is suggests that there is a strong com-ponent of the fluctuation that is from an underlying mechanismthat permits fluctuations to scale directly with spot level.

Fig. 4 also shows that the data for high intensity spots is dis-tributed similarly to a Gaussian random variable. At the otherextreme, for the spot-free region between spots, the distribu-tion of the background fluorescence appears to be exponential(Fig. 5) with a decay constant of 0.006. At spot intensity levelsbetween these extremes, the distributions appear to be skewedtoward the positive. Could these intermediate distributions rep-resent a shift from an exponential distribution to a Gaussian?

Consider a model wherein the signal component of the pixelintensity is a Gaussian process with standard deviation propor-tional to the gene expression level. To this signal is added an ap-proximately exponential noise process, skewed to the positive,with variance independent of signal intensity. The probabilitydensity function (pdf) for each pixel is then the convolution ofthe signal and noise pdfs. This convolution has negligible effectfor high intensity spots as the signal variance is much larger thanthe noise variance. For low intensities, the noise pdf leads to asignificant expansion and skewing of the pixel pdf.

Page 6: DNA Microarray Stochastic Model

DAVIES AND SEALE: DNA MICROARRAY STOCHASTIC MODEL 253

Fig. 7. Histogram of 20–40 percentile data as in Fig. 3 (jagged line) modeledby the convolution of the 80–100 percentile histogram with an exponential withdecay constant of 0.006 (smooth line).

Fig. 7 supports the validity of this model. Here we attemptto reproduce the 20 to 40 percentile histogram through con-volving the brightest groups histogram with a scaled exponentialpdf of decay constant equal to 0.006 normalized intensity units.Clearly, the reproduction is very similar to the observed pdf.

It is worth noting that simulations were undertaken to inves-tigate the impact of the normalization process on the structureof the histograms. For the range of degrees of freedom expected(e.g., on the order of 10) and standard deviations (e.g., 20% ofthe mean), the impact of the normalization was to narrow thehistograms by about 5% while preserving their structural fea-tures. We conclude that, given the magnitude of the features ob-served in our experimental data, the modest corruption do to thenormalization process has minimal impact on the representativenature of the model presented here.

Comparison of the features in the auto- and cross correla-tions (Fig. 6) reveals that they are very similar. Both share abroad main-lobe and demonstrate similar levels of correlationexcept at the zero-lag values. The similar structure in the auto-and cross correlations suggests that there is a responsible mech-anism that is common to both source images. We postulate thatthe mechanism behind these features is fluctuation in the levelof printed DNA. Any such fluctuation would modulate the ap-pearance of the spot in both control and test images. Visually,the correlated process is indicated by spatially large features ineach spot. Typical of these would be the dark central regions insome of the spots of Fig. 1.

On the other hand, the discrepancy between the auto- andcross correlations at zero-lag suggests that, in addition to thecommon component, there is a component of pixel noise that isuncorrelated with that of other pixels.

It should be noted that the normalization procedure shouldattenuate large extent features in the correlation functions. Yetthe observation is of positive nonzero correlation over the entire

Fig. 8. Reproduction of autocorrelation. Values obtained from slice throughcenter of measured autocorrelation of Fig. 6(a) are represented by plus signs.Solid line and crosses represent reproduction of the autocorrelation usingfunction described in text.

Fig. 9. Simulated array image formed using our model.

spot region. The reason for the nonzero values lies in the dif-ferent regions considered in the normalization versus the cor-relation functions. The normalization uses the 16 by 16 regioncentered on the spot while the correlation functions are for theregion of interest defined in Fig. 2. Any fluorescence outside theregion of interest, possibly due to misshapen spots, will bias thenormalization factor and in turn, lead to an offset in the correla-tion function.

A simple model has been found to be very effective in repro-ducing the observed autocorrelation. It consists of three compo-nents: an uncorrelated noise process, a correlated process whosecorrelation decays exponentially with distance between pixels,and a process that is identical across the entire spot. This modelhas been fitted to the data of Fig. 6(a). In the fitted model, the un-correlated process accounted for 15.3% of the variance, the cor-related process accounted for 75.5%, and the “identical” processaccounted for 9.3% of the variance. The exponential decay inthe correlated process was such that the correlation coefficientbetween adjacent pixels was 0.63. Fig. 8 shows a remarkabledegree of agreement between this model and observed autocor-relation. Fig. 9 presents a simulated array image formed usingthis model; its visual similarity in character to that of the realdata in Fig. 1 is very strong.

These observations of distributions and correlations may becombined to form a generic model that may be adapted to other

Page 7: DNA Microarray Stochastic Model

254 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 4, NO. 3, SEPTEMBER 2005

spot-based technologies. The scaled actual analyte (e.g., RNA)level is multiplied by the mean spot shape, then multiplied bya Gaussian process and added to a spatially uncorrelatedexponential noise process to form the observation for eachpixel

The Gaussian process is of unit mean and variance much lessthan one; it should be constrained to be positive (the actual un-derlying process may in fact be Gamma distributed). The fluctu-ation about the mean for this process is represented by . Thisfluctuation is spatially correlated with an exponential decay incorrelation with pixel separation so

where denotes expectation, is the variance of the Gaussianprocess, accounts for the contribution of its spatiallyuncorrelated component, accounts for the spatially correlatedfraction, and the coefficient captures the exponential decay incorrelation with pixel separation. The constant offset reflectsthe “identical” process of Fig. 8. The fluctuation is highly cor-related between channels with cross correlation

where is the corresponding fluctuation process for the otherchannel.

This model is not inconsistent with models of the statistics ofspot levels. Such models work with the sum of pixel values in thespot. By the central limit theorem, the sum of pixel values wouldapproach a Gaussian random variable and thus be consistentwith [12].

VI. CONCLUSION

The model describes the significant statistical properties ofhigh-quality microarray images, providing a probabilistic rela-tion between the observed pixels in test and control images. Itconsists of a multiplicative process that scales the sample nu-cleic acid levels, added to a spatially uncorrelated exponentialnoise process. The multiplicative process is highly correlatedbetween test and control fluorophore images. In each image, themultiplicative process exhibits an exponential decline in corre-lation with distance. The model is likely applicable to other cur-rent and future technologies wherein test and control samplesare annealed to complementary spots.

ACKNOWLEDGMENT

The authors would like to thank the Ontario Cancer InstituteMicroarray Center and Prof. T. Hughes of the Banting and BestInstitute for providing the cDNA and oligo data, respectively.

REFERENCES

[1] C. Xiang and Y. Chen, “cDNA microarray technology and its applica-tions (review),” Biotechnol. Adv., vol. 18, pp. 35–46, 2000.

[2] D. J. Lockhart, H. Dong, M. C. Byrne, M. T. Follettie, M. V. Gallo,M. S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, andE. L. Brown, “Expression monitoring by hybridization to high-densityoligonucleotide arrays,” Nature Biotechnol., vol. 14, pp. 1675–80, 1996.

[3] T. M. Harris, A. Massimi, and G. Childs, “Injecting new ideas into mi-croarray printing,” Nature Biotechnol., vol. 18, pp. 384–385, 2000.

[4] H. Yu, J. Chao, D. Patek, R. Mujumdar, S. Mujumdar, and A. S.Waggoner, “Cyanine dye dUTP analogs for enzymatic labeling of DNAprobes,” Nucl. Acids Res., vol. 22, pp. 3226–3232, 1994.

[5] M. Steinfath, W. Wruck, H. Seidel, H. Lehrach, U. Radelof, and J.O’Brien, “Automated image analysis for array hybridization experi-ments,” Bioinformatics, vol. 17, pp. 634–641, 2001.

[6] N. Jain, T. A. Tokuyasu, A. M. Snijders, R. Segraves, D. G. Albertson,and D. Pinkel, “Fully automatic quantification of microarray imagedata,” Genome Res., vol. 12, pp. 325–332, 2002.

[7] A. B. Goryachev, P. F. MacGregor, and A. M. Edwards, “Unfolding ofmicroarray data,” J. Comput. Biol., vol. 8, pp. 443–461, 2001.

[8] Y. Chen, E. R. Dougherty, and M. L. Bittner, “Ratio-based decisions andthe quantitative analysis of cDNA microarray images,” J. Biomed. Opt.,vol. 2, pp. 364–374, 1997.

[9] T. Ideker, V. Thorsson, A. F. Siegel, and L. E. Hood, “Testing for differ-entially-expressed genes by maximum-likelihood analysis of microarraydata,” J. Comput. Biol., vol. 7, pp. 805–817, 2000.

[10] P. Baldi and A. D. Long, “A Bayesian framework for the analysis ofmicroarray expression data: Regularized t-test and statistical inferencesof gene changes,” Bioinformatics, vol. 17, pp. 509–519, 2001.

[11] C. S. Brown, P. C. Goodwin, and P. K. Sorger, “Image metrics in thestatistical analysis of DNA microarray data,” Proc. Nat. Acad. Sci., vol.98, pp. 8944–8949, 2001.

[12] D. M. Rocke and B. Durbin, “A model for measurement error for geneexpression arrays,” J. Comput. Biol., vol. 8, pp. 557–569, 2001.

[13] Y. Balagurunathan, E. R. Dougherty, Y. Chen, M. L. Bittner, and J. M.Trent, “Simulation of cDNA microarrays via a parameterized randomsignal model,” J. Biomed. Opt., vol. 7, no. 3, pp. 507–523, Jul. 2002.

[14] H. L. Van Trees, Detection Estimation and Modulation Theory. NewYork: Wiley, 1968, pt. 1.

[15] H. Hogan, “New technique improves DNA data quality,” Biophoton. Int.,vol. 9, pp. 18–19, 2002.

Stephen W. Davies (S’79–M’85–SM’04) receivedthe B.Sc.E.E. and M.Sc.E.E. degrees from the Uni-versity of New Brunswick, Fredericton, NB, Canada,in 1981 and 1985, respectively, with a master’sthesis entitled “Estimation of myoelectric conductionvelocity distribution,” the M.B.A. degree from Dal-housie University, Halifax, NS, Canada, in 1993, andthe Ph.D. degree for applying communication theoryto automatic DNA sequencing from the Universityof Toronto, Toronto, ON, Canada, in 1999.

In 1982, he developed hardware and software atBell-Northern Research, Ottawa, ON, Canada. From 1984 to 1990, he conductedresearch into sonar signal processing and display for the Defence Research Es-tablishment Atlantic (DREA), Dartmouth, NS, Canada. From 1991 to 1995, hewas a Group Leader at DREA and led the development and trials of a largeairborne sonar. In 1999, he was with Lucent Technologies/Bell Laboratories,Holmdel, NJ; his patents form a key part of the intellectual property of startupFlarion Inc., Bedminster, NJ. Since 2000, he has been an Assistant Professor atthe University of Toronto in both the Department of Electrical and ComputerEngineering and the Institute of Biomaterials and Biomedical Engineering. Hisgroup focuses on genetic circuit design, designing and constructing analogs toelectronic circuits inside bacteria.

David A. Seale received the B.A.Sc. degree incomputer hardware systems engineering and theM.A.Sc. degree in electrical and computer engi-neering from the University of Toronto, Toronto,ON, Canada in 2000 and 2002, respectively. Hismasters thesis, “A statistical model of microarrayimages and an estimator of gene expression ratio,”applied Bayesian techniques to the analysis of DNAmicroarray images.

In 2002, he joined ATI Technologies Inc.,Markham, ON, Canada as a Computer Engineer in

the Integrated Graphics group. He is currently guiding research and develop-ment activities by developing mathematical models to predict performance offuture integrated graphics devices. This work has resulted in several successfulchipset products that can now be found in personal computers worldwide.