[IEEE 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI 2014) - Beijing, China (2014.4.29-2014.5.2)] 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI)

LEARNING AND VISUALIZING STATISTICAL RELATIONSHIPS BETWEEN PROTEINDISTRIBUTIONS FROM MICROSCOPY IMAGES

Soheil Kolouri1, Saurav Basu1, Gustavo K. Rohde1,2,3

1Center for Bioimage Informatics, Department of Biomedical Engineering2Electrical and Computer Engineering

3 Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA. 15213.

ABSTRACT

Multichannel microscopy has emerged as a technique for imagingmultiple targets (molecules, protein distributions, etc.) simultane-ously. Discovering the relative changes in these targets (i.e. distri-bution of different proteins) is fundamental for understanding cellstructure and function. We describe a new method for quantifyingand visualizing relationships between multiple targets, from a setof segmented multichannel cells. The method is based on combin-ing the canonical correlation analysis technique with a frameworkfor analyzing images based on the concept of optimal mass trans-portation. We apply the method towards understanding chromatindistribution in cancer nuclei as a function of nuclear envelope shape.We also show that sub cellular distribution of mitochondria can beused to predict the sub cellular localization of actin fibers in yeastcells. Finally, we also describe the application of the method towardsunderstanding relationships between nuclear and cellular shapes in2D HeLa cells. We believe that the method could serve as a gen-eral tool for mining relationships between different sub cellular pro-tein/molecule distributions as well as organelle shapes.

Index Terms— subcellular localization, protein distributions,organelle morphology, canonical correlation analysis, optimal trans-port

1. INTRODUCTION

Ever since their discovery by Robert Hooke in the 1660’s, cells havebeen extensively studied using ever-improving microscopy imagingtechniques. We now know cells are extremely complex entities,where the interplay between several thousands of different proteins(and other molecules) has a fundamental effect on their structure andfunction. Given that cells are mostly transparent to visible light, thefluorescence toolbox [1, 2] has greatly enhanced our ability to vi-sualize the sub cellular localization of numerous proteins. Simul-taneous imaging of a few channels (e.g. DAPI for nuclear DNA,acting, tubulin, etc.) is also possible given the ability to tag proteinswith different emission spectra. Methods for extending the numberof targets that can be imaged simultaneously to several tens are alsoavailable [3].

It is well known that proteins interact with other proteins di-rectly and indirectly. While much is known regarding the biophysi-

Dr. Basu and Mr. Kolouri contributed equally to this work. Dr. Basuperformed this work while employed at Carnegie Mellon University. Hiscurrent address and contact information is: IBM India Research Laboratory,New Dehli, India 110070. This work was supported, in part, by NIH grantGM090033.

cal mechanisms of interaction for certain proteins (microtubule for-mation from tubulin molecules, actin filaments from F-actin, etc.)much information is yet to be discovered regarding the effect ofthe sub cellular organization of different proteins on other proteinsand organelles. High throughput multichannel imaging can pro-vide information regarding the spatial distribution of proteins andorganelles, simultaneously and for numerous cells. However, wecurrently lack the computational framework for answering questionssuch as: what determines nuclear shape and size [4]? Is there afunction that describes properties of actin cap fiber formation andnuclear organization [5]? Moreover, related questions such as doesa particular drug, for example, disrupt any relationship such as theones mentioned above? It is fair to say that much of what is knownabout such relationships has been discovered through visual interpre-tation of microscopy images. It is our hypothesis that high through-put, high content, multichannel image cytometry datasets have muchmore information that is possible to capture through the human vi-sual system.

Here we describe a method that can be used by scientists to‘mine’ information regarding the interplay between the subcellularspatial distribution of different proteins and organelles placementsand configurations. The method takes as input images of cells forwhich at least two channels (i.e. proteins, organelles) have been im-aged simultaneously, and outputs the linear function that best, in thesense of highest linear correlation coefficient, describes the relation-ship between the subcellular localization of the two structures. Thefunction we describe can be analyzed for statistical significance us-ing a cross validation approach. In addition, the relationship can bevisualized so that biological interpretation is possible in many cases.The idea is based on combining the well known canonical correla-tion analysis technique [6] together with the linear optimal transportframework for analyzing images [7].

Below we apply the method towards understanding chromatindistribution in cancer nuclei as a function of nuclear envelope shape.In the case of liver cells, an increase in nuclear size is well correlatedwith a clearing of chromatin in the interior of the nucleus. We alsoapply the method to understanding relationships between sub cellu-lar mitochondrial distributions and actin distributions in yeast cells.Our results show that the distribution of actin and mitochondria inbudding yeast are correlated, which is in accordance with the resultspresented in [8, 9]. More precisely, we show that as the cell sizegrows and the actin distribution becomes more concentrated at thebudding site, the mitochondria distribution moves toward a bipolarconfiguration. We demonstrate that our algorithm captures the actinand mitochondrial distribution during a budding cycle, blindly andcompletely automatic.

978-1-4673-1961-4/14/$31.00 ©2014 IEEE 381

2. METHODS

The method for how to obtain statistically significant relationshipsbetween the morphology of different proteins and organelles we de-scribe below is based on the canonical correlation analysis (CCA)technique [6]. Given two separate linear embeddings for the two pro-teins (or organelles) of interest, the CCA method seeks the line pass-ing through each space that gives the best possible linear correlationbetween the data (morphologies) as projected onto each line (axis).Given a multi-channel dataset containing the images of the proteindistributions or organelles to be studied, the processing pipeline wepropose is based on first segmenting the desired cells so that the con-tours of organelles of interest, as well as the sub cellular localizationof proteins of interest, can be isolated. Once isolated, 2D contoursare embedded linearly using the technique described below. Proteindistributions are also embedded linearly using the method describedin detail [7] and summarized below. Once their respective linearembeddings are found, the CCA method is applied on a training setso as to obtain the linear relationship with highest correlation. Thevalue of the correlation is confirmed on a testing set so as to avoidproblems due to ‘over fitting.’

2.1. Pre-processing

The microscopy images used in our method must first be segmented.Once segmented, the images are normalized by aligning their centersof mass. In addition, each image is re-oriented using a Hottelingtransform so that its major axis is aligned with the y axis. Finally, forimages of protein/molecule distributions, each segmented image isalso normalized so that its sum of intensities equals one. This is doneto avoid differences in intensities due to uneven imaging procedures(staining amounts, etc., in histopathology images, for example). As aresult, the information that is being mined by the method we describebelow is essentially the relative placement of proteins (or molecules)within the cell.

2.2. Linear embedding of shape contours

Here we use an approach similar to the one described in [10]. Briefly,let a function f describing the segmented contour of the structure(e.g. cell) of interest. We compute the linear embedding of the con-tour by storing 180 coordinates stored in order of increasing angle.That is, starting at angle 0 w.r.t. the x-axis, we estimate the (x, y)coordinates of the segmented contour by interpolation of f at everytwo degrees. The set of coordinates stored in this order, which wecall as g, is a valid linear embedding for 2D shapes which are nottoo dimorphic so that at any angle, only one coordinate can be foundin f . Given the embedding for two such contours, g1, and g2, thestandard Euclidean distance can be used to measure how close or farthese are from each other.

2.3. Linear embedding of protein/organelle distributions

We utilize the linear optimal transport (LOT) method [7] for embed-ding any ‘mass’ distribution (proteins, molecules, etc.) into a linear(Euclidean) space. The idea is to utilize the optimal transport wefirst described in [11] to measure how close or far two mass distri-butions µ and ν are from each other. Let µ and ν be two imagesmodeled as continuous functions in a closed domain Ω (e.g. unitsquare), that integrate to 1 (probability measures). Let π(A0 × A1)be a transportation plan that describes how much ‘mass’ originallyin set A0 ∈ Ω in one image is being transported to set A1 ∈ Ω ofanother image. Let Π(µ, ν) define the set of all couplings between

µ and ν such that all mass in µ is transported to ν. The optimaltransport between µ and ν is defined as the following minimizer:

d(µ, ν) =

„inf

π∈Π(µ,ν)

ZΩ×Ω

|x− y|2dπ« 1

2

. (1)

The minimizer of the above is known to exist, and defines a mathe-matical distance between the images (measures). When the imagesare continuous such that dµ = α(x)dx and dν = β(x)dx we havethe situation that the mass at each point x in the domain Ω is sent toa single point ϕ(x) ∈ Ω, where ϕ(x) is called the OT map, which ismass preserving in the sense that

Rϕ−1(A)

α(x)dx =RAβ(y)dy. In

addition, the framework described above also describes a distance inthe OT Riemannian manifold. Let ϕt(x) = (1− t)x+ tϕ(x), whereϕ is again the map between µ and ν. Then the geodesic connectingthe two measures is given by µt(A) =

Rϕ−1

t (A)α(x)dx.

In [7] we introduced the concept of the linear optimal trans-port (OT) between µ and ν, given a reference point σ (with dσ =γ(x)dx). The idea is to compute the projection of µ and ν onto thelocal tangent plane centered at σ and define the distance between µand ν as the distance of their projections. Let ψ(x) define the OTmap between µ and σ and φ(x) the OT map between ν and σ. Thentheir projections are P (µ) = ψ(x)− x and P (ν) = φ(x)− x, andthe LOT distance between the two measures is given by the weighedEuclidean distance

dLOT (µ, ν) = ‖P (µ)− P (ν)‖σ. (2)

where, ‖.‖ is the weighted Euclidean distance.We compute the necessary transportation maps using the dis-

cretized version of the problem described in detail in [11]. For com-putational savings, we utilize the ‘particle’ approximation approachdescribed in [7] in order to first reduce each microscopy image to aset of particles (800 particles were used per image here). The compu-tation of the projections is done via the solution of the linear programdescribed in [11]. In the end, the linear embedding for µ is therefore(ψ(x)−x)

pσ(x), and the distance between two distributions µ and

ν is computed using equation (2).

2.4. Finding linear relationships with CCA

Given a data set of N segmented images I1, · · · , IN , the output ofthe above procedure is a set of vectors x1, · · · ,xN and y1, · · · ,yNcorresponding the the linear embedding (computed either from thesegmented contours or protein distributions) for each channel. Wenote that these embeddings are related to each other, as they are ac-quired from the same cell. For example, xk may refer to the linearembedding for the nuclear contour of cell k, while yk refers to theactin distribution for the same cell. We utilize the CCA technique[6] to find the directions within both linear spaces which are mostcorrelated to each other. Before doing so, however, we utilize thestandard principal component analysis (PCA) [6] technique to re-move from these dimensions with little or no data points. That is,for data points in Ωx, for example, we compute the covariance ma-trix S = 1

M

Pm(xm − x)(xm − x)T with x = 1

M

PMm=1 xm,

where xm is the mth image in the set. The principal components aregiven by the eigenvectors of S, and can be used to computed a ‘pro-jection’ of each original datapoint onto a reduced space. We selectthe topmost eigenvectors associated with 90% of the variance in thedataset.

Let X = x1, · · · , xN and Y = y1, · · · , yN represent thelinear embedding of each morphological structure in reduced (PCA)space. The goal of the CCA procedure is to compute a projection

382

over each space u = aTx and v = bTy such that the correlationcoefficient between u and v (computed from the projection of alldata points) is maximized. The projection coefficients a and b arecomputed by computing the cross covariance matrix between X andY and utilizing the approach described in [6].

2.5. Cross validation

An important consideration in any statistical analysis procedure inhigh dimensional space is the fact that ‘over fitting’ may occur. Thatis, given the potentially high dimensionality of each linear embed-ding (360 for contours, and 1600 for protein distributions), high cor-relations can often be achieved, though these may in the end havelittle predictive value. In order to discard these, we employ a sim-ple cross validation strategy. In the results shown below, for eachdataset, a portion of the data is first used to compute the canoni-cal correlations as described above. The correlation coefficients re-ported below are estimated using only testing data which were notused for computing the projections.

3. EXPERIMENTAL RESULTS

The examples shown below were computed with three distinct datasets.A nuclear morphometry dataset, the acquisition and pre-processingof which is described in detail in [11]. The data consisted of 857nuclei from normal and diseased liver cells (combined). These wereacquired using transmission light microscopy of Feulgen stained tis-sue sections, and segmented as described in [11]. The segmented nu-clei were analyzed for relationships between nuclear envelope shape,and internal chromatin distribution. In addition, we have also uti-lized data from the Yeast Resource Center Public Image Repository[12]. The data we used consisted of fluorescence images of mito-chondria and actin, imaged simultaneously from the same cells. Weutilized 233 2D images, selected at random and segmented using thesoftware described in [13]. Finally, the third dataset used was a setof HeLa cells obtained using fluorescence microscopy images [14].In total a set of 70 images were segmented semi automatically usinga simple thresholding approach. Here both the contours of the nucleiand cell membrane were segmented for the same cell.

Figure 1 contains the results obtained with the yeast cell datasetdescribed above. Here mitochondrial sub cellular location was cor-related with sub cellular actin placement. A linear function relat-ing these two distribution was computed using the method describedabove. Given the relatively low number of cells, we utilized a foldedcross validation strategy whereby roughly half of the cells were cho-sen for training, and the remaining cells were used for testing. Theexperiment was repeated 14 times and the maximum correlation co-efficient reported (0.732) was computed by using the projected testdata only. The evidence indicates correlation between sub cellularmitochondrial placement and actin placement. In Figure 1, the twoaxis correspond to the two variations in actin (x-axis) and mitochon-drial (y-axis) sub cellular placement.

Figure 2 displays the results obtained by analyzing the HeLacell dataset described earlier. Here the goal is to identify correla-tions between nuclear (second row) and cell shape (first row). Thecorrelation coefficient computed through cross validation was 0.54,indicating that nuclear elongation is mildly correlated with cell elon-gation, and that the elongation is along the same direction.

Figure 3 contains the results obtained on the nuclear structure inliver cancer dataset described above. The first row displays the chro-matin distribution variation that best correlates with a nuclear shape

Fig. 1. Understanding the sub cellular mitochondrial distribution asa function of actin distribution. The average correlation coefficient,computed through cross correlation was 0.732.

Fig. 2. Cell shape (first row) as a function of nuclear shape (secondrow). The correlation coefficient (computed through cross valida-tion) was 0.54.

direction. The correlation coefficient computed using the cross val-idation procedure described earlier (here half the data was used fortraining, and the other half for testing the correlation coefficient)shows that an increase in nuclear size is well correlated with a rela-tive clearing of chromatin in the center of the nucleus. The correla-tion coefficient computed was 0.86.

Fig. 3. Understanding nuclear chromatin placement as a functionof nuclear shape. Results show that increase in nuclear size is cor-related with a relative clearing of chromatin from the center of thenucleus. The correlation coefficient, estimated using cross valida-tion, was 0.86.

4. SUMMARY AND DISCUSSION

Here we described a method for computing a linear function thatrelates morphological variations between two morphological visu-alizations (organelle contours, protein distributions, etc.) observed

383

simultaneously for many cells. The idea is to utilize the CCA tech-nique to eliminate ‘noisy’ biological variations, and mine only thosechanges which best correlate the two ‘channels’. The method de-scribed here utilize both linear embeddings of shape contours, orlinear embeddings of protein/molecule distributions. In the case ofproteins or molecules, the approach utilizes the linear optimal trans-port embedding described in [7]. In addition to simply finding evi-dence for linear correlation, the output of the approach we describedcan also visualized, so that biological interpretation is possible. Wenote that while the figures above only show the first (highest correla-tion) direction produced by the CCA analysis, the remaining direc-tions can also be useful in elucidating further relationships (whichare orthogonal to the first).

We applied the method to identify meaningful correlations inthree distinct datasets. We showed that sub cellular actin distributionis well correlated with subcellular mitochondrial distribution in yeastcells. We also showed that cellular elongation is mildly correlatedwith nuclear elongation in 2D HeLa cells. In addition, we showedthat nuclear size is well correlated with an interior clearing of nuclearchromatin. In addition to these, we postulate the method could beuseful to mine relationships between any sub cellular distributionsof masses or shapes. We envision that the method could be usefulas a generic tool for searching for relationships between differentproteins or shapes inside cells.

We mention a few limitations of the approach we have described.The first is that the methodology was implemented in 2D. In theory,however, the methods could be expanded to three dimensions as wellproviding that appropriate linear embeddings could be used. The lin-ear optimal transport method described in [7] can be implementedin 3D, albeit at a higher computational cost. A linear embeddingfor shapes can also be obtained by utilizing differmorphisms for 3Dshapes. Yet another limitation of the approach is the fact that onlylinear relationships can be derived using the CCA approach. Deci-phering nonlinear relationships would require alternative regressiontechniques, utilizing specially tailored kernels or basis functions. Fi-nally, we also mention that the approach for obtaining linear embed-ding using the transport methodology can be computationally expen-sive (relative to extraction of descriptive features for example), withcomputation times being around 30 seconds in modern workstations[7]. Addressing these and other shortcomings will be the subject offuture work.

5. ACKNOWLEDGEMENT

This work was partially supported by NIH grant GM090033. Theauthors acknowledge Wei Wang’s contribution in helping segmentthe HeLa cell dataset described above.

6. REFERENCES

[1] C. Vonesch, F. Aguet, J.-L. Vonesch, and M. Unser, “The col-lored revolution of bioimaging,” IEEE Signal Processing Mag.,vol. 23, no. 3, pp. 20–31, 2006.

[2] B. N. G. Giepmans, S. R. Adams, M. H. Ellisman, and R. Y.Tsien, “The fluorescent toolbox for assessing protein locationand function,” Science, vol. 312, pp. 217–223, April 2006.

[3] Walter Schubert, Bernd Bonnekoh, Ansgar J Pommer, LarsPhilipsen, Raik Bockelmann, Yanina Malykh, Harald Goll-nick, Manuela Friedenberger, Marcus Bode, and Andreas W MDress, “Analyzing proteome topology and function by au-

tomated multidimensional fluorescence microscopy.,” NatBiotechnol, vol. 24, no. 10, pp. 1270–1278, 2006.

[4] Micah Webster, Keren L Witkin, and Orna Cohen-Fix, “Siz-ing up the nucleus: nuclear shape, size and nuclear-envelopeassembly.,” J Cell Sci, vol. 122, no. Pt 10, pp. 1477–1486,2009.

[5] Shyam B Khatau, Ryan J Bloom, Saumendra Bajpai, DavidRazafsky, Shu Zang, Anjil Giri, Pei-Hsun Wu, Jorge Marc-hand, Alfredo Celedon, Christopher M Hale, Sean X Sun,Didier Hodzic, and Denis Wirtz, “The distinct roles ofthe nucleus and nucleus-cytoskeleton connections in three-dimensional cell migration.,” Sci Rep, vol. 2, pp. 488, 2012.

[6] T. W. Anderson, An Introduction to Multivariate StatisticalAnalysis, Wiley Series in Probability and Statistics. WileyInter-Science, 2003.

[7] W. Wang, D. Slepcev, J. A. Ozolek, S. Basu, and G. K. Rohde,“A linear optimal transportation framework for quantifying andvisualizing variations in sets of images,” Intern. J. Comp. Vis.,vol. 101, no. 2, pp. 254–269, 2013.

[8] Hyeong-Cheol Yang, Alexander Palazzo, Theresa C. Swayne,and Liza A. Pon, “A retention mechanism for distribution ofmitochondria during cell division in budding yeast,” CurrentBiology, vol. 9, no. 19, pp. 1111 – S2, 1999.

[9] Kammy L. Fehrenbacher, Hyeong-Cheol Yang, Anna CardGay, Thomas M. Huckaba, and Liza A. Pon, “Live cell imag-ing of mitochondrial movement along actin cables in buddingyeast,” Current Biology, vol. 14, no. 22, pp. 1996 – 2004, 2004.

[10] Wei Wang, Yilin Mo, John A Ozolek, and Gustavo K Rohde,“Penalized fisher discriminant analysis and its application toimage-based morphometry.,” Pattern Recognit Lett, vol. 32,no. 15, pp. 2128–2135, 2011.

[11] Wei Wang, John A Ozolek, Dejan Slepcev, Ann B Lee, ChengChen, and Gustavo K Rohde, “An optimal transportation ap-proach for nuclear structure-based pathology.,” IEEE TransMed Imaging, vol. 30, no. 3, pp. 621–631, 2011.

[12] M. Riffle and T.N. Davis, “The yeast resource center publicimage repository: A large database of fluorescence microscopyimages,” BMC Bioinformatcis, vol. 11, pp. 263, 2010.

[13] Cheng Chen, Wei Wang, John A Ozolek, and Gustavo K Ro-hde, “A flexible and robust approach for segmenting cell nu-clei from 2d microscopy images using supervised learning andtemplate matching.,” Cytometry A, vol. 83, no. 5, pp. 495–507,2013.

[14] Tao Peng and Robert F Murphy, “Image-derived, three-dimensional generative models of cellular organization,” Cy-tometry Part A, vol. 79, no. 5, pp. 383–391, 2011.

384

Documents

[IEEE 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI 2014) - Beijing, China (2014.4.29-2014.5.2)] 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI)