Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota [email protected]

Large Two-way ArraysDouglas M. HawkinsSchool of StatisticsUniversity of [email protected]

What are large arrays?# of rows in at least hundredsand/or# of columns in at least hundreds

Challenges/OpportunitiesLogistics of handling data more tediousStandard graphic methods work less wellMore opportunity for assumptions to failbutParameter estimates more preciseFewer model assumptions maybe possible

SettingsMicroarray dataProteomics dataSpectral data (fluorescence, absorption)

Common problems seenOutliers/Heavy-tailed distributions Missing dataLarge # of variables hurts some methods

The ovarian cancer dataData set as I have it:15154 variables (M/Z values), % relative intensity recorded91 controls (clinical normals)162 ovarian cancer patients

The normalsGive us an array of 15154 rows, 91 columns.Qualifies as largeSpectrum very busy

not to mention outlier-proneSubtracting off a median for each MZ and making a normal probability plot of the residuals

Comparing cases, controlsFirst pass at a rule to distinguish normal controls from cancer cases:Calculate two-sample t between groups for each distinct M/Z

Good news / bad newsSeveral places in spectrum with large separation (t=24 corresponds to around 3 sigma of separation)Visually seem to be isolated spikesThis is due to large # of narrow peaks

Variability also differs

Big differences in mean and variabilitysuggest conventional statistical tools ofLinear discriminant analysisLogistic regressionQuadratic or regularized discriminant analysisusing a selected set of features. Off-the-shelf software doesnt like 15K variables, but methods very do-able.

Return to beginningAre there useful tools for extracting information from these arrays?Robust singular value decomposition (RSVD) one that merits consideration (see our two NISS tech reports)

Singular value approximationSome philosophy from Bradu (1984)Write X for nxp data array. First remove structure you dont want to see k-term SVD approximation is

The rit are row markers You could use them as plot positions for the proteinsThe cjt are column markers. You could use them as plot positions for the cases. They match their corresponding row markers.The eij are error terms. They should mainly be small

Fitting the SVDConventionally done by principal component analysis. We avoid this for two reasons:PCA is highly sensitive to outliersIt requires complete data (an issue in many large data sets, if not this one)Standard approach would use 15K square covariance matrix.

Alternating robust fit algorithmTake trial values for the column markers. Fit the corresponding row markers using robust regression on available data.Use resulting row markers to refine column markers.Iterate to convergence.For robust regression we use least trimmed squares (LTS) regression.

Result for the controlsFirst run, I just removed a grand median.Plots of the first few row markers show fine structure like that of mean spectrum and of the discriminators

But the subsequent terms capture the finer structure

Uses for the RSVDInstead of feature selection, we can use cases c scores as variables in discriminant rules. Can be advantageous in reducing measurement variability and avoids feature selection bias.Can use as the basis for methods like cluster analysis.

Cluster analysis useConsider methods based on Euclidean distance between cases (k-means / Kohonen follow similar lines)

The first term is sum of squared difference in column markers, weighted by squared Euclidean norm of row markers. Second term noise. Adds no information, detracts from performanceThird term, cross-product, approximates zero because of independence.

This leads tor,c scale arbitrary. Make column lengths 1 absorbing eigenvalue into cReplace column Euclidean distance with squared distance between column markers. This removes random variability.Similarly, for k-means/Kohonen, replace column profile with its SVD approximation.

Special caseIf a one term SVD suffices, we get an ordination of the rows and columns. Row ordination doesnt make much sense for spectral dataColumn ordination orders subjects rationally.

The cancer groupCarried out RSVD of just the cancerBut this time removed row median firstCorrects for overall abundance at each MZRobust singular values are 2800, 1850, 1200,suggesting more than one dimension.

No striking breaks in sequence.We can cluster, but get more of a partition of a continuum.Suggests that severity varies smoothly

Back to the two-group settingAn interesting question (suggested by Mahalanobis-Taguchi strategy) are cancer group alike?Can address this by RSVD of cancer cases and clustering on column markersOr use the controls to get multivariate metric and place the cancers in this metric.

Do a new control RSVDSubtract row medians. Get canonical variates for all versus just controls (Or, as we have plenty of cancer cases, conventionally, of cancer versus controls)Plot the two groups

Supports earlier comment re lack of big white space in the cancer group a continuum, not distinct subpopulationsControls look a lot more homogeneous than cancer cases.

SummaryLarge arrays challenge and opportunity.Hard to visualize or use graphs.Many data sets show outliers / missing data / very heavy tails.Robust-fit singular value decomposition can handle these; provides large data condensation.

Some references

Bradu, D., (1984), Response Surface Model Diagnosis in Two-way Tables Communications in Statistics, Part A -- Theory and Methods, 13, 30593106.

Hawkins, D. M., (2003), Discussion of A review and analysis of the Mahalanobis-Taguchi system, Technometrics 45, 25 29.

Hawkins, D. M., Liu, L., and Young, S. S., (2001), Robust Singular Value Decomposition Technical Report 122, National Institute for Statistical Sciences

Liu, L., Hawkins, D. M., Ghosh, S., and Young, S. S., (2002), Robust Singular Value Decomposition Analysis of Microarray Data Technical Report 123, National Institute for Statistical Sciences

Documents

Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota [email protected]