25
(Big) data analysis: Lecture 5, going into high dimensional datasets Amos Tanay Slides made available as lecture notes – not intended for presentati

(Big) data analysis: Lecture 5, going into high dimensional datasets Amos Tanay Slides made available as lecture notes – not intended for presentation

Embed Size (px)

Citation preview

(Big) data analysis: Lecture 5, going into high dimensional datasets

Amos Tanay

Slides made available as lecture notes – not intended for presentation

Objects and featuresPractical problem involve a large number of objects, and a large number of features defining the objects

Given a Matrix X={x_ij} i=1..N, j=1..M, we can study the object correlation matrix, or the feature correlation matrix:

Co = {cor(e_i*, e_j*)}

Cf = {cor_*i, cor_*j)}

Correlations can be assesed using different approaches, for eaxmple, using the linear covariance matrix:

XXT for the feature covariance matrixXTX – for the object covariance matrixIf we normalize X such that mean(x_i*)=0, std(x_i*)=1, the covariance=correlation

Correlation matrices as an example for multiple testing

• We perform n*(n-1)/2 tests for correlation of objects• We perform m*(m-1)/2 tests for correlation of objects• On random data, the correlation p-value should be distributed uniformly

in [0,1]• Which mean that there are 0.05* n*(n-1)/2 pairs with p<0.05…or n*(n-

1)/2000 with p<0.001• It is not good thing.

Controlling False Discovery Rate (FDR)One can control the multiple testing effect by eliminating tests (filtering bad objects, bad features, or focusing on a-priori interesting tests)More generally, we can compute the false discovery rate of a p-value threshold T as:q=Pr(pvalue on random < T)/Pr(pvalue < T)For example q = 0.1 would suggest that only 10% of the test we report (Pvalue<T) are expected to be random.

It is common to search for argmaxT(q(T)<FDR), and using this threshold for reporting results with controlled FDR.

Note that Pr(pvalue on random < T) is uniformly distributed if the test is well defined, but may require resampling in more complex situations

The FDR concept is strongly based on the notion of tests independence. The covariance matrix have very strong dependencies , so FDR cannot be used on it directly.

The FDR concept is orignally based on Benjamini and Hochberg (TAU)

Indirect correlation• If cor(X,Y)>>0 and cor(X,Z)>>0, what can you say about cor(Y,Z)? (nothing

in general – find examples)• However in may cases cor(Y,Z) will be correlated in a way that is explicable

by X• Binormal(X,Y|cov1), Binormal(X,Z|cov2) -> Pr(Y,Z)?• For binomial or multinomial data: conditional independence: Y indep Z | X

(set Bayes net theory)• How to approach Cor(Y,Z|X) = ? • Cor(Y-X, Z-X) – bad idea! Cor(Y-X, Z)?• Cor(Y,Z|X=X0), Cor(Y,Z| X in bin)

Hidden factors

• In many cases major parts of the correlation structure can be dependent on a hidden variable

• Examples: experimental batch, day in week, operator, doctor

• Cor(Xi, h) > 0 for each i….then: Cor(X,Y)>0 for all pairs…

• In that case, none of the observed variable can eliminate the dependencies: Cor(Y,Z|X)>0… (find an example)

Normalization: linear• As said, common normalization approach is to center and scale the rows

or columns of the data matrix (mean = 0, stdev = 1).

• If we want to factor out an effector variable X we can subtract the regression Y-f(X) but then be aware of the indirect correlation introduced

• Questions – what will happen if we normalize Y by:Ynorm = Y-sample(N(mean = f(X), stdev=(1-cor(X,Y)2))

Normalization: binning• If X is a variable with strong correlation to many other variables (Y,Z,W..)

we will in many case wish to stratify these variable given X without assuming any parameterization.

• This can be done by binning all objects to ranges of X values and observing the distributions of Y,Z,W within each bin.

• If bins are suffiiently large we can Subtract from each variable its mean or its linear regression with X within the bin

Y’(i) = Y(i) – mean(Y(i)|X(i) in bin b(i)), or Y’(i) = Y(i) – f(X(i)|X(i) in bin b(i)) (where f is the linear fit in

the bin)(

Quantile normalizationNormalization can be done by shifting (Adding a constant) and scaling (multiplying by a constant). This is adequate for normallly distributed data. Less so for other type of data.A more aggressive scheme is to force the entire distribution of values to be fixed. This can be done using quantile normalization.Define the percentile(X(i), X) as the normalized rank of X(i) within all X values.Define the quantile(i, X) to be the value of the i percentile in X

Quantile normaliztion of Y given X is defined as: Y’(i) = quantile(X, percentile(Y(i))

After this normalization Y and X have the same distribution, not just mean or variance.

One can use quantile normalization within binning stratification as in the previous slide (instead of subtracting the mean).The approach can be very problematic with respect to outliers

Forcing low dimension: PCA XTX – the empirical covariance matrix (if columns in X are normalized to standard normal (mean = 0, var=1), scaled by n What is the normalized linear combination (in simple words: weighted average) of features that maximize variance?

w – weights (such that Swi = 1)

what is argmaxw(Si (xi**w)2))?

or: w = argmax (wTXTXw)From linear algebra we know w is the eigenvector of the largest eigenvalue of the covariance matrix:

(XXT)w1 = l1w1

We can project the matrix on the orthogonal complement of the first eigenvector:X2 = X – Xw1w1

T

Now we can find the largest eigenvector of X2, defining w2, and so on for w3,..

We are in fact transforming the space to a new basis of eigenvectors. The projection of each data vector on the PCs (the w’s) is called the PC scores

PCA is easy to compute, used for dimensionality reduction

To perform PCA, we will usually use Singular Value Decomposition (SVD):X = USWT

where U – n by n orthogonal, W – m by m orthogonal, S – diagonalSome nice properties of SVD are: a) XTX = WSUTUSWT =WS2WT

b) SVD is computed efficiently and is usually numerically stable (see NRC 2.6)c) The PCA projection is computed by XW but this is just USWe can use the top k PCs to get a data matrix of rank k (so effective dimension of k) that is maximally similar to the original matrix.

PCA can be used on “objects” or “features”

Examples – two possible “orientations” for PCA of a gene expression matrix (tissues X genes)

1) PCA of single cell gene expression can done with tissues in the columns. Each gene will be projected on a reduced “virtual tissues” space. You can imagine such PCs to instruct Re tumor/normal behavior etc.2) If genes are in columns, each tissue will be projected on dimension that combine genes in to some “consolidated expression signature”. You can imagine such signatures to instruct on gene expression modules (groups of genes that are working together)

Do not to use PCA as your major/first data analysis approach

PCA is very popular – some reasons include:a) It is easy to compute, available in matlab/Rb) I generate projections on 2D that are nice to look atc) It does not require parameters of any kindd) It have nice theoretical properties, especially when thinking about machine

learning applications (e.g. feature selection for classificaiton)Nevertheless, In the context of data analysis, I am tempted to say that PCA is to be avoided in 99% of the situations.- The PCs are linear combination of properties – this is seldom a good assumption- The PCs are “hiding” the data and generate plots that cannot be directly

attributed to the data (e.g. problematic features, outliers, various biases)- The PCs are very difficult to interpret unless there are 2 strong behaviors in the

data – the visual gain is lost when going beyond 2DIt is recommended to approach high dimensional data with many measurements using first an exploratory approach (as we outline below), and to perform dimensionality reduction and high-dimension modeling in a way that is tailored to the problem

The other off-the-shelf high-dim analysis solution: hierarchical clustering

Given a set of vectors Xi

We are building a tree on the objects i=1..n, define by the relation Pa(i) (parent of i).

Compute distances d(Xi, Xj). Distances can use pearson, spearman, eucledian distance, or essentially anything.Add i=1..n to the list of objects L. If using weights set w(i) = 1 for i=1..nFind the minimal distance pair i1,i2

Add a new object n+1, set Pa(i1) = n+1, Pa(i2) = n+1.

Update d(n+1,j) for each j using one policy:1. Max(d(j,i1), d(j,i2)) or Min(.,.) (maximum,minimum linkage)

2. WeightedMean = 1/(wi1+wi1)[wi1d(j,i1) + wi2d(j,i2)]

3. Set Xn = Xi1 + Xi2, d(n+1,j) = d(Xn+1,Xj) (possibly weighting X using w)

Remove i1,i2 from the list L. If using weights, set wn+1 = wi1 + wi2

Continue iteratively (Adding objects n+2,… decreasing the size of L by 1 each time until convergence).

Hclust- caveats Hcluster is popular for visualizing the data in heat maps (ordering the objects or features) – but the tree is _not_ defining an order (we can inverse the order over nodes)Effective to detect big divisions in the dataOf course - effective when hierarchical structure is expected. But when the data is not clearly partitioned/divided/hierarchical, hclust will be difficult to understand and work with

May make sense for visualizing the correlation matrix more than the data matrixCan look at the original data which is a good thing (e.g. PCA issues), but smoothing can introduce visual artifacts that mislead novicesLimited as a basis for further modeling

A more organized intro to clustering

Clustering is a fuzzy task! No canonical goals or quality metricWe partition a group of objects into clusters:Property 1: maximize the similarity of objects within clustersProperty 2: maximize the separation between clustersProperty 3: use the right number of clustersThe tradeoffs between 1/2 are not well defined! (think of examples)

In exploratory data analysis our goals may be a bit different:Property A: Capture all of the data by a relatively small number of well define models that are easy to understand/visualizeProperty B: the model defining each cluster is well determinedProperty C: Random data will not contain cluster significantly different from the background distribution of all data

Clustering: k-meansGiven a set of vectors Xi

We optimize a clustering solution C1+C2+..+Ck = 1..n, Ci disjoint.

Uniquely defining cl(i) = the cluster of element iK is fixed and we try to minimize the disperssion of our clusters, defined by:Centerj = 1/|Cj|Si in CjXi

score= Si ||Xi – centercl(i)||2 (Other schemes possible)

This is done by brute-force optimization:Incremental algorithm: (starting from any solution C)Examine all possible cluster substitutions cl(i) = j->lIf one substitution improve score, apply it and continue with the new solution iterativelyIterative algorithm: (starting from any solution C)Compute centers. Update C such that cl(i) = the nearset center to Xi.

InitializationK means, as most algorithms we will learn in the coming lectures, is super-sensitive to initialization.

A large number of random starting poitns can be attempted – selecting the highest scoring solution

An effective heuristic for initialization is selecting seeds successivelyGiven i seeds we rank all objects given their minimal distance to the existing seeds.We sample the i+1 seed from the objects with minimal distance within the top 20 percentiles (so preferring new seeds that are far away from existing seeds).

Many other algorithmic variants are possible – replacing centers with medians, changing the distance or scoring scheme, adding new seeds for clusters that become empty, trying to merge or split clusters (change k) during the run.

Using K-mean for data exploration: key points

1. We select a number of clusters that is larger than what we think we need2. This is ok as long as clusters are not becoming too small. The guildeine is that we approximate a complex distribution using pieces we can understand.3. An appropriate size of a cluster depends on the statistics we collect – each feature is defining a 1D distribution within each cluster – this distribution should be appropriately determined (As we covered in the first lectures)4. Observing several similar clusters (with slight overfitting in each) can be handled in post-process – do not try to optimize the number of clusters!5. Frequently most of the data is “boring” (similar to the example of a 2D scatter-plot that is focused at 0). The goal function of most clustering algorithm will not allow appropriate modelling of strongly imbalanced clusters – it will be important to filter such cases in preprocess (but take note of their existence and frequency!)6. Combinatorial effects can be problematic for cluster in general, k-means in particular. For example, 4 independt groups of features that parititon the objects into two distinct group each will generate 16 cluster in theory – with model for the connetion between clusters. This should be diagnosed and handled using more sophisticated models.

Back to probabilistic modeling – multivariate normal distribution

Generalizing 2D correlation with a set of pairwise covariance

S – the covariance matrix |S| - det(S)

Multi-normality is a very strong restriction on data – although the model have many – d2 - parameters and it prone to sever overfittingRestricting the covariance structure can be approached in many ways.

Probabilistic analog of clustering: mixture models

We model uncertainty in the identification of objects to mixture components

Each mixture component can be based on some appropriate model- Independent discrete (binomial,multinormial)- Independent normals- Mutlivariate normals- More complex dependency model (e.g. Tree, Bayesian network, Markov

Random Field…)

This is our first model with a hiden variable T->X or P(T,X) = P(T)*P(X|T)The probability of the data is derived by marginzation over the hidden variables

Expectation Maximization (EM)(Working with e.g. a mixture of independent or correlated normal distributions)

Initialize the mixture component parameters and mixture coefficients:q1, r1

Compute the posterior probabilities: wij Pr(t =j|Xi)

Set new mixture coefficients: r2i=(1/m) Sj wij

Set new mixture components using weighted data wij, Xi

Iterate until convergence (no significant change in qlast, rlast)

Update mixture component using weight data – examples:Independent normal: E(X[k]) E(X2[k]) is determined by weighted statisticsFull multivariate: all statistics (E(X[k]X[u]) is determined by weighted stats

Expectation-Maximization

Relative entropy>=0EM maximization

)|(maxarg1 kk Q

Dempster

The most important property of the EM algorithm is that it is guaranteed to improve the likelihood of the data.

This DOES NOT MEAN it will find a good solution. Bad initialization will converge to bad outcome.

Here t is ranging over all possible assignment of the hidden mixture variable t per sample – so km possibilities

Expectation-Maximization: Mixture model

Q can be brought back to the form of a sum over independent terms, k of them depends only on the parameters of one mixture component, and one depends on the mixture coeeficient. Maximization of each of these component can then be shown to be done as in the non-mixture case (e.g. using the empirical covariance matrix for the multi-normal distribution).

KL-divergence

0)(min

log)1(log)

1()(max

)(log)()(

PH

Kk

Pk

PPH

xPxPPH

i

iii

Entropy (Shannon)

i i

ii xQ

xPxPQPH

)(

)(log)()||( Kullback-leibler divergence

iii

i i

ii

i i

ii xPxQ

xP

xQxP

xP

xQxPQPH 0)()(1

)(

)()(

)(

)(log)()|||(

1)log( uu

)||()||( PQHQPH

KL

Shannon

Not a metric!!