106
Principal Component Analysis Principal Component Analysis Jia Li Department of Statistics The Pennsylvania State University Email: [email protected] http://www.stat.psu.edu/jiali Jia Li http://www.stat.psu.edu/jiali

Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

  • Upload
    hadat

  • View
    237

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Principal Component Analysis

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali

G. Jogesh Babu
Page 2: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Principal Component Analysis (PCA)

I Consider data matrix Xn×p, where each row is one datainstance, and each column is one measurement.

I Let each row of X be x ti , i = 1, ..., n, xi ∈ Rp.

I Assume we have removed the mean of each column of X.I What can PCA achieve?

I Linear projection to a lower-dimensional subspace.I Maximize the variance (total variation) of the projected data.I Minimize the discrepancy of the full-dimensional data and the

projection in the subspace.

Jia Li http://www.stat.psu.edu/∼jiali

Page 3: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Jia Li http://www.stat.psu.edu/∼jiali

Page 4: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Mathematical Formulation

I Consider an orthonormal basis A = (a1, a2, ..., ap), aj ∈ Rp

(rotation of coordinates).I For the k < p subspace spanned by a1, ..., ak ,

I Project xi onto the subspace:∑k

j=1〈xi , aj〉aj . 〈·, ·〉 denotesinner product.

I Total variation of the projected data (up to a constant factorof n):

maxk∑

j=1

atjXtXaj (1)

Equivalently, for k = 1, length of normalized linearcombination of X1, ..., Xp.

Jia Li http://www.stat.psu.edu/∼jiali

Page 5: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

I An equivalent criterion for deriving PCA:I Discrepancy of the full-dimensional data and the projection in

the subspace:

minn∑

i=1

‖xi −k∑

j=1

〈xi , aj〉aj‖2 (2)

Jia Li http://www.stat.psu.edu/∼jiali

Page 6: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

I Equivalence of Criterion (1) and (2):

n∑i=1

‖xi‖2 =n∑

i=1

‖p∑

j=1

〈xi , aj〉aj‖2

=

p∑j=1

atjXtXaj

n∑i=1

‖xi −k∑

j=1

〈xi , aj〉aj‖2 =n∑

i=1

‖p∑

j=k+1

〈xi , aj〉aj‖2 =

p∑j=k+1

atjXtXaj

Jia Li http://www.stat.psu.edu/∼jiali

Page 7: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

As∑n

i=1 ‖xi‖2 is fixed:

maxk∑

j=1

atjXtXaj ⇐⇒

maxn∑

i=1

‖k∑

j=1

〈xi , aj〉aj‖2 ⇐⇒

minn∑

i=1

‖p∑

j=k+1

〈xi , aj〉aj‖2 ⇐⇒

minn∑

i=1

‖xi −k∑

j=1

〈xi , aj〉aj‖2

Jia Li http://www.stat.psu.edu/∼jiali

Page 8: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

SolutionI Consider max

∑kj=1 a

tjX

tXaj progressively with k = 1, 2, ....I Let Σ = XtX.

I Rayleigh-Ritz quotient: RΣ(a) =〈Σa, a〉〈a, a〉

.

I Suppose vj , j = 1, ..., p are the eigenvectors of Σ witheigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp. Let a =

∑pj=1 αjvj , then

RΣ(a) =

∑pj=1 λjα

2j∑p

j=1 α2j

Without loss of generality, we can assume∑p

j=1 α2j = 1.

ClearlyRΣ(a) ≤ λ1

with equality achieved by a = v1. More general theorem:Min-Max.

Jia Li http://www.stat.psu.edu/∼jiali

Page 9: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Solution for PCA

I The ordered eigenvectors of the covariance matrix Σ are theprincipal component directions.

I Properties guaranteed:I The variation of the first principal component is maximized

among all the linear projection.I The variation of the kth principal component is maximized

among all the directions orthogonal to the previous k − 1principal component directions.

I Any subspace spanned by v1, ..., vk ensures the minimumdiscrepancy (L2 norm) from the original data.

Jia Li http://www.stat.psu.edu/∼jiali

Page 10: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Singular Value Decomposition (SVD)

I Alternatively, �� ��X = UDVT .

I U = (u1,u2, ...,up) is an N × p orthogonal matrix. uj ,j = 1, ..., p form an orthonormal basis for the space spanned bythe column vectors of X.

I V = (v1, v2, ..., vp) is an p × p orthogonal matrix. vj ,j = 1, ..., p form an orthonormal basis for the space spanned bythe row vectors of X.

I D = diag(d1, d2, ..., dp), d1 ≥ d2 ≥ ... ≥ dp ≥ 0 are thesingular values of X.

Jia Li http://www.stat.psu.edu/∼jiali

Page 11: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Principal Components

I The sample covariance matrix of X is

S = XTX/N .

I Eigen decomposition of XTX:

XTX = (UDVT )T (UDVT )

= VDUTUDVT

= VD2VT

I The eigenvectors of XTX, vj , are called principal componentdirection of X.

Jia Li http://www.stat.psu.edu/∼jiali

Page 12: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

I It’s easy to see that zj = Xvj = ujdj . Hence uj , is simply theprojection of the row vectors of X, i.e., the input predictorvectors, on the direction vj , scaled by dj . For example

z1 =

X1,1v1,1 + X1,2v1,2 + · · ·+ X1,pv1,p

X2,1v1,1 + X2,2v1,2 + · · ·+ X2,pv1,p...

......

XN,1v1,1 + XN,2v1,2 + · · ·+ XN,pv1,p

I The principal components of X are zj = djuj , j = 1, ..., p.I The first principal component of X, z1, has the largest sample

variance amongst all normalized linear combinations of thecolumns of X.

Var(z1) = d21/N .

I Subsequent principal components zj have maximum varianced2j /N, subject to being orthogonal to the earlier ones.

Jia Li http://www.stat.psu.edu/∼jiali

Page 13: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Interpretation of Principal Components

I Loadings: The element vj ,l in the jth principal componentdirection vj is the loading for the lth original variable in thejth component.

I Scores: The element (Xvj)i is the score of the jth principalcomponent for the ith instance.

I Simple structure interpretation: prefer loadings that are closeto 1 or 0. That is, prefer to have variables either irrelevant tothe principal component or explains it to a strong extent.

I Classic idea: coordinate rotation (developed in general forfactor analysis).

Jia Li http://www.stat.psu.edu/∼jiali

Page 14: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Rotation of Principal Component Directions

I Find an orthonormal basis spanning the same subspace as thek PCDs under which the loadings are more “extreme”.

I The subspace is NOT changed, but the progressive maximumvariation along the PCDs no longer holds.

Jia Li http://www.stat.psu.edu/∼jiali

Page 15: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Varimax Criterion

I Let Tk×k be the orthonormal rotation matrix in the subspacespanned by the first k PCDs v1, ..., vk .

I Let V (k) = (v1, v2, ..., vk).

I Under the rotation coordinates, the loading matrix becomeV (k)T .

I Varimax by rows:

arg maxT

p∑j=1

1

k

k∑l=1

(V (k)T )4j ,l −

(1

k

k∑l=1

(V (k)T )2j ,l

)2

Jia Li http://www.stat.psu.edu/∼jiali

Page 16: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Varimax Criterion

I Varimax by columns:

arg maxT

k∑l=1

1

p

p∑j=1

(V (k)T )4j ,l −

1

p

p∑j=1

(V (k)T )2j ,l

2I Intuition: Large variance tends to generate extreme values.

I The sum of the variances of the squared loadings across eachrow or each column is maximized.

I Recommended by Jolliffe (1989): Rotate in subspacesspanned by eigenvectors with similar eigenvalues.

I Rationale: Under the rotated coordinates of PCDs, thevariance along each coordinate is still large.

Jia Li http://www.stat.psu.edu/∼jiali

Page 17: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Sparsity in PC Loadings

I Jolliffe, Trendafilov, Uddin (2003): SCOTLASSI To find the kth direction, ak :

arg maxak

atk(XtX)ak

s.t. atkak = 1, atj ak = 0, 1 ≤ j < k

and

p∑j=1

|ak,j | ≤ t

I Successively maximize variation with L1 penalty to achievesparsity.

I Solved numerically, e.g., projected gradient descent.

Jia Li http://www.stat.psu.edu/∼jiali

Page 18: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

Sparse PCA

I Zou, Hastie, Tibshirani (JCGS 2006):

I Theorem 3: Suppose we are considering the first k principalcomponents. Let Ap×k = (α1, ..., αk) and Bp×k = (β1, ..., βk).For any λ > 0, let

(A, B) = arg minA,B

n∑i=1

‖xi − ABtxi‖2 + λ

k∑j=1

‖βj‖2

subject toAtA = Ik×k .

Then βj ∝ vj , j = 1, 2, ..., k .

Jia Li http://www.stat.psu.edu/∼jiali

Page 19: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

I Add Lasso penalty for sparsity:I

(A, B) = arg minA,B

n∑i=1

‖xi−ABtxi‖2 +λk∑

j=1

‖βj‖2 +k∑

j=1

λ1,j‖βj‖1

subject toAtA = Ik×k .

I Numerical solution:I B given A: For each j , let Y ∗

j = Xαj . Solve B = (β1, ..., βk)by elastic net estimate:

βj = arg minβj

‖Y ∗j − Xβj‖2 + λ‖βj‖2 + λ1,j‖β‖1

I A given B: Minimize∑n

i=1 ‖xi − ABtxi‖2 = ‖X− XBAt‖2,subject to AtA = Ik×k . The solution is given by a reducedrank form of the Procrustes rotation. Compute the SVD(XtX)B = UDVt , set A = UVt .

Jia Li http://www.stat.psu.edu/∼jiali

Page 20: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis

References

1. Cadima, J., and I. Jolliffe (1995), “Loadings and Correlations in theInterpretation of Principal Components,” Journal of AppliedStatistics, 2:203-214.

2. Jennrich, R. I. (2001), “A Simple General Procedure for OrthogonalRotation,” Psychometrika, 2: 289-306.

3. Jolliffe, I. (1989), “Rotation of Ill-defined Principal Components,”Journal of Applied Statistics, 1: 139-147.

4. Jolliffe, I. (1995), “Rotation of Principal components: Choice ofNormalization Constraints,” Journal of Applied Statistics, 22: 29-35.

5. Jolliffe, I., T. Trendafilov, N. T., and Uddin, M. (2003), “AModified Principal Component Technique Based on the Lasso,”Journal of Computational and Graphical Statistics, 12: 531-547.

6. Kaiser, H. (1958), “The Varimax Criterion for Analytic Rotation inFactor Analysis,” Psychometrika, 3: 187-200.

7. Zou, H., T. Hastie, R. Tibshirani (2006), “Sparse PrincipalComponent Analysis,” JCGS, 2: 265-286.

Jia Li http://www.stat.psu.edu/∼jiali

Page 21: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Classification by Penalized Empirical RiskMinimization: SVM, Logistic Regression

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali

Page 22: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

A General Framework

I Consider training data {(xi , yi ), i = 1, ..., n}, where xi is theattribute/feature vector and yi is the label, yi ∈ {−1, 1}.

I For linear classifier f (x) = 〈w , x〉+ b, classify by sign(f (x)).

y = sign(w tx + b)

I Let the loss function be

L(x , y ;w) = L(w tx + b, y)

Usually, for classification, z = y(w tx + b) andL(w tx + b, y) = L(z).

I If we consider least square regression,L(w tx + b, y) = ‖w tx + b − y‖2.

Jia Li http://www.stat.psu.edu/∼jiali

Page 23: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

A General Framework (Continued)

I Penalized empirical risk minimization:

minw ,b

1

n

n∑i=1

L(yi (wtxi + b)) + λ‖w‖2 → min

w ,bR(w , b) (1)

I Logistic regression: λ = 0. Logistic loss:

L(z) = log(1 + e−z)

I Support vector machine: hinge loss:

L(z) = [1− z ]+ = max(0, 1− z)

Jia Li http://www.stat.psu.edu/∼jiali

Page 24: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Optimization Solution

I Gradient descent:

(w , b)− η∇w ,bR(w , b)→ w

where η is the step size

I Stochastic gradient descent:For each (xi , yi ):

w − η∇w ,bL(yi (wtxi + b))→ w

I Technicality: for non-differential points of a convex function,use subgradient instead.

I A vector v is a subgradient of g(x) at x0 in C if

g(x)− g(x0) ≥ v t(x − x0), ∀x ∈ C

Jia Li http://www.stat.psu.edu/∼jiali

Page 25: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

A Geometric View: Maximum-Margin Classifier

I Reference: B. Scholkopf and A. J. Smola, Learning withKernels, the MIT press, 2002.

I Definitions out of Learning with Kernels below.

I Canonical Hyperplane: The pair (w , b) ∈ H ×R is called acanonical form of a hyperplane with respect to x1, ..., xn ∈ Hif it is scaled such that mini=1,...,n |〈w , xi 〉+ b| = 1, whichamounts to saying that the point closest to the hyperplanehas a distance 1/‖w‖.

I Geometrical Margin: For a hyperplane{x ∈ H|〈w , x〉+ b = 0}, we call ρw ,b(x , y) = y(〈w ,x〉+b)

‖w‖ the

geometric al margin of the point (x , y) ∈ H × {1,−1}. Theminimum value ρw ,b = mini=1,...,n ρw ,b(xi , yi ) is thegeometrical margin of the data set.

Jia Li http://www.stat.psu.edu/∼jiali

Page 26: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Optimal Margin Hyperplane

minw∈H,b∈R

τ(w) =1

2‖w‖2 (2)

s.t. yi (〈xi ,w〉+ b) ≥ 1, ∀i = 1, ..., n

Lagrangian:

L(w , b, α) =1

2‖w‖2 −

n∑i=1

αi (yi (〈xi ,w〉+ b)− 1)

αi ≥ 0 , i = 1, ..., n

Jia Li http://www.stat.psu.edu/∼jiali

Page 27: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Dual Optimization

maxα∈Rn

W (α) =n∑

i=1

αi −1

2

n∑i ,j=1

αiαjyiyj〈xi , xj〉 (3)

s.t. αi ≥ 0 , i = 1, ..., nn∑

i=1

αiyi = 0

We have

w =n∑

i=1

αiyixi

b = yj −n∑

i=1

yiαik(xj , xi ), for αj > 0

Jia Li http://www.stat.psu.edu/∼jiali

Page 28: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Decision function f (x):

f (x) = sgn

(n∑

i=1

αiyi 〈x , xi 〉+ b

)

Non-linear SV classifier: replace 〈·, ·〉 by kernel function k(·, ·).

I k(xi , xj) = 〈xi , xj〉d , polynomial classifier of degree d

I k(xi , xj) = exp(−‖xi − xj‖2/c), radial basis function classifier

Jia Li http://www.stat.psu.edu/∼jiali

Page 29: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Soft Margin Hyperplane

I C-SVM by Cortes & Vapnik:

I Introduce slack variables ξi ≥ 0, i = 1, ..., n. Relax theconstraints: yi (〈xi ,w〉+ b) ≥ 1− ξi , i = 1, ..., n.

I Penalize large ξi :

minw∈H,ξ∈Rn

τ(w , ξ) =1

2‖w‖2 +

c

n

n∑i=1

ξi c > 0

s.t. ξi ≥ 0, i = 1, ..., n

yi (〈xi ,w〉+ b) ≥ 1− ξi i = 1, ..., n

Jia Li http://www.stat.psu.edu/∼jiali

Page 30: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Dual Optimization

maxα∈Rn

W (α) =n∑

i=1

αi −1

2

n∑i ,j=1

αiαjyiyjk(xi , xj)

s.t. 0 ≤ αi ≤c

m∀i = 1, ..., n

n∑i=1

αiyi = 0

We have w =∑n

i=1 αiyixi . For all j ’s s.t. 0 < αj < c

b = yj −m∑i=1

yiαik(xj , xi )

Jia Li http://www.stat.psu.edu/∼jiali

Page 31: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Multi-class Classification

I One versus the rest: M-binary classifiers f 1, ..., f M :f j(x) = sgn(g j(x)).

I Take class arg maxj=1,...,M g j(x), where

g j(x) =n∑

i=1

y ji αjik(x , xi ) + bj

Jia Li http://www.stat.psu.edu/∼jiali

Page 32: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Hilbert Space

I An inner product on a vector space H is a symmetric bilinearform 〈·, ·〉: H×H → R that is strictly positive definite:∀x ∈ H, 〈x , x〉 ≥ 0 with equality only if x = 0.

I A Hilbert space is a complete inner product space.

Jia Li http://www.stat.psu.edu/∼jiali

Page 33: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Gram Matrix

I Kernel k: X × X → R, k is assumed symmetric.

I Gram matrix: Given a function k: X × X → R and x1, ...,xm ∈ X , the m ×m matrix K : Ki ,j = k(xi , xj) is called theGram matrix (or kernel matrix) of k with respect to x1, ..., xm.

I If the Gram matrix is positive definite for all x1, ..., xm ∈ X ,we call k positive definite kernel.

I If k(x , x ′) is a positive definite kernel, then it is the innerproduct of the reproducing kernel map of x ∈ X :

RX := {f : X → R}, functions mapping X into R

Φ : X → RX , x → k(·, x)

Jia Li http://www.stat.psu.edu/∼jiali

Page 34: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Reproducing Kernel Hilbert Space

I Consider the linear space spanned by k(·, x):

f (·) =m∑i=1

αik(·, xi ), ∀m ∈ N, x1, ..., xm ∈ X

clearly a vector space.

I Define 〈f , g〉 =∑m

i=1

∑m′

j=1 αiβjk(xi , x′j ), where

f =∑m

i=1 αik(·, xi ), g =∑m′

j=1 βjk(·, x ′j ). The expansion of f

and g may not be unique, but 〈f , g〉 =∑m′

j=1 βj f (x ′j ) whichdoes not depend on the expansion of f . Similarly, it doesn’tdepend on the expansion of g . Hence, uniquely defined.

I 〈f , f 〉 =∑m

i ,j=1 αiαjk(xi , xj) ≥ 0 by positive definiteness of k .

Jia Li http://www.stat.psu.edu/∼jiali

Page 35: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Reproducing Kernel Hilbert SpaceI 〈k(·, x), f 〉 = f (x) by definition.I By Cauchy Inequality

|f (x)|2 = |〈k(·, x), f 〉|2 ≤ k(x , x)〈f , f 〉Hence 〈f , f 〉 = 0 implies f (x) = 0, ∀x . Hence 〈, 〉 is positivedefinite, and thus an inner product.

I 〈Φ(x),Φ(x ′)〉 = k(x , x ′).I Conversely, whenever we have a mapping Φ from X into an

inner product space, we obtain a positive definite kernel viak(x , x ′) := 〈Φ(x),Φ(x ′)〉:∑

i ,j

cicjk(xi , xj) = 〈∑i

ciΦ(xi ),∑j

cjΦ(xj)〉

= ‖∑i

ciΦ(xi )‖2 ≥ 0

Jia Li http://www.stat.psu.edu/∼jiali

Page 36: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Reproducing Kernel Hilbert Space

I Complete the vector space by the usual technique of Cauchysequences.

I Reproducing kernel Hilbert space:

H := span{k(x , ·)|x ∈ X}

I Φ(x) need not be the only feature map.

Jia Li http://www.stat.psu.edu/∼jiali

Page 37: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Projections in Hilbert Space

Theorem (Projection in Hilbert Space): Let H be a Hilbert spaceand M be a closed subspace. Then every x ∈ H can be writtenuniquely as x = z + z⊥ where z ∈M and z⊥ ∈M⊥, that is,〈z⊥, t〉 = 0 for all t ∈M. the vector z is the unique element ofM minimizing ‖x − z‖; it is called the projection Px := z of xonto M. The projection operator P is a linear map.

Jia Li http://www.stat.psu.edu/∼jiali

Page 38: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Classification/Decision Trees (I)

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali

Page 39: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Tree Structured Classifier

I Reference: Classification and Regression Trees by L. Breiman,J. H. Friedman, R. A. Olshen, and C. J. Stone, Chapman &Hall, 1984.

I A Medical Example (CART):I Predict high risk patients who will not survive at least 30 days

on the basis of the initial 24-hour data.I 19 variables are measured during the first 24 hours. These

include blood pressure, age, etc.

Jia Li http://www.stat.psu.edu/∼jiali

Page 40: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

A tree structure classification rule for the medical example

Jia Li http://www.stat.psu.edu/∼jiali

Page 41: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I Denote the feature space by X . The input vector X ∈ Xcontains p features X1, X2, ..., Xp, some of which may becategorical.

I Tree structured classifiers are constructed by repeated splits ofsubsets of X into two descendant subsets, beginning with Xitself.

I Definitions: node, terminal node (leaf node), parent node,child node.

I The union of the regions occupied by two child nodes is theregion occupied by their parent node.

I Every leaf node is assigned with a class. A query is associatedwith class of the leaf node it lands in.

Jia Li http://www.stat.psu.edu/∼jiali

Page 42: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Notation

I A node is denoted by t. Its left child node is denoted by tLand right by tR .

I The collection of all the nodes is denoted by T ; and thecollection of all the leaf nodes by T .

I A split is denoted by s. The set of splits is denoted by S.

Jia Li http://www.stat.psu.edu/∼jiali

Page 43: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Jia Li http://www.stat.psu.edu/∼jiali

Page 44: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

The Three Elements

I The construction of a tree involves the following threeelements:

1. The selection of the splits.2. The decisions when to declare a node terminal or to continue

splitting it.3. The assignment of each terminal node to a class.

Jia Li http://www.stat.psu.edu/∼jiali

Page 45: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I In particular, we need to decide the following:

1. A set Q of binary questions of the form{Is X ∈ A?}, A ⊆ X .

2. A goodness of split criterion Φ(s, t) that can be evaluated forany split s of any node t.

3. A stop-splitting rule.4. A rule for assigning every terminal node to a class.

Jia Li http://www.stat.psu.edu/∼jiali

Page 46: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Standard Set of Questions

I The input vector X = (X1,X2, ...,Xp) contains features ofboth categorical and ordered types.

I Each split depends on the value of only a unique variable.

I For each ordered variable Xj , Q includes all questions of theform

{Is Xj ≤ c?}for all real-valued c .

I Since the training data set is finite, there are only finitely manydistinct splits that can be generated by the question{Is Xj ≤ c?}.

Jia Li http://www.stat.psu.edu/∼jiali

Page 47: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I If Xj is categorical, taking values, say in{1, 2, ...,M}, then Qcontains all questions of the form

{Is Xj ∈ A?} .

A ranges over all subsets of {1, 2, ...,M}.I The splits for all p variables constitute the standard set of

questions.

Jia Li http://www.stat.psu.edu/∼jiali

Page 48: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Goodness of Split

I The goodness of split is measured by an impurity functiondefined for each node.

I Intuitively, we want each leaf node to be “pure”, that is, oneclass dominates.

Jia Li http://www.stat.psu.edu/∼jiali

Page 49: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

The Impurity Function

Definition: An impurity function is a function φ defined on the setof all K-tuples of numbers (p1, ..., pK ) satisfying pj ≥ 0, j = 1, ...,K ,

∑j pj = 1 with the properties:

1. φ is a maximum only at the point ( 1K , 1

K , ..., 1K ).

2. φ achieves its minimum only at the points (1, 0, ..., 0),(0, 1, 0, ..., 0), ..., (0, 0, ..., 0, 1).

3. φ is a symmetric function of p1, ..., pK , i.e., if you permutepj , φ remains constant.

Jia Li http://www.stat.psu.edu/∼jiali

Page 50: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I Definition: Given an impurity function φ, define the impuritymeasure i(t) of a node t as

i(t) = φ(p(1 | t), p(2 | t), ..., p(K | t)) ,

where p(j | t) is the estimated probability of class j withinnode t.

I Goodness of a split s for node t, denoted by Φ(s, t), is definedby

Φ(s, t) = ∆i(s, t) = i(t)− pR i(tR)− pLi(tL) ,

where pR and pL are the proportions of the samples in node tthat go to the right node tR and the left node tL respectively.

Jia Li http://www.stat.psu.edu/∼jiali

Page 51: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I Define I (t) = i(t)p(t), that is, the impurity function of nodet weighted by the estimated proportion of data that go tonode t.

I The impurity of tree T , I (T ) is defined by

I (T ) =∑t∈T

I (t) =∑t∈T

i(t)p(t) .

I Note for any node t the following equations hold:

p(tL) + p(tR) = p(t)

pL = p(tL)/p(t), pR = p(tR)/p(t)

pL + pR = 1

Jia Li http://www.stat.psu.edu/∼jiali

Page 52: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I Define

∆I (s, t) = I (t)− I (tL)− I (tR)

= p(t)i(t)− p(tL)i(tL)− p(tR)i(tR)

= p(t)(i(t)− pLi(tL)− pR i(tR))

= p(t)∆i(s, t)

Jia Li http://www.stat.psu.edu/∼jiali

Page 53: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I Possible impurity function:

1. Entropy:∑K

j=1 pj log 1pj

. If pj = 0, use the limit

limpj→0 pj log pj = 0.2. Misclassification rate: 1−maxj pj .

3. Gini index:∑K

j=1 pj(1− pj) = 1−∑K

j=1 p2j .

I Gini index seems to work best in practice for many problems.

I The twoing rule: At a node t, choose the split s thatmaximizes

pLpR

4

∑j

|p(j | tL)− p(j | tR)|

2

.

Jia Li http://www.stat.psu.edu/∼jiali

Page 54: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Estimate the posterior probabilities of classes in each node

I The total number of samples is N and the number of samplesin class j , 1 ≤ j ≤ K , is Nj .

I The number of samples going to node t is N(t); the numberof samples with class j going to node t is Nj(t).

I∑K

j=1 Nj(t) = N(t).I Nj(tL) + Nj(tR) = Nj(t).I For a full tree (balanced), the sum of N(t) over all the t’s at

the same level is N.

Jia Li http://www.stat.psu.edu/∼jiali

Page 55: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I Denote the prior probability of class j by πj .I The priors πj can be estimated from the data by Nj/N.I Sometimes priors are given before-hand.

I The estimated probability of a sample in class j going to nodet is p(t | j) = Nj(t)/Nj .

I p(tL | j) + p(tR | j) = p(t | j).I For a full tree, the sum of p(t | j) over all t’s at the same level

is 1.

Jia Li http://www.stat.psu.edu/∼jiali

Page 56: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I The joint probability of a sample being in class j and going tonode t is thus:

p(j , t) = πjp(t | j) = πjNj(t)/Nj .

I The probability of any sample going to node t is:

p(t) =K∑

j=1

p(j , t) =K∑

j=1

πjNj(t)/Nj .

Note p(tL) + p(tR) = p(t).

I The probability of a sample being in class j given that it goesto node t is:

p(j | t) = p(j , t)/p(t) .

For any t,∑K

j=1 p(j | t) = 1.

Jia Li http://www.stat.psu.edu/∼jiali

Page 57: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I When πj = Nj/N, we have the following simplification:I p(j | t) = Nj(t)/N(t).I p(t) = N(t)/N.I p(j , t) = Nj(t)/N.

Jia Li http://www.stat.psu.edu/∼jiali

Page 58: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Stopping Criteria

I A simple criteria: stop splitting a node t when

maxs∈S

∆I (s, t) < β ,

where β is a chosen threshold.I The above stopping criteria is unsatisfactory.

I A node with a small decrease of impurity after one step ofsplitting may have a large decrease after multiple levels ofsplits.

Jia Li http://www.stat.psu.edu/∼jiali

Page 59: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Class Assignment Rule

I A class assignment rule assigns a class j = {1, ...,K} to everyterminal node t ∈ T . The class assigned to node t ∈ T isdenoted by κ(t).

I For 0-1 loss, the class assignment rule is:

κ(t) = arg maxj

p(j | t) .

Jia Li http://www.stat.psu.edu/∼jiali

Page 60: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I The resubstitution estimate r(t) of the probability ofmisclassification, given that a case falls into node t is

r(t) = 1−maxj

p(j | t) = 1− p(κ(t) | t) .

I Denote R(t) = r(t)p(t).

I The resubstitution estimate for the overall misclassificationrate R(T ) of the tree classifier T is:

R(T ) =∑t∈T

R(t) .

Jia Li http://www.stat.psu.edu/∼jiali

Page 61: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I Proposition: For any split of a node t into tL and tR ,

R(t) ≥ R(tL) + R(tR) .

Proof:Denote j∗ = κ(t).

p(j∗ | t) = p(j∗, tL | t) + p(j∗, tR | t)= p(j∗ | tL)p(tL | t) + p(j∗ | tR)p(tR | t)= pLp(j∗ | tL) + pRp(j∗ | tR)

≤ pL maxj

p(j | tL) + pR maxj

p(j | tR)

Jia Li http://www.stat.psu.edu/∼jiali

Page 62: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Hence,

r(t) = 1− p(j∗ | t)

≥ 1−[pL max

jp(j | tL) + pR max

jp(j | tR)

]= pL(1−max

jp(j | tL)) + pR(1−max

jp(j | tR))

= pLr(tL) + pR r(tR)

Finally,

R(t) = p(t)r(t)

≥ p(t)pLr(tL) + p(t)pR r(tR)

= p(tL)r(tL) + p(tR)r(tR)

= R(tL) + R(tR)

Jia Li http://www.stat.psu.edu/∼jiali

Page 63: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Digit Recognition Example (CART)

I The 10 digits are shownby different on-offcombinations of sevenhorizontal and verticallights.

I Each digit is representedby a 7-dimensional vectorof zeros and ones. The ithsample isxi = (xi1, xi2, ..., xi7). Ifxij = 1, the jth light is on;if xij = 0, the jth light isoff.

Jia Li http://www.stat.psu.edu/∼jiali

Page 64: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Digit x·1 x·2 x·3 x·4 x·5 x·6 x·71 0 0 1 0 0 1 02 1 0 1 1 1 0 13 1 0 1 1 0 1 14 0 1 1 1 0 1 05 1 1 0 1 0 1 16 1 1 0 1 1 1 17 1 0 1 0 0 1 08 1 1 1 1 1 1 19 1 1 1 1 0 1 10 1 1 1 0 1 1 1

Jia Li http://www.stat.psu.edu/∼jiali

Page 65: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I The data for the example are generated by a malfunctioningcalculator.

I Each of the seven lights has probability 0.1 of being in thewrong state independently.

I The training set contains 200 samples generated according tothe specified distribution.

Jia Li http://www.stat.psu.edu/∼jiali

Page 66: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I A tree structured classifier is applied.I The set of questions Q contains:

Is x·j = 0?, j = 1, 2, ..., 7.I The twoing rule is used in splitting.I The pruning cross-validation method is used to choose the

right sized tree.

Jia Li http://www.stat.psu.edu/∼jiali

Page 67: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I Classification performance:I The error rate estimated by using a test set of size 5000 is

0.30.I The error rate estimated by cross-validation using the training

set is 0.30.I The resubstitution estimate of the error rate is 0.29.I The Bayes error rate is 0.26.I There is little room for improvement over the tree classifier.

Jia Li http://www.stat.psu.edu/∼jiali

Page 68: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I Accidently, every digitoccupies one leafnode.

I In general, oneclass may occupyany number of leafnodes andoccasionally noleaf node.

I X·6 and X·7 are neverused.

Jia Li http://www.stat.psu.edu/∼jiali

Page 69: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Waveform Example (CART)

I Three functions h1(τ), h2(τ), h3(τ) are shifted versions ofeach other, as shown in the figure.

I Each hj is specified by the equal-lateral right triangle function.Its values at integers τ = 1 ∼ 21 are measured.

Jia Li http://www.stat.psu.edu/∼jiali

Page 70: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I The three classes of waveforms are random convexcombinations of two of these waveforms plus independentGaussian noise. Each sample is a 21 dimensional vectorcontaining the values of the random waveforms measured atτ = 1, 2, ..., 21.

I To generate a sample in class 1, a random number u uniformlydistributed in [0, 1] and 21 random numbers ε1, ε2, ..., ε21

normally distributed with mean zero and variance 1 aregenerated.

x·j = uh1(j) + (1− u)h2(j) + εj , j = 1, ..., 21.

I To generate a sample in class 2, repeat the above process togenerate a random number u and 21 random numbers ε1, ...,ε21 and set

x·j = uh1(j) + (1− u)h3(j) + εj , j = 1, ..., 21.

I Class 3 vectors are generated by

x·j = uh2(j) + (1− u)h3(j) + εj , j = 1, ..., 21.

Jia Li http://www.stat.psu.edu/∼jiali

Page 71: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Example random waveforms

0 5 10 15 20

−4

−2

0

2

4

6C

lass

1

0 5 10 15 20

−4−2

02468

0 5 10 15 20

−2

0

2

4

6

8

Cla

ss 2

5 10 15 20−4

−2

0

2

4

6

5 10 15 20−4

−2

0

2

4

6

Cla

ss 3

0 5 10 15 20−5

0

5

Jia Li http://www.stat.psu.edu/∼jiali

Page 72: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

I 300 random samples are generated using prior probabilities(13 , 1

3 , 13) for training.

I Construction of the tree:I The set of questions: {Is x·j ≤ c?} for c ranging over all real

numbers and j = 1, ..., 21.I Gini index is used for measuring goodness of split.I The final tree is selected by pruning and cross-validation.

I Results:I The cross-validation estimate of misclassification rate is 0.29.I The misclassification rate on a separate test set of size 5000 is

0.28.I The Bayes classification rule can be derived. Applying this rule

to the test set yields a misclassification rate of 0.14.

Jia Li http://www.stat.psu.edu/∼jiali

Page 73: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Jia Li http://www.stat.psu.edu/∼jiali

Page 74: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Advantages of the Tree-Structured Approach

I Handles both categorical and ordered variables in a simple andnatural way.

I Automatic stepwise variable selection and complexityreduction.

I It provides an estimate of the misclassification rate for a querysample.

I It is invariant under all monotone transformations of individualordered variables.

I Robust to outliers and misclassified points in the training set.

I Easy to interpret.

Jia Li http://www.stat.psu.edu/∼jiali

Page 75: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Variable Combinations

I Splits perpendicular to thecoordinate axes are inefficientin certain cases.

I Use linear combinations ofvariables:

Is∑

ajx·j ≤ c?

I The amount of computationis increased significantly.

I Price to pay: modelcomplexity increases.

Jia Li http://www.stat.psu.edu/∼jiali

Page 76: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (I)

Missing Values

I Certain variables are missing in some training samples.I Often occurs in gene-expression microarray data.I Suppose each variable has 5% chance being missing

independently. Then for a training sample with 50 variables,the probability of missing some variables is as high as 92.3%.

I A query sample to be classified may have missing variables.I Find surrogate splits.

I Suppose the best split for node t is s which involves a questionon Xm. Find another split s ′ on a variable Xj , j 6= m, which ismost similar to s in a certain sense. Similarly, the second bestsurrogate split, the third, and so on, can be found.

Jia Li http://www.stat.psu.edu/∼jiali

Page 77: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

Classification/Decision Trees (II)

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali

Page 78: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

Right Sized Trees

I Let the expected misclassification rate of a tree T be R∗(T ).

I Recall the resubstitution estimate for R∗(T ) is

R(T ) =∑t∈T

r(t)p(t) =∑t∈T

R(t) .

I R(T ) is biased downward.

R(t) ≥ R(tL) + R(tR) .

Jia Li http://www.stat.psu.edu/∼jiali

Page 79: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

Digit recognition example

No. Terminal Nodes R(T ) Rts(T )

71 .00 .4263 .00 .4058 .03 .3940 .10 .3234 .12 .3219 .29 .3110 .29 .309 .32 .347 .41 .476 .46 .545 .53 .612 .75 .821 .86 .91

I The estimate R(T )becomes increasingly lessaccurate as the treesgrow larger.

I The estimate R ts

decreases first when thetree becomes larger, hitsminimum at the tree with10 terminal nodes, andbegins to increase whenthe tree further grows.

Jia Li http://www.stat.psu.edu/∼jiali

Page 80: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

Preliminaries for Pruning

I Grow a very large tree Tmax .

1. Until all terminal nodes are pure (contain only one class) orcontain only identical measurement vectors.

2. When the number of data in each terminal node is no greaterthan a certain threshold, say 5, or even 1.

3. As long as the tree is sufficiently large, the size of the initialtree is not critical.

Jia Li http://www.stat.psu.edu/∼jiali

Page 81: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

1. Descendant: a node t ′ is a descendant of node t if there is aconnected path down the tree leading from t to t ′.

2. Ancestor: t is an ancestor of t ′ if t ′ is its descendant.

3. A branch Tt of T with root node t ∈ T consists of the node t andall descendants of t in T .

4. Pruning a branch Tt from a tree T consists of deleting from T alldescendants of t, that is, cutting off all of Tt except its root node.The tree pruned this way will be denoted by T − Tt .

5. If T ′ is gotten from T by successively pruning off branches, then T ′

is called a pruned subtree of T and denoted by T ′ ≺ T .

Jia Li http://www.stat.psu.edu/∼jiali

Page 82: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

Subtrees

I Even for a moderate sized Tmax , there is an enormously largenumber of subtrees and an even larger number ways to prunethe initial tree to them.

I A “selective” pruning procedure is needed.I The pruning is optimal in a certain sense.I The search for different ways of pruning should be of

manageable computational load.

Jia Li http://www.stat.psu.edu/∼jiali

Page 83: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

Minimal Cost-Complexity Pruning

I Definition for the cost-complexity measure:I For any subtree T � Tmax , define its complexity as |T |, the

number of terminal nodes in T . Let α ≥ 0 be a real numbercalled the complexity parameter and define the cost-complexitymeasure Rα(T ) as

Rα(T ) = R(T ) + α|T | .

Jia Li http://www.stat.psu.edu/∼jiali

Page 84: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I For each value of α, find the subtree T (α) that minimizesRα(T ), i.e.,

Rα(T (α)) = minT�Tmax

Rα(T ) .

I If α is small, the penalty for having a large number of terminalnodes is small and T (α) tends to be large.

I For α sufficiently large, the minimizing subtree T (α) willconsist of the root node only.

Jia Li http://www.stat.psu.edu/∼jiali

Page 85: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I Since there are at most a finite number of subtrees of Tmax ,Rα(T (α)) yields different values for only finitely many α’s.T (α) continues to be the minimizing tree when α increasesuntil a jump point is reached.

I Two questions:I Is there a unique subtree T � Tmax which minimizes Rα(T )?I In the minimizing sequence of trees T1, T2, ..., is each subtree

obtained by pruning upward from the previous subtree, i.e.,does the nestingT1 � T2 � · · · � {t1} hold?

Jia Li http://www.stat.psu.edu/∼jiali

Page 86: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I Definition: The smallest minimizing subtree T (α) forcomplexity parameter α is defined by the conditions:

1. Rα(T (α)) = minT�Tmax Rα(T )2. If Rα(T ) = Rα(T (α)), then T (α) � T .

I If subtree T (α) exists, it must be unique.

I It can be proved that for every value of α, there exists asmallest minimizing subtree.

Jia Li http://www.stat.psu.edu/∼jiali

Page 87: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I The starting point for the pruning is not Tmax , but ratherT1 = T (0), which is the smallest subtree of Tmax satisfying

R(T1) = R(Tmax) .

I Let tL and tR be any two terminal nodes in Tmax descendedfrom the same parent node t. If R(t) = R(tL) + R(tR), pruneoff tL and tR .

I Continue the process until no more pruning is possible. Theresulting tree is T1.

Jia Li http://www.stat.psu.edu/∼jiali

Page 88: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I For Tt any branch of T1, define R(Tt) by

R(Tt) =∑t′∈Tt

R(t ′) ,

where Tt is the set of terminal nodes of Tt .

I For t any nonterminal node of T1, R(t) > R(Tt).

Jia Li http://www.stat.psu.edu/∼jiali

Page 89: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

Weakest-Link Cutting

I For any node t ∈ T1, set Rα({t}) = R(t) + α.

I For any branch Tt , define Rα(Tt) = R(Tt) + α|Tt |.I When α = 0, R0(Tt) < R0({t}). The inequality holds for

sufficiently small α. But at some critical value of α, the twocost-complexities become equal. For α exceeding thisthreshold, the inequality is reversed.

I Solve the inequality Rα(Tt) < Rα({t}) and get

α <R(t)− R(Tt)

|Tt | − 1.

The right hand side is always positive.

Jia Li http://www.stat.psu.edu/∼jiali

Page 90: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I Define a function g1(t), t ∈ T1 by

g1(t) =

{R(t)−R(Tt)

|Tt |−1, t /∈ T1

+∞, t ∈ T1

I Define the weakest link t1 in T1 as the node such that

g1(t1) = mint∈T1

g1(t) .

and put α2 = g1(t1).

Jia Li http://www.stat.psu.edu/∼jiali

Page 91: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I When α increases, t1 is the first node that becomes morepreferable than the branch Tt1 descended from it.

I α2 is the first value after α1 = 0 that yields a strict subtree ofT1 with a smaller cost-complexity at this complexityparameter. That is, for all α1 ≤ α < α2, the tree withsmallest cost-complexity is T1.

I Let T2 = T1 − Tt1 .

Jia Li http://www.stat.psu.edu/∼jiali

Page 92: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I Repeat the previous steps. Use T2 instead of T1, find theweakest link in T2 and prune off at the weakest link node.

g2(t) =

{R(t)−R(T2t)

|T2t |−1, t ∈ T2, t /∈ T2

+∞, t ∈ T2

g2(t2) = mint∈T2

g2(t)

α3 = g2(t2)

T3 = T2 − Tt2

I If at any stage, there are multiple weakest links, for instance,if gk(tk) = gk(t ′k), then define Tk+1 = Tk − Ttk − Tt′k

.I Two branches are either nested or share no node.

Jia Li http://www.stat.psu.edu/∼jiali

Page 93: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I A decreasing sequence of nested subtrees are obtained:

T1 � T2 � T3 � · · · � {t1} .

I Theorem: The {αk} are an increasing sequence, that is,αk < αk+1, k ≥ 1, where α1 = 0. For k ≥ 1, αk ≤ α < αk+1,T (α) = T (αk) = Tk .

Jia Li http://www.stat.psu.edu/∼jiali

Page 94: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I At the initial steps of pruning, the algorithm tends to cut offlarge subbranches with many leaf nodes. With the treebecoming smaller, it tends to cut off fewer.

I Digit recognition example:

Tree T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13

|Tk | 71 63 58 40 34 19 10 9 7 6 5 2 1

Jia Li http://www.stat.psu.edu/∼jiali

Page 95: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

Best Pruned Subtree

I Two approaches to choose the best pruned subtree:I Use a test sample set.I Cross-validation

I Use a test set to compute the classification error rate of eachminimum cost-complexity subtree. Choose the subtree withthe minimum test error rate.

I Cross validation: tree structures are not stable. When thetraining data set changes slightly, there may be largestructural change in the tree.

I It is difficult to correspond a subtree trained from the entiredata set to a subtree trained from a majority part of it.

I Focus on choosing the right complexity parameter α.

Jia Li http://www.stat.psu.edu/∼jiali

Page 96: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

Pruning by Cross-Validation

I Consider V -fold cross-validation. The original learning sampleL is divided by random selection into V subsets, Lv ,v = 1, ...,V . Let the training sample set in each fold beL(v) = L − Lv .

I The tree grown on the original set is Tmax . V accessory trees

T(v)max are grown on L(v).

Jia Li http://www.stat.psu.edu/∼jiali

Page 97: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I For each value of the complexity parameter α, let T (α),T (v)(α), v = 1, ...,V , be the corresponding minimal

cost-complexity subtrees of Tmax , T(v)max .

I For each maximum tree, we obtain a sequence of jump pointsof α:α1 < α2 < α3 · · · < αk < · · · .

I To find the corresponding minimal cost-complexity subtree atα, find αk from the list such that αk ≤ α < αk+1. Then thesubtree corresponding to αk is the subtree for α.

Jia Li http://www.stat.psu.edu/∼jiali

Page 98: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I The cross-validation error rate of T (α) is computed by

RCV (T (α)) =1

V

V∑v=1

N(v)miss

N(v),

where N(v) is the number of samples in the test set Lv in fold

v ; and N(v)miss is the number of misclassified samples in Lv

using T (v)(α), a pruned tree of T(v)max trained from L(v).

I Although α is continuous, there are only finite minimumcost-complexity trees grown on L.

Jia Li http://www.stat.psu.edu/∼jiali

Page 99: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

I Let Tk = T (αk). To compute the cross-validation error rateof Tk , let α′

k =√

αkαk+1.

I LetRCV (Tk) = RCV (T (α′

k)) .

I For the root node tree {t1}, RCV ({t1}) is set to theresubstitution cost R({t1}).

I Choose the subtree Tk with minimum cross-validation errorrate RCV (Tk).

Jia Li http://www.stat.psu.edu/∼jiali

Page 100: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Classification/Decision Trees (II)

Computation Involved

1. Grow V + 1 maximum trees.

2. For each of the V + 1 trees, find the sequence of subtreeswith minimum cost-complexity.

3. Suppose the maximum tree grown on the original data setTmax has K subtrees.

4. For each of the (K − 1) α′k , compute the misclassification

rate of each of the V test sample set, average the error ratesand set the mean to the cross-validation error rate.

5. Find the subtree of Tmax with minimum RCV (Tk).

Jia Li http://www.stat.psu.edu/∼jiali

Page 101: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Bagging and Boosting: Brief Introduction

Bagging and Boosting: Brief Introduction

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali

Page 102: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Bagging and Boosting: Brief Introduction

Overview

I Bagging and boosting are meta-algorithms that pool decisionsfrom multiple classifiers.

I Much information can be found on Wikipedia.

Jia Li http://www.stat.psu.edu/∼jiali

Page 103: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Bagging and Boosting: Brief Introduction

Overview on Bagging

I Invented by Leo Breiman: Bootstrap aggregating.

I L. Breiman, “Bagging predictors,” Machine Learning,24(2):123-140, 1996.

I Majority vote from classifiers trained on bootstrap samples ofthe training data.

Jia Li http://www.stat.psu.edu/∼jiali

Page 104: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Bagging and Boosting: Brief Introduction

Overview on Boosting

I Iteratively learning weak classifiers

I Final result is the weighted sum of the results of weakclassifiers.

I Many different kinds of boosting algorithms: Adaboost(Adaptive boosting) by Y. Freund and R. Schapire is the first.

Jia Li http://www.stat.psu.edu/∼jiali

Page 105: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Bagging and Boosting: Brief Introduction

Bagging

I Generate B bootstrap samples of the training data: randomsample with replication.

I Train a classifier or a regression function using each bootstrapsample.

I For classification: majority vote on the classification results.

I For regression: average on the predicted values.

I Reduces variation.

I Improves performance for unstable classifiers, which varysignificantly with small change in the data set, e.g., CART.

I Found to improve CART a lot, but not nearest neighborclassifers.

Jia Li http://www.stat.psu.edu/∼jiali

Page 106: Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Bagging and Boosting: Brief Introduction

Adaboost for Binary classification1. Training data: (xi , yi ), i = 1, ..., n, xi ∈ X , yi ∈ Y = {−1, 1}.2. Let w1,i = 1

n , i = 1, ..., n.3. For t = 1, ..., T :

3.1 Learn classifier ft : X → Y that minimizes the error rate withrespect to distribution wt,i over xi ’s.

3.2 Let et =∑n

i=1 wt,i I (yi 6= ft(xi ).3.3 If et > 0.5, stop.3.4 Choose αt ∈ R. Usually set αt = 1

2 log 1−et

et.

3.5 Update wt+1,i =wt,i e

−αt yi ft (xi )

Zt, where Zt is a normalization

factor to ensure∑

i wt+1,i = 1.

4. Output the final classifier:

f (x) = sign

(T∑

t=1

αt ft(x)

)Note: the update of wt,i implies incorrectly classified points receiveincreased weights in the next round of learning.Jia Li http://www.stat.psu.edu/∼jiali