Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component

Principal Component Analysis


Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali

G. Jogesh Babu


Principal Component Analysis (PCA)

I Consider data matrix Xn×p, where each row is one datainstance, and each column is one measurement.

I Let each row of X be x ti , i = 1, ..., n, xi ∈ Rp.

I Assume we have removed the mean of each column of X.I What can PCA achieve?

I Linear projection to a lower-dimensional subspace.I Maximize the variance (total variation) of the projected data.I Minimize the discrepancy of the full-dimensional data and the

projection in the subspace.





Mathematical Formulation

I Consider an orthonormal basis A = (a1, a2, ..., ap), aj ∈ Rp

(rotation of coordinates).I For the k < p subspace spanned by a1, ..., ak ,

I Project xi onto the subspace:∑k

j=1〈xi , aj〉aj . 〈·, ·〉 denotesinner product.

I Total variation of the projected data (up to a constant factorof n):

maxk∑

j=1

atjXtXaj (1)

Equivalently, for k = 1, length of normalized linearcombination of X1, ..., Xp.



I An equivalent criterion for deriving PCA:I Discrepancy of the full-dimensional data and the projection in

the subspace:

minn∑

i=1

‖xi −k∑

j=1

〈xi , aj〉aj‖2 (2)



I Equivalence of Criterion (1) and (2):

n∑i=1

‖xi‖2 =n∑

i=1

‖p∑

j=1

〈xi , aj〉aj‖2

=

p∑j=1

atjXtXaj

n∑i=1

‖xi −k∑

j=1

〈xi , aj〉aj‖2 =n∑

i=1

‖p∑

j=k+1

〈xi , aj〉aj‖2 =

p∑j=k+1

atjXtXaj



As∑n

i=1 ‖xi‖2 is fixed:

maxk∑

j=1

atjXtXaj ⇐⇒

maxn∑

i=1

‖k∑

j=1

〈xi , aj〉aj‖2 ⇐⇒

minn∑

i=1

‖p∑

j=k+1

〈xi , aj〉aj‖2 ⇐⇒

minn∑

i=1

‖xi −k∑

j=1

〈xi , aj〉aj‖2



SolutionI Consider max

∑kj=1 a

tjX

tXaj progressively with k = 1, 2, ....I Let Σ = XtX.

I Rayleigh-Ritz quotient: RΣ(a) =〈Σa, a〉〈a, a〉

.

I Suppose vj , j = 1, ..., p are the eigenvectors of Σ witheigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp. Let a =

∑pj=1 αjvj , then

RΣ(a) =

∑pj=1 λjα

2j∑p

j=1 α2j

Without loss of generality, we can assume∑p

j=1 α2j = 1.

ClearlyRΣ(a) ≤ λ1

with equality achieved by a = v1. More general theorem:Min-Max.



Solution for PCA

I The ordered eigenvectors of the covariance matrix Σ are theprincipal component directions.

I Properties guaranteed:I The variation of the first principal component is maximized

among all the linear projection.I The variation of the kth principal component is maximized

among all the directions orthogonal to the previous k − 1principal component directions.

I Any subspace spanned by v1, ..., vk ensures the minimumdiscrepancy (L2 norm) from the original data.



Singular Value Decomposition (SVD)

I Alternatively, �� X = UDVT .

I U = (u1,u2, ...,up) is an N × p orthogonal matrix. uj ,j = 1, ..., p form an orthonormal basis for the space spanned bythe column vectors of X.

I V = (v1, v2, ..., vp) is an p × p orthogonal matrix. vj ,j = 1, ..., p form an orthonormal basis for the space spanned bythe row vectors of X.

I D = diag(d1, d2, ..., dp), d1 ≥ d2 ≥ ... ≥ dp ≥ 0 are thesingular values of X.



Principal Components

I The sample covariance matrix of X is

S = XTX/N .

I Eigen decomposition of XTX:

XTX = (UDVT )T (UDVT )

= VDUTUDVT

= VD2VT

I The eigenvectors of XTX, vj , are called principal componentdirection of X.



I It’s easy to see that zj = Xvj = ujdj . Hence uj , is simply theprojection of the row vectors of X, i.e., the input predictorvectors, on the direction vj , scaled by dj . For example

z1 =

X1,1v1,1 + X1,2v1,2 + · · ·+ X1,pv1,p

X2,1v1,1 + X2,2v1,2 + · · ·+ X2,pv1,p...

......

XN,1v1,1 + XN,2v1,2 + · · ·+ XN,pv1,p

I The principal components of X are zj = djuj , j = 1, ..., p.I The first principal component of X, z1, has the largest sample

variance amongst all normalized linear combinations of thecolumns of X.

Var(z1) = d21/N .

I Subsequent principal components zj have maximum varianced2j /N, subject to being orthogonal to the earlier ones.



Interpretation of Principal Components

I Loadings: The element vj ,l in the jth principal componentdirection vj is the loading for the lth original variable in thejth component.

I Scores: The element (Xvj)i is the score of the jth principalcomponent for the ith instance.

I Simple structure interpretation: prefer loadings that are closeto 1 or 0. That is, prefer to have variables either irrelevant tothe principal component or explains it to a strong extent.

I Classic idea: coordinate rotation (developed in general forfactor analysis).



Rotation of Principal Component Directions

I Find an orthonormal basis spanning the same subspace as thek PCDs under which the loadings are more “extreme”.

I The subspace is NOT changed, but the progressive maximumvariation along the PCDs no longer holds.



Varimax Criterion

I Let Tk×k be the orthonormal rotation matrix in the subspacespanned by the first k PCDs v1, ..., vk .

I Let V (k) = (v1, v2, ..., vk).

I Under the rotation coordinates, the loading matrix becomeV (k)T .

I Varimax by rows:

arg maxT

p∑j=1

1

k

k∑l=1

(V (k)T )4j ,l −

(1

k

k∑l=1

(V (k)T )2j ,l

)2



Varimax Criterion

I Varimax by columns:

arg maxT

k∑l=1

1

p

p∑j=1

(V (k)T )4j ,l −

1

p

p∑j=1

(V (k)T )2j ,l

2I Intuition: Large variance tends to generate extreme values.

I The sum of the variances of the squared loadings across eachrow or each column is maximized.

I Recommended by Jolliffe (1989): Rotate in subspacesspanned by eigenvectors with similar eigenvalues.

I Rationale: Under the rotated coordinates of PCDs, thevariance along each coordinate is still large.



Sparsity in PC Loadings

I Jolliffe, Trendafilov, Uddin (2003): SCOTLASSI To find the kth direction, ak :

arg maxak

atk(XtX)ak

s.t. atkak = 1, atj ak = 0, 1 ≤ j < k

and

p∑j=1

|ak,j | ≤ t

I Successively maximize variation with L1 penalty to achievesparsity.

I Solved numerically, e.g., projected gradient descent.



Sparse PCA

I Zou, Hastie, Tibshirani (JCGS 2006):

I Theorem 3: Suppose we are considering the first k principalcomponents. Let Ap×k = (α1, ..., αk) and Bp×k = (β1, ..., βk).For any λ > 0, let

(A, B) = arg minA,B

n∑i=1

‖xi − ABtxi‖2 + λ

k∑j=1

‖βj‖2

subject toAtA = Ik×k .

Then βj ∝ vj , j = 1, 2, ..., k .



I Add Lasso penalty for sparsity:I

(A, B) = arg minA,B

n∑i=1

‖xi−ABtxi‖2 +λk∑

j=1

‖βj‖2 +k∑

j=1

λ1,j‖βj‖1

subject toAtA = Ik×k .

I Numerical solution:I B given A: For each j , let Y ∗

j = Xαj . Solve B = (β1, ..., βk)by elastic net estimate:

βj = arg minβj

‖Y ∗j − Xβj‖2 + λ‖βj‖2 + λ1,j‖β‖1

I A given B: Minimize∑n

i=1 ‖xi − ABtxi‖2 = ‖X− XBAt‖2,subject to AtA = Ik×k . The solution is given by a reducedrank form of the Procrustes rotation. Compute the SVD(XtX)B = UDVt , set A = UVt .



References

1. Cadima, J., and I. Jolliffe (1995), “Loadings and Correlations in theInterpretation of Principal Components,” Journal of AppliedStatistics, 2:203-214.

2. Jennrich, R. I. (2001), “A Simple General Procedure for OrthogonalRotation,” Psychometrika, 2: 289-306.

3. Jolliffe, I. (1989), “Rotation of Ill-defined Principal Components,”Journal of Applied Statistics, 1: 139-147.

4. Jolliffe, I. (1995), “Rotation of Principal components: Choice ofNormalization Constraints,” Journal of Applied Statistics, 22: 29-35.

5. Jolliffe, I., T. Trendafilov, N. T., and Uddin, M. (2003), “AModified Principal Component Technique Based on the Lasso,”Journal of Computational and Graphical Statistics, 12: 531-547.

6. Kaiser, H. (1958), “The Varimax Criterion for Analytic Rotation inFactor Analysis,” Psychometrika, 3: 187-200.

7. Zou, H., T. Hastie, R. Tibshirani (2006), “Sparse PrincipalComponent Analysis,” JCGS, 2: 265-286.


Classification by Penalized Empirical Risk Minimization: SVM, Logistic Regression

Classification by Penalized Empirical RiskMinimization: SVM, Logistic Regression

Jia Li





A General Framework

I Consider training data {(xi , yi ), i = 1, ..., n}, where xi is theattribute/feature vector and yi is the label, yi ∈ {−1, 1}.

I For linear classifier f (x) = 〈w , x〉+ b, classify by sign(f (x)).

y = sign(w tx + b)

I Let the loss function be

L(x , y ;w) = L(w tx + b, y)

Usually, for classification, z = y(w tx + b) andL(w tx + b, y) = L(z).

I If we consider least square regression,L(w tx + b, y) = ‖w tx + b − y‖2.



A General Framework (Continued)

I Penalized empirical risk minimization:

minw ,b

1

n

n∑i=1

L(yi (wtxi + b)) + λ‖w‖2 → min

w ,bR(w , b) (1)

I Logistic regression: λ = 0. Logistic loss:

L(z) = log(1 + e−z)

I Support vector machine: hinge loss:

L(z) = [1− z ]+ = max(0, 1− z)



Optimization Solution

I Gradient descent:

(w , b)− η∇w ,bR(w , b)→ w

where η is the step size

I Stochastic gradient descent:For each (xi , yi ):

w − η∇w ,bL(yi (wtxi + b))→ w

I Technicality: for non-differential points of a convex function,use subgradient instead.

I A vector v is a subgradient of g(x) at x0 in C if

g(x)− g(x0) ≥ v t(x − x0), ∀x ∈ C



A Geometric View: Maximum-Margin Classifier

I Reference: B. Scholkopf and A. J. Smola, Learning withKernels, the MIT press, 2002.

I Definitions out of Learning with Kernels below.

I Canonical Hyperplane: The pair (w , b) ∈ H ×R is called acanonical form of a hyperplane with respect to x1, ..., xn ∈ Hif it is scaled such that mini=1,...,n |〈w , xi 〉+ b| = 1, whichamounts to saying that the point closest to the hyperplanehas a distance 1/‖w‖.

I Geometrical Margin: For a hyperplane{x ∈ H|〈w , x〉+ b = 0}, we call ρw ,b(x , y) = y(〈w ,x〉+b)

‖w‖ the

geometric al margin of the point (x , y) ∈ H × {1,−1}. Theminimum value ρw ,b = mini=1,...,n ρw ,b(xi , yi ) is thegeometrical margin of the data set.



Optimal Margin Hyperplane

minw∈H,b∈R

τ(w) =1

2‖w‖2 (2)

s.t. yi (〈xi ,w〉+ b) ≥ 1, ∀i = 1, ..., n

Lagrangian:

L(w , b, α) =1

2‖w‖2 −

n∑i=1

αi (yi (〈xi ,w〉+ b)− 1)

αi ≥ 0 , i = 1, ..., n



Dual Optimization

maxα∈Rn

W (α) =n∑

i=1

αi −1

2

n∑i ,j=1

αiαjyiyj〈xi , xj〉 (3)

s.t. αi ≥ 0 , i = 1, ..., nn∑

i=1

αiyi = 0

We have

w =n∑

i=1

αiyixi

b = yj −n∑

i=1

yiαik(xj , xi ), for αj > 0



Decision function f (x):

f (x) = sgn

(n∑

i=1

αiyi 〈x , xi 〉+ b

)

Non-linear SV classifier: replace 〈·, ·〉 by kernel function k(·, ·).

I k(xi , xj) = 〈xi , xj〉d , polynomial classifier of degree d

I k(xi , xj) = exp(−‖xi − xj‖2/c), radial basis function classifier



Soft Margin Hyperplane

I C-SVM by Cortes & Vapnik:

I Introduce slack variables ξi ≥ 0, i = 1, ..., n. Relax theconstraints: yi (〈xi ,w〉+ b) ≥ 1− ξi , i = 1, ..., n.

I Penalize large ξi :

minw∈H,ξ∈Rn

τ(w , ξ) =1

2‖w‖2 +

c

n

n∑i=1

ξi c > 0

s.t. ξi ≥ 0, i = 1, ..., n

yi (〈xi ,w〉+ b) ≥ 1− ξi i = 1, ..., n



Dual Optimization

maxα∈Rn

W (α) =n∑

i=1

αi −1

2

n∑i ,j=1

αiαjyiyjk(xi , xj)

s.t. 0 ≤ αi ≤c

m∀i = 1, ..., n

n∑i=1

αiyi = 0

We have w =∑n

i=1 αiyixi . For all j ’s s.t. 0 < αj < c

b = yj −m∑i=1

yiαik(xj , xi )



Multi-class Classification

I One versus the rest: M-binary classifiers f 1, ..., f M :f j(x) = sgn(g j(x)).

I Take class arg maxj=1,...,M g j(x), where

g j(x) =n∑

i=1

y ji αjik(x , xi ) + bj



Hilbert Space

I An inner product on a vector space H is a symmetric bilinearform 〈·, ·〉: H×H → R that is strictly positive definite:∀x ∈ H, 〈x , x〉 ≥ 0 with equality only if x = 0.

I A Hilbert space is a complete inner product space.



Gram Matrix

I Kernel k: X × X → R, k is assumed symmetric.

I Gram matrix: Given a function k: X × X → R and x1, ...,xm ∈ X , the m ×m matrix K : Ki ,j = k(xi , xj) is called theGram matrix (or kernel matrix) of k with respect to x1, ..., xm.

I If the Gram matrix is positive definite for all x1, ..., xm ∈ X ,we call k positive definite kernel.

I If k(x , x ′) is a positive definite kernel, then it is the innerproduct of the reproducing kernel map of x ∈ X :

RX := {f : X → R}, functions mapping X into R

Φ : X → RX , x → k(·, x)



Reproducing Kernel Hilbert Space

I Consider the linear space spanned by k(·, x):

f (·) =m∑i=1

αik(·, xi ), ∀m ∈ N, x1, ..., xm ∈ X

clearly a vector space.

I Define 〈f , g〉 =∑m

i=1

∑m′

j=1 αiβjk(xi , x′j ), where

f =∑m

i=1 αik(·, xi ), g =∑m′

j=1 βjk(·, x ′j ). The expansion of f

and g may not be unique, but 〈f , g〉 =∑m′

j=1 βj f (x ′j ) whichdoes not depend on the expansion of f . Similarly, it doesn’tdepend on the expansion of g . Hence, uniquely defined.

I 〈f , f 〉 =∑m

i ,j=1 αiαjk(xi , xj) ≥ 0 by positive definiteness of k .



Reproducing Kernel Hilbert SpaceI 〈k(·, x), f 〉 = f (x) by definition.I By Cauchy Inequality

|f (x)|2 = |〈k(·, x), f 〉|2 ≤ k(x , x)〈f , f 〉Hence 〈f , f 〉 = 0 implies f (x) = 0, ∀x . Hence 〈, 〉 is positivedefinite, and thus an inner product.

I 〈Φ(x),Φ(x ′)〉 = k(x , x ′).I Conversely, whenever we have a mapping Φ from X into an

inner product space, we obtain a positive definite kernel viak(x , x ′) := 〈Φ(x),Φ(x ′)〉:∑

i ,j

cicjk(xi , xj) = 〈∑i

ciΦ(xi ),∑j

cjΦ(xj)〉

= ‖∑i

ciΦ(xi )‖2 ≥ 0



Reproducing Kernel Hilbert Space

I Complete the vector space by the usual technique of Cauchysequences.

I Reproducing kernel Hilbert space:

H := span{k(x , ·)|x ∈ X}

I Φ(x) need not be the only feature map.



Projections in Hilbert Space

Theorem (Projection in Hilbert Space): Let H be a Hilbert spaceand M be a closed subspace. Then every x ∈ H can be writtenuniquely as x = z + z⊥ where z ∈M and z⊥ ∈M⊥, that is,〈z⊥, t〉 = 0 for all t ∈M. the vector z is the unique element ofM minimizing ‖x − z‖; it is called the projection Px := z of xonto M. The projection operator P is a linear map.


Classification/Decision Trees (I)


Jia Li





Tree Structured Classifier

I Reference: Classification and Regression Trees by L. Breiman,J. H. Friedman, R. A. Olshen, and C. J. Stone, Chapman &Hall, 1984.

I A Medical Example (CART):I Predict high risk patients who will not survive at least 30 days

on the basis of the initial 24-hour data.I 19 variables are measured during the first 24 hours. These

include blood pressure, age, etc.



A tree structure classification rule for the medical example



I Denote the feature space by X . The input vector X ∈ Xcontains p features X1, X2, ..., Xp, some of which may becategorical.

I Tree structured classifiers are constructed by repeated splits ofsubsets of X into two descendant subsets, beginning with Xitself.

I Definitions: node, terminal node (leaf node), parent node,child node.

I The union of the regions occupied by two child nodes is theregion occupied by their parent node.

I Every leaf node is assigned with a class. A query is associatedwith class of the leaf node it lands in.



Notation

I A node is denoted by t. Its left child node is denoted by tLand right by tR .

I The collection of all the nodes is denoted by T ; and thecollection of all the leaf nodes by T .

I A split is denoted by s. The set of splits is denoted by S.





The Three Elements

I The construction of a tree involves the following threeelements:

1. The selection of the splits.2. The decisions when to declare a node terminal or to continue

splitting it.3. The assignment of each terminal node to a class.



I In particular, we need to decide the following:

1. A set Q of binary questions of the form{Is X ∈ A?}, A ⊆ X .

2. A goodness of split criterion Φ(s, t) that can be evaluated forany split s of any node t.

3. A stop-splitting rule.4. A rule for assigning every terminal node to a class.



Standard Set of Questions

I The input vector X = (X1,X2, ...,Xp) contains features ofboth categorical and ordered types.

I Each split depends on the value of only a unique variable.

I For each ordered variable Xj , Q includes all questions of theform

{Is Xj ≤ c?}for all real-valued c .

I Since the training data set is finite, there are only finitely manydistinct splits that can be generated by the question{Is Xj ≤ c?}.



I If Xj is categorical, taking values, say in{1, 2, ...,M}, then Qcontains all questions of the form

{Is Xj ∈ A?} .

A ranges over all subsets of {1, 2, ...,M}.I The splits for all p variables constitute the standard set of

questions.



Goodness of Split

I The goodness of split is measured by an impurity functiondefined for each node.

I Intuitively, we want each leaf node to be “pure”, that is, oneclass dominates.



The Impurity Function

Definition: An impurity function is a function φ defined on the setof all K-tuples of numbers (p1, ..., pK ) satisfying pj ≥ 0, j = 1, ...,K ,

∑j pj = 1 with the properties:

1. φ is a maximum only at the point ( 1K , 1

K , ..., 1K ).

2. φ achieves its minimum only at the points (1, 0, ..., 0),(0, 1, 0, ..., 0), ..., (0, 0, ..., 0, 1).

3. φ is a symmetric function of p1, ..., pK , i.e., if you permutepj , φ remains constant.



I Definition: Given an impurity function φ, define the impuritymeasure i(t) of a node t as

i(t) = φ(p(1 | t), p(2 | t), ..., p(K | t)) ,

where p(j | t) is the estimated probability of class j withinnode t.

I Goodness of a split s for node t, denoted by Φ(s, t), is definedby

Φ(s, t) = ∆i(s, t) = i(t)− pR i(tR)− pLi(tL) ,

where pR and pL are the proportions of the samples in node tthat go to the right node tR and the left node tL respectively.



I Define I (t) = i(t)p(t), that is, the impurity function of nodet weighted by the estimated proportion of data that go tonode t.

I The impurity of tree T , I (T ) is defined by

I (T ) =∑t∈T

I (t) =∑t∈T

i(t)p(t) .

I Note for any node t the following equations hold:

p(tL) + p(tR) = p(t)

pL = p(tL)/p(t), pR = p(tR)/p(t)

pL + pR = 1



I Define

∆I (s, t) = I (t)− I (tL)− I (tR)

= p(t)i(t)− p(tL)i(tL)− p(tR)i(tR)

= p(t)(i(t)− pLi(tL)− pR i(tR))

= p(t)∆i(s, t)



I Possible impurity function:

1. Entropy:∑K

j=1 pj log 1pj

. If pj = 0, use the limit

limpj→0 pj log pj = 0.2. Misclassification rate: 1−maxj pj .

3. Gini index:∑K

j=1 pj(1− pj) = 1−∑K

j=1 p2j .

I Gini index seems to work best in practice for many problems.

I The twoing rule: At a node t, choose the split s thatmaximizes

pLpR

4

∑j

|p(j | tL)− p(j | tR)|

2

.



Estimate the posterior probabilities of classes in each node

I The total number of samples is N and the number of samplesin class j , 1 ≤ j ≤ K , is Nj .

I The number of samples going to node t is N(t); the numberof samples with class j going to node t is Nj(t).

I∑K

j=1 Nj(t) = N(t).I Nj(tL) + Nj(tR) = Nj(t).I For a full tree (balanced), the sum of N(t) over all the t’s at

the same level is N.



I Denote the prior probability of class j by πj .I The priors πj can be estimated from the data by Nj/N.I Sometimes priors are given before-hand.

I The estimated probability of a sample in class j going to nodet is p(t | j) = Nj(t)/Nj .

I p(tL | j) + p(tR | j) = p(t | j).I For a full tree, the sum of p(t | j) over all t’s at the same level

is 1.



I The joint probability of a sample being in class j and going tonode t is thus:

p(j , t) = πjp(t | j) = πjNj(t)/Nj .

I The probability of any sample going to node t is:

p(t) =K∑

j=1

p(j , t) =K∑

j=1

πjNj(t)/Nj .

Note p(tL) + p(tR) = p(t).

I The probability of a sample being in class j given that it goesto node t is:

p(j | t) = p(j , t)/p(t) .

For any t,∑K

j=1 p(j | t) = 1.



I When πj = Nj/N, we have the following simplification:I p(j | t) = Nj(t)/N(t).I p(t) = N(t)/N.I p(j , t) = Nj(t)/N.



Stopping Criteria

I A simple criteria: stop splitting a node t when

maxs∈S

∆I (s, t) < β ,

where β is a chosen threshold.I The above stopping criteria is unsatisfactory.

I A node with a small decrease of impurity after one step ofsplitting may have a large decrease after multiple levels ofsplits.



Class Assignment Rule

I A class assignment rule assigns a class j = {1, ...,K} to everyterminal node t ∈ T . The class assigned to node t ∈ T isdenoted by κ(t).

I For 0-1 loss, the class assignment rule is:

κ(t) = arg maxj

p(j | t) .



I The resubstitution estimate r(t) of the probability ofmisclassification, given that a case falls into node t is

r(t) = 1−maxj

p(j | t) = 1− p(κ(t) | t) .

I Denote R(t) = r(t)p(t).

I The resubstitution estimate for the overall misclassificationrate R(T ) of the tree classifier T is:

R(T ) =∑t∈T

R(t) .



I Proposition: For any split of a node t into tL and tR ,

R(t) ≥ R(tL) + R(tR) .

Proof:Denote j∗ = κ(t).

p(j∗ | t) = p(j∗, tL | t) + p(j∗, tR | t)= p(j∗ | tL)p(tL | t) + p(j∗ | tR)p(tR | t)= pLp(j∗ | tL) + pRp(j∗ | tR)

≤ pL maxj

p(j | tL) + pR maxj

p(j | tR)



Hence,

r(t) = 1− p(j∗ | t)

≥ 1−[pL max

jp(j | tL) + pR max

jp(j | tR)

]= pL(1−max

jp(j | tL)) + pR(1−max

jp(j | tR))

= pLr(tL) + pR r(tR)

Finally,

R(t) = p(t)r(t)

≥ p(t)pLr(tL) + p(t)pR r(tR)

= p(tL)r(tL) + p(tR)r(tR)

= R(tL) + R(tR)



Digit Recognition Example (CART)

I The 10 digits are shownby different on-offcombinations of sevenhorizontal and verticallights.

I Each digit is representedby a 7-dimensional vectorof zeros and ones. The ithsample isxi = (xi1, xi2, ..., xi7). Ifxij = 1, the jth light is on;if xij = 0, the jth light isoff.



Digit x·1 x·2 x·3 x·4 x·5 x·6 x·71 0 0 1 0 0 1 02 1 0 1 1 1 0 13 1 0 1 1 0 1 14 0 1 1 1 0 1 05 1 1 0 1 0 1 16 1 1 0 1 1 1 17 1 0 1 0 0 1 08 1 1 1 1 1 1 19 1 1 1 1 0 1 10 1 1 1 0 1 1 1



I The data for the example are generated by a malfunctioningcalculator.

I Each of the seven lights has probability 0.1 of being in thewrong state independently.

I The training set contains 200 samples generated according tothe specified distribution.



I A tree structured classifier is applied.I The set of questions Q contains:

Is x·j = 0?, j = 1, 2, ..., 7.I The twoing rule is used in splitting.I The pruning cross-validation method is used to choose the

right sized tree.



I Classification performance:I The error rate estimated by using a test set of size 5000 is

0.30.I The error rate estimated by cross-validation using the training

set is 0.30.I The resubstitution estimate of the error rate is 0.29.I The Bayes error rate is 0.26.I There is little room for improvement over the tree classifier.



I Accidently, every digitoccupies one leafnode.

I In general, oneclass may occupyany number of leafnodes andoccasionally noleaf node.

I X·6 and X·7 are neverused.



Waveform Example (CART)

I Three functions h1(τ), h2(τ), h3(τ) are shifted versions ofeach other, as shown in the figure.

I Each hj is specified by the equal-lateral right triangle function.Its values at integers τ = 1 ∼ 21 are measured.



I The three classes of waveforms are random convexcombinations of two of these waveforms plus independentGaussian noise. Each sample is a 21 dimensional vectorcontaining the values of the random waveforms measured atτ = 1, 2, ..., 21.

I To generate a sample in class 1, a random number u uniformlydistributed in [0, 1] and 21 random numbers ε1, ε2, ..., ε21

normally distributed with mean zero and variance 1 aregenerated.

x·j = uh1(j) + (1− u)h2(j) + εj , j = 1, ..., 21.

I To generate a sample in class 2, repeat the above process togenerate a random number u and 21 random numbers ε1, ...,ε21 and set

x·j = uh1(j) + (1− u)h3(j) + εj , j = 1, ..., 21.

I Class 3 vectors are generated by

x·j = uh2(j) + (1− u)h3(j) + εj , j = 1, ..., 21.



Example random waveforms

0 5 10 15 20

−4

−2

0

2

4

6C

lass

1

0 5 10 15 20

−4−2

02468

0 5 10 15 20

−2

0

2

4

6

8

Cla

ss 2

5 10 15 20−4

−2

0

2

4

6

5 10 15 20−4

−2

0

2

4

6

Cla

ss 3

0 5 10 15 20−5

0

5



I 300 random samples are generated using prior probabilities(13 , 1

3 , 13) for training.

I Construction of the tree:I The set of questions: {Is x·j ≤ c?} for c ranging over all real

numbers and j = 1, ..., 21.I Gini index is used for measuring goodness of split.I The final tree is selected by pruning and cross-validation.

I Results:I The cross-validation estimate of misclassification rate is 0.29.I The misclassification rate on a separate test set of size 5000 is

0.28.I The Bayes classification rule can be derived. Applying this rule

to the test set yields a misclassification rate of 0.14.





Advantages of the Tree-Structured Approach

I Handles both categorical and ordered variables in a simple andnatural way.

I Automatic stepwise variable selection and complexityreduction.

I It provides an estimate of the misclassification rate for a querysample.

I It is invariant under all monotone transformations of individualordered variables.

I Robust to outliers and misclassified points in the training set.

I Easy to interpret.



Variable Combinations

I Splits perpendicular to thecoordinate axes are inefficientin certain cases.

I Use linear combinations ofvariables:

Is∑

ajx·j ≤ c?

I The amount of computationis increased significantly.

I Price to pay: modelcomplexity increases.



Missing Values

I Certain variables are missing in some training samples.I Often occurs in gene-expression microarray data.I Suppose each variable has 5% chance being missing

independently. Then for a training sample with 50 variables,the probability of missing some variables is as high as 92.3%.

I A query sample to be classified may have missing variables.I Find surrogate splits.

I Suppose the best split for node t is s which involves a questionon Xm. Find another split s ′ on a variable Xj , j 6= m, which ismost similar to s in a certain sense. Similarly, the second bestsurrogate split, the third, and so on, can be found.


Classification/Decision Trees (II)


Jia Li





Right Sized Trees

I Let the expected misclassification rate of a tree T be R∗(T ).

I Recall the resubstitution estimate for R∗(T ) is

R(T ) =∑t∈T

r(t)p(t) =∑t∈T

R(t) .

I R(T ) is biased downward.

R(t) ≥ R(tL) + R(tR) .



Digit recognition example

No. Terminal Nodes R(T ) Rts(T )

71 .00 .4263 .00 .4058 .03 .3940 .10 .3234 .12 .3219 .29 .3110 .29 .309 .32 .347 .41 .476 .46 .545 .53 .612 .75 .821 .86 .91

I The estimate R(T )becomes increasingly lessaccurate as the treesgrow larger.

I The estimate R ts

decreases first when thetree becomes larger, hitsminimum at the tree with10 terminal nodes, andbegins to increase whenthe tree further grows.



Preliminaries for Pruning

I Grow a very large tree Tmax .

1. Until all terminal nodes are pure (contain only one class) orcontain only identical measurement vectors.

2. When the number of data in each terminal node is no greaterthan a certain threshold, say 5, or even 1.

3. As long as the tree is sufficiently large, the size of the initialtree is not critical.



1. Descendant: a node t ′ is a descendant of node t if there is aconnected path down the tree leading from t to t ′.

2. Ancestor: t is an ancestor of t ′ if t ′ is its descendant.

3. A branch Tt of T with root node t ∈ T consists of the node t andall descendants of t in T .

4. Pruning a branch Tt from a tree T consists of deleting from T alldescendants of t, that is, cutting off all of Tt except its root node.The tree pruned this way will be denoted by T − Tt .

5. If T ′ is gotten from T by successively pruning off branches, then T ′

is called a pruned subtree of T and denoted by T ′ ≺ T .



Subtrees

I Even for a moderate sized Tmax , there is an enormously largenumber of subtrees and an even larger number ways to prunethe initial tree to them.

I A “selective” pruning procedure is needed.I The pruning is optimal in a certain sense.I The search for different ways of pruning should be of

manageable computational load.



Minimal Cost-Complexity Pruning

I Definition for the cost-complexity measure:I For any subtree T � Tmax , define its complexity as |T |, the

number of terminal nodes in T . Let α ≥ 0 be a real numbercalled the complexity parameter and define the cost-complexitymeasure Rα(T ) as

Rα(T ) = R(T ) + α|T | .



I For each value of α, find the subtree T (α) that minimizesRα(T ), i.e.,

Rα(T (α)) = minT�Tmax

Rα(T ) .

I If α is small, the penalty for having a large number of terminalnodes is small and T (α) tends to be large.

I For α sufficiently large, the minimizing subtree T (α) willconsist of the root node only.



I Since there are at most a finite number of subtrees of Tmax ,Rα(T (α)) yields different values for only finitely many α’s.T (α) continues to be the minimizing tree when α increasesuntil a jump point is reached.

I Two questions:I Is there a unique subtree T � Tmax which minimizes Rα(T )?I In the minimizing sequence of trees T1, T2, ..., is each subtree

obtained by pruning upward from the previous subtree, i.e.,does the nestingT1 � T2 � · · · � {t1} hold?



I Definition: The smallest minimizing subtree T (α) forcomplexity parameter α is defined by the conditions:

1. Rα(T (α)) = minT�Tmax Rα(T )2. If Rα(T ) = Rα(T (α)), then T (α) � T .

I If subtree T (α) exists, it must be unique.

I It can be proved that for every value of α, there exists asmallest minimizing subtree.



I The starting point for the pruning is not Tmax , but ratherT1 = T (0), which is the smallest subtree of Tmax satisfying

R(T1) = R(Tmax) .

I Let tL and tR be any two terminal nodes in Tmax descendedfrom the same parent node t. If R(t) = R(tL) + R(tR), pruneoff tL and tR .

I Continue the process until no more pruning is possible. Theresulting tree is T1.



I For Tt any branch of T1, define R(Tt) by

R(Tt) =∑t′∈Tt

R(t ′) ,

where Tt is the set of terminal nodes of Tt .

I For t any nonterminal node of T1, R(t) > R(Tt).



Weakest-Link Cutting

I For any node t ∈ T1, set Rα({t}) = R(t) + α.

I For any branch Tt , define Rα(Tt) = R(Tt) + α|Tt |.I When α = 0, R0(Tt) < R0({t}). The inequality holds for

sufficiently small α. But at some critical value of α, the twocost-complexities become equal. For α exceeding thisthreshold, the inequality is reversed.

I Solve the inequality Rα(Tt) < Rα({t}) and get

α <R(t)− R(Tt)

|Tt | − 1.

The right hand side is always positive.



I Define a function g1(t), t ∈ T1 by

g1(t) =

{R(t)−R(Tt)

|Tt |−1, t /∈ T1

+∞, t ∈ T1

I Define the weakest link t1 in T1 as the node such that

g1(t1) = mint∈T1

g1(t) .

and put α2 = g1(t1).



I When α increases, t1 is the first node that becomes morepreferable than the branch Tt1 descended from it.

I α2 is the first value after α1 = 0 that yields a strict subtree ofT1 with a smaller cost-complexity at this complexityparameter. That is, for all α1 ≤ α < α2, the tree withsmallest cost-complexity is T1.

I Let T2 = T1 − Tt1 .



I Repeat the previous steps. Use T2 instead of T1, find theweakest link in T2 and prune off at the weakest link node.

g2(t) =

{R(t)−R(T2t)

|T2t |−1, t ∈ T2, t /∈ T2

+∞, t ∈ T2

g2(t2) = mint∈T2

g2(t)

α3 = g2(t2)

T3 = T2 − Tt2

I If at any stage, there are multiple weakest links, for instance,if gk(tk) = gk(t ′k), then define Tk+1 = Tk − Ttk − Tt′k

.I Two branches are either nested or share no node.



I A decreasing sequence of nested subtrees are obtained:

T1 � T2 � T3 � · · · � {t1} .

I Theorem: The {αk} are an increasing sequence, that is,αk < αk+1, k ≥ 1, where α1 = 0. For k ≥ 1, αk ≤ α < αk+1,T (α) = T (αk) = Tk .



I At the initial steps of pruning, the algorithm tends to cut offlarge subbranches with many leaf nodes. With the treebecoming smaller, it tends to cut off fewer.

I Digit recognition example:

Tree T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13

|Tk | 71 63 58 40 34 19 10 9 7 6 5 2 1



Best Pruned Subtree

I Two approaches to choose the best pruned subtree:I Use a test sample set.I Cross-validation

I Use a test set to compute the classification error rate of eachminimum cost-complexity subtree. Choose the subtree withthe minimum test error rate.

I Cross validation: tree structures are not stable. When thetraining data set changes slightly, there may be largestructural change in the tree.

I It is difficult to correspond a subtree trained from the entiredata set to a subtree trained from a majority part of it.

I Focus on choosing the right complexity parameter α.



Pruning by Cross-Validation

I Consider V -fold cross-validation. The original learning sampleL is divided by random selection into V subsets, Lv ,v = 1, ...,V . Let the training sample set in each fold beL(v) = L − Lv .

I The tree grown on the original set is Tmax . V accessory trees

T(v)max are grown on L(v).



I For each value of the complexity parameter α, let T (α),T (v)(α), v = 1, ...,V , be the corresponding minimal

cost-complexity subtrees of Tmax , T(v)max .

I For each maximum tree, we obtain a sequence of jump pointsof α:α1 < α2 < α3 · · · < αk < · · · .

I To find the corresponding minimal cost-complexity subtree atα, find αk from the list such that αk ≤ α < αk+1. Then thesubtree corresponding to αk is the subtree for α.



I The cross-validation error rate of T (α) is computed by

RCV (T (α)) =1

V

V∑v=1

N(v)miss

N(v),

where N(v) is the number of samples in the test set Lv in fold

v ; and N(v)miss is the number of misclassified samples in Lv

using T (v)(α), a pruned tree of T(v)max trained from L(v).

I Although α is continuous, there are only finite minimumcost-complexity trees grown on L.



I Let Tk = T (αk). To compute the cross-validation error rateof Tk , let α′

k =√

αkαk+1.

I LetRCV (Tk) = RCV (T (α′

k)) .

I For the root node tree {t1}, RCV ({t1}) is set to theresubstitution cost R({t1}).

I Choose the subtree Tk with minimum cross-validation errorrate RCV (Tk).



Computation Involved

1. Grow V + 1 maximum trees.

2. For each of the V + 1 trees, find the sequence of subtreeswith minimum cost-complexity.

3. Suppose the maximum tree grown on the original data setTmax has K subtrees.

4. For each of the (K − 1) α′k , compute the misclassification

rate of each of the V test sample set, average the error ratesand set the mean to the cross-validation error rate.

5. Find the subtree of Tmax with minimum RCV (Tk).


Bagging and Boosting: Brief Introduction


Jia Li





Overview

I Bagging and boosting are meta-algorithms that pool decisionsfrom multiple classifiers.

I Much information can be found on Wikipedia.



Overview on Bagging

I Invented by Leo Breiman: Bootstrap aggregating.

I L. Breiman, “Bagging predictors,” Machine Learning,24(2):123-140, 1996.

I Majority vote from classifiers trained on bootstrap samples ofthe training data.



Overview on Boosting

I Iteratively learning weak classifiers

I Final result is the weighted sum of the results of weakclassifiers.

I Many different kinds of boosting algorithms: Adaboost(Adaptive boosting) by Y. Freund and R. Schapire is the first.



Bagging

I Generate B bootstrap samples of the training data: randomsample with replication.

I Train a classifier or a regression function using each bootstrapsample.

I For classification: majority vote on the classification results.

I For regression: average on the predicted values.

I Reduces variation.

I Improves performance for unstable classifiers, which varysignificantly with small change in the data set, e.g., CART.

I Found to improve CART a lot, but not nearest neighborclassifers.



Adaboost for Binary classification1. Training data: (xi , yi ), i = 1, ..., n, xi ∈ X , yi ∈ Y = {−1, 1}.2. Let w1,i = 1

n , i = 1, ..., n.3. For t = 1, ..., T :

3.1 Learn classifier ft : X → Y that minimizes the error rate withrespect to distribution wt,i over xi ’s.

3.2 Let et =∑n

i=1 wt,i I (yi 6= ft(xi ).3.3 If et > 0.5, stop.3.4 Choose αt ∈ R. Usually set αt = 1

2 log 1−et

et.

3.5 Update wt+1,i =wt,i e

−αt yi ft (xi )

Zt, where Zt is a normalization

factor to ensure∑

i wt+1,i = 1.

4. Output the final classifier:

f (x) = sign

(T∑

t=1

αt ft(x)

)Note: the update of wt,i implies incorrectly classified points receiveincreased weights in the next round of learning.Jia Li http://www.stat.psu.edu/∼jiali

Documents

Principal Component Analysis - astrostatistics.psu.edu · Principal Component Analysis Solution for PCA I The ordered eigenvectors of the covariance matrix are the principal component