Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Hindawi Publishing CorporationAbstract and Applied AnalysisVolume 2013 Article ID 540725 11 pageshttpdxdoiorg1011552013540725
Research ArticleKernel Sliced Inverse RegressionRegularization and Consistency
Qiang Wu1 Feng Liang2 and Sayan Mukherjee3
1 Department of Mathematical Sciences Middle Tennessee State University Murfreesboro TN 37130 USA2Department of Statistics University of Illinois at Urbana-Champaign Urbana IL 61820 USA3Departments of Statistical Science Mathematics and Computer Science Institute for Genome Sciences amp PolicyDuke University Durham NC 27708 USA
Correspondence should be addressed to Qiang Wu qwumtsuedu
Received 3 May 2013 Accepted 14 June 2013
Academic Editor Yiming Ying
Copyright copy 2013 Qiang Wu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Kernel sliced inverse regression (KSIR) is a natural framework for nonlinear dimension reduction using the mapping induced bykernels However there are numeric algorithmic and conceptual subtleties in making the method robust and consistent We applytwo types of regularization in this framework to address computational stability and generalization performance We also providean interpretation of the algorithm and prove consistency The utility of this approach is illustrated on simulated and real data
1 Introduction
The goal of dimension reduction in the standard regressionclassification setting is to summarize the information in the119901-dimensional predictor variable119883 relevant to predicting theunivariate response variable 119884 The summary 119878(119883) shouldhave 119889 ≪ 119901 variates and ideally should satisfy the followingconditional independence property
119884119883 | 119878 (119883) (1)Thus any inference of 119884 involves only the summary statistic119878(119883)which is ofmuch lower dimension than the original data119883
Linear methods for dimension reduction focus on linearsummaries of the data 119878(119883) = (120573
119879
1119883 120573
119879
119889119883) The 119889-
dimensional subspace S = span(1205731 120573
119889) is defined as
the effective dimension reduction (edr) space in [1] sinceSsummarizes all the predictive information on 119884 A key resultin [1] is that under somemild conditions the edr directions120573119895119889
119895=1correspond to the eigenvectors of the matrix
119879 = [cov (119883)]minus1 cov [E (119883 | 119884)] (2)Thus the edr directions or subspace can be estimated via aneigenanalysis of the matrix 119879 which is the foundation of the
sliced inverse regression (SIR) algorithm proposed in [1 2]Further developments include sliced average variance estima-tion (SAVE) [3] and Principal Hessian directions (PHDs) [4]The aforementioned algorithms cannot be applied in high-dimensional settings where the number of covariates 119901 isgreater than the number of observations 119899 since the samplecovariance matrix is singular Recently an extension of SIRhas been proposed in [5] which can handle the case for 119901 gt 119899based on the idea of partial least squares
A common premise held in high-dimensional data anal-ysis is that the intrinsic structure of data is in fact lowdimensional for example the data is concentrated on amanifold Linearmethods such as SIR often fail to capture thisnonlinear low-dimensional structure However there mayexist a nonlinear embedding of the data into a Hilbert spacewhere a linear method can capture the low-dimensionalstructure The basic idea in applying kernel methods is theapplication of a linear algorithm to the data mapped into afeature space induced by a kernel function If projections ontothis low-dimensional structure can be computed by innerproducts in this Hilbert space the so-called kernel trick [6 7]can be used to obtain simple and efficient algorithms Sincethe embedding is nonlinear linear directions in the featurespace correspond to nonlinear directions in the original data
2 Abstract and Applied Analysis
space Nonlinear extensions of some classical linear dimen-sional reduction methods using this approach include kernelprinciple component analysis [6] kernel Fisher discriminantanalysis [8] and kernel independent correlation analysis [9]This idea was applied to SIR in [10 11] resulting in the kernelsliced inverse regression (KSIR) method which allows for theestimation of nonlinear edr directions
There are numeric algorithmic and conceptual subtletiesto a direct application of this kernel idea to SIR although itlooks quite natural at first glance In KSIR the 119901-dimensionaldata are projected into a Hilbert space H through a featuremap 120601 119883 rarr H and the nonlinear features are supposed tobe recovered by the eigenfunctions of the following operator
119879 = [cov(120601(119883))]minus1 cov [E (120601 (119883) | 119884)] (3)
However this operator 119879 is actually not well defined in gen-eral especially when H is infinite dimensional and thecovariance operator cov(120601(119883)) is not invertible In additionthe key utility of representation theorems in kernel meth-ods is that optimization in a possibly infinite dimensionalHilbert space H reduces to solving a finite dimensionaloptimization problem In the KSIR algorithm developed in[10 11] the representer theorem has been implicitly used andseems to work well in empirical studies However it is nottheoretically justifiable since 119879 is estimated empirically viaobserved data Moreover the computation of eigenfunctionsof an empirical estimate of 119879 from observations is oftenill-conditioned and results in computational instability In[11] a low rank approximation of the kernel matrix is usedto overcome this instability and to reduce computationalcomplexity In this paper our aim is to clarify the theoreticalsubtleties in KSIR and tomotivate two types of regularizationschemes to overcome the computational difficulties arising inKSIR The consistency is proven and practical advantages ofregularization are demonstrated via empirical experiments
2 Mercer Kernels andNonlinear edr Directions
The extension of SIR to use kernels is based on properties ofreproducing kernel Hilbert spaces (RKHSs) and in particularMercer kernels [12]
Given predictor variables 119883 isin X sube R119901 a Mercer kernelis a continuous positive and semidefinite function 119896(sdot sdot) X timesX rarr R with the following spectral decomposition
119896 (119909 119911) = sum
119895
120582119895120601119895 (119909) 120601119895 (119911) (4)
where 120601119895 are the eigenfunctions and 120582
119895 are the corres-
ponding nonnegative nonincreasing eigenvalues An impor-tant property of Mercer kernels is that each kernel 119896 uniquelycorresponds to an RKHS as follows
H =
119891 | 119891 (119909) = sum
119895isinΛ
119886119895120601119895 (119909) with sum
119895isinΛ
1198862
119895
120582119895
lt infin
(5)
where the cardinality of Λ = 119895 120582119895gt 0 is the dimension of
the RKHS which may be infinite [12 13]Given a Mercer kernel there exists a unique map or
embedding 120601 from X to a Hilbert space defined by theeigenvalues and eigenfunctions of the kernel The map takesthe following form
120601 (119909) = (radic12058211206011 (119909) radic12058221206012 (119909) radic120582|Λ|120601|Λ| (119909)) (6)
The Hilbert space induced by this map with the standardinner product 119896(119909 119911) = ⟨120601(119909) 120601(119911)⟩ is isomorphic to theRKHS (5) and we will denote both Hilbert spaces asH [13]In the case where 119896 is infinite dimensional 120601 X rarr ℓ
2
The random variable 119883 isin X induces a random element120601(119883) in the RKHSThroughout this paper we will useHilbertspace valued random variables so we now recall some basicfacts Let119885 be a randomelement inHwithE119885 lt infin where sdot denotes the norm in H induced by its inner product⟨sdot sdot⟩ The expectation E(119885) is defined to be an element inHsatisfying ⟨119886E(119885)⟩ = E⟨119886 119885⟩ for all 119886 isin H If E1198852 le infinthen the covariance operator of 119885 is defined as E[(119885 minus E119885) otimes(119885 minus E119885)] where
(119886 otimes 119887) 119891 = ⟨119887 119891⟩ 119886 for any 119891 isinH (7)
Let P denote the measure for random variable 119883Throughout we assume the following conditions
Assumption 1
(1) For all 119909 isin X 119896(119909 sdot) isP-measurable(2) There exists119872 gt 0 such that 119909 isin X 119896(119883119883) le 119872
(as) with respect toP
Under Assumption 1 the random element 120601(119883) has awell-defined mean and a covariance operator because120601(119909)
2= 119896(119909 119909) is bounded (as)Without loss of generality
we assume E120601(119883) = 0 where 0 is the zero element in HThe boundedness also implies that the covariance operatorΣ = E[120601(119883)otimes120601(119883)] is compact and has the following spectraldecomposition
Σ =
infin
sum
119894=1
119908119894119890119894otimes 119890119894 (8)
where 119908119894and 119890119894isinH are the eigenvalues and eigenfunctions
respectivelyWe assume the following model for the relationship bet-
ween 119884 and119883
119884 = 119865 (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩ 120576) (9)
with 120573119895isin H and the distribution of 120576 is independent of 119883
This model implies that the response variable 119884 depends on119883 only through a 119889-dimensional summary statistic
119878 (119883) = (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩) (10)
Although 119878(119883) is a linear summary statistic inH it extractsnonlinear features in the space of the original predictor
Abstract and Applied Analysis 3
variables119883We call 120573119895119889
119895=1the nonlinear edr directions and
S = span(1205731 120573
119889) the nonlinear edr spaceThe following
proposition [10] extends the theoretical foundation of SIR tothis nonlinear setting
Proposition 2 Assume the following linear design conditionfor H that for any 119891 isin H there exists a vector 119887 isin R119889 suchthat
E [⟨119891 120601 (119883)⟩ | 119878 (119883)] = 119887119879119878 (119883)
with 119878 (119883) = (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩)119879
(11)
Then for the model specified in (9) the inverse regression curveE[120601(119883) | 119884] is contained in the span of (Σ120573
1 Σ120573
119889) where
Σ is the covariance operator of 120601(119883)
Proposition 2 is a straightforward extension of the multi-variate case in [1] to a Hilbert space or a direct application ofthe functional SIR setting in [14] Although the linear designcondition (11) may be difficult to check in practice it has beenshown that such a condition usually holds approximately in ahigh-dimensional space [15] This conforms to the argumentin [11] that the linearity in a reproducing kernel Hilbertspace is less strict than the linearity in the Euclidean spaceMoreover it is pointed out in [1 10] that even if the lineardesign condition is violated the bias of SIR variate is usuallynot large
An immediate consequence of Proposition 2 is thatnonlinear edr directions are the eigenvectors correspondingto the largest eigenvalues of the following generalized eigen-decomposition problem
Γ120573 = 120582Σ120573
where Σ = cov [120601 (119883)] Γ = cov [E (120601 (119883) | 119884)] (12)
or equivalently from an eigenanalysis of the operator 119879 =
Σminus1Γ In the infinite dimensional case a technical difficulty
arises since the operator
Σminus1=
infin
sum
119894=1
119908minus1
119894119890119894otimes 119890119894 (13)
is not defined on the entire Hilbert space H So for theoperator 119879 to be well defined we need to show that therange of Γ is indeed in the range of Σ A similar issue alsoarose in the analysis of dimension reduction and canonicalanalysis for functional data [16 17] In these analyses extraconditions are needed for operators like 119879 to be well definedIn KSIR this issue is resolved automatically by the lineardesign condition and extra conditions are not required asstated by the following Theorem see Appendix A for theproof
Theorem 3 Under Assumption 1 and the linear design condi-tion (11) the following hold
(i) the operator Γ is of finite rank 119889Γle 119889 Consequently
it is compact and has the following spectral decomposi-tion
Γ =
119889Γ
sum
119894=1
120591119894119906119894otimes 119906119894 (14)
where 120591119894and 119906
119894are the eigenvalues and eigenvectors
respectively Moreover 119906119894isin range(Σ) for all 119894 = 1
119889Γ
(ii) the eigendecomposition problem (12) is equivalent tothe eigenanalysis of the operator 119879 which takes thefollowing form
119879 =
119889Γ
sum
119894=1
120591119894119906119894otimes Σminus1(119906119894) (15)
3 Regularized Kernel SlicedInverse Regression
Thediscussion in Section 2 implies that nonlinear edr direc-tions can be retrieved by applying the original SIR algorithmin the feature space induced by the Mercer kernel There aresome computational challenges to this idea such as estimatingan infinite dimensional covariance operator and the fact thatthe feature map is often difficult or impossible to compute formany kernels We address these issues by working with innerproducts of the feature map and adding a regularization termto kernel SIR
31 Estimating the Nonlinear edr Directions Given 119899 obser-vations (119909
1 1199101) (119909
119899 119910119899) our objective is to obtain
an estimate of the edr directions (1205731 120573
119889) We first
formulate a procedure almost identical to the standard SIRprocedure except that it operates in the feature spaceH Thishighlights the immediate relation between the SIR and KSIRprocedures
(1) Without loss of generality we assume that themappedpredictor variables aremean zero that issum119899
119894=1120601(119909119894) =
0 for otherwise we can subtract 120601 = (1119899)sum119899119894=1120601(119909119894)
from 120601(119909119894) The sample covariance is estimated by
Σ =1
119899
119899
sum
119894=1
120601 (119909119894) otimes 120601 (119909
119894) (16)
(2) Bin the119884 variables into119867 slices1198661 119866
119867and com-
pute mean vectors of the corresponding mapped pre-dictor variables for each slice
120595ℎ=1
119899ℎ
sum
119894isin119866ℎ
120601 (119909119894) ℎ = 1 119867 (17)
Compute the sample between-group covariance mat-rix
Γ =
119867
sum
ℎ=1
119899ℎ
119899120595ℎotimes 120595ℎ (18)
4 Abstract and Applied Analysis
(3) Estimate the SIR directions 120573119895by solving the general-
ized eigendecomposition problem
Γ120573 = 120582Σ120573 (19)
This procedure is computationally impossible if theRKHS is infinite dimensional or the feature map cannot becomputed (which is the usual case) However the modelgiven in (9) requires not the edr directions but the projec-tion onto these directions that is the 119889 summary statistics
V1= ⟨1205731 120601 (119909)⟩ V119889 = ⟨120573119889 120601 (119909)⟩ (20)
which we call the KSIR variates Like other kernel algorithmsthe kernel trick enables KSIR variates to be efficiently com-puted from only the kernel 119896 not the map 120601
The key quantity in this alternative formulation is thecentred Grammatrix119870 defined by the kernel function 119896(sdot sdot)where
119870119894119895= ⟨120601 (119909
119894) 120601 (119909
119895)⟩ = 119896 (119909
119894 119909119895) (21)
Note that the rank of119870 is less than 119899 so119870 is always singularGiven the centered Gram matrix 119870 the following gener-
alized eigendecomposition problem can be used to computethe KSIR variates
119870119869119870119888 = 1205821198702119888 (22)
where 119888 denotes the 119899-dimensional generalized eigenvectorand 119869 denotes an 119899 times 119899 matrix with 119869
119894119895= 1119899
119898if 119894 119895
are in the 119898th group consisting of 119899119898
observations andzero otherwise The following proposition establishes theequivalence between two eigendecomposition problems (22)and (19) in the recovery of KSIR variates (V
1 V
119889)
Proposition 4 Given the observations (1199091 1199101) (119909
119899 119910119899)
let (1205731 120573
119889) and (119888
1 119888
119889) denote the generalized eigenvec-
tors of (22) and (19) respectively Then for any 119909 isin X and119895 = 1 119889 the following holds
V119895= ⟨120573119895 120601 (119909)⟩ = 119888
119879
119895119870119909
119870119909= (119896 (119909 119909
1) 119896 (119909 119909
119899))119879
(23)
provided that Σ is invertible When Σ is not invertible theconclusion holds modulo the null space of Σ
This result was proven in [10] in which the algorithmwasfurther reduced to solving
119869119870119888 = 120582119870119888 (24)
by canceling 119870 from both sides of (22)
Remark 5 It is important to remark that when Σ is not invert-ible Proposition 4 states that the equivalence between (19)and (22) holds modulo the null space of Σ that is a requi-rement of the representer theorem that is 120573
119895are linear com-
binations of 120601(119909119894) Without this mandated condition we will
see that each eigenvector of (22) produces an eigenvector of(19) while the eigenvector of (19) is not necessarily recoveredby (22)
Remark 6 It is necessary to clarify the difference between (12)and (19) In (12) it is natural to assume that 120573 is orthogonalto the null space of Σ To see this let 120573 = 1205730 + 120573⋆ with 1205730belonging to the null space and 120573⋆ orthogonal to the nullspace Then
E [⟨120573 120601 (119909)⟩2] = ⟨120573 Σ120573⟩ = ⟨120573
⋆ Σ120573⋆⟩ = E [⟨120573
⋆ 120601 (119909)⟩
2]
(25)
that is ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ (as) It means that 1205730does
not contribute to the KSIR variates and thus could be setas 0 However in (19) if 120573 is an eigenvector and 120573⋆ is itsorthogonal component relative to the null space of Σ theidentity ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ is in general not true for anew point 119909 which is different from the observations Thusfrom a theoretical perspective it is not as natural to assumethe representer theorem although it works well in practiceIn this sense the KSIR algorithm based on (22) or (24) doesnot have a thorough mathematical foundation
32 Regularization and Stability Except for the theoreticalsubtleties in applications with relatively small samples theeigendecomposition in (22) is often ill-conditioned resultingin overfitting as well as numerically unstable estimates ofthe edr space This can be addressed by either thresholdingeigenvalues of the estimated covariancematrix Σ or by addinga regularization term to (22) or (24)
We motivate two types of regularization schemes Thefirst one is the traditional ridge regularization It is used inboth linear SIR and functional SIR [18ndash20] which solves theeigendecomposition problem
Γ120573 = 120582 (Σ + 119904119868) 120573 (26)
Here and in the sequel 119868 denotes the identity operator and 119904is a tuning parameter Assuming the representer theorem itskernel form is given as
119869119870119888 = 120582 (119870 + 119899119904119868) 119888 (27)
Another type of regularization is to regularize (22) dire-ctly
119870119869119870119888 = 120582 (1198702+ 1198992119904119868) 119888 (28)
The following proposition whose proof is in Appendix Bstates that solving the generalized eigendecomposition prob-lem (28) is equivalent to finding the eigenvectors of
(Σ2+ 119904119868)minus1
ΣΓ (29)
Proposition 7 Let 119888119895be the eigenvectors of (28) and let 120573
119895
be the eigenvectors of (29) Then the following holds for theregularized KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)] = ⟨120573
119895 120601 (119909)⟩ (30)
Abstract and Applied Analysis 5
This algorithm is termed as the Tikhonov regularizationFor linear SIR it is shown in [21] that the Tikhonov regular-ization is more efficient than the ridge regularization
Except for the computational stability regularization alsomakes the matrix forms of KSIR (27) and (28) interpretableby justifiable representer theorem
Proposition 8 For both ridge and the Tikhonov regulariza-tion scheme of KSIR the eigenfunctions 120573
119895are linear combina-
tions of 120601(119909119894) 119894 = 1 119899
The conclusion follows from the observation that 120573119895=
(1120582119895)(Γ120573119895minus Σ120573119895) for the ridge regularization and 120573
119895=
(1120582119895)(ΣΓ120573
119895minus Σ2120573119895) for the Tikhonov regularization
To close we remark that KSIR is computationally advan-tageous even for the case of linear models when 119901 ≫ 119899 dueto the fact that the eigendecomposition problem is for 119899 times 119899matrices rather than the 119901 times 119901 matrices in the standard SIRformulation
33 Consistency of Regularized KSIR In this subsection weprove the asymptotic consistency of the edr directionsestimated by regularized KSIR and provide conditions underwhich the rate of convergence is 119874
119901(119899minus14) An important
observation from the proof is that the rate of convergence ofthe edr directions depends on the contribution of the smallprincipal components The rate can be arbitrarily slow if theedr space depends heavily on eigenvectors corresponding tosmall eigenvalues of the covariance operator
Note that various consistency results are available forlinear SIR [22ndash24] These results hold only for the finitedimensional setting and cannot be adapted to KSIR wherethe RKHS is often infinite dimensional Consistency of func-tional SIR has also been studied before In [14] a thresholdingmethod is considered which selects a finite number ofeigenvectors and uses results fromfinite rank operatorsTheirproof of consistency requires stronger and more complicatedconditions than oursThe consistency for functional SIR withridge regularization is proven in [19] but it is of aweaker formthan our result We remark that the consistency results forfunctional SIR can be improved using a similar argument inthis paper
In the following we state the consistency results for theTikhonov regularization A similar result can be proved forthe ridge regularization while the details are omitted
Theorem 9 Assume E119896(119883119883)2 lt infin lim119899rarrinfin
119904(119899) = 0 andlim119899rarrinfin
119904radic119899 = infin then
10038161003816100381610038161003816⟨120573119895 120601 (sdot)⟩ minus ⟨120573119895 120601 (sdot)⟩
10038161003816100381610038161003816= 119900119901 (1) 119895 = 1 119889
Γ (31)
where 119889Γis the rank of Γ ⟨120573
119895 120601(sdot)⟩ is the projection onto the
119895th edr and ⟨120573119895 120601(sdot)⟩ is the projection onto the 119895th edr as
estimated by regularized KSIRIf the edr directions 120573
119895119889Γ
119895=1depend only on a finite
number of eigenvectors of the covariance operator Σ the rateof convergence is 119874(119899minus14)
This theorem is a direct corollary of the following theoremwhich is proven in Appendix C
Theorem 10 Define the projection operator and its comple-ment for each119873 ge 1
Π119873=
119873
sum
119894=1
119890119894otimes 119890119894 Π
perp
119873= 119868 minus Π
119873=
infin
sum
119894=119873+1
119890119894otimes 119890119894 (32)
where 119890119894infin
119894=1are the eigenvectors of the covariance operator Σ
as defined in (8) with the corresponding eigenvalues denotedby 119908119894
Assume E119896(119883119883)2 lt infin For each 119873 ge 1 the followingholds10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879
10038171003817100381710038171003817119867119878
= 119874119901(1
119904radic119899) +
119889Γ
sum
119895=1
(119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817)
(33)
where 119895= Σminus1119906119895and 119906
119895119889Γ
119894=1are the eigenvectors of Γ as
defined in (14) and sdot 119867119878
denotes the Hilbert-Schmidt normof a linear operator
If 119904 = 119904(119899) satisfy 119904 rarr 0 and 119904radic119899 rarr infin as 119899 rarr infin then10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879
10038171003817100381710038171003817119867119878= 119900119901 (1) (34)
4 Application to Simulated and Real Data
In this section we compare regularized kernel sliced inverseregression (RKSIR) with several other SIR-related dimensionreductionmethodsThe comparisons are used to address twoquestions (1) does regularization improve the performanceof kernel sliced inverse regression and (2)does the nonlinear-ity of kernel sliced inverse regression improve the predictionaccuracy
We would like to remark that the assessment of nonlineardimension reduction methods could be more difficult thanthat of linear ones When the feature mapping 120601 for an RKHSis not available we do not know the true edr directions orsubspace So in that case we will use the prediction accuracyto evaluate the goodness of RKSIR
41 Importance of Nonlinearity and Regularization Our firstexample illustrates that both the nonlinearity and regulariza-tion of RKSIR can significantly improve prediction accuracy
The regression model has ten predictor variables 119883 =
(1198831 119883
10) with each one following a normal distribution
119883119894sim 119873(0 1) A univariate response is given as
119884 = (sin (1198831) + sin (119883
2)) (1 + sin (119883
3)) + 120576 (35)
with noise 120576 sim 119873(0 012) We compare the effectiveness
of the linear dimension reduction methods SIR SAVE andPHD with RKSIR by examining the predictive accuracy of anonlinear kernel regression model on the reduced space Wegenerate 100 training samples and apply the abovemethods to
6 Abstract and Applied Analysis
minus6 minus5 minus4 minus3 minus2 minus1 0 1 202
03
04
05
06
07
08
09
1M
ean
squa
re er
ror
d = 1
d = 2
log10
(s)
(a)
1 2 3 4 5 6 7 8 9 100
02
04
06
08
1
12
14
Number of edr directions
Mea
n sq
uare
erro
r
RKSIRSIR
SAVEPHD
(b)
minus2 minus15 minus1 minus05 0 05 1 15 2minus2
minus15
minus1
minus05
0
05
1
15
2
KSIR variates for the first nonlinear edr direction
sin(X
1)+
sin(X
2)
(c)
2
18
16
14
12
1
08
06
04
02
0minus15 minus1 minus05 0 05 1 15
KSIR variates for the second nonlinear edr direction
1+
sin(X
3)
(d)
Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors
compute the edr directions In RKSIR we used an additiveGaussian kernel as follows
119896 (119909 119911) =
119889
sum
119895=1
exp(minus(119909119895minus 119911119895)2
21205902) (36)
where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1
Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of
selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods
RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well
42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors
Abstract and Applied Analysis 7
The regression model has ten predictor variables 119883 =
(1198831 119883
10) and a univariate response specified by
119884 = 1198831+ 1198832
2+ 120576 120576 sim 119873 (0 01
2) (37)
where 119883 sim 119873(0 Σ119883) and Σ
119883= 119876Δ119876 with 119876 a randomly
chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions
For thismodel it is known that SIRwillmiss the directionalong the second variable119883
2 So we focus on the comparison
of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =
(1 + 119909119879119911)2 that corresponds to the feature space
Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)
then 1198831+ 1198832
2can be captured in one edr direction Ideally
the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to
1198831+ 1198832
2modulo shift and scale
V minus EV prop 1198831+ 1198832
2minus E (119883
1+ 1198832
2) (39)
So for this example given estimates of KSIR variates at the 119899data points V
119894119899
119894=1= ⟨1205731 120601(119909119894)⟩119899
119894=1 the error of the first edr
direction can bemeasured by the least squares fitting of Vwithrespect to (119883
1+ 1198832
2)
error = min119886119887isinR
1
119899
119899
sum
119894=1
(V119894minus (119886 (119909
1198941+ 1199092
1198942) + 119887))
2
(40)
We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability
43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits
The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space
minus05 0 05 1 15 2 250
100
200
300
400
500
600
700
120579 parameter
Erro
r rat
e
KSIRRKSIR
Figure 2 Error in edr as a function of 120579
We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation
The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8
5 Discussion
The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method
RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance
8 Abstract and Applied Analysis
Table 1 Mean and standard deviations for error rates in classification of digits
Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)
operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction
There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research
There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR
Appendices
A Proof of Theorem 3
Under the assumption of Proposition 2 for each 119884 = 119910
E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)
Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889
Γpositive eigenvalues 120591
119894119889Γ
119894=1and
eigenvectors 119906119894119889Γ
119894=1 such that Γ = sum119889Γ
119894=1120591119894119906119894otimes 119906119894
Recall that for any 119891 isinH
Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)
also belongs to
span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)
because of (A1) so
119906119894=1
120591119894
Γ119906119894isin range (Σ) (A4)
This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator
119879 = Σminus1Γ is well defined over the whole space Moreover
119879119891 = Σminus1(
119889Γ
sum
119894=1
⟨119906119894 119891⟩ 119906119894)
=
119889Γ
sum
119894=1
⟨119906119894 119891⟩ Σ
minus1(119906119894) = (
119889Γ
sum
119894=1
Σminus1(119906119894) otimes 119906119894)119891
(A5)
This proves (ii)
B Proof of Proposition 7
We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909
1) 120601(119909
119899)] It has the following SVD
decomposition
Φ = 119880119863119881119879
= [1199061sdot sdot sdot 119906119889119870] [
119863119889times119889
0119889times(119899minus119889)
0(119889119870minus119889)times119889
0(119889119870minus119889)times(119899minus119889)
][[
[
V1198791
V119879119899
]]
]
= 119880119863119881119879
(B1)
where 119880 = [1199061 119906
119889] 119881 = [V
1 V
119889] and 119863 = 119863
119889times119889is a
diagonal matrix of dimension 119889 le 119899
Abstract and Applied Analysis 9
We need to show the KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)]
= 119888119879
119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩
(B2)
It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by
120573 = Φ119888 119888 = 119881119863119880119879
120573 (B3)
Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863
2
119881119879 the argument may be made as follows
ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573
lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ
119879ΦΦ119879Φ119888 + 119899
2120574Φ119888)
lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888
(lArrrArr 119881119881119879
119870119869119870119888 = 120582119881119881119879
(1198702+ 1198992120574119868) 119888)
lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888
(B4)
Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1
119880119879 and using the facts119880119879119880 = 119868
119889 For the last step since
119881119879
119881 = 119868119889 we use the following facts
119881119881119879
119870 = 119881119881119879
1198811198632
119881119879
= 1198811198632
119881119879
= 119870
119881119881119879
119888 = 119881119881119879
119881119863minus1
119880119879
120573 = 119881119863minus1
119880119879
120573 = 119888
(B5)
In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889
119870is infinite For the infinite
dimensional case Φ is an operator from R119899 to H119870defined
by ΦV = sum119899119894=1
V119894120601(119909119894) for V = (V
1 V
119899)119879isin R119899 and Φ119879
is its adjoint an operator from H119870to R119899 such that Φ119879119891 =
(⟨120601(1199091) 119891⟩119870 ⟨120601(119909
1) 119891⟩119870)119879 for 119891 isin H
119870 The notions 119880
and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-
tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation
Σ =
119889119870
sum
119894=1
119894119906119894otimes 119906119894=
119889
sum
119894=1
120590119894119906119894otimes 119906119894 (B6)
where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590
119889gt 120590119889+1=
sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906
1 119906
119889)
and hence we can write 120601(119909119894) = 119880120591
119894 where 119880 = (119906
1 119906
119889)
should be considered as an operator fromR119889 toH119870and 120591119894isin
R119889 Denote Υ = (1205911 120591
119899)119879isin R119899times119889 It is easy to check that
Υ119879Υ = diag(119899120590
1 119899120590
119889) Let119863
119889times119889= diag(radic1198991205901 radic119899120590119889)
and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879
which is well defined
C Proof of Consistency
C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein
Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if
1198712
HS =
119901H
sum
119894=1
100381710038171003817100381711987111989011989410038171003817100381710038172
Hlt infin (C1)
where 119890119894 is an orthonormal basisTheHilbert-Schmidt class
forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and
119871119878 both belong to theHilbert-Schmidt class and the followingholds
119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)
where sdot denotes the default operator norm
1198712= sup119891isinH
100381710038171003817100381711987111989110038171003817100381710038172
100381710038171003817100381711989110038171003817100381710038172 (C3)
Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator
Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)
is self-adjoint positive compact and belongs to the Hilbert-Schmidt class
A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879
119899converges
to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as
the convergence of the operators
C2 Proof of Theorem 10 We will use the following resultfrom [14]
10038171003817100381710038171003817Σ minus Σ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
10038171003817100381710038171003817Γ minus Γ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
(C5)
To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1
ΣΓ Alsodefine
1198791= (Σ2+ 119904119868)minus1
ΣΓ 1198792= (Σ2+ 119904119868)minus1
ΣΓ (C6)
Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS
le10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792
1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879
1003817100381710038171003817HS
(C7)
10 Abstract and Applied Analysis
For the first term observe that10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817ΣΓ minus ΣΓ
10038171003817100381710038171003817HS= 119874119901(1
119904radic119899)
(C8)For the second term note that
1198791=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
Σ119906119895) otimes 119906119895
1198792=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
119906119895) otimes 119906119895
(C9)
Therefore
10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
100381710038171003817100381710038171003817
(C10)Since 119906
119895isin range(Σ) there exists
119895such that 119906
119895= Σ119895 Then
((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
= (Σ2+ 119904119868)minus1
(Σ2minus Σ2) (Σ2+ 119904119868)minus1
Σ2119895
(C11)
which implies
10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS
times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2100381710038171003817100381710038171003817
10038171003817100381710038171003817119895
10038171003817100381710038171003817
= 119874119901(1
119904radic119899)
(C12)
For the third term the following holds
10038171003817100381710038171198792 minus 11987910038171003817100381710038172
HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
Σ minus Σminus1) 119906119895
100381710038171003817100381710038171003817
2
(C13)
and for each 119895 = 1 119889Γ
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ119906119895minus Σminus1119906119895
100381710038171003817100381710038171003817
le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2119895minus 119895
100381710038171003817100381710038171003817
=
1003817100381710038171003817100381710038171003817100381710038171003817
infin
sum
119894=1
(1199082
119895
119904 + 1199082119895
minus 1)⟨119895 119890119894⟩ 119890119894
1003817100381710038171003817100381710038171003817100381710038171003817
= (
infin
sum
119894=1
1199042
(119904 + 1199082119894)2⟨119895 119890119894⟩2
)
12
le119904
119908119873
(
119873
sum
119894=1
⟨119895 119890119894⟩2
)
12
+ (
infin
sum
119894=119873+1
⟨119895 119890119894⟩2
)
12
=119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817
(C14)
Combining these terms results in (33)
Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have
10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)
if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573
119894depend only on a finite
number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890
119894 119894 = 1 119873
This implies that
119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890
119894 119894 = 1 119873 (C16)
Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)
Acknowledgments
The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH
References
[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991
[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991
[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991
[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992
[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007
[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997
[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002
[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000
[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002
[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008
[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
2 Abstract and Applied Analysis
space Nonlinear extensions of some classical linear dimen-sional reduction methods using this approach include kernelprinciple component analysis [6] kernel Fisher discriminantanalysis [8] and kernel independent correlation analysis [9]This idea was applied to SIR in [10 11] resulting in the kernelsliced inverse regression (KSIR) method which allows for theestimation of nonlinear edr directions
There are numeric algorithmic and conceptual subtletiesto a direct application of this kernel idea to SIR although itlooks quite natural at first glance In KSIR the 119901-dimensionaldata are projected into a Hilbert space H through a featuremap 120601 119883 rarr H and the nonlinear features are supposed tobe recovered by the eigenfunctions of the following operator
119879 = [cov(120601(119883))]minus1 cov [E (120601 (119883) | 119884)] (3)
However this operator 119879 is actually not well defined in gen-eral especially when H is infinite dimensional and thecovariance operator cov(120601(119883)) is not invertible In additionthe key utility of representation theorems in kernel meth-ods is that optimization in a possibly infinite dimensionalHilbert space H reduces to solving a finite dimensionaloptimization problem In the KSIR algorithm developed in[10 11] the representer theorem has been implicitly used andseems to work well in empirical studies However it is nottheoretically justifiable since 119879 is estimated empirically viaobserved data Moreover the computation of eigenfunctionsof an empirical estimate of 119879 from observations is oftenill-conditioned and results in computational instability In[11] a low rank approximation of the kernel matrix is usedto overcome this instability and to reduce computationalcomplexity In this paper our aim is to clarify the theoreticalsubtleties in KSIR and tomotivate two types of regularizationschemes to overcome the computational difficulties arising inKSIR The consistency is proven and practical advantages ofregularization are demonstrated via empirical experiments
2 Mercer Kernels andNonlinear edr Directions
The extension of SIR to use kernels is based on properties ofreproducing kernel Hilbert spaces (RKHSs) and in particularMercer kernels [12]
Given predictor variables 119883 isin X sube R119901 a Mercer kernelis a continuous positive and semidefinite function 119896(sdot sdot) X timesX rarr R with the following spectral decomposition
119896 (119909 119911) = sum
119895
120582119895120601119895 (119909) 120601119895 (119911) (4)
where 120601119895 are the eigenfunctions and 120582
119895 are the corres-
ponding nonnegative nonincreasing eigenvalues An impor-tant property of Mercer kernels is that each kernel 119896 uniquelycorresponds to an RKHS as follows
H =
119891 | 119891 (119909) = sum
119895isinΛ
119886119895120601119895 (119909) with sum
119895isinΛ
1198862
119895
120582119895
lt infin
(5)
where the cardinality of Λ = 119895 120582119895gt 0 is the dimension of
the RKHS which may be infinite [12 13]Given a Mercer kernel there exists a unique map or
embedding 120601 from X to a Hilbert space defined by theeigenvalues and eigenfunctions of the kernel The map takesthe following form
120601 (119909) = (radic12058211206011 (119909) radic12058221206012 (119909) radic120582|Λ|120601|Λ| (119909)) (6)
The Hilbert space induced by this map with the standardinner product 119896(119909 119911) = ⟨120601(119909) 120601(119911)⟩ is isomorphic to theRKHS (5) and we will denote both Hilbert spaces asH [13]In the case where 119896 is infinite dimensional 120601 X rarr ℓ
2
The random variable 119883 isin X induces a random element120601(119883) in the RKHSThroughout this paper we will useHilbertspace valued random variables so we now recall some basicfacts Let119885 be a randomelement inHwithE119885 lt infin where sdot denotes the norm in H induced by its inner product⟨sdot sdot⟩ The expectation E(119885) is defined to be an element inHsatisfying ⟨119886E(119885)⟩ = E⟨119886 119885⟩ for all 119886 isin H If E1198852 le infinthen the covariance operator of 119885 is defined as E[(119885 minus E119885) otimes(119885 minus E119885)] where
(119886 otimes 119887) 119891 = ⟨119887 119891⟩ 119886 for any 119891 isinH (7)
Let P denote the measure for random variable 119883Throughout we assume the following conditions
Assumption 1
(1) For all 119909 isin X 119896(119909 sdot) isP-measurable(2) There exists119872 gt 0 such that 119909 isin X 119896(119883119883) le 119872
(as) with respect toP
Under Assumption 1 the random element 120601(119883) has awell-defined mean and a covariance operator because120601(119909)
2= 119896(119909 119909) is bounded (as)Without loss of generality
we assume E120601(119883) = 0 where 0 is the zero element in HThe boundedness also implies that the covariance operatorΣ = E[120601(119883)otimes120601(119883)] is compact and has the following spectraldecomposition
Σ =
infin
sum
119894=1
119908119894119890119894otimes 119890119894 (8)
where 119908119894and 119890119894isinH are the eigenvalues and eigenfunctions
respectivelyWe assume the following model for the relationship bet-
ween 119884 and119883
119884 = 119865 (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩ 120576) (9)
with 120573119895isin H and the distribution of 120576 is independent of 119883
This model implies that the response variable 119884 depends on119883 only through a 119889-dimensional summary statistic
119878 (119883) = (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩) (10)
Although 119878(119883) is a linear summary statistic inH it extractsnonlinear features in the space of the original predictor
Abstract and Applied Analysis 3
variables119883We call 120573119895119889
119895=1the nonlinear edr directions and
S = span(1205731 120573
119889) the nonlinear edr spaceThe following
proposition [10] extends the theoretical foundation of SIR tothis nonlinear setting
Proposition 2 Assume the following linear design conditionfor H that for any 119891 isin H there exists a vector 119887 isin R119889 suchthat
E [⟨119891 120601 (119883)⟩ | 119878 (119883)] = 119887119879119878 (119883)
with 119878 (119883) = (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩)119879
(11)
Then for the model specified in (9) the inverse regression curveE[120601(119883) | 119884] is contained in the span of (Σ120573
1 Σ120573
119889) where
Σ is the covariance operator of 120601(119883)
Proposition 2 is a straightforward extension of the multi-variate case in [1] to a Hilbert space or a direct application ofthe functional SIR setting in [14] Although the linear designcondition (11) may be difficult to check in practice it has beenshown that such a condition usually holds approximately in ahigh-dimensional space [15] This conforms to the argumentin [11] that the linearity in a reproducing kernel Hilbertspace is less strict than the linearity in the Euclidean spaceMoreover it is pointed out in [1 10] that even if the lineardesign condition is violated the bias of SIR variate is usuallynot large
An immediate consequence of Proposition 2 is thatnonlinear edr directions are the eigenvectors correspondingto the largest eigenvalues of the following generalized eigen-decomposition problem
Γ120573 = 120582Σ120573
where Σ = cov [120601 (119883)] Γ = cov [E (120601 (119883) | 119884)] (12)
or equivalently from an eigenanalysis of the operator 119879 =
Σminus1Γ In the infinite dimensional case a technical difficulty
arises since the operator
Σminus1=
infin
sum
119894=1
119908minus1
119894119890119894otimes 119890119894 (13)
is not defined on the entire Hilbert space H So for theoperator 119879 to be well defined we need to show that therange of Γ is indeed in the range of Σ A similar issue alsoarose in the analysis of dimension reduction and canonicalanalysis for functional data [16 17] In these analyses extraconditions are needed for operators like 119879 to be well definedIn KSIR this issue is resolved automatically by the lineardesign condition and extra conditions are not required asstated by the following Theorem see Appendix A for theproof
Theorem 3 Under Assumption 1 and the linear design condi-tion (11) the following hold
(i) the operator Γ is of finite rank 119889Γle 119889 Consequently
it is compact and has the following spectral decomposi-tion
Γ =
119889Γ
sum
119894=1
120591119894119906119894otimes 119906119894 (14)
where 120591119894and 119906
119894are the eigenvalues and eigenvectors
respectively Moreover 119906119894isin range(Σ) for all 119894 = 1
119889Γ
(ii) the eigendecomposition problem (12) is equivalent tothe eigenanalysis of the operator 119879 which takes thefollowing form
119879 =
119889Γ
sum
119894=1
120591119894119906119894otimes Σminus1(119906119894) (15)
3 Regularized Kernel SlicedInverse Regression
Thediscussion in Section 2 implies that nonlinear edr direc-tions can be retrieved by applying the original SIR algorithmin the feature space induced by the Mercer kernel There aresome computational challenges to this idea such as estimatingan infinite dimensional covariance operator and the fact thatthe feature map is often difficult or impossible to compute formany kernels We address these issues by working with innerproducts of the feature map and adding a regularization termto kernel SIR
31 Estimating the Nonlinear edr Directions Given 119899 obser-vations (119909
1 1199101) (119909
119899 119910119899) our objective is to obtain
an estimate of the edr directions (1205731 120573
119889) We first
formulate a procedure almost identical to the standard SIRprocedure except that it operates in the feature spaceH Thishighlights the immediate relation between the SIR and KSIRprocedures
(1) Without loss of generality we assume that themappedpredictor variables aremean zero that issum119899
119894=1120601(119909119894) =
0 for otherwise we can subtract 120601 = (1119899)sum119899119894=1120601(119909119894)
from 120601(119909119894) The sample covariance is estimated by
Σ =1
119899
119899
sum
119894=1
120601 (119909119894) otimes 120601 (119909
119894) (16)
(2) Bin the119884 variables into119867 slices1198661 119866
119867and com-
pute mean vectors of the corresponding mapped pre-dictor variables for each slice
120595ℎ=1
119899ℎ
sum
119894isin119866ℎ
120601 (119909119894) ℎ = 1 119867 (17)
Compute the sample between-group covariance mat-rix
Γ =
119867
sum
ℎ=1
119899ℎ
119899120595ℎotimes 120595ℎ (18)
4 Abstract and Applied Analysis
(3) Estimate the SIR directions 120573119895by solving the general-
ized eigendecomposition problem
Γ120573 = 120582Σ120573 (19)
This procedure is computationally impossible if theRKHS is infinite dimensional or the feature map cannot becomputed (which is the usual case) However the modelgiven in (9) requires not the edr directions but the projec-tion onto these directions that is the 119889 summary statistics
V1= ⟨1205731 120601 (119909)⟩ V119889 = ⟨120573119889 120601 (119909)⟩ (20)
which we call the KSIR variates Like other kernel algorithmsthe kernel trick enables KSIR variates to be efficiently com-puted from only the kernel 119896 not the map 120601
The key quantity in this alternative formulation is thecentred Grammatrix119870 defined by the kernel function 119896(sdot sdot)where
119870119894119895= ⟨120601 (119909
119894) 120601 (119909
119895)⟩ = 119896 (119909
119894 119909119895) (21)
Note that the rank of119870 is less than 119899 so119870 is always singularGiven the centered Gram matrix 119870 the following gener-
alized eigendecomposition problem can be used to computethe KSIR variates
119870119869119870119888 = 1205821198702119888 (22)
where 119888 denotes the 119899-dimensional generalized eigenvectorand 119869 denotes an 119899 times 119899 matrix with 119869
119894119895= 1119899
119898if 119894 119895
are in the 119898th group consisting of 119899119898
observations andzero otherwise The following proposition establishes theequivalence between two eigendecomposition problems (22)and (19) in the recovery of KSIR variates (V
1 V
119889)
Proposition 4 Given the observations (1199091 1199101) (119909
119899 119910119899)
let (1205731 120573
119889) and (119888
1 119888
119889) denote the generalized eigenvec-
tors of (22) and (19) respectively Then for any 119909 isin X and119895 = 1 119889 the following holds
V119895= ⟨120573119895 120601 (119909)⟩ = 119888
119879
119895119870119909
119870119909= (119896 (119909 119909
1) 119896 (119909 119909
119899))119879
(23)
provided that Σ is invertible When Σ is not invertible theconclusion holds modulo the null space of Σ
This result was proven in [10] in which the algorithmwasfurther reduced to solving
119869119870119888 = 120582119870119888 (24)
by canceling 119870 from both sides of (22)
Remark 5 It is important to remark that when Σ is not invert-ible Proposition 4 states that the equivalence between (19)and (22) holds modulo the null space of Σ that is a requi-rement of the representer theorem that is 120573
119895are linear com-
binations of 120601(119909119894) Without this mandated condition we will
see that each eigenvector of (22) produces an eigenvector of(19) while the eigenvector of (19) is not necessarily recoveredby (22)
Remark 6 It is necessary to clarify the difference between (12)and (19) In (12) it is natural to assume that 120573 is orthogonalto the null space of Σ To see this let 120573 = 1205730 + 120573⋆ with 1205730belonging to the null space and 120573⋆ orthogonal to the nullspace Then
E [⟨120573 120601 (119909)⟩2] = ⟨120573 Σ120573⟩ = ⟨120573
⋆ Σ120573⋆⟩ = E [⟨120573
⋆ 120601 (119909)⟩
2]
(25)
that is ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ (as) It means that 1205730does
not contribute to the KSIR variates and thus could be setas 0 However in (19) if 120573 is an eigenvector and 120573⋆ is itsorthogonal component relative to the null space of Σ theidentity ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ is in general not true for anew point 119909 which is different from the observations Thusfrom a theoretical perspective it is not as natural to assumethe representer theorem although it works well in practiceIn this sense the KSIR algorithm based on (22) or (24) doesnot have a thorough mathematical foundation
32 Regularization and Stability Except for the theoreticalsubtleties in applications with relatively small samples theeigendecomposition in (22) is often ill-conditioned resultingin overfitting as well as numerically unstable estimates ofthe edr space This can be addressed by either thresholdingeigenvalues of the estimated covariancematrix Σ or by addinga regularization term to (22) or (24)
We motivate two types of regularization schemes Thefirst one is the traditional ridge regularization It is used inboth linear SIR and functional SIR [18ndash20] which solves theeigendecomposition problem
Γ120573 = 120582 (Σ + 119904119868) 120573 (26)
Here and in the sequel 119868 denotes the identity operator and 119904is a tuning parameter Assuming the representer theorem itskernel form is given as
119869119870119888 = 120582 (119870 + 119899119904119868) 119888 (27)
Another type of regularization is to regularize (22) dire-ctly
119870119869119870119888 = 120582 (1198702+ 1198992119904119868) 119888 (28)
The following proposition whose proof is in Appendix Bstates that solving the generalized eigendecomposition prob-lem (28) is equivalent to finding the eigenvectors of
(Σ2+ 119904119868)minus1
ΣΓ (29)
Proposition 7 Let 119888119895be the eigenvectors of (28) and let 120573
119895
be the eigenvectors of (29) Then the following holds for theregularized KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)] = ⟨120573
119895 120601 (119909)⟩ (30)
Abstract and Applied Analysis 5
This algorithm is termed as the Tikhonov regularizationFor linear SIR it is shown in [21] that the Tikhonov regular-ization is more efficient than the ridge regularization
Except for the computational stability regularization alsomakes the matrix forms of KSIR (27) and (28) interpretableby justifiable representer theorem
Proposition 8 For both ridge and the Tikhonov regulariza-tion scheme of KSIR the eigenfunctions 120573
119895are linear combina-
tions of 120601(119909119894) 119894 = 1 119899
The conclusion follows from the observation that 120573119895=
(1120582119895)(Γ120573119895minus Σ120573119895) for the ridge regularization and 120573
119895=
(1120582119895)(ΣΓ120573
119895minus Σ2120573119895) for the Tikhonov regularization
To close we remark that KSIR is computationally advan-tageous even for the case of linear models when 119901 ≫ 119899 dueto the fact that the eigendecomposition problem is for 119899 times 119899matrices rather than the 119901 times 119901 matrices in the standard SIRformulation
33 Consistency of Regularized KSIR In this subsection weprove the asymptotic consistency of the edr directionsestimated by regularized KSIR and provide conditions underwhich the rate of convergence is 119874
119901(119899minus14) An important
observation from the proof is that the rate of convergence ofthe edr directions depends on the contribution of the smallprincipal components The rate can be arbitrarily slow if theedr space depends heavily on eigenvectors corresponding tosmall eigenvalues of the covariance operator
Note that various consistency results are available forlinear SIR [22ndash24] These results hold only for the finitedimensional setting and cannot be adapted to KSIR wherethe RKHS is often infinite dimensional Consistency of func-tional SIR has also been studied before In [14] a thresholdingmethod is considered which selects a finite number ofeigenvectors and uses results fromfinite rank operatorsTheirproof of consistency requires stronger and more complicatedconditions than oursThe consistency for functional SIR withridge regularization is proven in [19] but it is of aweaker formthan our result We remark that the consistency results forfunctional SIR can be improved using a similar argument inthis paper
In the following we state the consistency results for theTikhonov regularization A similar result can be proved forthe ridge regularization while the details are omitted
Theorem 9 Assume E119896(119883119883)2 lt infin lim119899rarrinfin
119904(119899) = 0 andlim119899rarrinfin
119904radic119899 = infin then
10038161003816100381610038161003816⟨120573119895 120601 (sdot)⟩ minus ⟨120573119895 120601 (sdot)⟩
10038161003816100381610038161003816= 119900119901 (1) 119895 = 1 119889
Γ (31)
where 119889Γis the rank of Γ ⟨120573
119895 120601(sdot)⟩ is the projection onto the
119895th edr and ⟨120573119895 120601(sdot)⟩ is the projection onto the 119895th edr as
estimated by regularized KSIRIf the edr directions 120573
119895119889Γ
119895=1depend only on a finite
number of eigenvectors of the covariance operator Σ the rateof convergence is 119874(119899minus14)
This theorem is a direct corollary of the following theoremwhich is proven in Appendix C
Theorem 10 Define the projection operator and its comple-ment for each119873 ge 1
Π119873=
119873
sum
119894=1
119890119894otimes 119890119894 Π
perp
119873= 119868 minus Π
119873=
infin
sum
119894=119873+1
119890119894otimes 119890119894 (32)
where 119890119894infin
119894=1are the eigenvectors of the covariance operator Σ
as defined in (8) with the corresponding eigenvalues denotedby 119908119894
Assume E119896(119883119883)2 lt infin For each 119873 ge 1 the followingholds10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879
10038171003817100381710038171003817119867119878
= 119874119901(1
119904radic119899) +
119889Γ
sum
119895=1
(119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817)
(33)
where 119895= Σminus1119906119895and 119906
119895119889Γ
119894=1are the eigenvectors of Γ as
defined in (14) and sdot 119867119878
denotes the Hilbert-Schmidt normof a linear operator
If 119904 = 119904(119899) satisfy 119904 rarr 0 and 119904radic119899 rarr infin as 119899 rarr infin then10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879
10038171003817100381710038171003817119867119878= 119900119901 (1) (34)
4 Application to Simulated and Real Data
In this section we compare regularized kernel sliced inverseregression (RKSIR) with several other SIR-related dimensionreductionmethodsThe comparisons are used to address twoquestions (1) does regularization improve the performanceof kernel sliced inverse regression and (2)does the nonlinear-ity of kernel sliced inverse regression improve the predictionaccuracy
We would like to remark that the assessment of nonlineardimension reduction methods could be more difficult thanthat of linear ones When the feature mapping 120601 for an RKHSis not available we do not know the true edr directions orsubspace So in that case we will use the prediction accuracyto evaluate the goodness of RKSIR
41 Importance of Nonlinearity and Regularization Our firstexample illustrates that both the nonlinearity and regulariza-tion of RKSIR can significantly improve prediction accuracy
The regression model has ten predictor variables 119883 =
(1198831 119883
10) with each one following a normal distribution
119883119894sim 119873(0 1) A univariate response is given as
119884 = (sin (1198831) + sin (119883
2)) (1 + sin (119883
3)) + 120576 (35)
with noise 120576 sim 119873(0 012) We compare the effectiveness
of the linear dimension reduction methods SIR SAVE andPHD with RKSIR by examining the predictive accuracy of anonlinear kernel regression model on the reduced space Wegenerate 100 training samples and apply the abovemethods to
6 Abstract and Applied Analysis
minus6 minus5 minus4 minus3 minus2 minus1 0 1 202
03
04
05
06
07
08
09
1M
ean
squa
re er
ror
d = 1
d = 2
log10
(s)
(a)
1 2 3 4 5 6 7 8 9 100
02
04
06
08
1
12
14
Number of edr directions
Mea
n sq
uare
erro
r
RKSIRSIR
SAVEPHD
(b)
minus2 minus15 minus1 minus05 0 05 1 15 2minus2
minus15
minus1
minus05
0
05
1
15
2
KSIR variates for the first nonlinear edr direction
sin(X
1)+
sin(X
2)
(c)
2
18
16
14
12
1
08
06
04
02
0minus15 minus1 minus05 0 05 1 15
KSIR variates for the second nonlinear edr direction
1+
sin(X
3)
(d)
Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors
compute the edr directions In RKSIR we used an additiveGaussian kernel as follows
119896 (119909 119911) =
119889
sum
119895=1
exp(minus(119909119895minus 119911119895)2
21205902) (36)
where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1
Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of
selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods
RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well
42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors
Abstract and Applied Analysis 7
The regression model has ten predictor variables 119883 =
(1198831 119883
10) and a univariate response specified by
119884 = 1198831+ 1198832
2+ 120576 120576 sim 119873 (0 01
2) (37)
where 119883 sim 119873(0 Σ119883) and Σ
119883= 119876Δ119876 with 119876 a randomly
chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions
For thismodel it is known that SIRwillmiss the directionalong the second variable119883
2 So we focus on the comparison
of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =
(1 + 119909119879119911)2 that corresponds to the feature space
Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)
then 1198831+ 1198832
2can be captured in one edr direction Ideally
the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to
1198831+ 1198832
2modulo shift and scale
V minus EV prop 1198831+ 1198832
2minus E (119883
1+ 1198832
2) (39)
So for this example given estimates of KSIR variates at the 119899data points V
119894119899
119894=1= ⟨1205731 120601(119909119894)⟩119899
119894=1 the error of the first edr
direction can bemeasured by the least squares fitting of Vwithrespect to (119883
1+ 1198832
2)
error = min119886119887isinR
1
119899
119899
sum
119894=1
(V119894minus (119886 (119909
1198941+ 1199092
1198942) + 119887))
2
(40)
We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability
43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits
The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space
minus05 0 05 1 15 2 250
100
200
300
400
500
600
700
120579 parameter
Erro
r rat
e
KSIRRKSIR
Figure 2 Error in edr as a function of 120579
We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation
The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8
5 Discussion
The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method
RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance
8 Abstract and Applied Analysis
Table 1 Mean and standard deviations for error rates in classification of digits
Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)
operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction
There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research
There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR
Appendices
A Proof of Theorem 3
Under the assumption of Proposition 2 for each 119884 = 119910
E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)
Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889
Γpositive eigenvalues 120591
119894119889Γ
119894=1and
eigenvectors 119906119894119889Γ
119894=1 such that Γ = sum119889Γ
119894=1120591119894119906119894otimes 119906119894
Recall that for any 119891 isinH
Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)
also belongs to
span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)
because of (A1) so
119906119894=1
120591119894
Γ119906119894isin range (Σ) (A4)
This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator
119879 = Σminus1Γ is well defined over the whole space Moreover
119879119891 = Σminus1(
119889Γ
sum
119894=1
⟨119906119894 119891⟩ 119906119894)
=
119889Γ
sum
119894=1
⟨119906119894 119891⟩ Σ
minus1(119906119894) = (
119889Γ
sum
119894=1
Σminus1(119906119894) otimes 119906119894)119891
(A5)
This proves (ii)
B Proof of Proposition 7
We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909
1) 120601(119909
119899)] It has the following SVD
decomposition
Φ = 119880119863119881119879
= [1199061sdot sdot sdot 119906119889119870] [
119863119889times119889
0119889times(119899minus119889)
0(119889119870minus119889)times119889
0(119889119870minus119889)times(119899minus119889)
][[
[
V1198791
V119879119899
]]
]
= 119880119863119881119879
(B1)
where 119880 = [1199061 119906
119889] 119881 = [V
1 V
119889] and 119863 = 119863
119889times119889is a
diagonal matrix of dimension 119889 le 119899
Abstract and Applied Analysis 9
We need to show the KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)]
= 119888119879
119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩
(B2)
It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by
120573 = Φ119888 119888 = 119881119863119880119879
120573 (B3)
Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863
2
119881119879 the argument may be made as follows
ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573
lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ
119879ΦΦ119879Φ119888 + 119899
2120574Φ119888)
lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888
(lArrrArr 119881119881119879
119870119869119870119888 = 120582119881119881119879
(1198702+ 1198992120574119868) 119888)
lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888
(B4)
Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1
119880119879 and using the facts119880119879119880 = 119868
119889 For the last step since
119881119879
119881 = 119868119889 we use the following facts
119881119881119879
119870 = 119881119881119879
1198811198632
119881119879
= 1198811198632
119881119879
= 119870
119881119881119879
119888 = 119881119881119879
119881119863minus1
119880119879
120573 = 119881119863minus1
119880119879
120573 = 119888
(B5)
In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889
119870is infinite For the infinite
dimensional case Φ is an operator from R119899 to H119870defined
by ΦV = sum119899119894=1
V119894120601(119909119894) for V = (V
1 V
119899)119879isin R119899 and Φ119879
is its adjoint an operator from H119870to R119899 such that Φ119879119891 =
(⟨120601(1199091) 119891⟩119870 ⟨120601(119909
1) 119891⟩119870)119879 for 119891 isin H
119870 The notions 119880
and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-
tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation
Σ =
119889119870
sum
119894=1
119894119906119894otimes 119906119894=
119889
sum
119894=1
120590119894119906119894otimes 119906119894 (B6)
where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590
119889gt 120590119889+1=
sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906
1 119906
119889)
and hence we can write 120601(119909119894) = 119880120591
119894 where 119880 = (119906
1 119906
119889)
should be considered as an operator fromR119889 toH119870and 120591119894isin
R119889 Denote Υ = (1205911 120591
119899)119879isin R119899times119889 It is easy to check that
Υ119879Υ = diag(119899120590
1 119899120590
119889) Let119863
119889times119889= diag(radic1198991205901 radic119899120590119889)
and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879
which is well defined
C Proof of Consistency
C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein
Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if
1198712
HS =
119901H
sum
119894=1
100381710038171003817100381711987111989011989410038171003817100381710038172
Hlt infin (C1)
where 119890119894 is an orthonormal basisTheHilbert-Schmidt class
forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and
119871119878 both belong to theHilbert-Schmidt class and the followingholds
119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)
where sdot denotes the default operator norm
1198712= sup119891isinH
100381710038171003817100381711987111989110038171003817100381710038172
100381710038171003817100381711989110038171003817100381710038172 (C3)
Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator
Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)
is self-adjoint positive compact and belongs to the Hilbert-Schmidt class
A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879
119899converges
to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as
the convergence of the operators
C2 Proof of Theorem 10 We will use the following resultfrom [14]
10038171003817100381710038171003817Σ minus Σ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
10038171003817100381710038171003817Γ minus Γ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
(C5)
To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1
ΣΓ Alsodefine
1198791= (Σ2+ 119904119868)minus1
ΣΓ 1198792= (Σ2+ 119904119868)minus1
ΣΓ (C6)
Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS
le10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792
1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879
1003817100381710038171003817HS
(C7)
10 Abstract and Applied Analysis
For the first term observe that10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817ΣΓ minus ΣΓ
10038171003817100381710038171003817HS= 119874119901(1
119904radic119899)
(C8)For the second term note that
1198791=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
Σ119906119895) otimes 119906119895
1198792=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
119906119895) otimes 119906119895
(C9)
Therefore
10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
100381710038171003817100381710038171003817
(C10)Since 119906
119895isin range(Σ) there exists
119895such that 119906
119895= Σ119895 Then
((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
= (Σ2+ 119904119868)minus1
(Σ2minus Σ2) (Σ2+ 119904119868)minus1
Σ2119895
(C11)
which implies
10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS
times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2100381710038171003817100381710038171003817
10038171003817100381710038171003817119895
10038171003817100381710038171003817
= 119874119901(1
119904radic119899)
(C12)
For the third term the following holds
10038171003817100381710038171198792 minus 11987910038171003817100381710038172
HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
Σ minus Σminus1) 119906119895
100381710038171003817100381710038171003817
2
(C13)
and for each 119895 = 1 119889Γ
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ119906119895minus Σminus1119906119895
100381710038171003817100381710038171003817
le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2119895minus 119895
100381710038171003817100381710038171003817
=
1003817100381710038171003817100381710038171003817100381710038171003817
infin
sum
119894=1
(1199082
119895
119904 + 1199082119895
minus 1)⟨119895 119890119894⟩ 119890119894
1003817100381710038171003817100381710038171003817100381710038171003817
= (
infin
sum
119894=1
1199042
(119904 + 1199082119894)2⟨119895 119890119894⟩2
)
12
le119904
119908119873
(
119873
sum
119894=1
⟨119895 119890119894⟩2
)
12
+ (
infin
sum
119894=119873+1
⟨119895 119890119894⟩2
)
12
=119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817
(C14)
Combining these terms results in (33)
Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have
10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)
if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573
119894depend only on a finite
number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890
119894 119894 = 1 119873
This implies that
119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890
119894 119894 = 1 119873 (C16)
Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)
Acknowledgments
The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH
References
[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991
[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991
[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991
[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992
[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007
[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997
[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002
[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000
[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002
[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008
[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Abstract and Applied Analysis 3
variables119883We call 120573119895119889
119895=1the nonlinear edr directions and
S = span(1205731 120573
119889) the nonlinear edr spaceThe following
proposition [10] extends the theoretical foundation of SIR tothis nonlinear setting
Proposition 2 Assume the following linear design conditionfor H that for any 119891 isin H there exists a vector 119887 isin R119889 suchthat
E [⟨119891 120601 (119883)⟩ | 119878 (119883)] = 119887119879119878 (119883)
with 119878 (119883) = (⟨1205731 120601 (119883)⟩ ⟨120573119889 120601 (119883)⟩)119879
(11)
Then for the model specified in (9) the inverse regression curveE[120601(119883) | 119884] is contained in the span of (Σ120573
1 Σ120573
119889) where
Σ is the covariance operator of 120601(119883)
Proposition 2 is a straightforward extension of the multi-variate case in [1] to a Hilbert space or a direct application ofthe functional SIR setting in [14] Although the linear designcondition (11) may be difficult to check in practice it has beenshown that such a condition usually holds approximately in ahigh-dimensional space [15] This conforms to the argumentin [11] that the linearity in a reproducing kernel Hilbertspace is less strict than the linearity in the Euclidean spaceMoreover it is pointed out in [1 10] that even if the lineardesign condition is violated the bias of SIR variate is usuallynot large
An immediate consequence of Proposition 2 is thatnonlinear edr directions are the eigenvectors correspondingto the largest eigenvalues of the following generalized eigen-decomposition problem
Γ120573 = 120582Σ120573
where Σ = cov [120601 (119883)] Γ = cov [E (120601 (119883) | 119884)] (12)
or equivalently from an eigenanalysis of the operator 119879 =
Σminus1Γ In the infinite dimensional case a technical difficulty
arises since the operator
Σminus1=
infin
sum
119894=1
119908minus1
119894119890119894otimes 119890119894 (13)
is not defined on the entire Hilbert space H So for theoperator 119879 to be well defined we need to show that therange of Γ is indeed in the range of Σ A similar issue alsoarose in the analysis of dimension reduction and canonicalanalysis for functional data [16 17] In these analyses extraconditions are needed for operators like 119879 to be well definedIn KSIR this issue is resolved automatically by the lineardesign condition and extra conditions are not required asstated by the following Theorem see Appendix A for theproof
Theorem 3 Under Assumption 1 and the linear design condi-tion (11) the following hold
(i) the operator Γ is of finite rank 119889Γle 119889 Consequently
it is compact and has the following spectral decomposi-tion
Γ =
119889Γ
sum
119894=1
120591119894119906119894otimes 119906119894 (14)
where 120591119894and 119906
119894are the eigenvalues and eigenvectors
respectively Moreover 119906119894isin range(Σ) for all 119894 = 1
119889Γ
(ii) the eigendecomposition problem (12) is equivalent tothe eigenanalysis of the operator 119879 which takes thefollowing form
119879 =
119889Γ
sum
119894=1
120591119894119906119894otimes Σminus1(119906119894) (15)
3 Regularized Kernel SlicedInverse Regression
Thediscussion in Section 2 implies that nonlinear edr direc-tions can be retrieved by applying the original SIR algorithmin the feature space induced by the Mercer kernel There aresome computational challenges to this idea such as estimatingan infinite dimensional covariance operator and the fact thatthe feature map is often difficult or impossible to compute formany kernels We address these issues by working with innerproducts of the feature map and adding a regularization termto kernel SIR
31 Estimating the Nonlinear edr Directions Given 119899 obser-vations (119909
1 1199101) (119909
119899 119910119899) our objective is to obtain
an estimate of the edr directions (1205731 120573
119889) We first
formulate a procedure almost identical to the standard SIRprocedure except that it operates in the feature spaceH Thishighlights the immediate relation between the SIR and KSIRprocedures
(1) Without loss of generality we assume that themappedpredictor variables aremean zero that issum119899
119894=1120601(119909119894) =
0 for otherwise we can subtract 120601 = (1119899)sum119899119894=1120601(119909119894)
from 120601(119909119894) The sample covariance is estimated by
Σ =1
119899
119899
sum
119894=1
120601 (119909119894) otimes 120601 (119909
119894) (16)
(2) Bin the119884 variables into119867 slices1198661 119866
119867and com-
pute mean vectors of the corresponding mapped pre-dictor variables for each slice
120595ℎ=1
119899ℎ
sum
119894isin119866ℎ
120601 (119909119894) ℎ = 1 119867 (17)
Compute the sample between-group covariance mat-rix
Γ =
119867
sum
ℎ=1
119899ℎ
119899120595ℎotimes 120595ℎ (18)
4 Abstract and Applied Analysis
(3) Estimate the SIR directions 120573119895by solving the general-
ized eigendecomposition problem
Γ120573 = 120582Σ120573 (19)
This procedure is computationally impossible if theRKHS is infinite dimensional or the feature map cannot becomputed (which is the usual case) However the modelgiven in (9) requires not the edr directions but the projec-tion onto these directions that is the 119889 summary statistics
V1= ⟨1205731 120601 (119909)⟩ V119889 = ⟨120573119889 120601 (119909)⟩ (20)
which we call the KSIR variates Like other kernel algorithmsthe kernel trick enables KSIR variates to be efficiently com-puted from only the kernel 119896 not the map 120601
The key quantity in this alternative formulation is thecentred Grammatrix119870 defined by the kernel function 119896(sdot sdot)where
119870119894119895= ⟨120601 (119909
119894) 120601 (119909
119895)⟩ = 119896 (119909
119894 119909119895) (21)
Note that the rank of119870 is less than 119899 so119870 is always singularGiven the centered Gram matrix 119870 the following gener-
alized eigendecomposition problem can be used to computethe KSIR variates
119870119869119870119888 = 1205821198702119888 (22)
where 119888 denotes the 119899-dimensional generalized eigenvectorand 119869 denotes an 119899 times 119899 matrix with 119869
119894119895= 1119899
119898if 119894 119895
are in the 119898th group consisting of 119899119898
observations andzero otherwise The following proposition establishes theequivalence between two eigendecomposition problems (22)and (19) in the recovery of KSIR variates (V
1 V
119889)
Proposition 4 Given the observations (1199091 1199101) (119909
119899 119910119899)
let (1205731 120573
119889) and (119888
1 119888
119889) denote the generalized eigenvec-
tors of (22) and (19) respectively Then for any 119909 isin X and119895 = 1 119889 the following holds
V119895= ⟨120573119895 120601 (119909)⟩ = 119888
119879
119895119870119909
119870119909= (119896 (119909 119909
1) 119896 (119909 119909
119899))119879
(23)
provided that Σ is invertible When Σ is not invertible theconclusion holds modulo the null space of Σ
This result was proven in [10] in which the algorithmwasfurther reduced to solving
119869119870119888 = 120582119870119888 (24)
by canceling 119870 from both sides of (22)
Remark 5 It is important to remark that when Σ is not invert-ible Proposition 4 states that the equivalence between (19)and (22) holds modulo the null space of Σ that is a requi-rement of the representer theorem that is 120573
119895are linear com-
binations of 120601(119909119894) Without this mandated condition we will
see that each eigenvector of (22) produces an eigenvector of(19) while the eigenvector of (19) is not necessarily recoveredby (22)
Remark 6 It is necessary to clarify the difference between (12)and (19) In (12) it is natural to assume that 120573 is orthogonalto the null space of Σ To see this let 120573 = 1205730 + 120573⋆ with 1205730belonging to the null space and 120573⋆ orthogonal to the nullspace Then
E [⟨120573 120601 (119909)⟩2] = ⟨120573 Σ120573⟩ = ⟨120573
⋆ Σ120573⋆⟩ = E [⟨120573
⋆ 120601 (119909)⟩
2]
(25)
that is ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ (as) It means that 1205730does
not contribute to the KSIR variates and thus could be setas 0 However in (19) if 120573 is an eigenvector and 120573⋆ is itsorthogonal component relative to the null space of Σ theidentity ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ is in general not true for anew point 119909 which is different from the observations Thusfrom a theoretical perspective it is not as natural to assumethe representer theorem although it works well in practiceIn this sense the KSIR algorithm based on (22) or (24) doesnot have a thorough mathematical foundation
32 Regularization and Stability Except for the theoreticalsubtleties in applications with relatively small samples theeigendecomposition in (22) is often ill-conditioned resultingin overfitting as well as numerically unstable estimates ofthe edr space This can be addressed by either thresholdingeigenvalues of the estimated covariancematrix Σ or by addinga regularization term to (22) or (24)
We motivate two types of regularization schemes Thefirst one is the traditional ridge regularization It is used inboth linear SIR and functional SIR [18ndash20] which solves theeigendecomposition problem
Γ120573 = 120582 (Σ + 119904119868) 120573 (26)
Here and in the sequel 119868 denotes the identity operator and 119904is a tuning parameter Assuming the representer theorem itskernel form is given as
119869119870119888 = 120582 (119870 + 119899119904119868) 119888 (27)
Another type of regularization is to regularize (22) dire-ctly
119870119869119870119888 = 120582 (1198702+ 1198992119904119868) 119888 (28)
The following proposition whose proof is in Appendix Bstates that solving the generalized eigendecomposition prob-lem (28) is equivalent to finding the eigenvectors of
(Σ2+ 119904119868)minus1
ΣΓ (29)
Proposition 7 Let 119888119895be the eigenvectors of (28) and let 120573
119895
be the eigenvectors of (29) Then the following holds for theregularized KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)] = ⟨120573
119895 120601 (119909)⟩ (30)
Abstract and Applied Analysis 5
This algorithm is termed as the Tikhonov regularizationFor linear SIR it is shown in [21] that the Tikhonov regular-ization is more efficient than the ridge regularization
Except for the computational stability regularization alsomakes the matrix forms of KSIR (27) and (28) interpretableby justifiable representer theorem
Proposition 8 For both ridge and the Tikhonov regulariza-tion scheme of KSIR the eigenfunctions 120573
119895are linear combina-
tions of 120601(119909119894) 119894 = 1 119899
The conclusion follows from the observation that 120573119895=
(1120582119895)(Γ120573119895minus Σ120573119895) for the ridge regularization and 120573
119895=
(1120582119895)(ΣΓ120573
119895minus Σ2120573119895) for the Tikhonov regularization
To close we remark that KSIR is computationally advan-tageous even for the case of linear models when 119901 ≫ 119899 dueto the fact that the eigendecomposition problem is for 119899 times 119899matrices rather than the 119901 times 119901 matrices in the standard SIRformulation
33 Consistency of Regularized KSIR In this subsection weprove the asymptotic consistency of the edr directionsestimated by regularized KSIR and provide conditions underwhich the rate of convergence is 119874
119901(119899minus14) An important
observation from the proof is that the rate of convergence ofthe edr directions depends on the contribution of the smallprincipal components The rate can be arbitrarily slow if theedr space depends heavily on eigenvectors corresponding tosmall eigenvalues of the covariance operator
Note that various consistency results are available forlinear SIR [22ndash24] These results hold only for the finitedimensional setting and cannot be adapted to KSIR wherethe RKHS is often infinite dimensional Consistency of func-tional SIR has also been studied before In [14] a thresholdingmethod is considered which selects a finite number ofeigenvectors and uses results fromfinite rank operatorsTheirproof of consistency requires stronger and more complicatedconditions than oursThe consistency for functional SIR withridge regularization is proven in [19] but it is of aweaker formthan our result We remark that the consistency results forfunctional SIR can be improved using a similar argument inthis paper
In the following we state the consistency results for theTikhonov regularization A similar result can be proved forthe ridge regularization while the details are omitted
Theorem 9 Assume E119896(119883119883)2 lt infin lim119899rarrinfin
119904(119899) = 0 andlim119899rarrinfin
119904radic119899 = infin then
10038161003816100381610038161003816⟨120573119895 120601 (sdot)⟩ minus ⟨120573119895 120601 (sdot)⟩
10038161003816100381610038161003816= 119900119901 (1) 119895 = 1 119889
Γ (31)
where 119889Γis the rank of Γ ⟨120573
119895 120601(sdot)⟩ is the projection onto the
119895th edr and ⟨120573119895 120601(sdot)⟩ is the projection onto the 119895th edr as
estimated by regularized KSIRIf the edr directions 120573
119895119889Γ
119895=1depend only on a finite
number of eigenvectors of the covariance operator Σ the rateof convergence is 119874(119899minus14)
This theorem is a direct corollary of the following theoremwhich is proven in Appendix C
Theorem 10 Define the projection operator and its comple-ment for each119873 ge 1
Π119873=
119873
sum
119894=1
119890119894otimes 119890119894 Π
perp
119873= 119868 minus Π
119873=
infin
sum
119894=119873+1
119890119894otimes 119890119894 (32)
where 119890119894infin
119894=1are the eigenvectors of the covariance operator Σ
as defined in (8) with the corresponding eigenvalues denotedby 119908119894
Assume E119896(119883119883)2 lt infin For each 119873 ge 1 the followingholds10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879
10038171003817100381710038171003817119867119878
= 119874119901(1
119904radic119899) +
119889Γ
sum
119895=1
(119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817)
(33)
where 119895= Σminus1119906119895and 119906
119895119889Γ
119894=1are the eigenvectors of Γ as
defined in (14) and sdot 119867119878
denotes the Hilbert-Schmidt normof a linear operator
If 119904 = 119904(119899) satisfy 119904 rarr 0 and 119904radic119899 rarr infin as 119899 rarr infin then10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879
10038171003817100381710038171003817119867119878= 119900119901 (1) (34)
4 Application to Simulated and Real Data
In this section we compare regularized kernel sliced inverseregression (RKSIR) with several other SIR-related dimensionreductionmethodsThe comparisons are used to address twoquestions (1) does regularization improve the performanceof kernel sliced inverse regression and (2)does the nonlinear-ity of kernel sliced inverse regression improve the predictionaccuracy
We would like to remark that the assessment of nonlineardimension reduction methods could be more difficult thanthat of linear ones When the feature mapping 120601 for an RKHSis not available we do not know the true edr directions orsubspace So in that case we will use the prediction accuracyto evaluate the goodness of RKSIR
41 Importance of Nonlinearity and Regularization Our firstexample illustrates that both the nonlinearity and regulariza-tion of RKSIR can significantly improve prediction accuracy
The regression model has ten predictor variables 119883 =
(1198831 119883
10) with each one following a normal distribution
119883119894sim 119873(0 1) A univariate response is given as
119884 = (sin (1198831) + sin (119883
2)) (1 + sin (119883
3)) + 120576 (35)
with noise 120576 sim 119873(0 012) We compare the effectiveness
of the linear dimension reduction methods SIR SAVE andPHD with RKSIR by examining the predictive accuracy of anonlinear kernel regression model on the reduced space Wegenerate 100 training samples and apply the abovemethods to
6 Abstract and Applied Analysis
minus6 minus5 minus4 minus3 minus2 minus1 0 1 202
03
04
05
06
07
08
09
1M
ean
squa
re er
ror
d = 1
d = 2
log10
(s)
(a)
1 2 3 4 5 6 7 8 9 100
02
04
06
08
1
12
14
Number of edr directions
Mea
n sq
uare
erro
r
RKSIRSIR
SAVEPHD
(b)
minus2 minus15 minus1 minus05 0 05 1 15 2minus2
minus15
minus1
minus05
0
05
1
15
2
KSIR variates for the first nonlinear edr direction
sin(X
1)+
sin(X
2)
(c)
2
18
16
14
12
1
08
06
04
02
0minus15 minus1 minus05 0 05 1 15
KSIR variates for the second nonlinear edr direction
1+
sin(X
3)
(d)
Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors
compute the edr directions In RKSIR we used an additiveGaussian kernel as follows
119896 (119909 119911) =
119889
sum
119895=1
exp(minus(119909119895minus 119911119895)2
21205902) (36)
where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1
Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of
selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods
RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well
42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors
Abstract and Applied Analysis 7
The regression model has ten predictor variables 119883 =
(1198831 119883
10) and a univariate response specified by
119884 = 1198831+ 1198832
2+ 120576 120576 sim 119873 (0 01
2) (37)
where 119883 sim 119873(0 Σ119883) and Σ
119883= 119876Δ119876 with 119876 a randomly
chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions
For thismodel it is known that SIRwillmiss the directionalong the second variable119883
2 So we focus on the comparison
of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =
(1 + 119909119879119911)2 that corresponds to the feature space
Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)
then 1198831+ 1198832
2can be captured in one edr direction Ideally
the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to
1198831+ 1198832
2modulo shift and scale
V minus EV prop 1198831+ 1198832
2minus E (119883
1+ 1198832
2) (39)
So for this example given estimates of KSIR variates at the 119899data points V
119894119899
119894=1= ⟨1205731 120601(119909119894)⟩119899
119894=1 the error of the first edr
direction can bemeasured by the least squares fitting of Vwithrespect to (119883
1+ 1198832
2)
error = min119886119887isinR
1
119899
119899
sum
119894=1
(V119894minus (119886 (119909
1198941+ 1199092
1198942) + 119887))
2
(40)
We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability
43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits
The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space
minus05 0 05 1 15 2 250
100
200
300
400
500
600
700
120579 parameter
Erro
r rat
e
KSIRRKSIR
Figure 2 Error in edr as a function of 120579
We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation
The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8
5 Discussion
The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method
RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance
8 Abstract and Applied Analysis
Table 1 Mean and standard deviations for error rates in classification of digits
Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)
operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction
There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research
There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR
Appendices
A Proof of Theorem 3
Under the assumption of Proposition 2 for each 119884 = 119910
E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)
Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889
Γpositive eigenvalues 120591
119894119889Γ
119894=1and
eigenvectors 119906119894119889Γ
119894=1 such that Γ = sum119889Γ
119894=1120591119894119906119894otimes 119906119894
Recall that for any 119891 isinH
Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)
also belongs to
span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)
because of (A1) so
119906119894=1
120591119894
Γ119906119894isin range (Σ) (A4)
This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator
119879 = Σminus1Γ is well defined over the whole space Moreover
119879119891 = Σminus1(
119889Γ
sum
119894=1
⟨119906119894 119891⟩ 119906119894)
=
119889Γ
sum
119894=1
⟨119906119894 119891⟩ Σ
minus1(119906119894) = (
119889Γ
sum
119894=1
Σminus1(119906119894) otimes 119906119894)119891
(A5)
This proves (ii)
B Proof of Proposition 7
We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909
1) 120601(119909
119899)] It has the following SVD
decomposition
Φ = 119880119863119881119879
= [1199061sdot sdot sdot 119906119889119870] [
119863119889times119889
0119889times(119899minus119889)
0(119889119870minus119889)times119889
0(119889119870minus119889)times(119899minus119889)
][[
[
V1198791
V119879119899
]]
]
= 119880119863119881119879
(B1)
where 119880 = [1199061 119906
119889] 119881 = [V
1 V
119889] and 119863 = 119863
119889times119889is a
diagonal matrix of dimension 119889 le 119899
Abstract and Applied Analysis 9
We need to show the KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)]
= 119888119879
119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩
(B2)
It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by
120573 = Φ119888 119888 = 119881119863119880119879
120573 (B3)
Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863
2
119881119879 the argument may be made as follows
ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573
lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ
119879ΦΦ119879Φ119888 + 119899
2120574Φ119888)
lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888
(lArrrArr 119881119881119879
119870119869119870119888 = 120582119881119881119879
(1198702+ 1198992120574119868) 119888)
lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888
(B4)
Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1
119880119879 and using the facts119880119879119880 = 119868
119889 For the last step since
119881119879
119881 = 119868119889 we use the following facts
119881119881119879
119870 = 119881119881119879
1198811198632
119881119879
= 1198811198632
119881119879
= 119870
119881119881119879
119888 = 119881119881119879
119881119863minus1
119880119879
120573 = 119881119863minus1
119880119879
120573 = 119888
(B5)
In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889
119870is infinite For the infinite
dimensional case Φ is an operator from R119899 to H119870defined
by ΦV = sum119899119894=1
V119894120601(119909119894) for V = (V
1 V
119899)119879isin R119899 and Φ119879
is its adjoint an operator from H119870to R119899 such that Φ119879119891 =
(⟨120601(1199091) 119891⟩119870 ⟨120601(119909
1) 119891⟩119870)119879 for 119891 isin H
119870 The notions 119880
and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-
tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation
Σ =
119889119870
sum
119894=1
119894119906119894otimes 119906119894=
119889
sum
119894=1
120590119894119906119894otimes 119906119894 (B6)
where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590
119889gt 120590119889+1=
sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906
1 119906
119889)
and hence we can write 120601(119909119894) = 119880120591
119894 where 119880 = (119906
1 119906
119889)
should be considered as an operator fromR119889 toH119870and 120591119894isin
R119889 Denote Υ = (1205911 120591
119899)119879isin R119899times119889 It is easy to check that
Υ119879Υ = diag(119899120590
1 119899120590
119889) Let119863
119889times119889= diag(radic1198991205901 radic119899120590119889)
and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879
which is well defined
C Proof of Consistency
C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein
Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if
1198712
HS =
119901H
sum
119894=1
100381710038171003817100381711987111989011989410038171003817100381710038172
Hlt infin (C1)
where 119890119894 is an orthonormal basisTheHilbert-Schmidt class
forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and
119871119878 both belong to theHilbert-Schmidt class and the followingholds
119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)
where sdot denotes the default operator norm
1198712= sup119891isinH
100381710038171003817100381711987111989110038171003817100381710038172
100381710038171003817100381711989110038171003817100381710038172 (C3)
Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator
Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)
is self-adjoint positive compact and belongs to the Hilbert-Schmidt class
A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879
119899converges
to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as
the convergence of the operators
C2 Proof of Theorem 10 We will use the following resultfrom [14]
10038171003817100381710038171003817Σ minus Σ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
10038171003817100381710038171003817Γ minus Γ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
(C5)
To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1
ΣΓ Alsodefine
1198791= (Σ2+ 119904119868)minus1
ΣΓ 1198792= (Σ2+ 119904119868)minus1
ΣΓ (C6)
Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS
le10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792
1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879
1003817100381710038171003817HS
(C7)
10 Abstract and Applied Analysis
For the first term observe that10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817ΣΓ minus ΣΓ
10038171003817100381710038171003817HS= 119874119901(1
119904radic119899)
(C8)For the second term note that
1198791=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
Σ119906119895) otimes 119906119895
1198792=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
119906119895) otimes 119906119895
(C9)
Therefore
10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
100381710038171003817100381710038171003817
(C10)Since 119906
119895isin range(Σ) there exists
119895such that 119906
119895= Σ119895 Then
((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
= (Σ2+ 119904119868)minus1
(Σ2minus Σ2) (Σ2+ 119904119868)minus1
Σ2119895
(C11)
which implies
10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS
times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2100381710038171003817100381710038171003817
10038171003817100381710038171003817119895
10038171003817100381710038171003817
= 119874119901(1
119904radic119899)
(C12)
For the third term the following holds
10038171003817100381710038171198792 minus 11987910038171003817100381710038172
HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
Σ minus Σminus1) 119906119895
100381710038171003817100381710038171003817
2
(C13)
and for each 119895 = 1 119889Γ
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ119906119895minus Σminus1119906119895
100381710038171003817100381710038171003817
le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2119895minus 119895
100381710038171003817100381710038171003817
=
1003817100381710038171003817100381710038171003817100381710038171003817
infin
sum
119894=1
(1199082
119895
119904 + 1199082119895
minus 1)⟨119895 119890119894⟩ 119890119894
1003817100381710038171003817100381710038171003817100381710038171003817
= (
infin
sum
119894=1
1199042
(119904 + 1199082119894)2⟨119895 119890119894⟩2
)
12
le119904
119908119873
(
119873
sum
119894=1
⟨119895 119890119894⟩2
)
12
+ (
infin
sum
119894=119873+1
⟨119895 119890119894⟩2
)
12
=119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817
(C14)
Combining these terms results in (33)
Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have
10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)
if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573
119894depend only on a finite
number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890
119894 119894 = 1 119873
This implies that
119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890
119894 119894 = 1 119873 (C16)
Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)
Acknowledgments
The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH
References
[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991
[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991
[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991
[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992
[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007
[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997
[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002
[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000
[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002
[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008
[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
4 Abstract and Applied Analysis
(3) Estimate the SIR directions 120573119895by solving the general-
ized eigendecomposition problem
Γ120573 = 120582Σ120573 (19)
This procedure is computationally impossible if theRKHS is infinite dimensional or the feature map cannot becomputed (which is the usual case) However the modelgiven in (9) requires not the edr directions but the projec-tion onto these directions that is the 119889 summary statistics
V1= ⟨1205731 120601 (119909)⟩ V119889 = ⟨120573119889 120601 (119909)⟩ (20)
which we call the KSIR variates Like other kernel algorithmsthe kernel trick enables KSIR variates to be efficiently com-puted from only the kernel 119896 not the map 120601
The key quantity in this alternative formulation is thecentred Grammatrix119870 defined by the kernel function 119896(sdot sdot)where
119870119894119895= ⟨120601 (119909
119894) 120601 (119909
119895)⟩ = 119896 (119909
119894 119909119895) (21)
Note that the rank of119870 is less than 119899 so119870 is always singularGiven the centered Gram matrix 119870 the following gener-
alized eigendecomposition problem can be used to computethe KSIR variates
119870119869119870119888 = 1205821198702119888 (22)
where 119888 denotes the 119899-dimensional generalized eigenvectorand 119869 denotes an 119899 times 119899 matrix with 119869
119894119895= 1119899
119898if 119894 119895
are in the 119898th group consisting of 119899119898
observations andzero otherwise The following proposition establishes theequivalence between two eigendecomposition problems (22)and (19) in the recovery of KSIR variates (V
1 V
119889)
Proposition 4 Given the observations (1199091 1199101) (119909
119899 119910119899)
let (1205731 120573
119889) and (119888
1 119888
119889) denote the generalized eigenvec-
tors of (22) and (19) respectively Then for any 119909 isin X and119895 = 1 119889 the following holds
V119895= ⟨120573119895 120601 (119909)⟩ = 119888
119879
119895119870119909
119870119909= (119896 (119909 119909
1) 119896 (119909 119909
119899))119879
(23)
provided that Σ is invertible When Σ is not invertible theconclusion holds modulo the null space of Σ
This result was proven in [10] in which the algorithmwasfurther reduced to solving
119869119870119888 = 120582119870119888 (24)
by canceling 119870 from both sides of (22)
Remark 5 It is important to remark that when Σ is not invert-ible Proposition 4 states that the equivalence between (19)and (22) holds modulo the null space of Σ that is a requi-rement of the representer theorem that is 120573
119895are linear com-
binations of 120601(119909119894) Without this mandated condition we will
see that each eigenvector of (22) produces an eigenvector of(19) while the eigenvector of (19) is not necessarily recoveredby (22)
Remark 6 It is necessary to clarify the difference between (12)and (19) In (12) it is natural to assume that 120573 is orthogonalto the null space of Σ To see this let 120573 = 1205730 + 120573⋆ with 1205730belonging to the null space and 120573⋆ orthogonal to the nullspace Then
E [⟨120573 120601 (119909)⟩2] = ⟨120573 Σ120573⟩ = ⟨120573
⋆ Σ120573⋆⟩ = E [⟨120573
⋆ 120601 (119909)⟩
2]
(25)
that is ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ (as) It means that 1205730does
not contribute to the KSIR variates and thus could be setas 0 However in (19) if 120573 is an eigenvector and 120573⋆ is itsorthogonal component relative to the null space of Σ theidentity ⟨120573 120601(119909)⟩ = ⟨120573⋆ 120601(119909)⟩ is in general not true for anew point 119909 which is different from the observations Thusfrom a theoretical perspective it is not as natural to assumethe representer theorem although it works well in practiceIn this sense the KSIR algorithm based on (22) or (24) doesnot have a thorough mathematical foundation
32 Regularization and Stability Except for the theoreticalsubtleties in applications with relatively small samples theeigendecomposition in (22) is often ill-conditioned resultingin overfitting as well as numerically unstable estimates ofthe edr space This can be addressed by either thresholdingeigenvalues of the estimated covariancematrix Σ or by addinga regularization term to (22) or (24)
We motivate two types of regularization schemes Thefirst one is the traditional ridge regularization It is used inboth linear SIR and functional SIR [18ndash20] which solves theeigendecomposition problem
Γ120573 = 120582 (Σ + 119904119868) 120573 (26)
Here and in the sequel 119868 denotes the identity operator and 119904is a tuning parameter Assuming the representer theorem itskernel form is given as
119869119870119888 = 120582 (119870 + 119899119904119868) 119888 (27)
Another type of regularization is to regularize (22) dire-ctly
119870119869119870119888 = 120582 (1198702+ 1198992119904119868) 119888 (28)
The following proposition whose proof is in Appendix Bstates that solving the generalized eigendecomposition prob-lem (28) is equivalent to finding the eigenvectors of
(Σ2+ 119904119868)minus1
ΣΓ (29)
Proposition 7 Let 119888119895be the eigenvectors of (28) and let 120573
119895
be the eigenvectors of (29) Then the following holds for theregularized KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)] = ⟨120573
119895 120601 (119909)⟩ (30)
Abstract and Applied Analysis 5
This algorithm is termed as the Tikhonov regularizationFor linear SIR it is shown in [21] that the Tikhonov regular-ization is more efficient than the ridge regularization
Except for the computational stability regularization alsomakes the matrix forms of KSIR (27) and (28) interpretableby justifiable representer theorem
Proposition 8 For both ridge and the Tikhonov regulariza-tion scheme of KSIR the eigenfunctions 120573
119895are linear combina-
tions of 120601(119909119894) 119894 = 1 119899
The conclusion follows from the observation that 120573119895=
(1120582119895)(Γ120573119895minus Σ120573119895) for the ridge regularization and 120573
119895=
(1120582119895)(ΣΓ120573
119895minus Σ2120573119895) for the Tikhonov regularization
To close we remark that KSIR is computationally advan-tageous even for the case of linear models when 119901 ≫ 119899 dueto the fact that the eigendecomposition problem is for 119899 times 119899matrices rather than the 119901 times 119901 matrices in the standard SIRformulation
33 Consistency of Regularized KSIR In this subsection weprove the asymptotic consistency of the edr directionsestimated by regularized KSIR and provide conditions underwhich the rate of convergence is 119874
119901(119899minus14) An important
observation from the proof is that the rate of convergence ofthe edr directions depends on the contribution of the smallprincipal components The rate can be arbitrarily slow if theedr space depends heavily on eigenvectors corresponding tosmall eigenvalues of the covariance operator
Note that various consistency results are available forlinear SIR [22ndash24] These results hold only for the finitedimensional setting and cannot be adapted to KSIR wherethe RKHS is often infinite dimensional Consistency of func-tional SIR has also been studied before In [14] a thresholdingmethod is considered which selects a finite number ofeigenvectors and uses results fromfinite rank operatorsTheirproof of consistency requires stronger and more complicatedconditions than oursThe consistency for functional SIR withridge regularization is proven in [19] but it is of aweaker formthan our result We remark that the consistency results forfunctional SIR can be improved using a similar argument inthis paper
In the following we state the consistency results for theTikhonov regularization A similar result can be proved forthe ridge regularization while the details are omitted
Theorem 9 Assume E119896(119883119883)2 lt infin lim119899rarrinfin
119904(119899) = 0 andlim119899rarrinfin
119904radic119899 = infin then
10038161003816100381610038161003816⟨120573119895 120601 (sdot)⟩ minus ⟨120573119895 120601 (sdot)⟩
10038161003816100381610038161003816= 119900119901 (1) 119895 = 1 119889
Γ (31)
where 119889Γis the rank of Γ ⟨120573
119895 120601(sdot)⟩ is the projection onto the
119895th edr and ⟨120573119895 120601(sdot)⟩ is the projection onto the 119895th edr as
estimated by regularized KSIRIf the edr directions 120573
119895119889Γ
119895=1depend only on a finite
number of eigenvectors of the covariance operator Σ the rateof convergence is 119874(119899minus14)
This theorem is a direct corollary of the following theoremwhich is proven in Appendix C
Theorem 10 Define the projection operator and its comple-ment for each119873 ge 1
Π119873=
119873
sum
119894=1
119890119894otimes 119890119894 Π
perp
119873= 119868 minus Π
119873=
infin
sum
119894=119873+1
119890119894otimes 119890119894 (32)
where 119890119894infin
119894=1are the eigenvectors of the covariance operator Σ
as defined in (8) with the corresponding eigenvalues denotedby 119908119894
Assume E119896(119883119883)2 lt infin For each 119873 ge 1 the followingholds10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879
10038171003817100381710038171003817119867119878
= 119874119901(1
119904radic119899) +
119889Γ
sum
119895=1
(119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817)
(33)
where 119895= Σminus1119906119895and 119906
119895119889Γ
119894=1are the eigenvectors of Γ as
defined in (14) and sdot 119867119878
denotes the Hilbert-Schmidt normof a linear operator
If 119904 = 119904(119899) satisfy 119904 rarr 0 and 119904radic119899 rarr infin as 119899 rarr infin then10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879
10038171003817100381710038171003817119867119878= 119900119901 (1) (34)
4 Application to Simulated and Real Data
In this section we compare regularized kernel sliced inverseregression (RKSIR) with several other SIR-related dimensionreductionmethodsThe comparisons are used to address twoquestions (1) does regularization improve the performanceof kernel sliced inverse regression and (2)does the nonlinear-ity of kernel sliced inverse regression improve the predictionaccuracy
We would like to remark that the assessment of nonlineardimension reduction methods could be more difficult thanthat of linear ones When the feature mapping 120601 for an RKHSis not available we do not know the true edr directions orsubspace So in that case we will use the prediction accuracyto evaluate the goodness of RKSIR
41 Importance of Nonlinearity and Regularization Our firstexample illustrates that both the nonlinearity and regulariza-tion of RKSIR can significantly improve prediction accuracy
The regression model has ten predictor variables 119883 =
(1198831 119883
10) with each one following a normal distribution
119883119894sim 119873(0 1) A univariate response is given as
119884 = (sin (1198831) + sin (119883
2)) (1 + sin (119883
3)) + 120576 (35)
with noise 120576 sim 119873(0 012) We compare the effectiveness
of the linear dimension reduction methods SIR SAVE andPHD with RKSIR by examining the predictive accuracy of anonlinear kernel regression model on the reduced space Wegenerate 100 training samples and apply the abovemethods to
6 Abstract and Applied Analysis
minus6 minus5 minus4 minus3 minus2 minus1 0 1 202
03
04
05
06
07
08
09
1M
ean
squa
re er
ror
d = 1
d = 2
log10
(s)
(a)
1 2 3 4 5 6 7 8 9 100
02
04
06
08
1
12
14
Number of edr directions
Mea
n sq
uare
erro
r
RKSIRSIR
SAVEPHD
(b)
minus2 minus15 minus1 minus05 0 05 1 15 2minus2
minus15
minus1
minus05
0
05
1
15
2
KSIR variates for the first nonlinear edr direction
sin(X
1)+
sin(X
2)
(c)
2
18
16
14
12
1
08
06
04
02
0minus15 minus1 minus05 0 05 1 15
KSIR variates for the second nonlinear edr direction
1+
sin(X
3)
(d)
Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors
compute the edr directions In RKSIR we used an additiveGaussian kernel as follows
119896 (119909 119911) =
119889
sum
119895=1
exp(minus(119909119895minus 119911119895)2
21205902) (36)
where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1
Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of
selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods
RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well
42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors
Abstract and Applied Analysis 7
The regression model has ten predictor variables 119883 =
(1198831 119883
10) and a univariate response specified by
119884 = 1198831+ 1198832
2+ 120576 120576 sim 119873 (0 01
2) (37)
where 119883 sim 119873(0 Σ119883) and Σ
119883= 119876Δ119876 with 119876 a randomly
chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions
For thismodel it is known that SIRwillmiss the directionalong the second variable119883
2 So we focus on the comparison
of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =
(1 + 119909119879119911)2 that corresponds to the feature space
Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)
then 1198831+ 1198832
2can be captured in one edr direction Ideally
the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to
1198831+ 1198832
2modulo shift and scale
V minus EV prop 1198831+ 1198832
2minus E (119883
1+ 1198832
2) (39)
So for this example given estimates of KSIR variates at the 119899data points V
119894119899
119894=1= ⟨1205731 120601(119909119894)⟩119899
119894=1 the error of the first edr
direction can bemeasured by the least squares fitting of Vwithrespect to (119883
1+ 1198832
2)
error = min119886119887isinR
1
119899
119899
sum
119894=1
(V119894minus (119886 (119909
1198941+ 1199092
1198942) + 119887))
2
(40)
We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability
43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits
The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space
minus05 0 05 1 15 2 250
100
200
300
400
500
600
700
120579 parameter
Erro
r rat
e
KSIRRKSIR
Figure 2 Error in edr as a function of 120579
We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation
The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8
5 Discussion
The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method
RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance
8 Abstract and Applied Analysis
Table 1 Mean and standard deviations for error rates in classification of digits
Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)
operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction
There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research
There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR
Appendices
A Proof of Theorem 3
Under the assumption of Proposition 2 for each 119884 = 119910
E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)
Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889
Γpositive eigenvalues 120591
119894119889Γ
119894=1and
eigenvectors 119906119894119889Γ
119894=1 such that Γ = sum119889Γ
119894=1120591119894119906119894otimes 119906119894
Recall that for any 119891 isinH
Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)
also belongs to
span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)
because of (A1) so
119906119894=1
120591119894
Γ119906119894isin range (Σ) (A4)
This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator
119879 = Σminus1Γ is well defined over the whole space Moreover
119879119891 = Σminus1(
119889Γ
sum
119894=1
⟨119906119894 119891⟩ 119906119894)
=
119889Γ
sum
119894=1
⟨119906119894 119891⟩ Σ
minus1(119906119894) = (
119889Γ
sum
119894=1
Σminus1(119906119894) otimes 119906119894)119891
(A5)
This proves (ii)
B Proof of Proposition 7
We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909
1) 120601(119909
119899)] It has the following SVD
decomposition
Φ = 119880119863119881119879
= [1199061sdot sdot sdot 119906119889119870] [
119863119889times119889
0119889times(119899minus119889)
0(119889119870minus119889)times119889
0(119889119870minus119889)times(119899minus119889)
][[
[
V1198791
V119879119899
]]
]
= 119880119863119881119879
(B1)
where 119880 = [1199061 119906
119889] 119881 = [V
1 V
119889] and 119863 = 119863
119889times119889is a
diagonal matrix of dimension 119889 le 119899
Abstract and Applied Analysis 9
We need to show the KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)]
= 119888119879
119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩
(B2)
It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by
120573 = Φ119888 119888 = 119881119863119880119879
120573 (B3)
Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863
2
119881119879 the argument may be made as follows
ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573
lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ
119879ΦΦ119879Φ119888 + 119899
2120574Φ119888)
lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888
(lArrrArr 119881119881119879
119870119869119870119888 = 120582119881119881119879
(1198702+ 1198992120574119868) 119888)
lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888
(B4)
Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1
119880119879 and using the facts119880119879119880 = 119868
119889 For the last step since
119881119879
119881 = 119868119889 we use the following facts
119881119881119879
119870 = 119881119881119879
1198811198632
119881119879
= 1198811198632
119881119879
= 119870
119881119881119879
119888 = 119881119881119879
119881119863minus1
119880119879
120573 = 119881119863minus1
119880119879
120573 = 119888
(B5)
In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889
119870is infinite For the infinite
dimensional case Φ is an operator from R119899 to H119870defined
by ΦV = sum119899119894=1
V119894120601(119909119894) for V = (V
1 V
119899)119879isin R119899 and Φ119879
is its adjoint an operator from H119870to R119899 such that Φ119879119891 =
(⟨120601(1199091) 119891⟩119870 ⟨120601(119909
1) 119891⟩119870)119879 for 119891 isin H
119870 The notions 119880
and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-
tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation
Σ =
119889119870
sum
119894=1
119894119906119894otimes 119906119894=
119889
sum
119894=1
120590119894119906119894otimes 119906119894 (B6)
where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590
119889gt 120590119889+1=
sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906
1 119906
119889)
and hence we can write 120601(119909119894) = 119880120591
119894 where 119880 = (119906
1 119906
119889)
should be considered as an operator fromR119889 toH119870and 120591119894isin
R119889 Denote Υ = (1205911 120591
119899)119879isin R119899times119889 It is easy to check that
Υ119879Υ = diag(119899120590
1 119899120590
119889) Let119863
119889times119889= diag(radic1198991205901 radic119899120590119889)
and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879
which is well defined
C Proof of Consistency
C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein
Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if
1198712
HS =
119901H
sum
119894=1
100381710038171003817100381711987111989011989410038171003817100381710038172
Hlt infin (C1)
where 119890119894 is an orthonormal basisTheHilbert-Schmidt class
forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and
119871119878 both belong to theHilbert-Schmidt class and the followingholds
119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)
where sdot denotes the default operator norm
1198712= sup119891isinH
100381710038171003817100381711987111989110038171003817100381710038172
100381710038171003817100381711989110038171003817100381710038172 (C3)
Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator
Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)
is self-adjoint positive compact and belongs to the Hilbert-Schmidt class
A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879
119899converges
to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as
the convergence of the operators
C2 Proof of Theorem 10 We will use the following resultfrom [14]
10038171003817100381710038171003817Σ minus Σ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
10038171003817100381710038171003817Γ minus Γ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
(C5)
To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1
ΣΓ Alsodefine
1198791= (Σ2+ 119904119868)minus1
ΣΓ 1198792= (Σ2+ 119904119868)minus1
ΣΓ (C6)
Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS
le10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792
1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879
1003817100381710038171003817HS
(C7)
10 Abstract and Applied Analysis
For the first term observe that10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817ΣΓ minus ΣΓ
10038171003817100381710038171003817HS= 119874119901(1
119904radic119899)
(C8)For the second term note that
1198791=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
Σ119906119895) otimes 119906119895
1198792=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
119906119895) otimes 119906119895
(C9)
Therefore
10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
100381710038171003817100381710038171003817
(C10)Since 119906
119895isin range(Σ) there exists
119895such that 119906
119895= Σ119895 Then
((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
= (Σ2+ 119904119868)minus1
(Σ2minus Σ2) (Σ2+ 119904119868)minus1
Σ2119895
(C11)
which implies
10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS
times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2100381710038171003817100381710038171003817
10038171003817100381710038171003817119895
10038171003817100381710038171003817
= 119874119901(1
119904radic119899)
(C12)
For the third term the following holds
10038171003817100381710038171198792 minus 11987910038171003817100381710038172
HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
Σ minus Σminus1) 119906119895
100381710038171003817100381710038171003817
2
(C13)
and for each 119895 = 1 119889Γ
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ119906119895minus Σminus1119906119895
100381710038171003817100381710038171003817
le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2119895minus 119895
100381710038171003817100381710038171003817
=
1003817100381710038171003817100381710038171003817100381710038171003817
infin
sum
119894=1
(1199082
119895
119904 + 1199082119895
minus 1)⟨119895 119890119894⟩ 119890119894
1003817100381710038171003817100381710038171003817100381710038171003817
= (
infin
sum
119894=1
1199042
(119904 + 1199082119894)2⟨119895 119890119894⟩2
)
12
le119904
119908119873
(
119873
sum
119894=1
⟨119895 119890119894⟩2
)
12
+ (
infin
sum
119894=119873+1
⟨119895 119890119894⟩2
)
12
=119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817
(C14)
Combining these terms results in (33)
Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have
10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)
if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573
119894depend only on a finite
number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890
119894 119894 = 1 119873
This implies that
119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890
119894 119894 = 1 119873 (C16)
Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)
Acknowledgments
The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH
References
[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991
[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991
[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991
[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992
[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007
[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997
[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002
[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000
[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002
[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008
[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Abstract and Applied Analysis 5
This algorithm is termed as the Tikhonov regularizationFor linear SIR it is shown in [21] that the Tikhonov regular-ization is more efficient than the ridge regularization
Except for the computational stability regularization alsomakes the matrix forms of KSIR (27) and (28) interpretableby justifiable representer theorem
Proposition 8 For both ridge and the Tikhonov regulariza-tion scheme of KSIR the eigenfunctions 120573
119895are linear combina-
tions of 120601(119909119894) 119894 = 1 119899
The conclusion follows from the observation that 120573119895=
(1120582119895)(Γ120573119895minus Σ120573119895) for the ridge regularization and 120573
119895=
(1120582119895)(ΣΓ120573
119895minus Σ2120573119895) for the Tikhonov regularization
To close we remark that KSIR is computationally advan-tageous even for the case of linear models when 119901 ≫ 119899 dueto the fact that the eigendecomposition problem is for 119899 times 119899matrices rather than the 119901 times 119901 matrices in the standard SIRformulation
33 Consistency of Regularized KSIR In this subsection weprove the asymptotic consistency of the edr directionsestimated by regularized KSIR and provide conditions underwhich the rate of convergence is 119874
119901(119899minus14) An important
observation from the proof is that the rate of convergence ofthe edr directions depends on the contribution of the smallprincipal components The rate can be arbitrarily slow if theedr space depends heavily on eigenvectors corresponding tosmall eigenvalues of the covariance operator
Note that various consistency results are available forlinear SIR [22ndash24] These results hold only for the finitedimensional setting and cannot be adapted to KSIR wherethe RKHS is often infinite dimensional Consistency of func-tional SIR has also been studied before In [14] a thresholdingmethod is considered which selects a finite number ofeigenvectors and uses results fromfinite rank operatorsTheirproof of consistency requires stronger and more complicatedconditions than oursThe consistency for functional SIR withridge regularization is proven in [19] but it is of aweaker formthan our result We remark that the consistency results forfunctional SIR can be improved using a similar argument inthis paper
In the following we state the consistency results for theTikhonov regularization A similar result can be proved forthe ridge regularization while the details are omitted
Theorem 9 Assume E119896(119883119883)2 lt infin lim119899rarrinfin
119904(119899) = 0 andlim119899rarrinfin
119904radic119899 = infin then
10038161003816100381610038161003816⟨120573119895 120601 (sdot)⟩ minus ⟨120573119895 120601 (sdot)⟩
10038161003816100381610038161003816= 119900119901 (1) 119895 = 1 119889
Γ (31)
where 119889Γis the rank of Γ ⟨120573
119895 120601(sdot)⟩ is the projection onto the
119895th edr and ⟨120573119895 120601(sdot)⟩ is the projection onto the 119895th edr as
estimated by regularized KSIRIf the edr directions 120573
119895119889Γ
119895=1depend only on a finite
number of eigenvectors of the covariance operator Σ the rateof convergence is 119874(119899minus14)
This theorem is a direct corollary of the following theoremwhich is proven in Appendix C
Theorem 10 Define the projection operator and its comple-ment for each119873 ge 1
Π119873=
119873
sum
119894=1
119890119894otimes 119890119894 Π
perp
119873= 119868 minus Π
119873=
infin
sum
119894=119873+1
119890119894otimes 119890119894 (32)
where 119890119894infin
119894=1are the eigenvectors of the covariance operator Σ
as defined in (8) with the corresponding eigenvalues denotedby 119908119894
Assume E119896(119883119883)2 lt infin For each 119873 ge 1 the followingholds10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879
10038171003817100381710038171003817119867119878
= 119874119901(1
119904radic119899) +
119889Γ
sum
119895=1
(119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817)
(33)
where 119895= Σminus1119906119895and 119906
119895119889Γ
119894=1are the eigenvectors of Γ as
defined in (14) and sdot 119867119878
denotes the Hilbert-Schmidt normof a linear operator
If 119904 = 119904(119899) satisfy 119904 rarr 0 and 119904radic119899 rarr infin as 119899 rarr infin then10038171003817100381710038171003817(Σ2+ 119904119868)minus1ΣΓ minus 119879
10038171003817100381710038171003817119867119878= 119900119901 (1) (34)
4 Application to Simulated and Real Data
In this section we compare regularized kernel sliced inverseregression (RKSIR) with several other SIR-related dimensionreductionmethodsThe comparisons are used to address twoquestions (1) does regularization improve the performanceof kernel sliced inverse regression and (2)does the nonlinear-ity of kernel sliced inverse regression improve the predictionaccuracy
We would like to remark that the assessment of nonlineardimension reduction methods could be more difficult thanthat of linear ones When the feature mapping 120601 for an RKHSis not available we do not know the true edr directions orsubspace So in that case we will use the prediction accuracyto evaluate the goodness of RKSIR
41 Importance of Nonlinearity and Regularization Our firstexample illustrates that both the nonlinearity and regulariza-tion of RKSIR can significantly improve prediction accuracy
The regression model has ten predictor variables 119883 =
(1198831 119883
10) with each one following a normal distribution
119883119894sim 119873(0 1) A univariate response is given as
119884 = (sin (1198831) + sin (119883
2)) (1 + sin (119883
3)) + 120576 (35)
with noise 120576 sim 119873(0 012) We compare the effectiveness
of the linear dimension reduction methods SIR SAVE andPHD with RKSIR by examining the predictive accuracy of anonlinear kernel regression model on the reduced space Wegenerate 100 training samples and apply the abovemethods to
6 Abstract and Applied Analysis
minus6 minus5 minus4 minus3 minus2 minus1 0 1 202
03
04
05
06
07
08
09
1M
ean
squa
re er
ror
d = 1
d = 2
log10
(s)
(a)
1 2 3 4 5 6 7 8 9 100
02
04
06
08
1
12
14
Number of edr directions
Mea
n sq
uare
erro
r
RKSIRSIR
SAVEPHD
(b)
minus2 minus15 minus1 minus05 0 05 1 15 2minus2
minus15
minus1
minus05
0
05
1
15
2
KSIR variates for the first nonlinear edr direction
sin(X
1)+
sin(X
2)
(c)
2
18
16
14
12
1
08
06
04
02
0minus15 minus1 minus05 0 05 1 15
KSIR variates for the second nonlinear edr direction
1+
sin(X
3)
(d)
Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors
compute the edr directions In RKSIR we used an additiveGaussian kernel as follows
119896 (119909 119911) =
119889
sum
119895=1
exp(minus(119909119895minus 119911119895)2
21205902) (36)
where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1
Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of
selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods
RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well
42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors
Abstract and Applied Analysis 7
The regression model has ten predictor variables 119883 =
(1198831 119883
10) and a univariate response specified by
119884 = 1198831+ 1198832
2+ 120576 120576 sim 119873 (0 01
2) (37)
where 119883 sim 119873(0 Σ119883) and Σ
119883= 119876Δ119876 with 119876 a randomly
chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions
For thismodel it is known that SIRwillmiss the directionalong the second variable119883
2 So we focus on the comparison
of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =
(1 + 119909119879119911)2 that corresponds to the feature space
Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)
then 1198831+ 1198832
2can be captured in one edr direction Ideally
the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to
1198831+ 1198832
2modulo shift and scale
V minus EV prop 1198831+ 1198832
2minus E (119883
1+ 1198832
2) (39)
So for this example given estimates of KSIR variates at the 119899data points V
119894119899
119894=1= ⟨1205731 120601(119909119894)⟩119899
119894=1 the error of the first edr
direction can bemeasured by the least squares fitting of Vwithrespect to (119883
1+ 1198832
2)
error = min119886119887isinR
1
119899
119899
sum
119894=1
(V119894minus (119886 (119909
1198941+ 1199092
1198942) + 119887))
2
(40)
We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability
43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits
The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space
minus05 0 05 1 15 2 250
100
200
300
400
500
600
700
120579 parameter
Erro
r rat
e
KSIRRKSIR
Figure 2 Error in edr as a function of 120579
We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation
The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8
5 Discussion
The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method
RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance
8 Abstract and Applied Analysis
Table 1 Mean and standard deviations for error rates in classification of digits
Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)
operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction
There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research
There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR
Appendices
A Proof of Theorem 3
Under the assumption of Proposition 2 for each 119884 = 119910
E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)
Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889
Γpositive eigenvalues 120591
119894119889Γ
119894=1and
eigenvectors 119906119894119889Γ
119894=1 such that Γ = sum119889Γ
119894=1120591119894119906119894otimes 119906119894
Recall that for any 119891 isinH
Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)
also belongs to
span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)
because of (A1) so
119906119894=1
120591119894
Γ119906119894isin range (Σ) (A4)
This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator
119879 = Σminus1Γ is well defined over the whole space Moreover
119879119891 = Σminus1(
119889Γ
sum
119894=1
⟨119906119894 119891⟩ 119906119894)
=
119889Γ
sum
119894=1
⟨119906119894 119891⟩ Σ
minus1(119906119894) = (
119889Γ
sum
119894=1
Σminus1(119906119894) otimes 119906119894)119891
(A5)
This proves (ii)
B Proof of Proposition 7
We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909
1) 120601(119909
119899)] It has the following SVD
decomposition
Φ = 119880119863119881119879
= [1199061sdot sdot sdot 119906119889119870] [
119863119889times119889
0119889times(119899minus119889)
0(119889119870minus119889)times119889
0(119889119870minus119889)times(119899minus119889)
][[
[
V1198791
V119879119899
]]
]
= 119880119863119881119879
(B1)
where 119880 = [1199061 119906
119889] 119881 = [V
1 V
119889] and 119863 = 119863
119889times119889is a
diagonal matrix of dimension 119889 le 119899
Abstract and Applied Analysis 9
We need to show the KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)]
= 119888119879
119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩
(B2)
It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by
120573 = Φ119888 119888 = 119881119863119880119879
120573 (B3)
Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863
2
119881119879 the argument may be made as follows
ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573
lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ
119879ΦΦ119879Φ119888 + 119899
2120574Φ119888)
lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888
(lArrrArr 119881119881119879
119870119869119870119888 = 120582119881119881119879
(1198702+ 1198992120574119868) 119888)
lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888
(B4)
Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1
119880119879 and using the facts119880119879119880 = 119868
119889 For the last step since
119881119879
119881 = 119868119889 we use the following facts
119881119881119879
119870 = 119881119881119879
1198811198632
119881119879
= 1198811198632
119881119879
= 119870
119881119881119879
119888 = 119881119881119879
119881119863minus1
119880119879
120573 = 119881119863minus1
119880119879
120573 = 119888
(B5)
In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889
119870is infinite For the infinite
dimensional case Φ is an operator from R119899 to H119870defined
by ΦV = sum119899119894=1
V119894120601(119909119894) for V = (V
1 V
119899)119879isin R119899 and Φ119879
is its adjoint an operator from H119870to R119899 such that Φ119879119891 =
(⟨120601(1199091) 119891⟩119870 ⟨120601(119909
1) 119891⟩119870)119879 for 119891 isin H
119870 The notions 119880
and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-
tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation
Σ =
119889119870
sum
119894=1
119894119906119894otimes 119906119894=
119889
sum
119894=1
120590119894119906119894otimes 119906119894 (B6)
where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590
119889gt 120590119889+1=
sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906
1 119906
119889)
and hence we can write 120601(119909119894) = 119880120591
119894 where 119880 = (119906
1 119906
119889)
should be considered as an operator fromR119889 toH119870and 120591119894isin
R119889 Denote Υ = (1205911 120591
119899)119879isin R119899times119889 It is easy to check that
Υ119879Υ = diag(119899120590
1 119899120590
119889) Let119863
119889times119889= diag(radic1198991205901 radic119899120590119889)
and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879
which is well defined
C Proof of Consistency
C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein
Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if
1198712
HS =
119901H
sum
119894=1
100381710038171003817100381711987111989011989410038171003817100381710038172
Hlt infin (C1)
where 119890119894 is an orthonormal basisTheHilbert-Schmidt class
forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and
119871119878 both belong to theHilbert-Schmidt class and the followingholds
119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)
where sdot denotes the default operator norm
1198712= sup119891isinH
100381710038171003817100381711987111989110038171003817100381710038172
100381710038171003817100381711989110038171003817100381710038172 (C3)
Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator
Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)
is self-adjoint positive compact and belongs to the Hilbert-Schmidt class
A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879
119899converges
to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as
the convergence of the operators
C2 Proof of Theorem 10 We will use the following resultfrom [14]
10038171003817100381710038171003817Σ minus Σ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
10038171003817100381710038171003817Γ minus Γ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
(C5)
To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1
ΣΓ Alsodefine
1198791= (Σ2+ 119904119868)minus1
ΣΓ 1198792= (Σ2+ 119904119868)minus1
ΣΓ (C6)
Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS
le10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792
1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879
1003817100381710038171003817HS
(C7)
10 Abstract and Applied Analysis
For the first term observe that10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817ΣΓ minus ΣΓ
10038171003817100381710038171003817HS= 119874119901(1
119904radic119899)
(C8)For the second term note that
1198791=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
Σ119906119895) otimes 119906119895
1198792=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
119906119895) otimes 119906119895
(C9)
Therefore
10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
100381710038171003817100381710038171003817
(C10)Since 119906
119895isin range(Σ) there exists
119895such that 119906
119895= Σ119895 Then
((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
= (Σ2+ 119904119868)minus1
(Σ2minus Σ2) (Σ2+ 119904119868)minus1
Σ2119895
(C11)
which implies
10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS
times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2100381710038171003817100381710038171003817
10038171003817100381710038171003817119895
10038171003817100381710038171003817
= 119874119901(1
119904radic119899)
(C12)
For the third term the following holds
10038171003817100381710038171198792 minus 11987910038171003817100381710038172
HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
Σ minus Σminus1) 119906119895
100381710038171003817100381710038171003817
2
(C13)
and for each 119895 = 1 119889Γ
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ119906119895minus Σminus1119906119895
100381710038171003817100381710038171003817
le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2119895minus 119895
100381710038171003817100381710038171003817
=
1003817100381710038171003817100381710038171003817100381710038171003817
infin
sum
119894=1
(1199082
119895
119904 + 1199082119895
minus 1)⟨119895 119890119894⟩ 119890119894
1003817100381710038171003817100381710038171003817100381710038171003817
= (
infin
sum
119894=1
1199042
(119904 + 1199082119894)2⟨119895 119890119894⟩2
)
12
le119904
119908119873
(
119873
sum
119894=1
⟨119895 119890119894⟩2
)
12
+ (
infin
sum
119894=119873+1
⟨119895 119890119894⟩2
)
12
=119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817
(C14)
Combining these terms results in (33)
Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have
10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)
if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573
119894depend only on a finite
number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890
119894 119894 = 1 119873
This implies that
119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890
119894 119894 = 1 119873 (C16)
Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)
Acknowledgments
The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH
References
[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991
[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991
[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991
[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992
[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007
[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997
[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002
[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000
[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002
[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008
[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
6 Abstract and Applied Analysis
minus6 minus5 minus4 minus3 minus2 minus1 0 1 202
03
04
05
06
07
08
09
1M
ean
squa
re er
ror
d = 1
d = 2
log10
(s)
(a)
1 2 3 4 5 6 7 8 9 100
02
04
06
08
1
12
14
Number of edr directions
Mea
n sq
uare
erro
r
RKSIRSIR
SAVEPHD
(b)
minus2 minus15 minus1 minus05 0 05 1 15 2minus2
minus15
minus1
minus05
0
05
1
15
2
KSIR variates for the first nonlinear edr direction
sin(X
1)+
sin(X
2)
(c)
2
18
16
14
12
1
08
06
04
02
0minus15 minus1 minus05 0 05 1 15
KSIR variates for the second nonlinear edr direction
1+
sin(X
3)
(d)
Figure 1 Dimension reduction for model (35) (a) mean square error for RKSIR with different regularization parameters (b) mean squareerror for various dimension reduction methods The blue line represents the mean square error without using dimension reduction (c)-(d)RKSIR variates versus nonlinear factors
compute the edr directions In RKSIR we used an additiveGaussian kernel as follows
119896 (119909 119911) =
119889
sum
119895=1
exp(minus(119909119895minus 119911119895)2
21205902) (36)
where 120590 = 2 After projecting the training samples on theestimated edr directions we train a Gaussian kernel regres-sion model based on the new variatesThen the mean squareerror is computed on 1000 independent test samples Thisexperiment is repeated 100 times with all parameters set bycross-validation The results are summarized in Figure 1
Figure 1(a) displays the accuracy for RKSIR as a functionof the regularization parameter illustrating the importance of
selecting regularization parameters KSIRwithout regulariza-tion performs much worse than the RKSIR Figure 1(b) dis-plays the prediction accuracy of various dimension reductionmethods
RKSIR outperforms all the linear dimension reductionmethods which illustrates the power of nonlinearity intro-duced in RKSIR It also suggests that there are essentially twononlinear edr directions This observation seems to agreewith the model in (35) Indeed Figures 1(c) and 1(d) showthat the first two edr directions from RKSIR estimate thetwo nonlinear factors well
42 Effect of Regularization This example illustrates theeffect of regularization on the performance of KSIR as afunction of the anisotropy of the predictors
Abstract and Applied Analysis 7
The regression model has ten predictor variables 119883 =
(1198831 119883
10) and a univariate response specified by
119884 = 1198831+ 1198832
2+ 120576 120576 sim 119873 (0 01
2) (37)
where 119883 sim 119873(0 Σ119883) and Σ
119883= 119876Δ119876 with 119876 a randomly
chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions
For thismodel it is known that SIRwillmiss the directionalong the second variable119883
2 So we focus on the comparison
of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =
(1 + 119909119879119911)2 that corresponds to the feature space
Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)
then 1198831+ 1198832
2can be captured in one edr direction Ideally
the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to
1198831+ 1198832
2modulo shift and scale
V minus EV prop 1198831+ 1198832
2minus E (119883
1+ 1198832
2) (39)
So for this example given estimates of KSIR variates at the 119899data points V
119894119899
119894=1= ⟨1205731 120601(119909119894)⟩119899
119894=1 the error of the first edr
direction can bemeasured by the least squares fitting of Vwithrespect to (119883
1+ 1198832
2)
error = min119886119887isinR
1
119899
119899
sum
119894=1
(V119894minus (119886 (119909
1198941+ 1199092
1198942) + 119887))
2
(40)
We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability
43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits
The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space
minus05 0 05 1 15 2 250
100
200
300
400
500
600
700
120579 parameter
Erro
r rat
e
KSIRRKSIR
Figure 2 Error in edr as a function of 120579
We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation
The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8
5 Discussion
The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method
RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance
8 Abstract and Applied Analysis
Table 1 Mean and standard deviations for error rates in classification of digits
Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)
operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction
There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research
There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR
Appendices
A Proof of Theorem 3
Under the assumption of Proposition 2 for each 119884 = 119910
E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)
Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889
Γpositive eigenvalues 120591
119894119889Γ
119894=1and
eigenvectors 119906119894119889Γ
119894=1 such that Γ = sum119889Γ
119894=1120591119894119906119894otimes 119906119894
Recall that for any 119891 isinH
Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)
also belongs to
span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)
because of (A1) so
119906119894=1
120591119894
Γ119906119894isin range (Σ) (A4)
This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator
119879 = Σminus1Γ is well defined over the whole space Moreover
119879119891 = Σminus1(
119889Γ
sum
119894=1
⟨119906119894 119891⟩ 119906119894)
=
119889Γ
sum
119894=1
⟨119906119894 119891⟩ Σ
minus1(119906119894) = (
119889Γ
sum
119894=1
Σminus1(119906119894) otimes 119906119894)119891
(A5)
This proves (ii)
B Proof of Proposition 7
We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909
1) 120601(119909
119899)] It has the following SVD
decomposition
Φ = 119880119863119881119879
= [1199061sdot sdot sdot 119906119889119870] [
119863119889times119889
0119889times(119899minus119889)
0(119889119870minus119889)times119889
0(119889119870minus119889)times(119899minus119889)
][[
[
V1198791
V119879119899
]]
]
= 119880119863119881119879
(B1)
where 119880 = [1199061 119906
119889] 119881 = [V
1 V
119889] and 119863 = 119863
119889times119889is a
diagonal matrix of dimension 119889 le 119899
Abstract and Applied Analysis 9
We need to show the KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)]
= 119888119879
119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩
(B2)
It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by
120573 = Φ119888 119888 = 119881119863119880119879
120573 (B3)
Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863
2
119881119879 the argument may be made as follows
ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573
lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ
119879ΦΦ119879Φ119888 + 119899
2120574Φ119888)
lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888
(lArrrArr 119881119881119879
119870119869119870119888 = 120582119881119881119879
(1198702+ 1198992120574119868) 119888)
lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888
(B4)
Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1
119880119879 and using the facts119880119879119880 = 119868
119889 For the last step since
119881119879
119881 = 119868119889 we use the following facts
119881119881119879
119870 = 119881119881119879
1198811198632
119881119879
= 1198811198632
119881119879
= 119870
119881119881119879
119888 = 119881119881119879
119881119863minus1
119880119879
120573 = 119881119863minus1
119880119879
120573 = 119888
(B5)
In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889
119870is infinite For the infinite
dimensional case Φ is an operator from R119899 to H119870defined
by ΦV = sum119899119894=1
V119894120601(119909119894) for V = (V
1 V
119899)119879isin R119899 and Φ119879
is its adjoint an operator from H119870to R119899 such that Φ119879119891 =
(⟨120601(1199091) 119891⟩119870 ⟨120601(119909
1) 119891⟩119870)119879 for 119891 isin H
119870 The notions 119880
and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-
tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation
Σ =
119889119870
sum
119894=1
119894119906119894otimes 119906119894=
119889
sum
119894=1
120590119894119906119894otimes 119906119894 (B6)
where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590
119889gt 120590119889+1=
sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906
1 119906
119889)
and hence we can write 120601(119909119894) = 119880120591
119894 where 119880 = (119906
1 119906
119889)
should be considered as an operator fromR119889 toH119870and 120591119894isin
R119889 Denote Υ = (1205911 120591
119899)119879isin R119899times119889 It is easy to check that
Υ119879Υ = diag(119899120590
1 119899120590
119889) Let119863
119889times119889= diag(radic1198991205901 radic119899120590119889)
and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879
which is well defined
C Proof of Consistency
C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein
Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if
1198712
HS =
119901H
sum
119894=1
100381710038171003817100381711987111989011989410038171003817100381710038172
Hlt infin (C1)
where 119890119894 is an orthonormal basisTheHilbert-Schmidt class
forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and
119871119878 both belong to theHilbert-Schmidt class and the followingholds
119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)
where sdot denotes the default operator norm
1198712= sup119891isinH
100381710038171003817100381711987111989110038171003817100381710038172
100381710038171003817100381711989110038171003817100381710038172 (C3)
Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator
Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)
is self-adjoint positive compact and belongs to the Hilbert-Schmidt class
A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879
119899converges
to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as
the convergence of the operators
C2 Proof of Theorem 10 We will use the following resultfrom [14]
10038171003817100381710038171003817Σ minus Σ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
10038171003817100381710038171003817Γ minus Γ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
(C5)
To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1
ΣΓ Alsodefine
1198791= (Σ2+ 119904119868)minus1
ΣΓ 1198792= (Σ2+ 119904119868)minus1
ΣΓ (C6)
Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS
le10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792
1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879
1003817100381710038171003817HS
(C7)
10 Abstract and Applied Analysis
For the first term observe that10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817ΣΓ minus ΣΓ
10038171003817100381710038171003817HS= 119874119901(1
119904radic119899)
(C8)For the second term note that
1198791=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
Σ119906119895) otimes 119906119895
1198792=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
119906119895) otimes 119906119895
(C9)
Therefore
10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
100381710038171003817100381710038171003817
(C10)Since 119906
119895isin range(Σ) there exists
119895such that 119906
119895= Σ119895 Then
((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
= (Σ2+ 119904119868)minus1
(Σ2minus Σ2) (Σ2+ 119904119868)minus1
Σ2119895
(C11)
which implies
10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS
times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2100381710038171003817100381710038171003817
10038171003817100381710038171003817119895
10038171003817100381710038171003817
= 119874119901(1
119904radic119899)
(C12)
For the third term the following holds
10038171003817100381710038171198792 minus 11987910038171003817100381710038172
HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
Σ minus Σminus1) 119906119895
100381710038171003817100381710038171003817
2
(C13)
and for each 119895 = 1 119889Γ
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ119906119895minus Σminus1119906119895
100381710038171003817100381710038171003817
le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2119895minus 119895
100381710038171003817100381710038171003817
=
1003817100381710038171003817100381710038171003817100381710038171003817
infin
sum
119894=1
(1199082
119895
119904 + 1199082119895
minus 1)⟨119895 119890119894⟩ 119890119894
1003817100381710038171003817100381710038171003817100381710038171003817
= (
infin
sum
119894=1
1199042
(119904 + 1199082119894)2⟨119895 119890119894⟩2
)
12
le119904
119908119873
(
119873
sum
119894=1
⟨119895 119890119894⟩2
)
12
+ (
infin
sum
119894=119873+1
⟨119895 119890119894⟩2
)
12
=119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817
(C14)
Combining these terms results in (33)
Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have
10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)
if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573
119894depend only on a finite
number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890
119894 119894 = 1 119873
This implies that
119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890
119894 119894 = 1 119873 (C16)
Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)
Acknowledgments
The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH
References
[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991
[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991
[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991
[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992
[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007
[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997
[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002
[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000
[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002
[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008
[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Abstract and Applied Analysis 7
The regression model has ten predictor variables 119883 =
(1198831 119883
10) and a univariate response specified by
119884 = 1198831+ 1198832
2+ 120576 120576 sim 119873 (0 01
2) (37)
where 119883 sim 119873(0 Σ119883) and Σ
119883= 119876Δ119876 with 119876 a randomly
chosen orthogonal matrix and Δ = diag(1120579 2120579 10120579) Wewill see that increasing the parameter 120579 isin [0infin) increasesthe anisotropy of the data that increases the difficulty ofidentifying the correct edr directions
For thismodel it is known that SIRwillmiss the directionalong the second variable119883
2 So we focus on the comparison
of KSIR and RKSIR in this exampleIf we use a second-order polynomial kernel 119896(119909 119911) =
(1 + 119909119879119911)2 that corresponds to the feature space
Φ (119883) = 1119883119894 (119883119894119883119895)119894le119895 119894 119895 = 1 10 (38)
then 1198831+ 1198832
2can be captured in one edr direction Ideally
the first KSIR variate V = ⟨1205731 120601(119883)⟩ should be equivalent to
1198831+ 1198832
2modulo shift and scale
V minus EV prop 1198831+ 1198832
2minus E (119883
1+ 1198832
2) (39)
So for this example given estimates of KSIR variates at the 119899data points V
119894119899
119894=1= ⟨1205731 120601(119909119894)⟩119899
119894=1 the error of the first edr
direction can bemeasured by the least squares fitting of Vwithrespect to (119883
1+ 1198832
2)
error = min119886119887isinR
1
119899
119899
sum
119894=1
(V119894minus (119886 (119909
1198941+ 1199092
1198942) + 119887))
2
(40)
We drew 200 observations from the model specifiedin (37) and then we applied the two dimension reductionmethods KSIR and RKSIR The mean and standard errorsof 100 repetitions of this procedure are reported in Figure 2The result shows that KSIR becomes more andmore unstableas 120579 increases and the regularization helps to reduce thisinstability
43 Importance of Nonlinearity and Regularization in RealData When SIR is applied to classification problems it isequivalent to a Fisher discriminant analysis For the case ofmulticlass classification it is natural to use SIR and considereach class as a slice Kernel forms of Fisher discriminantanalysis (KFDA) [8] have been used to construct nonlineardiscriminant surfaces and the regularization has improvedperformance of KFDA [25] In this example we show thatthis idea of adding a nonlinearity and a regularization termimproves predictive accuracy in a realmulticlass classificationdata set the classification of handwritten digits
The MNIST data set (Y LeCun httpyannlecuncomexdbmnist) contains 60 000 images of handwritten digits0 1 2 9 as training data and 10 000 images as test dataEach image consists of 119901 = 28 times 28 = 784 gray-scalepixel intensities It is commonly believed that there is clearnonlinear structure in this 784-dimensional space
minus05 0 05 1 15 2 250
100
200
300
400
500
600
700
120579 parameter
Erro
r rat
e
KSIRRKSIR
Figure 2 Error in edr as a function of 120579
We compared regularized SIR (RSIR) as in (26) KSIRand RKSIR on this data to examine the effect of regulariza-tion and nonlinearity Each draw of the training set consistedof 100 observations of each digit We then computed thetop 10 edr directions using these 1000 observations and10 slices one for each digit We projected the 10 000 testobservations onto the edr directions and used a k-nearestneighbor (kNN) classifier with 119896 = 5 to classify the test dataThe accuracy of the kNN classifier without dimensionreduction was used as a baseline For KSIR and RKSIR weused a Gaussian kernel with the bandwith parameter setas the median pairwise distance between observations Theregularization parameter was set by cross-validation
The mean and standard deviation of the classificationaccuracy over 100 iterations of this procedure are reportedin Table 1 The first interesting observation is that the lin-ear dimension reduction does not capture discriminativeinformation as the classification accuracywithout dimensionreduction is better Nonlinearity does increase classifica-tion accuracy and coupling regularization with nonlinearityincreases accuracymoreThis improvement is dramatic for 23 5 and 8
5 Discussion
The interest in manifold learning and nonlinear dimensionreduction in both statistics and machine learning has led toa variety of statistical models and algorithms However mostof these methods are developed in the unsupervised learningframework Therefore the estimated dimensions may notbe optimal for the regression models Our work incorpo-rates nonlinearity and regularization to inverse regressionapproaches and results in a robust response driven nonlineardimension reduction method
RKHS has also been introduced into supervised dimen-sion reduction in [26] where the conditional covariance
8 Abstract and Applied Analysis
Table 1 Mean and standard deviations for error rates in classification of digits
Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)
operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction
There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research
There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR
Appendices
A Proof of Theorem 3
Under the assumption of Proposition 2 for each 119884 = 119910
E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)
Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889
Γpositive eigenvalues 120591
119894119889Γ
119894=1and
eigenvectors 119906119894119889Γ
119894=1 such that Γ = sum119889Γ
119894=1120591119894119906119894otimes 119906119894
Recall that for any 119891 isinH
Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)
also belongs to
span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)
because of (A1) so
119906119894=1
120591119894
Γ119906119894isin range (Σ) (A4)
This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator
119879 = Σminus1Γ is well defined over the whole space Moreover
119879119891 = Σminus1(
119889Γ
sum
119894=1
⟨119906119894 119891⟩ 119906119894)
=
119889Γ
sum
119894=1
⟨119906119894 119891⟩ Σ
minus1(119906119894) = (
119889Γ
sum
119894=1
Σminus1(119906119894) otimes 119906119894)119891
(A5)
This proves (ii)
B Proof of Proposition 7
We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909
1) 120601(119909
119899)] It has the following SVD
decomposition
Φ = 119880119863119881119879
= [1199061sdot sdot sdot 119906119889119870] [
119863119889times119889
0119889times(119899minus119889)
0(119889119870minus119889)times119889
0(119889119870minus119889)times(119899minus119889)
][[
[
V1198791
V119879119899
]]
]
= 119880119863119881119879
(B1)
where 119880 = [1199061 119906
119889] 119881 = [V
1 V
119889] and 119863 = 119863
119889times119889is a
diagonal matrix of dimension 119889 le 119899
Abstract and Applied Analysis 9
We need to show the KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)]
= 119888119879
119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩
(B2)
It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by
120573 = Φ119888 119888 = 119881119863119880119879
120573 (B3)
Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863
2
119881119879 the argument may be made as follows
ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573
lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ
119879ΦΦ119879Φ119888 + 119899
2120574Φ119888)
lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888
(lArrrArr 119881119881119879
119870119869119870119888 = 120582119881119881119879
(1198702+ 1198992120574119868) 119888)
lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888
(B4)
Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1
119880119879 and using the facts119880119879119880 = 119868
119889 For the last step since
119881119879
119881 = 119868119889 we use the following facts
119881119881119879
119870 = 119881119881119879
1198811198632
119881119879
= 1198811198632
119881119879
= 119870
119881119881119879
119888 = 119881119881119879
119881119863minus1
119880119879
120573 = 119881119863minus1
119880119879
120573 = 119888
(B5)
In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889
119870is infinite For the infinite
dimensional case Φ is an operator from R119899 to H119870defined
by ΦV = sum119899119894=1
V119894120601(119909119894) for V = (V
1 V
119899)119879isin R119899 and Φ119879
is its adjoint an operator from H119870to R119899 such that Φ119879119891 =
(⟨120601(1199091) 119891⟩119870 ⟨120601(119909
1) 119891⟩119870)119879 for 119891 isin H
119870 The notions 119880
and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-
tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation
Σ =
119889119870
sum
119894=1
119894119906119894otimes 119906119894=
119889
sum
119894=1
120590119894119906119894otimes 119906119894 (B6)
where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590
119889gt 120590119889+1=
sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906
1 119906
119889)
and hence we can write 120601(119909119894) = 119880120591
119894 where 119880 = (119906
1 119906
119889)
should be considered as an operator fromR119889 toH119870and 120591119894isin
R119889 Denote Υ = (1205911 120591
119899)119879isin R119899times119889 It is easy to check that
Υ119879Υ = diag(119899120590
1 119899120590
119889) Let119863
119889times119889= diag(radic1198991205901 radic119899120590119889)
and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879
which is well defined
C Proof of Consistency
C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein
Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if
1198712
HS =
119901H
sum
119894=1
100381710038171003817100381711987111989011989410038171003817100381710038172
Hlt infin (C1)
where 119890119894 is an orthonormal basisTheHilbert-Schmidt class
forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and
119871119878 both belong to theHilbert-Schmidt class and the followingholds
119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)
where sdot denotes the default operator norm
1198712= sup119891isinH
100381710038171003817100381711987111989110038171003817100381710038172
100381710038171003817100381711989110038171003817100381710038172 (C3)
Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator
Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)
is self-adjoint positive compact and belongs to the Hilbert-Schmidt class
A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879
119899converges
to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as
the convergence of the operators
C2 Proof of Theorem 10 We will use the following resultfrom [14]
10038171003817100381710038171003817Σ minus Σ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
10038171003817100381710038171003817Γ minus Γ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
(C5)
To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1
ΣΓ Alsodefine
1198791= (Σ2+ 119904119868)minus1
ΣΓ 1198792= (Σ2+ 119904119868)minus1
ΣΓ (C6)
Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS
le10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792
1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879
1003817100381710038171003817HS
(C7)
10 Abstract and Applied Analysis
For the first term observe that10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817ΣΓ minus ΣΓ
10038171003817100381710038171003817HS= 119874119901(1
119904radic119899)
(C8)For the second term note that
1198791=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
Σ119906119895) otimes 119906119895
1198792=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
119906119895) otimes 119906119895
(C9)
Therefore
10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
100381710038171003817100381710038171003817
(C10)Since 119906
119895isin range(Σ) there exists
119895such that 119906
119895= Σ119895 Then
((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
= (Σ2+ 119904119868)minus1
(Σ2minus Σ2) (Σ2+ 119904119868)minus1
Σ2119895
(C11)
which implies
10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS
times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2100381710038171003817100381710038171003817
10038171003817100381710038171003817119895
10038171003817100381710038171003817
= 119874119901(1
119904radic119899)
(C12)
For the third term the following holds
10038171003817100381710038171198792 minus 11987910038171003817100381710038172
HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
Σ minus Σminus1) 119906119895
100381710038171003817100381710038171003817
2
(C13)
and for each 119895 = 1 119889Γ
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ119906119895minus Σminus1119906119895
100381710038171003817100381710038171003817
le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2119895minus 119895
100381710038171003817100381710038171003817
=
1003817100381710038171003817100381710038171003817100381710038171003817
infin
sum
119894=1
(1199082
119895
119904 + 1199082119895
minus 1)⟨119895 119890119894⟩ 119890119894
1003817100381710038171003817100381710038171003817100381710038171003817
= (
infin
sum
119894=1
1199042
(119904 + 1199082119894)2⟨119895 119890119894⟩2
)
12
le119904
119908119873
(
119873
sum
119894=1
⟨119895 119890119894⟩2
)
12
+ (
infin
sum
119894=119873+1
⟨119895 119890119894⟩2
)
12
=119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817
(C14)
Combining these terms results in (33)
Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have
10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)
if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573
119894depend only on a finite
number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890
119894 119894 = 1 119873
This implies that
119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890
119894 119894 = 1 119873 (C16)
Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)
Acknowledgments
The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH
References
[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991
[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991
[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991
[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992
[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007
[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997
[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002
[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000
[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002
[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008
[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
8 Abstract and Applied Analysis
Table 1 Mean and standard deviations for error rates in classification of digits
Digit RKSIR KSIR RSIR kNN0 00273 (00089) 00472 (00191) 00487 (00128) 00291 (00071)1 00150 (00049) 00177 (00051) 00292 (00113) 00052 (00012)2 01039 (00207) 01475 (00497) 01921 (00238) 02008 (00186)3 00845 (00208) 01279 (00494) 01723 (00283) 01092 (00130)4 00784 (00240) 01044 (00461) 01327 (00327) 01617 (00213)5 00877 (00209) 01327 (00540) 02146 (00294) 01419 (00193)6 00472 (00108) 00804 (00383) 00816 (00172) 00446 (00081)7 00887 (00169) 01119 (00357) 01354 (00172) 01140 (00125)8 00981 (00259) 01490 (00699) 01981 (00286) 01140 (00156)9 00774 (00251) 01095 (00398) 01533 (00212) 02006 (00153)Average 00708 (00105) 01016 (00190) 01358 (00093) 01177 (00039)
operators on such kernel spaces are used to characterize theconditional independence between linear projections and theresponse variable Therefore their method estimates linearedr subspaces while in [10 11] and in this paper RKHS isused to model the edr subspace which leads to nonlineardimension reduction
There are several open issues in regularized kernel SIRmethod such as the selection of kernels regularizationparameters and number of dimensions A direct assessmentof the nonlinear edr directions is expected to reduce thecomputational burden in procedures based on cross valida-tion While these are well established in linear dimensionreduction however little is known for nonlinear dimensionreduction We would like to leave them for future research
There are some interesting connections between KSIRand functional SIR which are developed by Ferre and hiscoauthors in a series of papers [14 17 19] In functional SIRthe observable data are functions and the goal is to find linearedr directions for functional data analysis In KSIR theobservable data are typically not functions but mapped into afunction space in order to characterize nonlinear structuresThis suggests that computations involved in functional SIRcan be simplified by a parametrization with respect to anRKHS or using a linear kernel in the parametrized functionspace On the other hand from a theoretical point of viewKSIR can be viewed as a special case of functional SIRalthough our current theoretical results on KSIR are differentfrom the ones for functional SIR
Appendices
A Proof of Theorem 3
Under the assumption of Proposition 2 for each 119884 = 119910
E [120601 (119883) | 119884 = 119910] isin span Σ120573119894 119894 = 1 119889 (A1)
Since Γ = cov[E(120601(119883) | 119884)] the rank of Γ (ie thedimension of the image of Γ) is less than 119889 which impliesthat Γ is compact With the fact that it is symmetric andsemipositive there exist 119889
Γpositive eigenvalues 120591
119894119889Γ
119894=1and
eigenvectors 119906119894119889Γ
119894=1 such that Γ = sum119889Γ
119894=1120591119894119906119894otimes 119906119894
Recall that for any 119891 isinH
Γ119891 = E ⟨E [120601 (119883) | 119884] 119891⟩E [120601 (119883) | 119884] (A2)
also belongs to
span Σ120573119894 119894 = 1 119889 sub range (Σ) (A3)
because of (A1) so
119906119894=1
120591119894
Γ119906119894isin range (Σ) (A4)
This proves (i)Since for each 119891 isin H Γ119891 isin range (Σminus1) the operator
119879 = Σminus1Γ is well defined over the whole space Moreover
119879119891 = Σminus1(
119889Γ
sum
119894=1
⟨119906119894 119891⟩ 119906119894)
=
119889Γ
sum
119894=1
⟨119906119894 119891⟩ Σ
minus1(119906119894) = (
119889Γ
sum
119894=1
Σminus1(119906119894) otimes 119906119894)119891
(A5)
This proves (ii)
B Proof of Proposition 7
We first prove the proposition for matrices to simplify thennotation we then extend the result to the operators where119889119870is infinite and a matrix form does not make senseLet Φ = [120601(119909
1) 120601(119909
119899)] It has the following SVD
decomposition
Φ = 119880119863119881119879
= [1199061sdot sdot sdot 119906119889119870] [
119863119889times119889
0119889times(119899minus119889)
0(119889119870minus119889)times119889
0(119889119870minus119889)times(119899minus119889)
][[
[
V1198791
V119879119899
]]
]
= 119880119863119881119879
(B1)
where 119880 = [1199061 119906
119889] 119881 = [V
1 V
119889] and 119863 = 119863
119889times119889is a
diagonal matrix of dimension 119889 le 119899
Abstract and Applied Analysis 9
We need to show the KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)]
= 119888119879
119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩
(B2)
It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by
120573 = Φ119888 119888 = 119881119863119880119879
120573 (B3)
Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863
2
119881119879 the argument may be made as follows
ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573
lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ
119879ΦΦ119879Φ119888 + 119899
2120574Φ119888)
lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888
(lArrrArr 119881119881119879
119870119869119870119888 = 120582119881119881119879
(1198702+ 1198992120574119868) 119888)
lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888
(B4)
Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1
119880119879 and using the facts119880119879119880 = 119868
119889 For the last step since
119881119879
119881 = 119868119889 we use the following facts
119881119881119879
119870 = 119881119881119879
1198811198632
119881119879
= 1198811198632
119881119879
= 119870
119881119881119879
119888 = 119881119881119879
119881119863minus1
119880119879
120573 = 119881119863minus1
119880119879
120573 = 119888
(B5)
In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889
119870is infinite For the infinite
dimensional case Φ is an operator from R119899 to H119870defined
by ΦV = sum119899119894=1
V119894120601(119909119894) for V = (V
1 V
119899)119879isin R119899 and Φ119879
is its adjoint an operator from H119870to R119899 such that Φ119879119891 =
(⟨120601(1199091) 119891⟩119870 ⟨120601(119909
1) 119891⟩119870)119879 for 119891 isin H
119870 The notions 119880
and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-
tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation
Σ =
119889119870
sum
119894=1
119894119906119894otimes 119906119894=
119889
sum
119894=1
120590119894119906119894otimes 119906119894 (B6)
where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590
119889gt 120590119889+1=
sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906
1 119906
119889)
and hence we can write 120601(119909119894) = 119880120591
119894 where 119880 = (119906
1 119906
119889)
should be considered as an operator fromR119889 toH119870and 120591119894isin
R119889 Denote Υ = (1205911 120591
119899)119879isin R119899times119889 It is easy to check that
Υ119879Υ = diag(119899120590
1 119899120590
119889) Let119863
119889times119889= diag(radic1198991205901 radic119899120590119889)
and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879
which is well defined
C Proof of Consistency
C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein
Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if
1198712
HS =
119901H
sum
119894=1
100381710038171003817100381711987111989011989410038171003817100381710038172
Hlt infin (C1)
where 119890119894 is an orthonormal basisTheHilbert-Schmidt class
forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and
119871119878 both belong to theHilbert-Schmidt class and the followingholds
119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)
where sdot denotes the default operator norm
1198712= sup119891isinH
100381710038171003817100381711987111989110038171003817100381710038172
100381710038171003817100381711989110038171003817100381710038172 (C3)
Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator
Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)
is self-adjoint positive compact and belongs to the Hilbert-Schmidt class
A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879
119899converges
to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as
the convergence of the operators
C2 Proof of Theorem 10 We will use the following resultfrom [14]
10038171003817100381710038171003817Σ minus Σ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
10038171003817100381710038171003817Γ minus Γ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
(C5)
To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1
ΣΓ Alsodefine
1198791= (Σ2+ 119904119868)minus1
ΣΓ 1198792= (Σ2+ 119904119868)minus1
ΣΓ (C6)
Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS
le10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792
1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879
1003817100381710038171003817HS
(C7)
10 Abstract and Applied Analysis
For the first term observe that10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817ΣΓ minus ΣΓ
10038171003817100381710038171003817HS= 119874119901(1
119904radic119899)
(C8)For the second term note that
1198791=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
Σ119906119895) otimes 119906119895
1198792=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
119906119895) otimes 119906119895
(C9)
Therefore
10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
100381710038171003817100381710038171003817
(C10)Since 119906
119895isin range(Σ) there exists
119895such that 119906
119895= Σ119895 Then
((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
= (Σ2+ 119904119868)minus1
(Σ2minus Σ2) (Σ2+ 119904119868)minus1
Σ2119895
(C11)
which implies
10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS
times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2100381710038171003817100381710038171003817
10038171003817100381710038171003817119895
10038171003817100381710038171003817
= 119874119901(1
119904radic119899)
(C12)
For the third term the following holds
10038171003817100381710038171198792 minus 11987910038171003817100381710038172
HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
Σ minus Σminus1) 119906119895
100381710038171003817100381710038171003817
2
(C13)
and for each 119895 = 1 119889Γ
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ119906119895minus Σminus1119906119895
100381710038171003817100381710038171003817
le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2119895minus 119895
100381710038171003817100381710038171003817
=
1003817100381710038171003817100381710038171003817100381710038171003817
infin
sum
119894=1
(1199082
119895
119904 + 1199082119895
minus 1)⟨119895 119890119894⟩ 119890119894
1003817100381710038171003817100381710038171003817100381710038171003817
= (
infin
sum
119894=1
1199042
(119904 + 1199082119894)2⟨119895 119890119894⟩2
)
12
le119904
119908119873
(
119873
sum
119894=1
⟨119895 119890119894⟩2
)
12
+ (
infin
sum
119894=119873+1
⟨119895 119890119894⟩2
)
12
=119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817
(C14)
Combining these terms results in (33)
Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have
10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)
if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573
119894depend only on a finite
number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890
119894 119894 = 1 119873
This implies that
119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890
119894 119894 = 1 119873 (C16)
Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)
Acknowledgments
The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH
References
[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991
[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991
[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991
[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992
[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007
[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997
[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002
[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000
[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002
[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008
[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Abstract and Applied Analysis 9
We need to show the KSIR variates
V119895= 119888119879
119895[119896 (119909 119909
1) 119896 (119909 119909
119899)]
= 119888119879
119895Φ119879Φ (119909) = ⟨Φ119888119895 120601 (119909)⟩ = ⟨120573119895 120601 (119909)⟩
(B2)
It suffices to prove that if (120582 119888) is a solution to (28) then (120582 120573)is also a pair of eigenvalue and eigenvector of (Σ2 + 120574119868)minus1ΣΓand vice versa where 119888 and 120573 are related by
120573 = Φ119888 119888 = 119881119863119880119879
120573 (B3)
Noting that facts Σ = (1119899)ΦΦ119879 Γ = (1119899)Φ119869Φ119879 and 119870 =Φ119879Φ = 119881119863
2
119881119879 the argument may be made as follows
ΣΓ120573 = 120582 (Σ2+ 120574119868) 120573
lArrrArr ΦΦ119879Φ119869Φ119879Φ119888 = 120582 (ΦΦ
119879ΦΦ119879Φ119888 + 119899
2120574Φ119888)
lArrrArr Φ119870119869119870119888 = 120582Φ (1198702+ 1198992120574) 119888
(lArrrArr 119881119881119879
119870119869119870119888 = 120582119881119881119879
(1198702+ 1198992120574119868) 119888)
lArrrArr 119870119869119870119888 = 120582 (1198702+ 1198992120574119868) 119888
(B4)
Note that the implication in the third step is necessary only intherArr direction which is obtained by multiplying both sides119881119863minus1
119880119879 and using the facts119880119879119880 = 119868
119889 For the last step since
119881119879
119881 = 119868119889 we use the following facts
119881119881119879
119870 = 119881119881119879
1198811198632
119881119879
= 1198811198632
119881119879
= 119870
119881119881119879
119888 = 119881119881119879
119881119863minus1
119880119879
120573 = 119881119863minus1
119880119879
120573 = 119888
(B5)
In order for this result to hold rigorously when the RKHSis infinite dimensional we need to formally define Φ Φ119879and the SVD of Φ when 119889
119870is infinite For the infinite
dimensional case Φ is an operator from R119899 to H119870defined
by ΦV = sum119899119894=1
V119894120601(119909119894) for V = (V
1 V
119899)119879isin R119899 and Φ119879
is its adjoint an operator from H119870to R119899 such that Φ119879119891 =
(⟨120601(1199091) 119891⟩119870 ⟨120601(119909
1) 119891⟩119870)119879 for 119891 isin H
119870 The notions 119880
and 119880119879 are similarly definedThe above formulation ofΦ andΦ119879 coincides the defini-
tion of Σ as a covariance operator Since the rank of Σ is lessthan 119899 it is compact and has the following representation
Σ =
119889119870
sum
119894=1
119894119906119894otimes 119906119894=
119889
sum
119894=1
120590119894119906119894otimes 119906119894 (B6)
where 119889 le 119899 is the rank and 1205901ge 1205902ge sdot sdot sdot ge 120590
119889gt 120590119889+1=
sdot sdot sdot = 0 This implies that each 120601(119909119894) lies in span(119906
1 119906
119889)
and hence we can write 120601(119909119894) = 119880120591
119894 where 119880 = (119906
1 119906
119889)
should be considered as an operator fromR119889 toH119870and 120591119894isin
R119889 Denote Υ = (1205911 120591
119899)119879isin R119899times119889 It is easy to check that
Υ119879Υ = diag(119899120590
1 119899120590
119889) Let119863
119889times119889= diag(radic1198991205901 radic119899120590119889)
and119881 = Υ119863minus1Then we obtain the SVD forΦ asΦ = 119880119863119881119879
which is well defined
C Proof of Consistency
C1 Preliminaries In order to prove Theorems 9 and 10we use the properties of the Hilbert-Schmidt operatorscovariance operators for the Hilbert space valued randomvariables and the perturbation theory for linear operators Inthis subsection we provide a brief introduction to them Fordetails see [27ndash29] and references therein
Given a separable Hilbert space H of dimension 119901Ha linear operator 119871 on H is said to belong to the Hilbert-Schmidt class if
1198712
HS =
119901H
sum
119894=1
100381710038171003817100381711987111989011989410038171003817100381710038172
Hlt infin (C1)
where 119890119894 is an orthonormal basisTheHilbert-Schmidt class
forms a new Hilbert space with norm sdot HSGiven a bounded operator 119878 onH the operators 119878119871 and
119871119878 both belong to theHilbert-Schmidt class and the followingholds
119878119871HS le 119878 119871HS 119871119878HS le 119871HS 119878 (C2)
where sdot denotes the default operator norm
1198712= sup119891isinH
100381710038171003817100381711987111989110038171003817100381710038172
100381710038171003817100381711989110038171003817100381710038172 (C3)
Let 119885 be a random vector taking values in H satisfyingE1198852lt infin The covariance operator
Σ = E [(119885 minus E119885) otimes (119885 minus E119885)] (C4)
is self-adjoint positive compact and belongs to the Hilbert-Schmidt class
A well-known result from perturbation theory for linearoperators states that if a set of linear operators 119879
119899converges
to 119879 in the Hilbert-Schmidt norm and the eigenvalues of 119879are nondegenerate then the eigenvalues and eigenvectors of119879119899converge to those of 119879 with same rate or convergence as
the convergence of the operators
C2 Proof of Theorem 10 We will use the following resultfrom [14]
10038171003817100381710038171003817Σ minus Σ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
10038171003817100381710038171003817Γ minus Γ
10038171003817100381710038171003817HS = 119874119901 (1
radic119899)
(C5)
To simplify the notion we denote 119904= (Σ2+ 119904119868)minus1
ΣΓ Alsodefine
1198791= (Σ2+ 119904119868)minus1
ΣΓ 1198792= (Σ2+ 119904119868)minus1
ΣΓ (C6)
Then10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS
le10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HS +10038171003817100381710038171198791 minus 1198792
1003817100381710038171003817HS +10038171003817100381710038171198792 minus 119879
1003817100381710038171003817HS
(C7)
10 Abstract and Applied Analysis
For the first term observe that10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817ΣΓ minus ΣΓ
10038171003817100381710038171003817HS= 119874119901(1
119904radic119899)
(C8)For the second term note that
1198791=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
Σ119906119895) otimes 119906119895
1198792=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
119906119895) otimes 119906119895
(C9)
Therefore
10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
100381710038171003817100381710038171003817
(C10)Since 119906
119895isin range(Σ) there exists
119895such that 119906
119895= Σ119895 Then
((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
= (Σ2+ 119904119868)minus1
(Σ2minus Σ2) (Σ2+ 119904119868)minus1
Σ2119895
(C11)
which implies
10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS
times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2100381710038171003817100381710038171003817
10038171003817100381710038171003817119895
10038171003817100381710038171003817
= 119874119901(1
119904radic119899)
(C12)
For the third term the following holds
10038171003817100381710038171198792 minus 11987910038171003817100381710038172
HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
Σ minus Σminus1) 119906119895
100381710038171003817100381710038171003817
2
(C13)
and for each 119895 = 1 119889Γ
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ119906119895minus Σminus1119906119895
100381710038171003817100381710038171003817
le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2119895minus 119895
100381710038171003817100381710038171003817
=
1003817100381710038171003817100381710038171003817100381710038171003817
infin
sum
119894=1
(1199082
119895
119904 + 1199082119895
minus 1)⟨119895 119890119894⟩ 119890119894
1003817100381710038171003817100381710038171003817100381710038171003817
= (
infin
sum
119894=1
1199042
(119904 + 1199082119894)2⟨119895 119890119894⟩2
)
12
le119904
119908119873
(
119873
sum
119894=1
⟨119895 119890119894⟩2
)
12
+ (
infin
sum
119894=119873+1
⟨119895 119890119894⟩2
)
12
=119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817
(C14)
Combining these terms results in (33)
Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have
10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)
if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573
119894depend only on a finite
number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890
119894 119894 = 1 119873
This implies that
119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890
119894 119894 = 1 119873 (C16)
Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)
Acknowledgments
The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH
References
[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991
[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991
[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991
[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992
[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007
[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997
[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002
[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000
[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002
[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008
[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
10 Abstract and Applied Analysis
For the first term observe that10038171003817100381710038171003817119904minus 1198791
10038171003817100381710038171003817HSle100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817ΣΓ minus ΣΓ
10038171003817100381710038171003817HS= 119874119901(1
119904radic119899)
(C8)For the second term note that
1198791=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
Σ119906119895) otimes 119906119895
1198792=
119889Γ
sum
119895=1
120591119895((Σ2+ 119904119868)minus1
119906119895) otimes 119906119895
(C9)
Therefore
10038171003817100381710038171198791 minus 11987921003817100381710038171003817HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
100381710038171003817100381710038171003817
(C10)Since 119906
119895isin range(Σ) there exists
119895such that 119906
119895= Σ119895 Then
((Σ2+ 119904119868)minus1
minus (Σ2+ 119904119868)minus1
) Σ119906119895
= (Σ2+ 119904119868)minus1
(Σ2minus Σ2) (Σ2+ 119904119868)minus1
Σ2119895
(C11)
which implies
10038171003817100381710038171198791 minus 11987921003817100381710038171003817 le
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1100381710038171003817100381710038171003817
10038171003817100381710038171003817Σ2minus Σ210038171003817100381710038171003817HS
times100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2100381710038171003817100381710038171003817
10038171003817100381710038171003817119895
10038171003817100381710038171003817
= 119874119901(1
119904radic119899)
(C12)
For the third term the following holds
10038171003817100381710038171198792 minus 11987910038171003817100381710038172
HS =
119889Γ
sum
119895=1
120591119895
100381710038171003817100381710038171003817((Σ2+ 119904119868)minus1
Σ minus Σminus1) 119906119895
100381710038171003817100381710038171003817
2
(C13)
and for each 119895 = 1 119889Γ
100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ119906119895minus Σminus1119906119895
100381710038171003817100381710038171003817
le100381710038171003817100381710038171003817(Σ2+ 119904119868)minus1
Σ2119895minus 119895
100381710038171003817100381710038171003817
=
1003817100381710038171003817100381710038171003817100381710038171003817
infin
sum
119894=1
(1199082
119895
119904 + 1199082119895
minus 1)⟨119895 119890119894⟩ 119890119894
1003817100381710038171003817100381710038171003817100381710038171003817
= (
infin
sum
119894=1
1199042
(119904 + 1199082119894)2⟨119895 119890119894⟩2
)
12
le119904
119908119873
(
119873
sum
119894=1
⟨119895 119890119894⟩2
)
12
+ (
infin
sum
119894=119873+1
⟨119895 119890119894⟩2
)
12
=119904
1199082119873
10038171003817100381710038171003817Π119873(119895)10038171003817100381710038171003817+10038171003817100381710038171003817Πperp
119873(119895)10038171003817100381710038171003817
(C14)
Combining these terms results in (33)
Since Πperp119873(119895) rarr 0 as119873 rarr infin consequently we have
10038171003817100381710038171003817119904minus 11987910038171003817100381710038171003817HS= 119900119901 (1) (C15)
if 119904 rarr 0 and 119904radic119899 rarr infinIf all the edr directions 120573
119894depend only on a finite
number of eigenvectors of the covariance operator then thereexist some 119873 gt 1 such that Slowast = spanΣ119890
119894 119894 = 1 119873
This implies that
119895= Σminus1119906119895isin Σminus1(Slowast) sub span 119890
119894 119894 = 1 119873 (C16)
Therefore Πperp119873(119895) = 0 Let 119904 = 119874(119899minus14) the rate is119874(11989914)
Acknowledgments
The authors acknowledge the support of the National Sci-ence Foundation (DMS-0732276 and DMS-0732260) and theNational Institutes ofHealth (P50GM081883) Any opinionsfindings and conclusions or recommendations expressed inthis work are those of the authors and do not necessarilyreflect the views of the NSF or NIH
References
[1] K Li ldquoSliced inverse regression for dimension reduction (withdiscussion)rdquo Journal of the American Statistical Association vol86 no 414 pp 316ndash342 1991
[2] N Duan and K Li ldquoSlicing regression a link-free regressionmethodrdquoTheAnnals of Statistics vol 19 no 2 pp 505ndash530 1991
[3] R Cook and S Weisberg ldquoCommentrdquo Journal of the AmericanStatistical Association vol 86 no 414 pp 328ndash332 1991
[4] K C Li ldquoOn principal hessian directions for data visualizationand dimension reduction another application of steinrsquos lemmardquoJournal of the American Statistical Association vol 87 no 420pp 1025ndash1039 1992
[5] L Li R D Cook and C-L Tsai ldquoPartial inverse regressionrdquoBiometrika vol 94 no 3 pp 615ndash625 2007
[6] B Scholkopf A J Smola and K Muller ldquoKernel principalcomponent analysisrdquo in Proceedings of the Artificial NeuralNetworks (ICANN rsquo97) W Gerstner A Germond M Haslerand J-D Nicoud Eds vol 1327 of Lecture Notes in ComputerScience pp 583ndash588 Springer Lausanne Switzerland October1997
[7] B Scholkopf and A J Smola Learning with Kernels The MITPress Massachusetts Mass USA 2002
[8] G Baudat and F Anouar ldquoGeneralized discriminant analysisusing a kernel approachrdquo Neural Computation vol 12 no 10pp 2385ndash2404 2000
[9] F R Bach and M I Jordan ldquoKernel independent componentanalysisrdquo Journal of Machine Learning Research vol 3 no 1 pp1ndash48 2002
[10] H-M Wu ldquoKernel sliced inverse regression with applicationsto classificationrdquo Journal of Computational and Graphical Statis-tics vol 17 no 3 pp 590ndash610 2008
[11] Y-R Yeh S-Y Huang and Y-J Lee ldquoNonlinear dimensionreduction with kernel sliced inverse regressionrdquo IEEE Trans-actions on Knowledge and Data Engineering vol 21 no 11 pp1590ndash1603 2009
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Abstract and Applied Analysis 11
[12] J Mercer ldquoFunctions of positive and negative type and theirconnection with the theory of integral equationsrdquo PhilosophicalTransactions of the Royal Society of London A vol 209 no 441ndash458 pp 415ndash446 1909
[13] H Konig Eigenvalue Distribution of Compact Operators vol16 of Operator Theory Advances and Applications CH BaselSwitzerland 1986
[14] L Ferre and A Yao ldquoFunctional sliced inverse regression anal-ysisrdquo Statistics vol 37 no 6 pp 475ndash488 2003
[15] P Hall and K-C Li ldquoOn almost linearity of low-dimensionalprojections from high-dimensional datardquo The Annals of Statis-tics vol 21 no 2 pp 867ndash889 1993
[16] G He H-G Muller and J-L Wang ldquoFunctional canonicalanalysis for square integrable stochastic processesrdquo Journal ofMultivariate Analysis vol 85 no 1 pp 54ndash77 2003
[17] L Ferre and A Yao ldquoSmoothed functional inverse regressionrdquoStatistica Sinica vol 15 no 3 pp 665ndash683 2005
[18] W Zhong P Zeng P Ma J S Liu and Y Zhu ldquoRSIR regular-ized sliced inverse regression for motif discoveryrdquo Bioinformat-ics vol 21 no 22 pp 4169ndash4175 2005
[19] L Ferre and N Villa ldquoMultilayer perceptron with functionalinputs an inverse regression approachrdquo Scandinavian Journalof Statistics vol 33 no 4 pp 807ndash823 2006
[20] L Li andX Yin ldquoSliced inverse regressionwith regularizationsrdquoBiometrics vol 64 no 1 pp 124ndash131 2008
[21] C Bernard-Michel L Gardes and S Girard ldquoGaussian regular-ized sliced inverse regressionrdquo Statistics and Computing vol 19no 1 pp 85ndash98 2009
[22] T Hsing and R J Carroll ldquoAn asymptotic theory for slicedinverse regressionrdquo The Annals of Statistics vol 20 no 2 pp1040ndash1061 1992
[23] J Saracco ldquoAn asymptotic theory for sliced inverse regressionrdquoCommunications in Statistics vol 26 no 9 pp 2141ndash2171 1997
[24] L X Zhu and K W Ng ldquoAsymptotics of sliced inverse regres-sionrdquo Statistica Sinica vol 5 no 2 pp 727ndash736 1995
[25] T Kurita and T Taguchi ldquoA kernel-based fisher discriminantanalysis for face detectionrdquo IEICE Transactions on Informationand Systems vol 88 no 3 pp 628ndash635 2005
[26] K Fukumizu F R Bach and M I Jordan ldquoKernel dimensionreduction in regressionrdquo The Annals of Statistics vol 37 no 4pp 1871ndash1905 2009
[27] T Kato Perturbation Theory for Linear Operators SpringerBerlin Germany 1966
[28] F Chatelin Spectral Approximation of Linear Operators Aca-demic Press 1983
[29] G Blanchard O Bousquet and L Zwald ldquoStatistical propertiesof kernel principal component analysisrdquoMachine Learning vol66 no 2-3 pp 259ndash294 2007
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of