Upload
umce
View
1
Download
0
Embed Size (px)
Citation preview
Projection Pursuit Indices Based On
Orthonormal Function Expansions
Dianne Cook
yz
, Andreas Buja
y
, Javier Cabrera
z
y Bellcore, 445 South St,Morristown, NJ 07962-1910
z Dept of Statistics, Hill Cntr, Busch Campus, Rutgers University, New Brunswick, NJ 08904
Abstract
Projection pursuit describes a procedure for searching high dimensional data for \inter-
esting" low dimensional projections, via the optimization of a criterion function called the
projection pursuit index. By empirically examining the optimization process for several
projection pursuit indices we observed di�erences in the types of structure that maximized
each index. We were especially curious about di�erences between two indices based on
expansions in terms of orthogonal polynomials, the Legendre index (Friedman, 1987) and
the Hermite index (Hall, 1989). Being fast to compute these indices are ideally suited for
dynamic graphics implementations.
Both Legendre and Hermite indices are weighted L
2
-distances between the density of the
projected data and a standard normal density. A general form for this type of index is
introduced, which encompasses both of these. The form clari�es the e�ects of the weight
function on the index's sensitivity to di�erences from normality, highlighting some concep-
tual problems with the Legendre and Hermite indices. A new index, called the Natural
Hermite index, which alleviates some of these problems, is introduced.
As proposed by Friedman (1987) and also used by Hall (1989), a polynomial expansion
of the data density reduces the form of the index to a sum of squares of the coe�cients used
in the expansion. This drew our attention to examining these coe�cients as indices in their
own right. We found that the �rst two coe�cients and the lowest order indices produced
by them are the most useful ones for practical data exploration since they respond to
structure that can be analytically identi�ed, and because they have \long-sighted" vision
which enables them to \see" large structure from a distance. Complementing this low order
behaviour the higher order indices are \short-sighted". They are able to \see" intricate
structure but only when close to it.
We also show some practical use of projection pursuit using the polynomial indices,
including a discovery of previously unseen structure in a set of telephone usage data, and
two cautionary examples which illustrate that structure found is not always meaningful.
1
1 Introduction
The term \projection pursuit" was coined by Friedman and Tukey (1974) to de-
scribe a procedure for searching high (p) dimensional data for \interesting" low (k = 1
or 2 usually, maybe 3) dimensional projections. The procedure, originally suggested
by Kruskal (1969), involves de�ning a criterion function, or index, which measures
the \interestingness" of each k-dimensional projection of p-dimensional data. This
criterion function is optimized over the space of all k-dimensional projections of p-
space, searching for both global and local maxima. The resulting solutions hopefully
reveal low dimensional structure in the data not necessarily found by methods such
as principal component analysis.
Searching for interesting projections can be equated to searching for the most
non-normal projections. One reason is that for many high dimensional data sets
most projections look approximately Gaussian, so to �nd the unusual projections one
should search for non-normality (see Andrews et al (1971) and Diaconis and Freed-
man (1984) for further discussion). Huber (1985) equates interesting to structured or
non-randomness and discusses entropy (measured by
R
f(x) log f(x)dx, for example)
as a measure of randomness: the lower the value of entropy indicates more random-
ness. Notably, this also suggests searching away from normality because entropy, as
de�ned above with �xed scale, is minimized by the Gaussian distribution.
Using normality as the null model is highlighted in indices proposed by Jones
(1983, e.g. moment index) and Huber (1985, e.g., negative Shannon entropy, Fisher
information). This approach also suggests discarding structure such as location, scale
and covariance, which are found reasonably well with more conventional multivariate
methods, by sphering the data before beginning projection pursuit. Consequently we
have a framework for considering a family of projection pursuit indices based on a
description of the data in terms of an IR
p
-valued random vector, Z, satisfying EZ = 0
and CovZ = I
p
.
We want to construct a k-dimensional projection pursuit index, that is a real valued
function on the set of all k-dimensional projections of IR
p
. For simplicity, let k = 1,
so we consider all 1-dimensional projections of Z,
Z �! X = �
0
Z 2 IR (� 2 S
p�1
);
where S
p�1
is a unit (p � 1)-sphere in IR
p
, and X is a real-valued random variable.
(We use � 2 S
p�1
because only direction is important and we search over all possible
directions in IR
p
.) In the null case if Z
dis
� N (0; I
p
) then X
dis
� N (0; 1). Let the random
variable X have distribution function F (x) and density f(x). An index, I, can be
constructed by measuring the distance of f(x) from the standard normal density.
A practical index of this kind was proposed by Friedman (1987), although he de-
toured from the above route, by �rst mapping X into the bounded interval [�1; 1] by
the transformation Y = 2�(X)�1, where � is the distribution function of a standard
normal. By doing this he hoped to concentrate attention on di�erences in the center,
2
producing an index robust to tail uctuations. In the null case if X
dis
� N (0; 1) then
Y
dis
� U [�1; 1]. In general, let Y have distribution function G(y) and density g(y).
Friedman's proposed index is an L
2
-distance of g(y) from the density of U [�1; 1]:
I
L
=
Z
1
�1
fg(y)�
1
2
g
2
dy
We call this the Legendre index because g(y) will later be expanded in terms of the
natural polynomial basis with respect to U [�1; 1], namely the Legendre polynomials.
This is the starting point for the work presented in this paper, but before we
continue we note two basic details about the use of an index, I, in projection pursuit:
1: I is a functional of f (in Friedman's case I
L
is a functional of g).
2: f (or g) depends on the projection vector, �, so that projection pursuit
entails the search for local maxima of I over all possible �.
2 Transformation Approach
Friedman's detour can be generalized by considering an arbitrary strictly monotone
and smooth transformation T : IR! IR on the random variableX so that Y = T (X).
Then if X has distribution function F (x) and density f(x), let Y have distribution
function G(y) and density g(y). Given that the null version of the density, f(x), is
�(x) we denote the null version of the density g(y) to be (y). A general family of
indices is now de�ned by
I =
Z
IR
fg(y)� (y)g
2
(y)dy (1)
which specializes to
1
2
I
L
for T (X) = 2�(X)�1. (Note that we integrate with respect
to (y)dy, which becomes
1
2
dy for the Legendre index, I
L
.) This family of indices
incorporates the idea of a distance computation between \observed" f(x) or g(y) and
\null" �(x) or (y) under the assumption of the null distribution. The transformation,
T , can be considered to transform the index into a form suitable for estimation by
an alternative orthonormal basis (see below), and to adjust the index's sensitivities
to particular structures.
In somewhat reverse logic, now start with I, de�ned in its transformed state, and
map it back through the inverse transformation:
I =
Z
IR
(
f(x)
T
0
(x)
�
�(x)
T
0
(x)
)
2
�(x)dx
=
Z
IR
ff(x)� �(x)g
2
�(x)
T
0
(x)
2
dx (2)
This form clearly shows that the index is a weighted distance between f(x) and a
standard normal density, with weighting function �(x)=T
0
(x)
2
.
3
Using this formulation the Legendre index, I
L
, proposed by Friedman (1987) be-
comes:
I
L
=
Z
IR
ff(x)� �(x)g
2
1
2�(x)
dx; (3)
since T (X) = 2�(X) � 1 ) T
0
(X) = 2�(X). Ironically the mapping proposed by
Friedman to reduce the in uence of tail uctuations does exactly the opposite. The
term 1=�(x) e�ectively upweights tail observations, leaving the Legendre index very
sensitive to di�erences from normality in the tails of f(x). This is more a conceptual
stumbling block than a practical de�ciency because the problem is somewhat moot
for �nite function expansions. Just the same, equation (3) illustrates an unintended
e�ect of an otherwise innocuous-looking data transformation.
Through di�erent considerations Hall (1989) also observed the same phenomenon
of upweighted tails in the Legendre index. It motivated him to propose an alternative
index that measures the L
2
-distance between f(x) and the standard normal density
with respect to Lebesgue measure:
I
H
=
Z
IR
ff(x)� �(x)g
2
dx
Interestingly, this index is also a member of the family of indices (1) as it can be
obtained through a suitable transformation. Equating the implicit weight 1 with
�(x)=T
0
(x)
2
in 2 we �nd T
0
(x)
2
= �(x), or T
0
(x) =
q
�(x) and hence T (X) /
�
�=
p
2
(X) . Such a transformation seems unnatural at �rst, and may not contribute
any additional insight beyond the obvious one that I
H
gives equal weight to all
di�erences along IR. Aside from this, Hall's motivation for the design of the index is
from an established approach in density estimation.
We return then to Friedman's original idea of giving more weight to di�erences in
the center. Going a step beyond Hall's approach, we propose to use T (X) = X, the
identity transformation, giving:
I
N
=
Z
IR
ff(x)� �(x)g
2
�(x)dx (4)
We call this index the \Natural Hermite" index, and Hall's index the \Hermite" index
because both use Hermite polynomials in the expansion of f(x), but I
N
is \natural"
because the distance from the normal density is taken with respect to Normal measure.
The class of T that we have allowed is exibly broad, to entertain various construc-
tions which may not be entirely sensible for practical purposes. We make use only of
the following one-parameter family of transformations,
T
�
(X) =
p
2��(�
�
(X)� 1=2)
which allows us to see the three indices in a natural order. The elements of the family
are scaled to achieve T
0
�
(0) = 1, for all � > 0. The limit for �!1 is T
�!1
(X) = X.
4
For T
�=1
we get essentially the Legendre index, I
L
, for T
�=
p
2
we have an index
proportional to the Hermite, I
H
, and for T
1
we get the Natural Hermite index, I
N
.
The proper way to interpret these transformations is according to their ability to thin
out the tail weight for increasing �. The smaller �, the more the tail weight is in ated
and allowed to exert in uence on the projection pursuit index.
As mentioned above, the problem of tail weight is more conceptual than practical.
The upshot of this section then is conceptual consistency in a framework that allowed
us to devise a new index which is simple and more radical in its treatment of tail
weight. In the following sections, we will mostly work with our new index and give
cursory attention to the Legendre and Hermite indices when appropriate.
3 Density Estimation
For the purposes of projection pursuit index estimation, the empirical data distri-
bution needs to be mapped into a density estimate to which the de�nition of an index
can be applied. A natural approach in this context is polynomial expansion (Fried-
man, 1987). Density estimates obtained in this way are usually not very pleasing
for graphical purposes since the non-negativity constraint is impossible to enforce.
However, in the present context no graphical presentation of such estimates is in-
tended. In addition, considerable analytical simplicity and computational e�ciency
is achieved by this approach.
In each of the indices described above, the density f(x) (or g(y) in the transformed
version) is expanded using orthonormal functions:
f(x) =
1
X
i=0
a
i
p
i
(x)
In the Natural Hermite index, fp
i
(x); i = 0; 1; : : :g is the set of standardized Hermite
polynomials orthonormal with respect to (on. wrt) �(x). (Note that �(x) is also
called the weight function of the polynomial basis. In the notation of Thisted (1988),
p
i
(x) = (i!)
�
1
2
H
e
i
(x); p
0
(x) = 1; p
1
(x) = x; p
2
(x) = x
2
� 1. The subscript \e" is a
convention used to distinguish this Hermite polynomial basis from the basis on. wrt
�
2
(x).)
In addition, to estimate I
N
, �(x) is expanded as
P
1
i=0
b
i
p
i
(x). Inserting both ex-
pansions into I
N
(4) gives:
I
N
=
Z
IR
(
1
X
i=0
a
i
p
i
(x)�
1
X
i=0
b
i
p
i
(x)
)
2
�(x)dx
=
Z
IR
(
1
X
i=0
(a
i
� b
i
)p
i
(x)
)
2
�(x)dx
=
1
X
i=0
(a
i
� b
i
)
2
5
since the p
i
's are on. wrt �(x).
The Fourier coe�cients of the expansion, a
i
and b
i
, are as follows:
a
i
=
Z
IR
f(x)p
i
(x)�(x)dx =
Z
IR
p
i
(x)�(x)dF (x)
b
i
=
Z
IR
�(x)p
i
(x)�(x)dx
The coe�cients, b
i
, can be analytically calculated from Abramowitz and Stegun (1972,
22.5.18, 22.5.19):
b
2i
=
(�1)
i
p
�
((2i)!)
1
2
i!
1
2
2i+1
; b
2i+1
= 0; i = 0; 1; 2; : : :
Because of its dependence on f , the coe�cient a
i
is unknown and must be estimated
in order to estimate I
N
. Reinterpreting a
i
as an expectation,
a
i
= E
F
fp
i
(X)�(X)g
leads to the obvious sample estimate:
a
i
=
1
n
n
X
j=1
p
i
(x
j
)�(x
j
)
The index I
N
is estimated by using a
i
and truncating the sum at M terms,
^
I
N
M
=
M
X
i=0
(a
i
� b
i
)
2
:
The asymptotic theory for the choice of M as a function of sample size for I
L
and I
H
is the subject of Hall's (1989) paper. Using Hall's approach we �nd that the choice of
M for I
N
is the same as that for I
H
. Note also, that the truncation at M constitutes
a smoothing of the true index, I
N
.
The approximations in both Legendre and Hermite indices are similarly constructed.
In the Legendre index, the expansion is made on g(y), after X is transformed from
IR to [�1; 1] by Y = 2�(X) � 1, with fp
i
(y); i = 0; 1; : : :g being the set of stan-
dardized Legendre polynomials (Friedman, 1987). The Hermite index uses Hermite
polynomials on. wrt �
2
(x) (Hall, 1989).
4 Structure Detection
Our interest in the structure sensitivity of the indices stems from implementing
projection pursuit dynamically (Cook et al, 1991) in XGobi, which is dynamic graph-
ics software being developed by Swayne et al (1991). Included in the implementation
are controls for steepest ascent optimization of a variety of 2-dimensional projection
6
pursuit indices. The main feature is that the procedure is visualized by sequential
plotting of the projected data as the optimization ensues, and the interactive nature
of the implementation enables the optimization to be readily started from multiple
points.
For many long but interesting hours we watched projections of various types of
data as they were steered into local maxima of di�erent projection pursuit indices.
In the course we found that indices truncated as low as M = 0 or M = 1 were
the most interesting and also useful. This ies in the face of the natural idea that
M should be chosen as large as possible, within limits dictated by the sample size.
The usefulness of these low order indices arises from a \long-sighted" quality which
enables them to see large structure, such as clusters, from a distance. Speci�cally, we
found that low order Hermite and Natural Hermite indices often �nd projections with
a \hole" in the center, whilst the low order Legendre index tends to �nd projections
containing skewness. Higher (4, 5, : : : ) indices lose the long-sightedness and become
\short-sighted". They are receptive to �ner detail in the projected data although they
need starting points much closer to the structure to �nd it. The intrigue induced by
observing these behaviours led to the qualitative results in the next few sections.
5 One-dimensional Index
For simplicity we begin with the 1-dimensional index, using the Natural Hermite
as an example and then extending the results to both the Legendre and Hermite
indices. We are interested in maxima of I
N
0
= (a
0
� b
0
)
2
, I
N
1
= (a
0
� b
0
)
2
+ (a
1
� b
1
)
2
and its second term (a
1
� b
1
)
2
. Because each (a
i
� b
i
)
2
involves a quadratic in a
i
it
is maximized by minimizing or maximizing a
i
. Write a
i
as E
F
fp
i
(X)�(X)g and the
problem reduces to �nding the types of distribution functions, F (x), which minimize
or maximize this expectation.
Now F (x) needs to be absolutely continuous for the integral form (4) of the index,
I
N
, to exist but once the expansion is truncated this restriction may be dropped and
F (x) may be discrete. In the execution of projection pursuit, F (x) is restricted to
the set of distribution functions of all 1-dimensional projections of Z:
F (x) 2 F
Z
= fF (x):X = �
0
Z;� 2 S
p�1
g
and EZ = 0 and CovZ = I
p
are assumed as usual. However, to understand the
general types of distributions to which a
i
responds, consider F (x) belonging to the
expanded set:
F = fF (x): E
F
X = 0;E
F
X
2
= 1g
Now F is convex, but it is not closed in the weak topology. In order to consider
minimizing or maximizing a
i
over F we need to consider all F (x) in the weak closure
of F ,
�
F , which happens to be:
7
�
F = fF (x): E
F
X = 0;Var
F
X � 1g (�)
This set is furthermore weakly compact (��). (Proofs of these two statements and
Lemma 5.1, Propositions 5.2 and 5.3 are left to the Appendix.) Since
�
F is also convex,
this is a natural domain for optimizing the projection pursuit coe�cients, a
i
. They
are weakly continuous linear functionals, and their extrema in
�
F are taken on at the
extremal elements:
Lemma 5.1: The extremal elements of
�
F are the union of:
(i) the 3-point masses, F , satisfying E
F
X = 0, E
F
X
2
= 1;
(ii) the 2-point masses, F , satisfying E
F
X = 0, E
F
X
2
� 1;
(iii) the 1-point mass F = �
0
.
(where �
x
denotes a unit point mass at x.)
Thus, the set of extremals is not very expressive, but it is su�cient to give insight
into the behaviours of the lower order coe�cients, a
0
and a
1
. While it is true that
a
i
takes on its extrema on these elements for all i = 0; 1; 2; 3; : : :, none of its ability
to respond to high frequency structure in F is revealed by studying this extremal
behaviour for larger orders of i.
5.1 Truncation at First Term - I
N
0
Consider the simplest but, in our experience, most useful index I
N
0
= (a
0
� b
0
)
2
,
where
a
0
=
Z
IR
�(x)dF (x) since p
0
(x) � 1
b
0
=
1
2
p
�
(= a
0
when f � �)
so that I
N
0
= (a
0
� 1=(2
p
�))
2
: As mentioned earlier, we should expect two types of
(local) maxima for I
N
0
; one for a minimum of a
0
and one for a maximum.
Proposition 5.2:
(i) a
0
is minimized, with a value 1=
p
2�e, by a distribution with
equal masses of weight 0:5 at �1. (Call this distribution type CH,
or a \central hole".)
(ii) a
0
is maximized with a value 1=
p
2�, by a point mass at 0. This
distribution actually maximizes I
N
0
with a value of (1�1=
p
2)
2
=2�.
(Call this distribution type CM, or a \central mass" concentration.)
8
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.00
50.
010
0.01
5
γ
I0N
Central Mass
Central Hole
Figure 1: Symmetric interpolation (5) between distribution types CH and CM, showing
I
N
0
.
Intuitively, (i) says to minimize a
0
mass should be placed as far out as possible,
because of the shape of �(x), but the mean and variance constraints impose limits
on how far the mass can be from zero. Conversely, (ii) says that because �(x) is
unimodal with maximum at zero, to maximize a
0
place all mass at zero. This is not
a proper element of F , but it is clear that a
0
does not take on its maximum in F .
For example, the distribution
F = �
0
+
1�
2
�
�x
+
1 �
2
�
x
; = 1�
1
x
2
(5)
is centered, has unit variance, and F
w
! �
0
as x!1. This simple family of discrete
distributions serves as an interpolation between the type CM distribution for ! 1,
x! 1, and the type CH distribution type for = 0, x = 1. Figure 1 shows a plot
of I
N
0
for this interpolating family. Despite the greater relative magnitude of I
N
0
for
central mass concentration shown in this �gure, the nature of the intermediate dip
in the curve demonstrates that I
N
0
will also respond to central holes. In fact, I
N
0
will
more often respond to central holes since the range of -values which ascend to type
CH is larger (about 0.6) than that for type CM (about 0.4).
The Hermite index of order 0, I
H
0
, behaves identically to I
N
0
with the exception of
a constant factor. The Legendre index, on the other hand, doesn't have an equivalent
term: I
L
0
= 0, always.
5.2 Truncation at Second Term - I
N
1
5.2.1 Second Term Alone
The second term (a
1
� b
1
)
2
, where
9
a
1
=
Z
IR
x�(x)dF (x) since p
1
(x) � x
b
1
= 0 (= a
1
when f � �)
reduces to a
2
1
. For this quantity we need only consider maximizing a
1
, because the
skew symmetry of a
1
about x = 0 implies that the minimal value of a
1
will be equal
in magnitude to its maximal value, and obtained by a re ection through x = 0 of the
maximal distribution.
Proposition 5.3:
The second coe�cient, a
1
, is maximized by the two point distribu-
tion with mass ; (1� ) at
q
(1 � )= ;�
q
=(1 � ), respectively,
where is found by maximizing
q
(1 � )(�e
�(1� )=
+ e
=(1� )
).
( approximately equals 0.838.)
The maximum value of I
N
1
, is achieved equally by this right-skewed distribution and
the left-skewed distribution produced by its re ection through x = 0. Call these
distributions type SK. As above, it is useful to embed the distributions of interest in
a one-parameter family:
F = (1� )�
x
+ �
y
; x = �
q
=(1 � ); y =
q
(1� )= (6)
The members of this family are again centered and scaled to unit variance. They
are skewed, except for the type CH distribution at = 0:5, and for the type CM
distribution obtained when ! 1. The type SK maximum occurs in between the
two extremes, at approximately = 0:838. Figure 2(a) plots a
1
for this family as
a function of , on the interval (0:5; 1). The single mode of the curve occurs at the
distribution given in Proposition 5.3.
The second terms of the Hermite and Legendre indices behave very much like
these. In the Legendre index this is the lowest order index, because I
L
0
= 0, so that
I
L
1
responds exclusively to skewness in the data and the next section does not apply.
5.2.2 Piecing First and Second Terms Together
Truncating the sum at two terms gives the order 1 index, I
N
1
= (a
0
�1=(2
p
�))
2
+a
2
1
,
so the behaviour of I
N
1
depends on the interactive behaviour of the two terms. Figure
2(b) plots I
N
1
for the skewed interpolation of form (6). The distribution which maxi-
mizes I
N
1
is of type SK but not exactly the same as that which maximizes a
2
1
alone,
because the interaction with the �rst term draws it towards a type CM distribution.
It is characterized by having approximate masses 0:13 and 0:87 at �2:59 and 0:387,
respectively. The result is intuitive. The maximum value of a
2
1
is greater than the
maximum value of the �rst term, and the distribution of type CM which maximizes
the �rst term has no skewness, so the maximal distribution cannot be type CM. On
10
0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.00
50.
010
0.01
5
(a)
γ
a1
CMCH
0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.00
50.
010
0.01
5
(b)
γ
I1N
CM
CH
Figure 2: Skewed interpolation (6) between distribution types CH and CM, showing
(a) a
1
and (b) I
N
1
.
the other hand the SK type distributions that are \close" to maximizing a
2
1
also get
a contribution from the �rst term which favours all mass at 0.
Clearly the behaviour of I
N
1
is to respond to skewness when it is present. This
behaviour is also seen in the Hermite index, I
H
1
.
5.3 Higher Order Indices
There is a natural trend in that odd order indices respond to skewness, type SK
distributions, whilst even order indices respond to central mass concentration, type
CM distributions. As the order increases the type CM distribution tends to dominate
for the Natural Hermite index. This is also true for very high order Hermite and
Legendre indices, although it is not as easy to see due to increased variation and
number of in ection points, at least in orders 3 to 5. This is shown in Figure 3 where
the values of (a) Natural Hermite, (b) Hermite and (c) Legendre indices are plotted
for orders 0 to 5 for the skewed interpolation of form (6). (Note that an increase in
order means an increase in index value hence the lowest line (solid) is order 0 and the
highest line (dash-dot) is order 5 in each plot.) The extremal behaviour of the higher
order indices is not so interesting because we know we can collect these extremal
types of structure easily with the 0; 1 order indices. The practical power of high order
indices stems from the increased modality of the polynomials which enables them to
detect higher-frequency deviations from normality, that is, �ner structure. There is
a tradeo� because the increased modality makes it necessary to be very close to the
structure to �nd it by projection pursuit optimization.
11
0.5 0.7 0.9
0.0
0.02
0.04
0.06
0.08
Natural Hermite
γ
IkN
0
1
2
3
4
5
0.5 0.7 0.9
0.0
0.1
0.2
0.3
0.4
0.5
Hermite
γ
IkH
0
1
2
3
4
5
0.5 0.7 0.9
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Legendre
γ
IkL
1
2
3
4
5
Figure 3: Skewed interpolation (6) for indices of orders k = 0; 1; : : : ; 5 (a) Natural
Hermite, (b) Hermite, (c) Legendre.
6 Illustrations of 1-dimensional Projection Pursuit
6.1 Low Order Indices
Now that we have examined some aspects of the behaviour of the polynomial indices
over the ideal expanded set,
�
F , we would like to examine their behaviour, in practice,
that is, over the set, F
Z
, of distributions of 1-dimensional projections of Z. For most
dimensions, p, of Z this is not possible, but it is practical for the simple case of p = 2.
The following type of plot has been used by Huber (1990): project a two-dimensional
distribution onto lines parametrized by angles of unit vectors in the plane,
�
�
= (cos �; sin �); � = 0
o
; 1
o
; : : : ; 179
o
and calculate
^
I
N
0
for each �, and display its values radially as a function of �.
Figure 4(a) shows an example of this for a bivariate distribution whose X and Y
variables are independent and distributed according to a near type CM and an exact
type CH distribution, respectively. The data points (diamonds) are plotted in the
center of the �gure. (Note that overplotting obscures the relative frequency at each
12
+
(a) Natural Hermite(0)
-3 -2 -1 0 1 2 3
010
2030
40
(b) X-axis
-1.0 0.0 0.5 1.0
010
2030
40
(c) Y-axis
Figure 4: Bivariate distribution with X near type CM, Y exactly type CH, indepen-
dently distributed (a) I
N
0
, (b) Projection (X) which maximizes I
N
0
, distribution near
type CM, (c) Projection (Y ) corresponding to a local maximum of I
N
0
, distribution
type CH.
. ... .... ...
....
..
.. ..
.............. ...
..
...
.
...
....
.
.....
.. ..
..
. .
...
... ... +
(a) Natural Hermite(0)
-2 -1 0 1 2
02
46
810
12
(b) Global Max
-2 -1 0 1 2
02
46
8
(c) Local Max
.... .... ...
......
.. ..
.............. ..... ...
.
...
....
.
.....
.. ..
..
. ...
.
... ... +
(d) Natural Hermite(1)
-2 -1 0 1 2
05
1015
(e) Global Max
-2 -1 0 1 2
02
46
810
12
(f) Local Max
Figure 5: Two dimensions of the ea beetle data (a) I
N
0
, (b) Projection which maxi-
mizes I
N
0
, (c) Projection corresponding to local maximum of I
N
0
, (d) I
N
1
, (e) Projection
which maximizes I
N
1
, (f) Projection corresponding to �rst local maximum of I
N
1
.
13
point. In fact, the two points on X = 0 each contain 36 observations whilst the other
four points contain 1 observation each.) The solid line represents the index value,
^
I
N
0
,
plotted in relative distances from the center, in both positive and negative directions,
along the projection vector, �
�
. The dotted circle is a guidance line plotted at the
median index value. The maximum value is attained at �
0
o, that is the projection
of the data into the dashed line at Y = 0, which is the near type CM (Figure (b)).
However from this global maximum the index dips low and rises to a smaller maximum
at �
90
o, that is the projection of the data into the dotted line at X = 0, which
corresponds to the type CH distribution (Figure (c)). Each of these maxima tells
us something interesting about the data, so ideally we would want the optimizer to
return both. (Note that all projections between �
0
oand �
90
oconstitute interpolations
between the near type CM distribution and the type CH distribution.)
Figure 5(a) contains a similar plot of 2-dimensional data generated by taking a
2-dimensional projection of 6-dimensional data. The 6-dimensional data consists of
6 measurements on each of 74 ea-beetles including 3 di�erent species (Lubischew,
1969). The 2-dimensional data shown is the best three species separating projection
of the 6-dimensional principal component space. Figure (a) shows
^
I
N
0
calculated for
each 1
o
incremental projection, and Figure (b) contains a histogram of the projected
data corresponding to the maximum index value. In this case it is a near CH type
distribution. Figure (d) shows
^
I
N
1
, order 1 index, calculated for each 1
o
incremental
projection. The maximizing distribution, which separates one cluster from the other
two, is near SK type, seen in the histogram in plot (e). There are two other local
maxima, but the angular di�erence between these and the global maximum is close
enough to 60
o
to see that these are also near SK type distributions, each separating o�
one of the three clusters. This example illustrates the power of the myopia of the order
0 and 1 indices for separating gross clusters. An order 0 index will capture projections
with two relatively equal clusters whilst an order 1 index will capture projections with
one large cluster and a smaller cluster (perhaps obtained by projection of 2 clusters
on top of each other and a single cluster, as in this example).
6.2 High Order Indices
To illustrate the sensitivity of higher order indices to �ne structure we switch
to the Legendre index, because it needs considerably fewer terms than the Hermite
and Natural Hermite indices to capture the same depth of structure. A rationale for
this is given by asymptotic results in Hall (1989) (from bounds on each term given
in Sansone, (1959, p.199, 369)). The data used for this example is generated by the
infamous random number generator, RANDU, which is based on the multiplicative
congruential scheme: x
n+1
= (2
16
+ 3)x
n
(mod 2
31
). RANDU was widely used in
the 1970's but unfortunately fails most 3-dimensional criteria for randomness (Knuth,
1981). Points generated by RANDU lie on 15 parallel planes de�ned by 9x
n
�6x
n+1
+
x
n+2
� 0 (mod 2
31
), when sequentially placed into a 3-dimensional cube. The 2-
dimensional data used for the analysis is obtained by projecting the 3-dimensional
14
..
..
.
..
.....
.
..
..
.
.
..
..
..
.
.
..
.
.
. ..
.
...
.
.
. ...
.
.
. . ..
...
.. .
.
.
.. .
... ..
.
.
...
. .
.
.
.
.
.
....
..
.
.
.
..
..
.
..
. ..
.
.
... .
.
..
. ..
. .
. ..
..
.
..
....
.
.
. ...
.
..
.
.
.
. .
.
.
.
.
. ..
.
.
.
. ...
.
.
. . .
..
.
.
..
.
. .
..
..
.
.
.
.
.
.
.
.
.
..
.
.
..
.
...
.
.
.
. .
.
.
.
.
.
.
....
. .
.
..
.
.. ..
.
..
.
..
.
.
.
..
.
..
..
.
.
.
.
..
.
. ..
.
.
..
..
..
.
.
....
. .
.
.
.
.
..
.
..
..
..
..
. ...
... .
..
.
. . .
..
.
.
.
..
..
.
.
.
.
.. ...
...
.
.
.
.
.
. ...
.
.
.
...
.
.
.. .
.
.
...
.
.
. ...
...
.
.
.
.
.
. ..
. ..
..
... ..
.
.. ..
...
. .
. .
.
.
..
. .
..
. ...
. .
.
..
....
.
.. ..
..
.
.
..
.
...
..
..
....
.
..
.
. ..
.
..
...
.
.
.
.
...
..
..
..
..
..
.
...
...
..
.. .
.
..
.
. .
.
..
.
.
.
.
..
.
. . ...
..
.
.
.
...
.
.
.
..
....
..
..
.
.
.
.. .
.
.
.
.
..
.
.. ..
.
.
..
.
.
.
.
.
.
..
..
..
.
. ..
.
... .
.
...
.
..
..
..
.. .
. ..
. ..
.
. .....
.
.
.
. ..
..
.
.
..
.
.
.
..
...
..
.
.. .
..
. .
. . ....
.
..
.
...
.
.
...
.
.
...
..
. .
...
.
.
..
..
..
.
.
. .
.... .
..
. .
..
..
..
..
.
.
.. .
.
.
..
..
.
..
.
..
.
.. ..
....
...
. .
. ..
...
.
.
.
.
.
..
.
.
.
.
.
. .
..
.
..
.
.
.
.
.
..
.
.
.. ..
.
...
..
. .
.
..
. ..
.
..
.
.
.
.
.
.
.
..
.
.
.
.. .
.
.
.
..
.
...
..
..
.
..
.
..
..
..
.
.
.
.
..
.
..
.
..
.
.
..
.
... .
...
..
.
.. .. ..
.
..
.
.
..
..
.
.
..
.
.
.
.
... .
.
.
.
. .
..
..
..
.
...
.
.. .
.
.
.
. .. .
...
.
.
.
.
.
.
..
.
.
.
.
...
...
...
.
.
.
.
.
. ...
..
.
..
. ..
..
...
..
.. .
.
..
..
.
...
.
.
.
.
.
.
..
.
..
..
.
..
.
. .
.
.
.
..
.
.
.
.
.
...
.
.
..
.
.
..
.
..
.
. .
.
.
.
.
..
. . .
. .
.
..
.
..
..
. ..
...
. ..
.
.
.
.
...
+
(a) Legendre(2)
-2 -1 0 1 2
010
2030
4050
60
(b) Global Max
..
.
.
.
..
.....
.
..
..
.
.
.
.
..
..
.
.
.
.
.
.
. .
.
.
...
.
.
. ...
.
.
.. .
..
.
.
..
.
.
.
.. .
... ..
.
.
...
. .
.
.
.
.
.
....
..
.
.
.
..
.
.
.
..
. ..
.
.
...
.
.
..
..
.. .
. ..
.
.
.
..
...
.
.
.
. .
..
.
..
.
.
.
. .
.
.
.
.
. .
..
.
.
. ..
..
.
. . .
..
.
.
.
.
.
. .
..
.
.
.
.
.
.
.
.
.
.
.
..
.
.
..
.
...
.
.
.
. .
.
.
.
.
.
.
....
..
.
..
.
.. .
..
..
.
..
.
.
.
..
.
..
..
.
.
.
.
.
.
.
. ..
.
.
.
.
..
..
.
.
...
.
..
.
.
.
.
.
.
.
..
..
.
.
..
. ..
.
... .
..
.
. ..
..
.
.
.
..
..
.
.
.
.
.. ...
...
.
.
.
.
.
. ...
.
.
.
..
.
.
.
.
. ..
.
..
.
.
.
. ...
.
.
..
.
.
.
.
..
.
. .
..
.
... ..
.
...
..
..
. .
. .
.
.
..
. .
..
. ...
..
.
..
....
.
.. . .
..
.
.
.
.
.
.
...
.
..
....
.
..
.
...
.
.
.
...
.
.
.
.
..
.
..
..
..
..
.
.
.
.
....
.
.
..
..
.
.
.
.
. .
.
..
.
.
.
.
..
.
. . ...
..
.
.
.
...
.
.
.
..
.
..
.
..
..
.
.
.
.. ..
.
.
.
..
.
.
. .
.
.
.
..
.
.
.
.
.
.
..
.
.
..
.
. ..
.
... .
.
.
.....
..
.
.
.. .
. .
.. .
.
.
. .....
.
.
.
. ..
..
.
.
.
..
.
.
..
...
..
.
.. .
..
. .
.. .
...
.
..
.
...
.
.
...
.
...
.
..
. .
...
.
.
..
.
.
..
.
.
. .
.
... .
..
. .
.
.
..
..
..
.
.
.. .
.
.
..
..
.
..
.
..
.
.. ..
....
...
. .
. ..
...
.
.
.
.
.
..
.
.
.
.
.
. .
..
.
..
.
.
.
.
.
..
.
.
.. ..
.
..
...
. .
.
..
..
..
..
.
.
.
.
.
.
.
..
.
.
.
.
. .
.
.
.
..
.
...
. .
..
.
..
.
.
...
..
.
.
.
.
..
.
..
.
..
.
.
..
.
... .
.
..
..
.
.. ..
.
.
.
..
.
.
..
..
.
.
..
.
.
.
.
... .
.
.
.
. .
..
..
.
.
.
...
.
..
..
.
.
..
. .
...
.
.
.
.
.
.
..
.
.
.
.
...
...
...
.
.
.
.
.
. ..
.
..
.
..
..
.
..
...
..
...
.
..
.
.
.
...
.
.
.
.
.
.
..
.
..
..
.
..
.
. .
.
.
.
.
.
.
.
.
.
.
...
.
.
..
.
.
..
.
..
.
..
.
.
.
.
..
..
.
. .
.
..
.
..
..
. ..
..
.
..
.
.
.
.
.
...
+
(c) Legendre(25)
-2 -1 0 1 2
020
4060
8010
0
(d) Global Max
Figure 6: Legendre index on RANDU data (a) I
L
2
, (b) Projection which maximizes
I
L
2
, (c) I
L
25
, (d) Projection which maximizes I
L
25
.
data into a plane containing the normal vector (9;�6; 1)=k(9;�6; 1)k, so that the
planar structure in the points is visible. Figure 6(a) displays the order 2 Legendre
index,
^
I
L
2
. Its long-sighted quality is apparent in that it simply �nds the \edge
e�ects" of the square, where the projected data looks like a sample from a uniform
distribution. In contrast the global maximum of the order 25 index,
^
I
L
25
, corresponds
to the planar structure but because it is such a narrow spike it would be di�cult to
�nd by optimization. This is the tradeo� which must be borne with the higher order
indices; the �ne structure can be found but one needs to be much closer to see it.
(This higher order behaviour was �rst observed by Huber (1990).)
7 Two-dimensional Index
Our experience with the implementation of projection pursuit in XGobi is with 2-
dimensional indices. This is the most natural lower dimensional projection to examine
for dynamic graphics implementations, and perhaps general human visual perception,
so it is particularly pertinent to extend the results of the 1-dimensional indices to the
2-dimensional indices. The construction of the 2-dimensional Natural Hermite index
follows closely the 1-dimensional construction so for the purposes of brevity we will
15
only point to the di�erences and refer the reader back to Sections 1-3 for complete
treatment.
Consider a bivariate projection of Z,
Z �! (X;Y ) = (�
0
Z; �
0
Z) 2 IR
2
(�; � 2 S
p�1
; �
0
� = 0)
then the Natural Hermite index has the following form:
I
N
=
Z
IR
2
ff(x; y)� �(x; y)g
2
�(x; y)dxdy
Bivariate Hermite polynomials that are orthonormal with respect to �(x; y) (Jackson,
1936) are used to expand f(x; y) and �(x; y) giving:
I
N
=
Z
IR
2
8
<
:
1
X
i;j=0
(a
ij
� b
ij
)p
i
(x)p
j
(y)
9
=
;
2
�(x; y)dxdy
=
1
X
i;j=0
(a
ij
� b
ij
)
2
since p
i
, p
j
's are on. wrt �(x; y)
where p's are as previously de�ned, and a
ij
; b
ij
are the usual Fourier coe�cients for
the expansion. (Note that a
ij
depends on f and consequently is unknown whilst
b
ij
= b
i
b
j
where b
i
, b
j
are de�ned as in the univariate index.)
Estimating the index involves estimating a
ij
and truncating the expansion. The
natural sample estimate of a
ij
is the sample average, a
ij
=
1
n
P
n
k=1
p
i
(x
k
)p
j
(y
k
) �
�(x
k
; y
k
). Truncating the sum is a little more complicated than in the 1-dimensional
case. To maintain a�ne invariance in the estimation the summation needs to be done
on a triangle as follows:
^
I
N
M
=
X
i;j�0;i+j�M
(a
ij
� b
ij
)
2
This ensures that
^
I
N
M
is the same for all 2-dimensional rotations (cos � �X + sin � � Y ,
� sin � �X+cos � �Y ) of (X;Y ). (This property is lacking in Friedman's 2-dimensional
version of the Legendre index.) Projection pursuit entails the optimization of the
index over all 2-dimensional projections of Z but in order to understand the possible
behaviour of the index we consider elements of the closure of the set of all possible
2-dimensional distribution functions with zero mean and identity covariance.
7.1 Truncation at M = 0: I
N
0
Again to understand the index we begin with the simplest case, the 0 order
index, I
N
0
= (a
00
� b
00
)
2
, where
a
00
=
Z
IR
2
�(x; y)dF (x; y) since p
0
(x) = 1, p
0
(y) = 1
b
00
=
1
4�
(= a
00
when f � �)
16
so I
N
0
= (a
00
� 1=(4�))
2
.
Convexity and Jensen's Inequality can be used to show that a
00
is minimized by
any distribution with all mass placed on a unit circle. This constitutes the bivariate
generalization of CH type distribution and examples are:
(1) Uniform on a unit circle, or
(2) 2 replications of a Bernouilli experiment, in which X and Y have an
equal chance of being �1=
p
2, that is, equal mass at the vertices of a square.
The value of a
00
for any distribution of CH type is 1=(2�e) leading to I
N
0
= (1=(2�e)�
1=(4�))
2
. Conversely to maximize a
00
, all mass needs to be at 0. This distribution,
bivariate \central mass", maximizes I
N
0
, since a
00
= 1=(2�) and then I
N
0
= (1=(2�)�
1=(4�))
2
. While this constitutes the global maximum of I
N
0
, another local maximum
is produced by the bivariate \central hole" distribution, so in doing projection pursuit
with
^
I
N
0
it is likely to �nd both types of deviations from normality.
The bivariate Hermite index, I
H
0
, has similar properties but the Legendre index,
I
L
0
= 0, for all F .
7.2 Truncation at M = 1: I
N
1
An index of order 1 contains three sums of squares, I
N
1
= (a
00
� b
00
)
2
+ (a
01
�
b
01
)
2
+ (a
10
� b
10
)
2
, where b
01
= b
10
= 0. The contribution a
2
01
+ a
2
10
to I
N
1
forms a
rotation-invariant index in its own right. It is tailored to respond to skewness at any
rotation in the projection plane. Similar to its 1-dimensional analog, I
N
1
has a mixed
response pattern, with skewness dominating, but bivariate \central hole" and \central
mass" distributions also throw in their weight through the presence of (a
00
� b
00
)
2
.
The Hermite index, I
H
1
, responds similarly and the Legendre index, I
L
1
, responds
only to skewness.
7.3 Higher Order Indices
Except that each increment in order will add (order+1) extra terms into the sum-
mation, the same sorts of principles as the 1-dimensional higher order indices apply.
The global maximal behaviour of the indices is not as interesting as their ability to
respond to �ne structure.
8 Illustrations of 2-dimensional Projection Pursuit
8.1 Low Order Indices
Although for most dimensions, p, of Z it is not possible to visualize the entire
2-dimensional projection pursuit function, to a limited extent we can visualize 2-
dimensional projection pursuit on 4-dimensional data by constraining the 2-planes to
a manageable 2-parameter family. Given a 4-dimensional space, consider the family
f(�; �) : �
0
= (cos �; 0; sin �; 0); �
0
= (0; cos �; 0; sin�); �; � 2 (��=2; �=2)g. Each
17
-1.5 -0.5 0.5 1.5-1
.5-0
.50.
51.
5
(a) Natural Hermite(0)
(1,2)
(3,4)
(b)
-1.5 -0.5 0.5 1.5
-1.5
-0.5
0.5
1.5
(c) Natural Hermite(1)
(3,2)
(1,4)
(d)
Figure 7: Two-dimensional indices in four dimensions of the ea beetle data: (a)
Contour of I
N
0
, (b) Perspective plot of I
N
0
, (c) Contour of I
N
1
, (d) Perspective plot of
I
N
1
.
1
2
3
4
Figure 8: Matrix of pairwise plots of four dimensions of the ea beetle data.
18
-1.5 -0.5 0.5 1.5
-1.5
-0.5
0.5
1.5
(a) Legendre(2)
(1,2)
(3,2) (3,4)
(1,4)
(b)
-1.5 -0.5 0.5 1.5
-1.5
-0.5
0.5
1.5
(c) Legendre(15) (d)
Figure 9: Two-dimensional Legendre index on 4-dimensional spiral data (a)
^
I
L
2
, (b)
^
I
L
2
, (c)
^
I
L
15
, (d)
^
I
L
15
.
1
2
3
4
Figure 10: Matrix of pairwise plots of 4-dimensional spiral data.
19
pair of angles (�; �) speci�es a 2-projection whose index can be plotted as a function
of (�; �) over the square (��=2; �=2) � (��=2; �=2). This is done in Figure 7 for
four dimensions of the ea beetle data both as a contour plot and a perspective plot.
(Recall that we used a 2-dimensional projection of the ea beetle data, which contains
6 measurements, in the second example of the 1-dimensional index, Figure 5.) Nine
points in the square corresponding to values ��=2; 0;+�=2 for � and � are landmarks
with simple interpretations: when j�j = j�j = �=2 the projection is of variables 3
and 4, whilst when � = 0; j�j = �=2 it corresponds to variables 1 and 4, and to
variables 2 and 3 if j�j = �=2; � = 0. Figures 7(a) and (b) have a high peak centered
at (0; 0) (corresponding to a projection into the variable 1 and 2 axes) which means
that
^
I
N
0
responds strongly to the projection which separates the three species (Figure
8). In contrast the main peaks in
^
I
N
1
(Figures (c) and (d)) are along the axes near
either (0;��=2) or (��=2; 0) which correspond to projections in which one cluster is
separated o� from the other two. So the separating power of these low order indices
carries over to two dimensions. The order 0 index with its sensitivity to holes, near
type CM distribution, �nds projections with equal three or four group separations,
whilst the order 1 index with its sensitivity to skewness �nds separations with large
mass o�-center.
8.2 High Order Indices
Similarly the sensitivity to �ne structure of high order indices extends from 1-
dimension to 2-dimensions. Again we use the Legendre index to illustrate this. The
data, plotted in a matrix of pairwise plots in Figure 10, is formed from a bivariate
spiral in the �rst two variables and samples from a standard normal distribution in
the second two variables. Figure 9 contains contour and perspective plots of the
Legendre index, (a),(b) order 2,
^
I
L
2
and (c),(d) order 15,
^
I
L
15
. The center of each
plot (0; 0) corresponds to the spiral in the �rst two variables. Interestingly
^
I
L
2
is
very smooth but �nds very little whilst
^
I
L
15
is much noisier but clearly responds to the
spiral. (Huber (1990) used this data to demonstrate the inability of the 1-dimensional
Legendre index, used in a sequential manner, to �nd fully 2-dimensional structure.)
9 Exploring Data with 2-dimensional Projection Pursuit
9.1 Discovery of Structure in Telephone Usage Data
The �rst example of exploratory projection pursuit arises from a study of weekly
intra-LATA (within-state toll calls) telephone usage data conducted byMartin Koschat
and Deborah Swayne (1992a) at Bellcore using XGobi. The data contains 438 resi-
dential customers and 52 weeks of measurements of total usage. Some usual cleaning
of the data, including taking transformations and normal scores, was done prior to
the projection pursuit analysis, but we don't expand on this because it is not directly
pertinent to the illustration. The Hermite(1) index, used on the �rst 5 principal
components of the cleaned data, �nds a projection containing a \V" shape, which
20
x xx
x
x
xxxx
x
xx
x xx
x
x
x
x
x
x
x
x
x
xx
x
x xxx
xx
x x
xx
xx x
x
x
x
x
x
xx
xx
xx
xxx
x
x x
x
x
xx
x
x
x
xx
x xx
x
x
x
x
x
x
x
x
x x
xx
xxx
x
xx
x
x
x
x
x
x
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
xx
x
x x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
xx
x
x
x
x x
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
xxx x
xxx
x
x x
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
-3 -2 -1 0 1 2 3
-10
12
34
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
oo
oo
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
oo
o o
ooo
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
o
o
oo
o
o
o
o
o
o
o o
o
o
o
oo
o
o
o
oo
o
o
o
o
oo o
o
o
o
o
o
oo
o
o
o
o
o
oo oo
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
o
ooo
oo
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
oo o
o
oo
o
oo
o
oo
oo
o
oo
o
oo
o
o
o
o
o
o
o
o
o
oo
ooo
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
Figure 11: Local maximum of the Hermite(1) index on weekly intra-LATA telephone
usage data.
is plotted in Figure 11 (the \o" 's and \x" 's represent di�erent local telephone ex-
changes). Further examination of associated demographic variables revealed that
the arms of the \V" were caused by missing measurements in one of the two local
telephone exchanges, and these had been overlooked earlier in the study. (There is
video footage available of this exploratory study and discovery, Koschat and Swayne
(1992b).)
9.2 Cautionary Examples
Figure 12 provides a small example of exploratory 2-dimensional projection pur-
suit, which illustrates the danger of analyzing high dimensional data containing only
a small number of points. Plots (a); (b), and (c) are three local maxima of the Her-
mite(0) index on the modi�ed Wood Speci�c Gravity data (Rousseeuw and Leroy,
1987). The data contains 20 points with 5 measurements each, and observations
4; 6; 8; 19 have been contaminated to make them outliers. Rousseeuw and Leroy orig-
inally used this data (including a sixth variable, wood speci�c gravity) to illustrate
both the inability of classical least squares (LS) regression diagnostics and the con-
21
11
12
16
(a)
6
819
(b)
4
(b)
4
6
8
11 1216
19
(c)
Figure 12: Three local maxima of the Hermite(0) index on modi�ed Wood Speci�c
Gravity data.
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
(a)
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
(b)
Figure 13: Random Structure in 5-dimensional normal sample, size 100, (a) I
H
0
=
0:000449, (b) I
H
0
= 0:00123.
22
trasting capabilities of resistant Least Median Squares (LMS) regression diagnostics
in �nding the arti�cial outliers. In their analysis the LMS diagnostics reveal the
expected outliers, 4; 6; 8; 19, whereas the LS diagnostics suggest other points, for ex-
ample 11; 12; 16, are outlying. From the projection pursuit solutions (a); (b), it is
evident that all of these points may be in uential outliers. In fact plot (c) shows that
all points lie near a convex shell leaving an \empty center". (The authors have since
acknowledged the shortcomings of using this data to illustrate resistant methods.)
To further illustrate the danger of putting too much emphasis on structures found in
sparse high dimensional data Figure 13 displays two local maxima of the Hermite(0)
index on a sample of size 100 from a 5-dimensional standard normal distribution.
The projection in plot (a) contains a very small central hole, and plot (b) shows a
central mass concentration (plus a small separated cluster!). Looking at the index
values, 0:000449 and 0:00123, we see the values are quite small compared to the
absolute maximum that can be achieved (0:00556 for type CH, and 0:0796 for type
CM). So we issue a word of warning: exploratory projection pursuit will always �nd
structure, albeit weak, but care must be taken when emphasizing the signi�cance
of this structure (see Sun (1991) for a discussion of p-values for the 1-dimensional
Legendre index).
10 Discussion
There are a few matters related to our analysis that warrant attention.
The �rst is the signi�cance of the individual terms of the expansion as indices in
themselves. Notably each is minimized by the Normal distribution and is sensitive to
particular di�erences. We saw this in looking separately at the �rst and second terms.
They could be considered to be \template"-type indices because of their particular
sensitivity, and this leads to another approach in constructing projection pursuit
indices: construct a measure for a particular type of structure. An example of this
would be to reformat the �rst term of the Hermite indices by dropping the square
and simply using the negative coe�cient. This measure would speci�cally respond to
\central holes". Also recent developments with using wavelets as a tool for density
estimation may lead to even better tailoring for speci�c structure.
The other matters concern related work. Gnanadesikan (1977, p. 142-3), proposes
a 1-parameter family of methods for �nding directions of non-normality in multi-
variate data. Opposite ends of the parameter scale correspond to methods sensitive
to outlying observations and inlying clusters, respectively. Jee (1985) conducts an
empirical comparison of projection pursuit indices based on non-normality measures
such as Information and Entropy using histogram density estimation. Morton (1989)
in her work on modifying projection pursuit to trade accuracy for interpretability
adapts the bivariate Legendre index by transformations into a Fourier index. This
uses both Laguerre polynomials and trigonometric functions to expand the bivariate
data density. We suspect that it behaves like the bivariate Hermite indices.
23
Appendix
Proofs of Results in Section 5:
(�) Proof that
�
F = fF (x): E
F
X = 0;Var
F
X � 1g:
\�": Take any F s.t. E
F
X = 0, Var
F
X � 1 and let H = (1� ) �F + � �
�x
=2+ �
�
x
=2, where = (1�Var
F
X)=(x
2
�Var
F
X), so thatH satis�es E
H
X = 0, Var
H
X = 1.
As x!1, we have ! 0, and hence H
w
! F .
\�": Let F
n
(2 F)
w
! G. Use Skorohod representations X
n
� F
n
and Y � G
s.t. X
n
a:s:
! Y . The sequence X
n
is uniformly integrable since, by a Chebychev-type
argument:
EfI
jX
n
j�a
:jX
n
jg �
EX
2
n
a
�
1
a
# 0 as a " 1:
Thus X
n
L
1
! Y , in particular jEX
n
� EY j � EjX
n
� Y j ! 0, hence EY = 0.
Finally,
EY
2
= E lim inf
n
X
2
n
Fatou
� lim inf
n
EX
2
n
� 1:
2
(��) Proof of compactness of
�
F :
We have that
�
F is weakly closed, so then for compactness we need tightness which
is immediate from the Chebychev Inequality:
PfjXj � ag �
EX
2
a
2
�
1
a
2
# 0 uniformly in P2
�
F as a " 1:
2
Proof of Lemma 5.1:
There are two parts to the proof: (1) the list contains only extremals, and (2) the
list is exhaustive.
(1) First, we note that for F = �G+ (1� ) �H; (0 < < 1), the support of F
is the closure of the union of the supports of G and H. Thus, when decomposing F
into convex constituents, G and H, we need only consider G and H whose support is
contained in the support of F .
(iii) Clearly �
0
is extremal. (ii) Next, 2-point masses satisfying EX = 0, EX
2
� 1
are extremal because a given 2-point support allows at most one probability measure
with EX = 0. (i) Finally, a 3-point mass satisfying EX = 0 and EX
2
= 1 is also
uniquely determined by its support because for 3 di�erent values x
1
; x
2
; x
3
, the 3� 3
matrix:
A =
0
B
@
1 1 1
x
1
x
2
x
3
x
2
1
x
2
2
x
2
3
1
C
A
24
is of the Vandermonde type, hence has maximal rank and allows only one solution
= (
1
;
2
;
3
)
0
of A = (1; 0; 1)
0
.
(2) We show that (i) a convex mixture F of 4 linearly independent components
cannot be extremal if E
F
X = 0 and E
F
X
2
= 1, (ii) a convex mixture F of 3 linearly
independent components cannot be extremal if E
F
X = 0 and 0 <E
F
X
2
< 1, while
case (iii) is trivial: E
F
X = 0, E
F
X
2
= 0) F = �
0
.
(i) Let F =
P
4
1
i
G
i
, E
F
X = 0, E
F
X
2
= 1.
The set A = f(
1
;
2
;
3
;
4
):
i
� 0;
P
4
1
i
= 1g is a 3-dimensional simplex whose
faces are described by one of
i
= 0. The subset of A satisfying
P
4
1
i
E
G
i
X = 0 and
P
4
1
i
E
G
i
X
2
= 1 forms at least a 1-dimensional segment through A which intersects
with a face on both ends. This implies that F is not extremal, unless it already lies
on a face
i
= 0, that is, it is really a 3-component mixture.
(ii) This case is solved by considering linearly independent 3-component mixtures
F satisfying E
F
X = 0, E
F
X
2
< 1. The set A = f(
1
;
2
;
3
):
i
� 0;
P
3
1
i
= 1g is
a 2-dimensional simplex with faces
i
= 0, and also
P
3
1
i
E
G
i
X
2
= 1. The condition
P
3
1
i
E
G
i
X = 0 traces a segment out of A which intersects with a face
i
= 0 or
P
3
1
i
E
G
i
X
2
= 1. Thus, for F to be extremal it would have to be at
i
= 0, that is, a
2-component mixture, or F would have to satisfy the unit variance constraint, which
is excluded by assumption.
2
Proof of Proposition 5.2:
(i) Let F 2
�
F be arbitrary and G = �
1
=2 + �
�1
=2:
a
0
(F ) =
1
p
2�
E
F
e
�
1
2
X
2
�
1
p
2�
e
�
1
2
E
F
X
2
by Jensen's inequality
�
1
p
2�
e
�
1
2
since E
F
X
2
� 1
=
1
p
2�
E
G
e
�
1
2
X
2
= a
0
(G)
(ii) Let F 2
�
F be arbitrary and G = �
0
:
a
0
(F ) = E
F
�(X) � �(0) = E
G
�(X) = a
0
(G)
2
Proof of Proposition 5.3:
Let h(x) = x�(x); p(x) = b
0
+ b
1
x + b
2
x
2
where b
0
; b
1
; b
2
are arbitrary constants,
and let y
1
= 0; y
2
= 1, then by a result in Kemperman (1987) (originally from an
article in German by Richter, 1957):
25
x
-2 0 2
-0.2
-0.1
0.0
0.1
0.2
(a)
h(x)
p(x)
x
-2 0 2
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
(b)
h’(x)
p’(x)
p’(x)
Figure 14: (a) At most two possible contact points between h(x) and p(x). (b) Two
possibilities for p
0
(x), one intersects h
0
(x) 3 times and the other 4 times.
sup
F
fE
F
h(X) : E
F
X = y
1
; E
F
X
2
= y
2
g =
inf
b
0
;b
1
2IR;b
2
2IR
+
fb
0
+ b
1
y
1
+ b
2
y
2
: h(x) � p(x);8x 2 IRg:
That is, in order to maximize E
F
h(X), subject to the two moment constraints, it is
su�cient to bound h(x) by a quadratic, p(x), and minimize b
0
+ b
1
y
1
+ b
2
y
2
over all
possible values of b
0
; b
1
2 IR; b
2
2 IR
+
. We don't use this result directly but rather to
establish that we need only consider the set of distribution functions having exactly
two point masses. Kemperman (1987) shows that the class of distributions to be
studied reduces to those with support being the set of points of contact between h(x)
and p(x). These sets contain at most 2 points which is graphically obvious in Figure
14(a) but requires the following arguments.
If h(x
�
) = p(x
�
) then h
0
(x
�
) = p
0
(x
�
) since h(x) � p(x) everywhere. From Figure
14(b) it can be seen that there may be 1,2,3,4 or even 5 intersections between h
0
(x)
and p
0
(x), some or all of which may be equality points between h(x) and p(x). With
4 or 5 intersections between h
0
(x) and p
0
(x) all are at locations where h
0
(x) < 0 (see
plot (b)). That is, at least two intersections are to the left of -1 and at least two are
to the right of +1. By the convexity of p(x) there cannot be contact points between
p(x) and h(x) in both regions, so there are at most 3 contact points.
Consider the case of 3 equality points, x
0
< x
1
< x
2
. Note
(i) x
2
< 1 else x
0
; x
1
; x
2
> 0, due to a maximum at +1, which violates the assumption
E
F
X = 0.
(ii) For 3 intersection points to be possible x
0
< �
p
3, a minimum of h
0
(x).
(iii)
0 > h
0
(x) > p
0
(x); x < x
0
26
h
0
(x) < p
0
(x); x
0
< x < x
1
(7)
h
0
(x) > p
0
(x); x
1
< x < x
2
h
0
(x) < p
0
(x); x > x
2
Assume that x
0
is a contact point, that is, h(x
0
) = p(x
0
), and write h(x
1
) = h(x
0
) +
R
x
1
x
0
h
0
(x)dx and p(x
1
) = p(x
0
) +
R
x
1
x
0
p
0
(x)dx. Then h(x
1
) � p(x
1
) =
R
x
1
x
0
(h
0
(x) �
p
0
(x))dx < 0 by (7). Therefore h(x
1
) 6= p(x
1
) ) there are at most three contact
points.
The set of all two point mass distributions can be completely characterized by one
unknown parameter, which we call , because of the mean and variance constraints.
2
Acknowledgements
We would like to thank our colleagues at Bellcore and Rutgers, particularly Martin
Maechler, Diane Du�y, David Tyler and Cunhui Zhang, Jop Kemperman and Andrew
McDougall, for many useful and informative discussions. In addition we appreciate
the comments of three anonymous referees.
We are very grateful to Martin Koschat and Deborah Swayne for allowing us to
use their telephone usage data and for providing us with an example of projection
pursuit being responsible for a practical discovery.
References
Abramowitz, M. and Stegun, I. A. (1972). Handbook of Mathematical Functions.
Dover, New York.
Andrews, D. F., Gnanadesikan, R., and Warner, J. L. (1971). Transformations of
Multivariate Data. Biometrics, 27:825{840.
Cook, D., Buja, A., and Cabrera, J. (1991). Direction and Motion Control in the
Grand Tour. In Keramidas, E., editor, Proc. of the 23rd Symp. on the Interface
between Comput. Sci. and Statist., Seattle, WA. Interface Foundation of North
America.
Friedman, J. H. (1987). Exploratory Projection Pursuit. J. Amer. Statist. Assoc.,
82:249{266.
Friedman, J. H. and Tukey, J. W. (1974). A Projection Pursuit Algorithm for Ex-
ploratory Data Analysis. IEEE Trans. Comput. C, 23:881{889.
Gnandesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Ob-
servations. Wiley, New York.
27
Hall, P. (1989). Polynomial Projection Pursuit. Ann. Statist., 17:589{605.
Huber, P. J. (1985). Projection Pursuit (with discussion). Ann. Statist., 13:435{525.
Huber, P. J. (1990). Data Analysis and Projection Pursuit. Technical Report PJH-
90-1, M.I.T.
Jackson, D. (1936). Formal Properties of Orthogonal Polynomials in Two Variables.
Ann. Math., 37:423{434.
Jee, J. R. (1985). A Study of Projection Pursuit Methods. Technical Report TR
776-311-4-85, Rice University.
Jones, M. C. (1983). The Projection Pursuit Algorithm for Exploratory Data Analysis.
PhD thesis, Univerity of Bath.
Jones, M. C. and Sibson, R. (1987). What is Projection Pursuit? (with discussion).
J. Roy. Statist. Soc. Ser. A, 150:1{36.
Kemperman, J. H. B. (1987). Geometry of the Moment Problem. In Landau, H. J.,
editor, Proc. of the Symp. in App. Math., San Antonio, TX. A.M.S.
Knuth, D. E. (1981). The Art of Computer Programming. Vol. 2: Seminumerical
Algorithms. Addison-Wesley, Reading, Mass.
Koschat, M. A. and Swayne, D. F. (1992a). Visualizing Panel Data. Technical
Memorandum, Bellcore, Morristown, N.J.
Koschat, M. A. and Swayne, D. F. (1992b). Visualizing Panel Data. Video available
by request to [email protected].
Kruskal, J. B. (1969). Toward a practical method which helps uncover the structure
of a set of observations by �nding the line transformation which optimizes a new
\index of condensation". In Milton, R. C. and Nelder, J. A., editors, Statistical
Computation, pages 427{440. Academic Press, New York.
Lubischew, A. A. (1962). On the Use of Discriminant Functions in Taxonomy. Bio-
metrics, 18:455{477.
Morton, S. C. (1989). Interpretable Projection Pursuit. Technical Report 106, Lab.
for Computat. Statist., Stanford Univ.
Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection.
Wiley, New York.
Sansone, G. (1959). Orthogonal Functions. Interscience Publ. Inc., New York.
28
Sun, J. (1991). Signi�cance Levels in Exploratory Projection Pursuit. Biometrika,
78(4):759{769.
Swayne, D. F., Cook, D., and Buja, A. (1991). XGobi: Interactive Dynamic Graphics
in the X Window System with a Link to S. In ASA Proc. of the Section on
Statistical Graphics, pages 1{8.
Thisted, R. A. (1988). Elements of Statistical Computing. Chapman and Hall, New
York.
Upensky, J. V. (1927). On the Development of Arbitrary Functions in Series of
Hermite and Laguerre's Polynomials. Ann. Math., 28:593{619.
29