30
Supplementary Material of “Modern Statistical Data Modeling with Goodness of fit: LP Mechanics” Subhadeep Mukhopadhyay Emanuel Parzen Department of Statistics Department of Statistics Temple University Texas A&M University Philadelphia, PA, USA College Station, TX, USA [email protected] [email protected] Abstract In this note we will apply the the general theory developed in the article “Modern Statistical Data Modeling with Goodness of fit: LP Mechanics” to solve many applied statistical problems using real and simulated data sets. The focus here will be utility of our proposed unified statistical modeling tool and how ‘LP Statistical Learning’ allows researchers to systematically uncover patterns from different data types and data struc- tures. The following table summarize the real data sets that we will analyze using LP united statistical algorithms. Table 1: The Data sets analyzed categorized according to data structure and data type. Data structure Data type Data set Univariate X continuous Buffalo snowfall data Univariate X discrete Online rating data Jaynes’ dice problem Bivariate (X, Y ) discrete, discrete Fisher hair & eye color data Age-IQ sparse table data Bivariate (X, Y ) continuous, continuous Ripley GAG urine data Bivariate (X, Y ) continuous, discrete Space shuttle challenger data 1

Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Supplementary Material of “Modern Statistical

Data Modeling with Goodness of fit: LP Mechanics”

Subhadeep Mukhopadhyay Emanuel Parzen

Department of Statistics Department of Statistics

Temple University Texas A&M University

Philadelphia, PA, USA College Station, TX, USA

[email protected] [email protected]

Abstract

In this note we will apply the the general theory developed in the article “Modern

Statistical Data Modeling with Goodness of fit: LP Mechanics” to solve many applied

statistical problems using real and simulated data sets. The focus here will be utility of

our proposed unified statistical modeling tool and how ‘LP Statistical Learning’ allows

researchers to systematically uncover patterns from different data types and data struc-

tures. The following table summarize the real data sets that we will analyze using LP

united statistical algorithms.

Table 1: The Data sets analyzed categorized according to data structure and data type.

Data structure Data type Data set

Univariate X continuous Buffalo snowfall data

Univariate X discreteOnline rating data

Jaynes’ dice problem

Bivariate (X, Y ) discrete, discreteFisher hair & eye color data

Age-IQ sparse table data

Bivariate (X, Y ) continuous, continuous Ripley GAG urine data

Bivariate (X, Y ) continuous, discrete Space shuttle challenger data

1

Page 2: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Contents

1 Buffalo Snowfall Data 1

2 Online Rating Data 1

3 Jaynes’ Dice Problem 1

4 Fisher Hair & Eye Color Data 3

5 Ripley GAG Urine Data 6

6 Space Shuttle Challenger Data 12

7 Age-IQ Sparse Contingency Table Data 13

8 Empirical Power Comparison: Goodness of fit 15

9 Empirical Power Comparison: LPINFOR 15

2

Page 3: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

1 Buffalo Snowfall Data

Figure 1(c) shows the Buffalo Snowfall data, previously analyzed by Parzen (1979). We

select the normal distribution (with estimated mean 80.29 and standard deviation 23.72)

as our baseline density g(x). Fig 1(b) shows the QQ plot of the estimated LP-coefficients

LP[j;G, F ], for j = 1, 2, . . . , 15. Other than one prominent outlier LP[6;G, F ] most of

the points falls close to the line y = x. This diagnostic indicates normal distribution

do not produce a adequate fit to the data. The QIQ plot of the raw data and normal

distribution in Fig 1(b) further strengthen the same conclusion. At the final step we

proceed to nonparametrically estimate the underlying density. Data-driven AIC selects

d(u) = 1− .337S6(u). The resulting LP Skew estimate shown in Fig 1 strongly suggests

tri-modal density.

2 Online Rating Data

Fig 2 shows online rating discrete data from Tripadvisor.com of the Druk hotel with

baseline model g(x) being Poisson distribution with fitted λ = 3.82. Data-driven AIC

selects d(u) = 1 − .55S2(u) − .4S3(u) + .23S4(u). The estimated LP Skew probability

estimates p(x;x) captures the overall shape, including the two modes, which reflects the

mixed customer reviews: (i) detects the sharp drop in probability of review being ‘poor’

(elbow shape); (ii) both peaks at the boundary, and (iii) finally, the “flat-tailed” shape

at the end of the two categories (‘Very Good’ and ‘Excellent’). Compare our findings

with Fig 12 of Sur et al. (2013), where they have used a substantially complex and

computationally intensive model (Conway-Maxwell-Poisson mixture model), still fail to

satisfactorily capture the shape of the discrete data.

3 Jaynes’ Dice Problem

The celebrated Jaynes’ dice problem (Jaynes, 1962) is as follows: Suppose a die has

been tossed N times (unknown) and we are told only that the average number of the

faces was not 3.5 as we might expect from an fair die but 4.5. Given this information

(and nothing else), the goal is to determine probability assignment, i.e., what is the

probability that the next throw will result in face k, for k = 1, . . . , 6.

LP solution to the Jaynes’ dice problem is presented here. The estimated distribution

p(x;X) is displayed in Fig 3. G is selected as the discrete uniform distribution g(x;X) =

1

Page 4: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

0.0 0.2 0.4 0.6 0.8 1.0

−1.00

−0.25

0.00

0.25

1.00

Raw dataNormal

●●

−1 0 1−

0.3

−0.

2−

0.1

0.0

Theoretical Quantiles

Sam

ple

Qua

ntile

s of

LP

−C

oefs

Fitted LP Smooth Buffalo Snowfall

40 60 80 100 120

0.00

00.

005

0.01

00.

015

0.02

0

Fitted NormalLP Skew

Figure 1: (a) QIQ plot of the raw data and normal distribution; (b)The QQ plot of the

estimated LP-coefficients LP[j;G, F ], for j = 1, 2, . . . , 15. (c) LP Skew density estimates

of Buffalo Snowfall data. The baseline measure G = N (µ = 80.29, σ = 23.72) shown in

blue dotted line

2

Page 5: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

0.0

0.1

0.2

0.3

0.4

Fitted LP Smooth Online Ratings

● ●

Terrible Poor Average Very Good Excellent

LP Skew

Figure 2: LP Skew density estimates (red dotted line) of Online rating data with base

line measure as Poisson distribution with λ = 3.82 in blue dots.

1/6 for x = 1, . . . , 6, which reflects the null hypothesis of ‘fair’ dice. The non-zero

LP[1;G, F ] indicates loaded dice. The L2 LP skew probability estimates are given by

p(x;X) =1

6

[1 + .585T1(x;G)

], where sufficient statistics T1(x;G) = 2

√3

35(x− 3.5)

(3.1)

The exponential LP skew model is given by

p(x;X) =1

6exp

{− .193 + .634T1(x;X)

}. (3.2)

It is worth while to note that (3.2) exactly reproduce the Jaynes’ answer (see the fol-

lowing Table 2).

4 Fisher Hair & Eye Color Data

Table 3 calculates the joint probabilities and the marginal probabilities of the Fisher

data. The singular values of the estimated LP-comoment matrix are 0.446, 0.173, 0.029.

The rank-two approximation of LP canonical copula can adequately model (99.6%) the

3

Page 6: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Table 2: The estimated LP skew distribution for the Jaynes’ dice problem.

g(x;X) 1/6 1/6 1/6 1/6 1/6 1/6

p(x;X) (L2) .025 .080 .140 .195 .250 .310

p(x;X) (Exponential) 0.054 0.079 0.114 0.165 0.240 0.347

Jaynes’ 0.054 0.079 0.114 0.165 0.240 0.347

0.0

0.1

0.2

0.3

0.4

● ● ● ● ● ●

1 2 3 4 5 6

L2 LP SkewExp. LP Skew

Figure 3: LP Skew distribution (L2 red dotted line, Exponential is denoted by blue

dotted line) for the Jaynes’ dice problem with base line measure as discrete uniform in

black dots. Our exponential LP skew density matches with the Jaynes’ solution.

dependence. Table 3 computes the row and column scores using the aforementioned

algorithm. Fig 5 shows the bivariate LP correspondence analysis of Fisher data, which

displays the association between hair color and eye color.

For (X, Y ) discrete Fisher data (4×5 contingency table), the resulting smooth nonpara-

4

Page 7: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Table 3: Fisher Data: Displayed joint probabilities Pr(X = x, Y = y) ; marginal

probabilities Pr(X = x), Pr(Y = y); and corresponding mid-distributions, as well as

coefficients of conditional comparison densities d(v;Y, Y |X = x) and d(u;X,X|Y = y).

Hair Color

Eye Color Fair Red Medium Dark Black p(x;X) Fmid(x;X) λ1φ1 λ2φ2

Blue .061 .007 .045 .020 .001 .14 .07 -0.400 -0.165

Light .128 .021 .108 .035 .001 .29 .28 -0.441 -0.088

Medium .064 .016 .168 .076 .005 .33 .59 0.034 0.245

Dark .018 .009 .075 .126 .016 .24 .88 0.703 -0.134

p(y;Y ) .27 .05 .40 .26 .02

Fmid(y;Y ) .135 .295 .52 .85 .99

λ1ψ1 -0.544 -0.233 -0.042 0.589 1.094

λ2ψ2 -0.174 -0.048 0.208 -0.104 -0.286

metric checkerboard copula estimate is given by

cop(u, v;X, Y ) = 1−0.423S1(u)S1(u)S1(v)+0.157S2(u)S2(v)+0.115S2(u)S1(v), (4.1)

shown in Fig 4. Note the piece-wise constant bivariate discrete smooth structure of the

estimated copula density.

Our L2 canonical copula-based row and column scoring (using LP algorithm) repro-

duces the classical correspondence analysis pioneered by Hirschfeld (1935) and Benzecri

(1969). Alternatively, we could have used the LP canonical exponential copula model

to calculate the row and column profiles by µ′ik = γkφk(ui;X) and ν ′jk = γkψk(vj;Y ),

known as Goodman’s profile coordinates pioneered in Goodman (1981) and Goodman

(1985), which leads to the graphical display sometimes known as a weighted logratio

map (Greenacre, 2010) or the spectral map (SM) (Lewi, 1998). Our LP theory unifies

the ‘two cultures’ of correspondence analysis, Anglo multivariate analysis (usual nor-

mal theory) and French multivariate analysis (matrix algebra data analysis), using LP

representation theory of discrete checkerboard copula density.

Our L2 canonical copula density (5.1) is sometimes known as correlation model or

CA (Correspondence Analysis) model and the exponential copula model (5.2) as RC

(Row-Col) association model (Goodman, 1991, 1996). Since our LP-scores satisfy:

E[Tj(X;X)] =∑

i Tj(xi;X)p(xi) = 0 and Var[Tj(X;X)] =∑

i T2j (xi;X)p(xi) = 1 by

5

Page 8: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Eye Color

0.0

0.2

0.40.6

0.81.0

Hair C

olor

0.0

0.2

0.4

0.6

0.8

1.0

Association

0.5

1.0

1.5

2.0

2.5

Figure 4: LP nonparametric copula density estimates for Fisher data.

construction, this association model also known as the weighted RC model.

5 Ripley GAG Urine Data

This data set was introduced by Ripley (2004). The study collected the concentration

of a chemical GAG in the urine of n = 314 children aged from zero to seventeen years.

The aim of the study was to produce a chart to help paediatricians to assess if a child’s

GAG concentration is ‘normal.’

For Y = GAG level in Ripley data, we would like to investigate whether the parametric

model Gamma fits the data. We select the Gamma distribution with shape 2.5 and

rate parameter .19 estimated from the data as our base line density g(x). Estimated

smooth comparison density is given by d(u;G,F ) = 1 + .16S3(u), which indicates that

the underlying probability model is slightly more skewed than the proposed parametric

model. Fig 6(b) shows the final LP Skew estimated model. For X = AGE in Ripley

data, we choose exponential distribution with estimated mean parameter 5.28 as our

g(x). The estimated d(u;G,F ) = 1 + .32S2(u) − .25S3(u) − .22S5(u) − .39S7(u) with

LP goodness-of-fit statistic as∫d2 − 1 = 0.365422. Fig 6(a) shows that the exponential

model is inadequate for capturing the plausible multi-modality and heavy-tail of age

6

Page 9: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

−0.5 0.0 0.5 1.0

−0.

4−

0.2

0.0

0.2

0.4

Blue

Light

Medium

Dark

Fair

Red

Medium

Dark

Black

Figure 5: (color online) LP correspondence map for Fisher Hair and Eye color data

Age

0 5 10 15

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Fitted ExponentialLP Skew

GAG levels

0 10 20 30 40 50 60

0.00

0.02

0.04

0.06

0.08

Fitted GammaLP Skew

Figure 6: Goodness-of-fit of the marginal distributions of Ripley data.

distribution.

Eq. (5.1) computes the empirical LP-comoment matrix of order m = 4 between X Age

7

Page 10: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

and Y GAG level for the Ripley data:

LP[X, Y

]=

−0.908∗ −0.010 0.011 0.035

0.032 0.716∗ −0.071 0.028

0.064 0.015 −0.590∗ 0.117

−0.046 −0.085 −0.060 0.425∗

(5.1)

A quick graphical diagnostic for checking the goodness of “null” independence model

is to draw the Q-Q plot of the LP-comoments as shown in Fig 7. Under H0 all of the

LP-comoments should lie roughly on a line y = x. The outliers in the Fig 7 denotes the

dependency. In the next step we will use AIC model selection criterion to pick the most

‘non-zero’ LP-comoments (denoted by ‘*’ in Eq. (5.1)) to estimate the copula density.

●●

●●● ●

●●

●● ●

●●● ●

−2 −1 0 1 2

−0.

50.

00.

5

Theoretical Quantiles

Sam

ple

Qua

ntile

s of

LP

−C

omom

ents

Figure 7: LP-Comoment based nonparametric goodness of fit diagnostic for indepen-

dence (GAG data).

Fig 8 displays the LP nonparametric copula estimate for the (X, Y ) Ripley data given

by

cop(u, v;X, Y ) = 1−0.91S1(u)S1(u)S1(v)+0.71S2(u)S2(v)−0.59S3(u)S3(v)+0.42S4(u)S4(v)

(5.2)

8

Page 11: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Age

0.0

0.2

0.4

0.60.8

1.0

GA

G

0.0

0.2

0.4

0.6

0.8

1.0

Association

0

5

10

Figure 8: LP nonparametric copula density estimates for Ripley data.

Nonparametric copula-based conditional mean and conditional quantile modeling algo-

rithm will be illustrate with Ripley data. Our algorithm is based on the following result

Theorem 1. Conditional expectation E[Y |X = x] can be approximated by linear combi-

nation of orthonormal LP-score functions {Tj(X;X)}1≤j≤m with coefficients defined as

E[E[Y |X]Tj(X;X)

]= E[Y Tj(X;X)] = LP(j, 0;X, Y ), because of the representation:

E[Y | X = x

]− E[Y ] =

∑j>0

Tj(x;X) LP(j, 0;X, Y ). (5.3)

Direct proof uses E[(Y −E[Y |X])Tj(X;X)

]= 0 for all j, and Hilbert space theory. An

alternative derivation of this representation is given below, which elucidates the con-

nection between copula-based nonlinear regression. Using LP-representation of copula

density (4.3), we can express conditional expectation as

E[Y |X = x] =

∫y cop

(F (x), F (y)

)dF (y)

=

∫y[1 +

∑j,k>0

LP[j, k;X, Y ]Tj(x;X)Tk(y;Y )]

dF (y)

= E[Y ] +∑j>0

Tj(x;X)

{∑k>0

LP[j, k;X, Y ] LP[k;Y ]

}. (5.4)

9

Page 12: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Compete the proof by recognizing

LP[j, 0;X, Y ] = E[E[Tj(X;X)|Y

]Y]

=∑k>0

LP[j, k;X, Y ] LP[k;Y ].

Remark 1. Our approach is an alternative to parametric copula-based regression model

(Noh et al., 2013, Nikoloulopoulos and Karlis, 2010). Dette et al. (2013) aptly demon-

strated the danger of the “wrong” specification of a parametric family even in one or

two dimensional predictor. They noticed commonly used parametric copula families are

not flexible enough to produce non-monotone regression function, which makes it un-

suitable for any practical use. Our LP nonparametric regression algorithm yields easy

to compute closed form expression (5.3) for conditional mean function, which bypasses

the problem of misspecification bias and can adapt to any nonlinear shape. Moreover,

our approach can easily be integrated with data-driven model section criterions like AIC

or Mallow’s Cp to produce a parsimonious model.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

Figure 9: The shape of copula density slices or the conational comparison density

d(v;Y, Y |X = Q(u;X)) at u = .25, .50, .75 for GAG data.

Our goal is to answer the scientific question of Ripley data based on n = 314 sample,:

what are normal levels of GAG in children of each age between 1 and 18 years of age.

In the following, we describe the steps of LPSmoothing(X, Y ) algorithm.

Step A. Ripley (2004) discusses various traditional regression modeling approaches

using polynomial, spline, local polynomials. Figure 10(a,b) displays estimated LP non-

parametric (copula based) conditional mean E[Y |X = x] function (5.3) given by:

E[Y |X = x] = 13.1− 7.32T1(x) + 2.20T2(x).

10

Page 13: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●

●●●

●●●●

●●

●●●

●●●●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●●●●●●●

●●●●

●●●

●●●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●●●

0.0 0.2 0.4 0.6 0.8 1.0

01

02

03

04

05

0

F(X)

Y

(a) Conditional Mean

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●

●●●

●●● ●

●●

●●●

●●●●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●●●

●●●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●●●

0 5 10 15

01

02

03

04

05

0

X

Y

(b) Conditional Mean

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●

●●●

●●●●

●●

●●●

●●●●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●●●●●●●

●●●●

●●●

●●●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●●●

0.0 0.2 0.4 0.6 0.8 1.0

01

02

03

04

05

0

F(Age)

GA

G

Q(v=.1;Y|X)Q(v=.25;Y|X)Q(v=.50;Y|X)Q(v=.75;Y|X)Q(v=.9;Y|X)

(c) Conditional Quantile

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●

●●●

●●● ●

●●

●●●

●●●●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●●●

●●●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●●●

0 5 10 15

01

02

03

04

05

0

Age

GA

G

Q(v=.1;Y|X)Q(v=.25;Y|X)Q(v=.50;Y|X)Q(v=.75;Y|X)Q(v=.9;Y|X)

(d) Conditional Quantile

Figure 10: (color online) Smooth nonparametric copula based nonlinear regression. Con-

ditional quantile

11

Page 14: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Step B. We simulate the conditional quantiles from the estimated conditional compar-

ison densities (slices of copula density) as shown in the Fig. 9. The resulting smooth

nonlinear conditional quantile curves Q(v;Y, Y |X = Q(u;X)) are shown in Fig 10(c,d)

for u = .1, .25, .5, .75, .9. Recall from Fig 6 that the marginal distributions of X (age)

and Y (GAG levels) are highly skewed and heavy tailed, which creates a prominent

sparse region at the “tail” areas. Our algorithm does a satisfactory job in nonparamet-

rically estimating the extreme quantile curves under this challenging scenario. We argue

that these estimated quantile regression curves provide the statistical solution to the

scientific question: what are “normal levels” of GAG in children of each age between 1

and 18?

Step C. Conditional LPINFOR curve is shown in Fig 11(a), which is can be consid-

ered as the LPINFOR measure for the different conditional comparison density curves

(Fig. 9). The decomposition of conditional LPINFOR into the components describes

the shape of conditional distribution changes, as shown in Fig 11(b). For example,

LP[1;Y, Y |X = Q(u;X)] describes the conditional mean, LP[2;Y, Y |X = Q(u;X)] the

conditional variance, and so on. Tail behaviour of the conational distributions seem to

change rapidly as captured by the LP[4;Y, Y |X = Q(u;X)] in Fig 11(b).

Step D. The estimated conditional distributions f(y;Y |X = Q(u;X)) are displayed in

Fig 12 for u = .1, .25, .75, .9. Conditional density shows interesting bi-modality shape at

the extreme quantiles. It is also evident from Figure 12 that the classical location-scale

shift regression model is inappropriate for this example, which necessitates going beyond

the conditional mean description for modeling this data that can tackle the non-Gaussian

heavy-tailed data.

6 Space Shuttle Challenger Data

On January 28, 1986, the NASA Space Shuttle Challenger (OV-099) disintegrated over

the Atlantic Ocean 73 seconds into its flight. The accident was believed to be caused by

the below-freezing temperature (31oF) at the time of launch, which might have impacted

the rockets’ O-rings, a splitting of a ring of rubber that seals the parts. Based on n = 23

previous shuttle flights data on Y binary variable denotes the success/failure in the O-

rings of space shuttles and X the temperature at the time of the launch, our goal is to

model the probability of O-ring damage as a function of temperature Pr(Y = 1|X = x)

- Mixed Data modeling.

12

Page 15: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Define copula density for mixed (Y discrete, X continuous) variable using conditional

comparison density as follows:

cop(u, v;X, Y ) = d(u;X,X|Y = Q(v;Y )) = d(v;Y, Y |X = Q(u;X)), (6.1)

where the second equality follows from Bayes Theorem. LP nonparametric modeling

seek to estimate the conditional comparison density d(u;X,X|Y = Q(v;Y )), which

captures how the distribution f(x;X|Y = 1) differs from the pool f(x;X). The fol-

lowing result yields LP-comoment based fast computing formula for Fourier coefficients

LP[j;X,X|Y = 1] of the conditional comparison density d(u;X,X|Y = 1).

Theorem 2. Two equivalent expression of LP-Fourier coefficients of d(u;X,X|Y = 1)

(i) Conditional mean representation:

LP[j;X,X|Y = 1] =

1∫0

Sj(u;X) dD(u;X,X|Y = 1) = E[Tj(X;X)|Y = 1

]. (6.2)

(ii) LP-comoment representation:

LP[j, 1;X, Y ] = E[Tj(X;X)T1(Y ;Y )] =√

odds(

Pr[Y = 1])

LP[j;X,X|Y = 1].

(6.3)

We start our analysis by checking the ‘null’ independence model Y ⊥ X. Fig 13 shows

the Q-Q plot of the LP-comoments. Clearly visible 2 outliers (at the bottom left corner)

are the significant LP-comoments LP[1, 1;X, Y ] = −.45 and LP[3, 1;X, Y ] = −.42,

which indicates temperature is an important factor for O-ring damage. The L2 estimated

conditional comparison density shown in Fig 14(a) is given by

d(u;X,X|Y = 1) = 1− .67S1(u;X)− .64S3(u;X). (6.4)

Our LP nonparametric comparison analysis indicates the null hypothesis of equal distri-

bution is false. It further sheds light and shows that f(x;X|Y = 1) and f(x;X) differ

in location and skewness. The estimated LP conditional probability estimate is shown

in Fig 14(b), computed via Pr(Y = 1|X = x) = Pr(Y = 1) d(F (x;X);X,X|Y = 1).

7 Age-IQ Sparse Contingency Table Data

Consider the example shown in Table 4. Our goal is to understand the association

between Age and Intelligence Quotient (IQ). It is well known that the accuracy of chi-

squared approximation depends on both sample size n and number of cells p. Classical

13

Page 16: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

association mining tools for contingency tables like Pearson chi-squared statistic and

likelihood ratio statistic break down for sparse tables (where many cells have observed

zero counts, see Haberman (1988)) - a prototypical example of “large p small n” problem.

The classical chi-square statistic yields test statistic = 60, df = 56, p-value = 0.3329,

failing to capture the pattern from the sparse table. Eq 7.1 computes the LP-comoment

matrix. The significant element is LP[2, 1] = −0.617, with the associated p-value 0.0167.

LPINFOR thus adapts successfully to sparse tables.

Table 4: Wechsler Adult Intelligence Scale (WAIS) by males of various ages. Is there

any pattern ? Data described in Mack and Wolfe (1981) and Rayner and Best (1996).

Intelligence Scale

Age group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16− 19 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0

20− 34 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0

35− 54 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1

55− 69 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0

69+ 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0

LP[X = Age, Y = IQ

]=

−0.316 0.173 0.168 −0.114

−0.618∗ −0.031 −0.101 0.068

0.087 0.136 0.077 0.037

0.165 0.215 0.042 0.289

(7.1)

The higher-order significant LP[2, 1;X, Y ] = E[T2(X;X)T1(Y ;Y )] comoment not only

indicates strong evidence of non-linearity but also gives a great deal of information

regarding the functional relationship between age and IQ. Our analysis suggests that IQ

is associated with the T2(X;X) that has a quadratic (more precisely, inverted U-shaped)

age effect, which agrees with all previous findings that IQ increases as a function of

age up to a certain point, and then it declines with increasing age. This example

also demonstrates how the LP-comoment based approach nonparametrically finds the

optimal non-linear transformations for two-way contingency tables, extending Breiman

14

Page 17: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

and Friedman (1985)

ρ(Ψ∗(Y ), Υ∗(X)

)= max

Ψ,φρ(Ψ(Y ), Υ(X)

).

8 Empirical Power Comparison: Goodness of fit

We investigate the performance of the LP goodness-of-fit statistic under ‘tail alternative’

(detecting lack of fit “in the tail”). Consider the testing problem H0 : F = Φ verses

Gaussian contamination model H1 : F = πΦ + (1− π)Φ(· −µ), for π ∈ (0, 1) and µ > 0,

which is critical component of multiple testing. It has been noted that traditional tests

like AndersonDarling (AD), KolmogorovSmirnov (KS) and Cramer-von Mises (CVM)

lack power in this setting. We compare the performance of our statistic with four other

statistics: AD, CVM, KS and Higher-criticism (Donoho and Jin, 2008). Fig 15 shows the

power curve under different sparsity level π = {.01, .02, .05, .09} where the signal strength

µ varies from 0 to 3.5 under each setting, based on n = 5000 sample size. We used 500

null simulated data sets to estimate the 95% rejection cutoff for all methods at the

significance level 0.05 and used 500 simulated data sets from alternative to approximate

the power. LP goodness-of-fit statistic does a satisfactory job of detecting deviations at

the tails under a broad range of (µ, π) values, which is desirable as the true sparsity level

and signal strength are unknown quantities. Application of this idea recently extended

to modeling local false discovery rate in Mukhopadhyay (2013).

9 Empirical Power Comparison: LPINFOR

Computationally fast copula-based nonparametric dependence measure LPINFOR. We

compare the power of our LPINFOR statistic with distance correlation (Szekely and

Rizzo, 2009), maximal information coefficient (Reshef et al., 2011), Spearman and Pear-

son correlation for testing independence under six different functional relationships (for

brevity we restrict our attention to the relationships shown in Fig 16) and four different

types of noise settings: (i) E1 : Gaussian noise N (0, σ) where σ varying from 0 to 3;

(ii) E2 : Contaminated Gaussian (1 − η)N (0, 1) + ηN (1, 3), where the contamination

parameter η varies from 0 to .4; (iii) E3 : Bad leverage points are introduced to the data

by (1− η)N (0, 1) + ηN (µ, 1), where µ = {±20,±40} w.p 1/4 and η varies from 0 to .4;

(iv) E4 : heavy-tailed noise is introduced via Cauchy distribution with scale parameter

that varies from 0 to 2.

15

Page 18: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Table 5: Performance summary table on the basis of empirical power among six compet-

ing methods, compared over six bivariate relationships and four different noise settings.

Functional Relation

Noise Performance Linear Quadratic Lissajous W-shaped Sine Circle

E1 Winner Pearson LPINFOR LPINFOR Dcor Dcor LPINFOR

Runner-up Spearman Dcor MIC LPINFOR MIC MIC

E2 Winner Spearman LPINFOR LPINFOR LPINFOR MIC LPINFOR

Runner-up Dcor MIC MIC Dcor Dcor MIC

E3 Winner Spearman LPINFOR LPINFOR LPINFOR MIC LPINFOR

Runner-up LPINFOR MIC MIC MIC LPINFOR MIC

E4 Winner Spearman LPINFOR LPINFOR LPINFOR MIC LPINFOR

Runner-up LPINFOR MIC Dcor MIC LPINFOR MIC

The simulation is carried out based on sample size n = 300. We used 250 null simulated

data sets to estimate the 95% rejection cutoffs for all methods at the significance level

0.05 and used 200 simulated data sets from alternative to approximate the power, similar

to Noah and Tibshirani (2011). We have fixed m = 4 to compute LPINFOR for all of

our simulations. Fig 10-12 show the power curves for the bivariate relations under four

different noise settings. We summarize the findings in Table 5, which provides strong

evidence that LPINFOR performs remarkably well across s wide-range of settings.

Computational Considerations. We investigate the issue of scalability to large data

sets (big n problems) for the three competing nonlinear correlation detectors: LPIN-

FOR, distance correlation (Dcor), and maximal information coefficient (MIC). Generate

(X, Y ) ∼ Uniform[0, 1]2 with sample size n = 20, 50, 100, 500, 1000, 5000 , 10, 000. Table

6 reports the average time in sec (over 50 runs) required to compute the corresponding

statistic. It is important to note the software platforms used for the three methods:

DCor is written in JAVA, MIC is written in C, and we have used (unoptimized) R for

16

Page 19: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Table 6: Computational Complexity: uniformly distributed, independent samples of

size n, averaged over 50 runs based on Intel(R) Core(TM) i7-3540M CPU @ 3.00GHz 2

Core(s) processor. Timings are reported in seconds.

Methods Size of the data sets

n = 100 n = 500 n = 1000 n = 2500 n = 5000 n = 10, 000

LPINFOR 0.002 (.004) 0.001 (.004) 0.002 (.005) 0.005 (.007) 0.011 (.007) 0.018 (.007)

Dcor 0.003 (.007) 0.094 (.013) 0.371 (.013) 1.773 (.450) 7.756 (.810) 44.960 (12.64)

MIC 0.312 (.007) 0.628 (.008) 1.357 (.035) 5.584 (.110) 19.052 (.645) 65.170 (2.54)

LPINFOR. It is evident that for large sample sizes both Dcor and MIC become almost

infeasible - not to mention computationally extremely expensive. On the other hand,

LPINFOR emerges clearly as the computationally most efficient and scalable approach

for large data sets.

References

Benzecri, J. P. (1969). Statistical analysis as a tool to make patterns emerge from data.

New York: Academic Press.

Breiman, L. and J. H. Friedman (1985). Estimating optimal transformations for multiple

regression and correlation. Journal of the American Statistical Association 80 (391),

580–598.

Dette, H., R. Van Hecke, and S. Volgushev (2013). Misspecification in copula-based

regression. arXiv preprint arXiv:1310.8037 .

Donoho, D. and J. Jin (2008). Higher criticism thresholding: optimal feature selection

when useful features are rare and weak. Proc. Natl. Acad. Sci. USA, 105, 14790–15795.

Goodman, L. A. (1981). Association models and canonical correlation in the analysis

of cross-classifications having ordered categories. Journal of the American Statistical

Association 76 (374), 320–334.

17

Page 20: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Goodman, L. A. (1985). The analysis of cross-classified data having ordered and/or

unordered categories: Association models, correlation models, and asymmetry models

for contingency tables with or without missing entries. The Annals of Statistics 13,

10–69.

Goodman, L. A. (1991). Measures, models, and graphical displays in the analysis of

cross-classified data (with discussion). Journal of the American Statistical associa-

tion 86 (416), 1085–1111.

Goodman, L. A. (1996). A single general method for the analysis of cross-classified data:

reconciliation and synthesis of some methods of Pearson, Yule, and Fisher, and also

some methods of correspondence analysis and association analysis. Journal of the

American Statistical Association 91 (433), 408–428.

Greenacre, M. (2010). Correspondence analysis in practice. CRC Press.

Haberman, S. J. (1988). A warning on the use of chi-squared statistics with frequency

tables with small expected cell counts. Journal of the American Statistical Associa-

tion 83 (402), 555–560.

Hirschfeld, H. O. (1935). A connection between correlation and contingency. Proceedings

of the Cambridge Philosophical Society 31, 520–524.

Jaynes, E. T. (1962). Information theory and statistical mechanics (notes by the lec-

turer). Statistical Physics 3, Lectures from Brandeis Summer Institute 1962. New

York: WA Benjamin, Inc.

Lewi, P. (1998). Analysis of contingency tables. Handbook of Chemometrics and Quali-

metrics (Part B), 161–206.

Mack, G. A. and D. A. Wolfe (1981). K-sample rank tests for umbrella alternatives.

Journal of the American Statistical Association 76 (373), 175–181.

Mukhopadhyay, S. (2013). Cdfdr: A comparison density approach to local false discovery

rate estimation. arXiv:1308.2403 .

Nikoloulopoulos, A. K. and D. Karlis (2010). Regression in a copula model for bivariate

count data. Journal of Applied Statistics 37 (9), 1555–1568.

Noah, S. and R. Tibshirani (2011). Comment on detecting novel associations in large

data sets. Science Comment .

18

Page 21: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

Noh, H., A. E. Ghouch, and T. Bouezmarni (2013). Copula-based regression estimation

and inference. Journal of the American Statistical Association 108 (502), 676–688.

Parzen, E. (1979). Nonparametric statistical data modeling (with discussion). Journal

of the American Statistical Association 74, 105–131.

Rayner, J. and D. Best (1996). Smooth extensions of Pearsons’s product moment cor-

relation and spearman’s rho. Statistics & Probability Letters 30 (2), 171–177.

Reshef, D. N., Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turn-

baugh, E. S. Lander, and M. Mitzenmacher (2011). Detecting novel associations in

large data sets. Science 334 (6062), 1518–1524.

Ripley, B. D. (2004). Selecting amongst large classes of models. URL

http://www.stats.ox.ac.uk/ ripley/Nelder80.pdf. Symposium in Honour of David Cox’s

80th birthday .

Sur, P., G. Shmueli, S. Bose, and P. Dubey (2013). Modeling bimodal discrete data

using Conway-Maxwell-Poisson mixture models. arXiv:1309.0579 .

Szekely, G. J. and M. L. Rizzo (2009). Brownian distance covariance. The annals of

applied statistics , 1236–1265.

19

Page 22: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

0.0 0.2 0.4 0.6 0.8 1.0

24

68

Quantiles of X=Age

Con

ditio

nal L

PIN

FO

R

−1.

5−

1.0

−0.

50.

00.

51.

01.

52.

0

Com

pone

nts

u=.1 u=.25 u=.50 u=.75 u=.9

LP[1;Y,Y|X]LP[2;Y,Y|X]LP[3;Y,Y|X]LP[4;Y,Y|X]

Figure 11: (color online) Conditional LPINFOR curve and its component decomposition

for Ripley data.20

Page 23: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

f(y;Y|X=Q(.1;X))

Den

sity

0 10 20 30 40 50 60

0.00

0.02

0.04

0.06

0.08

f(y;Y|X=Q(.25;X))

Den

sity

0 10 20 30 40 50 60

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

f(y;Y|X=Q(.75;X))

Den

sity

0 10 20 30 40 50 60

0.00

0.02

0.04

0.06

0.08

f(y;Y|X=Q(.9;X))

Den

sity

0 10 20 30 40 50 60

0.00

0.02

0.04

0.06

Figure 12: The estimated nonparametric conditional distributions f(y;Y |X = Q(u;X))

for u = .1, .25, .75, .9.

21

Page 24: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

●●

−1 0 1

−0.

4−

0.2

0.0

0.2

Theoretical Quantiles

Sam

ple

Qua

ntile

s of

LP

−C

omom

ents

Figure 13: LP-Comoment based nonparametric goodness of fit diagnostic for indepen-

dence of Y binary and X continuous (O-Ring challenger data).

22

Page 25: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Estimated Comparison Density

55 60 65 70 75 80

0.0

0.2

0.4

0.6

0.8

1.0

Estimated LP Conditional Probability

Temp

Figure 14: (a) Top plot: The estimated nonparametric conditional comparison density

for the O-ring space shuttle data. The departure from uniformity indicates the sig-

nificance of the variable ‘Temperature’ for O-rign failure. (b) Bottom plot: The lower

temperature is associated with higher probability of O-ring damage suggesting the NASA

Space Shuttle orbiter Challenger launching temperature 31oF (on January 28, 1986) was

highly risky - resulted in a disaster, claiming the lives of all seven astronauts aboard.23

Page 26: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.0

0.2

0.4

0.6

0.8

1.0

Signal Strength (mu)

Powe

r

●● ●

● ● ●●

LPKSADCVMHC

0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.0

0.2

0.4

0.6

0.8

1.0

Signal Strength (mu)

Powe

r

●●

●●

0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.0

0.2

0.4

0.6

0.8

1.0

Signal Strength (mu)

Powe

r

● ● ● ● ● ●

0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.0

0.2

0.4

0.6

0.8

1.0

Signal Strength (mu)

Powe

r

● ● ● ● ● ● ● ●

Figure 15: Power comparison of LP goodness-of-fit statistic under tail alternative (Sec-

tion 8). The sparsity level changes π = {.01, .02, .05, .09}.

24

Page 27: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

●●

●●

●●

●●

●●

● ●

● ●

● ●

● ●

●●

● ●

●●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

−0.2

0.00.2

0.40.6

0.81.0

1.2

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

0.0 0.2 0.4 0.6 0.8 1.0

−0.2

0.00.2

0.40.6

0.81.0

1.2

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

−1.0 −0.5 0.0 0.5 1.0−1

.0−0

.50.0

0.51.0

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

−1.0 −0.5 0.0 0.5 1.0

0.00.5

1.0

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

−1.0

−0.5

0.00.5

1.0

● ● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

−1.0

−0.5

0.00.5

1.0

Figure 16: Various functional relationships considered in our simulation study in Section

9.

25

Page 28: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

● ● ● ● ● ● ● ● ● ● ● ●

●●

● ● ●

● ●

● ●

● ● ●

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00.2

0.40.6

0.81.0

Robustness under Gaussian error

Noise Level

Powe

r

● dcorLPINFORMICPearsonSpearman

●●

● ● ●

● ●●

●●

● ●●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Sensitivity towards outliers

Contamination proportion

Powe

r

●●

●●

● ●

● ●

● ●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Stability under bad leverage data

Contamination proportion

Powe

r

● ●●

●● ●

● ●

● ●● ●

●● ●

0.0 0.5 1.0 1.5 2.0

0.00.2

0.40.6

0.81.0

Resistance to heavy−tailed error

Noise Level

Powe

r

(a) Power curve for linear pattern

● ● ● ● ● ●

●●

●● ●

● ● ●●

● ● ●

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00.2

0.40.6

0.81.0

Robustness under Gaussian error

Noise Level

Powe

r

● dcorLPINFORMICPearsonSpearman

●●

●●

●●

●●

● ●●

●● ●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Sensitivity towards outliers

Contamination proportion

Powe

r

● ● ●

●●

●●

● ●

●●

●●

● ●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Stability under bad leverage data

Contamination proportion

Powe

r

● ●

● ●●

●●

●●

●● ●

●●

●● ● ● ●

● ●

0.0 0.5 1.0 1.5 2.0

0.00.2

0.40.6

0.81.0

Resistance to heavy−tailed error

Noise Level

Powe

r

(b) Power curve for quadratic pattern

Figure 17: Estimated power curve for linear and quadratic pattern under four types of

noise settings; set up described in Section 6.3.2

26

Page 29: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

● ●

●●

●●

●●

●●

● ●●

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00.2

0.40.6

0.81.0

Robustness under Gaussian error

Noise Level

Powe

r

● dcorLPINFORMICPearsonSpearman

●●

●●

●●

● ●

●● ●

●●

● ● ●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Sensitivity towards outliers

Contamination proportion

Powe

r

● ●●

●●

●● ● ● ●

●●

●●

●● ● ● ● ● ●

● ● ● ●●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Stability under bad leverage data

Contamination proportion

Powe

r

●●

●● ●

● ●●

●●

● ●

●●

● ●

●● ●

● ●●

●●

● ●●

0.0 0.5 1.0 1.5 2.0

0.00.2

0.40.6

0.81.0

Resistance to heavy−tailed error

Noise Level

Powe

r

(a) Power curve for Lissajous curve

● ● ● ● ● ● ● ● ● ●●

●●

● ●

●●

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00.2

0.40.6

0.81.0

Robustness under Gaussian error

Noise Level

Powe

r

● dcorLPINFORMICPearsonSpearman

● ●

●●

● ●●

●●

●●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Sensitivity towards outliers

Contamination proportion

Powe

r

●●

●●

●●

● ●

●● ●

● ●

●●

●● ●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Stability under bad leverage data

Contamination proportion

Powe

r

●●

● ● ●

● ●

● ●

●● ● ●

●●

●● ●

0.0 0.5 1.0 1.5 2.0

0.00.2

0.40.6

0.81.0

Resistance to heavy−tailed error

Noise Level

Powe

r

(b) Power curve for W-pattern

Figure 18: Estimated power curve for Lissajous curve and W-pattern under four types

of noise settings; set up described in Section 6.3.2

27

Page 30: Modern Statistical Data Modeling with Goodness of t: LP ... fileStatistical Data Modeling with Goodness of t: LP Mechanics" to solve many applied statistical problems using real and

● ● ● ● ●●

●●

●●

●●

● ●

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00.2

0.40.6

0.81.0

Robustness under Gaussian error

Noise Level

Powe

r

● dcorLPINFORMICPearsonSpearman

●●

●●

● ●

●● ●

●● ●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Sensitivity towards outliers

Contamination proportion

Powe

r

● ●

● ●

●●

●●

●●

● ●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Stability under bad leverage data

Contamination proportion

Powe

r

●●

● ●●

●●

● ●

●●

● ●

●●

●● ● ● ●

●●

0.0 0.5 1.0 1.5 2.0

0.00.2

0.40.6

0.81.0

Resistance to heavy−tailed error

Noise Level

Powe

r

(a) Power curve for sine pattern

● ● ● ●● ● ●

●●

● ●

●●

●●

●●

●●

●●

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00.2

0.40.6

0.81.0

Robustness under Gaussian error

Noise Level

Powe

r

● dcorLPINFORMICPearsonSpearman

●●

● ●

●●

●●

●●

● ●●

●● ●

● ●

●●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Sensitivity towards outliers

Contamination proportion

Powe

r

●●

●●

●●

●● ● ●

●● ●

● ●●

●● ●

●● ●

0.0 0.1 0.2 0.3 0.4

0.00.2

0.40.6

0.81.0

Stability under bad leverage data

Contamination proportion

Powe

r

●● ●

●● ●

●● ● ●

● ●●

●●

●● ●

0.0 0.5 1.0 1.5 2.0

0.00.2

0.40.6

0.81.0

Resistance to heavy−tailed error

Noise Level

Powe

r

(b) Power curve for circle pattern

Figure 19: Estimated power curve for sine curve and circle pattern under four types of

noise settings; set up described in Section 6.3.2

28