A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling

A Copula Approach to Test Asymmetric Information with

Applications to Predictive Modeling

PENG SHI

Division of StatisticsNorthern Illinois University

DeKalb, Illinois 60115email: [email protected]

EMILIANO A. VALDEZ

Department of MathematicsUniversity of Connecticut

Storrs, Connecticut 06269-3009email: [email protected]

August 31, 2010

Abstract

In this article, we provide an alternative evidence of asymmetric information in automobile in-surance based on a copula model. We use the Frank’s copula to jointly model the type of policycoverage chosen and the number of accidents, with the dependence parameter providing for ev-idence of the relationship between the choice of coverage and the frequency of accidents. Thisdependence therefore provides an indication of the presence (or absence) of asymmetric informa-tion. The type of coverage is in some sense ordered so that coverage with higher ordinals indicatethe most comprehensive coverage. Henceforth, a positive relationship would indicate that morecoverage is chosen for those with higher frequency of accidents, and vice versa. This presence ofasymmetric information could be due to either adverse selection or moral hazard, a distinctionoften made in the economics of insurance literature, or both. We calibrated our copula modelusing a one-year cross-sectional observation of claims arising from a major automobile insurerin Singapore. Our estimation results indicate a strong evidence of the presence of asymmetricinformation, and we further used our estimated model for other possible actuarial applications.In particular, we are able to demonstrate the effect of coverage choice on the incidence of acci-dents, and based on which, a posterior pure premium is derived. In general, a positive margin isobserved when compared with a prior premium available in our empirical data base.

Keywords: Adverse selection, Moral hazard, Insurance, Copula models, Predictive models.

1

A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling 2

1 Background and motivation

In contract theory, asymmetric information is defined to be the situation when one party of an

economic transaction possesses information that is not available to the other contractual party. A

clear example can be found in the market for used items or goods such as those found on eBay where

there is transfer of ownership from one person to another. In such transactions, there are variables,

usually observable only during the period of ownership of the goods and oftentimes difficult to trace,

to help assist the buyer in assessing the quality of the goods being purchased. As a consequence, the

original owner knows the history of the performance of the goods during its period of wear and tear

which maybe unavailable to the buyer. A classic example can also be found in the market for “used

cars”. Indeed, in the seminal work of Akerlof (1970), the economist George A. Akerlof explained

the problems arising from information asymmetry based on the “used car” market and refered to

defective used cars as “lemons”.

The presence of asymmetric information is no stranger in the insurance market, and this is true in

all forms of insurance (e.g. life, medical, dental, homeowners, and automobile). There is additional

information about the probability of a loss unavailable to the insurance company, preventing it

from creating a portfolio of homogeneous risks and thus making it more difficult for the insurer to

effectively price-discriminate buyers of insurance policies. The economics literature distinguishes

asymmetric information due to adverse selection and due to moral hazard. Within the context of

insurance, there is presence of adverse selection when an insurer does not have enough information

to assess those buyers who are considered ‘high risks’ and who are more likely to purchase insurance.

On the other hand, there is presence of moral hazard when the behavior of policyholders is altered

because they have insurance coverage. Prominent examples of this are in medical insurance when

policyholders tend to be less careful of their health and in automobile insurance when policyholders

tend to behave recklessly while driving on the road. In this article, because our data makes it difficult

to make the distinction, our model for investigating the presence of asymmetric information does

not allow for such distinction.

The implications of the presence of asymmetric information can have terrifying effects on the

insurance market such as possible deterioration of premium adequacy which may lead to insurer in-

solvency and subsequent market collapse. In combating for asymmetric information, insurers have

tended to modify insurance policy designs within the limits of the law. Some of these popular mod-

ifications are the introduction of policy deductibles, coinsurance, and policy limits; all tend to have

the characteristics that result in the sharing of the risks between the insured and the insurer. In the

pioneering work of Rothschild and Stiglitz (1976), the authors demonstrated that when an insur-

ance company offers a basket of products with varying levels of coverage, a separating equilibrium

results, in which optimally, as a result, the higher risk chooses a higher level of coverage and pays a

higher premium while the lower risk chooses a lower level of coverage and pays a lower premium.

As discussed earlier, this phenomenon is called adverse selection. A possible implication of the work

of Rothschild and Stiglitz (1976) is that the existence of adverse selection could be tested empiri-

cally by examining whether there is a positive relationship between the risk of policyholders and

their subsequent selection of the level of insurance coverage.

In this article, we focus on the automobile insurance market primarily motivated by the available

data we have for empirical investigation and by the fact that there have been several empirical inves-

tigations along this line making it easy to compare our results with existing ones. Although theory

generally predicts for the presence of asymmetric information, earlier empirical investigations, how-

ever, have rather been inconclusive in the sense that some studies have found strong evidence while

others have suggested no evidence at all. Several of these studies share many model characteristics


such as the use of a policyholder’s claim history to measure the risk level of the driver in conjunction

with the effects of the choice of policy coverage (the presence of deductibles, coinsurance and/or

policy limits). In addition, several of the empirical models control for certain observable character-

istics that allow the insurer to classify the risk level of the policyholders. However, several of these

studies are distinct in many aspects. First is the choice of the risk level of the driver: some treat the

likelihood of the event that an accident, or subsequently a claim, occurs while others investigate the

number of times an insured has made a claim as the primary variable of interest. Next of course are

the differences in the empirical data used in the investigation itself where other controlling effects

make the comparison non-uniform. Finally and obviously, the model choice, which ranges from us-

ing logit or probit models for the probability of a claim to using count data regression models when

using the number of accidents or claims.

To illustrate, using an ordered logit model, Puelz and Snow (1994) found a strong positive corre-

lation between the number of times an insured has made a claim together with the level of coverage

purchased accounting for the risk class to which the insured has been classified. Based on the notion

of conditional dependence, Dionne et al. (2001) suggested that the presence of adverse selection

vanishes when “the nonlinearity of the risk classification variables” are taken into account. Using

a portfolio of contracts from a French insurer, Chiappori and Salanie (2000) also concluded no ev-

idence of asymmetric information in the automobile insurance market. The authors proposed the

use of a bivariate probit regression to jointly model the choice of a better coverage and the occur-

rence of an accident. The rationale is that the influence between the occurrence of accidents and the

policy selection could go in either direction. Such method has also been employed by Cohen (2005)

using data from an insurer that operates in Israel, and by Saito (2006) using data from a Japanese

automobile insurer, to test the asymmetric information in the automobile insurance market. These

articles surprisingly have opposing results; Cohen (2005) concluded with evidence of informational

asymmetry and Saito (2006) found no adverse selection or moral hazard in this market.

As already alluded earlier, the adverse selection effect implies that a higher risk, one with a higher

chance of accident occurrence, ends up with a more comprehensive and better coverage. However,

the occurrence of accident could be the result of a moral hazard effect, that is, the selection of a more

comprehensive and better coverage could provide a motivation for a careless or reckless driving

leading back to a higher chance of getting into an accident. Consequently, in the literature, the

coverage-occurrence relation has been attributed to a combined effect of adverse selection and moral

hazard. Recently, Kim et al. (2009) pointed out that dichotomous indicators might not completely

reveal the policyholder’s selection decision. Henceforth, a multinomial measure was suggested for

the policyholder’s coverage selection. Similar to Richaudeau (1999), these authors considered a

count data regression model to examine the effect of policy selection on the occurrence of accidents.

Another factor that may affect the coverage-occurrence relationship is the degree of risk aversion

of policyholders. An individual with a higher risk aversion is more likely to purchase better coverage

and even drive more carefully. In contrast with adverse selection and moral hazard, this observa-

tion implies a negative relationship between occurrence of accident and choice of coverage. Such

negative correlation could offset the positive correlation implied by the traditional adverse selection.

Research work along this line include de Meza and Webb (2001) and Chiappori et al. (2006), where

the authors argue that the absence of the positive correlation between the level of risk and the choice

of coverage could be consistent with the presence of asymmetric information. Thus, the empirical

results in our study represent a combined effect of various types of asymmetric information.

Motivated by examining the asymmetric information in the automobile insurance market, we

propose to use a copula function to jointly model the policyholder’s choice of coverage and occur-

rence of accident. Our approach differs from existing studies in several respects. Unlike linear


correlation models used in previous studies, the copula approach allows to capture both linear and

non-linear coverage-occurrence relationships. As a consequence, our method avoids the potential

endogeneity issue that possibly arises when examining the association between a multinomial cov-

erage selection measure and the number of accidents. We note a common limitation shared by our

method and other empirical studies, that the coverage-occurrence relationship represents a com-

bined effect of various types of asymmetric information. As pointed out by Abbring et al. (2003),

a longitudinal observation is required to distinguish between adverse selection and moral hazard.

We leave this investigation for further work. Although we find the proposed method is motivated to

empirically examine asymmetric information, it is found useful for other actuarial applications. As

we additionally demonstrate in the paper, the copula model could be used to calculate conditional

probabilities of accident, to examine the effects of covariates on accident occurrence, as well as to

derive posterior pure premium. These posterior pure premiums can further assist the insurer to

investigate premium adequacy.

For the rest of the article, we have organized it as follows. Section 2 introduces the bivariate cop-

ula model for testing asymmetric information. Section 3 describes the empirical data and calibrates

the model using this data. In addition, the section discusses procedures used to examine model

goodness-of-fit and its implications. A robust test of model implications is provided in Appendix.

Two actuarial applications are demonstrated in Section 4: the calculations of conditional accident

probability applying Bayes’ rule and the derivation of posterior pure premiums. Section 5 concludes

the paper with a discussion of additional further work.

2 Specification of the copula model

In examining the presence of asymmetric information in the automobile insurance market, we fol-

low a similar argument made in Chiappori and Salanie (2000) by considering the joint distribution

model of some multinomial measure of coverage selection and the number of accidents. This mo-

tivates us to exploit the inherent flexibility of copula models in this work. We demonstrate the

suitability of these types of models for testing the presence (or absence) of asymmetric information

for a portfolio of automobile insurance contracts and subsequently, for other possible actuarial ap-

plications in predictive modeling. The copula approach used here has been inspired by the work of

Zimmer and Trivedi (2006), although in a different context, where a trivariate copula was employed

to jointly model demands for family health care and decisions regarding arrangements or purchase

of health insurance. Our bivariate random vector clearly involves discrete data for which there is the

additional difficulty of interpreting and applying copulas because of its non-uniqueness according

to Sklar’s representation. This type of model has been explored and applied to count data in the

actuarial literature; see Genest and Neslehova (2007). Other works that explore the use of copulas

for discrete data have appeared in Prieger (2002) and Smith (2003).

Various types of coverage are typically available to the consumer in an automobile insurance

market. An policyholder’s coverage selection is commonly described by a dichotomous measure,

such as responsabilite civile v.s. assurance tours risques, based on the French car insurance data

used in Chiappori and Salanie (2000). However, as pointed out by Kim et al. (2009), a multinomial

indicator of coverage choice provides extra information for a policyholder’s decision. For the same

argument then, we consider a multinomial measure and examine three types of coverage, ordinally

ranking them from lowest to highest: third party only, third party fire and theft, and comprehensive.

‘Third party only’ refers to third party liability coverage giving protection in the event of an accident

where the policyholder is at fault and this is usually the minimum liability coverage required by law

in many jurisdictions for automobile insurance. ‘Third party fire and theft’ provides for additional


coverage in case of fire and theft. Finally, a ‘comprehensive’ provides for further additional coverage

of damaged property in the event of an accident.

With a cross-sectional set of observations, we begin by letting yi1 and yi2 indicate the choice

of coverage and the number of accidents, respectively, for policyholder i. Here, yi1, with possible

values of 1, 2, or 3, represents the choice of third party, third party fire and theft, and comprehen-

sive coverages, respectively. Note that yi1 and yi2 are the observed variables whose values will be

determined according to the corresponding latent variables defined by y∗i1 and y∗i2, respectively. One

could view y∗i1 as the policyholder’s preferred policy coverage and y∗i2 as the inherent risk level of

the policyholder. As explained in Section 1, a policyholder’s optimal coverage choice and the inher-

ent degree of risk are to be jointly determined. Thus, estimating the two outcomes simultaneously

should deliver more and better information compared with a separate estimation.

A linear relationship can be assumed for both underlying latent variables according to the fol-

lowing specification:

y∗i1 = x′

iβ + νi + εi1

and

y∗i2 = z′

iγ + νi + εi2.

Here, xi and zi denote the vectors of covariates that explain the policy selection and the underlying

risk of the ith policyholder, respectively. The coefficient vectors β and γ are the corresponding

parameters to be estimated. The common term νi represents the unobserved characteristics of the ithpolicyholder that could possibly jointly affect the preferred coverage choice and degree of risk, and

thus introduces the association between the two processes. The error terms εi1 and εi2 are assumed

to be independent with νi. The above model could be calibrated by specifying the distributions of νiand error terms, or by specifying the covariance structure of the latent variables. As an alternative,

we choose to model the observable variables yi1 and yi2 with a parametric copula to be denoted by

C(·, ·). Then the joint probability mass function of yi1 and yi2 could be expressed as:

fi(yi1, yi2) = Prob(Yi1 = yi1, Yi2 = yi2)

= C(Fi1(yi1), Fi2(yi2)) − C(Fi1(yi1 − 1), Fi2(yi2))

− C(Fi1(yi1), Fi2(yi2 − 1)) + C(Fi1(yi1 − 1), Fi2(yi2 − 1)), (1)

where Fi1 and Fi2 are the cumulative distribution functions of yi1 and yi2, respectively.

Due to the parametric feature of the copula model, one needs specifications of the distribution

functions Fi1 and Fi2 for model identification. The coverage choice is measured on an ordinal scale.

Thus an ordered multinomial model is used to describe the relationship between the response yi1and the latent variable y∗i1:

yi1 =

1, if y∗i1 ≤ α1

2, if α1 < y∗i1 ≤ α2

3, if y∗i1 > α2

(2)

where α1 and α2 are unknown thresholds to be additionally estimated. The distribution function of

yi1 can thus be expressed as:

Fi1(yi1) = Prob(Yi1 ≤ yi1) =

π(α1 − x′

iβ), if yi1 = 1

π(α2 − x′

iβ), if yi1 = 2

1, if yi1 = 3

(3)


Commonly used regression techniques for an ordinal response include ordered probit model with

π(a) = Φ(a), where Φ(·) is the cumulative distribution function of a standard normal random vari-

able, and ordered logit model with π(a) = 1/[1 + exp(−a)]. These two methods typically provide

consistent results and the selection between the two rests on the user’s preference. For our purposes,

we considered an ordered logistic regression model in the estimation. Henceforth, we use

Fi1(yi1) = Prob(Yi1 ≤ yi1) =

1

1+exp(−(α1−x′

iβ)), if yi1 = 1

11+exp(−(α2−x

′

iβ)), if yi1 = 2

1, if yi1 = 3

(4)

The number of accidents yi2 is specified using a Negative Binomial regression model. More

specifically, its probability mass function is expressed as

fi2(yi2) = Prob(Yi2 = yi2) =Γ(yi2 + ψ)

Γ(ψ)Γ(yi2+1)

(

ψ

λi + ψ

)ψ (

λiλi + ψ

)yi2

, (5)

where a log link function used for its conditional mean is given by

E(yi2|zi) = λi = ωi exp(z′

iγ)

with ωi the weight parameter for policyholder i. This weight parameter has the interpretation of the

length of time to which the policyholder is covered by the policy for the calendar year in considera-

tion. The dispersion parameter ψ in the conditional variance expressed as

Var(yi2|zi) = λi + λ2i /ψ

provides additional flexibility to fit the claims data, especially being able to accommodate either

overdispersion or underdispersion often encountered with claims count data. See, for example,

Cameron and Trivedi (1986).

The model specified in this section by its nature is fully parametric and can therefore be easily

estimated using likelihood-based methods. The log of the likelihood function can be derived by

adding up the logarithm of the joint probability function expressed in (1) for all policyholders with

our set of observable data given by

(yi1, yi2,xi,zi)

for each policyholder i. In the log-likelihood function, Fi1(yi1) is replaced with expression (4), and

Fi2(yi2) =∑

fi2(yi2), with fi2(yi2) specified by (5). To accommodate the fact that the choice of

coverage and the frequency of accidents could possibly be either positively or negatively associated,

we consider the Frank copula which permits such flexibility:

C(u1, u2; θ) = −1

θlog

[

1 +(e−θu1 − 1)(e−θu2 − 1)

e−θ − 1

]

, (6)

where θ is the dependence parameter that captures the association between the two responses. The

flexibility of allowing for either direction of association has been one of the primary reason for its

popularity in applications in insurance, finance and medical statistics. It is rather straightforward

to show that when θ → 0, the Frank copula yields to the product copula, indicating the case of

independence. Furthermore, the case where θ > 0 indicates a positive association between the two

responses; the reverse is true when θ < 0. These relationships allow us to understand the type

of association that we (empirically) derive from our data. Additional statistical properties of the

Frank’s family of copulas in (6) have been explored in Genest (1987). We refer the reader to this

very interesting article to further understand this family of copulas.


3 Calibrating the model

3.1 Data characteristics

Data used to calibrate the model specified in Section 2 was drawn from a portfolio of automobile

insurance policies of a major insurer in Singapore. In particular, we use the observations in calendar

year 2001 for this insurer where we have a total of 44,226 policies that were recorded in the portfo-

lio. Similar to several jurisdictions worldwide, Singapore requires drivers to have, at the minimum,

a third party liability coverage to be able to drive a vehicle on the road, and at the same time, drivers

have the liberty to choose beyond this minimum level of coverage. Our dataset indeed comes from

a subsample of studies done in Frees and Valdez (2008) and in Frees et al. (2009) where the source

of the data was also revealed.

Table 1 provides a summary of the frequency statistics for our two primary variables of interest:

policy choice and claim count for the year. As indicated in this table, the majority of the policyhold-

ers, roughly 72% (31,648 contracts) in this case, choose for a comprehensive coverage and roughly

about 11% (4,896 contracts) of the policyholders choose for a third party only coverage. In some

sense, this is not at all surprising to find: owning and driving a car in Singapore is quite expensive.

The government has put up measures such as vehicle quotas to be on the road to manage the level

of car ownership. This expensive ownership could have possibly led car drivers to have insurance

protection that also additionally covers for property damage, as a comprehensive coverage would.

In terms of the frequency of accidents, as we normally would expect in portfolios of automobile

insurance policies, most policyholders do not incur an accident within a one-year time frame. For

our data, roughly 8% of the policyholders had a single accident and less than one percent had 2, 3 or

4 accidents. No single policyholder in our portfolio incurred more than 4 accidents during the year.

It is worth pointing out here the difference between an accident and when an accident becomes

potentially a claim. The reporting of an accident to the insurer that could eventually becomes a

claim may be influenced by the behavior of the policyholder. The policyholder might reluctantly

report an accident to the insurer under certain circumstances. For instance, for fear of increase

in future premiums, the policyholder may find it more economically beneficial not to report an

accident, especially if the loss amount from the damage or injury is small. Thus, in the records of

insurance companies, the number of accidents reported may actually be lower due to the presence of

underreporting. Because we use the data that was reported to the insurance company, we measure

accident frequency based on the number of claims recorded.

Further analysis reveals that among the 11% policyholders who selected the lowest coverage

(third party), 3.57% had at least one accident, among the 17% policyholders who selected third

party fire and theft, 4.69% occurred at least one accidents, and among 72% policyholders who

selected the highest coverage (comprehensive), 10.22% occurred at least one accidents. The above

results already foreshadow some presence of moral hazard, i.e., it is more likely to have accidents

for policyholders who purchase a higher coverage, though this positive correlation is largely due

to the common exogenous variables. This coverage-occurrence relationship empirically revealed

could also be explained by the existence of adverse selection, i.e., those who have more accidents

purchase higher coverage because they believe themselves to be worse drivers. In fact, among 91%

policyholders who did not have any accident, 11.67% purchased third party, 18.10% purchased third

party fire and theft, and 70.23% purchased comprehensive. In contrast, among 9% policyholders

who had accidents, 4.64% purchased third party, 9.55% purchased third party fire and theft, and

85.81% purchased comprehensive.

The covariates (or explanatory variables) used in our study can be found in Table 2. Each of


Table 1: Frequency statistics of policy choice and claim counts

Variables Value Frequency (%)

Policy Choice 1 third party only 11.07

2 third party fire and theft 17.37

3 comprehensive (reference level) 71.56

Claim Count 0 91.48

1 7.79

2 0.66

3 0.07

4 0.00

these variables can be categorized as either a driver or a vehicle characteristic. Driver characteristics

include the age of the insured, gender, marital status, years of driving experience, and the No Claims

Discount (NCD) enjoyed by the insured. Vehicle characteristics include the age of the vehicle in

years, the purpose or class (whether for private, business or other purposes), the size capacity,

and the brand of the vehicle. Several of these variables are categorical. For categorical variables,

the descriptive statistics shows the proportion of a value within each category. For variables with

continuous values, we provide both the mean and its standard deviation. Only two of these variables

(experience and vage) are continuous. Note that there are variables that could be treated as either

categorical or continuous, for instance, a policyholder’s age. Both specifications are found in the

literature. We selectively categorize some continuous variables to reduce the heterogeneity among

policyholders and capture the potential non-linear effects.

An important explanatory variable noteworthy is the NCD factor. In the empirical test of asym-

metric information, it is crucial to control for the policyholder’s past driving history. As pointed out

in Boyer and Dionne (1989), a policyholder’s past driving record affects his/her current propensity

for accidents, and thus is a significant risk factor in the determination of pure premium. Because

driving records are typically public information to insurance companies, failing to control for such

heterogeneity could overestimate the level of asymmetric information by treating public information

as private. The data set available to us contains an NCD that reflects past accidents. The no claims

discount scheme is introduced to encourage good driving. An NCD increases 10% after a year with

claims; the maximum NCD level is 50%. If the NCD is at 50% or 40%, a claim during one year will

reduce it to 30% or 20%, respectively. Two or more claims in one year can cause a complete lost

of the policyholder’s current level of NCD. If the NCD is at 30% or lower levels, the entire discount

is forfeited in case of accidents. Because clearly the NCD reveals the policyholder’s past driving

records, we use it as a proxy to measure accident history.

According to these descriptive statistics, we may conclude that a typical observation in our data

would be one who is middle aged (between 36 and 45) and a male driver who is married with

roughly 12 years of driving experience and zero claims discount. The typical vehicle driven in our

portfolio is roughly 8 years old, a private car with medium capacity whose brand is either a Toyota

or something else other than those listed in the table.


Table

2:

Desc

ripti

ve

stati

stic

sof

the

expla

nato

ryvari

able

su

sed

inth

em

od

el

cali

bra

tion

Vari

able

Valu

e/D

esc

ripti

on

Mean

Std

Dev

Dri

ver

chara

cteri

stic

s

agecl

ass

=1

ifin

sure

dage

isle

ssth

an

25

2.9

7%

=2

ifin

sure

dage

isbetw

een

26

an

d3

53

1.9

0%

=3

ifin

sure

dage

isbetw

een

36

an

d4

53

5.3

3%

=4

ifin

sure

dage

isbetw

een

46

an

d5

52

1.8

9%

=5

ifin

sure

dage

isbetw

een

56

an

d6

56

.61

%

=6

ifin

sure

dage

isgre

ate

rth

an

65

(refe

ren

cele

vel)

sexin

sure

d=

1if

the

poli

cyh

old

er

isfe

male

,0

ifm

ale

17

.23

%

mari

tal

=1

ifth

epoli

cyh

old

er

ism

arr

ied

,0

ifsi

ngle

83

.90

%

experi

en

cele

ngth

of

dri

vin

gexperi

en

ceof

the

insu

red

(in

years

)1

1.9

68

.13

NC

DN

oC

laim

sD

isco

un

ten

joyed

by

the

insu

red

=1

if0

perc

en

t3

1.9

4%

=2

if1

0perc

en

t1

4.8

9%

=3

if2

0perc

en

t1

1.1

7%

=4

if3

0perc

en

t7

.20

%

=5

if4

0perc

en

t6

.23

%

=6

if5

0perc

en

t(r

efe

ren

cele

vel)

Veh

icle

chara

cteri

stic

s

vage

the

age

of

the

insu

red

veh

icle

(in

years

)7

.55

6.4

8

veh

icle

class

=1

ifth

eveh

icle

isa

pri

vate

car

86

.29

%

=2

ifth

eveh

icle

isa

good

sveh

icle

13

.13

%

=3

ifoth

ers

(refe

ren

cele

vel)

capaci

tycl

ass

=1

ifpett

yca

rs(c

ubic

capaci

tyis

less

than

10

00

or

ton

nage

isle

ssth

an

1)

11

.27

%

=2

ifsm

all

cars

(cu

bic

capaci

tyis

betw

een

10

00

an

d1

50

0or

ton

nage

isbetw

een

1an

d2

)3

3.2

5%

=3

ifm

ed

ium

cars

(cu

bic

capaci

tyis

betw

een

15

00

an

d2

00

0or

ton

nage

isbetw

een

2an

d3

)4

8.9

6%

=4

ifla

rge

cars

(cu

bic

capaci

tyis

gre

ate

rth

an

20

00

or

ton

nage

isgre

ate

rth

an

3)

(refe

ren

cele

vel)

bra

nd

class

=1

ifToyota

18

.91

%

=2

ifH

on

da

13

.77

%

=3

ifN

issa

n1

6.8

8%

=4

ifM

isu

bis

hi

10

.34

%

=5

ifM

aza

da

4.7

1%

=6

ifoth

er

Japan

ese

car

4.1

0%

=7

ifK

ore

an

car

6.8

7%

=8

ifE

uro

pean

car

19

.92

%

=9

ifoth

ers

(refe

ren

cele

vel)


3.2 Estimation results and discussion

The resulting (maximum likelihood) estimates for the copula model described in Section 2 are pre-

sented in Table 3. Here in this table, we separate the effects of the marginals, with Choice referring

to the coverage selection and Risk referring to the number of claims in the year. As shown on the

table, the resulting effects of most of the covariates on both Choice and Risk are rather intuitive.

To illustrate, the insured age shows a nonlinear effect on coverage choice. Generally, we believe

that the younger driver is typically less risk averse and thus tends to more likely choose a lesser cov-

erage. However, the driving experience typically accumulates as drivers age. This might additionally

explain the nonlinear relationship between the age and policy choice. With respect to the effect of

age on the level of risk, we observe a statistically significant effect of the driver’s age on the number

of accidents. Expectedly, we also observe some nonlinear pattern in the relationship. Generally,

younger drivers tend to have more accidents than older ones. On the other hand, despite several

more years of driving experience, the older drivers tend to have slightly more accidents when com-

pared to medium age drivers. This phenomenon is not surprising at all with older drivers generally

believed to be more frail and more incapacitated when it comes to receptiveness during accidents.

Although the effects are not statistically significant according to our estimates, the female and

married drivers tend to have preference for better (that is, more comprehensive) coverage. At the

same time, both groups of drivers tend to have fewer accidents and are therefore considered safer

drivers. This could possibly be well explained by the higher degree of risk aversion for female and

married individuals as many statistical studies indicate.

In examining the effect of vehicle characteristics, we find that the age of the vehicle exhibits

significant effect on both policy choice and accident occurrence. As a vehicle depreciates over time,

a driver have the tendency to purchase less coverage. This phenomenon may well even be better

explained in Singapore. A major component of the price or value of a vehicle in Singapore is what

is referred to as the Certificate of Entitlement (COE) which as the name implies, entitles the driver

to purchase a vehicle. This entitlement is limited to about 10 years and hence its value even more

rapidly depreciates over time. On the other hand, the effect of vehicle age on the number of acci-

dents appears to be nonlinear. On the average, we observe a small negative effect. One possible

explanation could be that an older vehicle is owned by an older and possibly more experienced

driver. The coefficient of the driver’s experience in the cumulative logit model is -0.0019, indicating

then that policyholders with more years of driving experience tend to buy less coverage for their

vehicles. This observation confirms the nonlinear effect of the insured’s age as earlier. Intuitively,

the more experienced driver is less likely to have accidents.

Another explanatory variable that is worth making an observation is the NCD factor. First, there

is a significant effect of NCD on policy choice in the sense that a policyholder with a high NCD tends

to purchase better insurance coverage on its vehicle. This is because for the same level of discount,

the policyholder has the potential to even save more on a more expensive comprehensive coverage.

On the other hand, the NCD is a good indicator of a policyholder’s accident history and it reflects

the propensity for accidents and thus the policyholder’s risk type. Consistently, a driver with a lower

NCD tends to have more accidents.


Table

3:

Resu

ltin

gest

imate

sof

the

copu

lam

od

el

Ch

oic

e-

Cu

mu

lati

ve

Logit

Ris

k-

Negati

ve

Bin

om

ial

Est

imate

Std

Err

p-v

alu

eE

stim

ate

Std

Err

p-v

alu

e

CH

OIC

EIN

T1

-4.1

93

70

.22

93

0.0

00

0R

ISK

INT

-3.5

86

60

.47

01

0.0

00

0

CH

OIC

EIN

T2

-0.8

47

50

.22

69

0.0

00

2R

ISK

AG

EC

LA

SS

10

.70

33

0.1

99

90

.00

04

CH

OIC

EA

GE

CLA

SS

10

.66

71

0.1

51

00

.00

00

RIS

KA

GE

CLA

SS

20

.40

80

0.1

80

30

.02

36

CH

OIC

EA

GE

CLA

SS

20

.95

27

0.1

30

90

.00

00

RIS

KA

GE

CLA

SS

30

.20

59

0.1

77

30

.24

56

CH

OIC

EA

GE

CLA

SS

30

.83

21

0.1

27

00

.00

00

RIS

KA

GE

CLA

SS

40

.24

49

0.1

77

50

.16

77

CH

OIC

EA

GE

CLA

SS

40

.68

06

0.1

26

90

.00

00

RIS

KA

GE

CLA

SS

50

.29

63

0.1

85

00

.10

92

CH

OIC

EA

GE

CLA

SS

50

.43

00

0.1

34

20

.00

14

RIS

KS

EX

INS

UR

ED

F-0

.06

65

0.0

42

80

.12

04

CH

OIC

ES

EX

INS

UR

ED

F0

.05

23

0.0

43

30

.22

68

RIS

KM

AR

ITA

LM

-0.0

46

20

.04

66

0.3

22

2

CH

OIC

EM

AR

ITA

LM

0.0

48

40

.04

56

0.2

89

5R

ISK

VA

GE

-0.0

52

00

.00

31

0.0

00

0

CH

OIC

EVA

GE

-0.4

73

20

.00

41

0.0

00

0R

ISK

VE

HIC

LE

CLA

SS

11

.39

81

0.4

25

40

.00

10

CH

OIC

EV

EH

ICLE

CLA

SS

13

.69

21

0.1

73

00

.00

00

RIS

KV

EH

ICLE

CLA

SS

21

.34

91

0.4

26

90

.00

16

CH

OIC

EV

EH

ICLE

CLA

SS

23

.03

26

0.1

75

70

.00

00

RIS

KC

APA

CIT

YC

LA

SS

1-0

.10

69

0.0

91

40

.24

23

CH

OIC

EC

APA

CIT

YC

LA

SS

10

.36

66

0.0

76

90

.00

00

RIS

KC

APA

CIT

YC

LA

SS

20

.07

71

0.0

75

00

.30

39

CH

OIC

EC

APA

CIT

YC

LA

SS

20

.17

92

0.0

68

60

.00

89

RIS

KC

APA

CIT

YC

LA

SS

30

.16

62

0.0

71

70

.02

05

CH

OIC

EC

APA

CIT

YC

LA

SS

30

.53

42

0.0

66

50

.00

00

RIS

KB

RA

ND

CLA

SS

1-0

.02

84

0.0

91

70

.75

68

CH

OIC

EB

RA

ND

CLA

SS

1-0

.09

80

0.0

83

40

.24

00

RIS

KB

RA

ND

CLA

SS

20

.21

64

0.0

92

20

.01

89

CH

OIC

EB

RA

ND

CLA

SS

20

.04

75

0.0

85

00

.57

57

RIS

KB

RA

ND

CLA

SS

30

.02

10

0.0

90

20

.81

63

CH

OIC

EB

RA

ND

CLA

SS

30

.14

01

0.0

86

70

.10

63

RIS

KB

RA

ND

CLA

SS

40

.08

77

0.0

94

20

.35

18

CH

OIC

EB

RA

ND

CLA

SS

4-0

.12

86

0.0

91

60

.16

04

RIS

KB

RA

ND

CLA

SS

50

.03

29

0.1

07

90

.76

05

CH

OIC

EB

RA

ND

CLA

SS

5-0

.13

48

0.1

08

00

.21

23

RIS

KB

RA

ND

CLA

SS

6-0

.12

63

0.1

25

10

.31

27

CH

OIC

EB

RA

ND

CLA

SS

6-0

.33

39

0.1

02

10

.00

11

RIS

KB

RA

ND

CLA

SS

70

.05

45

0.0

99

50

.58

40

CH

OIC

EB

RA

ND

CLA

SS

70

.21

31

0.1

14

60

.06

29

RIS

KB

RA

ND

CLA

SS

80

.11

91

0.0

91

20

.19

16

CH

OIC

EB

RA

ND

CLA

SS

80

.35

27

0.0

87

40

.00

00

RIS

KE

XP

ER

IEN

CE

-0.0

05

60

.00

25

0.0

24

7

CH

OIC

EE

XP

ER

IEN

CE

-0.0

01

90

.00

22

0.3

85

2R

ISK

NC

D0

0.3

73

40

.04

84

0.0

00

0

CH

OIC

EN

CD

0-0

.62

56

0.0

45

20

.00

00

RIS

KN

CD

10

0.2

98

40

.05

54

0.0

00

0

CH

OIC

EN

CD

10

-0.5

82

20

.05

22

0.0

00

0R

ISK

NC

D2

00

.14

85

0.0

60

90

.01

47

CH

OIC

EN

CD

20

-0.3

92

50

.05

70

0.0

00

0R

ISK

NC

D3

00

.19

72

0.0

67

50

.00

35

CH

OIC

EN

CD

30

-0.0

68

60

.06

88

0.3

19

1R

ISK

NC

D4

00

.18

89

0.0

71

10

.00

79

CH

OIC

EN

CD

40

0.0

56

80

.07

70

0.4

60

9D

ISP

ER

SIO

N2

.04

22

0.3

41

70

.00

00

DE

PE

ND

EN

CE

1.4

45

70

.14

37

0.0

00

0

Log-l

ikeli

hood

-29

02

6.1

4

AIC

58

17

0.2

8

Spearm

an

sR

ho

0.2

3

Ch

i-sq

uare

test

10

8.3


3.3 Quality of fit tests

Goodness-of-fit tests are performed for the marginals as well as for the copula. For marginal distribu-

tions, we exhibit the observed and predicted frequencies for both the policy choice and the number

of accidents in Table 4. Because of the possible heterogeneity among subjects, the mass probabilities

for both discrete random variables are different across policyholders. We calculate the probabilities

at yi1 = 1, 2, 3 and at yi2 = 0, 1, 2, 3, 4 for each policyholder and present the average probabilities

over all policyholders in the table. The consistency between the actual and predicted frequencies

suggests very satisfactory fit for both marginals.

Table 4: Goodness-of-fit tests of the marginals

Choice Risk

Value Observed Predicted Value Observed Predicted

1 11.07% 10.72% 0 91.48% 91.49%

2 17.37% 16.61% 1 7.79% 7.77%

3 71.56% 72.67% 2 0.66% 0.68%

3 0.07% 0.06%

4 0.00% 0.01%

According to Table 3, the estimate for the dependence parameter in the Frank copula is roughly

1.5, which translates to a Spearman’s rho coefficient of roughly 23%. This provides an evidence of

the positive association between the policy choice and level of risk of the policyholder. In effect, we

find that drivers with better coverage tend to be more prone to accidents.

As already alluded in the previous section, the Frank copula allows for both positive and negative

association. This advantage also makes the goodness-of-fit test rather straightforward because the

statistical test of the dependence parameter avoids the problem often encountered with boundary

hypothesis. A t-statistic of 10.06 strongly supports that the dependence parameter is statistically

significantly different from zero, the case of independence, implying a significant positive association

between the two variables. In addition, one could perform a likelihood ratio test to examine the

quality of fit of the Frank copula. Under the standard set-up, the chi-square statistics follows a χ2

distribution which in this case has one degree of freedom. The observed value of the Chi-square

statistic and its corresponding p-value are presented at the bottom of Table 3. In agreement with

the t-statistic, we find that after controlling for the effects of the information on risk factors (some

driver and vehicle characteristics), a strong evidence of private asymmetric information (the positive

relationship between policy choice and risk) can be observed.

A final examination of the copula model is a robustness test. The prior knowledge about the

coverage-occurrence relationship suggests either a positive or a negative relationship between the

policy choice and number of claims. Because of such a priori information, our initial instinct suggests

to fit a model based on the Frank copula which allows for this flexibility. Due to the non-linear

optimization required in the evaluation of the likelihood function, we are interested in the robustness

of the estimation results by examining other copula specifications. In doing so, we re-calibrated the

copula model under two other customarily used Archimedean-type copulas, the Gumbel copula and

the Clayton copula with the following specifications:

Gumbel: C(u1, u2; θ) = exp[

−((− log u1)θ + (− log u2)

θ)1/θ]

, θ ≥ 1 (7)

Clayton: C(u1, u2; θ) =(

u−θ1 + u−θ2 − 1)−1/θ

, θ > 0 (8)


where θ represents the dependence parameter in both copulas. Note that the Gumbel and Clayton

copulas could not accommodate negative association. The estimated positive relationship that we

already observed based on the Frank copula between policy choice and risk in Section 3 suggests

both copulas are eligible to test for possible robustness. It is easy to see that we have the case of

independence when θ = 1 in the case of the Gumbel copula, and when θ = 0 in the case of the

Clayton copula.

The parameter estimates for the Gumbel and Clayton copula models are displayed in Appendix

A.1. Compared with Table 3, we find the coefficient estimates and p-values for all covariates to be in

agreement with the Frank copula model. The dependence parameter of the Gumbel copula is 1.13

that translates to a Spearman’s rho of 0.17. The dependence parameter of the Clayton copula is 0.32

that corresponds to a Spearman’s rho of 0.21. Both models suggest a positive association between the

policy selection and risk level of the policyholder. Although the t-test for the dependence parameter

in both models involve boundary hypothesis, the small standard errors suggest that the dependence

parameter of the Clayton copula is significantly greater than 0, and the dependence parameter of

the Gumbel copula is significantly greater than 1. The statistical significance is further confirmed by

a likelihood ratio test, with a Chi-square statistic of 94.74 and 110.02 for the Clayton and Clayton

copula model, respectively.

4 Applications to predictive modeling

Based on the empirical data on automobile insurance from a Singapore insurer, we calibrated and

analyzed the proposed copula model, and found a strong evidence of a significant positive associa-

tion between the policy choice and the level of risk of the policyholder as measured by the number

of times a claim has been made during a year. This positive association implies the existence of some

type of asymmetric information, although our findings are unable to distinguish between adverse

selection and moral hazard. A possible application of the copula model constructed in this study is

to further examine this asymmetric information in the automobile insurance market. However, as

pointed out in Section 1, a longitudinal data is needed to separate the effects of the two types of

asymmetric information, which is outside the scope of this study. Instead, we show that our cop-

ula model could also be used in several predictive modeling contexts. Two actuarial applications,

calculating accident probabilities and pure premiums, are presented in this section.

4.1 Accident probability

According to the copula model, we can additionally develop a predictive model that will analyze the

effect of a policyholder’s choice of coverage on the number of accidents. More specifically, consider

the probability distribution of Yi2, conditional on Yi1, which can easily be derived by applying Bayes’

rule:

Prob(Yi2 = yi2|Yi1 = yi1) = fi2|1(yi2|yi1,x,z) =fi(yi1, yi2|x,z)

fi1(yi1|x). (9)

This equation allows one to predict the likelihood of the number of claims, given the policy choice.

On the right hand side of equation (9), we can calculate the probability in the numerator according

to the copula expression discussed in Section 2 and the probability in the denominator based on the

marginal distribution of Yi1. Using our entire dataset, this conditional probability is computed for

each policyholder and then aggregated, given each type of coverage considered in this study. The


results of these calculations are summarized in Table 5. Due to the heterogeneity across the policy-

holders in the portfolio, we choose to present the mean and standard deviation of the conditional

probability by the type of coverage choice. According to the table, the largest posterior accident

probability is observed for the third party coverage and the lowest is observed for the comprehen-

sive coverage.

Table 5: Conditional probability of the number of accidents, given the policy coverage

Number of Third Party Only Fire and Theft Comprehensive

Accidents Probability StdDev Probability StdDev Probability StdDev

0 0.9564 0.0254 0.9478 0.0301 0.9035 0.0519

1 0.0401 0.0221 0.0482 0.0268 0.0883 0.0446

2 0.0032 0.0031 0.0037 0.0033 0.0075 0.0070

3+ 0.0003 0.0005 0.0003 0.0005 0.0007 0.0011

Additionally, using formula (9), we are able to derive the conditional expected number of claims

E(Yi2|Yi1) for each policyholder. To examine the effect of policy selection on the number of accidents,

we perform a pairwise comparison for the three types of coverage. Specifically, we examine the

quantities of the difference in expectations: E(Yi2|Yi1 = 2) − E(Yi2|Yi1 = 1) and E(Yi2|Yi1 = 3) −E(Yi2|Yi1 = 2). Each difference in expectations provides a measure of the average difference in the

number of claims between a particular level of coverage and another coverage that is considered

worse. Descriptive statistics of these two quantities are summarized in Table 6. Consistently, the

expected number of accidents increases with better insurance coverage.

Table 6: The effect of coverage selection on the number of accidents

Mean StdDev 1st Quartile 2nd Quartile 3rd Quartile

E(Yi2|Yi1 = 2) − E(Yi2|Yi1 = 1) 0.009 0.015 0.000 0.002 0.011

E(Yi2|Yi1 = 3) − E(Yi2|Yi1 = 2) 0.049 0.031 0.022 0.049 0.071

Similarly, one could investigate the effects of covariates on the accident probability without and

with the information on the coverage choice for each single policyholder. Using the covariate NCD

as an example, we are able to compare a prior and a posterior accident probability for a given NCD

level. To be more specific, we will examine the following two probabilities:

Prior: Prob(Yi2 = yi2) = fi2(yi2|yi1,z,NCD = k) (10)

Posterior: Prob(Yi2 = yi2|Yi1 = yi1) = fi2|1(yi2|yi1,x,z,NCD = k) (11)

To illustrate, we consider a 30-year old female policyholder who drives a 5-year small car made

by Toyota. The driver has 5 years driving experience and enjoys a no claim discount of 20% cur-

rently. We calculate the prior and posterior accident probabilities for this particular policyholder.

The prior accident probability is calculated without the knowledge of the coverage selected by the

policyholder. The posterior accident probability is the one incorporating the information of coverage

choice. In principle, the probabilities (10) and (11) could be derived for all covariates. Here, we

choose to demonstrate the effects only of some important explanatory variables to illustrate predic-

tive modeling capabilities of our copula approach. Specifically, we examine the effects of age class,

gender, marital status, and the level of NCD. For each of these covariates, the accident probabilities


Table 7: Accident probability for different age classes

Prior Posterior

Third Party Only Fire and Theft Comprehensive

Age ≤ 25

0 0.8008 0.8970 0.8930 0.7951

1 0.1686 0.0890 0.0924 0.1733

2 0.0264 0.0121 0.0126 0.0273

3+ 0.0042 0.0019 0.0020 0.0044

25<Age ≤ 35

0 0.8457 0.9229 0.9206 0.8422

1 0.1361 0.0689 0.0709 0.1391

2 0.0163 0.0074 0.0076 0.0167

3+ 0.0019 0.0009 0.0009 0.0020

36<Age ≤ 45

0 0.8712 0.9368 0.9347 0.8678

1 0.1162 0.0575 0.0594 0.1192

2 0.0115 0.0052 0.0054 0.0119

3+ 0.0011 0.0005 0.0005 0.0011

46<Age ≤ 55

0 0.8666 0.9343 0.9317 0.8626

1 0.1199 0.0596 0.0619 0.1234

2 0.0123 0.0056 0.0058 0.0127

3+ 0.0012 0.0006 0.0006 0.0013

56<Age ≤ 65

0 0.8603 0.9309 0.9274 0.8550

1 0.1248 0.0624 0.0655 0.1295

2 0.0135 0.0061 0.0064 0.0140

3+ 0.0014 0.0006 0.0007 0.0015

Age > 65

0 0.8932 0.9484 0.9444 0.8870

1 0.0982 0.0477 0.0514 0.1038

2 0.0080 0.0036 0.0039 0.0085

3+ 0.0006 0.0003 0.0003 0.0007

are calculated for each hypothetical category holding the other covariates equal to the observed

value. The comparison results are displayed in Tables 7, 8, and 9, respectively.

The same observation about the effect of coverage choice on the accident probability can be said

from the results summarized in these tables. The prior probability is the same for three types of

coverage, because such information is not incorporated into the calculation of the accident prob-

ability. After observing the choice of the policyholder, the accident probability is updated based

on the coverage selection. The probability of no accident is the largest for the highest coverage

(comprehensive) and the smallest for the lowest coverage (third party). The prior probability is

somewhere in between. As discussed before, this could be explained by the presence of additional

private information held only by the policyholder. A policyholder may decide to choose a high cover-

age because she believes she is a risky driver, or that she drives less carefully after purchasing a high

coverage policy. The effects of these covariates on the posterior probability appear to be intuitively


Table 8: Accident probability for gender and marital status

Gender

Prior Posterior

Third Party Fire and Theft Comprehensive

Female Male Female Male Female Male Female Male

0 0.8457 0.8364 0.9229 0.9177 0.9206 0.9151 0.8422 0.8326

1 0.1361 0.1431 0.0689 0.0730 0.0709 0.0753 0.1391 0.1464

2 0.0163 0.0182 0.0074 0.0083 0.0076 0.0085 0.0167 0.0187

3+ 0.0019 0.0023 0.0009 0.0010 0.0009 0.0011 0.0020 0.0024

Marital Status

Prior Posterior

y1=1 y1=2 y1=3

Single Married Single Married Single Married Single Married

0 0.8393 0.8457 0.9193 0.9229 0.9168 0.9206 0.8355 0.8422

1 0.1409 0.1361 0.0717 0.0689 0.0740 0.0709 0.1442 0.1391

2 0.0176 0.0163 0.0080 0.0074 0.0083 0.0076 0.0181 0.0167

3+ 0.0022 0.0019 0.0010 0.0009 0.0010 0.0009 0.0022 0.0020

predictable. The effect of age on accident probability is non-linear; see, for example, Frees et al.

(2009). The accident probability is the highest for young drivers (age ≤ 25), and then decreases

within older age groups. Probability of accidents slightly rises for the old drivers (age > 65). Fixing

other factors equal, a male driver is more likely to incur an accident than a female driver, and a mar-

ried policyholder appears to drive more carefully than a single driver. The effect of NCD is somehow

non-linear. On one hand, the NCD represents the past driving record and thus is an indicator of the

risk level of the policyholder. On the other hand, a high NCD might give the policyholder economic

incentive to purchase higher coverage which might result in less than careful driving.

4.2 Pure Premium

An additional important actuarial application of our copula model is to calculate the posterior pure

premium. In this section, we show how an actuary can evaluate the pure premium by incorporat-

ing the information on coverage choice after observing the accidents for the year. In automobile

insurance, a common approach for price determination is the two-part model, where the accident

probability (typically called claims frequency) and the amount of claims (typically called severity)

are modeled separately. The so-called pure premium is then set to be the expected loss amount that is

equal to the product of the expected number of accidents and the average claims for each accident,

assuming the number of accidents and the amount of losses are independent. The independence

assumption between frequency and severity is commonplace in the literature and in practice.

As demonstrated in the previous section, the proposed copula model can be used to calculate the

accident probability. To derive the pure premium, one needs to fit a model for the severity compo-

nent. For simplicity, we focus on the average claims during the observation year. To accommodate

the typical long-tail nature of insurance claims data, a Generalized Beta distribution of the second

kind (GB2) regression model is employed to model the average claims for each of the three types of

coverage. The GB2 is a four-parameter distributional family with density function:

fGB2(c;µ, σ, p, q) =exp (pz)

c|σ|B(p, q)[1 + exp(z)]p+q, c > 0 (12)


Table 9: Accident probability for different NCDs

Prior Posterior

Third Party Only Fire and Theft Comprehensive

NCD=0

0 0.8352 0.9170 0.9144 0.8314

1 0.1440 0.0736 0.0758 0.1473

2 0.0185 0.0084 0.0087 0.0190

3+ 0.0024 0.0011 0.0011 0.0024

NCD=10

0 0.8457 0.9229 0.9206 0.8422

1 0.1361 0.0689 0.0709 0.1391

2 0.0163 0.0074 0.0076 0.0167

3+ 0.0019 0.0009 0.0009 0.0020

NCD=20

0 0.8650 0.9335 0.9318 0.8624

1 0.1211 0.0602 0.0617 0.1234

2 0.0126 0.0057 0.0059 0.0129

3+ 0.0013 0.0006 0.0006 0.0013

NCD=30

0 0.8589 0.9302 0.9290 0.8570

1 0.1259 0.0629 0.0641 0.1276

2 0.0137 0.0062 0.0063 0.0139

3+ 0.0015 0.0007 0.0007 0.0015

NCD=40

0 0.8600 0.9308 0.9297 0.8583

1 0.1250 0.0624 0.0634 0.1266

2 0.0135 0.0061 0.0062 0.0137

3+ 0.0014 0.0006 0.0006 0.0014

NCD=50

0 0.8819 0.9426 0.9416 0.8804

1 0.1075 0.0526 0.0536 0.1089

2 0.0098 0.0044 0.0045 0.0099

3+ 0.0008 0.0004 0.0004 0.0009

where z = (ln c− µ)/σ and B(p, q) is the usual Euler beta function. The GB2 is a member of the log

location-scale family with the parameters µ as the location parameter, σ as the scale parameter, and

p and q as shape parameters.

The GB2 model is believed to be first introduced in a seminal work by McDonald (1984) to

understand income distribution. Additional applications have further been found in the economics

literature, including McDonald (1987) and McDonald and Butler (1987), and in the insurance lit-

erature such as Cummins et al. (1990). With four parameters, the GB2 provides great flexibility to

model heavy-tailed and skewed data. Many long-tailed distributions are nested as special or limit-

ing cases of the GB2 distribution (see McDonald and Xu (1995)). Despite the flexibility of GB2 in

modeling heavy-tailed data, its application in a regression context has rarely been found in the lit-

erature. McDonald and Butler (1990) first employed the GB2 regression to investigate the duration

of welfare spells. More recently in the actuarial literature, Sun et al. (2008) applied the GB2 in the


prediction of nursing home utilization with longitudinal data. Additionally, Frees and Valdez (2008)

and Frees et al. (2009) used the GB2 to capture the long-tail nature of automobile insurance data in

a hierarchical insurance claims model.

Following Sun et al. (2008), we reparameterize the location parameter µ as a linear function of

the covariates expressed as

µi = l′

iη.

Here, li represents the vector of explanatory variables for the ith policyholder while the vector η

denotes the corresponding coefficients to be estimated. In our context, a GB2 regression model is

calibrated for the average claim amount for third party, third party fire and theft, and comprehensive

policies, respectively. Hence we have in total three sets of parameters (η(j), σ(j), p(j), q(j)) to be

estimated, where j = 1, 2, 3 indicates the three different types of coverage. The likelihood-based

estimation method is used to estimate the three sets of parameters, and the estimation results are

summarized in Appendix A.2 for convenience. To demonstrate the quality of the fit of the models,

we exhibit the qq-plots of the resulting residuals from the three GB2 regression models in Figure 1.

Here we define the residual for the ith policyholder in the GB2 regression model as

ri = exp(ln ci − liη)/σ.

Straightforward calculations show that ri follows i.i.d from GB2(·; 1, 1, p, q). These qq-plots strongly

support the adequacy of the GB2 regression models for all three types of coverage.

−30 −20 −10 0 10

−20

−15

−10

−50

510

Third Party

Empirical Quantile

Theo

retic

al Qu

antile

of G

B2

−10 −8 −6 −4 −2 0

−10

−8−6

−4−2

0

Fire and Theft

Empirical Quantile

Theo

retic

al Qu

antile

of G

B2

−5.0 −4.0 −3.0 −2.0

−5.0

−4.5

−4.0

−3.5

−3.0

−2.5

−2.0

Comprehensive

Empirical Quantile

Theo

retic

al Qu

antile

of G

B2

Figure 1: GB2 qq-plots of regression residuals.

Combining the frequency part and the severity part, one could then derive the posterior pure

premium for the ith policyholder given his/her coverage choice yi1 using

Pure Premiumi

=E(Yi2|Yi1 = yi1) × E(Ci|Yi1 = yi1)

=

+∞∑

yi2=0

yi2fi2|1(y2i|y1i,x,z) × exp(liη(yi1))B(p(yi1) + σ(yi1), q(yi1) − σ(yi1))/B(p(yi1), q(yi1)) (13)

where B(p, q) = Γ(p)Γ(q)/Γ(p + q) and Γ(·) is the Gamma function.


Using the above formula (13), the posterior pure premium is calculated for each policyholder in

our sample. It is interesting to compare the posterior premiums with the (a priori) gross premiums

available in the original data set. Table 10 displays the correlation coefficients between the actual

gross premium and the estimated (posterior) premium for each type of coverage. The strong posi-

tive correlation suggests that the posterior calculation is in line with the premium actually charged

for each policyholder. In addition, we summarize descriptive statistics of the actual and posterior

premiums for each type of coverage as shown in Table 11. The positive margin as indicated by the

difference between the two premiums is anticipated because the gross premium actually charged

to the policyholder typically includes margins to cover expenses, profits, taxes and other possible

contingencies. This positive margin also indicates the adequacy of the premium assessed in the face

of asymmetric information.

Table 10: Correlation between actual and posterior premiums

Correlation p-value

Third Party 0.58282 <0.0001

Third Party Fire and Theft 0.62215 <0.0001

Comprehensive 0.80632 <0.0001

Table 11: Comparison of actual and posterior premiums


Mean StdDev Mean StdDev Mean StdDev

Actual 204.29 197.87 272.21 234.17 492.22 442.15

Posterior 129.54 106.76 208.69 147.40 420.80 263.08

5 Concluding remarks

This paper is primarily motivated to extend the literature in examining and identifying the presence

of asymmetric information in the automobile insurance market. We propose a bivariate copula re-

gression method to jointly examine the policyholder’s coverage choice and the level of risk. This

allows us to examine information asymmetry controlling for the presence of policyholder and vehi-

cle characteristics important to incorporate the heterogeneity of the policyholders in an insurance

portfolio. To test for the presence of asymmetric information, the policyholder’s coverage selection

was measured by an ordinal categorical variable and the degree of risk was approximated by an

ex post risk measure, the number of times a policyholder has claimed in a calendar year. Existing

methods have difficulty in modeling the two discrete variables simultaneously and thus are subject

to the endogeneity criticisms often made with models to test information asymmetry. Our proposed

copula model renders flexibility to partially address these criticisms while at the same time, captures

both linear and non-linear relationships between the policyholder’s coverage choice and his/her

level of risk, something often missing in the literature on information asymmetry. To calibrate the

copula model, we used a cross-sectional empirical observation of an insurance portfolio from a ma-

jor automobile insurer in Singapore. After controlling for the risk factors (policyholder and vehicle

characteristics) observed by the insurer, we found evidence of a strong positive coverage-occurrence

association, which suggests the possible existence of private information by the policyholders.

Beyond testing the asymmetric information, our copula model found extensive applications in


the predictive modeling context. To demonstrate, we showed the calculation of posterior accident

probability and pure premium based on the results of our model estimates. The results re-affirmed

the effect of coverage choice on the likelihood of incurring accidents. The availability of the gross

premium in the original data base allowed us to compare the posterior calculation with the a priori.

We observed a positive margin that can be attributed to the covering of expenses, profits, taxes,

among other things.

A limitation of our analysis is that we focused on the policyholder’s behavior over a cross-

sectional set of observation. A repeated observation over time that is believed to be more informative

in understanding the policyholder’s behavior is not given attention in this article. Thus, the evidence

of asymmetric information is limited to a combined effect of adverse selection and moral hazard. As

discussed in the text, the availability of longitudinal data provides the possibility of distinguishing

the two types of private information. We leave this as one direction of future research we intend to

pursue. The primary focus of this paper is on the ingenuity of the copula construction to examine

asymmetric information, and ancillary to the model construction is the set of additional extensive

applications in predictive modeling.

References

Jaap H. Abbring, James J. Heckman, Pierre-Andre Chiappori, and Jean Pinquet. Adverse selection

and moral hazard in insurance: can dynamic data help to distinguish? Journal of the European

Economic Association, 1(2-3):512–521, 2003.

George A. Akerlof. The market for ‘lemons’: quality uncertainty and the market mechanism. The

Quarterly Journal of Economics, 84(3):488–500, 1970.

Marcel Boyer and Georges Dionne. An empirical analysis of moral hazard and experience rating.

Review of Economics and Statistics, 71(1):128–134, 1989.

A. Colin Cameron and Pravin K. Trivedi. Econometric models based on count data: comparisons and

applications of some estimators. Journal of Applied Econometrics, 1(1):29–53, 1986.

Pierre-Andre Chiappori and Bernard Salanie. Testing for asymmetric information in insurance mar-

ket. Journal of Political Economy, 108(1):56–78, 2000.

Pierre-Andre Chiappori, Bruno Jullien, Bernard Salanie, and Francois Salanie. Asymmetric infor-

mation in insurance: general testable implications. RAND Journal of Economics, 37(4):783–798,

2006.

Alma Cohen. Asymmetric information and learning: evidence from the automobile insurance mar-

ket. The Review of Economics and Statistics, 87(2):197–207, 2005.

J. David Cummins, Georges Dionne, James B. McDonald, and B. Michael Pritchett. Applications of

the gb2 family of distributions in the modeling insurance loss processes. Insurance: Mathematics

and Economics, 9(4):257–272, 1990.

David de Meza and David C. Webb. Advantageous selection in insurance markets. RAND Journal of

Economics, 32(2):249–262, 2001.

Georges Dionne, Christian Gourieroux, and Charles Vanasse. Testing for evidence of adverse se-

lection in the automobile insurance market: a comment. Journal of Political Economy, 109(2):

444–453, 2001.


Edward W. Frees and Emiliano A. Valdez. Hierarchical insurance claims modeling. Journal of the

American Statistical Association, 103(484):1457–1469, 2008.

Edward W. Frees, Peng Shi, and Emiliano A. Valdez. Actuarial applications of a hierarchical claims

model. ASTIN Bulletin, 39(1):165–197, 2009.

Christian Genest. Frank’s family of bivariate distributions. Biometrika, 74(3):549–555, 1987.

Christian Genest and Johanna Neslehova. A primer on copulas for count data. ASTIN Bulletin, 37

(2):475–515, 2007.

Hyojoung Kim, Doyoung Kim, Subin Im, and James W. Hardin. Evidence of asymmetric information

in the automobile insurance market: dichotomous versus multinomial measurement of insurance

coverage. The Journal of Risk and Insurance, 76(2):343–366, 2009.

James B. McDonald. Some generalized functions for the size distribution of income. Econometrica,

52(3):647–663, 1984.

James B. McDonald. Model selection: some generalized distributions. Communications in Statistics,

16(4):1049–1074, 1987.

James B. McDonald and Richard J. Butler. Some generalized mixture distributions with an applica-

tion to unemployment duration. Review of Economics and Statistics, 69(2):232–240, 1987.

James B. McDonald and Richard J. Butler. Regression models for positive random variables. Journal

of Econometrics, 43(1-2):227–251, 1990.

James B. McDonald and Yexiao J. Xu. A generalization of the beta distribution with applications.

Journal of Econometrics, 66(1-2):133–152, 1995.

James E. Prieger. A flexible parametric selection model for non-normal data with application to

health care usage. Journal of Applied Econometrics, 17(4):367–392, 2002.

Robert Puelz and Arthur Snow. Evidence on adverse selection: equilibrium signaling and cross-

subsidization in the insurance market. Journal of Political Economy, 102(2):236–257, 1994.

Didier Richaudeau. Automobile insurance contracts and risk of accident: an empirical test using

french individual data. The Geneva Papers on Risk and Insurance - Theory, 24(1):97–114, 1999.

Michael D. Rothschild and Joseph E. Stiglitz. Equilibrium in competitive insurance markets: the

economics of markets with imperfect information. The Quarterly Journal of Economics, 90(4):

629–650, 1976.

Kuniyoshi Saito. Testing for asymmetric information in the automobile insurance market under rate

regulation. The Journal of Risk and Insurance, 73(2):335–356, 2006.

Murray D. Smith. Modelling sample selection using archimedean copulas. The Econometrics Journal,

6(1):99–123, 2003.

Jiafeng Sun, Edward W. Frees, and Marjorie A. Rosenberg. Heavy-tailed longitudinal data modeling

using copulas. Insurance Mathematics and Economics, 42(2):817–830, 2008.

David M. Zimmer and Pravin K. Trivedi. Using trivariate copulas to model sample selection and

treatment effects. Journal of Business and Economic Statistics, 24(1):63–76, 2006.


Appendix: Additional calibration results

A.1. Estimates of the bivariate model using alternative copulas

Clayton Gumbel

Estimate StdError p-value Estimate StdError p-value

CHOICE INT1 -4.2001 0.2357 0.0000 -4.1606 0.2248 0.0000

CHOICE INT2 -0.8588 0.2333 0.0002 -0.8200 0.2223 0.0002

CHOICE AGECLASS1 0.6587 0.1547 0.0000 0.6852 0.1500 0.0000

CHOICE AGECLASS2 0.9532 0.1341 0.0000 0.9712 0.1299 0.0000

CHOICE AGECLASS3 0.8314 0.1299 0.0000 0.8473 0.1260 0.0000

CHOICE AGECLASS4 0.6826 0.1297 0.0000 0.6981 0.1259 0.0000

CHOICE AGECLASS5 0.4247 0.1370 0.0019 0.4501 0.1333 0.0007

CHOICE SEXINSUREDF 0.0541 0.0433 0.2116 0.0530 0.0432 0.2198

CHOICE MARITALM 0.0479 0.0457 0.2946 0.0482 0.0455 0.2900

CHOICE VAGE -0.4735 0.0041 0.0000 -0.4719 0.0041 0.0000

CHOICE VEHICLECLASS1 3.6852 0.1721 0.0000 3.6974 0.1722 0.0000

CHOICE VEHICLECLASS2 3.0209 0.1747 0.0000 3.0445 0.1749 0.0000

CHOICE CAPACITYCLASS1 0.3710 0.0765 0.0000 0.3583 0.0764 0.0000



CHOICE BRANDCLASS1 -0.0958 0.0840 0.2546 -0.1017 0.0821 0.2153

CHOICE BRANDCLASS2 0.0476 0.0855 0.5781 0.0435 0.0837 0.6031







CHOICE EXPERIENCE -0.0020 0.0022 0.3638 -0.0018 0.0022 0.4160

CHOICE NCD0 -0.6273 0.0452 0.0000 -0.6206 0.0451 0.0000

CHOICE NCD10 -0.5827 0.0522 0.0000 -0.5800 0.0521 0.0000

CHOICE NCD20 -0.3950 0.0571 0.0000 -0.3901 0.0569 0.0000

CHOICE NCD30 -0.0751 0.0689 0.2759 -0.0679 0.0687 0.3230

CHOICE NCD40 0.0528 0.0770 0.4927 0.0580 0.0769 0.4506

RISK INT -3.4658 0.4286 0.0000 -4.0221 0.5086 0.0000

RISK AGECLASS1 0.6939 0.1899 0.0003 0.7244 0.1948 0.0002

RISK AGECLASS2 0.4002 0.1701 0.0186 0.4271 0.1754 0.0149

RISK AGECLASS3 0.2008 0.1672 0.2299 0.2265 0.1722 0.1886

RISK AGECLASS4 0.2428 0.1675 0.1472 0.2634 0.1722 0.1261

RISK AGECLASS5 0.2952 0.1737 0.0893 0.3116 0.1794 0.0823

RISK SEXINSUREDF -0.0650 0.0429 0.1300 -0.0647 0.0426 0.1292

RISK MARITALM -0.0480 0.0464 0.3016 -0.0430 0.0464 0.3538

RISK VAGE -0.0517 0.0031 0.0000 -0.0479 0.0031 0.0000

RISK VEHICLECLASS1 1.2856 0.4124 0.0018 1.8009 0.4749 0.0001

RISK VEHICLECLASS2 1.2321 0.4149 0.0030 1.7646 0.4767 0.0002

RISK CAPACITYCLASS1 -0.1033 0.0901 0.2514 -0.1183 0.0900 0.1886

RISK CAPACITYCLASS2 0.0754 0.0745 0.3111 0.0621 0.0739 0.4013

RISK CAPACITYCLASS3 0.1626 0.0713 0.0227 0.1538 0.0709 0.0300

RISK BRANDCLASS1 -0.0371 0.0923 0.6878 -0.0279 0.0903 0.7573

RISK BRANDCLASS2 0.2008 0.0927 0.0304 0.2222 0.0907 0.0143

RISK BRANDCLASS3 0.0180 0.0908 0.8424 0.0287 0.0889 0.7472

RISK BRANDCLASS4 0.0825 0.0947 0.3837 0.0913 0.0928 0.3255

RISK BRANDCLASS5 0.0260 0.1084 0.8108 0.0425 0.1065 0.6900

RISK BRANDCLASS6 -0.1229 0.1259 0.3292 -0.1443 0.1244 0.2459

RISK BRANDCLASS7 0.0521 0.1001 0.6030 0.0578 0.0984 0.5571

RISK BRANDCLASS8 0.1133 0.0913 0.2147 0.1182 0.0892 0.1854

RISK EXPERIENCE -0.0055 0.0025 0.0277 -0.0054 0.0025 0.0307

RISK NCD0 0.3779 0.0481 0.0000 0.3774 0.0481 0.0000

RISK NCD10 0.2980 0.0554 0.0000 0.3015 0.0552 0.0000

RISK NCD20 0.1505 0.0609 0.0135 0.1529 0.0606 0.0116

RISK NCD30 0.1971 0.0676 0.0036 0.1995 0.0672 0.0030

RISK NCD40 0.1897 0.0711 0.0076 0.1888 0.0710 0.0078

RISK DISPERSION 2.0226 0.3323 0.0000 1.9656 0.3113 0.0000

DEPENDENCE 0.3243 0.0402 0.0000 1.1339 0.0152 0.0000


A.2. Model estimates of the GB2 regressions for the three types of coverage


Estimate StdError Estimate StdError Estimate StdError

INT 4.67758 20.22671 5.04693 7.25633 16.6576 1.88469

AGECLASS1 0.08522 0.80147 0.30522 0.70918 -0.26675 0.23958

AGECLASS2 0.44681 0.6633 0.36807 0.67441 -0.28167 0.20914

AGECLASS3 0.18414 0.59053 -0.11574 0.6555 -0.30778 0.2063

AGECLASS4 0.40905 0.56369 0.19679 0.64948 -0.14535 0.20668

AGECLASS5 -0.28227 0.53849 -0.02156 0.68233 -0.1464 0.21601

SEXINSUREDF -0.13257 0.24567 0.15353 0.14397 0.07577 0.04895

MARITALM -0.01563 0.28268 -0.02837 0.15677 -0.06854 0.05506

VAGE -0.00048 0.01643 -0.0117 0.01479 -0.00886 0.00642

VEHICLECLASS1 2.71887 26.06902 3.17122 7.32979 -0.19153 0.4419

VEHICLECLASS2 2.40576 26.08081 3.30577 7.32863 -0.31469 0.44478

CAPACITYCLASS1 -1.01233 0.3859 0.41478 0.27417 -0.146 0.11073

CAPACITYCLASS2 -0.6866 0.34172 0.276 0.22547 -0.09755 0.08905

CAPACITYCLASS3 -0.6488 0.33846 0.23297 0.2436 -0.01115 0.08472

BRANDCLASS1 1.59989 8.6066 0.68757 0.4083 -0.18176 0.10178

BRANDCLASS2 1.3008 8.61842 0.91015 0.40603 -0.04009 0.10291

BRANDCLASS3 1.24008 8.60906 0.78001 0.40809 -0.13743 0.09939

BRANDCLASS4 1.25225 8.59816 0.47597 0.42724 -0.00862 0.10427

BRANDCLASS5 1.36969 8.61839 1.30815 0.44762 -0.01205 0.1224

BRANDCLASS6 0.26607 8.61422 -0.46972 0.58085 -0.25945 0.14099

BRANDCLASS7 2.41992 8.63265 0.65263 0.61413 -0.40162 0.1098

BRANDCLASS8 1.29999 8.59116 1.08444 0.43066 -0.06013 0.10041

EXPERIENCE 0.01878 0.0107 0.01852 0.00917 -0.00574 0.00289

NCD0 -0.45909 0.30802 0.0243 0.21019 0.06888 0.05652

NCD10 0.05255 0.35219 -0.08583 0.2274 0.11249 0.06363

NCD20 -0.67075 0.32677 0.0245 0.26651 0.08566 0.06818

NCD30 -0.98917 0.35421 0.14669 0.33584 0.08696 0.0763

NCD40 -0.37407 0.42586 -0.04255 0.35143 -0.01861 0.07809

σ 2.06419 1.02818 1.17141 0.2744 0.61004 0.02914

p 0.27252 0.29498 0.79842 0.45819 6.74983 1.30931

q 0.49532 0.63701 6.1853 12.48712 134.5066 92.77842

Documents

A Copula Approach to Test Asymmetric Information with Applications to Predictive Modeling