Non-parametric estimation of price elasticities: A ... · making pivotal market competition decisions, such as pricing and targeting. Despite its general usefulness, the estimation

Non-parametric estimation of price elasticities:

A heterogeneous treatment effect approach

Jingbo Wang and Yufeng Huang ∗

November 28, 2019

Abstract

We revisit the classical problem of price elasticity estimation from a causal perspec-

tive. When the price is perceived as a continuous but endogenous treatment, the flexible

estimation of price elasticity can be turned into the estimation of heterogeneous treatment

effects. To this end, we develop a control function approach to deal with treatment endo-

geneity when estimating heterogeneous treatment effects. This strategy works by breaking

the estimation of price elasticity into several intermediate problems of point-wise expecta-

tion estimation, where modern machine learning methods, such as deep neural networks

and random forests, can be used for prediction. In addition, we prove that if we use the

bagged nearest neighbors for point-wise prediction, the standard bootstrap procedure can

be directly employed to derive inference for the price elasticity estimates. Finally, we apply

our method to the IRI academic dataset on two national brands of yogurt. It is found that

the competitor’s yogurt in a similar size is more a substitute compared to the own brand’s

yogurt in a different size.

Key words : heterogeneous treatment effects; causal inference; price elasticity; demand anal-

ysis; nonparametric IV; bagged nearest neighbors; control function; the bootstrap

∗Jingbo Wang is a Ph.D. candidate in the department of economics at the University of Southern California;

Email: [email protected]. Yufeng Huang is an assistant professor in the department of marketing at the Simon

Business School, University of Rochester; Email: [email protected]. We thank participants in

seminars at the USC and the University of Rochester for their valuable feedbacks. In particular, Jingbo Wang

thanks Professor Cheng Hsiao for his generous and continuous support.

1

mailto:[email protected]

[email protected]

1 Introduction

Price elasticity, the percentage change in sales due to a percent change in price, is of paramount

importance in economics and marketing. As a primary measure of market structure, the precise

estimation of price elasticity is central in many situations, such as when economists conduct wel-

fare analysis, evaluate price effects of mergers, and estimate the pass-through effects of taxation

or tariff. In business economics and marketing, price elasticities can provide rich information

about the behaviors of current and potential customers. This information can assist firms in

making pivotal market competition decisions, such as pricing and targeting.

Despite its general usefulness, the estimation of price elasticity faces two main challenges.

The first challenge is price endogeneity. Observed prices in observational studies are often de-

termined simultaneously by market demand and supply. This fact inevitably raises the concern

of price endogeneity. The presence of price endogeneity, if not properly addressed, can weaken

the consistency of elasticity estimates and sometimes give confusing results, such as an upward-

sloping demand curve. The second challenge is the flexible estimation of price elasticity. The

responsiveness to prices is heterogeneous by nature. Price elasticity estimates ideally should ad-

just differently to different price levels, allow rich substitution patterns, and capture meaningful

behavioral changes along demographics.

Current popular approaches mostly use a parametric framework to work on these difficulties.

On the one hand, a parametric setup of the indirect utility enables one to back out the unob-

served omitted product characteristics. This advantage clears the way to use the generalized

methods of moments to deal with price endogeneity. On the other hand, a parametric setup of

the distribution of random coefficients can help overcome the independence of irrelevant alter-

natives. This ingenious feature offers a route to allow flexible substitution patterns. However,

a parametric setup can also be a double-edged sword. First, the consumer choice process can

be complicated and oftentimes involve choices in multiple goods, of multiple quantities, under

limited consideration, and with search friction. Parameterizing the choice process in these cases

has the risk of misspecification. Second, a parametric model also relies on the assumptions of

the preference heterogeneity. It often becomes pivotal to correctly specify these taste distribu-

tions. The estimation result can be sensitive to the specification and sometimes may experience

a fundamental change under a different specification. In these situations, it is strongly desirable

that we also have a model that does not rest on parametric assumptions.

With this motivation, we propose a nonparametric method to estimate price elasticities with

aggregate market-level data in differentiated products markets. In particular, we investigate the

problem of price elasticity estimation from a causal perspective. When market price is perceived

as a continuous but endogenous market policy, the estimation of price elasticity becomes a policy

2

evaluation problem. If we further require the price elasticity to be flexible, the problem thereby

turns into the estimation of treatment effects conditioning on different values of control variables.

In this way, we can borrow recent developments in the heterogeneous treatment effect literature

to estimate price elasticities in a fully nonparametric fashion.

To be specific, we will adapt classical nonparametric IV models (Blundell and Powell, 2003;

Newey and Powell, 2003; Hall and Horowitz, 2005) and deliberately shift our interest from struc-

tural functions to point-wise estimations. In particular, this paper will make an adaptation of

the triangular control function approach in Newey et al. (1999) as our starting point, since it

can provide a straightforward economic interpretation. We will first specify a triangular simulta-

neous equation system, where the first equation specifies a nonparametric relationship between

demand and prices and the second the relationship between prices and exogenous instrumen-

tal variables. It can be shown that, under commonly made control function assumptions, the

point-wise price slope can be identified as a combination of several point-wise conditional ex-

pectations. As a result, we can then estimate point-wise price elasticities from the estimations

of several intermediate point-wise conditional expectations. This estimation route, to the best

of our knowledge, has not yet been explored before in empirical studies. This paper is also the

first paper to utilize this result to flexibly estimate price elasticity.

The difficulty in the implementation of this strategy mainly lies in the estimation of point-

wise conditional expectations. In this paper, we propose the use of modern machine learning

methods to deal with this problem. The reason why machine learning methods are preferred is

that classical nonparametric methods, such as kernel and nearest neighbors estimators, suffer

the curse of dimensionality. When the dimension of conditioning covariates turns large, classical

nonparametric estimation can become unstable in practice. However, on the contrary, modern

machine learning methods are often empowered with data-driven algorithms and have been

shown to work well with a relatively large dimension of conditioning variables. In this way, we

are capable to accommodate many covariates in a fully nonparametric fashion and have more

leverage to pursue heterogeneity behind all these dimensions.

However, the disadvantage of using popular machine learning methods is statistical inference.

While machine learning methods have been widely used in business practices, it has been less

emphasized that how one can conduct valid statistical inference with them. In this paper, we

prove that, if we use the bootstrap averaged (bagged) nearest neighbors estimator (Biau et al.,

2010; Fan et al., 2018) for point-wise prediction, the standard bootstrap procedure can be di-

rectly used for its inference. Since the bootstrap commutes with smooth functions, we are then

able to conveniently use the bootstrap to directly derive inference for price elasticity estimates.

However, it is worth noting that when statistical inference is not a concern, other popular ma-

3

chine learning methods, such as deep neural networks and random forests, can also be employed

for conditional predictions within our framework.

We will first conduct Monte Carlo simulations to test the working validity of our approach.

In this paper, we demonstrate two settings. The first setting follows a conventional reduced form

setup. It is shown that our approach can well recover flexible shapes of price slopes despite of

the contamination of price endogeneity. We also derive confidence intervals using the bootstrap

along with our estimations. In our simulations, these confidence intervals have well covered the

model truth. The second Monte Carlo setting uses the standard BLP model (Berry et al., 1995)

as the true model. It is shown that our approach can be compatible with the BLP model and

approximately applies to market level data aggregated from BLP-type individual choices.

With this confidence, our method is then applied to the yogurt industry to estimate the

price elasticities of two leading national brands, Yoplait and Dannon, using the IRI academic

dataset. Our main focus is on the package size of yogurt. We estimate the own and cross price

elasticities for small-sized and large-sized yogurt for both Yoplait and Dannon. It is found that

the competing brand’s yogurt in a similar package size is more a substitute than own brand’s

yogurt in a different size. This result is obtained without a priori structural assumptions on

consumer preferences. We also trace the price elasticity changes across own price levels.

Our paper contributes to the literature on demand estimation. In particular, we revisit the

problem of price elasticity estimation from a causal perspective and in a nonparametric fashion.

Compared to the models based on multinomial choices, our framework bears less parametric

assumptions and can thus be more flexible in various situations, for example, when there is

concern on multiple purchasing quantities and endogenous consideration sets. Compared to the

existing nonparametric IV approaches (Blundell et al., 2012, 2016), our interest is mostly on the

point-wise heterogeneity and we incorporate modern machine learning algorithms to empower

classical nonparametric strategies, instead of imposing additional economic and econometric

constraints to regularize classical estimates.

Our paper also contributes to the emerging literature on heterogeneous treatment effects

(Athey and Imbens, 2016; Wager and Athey, 2018). In particular, we adapt the control function

approach in Newey et al. (1999) and shift its focus to point-wise estimations. This adaptation

offers a simple and intuitive pathway to deal with treatment endogeneity in the context of

heterogeneous treatment effects. We enrich the theoretical results in Fan et al. (2018) and prove

that the standard bootstrap procedure can be directly adopted for statistical inference, if the

bagged nearest neighbors estimator is used for intermediate predictions.

Our paper possesses an empirical contribution as well. We find that price elasticities exhibit

a very different pattern for yogurt in small and large package sizes. However, it is common

4

in empirical studies that yogurt of different sizes for the same brand are first pooled together

for further analysis. Our result raises a concern for this conventional practice since yogurt of

different sizes can be intrinsically very different products. We also argue that the heterogeneity

patterns we have found can serve as a primitive for sophisticated structural modeling.

The following of the paper is organized as follows. Section 2 gives a brief review of related

literature. We formally introduce our theoretical framework in Section 3. Section 4 gives more

details on our estimation and inference strategy. We further justify our method with a Monte

Carlo simulation in Section 5. Section 6 applies our method to estimate price elasticities for

yogurt with the IRI academic dataset. Section 7 is devoted to a discussion.

2 Related literature

Demand analysis has been one of the oldest economic problems. Working (1927); Stone (1954);

Deaton and Muellbauer (1980) had studied this problem repeatedly with new insights. More

recently, the BLP model (Berry et al., 1995) becomes the leading work-horse model in struc-

tural demand analysis. The BLP model (Berry et al., 1995) makes important changes of the

multinomial choice model (McFadden, 1973) and provides a seminal framework to deal with en-

dogeneity and flexibility (Nevo, 2001). Meanwhile, the reduced-form approaches are also quickly

evolving. This line of research works on relaxing the linear demand system in traditional analy-

sis (Hausman and Newey, 1995; Banks et al., 1997; Hausman and Newey, 2015) and improving

nonparametric estimates with additional theoretical or empirical constraints (Haag et al., 2009;

Blundell et al., 2012; Dette et al., 2016; Blundell et al., 2016). Interestingly, the seemingly un-

related structural perspective and the reduced form perspective can actually be reconciled under

a general nonparametric setting. Berry and Haile (2014) has ingeniously shown that with con-

nected substitutes (Berry et al., 2013), an index restriction is enough to transform the discrete

choice demand model into a nonparametric IV problem. Compiani (2019) further shows that

this nonparametric approach is able to give better shapes of demand curves when consumers ex-

perience inattention or loss aversion. Our paper contributes to the demand literature by offering

another route to flexibly estimate price elasticity, where machine learning methods can be used

to empower nonparametric regressions.

Our paper is related to the literature of heterogeneous treatment effects. The concept of treat-

ment effect heterogeneity is deeply rooted in economics and possesses great potential in empiri-

cal studies (Heckman et al., 1997; Athey and Imbens, 2016; Wager and Athey, 2018; Fan et al.,

2018). While previous studies mostly focus on exogenous treatments, our framework in this pa-

per offers a simple pathway to deal with treatment endogeneity in the context of heterogeneous

5

treatment effects. The simplicity of our method comes from adapting classical nonparametric IV

models and focusing on point-wise estimation. Compared to the random forests local generalized

method of moments approach (Athey et al., 2019), our framework is fully nonparametric, open

to the use of various machine learning methods, and enjoys an easy economic interpretation.

We have used the control function approach in our paper. As far as we know, the control

function approach is introduced to economics as a generalization of the linear instrumental vari-

able approach (Heckman and Robb, 1985). For these early insights, please see, for example,

Smith and Blundell (1986); Rivers and Vuong (1988). For more recent developments on non-

parametric IV models, see, for example, Chesher (2003); Matzkin (2003); Blundell and Powell

(2003); Newey and Powell (2003); Das et al. (2003); Hall and Horowitz (2005); Blundell et al.

(2007); Matzkin (2008); Imbens and Newey (2009); Blundell et al. (2013); Chen et al. (2014);

Matzkin (2015); Hahn and Ridder (2017, 2018). Different from most of the literature, our empir-

ical strategy follows a point-wise approach. In this way, we can uncover observed heterogeneity

and many modern machine learning methods can then be incorporated into the conventional

control function framework for the first time. It is also worth mentioning that Chen and Pouzo

(2015) has obtained point-wise bootstrap confidence bands for linear and nonlinear function-

als of nonparametric IV and nonparametric quantile IV estimators, and Chen and Christensen

(2018) has bootstrap uniform confidence bands for linear and nonlinear functionals of sieve

nonparametric IV estimator.

Our paper makes use of the bootstrap averaged (bagged) nearest neighbors method. The

bagged nearest neighbors regression estimator, as far as we know, first appears as a special case

when averaging nearest neighbor estimators from bootstrapped without replacement subsamples

in Biau et al. (2010). More recently, Fan et al. (2018) derives the same estimator independently

from a panel perspective, proves its point-wise asymptotic normality, and proposes to use the

generalized jackknife to analytically remove its higher order bias. The generalized jackknife

procedure, when working together with the bagging algorithm (Breiman, 1996), turns out to

be very powerful and can effectively mitigate the curse of dimensionality. For the purpose of

this paper, we would like to perform the generalized jackknife in a flexible fashion, and combine

intermediate estimates to form final parameters of interests, the use of the variance formula for

inference becomes less useful. In this paper, we complement existing results and prove that the

bootstrap can be directly used for the bagged nearest neighbors estimator. This bootstrap result

overcomes the obstacles to conducting inference when using the bagged nearest neighbors.

Machine learning methods have been gaining increasing interests in the business world as well

as in economic research (Mullainathan and Spiess, 2017). One way to accommodate machine

learning methods into economic research is to use the penalized regressions, such as LASSO,

6

for variable selection, and then apply conventional econometric methods to this reduced set of

variables. This interesting convenience can build on the assumption of ignorable approximation

errors (Belloni et al., 2014a,b), or more recently the orthogonality features in many economic

problems (Chernozhukov et al., 2017). Our paper demonstrates another possibility to introduce

machine learning methods into economics. Our philosophy is to transform well-known economic

and econometric models into a combination of intermediate conditional expectation problems

and then utilize modern machine learning methods to work on these conditional expectations.

We treat machine learning as an extension of traditional nonparametric methods.

Our paper connects to the literature of empirical industrial organization. While the current

canonical demand estimation approach builds on the discrete choice framework (McFadden,

1973) and the BLP model (Berry et al., 1995; Nevo, 2001; Petrin, 2002), empirical studies for

many industries have also offered various alternatives that can relax different restrictions of the

canonical model. These extensions include, but are not limited to, cases when consumers pur-

chase multiple goods or multiple units (Hendel, 1999; Dube et al., 2018; Kim et al., 2002), make

decisions under limited consideration (Goeree, 2008) or search frictions (De Los Santos et al.,

2012), choose among geographically-differentiated options (Houde, 2012), and purchase comple-

mentary products (Gentzkow, 2007). Our approach complements the above literature and can

be useful for a wide range of empirical challenges like these.

Our empirical study builds on previous findings in the yogurt industry. The landscape of the

yogurt industry can be simplified as several major producers, a few grocery chains, local stores,

and a vast number of individual consumers. A typical consumer usually purchases multiple units

of yogurt in multiple flavors from different product lines of a single brand (Kim et al., 2002).

The conventional pricing practice for grocery stores is that different product lines are priced

differently, but prices are uniform across flavors (Draganska and Jain, 2006; Draganska et al.,

2009). The vertical relationship between grocery chains and yogurt producers is known to be

important and can affect the offerings of yogurt in local grocery stores (Villas-Boas, 2007). It

is also found that consumers can exhibit brand inertia (Pavlidis and Ellickson, 2018) and have

fixed purchasing cost (Huang and Bronnenberg, 2018). It is interesting that the package size of

yogurt, which is a classic price discrimination device, has received little attention. In this paper,

we offer a novel analysis on the difference between yogurts in different package sizes.

3 Theoretical framework

In this section, we will introduce our theoretical framework to estimate price elasticity. We

will first demonstrate our problem in a nonparametric model, explain how this problem can be

7

approached from a heterogeneous treatment effect perspective, list a set of assumptions we need,

show how price slopes can be identified, and finally provide an intuition to the identification using

the directed acyclic graphs from Pearl (2009).

3.1 Model setup

We assume that product j at market t has sales sjt and price pjt , for j = 1, 2, · · · , J , andt = 1, 2, · · · , T . Let pt = (p1t, p2t, · · · , pJt)T denote the price vector for products {1, 2, · · · , J}at market t. Here (·)T is the transpose. Similarly, let xjt denote the vector of observed product

characteristics for product j at market t and xt = (xT1t,x

T2t, · · · ,xT

Jt)T the stacked vector of

observed product characteristics at market t. We assume that for j = 1, 2, · · · , J , and t =

1, 2, · · · , T , the structural relationship between market sales, market prices, and the observed

product characteristics is that

sjt = fj(pt,xt) + ǫjt, (1)

where ǫjt is the aggregate unobserved shock to the demand of product j at market t, and it

is assumed to have mean zero. The pivotal problem here is that the disturbance ǫjt can be

potentially correlated with pjt, which raises the concern of price endogeneity. However, the

source of price endogeneity can be very general. The endogeneity may come from the omis-

sion of a confounding variable unobserved by econometricians, such as the unobserved product

characteristics, measurement error on prices, or sample selection issues.

The demand for product j at market t depends on the prices and observed product char-

acteristics of all products at market t in a nonlinear fashion. In our assumption, this demand

function for product j, fj(·), is structural and unchanged across markets. The variation of the

demand of product j across different markets comes from the variation of the prices of all prod-

ucts, the product characteristics of all products, and the realized unobserved shocks in different

markets. Our setup can further allow for more market specific effects if we introduce one more

product with market features as its product characteristics. The model can also incorporate

infinite dimensions of unobserved heterogeneity across geographic locations and time if we allow

the demand function fj(·) to be location specific and time specific, that is, we can split the data

into different subsamples and we estimate a demand function for each subsample. However, the

flexibility comes at the price that there will be substantially less observations for each estimation.

We will not make these extensions in this paper since it is not our main focus.

8

3.2 Heterogeneous treatment effect

We can look at the estimation of price elasticity from the potential outcomes perspective. When

price is perceived as a continuous market policy, its corresponding demand is then the observed

outcome in parallel universes. When our goal is the causal effect of market price on demand,

the problem becomes the evaluation of a market price policy. As a result, the estimation of

price slope is then the estimation of the treatment effect of price on demand. As a contrast, the

structural approach directly models the preferences of individuals and then use the stability of

the preferences to derive price slope. We argue that the recovery of human nature is actually a

more fundamental and complex issue. If our goal is solely on price slopes, the causal perspective

is possibly more straightforward.

Price elasticity, which measures normalized price sensitiveness, should be heterogeneous by

nature. The price sensitiveness should change at different price levels, vary for different income

groups, and evolve over age. Flexibility is therefore one of the central concerns when estimating

price elasticity. In this context, we are eager to learn more about the heterogeneity of price

treatment effects on demand. The heterogeneous treatment effect provides us one solution

to allow flexibility and capture heterogeneity. The heterogeneous treatment effect, proposed

in Athey and Imbens (2016); Wager and Athey (2018), is defined as the conditional treatment

effect on a fixed value of all control variables, which is,

E [Y (1)− Y (0)|X = x],

where Y (1) is the potential outcome when treated and Y (0) untreated. The difference between

heterogeneous treatment effect and the conditional average treatment effect is that here the

control variables X can be potentially of high dimension. In our context, when we consider the

price as a treatment, the partial derivative ∂pfj(pt,xt) is just the continuous version hetero-

geneous treatment effect conditional on (pt,xt). This point-wise strategy can help us recover

the heterogeneity of treatment effects because we can deliberately make repeated estimations

on different points and observe the change in estimates. For some problems, the knowledge on

price slopes would suffice. However, when it is not the case, we need to further normalize the

obtained price slope and form price elasticity. The normalization is also achieved point-wisely.

The remaining challenge is the price endogeneity. When there is a confounder that simul-

taneously affects price and demand in the disturbance ǫ, we will not be able to estimate the

heterogeneous treatment effect ∂pfj(pt,xt). The intuition is that we cannot distinguish whether

the demand change comes from the price change or the change in the confounder, which is beyond

our control and moves simultaneously with price. To overcome this empirical hurdle, this paper

proposes a control function approach to estimate heterogeneous treatment effect in the presence

9

of treatment endogeneity. We will show that the problem can be conveniently transformed by

adapting classical nonparametric IV models. In this paper, we use the triangular simultaneous

equation system in Newey, Powell, and Vella (1999) as our starting point.

3.3 Control function approach

As is common in situations with endogeneity, we need some additional machinery to work on

the problem. It is further assumed that pt is related to a vector of instrumental variables zt,

pt = g(zt) + ut. (2)

where g(·) denotes a nonparametric relationship between price and the instrumental variables.

In a reduced-form interpretation, g(zjt) can be the conditional expectation of pt on zt. In this

way, it can assumed without loss of generality that

E(ut|zt) = 0. (3)

However, if we try to understand Equation (2) from a structural setup, Equation (3) implies that

the endogeneity issue is confined in Equation (1) and will not continue in Equation (2) for the

exogenous instrumental variables zt. In both situations, the disturbance ut can be interpreted

as the aggregation of all factors that affect price pt beyond and orthogonal to instrumental

variables zt. The above model setup has routinely followed the triangular control function

approach in Newey et al. (1999). If f(·) and g(·) are replaced with linear functions, we can

see its immediate origin of the classical two-step model in, for example, Heckman and Robb

(1985). It is worth noting that here we allow the possible overlap between the instrumental

variables zt and the observed product characteristics xt. The famous BLP instrument in the

demand literature is actually the observed characteristics of other products. To make this

point explicit, we will introduce a notation to decompose zt into (z(1)t , z

(2)t ), where z

(1)t are the

instrumental variables that overlap with xt, and z(2)t the instrumental variables excluded from

observed product characteristics xt. The dimension of excluded z(2)t is dz.

Before we move on, we spend some time comparing the modeling difference between the above

setup and the multi-nominal choice model. The multi-nominal choice model is often featured

with a linear indirect utility in product characteristics and the Type I error. In this case, the

aggregated market share has a logistic form. To our knowledge, this logistic demand function

cannot be decomposed into an additive function of the unobserved product characteristics. As

a result, the market demand from the random coefficients multi-nominal choice model, which

integrates the logistic market demand at local realizations of random coefficients, is also not in

general additively separable in unobserved product characteristics. In other words, the triangular

10

nonparametric model in this paper does not nest the standard random coefficients multi-nominal

choice model as a special case and the reverse is also true. However, we argue in the appendix

that if we follow a point-wise strategy and take a point-wise Taylor’s expansions of the logistic

demand function, it can be approximately true that the demand function from the standard

multi-nominal choice model is additively separable in unobserved product characteristics.

3.4 Assumptions

Now we are ready to formally list the assumptions we will need for estimation and inference.

The most important assumption that we are going to make is the exclusion assumption.

Assumption 1. For j = 1, 2, · · · , J , it holds that

E(ǫjt|xt, zt,ut) = E(ǫjt|ut).

Assumption 1 states that if the orthogonal shock ut is known, the error resulting from the

unobserved disturbance ǫjt only comes from the ut side. This assumption is pivotal in the

triangular simultaneous equations (Newey et al., 1999). We can understand it in this way. First

of all, the controls xt are exogeneous, so they can be safely moved away. However, since zt

and ut are components of prices, they must be correlated with disturbance ǫjt. The essence in

Assumption 1 is actually that zt do not have an impact on the demand except through the effects

on price and ut. This point will be more apparent in Figure 1 when we talk about the intuition

behind identification. Beyond this requirement, we also need to ensure that zt and ut do not

correlate and can be separated. It is ensured by the conditional independence assumption.

Assumption 2. For the instrumental variables zt, it holds that

E(ut|zt) = 0.

Another way to understand Assumption 2 is that there is no direct link between zt and ut.

This point will be further explained in Figure 1. We have made two modeling assumptions by

now. They are crucial to make our strategy work. The coming Assumption 3 and Assumption

4 are regularity conditions to ensure a well-defined solution.

Assumption 3. There are no less excluded instrumental variables than the number of endoge-

nous variables, that is, dz ≥ J .

Assumption 4. fj(pt,xt) for j = 1, 2, · · · , J , E(ǫjt|ut) for j = 1, 2, · · · , J , and g(zt), are first

order continuously differentiable with respect to all arguments. Moreover, the Jacobian matrix

of g(zt) with respect to the excluded instruments z(2)t is of full column rank.

11

Assumption 3 is commonly made in the instrumental variables literature. It implies that we

will require enough sources of variations to deal with endogeneity. The convenience it brings

will become more explicit when we derive the identification result. Assumption 4 is to ensure

that the partial derivatives exist and have desirable properties. It is a technical condition.

3.5 Identification

This section will show some algebra to derive the identification of the heterogeneous treatment

effect ∂pfj(pt,xt) in the presence of endogeneity. What we are about to find has actually already

appeared in Newey, Powell, and Vella (1999). What is also interesting is that, to the best of our

knowledge, this identification and estimation route seems not to have been explored so far. Our

paper can be the first paper to give this result an interpretation in the context of heterogeneous

treatment effect and practically use the result for estimation. Before we proceed, as always, we

first introduce new notations. Let

hj(pt,xt, z(2)t ) := E(sjt|pt,xt, zt),

which is the conditional expectation of demand given the levels of prices, products characteristics

and instrumental variables, and let

λ(ut) := E(ǫjt|pt,xt, zt) = E(ǫjt|ut),

where the second equality comes directly from Assumption 1. When we take conditional expec-

tations on both sides of Equation (1), it can be shown without much effort that

hj(pt,xt, z(2)t ) = fj(pt,xt) + λ(ut), (4)

which states that the conditional demand function hj is an additive function of the structural

demand function fj and the control function λ(·). As the classical argument of the control

function approach goes, the equation implies that with the inclusion of the control function, the

problem of endogeneity can turn into an omitted variable problem. In other words, the price

endogeneity will on longer be a concern in the presence of prices, product characteristics, and

the newly included control function.

Thanks to the regularity conditions in Assumption 3 and 4, we can continue to take partial

derivatives on both sides of Equation (4) with respect to pt and z(2)t . By the chain rule of

calculus, we can obtain that

∂pthj(pt,xt, z

(2)t )

︸︷︷︸

J×1

= ∂ptfj(pt,xt)

︸︷︷︸

J×1

+ ∂utλ(ut)︸︷︷︸

J×1

, (5)

12

∂z(2)thj(pt,xt, z

(2)t )

︸︷︷︸

dz×1

= − ∂z(2)tg(zt)

︸︷︷︸

dz×J

∂utλ(ut)

︸︷︷︸

J×1

. (6)

Here ∂pthj(pt,xt, z

(2)t ) is the Jacobian matrix of conditional demand function hj with respect to

price vector pt, evaluated at point (pt,xt, z(2)t ). The other terms are defined in the same fash-

ion. We explicitly list the dimensions of these Jacobian matrices underneath to avoid potential

confusion on various definitions of the Jacobian matrix.

Our goal is the heterogeneous treatment effect ∂pfj(pt,xt). We can rearrange Equation (5)

and Equation (6) and get

∂z(2)tg(zt)

︸︷︷︸

dz×J

∂ptfj(pt,xt)

︸︷︷︸

J×1

= ∂z(2)thj(pt,xt, z

(2)t )

︸︷︷︸

dz×1

+ ∂z(2)tg(zt)

︸︷︷︸

dz×J

∂pthj(pt,xt, z

(2)t )

︸︷︷︸

J×1

, (7)

which gives us a system of dz linear equations in J unknowns, whose solution is to be discussed

in the following two scenarios.

When dz > J , that is, when the number of excluded instrumental variables is larger than

the number of endogenous prices, Equation (7) is an over-identified system. Since ∂z(2)tg(zt) has

full column rank by Assumption 3, we are able to obtain a minimum distance solution

∂ptfj(pt,xt)

︸︷︷︸

J×1

= ∂pthj(pt,xt, z

(2)t )

︸︷︷︸

J×1

+(∂z(2)tg(zt)

T

︸︷︷︸

J×dz

∂z(2)tg(zt)

︸︷︷︸

dz×J

)−1 ∂z(2)tg(zt)

T

︸︷︷︸

J×dz

∂z(2)thj(pt,xt, z

(2)t )

︸︷︷︸

dz×1

.

When the number of excluded instruments is equal to the number of endogenous prices, that

is, when dz = J , Equation (7) is just identified. In this case, full column rank can ensure that

∂z(2)tg(zt) is invertible and we can get

∂ptfj(pt,xt)

︸︷︷︸

J×1

= ∂pthj(pt,xt, z

(2)t )

︸︷︷︸

J×1

) + ∂z(2)tg(zt)

−1

︸︷︷︸

J×J

∂z(2)thj(pt,xt, z

(2)t )

︸︷︷︸

J×1

, (8)

which is a simplification of the over-justified solution.

Now we take a closer look at Equation (8). If different endogenous prices do not share the

same excluded instrument, for example, when each product price has only its own Hausman

price IV, the Jacobian matrix ∂z(2)tg(zt) will be a diagonal matrix. In this case, the diagonal

element of the inverse of ∂z(2)tg(zt) is the inverse of the diagonal element of ∂

z(2)tg(zt), which

implies that we can further simplify the equation and deal with price endogeneity for each price

separately. However, if there exists a shared excluded instrument, such as a common cost shifter,

the inverse of the matrix ∂z(2)tg(zt) in this more complicated case needs to be solved jointly.

3.6 An explanation

As we have mentioned, the above relationship has appeared in Newey, Powell, and Vella (1999)

with a slightly different notation. However, we are going to give the equation an interpretation

13

Figure 1: An explanation on identification

Note: This figure depicts a simplified relationship between demand, price, unobserved confounder, and

instrumental variable using the directed acyclic graphs (Pearl, 2009).

in the context of heterogeneous treatment effects. We will use the the directed acyclic graphs

(Pearl, 2009) for illustration and our interpretation needs not to be the interpretation. Before

that, to make our argument easier, let us further simply Equation (8) to the case where there is

only one endogenous price and one excluded instrumental variable, that is,

∂pfj(pt,xt) = ∂phj(pt,xt, zt) + ∂zg(zt)−1∂zhj(pt,xt, zt). (9)

Figure 1 depicts the endogenous situation with the directed acyclic graphs (Pearl, 2009). We

hope to evaluate the causal effect of price on demand, which is the Channel 2 in Figure 1. It is

the casual effect because it is the effect of price on demand between different parallel universes

where price is different but the unobserved confounder is fixed. However, the unobserved con-

founder affects price and demand simultaneously through Channel 3 and Channel 4 . If this

is the case, when we move price a little bit, the change in demand comes from two sources. One

change comes directly from the change in price, which is Channel 2 , while the other comes

from the co-movement of the unobserved confounder with price, which is Channel 4 − 3 .

What we can observe in the data is the total effect of price on demand, but what we are more

interested in is the partial effect of price on demand, which is also the causal effect of price on

demand. However, if we have access to an instrumental variable, we are still able to decompose

the total effect of price on demand with the new machinery.

14

To make this trick, the instrumental variable needs to satisfy two conditions. First, we need

to require the instrumental variable not to affect the demand directly, which implies that there

is no direct link between the instrumental variable and the demand. This is actually also what

Assumption 1 has required. When there is a change in the instrumental variable, the change in

demand can only come from its effect on the price and the unobserved confounder. Second, there

should not be a direct link between the instrumental variable and the unobserved confounder.

They are two orthogonal determinants of the price. This point is guaranteed by Assumption 2,

where the unobserved confounder and the instrumental variable are set to be orthogonal.

Now let us get a step back and look at what the data can identify. First of all, the data

can inform us the total effect of price on demand, which includes the part from the unobserved

confounder. It is the ∂phj(pt,xt, zt) in Equation (9). Second, we also know how the instrumental

variable can change the price, which is the Channel 1 in Figure 1 and ∂zg(zt) in Equation (9).

Moreover, we also know how the instrumental variable affects demand when holding the price

constant. The instrumental variable can have an effect because the price is held constant. The

instrumental variable and the unobserved confounder are two determinants of the price. This is

Channel 1 − 3 + 4 in Figure 1. In this way, the indirect effect of price on demand, which is

Channel 4 − 3 , can be worked out by using the effect of the instrumental variable on the price

to divide the effect of the instrumental variable on the demand, which is −∂zg(zt)−1∂zhj(pt,xt, zt)

in Equation (9). Finally, the total effect minus the indirect effect gives us the direct partial effect

of price on demand. It is also the heterogeneous treatment effect of price on demand.

4 Estimation and inference

This section will be denoted to discuss the implementation details of our nonparametric strategy.

We will first discuss possible approaches to estimate the point-wise conditional expectations,

including classical nonparametric regressions and modern machine learning methods. When

we are clear on how to achieve point-wise predictions, the next problem is then how to obtain

partial derivatives. In this paper, we will use finite differences to numerically approximate partial

derivatives. Finally, we discuss how to conduct statistical inference. We propose the use of the

bagged nearest neighbors method for point-wise prediction and formally prove that the bootstrap

can be directly used for inference when the bagged nearest neighbor estimator is used.

4.1 Estimation

The major difference between our method and the nonparametric IV literature is that we have

deliberately transformed the problem into point-wise estimations, while current approaches in

15

the nonparametric IV literature have focused mostly on the estimation and inference of structural

functions. This point-wise transformation can bring two intriguing changes. First of all, we are

now capable to capture observed heterogeneity to our interest. For example, if our goal is to

explore the heterogeneous treatment effects, we can then repeatedly change the conditional values

and compare the treatment effects on these points. As a result, the point-wise approach can give

us a leverage to recover heterogeneity. Second, when the economic and econometric problem is

transformed into point-wise predictions, another possibility to utilize modern learning methods

is open to economists. Although not fully understood, machine learning methods, such as the

deep neural networks, have demonstrated their superior performance in making predictions in

a nonlinear fashion. While current discussion on machine learning in econometrics is more on

the variable selection side, this paper shows that these machine learning methods can actually

be conveniently incorporated with the well studied nonparametric IV models.

A class of modern machine learning methods can be viewed as algorithm enhanced classical

nonparametric methods. For example, the random forest regression, in its essence, has a con-

venient representation as the nearest neighbor method, but with adaptive and data-driven local

weights (Wager and Athey, 2018). The deep neural networks, at a high level, can be seen as a

sophisticated functional approximation using the sieves (Chen, 2007), which has already been

familiar to economists. The novelty is that the basis of the sieves is now implicitly chosen by

layers of linear combinations and nonlinear activations. The bagged nearest neighbor method, to

be elaborated in the next subsection, can also been seen as a bagging algorithm (Breiman, 1996)

enhanced matching estimator (Abadie and Imbens, 2006), which has already been widely used

in causal studies. However, despite these generalities, modern machine learning methods turn

out to work fairly well in practice, especially when there is a relatively large dimension of control

variables. The reason why there exists such a surprising difference is still a myth in probability

and statistical theory. One possible explanation is that the algorithms in the machine learning

methods can help alleviate the curse of dimensionality.

In this paper, our framework allows a general use of methods to derive conditional predic-

tions as long as they provide consistency. When the dimension of conditional variables is low,

traditional nonparametric methods, such as the sieves estimator and the kernel methods, can

also be used within our framework. In this case, the sieves estimator actually possesses an extra

advantage. Our intermediate goals are partial derivatives of conditional expectations. Since the

sieves estimator gives us an approximation of the structural function, we can then take partial

derivatives on the basis functions and directly get an approximation of the partial derivatives.

This strategy is straightforward and can work very well when there are few conditional vari-

ables. When there are a relatively large dimension of conditional variables, we conjecture that

16

the commonly used deep neural network can enjoy the same advantage. However, since it is not

the focus of this paper, we will not make further exploration in this direction. In general, we can

use numerical methods to derive partial derivatives. The most popular method in this domain is

perhaps the finite difference method, which has been widely used in modern numerical analysis.

For details, see, for example, Strikwerda (2004). The finite difference method is a discretiza-

tion method, in which finite differences are used to approximate derivatives. In other words,

the partial derivative is the limit of a difference quotient by definition. We can then take a

small difference and use the difference quotient for approximation. However, at implementation

level, there exist various different approaches to construct the difference quotient. The difference

quotient can have various schemes, such as the forward scheme, the backward scheme, and the

central scheme. In this paper, for simplicity, we will mostly use the forward scheme, that is, we

will use the difference quotient [g(z + δ)− g(z)]/δ to approximate the partial dirivative gz(z).

4.2 Inference

Machine learning methods can give fairly good point-wise predictions. The remaining difficulty

is statistical inference. To mitigate this discrepancy, we propose to use the bootstrap averaged

(bagged) nearest neighbors to estimate point-wise conditional expectations. We formally prove in

this paper that the bootstrap can be directly used for inference for the bagged nearest neighbors

estimators. Since the bootstrap commutes with smooth functions, we can therefore directly use

the bootstrap to derive inference for price elasticity estimates. It is worth noting that the price

elasticity estimates are not necessarily asymptotic normal.

The bootstrap averaged (bagged) nearest neighbors regression estimator, to the best of our

knowledge, first appears as a special example when averaging nearest neighbors estimators from

without replacement bootstrapped subsamples in Biau et al. (2010). The nearest neighbors es-

timator here is more commonly known as the matching estimator (Abadie and Imbens, 2006)

in economics. More recently, Fan et al. (2018) obtains the same estimator independently from

a panel perspective. The idea is that we can construct an artificial panel structure in the data

and average nearest neighbors from each crosssection. This strategy is in effect equivalent to

subsampling and thereby the subsample size in subsample bootstrapping becomes the counter-

part of the time dimension in panel data models (Hsiao, 2014). As a result, the panel jackknife

method (Arellano and Hahn, 2013; Dhaene and Jochmans, 2015) can be readily applied to the

bagged nearest neighbors estimators to remove higher order bias. In other words, the bagged

nearest neighbors estimator can be understood as the joint product of the familiar matching

estimators, the bootstrap averaging algorithm (Breiman, 1996), and the generalized jackknife

procedure. Surprisingly, this hybrid, as other machine learning methods, turns out to enjoy

17

desirable theoretical properties and has demonstrated fairly good working performance with a

relatively large dimension of control variables. We give a formal and concise introduction to the

bagged nearest neighbors in the appendix.

Since we need to combine several predictions to form price elasticities, statistical inference

for these final products of interests is not a trivial problem. The obstacles are that 1) we

may not have asymptotic normality, 2) the variance formula is not easy to obtain, and 3) the

variance formula may not have a working precision if many plug-in estimates are required. To

overcome this hurdle, we prove in this paper that the bootstrap can be used for inference for

bagged nearest neighbors. Since the bootstrap commutes with smooth functions, it follows that

the bootstrap can be directly used for inference of the price elasticity estimates, if the bagged

nearest neighbors estimators are used for point-wise predictions. We show in our Monte Carlo

simulations that the bootstrap strategy seems to work well in practice and offer a valid inference

with working precision. Our formal theorem and its proof for the bootstrap result are relegated

to the appendix after the introduction on the bagged nearest neighbors. As far as we know,

it is the first time this property is established for the bagged nearest neighbors. Our proof is

novel and becomes mathematically neat with the introduction of the Hoeffding decomposition

(Hoeffding, 1948) and the Mallow’s distance (Bickel and Freedman, 1981).

To conclude this section, our implementation strategy is that we first get point-wise condi-

tional predictions on various points, use them to numerically derive partial derivatives, and then

combine all relevant ingredients to obtain our final estimate of price elasticities. For inference,

we repeat the above process routinely on bootstrapped samples of the full data. The distribution

of all these final estimates from bootstrapped samples will approximate the asymptotic distri-

bution of the elasticity estimator. In this way, we can use this distribution to derive inference.

Moreover, in many cases, we are not satisfied with estimating price elasticity only for one point.

The same estimation and inference procedure can actually be repeated on all other points of

interest. In particular, it will be interesting when we deliberately change the value of one or two

conditioning variable and hold all the others constant. As a result, we can trace the change of

price elasticities along one or more dimensions of interest.

5 Monte Carlo simulation

We conduct Monte Carlo simulations in the section to show that our empirical strategy works

well in practice. Our method is applied to two parametric settings where we can conveniently

derive analytical solutions. We then compare our estimates and their analytical counterparts

at various given points. This design is aimed to show that our method can well capture the

18

heterogeneity of price slopes, even in the presence of endogeneity. In particular, we will use the

following data generating process in our simulation,

s1 = g(p1, p2, p3, p4) + ǫ,

ǫ = e− 2u,

p1 = z + u,

where p1, p2, p3, p4, u, e and z are independent and follow the standard normal distribution. Here

u can be interpreted as an unobserved common shock that affects both p1 and s1. The presence of

common shock u raises the concern of endogeneity. For other parameters, s1 can be interpreted

as the demand for product 1, p1, p2, p3, p4 the prices for product 1 to product 4, and e some

random demand shock. Moreover, z can be interpreted as a product specific cost component

and can then serve as an instrument for p1.

For the demand function g, we use two model specifications. In particular, for Model 1,

g(p1, p2, p3, p4) = 5 + p21 + 2p1 − 3p2 + p3 − p24,

and for Model 2,

g(p1, p2, p3, p4) = 5 + p31 + 2p1 − 3p2 + p3 − p24,

where the difference is only on the higher order terms of p1. The reason we make this distinction

is that the two models will have a linear and a quadratic price slope along p1, respectively. To

be specific, it is easy to verify that for Model 1, the price slope

∂g

∂p1(p1, 0, 0, 0) = 2p1 + 2,

and for Model 2,∂g

∂p1(p1, 0, 0, 0) = 3p21 + 2.

Figure 2 depicts our estimation results for Model 1 and Model 2 from one simulated sample

with 10,000 independent observations. The estimations are conducted point by point where p1

moves gradually from −0.8 to 0.8 and the other variables are held at 0. We also computed

the 95% confidence intervals using the bootstrap. They are presented using the dashed lines in

Figure 2. From Figure 2, we can see that even in the presence of endogeneity, our method can

still well capture the price slopes at different price levels. In other words, the heterogeneity of

the price treatment effects has been fully recovered.

We further repeat the above simulation exercise for 500 times. Table 1 summarizes the

results. In Table 1, Column p1 gives the values of p1 at the points where we make estimations.

The other variables at held at their mean and median levels 0. The two Columns Slope give us

19

Figure 2: Simulation 1

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

Model 1

0

1

2

3

4

5

6

Analytical price slope

Estimated price slope

95% Confidence Interval

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

Model 2

0

1

2

3

4

5

6

Analytical price slope

Estimated price slope


Note: This figure plots the estimations of our method for one simulated sample with 10,000 independent

observations. The estimations are conducted point by point with p1 varying from −0.8 to 0.8 and the

other variables at 0.

20

analytical values for price slopes at these points from our theoretical derivation. As a comparison,

our price slope estimates are present in their neighboring Columns Estimated Slope. The two

Columns Variance are the variances of the respective price slope estimates from 500 simulations.

Meanwhile, the two Columns Estimated Variance are the mean of the respective 500 simulated

bootstrap variances.

It can be seen from Table 1 that our bootstrap inference procedure can give a reliable and

convenient estimate of the true variance of the price slope estimator. This feature becomes

very important if statistical inference is also one of the concerns. The trajectories of the slope

estimates can well mimic their analytical truth, which will be very useful in many economic

research problems, such as the taxation or tariff pass-through. In the appendix, we will also

conduct a Monte Carlo simulation when the sample is generated from a random coefficient

multinomial model. It seems that our method can be compatible with this important case.

To conclude, we have employed a simple Monte Carlo simulation to demonstrate the potential

usefulness of our estimation and inference strategy. It shows that our method is able to recover

heterogeneity in the presence of endogeneity and give a convenient and valid inference.

6 Empirical Application

In this section, we demonstrate the applicability of our approach by estimating the price elas-

ticities among yogurt products. Yogurt is a widely-studied product category in empirical in-

dustrial organization and marketing. However, the yogurt literature has emphasized different

aspects of consumer behavior, and consequently, has adopted different and sometimes mutually-

incompatible demand models. In this paper, we will estimate own- and cross- price elasticities

among popular yogurt products without imposing a priori structural assumptions on consumer

behavior. We believe that the empirical elasticity patterns we find will be 1) directly informa-

tive of product markup and price pass-through, and 2) instructive of more reasonable modeling

assumptions if structural approaches are to be used to characterize demand.

The rich empirical literature on yogurt offers us a myriad of different characterizations of the

demand function that go beyond the “baseline” random coefficient demand system (Berry et al.,

1995). In particular, Villas-Boas (2007) use the conventional random coefficient discrete choice

model but ingeniously point out the importance of the interaction between retailer and manu-

facturer. Meanwhile, various other studies focus on different aspects of the consumer choice

process that are beyond the scope of a conventional discrete choice model. For example,

Kim et al. (2002) study consumers’ choices of a variety of products and in multiple quanti-

ties. Pavlidis and Ellickson (2018) study consumer switching costs across products and brands

21

Table 1: Simulation results

p1Model 1 Model 2

Slope Est. Slope Variance Est. Variance Slope Est. Slope Variance Est. variance

-0.80 0.40 0.34 0.0603 0.0694 3.92 3.75 0.1390 0.1498

-0.72 0.56 0.51 0.0607 0.0696 3.56 3.39 0.1293 0.1358

-0.64 0.72 0.68 0.0613 0.0691 3.23 3.06 0.1169 0.1229

-0.56 0.88 0.85 0.0629 0.0702 2.94 2.77 0.1092 0.1146

-0.48 1.04 1.02 0.0660 0.0712 2.69 2.52 0.1071 0.1080

-0.40 1.20 1.20 0.0668 0.0705 2.48 2.31 0.0980 0.1022

-0.32 1.36 1.36 0.0679 0.0717 2.31 2.13 0.0932 0.0959

-0.24 1.52 1.53 0.0706 0.0716 2.17 1.99 0.0893 0.0934

-0.16 1.68 1.70 0.0700 0.0739 2.08 1.89 0.0863 0.0908

-0.08 1.84 1.88 0.0702 0.0756 2.02 1.84 0.0838 0.0895

0.00 2.00 2.05 0.0743 0.0775 2.00 1.82 0.0867 0.0902

0.08 2.16 2.22 0.0770 0.0784 2.02 1.84 0.0880 0.0903

0.16 2.32 2.39 0.0772 0.0808 2.08 1.91 0.0883 0.0915

0.24 2.48 2.56 0.0776 0.0842 2.17 2.01 0.0871 0.0946

0.32 2.64 2.73 0.0786 0.0868 2.31 2.15 0.0883 0.0987

0.40 2.80 2.90 0.0819 0.0899 2.48 2.33 0.0932 0.1022

0.48 2.96 3.07 0.0832 0.0931 2.69 2.55 0.0969 0.1115

0.56 3.12 3.25 0.0866 0.0956 2.94 2.80 0.1054 0.1175

0.64 3.28 3.42 0.0921 0.0999 3.23 3.09 0.1190 0.1275

0.72 3.44 3.59 0.0946 0.1029 3.56 3.43 0.1311 0.1395

0.80 3.60 3.76 0.1016 0.1089 3.92 3.79 0.1508 0.1529

Note: This table summarizes our simulation results from 500 simulations. Est. is short for Estimated.

For each simulation, we generate a sample with 10,000 independent observations and make estimations

point by point with the values of p1 varying from −0.8 to 0.8 and the other variables at 0. For each

estimation, the estimated variance is obtained by the bootstrap. Meanwhile, variance is the variance

of the estimated slopes from 500 simulations.

22

and discuss its implications to dynamic pricing. Huang and Bronnenberg (2018) study the costly

consideration-set formation in a demand system, where consumers can choose a variety of prod-

ucts in multiple quantities.

In this section, we use the IRI academic dataset (Bronnenberg et al. (2008)) to characterize

elasticities across yogurt products. This dataset has been commonly used in marketing and

provides weekly sales information on yogurt for various store chains from 2001 to 2007 in several

states of the United States. From this dataset, we will demonstrate two interesting features that

are not much emphasized by existing literature. First, as in many consumer-packaged-goods

categories, yogurt products are packaged into various package sizes, such as 8 oz or 32 oz. The

classical microeconomic theory might view these size offerings as a form of the second-degree

price discrimination. However, in empirical studies, consumer preference heterogeneity across

package sizes are often abstracted away. Commonly, yogurts of various package sizes are pooled

within the brand for structural analyses, possibly due to the complexity of specifying preference

distribution across sizes. In this paper, we will give an analysis of package sizes without a priori

structural assumptions. Second, in our sample periods, one yogurt brand is constantly priced at

a higher level than the other. Such persistent differences might be an outcome of cost differences.

Our empirical analysis aims to speak to these two features of demand.

6.1 Sample construction and implementation

We first picture the landscape of the total yogurt sales during our sample period. A summary of

statistics is provided in Table 2. During our sample period, the most popular brand is Yoplait.

Dannon follows as the second popular brand, but with only two-thirds of the sales of Yoplait.

The third popular brand is the private label yogurt provided by stores. The private label is

technically not a brand since the private label yogurt can be very different and often of distinct

qualities from store to store. For Yoplait and Dannon, we decompose their sales further into

popular sizes and report the size-associated total sales. The numbers show that both Yoplait

and Dannon offer two vertically-different product lines. One is at around $1.7 per pound and

the other is premium and at around $2.4 per pound. In this paper, we will focus on the regular

product line, since in total they sell the most units and can best reflect the competitive interaction

between Yoplait and Dannon.

We will define Dannon 0.375 pounds (6 oz) and 0.5 pounds (8 oz) as Dannon small and

Dannon 1.5 pounds (24 oz) and 2 pounds (32 oz) as Dannon large. For Yoplait, Yoplait 0.375

pounds (6 oz) is defined to be Yoplait small and Yoplait 1.5 pounds (24 oz) Yoplait large. Since

Yoplait offers more at the premium line, we define Yoplait 0.25 pounds (4 oz), 0.6875 pounds (11

oz) and 1.125 pounds (18 oz) as Yoplait premium and use it as a control. For the Private label,

23

Table 2: Yogurt brand and size

Brand Total sales Size Individual sales Price

Yoplait 626, 760, 636

0.375 490, 723, 711 $1.66

0.25 63, 490, 752 $2.55

1.5 26, 982, 361 $1.73

1.125 23, 543, 991 $2.37

0.6875 11, 123, 974 $2.31

0.5 6, 517, 007 $2.18

3 1, 884, 085 $1.52

Dannon 397, 986, 186

0.375 199, 607, 994 $1.58

0.5 71, 257, 840 $1.40

1 33, 874, 083 $2.24

2 23, 157, 036 $1.41

1.5 14, 539, 104 $1.63

Private

label277, 347, 934

0.5 211, 898, 981 $0.98

0.375 42, 254, 002 $1.13

2 13, 661, 875 $0.97

Note: This table summarizes yogurt sales in popular brands and sizes from our dataset.

The sales are in units, the sizes are in pounds, and the price has been normalized to

dollars per pound. Only most popular brands and sizes have been listed.

24

Table 3: Yogurt brand and size

Variables Mean S.D. Min Q1 Median Q3 Max Obs.

Yoplait

small

Price 0.92 0.17 0.10 0.80 0.92 1.05 1.58 156,580

Sales 416.13 272.49 66.00 207.75 346.13 556.88 1343.25 156,580

IV 0.84 0.05 0.72 0.81 0.84 0.87 0.97 156,580

large

Price 0.92 0.12 0.17 0.83 0.92 1.00 1.46 156,580

Sales 89.65 61.50 15.00 43.50 72.00 120.00 301.50 156,580

IV 0.86 0.02 0.75 0.84 0.86 0.88 0.91 156,580

premium Price 0.64 0.08 0.15 0.58 0.63 0.70 0.98 156,580

Dannon

small

Price 0.83 0.18 0.11 0.69 0.80 0.94 1.36 156,580

Sales 216.38 180.58 20.25 83.25 160.50 291.75 915.00 156,580

IV 0.77 0.06 0.58 0.73 0.77 0.81 0.90 156,580

large

Price 0.78 0.11 0.27 0.69 0.77 0.86 1.30 156,580

Sales 131.13 93.37 18.00 62.00 106.00 172.50 502.00 156,580

IV 0.73 0.02 0.64 0.72 0.73 0.74 0.80 156,580

Private label

Price 0.54 0.09 0.18 0.48 0.53 0.60 1.46 156,580

Other controls

Store ACV 0.22 0.09 0.04 0.16 0.20 0.27 1.00 156,580

ChainShelf.1 0.25 0.10 0.04 0.15 0.22 0.33 0.49 156,580

Shelf.2 0.31 0.10 0.03 0.25 0.33 0.40 0.63 156,580

TimeWeek 0.51 0.29 0.02 0.27 0.52 0.75 1.00 156,580

Year 0.58 0.28 0.17 0.33 0.67 0.83 1.00 156,580

Note: This table summarizes yogurt sales in popular brands and sizes from our dataset.

Prices reported here are per 8 oz, except that the premium Yoplait price is per 4 oz.

25

we do not further distinguish sizes and use the average price as a control. When we compute

the average price, we use the sum of sale pounds to divide the total revenue.

With the above definitions, we eventually arrive at a sample of 156, 580 observations at the

store-week level. For each observation, we have the prices for Yoplait small, Yoplait large, Yoplait

premium, Dannon small, Dannon large, and the Private label. We instrument for the potentially

endogenous prices using the Hausman instruments (Hausman et al., 1994; Berry et al., 1995),

that is, prices of the focal product in other geographic markets. We further control for fixed

effects at the store, the chain, and the time level. We use the all-commodity volume (ACV)

provided by the IRI dataset as a control for store level fixed effect. It is a weighted measure of

the product availability based on store aggregate sales. The chain level fixed effects are to be

represented by two variables, Shelf.1 and Shelf.2. They are the ratios of the Dannon sales and

the private label sales to the total sales of yogurt in the grocery chain. They reflect the chain

level preference over Dannon, Yoplait, and the private label. One interpretation is that they can

reflect the shelf space allocation. We also add the number of the week in a year and the number

of the year in our sample to further control the time level fixed effects. Both variables have been

normalized to [0, 1]. Note that although our analysis uses proxies to control for fixed effects, we

do not make restrictions on the functional forms, thereby permitting the controls to involve in

a flexible manner. We provide the summary statistics on all these variables in Table 3.

6.2 Empirical findings

Our empirical findings are presented in Figure 3–6. In each of the four figures, we report the

price elasticities of all four focal yogurt products with respect to the price change in one of

them (e.g. Dannon small). We report estimated price elasticities at every of the 50 price levels

between $0.7 to $1 per 8 oz, while holding all other prices, market, and control variables at

their median levels. The bootstrap 95% confidence intervals for all these point estimations are

reported in dashed lines. We also report the own and cross elasticity estimates at own price

$0.85 per 8 oz, while others at their median levels, in Table 4.

First of all, both Dannon and Yoplait’s small-sized products exhibit elastic demand and

their own-price elasticities increase in magnitude with the price. The downward sloping own-

price elasticity is consistent with Marshall’s second law of demand, which is commonly assumed

in theoretical IO. Second, we can see that at the price $0.85, Dannon small has an own elasticity

of about −3, whereas Yoplait small has a cross elasticity of about 1.6. In contrast, at the

price $0.85, Yoplait small has an own elasticity of about −5, whereas Dannon small has a cross

elasticity of about 4. It seems to suggest that Yoplait small consumers are more sensitive to price.

Third, we also observe a different cross elasticity pattern for Dannon small and Yoplait small.

26

Figure 3: Own and cross price elasticities with respect to the price of Dannon small

0.7 0.75 0.8 0.85 0.9 0.95 1

Price of Dannon small

-4

-3

-2

-1

0

1

2

3

Ela

sticity

Dannon small

Dannon large

Yoplait small

Yoplait large


Note: This figure provides the price elasticity estimates of Dannon small, Dannon large, Yoplait

small, and Yoplait large with respect to the price of Dannon small. These elasticities are evaluated

at 30 price levels of Dannon small, from 0.7 to 1 dollars per 8 oz.

27

Figure 4: Own and cross price elasticities with respect to the price of Yoplait small

0.7 0.75 0.8 0.85 0.9 0.95 1

Price of Yoplait small

-6

-4

-2

0

2

4

Ela

sticity

Dannon small

Dannon large

Yoplait small

Yoplait large



small, and Yoplait large with respect to the price of Yoplait small. These elasticities are evaluated

at 30 price levels of Yoplait small, from 0.7 to 1 dollars per 8 oz.

28

Figure 5: Own and cross price elasticities with respect to the price of Dannon large

0.7 0.75 0.8 0.85 0.9 0.95 1

Price of Dannon large

-2

-1.5

-1

-0.5

0

0.5

1

Ela

sticity

Dannon small

Dannon large

Yoplait small

Yoplait large



small, and Yoplait large with respect to the price of Dannon large. These elasticities are evaluated

at 30 price levels of Dannon large, from 0.7 to 1 dollars per 8 oz.

29

Figure 6: Own and cross price elasticities with respect to the price of Yoplait large

0.7 0.75 0.8 0.85 0.9 0.95 1

Price of Yoplait large

-2

-1.5

-1

-0.5

0

0.5

1

1.5

Ela

sticity

Dannon small

Dannon large

Yoplait small

Yoplait large



small, and Yoplait large with respect to the price of Yoplait large. These elasticities are evaluated

at 30 price levels of Yoplait small, from 0.7 to 1 dollars per 8 oz.

30

Table 4: Price elasticities at representative points

Dannon small Dannon large Yoplait small Yoplait large

(A) Dannon small -2.961 0.268 1.621 0.957

(B) Dannon large 0.122 -1.371 0.048 0.970

(C) Yoplait small 3.880 0.649 -4.927 1.367

(D) Yoplait large 1.413 0.204 -0.294 -1.576

Note: This table presents the price elasticity estimates when the A (B, C, D) product is priced

at $0.85 per 8 oz and all the other products at their median price levels.

When we slightly change the price of Dannon small, the cross elasticity of Yoplait small slowly

increases with this change. However, when we slightly change the price of Yoplait small, the cross

elasticity of Dannon small slowly decreases. Considering the fact that Yoplait is mostly priced

a bit higher than Dannon in our sample, an immediate explanation for this pattern difference

is that when price levels are closer, the competition between the two products is more intense.

Fourth, we observe very similar patterns for Yoplait large and Dannon large when we change the

price of Yoplait small or Dannon small. It gives us the impression that the large-sized yogurts are

specialized for their targeted consumer groups, to whom the brand of small-sized yogurt is less

relevant. Finally, when we slowly change the price of Dannon large, the strongest substitution

comes from Yoplait large, which is consistent with our findings for small-sized yogurt. However,

the pattern does not hold when we change the price of Yoplait large. For Yoplait large, we also

fail to observe a downward sloping own elasticity curve. In addition, the price elasticities for

the other three products, i.e. Dannon small, Yoplait small, and Dannon large, are estimated

with relatively tight confidence intervals. In contrast, the price elasticities of Yoplait large are

estimated with much wider confidence intervals and for a large price region, the cross elasticities

for Dannon large and Yoplait small are not distinguishable from zero. All these facts hinder us

from reaching a confident interpretation for Yoplait large.

Since it seems that the substitution among yogurt products is more prevalent within the

package size rather than brand, let us focus solely on the small-sized yogurt and, for now,

ignore other products. This simplification can shed some light on the cost structure of Yoplait

and Dannon. From our estimation, at the median price level of Yoplait and Dannon, which is

($0.92, $0.80), the own-price elasticities are about −6 for Yoplait and about −2.5 for Dannon. If

this is the pricing equilibrium and both firms have constant marginal costs, we can conveniently

derive the first order optimality conditions for each firm. When we plug in the numbers, a simple

computation can tell us that the marginal cost is about $0.77 for Yoplait and $0.48 for Dannon.

31

Conversely, if we have information on marginal costs, our point-wise price elasticity estimates

can also assist to give an implication on optimal pricing.

To conclude, our result shows that small-sized yogurt and large-sized yogurt are very different

products by nature and the competition among yogurt products exists more within the same

package size than within the same brand. In empirical studies, an array of studies have first

aggregated data across all package sizes within brand, and then conduct further analysis. It seems

that whereas this approach can simplify the product space, it might have also abstracted away

important dimensions of product differentiation and substitution. It is worth noting that our

results build only on a minimal assumption on functional forms, functional differentiability, and

the validity of instrumental variables, regardless of the source of endogeneity. The heterogeneity

patterns we find are obtained before any a priori structural assumptions on individuals and

markets. We believe that such information can be valuable for its own sake and constructive for

further structural modeling.

7 Discussion

In this paper, we have proposed a point-wise approach to flexibly estimate price elasticity with

the presence of endogeneity. Our framework is open to the use of modern machine learning

methods to estimate point-wise conditional expectations. In this way, the curse of dimensionality

can be mitigated and the working performance of our approach can be improved. In particular, if

the bagged nearest neighbors are to be used for point-wise prediction, we prove that the standard

bootstrap procedure can be directly employed for inference. We believe that our flexible price

elasticity estimates can be very useful in a wide range of economic problems, including welfare

analysis, firm pricing, and tax incidence study.

Since this paper has focused on the estimation and inference of price elasticities, we have not

given an explicit discussion on how counterfactual analysis can be conducted in a non-parametric

setup. In a parametric model, the conventional assumption for counterfactual analysis is that the

structural parameters are stable in the sample and out of sample. They reflect deep preferences

and will not change in the new scenario. In this way, the structural models can provide pre-

dictions and counterfactual analysis even when the market becomes totally different. We argue

that if similar assumptions are made, counterfactuals can also be obtained in our framework.

However, instead of assuming constant price coefficients, we can impose other restrictions, such

as shape restriction on how price elasticities should evolve. This feature can be important when

the new policy is believed to have significant impact on individual preferences, that is, when

structural parameters are most likely to be misspecified.

32

We are also curious about the potential future use of the deep neural networks in our frame-

work. The deep neural networks has been widely used in business practices to deal with big data,

where they have demonstrated superior reliability and working usefulness. For our problem, one

particularly interesting direction is that how we can use the deep neural networks to predict

directly the slopes. If there is a convenient and fast algorithm to achieve this, we are curious

to know how it can further push the dimensionality limit. Moreover, if unstructured data, such

as satellite images, product pictures, and social network texts can also be utilized to answer

economic questions, we wonder what new insights we can get from such a change.

References

Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average

treatment effects. Econometrica 74 (1), 235–267.

Arcones, M. A. and E. Gine (1992). On the bootstrap of U and V statistics. The Annals of

Statistics 20 (2), 655–674.

Arellano, M. and J. Hahn (2013). Understanding Bias in Nonlinear Panel Models: Some Recent

Developments, Volume 3, pp. 381–409. Cambridge University Press.

Athey, S. and G. Imbens (2016). Recursive partitioning for heterogeneous causal effects. Pro-

ceedings of the National Academy of Sciences 113 (27), 7353–7360.

Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. The Annals of

Statistics 47 (2), 1148–1178.

Banks, J., R. Blundell, and A. Lewbel (1997). Quadratic engel curves and consumer demand.

The Review of Economics and Statistics 79 (4), 527–539.

Belloni, A., V. Chernozhukov, and C. Hansen (2014a). High-dimensional methods and inference

on structural and treatment effects. Journal of Economic Perspectives 28 (2), 29–50.

Belloni, A., V. Chernozhukov, and C. Hansen (2014b). Inference on treatment effects after

selection among high-dimensional controls. The Review of Economic Studies 81 (2), 608–650.

Berry, S., A. Gandhi, and P. Haile (2013). Connected substitutes and invertibility of demand.

Econometrica 81 (5), 2087–2111.

Berry, S., J. Levinsohn, and A. Pakes (1995). Automobile prices in market equilibrium. Econo-

metrica 63 (4), 841–890.

33

Berry, S. T. (1994). Estimating Discrete-Choice models of product differentiation. The RAND

Journal of Economics 25 (2), 242.

Berry, S. T. and P. A. Haile (2014). Identification in differentiated products markets using

market level data. Econometrica 82 (5), 1749–1797.

Biau, G., F. Cerou, and A. Guyader (2010). On the rate of convergence of the bagged nearest

neighbor estimate. Journal of Machine Learning Research 11 (3), 687—-712.

Biau, G. and L. Devroye (2015). Lectures on the Nearest Neighbor Method. Springer.

Bickel, P. J. and D. A. Freedman (1981). Some asymptotic theory for the bootstrap. The Annals

of Statistics 9 (6), 1196–1217.

Blundell, R., X. Chen, and D. Kristensen (2007). Semi-Nonparametric IV Estimation of Shape-

Invariant Engel Curves. Econometrica 75, 1613–1669.

Blundell, R., J. Horowitz, and M. Parey (2016). Nonparametric estimation of a nonseparable

demand function under the slutsky inequality restriction. The Review of Economics and

Statistics 99 (2), 291–304.

Blundell, R., J. L. Horowitz, and M. Parey (2012). Measuring the price responsiveness of gasoline

demand: Economic shape restrictions and nonparametric demand estimation. Quantitative

Economics 3 (1), 29–51.

Blundell, R., D. Kristensen, and R. L. Matzkin (2013). Control functions and simultaneous

equations methods. American Economic Review 103 (3), 563–69.

Blundell, R. and J. L. Powell (2003). Endogeneity in Nonparametric and Semiparametric Re-

gression Models, Volume 2, pp. 312–357. Cambridge University Press.

Breiman, L. (1996). Bagging predictors. Machine Learning 24 (2), 123–140.

Bronnenberg, B. J., M. W. Kruger, and C. F. Mela (2008). Database Paper-The IRI marketing

data set. Marketing Science 27 (4), 745–748.

Chen, X. (2007). Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models,

Volume 6, pp. 5549. Elsevier.

Chen, X., V. Chernozhukov, S. Lee, and W. K. Newey (2014). Local Identification of Nonpara-

metric and Semiparametric Models. Econometrica 82, 785–809.

34

Chen, X. and T. M. Christensen (2018). Optimal sup-norm rates and uniform inference on

nonlinear functionals of nonparametric iv regression. Quantitative Economics 9 (1), 39–84.

Chen, X. and D. Pouzo (2015). Sieve Wald and QLR Inferences on Semi/Nonparametric Con-

ditional Moment Models. Econometrica 83 (3), 1013–1079.

Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and W. Newey (2017).

Double/Debiased/Neyman machine learning of treatment effects. American Economic Re-

view 107 (5), 261–65.

Chesher, A. (2003). Identification in nonseparable models. Econometrica 71 (5), 1405–1441.

Compiani, G. (2019). Market counterfacturals and the specification of multi-product demand:

A nonparametric approach. Working paper .

Das, M., W. K. Newey, and F. Vella (2003). Nonparametric estimation of sample selection

models. The Review of Economic Studies 70 (1), 33–58.

De Los Santos, B., A. Hortacsu, and M. R. Wildenbeest (2012, May). Testing models of con-

sumer search using data on web browsing and purchasing behavior. American Economic

Review 102 (6), 2955–80.

Deaton, A. and J. Muellbauer (1980). An almost ideal demand system. The American Economic

Review 70 (3), 312–326.

Dette, H., S. Hoderlein, and N. Neumeyer (2016). Testing multivariate economic restrictions

using quantiles: The example of slutsky negative semidefiniteness. Journal of Economet-

rics 191 (1), 129–144.

Dhaene, G. and K. Jochmans (2015). Split-panel jackknife estimation of fixed-effect models.

The Review of Economic Studies 82 (3), 991–1030.

Draganska, M. and D. C. Jain (2006). Consumer preferences and Product-Line pricing strategies:

An empirical analysis. Marketing Science 25 (2), 164–174.

Draganska, M., M. Mazzeo, and K. Seim (2009). Beyond plain vanilla: Modeling joint product

assortment and pricing decisions. Quantitative Marketing and Economics 7 (2), 105–146.

Dube, J., G. J. Hitsch, and P. E. Rossi (2018). Income and wealth effects on Private-Label

demand: Evidence from the great recession. Marketing Science 37 (1), 22–53.

35

Efron, B. and C. Stein (1981). The jackknife estimate of variance. The Annals of Statistics 9 (3),

586–596.

Fan, Y., J. Lv, and J. Wang (2018). DNN: A two-scale distributional tale of heterogeneous

treatment effect inference. Working paper .

Gentzkow, M. (2007). Valuing new goods in a model with complementarity: Online newspapers.

American Economic Review 97 (3), 713–744.

Goeree, M. S. (2008). Limited information and advertising in the U.S. personal computer in-

dustry. Econometrica 76 (5), 1017–1074.

Haag, B. R., S. Hoderlein, and K. Pendakur (2009). Testing and imposing slutsky symmetry in

nonparametric demand systems. Journal of Econometrics 153 (1), 33–50.

Hahn, J. and G. Ridder (2017). Instrumental variable estimation of nonlinear models with

nonclassical measurement error using control variables. Journal of Econometrics 200, 238–

250.

Hahn, J. and G. Ridder (2018). Three-stage semi-parametric inference: Control variables and

differentiability. Journal of Econometrics .

Hall, P. and J. L. Horowitz (2005). Nonparametric methods for inference in the presence of

instrumental variables. The Annals of Statistics 33, 2904–2929.

Hausman, J., G. Leonard, and D. J. Zona (1994). Competitive analysis with differenciated

products. Annales d’Economie et de Statistique (34), 159–180.

Hausman, J. A. and W. K. Newey (1995). Nonparametric estimation of exact consumers surplus

and deadweight loss. Econometrica 63 (6), 1445–1476.

Hausman, J. A. and W. K. Newey (2015). Nonparametric welfare analysis. Annual Review of

Economics 9 (1), 1–26.

Heckman, J. and R. Robb (1985). Alternative methods for evaluating the impact of interventions:

An overview. Journal of Econometrics 30 (1), 239–267.

Heckman, J. J., H. Ichimura, and P. E. Todd (1997). Matching as an econometric evaluation

estimator: Evidence from evaluating a job training programme. The Review of Economic

Studies 64 (4), 605–654.

36

Hendel, I. (1999). Estimating multiple-discrete choice models: An application to computerization

returns. The Review of Economic Studies 66 (2), 423–446.

Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The Annals

of Mathematical Statistics 19 (3), 293–325.

Houde, J. (2012). Spatial differentiation and vertical mergers in retail markets for gasoline.

American Economic Review 102 (5), 2147–82.

Hsiao, C. (2014). Analysis of Panel Data. Cambridge University Press.

Huang, Y. and B. J. Bronnenberg (2018). Pennies for your thoughts: Costly product consider-

ation and purchase quantity thresholds. Marketing Science 37 (6), 1009–1028.

Imbens, G. W. and W. K. Newey (2009). Identification and estimation of triangular simultaneous

equations models without additivity. Econometrica 77 (5), 1481–1512.

Kim, J., G. M. Allenby, and P. E. Rossi (2002). Modeling consumer demand for variety. Mar-

keting Science 21 (3), 229–250.

Mack, Y. (1981). Local properties of k-NN regression estimates. SIAM Journal on Algebraic

Discrete Methods 2 (3), 311–323.

Matzkin, R. L. (2003). Nonparametric estimation of nonadditive random functions. Economet-

rica 71 (5), 1339–1375.

Matzkin, R. L. (2008). Identification in nonparametric simultaneous equations models. Econo-

metrica 76 (5), 945–978.

Matzkin, R. L. (2015). Estimation of nonparametric models with simultaneity. Economet-

rica 83 (1), 1–66.

McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior, pp. 105–142.

Academic Press.

Mullainathan, S. and J. Spiess (2017). Machine learning: An applied econometric approach.

Journal of Economic Perspectives 31 (2), 87–106.

Nevo, A. (2001). A practitioner’s guide to estimation of Random-Coefficients logit models of

demand. Journal of Economics & Management Strategy 9 (4), 513–548.

Newey, W. K. and J. L. Powell (2003). Instrumental Variable Estimation of Nonparametric

Models. Econometrica 71, 1565–1578.

37

Newey, W. K., J. L. Powell, and F. Vella (1999). Nonparametric estimation of triangular simul-

taneous equations models. Econometrica 67 (3), 565–603.

Pavlidis, P. and P. B. Ellickson (2018). Implications of parent brand inertia for multiproduct

pricing. Quantitative Marketing and Economics 15 (4), 369–407.

Pearl, J. (2009). Causality: Models, reasoning and inference. Cambridge University Press .

Petrin, A. (2002). Quantifying the benefits of new products: The case of the minivan. Journal

of Political Economy 110 (4), 705–729.

Rivers, D. and Q. H. Vuong (1988). Limited information estimators and exogeneity tests for

simultaneous probit models. Journal of Econometrics 39 (3), 347–366.

Smith, R. and R. Blundell (1986). An exogeneity test for a simultaneous equation tobit model

with an application to labor supply. Econometrica 54 (3), 679–685.

Stone, R. (1954). Linear expenditure systems and demand analysis: An application to the

pattern of british demand. The Economic Journal 64 (255), 511–527.

Strikwerda, J. (2004). Finite Difference Schemes and Partial Differential Equations (Second

Edition ed.). Society for Industrial and Applied Mathematics.

Villas-Boas, S. (2007). Vertical relationships between manufacturers and retailers: Inference

with limited data. The Review of Economic Studies 74 (2), 625–652.

Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effects

using random forests. Journal of the American Statistical Association 113 (523), 1228–1242.

Working, E. (1927). What do statistical “Demand curves” show? The Quarterly Journal of

Economics 41 (2), 212–235.

38

Appendix

A Monte Carlo simulation revisited

The BLP model (Berry et al., 1995), or the random coefficients multi-nominal choice model has

been the working horse model of demand analysis for decades. This section is to show that our

approach can be compatible with the standard random coefficients multi-nominal choice model.

A commonly used random coefficients multi-nominal choice model first explicitly specifies the

indirect utility from individual consumption, that is, individual i choosing product j at market

t enjoys the indirect utility

uijt = αi + βipjt + ξj + ǫijt, (10)

where pjt is the price of product j at market t, ξj is the other product characteristics of product

j. It is usually assumed that price pjt is product and market specific, while ξj is only product

specific. However, the crucial difference between pjt and ξj is that ξj is only observed by in-

dividual consumers but not by econometricians. Since prices are often correlated with product

characteristics, the omitted product characteristics becomes the source of price endogeneity in

this standard model. In other settings, the price endogeneity can also arise in scenarios such as

measurement error on prices, endogenous product choice set, or sample selection issue.

In Equation (10), αi and βi represent individual preferences, where αi is the individual fixed

effect, and βi is the individual sensitiveness to product price. The difference between this model

and the multi-nominal choice model (McFadden, 1973) is that αi and βi are assumed to be

random coefficients. This ingenious feature is introduced to avoid the presence of independent

irrelevant alternatives in multi-nominal choice models. In this way, the substitution patterns

between products can depend on demographic variables and thus the price elasticity is allowed

to be flexible. In our simulation, we explicitly assume

αi = α1νi1,

βi = β0 + β1νi1 + β2νi2,

where vi1 and vi2 represent demographic variables. They are assumed to be independent and

follow a standard normal distribution in this simulation. Other preference parameters, α1, β0,

β1, and β2, are set to be pre-determined fixed values (0.8,−3, 0.5, 0.5).

The error term ǫijt is often assumed to follow the i.i.d. Type I distribution. In this case, the

market share sjt of product j at market t has a closed form expression,

sjt =

∫ ∫exp(β0pjt + ξj + α1vi1 + β1vi1pjt + β2vi2pjt)

1 +∑J

q=1 exp(β0pqt + ξq + α1vq1 + β1vq1pqt + β2vq2pqt)dΦ(x1)dΦ(x2). (11)

39

At a high-level, this particular form of the market share function comes from the linear

setup of indirect utility and the i.i.d. Type I error. If they fail to hold, the aggregated market

share function turns into a general nonlinear function of product prices, observed characteristics,

and unobserved characteristics. If the unobserved product characteristics is the only source

of endogeneity, our model setup in the paper actually requires that the unobserved product

characteristic is additively separable to the demand function, which may not be true for the

above Equation (11). However, as we illustrate at the end of this section, this additivity can be

approximately true at each evaluated point. There exists other modeling routes. For example,

Berry and Haile (2014) propose to impose an index restriction.

Back to the standard model, the unobserved product characteristics ξj can be backed out to

deal with endogeneity. Berry (1994) has shown that, if there is an outside good with market

share s0, log sjt− log s0t can back out the unobserved product characteristics ξj. The backed out

unobserved product characteristics can then be used to form moment conditions if instrumental

variables are available. As a result, the preference parameters in Equation (11) can be recovered

using the general method of moments. If it is further assumed that these parameters are stable

in and out of sample, their estimates can be used to conduct counterfactual analysis.

In our case, we know the true data generating process since it is Monte Carlo simulation.

We have assumed that

pt = (0.5, 0.5, 0.75, 1)T + 0.5ut + 0.5zt,

ξt = (5, 6, 7, 8)T + ut.

where ut is a common shock to price pt and unobserved product characteristics ξt. zt only affect

pt and can serve as instruments. ut and zt are independent random vectors and each element

follow independent uniform distribution on [−0.5, 0.5]. In total, we assume to have 100,000

independent markets with 4 products. Figure 7 gives our estimation results.

We have argued that Equation (11) cannot in general be guaranteed to be expressed as an

additive model in unobserved product characteristics ξ. However, if we take a Taylor expansion

at a fixed point (x∗t ,p

∗t , ξ

∗t ),

sjt(xt,pt, ξt) = sjt(x∗t ,p

∗t , ξ

∗t ) +AT (xt − x∗

t ) +BT (pt − p∗t ) +CT (ξt − ξ∗t ) + o(·),

where

A = ▽xtsjt(x∗t ,p

∗t , ξ

∗t ),

B = ▽ptsjt(x

∗t ,p

∗t , ξ

∗t ),

C = ▽ξtsjt(x

∗t ,p

∗t , ξ

∗t ).

It seems that it can be approximately true that sjt is additive in ξ locally at each point.

40

Figure 7: Simulation with BLP as the true model

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Own elasticity

-2

-1.5

-1

-0.5

0

True Elasticity

Estimated Elasticity


0.2 0.3 0.4 0.5 0.6 0.7 0.8

Cross elasticity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

True Elasticity

Estimated Elasticity


Note: This figure plots the estimations of our method when the true model is the BLP model. The

sample size is 100,000. The estimations are conducted point by point with p1 varying from 0.2 to 0.8

and the other variables at their median levels.

41

B Bagged nearest neighbors

In this section, we provide a brief but rigorous introduction to the bagged nearest neighbors

method. It is assumed that we are in a standard nonparametric regression setup,

yi = g(xi) + ǫi,

where y is a scalar and x ∈ Rd with d fixed but potentially large. The function g(x) can be

E (yi |xi) and the disturbance ǫi is independent from x. Our primary goal here is to estimate and

infer E (yi |xi = x0) given x0, which is different from the conventional pursuit on the function

g. We further denote zi = (xi, yi) and maintain the following two commonly made assumptions.

Assumption 5. We have an i.i.d. sample, (x1, y1), (x2, y2), . . . , (xn, yn).

Assumption 6. The density f(·) of x is bounded away from 0 and ∞, f(·) and g(·) are both

twice continuously differentiable with bounded second derivatives in a neighborhood of x, and y

has finite second moment, E y2 < ∞. ǫ has zero mean and finite variance σ2 > 0.

Definition

Before we proceed, we introduce extra notations. Let {i1, · · · , im} with i1 < i2 < · · · < im and

m ≤ n be a subset of size m from the index set {1, · · · , n}. Φ(x0; zi1, zi2 , . . . , zim) is defined to

be the 1-nearest neighbor estimator for x = x0 in the subsample {(zij )mj=1}, that is,

Φ(x0; zi1 , zi2 , . . . , zim) = y(1)(zi1 , zi2, . . . , zim),

where y(1)(zi1 , zi2, . . . , zim) is the y associated with the closest point among (xi1, . . . ,xim) to the

fixed and given point x0 in Rd in terms of the Euclidean distance.

The definition for a bagged nearest neighbors estimator is

τn(m)(x0) =

(n

m

)−1 ∑

1≤i1<i2<...<im≤n

Φ(x0; zi1 , zi2 , . . . , zim). (12)

There is an equivalent L-statistic representation,

τn(m)(x0) =

(n

m

)−1{(n− 1

m− 1

)

y(1) +

(n− 2

m− 1

)

y(2) + · · ·+(m− 1

m− 1

)

y(n−m+1)

}

. (13)

Before introducing properties of the bagged nearest neighbors estimator, let us have a short

review on its development history here. As far as we know, the bagged nearest neighbors es-

timator is first proposed in Biau et al. (2010) as a special case from averaging bootstrapped

without replacement (Breiman, 1996) nearest neighbor estimators (Mack, 1981). In their paper,

42

Biau et al. (2010) have derived the universal consistency of the bagged nearest neighbors. More

recently, Fan et al. (2018) derive the same estimator independently from a panel perspective.

However, their focus is on the point-wise heterogenous treatment effect and they propose a two

scale approach to further remove the higher order bias term. This generalized jackknife proce-

dure, when combined with the bagging algorithm, can substantially improve the performance of

the bagged nearest neighbors. Fan et al. (2018) also prove the point-wise asymptotic normality

of the bagged nearest neighbors. In this paper, we enrich these results and prove that the boot-

strap (Efron and Stein, 1981) can be conveniently employed for the inference of bagged nearest

neighbors. Our result is particularly useful in cases when the parameter of interest is a smooth

function of bagged nearest neighbors estimates.

Theorem 1. (Biau and Devroye (2015)) Given x0 ∈ supp(x), under Assumptions 1–2,

E τn(m)(x0) = g(x0) +B(m),

B(m) = Γ(2/d+ 1)f(x0) tr(g

′′(x0)) + 2 g′(x0)Tf ′(x0)

2 d V2/dd f(x0)1+2/d

m−2/d + o(m−2/d), (14)

where Vd = πd/2

Γ(1+d/2), Γ(·) denotes the Gamma function, f ′(x0) and g′(x0) are the first order

gradients at x0 for f(x) and g(x), respectively, g′′(x0) is the Hessian matrix of g(·) at x0, and

tr(·) gives the trace.

This result comes from Biau and Devroye (2015) for k nearest neighbors estimator. Equation

(14) gives us the closed form of the asymptotically diminishing bias term. However, when the

dimensionality d is relatively large, the rate that the bias term diminishes can be very slow. This

is the curse of dimensionality. If we can effectively get rid of the first order bias term, then the

curse of dimensionality can be largely mitigated.

Generalized Jackknife

Consider two bagged nearest neighbors estimators with different subsampling scales m1 and m2

(m1 6= m2). Their asymptotic biases have the following forms

τn(m1)(x0) = g(x0) + cm−2/d1 + o(m

−2/d1 ),

τn(m2)(x0) = g(x0) + cm−2/d2 + o(m

−2/d2 ).

We then solve the following system of linear equations

w1 + w2 = 1, (15)

w1 m−2/d1 + w2 m

−2/d2 = 0, (16)

43

and obtain the solution (w∗1, w

∗2). Equation (15) is to ensure unbiasedness and Equation (16) the

removal of the first order bias. Then w∗1τn(m1)(x0) + w∗

2τn(m2)(x0) is free from the first order

bias term and thus the curse of dimensionality can be mitigated. From simulations in Fan et al.

(2018), this bias reduction can improve estimation precision measured in mean squared errors

substantively. In this paper, we choose m and 2m to reduce the first order bias. For inference,

Fan et al. (2018) has shown that the bagged nearest neighbors estimator is asymptotically normal

under mild conditions.

Theorem 2. (Fan et al. (2018)) Given x0 ∈ supp(v), under Assumptions 1–2, and assuming

m → ∞ and m/n → 0, then for some positive σn with σ2

n = O(mn),

τn(m)(x0)− g(x0)−B(m)

σn

D−→ N(0, 1). (17)

The expression of σ2n is in Lemma 3, Fan et al. (2018). However, since the bagged nearest

neighbors with different scales are correlated, in general for different jackknife choices we need to

use different variance formulas. In practice, it is desirable if we can conduct inference regardless

of the jackknife procedure. Moreover, when the parameter of interest is a smooth and possibly

nonlinear function of several bagged nearest neighbor estimates, it is not necessarily an L-statistic

any longer or even possesses asymptotic normality. In these cases, the use of variance formulas

becomes inconvenient. There is also a practical concern. The use of variance formula often relies

on intermediate plug-in estimates. In practice, these estimates can often weaken the inference

validity. Our solution to all these concerns is the bootstrap (Efron and Stein, 1981) and it is well

known that the bootstrap can commute with smooth functions (Bickel and Freedman, 1981). In

this way, we can conveniently conduct inference for many general situations. We prove that the

standard bootstrap procedure applies to the bagged nearest neighbors estimator. The result is

an implication of the following theorem.

Theorem 3. Let Gn be the empirical distribution of our sample (z1, z2 . . . , zn). Given

(z1, z2, . . . , zn), let (z∗1, . . . , z

∗n) be conditionally independent, with common distribution Gn. The

bagged nearest neighbors estimator defined on this sample is then

τ ∗n(m)(x0) =

(n

m

)−1 ∑

1≤i1<i2<...<im≤n

Φ(x0; z∗i1, z∗i2, . . . , z

∗im).

Given x0 ∈ supp(v), and σn in Equation (17), under Assumptions 1–2, and further assuming

m → ∞, m/n → 0, then for almost all sample sequences,

τ ∗n(m)(x0)− E∗τ ∗n(m)(x0)

σn

D−→ N(0, 1), (18)

where E∗τ ∗n(m)(x0) is the mean from the bootstrap.

44

In other words, the theorem suggests that we can first generate many bootstrap samples of

size n from repeatedly resampling our dataset with replacement. From each of the bootstrap

samples, we can then use the definition of the bagged nearest neighbors estimator and get one

estimate. The distribution of these estimates from the bootstrapped samples will converge to

the distribution of the bagged nearest neighbors estimator. For the case with the jackknife, we

conduct the same jackknife procedure on each of the bootstrap sample and obtain the bagged

nearest neighbors estimate after the jackknife.

B.1 Proof of Theorem 1 and Theorem 2

The proof for Theorem 1 can be found in Biau and Devroye (2015). Fan, Lv, and Wang (2018)

provide another proof on Theorem 1 with the introduction of spherical coordinates, which can

simplify the derivation of the higher order bias terms. The proof of Theorem 2 can be found in

Fan, Lv, and Wang (2018).

B.2 Proof of Theorem 3

We will prove Theorem 3 from the U-statistics perspective instead of the L-statistics. In the

statistics literature, Bickel and Freedman (1981); Arcones and Gine (1992) have derived the

properties for bootstrapping U-statistics under different settings. However, their results do

not yet cover the case when the subsampling scale m is allowed to diverges with n. In addition,

our proof becomes neat with the introduction of the Hoeffding decomposition (Hoeffding, 1948)

and the Mallow’s distance (Bickel and Freedman, 1981).

We first review some results that we will need. It can also be seen as a sketch for the proof

of Theorem 2. From Lemma 3 and Theorem 2 in Fan et al. (2018), we can have

• The bagged nearest neighbors estimator can be decomposed,

τn(m)− E τn(m) =m

n

n∑

i=1

g(zi) + ∆n(m),

where g(zi) = EΦ(z1, z2, . . . , zn|zi)−EΦ(z1, z2, . . . , zn), the canonical Hajek projection of

kernel Φ onto zi. The expectation E is with respect to G, the distribution of z.

• For some finite positive variance σ2,

σ2n = var [

m

n

n∑

i=1

g(zi)] =m2

n(2m− 1)σ2.

• When m/n → 0 and n → ∞,

(∆n(m)

σn

)2 → 0. (19)

45

• By the Lindeberg–Levy Central Limit Theorem, we can have

m

n

n∑

i=1

g(zi)

σn

D−→ N(0, 1).

Let Gn be the empirical distribution of (z1, z2 . . . , zn). Given (z1, z2, . . . , zn), let (z∗1, . . . , z

∗n)

be conditionally independent, with common distribution Gn. The bagged nearest neighbors

estimator defined on (z∗1, . . . , z∗n) is then

τ ∗n(m)(x0) =

(n

m

)−1 ∑

1≤i1<i2<...<im≤n

Φ(x0; z∗i1, z∗i2, . . . , z

∗im).

Similarly, we can have

• For the new distribution Gn,

τ ∗n(m)− En τ∗n(m) =

m

n

n∑

i=1

g(z∗i ) + ∆∗n(m),

where the expectation En is with respect to Gn.

• When m/n → 0 and n → ∞, for σ2n defined in (19),

(∆∗

n(m)

σn

)2 → 0.

With a bit abuse of notation, let ⇒ denote weak convergence, it can be shown that

L(τ∗n(m)− En τ

∗n(m)

σn

) ⇒ L(mn

n∑

i=1

g(z∗i ,Gn)

σn

),

which comes from the convergence of the remainder term.

To establish Theorem 3, we still need to prove

L(mn

n∑

i=1

g(z∗i )

σn

) ⇒ L(mn

n∑

i=1

g(zi)

σn

).

We will use the Mallow’s distance introduced in Bickel and Freedman (1981). Before we

proceed, we list some properties of the Mallows distance we will use. Let Mp be the Mallow’s

distance, p ∈ [1,∞) and all distributions have finite p-th moments.

• If F and G are distributions on the real line, then

Mp(F,G) = {∫ 1

0

|F−1(t)−G−1(t)|pdt}1/p.

46

• If X1, X2, . . . , Xn are independent observations from a distribution F , and Fn is their

empirical distribution, then almost everywhere,

Mp(Fn, F ) → 0.

• For any scalar a,

Mp(aU, aV ) = |a| ·Mp(U, V ).

• If the Ui are independent, likewise for Vi, and EUi = EVi, then

M22 (

n∑

i=1

Ui,

n∑

i=1

Vi) ≤n∑

i=1

M22 (Ui, Vi).

Now we are ready. Let Z(zi) = EΦ(z1, z2, . . . , zn|zi), then we have

g(zi) = EΦ(z1, z2, . . . , zn|zi)− EΦ(z1, z2, . . . , zn)

= Z(zi)− EZ(zi).

First, Gn is the empirical distribution function of G,

M2(z∗, z) → 0.

Since Z(·) is continuous and bounded, we further have

M2(√mg(zi),

√mg(z∗i )) → 0.

By the convolution property of the Mallow’s distance,

M2(m

n

n∑

i=1

g(z∗i )

σn,m

n

n∑

i=1

g(zi)

σn) ≤ m

nσn

√nM2(g(zi), g(z

∗i )),

where√nM2(g(zi), g(z

∗i )) = O(1) since Z and G are both continuous and bounded.

When m/n → 0,

M2(m

n

n∑

i=1

g(z∗i )

σn,m

n

n∑

i=1

g(zi)

σn) → 0,

which completes our proof of Theorem 3.

47

Documents

Non-parametric estimation of price elasticities: A ... · making pivotal market competition decisions, such as pricing and targeting. Despite its general usefulness, the estimation