Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Non-parametric estimation of price elasticities:
A heterogeneous treatment effect approach
Jingbo Wang and Yufeng Huang ∗
November 28, 2019
Abstract
We revisit the classical problem of price elasticity estimation from a causal perspec-
tive. When the price is perceived as a continuous but endogenous treatment, the flexible
estimation of price elasticity can be turned into the estimation of heterogeneous treatment
effects. To this end, we develop a control function approach to deal with treatment endo-
geneity when estimating heterogeneous treatment effects. This strategy works by breaking
the estimation of price elasticity into several intermediate problems of point-wise expecta-
tion estimation, where modern machine learning methods, such as deep neural networks
and random forests, can be used for prediction. In addition, we prove that if we use the
bagged nearest neighbors for point-wise prediction, the standard bootstrap procedure can
be directly employed to derive inference for the price elasticity estimates. Finally, we apply
our method to the IRI academic dataset on two national brands of yogurt. It is found that
the competitor’s yogurt in a similar size is more a substitute compared to the own brand’s
yogurt in a different size.
Key words : heterogeneous treatment effects; causal inference; price elasticity; demand anal-
ysis; nonparametric IV; bagged nearest neighbors; control function; the bootstrap
∗Jingbo Wang is a Ph.D. candidate in the department of economics at the University of Southern California;
Email: [email protected]. Yufeng Huang is an assistant professor in the department of marketing at the Simon
Business School, University of Rochester; Email: [email protected]. We thank participants in
seminars at the USC and the University of Rochester for their valuable feedbacks. In particular, Jingbo Wang
thanks Professor Cheng Hsiao for his generous and continuous support.
1
1 Introduction
Price elasticity, the percentage change in sales due to a percent change in price, is of paramount
importance in economics and marketing. As a primary measure of market structure, the precise
estimation of price elasticity is central in many situations, such as when economists conduct wel-
fare analysis, evaluate price effects of mergers, and estimate the pass-through effects of taxation
or tariff. In business economics and marketing, price elasticities can provide rich information
about the behaviors of current and potential customers. This information can assist firms in
making pivotal market competition decisions, such as pricing and targeting.
Despite its general usefulness, the estimation of price elasticity faces two main challenges.
The first challenge is price endogeneity. Observed prices in observational studies are often de-
termined simultaneously by market demand and supply. This fact inevitably raises the concern
of price endogeneity. The presence of price endogeneity, if not properly addressed, can weaken
the consistency of elasticity estimates and sometimes give confusing results, such as an upward-
sloping demand curve. The second challenge is the flexible estimation of price elasticity. The
responsiveness to prices is heterogeneous by nature. Price elasticity estimates ideally should ad-
just differently to different price levels, allow rich substitution patterns, and capture meaningful
behavioral changes along demographics.
Current popular approaches mostly use a parametric framework to work on these difficulties.
On the one hand, a parametric setup of the indirect utility enables one to back out the unob-
served omitted product characteristics. This advantage clears the way to use the generalized
methods of moments to deal with price endogeneity. On the other hand, a parametric setup of
the distribution of random coefficients can help overcome the independence of irrelevant alter-
natives. This ingenious feature offers a route to allow flexible substitution patterns. However,
a parametric setup can also be a double-edged sword. First, the consumer choice process can
be complicated and oftentimes involve choices in multiple goods, of multiple quantities, under
limited consideration, and with search friction. Parameterizing the choice process in these cases
has the risk of misspecification. Second, a parametric model also relies on the assumptions of
the preference heterogeneity. It often becomes pivotal to correctly specify these taste distribu-
tions. The estimation result can be sensitive to the specification and sometimes may experience
a fundamental change under a different specification. In these situations, it is strongly desirable
that we also have a model that does not rest on parametric assumptions.
With this motivation, we propose a nonparametric method to estimate price elasticities with
aggregate market-level data in differentiated products markets. In particular, we investigate the
problem of price elasticity estimation from a causal perspective. When market price is perceived
as a continuous but endogenous market policy, the estimation of price elasticity becomes a policy
2
evaluation problem. If we further require the price elasticity to be flexible, the problem thereby
turns into the estimation of treatment effects conditioning on different values of control variables.
In this way, we can borrow recent developments in the heterogeneous treatment effect literature
to estimate price elasticities in a fully nonparametric fashion.
To be specific, we will adapt classical nonparametric IV models (Blundell and Powell, 2003;
Newey and Powell, 2003; Hall and Horowitz, 2005) and deliberately shift our interest from struc-
tural functions to point-wise estimations. In particular, this paper will make an adaptation of
the triangular control function approach in Newey et al. (1999) as our starting point, since it
can provide a straightforward economic interpretation. We will first specify a triangular simulta-
neous equation system, where the first equation specifies a nonparametric relationship between
demand and prices and the second the relationship between prices and exogenous instrumen-
tal variables. It can be shown that, under commonly made control function assumptions, the
point-wise price slope can be identified as a combination of several point-wise conditional ex-
pectations. As a result, we can then estimate point-wise price elasticities from the estimations
of several intermediate point-wise conditional expectations. This estimation route, to the best
of our knowledge, has not yet been explored before in empirical studies. This paper is also the
first paper to utilize this result to flexibly estimate price elasticity.
The difficulty in the implementation of this strategy mainly lies in the estimation of point-
wise conditional expectations. In this paper, we propose the use of modern machine learning
methods to deal with this problem. The reason why machine learning methods are preferred is
that classical nonparametric methods, such as kernel and nearest neighbors estimators, suffer
the curse of dimensionality. When the dimension of conditioning covariates turns large, classical
nonparametric estimation can become unstable in practice. However, on the contrary, modern
machine learning methods are often empowered with data-driven algorithms and have been
shown to work well with a relatively large dimension of conditioning variables. In this way, we
are capable to accommodate many covariates in a fully nonparametric fashion and have more
leverage to pursue heterogeneity behind all these dimensions.
However, the disadvantage of using popular machine learning methods is statistical inference.
While machine learning methods have been widely used in business practices, it has been less
emphasized that how one can conduct valid statistical inference with them. In this paper, we
prove that, if we use the bootstrap averaged (bagged) nearest neighbors estimator (Biau et al.,
2010; Fan et al., 2018) for point-wise prediction, the standard bootstrap procedure can be di-
rectly used for its inference. Since the bootstrap commutes with smooth functions, we are then
able to conveniently use the bootstrap to directly derive inference for price elasticity estimates.
However, it is worth noting that when statistical inference is not a concern, other popular ma-
3
chine learning methods, such as deep neural networks and random forests, can also be employed
for conditional predictions within our framework.
We will first conduct Monte Carlo simulations to test the working validity of our approach.
In this paper, we demonstrate two settings. The first setting follows a conventional reduced form
setup. It is shown that our approach can well recover flexible shapes of price slopes despite of
the contamination of price endogeneity. We also derive confidence intervals using the bootstrap
along with our estimations. In our simulations, these confidence intervals have well covered the
model truth. The second Monte Carlo setting uses the standard BLP model (Berry et al., 1995)
as the true model. It is shown that our approach can be compatible with the BLP model and
approximately applies to market level data aggregated from BLP-type individual choices.
With this confidence, our method is then applied to the yogurt industry to estimate the
price elasticities of two leading national brands, Yoplait and Dannon, using the IRI academic
dataset. Our main focus is on the package size of yogurt. We estimate the own and cross price
elasticities for small-sized and large-sized yogurt for both Yoplait and Dannon. It is found that
the competing brand’s yogurt in a similar package size is more a substitute than own brand’s
yogurt in a different size. This result is obtained without a priori structural assumptions on
consumer preferences. We also trace the price elasticity changes across own price levels.
Our paper contributes to the literature on demand estimation. In particular, we revisit the
problem of price elasticity estimation from a causal perspective and in a nonparametric fashion.
Compared to the models based on multinomial choices, our framework bears less parametric
assumptions and can thus be more flexible in various situations, for example, when there is
concern on multiple purchasing quantities and endogenous consideration sets. Compared to the
existing nonparametric IV approaches (Blundell et al., 2012, 2016), our interest is mostly on the
point-wise heterogeneity and we incorporate modern machine learning algorithms to empower
classical nonparametric strategies, instead of imposing additional economic and econometric
constraints to regularize classical estimates.
Our paper also contributes to the emerging literature on heterogeneous treatment effects
(Athey and Imbens, 2016; Wager and Athey, 2018). In particular, we adapt the control function
approach in Newey et al. (1999) and shift its focus to point-wise estimations. This adaptation
offers a simple and intuitive pathway to deal with treatment endogeneity in the context of
heterogeneous treatment effects. We enrich the theoretical results in Fan et al. (2018) and prove
that the standard bootstrap procedure can be directly adopted for statistical inference, if the
bagged nearest neighbors estimator is used for intermediate predictions.
Our paper possesses an empirical contribution as well. We find that price elasticities exhibit
a very different pattern for yogurt in small and large package sizes. However, it is common
4
in empirical studies that yogurt of different sizes for the same brand are first pooled together
for further analysis. Our result raises a concern for this conventional practice since yogurt of
different sizes can be intrinsically very different products. We also argue that the heterogeneity
patterns we have found can serve as a primitive for sophisticated structural modeling.
The following of the paper is organized as follows. Section 2 gives a brief review of related
literature. We formally introduce our theoretical framework in Section 3. Section 4 gives more
details on our estimation and inference strategy. We further justify our method with a Monte
Carlo simulation in Section 5. Section 6 applies our method to estimate price elasticities for
yogurt with the IRI academic dataset. Section 7 is devoted to a discussion.
2 Related literature
Demand analysis has been one of the oldest economic problems. Working (1927); Stone (1954);
Deaton and Muellbauer (1980) had studied this problem repeatedly with new insights. More
recently, the BLP model (Berry et al., 1995) becomes the leading work-horse model in struc-
tural demand analysis. The BLP model (Berry et al., 1995) makes important changes of the
multinomial choice model (McFadden, 1973) and provides a seminal framework to deal with en-
dogeneity and flexibility (Nevo, 2001). Meanwhile, the reduced-form approaches are also quickly
evolving. This line of research works on relaxing the linear demand system in traditional analy-
sis (Hausman and Newey, 1995; Banks et al., 1997; Hausman and Newey, 2015) and improving
nonparametric estimates with additional theoretical or empirical constraints (Haag et al., 2009;
Blundell et al., 2012; Dette et al., 2016; Blundell et al., 2016). Interestingly, the seemingly un-
related structural perspective and the reduced form perspective can actually be reconciled under
a general nonparametric setting. Berry and Haile (2014) has ingeniously shown that with con-
nected substitutes (Berry et al., 2013), an index restriction is enough to transform the discrete
choice demand model into a nonparametric IV problem. Compiani (2019) further shows that
this nonparametric approach is able to give better shapes of demand curves when consumers ex-
perience inattention or loss aversion. Our paper contributes to the demand literature by offering
another route to flexibly estimate price elasticity, where machine learning methods can be used
to empower nonparametric regressions.
Our paper is related to the literature of heterogeneous treatment effects. The concept of treat-
ment effect heterogeneity is deeply rooted in economics and possesses great potential in empiri-
cal studies (Heckman et al., 1997; Athey and Imbens, 2016; Wager and Athey, 2018; Fan et al.,
2018). While previous studies mostly focus on exogenous treatments, our framework in this pa-
per offers a simple pathway to deal with treatment endogeneity in the context of heterogeneous
5
treatment effects. The simplicity of our method comes from adapting classical nonparametric IV
models and focusing on point-wise estimation. Compared to the random forests local generalized
method of moments approach (Athey et al., 2019), our framework is fully nonparametric, open
to the use of various machine learning methods, and enjoys an easy economic interpretation.
We have used the control function approach in our paper. As far as we know, the control
function approach is introduced to economics as a generalization of the linear instrumental vari-
able approach (Heckman and Robb, 1985). For these early insights, please see, for example,
Smith and Blundell (1986); Rivers and Vuong (1988). For more recent developments on non-
parametric IV models, see, for example, Chesher (2003); Matzkin (2003); Blundell and Powell
(2003); Newey and Powell (2003); Das et al. (2003); Hall and Horowitz (2005); Blundell et al.
(2007); Matzkin (2008); Imbens and Newey (2009); Blundell et al. (2013); Chen et al. (2014);
Matzkin (2015); Hahn and Ridder (2017, 2018). Different from most of the literature, our empir-
ical strategy follows a point-wise approach. In this way, we can uncover observed heterogeneity
and many modern machine learning methods can then be incorporated into the conventional
control function framework for the first time. It is also worth mentioning that Chen and Pouzo
(2015) has obtained point-wise bootstrap confidence bands for linear and nonlinear function-
als of nonparametric IV and nonparametric quantile IV estimators, and Chen and Christensen
(2018) has bootstrap uniform confidence bands for linear and nonlinear functionals of sieve
nonparametric IV estimator.
Our paper makes use of the bootstrap averaged (bagged) nearest neighbors method. The
bagged nearest neighbors regression estimator, as far as we know, first appears as a special case
when averaging nearest neighbor estimators from bootstrapped without replacement subsamples
in Biau et al. (2010). More recently, Fan et al. (2018) derives the same estimator independently
from a panel perspective, proves its point-wise asymptotic normality, and proposes to use the
generalized jackknife to analytically remove its higher order bias. The generalized jackknife
procedure, when working together with the bagging algorithm (Breiman, 1996), turns out to
be very powerful and can effectively mitigate the curse of dimensionality. For the purpose of
this paper, we would like to perform the generalized jackknife in a flexible fashion, and combine
intermediate estimates to form final parameters of interests, the use of the variance formula for
inference becomes less useful. In this paper, we complement existing results and prove that the
bootstrap can be directly used for the bagged nearest neighbors estimator. This bootstrap result
overcomes the obstacles to conducting inference when using the bagged nearest neighbors.
Machine learning methods have been gaining increasing interests in the business world as well
as in economic research (Mullainathan and Spiess, 2017). One way to accommodate machine
learning methods into economic research is to use the penalized regressions, such as LASSO,
6
for variable selection, and then apply conventional econometric methods to this reduced set of
variables. This interesting convenience can build on the assumption of ignorable approximation
errors (Belloni et al., 2014a,b), or more recently the orthogonality features in many economic
problems (Chernozhukov et al., 2017). Our paper demonstrates another possibility to introduce
machine learning methods into economics. Our philosophy is to transform well-known economic
and econometric models into a combination of intermediate conditional expectation problems
and then utilize modern machine learning methods to work on these conditional expectations.
We treat machine learning as an extension of traditional nonparametric methods.
Our paper connects to the literature of empirical industrial organization. While the current
canonical demand estimation approach builds on the discrete choice framework (McFadden,
1973) and the BLP model (Berry et al., 1995; Nevo, 2001; Petrin, 2002), empirical studies for
many industries have also offered various alternatives that can relax different restrictions of the
canonical model. These extensions include, but are not limited to, cases when consumers pur-
chase multiple goods or multiple units (Hendel, 1999; Dube et al., 2018; Kim et al., 2002), make
decisions under limited consideration (Goeree, 2008) or search frictions (De Los Santos et al.,
2012), choose among geographically-differentiated options (Houde, 2012), and purchase comple-
mentary products (Gentzkow, 2007). Our approach complements the above literature and can
be useful for a wide range of empirical challenges like these.
Our empirical study builds on previous findings in the yogurt industry. The landscape of the
yogurt industry can be simplified as several major producers, a few grocery chains, local stores,
and a vast number of individual consumers. A typical consumer usually purchases multiple units
of yogurt in multiple flavors from different product lines of a single brand (Kim et al., 2002).
The conventional pricing practice for grocery stores is that different product lines are priced
differently, but prices are uniform across flavors (Draganska and Jain, 2006; Draganska et al.,
2009). The vertical relationship between grocery chains and yogurt producers is known to be
important and can affect the offerings of yogurt in local grocery stores (Villas-Boas, 2007). It
is also found that consumers can exhibit brand inertia (Pavlidis and Ellickson, 2018) and have
fixed purchasing cost (Huang and Bronnenberg, 2018). It is interesting that the package size of
yogurt, which is a classic price discrimination device, has received little attention. In this paper,
we offer a novel analysis on the difference between yogurts in different package sizes.
3 Theoretical framework
In this section, we will introduce our theoretical framework to estimate price elasticity. We
will first demonstrate our problem in a nonparametric model, explain how this problem can be
7
approached from a heterogeneous treatment effect perspective, list a set of assumptions we need,
show how price slopes can be identified, and finally provide an intuition to the identification using
the directed acyclic graphs from Pearl (2009).
3.1 Model setup
We assume that product j at market t has sales sjt and price pjt , for j = 1, 2, · · · , J , andt = 1, 2, · · · , T . Let pt = (p1t, p2t, · · · , pJt)T denote the price vector for products {1, 2, · · · , J}at market t. Here (·)T is the transpose. Similarly, let xjt denote the vector of observed product
characteristics for product j at market t and xt = (xT1t,x
T2t, · · · ,xT
Jt)T the stacked vector of
observed product characteristics at market t. We assume that for j = 1, 2, · · · , J , and t =
1, 2, · · · , T , the structural relationship between market sales, market prices, and the observed
product characteristics is that
sjt = fj(pt,xt) + ǫjt, (1)
where ǫjt is the aggregate unobserved shock to the demand of product j at market t, and it
is assumed to have mean zero. The pivotal problem here is that the disturbance ǫjt can be
potentially correlated with pjt, which raises the concern of price endogeneity. However, the
source of price endogeneity can be very general. The endogeneity may come from the omis-
sion of a confounding variable unobserved by econometricians, such as the unobserved product
characteristics, measurement error on prices, or sample selection issues.
The demand for product j at market t depends on the prices and observed product char-
acteristics of all products at market t in a nonlinear fashion. In our assumption, this demand
function for product j, fj(·), is structural and unchanged across markets. The variation of the
demand of product j across different markets comes from the variation of the prices of all prod-
ucts, the product characteristics of all products, and the realized unobserved shocks in different
markets. Our setup can further allow for more market specific effects if we introduce one more
product with market features as its product characteristics. The model can also incorporate
infinite dimensions of unobserved heterogeneity across geographic locations and time if we allow
the demand function fj(·) to be location specific and time specific, that is, we can split the data
into different subsamples and we estimate a demand function for each subsample. However, the
flexibility comes at the price that there will be substantially less observations for each estimation.
We will not make these extensions in this paper since it is not our main focus.
8
3.2 Heterogeneous treatment effect
We can look at the estimation of price elasticity from the potential outcomes perspective. When
price is perceived as a continuous market policy, its corresponding demand is then the observed
outcome in parallel universes. When our goal is the causal effect of market price on demand,
the problem becomes the evaluation of a market price policy. As a result, the estimation of
price slope is then the estimation of the treatment effect of price on demand. As a contrast, the
structural approach directly models the preferences of individuals and then use the stability of
the preferences to derive price slope. We argue that the recovery of human nature is actually a
more fundamental and complex issue. If our goal is solely on price slopes, the causal perspective
is possibly more straightforward.
Price elasticity, which measures normalized price sensitiveness, should be heterogeneous by
nature. The price sensitiveness should change at different price levels, vary for different income
groups, and evolve over age. Flexibility is therefore one of the central concerns when estimating
price elasticity. In this context, we are eager to learn more about the heterogeneity of price
treatment effects on demand. The heterogeneous treatment effect provides us one solution
to allow flexibility and capture heterogeneity. The heterogeneous treatment effect, proposed
in Athey and Imbens (2016); Wager and Athey (2018), is defined as the conditional treatment
effect on a fixed value of all control variables, which is,
E [Y (1)− Y (0)|X = x],
where Y (1) is the potential outcome when treated and Y (0) untreated. The difference between
heterogeneous treatment effect and the conditional average treatment effect is that here the
control variables X can be potentially of high dimension. In our context, when we consider the
price as a treatment, the partial derivative ∂pfj(pt,xt) is just the continuous version hetero-
geneous treatment effect conditional on (pt,xt). This point-wise strategy can help us recover
the heterogeneity of treatment effects because we can deliberately make repeated estimations
on different points and observe the change in estimates. For some problems, the knowledge on
price slopes would suffice. However, when it is not the case, we need to further normalize the
obtained price slope and form price elasticity. The normalization is also achieved point-wisely.
The remaining challenge is the price endogeneity. When there is a confounder that simul-
taneously affects price and demand in the disturbance ǫ, we will not be able to estimate the
heterogeneous treatment effect ∂pfj(pt,xt). The intuition is that we cannot distinguish whether
the demand change comes from the price change or the change in the confounder, which is beyond
our control and moves simultaneously with price. To overcome this empirical hurdle, this paper
proposes a control function approach to estimate heterogeneous treatment effect in the presence
9
of treatment endogeneity. We will show that the problem can be conveniently transformed by
adapting classical nonparametric IV models. In this paper, we use the triangular simultaneous
equation system in Newey, Powell, and Vella (1999) as our starting point.
3.3 Control function approach
As is common in situations with endogeneity, we need some additional machinery to work on
the problem. It is further assumed that pt is related to a vector of instrumental variables zt,
pt = g(zt) + ut. (2)
where g(·) denotes a nonparametric relationship between price and the instrumental variables.
In a reduced-form interpretation, g(zjt) can be the conditional expectation of pt on zt. In this
way, it can assumed without loss of generality that
E(ut|zt) = 0. (3)
However, if we try to understand Equation (2) from a structural setup, Equation (3) implies that
the endogeneity issue is confined in Equation (1) and will not continue in Equation (2) for the
exogenous instrumental variables zt. In both situations, the disturbance ut can be interpreted
as the aggregation of all factors that affect price pt beyond and orthogonal to instrumental
variables zt. The above model setup has routinely followed the triangular control function
approach in Newey et al. (1999). If f(·) and g(·) are replaced with linear functions, we can
see its immediate origin of the classical two-step model in, for example, Heckman and Robb
(1985). It is worth noting that here we allow the possible overlap between the instrumental
variables zt and the observed product characteristics xt. The famous BLP instrument in the
demand literature is actually the observed characteristics of other products. To make this
point explicit, we will introduce a notation to decompose zt into (z(1)t , z
(2)t ), where z
(1)t are the
instrumental variables that overlap with xt, and z(2)t the instrumental variables excluded from
observed product characteristics xt. The dimension of excluded z(2)t is dz.
Before we move on, we spend some time comparing the modeling difference between the above
setup and the multi-nominal choice model. The multi-nominal choice model is often featured
with a linear indirect utility in product characteristics and the Type I error. In this case, the
aggregated market share has a logistic form. To our knowledge, this logistic demand function
cannot be decomposed into an additive function of the unobserved product characteristics. As
a result, the market demand from the random coefficients multi-nominal choice model, which
integrates the logistic market demand at local realizations of random coefficients, is also not in
general additively separable in unobserved product characteristics. In other words, the triangular
10
nonparametric model in this paper does not nest the standard random coefficients multi-nominal
choice model as a special case and the reverse is also true. However, we argue in the appendix
that if we follow a point-wise strategy and take a point-wise Taylor’s expansions of the logistic
demand function, it can be approximately true that the demand function from the standard
multi-nominal choice model is additively separable in unobserved product characteristics.
3.4 Assumptions
Now we are ready to formally list the assumptions we will need for estimation and inference.
The most important assumption that we are going to make is the exclusion assumption.
Assumption 1. For j = 1, 2, · · · , J , it holds that
E(ǫjt|xt, zt,ut) = E(ǫjt|ut).
Assumption 1 states that if the orthogonal shock ut is known, the error resulting from the
unobserved disturbance ǫjt only comes from the ut side. This assumption is pivotal in the
triangular simultaneous equations (Newey et al., 1999). We can understand it in this way. First
of all, the controls xt are exogeneous, so they can be safely moved away. However, since zt
and ut are components of prices, they must be correlated with disturbance ǫjt. The essence in
Assumption 1 is actually that zt do not have an impact on the demand except through the effects
on price and ut. This point will be more apparent in Figure 1 when we talk about the intuition
behind identification. Beyond this requirement, we also need to ensure that zt and ut do not
correlate and can be separated. It is ensured by the conditional independence assumption.
Assumption 2. For the instrumental variables zt, it holds that
E(ut|zt) = 0.
Another way to understand Assumption 2 is that there is no direct link between zt and ut.
This point will be further explained in Figure 1. We have made two modeling assumptions by
now. They are crucial to make our strategy work. The coming Assumption 3 and Assumption
4 are regularity conditions to ensure a well-defined solution.
Assumption 3. There are no less excluded instrumental variables than the number of endoge-
nous variables, that is, dz ≥ J .
Assumption 4. fj(pt,xt) for j = 1, 2, · · · , J , E(ǫjt|ut) for j = 1, 2, · · · , J , and g(zt), are first
order continuously differentiable with respect to all arguments. Moreover, the Jacobian matrix
of g(zt) with respect to the excluded instruments z(2)t is of full column rank.
11
Assumption 3 is commonly made in the instrumental variables literature. It implies that we
will require enough sources of variations to deal with endogeneity. The convenience it brings
will become more explicit when we derive the identification result. Assumption 4 is to ensure
that the partial derivatives exist and have desirable properties. It is a technical condition.
3.5 Identification
This section will show some algebra to derive the identification of the heterogeneous treatment
effect ∂pfj(pt,xt) in the presence of endogeneity. What we are about to find has actually already
appeared in Newey, Powell, and Vella (1999). What is also interesting is that, to the best of our
knowledge, this identification and estimation route seems not to have been explored so far. Our
paper can be the first paper to give this result an interpretation in the context of heterogeneous
treatment effect and practically use the result for estimation. Before we proceed, as always, we
first introduce new notations. Let
hj(pt,xt, z(2)t ) := E(sjt|pt,xt, zt),
which is the conditional expectation of demand given the levels of prices, products characteristics
and instrumental variables, and let
λ(ut) := E(ǫjt|pt,xt, zt) = E(ǫjt|ut),
where the second equality comes directly from Assumption 1. When we take conditional expec-
tations on both sides of Equation (1), it can be shown without much effort that
hj(pt,xt, z(2)t ) = fj(pt,xt) + λ(ut), (4)
which states that the conditional demand function hj is an additive function of the structural
demand function fj and the control function λ(·). As the classical argument of the control
function approach goes, the equation implies that with the inclusion of the control function, the
problem of endogeneity can turn into an omitted variable problem. In other words, the price
endogeneity will on longer be a concern in the presence of prices, product characteristics, and
the newly included control function.
Thanks to the regularity conditions in Assumption 3 and 4, we can continue to take partial
derivatives on both sides of Equation (4) with respect to pt and z(2)t . By the chain rule of
calculus, we can obtain that
∂pthj(pt,xt, z
(2)t )
︸ ︷︷ ︸
J×1
= ∂ptfj(pt,xt)
︸ ︷︷ ︸
J×1
+ ∂utλ(ut)︸ ︷︷ ︸
J×1
, (5)
12
∂z(2)thj(pt,xt, z
(2)t )
︸ ︷︷ ︸
dz×1
= − ∂z(2)tg(zt)
︸ ︷︷ ︸
dz×J
∂utλ(ut)
︸ ︷︷ ︸
J×1
. (6)
Here ∂pthj(pt,xt, z
(2)t ) is the Jacobian matrix of conditional demand function hj with respect to
price vector pt, evaluated at point (pt,xt, z(2)t ). The other terms are defined in the same fash-
ion. We explicitly list the dimensions of these Jacobian matrices underneath to avoid potential
confusion on various definitions of the Jacobian matrix.
Our goal is the heterogeneous treatment effect ∂pfj(pt,xt). We can rearrange Equation (5)
and Equation (6) and get
∂z(2)tg(zt)
︸ ︷︷ ︸
dz×J
∂ptfj(pt,xt)
︸ ︷︷ ︸
J×1
= ∂z(2)thj(pt,xt, z
(2)t )
︸ ︷︷ ︸
dz×1
+ ∂z(2)tg(zt)
︸ ︷︷ ︸
dz×J
∂pthj(pt,xt, z
(2)t )
︸ ︷︷ ︸
J×1
, (7)
which gives us a system of dz linear equations in J unknowns, whose solution is to be discussed
in the following two scenarios.
When dz > J , that is, when the number of excluded instrumental variables is larger than
the number of endogenous prices, Equation (7) is an over-identified system. Since ∂z(2)tg(zt) has
full column rank by Assumption 3, we are able to obtain a minimum distance solution
∂ptfj(pt,xt)
︸ ︷︷ ︸
J×1
= ∂pthj(pt,xt, z
(2)t )
︸ ︷︷ ︸
J×1
+(∂z(2)tg(zt)
T
︸ ︷︷ ︸
J×dz
∂z(2)tg(zt)
︸ ︷︷ ︸
dz×J
)−1 ∂z(2)tg(zt)
T
︸ ︷︷ ︸
J×dz
∂z(2)thj(pt,xt, z
(2)t )
︸ ︷︷ ︸
dz×1
.
When the number of excluded instruments is equal to the number of endogenous prices, that
is, when dz = J , Equation (7) is just identified. In this case, full column rank can ensure that
∂z(2)tg(zt) is invertible and we can get
∂ptfj(pt,xt)
︸ ︷︷ ︸
J×1
= ∂pthj(pt,xt, z
(2)t )
︸ ︷︷ ︸
J×1
) + ∂z(2)tg(zt)
−1
︸ ︷︷ ︸
J×J
∂z(2)thj(pt,xt, z
(2)t )
︸ ︷︷ ︸
J×1
, (8)
which is a simplification of the over-justified solution.
Now we take a closer look at Equation (8). If different endogenous prices do not share the
same excluded instrument, for example, when each product price has only its own Hausman
price IV, the Jacobian matrix ∂z(2)tg(zt) will be a diagonal matrix. In this case, the diagonal
element of the inverse of ∂z(2)tg(zt) is the inverse of the diagonal element of ∂
z(2)tg(zt), which
implies that we can further simplify the equation and deal with price endogeneity for each price
separately. However, if there exists a shared excluded instrument, such as a common cost shifter,
the inverse of the matrix ∂z(2)tg(zt) in this more complicated case needs to be solved jointly.
3.6 An explanation
As we have mentioned, the above relationship has appeared in Newey, Powell, and Vella (1999)
with a slightly different notation. However, we are going to give the equation an interpretation
13
Figure 1: An explanation on identification
Note: This figure depicts a simplified relationship between demand, price, unobserved confounder, and
instrumental variable using the directed acyclic graphs (Pearl, 2009).
in the context of heterogeneous treatment effects. We will use the the directed acyclic graphs
(Pearl, 2009) for illustration and our interpretation needs not to be the interpretation. Before
that, to make our argument easier, let us further simply Equation (8) to the case where there is
only one endogenous price and one excluded instrumental variable, that is,
∂pfj(pt,xt) = ∂phj(pt,xt, zt) + ∂zg(zt)−1∂zhj(pt,xt, zt). (9)
Figure 1 depicts the endogenous situation with the directed acyclic graphs (Pearl, 2009). We
hope to evaluate the causal effect of price on demand, which is the Channel 2 in Figure 1. It is
the casual effect because it is the effect of price on demand between different parallel universes
where price is different but the unobserved confounder is fixed. However, the unobserved con-
founder affects price and demand simultaneously through Channel 3 and Channel 4 . If this
is the case, when we move price a little bit, the change in demand comes from two sources. One
change comes directly from the change in price, which is Channel 2 , while the other comes
from the co-movement of the unobserved confounder with price, which is Channel 4 − 3 .
What we can observe in the data is the total effect of price on demand, but what we are more
interested in is the partial effect of price on demand, which is also the causal effect of price on
demand. However, if we have access to an instrumental variable, we are still able to decompose
the total effect of price on demand with the new machinery.
14
To make this trick, the instrumental variable needs to satisfy two conditions. First, we need
to require the instrumental variable not to affect the demand directly, which implies that there
is no direct link between the instrumental variable and the demand. This is actually also what
Assumption 1 has required. When there is a change in the instrumental variable, the change in
demand can only come from its effect on the price and the unobserved confounder. Second, there
should not be a direct link between the instrumental variable and the unobserved confounder.
They are two orthogonal determinants of the price. This point is guaranteed by Assumption 2,
where the unobserved confounder and the instrumental variable are set to be orthogonal.
Now let us get a step back and look at what the data can identify. First of all, the data
can inform us the total effect of price on demand, which includes the part from the unobserved
confounder. It is the ∂phj(pt,xt, zt) in Equation (9). Second, we also know how the instrumental
variable can change the price, which is the Channel 1 in Figure 1 and ∂zg(zt) in Equation (9).
Moreover, we also know how the instrumental variable affects demand when holding the price
constant. The instrumental variable can have an effect because the price is held constant. The
instrumental variable and the unobserved confounder are two determinants of the price. This is
Channel 1 − 3 + 4 in Figure 1. In this way, the indirect effect of price on demand, which is
Channel 4 − 3 , can be worked out by using the effect of the instrumental variable on the price
to divide the effect of the instrumental variable on the demand, which is −∂zg(zt)−1∂zhj(pt,xt, zt)
in Equation (9). Finally, the total effect minus the indirect effect gives us the direct partial effect
of price on demand. It is also the heterogeneous treatment effect of price on demand.
4 Estimation and inference
This section will be denoted to discuss the implementation details of our nonparametric strategy.
We will first discuss possible approaches to estimate the point-wise conditional expectations,
including classical nonparametric regressions and modern machine learning methods. When
we are clear on how to achieve point-wise predictions, the next problem is then how to obtain
partial derivatives. In this paper, we will use finite differences to numerically approximate partial
derivatives. Finally, we discuss how to conduct statistical inference. We propose the use of the
bagged nearest neighbors method for point-wise prediction and formally prove that the bootstrap
can be directly used for inference when the bagged nearest neighbor estimator is used.
4.1 Estimation
The major difference between our method and the nonparametric IV literature is that we have
deliberately transformed the problem into point-wise estimations, while current approaches in
15
the nonparametric IV literature have focused mostly on the estimation and inference of structural
functions. This point-wise transformation can bring two intriguing changes. First of all, we are
now capable to capture observed heterogeneity to our interest. For example, if our goal is to
explore the heterogeneous treatment effects, we can then repeatedly change the conditional values
and compare the treatment effects on these points. As a result, the point-wise approach can give
us a leverage to recover heterogeneity. Second, when the economic and econometric problem is
transformed into point-wise predictions, another possibility to utilize modern learning methods
is open to economists. Although not fully understood, machine learning methods, such as the
deep neural networks, have demonstrated their superior performance in making predictions in
a nonlinear fashion. While current discussion on machine learning in econometrics is more on
the variable selection side, this paper shows that these machine learning methods can actually
be conveniently incorporated with the well studied nonparametric IV models.
A class of modern machine learning methods can be viewed as algorithm enhanced classical
nonparametric methods. For example, the random forest regression, in its essence, has a con-
venient representation as the nearest neighbor method, but with adaptive and data-driven local
weights (Wager and Athey, 2018). The deep neural networks, at a high level, can be seen as a
sophisticated functional approximation using the sieves (Chen, 2007), which has already been
familiar to economists. The novelty is that the basis of the sieves is now implicitly chosen by
layers of linear combinations and nonlinear activations. The bagged nearest neighbor method, to
be elaborated in the next subsection, can also been seen as a bagging algorithm (Breiman, 1996)
enhanced matching estimator (Abadie and Imbens, 2006), which has already been widely used
in causal studies. However, despite these generalities, modern machine learning methods turn
out to work fairly well in practice, especially when there is a relatively large dimension of control
variables. The reason why there exists such a surprising difference is still a myth in probability
and statistical theory. One possible explanation is that the algorithms in the machine learning
methods can help alleviate the curse of dimensionality.
In this paper, our framework allows a general use of methods to derive conditional predic-
tions as long as they provide consistency. When the dimension of conditional variables is low,
traditional nonparametric methods, such as the sieves estimator and the kernel methods, can
also be used within our framework. In this case, the sieves estimator actually possesses an extra
advantage. Our intermediate goals are partial derivatives of conditional expectations. Since the
sieves estimator gives us an approximation of the structural function, we can then take partial
derivatives on the basis functions and directly get an approximation of the partial derivatives.
This strategy is straightforward and can work very well when there are few conditional vari-
ables. When there are a relatively large dimension of conditional variables, we conjecture that
16
the commonly used deep neural network can enjoy the same advantage. However, since it is not
the focus of this paper, we will not make further exploration in this direction. In general, we can
use numerical methods to derive partial derivatives. The most popular method in this domain is
perhaps the finite difference method, which has been widely used in modern numerical analysis.
For details, see, for example, Strikwerda (2004). The finite difference method is a discretiza-
tion method, in which finite differences are used to approximate derivatives. In other words,
the partial derivative is the limit of a difference quotient by definition. We can then take a
small difference and use the difference quotient for approximation. However, at implementation
level, there exist various different approaches to construct the difference quotient. The difference
quotient can have various schemes, such as the forward scheme, the backward scheme, and the
central scheme. In this paper, for simplicity, we will mostly use the forward scheme, that is, we
will use the difference quotient [g(z + δ)− g(z)]/δ to approximate the partial dirivative gz(z).
4.2 Inference
Machine learning methods can give fairly good point-wise predictions. The remaining difficulty
is statistical inference. To mitigate this discrepancy, we propose to use the bootstrap averaged
(bagged) nearest neighbors to estimate point-wise conditional expectations. We formally prove in
this paper that the bootstrap can be directly used for inference for the bagged nearest neighbors
estimators. Since the bootstrap commutes with smooth functions, we can therefore directly use
the bootstrap to derive inference for price elasticity estimates. It is worth noting that the price
elasticity estimates are not necessarily asymptotic normal.
The bootstrap averaged (bagged) nearest neighbors regression estimator, to the best of our
knowledge, first appears as a special example when averaging nearest neighbors estimators from
without replacement bootstrapped subsamples in Biau et al. (2010). The nearest neighbors es-
timator here is more commonly known as the matching estimator (Abadie and Imbens, 2006)
in economics. More recently, Fan et al. (2018) obtains the same estimator independently from
a panel perspective. The idea is that we can construct an artificial panel structure in the data
and average nearest neighbors from each crosssection. This strategy is in effect equivalent to
subsampling and thereby the subsample size in subsample bootstrapping becomes the counter-
part of the time dimension in panel data models (Hsiao, 2014). As a result, the panel jackknife
method (Arellano and Hahn, 2013; Dhaene and Jochmans, 2015) can be readily applied to the
bagged nearest neighbors estimators to remove higher order bias. In other words, the bagged
nearest neighbors estimator can be understood as the joint product of the familiar matching
estimators, the bootstrap averaging algorithm (Breiman, 1996), and the generalized jackknife
procedure. Surprisingly, this hybrid, as other machine learning methods, turns out to enjoy
17
desirable theoretical properties and has demonstrated fairly good working performance with a
relatively large dimension of control variables. We give a formal and concise introduction to the
bagged nearest neighbors in the appendix.
Since we need to combine several predictions to form price elasticities, statistical inference
for these final products of interests is not a trivial problem. The obstacles are that 1) we
may not have asymptotic normality, 2) the variance formula is not easy to obtain, and 3) the
variance formula may not have a working precision if many plug-in estimates are required. To
overcome this hurdle, we prove in this paper that the bootstrap can be used for inference for
bagged nearest neighbors. Since the bootstrap commutes with smooth functions, it follows that
the bootstrap can be directly used for inference of the price elasticity estimates, if the bagged
nearest neighbors estimators are used for point-wise predictions. We show in our Monte Carlo
simulations that the bootstrap strategy seems to work well in practice and offer a valid inference
with working precision. Our formal theorem and its proof for the bootstrap result are relegated
to the appendix after the introduction on the bagged nearest neighbors. As far as we know,
it is the first time this property is established for the bagged nearest neighbors. Our proof is
novel and becomes mathematically neat with the introduction of the Hoeffding decomposition
(Hoeffding, 1948) and the Mallow’s distance (Bickel and Freedman, 1981).
To conclude this section, our implementation strategy is that we first get point-wise condi-
tional predictions on various points, use them to numerically derive partial derivatives, and then
combine all relevant ingredients to obtain our final estimate of price elasticities. For inference,
we repeat the above process routinely on bootstrapped samples of the full data. The distribution
of all these final estimates from bootstrapped samples will approximate the asymptotic distri-
bution of the elasticity estimator. In this way, we can use this distribution to derive inference.
Moreover, in many cases, we are not satisfied with estimating price elasticity only for one point.
The same estimation and inference procedure can actually be repeated on all other points of
interest. In particular, it will be interesting when we deliberately change the value of one or two
conditioning variable and hold all the others constant. As a result, we can trace the change of
price elasticities along one or more dimensions of interest.
5 Monte Carlo simulation
We conduct Monte Carlo simulations in the section to show that our empirical strategy works
well in practice. Our method is applied to two parametric settings where we can conveniently
derive analytical solutions. We then compare our estimates and their analytical counterparts
at various given points. This design is aimed to show that our method can well capture the
18
heterogeneity of price slopes, even in the presence of endogeneity. In particular, we will use the
following data generating process in our simulation,
s1 = g(p1, p2, p3, p4) + ǫ,
ǫ = e− 2u,
p1 = z + u,
where p1, p2, p3, p4, u, e and z are independent and follow the standard normal distribution. Here
u can be interpreted as an unobserved common shock that affects both p1 and s1. The presence of
common shock u raises the concern of endogeneity. For other parameters, s1 can be interpreted
as the demand for product 1, p1, p2, p3, p4 the prices for product 1 to product 4, and e some
random demand shock. Moreover, z can be interpreted as a product specific cost component
and can then serve as an instrument for p1.
For the demand function g, we use two model specifications. In particular, for Model 1,
g(p1, p2, p3, p4) = 5 + p21 + 2p1 − 3p2 + p3 − p24,
and for Model 2,
g(p1, p2, p3, p4) = 5 + p31 + 2p1 − 3p2 + p3 − p24,
where the difference is only on the higher order terms of p1. The reason we make this distinction
is that the two models will have a linear and a quadratic price slope along p1, respectively. To
be specific, it is easy to verify that for Model 1, the price slope
∂g
∂p1(p1, 0, 0, 0) = 2p1 + 2,
and for Model 2,∂g
∂p1(p1, 0, 0, 0) = 3p21 + 2.
Figure 2 depicts our estimation results for Model 1 and Model 2 from one simulated sample
with 10,000 independent observations. The estimations are conducted point by point where p1
moves gradually from −0.8 to 0.8 and the other variables are held at 0. We also computed
the 95% confidence intervals using the bootstrap. They are presented using the dashed lines in
Figure 2. From Figure 2, we can see that even in the presence of endogeneity, our method can
still well capture the price slopes at different price levels. In other words, the heterogeneity of
the price treatment effects has been fully recovered.
We further repeat the above simulation exercise for 500 times. Table 1 summarizes the
results. In Table 1, Column p1 gives the values of p1 at the points where we make estimations.
The other variables at held at their mean and median levels 0. The two Columns Slope give us
19
Figure 2: Simulation 1
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
Model 1
0
1
2
3
4
5
6
Analytical price slope
Estimated price slope
95% Confidence Interval
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
Model 2
0
1
2
3
4
5
6
Analytical price slope
Estimated price slope
95% Confidence Interval
Note: This figure plots the estimations of our method for one simulated sample with 10,000 independent
observations. The estimations are conducted point by point with p1 varying from −0.8 to 0.8 and the
other variables at 0.
20
analytical values for price slopes at these points from our theoretical derivation. As a comparison,
our price slope estimates are present in their neighboring Columns Estimated Slope. The two
Columns Variance are the variances of the respective price slope estimates from 500 simulations.
Meanwhile, the two Columns Estimated Variance are the mean of the respective 500 simulated
bootstrap variances.
It can be seen from Table 1 that our bootstrap inference procedure can give a reliable and
convenient estimate of the true variance of the price slope estimator. This feature becomes
very important if statistical inference is also one of the concerns. The trajectories of the slope
estimates can well mimic their analytical truth, which will be very useful in many economic
research problems, such as the taxation or tariff pass-through. In the appendix, we will also
conduct a Monte Carlo simulation when the sample is generated from a random coefficient
multinomial model. It seems that our method can be compatible with this important case.
To conclude, we have employed a simple Monte Carlo simulation to demonstrate the potential
usefulness of our estimation and inference strategy. It shows that our method is able to recover
heterogeneity in the presence of endogeneity and give a convenient and valid inference.
6 Empirical Application
In this section, we demonstrate the applicability of our approach by estimating the price elas-
ticities among yogurt products. Yogurt is a widely-studied product category in empirical in-
dustrial organization and marketing. However, the yogurt literature has emphasized different
aspects of consumer behavior, and consequently, has adopted different and sometimes mutually-
incompatible demand models. In this paper, we will estimate own- and cross- price elasticities
among popular yogurt products without imposing a priori structural assumptions on consumer
behavior. We believe that the empirical elasticity patterns we find will be 1) directly informa-
tive of product markup and price pass-through, and 2) instructive of more reasonable modeling
assumptions if structural approaches are to be used to characterize demand.
The rich empirical literature on yogurt offers us a myriad of different characterizations of the
demand function that go beyond the “baseline” random coefficient demand system (Berry et al.,
1995). In particular, Villas-Boas (2007) use the conventional random coefficient discrete choice
model but ingeniously point out the importance of the interaction between retailer and manu-
facturer. Meanwhile, various other studies focus on different aspects of the consumer choice
process that are beyond the scope of a conventional discrete choice model. For example,
Kim et al. (2002) study consumers’ choices of a variety of products and in multiple quanti-
ties. Pavlidis and Ellickson (2018) study consumer switching costs across products and brands
21
Table 1: Simulation results
p1Model 1 Model 2
Slope Est. Slope Variance Est. Variance Slope Est. Slope Variance Est. variance
-0.80 0.40 0.34 0.0603 0.0694 3.92 3.75 0.1390 0.1498
-0.72 0.56 0.51 0.0607 0.0696 3.56 3.39 0.1293 0.1358
-0.64 0.72 0.68 0.0613 0.0691 3.23 3.06 0.1169 0.1229
-0.56 0.88 0.85 0.0629 0.0702 2.94 2.77 0.1092 0.1146
-0.48 1.04 1.02 0.0660 0.0712 2.69 2.52 0.1071 0.1080
-0.40 1.20 1.20 0.0668 0.0705 2.48 2.31 0.0980 0.1022
-0.32 1.36 1.36 0.0679 0.0717 2.31 2.13 0.0932 0.0959
-0.24 1.52 1.53 0.0706 0.0716 2.17 1.99 0.0893 0.0934
-0.16 1.68 1.70 0.0700 0.0739 2.08 1.89 0.0863 0.0908
-0.08 1.84 1.88 0.0702 0.0756 2.02 1.84 0.0838 0.0895
0.00 2.00 2.05 0.0743 0.0775 2.00 1.82 0.0867 0.0902
0.08 2.16 2.22 0.0770 0.0784 2.02 1.84 0.0880 0.0903
0.16 2.32 2.39 0.0772 0.0808 2.08 1.91 0.0883 0.0915
0.24 2.48 2.56 0.0776 0.0842 2.17 2.01 0.0871 0.0946
0.32 2.64 2.73 0.0786 0.0868 2.31 2.15 0.0883 0.0987
0.40 2.80 2.90 0.0819 0.0899 2.48 2.33 0.0932 0.1022
0.48 2.96 3.07 0.0832 0.0931 2.69 2.55 0.0969 0.1115
0.56 3.12 3.25 0.0866 0.0956 2.94 2.80 0.1054 0.1175
0.64 3.28 3.42 0.0921 0.0999 3.23 3.09 0.1190 0.1275
0.72 3.44 3.59 0.0946 0.1029 3.56 3.43 0.1311 0.1395
0.80 3.60 3.76 0.1016 0.1089 3.92 3.79 0.1508 0.1529
Note: This table summarizes our simulation results from 500 simulations. Est. is short for Estimated.
For each simulation, we generate a sample with 10,000 independent observations and make estimations
point by point with the values of p1 varying from −0.8 to 0.8 and the other variables at 0. For each
estimation, the estimated variance is obtained by the bootstrap. Meanwhile, variance is the variance
of the estimated slopes from 500 simulations.
22
and discuss its implications to dynamic pricing. Huang and Bronnenberg (2018) study the costly
consideration-set formation in a demand system, where consumers can choose a variety of prod-
ucts in multiple quantities.
In this section, we use the IRI academic dataset (Bronnenberg et al. (2008)) to characterize
elasticities across yogurt products. This dataset has been commonly used in marketing and
provides weekly sales information on yogurt for various store chains from 2001 to 2007 in several
states of the United States. From this dataset, we will demonstrate two interesting features that
are not much emphasized by existing literature. First, as in many consumer-packaged-goods
categories, yogurt products are packaged into various package sizes, such as 8 oz or 32 oz. The
classical microeconomic theory might view these size offerings as a form of the second-degree
price discrimination. However, in empirical studies, consumer preference heterogeneity across
package sizes are often abstracted away. Commonly, yogurts of various package sizes are pooled
within the brand for structural analyses, possibly due to the complexity of specifying preference
distribution across sizes. In this paper, we will give an analysis of package sizes without a priori
structural assumptions. Second, in our sample periods, one yogurt brand is constantly priced at
a higher level than the other. Such persistent differences might be an outcome of cost differences.
Our empirical analysis aims to speak to these two features of demand.
6.1 Sample construction and implementation
We first picture the landscape of the total yogurt sales during our sample period. A summary of
statistics is provided in Table 2. During our sample period, the most popular brand is Yoplait.
Dannon follows as the second popular brand, but with only two-thirds of the sales of Yoplait.
The third popular brand is the private label yogurt provided by stores. The private label is
technically not a brand since the private label yogurt can be very different and often of distinct
qualities from store to store. For Yoplait and Dannon, we decompose their sales further into
popular sizes and report the size-associated total sales. The numbers show that both Yoplait
and Dannon offer two vertically-different product lines. One is at around $1.7 per pound and
the other is premium and at around $2.4 per pound. In this paper, we will focus on the regular
product line, since in total they sell the most units and can best reflect the competitive interaction
between Yoplait and Dannon.
We will define Dannon 0.375 pounds (6 oz) and 0.5 pounds (8 oz) as Dannon small and
Dannon 1.5 pounds (24 oz) and 2 pounds (32 oz) as Dannon large. For Yoplait, Yoplait 0.375
pounds (6 oz) is defined to be Yoplait small and Yoplait 1.5 pounds (24 oz) Yoplait large. Since
Yoplait offers more at the premium line, we define Yoplait 0.25 pounds (4 oz), 0.6875 pounds (11
oz) and 1.125 pounds (18 oz) as Yoplait premium and use it as a control. For the Private label,
23
Table 2: Yogurt brand and size
Brand Total sales Size Individual sales Price
Yoplait 626, 760, 636
0.375 490, 723, 711 $1.66
0.25 63, 490, 752 $2.55
1.5 26, 982, 361 $1.73
1.125 23, 543, 991 $2.37
0.6875 11, 123, 974 $2.31
0.5 6, 517, 007 $2.18
3 1, 884, 085 $1.52
Dannon 397, 986, 186
0.375 199, 607, 994 $1.58
0.5 71, 257, 840 $1.40
1 33, 874, 083 $2.24
2 23, 157, 036 $1.41
1.5 14, 539, 104 $1.63
Private
label277, 347, 934
0.5 211, 898, 981 $0.98
0.375 42, 254, 002 $1.13
2 13, 661, 875 $0.97
Note: This table summarizes yogurt sales in popular brands and sizes from our dataset.
The sales are in units, the sizes are in pounds, and the price has been normalized to
dollars per pound. Only most popular brands and sizes have been listed.
24
Table 3: Yogurt brand and size
Variables Mean S.D. Min Q1 Median Q3 Max Obs.
Yoplait
small
Price 0.92 0.17 0.10 0.80 0.92 1.05 1.58 156,580
Sales 416.13 272.49 66.00 207.75 346.13 556.88 1343.25 156,580
IV 0.84 0.05 0.72 0.81 0.84 0.87 0.97 156,580
large
Price 0.92 0.12 0.17 0.83 0.92 1.00 1.46 156,580
Sales 89.65 61.50 15.00 43.50 72.00 120.00 301.50 156,580
IV 0.86 0.02 0.75 0.84 0.86 0.88 0.91 156,580
premium Price 0.64 0.08 0.15 0.58 0.63 0.70 0.98 156,580
Dannon
small
Price 0.83 0.18 0.11 0.69 0.80 0.94 1.36 156,580
Sales 216.38 180.58 20.25 83.25 160.50 291.75 915.00 156,580
IV 0.77 0.06 0.58 0.73 0.77 0.81 0.90 156,580
large
Price 0.78 0.11 0.27 0.69 0.77 0.86 1.30 156,580
Sales 131.13 93.37 18.00 62.00 106.00 172.50 502.00 156,580
IV 0.73 0.02 0.64 0.72 0.73 0.74 0.80 156,580
Private label
Price 0.54 0.09 0.18 0.48 0.53 0.60 1.46 156,580
Other controls
Store ACV 0.22 0.09 0.04 0.16 0.20 0.27 1.00 156,580
ChainShelf.1 0.25 0.10 0.04 0.15 0.22 0.33 0.49 156,580
Shelf.2 0.31 0.10 0.03 0.25 0.33 0.40 0.63 156,580
TimeWeek 0.51 0.29 0.02 0.27 0.52 0.75 1.00 156,580
Year 0.58 0.28 0.17 0.33 0.67 0.83 1.00 156,580
Note: This table summarizes yogurt sales in popular brands and sizes from our dataset.
Prices reported here are per 8 oz, except that the premium Yoplait price is per 4 oz.
25
we do not further distinguish sizes and use the average price as a control. When we compute
the average price, we use the sum of sale pounds to divide the total revenue.
With the above definitions, we eventually arrive at a sample of 156, 580 observations at the
store-week level. For each observation, we have the prices for Yoplait small, Yoplait large, Yoplait
premium, Dannon small, Dannon large, and the Private label. We instrument for the potentially
endogenous prices using the Hausman instruments (Hausman et al., 1994; Berry et al., 1995),
that is, prices of the focal product in other geographic markets. We further control for fixed
effects at the store, the chain, and the time level. We use the all-commodity volume (ACV)
provided by the IRI dataset as a control for store level fixed effect. It is a weighted measure of
the product availability based on store aggregate sales. The chain level fixed effects are to be
represented by two variables, Shelf.1 and Shelf.2. They are the ratios of the Dannon sales and
the private label sales to the total sales of yogurt in the grocery chain. They reflect the chain
level preference over Dannon, Yoplait, and the private label. One interpretation is that they can
reflect the shelf space allocation. We also add the number of the week in a year and the number
of the year in our sample to further control the time level fixed effects. Both variables have been
normalized to [0, 1]. Note that although our analysis uses proxies to control for fixed effects, we
do not make restrictions on the functional forms, thereby permitting the controls to involve in
a flexible manner. We provide the summary statistics on all these variables in Table 3.
6.2 Empirical findings
Our empirical findings are presented in Figure 3–6. In each of the four figures, we report the
price elasticities of all four focal yogurt products with respect to the price change in one of
them (e.g. Dannon small). We report estimated price elasticities at every of the 50 price levels
between $0.7 to $1 per 8 oz, while holding all other prices, market, and control variables at
their median levels. The bootstrap 95% confidence intervals for all these point estimations are
reported in dashed lines. We also report the own and cross elasticity estimates at own price
$0.85 per 8 oz, while others at their median levels, in Table 4.
First of all, both Dannon and Yoplait’s small-sized products exhibit elastic demand and
their own-price elasticities increase in magnitude with the price. The downward sloping own-
price elasticity is consistent with Marshall’s second law of demand, which is commonly assumed
in theoretical IO. Second, we can see that at the price $0.85, Dannon small has an own elasticity
of about −3, whereas Yoplait small has a cross elasticity of about 1.6. In contrast, at the
price $0.85, Yoplait small has an own elasticity of about −5, whereas Dannon small has a cross
elasticity of about 4. It seems to suggest that Yoplait small consumers are more sensitive to price.
Third, we also observe a different cross elasticity pattern for Dannon small and Yoplait small.
26
Figure 3: Own and cross price elasticities with respect to the price of Dannon small
0.7 0.75 0.8 0.85 0.9 0.95 1
Price of Dannon small
-4
-3
-2
-1
0
1
2
3
Ela
sticity
Dannon small
Dannon large
Yoplait small
Yoplait large
95% Confidence Interval
Note: This figure provides the price elasticity estimates of Dannon small, Dannon large, Yoplait
small, and Yoplait large with respect to the price of Dannon small. These elasticities are evaluated
at 30 price levels of Dannon small, from 0.7 to 1 dollars per 8 oz.
27
Figure 4: Own and cross price elasticities with respect to the price of Yoplait small
0.7 0.75 0.8 0.85 0.9 0.95 1
Price of Yoplait small
-6
-4
-2
0
2
4
Ela
sticity
Dannon small
Dannon large
Yoplait small
Yoplait large
95% Confidence Interval
Note: This figure provides the price elasticity estimates of Dannon small, Dannon large, Yoplait
small, and Yoplait large with respect to the price of Yoplait small. These elasticities are evaluated
at 30 price levels of Yoplait small, from 0.7 to 1 dollars per 8 oz.
28
Figure 5: Own and cross price elasticities with respect to the price of Dannon large
0.7 0.75 0.8 0.85 0.9 0.95 1
Price of Dannon large
-2
-1.5
-1
-0.5
0
0.5
1
Ela
sticity
Dannon small
Dannon large
Yoplait small
Yoplait large
95% Confidence Interval
Note: This figure provides the price elasticity estimates of Dannon small, Dannon large, Yoplait
small, and Yoplait large with respect to the price of Dannon large. These elasticities are evaluated
at 30 price levels of Dannon large, from 0.7 to 1 dollars per 8 oz.
29
Figure 6: Own and cross price elasticities with respect to the price of Yoplait large
0.7 0.75 0.8 0.85 0.9 0.95 1
Price of Yoplait large
-2
-1.5
-1
-0.5
0
0.5
1
1.5
Ela
sticity
Dannon small
Dannon large
Yoplait small
Yoplait large
95% Confidence Interval
Note: This figure provides the price elasticity estimates of Dannon small, Dannon large, Yoplait
small, and Yoplait large with respect to the price of Yoplait large. These elasticities are evaluated
at 30 price levels of Yoplait small, from 0.7 to 1 dollars per 8 oz.
30
Table 4: Price elasticities at representative points
Dannon small Dannon large Yoplait small Yoplait large
(A) Dannon small -2.961 0.268 1.621 0.957
(B) Dannon large 0.122 -1.371 0.048 0.970
(C) Yoplait small 3.880 0.649 -4.927 1.367
(D) Yoplait large 1.413 0.204 -0.294 -1.576
Note: This table presents the price elasticity estimates when the A (B, C, D) product is priced
at $0.85 per 8 oz and all the other products at their median price levels.
When we slightly change the price of Dannon small, the cross elasticity of Yoplait small slowly
increases with this change. However, when we slightly change the price of Yoplait small, the cross
elasticity of Dannon small slowly decreases. Considering the fact that Yoplait is mostly priced
a bit higher than Dannon in our sample, an immediate explanation for this pattern difference
is that when price levels are closer, the competition between the two products is more intense.
Fourth, we observe very similar patterns for Yoplait large and Dannon large when we change the
price of Yoplait small or Dannon small. It gives us the impression that the large-sized yogurts are
specialized for their targeted consumer groups, to whom the brand of small-sized yogurt is less
relevant. Finally, when we slowly change the price of Dannon large, the strongest substitution
comes from Yoplait large, which is consistent with our findings for small-sized yogurt. However,
the pattern does not hold when we change the price of Yoplait large. For Yoplait large, we also
fail to observe a downward sloping own elasticity curve. In addition, the price elasticities for
the other three products, i.e. Dannon small, Yoplait small, and Dannon large, are estimated
with relatively tight confidence intervals. In contrast, the price elasticities of Yoplait large are
estimated with much wider confidence intervals and for a large price region, the cross elasticities
for Dannon large and Yoplait small are not distinguishable from zero. All these facts hinder us
from reaching a confident interpretation for Yoplait large.
Since it seems that the substitution among yogurt products is more prevalent within the
package size rather than brand, let us focus solely on the small-sized yogurt and, for now,
ignore other products. This simplification can shed some light on the cost structure of Yoplait
and Dannon. From our estimation, at the median price level of Yoplait and Dannon, which is
($0.92, $0.80), the own-price elasticities are about −6 for Yoplait and about −2.5 for Dannon. If
this is the pricing equilibrium and both firms have constant marginal costs, we can conveniently
derive the first order optimality conditions for each firm. When we plug in the numbers, a simple
computation can tell us that the marginal cost is about $0.77 for Yoplait and $0.48 for Dannon.
31
Conversely, if we have information on marginal costs, our point-wise price elasticity estimates
can also assist to give an implication on optimal pricing.
To conclude, our result shows that small-sized yogurt and large-sized yogurt are very different
products by nature and the competition among yogurt products exists more within the same
package size than within the same brand. In empirical studies, an array of studies have first
aggregated data across all package sizes within brand, and then conduct further analysis. It seems
that whereas this approach can simplify the product space, it might have also abstracted away
important dimensions of product differentiation and substitution. It is worth noting that our
results build only on a minimal assumption on functional forms, functional differentiability, and
the validity of instrumental variables, regardless of the source of endogeneity. The heterogeneity
patterns we find are obtained before any a priori structural assumptions on individuals and
markets. We believe that such information can be valuable for its own sake and constructive for
further structural modeling.
7 Discussion
In this paper, we have proposed a point-wise approach to flexibly estimate price elasticity with
the presence of endogeneity. Our framework is open to the use of modern machine learning
methods to estimate point-wise conditional expectations. In this way, the curse of dimensionality
can be mitigated and the working performance of our approach can be improved. In particular, if
the bagged nearest neighbors are to be used for point-wise prediction, we prove that the standard
bootstrap procedure can be directly employed for inference. We believe that our flexible price
elasticity estimates can be very useful in a wide range of economic problems, including welfare
analysis, firm pricing, and tax incidence study.
Since this paper has focused on the estimation and inference of price elasticities, we have not
given an explicit discussion on how counterfactual analysis can be conducted in a non-parametric
setup. In a parametric model, the conventional assumption for counterfactual analysis is that the
structural parameters are stable in the sample and out of sample. They reflect deep preferences
and will not change in the new scenario. In this way, the structural models can provide pre-
dictions and counterfactual analysis even when the market becomes totally different. We argue
that if similar assumptions are made, counterfactuals can also be obtained in our framework.
However, instead of assuming constant price coefficients, we can impose other restrictions, such
as shape restriction on how price elasticities should evolve. This feature can be important when
the new policy is believed to have significant impact on individual preferences, that is, when
structural parameters are most likely to be misspecified.
32
We are also curious about the potential future use of the deep neural networks in our frame-
work. The deep neural networks has been widely used in business practices to deal with big data,
where they have demonstrated superior reliability and working usefulness. For our problem, one
particularly interesting direction is that how we can use the deep neural networks to predict
directly the slopes. If there is a convenient and fast algorithm to achieve this, we are curious
to know how it can further push the dimensionality limit. Moreover, if unstructured data, such
as satellite images, product pictures, and social network texts can also be utilized to answer
economic questions, we wonder what new insights we can get from such a change.
References
Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average
treatment effects. Econometrica 74 (1), 235–267.
Arcones, M. A. and E. Gine (1992). On the bootstrap of U and V statistics. The Annals of
Statistics 20 (2), 655–674.
Arellano, M. and J. Hahn (2013). Understanding Bias in Nonlinear Panel Models: Some Recent
Developments, Volume 3, pp. 381–409. Cambridge University Press.
Athey, S. and G. Imbens (2016). Recursive partitioning for heterogeneous causal effects. Pro-
ceedings of the National Academy of Sciences 113 (27), 7353–7360.
Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. The Annals of
Statistics 47 (2), 1148–1178.
Banks, J., R. Blundell, and A. Lewbel (1997). Quadratic engel curves and consumer demand.
The Review of Economics and Statistics 79 (4), 527–539.
Belloni, A., V. Chernozhukov, and C. Hansen (2014a). High-dimensional methods and inference
on structural and treatment effects. Journal of Economic Perspectives 28 (2), 29–50.
Belloni, A., V. Chernozhukov, and C. Hansen (2014b). Inference on treatment effects after
selection among high-dimensional controls. The Review of Economic Studies 81 (2), 608–650.
Berry, S., A. Gandhi, and P. Haile (2013). Connected substitutes and invertibility of demand.
Econometrica 81 (5), 2087–2111.
Berry, S., J. Levinsohn, and A. Pakes (1995). Automobile prices in market equilibrium. Econo-
metrica 63 (4), 841–890.
33
Berry, S. T. (1994). Estimating Discrete-Choice models of product differentiation. The RAND
Journal of Economics 25 (2), 242.
Berry, S. T. and P. A. Haile (2014). Identification in differentiated products markets using
market level data. Econometrica 82 (5), 1749–1797.
Biau, G., F. Cerou, and A. Guyader (2010). On the rate of convergence of the bagged nearest
neighbor estimate. Journal of Machine Learning Research 11 (3), 687—-712.
Biau, G. and L. Devroye (2015). Lectures on the Nearest Neighbor Method. Springer.
Bickel, P. J. and D. A. Freedman (1981). Some asymptotic theory for the bootstrap. The Annals
of Statistics 9 (6), 1196–1217.
Blundell, R., X. Chen, and D. Kristensen (2007). Semi-Nonparametric IV Estimation of Shape-
Invariant Engel Curves. Econometrica 75, 1613–1669.
Blundell, R., J. Horowitz, and M. Parey (2016). Nonparametric estimation of a nonseparable
demand function under the slutsky inequality restriction. The Review of Economics and
Statistics 99 (2), 291–304.
Blundell, R., J. L. Horowitz, and M. Parey (2012). Measuring the price responsiveness of gasoline
demand: Economic shape restrictions and nonparametric demand estimation. Quantitative
Economics 3 (1), 29–51.
Blundell, R., D. Kristensen, and R. L. Matzkin (2013). Control functions and simultaneous
equations methods. American Economic Review 103 (3), 563–69.
Blundell, R. and J. L. Powell (2003). Endogeneity in Nonparametric and Semiparametric Re-
gression Models, Volume 2, pp. 312–357. Cambridge University Press.
Breiman, L. (1996). Bagging predictors. Machine Learning 24 (2), 123–140.
Bronnenberg, B. J., M. W. Kruger, and C. F. Mela (2008). Database Paper-The IRI marketing
data set. Marketing Science 27 (4), 745–748.
Chen, X. (2007). Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models,
Volume 6, pp. 5549. Elsevier.
Chen, X., V. Chernozhukov, S. Lee, and W. K. Newey (2014). Local Identification of Nonpara-
metric and Semiparametric Models. Econometrica 82, 785–809.
34
Chen, X. and T. M. Christensen (2018). Optimal sup-norm rates and uniform inference on
nonlinear functionals of nonparametric iv regression. Quantitative Economics 9 (1), 39–84.
Chen, X. and D. Pouzo (2015). Sieve Wald and QLR Inferences on Semi/Nonparametric Con-
ditional Moment Models. Econometrica 83 (3), 1013–1079.
Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and W. Newey (2017).
Double/Debiased/Neyman machine learning of treatment effects. American Economic Re-
view 107 (5), 261–65.
Chesher, A. (2003). Identification in nonseparable models. Econometrica 71 (5), 1405–1441.
Compiani, G. (2019). Market counterfacturals and the specification of multi-product demand:
A nonparametric approach. Working paper .
Das, M., W. K. Newey, and F. Vella (2003). Nonparametric estimation of sample selection
models. The Review of Economic Studies 70 (1), 33–58.
De Los Santos, B., A. Hortacsu, and M. R. Wildenbeest (2012, May). Testing models of con-
sumer search using data on web browsing and purchasing behavior. American Economic
Review 102 (6), 2955–80.
Deaton, A. and J. Muellbauer (1980). An almost ideal demand system. The American Economic
Review 70 (3), 312–326.
Dette, H., S. Hoderlein, and N. Neumeyer (2016). Testing multivariate economic restrictions
using quantiles: The example of slutsky negative semidefiniteness. Journal of Economet-
rics 191 (1), 129–144.
Dhaene, G. and K. Jochmans (2015). Split-panel jackknife estimation of fixed-effect models.
The Review of Economic Studies 82 (3), 991–1030.
Draganska, M. and D. C. Jain (2006). Consumer preferences and Product-Line pricing strategies:
An empirical analysis. Marketing Science 25 (2), 164–174.
Draganska, M., M. Mazzeo, and K. Seim (2009). Beyond plain vanilla: Modeling joint product
assortment and pricing decisions. Quantitative Marketing and Economics 7 (2), 105–146.
Dube, J., G. J. Hitsch, and P. E. Rossi (2018). Income and wealth effects on Private-Label
demand: Evidence from the great recession. Marketing Science 37 (1), 22–53.
35
Efron, B. and C. Stein (1981). The jackknife estimate of variance. The Annals of Statistics 9 (3),
586–596.
Fan, Y., J. Lv, and J. Wang (2018). DNN: A two-scale distributional tale of heterogeneous
treatment effect inference. Working paper .
Gentzkow, M. (2007). Valuing new goods in a model with complementarity: Online newspapers.
American Economic Review 97 (3), 713–744.
Goeree, M. S. (2008). Limited information and advertising in the U.S. personal computer in-
dustry. Econometrica 76 (5), 1017–1074.
Haag, B. R., S. Hoderlein, and K. Pendakur (2009). Testing and imposing slutsky symmetry in
nonparametric demand systems. Journal of Econometrics 153 (1), 33–50.
Hahn, J. and G. Ridder (2017). Instrumental variable estimation of nonlinear models with
nonclassical measurement error using control variables. Journal of Econometrics 200, 238–
250.
Hahn, J. and G. Ridder (2018). Three-stage semi-parametric inference: Control variables and
differentiability. Journal of Econometrics .
Hall, P. and J. L. Horowitz (2005). Nonparametric methods for inference in the presence of
instrumental variables. The Annals of Statistics 33, 2904–2929.
Hausman, J., G. Leonard, and D. J. Zona (1994). Competitive analysis with differenciated
products. Annales d’Economie et de Statistique (34), 159–180.
Hausman, J. A. and W. K. Newey (1995). Nonparametric estimation of exact consumers surplus
and deadweight loss. Econometrica 63 (6), 1445–1476.
Hausman, J. A. and W. K. Newey (2015). Nonparametric welfare analysis. Annual Review of
Economics 9 (1), 1–26.
Heckman, J. and R. Robb (1985). Alternative methods for evaluating the impact of interventions:
An overview. Journal of Econometrics 30 (1), 239–267.
Heckman, J. J., H. Ichimura, and P. E. Todd (1997). Matching as an econometric evaluation
estimator: Evidence from evaluating a job training programme. The Review of Economic
Studies 64 (4), 605–654.
36
Hendel, I. (1999). Estimating multiple-discrete choice models: An application to computerization
returns. The Review of Economic Studies 66 (2), 423–446.
Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The Annals
of Mathematical Statistics 19 (3), 293–325.
Houde, J. (2012). Spatial differentiation and vertical mergers in retail markets for gasoline.
American Economic Review 102 (5), 2147–82.
Hsiao, C. (2014). Analysis of Panel Data. Cambridge University Press.
Huang, Y. and B. J. Bronnenberg (2018). Pennies for your thoughts: Costly product consider-
ation and purchase quantity thresholds. Marketing Science 37 (6), 1009–1028.
Imbens, G. W. and W. K. Newey (2009). Identification and estimation of triangular simultaneous
equations models without additivity. Econometrica 77 (5), 1481–1512.
Kim, J., G. M. Allenby, and P. E. Rossi (2002). Modeling consumer demand for variety. Mar-
keting Science 21 (3), 229–250.
Mack, Y. (1981). Local properties of k-NN regression estimates. SIAM Journal on Algebraic
Discrete Methods 2 (3), 311–323.
Matzkin, R. L. (2003). Nonparametric estimation of nonadditive random functions. Economet-
rica 71 (5), 1339–1375.
Matzkin, R. L. (2008). Identification in nonparametric simultaneous equations models. Econo-
metrica 76 (5), 945–978.
Matzkin, R. L. (2015). Estimation of nonparametric models with simultaneity. Economet-
rica 83 (1), 1–66.
McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior, pp. 105–142.
Academic Press.
Mullainathan, S. and J. Spiess (2017). Machine learning: An applied econometric approach.
Journal of Economic Perspectives 31 (2), 87–106.
Nevo, A. (2001). A practitioner’s guide to estimation of Random-Coefficients logit models of
demand. Journal of Economics & Management Strategy 9 (4), 513–548.
Newey, W. K. and J. L. Powell (2003). Instrumental Variable Estimation of Nonparametric
Models. Econometrica 71, 1565–1578.
37
Newey, W. K., J. L. Powell, and F. Vella (1999). Nonparametric estimation of triangular simul-
taneous equations models. Econometrica 67 (3), 565–603.
Pavlidis, P. and P. B. Ellickson (2018). Implications of parent brand inertia for multiproduct
pricing. Quantitative Marketing and Economics 15 (4), 369–407.
Pearl, J. (2009). Causality: Models, reasoning and inference. Cambridge University Press .
Petrin, A. (2002). Quantifying the benefits of new products: The case of the minivan. Journal
of Political Economy 110 (4), 705–729.
Rivers, D. and Q. H. Vuong (1988). Limited information estimators and exogeneity tests for
simultaneous probit models. Journal of Econometrics 39 (3), 347–366.
Smith, R. and R. Blundell (1986). An exogeneity test for a simultaneous equation tobit model
with an application to labor supply. Econometrica 54 (3), 679–685.
Stone, R. (1954). Linear expenditure systems and demand analysis: An application to the
pattern of british demand. The Economic Journal 64 (255), 511–527.
Strikwerda, J. (2004). Finite Difference Schemes and Partial Differential Equations (Second
Edition ed.). Society for Industrial and Applied Mathematics.
Villas-Boas, S. (2007). Vertical relationships between manufacturers and retailers: Inference
with limited data. The Review of Economic Studies 74 (2), 625–652.
Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effects
using random forests. Journal of the American Statistical Association 113 (523), 1228–1242.
Working, E. (1927). What do statistical “Demand curves” show? The Quarterly Journal of
Economics 41 (2), 212–235.
38
Appendix
A Monte Carlo simulation revisited
The BLP model (Berry et al., 1995), or the random coefficients multi-nominal choice model has
been the working horse model of demand analysis for decades. This section is to show that our
approach can be compatible with the standard random coefficients multi-nominal choice model.
A commonly used random coefficients multi-nominal choice model first explicitly specifies the
indirect utility from individual consumption, that is, individual i choosing product j at market
t enjoys the indirect utility
uijt = αi + βipjt + ξj + ǫijt, (10)
where pjt is the price of product j at market t, ξj is the other product characteristics of product
j. It is usually assumed that price pjt is product and market specific, while ξj is only product
specific. However, the crucial difference between pjt and ξj is that ξj is only observed by in-
dividual consumers but not by econometricians. Since prices are often correlated with product
characteristics, the omitted product characteristics becomes the source of price endogeneity in
this standard model. In other settings, the price endogeneity can also arise in scenarios such as
measurement error on prices, endogenous product choice set, or sample selection issue.
In Equation (10), αi and βi represent individual preferences, where αi is the individual fixed
effect, and βi is the individual sensitiveness to product price. The difference between this model
and the multi-nominal choice model (McFadden, 1973) is that αi and βi are assumed to be
random coefficients. This ingenious feature is introduced to avoid the presence of independent
irrelevant alternatives in multi-nominal choice models. In this way, the substitution patterns
between products can depend on demographic variables and thus the price elasticity is allowed
to be flexible. In our simulation, we explicitly assume
αi = α1νi1,
βi = β0 + β1νi1 + β2νi2,
where vi1 and vi2 represent demographic variables. They are assumed to be independent and
follow a standard normal distribution in this simulation. Other preference parameters, α1, β0,
β1, and β2, are set to be pre-determined fixed values (0.8,−3, 0.5, 0.5).
The error term ǫijt is often assumed to follow the i.i.d. Type I distribution. In this case, the
market share sjt of product j at market t has a closed form expression,
sjt =
∫ ∫exp(β0pjt + ξj + α1vi1 + β1vi1pjt + β2vi2pjt)
1 +∑J
q=1 exp(β0pqt + ξq + α1vq1 + β1vq1pqt + β2vq2pqt)dΦ(x1)dΦ(x2). (11)
39
At a high-level, this particular form of the market share function comes from the linear
setup of indirect utility and the i.i.d. Type I error. If they fail to hold, the aggregated market
share function turns into a general nonlinear function of product prices, observed characteristics,
and unobserved characteristics. If the unobserved product characteristics is the only source
of endogeneity, our model setup in the paper actually requires that the unobserved product
characteristic is additively separable to the demand function, which may not be true for the
above Equation (11). However, as we illustrate at the end of this section, this additivity can be
approximately true at each evaluated point. There exists other modeling routes. For example,
Berry and Haile (2014) propose to impose an index restriction.
Back to the standard model, the unobserved product characteristics ξj can be backed out to
deal with endogeneity. Berry (1994) has shown that, if there is an outside good with market
share s0, log sjt− log s0t can back out the unobserved product characteristics ξj. The backed out
unobserved product characteristics can then be used to form moment conditions if instrumental
variables are available. As a result, the preference parameters in Equation (11) can be recovered
using the general method of moments. If it is further assumed that these parameters are stable
in and out of sample, their estimates can be used to conduct counterfactual analysis.
In our case, we know the true data generating process since it is Monte Carlo simulation.
We have assumed that
pt = (0.5, 0.5, 0.75, 1)T + 0.5ut + 0.5zt,
ξt = (5, 6, 7, 8)T + ut.
where ut is a common shock to price pt and unobserved product characteristics ξt. zt only affect
pt and can serve as instruments. ut and zt are independent random vectors and each element
follow independent uniform distribution on [−0.5, 0.5]. In total, we assume to have 100,000
independent markets with 4 products. Figure 7 gives our estimation results.
We have argued that Equation (11) cannot in general be guaranteed to be expressed as an
additive model in unobserved product characteristics ξ. However, if we take a Taylor expansion
at a fixed point (x∗t ,p
∗t , ξ
∗t ),
sjt(xt,pt, ξt) = sjt(x∗t ,p
∗t , ξ
∗t ) +AT (xt − x∗
t ) +BT (pt − p∗t ) +CT (ξt − ξ∗t ) + o(·),
where
A = ▽xtsjt(x∗t ,p
∗t , ξ
∗t ),
B = ▽ptsjt(x
∗t ,p
∗t , ξ
∗t ),
C = ▽ξtsjt(x
∗t ,p
∗t , ξ
∗t ).
It seems that it can be approximately true that sjt is additive in ξ locally at each point.
40
Figure 7: Simulation with BLP as the true model
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Own elasticity
-2
-1.5
-1
-0.5
0
True Elasticity
Estimated Elasticity
95% Confidence Interval
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Cross elasticity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
True Elasticity
Estimated Elasticity
95% Confidence Interval
Note: This figure plots the estimations of our method when the true model is the BLP model. The
sample size is 100,000. The estimations are conducted point by point with p1 varying from 0.2 to 0.8
and the other variables at their median levels.
41
B Bagged nearest neighbors
In this section, we provide a brief but rigorous introduction to the bagged nearest neighbors
method. It is assumed that we are in a standard nonparametric regression setup,
yi = g(xi) + ǫi,
where y is a scalar and x ∈ Rd with d fixed but potentially large. The function g(x) can be
E (yi |xi) and the disturbance ǫi is independent from x. Our primary goal here is to estimate and
infer E (yi |xi = x0) given x0, which is different from the conventional pursuit on the function
g. We further denote zi = (xi, yi) and maintain the following two commonly made assumptions.
Assumption 5. We have an i.i.d. sample, (x1, y1), (x2, y2), . . . , (xn, yn).
Assumption 6. The density f(·) of x is bounded away from 0 and ∞, f(·) and g(·) are both
twice continuously differentiable with bounded second derivatives in a neighborhood of x, and y
has finite second moment, E y2 < ∞. ǫ has zero mean and finite variance σ2 > 0.
Definition
Before we proceed, we introduce extra notations. Let {i1, · · · , im} with i1 < i2 < · · · < im and
m ≤ n be a subset of size m from the index set {1, · · · , n}. Φ(x0; zi1, zi2 , . . . , zim) is defined to
be the 1-nearest neighbor estimator for x = x0 in the subsample {(zij )mj=1}, that is,
Φ(x0; zi1 , zi2 , . . . , zim) = y(1)(zi1 , zi2, . . . , zim),
where y(1)(zi1 , zi2, . . . , zim) is the y associated with the closest point among (xi1, . . . ,xim) to the
fixed and given point x0 in Rd in terms of the Euclidean distance.
The definition for a bagged nearest neighbors estimator is
τn(m)(x0) =
(n
m
)−1 ∑
1≤i1<i2<...<im≤n
Φ(x0; zi1 , zi2 , . . . , zim). (12)
There is an equivalent L-statistic representation,
τn(m)(x0) =
(n
m
)−1{(n− 1
m− 1
)
y(1) +
(n− 2
m− 1
)
y(2) + · · ·+(m− 1
m− 1
)
y(n−m+1)
}
. (13)
Before introducing properties of the bagged nearest neighbors estimator, let us have a short
review on its development history here. As far as we know, the bagged nearest neighbors es-
timator is first proposed in Biau et al. (2010) as a special case from averaging bootstrapped
without replacement (Breiman, 1996) nearest neighbor estimators (Mack, 1981). In their paper,
42
Biau et al. (2010) have derived the universal consistency of the bagged nearest neighbors. More
recently, Fan et al. (2018) derive the same estimator independently from a panel perspective.
However, their focus is on the point-wise heterogenous treatment effect and they propose a two
scale approach to further remove the higher order bias term. This generalized jackknife proce-
dure, when combined with the bagging algorithm, can substantially improve the performance of
the bagged nearest neighbors. Fan et al. (2018) also prove the point-wise asymptotic normality
of the bagged nearest neighbors. In this paper, we enrich these results and prove that the boot-
strap (Efron and Stein, 1981) can be conveniently employed for the inference of bagged nearest
neighbors. Our result is particularly useful in cases when the parameter of interest is a smooth
function of bagged nearest neighbors estimates.
Theorem 1. (Biau and Devroye (2015)) Given x0 ∈ supp(x), under Assumptions 1–2,
E τn(m)(x0) = g(x0) +B(m),
B(m) = Γ(2/d+ 1)f(x0) tr(g
′′(x0)) + 2 g′(x0)Tf ′(x0)
2 d V2/dd f(x0)1+2/d
m−2/d + o(m−2/d), (14)
where Vd = πd/2
Γ(1+d/2), Γ(·) denotes the Gamma function, f ′(x0) and g′(x0) are the first order
gradients at x0 for f(x) and g(x), respectively, g′′(x0) is the Hessian matrix of g(·) at x0, and
tr(·) gives the trace.
This result comes from Biau and Devroye (2015) for k nearest neighbors estimator. Equation
(14) gives us the closed form of the asymptotically diminishing bias term. However, when the
dimensionality d is relatively large, the rate that the bias term diminishes can be very slow. This
is the curse of dimensionality. If we can effectively get rid of the first order bias term, then the
curse of dimensionality can be largely mitigated.
Generalized Jackknife
Consider two bagged nearest neighbors estimators with different subsampling scales m1 and m2
(m1 6= m2). Their asymptotic biases have the following forms
τn(m1)(x0) = g(x0) + cm−2/d1 + o(m
−2/d1 ),
τn(m2)(x0) = g(x0) + cm−2/d2 + o(m
−2/d2 ).
We then solve the following system of linear equations
w1 + w2 = 1, (15)
w1 m−2/d1 + w2 m
−2/d2 = 0, (16)
43
and obtain the solution (w∗1, w
∗2). Equation (15) is to ensure unbiasedness and Equation (16) the
removal of the first order bias. Then w∗1τn(m1)(x0) + w∗
2τn(m2)(x0) is free from the first order
bias term and thus the curse of dimensionality can be mitigated. From simulations in Fan et al.
(2018), this bias reduction can improve estimation precision measured in mean squared errors
substantively. In this paper, we choose m and 2m to reduce the first order bias. For inference,
Fan et al. (2018) has shown that the bagged nearest neighbors estimator is asymptotically normal
under mild conditions.
Theorem 2. (Fan et al. (2018)) Given x0 ∈ supp(v), under Assumptions 1–2, and assuming
m → ∞ and m/n → 0, then for some positive σn with σ2
n = O(mn),
τn(m)(x0)− g(x0)−B(m)
σn
D−→ N(0, 1). (17)
The expression of σ2n is in Lemma 3, Fan et al. (2018). However, since the bagged nearest
neighbors with different scales are correlated, in general for different jackknife choices we need to
use different variance formulas. In practice, it is desirable if we can conduct inference regardless
of the jackknife procedure. Moreover, when the parameter of interest is a smooth and possibly
nonlinear function of several bagged nearest neighbor estimates, it is not necessarily an L-statistic
any longer or even possesses asymptotic normality. In these cases, the use of variance formulas
becomes inconvenient. There is also a practical concern. The use of variance formula often relies
on intermediate plug-in estimates. In practice, these estimates can often weaken the inference
validity. Our solution to all these concerns is the bootstrap (Efron and Stein, 1981) and it is well
known that the bootstrap can commute with smooth functions (Bickel and Freedman, 1981). In
this way, we can conveniently conduct inference for many general situations. We prove that the
standard bootstrap procedure applies to the bagged nearest neighbors estimator. The result is
an implication of the following theorem.
Theorem 3. Let Gn be the empirical distribution of our sample (z1, z2 . . . , zn). Given
(z1, z2, . . . , zn), let (z∗1, . . . , z
∗n) be conditionally independent, with common distribution Gn. The
bagged nearest neighbors estimator defined on this sample is then
τ ∗n(m)(x0) =
(n
m
)−1 ∑
1≤i1<i2<...<im≤n
Φ(x0; z∗i1, z∗i2, . . . , z
∗im).
Given x0 ∈ supp(v), and σn in Equation (17), under Assumptions 1–2, and further assuming
m → ∞, m/n → 0, then for almost all sample sequences,
τ ∗n(m)(x0)− E∗τ ∗n(m)(x0)
σn
D−→ N(0, 1), (18)
where E∗τ ∗n(m)(x0) is the mean from the bootstrap.
44
In other words, the theorem suggests that we can first generate many bootstrap samples of
size n from repeatedly resampling our dataset with replacement. From each of the bootstrap
samples, we can then use the definition of the bagged nearest neighbors estimator and get one
estimate. The distribution of these estimates from the bootstrapped samples will converge to
the distribution of the bagged nearest neighbors estimator. For the case with the jackknife, we
conduct the same jackknife procedure on each of the bootstrap sample and obtain the bagged
nearest neighbors estimate after the jackknife.
B.1 Proof of Theorem 1 and Theorem 2
The proof for Theorem 1 can be found in Biau and Devroye (2015). Fan, Lv, and Wang (2018)
provide another proof on Theorem 1 with the introduction of spherical coordinates, which can
simplify the derivation of the higher order bias terms. The proof of Theorem 2 can be found in
Fan, Lv, and Wang (2018).
B.2 Proof of Theorem 3
We will prove Theorem 3 from the U-statistics perspective instead of the L-statistics. In the
statistics literature, Bickel and Freedman (1981); Arcones and Gine (1992) have derived the
properties for bootstrapping U-statistics under different settings. However, their results do
not yet cover the case when the subsampling scale m is allowed to diverges with n. In addition,
our proof becomes neat with the introduction of the Hoeffding decomposition (Hoeffding, 1948)
and the Mallow’s distance (Bickel and Freedman, 1981).
We first review some results that we will need. It can also be seen as a sketch for the proof
of Theorem 2. From Lemma 3 and Theorem 2 in Fan et al. (2018), we can have
• The bagged nearest neighbors estimator can be decomposed,
τn(m)− E τn(m) =m
n
n∑
i=1
g(zi) + ∆n(m),
where g(zi) = EΦ(z1, z2, . . . , zn|zi)−EΦ(z1, z2, . . . , zn), the canonical Hajek projection of
kernel Φ onto zi. The expectation E is with respect to G, the distribution of z.
• For some finite positive variance σ2,
σ2n = var [
m
n
n∑
i=1
g(zi)] =m2
n(2m− 1)σ2.
• When m/n → 0 and n → ∞,
(∆n(m)
σn
)2 → 0. (19)
45
• By the Lindeberg–Levy Central Limit Theorem, we can have
m
n
n∑
i=1
g(zi)
σn
D−→ N(0, 1).
Let Gn be the empirical distribution of (z1, z2 . . . , zn). Given (z1, z2, . . . , zn), let (z∗1, . . . , z
∗n)
be conditionally independent, with common distribution Gn. The bagged nearest neighbors
estimator defined on (z∗1, . . . , z∗n) is then
τ ∗n(m)(x0) =
(n
m
)−1 ∑
1≤i1<i2<...<im≤n
Φ(x0; z∗i1, z∗i2, . . . , z
∗im).
Similarly, we can have
• For the new distribution Gn,
τ ∗n(m)− En τ∗n(m) =
m
n
n∑
i=1
g(z∗i ) + ∆∗n(m),
where the expectation En is with respect to Gn.
• When m/n → 0 and n → ∞, for σ2n defined in (19),
(∆∗
n(m)
σn
)2 → 0.
With a bit abuse of notation, let ⇒ denote weak convergence, it can be shown that
L(τ∗n(m)− En τ
∗n(m)
σn
) ⇒ L(mn
n∑
i=1
g(z∗i ,Gn)
σn
),
which comes from the convergence of the remainder term.
To establish Theorem 3, we still need to prove
L(mn
n∑
i=1
g(z∗i )
σn
) ⇒ L(mn
n∑
i=1
g(zi)
σn
).
We will use the Mallow’s distance introduced in Bickel and Freedman (1981). Before we
proceed, we list some properties of the Mallows distance we will use. Let Mp be the Mallow’s
distance, p ∈ [1,∞) and all distributions have finite p-th moments.
• If F and G are distributions on the real line, then
Mp(F,G) = {∫ 1
0
|F−1(t)−G−1(t)|pdt}1/p.
46
• If X1, X2, . . . , Xn are independent observations from a distribution F , and Fn is their
empirical distribution, then almost everywhere,
Mp(Fn, F ) → 0.
• For any scalar a,
Mp(aU, aV ) = |a| ·Mp(U, V ).
• If the Ui are independent, likewise for Vi, and EUi = EVi, then
M22 (
n∑
i=1
Ui,
n∑
i=1
Vi) ≤n∑
i=1
M22 (Ui, Vi).
Now we are ready. Let Z(zi) = EΦ(z1, z2, . . . , zn|zi), then we have
g(zi) = EΦ(z1, z2, . . . , zn|zi)− EΦ(z1, z2, . . . , zn)
= Z(zi)− EZ(zi).
First, Gn is the empirical distribution function of G,
M2(z∗, z) → 0.
Since Z(·) is continuous and bounded, we further have
M2(√mg(zi),
√mg(z∗i )) → 0.
By the convolution property of the Mallow’s distance,
M2(m
n
n∑
i=1
g(z∗i )
σn,m
n
n∑
i=1
g(zi)
σn) ≤ m
nσn
√nM2(g(zi), g(z
∗i )),
where√nM2(g(zi), g(z
∗i )) = O(1) since Z and G are both continuous and bounded.
When m/n → 0,
M2(m
n
n∑
i=1
g(z∗i )
σn,m
n
n∑
i=1
g(zi)
σn) → 0,
which completes our proof of Theorem 3.
47