Maximizing Welfare with Incentive-Aware Evaluation Mechanisms

Maximizing Welfare with Incentive-Aware Evaluation Mechanisms*

Nika Haghtalab1, Nicole Immorlica2, Brendan Lucier2, and Jack Z. Wang1

1Cornell University, (nika,[email protected])2Microsoft Research, (nicimm,[email protected])

AbstractMotivated by applications such as college admission andinsurance rate determination, we propose an evaluationproblem where the inputs are controlled by strategic indi-viduals who can modify their features at a cost. A learnercan only partially observe the features, and aims to clas-sify individuals with respect to a quality score. The goalis to design an evaluation mechanism that maximizes theoverall quality score, i.e., welfare, in the population, tak-ing any strategic updating into account.

We further study the algorithmic aspect of findingthe welfare maximizing evaluation mechanism under twospecific settings in our model. When scores are linear andmechanisms use linear scoring rules on the observablefeatures, we show that the optimal evaluation mechanismis an appropriate projection of the quality score. Whenmechanisms must use linear thresholds, we design a poly-nomial time algorithm with a (1/4)-approximation guaran-tee when the underlying feature distribution is sufficientlysmooth and admits an oracle for finding dense regions. Weextend our results to settings where the prior distributionis unknown and must be learned from samples.

1 IntroductionAutomatic decision making is increasingly used to iden-tify qualified individuals in areas such as education, hir-ing, and public health. This has inspired a line of workaimed at improving the performance and interpretabil-ity of classifiers for identifying qualification and excel-lence within a society given access to limited visible at-tributes of individuals. As these classifiers become widelydeployed at a societal level, they can take on the addi-tional role of defining excellence and qualification. Thatis, classifiers encourage people who seek to be “identi-fied as qualified” to acquire attributes that are “deemed

*Published in IJCAI 2020.

to be qualified” by the classifier. For example, a collegeadmission policy that heavily relies on SAT scores willnaturally encourage students to increase their SAT scores,which they might do by getting a better understanding ofthe core material, taking SAT prep lessons, cheating, etc.Progress in machine learning has not fully leveraged clas-sifiers’ role in incentivizing people to change their featureattributes, and at times has even considered it an inconve-nience to the designer who must now take steps to ensurethat their classifier cannot be “gamed” [Meir et al., 2012,Chen et al., 2018, Hardt et al., 2016, Dekel et al., 2010,Cai et al., 2015]. One of the motivations for such strategy-proof classification is Goodhart’s law, which states “whena measure becomes a target, it ceases to be a good mea-sure.” Taking Goodhart’s law to heart, one might view anindividual’s attributes to be immutable, and any strategicresponse to a classifier (changes in one’s attributes) onlyserves to mask an agent’s true qualification and therebydegrade the usefulness of the classifier.

What this narrative does not address is that in many ap-plications of classification, one’s qualifications can trulybe improved by changing one’s attributes. For example,students who improve their understanding of core mate-rial truly become more qualified for college education.These changes have the potential to raise the overall levelof qualification and excellence in a society and should beencouraged by a social planner. In this work, we focus onthis powerful and under-utilized role of classification inmachine learning and ask how to

design an evaluation mechanism on visible fea-tures that incentivizes individuals to improve adesired quality.

For instance, in college admissions, the planner mightwish to maximize the “quality” of candidates. Quality is afunction of many features, such as persistence, creativity,GPA, past achievements, only a few of which may be di-rectly observable by the planner. Nevertheless the plannerdesigns an admission test on visible features to identify

1

arX

iv:2

011.

0195

6v1

[cs

.GT

] 3

Nov

202

0

qualified individuals. To pass the test, ideally candidatesimprove their features and truly increase their quality asa result. In another motivating example, consider a de-signer who wants to increase average driver safety, whichcan depend on many features detailing every aspect of adriver’s habits. The designer may only have access to aset of visible features such as age, driver training, or evendriving statistics like acceleration/deceleration speed (asrecorded by cars’ GPS devices). Then a scoring rule (thatcan affect a driver’s insurance premium) attempts to esti-mate and abstract a notion of “safety” from these features.Drivers naturally adjust their behavior to maximize theirscore. In both cases, the mechanism does not just pickout high quality candidates or safe drivers in the under-lying population, but it actually causes their distributionsto change.

Modeling and Contributions. Motivated by these ob-servations, we introduce the following general prob-lem. Agents are described by feature vectors in a high-dimensional feature space, and can change their innatefeatures at a cost. There is a true function mapping fea-ture vectors to a true quality score (binary or real-valued).The planner observes a low-dimensional projection ofagents’ features and chooses an evaluation mechanism,from a fixed class of mechanisms which map this low-dimensional space to observed scores (binary or real-valued). Agents get value from having a high observedscore (e.g., getting admitted to university or having a lowcar insurance premium), whereas the planner wishes tomaximize welfare, i.e., the average true score of the pop-ulation.

To further study this model, we analyze the algorithmicaspects of maximizing welfare in two specific instantia-tions. In Section 3, we show that when the true qualityfunction is linear and the mechanism class is the set ofall linear functions, the optimal is a projection of the truequality function on the visible subspace. Our most inter-esting algorithmic results (Section 4), arise from the casewhen the true function is linear and mechanism class is theset of all linear thresholds. In this case, a simple projec-tion does not work: we need to consider the distributionof agents (projected on the visible feature space) whenchoosing the mechanism. For this case, we provide poly-nomial time approximation algorithms for finding the op-timal linear threshold. In Section 5, we also provide sam-ple complexity guarantees for learning the optimal mech-anism from samples only.

Prior Work. Our work builds upon the strategic ma-chine learning literature introduce by Hardt et al. [2016].

As in our work, agents are represented by feature vec-tors which can be manipulated at a cost. Hardt et al.[2016] design optimal learning algorithms in the presenceof these costly strategic manipulations. Hu et al. [2019]and Milli et al. [2019] extend [Hardt et al., 2016] by as-suming different groups of agents have different costs ofmanipulation and study the disparate impact on their out-comes. [Dong et al., 2018] consider a setting in whichthe learner does not know the distribution of agents’ fea-tures or costs but must learn them through revealed pref-erence. Importantly, in all these works, the manipulationsdo not change the underlying features of the agent andhence purely disadvantage the learning algorithm. Klein-berg and Raghavan [2019] introduce a different model inwhich manipulations do change the underlying features.Some changes are advantageous, and the designer choosesa rule that incentivizes these while discouraging disadvan-tageous changes. Their main result is that simple linearmechanisms suffice for a single known agent (i.e., knowninitial feature vector). In contrast, we study a populationof agents with a known distribution of feature vectorsand optimize over the class of linear, or linear threshold,mechanisms. Alon et al. [2020] also study extensions ofKleinberg and Raghavan [2019] to multiple agents. In thatwork, agents differ in how costly it is for them to manip-ulate their features but they all have the same starting fea-ture representation, but in our work, agents differ in theirstarting features while facing the same manipulation cost.

As noted by Kleinberg and Raghavan [2019], theirmodel is closely related to the field of contract designin economics. The canonical principal-agent model (see,e.g., [Grossman and Hart, 1983, Ross, 1973]) involvesa single principle and a single agent. There is a single-dimensional output, say the crop yield of a farm, and theprincipal wishes to incentivize the agent to expend costlyeffort to maximize output. However, the principle canonly observe output, and the mapping of effort to output isnoisy. Under a variety of conditions, the optimal contractin such settings pays the agent an function of output [Car-roll, 2015, Dutting et al., 2019, Holmstrom and Milgrom,1987], although the optimal contract in general settingscan be quite complex [Mcafee and McMillan, 1986]. Ourmodel differs from this canonical literature in that botheffort and output are multi-dimensional. In this regard,the work of Holmstrom and Milgrom [1991] is closest toours. They also study a setting with a multi-dimensionalfeature space in which the principal observes only a low-dimensional representation. Important differences includethat they only study one type of agent whereas we allowagents to have different starting feature vectors, and theyassume transferable utility whereas in our setting pay-

2

ments are implicit and do not reduce the utility of the prin-cipal.

2 PreliminariesThe Model. We denote the true features of an individualby x ∈ Rn, where the feature space Rn encodes all rel-evant features of a candidate, e.g., health history, biomet-rics, vaccination record, exercise and dietary habits. Wedenote by f : Rn → [0, 1] the mapping from one’s truefeatures to their true quality. For example, a real-valuedf(x) ∈ [0, 1] can express the overall quality of candidatesand a binary-valued f(x) can determine whether x meetsa certain qualification level.

These true features of an individual may not be visibleto the designer. Instead there is an n×n projection matrixP of rank k, i.e., P 2 = P , such that for any x, Px repre-sents the visible representation of the individual, such asvaccination record and health history, but not exercise anddietary habits. We define by Img(P ) = z ∈ Rn | z =Pz the set of all feature representations that are visibleto the designer. We denote by g : Rn → R a mecha-nism whose outcome for any individual x depends onlyon Px, i.e., the visible features of x.1 E.g., g(x) ∈ R canexpress the payoff an individual receives from the mech-anism, g(x) ∈ [0, 1] can express the probability that xis accepted by a randomized mechanism g, or a binary-valued g(x) ∈ 0, 1 can express whether x is acceptedor rejected deterministically.

Let cost(x,x′) represent the cost that an individual in-curs when changing their features from x to x′. We con-sider cost(x,x′) = c‖x − x′‖2 for some c > 0. Givenmechanism g, the total payoff x receives from changingits feature representation to x′ is Ug(x,x′) = g(x′) −cost(x,x′). Let δg : Rn → Rn denote the best responseof an individual to g, i.e., δg(x) = argmax x′ Ug(x,x

′).We consider a distribution D over feature vector in Rn,representing individuals. Our goal is to design a mecha-nism g such that, when individuals from D best respondto it, it yields the highest quality individuals on average.That is to find a mechanism g ∈ G that maximizes

Val(g) = Ex∼D

[f(δg(x))

]. (1)

We often consider the gain in the quality of individu-als compared to the average quality before deploying anymechanism, i.e., the baseline E[f(x)], defined by

Gain(g) = Val(g)− Ex∼D

[f(x)]. (2)

1Equiv. g(x) = g||(Px) for unrestricted g|| :Rn→R.

We denote the mechanism with highest gain by gopt ∈ G.For example, f can indicate the overall health of an in-

dividual and G the set of governmental policies on how toset insurance premiums based on the observable featuresof individuals, e.g., setting lower premiums for those whoreceived preventative care in the previous year. Then, gopt

corresponds to the policy that leads to the healthiest so-ciety on average. In Sections 3 and 4, we work with dif-ferent design choices for f and G, and show how to find(approximately) optimal mechanisms.

Types of Mechanisms. More specifically, we considera known true quality function f that is a linear functionwf · x − bf for some vector wf ∈ Rn and bf ∈ R.Without loss of generality we assume that the domain andcost multiplier c are scaled so that wf is a unit vector.

In Section 3, we consider the class of linear mecha-nisms Glin, i.e., any g ∈ Glin is represented by g(x) =wg ·x− bg , for a scalar bg ∈ R and vector wg ∈ Img(P )of length ‖wg‖2 ≤ R for some R ∈ R+. In Section 4, weconsider the class of linear threshold (halfspace) mecha-nisms G0-1, where a g ∈ G0-1 is represented by g(x) =sign(wg ·x− bg) for some unit vector wg ∈ Img(P ) andscalar bg ∈ R.

Other Notation. ‖ · ‖ is the L2 norm of a vector un-less otherwise stated. For ` ≥ 0, the `-margin density of ahalfspace is Den`D(w, b) = Prx∼D

[w ·x− b ∈ [−`, 0]

].

This is the total density D assigns to points that are at dis-tance at most ` from being included in the positive side ofthe halfspace. The soft `-margin density of this halfspaceis defined as

S-Den`D(w, b) = Ex∼D

[(b−w ·x)1

(w ·x−b ∈ [−`, 0]

)].

We suppress ` and D when the context is clear.We assume that individuals have features that are within

a ball of radius r of the origin, i.e.,D is only supported onX = x | ‖x‖ ≤ r. In Section 4, we work with distri-bution D that is additionally σ-smooth. D is σ-smootheddistribution if there is a corresponding distribution P overX such that to sample x′ ∼ D one first samples x ∼ Pand then x′ = x+N(0, σ2I). Smoothing is a common as-sumption in theory of computer science where N(0, σ2I)models uncertainties in real-life measurements. To ensurethat the noise in measurements is small compared to ra-dius of the domain r, we assume that σ ∈ O(r).

3

3 Linear Mechanisms

In this section, we show how the optimal linear mecha-nism in Glin is characterized by gopt(x) = wg · x for wg

that is (the largest vector) in the direction of Pwf . Thisleads to an algorithm with O(n) runtime for finding theoptimal mechanism. At a high level, this result shows thatwhen the true quality function f is linear, the optimal lin-ear mechanism is in the direction of the closest vector towf in the visible feature space. Indeed, this characteriza-tion extends to any true quality function that is a mono-tonic transformation of a linear function.

Theorem 1 (Linear Mechanisms). Let f(x) = h(wf ·x − bf ) for some monotonic function h : R → R. Letg(x) =

(Pwf )R‖Pwf‖2 · x. Then, g has the optimal gain.

It is interesting to note that designing an optimal lin-ear mechanism (Theorem 1) does not use any informationabout the distribution of instances in D, rather directlyprojects wf on the subspace of visible features. We seein Section 4 that this is not the case for linear thresholdmechanisms, where the distribution of feature attributesplays a central role in choosing the optimal mechanism.

4 Linear Threshold Mechanisms

In this section, we consider the class of linear thresholdmechanisms G0-1 and explore the computational aspects offinding gopt ∈ G0-1. Note that any g(x) = sign(wg·x−bg)corresponds to the transformation of a linear mechanismfrom Glin that only rewards those whose quality passes athreshold. As in the case of linear mechanisms, individ-uals only move in the direction of wg and every unit oftheir movement improves the payoff of the mechanism bywg ·wf . However, an individual’s payoff from the mech-anism improves only if its movement pushes the featurerepresentation from the 0-side of a linear threshold to +1-side. Therefore only those individuals whose feature rep-resentations are close to the decision boundary move.

Lemma 1. For g(x) = sign(wg · x− bg) ∈ G0-1,

δg(x)=x− 1

(wg · x−bg∈

[−1

c, 0

])(wg · x− bg)wg.

This leads to very different dynamics and challengescompared to the linear mechanism case. For example,wg ∝ Pwf no longer leads to a good mechanism as itmay only incentivize a small fraction of individuals asshown in the following example.

𝒟 = 0,0, 𝑥& '(∼*+,-[/0,0]𝐰34 ∝ 𝑃𝐰7

𝐰3 𝑥8

ℓ𝑥&

ℓ

ℓ 2

2𝑟

Figure 1: Mechanism g′(x) = sign(w′g · x− bg) for any w′

g ∝Pwf is far from optimal.

Example 1. As in Figure 1, consider wf =(1√3, 1√

3, 1√

3

)and P that projects a vector on its sec-

ond and third coordinates. Consider a distribution D thatconsists of x = (0, 0, x3) for x3 ∼ Unif [−r, r]. Any gonly incentivizes individuals who are at distance at most` = 1/c from the the decision boundary, highlighted bythe shaded regions. Mechanism g(x) = sign(x2 − `) in-centivizes everyone to move ` unit in the direction of x2

and leads to total utility of E[f(δg(x))] = `/√

3. But,g′(x) = sign(w′g · x − b′g) for unit vector w′g ∝ Pwf

only incentivizes√

2`/2r fraction of the individuals, onaverage each moves only `/2 ·

√2/3 units in the direction

of wf . Therefore, Gain(g′) ≤ `2

2r√

3 Gain(g) when

unit cost c 1r .

Using the characterization of an individual’s best-response to g from Lemma 1, for any true quality functionf(x) = wf · x − bf and g(x) = sign(wg · x − bg), wehave

Gain(g) = (wg ·wf ) · S-Den1/c(wg, bg). (3)

While mechanisms with wg ∝ Pwf can be farfrom the optimal, one can still achieve a 1/(4rc)-approximation to the optimal mechanism by using suchmechanisms and optimizing over the bias term bg .

Theorem 2 (1/(4rc)-approximation). Consider the poly-nomial time algorithm that returns the best g from G =sign(

Pwf

‖Pwf‖ ·x−bg)|bg = i/2c,∀i ∈ [d2rce+ 1]. Then,

Gain(g) ≥ 14rcGain(gopt).

This is a good approximation when the cost unit c isnot too large compared to the radius of the domain r.However, in most cases the cost to change ones’ featureis much larger than a constant fraction of the radius. Inthis case to find gopt ∈ G0-1, we need to simultaneously

4

optimize the total density of instances that fall within themargin of the classifier, their average distance to the deci-sion boundary, and the correlation between wg and wf .

One of the challenges involved here is finding a halfs-pace whose margin captures a large fraction of instances,i.e., has large margin density. Many variants of this prob-lem have been studied in the past and are known to behard. For example, the densest subspace, the densest half-space, densest cube, and the densest ball are all known tobe hard to approximate [Ben-David et al., 2002, Hardt andMoitra, 2013, Johnson and Preparata, 1978]. Yet, findingdense regions is a routine unsupervised learning task forwhich existing optimization tools are known to performwell in practice. Therefore, we assume that we have ac-cess to such an optimization tool, which we call a densityoptimization oracle.

Definition 1 (Density Optimization Oracle). Oracle Otakes any distribution D, margin `, a set K ⊆ Rn, takesO(1) time and returns

O(D, `,K) ∈ argmaxw∈K,‖w‖=1

b∈R

Den`D(w, b). (4)

Another challenge is that Gain(·) is a non-smooth func-tion. As a result, there are distributions for which smallchanges to (w, b), e.g., to improve w ·wf , could result ina completely different gain. However, one of the proper-ties of real-life distributions is that there is some amountof noise in the data, e.g., because measurements are notperfect. This is modeled by smoothed distributions, as de-scribed in Section 2. Smooth distributions provide an im-plicit regularization that smooths the expected loss func-tion over the space of all solutions.

Using these two assumptions, i.e., access to a densityoptimization oracle and smoothness of the distribution,we show that there is a polynomial time algorithm thatachieves a 1/4 approximation to the gain of gopt. At ahigh level, smoothness of D allows us to limit our searchto those wg ∈ Rn for which wf · wg = v for a smallset of discrete values v. For each v, we use the densityoracle to search over all wg such that wf ·wg = v and re-turn a candidate with high margin density. We show how amechanism with high margin density will lead us to (a po-tentially different) mechanism with high soft-margin den-sity and as a result a near optimal gain. This approach isillustrated in more detail in Algorithm 1.

Theorem 3 ((Main) 1/4-approximation). Consider a dis-tribution over X and the corresponding σ-smoothed dis-tribution D for some σ ∈ O(r). Then, for any smallenough ε, Algorithm 1 runs in time poly(d, 1/ε), makes

Algorithm 1 (1/4− ε) Approximation for G0-1

Input: σ-smoothed distribution D with radius r beforeperturbation, Vector wf , Projection matrix P , Desiredaccuracy ε.Output: a linear threshold mechanism gLet ε′ = minε4, ε2σ4/r4 and C = ∅.for η = 0, ε′‖Pwf‖, 2ε′‖Pwf‖, . . . , ‖Pwf‖ do

Let Kη = Img(P ) ∩ w | w ·wf = η.Use the density optimization oracle to compute

(wη, bη)← O(D, 1

c,Kη)

Let C ← C ∪ (wη, bη), (wη, bη + 1

2c

)

end forreturn g(x) = sign (w · x− b), where

(w, b)← argmax(w,b)∈C

(w ·wf )S-Den1/cD (w, b).

O( 1ε4 + r2

ε2σ2 ) oracle calls and returns g ∈ G0-1, such that

Gain(g) ≥ 1

4Gain(gopt)− ε.

Proof Sketch of Theorem 3. Recall from Equation 3,Gain(gopt) is the product of two values (wgopt ·wf ) andS-Den(wgopt , bgopt). The first lemma shows that we cando a grid search over the value of (wgopt · wf ). In otherwords, there is a predefined grid on the values of w ·wf

for which there is a w for which w · wf ≈ wgopt · wf .This is demonstrated in Figure 2.

Lemma 2 (Discretization). For any two unit vectorsw1,w2 ∈ Img(P ), and any ε, such that w1 ·w2 ≤ 1−2ε,there is a unit vector w ∈ Img(P ), such that w · w2 =w ·w1 + ε and w ·w1 ≥ 1− ε.

The second technical lemma shows that approximatingwg by a close vector, w, and bg by a close scalar, b, onlyhas a small effect on its soft margin density. That is, thesoft margin density is Lipschitz. For this claim to hold,it is essential for the distribution to be smooth. The keyproperty of a σ-smoothed distribution D—correspondingto the original distribution P—is that Den`D(wg, bg) in-cludes instances x ∼ P that are not in the margin of(wg, bg). As shown in Figure 2, these instances also con-tribute to the soft margin density of any other w and b,as long as the distance of x to halfplanes w · x = b andwg ·x = bg are comparable. So, it is sufficient to show thatthe distance of any x to two halfplanes with a small an-gle is approximately the same. Here, angle between two

5

𝐰"

𝐰#

𝐰

𝜖

≤ 𝜖

𝟏( 𝐰𝒈 ⋅ 𝐱 ∈ 𝑏" − ℓ, 𝑏" )

𝟏( 𝐰 ⋅ 𝐱 ∈ 𝑏 − ℓ, 𝑏 )

Figure 2: The wg ·wf can be increased by any small amount εto match a specific value, when wf is shifted by a ≤ ε angle tovector w. When a distribution is smooth, the margin density ofmechanisms (w, b) and (wg, bg) are close when θ(w,wg) and|b − bg| are small. The lightly shaded area shows that instancesoutside of the margin contribute to (soft-)margin density.

unit vectors is defined as θ(w,w′) = arccos(w · w′).Lemma 2 leverages this fact to prove that soft margin den-sity is Lipschitz smooth.

Lemma 3 (Soft Margin Lipschitzness). For any distribu-tion over X and its corresponding σ-smooth distributionD, for ε ≤ 1

3R2 , any ν < 1, for R = 2r + σ√

2 ln(2/ν),and any ` ≤ O(R), if θ(w1,w2) ≤ εσ2 and |b1 − b2| ≤εσ, we have∣∣∣S-Den`D(w1, b1)− S-Den`D(w2, b2)

∣∣∣ ≤ O(ν + εR2).

The final step to bring these results together is to showthat we can find a mechanism (between those on the grid)that has a large soft margin density. To do this, we showthat the mechanism (with potentially small changes to it)that has the highest margin density has at least 1/4 of thesoft density of the optimal mechanism. The proof tech-nique of this lemma involves analyzing whether more thanhalf of instances within an `-margin of a mechanism areat distance at most `/2 of the margin, in which case softmargin density of the mechanism is at least 1/4 of its den-sity. Otherwise, shifting the bias of the mechanism by `/2results in a mechanism whose soft margin is a least 1/4 ofthe original margin density.

Lemma 4 (Margin Density Approximates Soft-MarginDensity). For any distribution D over Rn, a class of unitvectors K ⊆ Rn, margin ` > 0, and a set of values N , let(wη, bη) = O(D, `,K∩w ·wf = η) for η ∈ N , and let(w, b) be the mechanism with maximum ηDen`D(wη, bη)

among these options. Then,

max

Gain(w, b),Gain(w, b+

`

2)

≥1

4maxw∈K

w·wf∈Nb∈R

Gain(w, b).

Putting these lemmas together, one can show that Al-gorithm 1, finds a 1

4 approximation.

5 Learning Optimal MechanismsFrom Samples

Up to this point, we have assumed that distribution D isknown to the mechanism designer and all computations,such as measuring margin density and soft margin den-sity, can be directly performed on D. However, in mostapplications the underlying distribution over feature at-tributes is unknown or the designer only has access to asmall set of historic data. In this section, we show thatall computations that are performed in this paper can bedone instead on a small set of samples that are drawn i.i.d.from distribution D. Note that (since no mechanism is yetdeployed at this point) these samples represent the initialnon-strategic feature attributes.

We first note that the characterization of the optimallinear mechanism (Theorem 1) is independent of dis-tribution D, that is, even with no samples from D wecan still design the optimal linear mechanism. On theother hand, the optimal linear threshold mechanism (The-orem 3) heavily depends on the distribution of instances,e.g., through the margin and soft-margin density of D.To address sample complexity of learning a linear thresh-old mechanism, we use the concept of pseudo-dimension.This is the analog of VC-dimension for real-valued func-tions.

Definition 2 (Pseudo-dimension). Consider a set of real-valued functions F on the instance space X . A set of in-stances x(1), . . . , x(n) ∈ X is shattered by F , if there arevalues v(1), . . . , v(n) such for any subset T ⊆ [n] there isa function fT ∈ F for which f(x(i)) ≥ v(i) if and only ifi ∈ T . The size of the largest set that is shattered by F isthe pseudo-dimension of F .

It is well-known that the pseudo-dimension of a func-tion class closely (via upper and lower bounds) con-trols the number of samples that are needed for learninga good function. This is characterized by the uniform-convergence property as shown below.

Theorem 4 (Uniform Convergence Pollard [2011]). Con-sider the class of functions F such that f : X → [0, H]

6

Gain% 𝑔 = minℎ+, 𝐱 , ℎ+/(𝐱)

𝐰 ⋅ 𝐱 − 𝑏0−ℓ

ℎ+, 𝐱 = 𝟏 𝐰 ⋅ 𝐱 − 𝑏 ≤ 0 𝑏 − 𝐰 ⋅ 𝐱 (𝐰 ⋅ 𝐰;)

ℎ+/ 𝐱 = 𝟏 𝐰 ⋅ 𝐱− 𝑏 ≥ −ℓ (𝐰 ⋅ 𝐰;)

Figure 3: Gainx(g) is a minimum of h1g(x) and h2

g(x).

with pseudo-dimension d. For any distribution D and aset S of m = ε−2H2(d + ln(1/ε)) i.i.d. randomly drawnsamples from D, with probability 1− δ, for all f ∈ F∣∣∣∣∣ 1

m

∑x∈S

f(x)− Ex∼D

[f(x)]

∣∣∣∣∣ ≤ ε.As is commonly used in machine learning, uniform

convergence implies that choosing any mechanism basedon its performance on the sample set leads to a mechanismwhose expected performance is within 2ε of the optimalmechanism for the underlying distribution.

Lemma 5 (Pseudo-dimension). For each g = sign(wg ·x − bg) and x let Gainx(g) be the gain of g just on in-stance x and let Den`x(g) = 1(wg · x − bg ∈ [−`, 0]).The class of real-valued functions Gain gg∈G0-1 hasa pseudo-dimension that is at most O(rank(P )) More-over, the class of functions x → Den`x(g)g∈G0-1

has a pseudo-dimension (equivalently VC dimension)O(rank(P )).

Proof. For g ∈ G0-1, note that

Gainx(g)=1

(wg ·x− bg∈

[−1

c, 0

])(bg−wg·x)(wf ·wg).

Note that we can write Gainx(g) = minh1(x), h2(x)for h1

g(x) := 1(w · x − b ≤ 0)(b −w · x)(w ·wf ) andh2g(x) := 1(w·x−b ≥ −`)(w·wf ). As show in Figure 3,h1g and h2

g are both monotone functions of w · x− b.Pollard [Pollard, 2011] shows that the set of all

linear functions in a rank k subspace has pseudo-dimension k. Pollard [Pollard, 2011] also shows thatthe set of monotone transformations of these functionshas pseudo-dimension O(k). Therefore, the set of func-tions h1

g(x)g∈G0-1 and h2g(x)g∈G0-1 , each have pseudo-

dimension Rank(P ). It is well-known that the set of allminimums of two functions from two classes each withpseudo-dimension d has a pseudo-dimension of O(d).Therefore, x→ Gainx(g)g∈G0-1 has pseudo-dimension

of at most O(rank(P )). A similar construction showsthat x → Den`x(g)g∈G0-1 has VC dimension of at mostO(rank(P )).

We give a bound on the number of samples sufficientfor finding a near optimal mechanism in G0-1.

Theorem 5 (Sample Complexity). For any small enoughε and δ, there is

m ∈ O(

1

c2ε2(rank(P ) + ln(1/δ))

)such that for S ∼ Dm i.i.d. samples with probability 1 −δ, g ∈ G0-1 that optimizes the empirical gain on S hasGain(g) ≥ Gain(gopt)−ε. Furthermore, with probability1 − δ when Algorithm 1 is run on S, the outcome g hasGain(g) ≥ 1

4Gain(gopt)− ε.

6 DiscussionIn this work, we focused on increasing the expected qual-ity of the population, i.e., welfare, through classification.There are, however, other reasonable objectives that onemay want to consider. In some cases, our goal may be todesign evaluation mechanisms that lead to highest qualityaccepted individuals, not accounting for those who are re-jected by the mechanism. It would be interesting to studythese objectives under our model as well. An example ofthis is to consider, for a set of boolean valued functions G,maxg∈G Ex∼D

[f(δg(x)) | g(x) = 1

].

Throughout this work, we focused on the L2 cost func-tion (i.e., cost(x,x′) = c‖x − x′‖2). This specificationmodels scenarios in which an individual can improve mul-tiple features most efficiently by combining effort ratherthan working on each feature individually (L1 norm). Forexample, simultaneously improving writing, vocabulary,and critical analysis of a student more by for examplereading novels, is more effective than spending effort toimprove vocabulary by, say, memorizing vocab lists, andthen improving the other attributes. It would be interest-ing to analyze the algorithmic aspects of alternative costfunctions, such as L1 cost or even non-metric costs (e.g.,ones with learning curves whereby the first 10% improve-ment is cheaper than the next 10%), and different costs fordifferent types of individuals.

Finally, we have assumed we know the true mapping offeatures to qualities (i.e., f ). In many settings, one mightnot know this mapping, or even the full set of features. In-stead, the designer only observes the quality of individu-als after they respond to their incentives (i.e., (f(δg(x)))),and the projection of their new feature set (i.e., Pδg(x)).

7

Existing works on Stackelberg games and strategic classi-fication have considered the use of learning tools for de-signing optimal mechanisms without the knowledge of theunderlying function [Blum et al., 2014, Haghtalab et al.,2016, Miller et al., 2020]. It would be interesting to char-acterize how the nature of observations and interventionsavailable in our setting specifically affects this learningprocess.

ReferencesTal Alon, Magdalen Dobson, Ariel D Procaccia, Inbal

Talgam-Cohen, and Jamie Tucker-Foltz. Multiagentevaluation mechanisms. In Proceedings of the 34thAAAI Conference on Artificial Intelligence (AAAI),2020.

Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon.The computational complexity of densest region detec-tion. Journal of Computer and System Sciences, 64(1):22–47, 2002.

Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia.Learning optimal commitment to overcome insecurity.In Proceedings of the 28th Annual Conference on Neu-ral Information Processing Systems (NeurIPS), pages1826–1834, 2014.

Yang Cai, Constantinos Daskalakis, and Christos Pa-padimitriou. Optimum statistical estimation with strate-gic data sources. In Proceedings of the 28th Con-ference on Computational Learning Theory (COLT),pages 280–296, 2015.

Gabriel Carroll. Robustness and linear contracts. Ameri-can Economic Review, 105(2):536–563, 2015.

Yiling Chen, Chara Podimata, Ariel D Procaccia, and Nis-arg Shah. Strategyproof linear regression in high di-mensions. In Proceedings of the 19th ACM Confer-ence on Economics and Computation (EC), pages 9–26. ACM, 2018.

Ofer Dekel, Felix Fischer, and Ariel D. Procaccia. Incen-tive compatible regression learning. Journal of Com-puter and System Sciences, 76(8):759–777, 2010.

Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Wag-goner, and Zhiwei Steven Wu. Strategic classifica-tion from revealed preferences. In Proceedings of the19th ACM Conference on Economics and Computation(EC), pages 55–70, 2018.

Paul Dutting, Tim Roughgarden, and Inbal Talgam-Cohen. Simple versus optimal contracts. In Proceed-ings of the 20th ACM Conference on Economics andComputation (EC), pages 369–387, 2019.

Sanford J Grossman and Oliver D Hart. An analysis ofthe principal-agent problem. Econometrica, 51(1):7–45, 1983.

Nika Haghtalab, Fei Fang, Thanh Hong Nguyen, AruneshSinha, Ariel D. Procaccia, and Milind Tambe. Threestrategies to success: Learning adversary models in se-curity games. In Proceedings of the 25th Interna-tional Joint Conference on Artificial Intelligence (IJ-CAI), pages 308–314, 2016.

Moritz Hardt and Ankur Moitra. Algorithms and hardnessfor robust subspace recovery. In Conference on Com-putational Learning Theory (COLT), pages 354–375,2013.

Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou,and Mary Wootters. Strategic classification. In Pro-ceedings of the 7th ACM Conference on Innovationsin Theoretical Computer Science Conference (ITCS),pages 111–122, 2016.

Bengt Holmstrom and Paul Milgrom. Aggregation andlinearity in the provision of intertemporal incentives.Econometrica, 55(2):303–328, 1987.

Bengt Holmstrom and Paul Milgrom. Multitask principal-agent analyses: Incentive contracts, asset ownership,and job design. Journal of Law, Economics, and Or-ganization, 7:24–52, 1991.

Lily Hu, Nicole Immorlica, and Jennifer WortmanVaughan. The disparate effects of strategic manipu-lation. In Proceedings of the 2nd ACM Conferenceon Fairness, Accountability, and Transparency (FAT*),pages 259–268, 2019.

David S. Johnson and Franco P Preparata. TheoreticalComputer Science, 6(1):93–107, 1978.

Jon Kleinberg and Manish Raghavan. How do classifiersinduce agents to invest effort strategically? In Proceed-ings of the 20th ACM Conference on Economics andComputation (EC), pages 825–844, 2019.

R Preston Mcafee and John McMillan. Bidding for con-tracts: A principal-agent analysis. The RAND Journalof Economics, 17(3):326–338, 1986.

8

Reshef Meir, Ariel D Procaccia, and Jeffrey S Rosen-schein. Algorithms for strategyproof classification. Ar-tificial Intelligence, 186:123–156, 2012.

John Miller, Smitha Milli, and Moritz Hardt. Strategicclassification is causal modeling in disguise. arXivpreprint arXiv:1910.10362, 2020.

Smitha Milli, John Miller, Anca D Dragan, and MoritzHardt. The social cost of strategic classification. InProceedings of the 2nd ACM Conference on Fairness,Accountability, and Transparency (FAT*), pages 230–239, 2019.

D. Pollard. Convergence of Stochastic Processes.Springer Series in Statistics. 2011.

Stephen A Ross. The economic theory of agency: Theprincipal’s problem. American Economic Review, 63(2):134–139, 1973.

9

A Useful Properties of Smoothed DistributionsLemma 6. For any σ-smoothed distribution D and any unit vector w and any range [a, b],

Prx∼D

[w · x ∈ [a, b]] ≤ |b− a|σ√

2π.

Proof. First note that for any Gaussian distribution with variance σ2I over x, the distribution of w·x is a 1-dimensionalGaussian with variance σ2. Since any σ-smoothed distribution is a mixture of many Gaussians, distribution over w·x isalso the mixture of many Gaussians. It is not hard to see that the density of any one dimensional Gaussian is maximizedin range [a, b] if its centered at (a− b)/2. Therefore,

Prx∼D

[w · x ∈ [a, b]] ≤ Prx∼N(0,σ2)

[a− b

2≤ x ≤ b− a

2

]≤ 2

σ√

2π

∫ (b−a)/2

0

exp

(− z2

2σ2

)dz

≤ (b− a)

σ√

2π.

Lemma 7. For any distribution over x | ‖x‖ ≤ r and its corresponding σ-smoothed distribution D, and H ≥2r + σ

√2 ln(1/ε), we have

Prx∼D

[‖x‖ ≥ H

]≤ ε.

Proof. Consider the Gaussian distribution N(c, σ2I) for some ‖c‖ ≤ r. Then for any ‖x‖ ≥ H , it must be that thedistance of x to c is at least H − 2r. Therefore,

Prx∼D

[‖x‖ ≥ H] ≤ Prx∼N(0,σ2)

[x ≥ H − 2r] ≤ exp

(− (H − 2r)2

2σ2

)≤ ε.

Lemma 8. For any distribution over x | ‖x‖ ≤ r and let p(·) denote the density of the corresponding σ-smootheddistribution. Let x and x′ be such that ‖x− x′‖ ≤ D. Then,

p(x)

p(x′)≤ exp

(D(2r + ‖x‖+ ‖x′‖)

2σ2

).

Proof. Consider distributionN(c, σ2I) for some ‖c‖ ≤ r. Without loss of generality assume that ‖c−x‖ ≤ ‖c−x′‖,else p(x)/p(x′) ≤ 1. We have

p(x)

p(x′)=

exp(−‖c−x‖2

2σ2

)exp

(−‖c−x′‖2

2σ2

)= exp

(‖c− x′‖2 − ‖c− x‖2

2σ2

)= exp

((‖c− x′‖ − ‖c− x‖) (‖c− x′‖+ ‖c− x‖)

2σ2

)≤ exp

(D(2r + ‖x‖+ ‖x′‖)

2σ2

).

Since D is a σ-smoothed distribution, it is a mixture of many Gaussians with variance σ2I centered within L2 ballof radius r. Summing over the density for all these Gaussian proves the claim.

10

B Proofs from Section 3Lemma 9. For any linear g(x) = wg · x− bg and cost function cost(x,x′) = c‖x− x′‖2,

δg(x) = x + αwg,

where α = 0 if c > ‖wg‖ and α =∞ if c ≤ ‖wg‖.

Proof. Consider an image of a vector on wg and its orthogonal space, note that any x′ can be represented by x′−x =αwg + z, where z ·wg = 0 and α ∈ R. Then, the payoff that a player with features x gets from playing x′ is

Ug(x,x′) = wg · x′ − bg − c‖x− x′‖

= wg · (αwg + z)− bg − c‖αwg + z‖= α‖wg‖2 − bg − αc‖wg‖ − |α|c‖z‖,

where the last transition is due to the fact that z is orthogonal to wg .It is clear from the above equation that the player’s utility is optimized only if z = 0 and α ≥ 0, that is, δg(x) =

x + αwg for some α ≥ 0. Therefore,

δg(x) = argmaxx′

wg · x− bg − c‖x− x′‖ = x +

(argmaxα≥0

α‖wg‖ (‖wg‖ − c))wg.

Note that α‖wg‖ (‖wg‖ − c) is maximizes at α = 0 if c > ‖wg‖ and is maximized at α =∞ if c ≤ ‖wg‖.

B.1 Proof of Theorem 1For ease of exposition we refer to parameters of gopt by w∗ and b∗. Let wg =

(Pwf )R‖Pwf‖2 and g(x) = wg · x. By the

choice of wg and definition of G, we have ‖wg‖ = R ≥ ‖w∗‖ and wg ∈ Img(P ).By Lemma 9 and the fact that ‖wg‖ ≥ ‖w∗‖ there are αg, αgopt ∈ 0,∞, such that αg ≥ αgopt , δgopt(x) =

x + αgoptwg , and δg(x) = x + αgwg. Furthermore, we have

wf ·w∗ ≤ ‖wf‖‖w∗‖ ≤ ‖wf‖R = wf ·R(Pwf )

‖Pwf‖= wf ·wg,

where the first transition is by Cauchy-Schwarz, the second transition is by the definition of G, and the last twotransitions are by the definition of wg . Using the monotonicity of h and the fact that αg ≥ αgopt and wf ·wg ≥ wf ·w∗,we have

Ex∼D

[f(δgopt(x)

]= E

x∼D

[h (wf · x + αgwf ·w∗ − bf )

]≤ E

x∼D

[h(wf · x + αgoptwf ·wg − bf

) ]= E

x∼D

[f(δg(x))

].

This proves the claim.

C Proofs from Section 4

C.1 Proof of Lemma 1By definition, the closest x′ to x such that g(x′) = 1 is the projection of x on g, defined by x′ = x − (w · x − b)w.By definition

cost(x,x′) = c‖x− x′‖2 = c (w · x− b) .The claim follows by noting that a player with features x would report δg(x) = x′ only if cost(x,x′) ≤ 1, otherwiseδg(x) = x.

11

C.2 Proof of Lemma 2The proof idea is quite simple: Only a small move in angle is required to achieve an ε change in the projection of onw2. Here, we prove this rigorously. Let w = αw1 +βw2 for α and β that we will described shortly. Note that Img(P )includes any linear combination of w1 and w2. Therefore, w ∈ P . Let ω = w1 ·w2,

α =

√1− (ω + ε)2

1− ω2,

andβ =

√1− α2 + α2ω2 − αω.

By definition of α and β, ‖w‖ =√α2 + β2 + 2αβω = 1. That, is w is indeed a unit vector. Next, we show that

w ·w2 = w1 ·w2 + ε. We have

w ·w2 = αω + β =

√ω2 (1− (ε+ ω)2)

1− ω2− 1− (ε+ ω)2

1− ω2+ 1 = ω + ε = w1 ·w2 + ε.

It remains to show that w1 ·w ≥ 1− ε. By definition of w, we have

w ·w1 = α+ βω =(1− ω2

)√ (ε+ ω)2 − 1

ω2 − 1+ ω(ε+ ω).

One can show that the right hand-side equation is monotonically decreasing in ω Since ω < 1− 2ε, we have that

w ·w1 ≥(1− (1− 2ε)2

)√ 1− (1− ε)2

1− (1− 2ε)2+ (1− 2ε)(1− ε)

= (1− ε)

(2ε

(√2− ε1− ε

− 1

)+ 1

)≥ 1− ε.

This completes the proof.

C.3 Proof of Lemma 3Without loss of generality, let the angle between w1 and w2 to be exactly β = εσ2. Also without loss of generalityassume b1 > b2 and let w1 = (1, 0, . . . , 0) and w2 = (cos(β), sin(β), 0, . . . , 0). Consider the following rotationmatrix

M =

cos(β) sin(β) 0 · · · 0− sin(β) cos(β) 0 · · · 0

0 0 1 · · · 0...

.... . .

0 0 0 · · · 1

that rotates every vector by angle β on the axis of w1,w2. Note that x → Px is a bijection. Furthermore, for any x,w1 · x = w2 · (Mx) and the ‖x‖ = ‖Mx‖.

We use the mapping x → Px to upper bound the difference between the soft-margin density of (w1, b1) and(w2, b2). At a high level, we show that the total contribution of x to soft-margin density of (w1, b1) is close to thetotal contribution of Mx to the soft-margin density of (w2, b2).

For this to work, we first show that almost all instances x in the `-margin of (w1, b1) translate to instances Mx inthe `-margin of (w2, b2), and vice versa. That is,

Prx∼D

[w1 · x− b1 ∈ [−`, 0] but w2 ·Mx− b2 /∈ [−`, 0], ‖x‖ ≤ R

]≤ Pr

x∼D

[w1 · x− b1 ∈ [−σε, 0]

]12

≤ ε√2π, (5)

where the first transition is due to that fact that w1 · x = w2 ·Mx, so only instances for which w1 · x− b1 ∈ [−σε, 0]contribute to the above event. The last transition is by Lemma 6. Similarly,

Prx∼D

[w1 · x− b1 /∈ [−`, 0] but w2 ·Mx− b2 ∈ [−`, 0], ‖x‖ ≤ R

]≤ ε√

2π. (6)

Next, we show that the contribution of x to the soft-margin of (w1, b1) is close to the contribution of Mx to thesoft-margin of (w2, b2).

|(b1 −w1 · x)− (b2 −w2 ·Mx)| ≤ |b1 − b2| ≤ σε. (7)

Moreover, x is also close to Mx, since M rotates any vector by angle β only. That is, for all x, such that ‖x‖ ≤ R,we have

‖x−Mx‖ ≤ ‖x‖‖I −M‖F ≤ 2√

2 sin(β/2)R ≤√

2βR.

Now consider the density of distribution D indicated by p(·). By Lemma 8 and the fact that ‖x‖ = ‖Mx‖, we havethat for any ‖x‖ ≤ R,

p(x)

p(Mx)≤ exp

(√2βR · (6r + 2σ

√2 ln(2/ε))

2σ2

)≤ exp

(3βR2

σ2

)< 1 + 6β

R2

σ2= 1 + 6εR2 (8)

where the penultimate transition is by the fact that exp(x) ≤ 1 + 2x for x < 1 and 3βR2

σ2 ≤ 1 for ε ≤ 1/3R2.Lastly, by Lemma 7, all but ν fraction of the points in D are within distance R of the origin. Putting these together, forΩ = x | ‖x‖ ≤ R, we have∣∣S-DenD(w1, b1)− S-DenD(w2, b2)

∣∣=∣∣∣ Ex∼D

[(b1 −w1 · x)1(w1 · x− b1 ∈ [−`, 0])

]− EMx∼D

[(b2 −w2 ·Mx)1(w2 ·Mx− b2 ∈ [−`, 0])

]∣∣∣≤ Pr

x∼D[x /∈ Ω] + ` Pr

x∼D

[w1 · x− b1 ∈ [−`, 0] but w2 ·Mx− b2 /∈ [−`, 0]

]+ ` Pr

Mx∼D

[w1 · x− b1 /∈ [−`, 0] but w2 ·Mx− b2 ∈ [−`, 0]

]+

∣∣∣∣∫Ω

(p(x)(b1 −w1 · x)− p(Mx)(b2 −w2 ·Mx)

)dx

∣∣∣∣≤ ν +

2`ε√2π

+ 2

∫Ω

max|p(x)− p(Mx)|, |(b1 −w1 · x)− (b2 −w2 ·Mx)|

≤ ν +2ε`√2π

+ 6R2ε+ σε

≤ O(ν + εR2),

where the last transition is by the assumption that σ ∈ O(R) and ` ∈ O(R).

C.4 Proof of Lemma 4We suppress D and ` in the notation of margin density and soft-margin density below. Let w∗, b∗ be the optimalsolution to the weighted soft-margin density problem, i.e.,

(w∗ ·wf ) · S-Den`D (w∗, b∗) = maxη∈N

η · maxw·wf=ηb∈R

(w ·wf ) · S-Den(w, b).

13

Note that for any x ∼ D whose contribution to S-Den(w∗, b∗) is non-zero, it must be that w∗ · x − b ∈ [−`, 0]. Bydefinitions of margin and soft margin densities, we have

α (w∗ ·wf ) · S-Den(w∗, b∗) ≤ α` (w∗ ·wf ) ·Den(w∗, b∗) ≤ ` (w ·wf ) ·Den(w, b). (9)

Consider the margin density of (w, b). There are two cases:

1. At least half of the points, x, in the `-margin are at distance at least `/2 from the boundary, i.e., w · x − b ∈[−`,−`/2]. In this case, we have (

1

2

)(`

2

)Den(w, b) ≤ S-Den(w, b). (10)

2. At least half of the points, x, in the `-margin are at distance at most `/2 from the boundary, i.e., w · x − b ∈[−`/2, 0]. Note that all such points are in the `-margin of the halfspace defined by (w, b + `/2). Additionally,these points are at distance of at least `/2 from this halfspace. Therefore,(

1

2

)(`

2

)Den(w, b) ≤ S-Den(w, b+ `/2). (11)

The claim is proved using Equations 9, 10, and 11.

C.5 Proof of Theorem 3For ease of exposition, let ε′ = minε4, ε2σ4/r4, ν = 1/ exp (1/ε), and R = 2r+σ

√2 ln(1/ν). Let’s start with the

reformulating the optimization problem as follows.

maxg∈G0-1

Gain(g) = maxw∈Img(P )

b∈R

(w ·wf ) S-Den1/cD (w, b) = max

η∈[0,1]η maxw∈Img(P )w·wf=ηb∈R

S-Den1/cD (w, b). (12)

Let (w∗, b∗) be the optimizer of Equation (12). Next, we show the value of solution (w∗, b∗) can be ap-proximated well by a solution (w, b) where w · wf and b belongs to a discrete set of values. Let N :=0, ε′‖Pwf‖, 2ε′‖Pwf‖ . . . , ‖Pwf‖. Let η∗ = w∗ ·wf and define i such that η∗ ∈ (iε′‖Pwf‖, (i+1)ε′‖Pwf‖]. Ifi > b 1

ε′ c − 2, then let w =Pwf

‖Pwf‖ . If i ≥ b 1ε′ c − 2, then let w be the unit vector defined by Lemma 2 when w1 = w∗

and w2 =Pwf

‖Pwf‖ . Then,

w ·wf = jε′‖Pwf‖,wherej =

b 1ε′ c If i ≥ b 1

ε′ c − 2

i+ 1 Otherwise.

By these definitions and Lemma 2, we have that w·w∗ ≥ 1−2ε′‖Pwf‖, which implies that θ(w∗, w) ≤√

2ε′‖Pwf‖.Next, we use the lipschitzness of the soft-margin density to show that the soft-margin density of w∗ and w are close.

Let b be the closest multiple of ε′‖Pwf‖ to b. Using Lemma 3, we have that∣∣∣S-DenD(w∗, b∗)− S-DenD(w, b)∣∣∣ ≤ O(ν +

ε′‖Pwf‖σ2

R2

)∈ O

(ν +

R2

σ2

√2ε′‖Pwf‖

).

Note that for any |a1 − a2| ≤ ε′ and |b1 − b2| ≤ ε′, |a1b1 − a2b2| ≤ (a1 + b1)ε′ + ε′2, therefore,

(w ·wf ) S-Den`D(w, b) ≥ (w∗ ·wf ) S-Den`D(w∗, b∗)−O(ν +

R2

σ2

√2ε′‖Pwf‖

). (13)

14

By Lemma 4, the outcome of the algorithm has the property that

Gain(w, b) ≥ 1

4Gain(w, b)

Putting this all together, we have

Gain(w, b) ≥ maxg∈G0-1

Gain(g)−O(ν +

R2

σ2

√2ε′‖Pwf‖

)Now, replacing back ε′ = minε4, ε2σ4/r4, ν = 1/ exp (1/ε), and R = 2r + σ

√2 ln(1/ν), we have

ν +R2

σ2

√2ε′‖Pwf‖ ≤ O

(ε+

r2

σ2

√ε′ +

σ2

σ2ε

√ε′)≤ O (ε)

D Proof of Theorem 5The first claim directly follows from Lemma 5, Theorem 4 and the facts that Gainx(g) ∈ [0, `] and Gain(g) =Ex∼D[Gainx(g)].

For the second claim, note that Algorithm 1 accesses distribution D only in steps that maximize the margin densityof a classifier (the calls to oracle O(D, 1

c ,Kη)) or maximize the gain of a classifier. By Theorem 4, Lemma 5, and thefact that Den(g) = Ex∼D[Denx(g)], we have that the density of any outcome of the oracle O(S, 1

c ,Kη) is optimal upto O(ε/c) for cost unit c. Lemma 4 shows that the soft-margin density and the gain is also within a factor O(ε) fromthe 1/4 approximation.

E Threshold Mechanisms when All Features are VisibleConsider the special case where the projection matrix P is the identity matrix (or, more generally, has full rank), andthe mechanism is tasked with implementing a desired threshold rule. In this case, we want to maximize the numberof individuals whose features satisfy some arbitrary condition, and the features are fully visible to the mechanism.Suppose that f : Rn → 0, 1 is an arbitrary binary quality metric, and the mechanism design space is over allfunctions g : Rn → [0, 1] (even non-binary functions). Then in particular it is possible to set g = f , and we observethat this choice of g is always optimal.

Proposition 1. For any f : Rn → 0, 1, if G is the set of all functions g : Rn → [0, 1], then f ∈ argmax g∈G Val(g).

Proof. Fix f , and write Sf = x ∈ Rn | f(x) = 1 for the set of feature vectors selected by f . Let Sf =x ∈ Rn | ∃y s.t. f(y) = 1, cost(x, y) < 1/c. That is, Sf contains all points of Sf , plus all points that liewithin distance 1/c of any point in Sf . Write Φf = Prx∼D

[x ∈ Sf

]. Then note that Φf is an upper bound on

Val(g) = Ex∼D[f(δg(x))

]for any g, since Φf is the probability mass of all individuals who could possibly move

their features to lie in Sf at a total cost of at most 1, even if compelled to make such a move by a social planner, and 1is the maximum utility that can be gained by any move.

We now argue that setting g = f achieves this bound of Φf , which completes the proof. Indeed, by definition,if x ∈ Sf then there exists some y ∈ Sf such that cost(x, y) < 1. So in particular Uf (x,y) > 0, and henceUf (x, δf (x)) > 0. We must therefore have f(δf (x)) = 1 for all x ∈ Sf , since Uf (x,x′) ≤ 0 whenever f(x′) = 0.We conclude that Val(f) = Ex∼D

[f(δf (x))

]= Φf , as required.

This proposition demonstrates that our problem is made interesting in scenarios where the projection matrix P isnot full-rank, so that some aspects of the feature space is hidden from the mechanism.

We note that Proposition 1 does not hold if we allow f to be an arbitrary non-binary quality metric f : Rn → [0, 1].For example, suppose f(x) = 0.1 when x1 ≥ 0 and f(x) = 0 otherwise, c = 1, andD is the uniform distribution over[−1, 1]n. Then setting g = f results in all agents with x1 ≥ −0.1 contributing positively to the objective, since agentswith x1 ∈ (−0.1, 0) will gain positive utility by shifting to x′1 = 0. However, if we instead set g such that g(x) = 1when x1 ≥ 0 and g(x) = 0 otherwise, then all agents will contribute positively to the objective, for a strictly greateraggregate quality.

15

Documents

Maximizing Welfare with Incentive-Aware Evaluation Mechanisms