Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Collaborative Pure Exploration in Kernel Bandit
YIHAN DU, IIIS, Tsinghua University, China
WEI CHEN,Microsoft Research, China
YUKO KUROKI, The University of Tokyo / RIKEN, Japan
LONGBO HUANG, IIIS, Tsinghua University, China
In this paper, we formulate a Collaborative Pure Exploration in Kernel Bandit problem (CoPE-KB), which provides a novel model for
multi-agent multi-task decision making under limited communication and general reward functions, and is applicable to many online
learning tasks, e.g., recommendation systems and network scheduling. We consider two settings of CoPE-KB, i.e., Fixed-Confidence
(FC) and Fixed-Budget (FB), and design two optimal algorithms CoopKernelFC (for FC) and CoopKernelFB (for FB). Our algorithms
are equipped with innovative and efficient kernelized estimators to simultaneously achieve computation and communication efficiency.
Matching upper and lower bounds under both the statistical and communication metrics are established to demonstrate the optimality
of our algorithms. The theoretical bounds successfully quantify the influences of task similarities on learning acceleration and only
depend on the effective dimension of the kernelized feature space. Our analytical techniques, including data dimension decomposition,
linear structured instance transformation and (communication) round-speedup induction, are novel and applicable to other bandit
problems. Empirical evaluations are provided to validate our theoretical results and demonstrate the performance superiority of our
algorithms.
Additional Key Words and Phrases: collaborative pure exploration, kernel bandit, communication round, learning speedup
ACM Reference Format:Yihan Du, Wei Chen, Yuko Kuroki, and Longbo Huang. 2021. Collaborative Pure Exploration in Kernel Bandit. In Woodstock โ21: ACM
Symposium on Neural Gaze Detection, June 03โ05, 2021, Woodstock, NY. ACM, New York, NY, USA, 44 pages. https://doi.org/10.1145/
1122445.1122456
1 INTRODUCTION
Pure exploration [3, 9, 12, 18, 21, 24] is a fundamental online learning problem in multi-armed bandits, where an agent
sequentially chooses options (often called arms) and observes random feedback, with the objective of identifying the
best option (arm). This problem finds various applications such as recommendation systems [31], online advertising [37]
and neural architecture search [17]. However, the traditional single-agent pure exploration problem [3, 9, 12, 18, 21, 24]
cannot be directly applied to many real-world distributed online learning platforms, which often face a large volume of
user requests and need to coordinate multiple distributed computing devices to process the requests, e.g., geographically
distributed data centers [30] and Web servers [44]. These computing devices communicate with each other to exchange
information in order to attain globally optimal performance.
To handle such distributed pure exploration problem, prior works [20, 22, 38] have developed the Collaborative Pure
Exploration (CoPE) model, where there are multiple agents that communicate and cooperate in order to identify the best
arm with learning speedup. Yet, existing results only investigate the classic multi-armed bandit (MAB) setting [3, 18, 21],
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
ยฉ 2021 Association for Computing Machinery.
Manuscript submitted to ACM
1
arX
iv:2
110.
1577
1v1
[cs
.LG
] 2
9 O
ct 2
021
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
and focus only on the fully-collaborative setting, i.e., the agents aim to solve a common task. However, in many
real-world applications such as recommendation systems [31], it is often the case that different computing devices
face different but correlated recommendation tasks. Moreover, there usually exists some structured dependency of user
utilities on the recommended items. In such applications, it is important to develop a more general CoPE model that
allows heterogeneous tasks and complex reward structures, and quantitatively investigate how task similarities impact
learning acceleration.
Motivated by the above facts, we propose a novel Collaborative Pure Exploration in Kernel Bandit (CoPE-KB) problem,
which generalizes traditional single-task CoPE problems [20, 22, 38] to the multi-task setting. It also generalizes the
classic MAB model to allow general (linear or nonlinear) reward structures via the powerful kernel representation.
Specifically, each agent is given a set of arms, and the expected reward of each arm is generated by a task-dependent
reward function with a low norm in a high-dimensional (possibly infinite-dimensional) Reproducing Kernel Hilbert
Space (RKHS) [33, 42], by which we can represent real-world nonlinear reward dependency as some linear function in
a high-dimensional space, and can go beyond linear rewards as commonly done in the literature, e.g., [10, 13, 14, 35, 40].
Each agent sequentially chooses arms to sample and observes noisy outcomes. The agents can broadcast and receive
messages to/from others in communication rounds, so that they can exploit the task similarity and collaborate to
expedite learning processes. The task of each agent is to find the best arm that maximizes the expected reward among
her arm set.
Our CoPE-KB formulation can handle different tasks in parallel and characterize the dependency of rewards on
options, which provides a more general and flexible model for real-world applications. For example, in distributed
recommendation systems [31], different computing devices can face different tasks, and it is inefficient to learn the
reward of each option individually. Instead, CoPE-KB enables us to directly learn the relationship between option
features and user utilities, and exploit the similarity of such relationship among different tasks to accelerate learning.
There are also many other applications, such as clinical trials [41], where we conduct multiple clinical trials in parallel
and utilize the common useful information to accelerate drug development, and neural architecture search [17], where
we simultaneously run different tests of neural architectures under different environmental setups to expedite search
processes.
We consider two important pure exploration settings under the CoPE-KB model, i.e., Fixed-Confidence (FC), where
agents aim to minimize the number of used samples under a given confidence, and Fixed-Budget (FB), where the goal is
to minimize the error probability under a given sample budget. Note that due to the high dimension (possibly infinite)
of the RKHS, it is highly non-trivial to simplify the burdensome computation and communication in the RKHS, and to
derive theoretical bounds only dependent on the effective dimension of the kernelized feature space. To tackle the above
challenges, we adopt efficient kernelized estimators and design novel algorithms CoopKernelFC and CoopKernelFB for
the FC and FB settings, respectively, which only cost Poly(๐๐ ) computation and communication complexity instead of
Poly(dim(H๐พ )) as in [10, 43], where ๐ is the number of arms,๐ is the number of agents, andH๐พ is the high-dimensional
RKHS. We also establish matching upper and lower bounds in terms of sampling and communication complexity to
demonstrate the optimality of our algorithms (within logarithmic factors).
Our work distinguishes itself from prior CoPE works, e.g., [20, 22, 38], in the following aspects: (i) Prior works [20, 22,
38] only consider the classic MAB setting, while we adopt a high-dimensional RKHS to allow more general real-world
reward dependency on option features. (ii) Unlike [20, 22, 38] which restrict tasks (given arm sets and rewards) among
agents to be the same, we allow different tasks for different agents, and explicitly quantify how task similarities impact
learning acceleration. (iii) In lower bound analysis, prior works [20, 38] mainly focus on a 2-armed case, whereas we
2
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
derive a novel lower bound analysis for general multi-armed cases with high-dimensional linear reward structures.
Moreover, when reducing CoPE-KB to prior CoPE with classic MAB setting (all agents are solving the same classic
MAB task) [20, 38], our lower and upper bounds also match the existing state-of-the-art results in [38].
The contributions of this paper are summarized as follows:
โข We formulate a novel Collaborative Pure Exploration in Kernel Bandit (CoPE-KB) problem, which models
distributed multi-task decision making problems with general reward functions, and finds applications in many
real-world online learning tasks, and study two settings of CoPE-KB, i.e., CoPE-KB with fixed-confidence (FC)
and CoPE-KB with fixed-budget (FB).
โข For CoPE-KB with fixed-confidence (FC), we propose an algorithm CoopKernelFC, which adopts an effi-
cient kernelized estimator to significantly reduce computation and communication complexity from exist-
ing Poly(dim(H๐พ )) to only Poly(๐๐ ). We derive matching upper and lower bounds of sample complexity
๏ฟฝฬ๏ฟฝ ( ๐โ
๐log๐ฟโ1) and communication rounds ๐ (logฮโ1
min). Here ๐โ is the problem hardness (defined in Section 4.2),
and ฮโ1
minis the minimum reward gap.
โข For CoPE-KB with fixed-budget (FB), we design an efficient algorithm CoopKernelFB with error probability
๏ฟฝฬ๏ฟฝ
(exp
(โ๐๐๐โ
)๐2๐
)and communication rounds ๐ (log(๐ ( หX))). A matching lower bound of communication
rounds is also established to validate the communication optimality of CoopKernelFB (within double-logarithmic
factors). Here๐ is the sample budget,หX is the set of arms, and๐ ( หX) is the principle dimension of data projections
inหX to RKHS (defined in Section 5.1.1).
โข Our results explicitly quantify the impacts of task similarities on learning acceleration. Our novel analytical
techniques, including data dimension decomposition, linear structured instance transformation and round-
speedup induction, can be of independent interests and are applicable to other bandit problems.
Due to space limit, we defer all the proofs to Appendix.
2 RELATEDWORK
This work falls in the literature of multi-armed bandits [4, 8, 27, 28, 39]. Here we mainly review three most related lines
of research, i.e., collaborative pure exploration and kernel bandit.
Collaborative Pure Exploration (CoPE). The collaborative pure exploration literature is initiated by [20], which
considers the classic MAB and fully-collaborative settings, and designs fixed-confidence algorithms based on majority
vote with upper bound analysis. Tao et al. [38] further develop a fixed-budget algorithm by calling conventional
single-agent fixed-confidence algorithms, and completes the analysis of round-speedup lower bounds. Karpov et al.
[22] extend the formulation of [20, 38] to the best๐ arm identification problem, and designs fixed-confidence and
fixed-budget algorithms with tight round-speedup upper and lower bounds, which give a strong separation between
best arm identification and the extended best ๐ arm identification. Our CoPE-KB model encompasses the classic
MAB and fully-collaborative settings in the above works [20, 22, 38], but faces unique challenges on computation and
communication efficiency due to the high-dimensional reward structures.
Collaborative Regret Minimization. There are other works studying collaborative (distributed) bandit with the
regret minimization objective. Bistritz and Leshem [5], Liu and Zhao [29], Rosenski et al. [32] study the multi-player
bandit with collisions motivated by cognitive radio networks, where multiple players simultaneously choose arms from
the same set and receive no reward if more than one player choose the same arm (i.e., a collision happens). Bubeck and
Budzinski [6], Bubeck et al. [7] investigate a variant multi-player bandit problem where players cannot communicate but
3
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
have access to shared randomness, and they propose algorithms that achieve nearly optimal regrets without collisions.
Chakraborty et al. [11] introduce another distributed bandit problem, where each agent decides either to pull an arm
or to broadcast a message in order to maximize the total reward. Korda et al. [25], Szorenyi et al. [36] adapt bandit
algorithms to peer-to-peer networks, where the peers pick arms from the same set and can only communicate with a
few random others along network links. The above works consider different learning objectives and communication
protocols from ours, and do not involve the challenges of simultaneously handling multiple different tasks and analyzing
the relationship between communication rounds and learning speedup.
Kernel Bandit. There are a number of works for kernel bandit with the regret minimization objective. Srinivas et al.
[35] study the Gaussian process bandit problem with RKHS, which is the Bayesian version of kernel bandits, and
designs an Upper Confidence Bound (UCB) style algorithm. Chowdhury and Gopalan [13] further improve the regret
results of [35] by constructing tighter kernelized confidence intervals. Valko et al. [40] consider kernel bandit from
the frequentist perspective and provides an alternative regret analysis based on effective dimension. Deshmukh et al.
[14], Krause and Ong [26] study the multi-task kernel bandits, where the kernel function of RKHS is constituted by two
compositions from task similarities and arm features. Dubey et al. [16] investigate the multi-agent kernel bandit with a
local communication protocol, with the learning objective being to reduce the average regret suffered by per agent. For
kernel bandit with the pure exploration objective, there are only two works [10, 43] to our best knowledge. Camilleri
et al. [10] design a single-agent algorithm which uses a robust inverse propensity score estimator to reduce the sample
complexity incurred by rounding procedures. Zhu et al. [43] propose a variant of [10] which applies neural networks to
approximate nonlinear reward functions. All of these works consider either regret minimization or single-agent setting,
which largely differs from our problem, and they do not investigate the distributed decision making and (communication)
round-speedup trade-off. Thus, their algorithms and analysis cannot be applied to solve our CoPE-KB problem.
3 COLLABORATIVE PURE EXPLORATION IN KERNEL BANDIT (COPE-KB)
In this section, we present the formal formulation of the Collaborative Pure Exploration in Kernel Bandit (CoPE-KB), and
discuss the two important settings under CoPE-KB that will be investigated.
Agents and rewards. There are ๐ agents [๐ ] = {1, . . . ,๐ }, who collaborate to solve different but possibly related
instances (tasks) of the pure exploration in kernel bandit (PE-KB) problem. For each agent ๐ฃ โ [๐ ], she is given a set of
๐ arms X๐ฃ = {๐ฅ๐ฃ,1, . . . , ๐ฅ๐ฃ,๐} โ R๐X , where ๐X is the dimension of arm feature vectors. The expected reward of each
arm ๐ฅ โ X๐ฃ is ๐๐ฃ (๐ฅ), where ๐๐ฃ : X๐ฃ โฆโ R is an unknown reward function. Let X = โช๐ฃโ[๐ ]X๐ฃ . Following the literature inkernel bandits [14, 26, 35, 40], we assume that for any ๐ฃ โ [๐ ], ๐๐ฃ has a bounded norm in a Reproducing Kernel Hilbert
Space (RKHS) specified by kernel ๐พX : X ร X โฆโ R (see below for more details). At each timestep ๐ก , each agent ๐ฃ pulls
an arm ๐ฅ๐ฃ,๐ก โ X๐ฃ and observes a random reward ๐ฆ๐ฃ,๐ก = ๐ (๐ฅ๐ฃ,๐ก ) + [๐ฃ,๐ก , where [๐ฃ,๐ก is an independent and zero-mean
1-sub-Gaussian noise (without loss of generality).1We assume that the best arms ๐ฅ๐ฃ,โ = argmax๐ฅ โX๐ฃ ๐๐ฃ (๐ฅ) are unique
for all ๐ฃ โ [๐ ], which is a common assumption in the pure exploration literature, e.g., [3, 12, 18, 24].
Multi-Task Kernel Composition.We assume that the functions ๐๐ฃ are parametric functionals of a global function
๐น : X รZ โฆโ R, which satisfies that, for each agent ๐ฃ โ [๐ ], there exists a task feature vector ๐ง๐ฃ โ Z such that
๐๐ฃ (๐ฅ) = ๐น (๐ฅ, ๐ง๐ฃ), โ๐ฅ โ X๐ฃ . (1)
Here X andZ denote the arm feature space and task feature space, respectively. Eq. (1) allows tasks to be different for
agents, whereas prior CoPE works [20, 22, 38] restrict the tasks (X๐ฃ and ๐๐ฃ ) to be the same for all agents ๐ฃ โ [๐ ].1A random variable [ is called 1-sub-Gaussian if it satisfies that E[exp(_[ โ _E[[ ]) ] โค exp(_2/2) for any _ โ R.
4
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
DenoteหX = X รZ and ๐ฅ = (๐ฅ, ๐ง๐ฃ). As a standard assumption in kernel bandits [14, 16, 26, 35], we assume that ๐น
has a bounded norm in a global RKHS H๐พ specified by kernel ๐พ :หX ร หX โฆโ R, and there exists a feature mapping
๐ :หX โฆโ H๐พ and an unknown parameter \โ โ H๐พ such that
๐น (๐ฅ) = ๐ (๐ฅ)โค\โ, โ๐ฅ โ หX, and ๐พ (๐ฅ, ๐ฅ โฒ) = ๐ (๐ฅ)โค๐ (๐ฅ โฒ), โ๐ฅ, ๐ฅ โฒ โ หX.
Here โฅ\โโฅ :=โ\โโค\โ โค ๐ต for some known constant ๐ต > 0. ๐พ :
หX ร หX โฆโ R is a product composite kernel, which
satisfies that for any ๐ง, ๐งโฒ โ Z, ๐ฅ, ๐ฅ โฒ โ X,
๐พ ((๐ฅ, ๐ง), (๐ฅ โฒ, ๐งโฒ)) = ๐พZ (๐ง, ๐งโฒ) ยท ๐พX (๐ฅ, ๐ฅ โฒ),
where ๐พX is the arm feature kernel that depicts the feature structure of arms, and ๐พZ is the task feature kernel that
measures the similarity of functions ๐๐ฃ . For example, in the fully-collaborative setting, all agents solve a common task,
and we have that ๐ง๐ฃ = 1 for all ๐ฃ โ [๐ ], ๐พZ (๐ง, ๐งโฒ) = 1 for all ๐ง, ๐งโฒ โ Z, and ๐พ = ๐พX . On the contrary, if all tasks are
different, then rank(๐พZ) = ๐ . ๐พZ allows us to characterize the influences of task similarities (1 โค rank(๐พZ) โค ๐ ) onlearning.
We give a simple 2-agent (2-task) illustrating example in Figure 1. Agent 1 is given Items 1,2 with the expected
rewards `1, `2, respectively, denoted by X1 = {๐ฅ1,1, ๐ฅ1,2}. Agent 2 is given Items 2,3 with the expected rewards
`2, `3, respectively, denoted by X2 = {๐ฅ2,1, ๐ฅ2,2}. Here ๐ฅ1,2 = ๐ฅ2,1 is the same item. In this case, ๐ (๐ฅ1,1) = [1, 0, 0]โค,๐ (๐ฅ1,2) = ๐ (๐ฅ2,1) = [0, 1, 0]โค, ๐ (๐ฅ2,2) = [0, 0, 1]โค, and \โ = [`1, `2, `3]โค. The two agents can share the learned
information on the second dimension of \โ to accelerate learning processes.
๐ฅ1,1
Item 1, reward ๐1
Task 1 Task 2
๐ ๐ฅ 1,1 = 1,0,0 ๐
๐ ๐ฅ 1,2 = ๐ ๐ฅ 2,1 = 0,1,0 ๐
๐ ๐ฅ 2,2 = 0,0,1 ๐
ฮธโ = ๐1, ๐2, ๐3๐
๐ฅ2,2
Item 3, reward ๐3
๐ฅ1,2 = ๐ฅ2,1
Item 2, reward ๐2
๐ฅ 1,1 = ๐ง1, ๐ฅ1,1
๐ฅ 1,2 = ๐ง1, ๐ฅ1,2 = ๐ฅ 2,1 = ๐ง2, ๐ฅ2,1
๐ฅ 2,2 = ๐ง2, ๐ฅ2,2
Fig. 1. Illustrating example.
Note that the RKHSH๐พ can have infinite dimensions,
and any direct operation on H๐พ , e.g., the calculation
of ๐ (๐ฅ) and explicit expression of the estimate of \โ, is
impracticable. In this paper, all our algorithms only query
the kernel function ๐พ (ยท, ยท) instead of directly operating
onH๐พ , and ๐ (๐ฅ) and \โ are only used in our theoretical
analysis, which is different from existing works, e.g., [10,
43].
Communication. Following the popular communica-
tion protocol in existing CoPE works [20, 22, 38], we
allow these ๐ agents to exchange information via com-
munication rounds, in which each agent can broadcast
and receive messages from others. While we do not re-
strict the exact length of a message, for practical implementation it should be bounded by ๐ (๐) bits. Here ๐ is the
number of arms for each agent, and we consider the number of bits for representing a real number as a constant.
In the CoPE-KB problem, our goal is to design computation and communication efficient algorithms to coordinate
different agents to simultaneously complete multiple tasks in collaboration and characterize how the take similarities
impact the learning speedup.
Fixed-Confidence and Fixed-Budget. We consider two versions of the CoPE-KB problem, one with fixed-confidence
(FC) and the other with fixed-budget (FB). Specifically, in the FC setting, given a confidence parameter ๐ฟ โ (0, 1), theagents aim to identify ๐ฅ๐ฃ,โ for all ๐ฃ โ [๐ ] with probability at least 1 โ ๐ฟ and minimize the average number of samples
used by each agent. In the FB setting, on the other hand, the agents are given an overall ๐ ยท๐ sample budget (๐ average
5
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
samples per agent), and aim to use at most๐ ยท๐ samples to identify ๐ฅ๐ฃ,โ for all ๐ฃ โ [๐ ] and minimize the error probability.
In both FC and FB settings, agents are requested to minimize the number of communication rounds.
To evaluate the learning acceleration of our algorithms, following the CoPE literature, e.g., [20, 22, 38], we also define
the speedup metric of our algorithms. For a CoPE-KB instance I, let๐A๐ ,I denote the average number of samples used
by each agent in multi-agent algorithmA๐ to identify ๐ฅ๐ฃ,โ for all ๐ฃ โ [๐ ], and let๐A๐ ,I denote the average number of
samples used by each task for a single-agent algorithmA๐ to sequentially (without communication) identify ๐ฅ๐ฃ,โ for all
๐ฃ โ [๐ ]. Then, the speedup of A๐ on instance I is formally defined as
๐ฝA๐ ,I = inf
A๐
๐A๐ ,I๐A๐ ,I
. (2)
It can be seen that 1 โค ๐ฝA๐ ,I โค ๐ , where ๐ฝA๐ ,I = 1 for the case where all tasks are different and ๐ฝA๐ ,I can
approach ๐ for a fully-collaborative instance. By taking ๐A๐ ,I and ๐A๐ ,I as the smallest numbers of samples needed
to meet the confidence constraint, the definition of ๐ฝA๐ ,I can be similarly defined for error probability results.
In particular, when all agents ๐ฃ โ [๐ ] have the same arm set X๐ฃ = X = {๐1, . . . , ๐๐} (i.e., standard bases in R๐) and
the same reward function ๐๐ฃ (๐ฅ) = ๐ (๐ฅ) = ๐ฅโค\โ for any ๐ฅ โ X, all agents are solving a common classic MAB task, and
then the task featureZ = {1} and ๐พZ (๐ง, ๐งโฒ) = 1 for any ๐ง, ๐งโฒ โ Z. In this case, our CoPE-KB problem reduces to prior
CoPE with classic MAB setting [20, 38].
4 FIXED-CONFIDENCE COPE-KB
We start with the fixed-confidence (FC) setting and propose the CoopKernelFC algorithm. We explicitly quantify how
task similarities impact learning acceleration, and provide sample complexity and round-speedup lower bounds to
demonstrate the optimality of CoopKernelFC.
4.1 Algorithm CoopKernelFC
4.1.1 Algorithm. CoopKernelFC has three key components: (i) maintain alive arm sets for all agents, (ii) perform
sampling individually according to the globally optimal sample allocation, and (iii) exchange the distilled observa-
tion information to estimate reward gaps and eliminate sub-optimal arms, via efficient kernelized computation and
communication schemes.
The procedure of CoopKernelFC (Algorithm 1) for each agent ๐ฃ is as follows. Agent ๐ฃ maintains alive arm sets B (๐ก )๐ฃโฒ
for all ๐ฃ โฒ โ [๐ ] by successively eliminating sub-optimal arms in each phase. In phase ๐ก , she solves a global min-max
optimization, which takes into account the objectives and available arm sets of all agents, to obtain the optimal sample
allocation _โ๐ก โ โณ หX and optimal value ๐โ๐ก (Line 4). Here โณ หX is the collection of all distributions onX. b๐ก is a regularizationparameter such thatโ๏ธ
b๐ก max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หX๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ(b๐ก ๐ผ+โ๐ฅโX 1
๐๐๐ (๐ฅ)๐ (๐ฅ)โค)โ1 โค 1
(1 + Y)๐ต ยท 2๐ก+1, (3)
which ensures the estimation bias for reward gap to be bounded by 2โ(๐ก+1)
and can be efficiently computed by kernelized
transformation (specified in Section 4.1.2). Then, agent ๐ฃ uses ๐โ๐ก to compute the number of required samples ๐ (๐ก ) , which
guarantees that the confidence radius of estimation for reward gaps is within 2โ๐ก
(Line 5). In algorithm CoopKernelFC,
we use a rounding procedure ROUNDY (_, ๐ ) with approximation parameter ๐ from [2, 10], which rounds the sample
6
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Algorithm 1: Distributed Algorithm CoopKernelFC: for Agent ๐ฃ (๐ฃ โ [๐ ])Input: ๐ฟ , หX1, . . . , หX๐ , ๐พ (ยท, ยท) :
หXร หX โฆโ R, ๐ต, rounding procedure ROUNDY (ยท, ยท) with approximation parameter Y.
1 Initialization: B (1)๐ฃโฒ โ X๐ฃโฒ for all ๐ฃ
โฒ โ [๐ ]. ๐ก โ 1 ; // initialize alive arm sets B (1)๐ฃโฒ
2 while โ๐ฃ โฒ โ [๐ ], |B (๐ก )๐ฃโฒ | > 1 do
3 ๐ฟ๐ก โ ๐ฟ2๐ก2
;
4 Let _โ๐ก and ๐โ๐ก be the optimal solution and optimal value of
min
_โโณ หXmax
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โB (๐ก )๐ฃโฒ ,๐ฃโฒโ[๐ ]
โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2(b๐ก ๐ผ+โ๏ฟฝฬ๏ฟฝโ หX _๏ฟฝฬ๏ฟฝ๐ (๏ฟฝฬ๏ฟฝ)๐ (๏ฟฝฬ๏ฟฝ)โค)โ1, where b๐ก is a regularization parameter that
satisfies Eq. (3) ; // compute the optimal sample allocation
5 ๐ (๐ก ) โโ8(2๐ก )2 (1 + Y)2๐โ๐ก log
(2๐2๐ /๐ฟ๐ก
)โ; // compute the number of required samples
6 (๐ 1, . . . , ๐ ๐ (๐ก ) ) โ ROUNDY (_โ๐ก , ๐ (๐ก ) );7 Let ๏ฟฝฬ๏ฟฝ (๐ก )๐ฃ be the sub-sequence of (๐ 1, . . . , ๐ ๐ (๐ก ) ) which only contains the arms in
หX๐ฃ ; // generate the sample
sequence for agent ๐ฃ
8 Pull arms ๏ฟฝฬ๏ฟฝ (๐ก )๐ฃ and observe random rewards ๐ (๐ก )๐ฃ ;
9 Broadcast {(๐ (๐ก )๐ฃ,๐, ๐ฆ(๐ก )๐ฃ,๐)}๐โ[๐] , where ๐
(๐ก )๐ฃ,๐
is the number of samples and ๐ฆ(๐ก )๐ฃ,๐
is the average observed reward
on arm ๐ฅ๐ฃ,๐ ;
10 Receive {(๐ (๐ก )๐ฃโฒ,๐ , ๐ฆ
(๐ก )๐ฃโฒ,๐ )}๐โ[๐] from all other agents ๐ฃ โฒ โ [๐ ] \ {๐ฃ};
11 For notational simplicity, we combine the subscripts ๐ฃ โฒ, ๐ in ๐ฅ๐ฃโฒ,๐ , ๐(๐ก )๐ฃโฒ,๐ , ๐ฆ
(๐ก )๐ฃโฒ,๐ by using
๐ฅ (๐ฃโฒโ1)๐+๐ , ๐(๐ก )(๐ฃโฒโ1)๐+๐ , ๐ฆ
(๐ก )(๐ฃโฒโ1)๐+๐ , respectively. Then, ๐๐ก (๐ฅ) โ [
โ๏ธ๐(๐ก )1๐พ (๐ฅ, ๐ฅ1), . . . ,
โ๏ธ๐(๐ก )๐๐๐พ (๐ฅ, ๐ฅ๐๐ )]โค for
any ๐ฅ โ หX. ๐พ (๐ก ) โ [โ๏ธ๐(๐ก )๐๐(๐ก )๐๐พ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ] . ๏ฟฝฬ๏ฟฝ (๐ก ) โ [
โ๏ธ๐(๐ก )1๐ฆ(๐ก )1, . . . ,
โ๏ธ๐(๐ก )๐๐๐ฆ(๐ก )๐๐]โค ; // organize
overall observation information
12 for all ๐ฃ โฒ โ [๐ ] do13 ฮฬ๐ก (๐ฅ๐ , ๐ฅ ๐ ) โ (๐๐ก (๐ฅ๐ ) โ ๐๐ก (๐ฅ ๐ ))โค (๐พ (๐ก ) + ๐ (๐ก )b๐ก ๐ผ )โ1๏ฟฝฬ๏ฟฝ (๐ก ) , โ๐ฅ๐ , ๐ฅ ๐ โ B (๐ก )๐ฃโฒ ; // estimate the reward gap
between ๏ฟฝฬ๏ฟฝ๐ and ๏ฟฝฬ๏ฟฝ ๐
14 B (๐ก+1)๐ฃโฒ โ B (๐ก )
๐ฃโฒ \ {๐ฅ โ B(๐ก )๐ฃโฒ |โ๐ฅ
โฒ โ B (๐ก )๐ฃโฒ : ฮฬ๐ก (๐ฅ โฒ, ๐ฅ) โฅ 2
โ๐ก } ; // discard sub-optimal arms
15 ๐ก โ ๐ก + 1;
16 E return B (๐ก )1, . . . ,B (๐ก )
๐;
allocation _ โ โณหX into the integer numbers of samples ^ โ N | หX |
, such that
โ๏ฟฝฬ๏ฟฝ โ หX ^๏ฟฝฬ๏ฟฝ = ๐ and
max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โB (๐ก )๐ฃโฒ ,๐ฃโฒโ[๐ ]
โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2(๐b๐ก ๐ผ+โ๏ฟฝฬ๏ฟฝโ หX ^๏ฟฝฬ๏ฟฝ๐ (๏ฟฝฬ๏ฟฝ)๐ (๏ฟฝฬ๏ฟฝ)โค)โ1
โค (1 + ๐) max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โB (๐ก )๐ฃโฒ ,๐ฃโฒโ[๐ ]
โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2(๐b๐ก ๐ผ+โ๏ฟฝฬ๏ฟฝโ หX ๐_๏ฟฝฬ๏ฟฝ๐ (๏ฟฝฬ๏ฟฝ)๐ (๏ฟฝฬ๏ฟฝ)โค)โ1.
By calling ROUNDY (_โ๐ก , ๐ (๐ก ) ), agent ๐ฃ generates an overall sample sequence (๐ 1, . . . , ๐ ๐ (๐ก ) ) according to _โ๐ก , and extractsa sub-sequence ๏ฟฝฬ๏ฟฝ (๐ก )๐ฃ that only contains the arms in
หX๐ฃ to sample (Lines 6-8). After sampling, she only communicates
the number of samples ๐(๐ก )๐ฃ,๐
and average observed reward ๐ฆ(๐ก )๐ฃ,๐
for each arm with other agents (Lines 9-10). With the
overall observation information, she estimates the reward gap ฮฬ๐ก (๐ฅ๐ , ๐ฅ ๐ ) between any arm pair ๐ฅ๐ , ๐ฅ ๐ โ B (๐ก )๐ฃโฒ for all
๐ฃ โฒ โ [๐ ] and discards sub-optimal arms (Lines 13-14).
4.1.2 Computation and Communication Efficiency. Here we explain the efficiency of CoopKernelFC. Note that in
CoPE-KB, due to its high-dimensional reward structures, directly using the empirical mean to estimate rewards will
7
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
cause loose sample complexity, and naively calculating and transmitting infinite-dimensional parameter \โ will incur
huge computation and communication costs. As a result, we cannot directly compute and communicate scalar empirical
rewards as in prior CoPE with classic MAB works [20, 22, 38].
Computation Efficiency. CoopKernelFC uses three efficient kernelized operations, i.e., optimization solver (Line 4),
condition for regularization parameter b๐ก (Eq. (3)) and estimator of reward gaps (Line 13). Unlike prior kernel bandit
algorithms [10, 43] which explicitly compute ๐ (๐ฅ) and maintain the estimate of \โ on the infinite-dimensional RKHS,
CoopKernelFC only queries kernel function ๐พ (ยท, ยท) and significantly reduces the computation (memory) costs from
Poly(dim(H๐พ )) to only Poly(๐๐ ).Below we give the formal expressions of these operations and defer their detailed derivation to Appendix A.1.
Kernelized Estimator. We first introduce the kernelized estimator of reward gaps (Line 13). Following the standard
estimation procedure in linear/kernel bandits [10, 19, 23, 43], we consider the following regularized least square estimator
of underlying reward parameter \โ
ห\๐ก =ยฉยญยซ๐ (๐ก )b๐ก ๐ผ +
๐ (๐ก )โ๏ธ๐=1
๐ (๐ ๐ )๐ (๐ ๐ )โคยชยฎยฌโ1
๐ (๐ก )โ๏ธ๐=1
๐ (๐ ๐ )๐ฆ ๐
Note that this form ofห\๐ก has ๐
(๐ก )terms in the summation, which are cumbersome to compute and communicate.
Since the samples (๐ 1, . . . , ๐ ๐ (๐ก ) ) are composed by arms ๐ฅ1, . . . , ๐ฅ๐๐ , we merge repetitive computations for same arms
in the summation and obtain (for notational simplicity, we combine the subscripts ๐ฃ โฒ, ๐ in ๐ฅ๐ฃโฒ,๐ , ๐(๐ก )๐ฃโฒ,๐ , ๐ฆ
(๐ก )๐ฃโฒ,๐ by using
๐ฅ (๐ฃโฒโ1)๐+๐ , ๐(๐ก )(๐ฃโฒโ1)๐+๐ , ๐ฆ
(๐ก )(๐ฃโฒโ1)๐+๐ , respectively)
ห\๐ก(a)
=
(๐ (๐ก )b๐ก ๐ผ +
๐๐โ๏ธ๐=1
๐(๐ก )๐๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค
)โ1 ๐๐โ๏ธ๐=1
๐(๐ก )๐๐ (๐ฅ๐ )๐ฆ (๐ก )๐
(b)
=ฮฆโค๐ก(๐ (๐ก )b๐ก ๐ผ + ๐พ (๐ก )
)โ1
๐ฆ (๐ก ) . (4)
Here ๐(๐ก )๐
is the number of samples and ๐ฆ(๐ก )๐
is the average observed reward on arm ๐ฅ๐ for any ๐ โ [๐๐ ]. ฮฆ๐ก =
[โ๏ธ๐(๐ก )1๐ (๐ฅ1)โค; . . . ;
โ๏ธ๐(๐ก )๐๐๐ (๐ฅ๐๐ )โค] is the empiricallyweighted feature vector,๐พ (๐ก ) = ฮฆ๐กฮฆ
โค๐ก = [
โ๏ธ๐(๐ก )๐๐(๐ก )๐๐พ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ]
is the kernel matrix, and ๐ฆ (๐ก ) = [โ๏ธ๐(๐ก )1๐ฆ(๐ก )1, . . . ,
โ๏ธ๐(๐ก )๐๐๐ฆ(๐ก )๐๐]โค is the average observations. Equality (a) rearranges the
summation according to different chosen arms, and (b) follows from kernel transformation.
Then, by multiplying
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค, we obtain the estimator of reward gaps ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ) as
ฮฬ(๐ฅ๐ , ๐ฅ ๐ ) =(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โคห\๐ก =
(๐๐ก (๐ฅ๐ ) โ ๐๐ก (๐ฅ ๐ )
)โค (๐ (๐ก )b๐ก ๐ผ + ๐พ (๐ก )
)โ1
๐ฆ (๐ก ) , (5)
where ๐๐ก (๐ฅ) = ฮฆ๐ก๐ (๐ฅ) = [โ๏ธ๐(๐ก )1๐พ (๐ฅ, ๐ฅ1), . . . ,
โ๏ธ๐(๐ก )๐๐๐พ (๐ฅ, ๐ฅ๐๐ )]โค for any ๐ฅ โ หX. This estimator not only transforms
heavy operations on the infinite-dimensional RKHS to efficient ones that only query the kernel function, but also
merges repetitive computations for same arms (equality (a)) and only requires calculations dependent on ๐๐ .
Kernelized Optimization Solver/Condition for Regularization Parameter. Now we introduce the optimization solver
(Line 4) and condition for regularization parameter b๐ก (Eq. (3)).
For the kernelized optimization solver, we solve the min-max optimization in Line 4 by projected gradient descent,
which follows the procedure in [10]. Specifically, let๐ด(b, _) = b๐ผ +โ๏ฟฝฬ๏ฟฝ โ หX _๏ฟฝฬ๏ฟฝ๐ (๐ฅ)๐ (๐ฅ)
โคfor any b > 0, _ โ โณ
หX . We define
8
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
function โ(_) = max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โB (๐ก )๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ๐ (๐ฅ ๐ )โฅ2๐ด(b๐ก ,_)โ1
, and denote the optimal solution of โ(_) by ๐ฅโ๐(_), ๐ฅโ
๐(_). Then,
the gradient of โ(_) is given by
[โ_โ(_)]๏ฟฝฬ๏ฟฝ = โ((๐ (๐ฅโ๐ (_)) โ ๐ (๐ฅ
โ๐ (_))
)โค๐ด(b๐ก , _)โ1๐ (๐ฅ)
)2
, โ๐ฅ โ หX, (6)
which can be efficiently calculated by the following kernel transformation(๐ (๐ฅโ๐ (_)) โ ๐ (๐ฅ
โ๐ (_))
)โค๐ด(b๐ก , _)โ1๐ (๐ฅ)
=bโ1
๐ก
(๐พ (๐ฅโ๐ (_), ๐ฅ) โ ๐พ (๐ฅ
โ๐ (_), ๐ฅ) โ
(๐_ (๐ฅโ๐ (_)) โ ๐_ (๐ฅ
โ๐ (_))
)โค(b๐ก ๐ผ + ๐พ_)โ1 ๐_ (๐ฅ)
), (7)
where ๐_ (๐ฅ) = [ 1โ_1
๐พ (๐ฅ, ๐ฅ1), . . . , 1
_๐๐๐พ (๐ฅ, ๐ฅ๐๐ )]โค and ๐พ_๐ข = [ 1โ
_๐_ ๐๐พ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ] .
For condition Eq. (3) on the regularization parameter b๐ก , we can transform it to
max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หX๐ฃ๐ฃโ[๐ ]
โ๏ธ(๐พ (๐ฅ๐ , ๐ฅ๐ ) + ๐พ (๐ฅ ๐ , ๐ฅ ๐ ) โ 2๐พ (๐ฅ๐ , ๐ฅ ๐ )
)โ โฅ๐_๐ข (๐ฅ๐ ) โ ๐_๐ข (๐ฅ ๐ )โฅ2(b๐ก ๐ผ+๐พ_๐ข )โ1
โค 1
(1 + Y)๐ต ยท 2๐ก+1, (8)
where _๐ข = [ 1
๐๐, . . . , 1
๐๐]โค is the uniform distribution on
หX, ๐_๐ข (๐ฅ) = [1โ๐๐๐พ (๐ฅ, ๐ฅ1), . . . , 1โ
๐๐๐พ (๐ฅ, ๐ฅ๐๐ )]โค and ๐พ_๐ข =
[ 1
๐๐๐พ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ] .
Both the kernelized optimization solver and condition for b๐ก avoid inefficient operations directly on infinite-
dimensional RKHS by querying the kernel function, and only cost Poly(๐๐ ) computation (memory) complexity
(Eqs. (7),(8) only contains scalar ๐พ (๐ฅ๐ , ๐ฅ ๐ ), ๐๐ -dimensional vector ๐_ and ๐๐ ร ๐๐ -dimensional matrix ๐พ_).
Communication Efficiency. By taking advantage of the kernelized estimator (Eq. (5)), CoopKernelFC merges repeti-
tive computations for the same arms and only transmits ๐๐ scalar tuples {(๐ (๐ก )๐ฃ,๐, ๐ฆ(๐ก )๐ฃ,๐)}๐โ[๐],๐ฃโ[๐ ] among agents instead
of transmitting all ๐ (๐ก ) samples as in [16]. This significantly reduces the communication cost from ๐ (๐ (๐ก ) ) bits to๐ (๐๐ ) bits (Lines 9-10).
4.2 Theoretical performance of CoopKernelFC
Define the problem hardness of identifying the best arms ๐ฅโ๐ฃ for all ๐ฃ โ [๐ ] as
๐โ = min
_โโณ หXmax
๏ฟฝฬ๏ฟฝ โX๐ฃ ,๐ฃโ[๐ ]
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ2(bโ๐ผ+โ๏ฟฝฬ๏ฟฝโ หX _๏ฟฝฬ๏ฟฝ๐ (๏ฟฝฬ๏ฟฝ)๐ (๏ฟฝฬ๏ฟฝ)โค)โ1
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ))2, (9)
where bโ = min๐ก โฅ1 b๐ก . ๐โis the information-theoretic lower bound of the CoPE-KB problem, which is adapted from
linear/kernel bandit pure exploration [10, 19, 23, 43]. Let ๐ denote the per-agent sample complexity, i.e., average number
of samples used by each agent in algorithm CoopKernelFC.
The sample complexity and number of communication rounds of CoopKernelFC are as follows.
Theorem 1 (Fixed-Confidence Upper Bound). With probability at least 1 โ ๐ฟ , algorithm CoopKernelFC returns the
correct answers ๐ฅโ๐ฃ for all ๐ฃ โ [๐ ], with per-agent sample complexity
๐ = ๐
(๐โ
๐ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
))and communication rounds ๐ (logฮโ1
min).
9
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Remark 1. ๐โ is comprised of two sources of problem hardness, one due to handling different tasks and the other due to
distinguishing different arms (We will decompose the sample complexity into these two parts in Corollary 1(c)). We see
that the sample complexity ofCoopKernelFCmatches the lower bound (up to logarithmic factors). For fully-collaborative
instances where single-agent algorithms [19, 23] have ๏ฟฝฬ๏ฟฝ (๐โ log๐ฟโ1) sample complexity, our CoopKernelFC achieves
the maximum๐ -speedup (i.e., enjoys ๏ฟฝฬ๏ฟฝ ( ๐โ
๐log๐ฟโ1) sample complexity) using only logarithmic communication rounds.
Interpretation. We further interpret Theorem 1 via standard expressive tools in kernel bandits [14, 35, 40], i.e.,
effective dimension and maximum information gain, to characterize the relationship between sample complexity and
data structures, and demonstrate how task similarity influences learning performance.
To this end, define the maximum information gain over all sample allocation _ โ โณหX as
ฮฅ = max
_โโณ หXlog det
(๐ผ + bโโ1๐พ_
).
Denote _โ = argmax_โโณ หXlog det
(๐ผ + bโโ1๐พ_
)and ๐ผ1 โฅ ยท ยท ยท โฅ ๐ผ๐๐ the eigenvalues of ๐พ_โ , and define the effective
dimension of ๐พ_โ as
๐eff
= min
{๐ : ๐bโ log(๐๐ ) โฅ
๐๐โ๏ธ๐=๐+1
๐ผ๐
}.
We then have the following corollary.
Corollary 1. The per-agent sample complexity of algorithm CoopKernelFC, denoted by ๐ , can also be bounded as
follows:
(a) ๐ = ๐
(ฮฅ
ฮ2
min๐ยท ๐(ฮmin, ๐ฟ)
), where ฮฅ is the maximum information gain.
(b) ๐ = ๐
(๐eff
ฮ2
min๐ยท log
(๐๐ ยท
(1 + Trace(๐พ_โ )
bโ๐eff
))๐(ฮmin, ๐ฟ)
), where ๐
effis the effective dimension.
(c) ๐ = ๐
(rank(๐พ๐ง) ยท rank(๐พ๐ฅ )
ฮ2
min๐
ยท log
(Trace
(๐ผ + bโ1
โ ๐พ_โ)
rank(๐พ_โ )
)๐(ฮmin, ๐ฟ)
).
Here ๐(ฮmin, ๐ฟ) = logฮโ1
min
(log
(๐๐๐ฟ
)+ log logฮโ1
min
).
Remark 2. Corollary 1(a) shows that, our sample complexity can be bounded by the maximum information gain of
any sample allocation onหX, which extends conventional information-gain-based results in regret minimization kernel
bandits [13, 16, 35] to the pure exploration setting in the view of experimental (allocation) design.
In terms of dimension dependency, it is demonstrated in Corollary 1(b) that our result only depends on the effective
dimension of kernel representation, which is the number of principle directions that data projections in RKHS spread.
We also provide a fundamental decomposition of sample complexity into two compositions from task similarities and
arm features in Corollary 1(c), which shows that the more tasks are similar, the fewer samples we need for accomplishing
all tasks. For example, when tasks are the same (fully-collaborative), i.e., rank(๐พ๐ง) = 1, each agent only spends a1
๐
fraction of samples used by single-agent algorithms [10, 43]. Conversely, when the tasks are totally different, i.e.,
rank(๐พ๐ง) = ๐ , no advantage can be attained by multi-agent deployments, since the information from neighboring
agents is useless for solving local tasks.
10
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
4.3 Lower Bound for Fixed-Confidence Setting
We now present lower bounds for the sample complexity and a round-speedup for fully-collaborative instances, using
a novel measure transformation techniques. The bounds validate the optimality of CoopKernelFC in both sampling
and communication. Specifically, Theorems 2 and 3 below formally present our bounds. In the theorems, we refer to a
distributed algorithm A for CoPE-KB as ๐ฟ-correct, if it returns the correct answers ๐ฅโ๐ฃ for all ๐ฃ โ [๐ ] with probability
at least 1 โ ๐ฟ .
Theorem 2 (Fixed-Confidence Sample Complexity Lower Bound). Consider the fixed-confidence collaborative
pure exploration in kernel bandit problem with Gaussian noise [๐ฃ,๐ก . Given any ๐ฟ โ (0, 1), a ๐ฟ-correct distributed algorithmA must have per-agent sample complexity ฮฉ( ๐
โ
๐log๐ฟโ1).
Remark 3. Theorem 2 shows that even if the agents are allowed to share samples without limitation, each agent
still requires at least ฮฉฬ( ๐โ
๐) samples on average. Together with Theorem 1, one sees that CoopKernelFC is within
logarithmic factors of the optimal sampling.
Theorem 3 (Fixed-Confidence Round-Speedup Lower Bound). There exists a fully-collaborative instance of the
fixed-confidence CoPE-KB problem with multi-armed and linear reward structures, for which given any ๐ฟ โ (0, 1), a๐ฟ-correct and ๐ฝ-speedup distributed algorithm A must utilize
ฮฉยฉยญยซ
logฮโ1
min
log(1 + ๐๐ฝ) + log logฮโ1
min
ยชยฎยฌcommunication rounds in expectation. In particular, when ๐ฝ = ๐ , A must require ฮฉ( logฮโ1
min
log logฮโ1
min
) communication rounds in
expectation.
Remark 4. Theorem 3 exhibits that logarithmic communication rounds are indispensable for achieving the full speedup,
which validates that CoopKernelFC is near-optimal in communication. Moreover, when CoPE-KB reduces to prior
CoPE with classic MAB setting [20, 38], i.e., all agents are solving the same classic MAB task, our upper and lower
bounds (Theorems 1 and 3) match the state-of-the-art results in [38].
Novel Analysis for Fixed-Confidence Round-Speedup Lower Bound.We highlight that our round-speedup lower
bound for the FC setting analysis has the following novel aspects. (i) Unlike prior CoPE work [38] which focuses
on a preliminary 2-armed case without considering reward structures, we investigate multi-armed instances with
high-dimensional linear reward structures. (ii) We develop a linear structured progress lemma (Lemma 3 in Appendix A.5),
which effectively handles the challenges due to different possible sample allocation on multiple arms and derives the
required communication rounds under linear reward structures. (iii) We propose multi-armed measure transformation
and linear structured instance transformation lemmas (Lemmas 4,5 in Appendix A.5), which bound the change of
probability measures in instance transformation with multiple arms and high-dimensional linear rewards, and serve as
basic analytical tools in our proof.
5 FIXED-BUDGET COPE-KB
We now turn to the fixed-budget (FB) setting and design an efficient algorithm CoopKernelFB. We also establish a
fixed-budget round-speedup lower bound to validate its communication optimality.
11
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Algorithm 2: Distributed Algorithm CoopKernelFB: for Agent ๐ฃ (๐ฃ โ [๐ ])Input: Per-agent budget ๐ , หX1, . . . , หX๐ , ๐พ (ยท, ยท) :
หX ร หX โฆโ R, regularization parameter bโ, rounding procedure
ROUNDY (ยท, ยท) with approximation parameter Y.
1 Initialization: ๐ โโlog
2(๐ ( หX))
โ. ๐ โ โ๐๐ /๐ โ. B (1)
๐ฃโฒ โ X๐ฃโฒ for all ๐ฃโฒ โ [๐ ]. ๐ก โ 1 ; // pre-determine the
number of phases and the number of samples for each phase
2 while ๐ก โค ๐ and โ๐ฃ โฒ โ [๐ ], |B (๐ก )๐ฃโฒ | > 1 do
3 Let _โ๐ก and ๐โ๐ก be the optimal solution and optimal value of
min
_โโณ หXmax
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โB (๐ก )๐ฃโฒ ,๐ฃโฒโ[๐ ]
โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2(bโ๐ผ+โ๏ฟฝฬ๏ฟฝโ หX _๏ฟฝฬ๏ฟฝ๐ (๏ฟฝฬ๏ฟฝ)๐ (๏ฟฝฬ๏ฟฝ)โค)โ1; // compute the optimal sample allocation
4 (๐ 1, . . . , ๐ ๐ (๐ก ) ) โ ROUNDY (_โ๐ก , ๐ (๐ก ) );5 Let ๏ฟฝฬ๏ฟฝ (๐ก )๐ฃ be the sub-sequence of (๐ 1, . . . , ๐ ๐ ) which only contains the arms in
หX๐ฃ ; // generate the sample
sequence for agent ๐ฃ
6 Pull arms ๏ฟฝฬ๏ฟฝ (๐ก )๐ฃ and observe random rewards ๐ (๐ก )๐ฃ ;
7 Broadcast {(๐ (๐ก )๐ฃ,๐, ๐ฆ(๐ก )๐ฃ,๐)}๐โ[๐] ;
8 Receive {(๐ (๐ก )๐ฃโฒ,๐ , ๐ฆ
(๐ก )๐ฃโฒ,๐ )}๐โ[๐] from all other agents ๐ฃ โฒ โ [๐ ] \ {๐ฃ};
9 ๐๐ก (๐ฅ) โ [โ๏ธ๐(๐ก )1๐พ (๐ฅ, ๐ฅ1), . . . ,
โ๏ธ๐(๐ก )๐๐๐พ (๐ฅ, ๐ฅ๐๐ )]โค for any ๐ฅ โ หX. ๐พ (๐ก ) โ [
โ๏ธ๐(๐ก )๐๐(๐ก )๐๐พ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ] .
๏ฟฝฬ๏ฟฝ (๐ก ) โ [โ๏ธ๐(๐ก )1๐ฆ(๐ก )1, . . . ,
โ๏ธ๐(๐ก )๐๐๐ฆ(๐ก )๐๐]โค ; // organize overall observation information
10 for all ๐ฃ โฒ โ [๐ ] do11 ห๐๐ก (๐ฅ) โ ๐๐ก (๐ฅ)โค (๐พ (๐ก ) + ๐ (๐ก )bโ๐ผ )โ1๏ฟฝฬ๏ฟฝ (๐ก ) for all ๐ฅ โ B (๐ก )
๐ฃโฒ ; // estimate the rewards of alive arms
12 Sort all ๐ฅ โ B (๐ก )๐ฃโฒ by
ห๐๐ก (๐ฅ) in decreasing order. Let ๐ฅ (1) , . . . , ๐ฅ ( |B (๐ก )๐ฃโฒ |)
denote the sorted arm sequence;
13 Let ๐๐ก+1 be the largest index such that ๐ ({๐ฅ (1) , . . . , ๐ฅ (๐๐ก+1) }) โค ๐ (B(๐ก )๐ฃโฒ )/2;
14 B (๐ก+1)๐ฃโฒ โ {๐ฅ (1) , . . . , ๐ฅ (๐๐ก+1) } ; // cut down the alive arm set to half dimension
15 ๐ก โ ๐ก + 1;
16 return B (๐ก )1, . . . ,B (๐ก )
๐;
5.1 Algorithm CoopKernelFB
5.1.1 Algorithm. CoopKernelFB consists of three key steps: (i) pre-determine the numbers of phases and samples
according to data dimension, (ii) maintain alive arm sets for all agents, plan a globally optimal sample allocation, (iii)
communicate observation information and cut down alive arms to a half in the dimension sense.
The procedure of CoopKernelFB is given in Algorithm 2. During initialization, we determine the number of phases
๐ and the number of samples for each phase ๐ according to the principle dimension ๐ ( หX) (Line 1), defined as:
๐ ( หS) = min
_โโณ หXmax
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หSโฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2(bโ๐ผ+โ๏ฟฝฬ๏ฟฝโ หX _๏ฟฝฬ๏ฟฝ๐ (๏ฟฝฬ๏ฟฝ)๐ (๏ฟฝฬ๏ฟฝ)โค)โ1
, โ หS โ หX
i.e., the principle dimension of data projections inหS to the RKHS. In each phase ๐ก , each agent ๐ฃ maintains alive arm sets
B (๐ก )๐ฃโฒ for all agents ๐ฃ โฒ โ [๐ ], and solves an integrated optimization to obtain a globally optimal sample allocation _โ๐ก
(Line 3). Then, she generates a sample sequence (๐ (๐ก )1, . . . , ๐
(๐ก )๐) according to _โ๐ก , and selects the sub-sequence ๏ฟฝฬ๏ฟฝ (๐ก )๐ฃ that
only contains her available arms to perform sampling (Lines 4-5). During communication, she only sends and receives
the number of samples ๐(๐ก )๐ฃ,๐
and average observed reward ๐ฆ(๐ก )๐ฃ,๐
for each arm to and from other agents (Lines 7-8). Using
12
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
the shared information, she estimates rewards of alive arms and only selects the best half of them in the dimension
sense to enter the next phase (Lines 11-14).
5.1.2 Computation and Communication Efficiency. CoopKernelFB also adopts the efficient kernelized optimization
solver (Eqs. (6),(7)) to solve the min-max optimization in Line 3 and employs the kernelized estimator (Eq. (4)) to
estimate the rewards in Line 11. Moreover, CoopKernelFB only spends Poly(๐๐ ) computation time and ๐ (๐๐ )-bitcommunication costs.
5.2 Theoretical performance of CoopKernelFB
We present the error probability of CoopKernelFB in the following theorem, where _๐ข = 1
๐๐1.
Theorem 4 (Fixed-Budget Upper Bound). Suppose ๐ ({๐ฅโ๐ฃ , ๐ฅ}) โฅ 1 for any ๐ฅ โ X๐ฃ \ {๐ฅโ๐ฃ }, ๐ฃ โ [๐ ] and ๐ =
ฮฉ(๐โ log(๐ ( หX))), and the regularization parameter bโ > 0 satisfiesโ๏ธbโmax
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หX๐ฃ ,๐ฃโ[๐ ] โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ๐ด(bโ,_๐ข )โ1 โคฮmin
2(1+Y)๐ต . With at most ๐ samples per agent, CoopKernelFB returns the correct answers ๐ฅโ๐ฃ for all ๐ฃ โ [๐ ], with error
probability
๐ธ๐๐ = ๐
(๐2๐ log(๐ ( หX)) ยท exp
(โ ๐๐
๐โ log(๐ ( หX))
))and communication rounds ๐ (log(๐ ( หX))).
Remark 5.Theorem 4 implies that, to guarantee an error probability๐ฟ ,CoopKernelFB only requires๐ ( ๐โ
log(๐ ( หX))๐
log( ๐2๐ log(๐ ( หX))
๐ฟ))
samples, which matches the sample complexity lower bound (Theorem 2) up to logarithmic factors. In addition,
CoopKernelFB attains the maximum ๐ -speedup for fully-collaborative instances with only logarithmic communication
rounds, which also matches the round-speedup lower bound (Theorem 5) within double logarithmic factors.
Technical Novelty in Error Probability Analysis. Our analysis extends prior single-agent analysis [23] to the
multi-agent setting. The single-agent analysis in [23] only uses a single universal Gaussian-process concentration
bound. Instead, we establish novel estimate concentrations and high probability events for each arm pair and each
agent to handle the distributed environment, and build a connection between the principle dimension ๐ (B (๐ก )๐ฃ ) andproblem hardness ๐โ via elimination rules (Lines 13-14) to guarantee the identification correctness.
Interpretation. Similar to Corollary 1, we can also interpret the error probability result with the standard tools of
maximum information gain and effective dimension in kernel bandits [14, 35, 40], and decompose the error probability
into two compositions from task similarities and arm features.
Corollary 2. The error probability of algorithm CoopKernelFC, denoted by ๐ธ๐๐ , can also be bounded as follows:
(a)๐ธ๐๐ = ๐
(exp
(โ
๐๐ฮ2
min
ฮฅ log(๐ ( หX))
)ยท ๐2๐ log(๐ ( หX))
), where ฮฅ is the maximum information gain.
(b)๐ธ๐๐ = ๐ยฉยญยญยซexp
ยฉยญยญยซโ๐๐ฮ2
min
๐eff
log
(๐๐ ยท
(1 + Trace(๐พ_โ )
bโ๐eff
))log(๐ ( หX))
ยชยฎยฎยฌ ยท ๐2๐ log(๐ ( หX))ยชยฎยฎยฌ , where ๐eff is the
effective dimension.
(c)๐ธ๐๐ = ๐ยฉยญยญยซexp
ยฉยญยญยซโ๐๐ฮ2
min
rank(๐พ๐ง) ยท rank(๐พ๐ฅ ) log
(Trace(๐ผ+bโ1
โ ๐พ_โ )rank(๐พ_โ )
)log(๐ ( หX))
ยชยฎยฎยฌ ยท ๐2๐ log(๐ ( หX))ยชยฎยฎยฌ .
13
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Remark 6. Corollaries 2(a), 2(b) bound the error probability by maximum information gain and effective dimension,
respectively, which capture essential structures of tasks and arm features and only depend on the effective dimension of
the feature space of kernel representation. Furthermore, we exhibit how task similarities influence the error probability
performance in Corollary 2(c). For example, in the fully-collaborative case where rank(๐พ๐ง) = 1, the error probability
enjoys an exponential decay factor of ๐ compared to conventional single-agent results [23] (achieves a ๐ -speedup).
Conversely, when the tasks are totally different with rank(๐พ๐ง) = ๐ , the error probability degenerates to conventional
single-agent results [23], since in this case information sharing brings no benefit.
5.3 Lower Bound for Fixed-Budget Setting
In this subsection, we establish a round-speedup lower bound for the FB setting.
Theorem 5 (Fixed-Budget Round-Speedup Lower Bound). There exists a fully-collaborative instance of the fixed-
budget CoPE-KB problem with multi-armed and linear reward structures, for which given any ๐ฝ โ [ ๐
log(๐ ( หX)),๐ ], a
๐ฝ-speedup distributed algorithm A must utilize
ฮฉยฉยญยซ log(๐ ( หX))
log(๐๐ฝ) + log log(๐ ( หX))
ยชยฎยฌcommunication rounds in expectation. In particular, when ๐ฝ = ๐ , A must use ฮฉ( log(๐ ( หX))
log log(๐ ( หX))) communication rounds in
expectation.
Remark 7. Theorem 5 shows that under the FB setting, to achieve the full speedup, agents require at least logarithmic
communication rounds with respect to the principle dimension ๐ ( หX), which validates the communication optimality
of CoopKernelFB. In the degenerated case when all agents solve the same non-structured pure exploration problem,
same as in prior classic MAB setting [20, 38], both our upper (Theorem 4) and lower (Theorem 5) bounds match the
state-of-the-art results in [38].
Novel Analysis for Fixed-Budget Round-Speedup Lower Bound. Different from the FC setting, here we borrow
the proof idea of prior limited adaptivity work [1] to establish a non-trivial lower bound analysis under Bayesian
environments, and perform instance transformation by changing data dimension instead of tuning reward gaps. In
our analysis, we employ novel techniques to calculate the information entropy and support size of posterior reward
distributions in order to build induction among different rounds and derive the required communication rounds.
6 EXPERIMENTS
In this section, we conduct experiments to validate the empirical performance of our algorithms. In our experiments, we
set๐ = 5, ๐ = 4, ๐ = 6, ๐ฟ = 0.005 and ๐ (๐ฅ) = ๐ผ๐ฅ for any ๐ฅ โ หX. The entries of \โ form an arithmetic sequence that starts
from 0.1 and has the common difference ฮmin, i.e., \โ = [0.1, 0.1 + ฮmin, . . . , 0.1 + (๐ โ 1)ฮmin]โค. For the FC setting, we
vary the gap ฮmin โ [0.1, 0.8] to generate different instances (points), and run 50 independent simulations to plot the
average sample complexity with 95% confidence intervals. For the FB setting, we change the budget ๐ โ [7000, 300000]to obtain different instances, and perform 100 independent runs to show the error probability across runs. The specific
values of gap ฮmin and budget ๐ can be seen in X-axis of the figures.
Fixed-Confidence. In the FC setting (Figures 2(a)-2(c)), we compareCoopKernelFCwith five baselines:CoopKernel-IndAlloc
is an ablation variant ofCoopKernelFCwhich individually calculates sample allocations for different agents. IndRAGE [19],
IndALBA [34] and IndPolyALBA [15] are single-agent algorithms, which use ๐ copies of single-agent RAGE [19],
14
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50Gap
103
104
105
106
107Sa
mpl
e Co
mpl
exity
CoopKernel (ours)CoopKernel-IndAllocIndRAGEIndRAGE-VIndALBAIndPolyALBA
(a) FC, ๐พZ = 1 (fully-collaborative)
0.10 0.15 0.20 0.25 0.30Gap
103
104
105
106
Sam
ple
Com
plex
ity
CoopKernel (ours)CoopKernel-IndAllocIndRAGEIndRAGE-VIndALBAIndPolyALBA
(b) FC, 1 < ๐พZ < ๐
0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70Gap
103
104
105
106
Sam
ple
Com
plex
ity
CoopKernel (ours)CoopKernel-IndAllocIndRAGEIndRAGE-VIndALBAIndPolyALBA
(c) FC, ๐พZ = ๐
5000 5500 6000 6500 7000 7500 8000 8500 9000Budget
0.00.10.20.30.40.50.60.70.80.91.0
Erro
r Pro
babi
lity
CoopKernelFB (ours)CoopKernelFB-IndAllocIndRAGE-FBIndUniformFB
(d) FB, ๐พZ = 1 (fully-collaborative)
5000 5500 6000 6500 7000 7500 8000 8500 9000Budget
0.00.10.20.30.40.50.60.70.80.91.0
Erro
r Pro
babi
lity
CoopKernelFB (ours)CoopKernelFB-IndAllocIndRAGE-FBIndUniformFB
(e) FB, 1 < ๐พZ < ๐
6000 6500 7000 7500 8000 8500 9000 950010000Budget
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Erro
r Pro
babi
lity
CoopKernelFB (ours)CoopKernelFB-IndAllocIndRAGE-FBIndUniformFB
(f) FB, ๐พZ = ๐
Fig. 2. Experimental results for FC and FB settings.
ALBA [34] and PolyALBA [15] policies to solve the ๐ tasks independently. IndRAGE/๐ is a ๐ -speedup baseline, which
divides the sample complexity of the best single-agent algorithm IndRAGE by ๐ . One can see that CoopKernelFC
achieves the best sample complexity in Figures 2(a),2(b), which demonstrates the effectiveness of our sample allocation
and cooperation scheme. Moreover, the empirical results also reflect the impacts of task similarities on learning speedup,
and keep consistent with our theoretical analysis. Specifically, in the fully-collaborative case (Figure 2(a)),CoopKernelFC
matches IndRAGE-๐ since it attains the ๐ speedup; in the intermediate (1 < ๐พZ < ๐ ) case (Figure 2(b)), the curve
of CoopKernelFC lies between IndRAGE/๐ and IndRAGE, since it only achieves smaller than ๐ speedup due to the
decrease of task similarity; in the totally-different-task (๐พZ = ๐ ) case (Figure 2(c)), CoopKernelFC performs similar to
the single-agent algorithm IndRAGE, since information sharing among agents brings no advantage in this case.
Fixed-Budget. In the FB setting (Figures 2(d)-2(f)), we compareCoopKernelFBwith three baselines:CoopKernelFB-IndAlloc
is an ablation variant of CoopKernelFB where agents calculate and use different sample allocations. IndPeaceFB [23]
and IndUniformFB solve the ๐ tasks independently by calling ๐ copies of single-agent PeaceFB [23] and uniform
sampling policies, respectively. As shown in Figures 2(d),2(e), our CoopKernelFB enjoys a lower error probability than
all other algorithms. In addition, these empirical results also validate the influences of task similarities on learning
performance, and match our theoretical analysis. Specifically, as the task similarity decreases in Figures 2(d) to 2(f), the
error probability of CoopKernelFB gets closer to that of single-agent IndRAGE-FB, due to the slow-down of its learning
speedup.
7 CONCLUSION
In this paper, we propose a novel Collaborative Pure Exploration in Kernel Bandit (CoPE-KB) problem with Fixed-
Confidence (FC) and Fixed-Budget (FB) settings. CoPE-KB aims to coordinate multiple agents to identify best arms with
15
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
general reward functions. We design two computation and communication efficient algorithms CoopKernelFC and
CoopKernelFB based on novel kernelized estimators. Matching upper and lower bounds are established to demonstrate
the statistical and communication optimality of our algorithms. Our theoretical results explicitly characterize the
impacts of task similarities on learning speedup and avoid heavy dependency on the high dimension of the kernelized
feature space. In our analysis, we also develop novel analytical techniques, including data dimension decomposition,
linear structured instance transformation and (communication) round-speedup induction, which are applicable to other
bandit problems and can be of independent interests.
REFERENCES[1] Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. 2017. Learning with limited rounds of adaptivity: Coin tossing, multi-armed
bandits, and ranking from pairwise comparisons. In Conference on Learning Theory. PMLR, 39โ75.
[2] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. 2021. Near-optimal discrete optimization for experimental design: A regret
minimization approach. Mathematical Programming 186, 1 (2021), 439โ478.
[3] Jean-Yves Audibert, Sรฉbastien Bubeck, and Rรฉmi Munos. 2010. Best arm identification in multi-armed bandits.. In Conference on Learning Theory.Citeseer, 41โ53.
[4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47, 2-3 (2002),
235โ256.
[5] Ilai Bistritz and Amir Leshem. 2018. Distributed Multi-Player Bandits-a Game of Thrones Approach.. In Advances in Neural Information ProcessingSystems. 7222โ7232.
[6] Sรฉbastien Bubeck and Thomas Budzinski. 2020. Coordination without communication: optimal regret in two players multi-armed bandits. In
Conference on Learning Theory. PMLR, 916โ939.
[7] Sรฉbastien Bubeck, Thomas Budzinski, and Mark Sellke. 2021. Cooperative and stochastic multi-player multi-armed bandit: Optimal regret with
neither communication nor collisions. In Conference on Learning Theory. PMLR, 821โ822.
[8] Sรฉbastien Bubeck and Nicolo Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Machine Learning5, 1 (2012), 1โ122.
[9] Sรฉebastian Bubeck, Tengyao Wang, and Nitin Viswanathan. 2013. Multiple identifications in multi-armed bandits. In International Conference onMachine Learning. PMLR, 258โ265.
[10] Romain Camilleri, Kevin Jamieson, and Julian Katz-Samuels. 2021. High-Dimensional Experimental Design and Kernel Bandits. In InternationalConference on Machine Learning. PMLR, 1227โ1237.
[11] Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. 2017. Coordinated Versus Decentralized Exploration In Multi-Agent
Multi-Armed Bandits.. In Proceedings of the International Joint Conference on Artifical Intelligence. 164โ170.[12] Lijie Chen, Jian Li, and Mingda Qiao. 2017. Towards instance optimal bounds for best arm identification. In Conference on Learning Theory. PMLR,
535โ592.
[13] Sayak Ray Chowdhury and Aditya Gopalan. 2017. On kernelized multi-armed bandits. In International Conference on Machine Learning. PMLR,
844โ853.
[14] Aniket Anand Deshmukh, Urun Dogan, and Clayton Scott. 2017. Multi-task learning for contextual bandits. In Advances in Neural InformationProcessing Systems. 4851โ4859.
[15] Yihan Du, Yuko Kuroki, and Wei Chen. 2021. Combinatorial pure exploration with full-bandit or partial linear feedback. In Proceedings of the AAAIConference on Artificial Intelligence, Vol. 35. 7262โ7270.
[16] Abhimanyu Dubey et al. 2020. Kernel methods for cooperative multi-agent contextual bandits. In International Conference on Machine Learning.2740โ2750.
[17] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. The Journal of Machine Learning Research 20, 1
(2019), 1997โ2017.
[18] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. 2006. Action Elimination and Stopping Conditions for the Multi-Armed
Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research 7, 6 (2006).
[19] Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. 2019. Sequential experimental design for transductive linear bandits. Advances inneural information processing systems 32 (2019), 10667โ10677.
[20] Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. 2013. Distributed Exploration in Multi-Armed Bandits. In Advancesin Neural Information Processing Systems, Vol. 26. 854โ862.
[21] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. 2012. PAC subset selection in stochastic multi-armed bandits. In InternationalConference on Machine Learning, Vol. 12. 655โ662.
[22] Nikolai Karpov, Qin Zhang, and Yuan Zhou. 2020. Collaborative top distribution identifications with limited interaction. In IEEE 61st AnnualSymposium on Foundations of Computer Science. 160โ171.
16
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
[23] Julian Katz-Samuels, Lalit Jain, Kevin G Jamieson, et al. 2020. An Empirical Process Approach to the Union Bound: Practical Algorithms for
Combinatorial and Linear Bandits. In Advances in Neural Information Processing Systems, Vol. 33.[24] Emilie Kaufmann, Olivier Cappรฉ, and Aurรฉlien Garivier. 2016. On the complexity of best-arm identification in multi-armed bandit models. Journal
of Machine Learning Research 17, 1 (2016), 1โ42.
[25] Nathan Korda, Balazs Szorenyi, and Shuai Li. 2016. Distributed clustering of linear bandits in peer to peer networks. In International Conference onMachine Learning. PMLR, 1301โ1309.
[26] Andreas Krause and Cheng Soon Ong. 2011. Contextual Gaussian Process Bandit Optimization.. In Advances in Neural Information ProcessingSystems. 2447โ2455.
[27] Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 1 (1985), 4โ22.[28] Tor Lattimore and Csaba Szepesvรกri. 2020. Bandit algorithms. Cambridge University Press.
[29] Keqin Liu and Qing Zhao. 2010. Distributed learning in multi-armed bandit with multiple players. IEEE Transactions on Signal Processing 58, 11
(2010), 5667โ5681.
[30] Zhenhua Liu, Minghong Lin, Adam Wierman, Steven H Low, and Lachlan LH Andrew. 2011. Geographical load balancing with renewables. 39, 3
(2011), 62โ66.
[31] Trung T. Nguyen. 2021. On the Edge and Cloud: Recommendation Systems with Distributed Machine Learning. In 2021 International Conference onInformation Technology (ICIT). 929โ934. https://doi.org/10.1109/ICIT52682.2021.9491121
[32] Jonathan Rosenski, Ohad Shamir, and Liran Szlak. 2016. Multi-player banditsโa musical chairs approach. In International Conference on MachineLearning. PMLR, 155โ163.
[33] Bernhard Schรถlkopf, Alexander J Smola, Francis Bach, et al. 2002. Learning with kernels: support vector machines, regularization, optimization, andbeyond. MIT press.
[34] Marta Soare, Alessandro Lazaric, and Rรฉmi Munos. 2014. Best-arm identification in linear bandits. In Advances in Neural Information ProcessingSystems, Vol. 27. 828โ836.
[35] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. 2010. Gaussian process optimization in the bandit setting: No regret and
experimental design. In International Conference on Machine Learning.[36] Balazs Szorenyi, Rรณbert Busa-Fekete, Istvรกn Hegedus, Rรณbert Ormรกndi, Mรกrk Jelasity, and Balรกzs Kรฉgl. 2013. Gossip-based distributed stochastic
bandit algorithms. In International Conference on Machine Learning. PMLR, 19โ27.
[37] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. 2013. Automatic ad format selection via contextual bandits. In Proceedings of the 22ndACM International Conference on Information & Knowledge Management. 1587โ1594.
[38] Chao Tao, Qin Zhang, and Yuan Zhou. 2019. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed
bandits. In IEEE 60th Annual Symposium on Foundations of Computer Science. 126โ146.[39] William R Thompson. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25,
3/4 (1933), 285โ294.
[40] Michal Valko, Nathan Korda, Rรฉmi Munos, Ilias Flaounas, and Nello Cristianini. 2013. Finite-Time Analysis of Kernelised Contextual Bandits. In
Uncertainty in Artificial Intelligence.[41] Sofรญa S Villar, Jack Bowden, and James Wason. 2015. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges.
Statistical Science: A Review Journal of the Institute of Mathematical Statistics 30, 2 (2015), 199.[42] Grace Wahba. 1990. Spline models for observational data. SIAM.
[43] Yinglun Zhu, Dongruo Zhou, Ruoxi Jiang, Quanquan Gu, Rebecca Willett, and Robert Nowak. 2021. Pure Exploration in Kernel and Neural Bandits.
arXiv preprint arXiv:2106.12034 (2021).[44] Ling Zhuo, Cho-Li Wang, and Francis C.M. Lau. 2003. Document replication and distribution in extensible geographically distributed web servers. J.
Parallel and Distrib. Comput. 63, 10 (2003), 927โ944.
17
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
APPENDIX
A PROOFS FOR THE FIXED-CONFIDENCE SETTING
A.1 Kernelized Computations in Algorithm CoopKernelFC
Kernelized Condition for Regularization Parameter b๐ก .We first introduce how to compute the condition Eq. (3)
for regularization parameter b๐ก in Line 4 of Algorithm 1.
Let ฮฆ_ = [โ_1๐ (๐ฅ1)โค; . . . ;
โ๏ธ_๐๐๐ (๐ฅ๐๐ )โค] and ๐พ_ = ฮฆ_ฮฆ
โค_= [
โ๏ธ_๐_ ๐๐พ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ] for any _ โ โณ หX . Let ๐_ (๐ฅ) =
ฮฆ_๐ (๐ฅ) = [โ_1๐พ (๐ฅ, ๐ฅ1), . . . ,
โ๏ธ_๐๐๐พ (๐ฅ, ๐ฅ๐๐ )]โค for any _ โ โณ
หX , ๐ฅ โ หX. Since(b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)๐ (๐ฅ) = b๐ก๐ (๐ฅ) + ฮฆโค_ ๐_ (๐ฅ)
for any ๐ฅ โ หX, we have
๐ (๐ฅ) =b๐ก(b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1
๐ (๐ฅ) +(b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1
ฮฆโค_๐_ (๐ฅ)
=b๐ก
(b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1
๐ (๐ฅ) + ฮฆโค_(b๐ก ๐ผ + ๐พ_)โ1 ๐_ (๐ฅ)
Thus,
๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ) = b๐ก(b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1 (๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)+ ฮฆโค
_(b๐ก ๐ผ + ๐พ_)โ1
(๐_ (๐ฅ๐ ) โ ๐_ (๐ฅ ๐ )
)Multiplying
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โคon both sides, we have(
๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ))โค (
๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ))
=b๐ก(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1 (๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)+
(๐_ (๐ฅ๐ ) โ ๐_ (๐ฅ ๐ )
)โค (b๐ก ๐ผ + ๐พ_)โ1(๐_ (๐ฅ๐ ) โ ๐_ (๐ฅ ๐ )
)Thus,
โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2(b๐ก ๐ผ+ฮฆโค_ฮฆ_
)โ1
=(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1 (๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)=bโ1
๐ก
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โ bโ1
๐ก
(๐_ (๐ฅ๐ ) โ ๐_ (๐ฅ ๐ )
)โค (b๐ก ๐ผ + ๐พ_)โ1(๐_ (๐ฅ๐ ) โ ๐_ (๐ฅ ๐ )
)=bโ1
๐ก
(๐พ (๐ฅ๐ , ๐ฅ๐ ) + ๐พ (๐ฅ ๐ , ๐ฅ ๐ ) โ 2๐พ (๐ฅ๐ , ๐ฅ ๐ )
)โ bโ1
๐ก
(๐_ (๐ฅ๐ ) โ ๐_ (๐ฅ ๐ )
)โค (b๐ก ๐ผ + ๐พ_)โ1(๐_ (๐ฅ๐ ) โ ๐_ (๐ฅ ๐ )
)=bโ1
๐ก
(๐พ (๐ฅ๐ , ๐ฅ๐ ) + ๐พ (๐ฅ ๐ , ๐ฅ ๐ ) โ 2๐พ (๐ฅ๐ , ๐ฅ ๐ )
)โ bโ1
๐ก โฅ๐_ (๐ฅ๐ ) โ ๐_ (๐ฅ ๐ )โฅ2(b๐ก ๐ผ+๐พ_)โ1
Let _๐ข = 1
๐๐1 be the uniform distribution on
หX. Then, the condition Eq. (3) for regularization parameter b๐กโ๏ธb๐ก max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หX๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ(b๐ก ๐ผ+โ๐ฅโX 1
๐๐๐ (๐ฅ)๐ (๐ฅ)โค)โ1 โค 1
(1 + Y)๐ต ยท 2๐ก+1
is equivalent to the following efficient kernelized statement
max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หX๐ฃ ,๐ฃโ[๐ ]
โ๏ธ(๐พ (๐ฅ๐ , ๐ฅ๐ ) + ๐พ (๐ฅ ๐ , ๐ฅ ๐ ) โ 2๐พ (๐ฅ๐ , ๐ฅ ๐ )
)โ โฅ๐_๐ข (๐ฅ๐ ) โ ๐_๐ข (๐ฅ ๐ )โฅ2(b๐ก ๐ผ+๐พ_๐ข )โ1
โค 1
(1 + Y)๐ต ยท 2๐ก+1.
Kernelized Optimization Solver. Now we present the efficient kernelized optimization solver for the following
convex optimization in Line 4 of Algorithm 1:
min
_โโณ หXmax
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โB (๐ก )๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2๐ด(b๐ก ,_)โ1
, (10)
18
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
where ๐ด(b, _) = b๐ผ +โ๏ฟฝฬ๏ฟฝ โ หX _๏ฟฝฬ๏ฟฝ๐ (๐ฅ)๐ (๐ฅ)
โคfor any b > 0, _ โ โณ
หX .
Define function โ(_) = max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โB (๐ก )๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2๐ด(b๐ก ,_)โ1
, and define ๐ฅโ๐(_), ๐ฅโ
๐(_) as the optimal solution of
โ(_). Then, the gradient of โ(_) with respect to _ is
[โ_โ(_)]๏ฟฝฬ๏ฟฝ = โ((๐ (๐ฅโ๐ (_)) โ ๐ (๐ฅ
โ๐ (_))
)โค๐ด(b๐ก , _)โ1๐ (๐ฅ)
)2
,โ๐ฅ โ หX. (11)
Next, we show how to efficiently compute gradient [โ_โ(_)]๏ฟฝฬ๏ฟฝ with kernel function ๐พ (ยท, ยท).Since
(b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)๐ (๐ฅ) = b๐ก๐ (๐ฅ) + ฮฆโค_ ๐_ (๐ฅ) for any ๐ฅ โ หX, we have
๐ (๐ฅ) =b๐ก(b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1
๐ (๐ฅ) +(b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1
ฮฆโค_๐_ (๐ฅ)
=b๐ก
(b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1
๐ (๐ฅ) + ฮฆโค_(b๐ก ๐ผ + ๐พ_)โ1 ๐_ (๐ฅ)
Multiplying
(๐ (๐ฅโ
๐(_)) โ ๐ (๐ฅโ
๐(_))
)โคon both sides, we have(
๐ (๐ฅโ๐ (_)) โ ๐ (๐ฅโ๐ (_))
)โค๐ (๐ฅ)
=b๐ก
(๐ (๐ฅโ๐ (_)) โ ๐ (๐ฅ
โ๐ (_))
)โค (b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1
๐ (๐ฅ) +(๐_ (๐ฅโ๐ (_)) โ ๐_ (๐ฅ
โ๐ (_))
)โค(b๐ก ๐ผ + ๐พ_)โ1 ๐_ (๐ฅ)
Then, (๐ (๐ฅโ๐ (_)) โ ๐ (๐ฅ
โ๐ (_))
)โค (b๐ก ๐ผ + ฮฆโค_ ฮฆ_
)โ1
๐ (๐ฅ)
=bโ1
๐ก
(๐ (๐ฅโ๐ (_)) โ ๐ (๐ฅ
โ๐ (_))
)โค๐ (๐ฅ) โ bโ1
๐ก
(๐_ (๐ฅโ๐ (_)) โ ๐_ (๐ฅ
โ๐ (_))
)โค(b๐ก ๐ผ + ๐พ_)โ1 ๐_ (๐ฅ)
=bโ1
๐ก
(๐พ (๐ฅโ๐ (_), ๐ฅ) โ ๐พ (๐ฅ
โ๐ (_), ๐ฅ) โ
(๐_ (๐ฅโ๐ (_)) โ ๐_ (๐ฅ
โ๐ (_))
)โค(b๐ก ๐ผ + ๐พ_)โ1 ๐_ (๐ฅ)
)(12)
Therefore, we can compute gradient โ_โ(_) (Eq. (11)) using the equivalent kernelized expression Eq. (12), and then the
optimization (Eq. (10)) can be efficiently solved by projected gradient descent.
Innovative Kernelized Estimator. Finally, we explicate the innovative kernelized estimator of reward gaps in Line 13
of Algorithm 1, which plays an important role in boosting the computation and communication efficiency.
Letห\๐ก denote the minimizer of the following regularized least square loss function:
L(\ ) = ๐ (๐ก )b๐ก โฅ\ โฅ2 +๐ (๐ก )โ๏ธ๐=1
(๐ฆ ๐ โ ๐ (๐ ๐ )โค\ )2 .
Letting the derivative of L(\ ) equal to zero, we have
๐ (๐ก )b๐ก ห\๐ก +๐ (๐ก )โ๏ธ๐=1
๐ (๐ฅ ๐ )๐ (๐ฅ ๐ )โค ห\๐ก =
๐ (๐ก )โ๏ธ๐=1
๐ (๐ฅ ๐ )๐ฆ ๐ .
Rearranging the summation, we can obtain
๐ (๐ก )b๐ก ห\๐ก +(๐๐โ๏ธ๐=1
๐(๐ก )๐๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค
)ห\๐ก =
๐๐โ๏ธ๐=1
๐(๐ก )๐๐ (๐ฅ๐ )๐ฆ (๐ก )๐ , (13)
19
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
where ๐(๐ก )๐
is the number of samples and ๐ฆ(๐ก )๐
is the average observation on arm ๐ฅ๐ for any ๐ โ [๐๐ ]. Let ฮฆ๐ก =
[โ๏ธ๐(๐ก )1๐ (๐ฅ1)โค; . . . ;
โ๏ธ๐(๐ก )๐๐๐ (๐ฅ๐๐ )โค],๐พ (๐ก ) = ฮฆ๐กฮฆ
โค๐ก = [
โ๏ธ๐(๐ก )๐๐(๐ก )๐๐พ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ] and๐ฆ (๐ก ) = [
โ๏ธ๐(๐ก )1๐ฆ(๐ก )1, . . . ,
โ๏ธ๐(๐ก )๐๐๐ฆ(๐ก )๐๐]โค.
Then, we can write Eq. (13) as (๐ (๐ก )b๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)ห\๐ก =ฮฆ
โค๐ก ๐ฆ(๐ก ) .
Since
(๐ (๐ก )b๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โป 0 and
(๐ (๐ก )b๐ก ๐ผ + ฮฆ๐กฮฆโค๐ก
)โป 0,
ห\๐ก =
(๐ (๐ก )b๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก ๐ฆ(๐ก )
= ฮฆโค๐ก(๐ (๐ก )b๐ก ๐ผ + ฮฆ๐กฮฆโค๐ก
)โ1
๐ฆ (๐ก )
= ฮฆโค๐ก(๐ (๐ก )b๐ก ๐ผ + ๐พ (๐ก )
)โ1
๐ฆ (๐ก ) .
Let ๐๐ก (๐ฅ) = ฮฆ๐ก๐ (๐ฅ) = [โ๏ธ๐(๐ก )1๐พ (๐ฅ, ๐ฅ1), . . . ,
โ๏ธ๐(๐ก )๐๐๐พ (๐ฅ, ๐ฅ๐๐ )]โค for any ๐ฅ โ X. Then, we obtain the efficient kernelized
estimators of ๐ (๐ฅ๐ ) and ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ) as
ห๐ (๐ฅ๐ ) = ๐ (๐ฅ๐ )โค ห\๐ก = ๐๐ก (๐ฅ๐ )โค(๐ (๐ก )b๐ก ๐ผ + ๐พ (๐ก )
)โ1
๐ฆ (๐ก ) ,
ฮฬ(๐ฅ๐ , ๐ฅ ๐ ) =(๐๐ก (๐ฅ๐ ) โ ๐๐ก (๐ฅ ๐ )
)โค (๐ (๐ก )b๐ก ๐ผ + ๐พ (๐ก )
)โ1
๐ฆ (๐ก ) .
A.2 Proof of Theorem 1
Our proof of Theorem 1 adapts the analysis procedure of [19, 23] to the multi-agent setting.
For any _ โ โณหX , let ฮฆ_ = [
โ_1๐ (๐ฅโ๐ฃ )โค; . . . ;
โ๏ธ_๐๐๐ (๐ฅ๐๐ )โค] and ฮฆโค
_ฮฆ_ =
โ๏ฟฝฬ๏ฟฝ โ หX _๏ฟฝฬ๏ฟฝ๐ (๐ฅ)๐ (๐ฅ)
โค. In order to prove
Theorem 1, we first introduce the following lemmas.
Lemma 1 (Concentration). Defining event
G =
{ ๏ฟฝ๏ฟฝ๏ฟฝ( ห๐๐ก (๐ฅ๐ ) โ ห๐๐ก (๐ฅ ๐ ))โ
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
) ๏ฟฝ๏ฟฝ๏ฟฝ < (1 + Y) ยท โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (b๐ก ๐ผ+ฮฆโค_โ๐กฮฆ_โ๐ก)โ1 ยท
ยฉยญยซโ๏ธ
2 log
(2๐2๐ /๐ฟ๐ก
)๐๐ก
+โ๏ธb๐ก ยท โฅ\โโฅ2
ยชยฎยฌ โค 2โ๐ก , โ๐ฅ๐ , ๐ฅ ๐ โ B (๐ก )๐ฃ , โ๐ฃ โ [๐ ], โ๐ก โฅ 1
},
we have
Pr [G] โฅ 1 โ ๐ฟ.
Proof of Lemma 1. Letห\๐ก be the regularized least square estimator of \โ with samples ๐ฅโ๐ฃ , . . . , ๐ฅ๐๐ก and ๐พ๐ก = ๐๐ก b๐ก .
Recall that ฮฆ๐ก = [โ๏ธ๐(๐ก )1๐ (๐ฅโ๐ฃ )โค; . . . ;
โ๏ธ๐(๐ก )๐๐๐ (๐ฅ๐๐ )โค] and ฮฆโค๐ก ฮฆ๐ก =
โ๐๐๐=1
๐(๐ก )๐๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค.
In addition, ๐ฆ (๐ก ) = [โ๏ธ๐(๐ก )1๐ฆ(๐ก )1, . . . ,
โ๏ธ๐(๐ก )๐๐๐ฆ(๐ก )๐๐]โค and
ห\๐ก =(๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1 ฮฆโค๐ก ๐ฆ(๐ก )
.
Let [ฬ (๐ก ) = [โ๏ธ๐(๐ก )1[ฬ(๐ก )1, . . . ,
โ๏ธ๐(๐ก )๐๐[ฬ(๐ก )๐๐]โค, where [ฬ (๐ก )
๐= ๐ฆ(๐ก )๐โ ๐ (๐ฅ๐ )โค\โ denote the average noise of the ๐ (๐ก )๐ pulls
on arm ๐ฅ๐ for any ๐ โ [๐๐ ].Then, (
ห๐๐ก (๐ฅ๐ ) โ ห๐๐ก (๐ฅ ๐ ))โ
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)20
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
=(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (ห\๐ก โ \โ
)=
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค ( (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก ๐ฆ(๐ก ) โ \โ
)=
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค ( (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก(ฮฆ๐ก\โ + [ฬ (๐ก )
)โ \โ
)=
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค ( (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก ฮฆ๐ก\โ +
(๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก [ฬ(๐ก ) โ \โ
)=
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค ( (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1(ฮฆโค๐ก ฮฆ๐ก + ๐พ๐ก ๐ผ
)\โ +
(๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก [ฬ(๐ก )
(14)
โ \โ โ ๐พ๐ก(๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
\โ)
=(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก [ฬ(๐ก ) โ ๐พ๐ก
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
\โ (15)
Since the mean of the first term is zero and its variance is bounded by(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก ฮฆ๐ก(๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1(๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
) (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)=
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)=โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2(๐พ๐ก ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1
,
using the Hoeffding inequality, we have that with probability at least 1 โ ๐ฟ๐ก๐2๐
,๏ฟฝ๏ฟฝ๏ฟฝ (๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ))โค (๐พ๐ก ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก [(๐ก )๐ฃ
๏ฟฝ๏ฟฝ๏ฟฝ < โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ(๐พ๐ก ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1
โ๏ธ2 log
(2๐2๐
๐ฟ๐ก
)Thus, with probability at least 1 โ ๐ฟ๐ก
๐2๐,๏ฟฝ๏ฟฝ๏ฟฝ( ห๐๐ก (๐ฅ๐ ) โ ห๐๐ก (๐ฅ ๐ )
)โ
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
) ๏ฟฝ๏ฟฝ๏ฟฝ<โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ(๐พ๐ก ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1
โ๏ธ2 log
(2๐2๐
๐ฟ๐ก
)+ ๐พ๐ก โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ(๐พ๐ก ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1 โฅ\โโฅ(๐พ๐ก ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1
โคโฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ(๐พ๐ก ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1
โ๏ธ2 log
(2๐2๐
๐ฟ๐ก
)+ โ๐พ๐ก ยท โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ(๐พ๐ก ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1 โฅ\โโฅ2
(a)
โค(1 + Y) ยท โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (b๐ก ๐ผ+ฮฆโค
_โ๐กฮฆ_โ๐ก)โ1
โ๐๐ก
โ๏ธ2 log
(2๐2๐
๐ฟ๐ก
)+
โ๏ธb๐ก๐๐ก ยท
(1 + Y) ยท โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (b๐ก ๐ผ+ฮฆโค_โ๐กฮฆ_โ๐ก)โ1
โ๐๐ก
ยท โฅ\โโฅ2
=(1 + Y) ยท โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (b๐ก ๐ผ+ฮฆโค_โ๐กฮฆ_โ๐ก)โ1
โ๏ธ2 log
(2๐2๐ /๐ฟ๐ก
)๐๐ก
+ (1 + Y)โ๏ธb๐ก ยท โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (b๐ก ๐ผ+ฮฆโค
_โ๐กฮฆ_โ๐ก)โ1 โฅ\โโฅ2
โค(1 + Y) max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หB (๐ก )๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (b๐ก ๐ผ+ฮฆโค
_โ๐กฮฆ_โ๐ก)โ1
โ๏ธ2 log
(2๐2๐ /๐ฟ๐ก
)๐๐ก
21
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
+ (1 + Y)โ๏ธb๐ก max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หB (๐ก )๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (b๐ก ๐ผ+ฮฆโค
_โ๐กฮฆ_โ๐ก)โ1 โฅ\โโฅ2,
where (a) is due to the rounding procedure.
According to the choice of b๐ก , it holds that
(1 + Y)โ๏ธb๐ก max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หB (๐ก )๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (
b๐ก ๐ผ+ฮฆโค_โ๐กฮฆ_โ๐ก
)โ1 โฅ\โโฅ2
โค(1 + Y)โ๏ธb๐ก max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หB (๐ก )๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (
b๐ก ๐ผ+ฮฆโค_๐ขฮฆ_๐ข)โ1 โฅ\โโฅ2
โค(1 + Y)โ๏ธb๐ก max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หX๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (
b๐ก ๐ผ+ฮฆโค_๐ขฮฆ_๐ข)โ1 ยท ๐ต
โค 1
2๐ก+1 .
Thus, with probability at least 1 โ ๐ฟ๐ก๐2๐
,๏ฟฝ๏ฟฝ๏ฟฝ( ห๐๐ก (๐ฅ๐ ) โ ห๐๐ก (๐ฅ ๐ ))โ
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
) ๏ฟฝ๏ฟฝ๏ฟฝ<(1 + Y) max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หB (๐ก )๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (b๐ก ๐ผ+ฮฆโค
_โ๐กฮฆ_โ๐ก)โ1
โ๏ธ2 log
(2๐2๐ /๐ฟ๐ก
)๐๐ก
+ 1
2๐ก+1
=
โ๏ธ2(1 + Y)2๐โ๐ก log
(2๐2๐ /๐ฟ๐ก
)๐๐ก
+ 1
2๐ก+1
โค 1
2๐ก+1 +
1
2๐ก+1
=1
2๐ก
By a union bound over arms ๐ฅ๐ , ๐ฅ ๐ , agent ๐ฃ and phase ๐ก , we have that
Pr [G] โฅ 1 โ ๐ฟ.
โก
For any ๐ก > 1 and ๐ฃ โ [๐ ], let S (๐ก )๐ฃ = {๐ฅ โ หX๐ฃ : ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ) โค 2โ๐ก+2}.
Lemma 2. Assume that event G occurs. Then, for any phase ๐ก > 1 and agent ๐ฃ โ [๐ ], we have that ๐ฅโ๐ฃ โ B(๐ก )๐ฃ and
B (๐ก )๐ฃ โ S (๐ก )๐ฃ .
Proof of Lemma 2. We prove the first statement by induction.
To begin, for any ๐ฃ โ [๐ ], ๐ฅโ๐ฃ โ B(1)๐ฃ trivially holds.
Suppose that ๐ฅโ๐ฃ โ B(๐ก )๐ฃ holds for any ๐ฃ โ [๐ ], and there exists some ๐ฃ โฒ โ [๐ ] such that ๐ฅโ
๐ฃโฒ โ B(๐ก+1)๐ฃโฒ . According to
the elimination rule of algorithm CoopKernelFC, we have that these exists some ๐ฅ โฒ โ B (๐ก )๐ฃ such that
ห๐๐ก (๐ฅ โฒ) โ ห๐๐ก (๐ฅโ๐ฃโฒ) โฅ 2โ๐ก .
Using Lemma 1, we have
๐ (๐ฅ โฒ) โ ๐ (๐ฅโ๐ฃโฒ) > ห๐๐ก (๐ฅ โฒ) โ ห๐๐ก (๐ฅโ๐ฃโฒ) โ 2โ๐ก โฅ 0,
22
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
which contradicts the definition of ๐ฅโ๐ฃโฒ . Thus, we have that for any ๐ฃ โ [๐ ], ๐ฅ
โ๐ฃ โ B
(๐ก+1)๐ฃ , which completes the proof of
the first statement.
Now, we prove the second statement also by induction.
To begin, we prove that for any ๐ฃ โ [๐ ], B (2)๐ฃ โ S (2)๐ฃ . Suppose that there exists some ๐ฃ โฒ โ [๐ ] such that B (2)๐ฃโฒ โ S (2)
๐ฃโฒ .
Then, there exists some ๐ฅ โฒ โ B (2)๐ฃโฒ such that ๐ (๐ฅโ
๐ฃโฒ) โ ๐ (๐ฅโฒ) > 2
โ2+2 = 1. Using Lemma 1, we have that at the phase
๐ก = 1,
ห๐๐ก (๐ฅโ๐ฃโฒ) โ ห๐๐ก (๐ฅ โฒ) โฅ ๐ (๐ฅโ๐ฃโฒ) โ ๐ (๐ฅโฒ) โ 2
โ1 > 1 โ 2โ1 = 2
โ1,
which implies that ๐ฅ โฒ should have been eliminated in phase ๐ก = 1 and gives a contradiction.
Suppose that B (๐ก )๐ฃ โ S (๐ก )๐ฃ (๐ก > 1) holds for any ๐ฃ โ [๐ ], and there exists some ๐ฃ โฒ โ [๐ ] such that B (๐ก+1)๐ฃโฒ โ S (๐ก+1)
๐ฃโฒ .
Then, there exists some ๐ฅ โฒ โ B (๐ก+1)๐ฃโฒ such that ๐ (๐ฅโ
๐ฃโฒ) โ ๐ (๐ฅโฒ) > 2
โ(๐ก+1)+2 = 2 ยท 2โ๐ก . Using Lemma 1, we have that at the
phase ๐ก ,
ห๐๐ก (๐ฅโ๐ฃโฒ) โ ห๐๐ก (๐ฅ โฒ) โฅ ๐ (๐ฅโ๐ฃโฒ) โ ๐ (๐ฅโฒ) โ 2
โ๐ก > 2 ยท 2โ๐ก โ 2โ๐ก = 2
โ๐ก ,
which implies that ๐ฅ โฒ should have been eliminated in phase ๐ก and gives a contradiction. Thus, we complete the proof of
Lemma 2. โก
Now we prove Theorem 1.
Proof of Theorem 1. We first prove the correctness.
Let ๐กโ =โlog
2ฮโ1
min
โ+ 1 be the index of the last phase of algorithm CoopKernelFC. According to Lemma 2, when
๐ก = ๐กโ, B (๐ก )๐ฃ = {๐ฅโ๐ฃ } holds for any ๐ฃ โ [๐ ], and thus algorithm CoopKernelFC returns the correct answer ๐ฅโ๐ฃ for all
๐ฃ โ [๐ ].Next, we prove the sample complexity.
In algorithm CoopKernelFC, the computation of _โ๐ก , ๐โ๐ก and ๐๐ก is the same for all agents, and each agent ๐ฃ just
generates partial samples that belong to his arm setหX๐ฃ from the total ๐๐ก samples. Hence, to bound the overall sample
complexity, it suffices to bound
โ๐กโ๐ก=1
๐๐ก , and then we can obtain the per-agent sample complexity by dividing ๐ . Let
Y = 0.1. We have
๐กโโ๏ธ๐ก=1
๐๐ก
=
๐กโโ๏ธ๐ก=1
(8(2๐ก )2 (1 + Y)2๐โ๐ก log
(2๐2๐
๐ฟ๐ก
)+ 1
)
=
๐กโโ๏ธ๐ก=2
8(2๐ก )2(2โ๐ก+2
)2
(1 + Y)2min_โโณX max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โB (๐ก )๐ฃ ,๐ฃโ[๐ ] โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2(b๐ก ๐ผ+ฮฆโค_ฮฆ_
)โ1(2โ๐ก+2)2
log
(4๐๐2๐ก2
๐ฟ
)+ ๐1 + ๐กโ
โค๐กโโ๏ธ๐ก=2
ยฉยญยญยญยซ128(1 + Y)2min_โโณX max
๏ฟฝฬ๏ฟฝ โB (๐ก )๐ฃ ,๐ฃโ[๐ ] โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ2(
b๐ก ๐ผ+ฮฆโค_ฮฆ_)โ1(
2โ๐ก+2)2
log
(4๐๐2 (๐กโ)2
๐ฟ
)ยชยฎยฎยฎยฌ + ๐1 + ๐กโ
23
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
โค๐กโโ๏ธ๐ก=2
ยฉยญยญยญยซ128(1 + Y)2 min
_โโณXmax
๏ฟฝฬ๏ฟฝ โB (๐ก )๐ฃ ,๐ฃโ[๐ ]
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ2(b๐ก ๐ผ+ฮฆโค_ฮฆ_
)โ1
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ))2log
(4๐๐2 (๐กโ)2
๐ฟ
)ยชยฎยฎยฎยฌ + ๐1 + ๐กโ
โค๐กโโ๏ธ๐ก=2
ยฉยญยญยญยซ128(1 + Y)2 min
_โโณXmax
๏ฟฝฬ๏ฟฝ โ หX๐ฃ ,๐ฃโ[๐ ]
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ2(bโ๐ผ+ฮฆโค_ฮฆ_
)โ1
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ))2log
(4๐๐2 (๐กโ)2
๐ฟ
)ยชยฎยฎยฎยฌ + ๐1 + ๐กโ
โค๐กโ ยท(128(1 + Y)2๐โ log
(4๐๐2 (๐กโ)2
๐ฟ
))+ ๐1 + ๐กโ
=๐
(๐โ ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
))Thus, the per-agent sample complexity is bounded by
๐
(๐โ
๐ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
)).
Since algorithm CoopKernelFC has at most ๐กโ =โlog
2ฮโ1
min
โ+ 1 phases, the number of communication rounds is
bounded by ๐ (logฮโ1
min). โก
A.3 Proof of Corollary 1
Proof of Corollary 1. Recall that๐พ_ = ฮฆ_ฮฆโค_and _โ = argmax_โโณ หX
log det
(๐ผ + bโโ1๐พ_
).We have that log det
(๐ผ + bโโ1๐พ_
)=
log det
(๐ผ + bโโ1ฮฆโค
_ฮฆ_
)= log det
(๐ผ + bโโ1
โ๏ฟฝฬ๏ฟฝ โฒโ หX _๏ฟฝฬ๏ฟฝ โฒ๐ (๐ฅ
โฒ)๐ (๐ฅ โฒ)โค). Then,
๐โ = min
_โโณXmax
๏ฟฝฬ๏ฟฝ โ หX๐ฃ ,๐ฃโ[๐ ]
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ2(bโ๐ผ+ฮฆโค_ฮฆ_
)โ1
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ))2
โค min
_โโณXmax
๏ฟฝฬ๏ฟฝ โ หX๐ฃ ,๐ฃโ[๐ ]
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ2(bโ๐ผ+ฮฆโค_ฮฆ_
)โ1
ฮ2
min
=1
ฮ2
min
ยท min
_โโณXmax
๏ฟฝฬ๏ฟฝ โ หX๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ2(
bโ๐ผ+ฮฆโค_ฮฆ_)โ1
โค 1
ฮ2
min
ยท min
_โโณX
(2 max
๏ฟฝฬ๏ฟฝ โ หXโฅ๐ (๐ฅ)โฅ (
bโ๐ผ+ฮฆโค_ฮฆ_)โ1
)2
=4
ฮ2
min
ยท min
_โโณXmax
๏ฟฝฬ๏ฟฝ โ หXโฅ๐ (๐ฅ)โฅ2(
bโ๐ผ+โ๏ฟฝฬ๏ฟฝโฒโ หX _๏ฟฝฬ๏ฟฝโฒ๐ (๏ฟฝฬ๏ฟฝ โฒ)๐ (๏ฟฝฬ๏ฟฝ โฒ)โค
)โ1
=4
ฮ2
min
ยทmax
๏ฟฝฬ๏ฟฝ โ หXโฅ๐ (๐ฅ)โฅ2(
bโ๐ผ+โ๏ฟฝฬ๏ฟฝโฒโ หX _
โ๏ฟฝฬ๏ฟฝโฒ๐ (๏ฟฝฬ๏ฟฝ
โฒ)๐ (๏ฟฝฬ๏ฟฝ โฒ)โค)โ1
(b)
=4
ฮ2
min
ยทโ๏ธ๏ฟฝฬ๏ฟฝ โ หX
_โ๏ฟฝฬ๏ฟฝโฅ๐ (๐ฅ)โฅ2(
bโ๐ผ+โ๏ฟฝฬ๏ฟฝโฒโ หX _
โ๏ฟฝฬ๏ฟฝโฒ๐ (๏ฟฝฬ๏ฟฝ
โฒ)๐ (๏ฟฝฬ๏ฟฝ โฒ)โค)โ1,
where (b) is due to Lemma 9.
24
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Since _โ๏ฟฝฬ๏ฟฝโฅ๐ (๐ฅ)โฅ2(
bโ๐ผ+โ๏ฟฝฬ๏ฟฝโฒโ หX _
โ๏ฟฝฬ๏ฟฝโฒ๐ (๏ฟฝฬ๏ฟฝ
โฒ)๐ (๏ฟฝฬ๏ฟฝ โฒ)โค)โ1โค 1 for any ๐ฅ โ X,
โ๏ธ๏ฟฝฬ๏ฟฝ โ หX
_โ๏ฟฝฬ๏ฟฝโฅ๐ (๐ฅ)โฅ2(
bโ๐ผ+โ๏ฟฝฬ๏ฟฝโฒโ หX _
โ๏ฟฝฬ๏ฟฝโฒ๐ (๏ฟฝฬ๏ฟฝ
โฒ)๐ (๏ฟฝฬ๏ฟฝ โฒ)โค)โ1
โค2
โ๏ธ๏ฟฝฬ๏ฟฝ โ หX
log
ยฉยญยซ1 + _โ๏ฟฝฬ๏ฟฝโฅ๐ (๐ฅ)โฅ2(
bโ๐ผ+โ๏ฟฝฬ๏ฟฝโฒโ หX _
โ๏ฟฝฬ๏ฟฝโฒ๐ (๏ฟฝฬ๏ฟฝ
โฒ)๐ (๏ฟฝฬ๏ฟฝ โฒ)โค)โ1
ยชยฎยฌ(c)
โค2 log
det
(bโ๐ผ +
โ๏ฟฝฬ๏ฟฝ โ หX _
โ๏ฟฝฬ๏ฟฝ๐ (๐ฅ)๐ (๐ฅ)โค
)det (bโ๐ผ )
=2 log det
ยฉยญยซ๐ผ + bโ1
โโ๏ธ๏ฟฝฬ๏ฟฝ โ หX
_โ๏ฟฝฬ๏ฟฝ๐ (๐ฅ)๐ (๐ฅ)โคยชยฎยฌ
=2 log det
(๐ผ + bโ1
โ ๐พ_โ),
where (c) comes from Lemma 10.
Thus, we have
๐โ = min
_โโณXmax
๏ฟฝฬ๏ฟฝ โ หX๐ฃ ,๐ฃโ[๐ ]
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ2(bโ๐ผ+ฮฆโค_ฮฆ_
)โ1
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ))2
โค 4
ฮ2
min
ยทโ๏ธ๏ฟฝฬ๏ฟฝ โ หX
_โ๏ฟฝฬ๏ฟฝโฅ๐ (๐ฅ)โฅ2(
bโ๐ผ+โ๏ฟฝฬ๏ฟฝโฒโ หX _
โ๏ฟฝฬ๏ฟฝโฒ๐ (๏ฟฝฬ๏ฟฝ
โฒ)๐ (๏ฟฝฬ๏ฟฝ โฒ)โค)โ1
โค 8
ฮ2
min
ยท log det
(๐ผ + bโ1
โ ๐พ_โ)
(16)
In the following, we interpret the term log det
(๐ผ + bโ1
โ ๐พ_โ)using two standard expressive tools, i.e., maximum
information gain and effective dimension, respectively.
Maximum Information Gain. Recall that the maximum information gain over all sample allocation _ โ โณหX is defined
as
ฮฅ = max
_โโณ หXlog det
(๐ผ + bโโ1๐พ_
).
Then, using Eq. (16) and the definitions of _โ, the per-agent sample complexity is bounded by
๐
(๐โ
๐ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
))=๐
ยฉยญยญยซlog det
(๐ผ + bโโ1๐พ_โ
)ฮ2
min๐
ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
)ยชยฎยฎยฌ=๐
(ฮฅ
ฮ2
min๐ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
))
25
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Effective Dimension. Recall that ๐ผ1 โฅ ยท ยท ยท โฅ ๐ผ๐๐ denote the eigenvalues of ๐พ_โ in decreasing order. The effective
dimension of ๐พ_โ is defined as
๐eff
= min
{๐ : ๐bโ log(๐๐ ) โฅ
๐๐โ๏ธ๐=๐+1
๐ผ๐
},
and it holds that ๐effbโ log(๐๐ ) โฅ โ๐๐
๐=๐eff+1 ๐ผ๐ .
Let Y = ๐effbโ log(๐๐ )โโ๐๐
๐=๐eff+1 ๐ผ๐ , and thus Y โค ๐effbโ log(๐๐ ). Then, we haveโ๐eff๐=1
๐ผ๐ = Trace(๐พ_โ )โโ๐๐๐=๐eff+1 ๐ผ๐ =
Trace(๐พ_โ ) โ ๐effbโ log(๐๐ ) + Y and โ๐๐๐=๐eff+1 ๐ผ๐ = ๐effbโ log(๐๐ ) โ Y.
log det
(๐ผ + bโ1
โ ๐พ_โ)
= log
(ฮ ๐๐๐=1
(1 + bโ1
โ ๐ผ๐))
= log
(ฮ ๐eff๐=1
(1 + bโ1
โ ๐ผ๐)ยท ฮ ๐๐
๐=๐eff+1
(1 + bโ1
โ ๐ผ๐))
โค log
((1 + bโ1
โ ยทTrace(๐พ_โ ) โ ๐effbโ log(๐๐ ) + Y
๐eff
)๐eff (1 + bโ1
โ ยท๐effbโ log(๐๐ ) โ Y๐๐ โ ๐
eff
)๐๐โ๐eff )โค๐
efflog
(1 + bโ1
โ ยทTrace(๐พ_โ ) โ ๐effbโ log(๐๐ ) + Y
๐eff
)+ log
(1 + ๐eff log(๐๐ )
๐๐ โ ๐eff
)๐๐โ๐eff=๐
efflog
(1 + bโ1
โ ยทTrace(๐พ_โ ) โ ๐effbโ log(๐๐ ) + Y
๐eff
)+ log
(1 + ๐eff log(๐๐ โ ๐
eff+ ๐
eff)
๐๐ โ ๐eff
)๐๐โ๐eff(d)
โค๐eff
log
(1 + bโ1
โ ยทTrace(๐พ_โ ) โ ๐effbโ log(๐๐ ) + Y
๐eff
)+ log
(1 + ๐eff log(๐๐ + ๐
eff)
๐๐
)๐๐=๐
efflog
(1 + bโ1
โ ยทTrace(๐พ_โ ) โ ๐effbโ log(๐๐ ) + Y
๐eff
)+ ๐๐ log
(1 + ๐eff log(๐๐ + ๐
eff)
๐๐
)โค๐
efflog
(1 + Trace(๐พ_โ )
bโ๐eff
)+ ๐
efflog(๐๐ + ๐
eff)
โค๐eff
log
(2๐๐ ยท
(1 + Trace(๐พ_โ )
bโ๐eff
)),
where inequality (d) is due to that
(1 + ๐eff log(๐ฅ+๐eff)
๐ฅ
)๐ฅis monotonically increasing with respect to ๐ฅ โฅ 1.
Then, using Eq. (16), the per-agent sample complexity is bounded by
๐
(๐โ
๐ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
))=๐
ยฉยญยญยซlog det
(๐ผ + bโโ1๐พ_โ
)ฮ2
min๐
ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
)ยชยฎยฎยฌ=๐
(๐eff
ฮ2
min๐ยท log
(๐๐ ยท
(1 + Trace(๐พ_โ )
bโ๐eff
))ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
))Decomposition. Let ๐พ = [๐พ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ] , ๐พ๐ง = [๐พ๐ง (๐ง๐ฃ, ๐ง๐ฃโฒ)]๐ฃ,๐ฃโฒโ[๐ ] and ๐พ๐ฅ = [๐พ๐ฅ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ] . Since kernelfunction ๐พ (ยค,ยค) is a Hadamard composition of ๐พ๐ง (ยค,ยค) and ๐พ๐ฅ (ยค,ยค), it holds that rank(๐พ_โ ) = rank(๐พ) โค rank(๐พ๐ง) ยท rank(๐พ๐ฅ ).
26
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
log det
(๐ผ + bโ1
โ ๐พ_โ)
= log
(ฮ ๐๐๐=1
(1 + bโ1
โ ๐ผ๐))
= log
(ฮ rank(๐พ_โ )๐=1
(1 + bโ1
โ ๐ผ๐))
โค log
ยฉยญยซโrank(๐พ_โ )๐=1
(1 + bโ1
โ ๐ผ๐)
rank(๐พ_โ )ยชยฎยฌrank(๐พ_โ )
=rank(๐พ_โ ) log
ยฉยญยซโrank(๐พ_โ )๐=1
(1 + bโ1
โ ๐ผ๐)
rank(๐พ_โ )ยชยฎยฌ
โคrank(๐พ๐ง) ยท rank(๐พ๐ฅ ) log
(Trace
(๐ผ + bโ1
โ ๐พ_โ)
rank(๐พ_โ )
)Then, using Eq. (16), the per-agent sample complexity is bounded by
๐
(๐โ
๐ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
))=๐
ยฉยญยญยซlog det
(๐ผ + bโโ1๐พ_โ
)ฮ2
min๐
ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
)ยชยฎยฎยฌ=๐
(rank(๐พ๐ง) ยท rank(๐พ๐ฅ )
ฮ2
min๐
ยท log
(Trace
(๐ผ + bโ1
โ ๐พ_โ)
rank(๐พ_โ )
)ยท logฮโ1
min
(log
(๐๐
๐ฟ
)+ log logฮโ1
min
))โก
A.4 Proof of Theorem 2
Proof of Theorem 2. Our proof of Theorem 2 adapts the analysis procedure of [19] to the multi-agent setting.
Suppose thatA is a ๐ฟ-correct algorithm for CoPE-KB. For any ๐ โ [๐๐ ], let a\ โ,๐ = N(๐ (๐ฅ๐ )โค\โ, 1) denote the rewarddistribution of arm ๐ฅ๐ , and๐๐ denote the number of times arm ๐ฅ๐ is pulled by algorithmA. Let ฮ = {\ โ H๐พ : โ๐ฃ, โ๐ฅ โหX๐ฃ \ {๐ฅโ๐ฃ },
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)
)โค\ < 0}.
Since A is ๐ฟ-correct, according to the โChange of Distributionโ lemma (Lemma A.3) in [24], we have that for any
\ โ ฮ,๐๐โ๏ธ๐=1
E[๐๐ ] ยท KL(a\ โ,๐ , a\,๐ ) โฅ log(1/2.4๐ฟ) .
Thus, we have
min
\ โฮ
๐๐โ๏ธ๐=1
E[๐๐ ] ยท KL(a\ โ,๐ , a\,๐ ) โฅ log(1/2.4๐ฟ) .
Let ๐โ be the optimal solution of the following optimization problem
min
๐๐โ๏ธ๐=1
๐ก๐
27
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
๐ .๐ก . min
\ โH๐พ
๐๐โ๏ธ๐=1
๐ก๐ ยท KL(a\ โ,๐ , a\,๐ ) โฅ log(1/2.4๐ฟ) .
Then, we have
min
\ โH๐พ
๐๐โ๏ธ๐=1
๐กโ๐โ๐๐๐=1
๐กโ๐
ยท KL(a\ โ,๐ , a\,๐ ) โฅlog(1/2.4๐ฟ)โ๐๐
๐=1๐กโ๐
โฅ log(1/2.4๐ฟ)โ๐๐๐=1E[๐๐ ]
.
Since
โ๐๐๐=1
๐กโ๐โ๐๐๐=1๐กโ๐
= 1,
max
_โโณXmin
\ โฮ
๐๐โ๏ธ๐=1
_๐ ยท KL(a\ โ,๐ , a\,๐ ) โฅlog(1/2.4๐ฟ)โ๐๐๐=1E[๐๐ ]
.
Thus, we have
๐๐โ๏ธ๐=1
E[๐๐ ] โฅlog(1/2.4๐ฟ)
max_โโณX min\ โฮโ๐๐๐=1
_๐ ยท KL(a\ โ,๐ , a\,๐ )
= log(1/2.4๐ฟ) min
_โโณXmax
\ โฮ1โ๐๐
๐=1_๐ ยท KL(a\ โ,๐ , a\,๐ )
(17)
For _ โ โณX , let๐ด(bโ, _) = bโ๐ผ +โ๐๐๐=1
_๐๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค. For any _ โ โณX , ๐ฃ โ [๐ ], ๐ โ [๐๐ ] such that ๐ฅ ๐ โ หX๐ฃ \ {๐ฅโ๐ฃ }, define
\ ๐ (_) = \โ โ(2(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค\โ) ยท ๐ด(bโ, _)โ1 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค๐ด(bโ, _)โ1 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))
Here (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค\ ๐ (_) = โ(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค\โ < 0, and thus \ ๐ (_) โ ฮ.The KL-divergence between a\ โ,๐ and a\ ๐ (_),๐ is
KL(a\ โ,๐ , a\ ๐ (_),๐ )
=1
2
(๐ (๐ฅ๐ )โค (\โ โ \ ๐ (_))
)2
=2(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค\โ)2 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค ยท ๐ด(bโ, _)โ1๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค๐ด(bโ, _)โ1 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))(
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค๐ด(bโ, _)โ1 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ )))2
,
and thus,
๐๐โ๏ธ๐=1
_๐ ยท KL(a\ โ,๐ , a\ ๐ (_),๐ )
=2(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค\โ)2 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค ยท ๐ด(bโ, _)โ1 (โ๐๐๐=1
_๐๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค)๐ด(bโ, _)โ1 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))((๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค๐ด(bโ, _)โ1 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))
)2
โค2(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค\โ)2
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค๐ด(bโ, _)โ1 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))
=2(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))2
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค๐ด(bโ, _)โ1 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))(18)
28
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Let J = { ๐ โ [๐๐ ] : ๐ฅ ๐ โ ๐ฅโ๐ฃ ,โ๐ฃ โ [๐ ]} denote the set of indices of all sub-optimal arms. Then, plugging the above
Eq. (18) into Eq. (17), we have
๐๐โ๏ธ๐=1
E[๐๐ ] โฅ log(1/2.4๐ฟ) min
_โโณXmax
\ โฮ1โ๐๐
๐=1_๐ ยท KL(a\ โ,๐ , a\,๐ )
โฅ log(1/2.4๐ฟ) min
_โโณXmax
๐ โJ1โ๐๐
๐=1_๐ ยท KL(a\ โ,๐ , a\ ๐ (_),๐ )
โฅ log(1/2.4๐ฟ) min
_โโณXmax
๏ฟฝฬ๏ฟฝ ๐ โ หX๐ฃ\{๐ฅโ๐ฃ },๐ฃโ[๐ ]
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))โค๐ด(bโ, _)โ1 (๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))2(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ ๐ ))2
=1
2
log(1/2.4๐ฟ)๐โ,
which completes the proof of Theorem 2.
โก
A.5 Proof of Theorem 3
Our proof of Theorem 3 generalizes the 2-armed lower bound analysis in [38] to the multi-armed case with linear
reward structures. We first introduce some notations and definitions.
Consider a fully-collaborative instance I(X, \โ) of the CoPE-KB problem, whereหX = X = X๐ฃ, ๐ = ๐๐ฃ for all ๐ฃ โ [๐ ]
and ๐ (๐ฅ๐ )โค\โ is equal for all ๐ฅ๐ โ ๐ฅโ. Let ฮ = ฮmin = (๐ (๐ฅโ) โ ๐ (๐ฅ๐ ))โค\โ and ๐ =๐ (๐ฅ๐ )โค\ โ
ฮ for any ๐ฅ๐ โ ๐ฅโ, where
๐ > 0 is a constant. Then, we have that๐ (๐ฅโ)โค\ โ
ฮ =๐ (๐ฅ๐ )โค\ โ+ฮ
ฮ = 1 + ๐ .For any integer ๐ผ โฅ 0, let E(๐ผ,๐ ) be the event that A uses at least ๐ผ communication rounds and at most ๐
samples before the end of the ๐ผ-th round, and let E+1 (๐ผ,๐ ) be the event that A uses at least ๐ผ + 1 communication
rounds and at most ๐ samples before the end of the ๐ผ-th round. Let ๐A and ๐A,๐ฅ๐ denote the expected number of
samples used by A, and the expected number of samples used on arm ๐ฅ๐ by A, respectively. Let _ be the sample
allocation of A, i.e., _๐ =๐A,๐ฅ๐๐A
. Let ๐ (X, \โ) = min_โโณX max๐ฅ โX\{๐ฅโ }โฅ๐ (๐ฅโ)โ๐ (๐ฅ) โฅ2(bโ๐ผ+โ๐ฅโX _๐ฅ๐ (๐ฅ )๐ (๐ฅ )โค)โ1
(๐ (๐ฅโ)โ๐ (๐ฅ))2 and ๐ (X) =
min_โโณX max๐ฅ โX\{๐ฅโ } โฅ๐ (๐ฅโ) โ ๐ (๐ฅ)โฅ2(bโ๐ผ+โ๐ฅโX _๐ฅ๐ (๐ฅ)๐ (๐ฅ)โค)โ1. Then, we have ๐ (X, \โ) = ๐ (X)
ฮ2.
In order to prove Theorem 3, we first prove the following lemmas.
Lemma 3 (Linear Structured Progress Lemma). For any integer ๐ผ โฅ 0 and any ๐ โฅ 1, we have
Pr
I(X,\ โ)
[E+1
(๐ผ,๐ (X, \โ)๐๐
)]โฅ Pr
I(X,\ โ)
[E
(๐ผ,๐ (X, \โ)๐๐
)]โ 2๐ฟ โ 1
โ๐.
Proof of Lemma 3. Let F be the event that A uses exactly ๐ผ communication rounds and at most๐ (X,\ โ)๐๐
samples
before the end of the ๐ผ-th round. Then, we have
Pr
I(X,\ โ)
[E+1
(๐ผ,๐ (X, \โ)๐๐
)]โฅ Pr
I(X,\ โ)
[E
(๐ผ,๐ (X, \โ)๐๐
)]โ Pr
I(X,\ โ)[F ] .
Thus, to prove Lemma 3, it suffices to prove
Pr
I(X,\ โ)[F ] โค 2๐ฟ + 1
โ๐. (19)
We can decompose F as
Pr
I(X,\ โ)[F ] = Pr
I(X,\ โ)[F ,A returns ๐ฅโ] + Pr
I(X,\ โ)[F ,A does not return ๐ฅโ]
29
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
= Pr
I(X,\ โ)[F ,A returns ๐ฅโ] + ๐ฟ (20)
Let๐ฆ๐ = ๐ (๐ฅโ)โ๐ (๐ฅ๐ ) for any ๐ โ [๐]. Let\ (bโ, _) = \โโ2(๐ฆโค๐ \ โ)๐ด(bโ,_)โ1๐ฆ ๐
๐ฆโค๐๐ด(bโ,_)โ1๐ฆ ๐
, where๐ด(bโ, _) = bโ๐ผ+โ๐๐=1
_๐๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค
and ๐ = argmax๐โ[๐]๐ฆโค๐ ๐ด(bโ,_)โ1๐ฆ๐
(๐ฆโค๐\ โ)2 . Let I(X, \ (_)) denote the instance where the underlying parameter is \ (_). Under
I(X, \ (_)), it holds that ๐ฆโค๐\ (_) = โ๐ฆโค
๐\โ < 0, and thus ๐ฅโ is sub-optimal. Let DI denote the product distribution of
instance I with at most๐ (X,\ โ)
๐ samples over all agents.
Using the Pinskerโs inequality (Lemma 11) and Gaussian KL-divergence computation, we have
โฅDI(X,\ โ) โ DI(X,\ (_)) โฅTV
โคโ๏ธ
1
2
KL
(DI(X,\ โ) โฅDI(X,\ (_))
)โคโโ
1
4
โ๏ธ๐โ[๐]
(๐ (๐ฅ๐ )โค (\โ โ \ (_)))2 ยท _๐๐A
=
โโโโ1
4
โ๏ธ๐โ[๐]
4(๐ฆโค๐\โ)2 ยท ๐ฆโค
๐๐ด(bโ, _)โ1๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค๐ด(bโ, _)โ1๐ฆ ๐
(๐ฆโค๐๐ด(bโ, _)โ1๐ฆ ๐ )2
ยท _๐๐A
=
โโโ๐A(๐ฆโค๐\โ)2 ยท ๐ฆโค
๐๐ด(bโ, _)โ1 (โ๐โ[๐] _๐๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค)๐ด(bโ, _)โ1๐ฆ ๐
(๐ฆโค๐๐ด(bโ, _)โ1๐ฆ ๐ )2
โค
โโโ๐A(๐ฆโค๐\โ)2 ยท ๐ฆโค
๐๐ด(bโ, _)โ1 (bโ๐ผ +
โ๐โ[๐] _๐๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค)๐ด(bโ, _)โ1๐ฆ ๐
(๐ฆโค๐๐ด(bโ, _)โ1๐ฆ ๐ )2
=
โโโ๐A ยท
(๐ฆโค๐\โ)2
๐ฆโค๐๐ด(bโ, _)โ1๐ฆ ๐
โคโโโโ๐ (X, \โ)
๐ยท 1
๐ฆโค๐๐ด(bโ,_)โ1๐ฆ ๐
(๐ฆโค๐\ โ)2
โคโโโโ๐ (X, \โ)
๐ยท 1
min_โโณX๐ฆโค๐๐ด(bโ,_)โ1๐ฆ ๐
(๐ฆโค๐\ โ)2
=1
โ๐
(21)
Since ๐ฅโ is sub-optimal under I(X, \ (_)), using the measure change technique, we have
Pr
I(X,\ โ)[F ,A returns ๐ฅโ] โค Pr
I(X,\ (_))[F ,A returns ๐ฅโ]
+ โฅDI(X,\ โ) โ DI(X,\ (_)) โฅTV
โค๐ฟ + 1
โ๐
30
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Plugging the above equality into Eq. (20), we have
Pr
I(X,\ โ)[F ] โค 2๐ฟ + 1
โ๐,
which completes the proof of Lemma 3. โก
Let I(X, \/^) denote the instance where the underlying parameter is \/^. Under I(X, \/^), the reward gap is
(๐ (๐ฅโ) โ๐ (๐ฅ๐ ))โค\/^ = 1
^ ฮ and the sample complexity is ๐ (X, \/^) = ^2๐ (X)ฮ2
. LetDI(X,\ โ) andDI(X,\/^) denote theproduct distributions of instances I(X, \โ) and I(X, \/^) with ๐A samples, respectively.
Lemma 4 (Multi-armed Measure Transformation Lemma). Suppose that algorithm A uses ๐A =๐ (X,\ โ)
Zsamples
over all agents on instance I(X, \โ), where Z โฅ 100. Then, for any event ๐ธ on I(X, \โ) and any ๐ โฅ Z , we have
Pr
DI(X,\/^ )[๐ธ] โค Pr
DI(X,\โ )[๐ธ] ยท exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐)
Z
)+ 1
๐๐2.
Proof. For any ๐ โ [๐], let ๐๐,1, . . . , ๐๐,๐A,๐ฅ๐ denote the observed ๐A,๐ฅ๐ samples on arm ๐ฅ๐ , and define event
๐ฟ๐ =
๐A,๐ฅ๐โ๏ธ๐ก=1
๐๐,๐ก โฅ ๐A,๐ฅ๐ ยท ๐ (๐ฅ๐ )โค\/^ + ๐ง
๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ
,where ๐ง โฅ 0 is a parameter specified later. We also define event ๐ฟ = โฉ๐โ[๐]๐ฟ๐ . Then,
Pr
DI(X,\/^ )[๐ธ] โค Pr
DI(X,\/^ )[๐ธ, ๐ฟ] + Pr
DI(X,\/^ )[ยฌ๐ฟ]
Using the measure change technique, we bound the term PrDI(X,\/^ ) [๐ธ, ๐ฟ] as
Pr
DI(X,\/^ )[๐ธ, ๐ฟ]
= Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
ยฉยญยซโ1
2
โ๏ธ๐โ[๐]
๐A,๐ฅ๐โ๏ธ๐ก=1
((๐๐,๐ก โ ๐ (๐ฅ๐ )โค\/^)2 โ (๐๐,๐ก โ ๐ (๐ฅ๐ )โค\โ)2
)ยชยฎยฌ= Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
ยฉยญยซโ1
2
โ๏ธ๐โ[๐]
๐A,๐ฅ๐โ๏ธ๐ก=1
((๐ (๐ฅ๐ )โค\/^)2 โ (๐ (๐ฅ๐ )โค\โ)2 โ 2๐๐,๐ก (๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ)
)ยชยฎยฌ= Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
(โ 1
2
โ๏ธ๐โ[๐]
( ((๐ (๐ฅ๐ )โค\/^)2 โ (๐ (๐ฅ๐ )โค\โ)2
)ยท๐A,๐ฅ๐
โ 2
(๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ
)ยท๐A,๐ฅ๐โ๏ธ๐ก=1
๐๐,๐ก
))โค Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
(โ 1
2
โ๏ธ๐โ[๐]
( ((๐ (๐ฅ๐ )โค\/^)2 โ (๐ (๐ฅ๐ )โค\โ)2
)ยท๐A,๐ฅ๐
โ 2
(๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ
)ยท(๐A,๐ฅ๐ ยท ๐ (๐ฅ๐ )
โค\/^ + ๐ง
๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ
) ))= Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
(โ 1
2
โ๏ธ๐โ[๐]
( ((๐ (๐ฅ๐ )โค\/^)2 โ (๐ (๐ฅ๐ )โค\โ)2
)ยท๐A,๐ฅ๐
31
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
โ 2
(๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ
)ยท๐A,๐ฅ๐ ยท ๐ (๐ฅ๐ )
โค\/^ โ 2๐ง
))= Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
(โ 1
2
โ๏ธ๐โ[๐]
( (๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ
) (๐ (๐ฅ๐ )โค\/^ ยท๐A,๐ฅ๐ + ๐ (๐ฅ๐ )
โค\โ ยท๐A,๐ฅ๐
โ 2 ยท ๐ (๐ฅ๐ )โค\/^ ยท๐A,๐ฅ๐)โ 2๐ง
))= Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
(โ 1
2
โ๏ธ๐โ[๐]
(โ
(๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ
)2 ยท๐A,๐ฅ๐ โ 2๐ง
))
= Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
ยฉยญยซโ๏ธ๐โ[๐]
(1
2
(๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ
)2 ยท๐A,๐ฅ๐ + ๐ง
)ยชยฎยฌ= Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
ยฉยญยซโ๏ธ๐โ[๐]
(1
2
(1 โ 1
^)2 (๐ (๐ฅ๐ )โค\โ)2 ยท
๐ (X)ฮ2Z
ยท _๐ + ๐ง)ยชยฎยฌ
โค Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
((1 + ๐)2
2
ยท ๐ (X)Z+ ๐๐ง
)Next, using the Chernoff-Hoeffding inequality, we bound the second term as
Pr
DI(X,\/^ )[ยฌ๐ฟ] โค
โ๏ธ๐โ[๐]
Pr
DI(X,\/^ )[ยฌ๐ฟ๐ ]
โคโ๏ธ๐โ[๐]
exp
(โ2
๐ง2
(๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ)2ยท Z
_๐๐ (X, \โ)
)=
โ๏ธ๐โ[๐]
exp
(โ2
๐ง2
(1 โ 1
^ )2 (๐ (๐ฅ๐ )โค\โ)2ยท ฮ2Z
_๐๐ (X)
)โค
โ๏ธ๐โ[๐]
exp
(โ 2
(1 + ๐)2ยท ๐ง
2Z
๐ (X)
)=๐ ยท exp
(โ 2
(1 + ๐)2ยท ๐ง
2Z
๐ (X)
)Thus,
Pr
DI(X,\/^ )[๐ธ] โค Pr
DI(X,\/^ )[๐ธ, ๐ฟ] + Pr
DI(X,\/^ )[ยฌ๐ฟ]
โค Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
((1 + ๐)2
2
ยท ๐ (X)Z+ ๐๐ง
)+ ๐ ยท exp
(โ 2
(1 + ๐)2ยท ๐ง
2Z
๐ (X)
)Let ๐ง =
โ๏ธ(1+๐)2๐ (X) log(๐๐)
Z. Then, we have
Pr
DI(X,\/^ )[๐ธ] โค Pr
DI(X,\โ )[๐ธ, ๐ฟ] ยท exp
ยฉยญยซ (1 + ๐)2
2
ยท ๐ (X)Z+ ๐
โ๏ธ(1 + ๐)2๐ (X) log(๐๐)
Z
ยชยฎยฌ+ ๐ ยท exp (โ2 log(๐๐))
32
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
โค Pr
DI(X,\โ )[๐ธ] ยท exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐)
Z
)+ 1
๐๐2
โก
Lemma 5 (Linear Structured Instance Transformation Lemma). For any integer ๐ผ โฅ 0, ๐ โฅ 100 and ^ โฅ 1, we
have
Pr
I(X,\/^)
[E
(๐ผ + 1,
๐ (X, \โ)๐๐
+ ๐ (X, \โ)
๐ฝ
)]โฅ Pr
I(X,\ โ)
[E+1
(๐ผ,๐ (X, \โ)๐๐
)]โ ๐ฟ โ
โ๏ธ(1 + ๐)2
4
ยท ๐ (X)๐
โ(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)โ 1
๐๐
Proof. Let โ = {๐๐,1, . . . , ๐๐,๐A,๐ฅ๐ }๐โ[๐] denote the ๐A samples of algorithm A on instance I(X, \โ). Let S denote
the set of all possible โ , conditioned on which E+1 (๐ผ, ๐ (X,\โ)
๐๐) holds. Then, we haveโ๏ธ
๐ โSPr
I(X,\ โ)[โ = ๐ ] =
โ๏ธ๐ โS
Pr
I(X,\ โ)
[E+1
(๐ผ,๐ (X, \โ)๐๐
)]For any agent ๐ฃ โ [๐ ], let K๐ฃ be the event that agent ๐ฃ uses more than
๐ (X,\ โ)๐ฝ
samples during the (๐ผ + 1)-st round.Conditioned on ๐ โ S, K๐ฃ only depends on the samples of agent ๐ฃ during the (๐ผ + 1)-st round, and is independent of
other agents.
Using the facts that A is ๐ฟ-correct and ๐ฝ-speedup and all agents have the same performance on fully-collaborative
instances, we have
๐ฟ โฅ Pr
I(X,\ โ)
[A uses more than
๐ (X, \โ)๐ฝ
]โฅ
โ๏ธ๐ โS
Pr
I(X,\ โ)[โ = ๐ ] ยท Pr
I(X,\ โ)[K1 โจ ยท ยท ยท โจ K๐ |โ = ๐ ]
=โ๏ธ๐ โS
Pr
I(X,\ โ)[โ = ๐ ] ยท ยฉยญยซ1 โ
โ๐ฃโ[๐ ]
(1 โ Pr
I(X,\ โ)[K๐ |โ = ๐ ]
)ยชยฎยฌ=E+1
(๐ผ,๐ (X, \โ)๐๐
)โ
โ๏ธ๐ โS
Pr
I(X,\ โ)[โ = ๐ ] ยท
โ๐ฃโ[๐ ]
(1 โ Pr
I(X,\ โ)[K๐ |โ = ๐ ]
)Rearranging the above inequality, we haveโ๏ธ
๐ โSPr
I(X,\ โ)[โ = ๐ ] ยท
โ๐ฃโ[๐ ]
(1 โ Pr
I(X,\ โ)[K๐ |โ = ๐ ]
)โฅE+1
(๐ผ,๐ (X, \โ)๐๐
)โ ๐ฟ (22)
Using Lemma 4 with Z = ๐ฝ and ๐ = ๐ , we have
Pr
I(X,\/^)
[E
(๐ผ + 1,
๐ (X, \โ)๐๐
+ ๐ (X, \โ)
๐ฝ
)]โฅ
โ๏ธ๐ โS
Pr
I(X,\/^)[โ = ๐ ] ยท Pr
I(X,\/^)[ยฌK1 โง ยท ยท ยท โง ยฌK๐ |โ = ๐ ]
=โ๏ธ๐ โS
Pr
I(X,\/^)[โ = ๐ ] ยท
โ๐ฃโ[๐ ]
(1 โ Pr
I(X,\/^)[K๐ |โ = ๐ ]
)33
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
โฅโ๏ธ๐ โS
Pr
I(X,\/^)[โ = ๐ ] ยท
โ๐ฃโ[๐ ]
max
{1 โ Pr
I(X,\ โ)[K๐ |โ = ๐ ] ยท exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
๐๐ 2, 0
}โฅ
โ๏ธ๐ โS
Pr
I(X,\/^)[โ = ๐ ] ยท
ยฉยญยซโ๐ฃโ[๐ ]
max
{1 โ Pr
I(X,\ โ)[K๐ |โ = ๐ ] ยท exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
), 0
}โ 1
๐๐
ยชยฎยฌ=
โ๏ธ๐ โS
Pr
I(X,\/^)[โ = ๐ ] ยท
( โ๐ฃโ[๐ ]
max
(1 โ Pr
I(X,\ โ)[K๐ |โ = ๐ ]
โ Pr
I(X,\ โ)[K๐ |โ = ๐ ] ยท
(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
), 0
)โ 1
๐๐
)(e)
โฅโ๏ธ๐ โS
Pr
I(X,\/^)[โ = ๐ ] ยท
ยฉยญยซโ๐ฃโ[๐ ]
(1 โ Pr
I(X,\ โ)[K๐ |โ = ๐ ]
)โ
(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)โ 1
๐๐
ยชยฎยฌโฅ
โ๏ธ๐ โS
Pr
I(X,\/^)[โ = ๐ ] ยท
โ๐ฃโ[๐ ]
(1 โ Pr
I(X,\ โ)[K๐ |โ = ๐ ]
)โ
(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)โ 1
๐๐
โฅโ๏ธ๐ โS
Pr
I(X,\ โ)[โ = ๐ ] ยท
โ๐ฃโ[๐ ]
(1 โ Pr
I(X,\ โ)[K๐ |โ = ๐ ]
)โ
โ๏ธ๐ โS
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ Pr
I(X,\/^)[โ = ๐ ] โ Pr
I(X,\ โ)[โ = ๐ ]
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โ (exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)โ 1
๐๐
where (e) comes from Lemma 12.
Using the Pinskerโs inequality (Lemma 11), we haveโ๏ธ๐ โS
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ Pr
I(X,\/^)[โ = ๐ ] โ Pr
I(X,\ โ)[โ = ๐ ]
๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ (23)
โคโ๏ธ
1
2
KL(DI(X,\/^) ,DI(X,\ โ) )
โคโโ
1
4
โ๏ธ๐โ[๐]
(๐ (๐ฅ๐ )โค\/^ โ ๐ (๐ฅ๐ )โค\โ)2 ยท _๐๐A
=
โโ1
4
โ๏ธ๐โ[๐]
((1 โ 1
^)2 (๐ (๐ฅ๐ )โค\โ)2 ยท
๐ (X)ฮ2๐
_๐
)
34
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
โค
โ๏ธ(1 + ๐)2
4
ยท ๐ (X)๐
(24)
Thus, using Eqs. (22),(24), we have
Pr
I(X,\/^)
[E
(๐ผ + 1,
๐ (X, \โ)๐๐
+ ๐ (X, \โ)
๐ฝ
)]โฅ
โ๏ธ๐ โS
Pr
I(X,\ โ)[โ = ๐ ] ยท
โ๐ฃโ[๐ ]
(1 โ Pr
I(X,\ โ)[K๐ |โ = ๐ ]
)โ
โ๏ธ(1 + ๐)2
4
ยท ๐ (X)๐
โ(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)โ 1
๐๐
โฅ Pr
I(X,\ โ)
[E+1
(๐ผ,๐ (X, \โ)๐๐
)]โ ๐ฟ โ
โ๏ธ(1 + ๐)2
4
ยท ๐ (X)๐
โ(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)โ 1
๐๐
โก
Now we prove Theorem 3.
Proof of Theorem 3. Combining Lemmas 3,5, we have
Pr
I(X,\/^)
[E
(๐ผ + 1,
๐ (X, \โ)๐๐
+ ๐ (X, \โ)
๐ฝ
)]โฅ Pr
I(X,\ โ)
[E+1
(๐ผ,๐ (X, \โ)๐๐
)]โ ๐ฟ โ
โ๏ธ(1 + ๐)2
4
ยท ๐ (X)๐
โ(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)โ 1
๐๐
โฅ Pr
I(X,\ โ)
[E
(๐ผ,๐ (X, \โ)๐๐
)]โ 3๐ฟ โ
โ๏ธ(1 + ๐)2๐ (X)
๐
โ(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)โ 1
๐๐
Let ^ =
โ๏ธ1 + ๐๐
๐ฝ. Then, we have
Pr
I(X,\/^)
[E
(๐ผ + 1,
๐ (X, \/^)๐๐
)]= Pr
I(X,\/^)
[E
(๐ผ + 1,
^2 ยท ๐ (X, \โ)๐๐
)]= Pr
I(X,\/^)
[E
(๐ผ + 1,
๐ (X, \โ)๐๐
+ ๐ (X, \โ)
๐ฝ
)]โฅ Pr
I(X,\ โ)
[E
(๐ผ,๐ (X, \โ)๐๐
)]โ 3๐ฟ โ
โ๏ธ(1 + ๐)2๐ (X)
๐
35
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
โ(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)โ 1
๐๐(25)
Let I(X, \0) be the basic instance of induction, where the reward gap is ฮ0 = (๐ (๐ฅโ) โ๐ (๐ฅ๐ ))โค\0 = 1 for any ๐ โ [๐].Let ๐ก0 be the largest integer such that
ฮ ยท(1 + ๐๐
๐ฝ
) ๐ก02
โค 1,
where ๐ = 1000๐ก20. Then, we have
๐ก0 = ฮฉยฉยญยซ
log( 1
ฮ )log(1 + ๐
๐ฝ) + log log( 1
ฮ )ยชยฎยฌ
Starting from I(X, \0), we repeatedly apply Eq. (25) for ๐ก0 times to switch to I(X, \โ) where the reward gap is ฮ.
Since PrI(X,\0)[E
(0,๐ (X,\0)๐๐
)]= 1, by induction, we have
Pr
I(X,\ โ)
[E
(๐ก0,
๐ (X, \โ)๐๐
)]โฅ1 โ ยฉยญยซ3๐ฟ +
โ๏ธ(1 + ๐)2๐ (X)
๐+
(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)+ 1
๐๐
ยชยฎยฌ ยท ๐ก0When ๐ฟ = ๐ ( 1
๐๐) and ๐,๐ , ๐ฝ are large enough, we can have
ยฉยญยซ3๐ฟ +
โ๏ธ(1 + ๐)2๐ (X)
๐+
(exp
((1 + ๐)2๐ (X)๐
โ๏ธlog(๐๐ )
๐ฝ
)โ 1
)+ 1
๐๐
ยชยฎยฌ ยท ๐ก0 โค 1
2
and then
Pr
I(X,\ โ)
[E
(๐ก0,
๐ (X, \โ)๐๐
)]โฅ 1
2
.
Thus,
Pr
I(X,\ โ)
A uses ฮฉยฉยญยซ
log( 1
ฮ )log(1 + ๐
๐ฝ) + log log( 1
ฮ )ยชยฎยฌ communication rounds
โฅ Pr
I(X,\ โ)
[E
(๐ก0,
๐ (X, \โ)๐๐
)]โฅ 1
2
,
which completes the proof of Theorem 3. โก
B PROOFS FOR THE FIXED-BUDGET SETTING
B.1 Proof of Theorem 4
Proof of Theorem 4. Our proof of Theorem 4 adapts the error probability analysis in [23] to the multi-agent setting.
Since the number of samples used over all agents in each phase is ๐ = โ๐๐ /๐ โ, the total number of samples used by
algorithm CoopKernelFB is at most ๐๐ and the total number of samples used per agent is at most ๐ .
Now we prove the error probability upper bound.
Recall that for any _ โ โณหX , ฮฆ_ = [
โ_1๐ (๐ฅโ๐ฃ )โค; . . . ;
โ๏ธ_๐๐๐ (๐ฅ๐๐ )โค] and ฮฆโค
_ฮฆ_ =
โ๏ฟฝฬ๏ฟฝ โ หX _๏ฟฝฬ๏ฟฝ๐ (๐ฅ)๐ (๐ฅ)
โค. Let ๐พโ = bโ๐ .
36
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
For any ๐ฅ๐ , ๐ฅ ๐ โ B (๐ก )๐ฃ , ๐ฃ โ [๐ ], ๐ก โ [๐ ], define
ฮ๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ = inf
ฮ>0
โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2(bโ๐ผ+ฮฆโค
_โ๐กฮฆ_โ๐ก)โ1
ฮ2โค 2๐โ
and event
J๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ ={๏ฟฝ๏ฟฝ๏ฟฝ( ห๐๐ก (๐ฅ๐ ) โ ห๐๐ก (๐ฅ ๐ )
)โ
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
) ๏ฟฝ๏ฟฝ๏ฟฝ < ฮ๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐
}.
In the following, we prove Pr
[ยฌJ๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐
]โค 2 exp
(โ ๐
2(1+Y)๐โ).
Similar to the analysis procedure of Lemma 1, we have(ห๐๐ก (๐ฅ๐ ) โ ห๐๐ก (๐ฅ ๐ )
)โ
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)=
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (๐พโ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก [ฬ(๐ก ) โ ๐พโ
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
)โค (๐พโ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
\โ,
where the mean of the first term is zero and its variance is bounded by
โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ2(๐พโ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1โค(1 + Y) ยท โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (bโ๐ผ+ฮฆโค
_โ๐กฮฆ_โ๐ก)โ1
๐.
Using the Hoeffding inequality, we have
Pr
[๏ฟฝ๏ฟฝ๏ฟฝ(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ))โค (๐พโ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก [(๐ก )๐ฃ
๏ฟฝ๏ฟฝ๏ฟฝ โฅ 1
2
ฮ๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐
]โค2 exp
ยฉยญยญยญยซโ2
1
4ฮ2
๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐
(1+Y) ยท โฅ๐ (๏ฟฝฬ๏ฟฝ๐ )โ๐ (๏ฟฝฬ๏ฟฝ ๐ ) โฅ (bโ๐ผ+ฮฆโค_โ๐ก
ฮฆ_โ๐ก)โ1
๐
ยชยฎยฎยฎยฌโค2 exp
ยฉยญยญยญยญยญยซโ1
2
๐
(1+Y) ยท โฅ๐ (๏ฟฝฬ๏ฟฝ๐ )โ๐ (๏ฟฝฬ๏ฟฝ ๐ ) โฅ (bโ๐ผ+ฮฆโค_โ๐ก
ฮฆ_โ๐ก)โ1
ฮ2
๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐
ยชยฎยฎยฎยฎยฎยฌโค2 exp
(โ ๐
2(1 + Y)๐โ
)Thus, with probability at least 1 โ 2 exp
(โ ๐
2(1+Y)๐โ), we have๏ฟฝ๏ฟฝ๏ฟฝ (๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ))โค (
๐พโ๐ผ + ฮฆโค๐ก ฮฆ๐ก)โ1
ฮฆโค๐ก [(๐ก )๐ฃ
๏ฟฝ๏ฟฝ๏ฟฝ < 1
2
ฮ๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ .
Recall that bโ satisfies (1 + Y)โ๏ธbโmax
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หX๐ฃ ,๐ฃโ[๐ ] โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (bโ๐ผ+ฮฆโค_๐ขฮฆ_๐ข )โ1๐ต โค 1
2ฮ๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ . Then, we bound the
regularization (bias) term. ๏ฟฝ๏ฟฝ๏ฟฝ๐พโ (๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ))โค (๐พโ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
\โ๏ฟฝ๏ฟฝ๏ฟฝ
โค๐พโโฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ(๐พโ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1 โฅ\โโฅ(๐พโ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1
โคโ๐พโ ยท โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ(๐พโ๐ผ+ฮฆโค๐ก ฮฆ๐ก )โ1 โฅ\โโฅ2
37
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
โคโ๏ธbโ๐ ยท
(1 + Y) ยท โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (bโ๐ผ+ฮฆโค_โ๐กฮฆ_โ๐ก)โ1
โ๐
ยท โฅ\โโฅ2
โค(1 + Y)โ๏ธbโ max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หB (๐ก )๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (
bโ๐ผ+ฮฆโค_โ๐กฮฆ_โ๐ก
)โ1 โฅ\โโฅ2
โค(1 + Y)โ๏ธbโ max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หB (๐ก )๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (
bโ๐ผ+ฮฆโค_๐ขฮฆ_๐ข)โ1 โฅ\โโฅ2
โค(1 + Y)โ๏ธbโ max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โ หX๐ฃ ,๐ฃโ[๐ ]โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (
bโ๐ผ+ฮฆโค_๐ขฮฆ_๐ข)โ1 ยท ๐ต
โค 1
2
ฮ๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐
Thus, with probability at least 1 โ 2 exp
(โ ๐
2(1+Y)๐โ), we have๏ฟฝ๏ฟฝ๏ฟฝ( ห๐๐ก (๐ฅ๐ ) โ ห๐๐ก (๐ฅ ๐ )
)โ
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
) ๏ฟฝ๏ฟฝ๏ฟฝโค
๏ฟฝ๏ฟฝ๏ฟฝ (๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ))โค (๐พโ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
ฮฆโค๐ก [(๐ก )๐ฃ
๏ฟฝ๏ฟฝ๏ฟฝ + ๏ฟฝ๏ฟฝ๏ฟฝ๐พโ (๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ ))โค (๐พโ๐ผ + ฮฆโค๐ก ฮฆ๐ก
)โ1
\โ๏ฟฝ๏ฟฝ๏ฟฝ
<ฮ๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ ,
which completes the proof of Pr
[ยฌJ๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐
]โค 2 exp
(โ ๐
2(1+Y)๐โ).
Define event
J =
{๏ฟฝ๏ฟฝ๏ฟฝ( ห๐๐ก (๐ฅ๐ ) โ ห๐๐ก (๐ฅ ๐ ))โ
(๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )
) ๏ฟฝ๏ฟฝ๏ฟฝ < ฮ๐ก,๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ , โ๐ฅ๐ , ๐ฅ ๐ โ B(๐ก )๐ฃ ,โ๐ฃ โ [๐ ],โ๐ก โ [๐ ]
},
By a union bound over ๐ฅ๐ , ๐ฅ ๐ โ B (๐ก )๐ฃ , ๐ฃ โ [๐ ] and ๐ก โ [๐ ], we have
Pr [ยฌJ] โค2๐2๐ log(๐ ( หX)) ยท exp
(โ ๐
2(1 + Y)๐โ
)=๐
(๐2๐ log(๐ ( หX)) ยท exp
(โ ๐๐
๐โ log(๐ ( หX))
))In order to prove Theorem 4, it suffices to prove that conditioning on J , algorithm CoopKernelFB returns the correct
answers ๐ฅโ๐ฃ for all ๐ฃ โ [๐ ].Suppose that there exist ๐ฃ โ [๐ ] and ๐ก โ [๐ ] such that ๐ฅโ๐ฃ is eliminated in phase ๐ก . Define
Bโฒ (๐ก )๐ฃ = {๐ฅ โ B (๐ก )๐ฃ :ห๐๐ก (๐ฅ) > ห๐๐ก (๐ฅโ๐ฃ )},
which denotes the subset of arms that are ranked before ๐ฅโ๐ฃ by the estimated rewards in B (๐ก )๐ฃ . According to the
elimination rule, we have
๐ (Bโฒ (๐ก )๐ฃ โช {๐ฅโ๐ฃ }) >1
2
๐ (B (๐ก )๐ฃ ) =1
2
max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โB (๐ก )๐ฃโฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (bโ๐ผ+ฮฆโค
_โ๐กฮฆ_โ๐ก)โ1 (26)
Define ๐ฅ0 = argmax๏ฟฝฬ๏ฟฝ โBโฒ (๐ก )๐ฃ
ฮ๏ฟฝฬ๏ฟฝ . We have
1
2
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ0)โฅ (bโ๐ผ+ฮฆโค_โ๐กฮฆ_โ๐ก)โ1
ฮ2
๏ฟฝฬ๏ฟฝ0
38
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
โค 1
2
max
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โB (๐ก )๐ฃ
โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (bโ๐ผ+ฮฆโค_โ๐กฮฆ_โ๐ก)โ1
ฮ2
๏ฟฝฬ๏ฟฝ0
(f)
<๐ (Bโฒ (๐ก )๐ฃ โช {๐ฅโ๐ฃ })
ฮ2
๏ฟฝฬ๏ฟฝ0
(g)
= min
_โโณXmax
๏ฟฝฬ๏ฟฝ๐ ,๏ฟฝฬ๏ฟฝ ๐ โBโฒ (๐ก )๐ฃ โช{๏ฟฝฬ๏ฟฝโ๐ฃ }
โฅ๐ (๐ฅ๐ ) โ ๐ (๐ฅ ๐ )โฅ (bโ๐ผ+ฮฆโค_ฮฆ_
)โ1
ฮ2
๏ฟฝฬ๏ฟฝ0
(h)
โค min
_โโณXmax
๏ฟฝฬ๏ฟฝ โBโฒ (๐ก )๐ฃ
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ (bโ๐ผ+ฮฆโค_ฮฆ_
)โ1
ฮ2
๏ฟฝฬ๏ฟฝ
โค min
_โโณXmax
๏ฟฝฬ๏ฟฝ โX๐ฃ\{๏ฟฝฬ๏ฟฝโ๐ฃ },๐ฃโ[๐ ]
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ (bโ๐ผ+ฮฆโค_ฮฆ_
)โ1
ฮ2
๏ฟฝฬ๏ฟฝ
=๐โ,
where (f) can be obtained by dividing Eq. (26) by ฮ2
๏ฟฝฬ๏ฟฝ0
, (g) comes from the definition of ๐ (ยท), and (h) is due to the
definition of ๐ฅ0.
According to the definition
ฮ๐ก,๏ฟฝฬ๏ฟฝโ๐ฃ ,๏ฟฝฬ๏ฟฝ0= inf
ฮ>0
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ0)โฅ2(bโ๐ผ+ฮฆโค
_โ๐กฮฆ_โ๐ก)โ1
ฮ2โค 2๐โ
,we have ฮ๐ก,๏ฟฝฬ๏ฟฝโ๐ฃ ,๏ฟฝฬ๏ฟฝ0
โค ฮ๏ฟฝฬ๏ฟฝ0.
Conditioning on J , we have
๏ฟฝ๏ฟฝ๏ฟฝ( ห๐๐ก (๐ฅโ๐ฃ ) โ ห๐๐ก (๐ฅ0))โ
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ0)
) ๏ฟฝ๏ฟฝ๏ฟฝ < ฮ๐ก,๏ฟฝฬ๏ฟฝโ๐ฃ ,๏ฟฝฬ๏ฟฝ0โค ฮ๏ฟฝฬ๏ฟฝ0
. Then, we have
ห๐๐ก (๐ฅโ๐ฃ ) โ ห๐๐ก (๐ฅ0) >(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ0)
)โ ฮ๏ฟฝฬ๏ฟฝ0
= 0,
which contradicts the definition of ๐ฅ0 that satisfiesห๐๐ก (๐ฅ0) > ห๐๐ก (๐ฅโ๐ฃ ). Thus, for any ๐ฃ โ [๐ ], ๐ฅโ๐ฃ will never be eliminated.
Since ๐ ({๐ฅโ๐ฃ , ๐ฅ}) โฅ 1 for any ๐ฅ โ X๐ฃ \ {๐ฅโ๐ฃ }, ๐ฃ โ [๐ ] and ๐ =โlog
2(๐ ( หX))
โโฅ
โlog
2(๐ (X๐ฃ))
โfor any ๐ฃ โ [๐ ], we
have that conditioning on J , algorithm CoopKernelFB returns the correct answers ๐ฅโ๐ฃ for all ๐ฃ โ [๐ ].For communication rounds, since algorithm CoopKernelFB has at most ๐ =
โlog
2(๐ ( หX))
โphases, the number of
communication rounds is bounded by ๐ (log(๐ ( หX))). โก
B.2 Proof of Corollary 2
Proof of Corollary 2. Following the analysis procedure of Corollary 1, we have
๐โ = min
_โโณXmax
๏ฟฝฬ๏ฟฝ โ หX๐ฃ ,๐ฃโ[๐ ]
โฅ๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ)โฅ2(bโ๐ผ+ฮฆโค_ฮฆ_
)โ1
(๐ (๐ฅโ๐ฃ ) โ ๐ (๐ฅ))2
โค 8
ฮ2
min
ยท log det
(๐ผ + bโ1
โ ๐พ_โ)
39
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Maximum Information Gain. Recall that the maximum information gain over all sample allocation _ โ โณหX is defined
as
ฮฅ = max
_โโณ หXlog det
(๐ผ + bโโ1๐พ_
).
Then, the error probability is bounded by
๐
(๐2๐ log(๐ ( หX)) ยท exp
(โ ๐๐
๐โ log(๐ ( หX))
))=๐
(๐2๐ log(๐ ( หX)) ยท exp
(โ
๐๐ฮ2
min
ฮฅ log(๐ ( หX))
))Effective Dimension. Recall that ๐ผ1 โฅ ยท ยท ยท โฅ ๐ผ๐๐ denote the eigenvalues of ๐พ_โ in decreasing order. The effective
dimension of ๐พ_โ is defined as
๐eff
= min
{๐ : ๐bโ log(๐๐ ) โฅ
๐๐โ๏ธ๐=๐+1
๐ผ๐
},
and we have
log det
(๐ผ + bโ1
โ ๐พ_โ)โค ๐
efflog
(2๐๐ ยท
(1 + Trace(๐พ_โ )
bโ๐eff
)).
Then, the error probability is bounded by
๐
(๐2๐ log(๐ ( หX)) ยท exp
(โ ๐๐
๐โ log(๐ ( หX))
))
=๐ยฉยญยญยซ๐2๐ log(๐ ( หX)) ยท exp
ยฉยญยญยซโ๐๐ฮ2
min
๐eff
log
(๐๐ ยท
(1 + Trace(๐พ_โ )
bโ๐eff
))log(๐ ( หX))
ยชยฎยฎยฌยชยฎยฎยฌ
Decomposition. Recall that ๐พ = [๐พ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ] , ๐พ๐ง = [๐พ๐ง (๐ง๐ฃ, ๐ง๐ฃโฒ)]๐ฃ,๐ฃโฒโ[๐ ] , ๐พ๐ฅ = [๐พ๐ฅ (๐ฅ๐ , ๐ฅ ๐ )]๐, ๐ โ[๐๐ ] , andrank(๐พ_โ ) = rank(๐พ) โค rank(๐พ๐ง) ยท rank(๐พ๐ฅ ). We have
log det
(๐ผ + bโ1
โ ๐พ_โ)โค rank(๐พ๐ง) ยท rank(๐พ๐ฅ ) log
(Trace
(๐ผ + bโ1
โ ๐พ_โ)
rank(๐พ_โ )
)Then, the error probability is bounded by
๐
(๐2๐ log(๐ ( หX)) ยท exp
(โ ๐๐
๐โ log(๐ ( หX))
))
=๐ยฉยญยญยซ๐2๐ log(๐ ( หX)) ยท exp
ยฉยญยญยซโ๐๐ฮ2
min
rank(๐พ๐ง) ยท rank(๐พ๐ฅ ) log
(Trace(๐ผ+bโ1
โ ๐พ_โ )rank(๐พ_โ )
)log(๐ ( หX))
ยชยฎยฎยฌยชยฎยฎยฌ
Therefore, we complete the proof of Corollary 2. โก
B.3 Proof of Theorem 5
Our proof of Theorem 5 follows the analysis procedure in [1].
40
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
We first introduce some definitions in information theory. For any random variable ๐ด, let ๐ป (๐ด) denote the Shannonentropy of ๐ด. If ๐ด is uniformly distributed on its support, ๐ป (๐ด) = log |๐ด|. For any ๐ โ (0, 1), let ๐ป2 (๐) = โ๐ log
2๐ โ
(1 โ ๐) log2(1 โ ๐) denote the binary entropy, and ๐ป2 (๐) = ๐ป (๐ด) for ๐ด โผ Bernoulli(๐). For any random variables ๐ด, ๐ต,
let ๐ป (๐ด;๐ต) = ๐ป (๐ด) โ ๐ป (๐ด|๐ต) = ๐ป (๐ต) โ ๐ป (๐ต |๐ด) denote the mutual information of ๐ด and ๐ต.
Consider the following fully-collaborative instance Dฮ,๐๐
: for a uniformly drawn index ๐โ โ [๐], \โ๐โ = ๐ + ฮ and
\โ๐= ๐ for all ๐ โ ๐โ. The arm set X = {๐ฅ โ {0, 1}๐ :
โ๐๐=1
๐ฅ (๐) = 1}, and the feature mapping ๐ (๐ฅ) = ๐ผ๐ฅ for all ๐ฅ โ X.Under instance Dฮ,๐
๐, we have ๐ ( หX) = ๐ (X) = ๐ .
Let ๐ฟโ > 0 be a small constant that we specify later. There exists a single-agent algorithm A๐ (e.g., [23]) that uses at
most ๐ samples and guarantees ๐ (log๐ ยท exp(โ ฮ2๐๐ log๐
)) error probability on instance Dฮ,๐๐
for any ๐ > 1. Restricting
the error probability to the constant ๐ฟโ, we have that for any ๐ > 1,A๐ uses at most๐ = ๐ ( ๐ฮ2
log๐ ยท log log๐) samples
to guarantee ๐ฟโ error probability on instance Dฮ,๐๐
.
Let ๐ผ = ๐๐ฝand 1 โค ๐ผ โค log๐ . According to the definition of speedup, a
๐๐ผ -speedup distributed algorithm A must
satisfy that for any ๐ > 1, A uses at most ๐ = ๐ (๐
ฮ2
๐๐ผ
ยท ๐ log๐ ยท log log๐) = ๐ ( ๐ผ๐ฮ2
log๐ ยท log log๐) samples over all
agents and guarantees ๐ฟโ error probability on instance Dฮ,๐๐
.
Main Proof. Now, in order to prove Theorem 5, we first prove the following Lemma 6, which relaxes the sample budget
within logarithmic factors.
Lemma 6. For any ๐ > 1, any distributed algorithm A that can use at most ๐ ( ๐ผ๐ (log๐ผ+log log๐)2ฮ2 (log๐)2 ) samples over all
agents and guarantee ๐ฟโ error probability on instance Dฮ,๐๐
needs ฮฉ( log๐
log๐ผ+log log๐) communication rounds.
Below we prove Lemma 6 by induction (Lemmas 7,8).
Lemma 7 (Basic Step). For any ๐ > 1 and 1 โค ๐ผ โค log๐ , there is no 1-round algorithmA1 that can use๐ ( ๐ผ๐ฮ2) samples
and guarantee ๐ฟ1 error probability for some constant ๐ฟ1 โ (0, 1) on instance Dฮ,๐๐
.
Proof of Lemma 7. Let ๐ผ denote the random variable of index ๐โ. Since ๐ผ is uniformly distributed on [๐],๐ป (๐ผ ) = log๐ .
Let ๐ denote the sample profile of A1 on instance Dฮ,๐๐
. According to the definitions of X and \โ, for instance Dฮ,๐๐
,
identifying the best arm is equivalent to identifying ๐ผ . Suppose that A1 returns the best arm with error probability ๐ฟ1.
Using the Fanoโs inequality (Lemma 13), we have
๐ป (๐ผ |๐) โค ๐ป2 (๐ฟ1) + ๐ฟ1 log๐ (27)
Using Lemma 14, we have
๐ป (๐ผ |๐) =๐ป (๐ผ ) โ ๐ป (๐ผ ; ๐)
= log๐ โ๐ (๐ผ๐ฮ2ยท ฮ
2
๐)
= log๐ โ๐ (๐ผ)
Then, for some small enough constant ๐ฟ1 โ (0, 1), Eq. (27) cannot hold. Thus, for any ๐ > 1 and 1 โค ๐ผ โค log๐ , there is
no 1-round algorithm A1 that can use ๐ ( ๐ผ๐ฮ2) samples and guarantee ๐ฟ1 error probability on instance Dฮ,๐
๐. โก
Lemma 8 (Induction Step). Suppose that 1 โค ๐ผ โค log๐ and ๐ฟ โ (0, 1). If for any ๐ > 1, there is no (๐ โ 1)-roundalgorithmA๐โ1 that can use๐ ( ๐ผ๐
ฮ2 (๐โ1)2 ) samples and guarantee ๐ฟ error probability on instance Dฮ,๐๐
, then for any ๐ > 1,41
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
there is no ๐ -round algorithm A๐ that can use ๐ ( ๐ผ๐ฮ2๐ 2) samples and guarantee ๐ฟ โ๐ ( 1
๐ 2) error probability on instance
Dฮ,๐๐
.
Proof of Lemma 8. We prove this lemma by contradiction. Suppose that for some ๐ > 1, there exists an ๐ -round
algorithm A๐ that can use ๐ ( ๐ผ๐ฮ2๐ 2) samples and guarantee ๐ฟ error probability on instance Dฮ,๐
๐. In the following, we
show how to construct an (๐ โ 1)-round algorithm A๐โ1 that can use ๐ ( ๐ผ๐ฮ2 (๐โ1)2 ) samples and guarantee at most
๐ฟ +๐ ( 1
๐ 2) error probability on instance Dฮ,๐
ห๐for some
ห๐ .
Let ๐1 denote the sample profile of A๐ in the first round on instance Dฮ,๐๐
. Since the number of samples of A๐ isbounded by ๐ ( ๐ผ๐
ฮ2๐ 2), using Lemma 14, we have
๐ป (๐ผ |๐1) =๐ป (๐ผ ) โ ๐ป (๐ผ ; ๐1)
= log๐ โ๐ ( ๐ผ๐ฮ2๐2
ยท ฮ2
๐)
= log๐ โ๐ ( ๐ผ๐2)
Let Dฮ,๐๐ |๐1
denote the posterior of Dฮ,๐๐
after observing the sample profile ๐1.
Using Lemma 15 on random variable ๐ผ |๐1 with parameters ๐พ = ๐ ( ๐ผ๐ 2) and ๐ = ๐ ( 1
๐ 2), we can write Dฮ,๐
๐ |๐1
as a convex
combination of distributions Q0,Q1, . . . ,Qโ , i.e., Dฮ,๐๐ |๐1
=โโ๐=0
๐ ๐Q ๐ such that ๐0 = ๐ ( 1
๐ 2), and for any ๐ โฅ 1,
|supp(Q ๐ ) | โฅ exp
(log๐ โ ๐พ
๐
)= exp (log๐ โ ๐ (๐ผ))
=๐
๐๐ (๐ผ),
and
โฅQ ๐ โU๐ โฅTV = ๐ ( 1
๐2),
whereU๐ is the uniform distribution on supp(Q ๐ ).Since
Pr[A๐ has an error|Dฮ,๐๐ |๐1
] =โโ๏ธ๐=0
๐ ๐ Pr[A๐ has an error|Q ๐ ] โค ๐ฟ,
using an average argument, there exists a distribution Q ๐ for some ๐ โฅ 1 such that
Pr[A๐ has an error|Q ๐ ] โค ๐ฟ.
Letห๐ = |supp(Q ๐ ) | โฅ ๐
๐๐ (๐ผ ). Since โฅQ ๐ โU๐ โฅTV = ๐ ( 1
๐ 2) andU๐ is equivalent to Dฮ,๐
ห๐, we have
Pr[A๐ has an error|Dฮ,๐ห๐] โค ๐ฟ + ๐ ( 1
๐2) .
Under instance Dฮ,๐ห๐
, the sample budget for (๐ โ 1)-round algorithms is ๐ ( ๐ผ ห๐ฮ2 (๐โ1)2 ), and it holds that
๐ ( ๐ผ ห๐
ฮ2 (๐ โ 1)2) โฅ ๐ (
๐ผ ๐
๐๐ (๐ผ )
ฮ2 (๐ โ 1)2) โฅ ๐ ( ๐ผ๐
1โ๐ (1)
ฮ2 (๐ โ 1)2) โฅ ๐ ( ๐ผ๐
ฮ2๐2)
42
Collaborative Pure Exploration in Kernel Bandit Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Then, we can construct an (๐ โ1)-round algorithmA๐โ1 using algorithmA๐ from the second round. The constructed
A๐โ1 uses at most ๐ ( ๐ผ๐ฮ2๐ 2) โค ๐ ( ๐ผ ห๐
ฮ2 (๐โ1)2 ) samples and guarantees ๐ฟ + ๐ ( 1
๐ 2) error probability.
The specific procedure ofA๐โ1 is as follows: ifA๐ samples some dimension ๐ โ supp(Q ๐ ), thenA๐โ1 also samples ๐;
otherwise, if A๐ samples some dimension ๐ โ [๐] \ supp(Q ๐ ), then A๐โ1 samples Bernoulli(๐) and feeds the outcome
to A๐ . Finally, if A๐ returns some dimension ๐ โ supp(Q ๐ ), then A๐โ1 also returns ๐; otherwise, if A๐ returns some
dimension ๐ โ [๐] \ supp(Q ๐ ), then A๐โ1 returns an arbitrary dimension in [๐] \ supp(Q ๐ ) (the error case). โก
Proof of Lemma 6. Let ๐โ =log๐
log๐ผ+log log๐. Combining Lemmas 7,8, we obtain that there is no ๐โ-round algorithm
A that can use ๐ ( ๐ผ๐ฮ2๐ 2
โ) samples and guarantee ๐ฟ1 โ
โ๐โ๐=2
๐ ( 1
๐ 2) error probability on instance Dฮ,๐
๐for any ๐ > 1.
Let ๐ฟโ < ๐ฟ1 โโ๐โ๐=2
๐ ( 1
๐ 2). Thus, for any ๐ > 1 and 1 โค ๐ผ โค log๐ , any distributed algorithm A that can use ๐ ( ๐ผ๐
ฮ2๐ 2
โ)
samples and guarantee ๐ฟโ error probability on instance Dฮ,๐๐
must cost ๐โ = ฮฉ
(log๐
log( ๐๐ฝ)+log log๐
)communication
rounds. โก
Therefore, for any๐ > 1 and๐
log๐โค ๐ฝ โค ๐ , a ๐ฝ-speedup distributed algorithmA needs at leastฮฉ
(log๐ ( หX)
log( ๐๐ฝ)+log log๐ ( หX)
)communication rounds under instance Dฮ,๐
๐, which completes the proof of Theorem 5.
C TECHNICAL TOOLS
Lemma 9 (Lemma 15 in [10]). For _โ = argmax_โโณ หXlog det
(๐ผ + bโโ1
โ๏ฟฝฬ๏ฟฝ โฒโ หX _๏ฟฝฬ๏ฟฝ โฒ๐ (๐ฅ
โฒ)๐ (๐ฅ โฒ)โค), we have
max
๏ฟฝฬ๏ฟฝ โ หXโฅ๐ (๐ฅ)โฅ2(
bโ๐ผ+โ๏ฟฝฬ๏ฟฝโฒโ หX _
โ๏ฟฝฬ๏ฟฝโฒ๐ (๏ฟฝฬ๏ฟฝ
โฒ)๐ (๏ฟฝฬ๏ฟฝ โฒ)โค)โ1
=โ๏ธ๏ฟฝฬ๏ฟฝ โ หX
_โ๏ฟฝฬ๏ฟฝโฅ๐ (๐ฅ)โฅ2(
bโ๐ผ+โ๏ฟฝฬ๏ฟฝโฒโ หX _
โ๏ฟฝฬ๏ฟฝโฒ๐ (๏ฟฝฬ๏ฟฝ
โฒ)๐ (๏ฟฝฬ๏ฟฝ โฒ)โค)โ1
Lemma 10. For _โ = argmax_โโณ หXlog det
(๐ผ + bโโ1
โ๏ฟฝฬ๏ฟฝ โฒโ หX _๏ฟฝฬ๏ฟฝ โฒ๐ (๐ฅ
โฒ)๐ (๐ฅ โฒ)โค), we have
โ๏ธ๏ฟฝฬ๏ฟฝ โ หX
log
ยฉยญยซ1 + _โ๏ฟฝฬ๏ฟฝโฅ๐ (๐ฅ)โฅ2(
bโ๐ผ+โ๏ฟฝฬ๏ฟฝโฒโ หX _
โ๏ฟฝฬ๏ฟฝโฒ๐ (๏ฟฝฬ๏ฟฝ
โฒ)๐ (๏ฟฝฬ๏ฟฝ โฒ)โค)โ1
ยชยฎยฌ โค log
det
(bโ๐ผ +
โ๏ฟฝฬ๏ฟฝ โ หX _
โ๏ฟฝฬ๏ฟฝ๐ (๐ฅ)๐ (๐ฅ)โค
)det (bโ๐ผ )
Proof of Lemma 10. For any ๐ โ [๐๐ ], let๐๐ = det
(bโ๐ผ +
โ๐โ[ ๐ ] _
โ๐๐ (๐ฅ๐ )๐ (๐ฅ๐ )โค
).
det
ยฉยญยซbโ๐ผ +โ๏ธ๏ฟฝฬ๏ฟฝ โ หX
_โ๏ฟฝฬ๏ฟฝ๐ (๐ฅ)๐ (๐ฅ)โคยชยฎยฌ
= det
ยฉยญยซbโ๐ผ +โ๏ธ
๐โ[๐๐โ1]_โ๐ ๐ (๐ฅ๐ )๐ (๐ฅ๐ )
โค + _โ๐๐๐ (๐ฅ๐๐ )๐ (๐ฅ๐๐ )โคยชยฎยฌ
= det (๐๐๐โ1) det
(๐ผ + _โ๐๐ ยท๐
โ 1
2
๐๐โ1๐ (๐ฅ๐๐ )
(๐โ 1
2
๐๐โ1๐ (๐ฅ๐๐ )
)โค)= det (๐๐๐โ1) det
(๐ผ + _โ๐๐ ยท ๐ (๐ฅ๐๐ )
โค๐โ1
๐๐โ1๐ (๐ฅ๐๐ )
)= det (๐๐๐โ1)
(1 + _โ๐๐ โฅ๐ (๐ฅ๐๐ )โฅ
2
๐โ1
๐๐โ1
)= det (bโ๐ผ )
๐๐โ๐=1
(1 + _โ๐ โฅ๐ (๐ฅ๐ )โฅ
2
๐โ1
๐โ1
)43
Woodstock โ21, June 03โ05, 2021, Woodstock, NY
Thus,
det
(bโ๐ผ +
โ๏ฟฝฬ๏ฟฝ โ หX _
โ๏ฟฝฬ๏ฟฝ๐ (๐ฅ)๐ (๐ฅ)โค
)det (bโ๐ผ )
=
๐๐โ๐=1
(1 + _โ๐ โฅ๐ (๐ฅ๐ )โฅ
2
๐โ1
๐โ1
)Taking logarithm on both sides, we have
log
det
(bโ๐ผ +
โ๏ฟฝฬ๏ฟฝ โ หX _
โ๏ฟฝฬ๏ฟฝ๐ (๐ฅ)๐ (๐ฅ)โค
)det (bโ๐ผ )
=
๐๐โ๏ธ๐=1
log
(1 + _โ๐ โฅ๐ (๐ฅ๐ )โฅ
2
๐โ1
๐โ1
)โฅ๐๐โ๏ธ๐=1
log
(1 + _โ๐ โฅ๐ (๐ฅ๐ )โฅ
2
๐โ1
๐๐
),
which completes the proof of Lemma 10. โก
Lemma 11 (Pinskerโs ineqality). If ๐ and ๐ are two probability distributions on a measurable space (๐, ฮฃ), then for
any measurable event ๐ด โ ฮฃ, it holds that
|๐ (๐ด) โ๐ (๐ด) | โคโ๏ธ
1
2
KL(๐ โฅ๐) .
Lemma 12 (Lemma 29 in [38]). For any ๐พ1, . . . , ๐พ๐พ โ [0, 1] and ๐ฅ โฅ 0, it holds that
๐พโ๐=1
max{1 โ ๐พ๐ โ ๐พ๐๐ฅ, 0} โฅ๐พโ๐=1
(1 โ ๐พ๐ ) โ ๐ฅ
Lemma 13 (Fanoโs Ineqality). Let ๐ด, ๐ต be random variables and ๐ be a function that given ๐ด predicts a value for ๐ต.
If Pr(๐ (๐ด) โ ๐ต) โค ๐ฟ , then ๐ป (๐ต |๐ด) โค ๐ป2 (๐ฟ) + ๐ฟ ยท log |๐ต |.
Lemma 14. For the instance Dฮ,๐๐
with sample profile S, we have ๐ป (๐ผ ; ๐) = ๐ ( |๐ | ยท ฮ2
๐).
The proof of Lemma 14 follows the same analysis procedure of Lemma 7 in [1].
Lemma 15 (Lemma 8 in [1]). Let ๐ด โผ D be a random variable on [๐] with ๐ป (๐ด) โฅ log๐ โ ๐พ for some ๐พ โฅ 1. For any
Y > exp(โ๐พ), there exists โ + 1 distributions๐0,๐1, ...,๐โ on [๐] along with โ + 1 probabilities ๐0, ๐1, . . . , ๐โ (โโ๐=0
๐๐ = 1)
for some โ = ๐ (๐พ/Y3) such that D =โโ๐=1
๐๐๐๐ , ๐0 = ๐ (Y), and for any ๐ โฅ 1,
1. log |supp(๐๐ ) | โฅ log๐ โ ๐พ/Y.2. โฅ๐๐ โU๐ โฅTV = ๐ (Y) whereU๐ denotes the uniform distribution on supp(๐๐ ).
44