44
Collaborative Pure Exploration in Kernel Bandit YIHAN DU, IIIS, Tsinghua University, China WEI CHEN, Microsoft Research, China YUKO KUROKI, The University of Tokyo / RIKEN, Japan LONGBO HUANG, IIIS, Tsinghua University, China In this paper, we formulate a Collaborative Pure Exploration in Kernel Bandit problem (CoPE-KB), which provides a novel model for multi-agent multi-task decision making under limited communication and general reward functions, and is applicable to many online learning tasks, e.g., recommendation systems and network scheduling. We consider two settings of CoPE-KB, i.e., Fixed-Con๏ฌdence (FC) and Fixed-Budget (FB), and design two optimal algorithms CoopKernelFC (for FC) and CoopKernelFB (for FB). Our algorithms are equipped with innovative and e๏ฌƒcient kernelized estimators to simultaneously achieve computation and communication e๏ฌƒciency. Matching upper and lower bounds under both the statistical and communication metrics are established to demonstrate the optimality of our algorithms. The theoretical bounds successfully quantify the in๏ฌ‚uences of task similarities on learning acceleration and only depend on the e๏ฌ€ective dimension of the kernelized feature space. Our analytical techniques, including data dimension decomposition, linear structured instance transformation and (communication) round-speedup induction, are novel and applicable to other bandit problems. Empirical evaluations are provided to validate our theoretical results and demonstrate the performance superiority of our algorithms. Additional Key Words and Phrases: collaborative pure exploration, kernel bandit, communication round, learning speedup ACM Reference Format: Yihan Du, Wei Chen, Yuko Kuroki, and Longbo Huang. 2021. Collaborative Pure Exploration in Kernel Bandit. In Woodstock โ€™21: ACM Symposium on Neural Gaze Detection, June 03โ€“05, 2021, Woodstock, NY . ACM, New York, NY, USA, 44 pages. https://doi.org/10.1145/ 1122445.1122456 1 INTRODUCTION Pure exploration [3, 9, 12, 18, 21, 24] is a fundamental online learning problem in multi-armed bandits, where an agent sequentially chooses options (often called arms) and observes random feedback, with the objective of identifying the best option (arm). This problem ๏ฌnds various applications such as recommendation systems [31], online advertising [37] and neural architecture search [17]. However, the traditional single-agent pure exploration problem [3, 9, 12, 18, 21, 24] cannot be directly applied to many real-world distributed online learning platforms, which often face a large volume of user requests and need to coordinate multiple distributed computing devices to process the requests, e.g., geographically distributed data centers [30] and Web servers [44]. These computing devices communicate with each other to exchange information in order to attain globally optimal performance. To handle such distributed pure exploration problem, prior works [20, 22, 38] have developed the Collaborative Pure Exploration (CoPE) model, where there are multiple agents that communicate and cooperate in order to identify the best arm with learning speedup. Yet, existing results only investigate the classic multi-armed bandit (MAB) setting [3, 18, 21], Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro๏ฌt or commercial advantage and that copies bear this notice and the full citation on the ๏ฌrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci๏ฌc permission and/or a fee. Request permissions from [email protected]. ยฉ 2021 Association for Computing Machinery. Manuscript submitted to ACM 1 arXiv:2110.15771v1 [cs.LG] 29 Oct 2021

CoopKernelFC CoopKernelFB arXiv:2110.15771v1 [cs.LG] 29

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Collaborative Pure Exploration in Kernel Bandit

YIHAN DU, IIIS, Tsinghua University, China

WEI CHEN,Microsoft Research, China

YUKO KUROKI, The University of Tokyo / RIKEN, Japan

LONGBO HUANG, IIIS, Tsinghua University, China

In this paper, we formulate a Collaborative Pure Exploration in Kernel Bandit problem (CoPE-KB), which provides a novel model for

multi-agent multi-task decision making under limited communication and general reward functions, and is applicable to many online

learning tasks, e.g., recommendation systems and network scheduling. We consider two settings of CoPE-KB, i.e., Fixed-Confidence

(FC) and Fixed-Budget (FB), and design two optimal algorithms CoopKernelFC (for FC) and CoopKernelFB (for FB). Our algorithms

are equipped with innovative and efficient kernelized estimators to simultaneously achieve computation and communication efficiency.

Matching upper and lower bounds under both the statistical and communication metrics are established to demonstrate the optimality

of our algorithms. The theoretical bounds successfully quantify the influences of task similarities on learning acceleration and only

depend on the effective dimension of the kernelized feature space. Our analytical techniques, including data dimension decomposition,

linear structured instance transformation and (communication) round-speedup induction, are novel and applicable to other bandit

problems. Empirical evaluations are provided to validate our theoretical results and demonstrate the performance superiority of our

algorithms.

Additional Key Words and Phrases: collaborative pure exploration, kernel bandit, communication round, learning speedup

ACM Reference Format:Yihan Du, Wei Chen, Yuko Kuroki, and Longbo Huang. 2021. Collaborative Pure Exploration in Kernel Bandit. In Woodstock โ€™21: ACM

Symposium on Neural Gaze Detection, June 03โ€“05, 2021, Woodstock, NY. ACM, New York, NY, USA, 44 pages. https://doi.org/10.1145/

1122445.1122456

1 INTRODUCTION

Pure exploration [3, 9, 12, 18, 21, 24] is a fundamental online learning problem in multi-armed bandits, where an agent

sequentially chooses options (often called arms) and observes random feedback, with the objective of identifying the

best option (arm). This problem finds various applications such as recommendation systems [31], online advertising [37]

and neural architecture search [17]. However, the traditional single-agent pure exploration problem [3, 9, 12, 18, 21, 24]

cannot be directly applied to many real-world distributed online learning platforms, which often face a large volume of

user requests and need to coordinate multiple distributed computing devices to process the requests, e.g., geographically

distributed data centers [30] and Web servers [44]. These computing devices communicate with each other to exchange

information in order to attain globally optimal performance.

To handle such distributed pure exploration problem, prior works [20, 22, 38] have developed the Collaborative Pure

Exploration (CoPE) model, where there are multiple agents that communicate and cooperate in order to identify the best

arm with learning speedup. Yet, existing results only investigate the classic multi-armed bandit (MAB) setting [3, 18, 21],

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not

made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

ยฉ 2021 Association for Computing Machinery.

Manuscript submitted to ACM

1

arX

iv:2

110.

1577

1v1

[cs

.LG

] 2

9 O

ct 2

021

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

and focus only on the fully-collaborative setting, i.e., the agents aim to solve a common task. However, in many

real-world applications such as recommendation systems [31], it is often the case that different computing devices

face different but correlated recommendation tasks. Moreover, there usually exists some structured dependency of user

utilities on the recommended items. In such applications, it is important to develop a more general CoPE model that

allows heterogeneous tasks and complex reward structures, and quantitatively investigate how task similarities impact

learning acceleration.

Motivated by the above facts, we propose a novel Collaborative Pure Exploration in Kernel Bandit (CoPE-KB) problem,

which generalizes traditional single-task CoPE problems [20, 22, 38] to the multi-task setting. It also generalizes the

classic MAB model to allow general (linear or nonlinear) reward structures via the powerful kernel representation.

Specifically, each agent is given a set of arms, and the expected reward of each arm is generated by a task-dependent

reward function with a low norm in a high-dimensional (possibly infinite-dimensional) Reproducing Kernel Hilbert

Space (RKHS) [33, 42], by which we can represent real-world nonlinear reward dependency as some linear function in

a high-dimensional space, and can go beyond linear rewards as commonly done in the literature, e.g., [10, 13, 14, 35, 40].

Each agent sequentially chooses arms to sample and observes noisy outcomes. The agents can broadcast and receive

messages to/from others in communication rounds, so that they can exploit the task similarity and collaborate to

expedite learning processes. The task of each agent is to find the best arm that maximizes the expected reward among

her arm set.

Our CoPE-KB formulation can handle different tasks in parallel and characterize the dependency of rewards on

options, which provides a more general and flexible model for real-world applications. For example, in distributed

recommendation systems [31], different computing devices can face different tasks, and it is inefficient to learn the

reward of each option individually. Instead, CoPE-KB enables us to directly learn the relationship between option

features and user utilities, and exploit the similarity of such relationship among different tasks to accelerate learning.

There are also many other applications, such as clinical trials [41], where we conduct multiple clinical trials in parallel

and utilize the common useful information to accelerate drug development, and neural architecture search [17], where

we simultaneously run different tests of neural architectures under different environmental setups to expedite search

processes.

We consider two important pure exploration settings under the CoPE-KB model, i.e., Fixed-Confidence (FC), where

agents aim to minimize the number of used samples under a given confidence, and Fixed-Budget (FB), where the goal is

to minimize the error probability under a given sample budget. Note that due to the high dimension (possibly infinite)

of the RKHS, it is highly non-trivial to simplify the burdensome computation and communication in the RKHS, and to

derive theoretical bounds only dependent on the effective dimension of the kernelized feature space. To tackle the above

challenges, we adopt efficient kernelized estimators and design novel algorithms CoopKernelFC and CoopKernelFB for

the FC and FB settings, respectively, which only cost Poly(๐‘›๐‘‰ ) computation and communication complexity instead of

Poly(dim(H๐พ )) as in [10, 43], where ๐‘› is the number of arms,๐‘‰ is the number of agents, andH๐พ is the high-dimensional

RKHS. We also establish matching upper and lower bounds in terms of sampling and communication complexity to

demonstrate the optimality of our algorithms (within logarithmic factors).

Our work distinguishes itself from prior CoPE works, e.g., [20, 22, 38], in the following aspects: (i) Prior works [20, 22,

38] only consider the classic MAB setting, while we adopt a high-dimensional RKHS to allow more general real-world

reward dependency on option features. (ii) Unlike [20, 22, 38] which restrict tasks (given arm sets and rewards) among

agents to be the same, we allow different tasks for different agents, and explicitly quantify how task similarities impact

learning acceleration. (iii) In lower bound analysis, prior works [20, 38] mainly focus on a 2-armed case, whereas we

2

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

derive a novel lower bound analysis for general multi-armed cases with high-dimensional linear reward structures.

Moreover, when reducing CoPE-KB to prior CoPE with classic MAB setting (all agents are solving the same classic

MAB task) [20, 38], our lower and upper bounds also match the existing state-of-the-art results in [38].

The contributions of this paper are summarized as follows:

โ€ข We formulate a novel Collaborative Pure Exploration in Kernel Bandit (CoPE-KB) problem, which models

distributed multi-task decision making problems with general reward functions, and finds applications in many

real-world online learning tasks, and study two settings of CoPE-KB, i.e., CoPE-KB with fixed-confidence (FC)

and CoPE-KB with fixed-budget (FB).

โ€ข For CoPE-KB with fixed-confidence (FC), we propose an algorithm CoopKernelFC, which adopts an effi-

cient kernelized estimator to significantly reduce computation and communication complexity from exist-

ing Poly(dim(H๐พ )) to only Poly(๐‘›๐‘‰ ). We derive matching upper and lower bounds of sample complexity

๏ฟฝฬƒ๏ฟฝ ( ๐œŒโˆ—

๐‘‰log๐›ฟโˆ’1) and communication rounds ๐‘‚ (logฮ”โˆ’1

min). Here ๐œŒโˆ— is the problem hardness (defined in Section 4.2),

and ฮ”โˆ’1

minis the minimum reward gap.

โ€ข For CoPE-KB with fixed-budget (FB), we design an efficient algorithm CoopKernelFB with error probability

๏ฟฝฬƒ๏ฟฝ

(exp

(โˆ’๐‘‡๐‘‰๐œŒโˆ—

)๐‘›2๐‘‰

)and communication rounds ๐‘‚ (log(๐œ” ( หœX))). A matching lower bound of communication

rounds is also established to validate the communication optimality of CoopKernelFB (within double-logarithmic

factors). Here๐‘‡ is the sample budget,หœX is the set of arms, and๐œ” ( หœX) is the principle dimension of data projections

inหœX to RKHS (defined in Section 5.1.1).

โ€ข Our results explicitly quantify the impacts of task similarities on learning acceleration. Our novel analytical

techniques, including data dimension decomposition, linear structured instance transformation and round-

speedup induction, can be of independent interests and are applicable to other bandit problems.

Due to space limit, we defer all the proofs to Appendix.

2 RELATEDWORK

This work falls in the literature of multi-armed bandits [4, 8, 27, 28, 39]. Here we mainly review three most related lines

of research, i.e., collaborative pure exploration and kernel bandit.

Collaborative Pure Exploration (CoPE). The collaborative pure exploration literature is initiated by [20], which

considers the classic MAB and fully-collaborative settings, and designs fixed-confidence algorithms based on majority

vote with upper bound analysis. Tao et al. [38] further develop a fixed-budget algorithm by calling conventional

single-agent fixed-confidence algorithms, and completes the analysis of round-speedup lower bounds. Karpov et al.

[22] extend the formulation of [20, 38] to the best๐‘š arm identification problem, and designs fixed-confidence and

fixed-budget algorithms with tight round-speedup upper and lower bounds, which give a strong separation between

best arm identification and the extended best ๐‘š arm identification. Our CoPE-KB model encompasses the classic

MAB and fully-collaborative settings in the above works [20, 22, 38], but faces unique challenges on computation and

communication efficiency due to the high-dimensional reward structures.

Collaborative Regret Minimization. There are other works studying collaborative (distributed) bandit with the

regret minimization objective. Bistritz and Leshem [5], Liu and Zhao [29], Rosenski et al. [32] study the multi-player

bandit with collisions motivated by cognitive radio networks, where multiple players simultaneously choose arms from

the same set and receive no reward if more than one player choose the same arm (i.e., a collision happens). Bubeck and

Budzinski [6], Bubeck et al. [7] investigate a variant multi-player bandit problem where players cannot communicate but

3

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

have access to shared randomness, and they propose algorithms that achieve nearly optimal regrets without collisions.

Chakraborty et al. [11] introduce another distributed bandit problem, where each agent decides either to pull an arm

or to broadcast a message in order to maximize the total reward. Korda et al. [25], Szorenyi et al. [36] adapt bandit

algorithms to peer-to-peer networks, where the peers pick arms from the same set and can only communicate with a

few random others along network links. The above works consider different learning objectives and communication

protocols from ours, and do not involve the challenges of simultaneously handling multiple different tasks and analyzing

the relationship between communication rounds and learning speedup.

Kernel Bandit. There are a number of works for kernel bandit with the regret minimization objective. Srinivas et al.

[35] study the Gaussian process bandit problem with RKHS, which is the Bayesian version of kernel bandits, and

designs an Upper Confidence Bound (UCB) style algorithm. Chowdhury and Gopalan [13] further improve the regret

results of [35] by constructing tighter kernelized confidence intervals. Valko et al. [40] consider kernel bandit from

the frequentist perspective and provides an alternative regret analysis based on effective dimension. Deshmukh et al.

[14], Krause and Ong [26] study the multi-task kernel bandits, where the kernel function of RKHS is constituted by two

compositions from task similarities and arm features. Dubey et al. [16] investigate the multi-agent kernel bandit with a

local communication protocol, with the learning objective being to reduce the average regret suffered by per agent. For

kernel bandit with the pure exploration objective, there are only two works [10, 43] to our best knowledge. Camilleri

et al. [10] design a single-agent algorithm which uses a robust inverse propensity score estimator to reduce the sample

complexity incurred by rounding procedures. Zhu et al. [43] propose a variant of [10] which applies neural networks to

approximate nonlinear reward functions. All of these works consider either regret minimization or single-agent setting,

which largely differs from our problem, and they do not investigate the distributed decision making and (communication)

round-speedup trade-off. Thus, their algorithms and analysis cannot be applied to solve our CoPE-KB problem.

3 COLLABORATIVE PURE EXPLORATION IN KERNEL BANDIT (COPE-KB)

In this section, we present the formal formulation of the Collaborative Pure Exploration in Kernel Bandit (CoPE-KB), and

discuss the two important settings under CoPE-KB that will be investigated.

Agents and rewards. There are ๐‘‰ agents [๐‘‰ ] = {1, . . . ,๐‘‰ }, who collaborate to solve different but possibly related

instances (tasks) of the pure exploration in kernel bandit (PE-KB) problem. For each agent ๐‘ฃ โˆˆ [๐‘‰ ], she is given a set of

๐‘› arms X๐‘ฃ = {๐‘ฅ๐‘ฃ,1, . . . , ๐‘ฅ๐‘ฃ,๐‘›} โŠ† R๐‘‘X , where ๐‘‘X is the dimension of arm feature vectors. The expected reward of each

arm ๐‘ฅ โˆˆ X๐‘ฃ is ๐‘“๐‘ฃ (๐‘ฅ), where ๐‘“๐‘ฃ : X๐‘ฃ โ†ฆโ†’ R is an unknown reward function. Let X = โˆช๐‘ฃโˆˆ[๐‘‰ ]X๐‘ฃ . Following the literature inkernel bandits [14, 26, 35, 40], we assume that for any ๐‘ฃ โˆˆ [๐‘‰ ], ๐‘“๐‘ฃ has a bounded norm in a Reproducing Kernel Hilbert

Space (RKHS) specified by kernel ๐พX : X ร— X โ†ฆโ†’ R (see below for more details). At each timestep ๐‘ก , each agent ๐‘ฃ pulls

an arm ๐‘ฅ๐‘ฃ,๐‘ก โˆˆ X๐‘ฃ and observes a random reward ๐‘ฆ๐‘ฃ,๐‘ก = ๐‘“ (๐‘ฅ๐‘ฃ,๐‘ก ) + [๐‘ฃ,๐‘ก , where [๐‘ฃ,๐‘ก is an independent and zero-mean

1-sub-Gaussian noise (without loss of generality).1We assume that the best arms ๐‘ฅ๐‘ฃ,โˆ— = argmax๐‘ฅ โˆˆX๐‘ฃ ๐‘“๐‘ฃ (๐‘ฅ) are unique

for all ๐‘ฃ โˆˆ [๐‘‰ ], which is a common assumption in the pure exploration literature, e.g., [3, 12, 18, 24].

Multi-Task Kernel Composition.We assume that the functions ๐‘“๐‘ฃ are parametric functionals of a global function

๐น : X ร—Z โ†ฆโ†’ R, which satisfies that, for each agent ๐‘ฃ โˆˆ [๐‘‰ ], there exists a task feature vector ๐‘ง๐‘ฃ โˆˆ Z such that

๐‘“๐‘ฃ (๐‘ฅ) = ๐น (๐‘ฅ, ๐‘ง๐‘ฃ), โˆ€๐‘ฅ โˆˆ X๐‘ฃ . (1)

Here X andZ denote the arm feature space and task feature space, respectively. Eq. (1) allows tasks to be different for

agents, whereas prior CoPE works [20, 22, 38] restrict the tasks (X๐‘ฃ and ๐‘“๐‘ฃ ) to be the same for all agents ๐‘ฃ โˆˆ [๐‘‰ ].1A random variable [ is called 1-sub-Gaussian if it satisfies that E[exp(_[ โˆ’ _E[[ ]) ] โ‰ค exp(_2/2) for any _ โˆˆ R.

4

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

DenoteหœX = X ร—Z and ๐‘ฅ = (๐‘ฅ, ๐‘ง๐‘ฃ). As a standard assumption in kernel bandits [14, 16, 26, 35], we assume that ๐น

has a bounded norm in a global RKHS H๐พ specified by kernel ๐พ :หœX ร— หœX โ†ฆโ†’ R, and there exists a feature mapping

๐œ™ :หœX โ†ฆโ†’ H๐พ and an unknown parameter \โˆ— โˆˆ H๐พ such that

๐น (๐‘ฅ) = ๐œ™ (๐‘ฅ)โŠค\โˆ—, โˆ€๐‘ฅ โˆˆ หœX, and ๐พ (๐‘ฅ, ๐‘ฅ โ€ฒ) = ๐œ™ (๐‘ฅ)โŠค๐œ™ (๐‘ฅ โ€ฒ), โˆ€๐‘ฅ, ๐‘ฅ โ€ฒ โˆˆ หœX.

Here โˆฅ\โˆ—โˆฅ :=โˆš\โˆ—โŠค\โˆ— โ‰ค ๐ต for some known constant ๐ต > 0. ๐พ :

หœX ร— หœX โ†ฆโ†’ R is a product composite kernel, which

satisfies that for any ๐‘ง, ๐‘งโ€ฒ โˆˆ Z, ๐‘ฅ, ๐‘ฅ โ€ฒ โˆˆ X,

๐พ ((๐‘ฅ, ๐‘ง), (๐‘ฅ โ€ฒ, ๐‘งโ€ฒ)) = ๐พZ (๐‘ง, ๐‘งโ€ฒ) ยท ๐พX (๐‘ฅ, ๐‘ฅ โ€ฒ),

where ๐พX is the arm feature kernel that depicts the feature structure of arms, and ๐พZ is the task feature kernel that

measures the similarity of functions ๐‘“๐‘ฃ . For example, in the fully-collaborative setting, all agents solve a common task,

and we have that ๐‘ง๐‘ฃ = 1 for all ๐‘ฃ โˆˆ [๐‘‰ ], ๐พZ (๐‘ง, ๐‘งโ€ฒ) = 1 for all ๐‘ง, ๐‘งโ€ฒ โˆˆ Z, and ๐พ = ๐พX . On the contrary, if all tasks are

different, then rank(๐พZ) = ๐‘‰ . ๐พZ allows us to characterize the influences of task similarities (1 โ‰ค rank(๐พZ) โ‰ค ๐‘‰ ) onlearning.

We give a simple 2-agent (2-task) illustrating example in Figure 1. Agent 1 is given Items 1,2 with the expected

rewards `1, `2, respectively, denoted by X1 = {๐‘ฅ1,1, ๐‘ฅ1,2}. Agent 2 is given Items 2,3 with the expected rewards

`2, `3, respectively, denoted by X2 = {๐‘ฅ2,1, ๐‘ฅ2,2}. Here ๐‘ฅ1,2 = ๐‘ฅ2,1 is the same item. In this case, ๐œ™ (๐‘ฅ1,1) = [1, 0, 0]โŠค,๐œ™ (๐‘ฅ1,2) = ๐œ™ (๐‘ฅ2,1) = [0, 1, 0]โŠค, ๐œ™ (๐‘ฅ2,2) = [0, 0, 1]โŠค, and \โˆ— = [`1, `2, `3]โŠค. The two agents can share the learned

information on the second dimension of \โˆ— to accelerate learning processes.

๐‘ฅ1,1

Item 1, reward ๐œ‡1

Task 1 Task 2

๐œ™ ๐‘ฅ 1,1 = 1,0,0 ๐‘‡

๐œ™ ๐‘ฅ 1,2 = ๐œ™ ๐‘ฅ 2,1 = 0,1,0 ๐‘‡

๐œ™ ๐‘ฅ 2,2 = 0,0,1 ๐‘‡

ฮธโˆ— = ๐œ‡1, ๐œ‡2, ๐œ‡3๐‘‡

๐‘ฅ2,2

Item 3, reward ๐œ‡3

๐‘ฅ1,2 = ๐‘ฅ2,1

Item 2, reward ๐œ‡2

๐‘ฅ 1,1 = ๐‘ง1, ๐‘ฅ1,1

๐‘ฅ 1,2 = ๐‘ง1, ๐‘ฅ1,2 = ๐‘ฅ 2,1 = ๐‘ง2, ๐‘ฅ2,1

๐‘ฅ 2,2 = ๐‘ง2, ๐‘ฅ2,2

Fig. 1. Illustrating example.

Note that the RKHSH๐พ can have infinite dimensions,

and any direct operation on H๐พ , e.g., the calculation

of ๐œ™ (๐‘ฅ) and explicit expression of the estimate of \โˆ—, is

impracticable. In this paper, all our algorithms only query

the kernel function ๐พ (ยท, ยท) instead of directly operating

onH๐พ , and ๐œ™ (๐‘ฅ) and \โˆ— are only used in our theoretical

analysis, which is different from existing works, e.g., [10,

43].

Communication. Following the popular communica-

tion protocol in existing CoPE works [20, 22, 38], we

allow these ๐‘‰ agents to exchange information via com-

munication rounds, in which each agent can broadcast

and receive messages from others. While we do not re-

strict the exact length of a message, for practical implementation it should be bounded by ๐‘‚ (๐‘›) bits. Here ๐‘› is the

number of arms for each agent, and we consider the number of bits for representing a real number as a constant.

In the CoPE-KB problem, our goal is to design computation and communication efficient algorithms to coordinate

different agents to simultaneously complete multiple tasks in collaboration and characterize how the take similarities

impact the learning speedup.

Fixed-Confidence and Fixed-Budget. We consider two versions of the CoPE-KB problem, one with fixed-confidence

(FC) and the other with fixed-budget (FB). Specifically, in the FC setting, given a confidence parameter ๐›ฟ โˆˆ (0, 1), theagents aim to identify ๐‘ฅ๐‘ฃ,โˆ— for all ๐‘ฃ โˆˆ [๐‘‰ ] with probability at least 1 โˆ’ ๐›ฟ and minimize the average number of samples

used by each agent. In the FB setting, on the other hand, the agents are given an overall ๐‘‡ ยท๐‘‰ sample budget (๐‘‡ average

5

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

samples per agent), and aim to use at most๐‘‡ ยท๐‘‰ samples to identify ๐‘ฅ๐‘ฃ,โˆ— for all ๐‘ฃ โˆˆ [๐‘‰ ] and minimize the error probability.

In both FC and FB settings, agents are requested to minimize the number of communication rounds.

To evaluate the learning acceleration of our algorithms, following the CoPE literature, e.g., [20, 22, 38], we also define

the speedup metric of our algorithms. For a CoPE-KB instance I, let๐‘‡A๐‘€ ,I denote the average number of samples used

by each agent in multi-agent algorithmA๐‘€ to identify ๐‘ฅ๐‘ฃ,โˆ— for all ๐‘ฃ โˆˆ [๐‘‰ ], and let๐‘‡A๐‘† ,I denote the average number of

samples used by each task for a single-agent algorithmA๐‘† to sequentially (without communication) identify ๐‘ฅ๐‘ฃ,โˆ— for all

๐‘ฃ โˆˆ [๐‘‰ ]. Then, the speedup of A๐‘€ on instance I is formally defined as

๐›ฝA๐‘€ ,I = inf

A๐‘†

๐‘‡A๐‘† ,I๐‘‡A๐‘€ ,I

. (2)

It can be seen that 1 โ‰ค ๐›ฝA๐‘€ ,I โ‰ค ๐‘‰ , where ๐›ฝA๐‘€ ,I = 1 for the case where all tasks are different and ๐›ฝA๐‘€ ,I can

approach ๐‘‰ for a fully-collaborative instance. By taking ๐‘‡A๐‘€ ,I and ๐‘‡A๐‘† ,I as the smallest numbers of samples needed

to meet the confidence constraint, the definition of ๐›ฝA๐‘€ ,I can be similarly defined for error probability results.

In particular, when all agents ๐‘ฃ โˆˆ [๐‘‰ ] have the same arm set X๐‘ฃ = X = {๐’†1, . . . , ๐’†๐‘›} (i.e., standard bases in R๐‘›) and

the same reward function ๐‘“๐‘ฃ (๐‘ฅ) = ๐‘“ (๐‘ฅ) = ๐‘ฅโŠค\โˆ— for any ๐‘ฅ โˆˆ X, all agents are solving a common classic MAB task, and

then the task featureZ = {1} and ๐พZ (๐‘ง, ๐‘งโ€ฒ) = 1 for any ๐‘ง, ๐‘งโ€ฒ โˆˆ Z. In this case, our CoPE-KB problem reduces to prior

CoPE with classic MAB setting [20, 38].

4 FIXED-CONFIDENCE COPE-KB

We start with the fixed-confidence (FC) setting and propose the CoopKernelFC algorithm. We explicitly quantify how

task similarities impact learning acceleration, and provide sample complexity and round-speedup lower bounds to

demonstrate the optimality of CoopKernelFC.

4.1 Algorithm CoopKernelFC

4.1.1 Algorithm. CoopKernelFC has three key components: (i) maintain alive arm sets for all agents, (ii) perform

sampling individually according to the globally optimal sample allocation, and (iii) exchange the distilled observa-

tion information to estimate reward gaps and eliminate sub-optimal arms, via efficient kernelized computation and

communication schemes.

The procedure of CoopKernelFC (Algorithm 1) for each agent ๐‘ฃ is as follows. Agent ๐‘ฃ maintains alive arm sets B (๐‘ก )๐‘ฃโ€ฒ

for all ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ] by successively eliminating sub-optimal arms in each phase. In phase ๐‘ก , she solves a global min-max

optimization, which takes into account the objectives and available arm sets of all agents, to obtain the optimal sample

allocation _โˆ—๐‘ก โˆˆ โ–ณ หœX and optimal value ๐œŒโˆ—๐‘ก (Line 4). Here โ–ณ หœX is the collection of all distributions onX. b๐‘ก is a regularizationparameter such thatโˆš๏ธ

b๐‘ก max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ(b๐‘ก ๐ผ+โˆ‘๐‘ฅโˆˆX 1

๐‘›๐‘‰๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)โŠค)โˆ’1 โ‰ค 1

(1 + Y)๐ต ยท 2๐‘ก+1, (3)

which ensures the estimation bias for reward gap to be bounded by 2โˆ’(๐‘ก+1)

and can be efficiently computed by kernelized

transformation (specified in Section 4.1.2). Then, agent ๐‘ฃ uses ๐œŒโˆ—๐‘ก to compute the number of required samples ๐‘ (๐‘ก ) , which

guarantees that the confidence radius of estimation for reward gaps is within 2โˆ’๐‘ก

(Line 5). In algorithm CoopKernelFC,

we use a rounding procedure ROUNDY (_, ๐‘ ) with approximation parameter ๐œ– from [2, 10], which rounds the sample

6

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Algorithm 1: Distributed Algorithm CoopKernelFC: for Agent ๐‘ฃ (๐‘ฃ โˆˆ [๐‘‰ ])Input: ๐›ฟ , หœX1, . . . , หœX๐‘‰ , ๐พ (ยท, ยท) :

หœXร— หœX โ†ฆโ†’ R, ๐ต, rounding procedure ROUNDY (ยท, ยท) with approximation parameter Y.

1 Initialization: B (1)๐‘ฃโ€ฒ โ† X๐‘ฃโ€ฒ for all ๐‘ฃ

โ€ฒ โˆˆ [๐‘‰ ]. ๐‘ก โ† 1 ; // initialize alive arm sets B (1)๐‘ฃโ€ฒ

2 while โˆƒ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ], |B (๐‘ก )๐‘ฃโ€ฒ | > 1 do

3 ๐›ฟ๐‘ก โ† ๐›ฟ2๐‘ก2

;

4 Let _โˆ—๐‘ก and ๐œŒโˆ—๐‘ก be the optimal solution and optimal value of

min

_โˆˆโ–ณ หœXmax

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆB (๐‘ก )๐‘ฃโ€ฒ ,๐‘ฃโ€ฒโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2(b๐‘ก ๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโˆˆ หœX _๏ฟฝฬƒ๏ฟฝ๐œ™ (๏ฟฝฬƒ๏ฟฝ)๐œ™ (๏ฟฝฬƒ๏ฟฝ)โŠค)โˆ’1, where b๐‘ก is a regularization parameter that

satisfies Eq. (3) ; // compute the optimal sample allocation

5 ๐‘ (๐‘ก ) โ†โŒˆ8(2๐‘ก )2 (1 + Y)2๐œŒโˆ—๐‘ก log

(2๐‘›2๐‘‰ /๐›ฟ๐‘ก

)โŒ‰; // compute the number of required samples

6 (๐‘ 1, . . . , ๐‘ ๐‘ (๐‘ก ) ) โ† ROUNDY (_โˆ—๐‘ก , ๐‘ (๐‘ก ) );7 Let ๏ฟฝฬƒ๏ฟฝ (๐‘ก )๐‘ฃ be the sub-sequence of (๐‘ 1, . . . , ๐‘ ๐‘ (๐‘ก ) ) which only contains the arms in

หœX๐‘ฃ ; // generate the sample

sequence for agent ๐‘ฃ

8 Pull arms ๏ฟฝฬƒ๏ฟฝ (๐‘ก )๐‘ฃ and observe random rewards ๐’š (๐‘ก )๐‘ฃ ;

9 Broadcast {(๐‘ (๐‘ก )๐‘ฃ,๐‘–, ๐‘ฆ(๐‘ก )๐‘ฃ,๐‘–)}๐‘–โˆˆ[๐‘›] , where ๐‘

(๐‘ก )๐‘ฃ,๐‘–

is the number of samples and ๐‘ฆ(๐‘ก )๐‘ฃ,๐‘–

is the average observed reward

on arm ๐‘ฅ๐‘ฃ,๐‘– ;

10 Receive {(๐‘ (๐‘ก )๐‘ฃโ€ฒ,๐‘– , ๐‘ฆ

(๐‘ก )๐‘ฃโ€ฒ,๐‘– )}๐‘–โˆˆ[๐‘›] from all other agents ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ] \ {๐‘ฃ};

11 For notational simplicity, we combine the subscripts ๐‘ฃ โ€ฒ, ๐‘– in ๐‘ฅ๐‘ฃโ€ฒ,๐‘– , ๐‘(๐‘ก )๐‘ฃโ€ฒ,๐‘– , ๐‘ฆ

(๐‘ก )๐‘ฃโ€ฒ,๐‘– by using

๐‘ฅ (๐‘ฃโ€ฒโˆ’1)๐‘›+๐‘– , ๐‘(๐‘ก )(๐‘ฃโ€ฒโˆ’1)๐‘›+๐‘– , ๐‘ฆ

(๐‘ก )(๐‘ฃโ€ฒโˆ’1)๐‘›+๐‘– , respectively. Then, ๐‘˜๐‘ก (๐‘ฅ) โ† [

โˆš๏ธƒ๐‘(๐‘ก )1๐พ (๐‘ฅ, ๐‘ฅ1), . . . ,

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐พ (๐‘ฅ, ๐‘ฅ๐‘›๐‘‰ )]โŠค for

any ๐‘ฅ โˆˆ หœX. ๐พ (๐‘ก ) โ† [โˆš๏ธƒ๐‘(๐‘ก )๐‘–๐‘(๐‘ก )๐‘—๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ] . ๏ฟฝฬ„๏ฟฝ (๐‘ก ) โ† [

โˆš๏ธƒ๐‘(๐‘ก )1๐‘ฆ(๐‘ก )1, . . . ,

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐‘ฆ(๐‘ก )๐‘›๐‘‰]โŠค ; // organize

overall observation information

12 for all ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ] do13 ฮ”ฬ‚๐‘ก (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— ) โ† (๐‘˜๐‘ก (๐‘ฅ๐‘– ) โˆ’ ๐‘˜๐‘ก (๐‘ฅ ๐‘— ))โŠค (๐พ (๐‘ก ) + ๐‘ (๐‘ก )b๐‘ก ๐ผ )โˆ’1๏ฟฝฬ„๏ฟฝ (๐‘ก ) , โˆ€๐‘ฅ๐‘– , ๐‘ฅ ๐‘— โˆˆ B (๐‘ก )๐‘ฃโ€ฒ ; // estimate the reward gap

between ๏ฟฝฬƒ๏ฟฝ๐‘– and ๏ฟฝฬƒ๏ฟฝ ๐‘—

14 B (๐‘ก+1)๐‘ฃโ€ฒ โ† B (๐‘ก )

๐‘ฃโ€ฒ \ {๐‘ฅ โˆˆ B(๐‘ก )๐‘ฃโ€ฒ |โˆƒ๐‘ฅ

โ€ฒ โˆˆ B (๐‘ก )๐‘ฃโ€ฒ : ฮ”ฬ‚๐‘ก (๐‘ฅ โ€ฒ, ๐‘ฅ) โ‰ฅ 2

โˆ’๐‘ก } ; // discard sub-optimal arms

15 ๐‘ก โ† ๐‘ก + 1;

16 E return B (๐‘ก )1, . . . ,B (๐‘ก )

๐‘‰;

allocation _ โˆˆ โ–ณหœX into the integer numbers of samples ^ โˆˆ N | หœX |

, such that

โˆ‘๏ฟฝฬƒ๏ฟฝ โˆˆ หœX ^๏ฟฝฬƒ๏ฟฝ = ๐‘ and

max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆB (๐‘ก )๐‘ฃโ€ฒ ,๐‘ฃโ€ฒโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2(๐‘b๐‘ก ๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโˆˆ หœX ^๏ฟฝฬƒ๏ฟฝ๐œ™ (๏ฟฝฬƒ๏ฟฝ)๐œ™ (๏ฟฝฬƒ๏ฟฝ)โŠค)โˆ’1

โ‰ค (1 + ๐œ–) max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆB (๐‘ก )๐‘ฃโ€ฒ ,๐‘ฃโ€ฒโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2(๐‘b๐‘ก ๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโˆˆ หœX ๐‘_๏ฟฝฬƒ๏ฟฝ๐œ™ (๏ฟฝฬƒ๏ฟฝ)๐œ™ (๏ฟฝฬƒ๏ฟฝ)โŠค)โˆ’1.

By calling ROUNDY (_โˆ—๐‘ก , ๐‘ (๐‘ก ) ), agent ๐‘ฃ generates an overall sample sequence (๐‘ 1, . . . , ๐‘ ๐‘ (๐‘ก ) ) according to _โˆ—๐‘ก , and extractsa sub-sequence ๏ฟฝฬƒ๏ฟฝ (๐‘ก )๐‘ฃ that only contains the arms in

หœX๐‘ฃ to sample (Lines 6-8). After sampling, she only communicates

the number of samples ๐‘(๐‘ก )๐‘ฃ,๐‘–

and average observed reward ๐‘ฆ(๐‘ก )๐‘ฃ,๐‘–

for each arm with other agents (Lines 9-10). With the

overall observation information, she estimates the reward gap ฮ”ฬ‚๐‘ก (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— ) between any arm pair ๐‘ฅ๐‘– , ๐‘ฅ ๐‘— โˆˆ B (๐‘ก )๐‘ฃโ€ฒ for all

๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ] and discards sub-optimal arms (Lines 13-14).

4.1.2 Computation and Communication Efficiency. Here we explain the efficiency of CoopKernelFC. Note that in

CoPE-KB, due to its high-dimensional reward structures, directly using the empirical mean to estimate rewards will

7

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

cause loose sample complexity, and naively calculating and transmitting infinite-dimensional parameter \โˆ— will incur

huge computation and communication costs. As a result, we cannot directly compute and communicate scalar empirical

rewards as in prior CoPE with classic MAB works [20, 22, 38].

Computation Efficiency. CoopKernelFC uses three efficient kernelized operations, i.e., optimization solver (Line 4),

condition for regularization parameter b๐‘ก (Eq. (3)) and estimator of reward gaps (Line 13). Unlike prior kernel bandit

algorithms [10, 43] which explicitly compute ๐œ™ (๐‘ฅ) and maintain the estimate of \โˆ— on the infinite-dimensional RKHS,

CoopKernelFC only queries kernel function ๐พ (ยท, ยท) and significantly reduces the computation (memory) costs from

Poly(dim(H๐พ )) to only Poly(๐‘›๐‘‰ ).Below we give the formal expressions of these operations and defer their detailed derivation to Appendix A.1.

Kernelized Estimator. We first introduce the kernelized estimator of reward gaps (Line 13). Following the standard

estimation procedure in linear/kernel bandits [10, 19, 23, 43], we consider the following regularized least square estimator

of underlying reward parameter \โˆ—

ห†\๐‘ก =ยฉยญยซ๐‘ (๐‘ก )b๐‘ก ๐ผ +

๐‘ (๐‘ก )โˆ‘๏ธ๐‘—=1

๐œ™ (๐‘  ๐‘— )๐œ™ (๐‘  ๐‘— )โŠคยชยฎยฌโˆ’1

๐‘ (๐‘ก )โˆ‘๏ธ๐‘—=1

๐œ™ (๐‘  ๐‘— )๐‘ฆ ๐‘—

Note that this form ofห†\๐‘ก has ๐‘

(๐‘ก )terms in the summation, which are cumbersome to compute and communicate.

Since the samples (๐‘ 1, . . . , ๐‘ ๐‘ (๐‘ก ) ) are composed by arms ๐‘ฅ1, . . . , ๐‘ฅ๐‘›๐‘‰ , we merge repetitive computations for same arms

in the summation and obtain (for notational simplicity, we combine the subscripts ๐‘ฃ โ€ฒ, ๐‘– in ๐‘ฅ๐‘ฃโ€ฒ,๐‘– , ๐‘(๐‘ก )๐‘ฃโ€ฒ,๐‘– , ๐‘ฆ

(๐‘ก )๐‘ฃโ€ฒ,๐‘– by using

๐‘ฅ (๐‘ฃโ€ฒโˆ’1)๐‘›+๐‘– , ๐‘(๐‘ก )(๐‘ฃโ€ฒโˆ’1)๐‘›+๐‘– , ๐‘ฆ

(๐‘ก )(๐‘ฃโ€ฒโˆ’1)๐‘›+๐‘– , respectively)

ห†\๐‘ก(a)

=

(๐‘ (๐‘ก )b๐‘ก ๐ผ +

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

๐‘(๐‘ก )๐‘–๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค

)โˆ’1 ๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

๐‘(๐‘ก )๐‘–๐œ™ (๐‘ฅ๐‘– )๐‘ฆ (๐‘ก )๐‘–

(b)

=ฮฆโŠค๐‘ก(๐‘ (๐‘ก )b๐‘ก ๐ผ + ๐พ (๐‘ก )

)โˆ’1

๐‘ฆ (๐‘ก ) . (4)

Here ๐‘(๐‘ก )๐‘–

is the number of samples and ๐‘ฆ(๐‘ก )๐‘–

is the average observed reward on arm ๐‘ฅ๐‘– for any ๐‘– โˆˆ [๐‘›๐‘‰ ]. ฮฆ๐‘ก =

[โˆš๏ธƒ๐‘(๐‘ก )1๐œ™ (๐‘ฅ1)โŠค; . . . ;

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐œ™ (๐‘ฅ๐‘›๐‘‰ )โŠค] is the empiricallyweighted feature vector,๐พ (๐‘ก ) = ฮฆ๐‘กฮฆ

โŠค๐‘ก = [

โˆš๏ธƒ๐‘(๐‘ก )๐‘–๐‘(๐‘ก )๐‘—๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ]

is the kernel matrix, and ๐‘ฆ (๐‘ก ) = [โˆš๏ธƒ๐‘(๐‘ก )1๐‘ฆ(๐‘ก )1, . . . ,

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐‘ฆ(๐‘ก )๐‘›๐‘‰]โŠค is the average observations. Equality (a) rearranges the

summation according to different chosen arms, and (b) follows from kernel transformation.

Then, by multiplying

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค, we obtain the estimator of reward gaps ๐‘“ (๐‘ฅ๐‘– ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— ) as

ฮ”ฬ‚(๐‘ฅ๐‘– , ๐‘ฅ ๐‘— ) =(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠคห†\๐‘ก =

(๐‘˜๐‘ก (๐‘ฅ๐‘– ) โˆ’ ๐‘˜๐‘ก (๐‘ฅ ๐‘— )

)โŠค (๐‘ (๐‘ก )b๐‘ก ๐ผ + ๐พ (๐‘ก )

)โˆ’1

๐‘ฆ (๐‘ก ) , (5)

where ๐‘˜๐‘ก (๐‘ฅ) = ฮฆ๐‘ก๐œ™ (๐‘ฅ) = [โˆš๏ธƒ๐‘(๐‘ก )1๐พ (๐‘ฅ, ๐‘ฅ1), . . . ,

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐พ (๐‘ฅ, ๐‘ฅ๐‘›๐‘‰ )]โŠค for any ๐‘ฅ โˆˆ หœX. This estimator not only transforms

heavy operations on the infinite-dimensional RKHS to efficient ones that only query the kernel function, but also

merges repetitive computations for same arms (equality (a)) and only requires calculations dependent on ๐‘›๐‘‰ .

Kernelized Optimization Solver/Condition for Regularization Parameter. Now we introduce the optimization solver

(Line 4) and condition for regularization parameter b๐‘ก (Eq. (3)).

For the kernelized optimization solver, we solve the min-max optimization in Line 4 by projected gradient descent,

which follows the procedure in [10]. Specifically, let๐ด(b, _) = b๐ผ +โˆ‘๏ฟฝฬƒ๏ฟฝ โˆˆ หœX _๏ฟฝฬƒ๏ฟฝ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)

โŠคfor any b > 0, _ โˆˆ โ–ณ

หœX . We define

8

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

function โ„Ž(_) = max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’๐œ™ (๐‘ฅ ๐‘— )โˆฅ2๐ด(b๐‘ก ,_)โˆ’1

, and denote the optimal solution of โ„Ž(_) by ๐‘ฅโˆ—๐‘–(_), ๐‘ฅโˆ—

๐‘—(_). Then,

the gradient of โ„Ž(_) is given by

[โˆ‡_โ„Ž(_)]๏ฟฝฬƒ๏ฟฝ = โˆ’((๐œ™ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐œ™ (๐‘ฅ

โˆ—๐‘— (_))

)โŠค๐ด(b๐‘ก , _)โˆ’1๐œ™ (๐‘ฅ)

)2

, โˆ€๐‘ฅ โˆˆ หœX, (6)

which can be efficiently calculated by the following kernel transformation(๐œ™ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐œ™ (๐‘ฅ

โˆ—๐‘— (_))

)โŠค๐ด(b๐‘ก , _)โˆ’1๐œ™ (๐‘ฅ)

=bโˆ’1

๐‘ก

(๐พ (๐‘ฅโˆ—๐‘– (_), ๐‘ฅ) โˆ’ ๐พ (๐‘ฅ

โˆ—๐‘— (_), ๐‘ฅ) โˆ’

(๐‘˜_ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐‘˜_ (๐‘ฅ

โˆ—๐‘— (_))

)โŠค(b๐‘ก ๐ผ + ๐พ_)โˆ’1 ๐‘˜_ (๐‘ฅ)

), (7)

where ๐‘˜_ (๐‘ฅ) = [ 1โˆš_1

๐พ (๐‘ฅ, ๐‘ฅ1), . . . , 1

_๐‘›๐‘‰๐พ (๐‘ฅ, ๐‘ฅ๐‘›๐‘‰ )]โŠค and ๐พ_๐‘ข = [ 1โˆš

_๐‘–_ ๐‘—๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ] .

For condition Eq. (3) on the regularization parameter b๐‘ก , we can transform it to

max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœX๐‘ฃ๐‘ฃโˆˆ[๐‘‰ ]

โˆš๏ธ‚(๐พ (๐‘ฅ๐‘– , ๐‘ฅ๐‘– ) + ๐พ (๐‘ฅ ๐‘— , ๐‘ฅ ๐‘— ) โˆ’ 2๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )

)โˆ’ โˆฅ๐‘˜_๐‘ข (๐‘ฅ๐‘– ) โˆ’ ๐‘˜_๐‘ข (๐‘ฅ ๐‘— )โˆฅ2(b๐‘ก ๐ผ+๐พ_๐‘ข )โˆ’1

โ‰ค 1

(1 + Y)๐ต ยท 2๐‘ก+1, (8)

where _๐‘ข = [ 1

๐‘›๐‘‰, . . . , 1

๐‘›๐‘‰]โŠค is the uniform distribution on

หœX, ๐‘˜_๐‘ข (๐‘ฅ) = [1โˆš๐‘›๐‘‰๐พ (๐‘ฅ, ๐‘ฅ1), . . . , 1โˆš

๐‘›๐‘‰๐พ (๐‘ฅ, ๐‘ฅ๐‘›๐‘‰ )]โŠค and ๐พ_๐‘ข =

[ 1

๐‘›๐‘‰๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ] .

Both the kernelized optimization solver and condition for b๐‘ก avoid inefficient operations directly on infinite-

dimensional RKHS by querying the kernel function, and only cost Poly(๐‘›๐‘‰ ) computation (memory) complexity

(Eqs. (7),(8) only contains scalar ๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— ), ๐‘›๐‘‰ -dimensional vector ๐‘˜_ and ๐‘›๐‘‰ ร— ๐‘›๐‘‰ -dimensional matrix ๐พ_).

Communication Efficiency. By taking advantage of the kernelized estimator (Eq. (5)), CoopKernelFC merges repeti-

tive computations for the same arms and only transmits ๐‘›๐‘‰ scalar tuples {(๐‘ (๐‘ก )๐‘ฃ,๐‘–, ๐‘ฆ(๐‘ก )๐‘ฃ,๐‘–)}๐‘–โˆˆ[๐‘›],๐‘ฃโˆˆ[๐‘‰ ] among agents instead

of transmitting all ๐‘ (๐‘ก ) samples as in [16]. This significantly reduces the communication cost from ๐‘‚ (๐‘ (๐‘ก ) ) bits to๐‘‚ (๐‘›๐‘‰ ) bits (Lines 9-10).

4.2 Theoretical performance of CoopKernelFC

Define the problem hardness of identifying the best arms ๐‘ฅโˆ—๐‘ฃ for all ๐‘ฃ โˆˆ [๐‘‰ ] as

๐œŒโˆ— = min

_โˆˆโ–ณ หœXmax

๏ฟฝฬƒ๏ฟฝ โˆˆX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ2(bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโˆˆ หœX _๏ฟฝฬƒ๏ฟฝ๐œ™ (๏ฟฝฬƒ๏ฟฝ)๐œ™ (๏ฟฝฬƒ๏ฟฝ)โŠค)โˆ’1

(๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ))2, (9)

where bโˆ— = min๐‘ก โ‰ฅ1 b๐‘ก . ๐œŒโˆ—is the information-theoretic lower bound of the CoPE-KB problem, which is adapted from

linear/kernel bandit pure exploration [10, 19, 23, 43]. Let ๐‘† denote the per-agent sample complexity, i.e., average number

of samples used by each agent in algorithm CoopKernelFC.

The sample complexity and number of communication rounds of CoopKernelFC are as follows.

Theorem 1 (Fixed-Confidence Upper Bound). With probability at least 1 โˆ’ ๐›ฟ , algorithm CoopKernelFC returns the

correct answers ๐‘ฅโˆ—๐‘ฃ for all ๐‘ฃ โˆˆ [๐‘‰ ], with per-agent sample complexity

๐‘† = ๐‘‚

(๐œŒโˆ—

๐‘‰ยท logฮ”โˆ’1

min

(log

(๐‘›๐‘‰

๐›ฟ

)+ log logฮ”โˆ’1

min

))and communication rounds ๐‘‚ (logฮ”โˆ’1

min).

9

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Remark 1. ๐œŒโˆ— is comprised of two sources of problem hardness, one due to handling different tasks and the other due to

distinguishing different arms (We will decompose the sample complexity into these two parts in Corollary 1(c)). We see

that the sample complexity ofCoopKernelFCmatches the lower bound (up to logarithmic factors). For fully-collaborative

instances where single-agent algorithms [19, 23] have ๏ฟฝฬƒ๏ฟฝ (๐œŒโˆ— log๐›ฟโˆ’1) sample complexity, our CoopKernelFC achieves

the maximum๐‘‰ -speedup (i.e., enjoys ๏ฟฝฬƒ๏ฟฝ ( ๐œŒโˆ—

๐‘‰log๐›ฟโˆ’1) sample complexity) using only logarithmic communication rounds.

Interpretation. We further interpret Theorem 1 via standard expressive tools in kernel bandits [14, 35, 40], i.e.,

effective dimension and maximum information gain, to characterize the relationship between sample complexity and

data structures, and demonstrate how task similarity influences learning performance.

To this end, define the maximum information gain over all sample allocation _ โˆˆ โ–ณหœX as

ฮฅ = max

_โˆˆโ–ณ หœXlog det

(๐ผ + bโˆ—โˆ’1๐พ_

).

Denote _โˆ— = argmax_โˆˆโ–ณ หœXlog det

(๐ผ + bโˆ—โˆ’1๐พ_

)and ๐›ผ1 โ‰ฅ ยท ยท ยท โ‰ฅ ๐›ผ๐‘›๐‘‰ the eigenvalues of ๐พ_โˆ— , and define the effective

dimension of ๐พ_โˆ— as

๐‘‘eff

= min

{๐‘— : ๐‘—bโˆ— log(๐‘›๐‘‰ ) โ‰ฅ

๐‘›๐‘‰โˆ‘๏ธ๐‘–=๐‘—+1

๐›ผ๐‘–

}.

We then have the following corollary.

Corollary 1. The per-agent sample complexity of algorithm CoopKernelFC, denoted by ๐‘† , can also be bounded as

follows:

(a) ๐‘† = ๐‘‚

(ฮฅ

ฮ”2

min๐‘‰ยท ๐‘”(ฮ”min, ๐›ฟ)

), where ฮฅ is the maximum information gain.

(b) ๐‘† = ๐‘‚

(๐‘‘eff

ฮ”2

min๐‘‰ยท log

(๐‘›๐‘‰ ยท

(1 + Trace(๐พ_โˆ— )

bโˆ—๐‘‘eff

))๐‘”(ฮ”min, ๐›ฟ)

), where ๐‘‘

effis the effective dimension.

(c) ๐‘† = ๐‘‚

(rank(๐พ๐‘ง) ยท rank(๐พ๐‘ฅ )

ฮ”2

min๐‘‰

ยท log

(Trace

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)

rank(๐พ_โˆ— )

)๐‘”(ฮ”min, ๐›ฟ)

).

Here ๐‘”(ฮ”min, ๐›ฟ) = logฮ”โˆ’1

min

(log

(๐‘›๐‘‰๐›ฟ

)+ log logฮ”โˆ’1

min

).

Remark 2. Corollary 1(a) shows that, our sample complexity can be bounded by the maximum information gain of

any sample allocation onหœX, which extends conventional information-gain-based results in regret minimization kernel

bandits [13, 16, 35] to the pure exploration setting in the view of experimental (allocation) design.

In terms of dimension dependency, it is demonstrated in Corollary 1(b) that our result only depends on the effective

dimension of kernel representation, which is the number of principle directions that data projections in RKHS spread.

We also provide a fundamental decomposition of sample complexity into two compositions from task similarities and

arm features in Corollary 1(c), which shows that the more tasks are similar, the fewer samples we need for accomplishing

all tasks. For example, when tasks are the same (fully-collaborative), i.e., rank(๐พ๐‘ง) = 1, each agent only spends a1

๐‘‰

fraction of samples used by single-agent algorithms [10, 43]. Conversely, when the tasks are totally different, i.e.,

rank(๐พ๐‘ง) = ๐‘‰ , no advantage can be attained by multi-agent deployments, since the information from neighboring

agents is useless for solving local tasks.

10

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

4.3 Lower Bound for Fixed-Confidence Setting

We now present lower bounds for the sample complexity and a round-speedup for fully-collaborative instances, using

a novel measure transformation techniques. The bounds validate the optimality of CoopKernelFC in both sampling

and communication. Specifically, Theorems 2 and 3 below formally present our bounds. In the theorems, we refer to a

distributed algorithm A for CoPE-KB as ๐›ฟ-correct, if it returns the correct answers ๐‘ฅโˆ—๐‘ฃ for all ๐‘ฃ โˆˆ [๐‘‰ ] with probability

at least 1 โˆ’ ๐›ฟ .

Theorem 2 (Fixed-Confidence Sample Complexity Lower Bound). Consider the fixed-confidence collaborative

pure exploration in kernel bandit problem with Gaussian noise [๐‘ฃ,๐‘ก . Given any ๐›ฟ โˆˆ (0, 1), a ๐›ฟ-correct distributed algorithmA must have per-agent sample complexity ฮฉ( ๐œŒ

โˆ—

๐‘‰log๐›ฟโˆ’1).

Remark 3. Theorem 2 shows that even if the agents are allowed to share samples without limitation, each agent

still requires at least ฮฉฬƒ( ๐œŒโˆ—

๐‘‰) samples on average. Together with Theorem 1, one sees that CoopKernelFC is within

logarithmic factors of the optimal sampling.

Theorem 3 (Fixed-Confidence Round-Speedup Lower Bound). There exists a fully-collaborative instance of the

fixed-confidence CoPE-KB problem with multi-armed and linear reward structures, for which given any ๐›ฟ โˆˆ (0, 1), a๐›ฟ-correct and ๐›ฝ-speedup distributed algorithm A must utilize

ฮฉยฉยญยซ

logฮ”โˆ’1

min

log(1 + ๐‘‰๐›ฝ) + log logฮ”โˆ’1

min

ยชยฎยฌcommunication rounds in expectation. In particular, when ๐›ฝ = ๐‘‰ , A must require ฮฉ( logฮ”โˆ’1

min

log logฮ”โˆ’1

min

) communication rounds in

expectation.

Remark 4. Theorem 3 exhibits that logarithmic communication rounds are indispensable for achieving the full speedup,

which validates that CoopKernelFC is near-optimal in communication. Moreover, when CoPE-KB reduces to prior

CoPE with classic MAB setting [20, 38], i.e., all agents are solving the same classic MAB task, our upper and lower

bounds (Theorems 1 and 3) match the state-of-the-art results in [38].

Novel Analysis for Fixed-Confidence Round-Speedup Lower Bound.We highlight that our round-speedup lower

bound for the FC setting analysis has the following novel aspects. (i) Unlike prior CoPE work [38] which focuses

on a preliminary 2-armed case without considering reward structures, we investigate multi-armed instances with

high-dimensional linear reward structures. (ii) We develop a linear structured progress lemma (Lemma 3 in Appendix A.5),

which effectively handles the challenges due to different possible sample allocation on multiple arms and derives the

required communication rounds under linear reward structures. (iii) We propose multi-armed measure transformation

and linear structured instance transformation lemmas (Lemmas 4,5 in Appendix A.5), which bound the change of

probability measures in instance transformation with multiple arms and high-dimensional linear rewards, and serve as

basic analytical tools in our proof.

5 FIXED-BUDGET COPE-KB

We now turn to the fixed-budget (FB) setting and design an efficient algorithm CoopKernelFB. We also establish a

fixed-budget round-speedup lower bound to validate its communication optimality.

11

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Algorithm 2: Distributed Algorithm CoopKernelFB: for Agent ๐‘ฃ (๐‘ฃ โˆˆ [๐‘‰ ])Input: Per-agent budget ๐‘‡ , หœX1, . . . , หœX๐‘‰ , ๐พ (ยท, ยท) :

หœX ร— หœX โ†ฆโ†’ R, regularization parameter bโˆ—, rounding procedure

ROUNDY (ยท, ยท) with approximation parameter Y.

1 Initialization: ๐‘… โ†โŒˆlog

2(๐œ” ( หœX))

โŒ‰. ๐‘ โ† โŒŠ๐‘‡๐‘‰ /๐‘…โŒ‹. B (1)

๐‘ฃโ€ฒ โ† X๐‘ฃโ€ฒ for all ๐‘ฃโ€ฒ โˆˆ [๐‘‰ ]. ๐‘ก โ† 1 ; // pre-determine the

number of phases and the number of samples for each phase

2 while ๐‘ก โ‰ค ๐‘… and โˆƒ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ], |B (๐‘ก )๐‘ฃโ€ฒ | > 1 do

3 Let _โˆ—๐‘ก and ๐œŒโˆ—๐‘ก be the optimal solution and optimal value of

min

_โˆˆโ–ณ หœXmax

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆB (๐‘ก )๐‘ฃโ€ฒ ,๐‘ฃโ€ฒโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2(bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโˆˆ หœX _๏ฟฝฬƒ๏ฟฝ๐œ™ (๏ฟฝฬƒ๏ฟฝ)๐œ™ (๏ฟฝฬƒ๏ฟฝ)โŠค)โˆ’1; // compute the optimal sample allocation

4 (๐‘ 1, . . . , ๐‘ ๐‘ (๐‘ก ) ) โ† ROUNDY (_โˆ—๐‘ก , ๐‘ (๐‘ก ) );5 Let ๏ฟฝฬƒ๏ฟฝ (๐‘ก )๐‘ฃ be the sub-sequence of (๐‘ 1, . . . , ๐‘ ๐‘ ) which only contains the arms in

หœX๐‘ฃ ; // generate the sample

sequence for agent ๐‘ฃ

6 Pull arms ๏ฟฝฬƒ๏ฟฝ (๐‘ก )๐‘ฃ and observe random rewards ๐’š (๐‘ก )๐‘ฃ ;

7 Broadcast {(๐‘ (๐‘ก )๐‘ฃ,๐‘–, ๐‘ฆ(๐‘ก )๐‘ฃ,๐‘–)}๐‘–โˆˆ[๐‘›] ;

8 Receive {(๐‘ (๐‘ก )๐‘ฃโ€ฒ,๐‘– , ๐‘ฆ

(๐‘ก )๐‘ฃโ€ฒ,๐‘– )}๐‘–โˆˆ[๐‘›] from all other agents ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ] \ {๐‘ฃ};

9 ๐‘˜๐‘ก (๐‘ฅ) โ† [โˆš๏ธƒ๐‘(๐‘ก )1๐พ (๐‘ฅ, ๐‘ฅ1), . . . ,

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐พ (๐‘ฅ, ๐‘ฅ๐‘›๐‘‰ )]โŠค for any ๐‘ฅ โˆˆ หœX. ๐พ (๐‘ก ) โ† [

โˆš๏ธƒ๐‘(๐‘ก )๐‘–๐‘(๐‘ก )๐‘—๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ] .

๏ฟฝฬ„๏ฟฝ (๐‘ก ) โ† [โˆš๏ธƒ๐‘(๐‘ก )1๐‘ฆ(๐‘ก )1, . . . ,

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐‘ฆ(๐‘ก )๐‘›๐‘‰]โŠค ; // organize overall observation information

10 for all ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ] do11 ห†๐‘“๐‘ก (๐‘ฅ) โ† ๐‘˜๐‘ก (๐‘ฅ)โŠค (๐พ (๐‘ก ) + ๐‘ (๐‘ก )bโˆ—๐ผ )โˆ’1๏ฟฝฬ„๏ฟฝ (๐‘ก ) for all ๐‘ฅ โˆˆ B (๐‘ก )

๐‘ฃโ€ฒ ; // estimate the rewards of alive arms

12 Sort all ๐‘ฅ โˆˆ B (๐‘ก )๐‘ฃโ€ฒ by

ห†๐‘“๐‘ก (๐‘ฅ) in decreasing order. Let ๐‘ฅ (1) , . . . , ๐‘ฅ ( |B (๐‘ก )๐‘ฃโ€ฒ |)

denote the sorted arm sequence;

13 Let ๐‘–๐‘ก+1 be the largest index such that ๐œ” ({๐‘ฅ (1) , . . . , ๐‘ฅ (๐‘–๐‘ก+1) }) โ‰ค ๐œ” (B(๐‘ก )๐‘ฃโ€ฒ )/2;

14 B (๐‘ก+1)๐‘ฃโ€ฒ โ† {๐‘ฅ (1) , . . . , ๐‘ฅ (๐‘–๐‘ก+1) } ; // cut down the alive arm set to half dimension

15 ๐‘ก โ† ๐‘ก + 1;

16 return B (๐‘ก )1, . . . ,B (๐‘ก )

๐‘‰;

5.1 Algorithm CoopKernelFB

5.1.1 Algorithm. CoopKernelFB consists of three key steps: (i) pre-determine the numbers of phases and samples

according to data dimension, (ii) maintain alive arm sets for all agents, plan a globally optimal sample allocation, (iii)

communicate observation information and cut down alive arms to a half in the dimension sense.

The procedure of CoopKernelFB is given in Algorithm 2. During initialization, we determine the number of phases

๐‘… and the number of samples for each phase ๐‘ according to the principle dimension ๐œ” ( หœX) (Line 1), defined as:

๐œ” ( หœS) = min

_โˆˆโ–ณ หœXmax

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœSโˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2(bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโˆˆ หœX _๏ฟฝฬƒ๏ฟฝ๐œ™ (๏ฟฝฬƒ๏ฟฝ)๐œ™ (๏ฟฝฬƒ๏ฟฝ)โŠค)โˆ’1

, โˆ€ หœS โŠ† หœX

i.e., the principle dimension of data projections inหœS to the RKHS. In each phase ๐‘ก , each agent ๐‘ฃ maintains alive arm sets

B (๐‘ก )๐‘ฃโ€ฒ for all agents ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ], and solves an integrated optimization to obtain a globally optimal sample allocation _โˆ—๐‘ก

(Line 3). Then, she generates a sample sequence (๐‘  (๐‘ก )1, . . . , ๐‘ 

(๐‘ก )๐‘) according to _โˆ—๐‘ก , and selects the sub-sequence ๏ฟฝฬƒ๏ฟฝ (๐‘ก )๐‘ฃ that

only contains her available arms to perform sampling (Lines 4-5). During communication, she only sends and receives

the number of samples ๐‘(๐‘ก )๐‘ฃ,๐‘–

and average observed reward ๐‘ฆ(๐‘ก )๐‘ฃ,๐‘–

for each arm to and from other agents (Lines 7-8). Using

12

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

the shared information, she estimates rewards of alive arms and only selects the best half of them in the dimension

sense to enter the next phase (Lines 11-14).

5.1.2 Computation and Communication Efficiency. CoopKernelFB also adopts the efficient kernelized optimization

solver (Eqs. (6),(7)) to solve the min-max optimization in Line 3 and employs the kernelized estimator (Eq. (4)) to

estimate the rewards in Line 11. Moreover, CoopKernelFB only spends Poly(๐‘›๐‘‰ ) computation time and ๐‘‚ (๐‘›๐‘‰ )-bitcommunication costs.

5.2 Theoretical performance of CoopKernelFB

We present the error probability of CoopKernelFB in the following theorem, where _๐‘ข = 1

๐‘›๐‘‰1.

Theorem 4 (Fixed-Budget Upper Bound). Suppose ๐œ” ({๐‘ฅโˆ—๐‘ฃ , ๐‘ฅ}) โ‰ฅ 1 for any ๐‘ฅ โˆˆ X๐‘ฃ \ {๐‘ฅโˆ—๐‘ฃ }, ๐‘ฃ โˆˆ [๐‘‰ ] and ๐‘‡ =

ฮฉ(๐œŒโˆ— log(๐œ” ( หœX))), and the regularization parameter bโˆ— > 0 satisfiesโˆš๏ธbโˆ—max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ] โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ๐ด(bโˆ—,_๐‘ข )โˆ’1 โ‰คฮ”min

2(1+Y)๐ต . With at most ๐‘‡ samples per agent, CoopKernelFB returns the correct answers ๐‘ฅโˆ—๐‘ฃ for all ๐‘ฃ โˆˆ [๐‘‰ ], with error

probability

๐ธ๐‘Ÿ๐‘Ÿ = ๐‘‚

(๐‘›2๐‘‰ log(๐œ” ( หœX)) ยท exp

(โˆ’ ๐‘‡๐‘‰

๐œŒโˆ— log(๐œ” ( หœX))

))and communication rounds ๐‘‚ (log(๐œ” ( หœX))).

Remark 5.Theorem 4 implies that, to guarantee an error probability๐›ฟ ,CoopKernelFB only requires๐‘‚ ( ๐œŒโˆ—

log(๐œ” ( หœX))๐‘‰

log( ๐‘›2๐‘‰ log(๐œ” ( หœX))

๐›ฟ))

samples, which matches the sample complexity lower bound (Theorem 2) up to logarithmic factors. In addition,

CoopKernelFB attains the maximum ๐‘‰ -speedup for fully-collaborative instances with only logarithmic communication

rounds, which also matches the round-speedup lower bound (Theorem 5) within double logarithmic factors.

Technical Novelty in Error Probability Analysis. Our analysis extends prior single-agent analysis [23] to the

multi-agent setting. The single-agent analysis in [23] only uses a single universal Gaussian-process concentration

bound. Instead, we establish novel estimate concentrations and high probability events for each arm pair and each

agent to handle the distributed environment, and build a connection between the principle dimension ๐œ” (B (๐‘ก )๐‘ฃ ) andproblem hardness ๐œŒโˆ— via elimination rules (Lines 13-14) to guarantee the identification correctness.

Interpretation. Similar to Corollary 1, we can also interpret the error probability result with the standard tools of

maximum information gain and effective dimension in kernel bandits [14, 35, 40], and decompose the error probability

into two compositions from task similarities and arm features.

Corollary 2. The error probability of algorithm CoopKernelFC, denoted by ๐ธ๐‘Ÿ๐‘Ÿ , can also be bounded as follows:

(a)๐ธ๐‘Ÿ๐‘Ÿ = ๐‘‚

(exp

(โˆ’

๐‘‡๐‘‰ฮ”2

min

ฮฅ log(๐œ” ( หœX))

)ยท ๐‘›2๐‘‰ log(๐œ” ( หœX))

), where ฮฅ is the maximum information gain.

(b)๐ธ๐‘Ÿ๐‘Ÿ = ๐‘‚ยฉยญยญยซexp

ยฉยญยญยซโˆ’๐‘‡๐‘‰ฮ”2

min

๐‘‘eff

log

(๐‘›๐‘‰ ยท

(1 + Trace(๐พ_โˆ— )

bโˆ—๐‘‘eff

))log(๐œ” ( หœX))

ยชยฎยฎยฌ ยท ๐‘›2๐‘‰ log(๐œ” ( หœX))ยชยฎยฎยฌ , where ๐‘‘eff is the

effective dimension.

(c)๐ธ๐‘Ÿ๐‘Ÿ = ๐‘‚ยฉยญยญยซexp

ยฉยญยญยซโˆ’๐‘‡๐‘‰ฮ”2

min

rank(๐พ๐‘ง) ยท rank(๐พ๐‘ฅ ) log

(Trace(๐ผ+bโˆ’1

โˆ— ๐พ_โˆ— )rank(๐พ_โˆ— )

)log(๐œ” ( หœX))

ยชยฎยฎยฌ ยท ๐‘›2๐‘‰ log(๐œ” ( หœX))ยชยฎยฎยฌ .

13

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Remark 6. Corollaries 2(a), 2(b) bound the error probability by maximum information gain and effective dimension,

respectively, which capture essential structures of tasks and arm features and only depend on the effective dimension of

the feature space of kernel representation. Furthermore, we exhibit how task similarities influence the error probability

performance in Corollary 2(c). For example, in the fully-collaborative case where rank(๐พ๐‘ง) = 1, the error probability

enjoys an exponential decay factor of ๐‘‰ compared to conventional single-agent results [23] (achieves a ๐‘‰ -speedup).

Conversely, when the tasks are totally different with rank(๐พ๐‘ง) = ๐‘‰ , the error probability degenerates to conventional

single-agent results [23], since in this case information sharing brings no benefit.

5.3 Lower Bound for Fixed-Budget Setting

In this subsection, we establish a round-speedup lower bound for the FB setting.

Theorem 5 (Fixed-Budget Round-Speedup Lower Bound). There exists a fully-collaborative instance of the fixed-

budget CoPE-KB problem with multi-armed and linear reward structures, for which given any ๐›ฝ โˆˆ [ ๐‘‰

log(๐œ” ( หœX)),๐‘‰ ], a

๐›ฝ-speedup distributed algorithm A must utilize

ฮฉยฉยญยซ log(๐œ” ( หœX))

log(๐‘‰๐›ฝ) + log log(๐œ” ( หœX))

ยชยฎยฌcommunication rounds in expectation. In particular, when ๐›ฝ = ๐‘‰ , A must use ฮฉ( log(๐œ” ( หœX))

log log(๐œ” ( หœX))) communication rounds in

expectation.

Remark 7. Theorem 5 shows that under the FB setting, to achieve the full speedup, agents require at least logarithmic

communication rounds with respect to the principle dimension ๐œ” ( หœX), which validates the communication optimality

of CoopKernelFB. In the degenerated case when all agents solve the same non-structured pure exploration problem,

same as in prior classic MAB setting [20, 38], both our upper (Theorem 4) and lower (Theorem 5) bounds match the

state-of-the-art results in [38].

Novel Analysis for Fixed-Budget Round-Speedup Lower Bound. Different from the FC setting, here we borrow

the proof idea of prior limited adaptivity work [1] to establish a non-trivial lower bound analysis under Bayesian

environments, and perform instance transformation by changing data dimension instead of tuning reward gaps. In

our analysis, we employ novel techniques to calculate the information entropy and support size of posterior reward

distributions in order to build induction among different rounds and derive the required communication rounds.

6 EXPERIMENTS

In this section, we conduct experiments to validate the empirical performance of our algorithms. In our experiments, we

set๐‘‰ = 5, ๐‘‘ = 4, ๐‘› = 6, ๐›ฟ = 0.005 and ๐œ™ (๐‘ฅ) = ๐ผ๐‘ฅ for any ๐‘ฅ โˆˆ หœX. The entries of \โˆ— form an arithmetic sequence that starts

from 0.1 and has the common difference ฮ”min, i.e., \โˆ— = [0.1, 0.1 + ฮ”min, . . . , 0.1 + (๐‘‘ โˆ’ 1)ฮ”min]โŠค. For the FC setting, we

vary the gap ฮ”min โˆˆ [0.1, 0.8] to generate different instances (points), and run 50 independent simulations to plot the

average sample complexity with 95% confidence intervals. For the FB setting, we change the budget ๐‘‡ โˆˆ [7000, 300000]to obtain different instances, and perform 100 independent runs to show the error probability across runs. The specific

values of gap ฮ”min and budget ๐‘‡ can be seen in X-axis of the figures.

Fixed-Confidence. In the FC setting (Figures 2(a)-2(c)), we compareCoopKernelFCwith five baselines:CoopKernel-IndAlloc

is an ablation variant ofCoopKernelFCwhich individually calculates sample allocations for different agents. IndRAGE [19],

IndALBA [34] and IndPolyALBA [15] are single-agent algorithms, which use ๐‘‰ copies of single-agent RAGE [19],

14

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50Gap

103

104

105

106

107Sa

mpl

e Co

mpl

exity

CoopKernel (ours)CoopKernel-IndAllocIndRAGEIndRAGE-VIndALBAIndPolyALBA

(a) FC, ๐พZ = 1 (fully-collaborative)

0.10 0.15 0.20 0.25 0.30Gap

103

104

105

106

Sam

ple

Com

plex

ity

CoopKernel (ours)CoopKernel-IndAllocIndRAGEIndRAGE-VIndALBAIndPolyALBA

(b) FC, 1 < ๐พZ < ๐‘‰

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70Gap

103

104

105

106

Sam

ple

Com

plex

ity

CoopKernel (ours)CoopKernel-IndAllocIndRAGEIndRAGE-VIndALBAIndPolyALBA

(c) FC, ๐พZ = ๐‘‰

5000 5500 6000 6500 7000 7500 8000 8500 9000Budget

0.00.10.20.30.40.50.60.70.80.91.0

Erro

r Pro

babi

lity

CoopKernelFB (ours)CoopKernelFB-IndAllocIndRAGE-FBIndUniformFB

(d) FB, ๐พZ = 1 (fully-collaborative)

5000 5500 6000 6500 7000 7500 8000 8500 9000Budget

0.00.10.20.30.40.50.60.70.80.91.0

Erro

r Pro

babi

lity

CoopKernelFB (ours)CoopKernelFB-IndAllocIndRAGE-FBIndUniformFB

(e) FB, 1 < ๐พZ < ๐‘‰

6000 6500 7000 7500 8000 8500 9000 950010000Budget

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Erro

r Pro

babi

lity

CoopKernelFB (ours)CoopKernelFB-IndAllocIndRAGE-FBIndUniformFB

(f) FB, ๐พZ = ๐‘‰

Fig. 2. Experimental results for FC and FB settings.

ALBA [34] and PolyALBA [15] policies to solve the ๐‘‰ tasks independently. IndRAGE/๐‘‰ is a ๐‘‰ -speedup baseline, which

divides the sample complexity of the best single-agent algorithm IndRAGE by ๐‘‰ . One can see that CoopKernelFC

achieves the best sample complexity in Figures 2(a),2(b), which demonstrates the effectiveness of our sample allocation

and cooperation scheme. Moreover, the empirical results also reflect the impacts of task similarities on learning speedup,

and keep consistent with our theoretical analysis. Specifically, in the fully-collaborative case (Figure 2(a)),CoopKernelFC

matches IndRAGE-๐‘‰ since it attains the ๐‘‰ speedup; in the intermediate (1 < ๐พZ < ๐‘‰ ) case (Figure 2(b)), the curve

of CoopKernelFC lies between IndRAGE/๐‘‰ and IndRAGE, since it only achieves smaller than ๐‘‰ speedup due to the

decrease of task similarity; in the totally-different-task (๐พZ = ๐‘‰ ) case (Figure 2(c)), CoopKernelFC performs similar to

the single-agent algorithm IndRAGE, since information sharing among agents brings no advantage in this case.

Fixed-Budget. In the FB setting (Figures 2(d)-2(f)), we compareCoopKernelFBwith three baselines:CoopKernelFB-IndAlloc

is an ablation variant of CoopKernelFB where agents calculate and use different sample allocations. IndPeaceFB [23]

and IndUniformFB solve the ๐‘‰ tasks independently by calling ๐‘‰ copies of single-agent PeaceFB [23] and uniform

sampling policies, respectively. As shown in Figures 2(d),2(e), our CoopKernelFB enjoys a lower error probability than

all other algorithms. In addition, these empirical results also validate the influences of task similarities on learning

performance, and match our theoretical analysis. Specifically, as the task similarity decreases in Figures 2(d) to 2(f), the

error probability of CoopKernelFB gets closer to that of single-agent IndRAGE-FB, due to the slow-down of its learning

speedup.

7 CONCLUSION

In this paper, we propose a novel Collaborative Pure Exploration in Kernel Bandit (CoPE-KB) problem with Fixed-

Confidence (FC) and Fixed-Budget (FB) settings. CoPE-KB aims to coordinate multiple agents to identify best arms with

15

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

general reward functions. We design two computation and communication efficient algorithms CoopKernelFC and

CoopKernelFB based on novel kernelized estimators. Matching upper and lower bounds are established to demonstrate

the statistical and communication optimality of our algorithms. Our theoretical results explicitly characterize the

impacts of task similarities on learning speedup and avoid heavy dependency on the high dimension of the kernelized

feature space. In our analysis, we also develop novel analytical techniques, including data dimension decomposition,

linear structured instance transformation and (communication) round-speedup induction, which are applicable to other

bandit problems and can be of independent interests.

REFERENCES[1] Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. 2017. Learning with limited rounds of adaptivity: Coin tossing, multi-armed

bandits, and ranking from pairwise comparisons. In Conference on Learning Theory. PMLR, 39โ€“75.

[2] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. 2021. Near-optimal discrete optimization for experimental design: A regret

minimization approach. Mathematical Programming 186, 1 (2021), 439โ€“478.

[3] Jean-Yves Audibert, Sรฉbastien Bubeck, and Rรฉmi Munos. 2010. Best arm identification in multi-armed bandits.. In Conference on Learning Theory.Citeseer, 41โ€“53.

[4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47, 2-3 (2002),

235โ€“256.

[5] Ilai Bistritz and Amir Leshem. 2018. Distributed Multi-Player Bandits-a Game of Thrones Approach.. In Advances in Neural Information ProcessingSystems. 7222โ€“7232.

[6] Sรฉbastien Bubeck and Thomas Budzinski. 2020. Coordination without communication: optimal regret in two players multi-armed bandits. In

Conference on Learning Theory. PMLR, 916โ€“939.

[7] Sรฉbastien Bubeck, Thomas Budzinski, and Mark Sellke. 2021. Cooperative and stochastic multi-player multi-armed bandit: Optimal regret with

neither communication nor collisions. In Conference on Learning Theory. PMLR, 821โ€“822.

[8] Sรฉbastien Bubeck and Nicolo Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Machine Learning5, 1 (2012), 1โ€“122.

[9] Sรฉebastian Bubeck, Tengyao Wang, and Nitin Viswanathan. 2013. Multiple identifications in multi-armed bandits. In International Conference onMachine Learning. PMLR, 258โ€“265.

[10] Romain Camilleri, Kevin Jamieson, and Julian Katz-Samuels. 2021. High-Dimensional Experimental Design and Kernel Bandits. In InternationalConference on Machine Learning. PMLR, 1227โ€“1237.

[11] Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. 2017. Coordinated Versus Decentralized Exploration In Multi-Agent

Multi-Armed Bandits.. In Proceedings of the International Joint Conference on Artifical Intelligence. 164โ€“170.[12] Lijie Chen, Jian Li, and Mingda Qiao. 2017. Towards instance optimal bounds for best arm identification. In Conference on Learning Theory. PMLR,

535โ€“592.

[13] Sayak Ray Chowdhury and Aditya Gopalan. 2017. On kernelized multi-armed bandits. In International Conference on Machine Learning. PMLR,

844โ€“853.

[14] Aniket Anand Deshmukh, Urun Dogan, and Clayton Scott. 2017. Multi-task learning for contextual bandits. In Advances in Neural InformationProcessing Systems. 4851โ€“4859.

[15] Yihan Du, Yuko Kuroki, and Wei Chen. 2021. Combinatorial pure exploration with full-bandit or partial linear feedback. In Proceedings of the AAAIConference on Artificial Intelligence, Vol. 35. 7262โ€“7270.

[16] Abhimanyu Dubey et al. 2020. Kernel methods for cooperative multi-agent contextual bandits. In International Conference on Machine Learning.2740โ€“2750.

[17] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. The Journal of Machine Learning Research 20, 1

(2019), 1997โ€“2017.

[18] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. 2006. Action Elimination and Stopping Conditions for the Multi-Armed

Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research 7, 6 (2006).

[19] Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. 2019. Sequential experimental design for transductive linear bandits. Advances inneural information processing systems 32 (2019), 10667โ€“10677.

[20] Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. 2013. Distributed Exploration in Multi-Armed Bandits. In Advancesin Neural Information Processing Systems, Vol. 26. 854โ€“862.

[21] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. 2012. PAC subset selection in stochastic multi-armed bandits. In InternationalConference on Machine Learning, Vol. 12. 655โ€“662.

[22] Nikolai Karpov, Qin Zhang, and Yuan Zhou. 2020. Collaborative top distribution identifications with limited interaction. In IEEE 61st AnnualSymposium on Foundations of Computer Science. 160โ€“171.

16

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

[23] Julian Katz-Samuels, Lalit Jain, Kevin G Jamieson, et al. 2020. An Empirical Process Approach to the Union Bound: Practical Algorithms for

Combinatorial and Linear Bandits. In Advances in Neural Information Processing Systems, Vol. 33.[24] Emilie Kaufmann, Olivier Cappรฉ, and Aurรฉlien Garivier. 2016. On the complexity of best-arm identification in multi-armed bandit models. Journal

of Machine Learning Research 17, 1 (2016), 1โ€“42.

[25] Nathan Korda, Balazs Szorenyi, and Shuai Li. 2016. Distributed clustering of linear bandits in peer to peer networks. In International Conference onMachine Learning. PMLR, 1301โ€“1309.

[26] Andreas Krause and Cheng Soon Ong. 2011. Contextual Gaussian Process Bandit Optimization.. In Advances in Neural Information ProcessingSystems. 2447โ€“2455.

[27] Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 1 (1985), 4โ€“22.[28] Tor Lattimore and Csaba Szepesvรกri. 2020. Bandit algorithms. Cambridge University Press.

[29] Keqin Liu and Qing Zhao. 2010. Distributed learning in multi-armed bandit with multiple players. IEEE Transactions on Signal Processing 58, 11

(2010), 5667โ€“5681.

[30] Zhenhua Liu, Minghong Lin, Adam Wierman, Steven H Low, and Lachlan LH Andrew. 2011. Geographical load balancing with renewables. 39, 3

(2011), 62โ€“66.

[31] Trung T. Nguyen. 2021. On the Edge and Cloud: Recommendation Systems with Distributed Machine Learning. In 2021 International Conference onInformation Technology (ICIT). 929โ€“934. https://doi.org/10.1109/ICIT52682.2021.9491121

[32] Jonathan Rosenski, Ohad Shamir, and Liran Szlak. 2016. Multi-player banditsโ€“a musical chairs approach. In International Conference on MachineLearning. PMLR, 155โ€“163.

[33] Bernhard Schรถlkopf, Alexander J Smola, Francis Bach, et al. 2002. Learning with kernels: support vector machines, regularization, optimization, andbeyond. MIT press.

[34] Marta Soare, Alessandro Lazaric, and Rรฉmi Munos. 2014. Best-arm identification in linear bandits. In Advances in Neural Information ProcessingSystems, Vol. 27. 828โ€“836.

[35] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. 2010. Gaussian process optimization in the bandit setting: No regret and

experimental design. In International Conference on Machine Learning.[36] Balazs Szorenyi, Rรณbert Busa-Fekete, Istvรกn Hegedus, Rรณbert Ormรกndi, Mรกrk Jelasity, and Balรกzs Kรฉgl. 2013. Gossip-based distributed stochastic

bandit algorithms. In International Conference on Machine Learning. PMLR, 19โ€“27.

[37] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. 2013. Automatic ad format selection via contextual bandits. In Proceedings of the 22ndACM International Conference on Information & Knowledge Management. 1587โ€“1594.

[38] Chao Tao, Qin Zhang, and Yuan Zhou. 2019. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed

bandits. In IEEE 60th Annual Symposium on Foundations of Computer Science. 126โ€“146.[39] William R Thompson. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25,

3/4 (1933), 285โ€“294.

[40] Michal Valko, Nathan Korda, Rรฉmi Munos, Ilias Flaounas, and Nello Cristianini. 2013. Finite-Time Analysis of Kernelised Contextual Bandits. In

Uncertainty in Artificial Intelligence.[41] Sofรญa S Villar, Jack Bowden, and James Wason. 2015. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges.

Statistical Science: A Review Journal of the Institute of Mathematical Statistics 30, 2 (2015), 199.[42] Grace Wahba. 1990. Spline models for observational data. SIAM.

[43] Yinglun Zhu, Dongruo Zhou, Ruoxi Jiang, Quanquan Gu, Rebecca Willett, and Robert Nowak. 2021. Pure Exploration in Kernel and Neural Bandits.

arXiv preprint arXiv:2106.12034 (2021).[44] Ling Zhuo, Cho-Li Wang, and Francis C.M. Lau. 2003. Document replication and distribution in extensible geographically distributed web servers. J.

Parallel and Distrib. Comput. 63, 10 (2003), 927โ€“944.

17

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

APPENDIX

A PROOFS FOR THE FIXED-CONFIDENCE SETTING

A.1 Kernelized Computations in Algorithm CoopKernelFC

Kernelized Condition for Regularization Parameter b๐‘ก .We first introduce how to compute the condition Eq. (3)

for regularization parameter b๐‘ก in Line 4 of Algorithm 1.

Let ฮฆ_ = [โˆš_1๐œ™ (๐‘ฅ1)โŠค; . . . ;

โˆš๏ธ_๐‘›๐‘‰๐œ™ (๐‘ฅ๐‘›๐‘‰ )โŠค] and ๐พ_ = ฮฆ_ฮฆ

โŠค_= [

โˆš๏ธ_๐‘–_ ๐‘—๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ] for any _ โˆˆ โ–ณ หœX . Let ๐‘˜_ (๐‘ฅ) =

ฮฆ_๐œ™ (๐‘ฅ) = [โˆš_1๐พ (๐‘ฅ, ๐‘ฅ1), . . . ,

โˆš๏ธ_๐‘›๐‘‰๐พ (๐‘ฅ, ๐‘ฅ๐‘›๐‘‰ )]โŠค for any _ โˆˆ โ–ณ

หœX , ๐‘ฅ โˆˆ หœX. Since(b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)๐œ™ (๐‘ฅ) = b๐‘ก๐œ™ (๐‘ฅ) + ฮฆโŠค_ ๐‘˜_ (๐‘ฅ)

for any ๐‘ฅ โˆˆ หœX, we have

๐œ™ (๐‘ฅ) =b๐‘ก(b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1

๐œ™ (๐‘ฅ) +(b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1

ฮฆโŠค_๐‘˜_ (๐‘ฅ)

=b๐‘ก

(b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1

๐œ™ (๐‘ฅ) + ฮฆโŠค_(b๐‘ก ๐ผ + ๐พ_)โˆ’1 ๐‘˜_ (๐‘ฅ)

Thus,

๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ) = b๐‘ก(b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1 (๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)+ ฮฆโŠค

_(b๐‘ก ๐ผ + ๐พ_)โˆ’1

(๐‘˜_ (๐‘ฅ๐‘– ) โˆ’ ๐‘˜_ (๐‘ฅ ๐‘— )

)Multiplying

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠคon both sides, we have(

๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค (

๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))

=b๐‘ก(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1 (๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)+

(๐‘˜_ (๐‘ฅ๐‘– ) โˆ’ ๐‘˜_ (๐‘ฅ ๐‘— )

)โŠค (b๐‘ก ๐ผ + ๐พ_)โˆ’1(๐‘˜_ (๐‘ฅ๐‘– ) โˆ’ ๐‘˜_ (๐‘ฅ ๐‘— )

)Thus,

โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2(b๐‘ก ๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1

=(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1 (๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)=bโˆ’1

๐‘ก

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โˆ’ bโˆ’1

๐‘ก

(๐‘˜_ (๐‘ฅ๐‘– ) โˆ’ ๐‘˜_ (๐‘ฅ ๐‘— )

)โŠค (b๐‘ก ๐ผ + ๐พ_)โˆ’1(๐‘˜_ (๐‘ฅ๐‘– ) โˆ’ ๐‘˜_ (๐‘ฅ ๐‘— )

)=bโˆ’1

๐‘ก

(๐พ (๐‘ฅ๐‘– , ๐‘ฅ๐‘– ) + ๐พ (๐‘ฅ ๐‘— , ๐‘ฅ ๐‘— ) โˆ’ 2๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )

)โˆ’ bโˆ’1

๐‘ก

(๐‘˜_ (๐‘ฅ๐‘– ) โˆ’ ๐‘˜_ (๐‘ฅ ๐‘— )

)โŠค (b๐‘ก ๐ผ + ๐พ_)โˆ’1(๐‘˜_ (๐‘ฅ๐‘– ) โˆ’ ๐‘˜_ (๐‘ฅ ๐‘— )

)=bโˆ’1

๐‘ก

(๐พ (๐‘ฅ๐‘– , ๐‘ฅ๐‘– ) + ๐พ (๐‘ฅ ๐‘— , ๐‘ฅ ๐‘— ) โˆ’ 2๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )

)โˆ’ bโˆ’1

๐‘ก โˆฅ๐‘˜_ (๐‘ฅ๐‘– ) โˆ’ ๐‘˜_ (๐‘ฅ ๐‘— )โˆฅ2(b๐‘ก ๐ผ+๐พ_)โˆ’1

Let _๐‘ข = 1

๐‘›๐‘‰1 be the uniform distribution on

หœX. Then, the condition Eq. (3) for regularization parameter b๐‘กโˆš๏ธb๐‘ก max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ(b๐‘ก ๐ผ+โˆ‘๐‘ฅโˆˆX 1

๐‘›๐‘‰๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)โŠค)โˆ’1 โ‰ค 1

(1 + Y)๐ต ยท 2๐‘ก+1

is equivalent to the following efficient kernelized statement

max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]

โˆš๏ธ‚(๐พ (๐‘ฅ๐‘– , ๐‘ฅ๐‘– ) + ๐พ (๐‘ฅ ๐‘— , ๐‘ฅ ๐‘— ) โˆ’ 2๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )

)โˆ’ โˆฅ๐‘˜_๐‘ข (๐‘ฅ๐‘– ) โˆ’ ๐‘˜_๐‘ข (๐‘ฅ ๐‘— )โˆฅ2(b๐‘ก ๐ผ+๐พ_๐‘ข )โˆ’1

โ‰ค 1

(1 + Y)๐ต ยท 2๐‘ก+1.

Kernelized Optimization Solver. Now we present the efficient kernelized optimization solver for the following

convex optimization in Line 4 of Algorithm 1:

min

_โˆˆโ–ณ หœXmax

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2๐ด(b๐‘ก ,_)โˆ’1

, (10)

18

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

where ๐ด(b, _) = b๐ผ +โˆ‘๏ฟฝฬƒ๏ฟฝ โˆˆ หœX _๏ฟฝฬƒ๏ฟฝ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)

โŠคfor any b > 0, _ โˆˆ โ–ณ

หœX .

Define function โ„Ž(_) = max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2๐ด(b๐‘ก ,_)โˆ’1

, and define ๐‘ฅโˆ—๐‘–(_), ๐‘ฅโˆ—

๐‘—(_) as the optimal solution of

โ„Ž(_). Then, the gradient of โ„Ž(_) with respect to _ is

[โˆ‡_โ„Ž(_)]๏ฟฝฬƒ๏ฟฝ = โˆ’((๐œ™ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐œ™ (๐‘ฅ

โˆ—๐‘— (_))

)โŠค๐ด(b๐‘ก , _)โˆ’1๐œ™ (๐‘ฅ)

)2

,โˆ€๐‘ฅ โˆˆ หœX. (11)

Next, we show how to efficiently compute gradient [โˆ‡_โ„Ž(_)]๏ฟฝฬƒ๏ฟฝ with kernel function ๐พ (ยท, ยท).Since

(b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)๐œ™ (๐‘ฅ) = b๐‘ก๐œ™ (๐‘ฅ) + ฮฆโŠค_ ๐‘˜_ (๐‘ฅ) for any ๐‘ฅ โˆˆ หœX, we have

๐œ™ (๐‘ฅ) =b๐‘ก(b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1

๐œ™ (๐‘ฅ) +(b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1

ฮฆโŠค_๐‘˜_ (๐‘ฅ)

=b๐‘ก

(b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1

๐œ™ (๐‘ฅ) + ฮฆโŠค_(b๐‘ก ๐ผ + ๐พ_)โˆ’1 ๐‘˜_ (๐‘ฅ)

Multiplying

(๐œ™ (๐‘ฅโˆ—

๐‘–(_)) โˆ’ ๐œ™ (๐‘ฅโˆ—

๐‘—(_))

)โŠคon both sides, we have(

๐œ™ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐œ™ (๐‘ฅโˆ—๐‘— (_))

)โŠค๐œ™ (๐‘ฅ)

=b๐‘ก

(๐œ™ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐œ™ (๐‘ฅ

โˆ—๐‘— (_))

)โŠค (b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1

๐œ™ (๐‘ฅ) +(๐‘˜_ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐‘˜_ (๐‘ฅ

โˆ—๐‘— (_))

)โŠค(b๐‘ก ๐ผ + ๐พ_)โˆ’1 ๐‘˜_ (๐‘ฅ)

Then, (๐œ™ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐œ™ (๐‘ฅ

โˆ—๐‘— (_))

)โŠค (b๐‘ก ๐ผ + ฮฆโŠค_ ฮฆ_

)โˆ’1

๐œ™ (๐‘ฅ)

=bโˆ’1

๐‘ก

(๐œ™ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐œ™ (๐‘ฅ

โˆ—๐‘— (_))

)โŠค๐œ™ (๐‘ฅ) โˆ’ bโˆ’1

๐‘ก

(๐‘˜_ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐‘˜_ (๐‘ฅ

โˆ—๐‘— (_))

)โŠค(b๐‘ก ๐ผ + ๐พ_)โˆ’1 ๐‘˜_ (๐‘ฅ)

=bโˆ’1

๐‘ก

(๐พ (๐‘ฅโˆ—๐‘– (_), ๐‘ฅ) โˆ’ ๐พ (๐‘ฅ

โˆ—๐‘— (_), ๐‘ฅ) โˆ’

(๐‘˜_ (๐‘ฅโˆ—๐‘– (_)) โˆ’ ๐‘˜_ (๐‘ฅ

โˆ—๐‘— (_))

)โŠค(b๐‘ก ๐ผ + ๐พ_)โˆ’1 ๐‘˜_ (๐‘ฅ)

)(12)

Therefore, we can compute gradient โˆ‡_โ„Ž(_) (Eq. (11)) using the equivalent kernelized expression Eq. (12), and then the

optimization (Eq. (10)) can be efficiently solved by projected gradient descent.

Innovative Kernelized Estimator. Finally, we explicate the innovative kernelized estimator of reward gaps in Line 13

of Algorithm 1, which plays an important role in boosting the computation and communication efficiency.

Letห†\๐‘ก denote the minimizer of the following regularized least square loss function:

L(\ ) = ๐‘ (๐‘ก )b๐‘ก โˆฅ\ โˆฅ2 +๐‘ (๐‘ก )โˆ‘๏ธ๐‘—=1

(๐‘ฆ ๐‘— โˆ’ ๐œ™ (๐‘  ๐‘— )โŠค\ )2 .

Letting the derivative of L(\ ) equal to zero, we have

๐‘ (๐‘ก )b๐‘ก ห†\๐‘ก +๐‘ (๐‘ก )โˆ‘๏ธ๐‘—=1

๐œ™ (๐‘ฅ ๐‘— )๐œ™ (๐‘ฅ ๐‘— )โŠค ห†\๐‘ก =

๐‘ (๐‘ก )โˆ‘๏ธ๐‘—=1

๐œ™ (๐‘ฅ ๐‘— )๐‘ฆ ๐‘— .

Rearranging the summation, we can obtain

๐‘ (๐‘ก )b๐‘ก ห†\๐‘ก +(๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

๐‘(๐‘ก )๐‘–๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค

)ห†\๐‘ก =

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

๐‘(๐‘ก )๐‘–๐œ™ (๐‘ฅ๐‘– )๐‘ฆ (๐‘ก )๐‘– , (13)

19

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

where ๐‘(๐‘ก )๐‘–

is the number of samples and ๐‘ฆ(๐‘ก )๐‘–

is the average observation on arm ๐‘ฅ๐‘– for any ๐‘– โˆˆ [๐‘›๐‘‰ ]. Let ฮฆ๐‘ก =

[โˆš๏ธƒ๐‘(๐‘ก )1๐œ™ (๐‘ฅ1)โŠค; . . . ;

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐œ™ (๐‘ฅ๐‘›๐‘‰ )โŠค],๐พ (๐‘ก ) = ฮฆ๐‘กฮฆ

โŠค๐‘ก = [

โˆš๏ธƒ๐‘(๐‘ก )๐‘–๐‘(๐‘ก )๐‘—๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ] and๐‘ฆ (๐‘ก ) = [

โˆš๏ธƒ๐‘(๐‘ก )1๐‘ฆ(๐‘ก )1, . . . ,

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐‘ฆ(๐‘ก )๐‘›๐‘‰]โŠค.

Then, we can write Eq. (13) as (๐‘ (๐‘ก )b๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)ห†\๐‘ก =ฮฆ

โŠค๐‘ก ๐‘ฆ(๐‘ก ) .

Since

(๐‘ (๐‘ก )b๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โ‰ป 0 and

(๐‘ (๐‘ก )b๐‘ก ๐ผ + ฮฆ๐‘กฮฆโŠค๐‘ก

)โ‰ป 0,

ห†\๐‘ก =

(๐‘ (๐‘ก )b๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก ๐‘ฆ(๐‘ก )

= ฮฆโŠค๐‘ก(๐‘ (๐‘ก )b๐‘ก ๐ผ + ฮฆ๐‘กฮฆโŠค๐‘ก

)โˆ’1

๐‘ฆ (๐‘ก )

= ฮฆโŠค๐‘ก(๐‘ (๐‘ก )b๐‘ก ๐ผ + ๐พ (๐‘ก )

)โˆ’1

๐‘ฆ (๐‘ก ) .

Let ๐‘˜๐‘ก (๐‘ฅ) = ฮฆ๐‘ก๐œ™ (๐‘ฅ) = [โˆš๏ธƒ๐‘(๐‘ก )1๐พ (๐‘ฅ, ๐‘ฅ1), . . . ,

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐พ (๐‘ฅ, ๐‘ฅ๐‘›๐‘‰ )]โŠค for any ๐‘ฅ โˆˆ X. Then, we obtain the efficient kernelized

estimators of ๐‘“ (๐‘ฅ๐‘– ) and ๐‘“ (๐‘ฅ๐‘– ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— ) as

ห†๐‘“ (๐‘ฅ๐‘– ) = ๐œ™ (๐‘ฅ๐‘– )โŠค ห†\๐‘ก = ๐‘˜๐‘ก (๐‘ฅ๐‘– )โŠค(๐‘ (๐‘ก )b๐‘ก ๐ผ + ๐พ (๐‘ก )

)โˆ’1

๐‘ฆ (๐‘ก ) ,

ฮ”ฬ‚(๐‘ฅ๐‘– , ๐‘ฅ ๐‘— ) =(๐‘˜๐‘ก (๐‘ฅ๐‘– ) โˆ’ ๐‘˜๐‘ก (๐‘ฅ ๐‘— )

)โŠค (๐‘ (๐‘ก )b๐‘ก ๐ผ + ๐พ (๐‘ก )

)โˆ’1

๐‘ฆ (๐‘ก ) .

A.2 Proof of Theorem 1

Our proof of Theorem 1 adapts the analysis procedure of [19, 23] to the multi-agent setting.

For any _ โˆˆ โ–ณหœX , let ฮฆ_ = [

โˆš_1๐œ™ (๐‘ฅโˆ—๐‘ฃ )โŠค; . . . ;

โˆš๏ธ_๐‘›๐‘‰๐œ™ (๐‘ฅ๐‘›๐‘‰ )โŠค] and ฮฆโŠค

_ฮฆ_ =

โˆ‘๏ฟฝฬƒ๏ฟฝ โˆˆ หœX _๏ฟฝฬƒ๏ฟฝ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)

โŠค. In order to prove

Theorem 1, we first introduce the following lemmas.

Lemma 1 (Concentration). Defining event

G =

{ ๏ฟฝ๏ฟฝ๏ฟฝ( ห†๐‘“๐‘ก (๐‘ฅ๐‘– ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ ๐‘— ))โˆ’

(๐‘“ (๐‘ฅ๐‘– ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— )

) ๏ฟฝ๏ฟฝ๏ฟฝ < (1 + Y) ยท โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (b๐‘ก ๐ผ+ฮฆโŠค_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1 ยท

ยฉยญยซโˆš๏ธ„

2 log

(2๐‘›2๐‘‰ /๐›ฟ๐‘ก

)๐‘๐‘ก

+โˆš๏ธb๐‘ก ยท โˆฅ\โˆ—โˆฅ2

ยชยฎยฌ โ‰ค 2โˆ’๐‘ก , โˆ€๐‘ฅ๐‘– , ๐‘ฅ ๐‘— โˆˆ B (๐‘ก )๐‘ฃ , โˆ€๐‘ฃ โˆˆ [๐‘‰ ], โˆ€๐‘ก โ‰ฅ 1

},

we have

Pr [G] โ‰ฅ 1 โˆ’ ๐›ฟ.

Proof of Lemma 1. Letห†\๐‘ก be the regularized least square estimator of \โˆ— with samples ๐‘ฅโˆ—๐‘ฃ , . . . , ๐‘ฅ๐‘๐‘ก and ๐›พ๐‘ก = ๐‘๐‘ก b๐‘ก .

Recall that ฮฆ๐‘ก = [โˆš๏ธƒ๐‘(๐‘ก )1๐œ™ (๐‘ฅโˆ—๐‘ฃ )โŠค; . . . ;

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐œ™ (๐‘ฅ๐‘›๐‘‰ )โŠค] and ฮฆโŠค๐‘ก ฮฆ๐‘ก =

โˆ‘๐‘›๐‘‰๐‘–=1

๐‘(๐‘ก )๐‘–๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค.

In addition, ๐‘ฆ (๐‘ก ) = [โˆš๏ธƒ๐‘(๐‘ก )1๐‘ฆ(๐‘ก )1, . . . ,

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰๐‘ฆ(๐‘ก )๐‘›๐‘‰]โŠค and

ห†\๐‘ก =(๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1 ฮฆโŠค๐‘ก ๐‘ฆ(๐‘ก )

.

Let [ฬ„ (๐‘ก ) = [โˆš๏ธƒ๐‘(๐‘ก )1[ฬ„(๐‘ก )1, . . . ,

โˆš๏ธƒ๐‘(๐‘ก )๐‘›๐‘‰[ฬ„(๐‘ก )๐‘›๐‘‰]โŠค, where [ฬ„ (๐‘ก )

๐‘–= ๐‘ฆ(๐‘ก )๐‘–โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ— denote the average noise of the ๐‘ (๐‘ก )๐‘– pulls

on arm ๐‘ฅ๐‘– for any ๐‘– โˆˆ [๐‘›๐‘‰ ].Then, (

ห†๐‘“๐‘ก (๐‘ฅ๐‘– ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ ๐‘— ))โˆ’

(๐‘“ (๐‘ฅ๐‘– ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— )

)20

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

=(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (ห†\๐‘ก โˆ’ \โˆ—

)=

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค ( (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก ๐‘ฆ(๐‘ก ) โˆ’ \โˆ—

)=

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค ( (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก(ฮฆ๐‘ก\โˆ— + [ฬ„ (๐‘ก )

)โˆ’ \โˆ—

)=

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค ( (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก ฮฆ๐‘ก\โˆ— +

(๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก [ฬ„(๐‘ก ) โˆ’ \โˆ—

)=

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค ( (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1(ฮฆโŠค๐‘ก ฮฆ๐‘ก + ๐›พ๐‘ก ๐ผ

)\โˆ— +

(๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก [ฬ„(๐‘ก )

(14)

โˆ’ \โˆ— โˆ’ ๐›พ๐‘ก(๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

\โˆ—)

=(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก [ฬ„(๐‘ก ) โˆ’ ๐›พ๐‘ก

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

\โˆ— (15)

Since the mean of the first term is zero and its variance is bounded by(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก ฮฆ๐‘ก(๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โ‰ค

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1(๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

) (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)=

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)=โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2(๐›พ๐‘ก ๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1

,

using the Hoeffding inequality, we have that with probability at least 1 โˆ’ ๐›ฟ๐‘ก๐‘›2๐‘‰

,๏ฟฝ๏ฟฝ๏ฟฝ (๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค (๐›พ๐‘ก ๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก [(๐‘ก )๐‘ฃ

๏ฟฝ๏ฟฝ๏ฟฝ < โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ(๐›พ๐‘ก ๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1

โˆš๏ธ„2 log

(2๐‘›2๐‘‰

๐›ฟ๐‘ก

)Thus, with probability at least 1 โˆ’ ๐›ฟ๐‘ก

๐‘›2๐‘‰,๏ฟฝ๏ฟฝ๏ฟฝ( ห†๐‘“๐‘ก (๐‘ฅ๐‘– ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ ๐‘— )

)โˆ’

(๐‘“ (๐‘ฅ๐‘– ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— )

) ๏ฟฝ๏ฟฝ๏ฟฝ<โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ(๐›พ๐‘ก ๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1

โˆš๏ธ„2 log

(2๐‘›2๐‘‰

๐›ฟ๐‘ก

)+ ๐›พ๐‘ก โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ(๐›พ๐‘ก ๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1 โˆฅ\โˆ—โˆฅ(๐›พ๐‘ก ๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1

โ‰คโˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ(๐›พ๐‘ก ๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1

โˆš๏ธ„2 log

(2๐‘›2๐‘‰

๐›ฟ๐‘ก

)+ โˆš๐›พ๐‘ก ยท โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ(๐›พ๐‘ก ๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1 โˆฅ\โˆ—โˆฅ2

(a)

โ‰ค(1 + Y) ยท โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (b๐‘ก ๐ผ+ฮฆโŠค

_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

โˆš๐‘๐‘ก

โˆš๏ธ„2 log

(2๐‘›2๐‘‰

๐›ฟ๐‘ก

)+

โˆš๏ธb๐‘ก๐‘๐‘ก ยท

(1 + Y) ยท โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (b๐‘ก ๐ผ+ฮฆโŠค_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

โˆš๐‘๐‘ก

ยท โˆฅ\โˆ—โˆฅ2

=(1 + Y) ยท โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (b๐‘ก ๐ผ+ฮฆโŠค_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

โˆš๏ธ„2 log

(2๐‘›2๐‘‰ /๐›ฟ๐‘ก

)๐‘๐‘ก

+ (1 + Y)โˆš๏ธb๐‘ก ยท โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (b๐‘ก ๐ผ+ฮฆโŠค

_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1 โˆฅ\โˆ—โˆฅ2

โ‰ค(1 + Y) max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (b๐‘ก ๐ผ+ฮฆโŠค

_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

โˆš๏ธ„2 log

(2๐‘›2๐‘‰ /๐›ฟ๐‘ก

)๐‘๐‘ก

21

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

+ (1 + Y)โˆš๏ธb๐‘ก max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (b๐‘ก ๐ผ+ฮฆโŠค

_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1 โˆฅ\โˆ—โˆฅ2,

where (a) is due to the rounding procedure.

According to the choice of b๐‘ก , it holds that

(1 + Y)โˆš๏ธb๐‘ก max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (

b๐‘ก ๐ผ+ฮฆโŠค_โˆ—๐‘กฮฆ_โˆ—๐‘ก

)โˆ’1 โˆฅ\โˆ—โˆฅ2

โ‰ค(1 + Y)โˆš๏ธb๐‘ก max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (

b๐‘ก ๐ผ+ฮฆโŠค_๐‘ขฮฆ_๐‘ข)โˆ’1 โˆฅ\โˆ—โˆฅ2

โ‰ค(1 + Y)โˆš๏ธb๐‘ก max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (

b๐‘ก ๐ผ+ฮฆโŠค_๐‘ขฮฆ_๐‘ข)โˆ’1 ยท ๐ต

โ‰ค 1

2๐‘ก+1 .

Thus, with probability at least 1 โˆ’ ๐›ฟ๐‘ก๐‘›2๐‘‰

,๏ฟฝ๏ฟฝ๏ฟฝ( ห†๐‘“๐‘ก (๐‘ฅ๐‘– ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ ๐‘— ))โˆ’

(๐‘“ (๐‘ฅ๐‘– ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— )

) ๏ฟฝ๏ฟฝ๏ฟฝ<(1 + Y) max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (b๐‘ก ๐ผ+ฮฆโŠค

_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

โˆš๏ธ„2 log

(2๐‘›2๐‘‰ /๐›ฟ๐‘ก

)๐‘๐‘ก

+ 1

2๐‘ก+1

=

โˆš๏ธ„2(1 + Y)2๐œŒโˆ—๐‘ก log

(2๐‘›2๐‘‰ /๐›ฟ๐‘ก

)๐‘๐‘ก

+ 1

2๐‘ก+1

โ‰ค 1

2๐‘ก+1 +

1

2๐‘ก+1

=1

2๐‘ก

By a union bound over arms ๐‘ฅ๐‘– , ๐‘ฅ ๐‘— , agent ๐‘ฃ and phase ๐‘ก , we have that

Pr [G] โ‰ฅ 1 โˆ’ ๐›ฟ.

โ–ก

For any ๐‘ก > 1 and ๐‘ฃ โˆˆ [๐‘‰ ], let S (๐‘ก )๐‘ฃ = {๐‘ฅ โˆˆ หœX๐‘ฃ : ๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ) โ‰ค 2โˆ’๐‘ก+2}.

Lemma 2. Assume that event G occurs. Then, for any phase ๐‘ก > 1 and agent ๐‘ฃ โˆˆ [๐‘‰ ], we have that ๐‘ฅโˆ—๐‘ฃ โˆˆ B(๐‘ก )๐‘ฃ and

B (๐‘ก )๐‘ฃ โŠ† S (๐‘ก )๐‘ฃ .

Proof of Lemma 2. We prove the first statement by induction.

To begin, for any ๐‘ฃ โˆˆ [๐‘‰ ], ๐‘ฅโˆ—๐‘ฃ โˆˆ B(1)๐‘ฃ trivially holds.

Suppose that ๐‘ฅโˆ—๐‘ฃ โˆˆ B(๐‘ก )๐‘ฃ holds for any ๐‘ฃ โˆˆ [๐‘‰ ], and there exists some ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ] such that ๐‘ฅโˆ—

๐‘ฃโ€ฒ โˆ‰ B(๐‘ก+1)๐‘ฃโ€ฒ . According to

the elimination rule of algorithm CoopKernelFC, we have that these exists some ๐‘ฅ โ€ฒ โˆˆ B (๐‘ก )๐‘ฃ such that

ห†๐‘“๐‘ก (๐‘ฅ โ€ฒ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅโˆ—๐‘ฃโ€ฒ) โ‰ฅ 2โˆ’๐‘ก .

Using Lemma 1, we have

๐‘“ (๐‘ฅ โ€ฒ) โˆ’ ๐‘“ (๐‘ฅโˆ—๐‘ฃโ€ฒ) > ห†๐‘“๐‘ก (๐‘ฅ โ€ฒ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅโˆ—๐‘ฃโ€ฒ) โˆ’ 2โˆ’๐‘ก โ‰ฅ 0,

22

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

which contradicts the definition of ๐‘ฅโˆ—๐‘ฃโ€ฒ . Thus, we have that for any ๐‘ฃ โˆˆ [๐‘‰ ], ๐‘ฅ

โˆ—๐‘ฃ โˆˆ B

(๐‘ก+1)๐‘ฃ , which completes the proof of

the first statement.

Now, we prove the second statement also by induction.

To begin, we prove that for any ๐‘ฃ โˆˆ [๐‘‰ ], B (2)๐‘ฃ โŠ† S (2)๐‘ฃ . Suppose that there exists some ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ] such that B (2)๐‘ฃโ€ฒ โŠŠ S (2)

๐‘ฃโ€ฒ .

Then, there exists some ๐‘ฅ โ€ฒ โˆˆ B (2)๐‘ฃโ€ฒ such that ๐‘“ (๐‘ฅโˆ—

๐‘ฃโ€ฒ) โˆ’ ๐‘“ (๐‘ฅโ€ฒ) > 2

โˆ’2+2 = 1. Using Lemma 1, we have that at the phase

๐‘ก = 1,

ห†๐‘“๐‘ก (๐‘ฅโˆ—๐‘ฃโ€ฒ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ โ€ฒ) โ‰ฅ ๐‘“ (๐‘ฅโˆ—๐‘ฃโ€ฒ) โˆ’ ๐‘“ (๐‘ฅโ€ฒ) โˆ’ 2

โˆ’1 > 1 โˆ’ 2โˆ’1 = 2

โˆ’1,

which implies that ๐‘ฅ โ€ฒ should have been eliminated in phase ๐‘ก = 1 and gives a contradiction.

Suppose that B (๐‘ก )๐‘ฃ โŠ† S (๐‘ก )๐‘ฃ (๐‘ก > 1) holds for any ๐‘ฃ โˆˆ [๐‘‰ ], and there exists some ๐‘ฃ โ€ฒ โˆˆ [๐‘‰ ] such that B (๐‘ก+1)๐‘ฃโ€ฒ โŠŠ S (๐‘ก+1)

๐‘ฃโ€ฒ .

Then, there exists some ๐‘ฅ โ€ฒ โˆˆ B (๐‘ก+1)๐‘ฃโ€ฒ such that ๐‘“ (๐‘ฅโˆ—

๐‘ฃโ€ฒ) โˆ’ ๐‘“ (๐‘ฅโ€ฒ) > 2

โˆ’(๐‘ก+1)+2 = 2 ยท 2โˆ’๐‘ก . Using Lemma 1, we have that at the

phase ๐‘ก ,

ห†๐‘“๐‘ก (๐‘ฅโˆ—๐‘ฃโ€ฒ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ โ€ฒ) โ‰ฅ ๐‘“ (๐‘ฅโˆ—๐‘ฃโ€ฒ) โˆ’ ๐‘“ (๐‘ฅโ€ฒ) โˆ’ 2

โˆ’๐‘ก > 2 ยท 2โˆ’๐‘ก โˆ’ 2โˆ’๐‘ก = 2

โˆ’๐‘ก ,

which implies that ๐‘ฅ โ€ฒ should have been eliminated in phase ๐‘ก and gives a contradiction. Thus, we complete the proof of

Lemma 2. โ–ก

Now we prove Theorem 1.

Proof of Theorem 1. We first prove the correctness.

Let ๐‘กโˆ— =โŒˆlog

2ฮ”โˆ’1

min

โŒ‰+ 1 be the index of the last phase of algorithm CoopKernelFC. According to Lemma 2, when

๐‘ก = ๐‘กโˆ—, B (๐‘ก )๐‘ฃ = {๐‘ฅโˆ—๐‘ฃ } holds for any ๐‘ฃ โˆˆ [๐‘‰ ], and thus algorithm CoopKernelFC returns the correct answer ๐‘ฅโˆ—๐‘ฃ for all

๐‘ฃ โˆˆ [๐‘‰ ].Next, we prove the sample complexity.

In algorithm CoopKernelFC, the computation of _โˆ—๐‘ก , ๐œŒโˆ—๐‘ก and ๐‘๐‘ก is the same for all agents, and each agent ๐‘ฃ just

generates partial samples that belong to his arm setหœX๐‘ฃ from the total ๐‘๐‘ก samples. Hence, to bound the overall sample

complexity, it suffices to bound

โˆ‘๐‘กโˆ—๐‘ก=1

๐‘๐‘ก , and then we can obtain the per-agent sample complexity by dividing ๐‘‰ . Let

Y = 0.1. We have

๐‘กโˆ—โˆ‘๏ธ๐‘ก=1

๐‘๐‘ก

=

๐‘กโˆ—โˆ‘๏ธ๐‘ก=1

(8(2๐‘ก )2 (1 + Y)2๐œŒโˆ—๐‘ก log

(2๐‘›2๐‘‰

๐›ฟ๐‘ก

)+ 1

)

=

๐‘กโˆ—โˆ‘๏ธ๐‘ก=2

8(2๐‘ก )2(2โˆ’๐‘ก+2

)2

(1 + Y)2min_โˆˆโ–ณX max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ] โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2(b๐‘ก ๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1(2โˆ’๐‘ก+2)2

log

(4๐‘‰๐‘›2๐‘ก2

๐›ฟ

)+ ๐‘1 + ๐‘กโˆ—

โ‰ค๐‘กโˆ—โˆ‘๏ธ๐‘ก=2

ยฉยญยญยญยซ128(1 + Y)2min_โˆˆโ–ณX max

๏ฟฝฬƒ๏ฟฝ โˆˆB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ] โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ2(

b๐‘ก ๐ผ+ฮฆโŠค_ฮฆ_)โˆ’1(

2โˆ’๐‘ก+2)2

log

(4๐‘‰๐‘›2 (๐‘กโˆ—)2

๐›ฟ

)ยชยฎยฎยฎยฌ + ๐‘1 + ๐‘กโˆ—

23

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

โ‰ค๐‘กโˆ—โˆ‘๏ธ๐‘ก=2

ยฉยญยญยญยซ128(1 + Y)2 min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ โˆˆB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ2(b๐‘ก ๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1

(๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ))2log

(4๐‘‰๐‘›2 (๐‘กโˆ—)2

๐›ฟ

)ยชยฎยฎยฎยฌ + ๐‘1 + ๐‘กโˆ—

โ‰ค๐‘กโˆ—โˆ‘๏ธ๐‘ก=2

ยฉยญยญยญยซ128(1 + Y)2 min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ2(bโˆ—๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1

(๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ))2log

(4๐‘‰๐‘›2 (๐‘กโˆ—)2

๐›ฟ

)ยชยฎยฎยฎยฌ + ๐‘1 + ๐‘กโˆ—

โ‰ค๐‘กโˆ— ยท(128(1 + Y)2๐œŒโˆ— log

(4๐‘‰๐‘›2 (๐‘กโˆ—)2

๐›ฟ

))+ ๐‘1 + ๐‘กโˆ—

=๐‘‚

(๐œŒโˆ— ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

))Thus, the per-agent sample complexity is bounded by

๐‘‚

(๐œŒโˆ—

๐‘‰ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

)).

Since algorithm CoopKernelFC has at most ๐‘กโˆ— =โŒˆlog

2ฮ”โˆ’1

min

โŒ‰+ 1 phases, the number of communication rounds is

bounded by ๐‘‚ (logฮ”โˆ’1

min). โ–ก

A.3 Proof of Corollary 1

Proof of Corollary 1. Recall that๐พ_ = ฮฆ_ฮฆโŠค_and _โˆ— = argmax_โˆˆโ–ณ หœX

log det

(๐ผ + bโˆ—โˆ’1๐พ_

).We have that log det

(๐ผ + bโˆ—โˆ’1๐พ_

)=

log det

(๐ผ + bโˆ—โˆ’1ฮฆโŠค

_ฮฆ_

)= log det

(๐ผ + bโˆ—โˆ’1

โˆ‘๏ฟฝฬƒ๏ฟฝ โ€ฒโˆˆ หœX _๏ฟฝฬƒ๏ฟฝ โ€ฒ๐œ™ (๐‘ฅ

โ€ฒ)๐œ™ (๐‘ฅ โ€ฒ)โŠค). Then,

๐œŒโˆ— = min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ2(bโˆ—๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1

(๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ))2

โ‰ค min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ2(bโˆ—๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1

ฮ”2

min

=1

ฮ”2

min

ยท min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+ฮฆโŠค_ฮฆ_)โˆ’1

โ‰ค 1

ฮ”2

min

ยท min

_โˆˆโ–ณX

(2 max

๏ฟฝฬƒ๏ฟฝ โˆˆ หœXโˆฅ๐œ™ (๐‘ฅ)โˆฅ (

bโˆ—๐ผ+ฮฆโŠค_ฮฆ_)โˆ’1

)2

=4

ฮ”2

min

ยท min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ โˆˆ หœXโˆฅ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโ€ฒโˆˆ หœX _๏ฟฝฬƒ๏ฟฝโ€ฒ๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)โŠค

)โˆ’1

=4

ฮ”2

min

ยทmax

๏ฟฝฬƒ๏ฟฝ โˆˆ หœXโˆฅ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโ€ฒโˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝโ€ฒ๐œ™ (๏ฟฝฬƒ๏ฟฝ

โ€ฒ)๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)โŠค)โˆ’1

(b)

=4

ฮ”2

min

ยทโˆ‘๏ธ๏ฟฝฬƒ๏ฟฝ โˆˆ หœX

_โˆ—๏ฟฝฬƒ๏ฟฝโˆฅ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโ€ฒโˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝโ€ฒ๐œ™ (๏ฟฝฬƒ๏ฟฝ

โ€ฒ)๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)โŠค)โˆ’1,

where (b) is due to Lemma 9.

24

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Since _โˆ—๏ฟฝฬƒ๏ฟฝโˆฅ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโ€ฒโˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝโ€ฒ๐œ™ (๏ฟฝฬƒ๏ฟฝ

โ€ฒ)๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)โŠค)โˆ’1โ‰ค 1 for any ๐‘ฅ โˆˆ X,

โˆ‘๏ธ๏ฟฝฬƒ๏ฟฝ โˆˆ หœX

_โˆ—๏ฟฝฬƒ๏ฟฝโˆฅ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโ€ฒโˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝโ€ฒ๐œ™ (๏ฟฝฬƒ๏ฟฝ

โ€ฒ)๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)โŠค)โˆ’1

โ‰ค2

โˆ‘๏ธ๏ฟฝฬƒ๏ฟฝ โˆˆ หœX

log

ยฉยญยซ1 + _โˆ—๏ฟฝฬƒ๏ฟฝโˆฅ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโ€ฒโˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝโ€ฒ๐œ™ (๏ฟฝฬƒ๏ฟฝ

โ€ฒ)๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)โŠค)โˆ’1

ยชยฎยฌ(c)

โ‰ค2 log

det

(bโˆ—๐ผ +

โˆ‘๏ฟฝฬƒ๏ฟฝ โˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)โŠค

)det (bโˆ—๐ผ )

=2 log det

ยฉยญยซ๐ผ + bโˆ’1

โˆ—โˆ‘๏ธ๏ฟฝฬƒ๏ฟฝ โˆˆ หœX

_โˆ—๏ฟฝฬƒ๏ฟฝ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)โŠคยชยฎยฌ

=2 log det

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—),

where (c) comes from Lemma 10.

Thus, we have

๐œŒโˆ— = min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ2(bโˆ—๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1

(๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ))2

โ‰ค 4

ฮ”2

min

ยทโˆ‘๏ธ๏ฟฝฬƒ๏ฟฝ โˆˆ หœX

_โˆ—๏ฟฝฬƒ๏ฟฝโˆฅ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโ€ฒโˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝโ€ฒ๐œ™ (๏ฟฝฬƒ๏ฟฝ

โ€ฒ)๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)โŠค)โˆ’1

โ‰ค 8

ฮ”2

min

ยท log det

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)

(16)

In the following, we interpret the term log det

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)using two standard expressive tools, i.e., maximum

information gain and effective dimension, respectively.

Maximum Information Gain. Recall that the maximum information gain over all sample allocation _ โˆˆ โ–ณหœX is defined

as

ฮฅ = max

_โˆˆโ–ณ หœXlog det

(๐ผ + bโˆ—โˆ’1๐พ_

).

Then, using Eq. (16) and the definitions of _โˆ—, the per-agent sample complexity is bounded by

๐‘‚

(๐œŒโˆ—

๐‘‰ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

))=๐‘‚

ยฉยญยญยซlog det

(๐ผ + bโˆ—โˆ’1๐พ_โˆ—

)ฮ”2

min๐‘‰

ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

)ยชยฎยฎยฌ=๐‘‚

(ฮฅ

ฮ”2

min๐‘‰ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

))

25

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Effective Dimension. Recall that ๐›ผ1 โ‰ฅ ยท ยท ยท โ‰ฅ ๐›ผ๐‘›๐‘‰ denote the eigenvalues of ๐พ_โˆ— in decreasing order. The effective

dimension of ๐พ_โˆ— is defined as

๐‘‘eff

= min

{๐‘— : ๐‘—bโˆ— log(๐‘›๐‘‰ ) โ‰ฅ

๐‘›๐‘‰โˆ‘๏ธ๐‘–=๐‘—+1

๐›ผ๐‘–

},

and it holds that ๐‘‘effbโˆ— log(๐‘›๐‘‰ ) โ‰ฅ โˆ‘๐‘›๐‘‰

๐‘–=๐‘‘eff+1 ๐›ผ๐‘– .

Let Y = ๐‘‘effbโˆ— log(๐‘›๐‘‰ )โˆ’โˆ‘๐‘›๐‘‰

๐‘–=๐‘‘eff+1 ๐›ผ๐‘– , and thus Y โ‰ค ๐‘‘effbโˆ— log(๐‘›๐‘‰ ). Then, we haveโˆ‘๐‘‘eff๐‘–=1

๐›ผ๐‘– = Trace(๐พ_โˆ— )โˆ’โˆ‘๐‘›๐‘‰๐‘–=๐‘‘eff+1 ๐›ผ๐‘– =

Trace(๐พ_โˆ— ) โˆ’ ๐‘‘effbโˆ— log(๐‘›๐‘‰ ) + Y and โˆ‘๐‘›๐‘‰๐‘–=๐‘‘eff+1 ๐›ผ๐‘– = ๐‘‘effbโˆ— log(๐‘›๐‘‰ ) โˆ’ Y.

log det

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)

= log

(ฮ ๐‘›๐‘‰๐‘–=1

(1 + bโˆ’1

โˆ— ๐›ผ๐‘–))

= log

(ฮ ๐‘‘eff๐‘–=1

(1 + bโˆ’1

โˆ— ๐›ผ๐‘–)ยท ฮ ๐‘›๐‘‰

๐‘–=๐‘‘eff+1

(1 + bโˆ’1

โˆ— ๐›ผ๐‘–))

โ‰ค log

((1 + bโˆ’1

โˆ— ยทTrace(๐พ_โˆ— ) โˆ’ ๐‘‘effbโˆ— log(๐‘›๐‘‰ ) + Y

๐‘‘eff

)๐‘‘eff (1 + bโˆ’1

โˆ— ยท๐‘‘effbโˆ— log(๐‘›๐‘‰ ) โˆ’ Y๐‘›๐‘‰ โˆ’ ๐‘‘

eff

)๐‘›๐‘‰โˆ’๐‘‘eff )โ‰ค๐‘‘

efflog

(1 + bโˆ’1

โˆ— ยทTrace(๐พ_โˆ— ) โˆ’ ๐‘‘effbโˆ— log(๐‘›๐‘‰ ) + Y

๐‘‘eff

)+ log

(1 + ๐‘‘eff log(๐‘›๐‘‰ )

๐‘›๐‘‰ โˆ’ ๐‘‘eff

)๐‘›๐‘‰โˆ’๐‘‘eff=๐‘‘

efflog

(1 + bโˆ’1

โˆ— ยทTrace(๐พ_โˆ— ) โˆ’ ๐‘‘effbโˆ— log(๐‘›๐‘‰ ) + Y

๐‘‘eff

)+ log

(1 + ๐‘‘eff log(๐‘›๐‘‰ โˆ’ ๐‘‘

eff+ ๐‘‘

eff)

๐‘›๐‘‰ โˆ’ ๐‘‘eff

)๐‘›๐‘‰โˆ’๐‘‘eff(d)

โ‰ค๐‘‘eff

log

(1 + bโˆ’1

โˆ— ยทTrace(๐พ_โˆ— ) โˆ’ ๐‘‘effbโˆ— log(๐‘›๐‘‰ ) + Y

๐‘‘eff

)+ log

(1 + ๐‘‘eff log(๐‘›๐‘‰ + ๐‘‘

eff)

๐‘›๐‘‰

)๐‘›๐‘‰=๐‘‘

efflog

(1 + bโˆ’1

โˆ— ยทTrace(๐พ_โˆ— ) โˆ’ ๐‘‘effbโˆ— log(๐‘›๐‘‰ ) + Y

๐‘‘eff

)+ ๐‘›๐‘‰ log

(1 + ๐‘‘eff log(๐‘›๐‘‰ + ๐‘‘

eff)

๐‘›๐‘‰

)โ‰ค๐‘‘

efflog

(1 + Trace(๐พ_โˆ— )

bโˆ—๐‘‘eff

)+ ๐‘‘

efflog(๐‘›๐‘‰ + ๐‘‘

eff)

โ‰ค๐‘‘eff

log

(2๐‘›๐‘‰ ยท

(1 + Trace(๐พ_โˆ— )

bโˆ—๐‘‘eff

)),

where inequality (d) is due to that

(1 + ๐‘‘eff log(๐‘ฅ+๐‘‘eff)

๐‘ฅ

)๐‘ฅis monotonically increasing with respect to ๐‘ฅ โ‰ฅ 1.

Then, using Eq. (16), the per-agent sample complexity is bounded by

๐‘‚

(๐œŒโˆ—

๐‘‰ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

))=๐‘‚

ยฉยญยญยซlog det

(๐ผ + bโˆ—โˆ’1๐พ_โˆ—

)ฮ”2

min๐‘‰

ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

)ยชยฎยฎยฌ=๐‘‚

(๐‘‘eff

ฮ”2

min๐‘‰ยท log

(๐‘›๐‘‰ ยท

(1 + Trace(๐พ_โˆ— )

bโˆ—๐‘‘eff

))ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

))Decomposition. Let ๐พ = [๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ] , ๐พ๐‘ง = [๐พ๐‘ง (๐‘ง๐‘ฃ, ๐‘ง๐‘ฃโ€ฒ)]๐‘ฃ,๐‘ฃโ€ฒโˆˆ[๐‘‰ ] and ๐พ๐‘ฅ = [๐พ๐‘ฅ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ] . Since kernelfunction ๐พ (ยค,ยค) is a Hadamard composition of ๐พ๐‘ง (ยค,ยค) and ๐พ๐‘ฅ (ยค,ยค), it holds that rank(๐พ_โˆ— ) = rank(๐พ) โ‰ค rank(๐พ๐‘ง) ยท rank(๐พ๐‘ฅ ).

26

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

log det

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)

= log

(ฮ ๐‘›๐‘‰๐‘–=1

(1 + bโˆ’1

โˆ— ๐›ผ๐‘–))

= log

(ฮ rank(๐พ_โˆ— )๐‘–=1

(1 + bโˆ’1

โˆ— ๐›ผ๐‘–))

โ‰ค log

ยฉยญยซโˆ‘rank(๐พ_โˆ— )๐‘–=1

(1 + bโˆ’1

โˆ— ๐›ผ๐‘–)

rank(๐พ_โˆ— )ยชยฎยฌrank(๐พ_โˆ— )

=rank(๐พ_โˆ— ) log

ยฉยญยซโˆ‘rank(๐พ_โˆ— )๐‘–=1

(1 + bโˆ’1

โˆ— ๐›ผ๐‘–)

rank(๐พ_โˆ— )ยชยฎยฌ

โ‰คrank(๐พ๐‘ง) ยท rank(๐พ๐‘ฅ ) log

(Trace

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)

rank(๐พ_โˆ— )

)Then, using Eq. (16), the per-agent sample complexity is bounded by

๐‘‚

(๐œŒโˆ—

๐‘‰ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

))=๐‘‚

ยฉยญยญยซlog det

(๐ผ + bโˆ—โˆ’1๐พ_โˆ—

)ฮ”2

min๐‘‰

ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

)ยชยฎยฎยฌ=๐‘‚

(rank(๐พ๐‘ง) ยท rank(๐พ๐‘ฅ )

ฮ”2

min๐‘‰

ยท log

(Trace

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)

rank(๐พ_โˆ— )

)ยท logฮ”โˆ’1

min

(log

(๐‘‰๐‘›

๐›ฟ

)+ log logฮ”โˆ’1

min

))โ–ก

A.4 Proof of Theorem 2

Proof of Theorem 2. Our proof of Theorem 2 adapts the analysis procedure of [19] to the multi-agent setting.

Suppose thatA is a ๐›ฟ-correct algorithm for CoPE-KB. For any ๐‘– โˆˆ [๐‘›๐‘‰ ], let a\ โˆ—,๐‘– = N(๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—, 1) denote the rewarddistribution of arm ๐‘ฅ๐‘– , and๐‘‡๐‘– denote the number of times arm ๐‘ฅ๐‘– is pulled by algorithmA. Let ฮ˜ = {\ โˆˆ H๐พ : โˆƒ๐‘ฃ, โˆƒ๐‘ฅ โˆˆหœX๐‘ฃ \ {๐‘ฅโˆ—๐‘ฃ },

(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)

)โŠค\ < 0}.

Since A is ๐›ฟ-correct, according to the โ€œChange of Distributionโ€ lemma (Lemma A.3) in [24], we have that for any

\ โˆˆ ฮ˜,๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

E[๐‘‡๐‘– ] ยท KL(a\ โˆ—,๐‘– , a\,๐‘– ) โ‰ฅ log(1/2.4๐›ฟ) .

Thus, we have

min

\ โˆˆฮ˜

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

E[๐‘‡๐‘– ] ยท KL(a\ โˆ—,๐‘– , a\,๐‘– ) โ‰ฅ log(1/2.4๐›ฟ) .

Let ๐’•โˆ— be the optimal solution of the following optimization problem

min

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

๐‘ก๐‘–

27

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

๐‘  .๐‘ก . min

\ โˆˆH๐พ

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

๐‘ก๐‘– ยท KL(a\ โˆ—,๐‘– , a\,๐‘– ) โ‰ฅ log(1/2.4๐›ฟ) .

Then, we have

min

\ โˆˆH๐พ

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

๐‘กโˆ—๐‘–โˆ‘๐‘›๐‘‰๐‘—=1

๐‘กโˆ—๐‘—

ยท KL(a\ โˆ—,๐‘– , a\,๐‘– ) โ‰ฅlog(1/2.4๐›ฟ)โˆ‘๐‘›๐‘‰

๐‘—=1๐‘กโˆ—๐‘—

โ‰ฅ log(1/2.4๐›ฟ)โˆ‘๐‘›๐‘‰๐‘—=1E[๐‘‡๐‘— ]

.

Since

โˆ‘๐‘›๐‘‰๐‘–=1

๐‘กโˆ—๐‘–โˆ‘๐‘›๐‘‰๐‘—=1๐‘กโˆ—๐‘—

= 1,

max

_โˆˆโ–ณXmin

\ โˆˆฮ˜

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

_๐‘– ยท KL(a\ โˆ—,๐‘– , a\,๐‘– ) โ‰ฅlog(1/2.4๐›ฟ)โˆ‘๐‘›๐‘‰๐‘–=1E[๐‘‡๐‘– ]

.

Thus, we have

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

E[๐‘‡๐‘– ] โ‰ฅlog(1/2.4๐›ฟ)

max_โˆˆโ–ณX min\ โˆˆฮ˜โˆ‘๐‘›๐‘‰๐‘–=1

_๐‘– ยท KL(a\ โˆ—,๐‘– , a\,๐‘– )

= log(1/2.4๐›ฟ) min

_โˆˆโ–ณXmax

\ โˆˆฮ˜1โˆ‘๐‘›๐‘‰

๐‘–=1_๐‘– ยท KL(a\ โˆ—,๐‘– , a\,๐‘– )

(17)

For _ โˆˆ โ–ณX , let๐ด(bโˆ—, _) = bโˆ—๐ผ +โˆ‘๐‘›๐‘‰๐‘–=1

_๐‘–๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค. For any _ โˆˆ โ–ณX , ๐‘ฃ โˆˆ [๐‘‰ ], ๐‘— โˆˆ [๐‘›๐‘‰ ] such that ๐‘ฅ ๐‘— โˆˆ หœX๐‘ฃ \ {๐‘ฅโˆ—๐‘ฃ }, define

\ ๐‘— (_) = \โˆ— โˆ’(2(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค\โˆ—) ยท ๐ด(bโˆ—, _)โˆ’1 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค๐ด(bโˆ—, _)โˆ’1 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))

Here (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค\ ๐‘— (_) = โˆ’(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค\โˆ— < 0, and thus \ ๐‘— (_) โˆˆ ฮ˜.The KL-divergence between a\ โˆ—,๐‘– and a\ ๐‘— (_),๐‘– is

KL(a\ โˆ—,๐‘– , a\ ๐‘— (_),๐‘– )

=1

2

(๐œ™ (๐‘ฅ๐‘– )โŠค (\โˆ— โˆ’ \ ๐‘— (_))

)2

=2(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค\โˆ—)2 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค ยท ๐ด(bโˆ—, _)โˆ’1๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค๐ด(bโˆ—, _)โˆ’1 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))(

(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค๐ด(bโˆ—, _)โˆ’1 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )))2

,

and thus,

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

_๐‘– ยท KL(a\ โˆ—,๐‘– , a\ ๐‘— (_),๐‘– )

=2(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค\โˆ—)2 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค ยท ๐ด(bโˆ—, _)โˆ’1 (โˆ‘๐‘›๐‘‰๐‘–=1

_๐‘–๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค)๐ด(bโˆ—, _)โˆ’1 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))((๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค๐ด(bโˆ—, _)โˆ’1 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))

)2

โ‰ค2(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค\โˆ—)2

(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค๐ด(bโˆ—, _)โˆ’1 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))

=2(๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— ))2

(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค๐ด(bโˆ—, _)โˆ’1 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))(18)

28

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Let J = { ๐‘— โˆˆ [๐‘›๐‘‰ ] : ๐‘ฅ ๐‘— โ‰  ๐‘ฅโˆ—๐‘ฃ ,โˆ€๐‘ฃ โˆˆ [๐‘‰ ]} denote the set of indices of all sub-optimal arms. Then, plugging the above

Eq. (18) into Eq. (17), we have

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

E[๐‘‡๐‘– ] โ‰ฅ log(1/2.4๐›ฟ) min

_โˆˆโ–ณXmax

\ โˆˆฮ˜1โˆ‘๐‘›๐‘‰

๐‘–=1_๐‘– ยท KL(a\ โˆ—,๐‘– , a\,๐‘– )

โ‰ฅ log(1/2.4๐›ฟ) min

_โˆˆโ–ณXmax

๐‘— โˆˆJ1โˆ‘๐‘›๐‘‰

๐‘–=1_๐‘– ยท KL(a\ โˆ—,๐‘– , a\ ๐‘— (_),๐‘– )

โ‰ฅ log(1/2.4๐›ฟ) min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœX๐‘ฃ\{๐‘ฅโˆ—๐‘ฃ },๐‘ฃโˆˆ[๐‘‰ ]

(๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค๐ด(bโˆ—, _)โˆ’1 (๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))2(๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— ))2

=1

2

log(1/2.4๐›ฟ)๐œŒโˆ—,

which completes the proof of Theorem 2.

โ–ก

A.5 Proof of Theorem 3

Our proof of Theorem 3 generalizes the 2-armed lower bound analysis in [38] to the multi-armed case with linear

reward structures. We first introduce some notations and definitions.

Consider a fully-collaborative instance I(X, \โˆ—) of the CoPE-KB problem, whereหœX = X = X๐‘ฃ, ๐‘“ = ๐‘“๐‘ฃ for all ๐‘ฃ โˆˆ [๐‘‰ ]

and ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ— is equal for all ๐‘ฅ๐‘– โ‰  ๐‘ฅโˆ—. Let ฮ” = ฮ”min = (๐œ™ (๐‘ฅโˆ—) โˆ’ ๐œ™ (๐‘ฅ๐‘– ))โŠค\โˆ— and ๐‘ =๐œ™ (๐‘ฅ๐‘– )โŠค\ โˆ—

ฮ” for any ๐‘ฅ๐‘– โ‰  ๐‘ฅโˆ—, where

๐‘ > 0 is a constant. Then, we have that๐œ™ (๐‘ฅโˆ—)โŠค\ โˆ—

ฮ” =๐œ™ (๐‘ฅ๐‘– )โŠค\ โˆ—+ฮ”

ฮ” = 1 + ๐‘ .For any integer ๐›ผ โ‰ฅ 0, let E(๐›ผ,๐‘‡ ) be the event that A uses at least ๐›ผ communication rounds and at most ๐‘‡

samples before the end of the ๐›ผ-th round, and let E+1 (๐›ผ,๐‘‡ ) be the event that A uses at least ๐›ผ + 1 communication

rounds and at most ๐‘‡ samples before the end of the ๐›ผ-th round. Let ๐‘‡A and ๐‘‡A,๐‘ฅ๐‘– denote the expected number of

samples used by A, and the expected number of samples used on arm ๐‘ฅ๐‘– by A, respectively. Let _ be the sample

allocation of A, i.e., _๐‘– =๐‘‡A,๐‘ฅ๐‘–๐‘‡A

. Let ๐œŒ (X, \โˆ—) = min_โˆˆโ–ณX max๐‘ฅ โˆˆX\{๐‘ฅโˆ— }โˆฅ๐œ™ (๐‘ฅโˆ—)โˆ’๐œ™ (๐‘ฅ) โˆฅ2(bโˆ—๐ผ+โˆ‘๐‘ฅโˆˆX _๐‘ฅ๐œ™ (๐‘ฅ )๐œ™ (๐‘ฅ )โŠค)โˆ’1

(๐‘“ (๐‘ฅโˆ—)โˆ’๐‘“ (๐‘ฅ))2 and ๐‘‘ (X) =

min_โˆˆโ–ณX max๐‘ฅ โˆˆX\{๐‘ฅโˆ— } โˆฅ๐œ™ (๐‘ฅโˆ—) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ2(bโˆ—๐ผ+โˆ‘๐‘ฅโˆˆX _๐‘ฅ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)โŠค)โˆ’1. Then, we have ๐œŒ (X, \โˆ—) = ๐‘‘ (X)

ฮ”2.

In order to prove Theorem 3, we first prove the following lemmas.

Lemma 3 (Linear Structured Progress Lemma). For any integer ๐›ผ โ‰ฅ 0 and any ๐‘ž โ‰ฅ 1, we have

Pr

I(X,\ โˆ—)

[E+1

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โ‰ฅ Pr

I(X,\ โˆ—)

[E

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โˆ’ 2๐›ฟ โˆ’ 1

โˆš๐‘ž.

Proof of Lemma 3. Let F be the event that A uses exactly ๐›ผ communication rounds and at most๐œŒ (X,\ โˆ—)๐‘‰๐‘ž

samples

before the end of the ๐›ผ-th round. Then, we have

Pr

I(X,\ โˆ—)

[E+1

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โ‰ฅ Pr

I(X,\ โˆ—)

[E

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โˆ’ Pr

I(X,\ โˆ—)[F ] .

Thus, to prove Lemma 3, it suffices to prove

Pr

I(X,\ โˆ—)[F ] โ‰ค 2๐›ฟ + 1

โˆš๐‘ž. (19)

We can decompose F as

Pr

I(X,\ โˆ—)[F ] = Pr

I(X,\ โˆ—)[F ,A returns ๐‘ฅโˆ—] + Pr

I(X,\ โˆ—)[F ,A does not return ๐‘ฅโˆ—]

29

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

= Pr

I(X,\ โˆ—)[F ,A returns ๐‘ฅโˆ—] + ๐›ฟ (20)

Let๐‘ฆ๐‘– = ๐œ™ (๐‘ฅโˆ—)โˆ’๐œ™ (๐‘ฅ๐‘– ) for any ๐‘– โˆˆ [๐‘›]. Let\ (bโˆ—, _) = \โˆ—โˆ’2(๐‘ฆโŠค๐‘— \ โˆ—)๐ด(bโˆ—,_)โˆ’1๐‘ฆ ๐‘—

๐‘ฆโŠค๐‘—๐ด(bโˆ—,_)โˆ’1๐‘ฆ ๐‘—

, where๐ด(bโˆ—, _) = bโˆ—๐ผ+โˆ‘๐‘›๐‘–=1

_๐‘–๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค

and ๐‘— = argmax๐‘–โˆˆ[๐‘›]๐‘ฆโŠค๐‘– ๐ด(bโˆ—,_)โˆ’1๐‘ฆ๐‘–

(๐‘ฆโŠค๐‘–\ โˆ—)2 . Let I(X, \ (_)) denote the instance where the underlying parameter is \ (_). Under

I(X, \ (_)), it holds that ๐‘ฆโŠค๐‘—\ (_) = โˆ’๐‘ฆโŠค

๐‘—\โˆ— < 0, and thus ๐‘ฅโˆ— is sub-optimal. Let DI denote the product distribution of

instance I with at most๐œŒ (X,\ โˆ—)

๐‘ž samples over all agents.

Using the Pinskerโ€™s inequality (Lemma 11) and Gaussian KL-divergence computation, we have

โˆฅDI(X,\ โˆ—) โˆ’ DI(X,\ (_)) โˆฅTV

โ‰คโˆš๏ธ‚

1

2

KL

(DI(X,\ โˆ—) โˆฅDI(X,\ (_))

)โ‰คโˆšโˆš

1

4

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

(๐œ™ (๐‘ฅ๐‘– )โŠค (\โˆ— โˆ’ \ (_)))2 ยท _๐‘–๐‘‡A

=

โˆšโˆšโˆšโˆš1

4

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

4(๐‘ฆโŠค๐‘—\โˆ—)2 ยท ๐‘ฆโŠค

๐‘—๐ด(bโˆ—, _)โˆ’1๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค๐ด(bโˆ—, _)โˆ’1๐‘ฆ ๐‘—

(๐‘ฆโŠค๐‘—๐ด(bโˆ—, _)โˆ’1๐‘ฆ ๐‘— )2

ยท _๐‘–๐‘‡A

=

โˆšโˆšโˆš๐‘‡A(๐‘ฆโŠค๐‘—\โˆ—)2 ยท ๐‘ฆโŠค

๐‘—๐ด(bโˆ—, _)โˆ’1 (โˆ‘๐‘–โˆˆ[๐‘›] _๐‘–๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค)๐ด(bโˆ—, _)โˆ’1๐‘ฆ ๐‘—

(๐‘ฆโŠค๐‘—๐ด(bโˆ—, _)โˆ’1๐‘ฆ ๐‘— )2

โ‰ค

โˆšโˆšโˆš๐‘‡A(๐‘ฆโŠค๐‘—\โˆ—)2 ยท ๐‘ฆโŠค

๐‘—๐ด(bโˆ—, _)โˆ’1 (bโˆ—๐ผ +

โˆ‘๐‘–โˆˆ[๐‘›] _๐‘–๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค)๐ด(bโˆ—, _)โˆ’1๐‘ฆ ๐‘—

(๐‘ฆโŠค๐‘—๐ด(bโˆ—, _)โˆ’1๐‘ฆ ๐‘— )2

=

โˆšโˆšโˆš๐‘‡A ยท

(๐‘ฆโŠค๐‘—\โˆ—)2

๐‘ฆโŠค๐‘—๐ด(bโˆ—, _)โˆ’1๐‘ฆ ๐‘—

โ‰คโˆšโˆšโˆšโˆš๐œŒ (X, \โˆ—)

๐‘žยท 1

๐‘ฆโŠค๐‘—๐ด(bโˆ—,_)โˆ’1๐‘ฆ ๐‘—

(๐‘ฆโŠค๐‘—\ โˆ—)2

โ‰คโˆšโˆšโˆšโˆš๐œŒ (X, \โˆ—)

๐‘žยท 1

min_โˆˆโ–ณX๐‘ฆโŠค๐‘—๐ด(bโˆ—,_)โˆ’1๐‘ฆ ๐‘—

(๐‘ฆโŠค๐‘—\ โˆ—)2

=1

โˆš๐‘ž

(21)

Since ๐‘ฅโˆ— is sub-optimal under I(X, \ (_)), using the measure change technique, we have

Pr

I(X,\ โˆ—)[F ,A returns ๐‘ฅโˆ—] โ‰ค Pr

I(X,\ (_))[F ,A returns ๐‘ฅโˆ—]

+ โˆฅDI(X,\ โˆ—) โˆ’ DI(X,\ (_)) โˆฅTV

โ‰ค๐›ฟ + 1

โˆš๐‘ž

30

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Plugging the above equality into Eq. (20), we have

Pr

I(X,\ โˆ—)[F ] โ‰ค 2๐›ฟ + 1

โˆš๐‘ž,

which completes the proof of Lemma 3. โ–ก

Let I(X, \/^) denote the instance where the underlying parameter is \/^. Under I(X, \/^), the reward gap is

(๐œ™ (๐‘ฅโˆ—) โˆ’๐œ™ (๐‘ฅ๐‘– ))โŠค\/^ = 1

^ ฮ” and the sample complexity is ๐œŒ (X, \/^) = ^2๐‘‘ (X)ฮ”2

. LetDI(X,\ โˆ—) andDI(X,\/^) denote theproduct distributions of instances I(X, \โˆ—) and I(X, \/^) with ๐‘‡A samples, respectively.

Lemma 4 (Multi-armed Measure Transformation Lemma). Suppose that algorithm A uses ๐‘‡A =๐œŒ (X,\ โˆ—)

Zsamples

over all agents on instance I(X, \โˆ—), where Z โ‰ฅ 100. Then, for any event ๐ธ on I(X, \โˆ—) and any ๐‘„ โ‰ฅ Z , we have

Pr

DI(X,\/^ )[๐ธ] โ‰ค Pr

DI(X,\โˆ— )[๐ธ] ยท exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘„)

Z

)+ 1

๐‘›๐‘„2.

Proof. For any ๐‘– โˆˆ [๐‘›], let ๐‘๐‘–,1, . . . , ๐‘๐‘–,๐‘‡A,๐‘ฅ๐‘– denote the observed ๐‘‡A,๐‘ฅ๐‘– samples on arm ๐‘ฅ๐‘– , and define event

๐ฟ๐‘– =

๐‘‡A,๐‘ฅ๐‘–โˆ‘๏ธ๐‘ก=1

๐‘๐‘–,๐‘ก โ‰ฅ ๐‘‡A,๐‘ฅ๐‘– ยท ๐œ™ (๐‘ฅ๐‘– )โŠค\/^ + ๐‘ง

๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—

,where ๐‘ง โ‰ฅ 0 is a parameter specified later. We also define event ๐ฟ = โˆฉ๐‘–โˆˆ[๐‘›]๐ฟ๐‘– . Then,

Pr

DI(X,\/^ )[๐ธ] โ‰ค Pr

DI(X,\/^ )[๐ธ, ๐ฟ] + Pr

DI(X,\/^ )[ยฌ๐ฟ]

Using the measure change technique, we bound the term PrDI(X,\/^ ) [๐ธ, ๐ฟ] as

Pr

DI(X,\/^ )[๐ธ, ๐ฟ]

= Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

ยฉยญยซโˆ’1

2

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

๐‘‡A,๐‘ฅ๐‘–โˆ‘๏ธ๐‘ก=1

((๐‘๐‘–,๐‘ก โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\/^)2 โˆ’ (๐‘๐‘–,๐‘ก โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)2

)ยชยฎยฌ= Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

ยฉยญยซโˆ’1

2

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

๐‘‡A,๐‘ฅ๐‘–โˆ‘๏ธ๐‘ก=1

((๐œ™ (๐‘ฅ๐‘– )โŠค\/^)2 โˆ’ (๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)2 โˆ’ 2๐‘๐‘–,๐‘ก (๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)

)ยชยฎยฌ= Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

(โˆ’ 1

2

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

( ((๐œ™ (๐‘ฅ๐‘– )โŠค\/^)2 โˆ’ (๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)2

)ยท๐‘‡A,๐‘ฅ๐‘–

โˆ’ 2

(๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—

)ยท๐‘‡A,๐‘ฅ๐‘–โˆ‘๏ธ๐‘ก=1

๐‘๐‘–,๐‘ก

))โ‰ค Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

(โˆ’ 1

2

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

( ((๐œ™ (๐‘ฅ๐‘– )โŠค\/^)2 โˆ’ (๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)2

)ยท๐‘‡A,๐‘ฅ๐‘–

โˆ’ 2

(๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—

)ยท(๐‘‡A,๐‘ฅ๐‘– ยท ๐œ™ (๐‘ฅ๐‘– )

โŠค\/^ + ๐‘ง

๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—

) ))= Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

(โˆ’ 1

2

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

( ((๐œ™ (๐‘ฅ๐‘– )โŠค\/^)2 โˆ’ (๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)2

)ยท๐‘‡A,๐‘ฅ๐‘–

31

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

โˆ’ 2

(๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—

)ยท๐‘‡A,๐‘ฅ๐‘– ยท ๐œ™ (๐‘ฅ๐‘– )

โŠค\/^ โˆ’ 2๐‘ง

))= Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

(โˆ’ 1

2

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

( (๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—

) (๐œ™ (๐‘ฅ๐‘– )โŠค\/^ ยท๐‘‡A,๐‘ฅ๐‘– + ๐œ™ (๐‘ฅ๐‘– )

โŠค\โˆ— ยท๐‘‡A,๐‘ฅ๐‘–

โˆ’ 2 ยท ๐œ™ (๐‘ฅ๐‘– )โŠค\/^ ยท๐‘‡A,๐‘ฅ๐‘–)โˆ’ 2๐‘ง

))= Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

(โˆ’ 1

2

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

(โˆ’

(๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—

)2 ยท๐‘‡A,๐‘ฅ๐‘– โˆ’ 2๐‘ง

))

= Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

ยฉยญยซโˆ‘๏ธ๐‘–โˆˆ[๐‘›]

(1

2

(๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—

)2 ยท๐‘‡A,๐‘ฅ๐‘– + ๐‘ง

)ยชยฎยฌ= Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

ยฉยญยซโˆ‘๏ธ๐‘–โˆˆ[๐‘›]

(1

2

(1 โˆ’ 1

^)2 (๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)2 ยท

๐‘‘ (X)ฮ”2Z

ยท _๐‘– + ๐‘ง)ยชยฎยฌ

โ‰ค Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

((1 + ๐‘)2

2

ยท ๐‘‘ (X)Z+ ๐‘›๐‘ง

)Next, using the Chernoff-Hoeffding inequality, we bound the second term as

Pr

DI(X,\/^ )[ยฌ๐ฟ] โ‰ค

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

Pr

DI(X,\/^ )[ยฌ๐ฟ๐‘– ]

โ‰คโˆ‘๏ธ๐‘–โˆˆ[๐‘›]

exp

(โˆ’2

๐‘ง2

(๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)2ยท Z

_๐‘–๐œŒ (X, \โˆ—)

)=

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

exp

(โˆ’2

๐‘ง2

(1 โˆ’ 1

^ )2 (๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)2ยท ฮ”2Z

_๐‘–๐‘‘ (X)

)โ‰ค

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

exp

(โˆ’ 2

(1 + ๐‘)2ยท ๐‘ง

2Z

๐‘‘ (X)

)=๐‘› ยท exp

(โˆ’ 2

(1 + ๐‘)2ยท ๐‘ง

2Z

๐‘‘ (X)

)Thus,

Pr

DI(X,\/^ )[๐ธ] โ‰ค Pr

DI(X,\/^ )[๐ธ, ๐ฟ] + Pr

DI(X,\/^ )[ยฌ๐ฟ]

โ‰ค Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

((1 + ๐‘)2

2

ยท ๐‘‘ (X)Z+ ๐‘›๐‘ง

)+ ๐‘› ยท exp

(โˆ’ 2

(1 + ๐‘)2ยท ๐‘ง

2Z

๐‘‘ (X)

)Let ๐‘ง =

โˆš๏ธ‚(1+๐‘)2๐‘‘ (X) log(๐‘›๐‘„)

Z. Then, we have

Pr

DI(X,\/^ )[๐ธ] โ‰ค Pr

DI(X,\โˆ— )[๐ธ, ๐ฟ] ยท exp

ยฉยญยซ (1 + ๐‘)2

2

ยท ๐‘‘ (X)Z+ ๐‘›

โˆš๏ธ„(1 + ๐‘)2๐‘‘ (X) log(๐‘›๐‘„)

Z

ยชยฎยฌ+ ๐‘› ยท exp (โˆ’2 log(๐‘›๐‘„))

32

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

โ‰ค Pr

DI(X,\โˆ— )[๐ธ] ยท exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘„)

Z

)+ 1

๐‘›๐‘„2

โ–ก

Lemma 5 (Linear Structured Instance Transformation Lemma). For any integer ๐›ผ โ‰ฅ 0, ๐‘ž โ‰ฅ 100 and ^ โ‰ฅ 1, we

have

Pr

I(X,\/^)

[E

(๐›ผ + 1,

๐œŒ (X, \โˆ—)๐‘‰๐‘ž

+ ๐œŒ (X, \โˆ—)

๐›ฝ

)]โ‰ฅ Pr

I(X,\ โˆ—)

[E+1

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โˆ’ ๐›ฟ โˆ’

โˆš๏ธ„(1 + ๐‘)2

4

ยท ๐‘‘ (X)๐‘ž

โˆ’(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)โˆ’ 1

๐‘›๐‘‰

Proof. Let โ„“ = {๐‘๐‘–,1, . . . , ๐‘๐‘–,๐‘‡A,๐‘ฅ๐‘– }๐‘–โˆˆ[๐‘›] denote the ๐‘‡A samples of algorithm A on instance I(X, \โˆ—). Let S denote

the set of all possible โ„“ , conditioned on which E+1 (๐›ผ, ๐œŒ (X,\โˆ—)

๐‘‰๐‘ž) holds. Then, we haveโˆ‘๏ธ

๐‘ โˆˆSPr

I(X,\ โˆ—)[โ„“ = ๐‘ ] =

โˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\ โˆ—)

[E+1

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]For any agent ๐‘ฃ โˆˆ [๐‘‰ ], let K๐‘ฃ be the event that agent ๐‘ฃ uses more than

๐œŒ (X,\ โˆ—)๐›ฝ

samples during the (๐›ผ + 1)-st round.Conditioned on ๐‘  โˆˆ S, K๐‘ฃ only depends on the samples of agent ๐‘ฃ during the (๐›ผ + 1)-st round, and is independent of

other agents.

Using the facts that A is ๐›ฟ-correct and ๐›ฝ-speedup and all agents have the same performance on fully-collaborative

instances, we have

๐›ฟ โ‰ฅ Pr

I(X,\ โˆ—)

[A uses more than

๐œŒ (X, \โˆ—)๐›ฝ

]โ‰ฅ

โˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\ โˆ—)[โ„“ = ๐‘ ] ยท Pr

I(X,\ โˆ—)[K1 โˆจ ยท ยท ยท โˆจ K๐‘‰ |โ„“ = ๐‘ ]

=โˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\ โˆ—)[โ„“ = ๐‘ ] ยท ยฉยญยซ1 โˆ’

โˆ๐‘ฃโˆˆ[๐‘‰ ]

(1 โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ]

)ยชยฎยฌ=E+1

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)โˆ’

โˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\ โˆ—)[โ„“ = ๐‘ ] ยท

โˆ๐‘ฃโˆˆ[๐‘‰ ]

(1 โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ]

)Rearranging the above inequality, we haveโˆ‘๏ธ

๐‘ โˆˆSPr

I(X,\ โˆ—)[โ„“ = ๐‘ ] ยท

โˆ๐‘ฃโˆˆ[๐‘‰ ]

(1 โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ]

)โ‰ฅE+1

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)โˆ’ ๐›ฟ (22)

Using Lemma 4 with Z = ๐›ฝ and ๐‘„ = ๐‘‰ , we have

Pr

I(X,\/^)

[E

(๐›ผ + 1,

๐œŒ (X, \โˆ—)๐‘‰๐‘ž

+ ๐œŒ (X, \โˆ—)

๐›ฝ

)]โ‰ฅ

โˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\/^)[โ„“ = ๐‘ ] ยท Pr

I(X,\/^)[ยฌK1 โˆง ยท ยท ยท โˆง ยฌK๐‘‰ |โ„“ = ๐‘ ]

=โˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\/^)[โ„“ = ๐‘ ] ยท

โˆ๐‘ฃโˆˆ[๐‘‰ ]

(1 โˆ’ Pr

I(X,\/^)[K๐‘– |โ„“ = ๐‘ ]

)33

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

โ‰ฅโˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\/^)[โ„“ = ๐‘ ] ยท

โˆ๐‘ฃโˆˆ[๐‘‰ ]

max

{1 โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ] ยท exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

๐‘›๐‘‰ 2, 0

}โ‰ฅ

โˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\/^)[โ„“ = ๐‘ ] ยท

ยฉยญยซโˆ๐‘ฃโˆˆ[๐‘‰ ]

max

{1 โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ] ยท exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

), 0

}โˆ’ 1

๐‘›๐‘‰

ยชยฎยฌ=

โˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\/^)[โ„“ = ๐‘ ] ยท

( โˆ๐‘ฃโˆˆ[๐‘‰ ]

max

(1 โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ]

โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ] ยท

(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

), 0

)โˆ’ 1

๐‘›๐‘‰

)(e)

โ‰ฅโˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\/^)[โ„“ = ๐‘ ] ยท

ยฉยญยซโˆ๐‘ฃโˆˆ[๐‘‰ ]

(1 โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ]

)โˆ’

(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)โˆ’ 1

๐‘›๐‘‰

ยชยฎยฌโ‰ฅ

โˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\/^)[โ„“ = ๐‘ ] ยท

โˆ๐‘ฃโˆˆ[๐‘‰ ]

(1 โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ]

)โˆ’

(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)โˆ’ 1

๐‘›๐‘‰

โ‰ฅโˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\ โˆ—)[โ„“ = ๐‘ ] ยท

โˆ๐‘ฃโˆˆ[๐‘‰ ]

(1 โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ]

)โˆ’

โˆ‘๏ธ๐‘ โˆˆS

๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ Pr

I(X,\/^)[โ„“ = ๐‘ ] โˆ’ Pr

I(X,\ โˆ—)[โ„“ = ๐‘ ]

๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ โˆ’ (exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)โˆ’ 1

๐‘›๐‘‰

where (e) comes from Lemma 12.

Using the Pinskerโ€™s inequality (Lemma 11), we haveโˆ‘๏ธ๐‘ โˆˆS

๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ Pr

I(X,\/^)[โ„“ = ๐‘ ] โˆ’ Pr

I(X,\ โˆ—)[โ„“ = ๐‘ ]

๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ (23)

โ‰คโˆš๏ธ‚

1

2

KL(DI(X,\/^) ,DI(X,\ โˆ—) )

โ‰คโˆšโˆš

1

4

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

(๐œ™ (๐‘ฅ๐‘– )โŠค\/^ โˆ’ ๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)2 ยท _๐‘–๐‘‡A

=

โˆšโˆš1

4

โˆ‘๏ธ๐‘–โˆˆ[๐‘›]

((1 โˆ’ 1

^)2 (๐œ™ (๐‘ฅ๐‘– )โŠค\โˆ—)2 ยท

๐‘‘ (X)ฮ”2๐‘ž

_๐‘–

)

34

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

โ‰ค

โˆš๏ธ„(1 + ๐‘)2

4

ยท ๐‘‘ (X)๐‘ž

(24)

Thus, using Eqs. (22),(24), we have

Pr

I(X,\/^)

[E

(๐›ผ + 1,

๐œŒ (X, \โˆ—)๐‘‰๐‘ž

+ ๐œŒ (X, \โˆ—)

๐›ฝ

)]โ‰ฅ

โˆ‘๏ธ๐‘ โˆˆS

Pr

I(X,\ โˆ—)[โ„“ = ๐‘ ] ยท

โˆ๐‘ฃโˆˆ[๐‘‰ ]

(1 โˆ’ Pr

I(X,\ โˆ—)[K๐‘– |โ„“ = ๐‘ ]

)โˆ’

โˆš๏ธ„(1 + ๐‘)2

4

ยท ๐‘‘ (X)๐‘ž

โˆ’(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)โˆ’ 1

๐‘›๐‘‰

โ‰ฅ Pr

I(X,\ โˆ—)

[E+1

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โˆ’ ๐›ฟ โˆ’

โˆš๏ธ„(1 + ๐‘)2

4

ยท ๐‘‘ (X)๐‘ž

โˆ’(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)โˆ’ 1

๐‘›๐‘‰

โ–ก

Now we prove Theorem 3.

Proof of Theorem 3. Combining Lemmas 3,5, we have

Pr

I(X,\/^)

[E

(๐›ผ + 1,

๐œŒ (X, \โˆ—)๐‘‰๐‘ž

+ ๐œŒ (X, \โˆ—)

๐›ฝ

)]โ‰ฅ Pr

I(X,\ โˆ—)

[E+1

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โˆ’ ๐›ฟ โˆ’

โˆš๏ธ„(1 + ๐‘)2

4

ยท ๐‘‘ (X)๐‘ž

โˆ’(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)โˆ’ 1

๐‘›๐‘‰

โ‰ฅ Pr

I(X,\ โˆ—)

[E

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โˆ’ 3๐›ฟ โˆ’

โˆš๏ธ„(1 + ๐‘)2๐‘‘ (X)

๐‘ž

โˆ’(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)โˆ’ 1

๐‘›๐‘‰

Let ^ =

โˆš๏ธƒ1 + ๐‘‰๐‘ž

๐›ฝ. Then, we have

Pr

I(X,\/^)

[E

(๐›ผ + 1,

๐œŒ (X, \/^)๐‘‰๐‘ž

)]= Pr

I(X,\/^)

[E

(๐›ผ + 1,

^2 ยท ๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]= Pr

I(X,\/^)

[E

(๐›ผ + 1,

๐œŒ (X, \โˆ—)๐‘‰๐‘ž

+ ๐œŒ (X, \โˆ—)

๐›ฝ

)]โ‰ฅ Pr

I(X,\ โˆ—)

[E

(๐›ผ,๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โˆ’ 3๐›ฟ โˆ’

โˆš๏ธ„(1 + ๐‘)2๐‘‘ (X)

๐‘ž

35

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

โˆ’(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)โˆ’ 1

๐‘›๐‘‰(25)

Let I(X, \0) be the basic instance of induction, where the reward gap is ฮ”0 = (๐œ™ (๐‘ฅโˆ—) โˆ’๐œ™ (๐‘ฅ๐‘– ))โŠค\0 = 1 for any ๐‘– โˆˆ [๐‘›].Let ๐‘ก0 be the largest integer such that

ฮ” ยท(1 + ๐‘‰๐‘ž

๐›ฝ

) ๐‘ก02

โ‰ค 1,

where ๐‘ž = 1000๐‘ก20. Then, we have

๐‘ก0 = ฮฉยฉยญยซ

log( 1

ฮ” )log(1 + ๐‘‰

๐›ฝ) + log log( 1

ฮ” )ยชยฎยฌ

Starting from I(X, \0), we repeatedly apply Eq. (25) for ๐‘ก0 times to switch to I(X, \โˆ—) where the reward gap is ฮ”.

Since PrI(X,\0)[E

(0,๐œŒ (X,\0)๐‘‰๐‘ž

)]= 1, by induction, we have

Pr

I(X,\ โˆ—)

[E

(๐‘ก0,

๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โ‰ฅ1 โˆ’ ยฉยญยซ3๐›ฟ +

โˆš๏ธ„(1 + ๐‘)2๐‘‘ (X)

๐‘ž+

(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)+ 1

๐‘›๐‘‰

ยชยฎยฌ ยท ๐‘ก0When ๐›ฟ = ๐‘‚ ( 1

๐‘›๐‘‰) and ๐‘ž,๐‘‰ , ๐›ฝ are large enough, we can have

ยฉยญยซ3๐›ฟ +

โˆš๏ธ„(1 + ๐‘)2๐‘‘ (X)

๐‘ž+

(exp

((1 + ๐‘)2๐‘‘ (X)๐‘›

โˆš๏ธ„log(๐‘›๐‘‰ )

๐›ฝ

)โˆ’ 1

)+ 1

๐‘›๐‘‰

ยชยฎยฌ ยท ๐‘ก0 โ‰ค 1

2

and then

Pr

I(X,\ โˆ—)

[E

(๐‘ก0,

๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โ‰ฅ 1

2

.

Thus,

Pr

I(X,\ โˆ—)

A uses ฮฉยฉยญยซ

log( 1

ฮ” )log(1 + ๐‘‰

๐›ฝ) + log log( 1

ฮ” )ยชยฎยฌ communication rounds

โ‰ฅ Pr

I(X,\ โˆ—)

[E

(๐‘ก0,

๐œŒ (X, \โˆ—)๐‘‰๐‘ž

)]โ‰ฅ 1

2

,

which completes the proof of Theorem 3. โ–ก

B PROOFS FOR THE FIXED-BUDGET SETTING

B.1 Proof of Theorem 4

Proof of Theorem 4. Our proof of Theorem 4 adapts the error probability analysis in [23] to the multi-agent setting.

Since the number of samples used over all agents in each phase is ๐‘ = โŒŠ๐‘‡๐‘‰ /๐‘…โŒ‹, the total number of samples used by

algorithm CoopKernelFB is at most ๐‘‡๐‘‰ and the total number of samples used per agent is at most ๐‘‡ .

Now we prove the error probability upper bound.

Recall that for any _ โˆˆ โ–ณหœX , ฮฆ_ = [

โˆš_1๐œ™ (๐‘ฅโˆ—๐‘ฃ )โŠค; . . . ;

โˆš๏ธ_๐‘›๐‘‰๐œ™ (๐‘ฅ๐‘›๐‘‰ )โŠค] and ฮฆโŠค

_ฮฆ_ =

โˆ‘๏ฟฝฬƒ๏ฟฝ โˆˆ หœX _๏ฟฝฬƒ๏ฟฝ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)

โŠค. Let ๐›พโˆ— = bโˆ—๐‘ .

36

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

For any ๐‘ฅ๐‘– , ๐‘ฅ ๐‘— โˆˆ B (๐‘ก )๐‘ฃ , ๐‘ฃ โˆˆ [๐‘‰ ], ๐‘ก โˆˆ [๐‘…], define

ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— = inf

ฮ”>0

โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2(bโˆ—๐ผ+ฮฆโŠค

_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

ฮ”2โ‰ค 2๐œŒโˆ—

and event

J๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— ={๏ฟฝ๏ฟฝ๏ฟฝ( ห†๐‘“๐‘ก (๐‘ฅ๐‘– ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ ๐‘— )

)โˆ’

(๐‘“ (๐‘ฅ๐‘– ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— )

) ๏ฟฝ๏ฟฝ๏ฟฝ < ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘—

}.

In the following, we prove Pr

[ยฌJ๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘—

]โ‰ค 2 exp

(โˆ’ ๐‘

2(1+Y)๐œŒโˆ—).

Similar to the analysis procedure of Lemma 1, we have(ห†๐‘“๐‘ก (๐‘ฅ๐‘– ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ ๐‘— )

)โˆ’

(๐‘“ (๐‘ฅ๐‘– ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— )

)=

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (๐›พโˆ—๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก [ฬ„(๐‘ก ) โˆ’ ๐›พโˆ—

(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )

)โŠค (๐›พโˆ—๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

\โˆ—,

where the mean of the first term is zero and its variance is bounded by

โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ2(๐›พโˆ—๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1โ‰ค(1 + Y) ยท โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (bโˆ—๐ผ+ฮฆโŠค

_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

๐‘.

Using the Hoeffding inequality, we have

Pr

[๏ฟฝ๏ฟฝ๏ฟฝ(๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค (๐›พโˆ—๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก [(๐‘ก )๐‘ฃ

๏ฟฝ๏ฟฝ๏ฟฝ โ‰ฅ 1

2

ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘—

]โ‰ค2 exp

ยฉยญยญยญยซโˆ’2

1

4ฮ”2

๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘—

(1+Y) ยท โˆฅ๐œ™ (๏ฟฝฬƒ๏ฟฝ๐‘– )โˆ’๐œ™ (๏ฟฝฬƒ๏ฟฝ ๐‘— ) โˆฅ (bโˆ—๐ผ+ฮฆโŠค_โˆ—๐‘ก

ฮฆ_โˆ—๐‘ก)โˆ’1

๐‘

ยชยฎยฎยฎยฌโ‰ค2 exp

ยฉยญยญยญยญยญยซโˆ’1

2

๐‘

(1+Y) ยท โˆฅ๐œ™ (๏ฟฝฬƒ๏ฟฝ๐‘– )โˆ’๐œ™ (๏ฟฝฬƒ๏ฟฝ ๐‘— ) โˆฅ (bโˆ—๐ผ+ฮฆโŠค_โˆ—๐‘ก

ฮฆ_โˆ—๐‘ก)โˆ’1

ฮ”2

๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘—

ยชยฎยฎยฎยฎยฎยฌโ‰ค2 exp

(โˆ’ ๐‘

2(1 + Y)๐œŒโˆ—

)Thus, with probability at least 1 โˆ’ 2 exp

(โˆ’ ๐‘

2(1+Y)๐œŒโˆ—), we have๏ฟฝ๏ฟฝ๏ฟฝ (๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค (

๐›พโˆ—๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก)โˆ’1

ฮฆโŠค๐‘ก [(๐‘ก )๐‘ฃ

๏ฟฝ๏ฟฝ๏ฟฝ < 1

2

ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— .

Recall that bโˆ— satisfies (1 + Y)โˆš๏ธbโˆ—max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ] โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (bโˆ—๐ผ+ฮฆโŠค_๐‘ขฮฆ_๐‘ข )โˆ’1๐ต โ‰ค 1

2ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— . Then, we bound the

regularization (bias) term. ๏ฟฝ๏ฟฝ๏ฟฝ๐›พโˆ— (๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค (๐›พโˆ—๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

\โˆ—๏ฟฝ๏ฟฝ๏ฟฝ

โ‰ค๐›พโˆ—โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ(๐›พโˆ—๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1 โˆฅ\โˆ—โˆฅ(๐›พโˆ—๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1

โ‰คโˆš๐›พโˆ— ยท โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ(๐›พโˆ—๐ผ+ฮฆโŠค๐‘ก ฮฆ๐‘ก )โˆ’1 โˆฅ\โˆ—โˆฅ2

37

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

โ‰คโˆš๏ธbโˆ—๐‘ ยท

(1 + Y) ยท โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (bโˆ—๐ผ+ฮฆโŠค_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

โˆš๐‘

ยท โˆฅ\โˆ—โˆฅ2

โ‰ค(1 + Y)โˆš๏ธbโˆ— max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (

bโˆ—๐ผ+ฮฆโŠค_โˆ—๐‘กฮฆ_โˆ—๐‘ก

)โˆ’1 โˆฅ\โˆ—โˆฅ2

โ‰ค(1 + Y)โˆš๏ธbโˆ— max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœB (๐‘ก )๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (

bโˆ—๐ผ+ฮฆโŠค_๐‘ขฮฆ_๐‘ข)โˆ’1 โˆฅ\โˆ—โˆฅ2

โ‰ค(1 + Y)โˆš๏ธbโˆ— max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (

bโˆ—๐ผ+ฮฆโŠค_๐‘ขฮฆ_๐‘ข)โˆ’1 ยท ๐ต

โ‰ค 1

2

ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘—

Thus, with probability at least 1 โˆ’ 2 exp

(โˆ’ ๐‘

2(1+Y)๐œŒโˆ—), we have๏ฟฝ๏ฟฝ๏ฟฝ( ห†๐‘“๐‘ก (๐‘ฅ๐‘– ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ ๐‘— )

)โˆ’

(๐‘“ (๐‘ฅ๐‘– ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— )

) ๏ฟฝ๏ฟฝ๏ฟฝโ‰ค

๏ฟฝ๏ฟฝ๏ฟฝ (๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค (๐›พโˆ—๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

ฮฆโŠค๐‘ก [(๐‘ก )๐‘ฃ

๏ฟฝ๏ฟฝ๏ฟฝ + ๏ฟฝ๏ฟฝ๏ฟฝ๐›พโˆ— (๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— ))โŠค (๐›พโˆ—๐ผ + ฮฆโŠค๐‘ก ฮฆ๐‘ก

)โˆ’1

\โˆ—๏ฟฝ๏ฟฝ๏ฟฝ

<ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— ,

which completes the proof of Pr

[ยฌJ๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘—

]โ‰ค 2 exp

(โˆ’ ๐‘

2(1+Y)๐œŒโˆ—).

Define event

J =

{๏ฟฝ๏ฟฝ๏ฟฝ( ห†๐‘“๐‘ก (๐‘ฅ๐‘– ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ ๐‘— ))โˆ’

(๐‘“ (๐‘ฅ๐‘– ) โˆ’ ๐‘“ (๐‘ฅ ๐‘— )

) ๏ฟฝ๏ฟฝ๏ฟฝ < ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— , โˆ€๐‘ฅ๐‘– , ๐‘ฅ ๐‘— โˆˆ B(๐‘ก )๐‘ฃ ,โˆ€๐‘ฃ โˆˆ [๐‘‰ ],โˆ€๐‘ก โˆˆ [๐‘…]

},

By a union bound over ๐‘ฅ๐‘– , ๐‘ฅ ๐‘— โˆˆ B (๐‘ก )๐‘ฃ , ๐‘ฃ โˆˆ [๐‘‰ ] and ๐‘ก โˆˆ [๐‘…], we have

Pr [ยฌJ] โ‰ค2๐‘›2๐‘‰ log(๐œ” ( หœX)) ยท exp

(โˆ’ ๐‘

2(1 + Y)๐œŒโˆ—

)=๐‘‚

(๐‘›2๐‘‰ log(๐œ” ( หœX)) ยท exp

(โˆ’ ๐‘‡๐‘‰

๐œŒโˆ— log(๐œ” ( หœX))

))In order to prove Theorem 4, it suffices to prove that conditioning on J , algorithm CoopKernelFB returns the correct

answers ๐‘ฅโˆ—๐‘ฃ for all ๐‘ฃ โˆˆ [๐‘‰ ].Suppose that there exist ๐‘ฃ โˆˆ [๐‘‰ ] and ๐‘ก โˆˆ [๐‘…] such that ๐‘ฅโˆ—๐‘ฃ is eliminated in phase ๐‘ก . Define

Bโ€ฒ (๐‘ก )๐‘ฃ = {๐‘ฅ โˆˆ B (๐‘ก )๐‘ฃ :ห†๐‘“๐‘ก (๐‘ฅ) > ห†๐‘“๐‘ก (๐‘ฅโˆ—๐‘ฃ )},

which denotes the subset of arms that are ranked before ๐‘ฅโˆ—๐‘ฃ by the estimated rewards in B (๐‘ก )๐‘ฃ . According to the

elimination rule, we have

๐œ” (Bโ€ฒ (๐‘ก )๐‘ฃ โˆช {๐‘ฅโˆ—๐‘ฃ }) >1

2

๐œ” (B (๐‘ก )๐‘ฃ ) =1

2

max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆB (๐‘ก )๐‘ฃโˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (bโˆ—๐ผ+ฮฆโŠค

_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1 (26)

Define ๐‘ฅ0 = argmax๏ฟฝฬƒ๏ฟฝ โˆˆBโ€ฒ (๐‘ก )๐‘ฃ

ฮ”๏ฟฝฬƒ๏ฟฝ . We have

1

2

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ0)โˆฅ (bโˆ—๐ผ+ฮฆโŠค_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

ฮ”2

๏ฟฝฬƒ๏ฟฝ0

38

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

โ‰ค 1

2

max

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆB (๐‘ก )๐‘ฃ

โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (bโˆ—๐ผ+ฮฆโŠค_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

ฮ”2

๏ฟฝฬƒ๏ฟฝ0

(f)

<๐œ” (Bโ€ฒ (๐‘ก )๐‘ฃ โˆช {๐‘ฅโˆ—๐‘ฃ })

ฮ”2

๏ฟฝฬƒ๏ฟฝ0

(g)

= min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ๐‘– ,๏ฟฝฬƒ๏ฟฝ ๐‘— โˆˆBโ€ฒ (๐‘ก )๐‘ฃ โˆช{๏ฟฝฬƒ๏ฟฝโˆ—๐‘ฃ }

โˆฅ๐œ™ (๐‘ฅ๐‘– ) โˆ’ ๐œ™ (๐‘ฅ ๐‘— )โˆฅ (bโˆ—๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1

ฮ”2

๏ฟฝฬƒ๏ฟฝ0

(h)

โ‰ค min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ โˆˆBโ€ฒ (๐‘ก )๐‘ฃ

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ (bโˆ—๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1

ฮ”2

๏ฟฝฬƒ๏ฟฝ

โ‰ค min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ โˆˆX๐‘ฃ\{๏ฟฝฬƒ๏ฟฝโˆ—๐‘ฃ },๐‘ฃโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ (bโˆ—๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1

ฮ”2

๏ฟฝฬƒ๏ฟฝ

=๐œŒโˆ—,

where (f) can be obtained by dividing Eq. (26) by ฮ”2

๏ฟฝฬƒ๏ฟฝ0

, (g) comes from the definition of ๐œ” (ยท), and (h) is due to the

definition of ๐‘ฅ0.

According to the definition

ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝโˆ—๐‘ฃ ,๏ฟฝฬƒ๏ฟฝ0= inf

ฮ”>0

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ0)โˆฅ2(bโˆ—๐ผ+ฮฆโŠค

_โˆ—๐‘กฮฆ_โˆ—๐‘ก)โˆ’1

ฮ”2โ‰ค 2๐œŒโˆ—

,we have ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝโˆ—๐‘ฃ ,๏ฟฝฬƒ๏ฟฝ0

โ‰ค ฮ”๏ฟฝฬƒ๏ฟฝ0.

Conditioning on J , we have

๏ฟฝ๏ฟฝ๏ฟฝ( ห†๐‘“๐‘ก (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ0))โˆ’

(๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ0)

) ๏ฟฝ๏ฟฝ๏ฟฝ < ฮ”๐‘ก,๏ฟฝฬƒ๏ฟฝโˆ—๐‘ฃ ,๏ฟฝฬƒ๏ฟฝ0โ‰ค ฮ”๏ฟฝฬƒ๏ฟฝ0

. Then, we have

ห†๐‘“๐‘ก (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ห†๐‘“๐‘ก (๐‘ฅ0) >(๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ0)

)โˆ’ ฮ”๏ฟฝฬƒ๏ฟฝ0

= 0,

which contradicts the definition of ๐‘ฅ0 that satisfiesห†๐‘“๐‘ก (๐‘ฅ0) > ห†๐‘“๐‘ก (๐‘ฅโˆ—๐‘ฃ ). Thus, for any ๐‘ฃ โˆˆ [๐‘‰ ], ๐‘ฅโˆ—๐‘ฃ will never be eliminated.

Since ๐œ” ({๐‘ฅโˆ—๐‘ฃ , ๐‘ฅ}) โ‰ฅ 1 for any ๐‘ฅ โˆˆ X๐‘ฃ \ {๐‘ฅโˆ—๐‘ฃ }, ๐‘ฃ โˆˆ [๐‘‰ ] and ๐‘… =โŒˆlog

2(๐œ” ( หœX))

โŒ‰โ‰ฅ

โŒˆlog

2(๐œ” (X๐‘ฃ))

โŒ‰for any ๐‘ฃ โˆˆ [๐‘‰ ], we

have that conditioning on J , algorithm CoopKernelFB returns the correct answers ๐‘ฅโˆ—๐‘ฃ for all ๐‘ฃ โˆˆ [๐‘‰ ].For communication rounds, since algorithm CoopKernelFB has at most ๐‘… =

โŒˆlog

2(๐œ” ( หœX))

โŒ‰phases, the number of

communication rounds is bounded by ๐‘‚ (log(๐œ” ( หœX))). โ–ก

B.2 Proof of Corollary 2

Proof of Corollary 2. Following the analysis procedure of Corollary 1, we have

๐œŒโˆ— = min

_โˆˆโ–ณXmax

๏ฟฝฬƒ๏ฟฝ โˆˆ หœX๐‘ฃ ,๐‘ฃโˆˆ[๐‘‰ ]

โˆฅ๐œ™ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐œ™ (๐‘ฅ)โˆฅ2(bโˆ—๐ผ+ฮฆโŠค_ฮฆ_

)โˆ’1

(๐‘“ (๐‘ฅโˆ—๐‘ฃ ) โˆ’ ๐‘“ (๐‘ฅ))2

โ‰ค 8

ฮ”2

min

ยท log det

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)

39

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Maximum Information Gain. Recall that the maximum information gain over all sample allocation _ โˆˆ โ–ณหœX is defined

as

ฮฅ = max

_โˆˆโ–ณ หœXlog det

(๐ผ + bโˆ—โˆ’1๐พ_

).

Then, the error probability is bounded by

๐‘‚

(๐‘›2๐‘‰ log(๐œ” ( หœX)) ยท exp

(โˆ’ ๐‘‡๐‘‰

๐œŒโˆ— log(๐œ” ( หœX))

))=๐‘‚

(๐‘›2๐‘‰ log(๐œ” ( หœX)) ยท exp

(โˆ’

๐‘‡๐‘‰ฮ”2

min

ฮฅ log(๐œ” ( หœX))

))Effective Dimension. Recall that ๐›ผ1 โ‰ฅ ยท ยท ยท โ‰ฅ ๐›ผ๐‘›๐‘‰ denote the eigenvalues of ๐พ_โˆ— in decreasing order. The effective

dimension of ๐พ_โˆ— is defined as

๐‘‘eff

= min

{๐‘— : ๐‘—bโˆ— log(๐‘›๐‘‰ ) โ‰ฅ

๐‘›๐‘‰โˆ‘๏ธ๐‘–=๐‘—+1

๐›ผ๐‘–

},

and we have

log det

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)โ‰ค ๐‘‘

efflog

(2๐‘›๐‘‰ ยท

(1 + Trace(๐พ_โˆ— )

bโˆ—๐‘‘eff

)).

Then, the error probability is bounded by

๐‘‚

(๐‘›2๐‘‰ log(๐œ” ( หœX)) ยท exp

(โˆ’ ๐‘‡๐‘‰

๐œŒโˆ— log(๐œ” ( หœX))

))

=๐‘‚ยฉยญยญยซ๐‘›2๐‘‰ log(๐œ” ( หœX)) ยท exp

ยฉยญยญยซโˆ’๐‘‡๐‘‰ฮ”2

min

๐‘‘eff

log

(๐‘›๐‘‰ ยท

(1 + Trace(๐พ_โˆ— )

bโˆ—๐‘‘eff

))log(๐œ” ( หœX))

ยชยฎยฎยฌยชยฎยฎยฌ

Decomposition. Recall that ๐พ = [๐พ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ] , ๐พ๐‘ง = [๐พ๐‘ง (๐‘ง๐‘ฃ, ๐‘ง๐‘ฃโ€ฒ)]๐‘ฃ,๐‘ฃโ€ฒโˆˆ[๐‘‰ ] , ๐พ๐‘ฅ = [๐พ๐‘ฅ (๐‘ฅ๐‘– , ๐‘ฅ ๐‘— )]๐‘–, ๐‘— โˆˆ[๐‘›๐‘‰ ] , andrank(๐พ_โˆ— ) = rank(๐พ) โ‰ค rank(๐พ๐‘ง) ยท rank(๐พ๐‘ฅ ). We have

log det

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)โ‰ค rank(๐พ๐‘ง) ยท rank(๐พ๐‘ฅ ) log

(Trace

(๐ผ + bโˆ’1

โˆ— ๐พ_โˆ—)

rank(๐พ_โˆ— )

)Then, the error probability is bounded by

๐‘‚

(๐‘›2๐‘‰ log(๐œ” ( หœX)) ยท exp

(โˆ’ ๐‘‡๐‘‰

๐œŒโˆ— log(๐œ” ( หœX))

))

=๐‘‚ยฉยญยญยซ๐‘›2๐‘‰ log(๐œ” ( หœX)) ยท exp

ยฉยญยญยซโˆ’๐‘‡๐‘‰ฮ”2

min

rank(๐พ๐‘ง) ยท rank(๐พ๐‘ฅ ) log

(Trace(๐ผ+bโˆ’1

โˆ— ๐พ_โˆ— )rank(๐พ_โˆ— )

)log(๐œ” ( หœX))

ยชยฎยฎยฌยชยฎยฎยฌ

Therefore, we complete the proof of Corollary 2. โ–ก

B.3 Proof of Theorem 5

Our proof of Theorem 5 follows the analysis procedure in [1].

40

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

We first introduce some definitions in information theory. For any random variable ๐ด, let ๐ป (๐ด) denote the Shannonentropy of ๐ด. If ๐ด is uniformly distributed on its support, ๐ป (๐ด) = log |๐ด|. For any ๐‘ โˆˆ (0, 1), let ๐ป2 (๐‘) = โˆ’๐‘ log

2๐‘ โˆ’

(1 โˆ’ ๐‘) log2(1 โˆ’ ๐‘) denote the binary entropy, and ๐ป2 (๐‘) = ๐ป (๐ด) for ๐ด โˆผ Bernoulli(๐‘). For any random variables ๐ด, ๐ต,

let ๐ป (๐ด;๐ต) = ๐ป (๐ด) โˆ’ ๐ป (๐ด|๐ต) = ๐ป (๐ต) โˆ’ ๐ป (๐ต |๐ด) denote the mutual information of ๐ด and ๐ต.

Consider the following fully-collaborative instance Dฮ”,๐‘๐‘‘

: for a uniformly drawn index ๐‘–โˆ— โˆˆ [๐‘‘], \โˆ—๐‘–โˆ— = ๐‘ + ฮ” and

\โˆ—๐‘—= ๐‘ for all ๐‘— โ‰  ๐‘–โˆ—. The arm set X = {๐‘ฅ โˆˆ {0, 1}๐‘‘ :

โˆ‘๐‘‘๐‘–=1

๐‘ฅ (๐‘–) = 1}, and the feature mapping ๐œ™ (๐‘ฅ) = ๐ผ๐‘ฅ for all ๐‘ฅ โˆˆ X.Under instance Dฮ”,๐‘

๐‘‘, we have ๐œ” ( หœX) = ๐œ” (X) = ๐‘‘ .

Let ๐›ฟโˆ— > 0 be a small constant that we specify later. There exists a single-agent algorithm A๐‘† (e.g., [23]) that uses at

most ๐‘‡ samples and guarantees ๐‘‚ (log๐‘‘ ยท exp(โˆ’ ฮ”2๐‘‡๐‘‘ log๐‘‘

)) error probability on instance Dฮ”,๐‘๐‘‘

for any ๐‘‘ > 1. Restricting

the error probability to the constant ๐›ฟโˆ—, we have that for any ๐‘‘ > 1,A๐‘† uses at most๐‘‡ = ๐‘‚ ( ๐‘‘ฮ”2

log๐‘‘ ยท log log๐‘‘) samples

to guarantee ๐›ฟโˆ— error probability on instance Dฮ”,๐‘๐‘‘

.

Let ๐›ผ = ๐‘‰๐›ฝand 1 โ‰ค ๐›ผ โ‰ค log๐‘‘ . According to the definition of speedup, a

๐‘‰๐›ผ -speedup distributed algorithm A must

satisfy that for any ๐‘‘ > 1, A uses at most ๐‘‡ = ๐‘‚ (๐‘›

ฮ”2

๐‘‰๐›ผ

ยท ๐‘‰ log๐‘‘ ยท log log๐‘‘) = ๐‘‚ ( ๐›ผ๐‘‘ฮ”2

log๐‘‘ ยท log log๐‘‘) samples over all

agents and guarantees ๐›ฟโˆ— error probability on instance Dฮ”,๐‘๐‘‘

.

Main Proof. Now, in order to prove Theorem 5, we first prove the following Lemma 6, which relaxes the sample budget

within logarithmic factors.

Lemma 6. For any ๐‘‘ > 1, any distributed algorithm A that can use at most ๐‘‚ ( ๐›ผ๐‘‘ (log๐›ผ+log log๐‘‘)2ฮ”2 (log๐‘‘)2 ) samples over all

agents and guarantee ๐›ฟโˆ— error probability on instance Dฮ”,๐‘๐‘‘

needs ฮฉ( log๐‘‘

log๐›ผ+log log๐‘‘) communication rounds.

Below we prove Lemma 6 by induction (Lemmas 7,8).

Lemma 7 (Basic Step). For any ๐‘‘ > 1 and 1 โ‰ค ๐›ผ โ‰ค log๐‘‘ , there is no 1-round algorithmA1 that can use๐‘‚ ( ๐›ผ๐‘‘ฮ”2) samples

and guarantee ๐›ฟ1 error probability for some constant ๐›ฟ1 โˆˆ (0, 1) on instance Dฮ”,๐‘๐‘‘

.

Proof of Lemma 7. Let ๐ผ denote the random variable of index ๐‘–โˆ—. Since ๐ผ is uniformly distributed on [๐‘‘],๐ป (๐ผ ) = log๐‘‘ .

Let ๐‘† denote the sample profile of A1 on instance Dฮ”,๐‘๐‘‘

. According to the definitions of X and \โˆ—, for instance Dฮ”,๐‘๐‘‘

,

identifying the best arm is equivalent to identifying ๐ผ . Suppose that A1 returns the best arm with error probability ๐›ฟ1.

Using the Fanoโ€™s inequality (Lemma 13), we have

๐ป (๐ผ |๐‘†) โ‰ค ๐ป2 (๐›ฟ1) + ๐›ฟ1 log๐‘‘ (27)

Using Lemma 14, we have

๐ป (๐ผ |๐‘†) =๐ป (๐ผ ) โˆ’ ๐ป (๐ผ ; ๐‘†)

= log๐‘‘ โˆ’๐‘‚ (๐›ผ๐‘‘ฮ”2ยท ฮ”

2

๐‘‘)

= log๐‘‘ โˆ’๐‘‚ (๐›ผ)

Then, for some small enough constant ๐›ฟ1 โˆˆ (0, 1), Eq. (27) cannot hold. Thus, for any ๐‘‘ > 1 and 1 โ‰ค ๐›ผ โ‰ค log๐‘‘ , there is

no 1-round algorithm A1 that can use ๐‘‚ ( ๐›ผ๐‘‘ฮ”2) samples and guarantee ๐›ฟ1 error probability on instance Dฮ”,๐‘

๐‘‘. โ–ก

Lemma 8 (Induction Step). Suppose that 1 โ‰ค ๐›ผ โ‰ค log๐‘‘ and ๐›ฟ โˆˆ (0, 1). If for any ๐‘‘ > 1, there is no (๐‘Ÿ โˆ’ 1)-roundalgorithmA๐‘Ÿโˆ’1 that can use๐‘‚ ( ๐›ผ๐‘‘

ฮ”2 (๐‘Ÿโˆ’1)2 ) samples and guarantee ๐›ฟ error probability on instance Dฮ”,๐‘๐‘‘

, then for any ๐‘‘ > 1,41

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

there is no ๐‘Ÿ -round algorithm A๐‘Ÿ that can use ๐‘‚ ( ๐›ผ๐‘‘ฮ”2๐‘Ÿ 2) samples and guarantee ๐›ฟ โˆ’๐‘‚ ( 1

๐‘Ÿ 2) error probability on instance

Dฮ”,๐‘๐‘‘

.

Proof of Lemma 8. We prove this lemma by contradiction. Suppose that for some ๐‘‘ > 1, there exists an ๐‘Ÿ -round

algorithm A๐‘Ÿ that can use ๐‘‚ ( ๐›ผ๐‘‘ฮ”2๐‘Ÿ 2) samples and guarantee ๐›ฟ error probability on instance Dฮ”,๐‘

๐‘‘. In the following, we

show how to construct an (๐‘Ÿ โˆ’ 1)-round algorithm A๐‘Ÿโˆ’1 that can use ๐‘‚ ( ๐›ผ๐‘‘ฮ”2 (๐‘Ÿโˆ’1)2 ) samples and guarantee at most

๐›ฟ +๐‘‚ ( 1

๐‘Ÿ 2) error probability on instance Dฮ”,๐‘

หœ๐‘‘for some

หœ๐‘‘ .

Let ๐‘†1 denote the sample profile of A๐‘Ÿ in the first round on instance Dฮ”,๐‘๐‘‘

. Since the number of samples of A๐‘Ÿ isbounded by ๐‘‚ ( ๐›ผ๐‘‘

ฮ”2๐‘Ÿ 2), using Lemma 14, we have

๐ป (๐ผ |๐‘†1) =๐ป (๐ผ ) โˆ’ ๐ป (๐ผ ; ๐‘†1)

= log๐‘‘ โˆ’๐‘‚ ( ๐›ผ๐‘‘ฮ”2๐‘Ÿ2

ยท ฮ”2

๐‘‘)

= log๐‘‘ โˆ’๐‘‚ ( ๐›ผ๐‘Ÿ2)

Let Dฮ”,๐‘๐‘‘ |๐‘†1

denote the posterior of Dฮ”,๐‘๐‘‘

after observing the sample profile ๐‘†1.

Using Lemma 15 on random variable ๐ผ |๐‘†1 with parameters ๐›พ = ๐‘‚ ( ๐›ผ๐‘Ÿ 2) and ๐œ– = ๐‘œ ( 1

๐‘Ÿ 2), we can write Dฮ”,๐‘

๐‘‘ |๐‘†1

as a convex

combination of distributions Q0,Q1, . . . ,Qโ„“ , i.e., Dฮ”,๐‘๐‘‘ |๐‘†1

=โˆ‘โ„“๐‘—=0

๐‘ž ๐‘—Q ๐‘— such that ๐‘ž0 = ๐‘œ ( 1

๐‘Ÿ 2), and for any ๐‘— โ‰ฅ 1,

|supp(Q ๐‘— ) | โ‰ฅ exp

(log๐‘‘ โˆ’ ๐›พ

๐œ–

)= exp (log๐‘‘ โˆ’ ๐‘œ (๐›ผ))

=๐‘‘

๐‘’๐‘œ (๐›ผ),

and

โˆฅQ ๐‘— โˆ’U๐‘— โˆฅTV = ๐‘œ ( 1

๐‘Ÿ2),

whereU๐‘— is the uniform distribution on supp(Q ๐‘— ).Since

Pr[A๐‘Ÿ has an error|Dฮ”,๐‘๐‘‘ |๐‘†1

] =โ„“โˆ‘๏ธ๐‘—=0

๐‘ž ๐‘— Pr[A๐‘Ÿ has an error|Q ๐‘— ] โ‰ค ๐›ฟ,

using an average argument, there exists a distribution Q ๐‘— for some ๐‘— โ‰ฅ 1 such that

Pr[A๐‘Ÿ has an error|Q ๐‘— ] โ‰ค ๐›ฟ.

Letหœ๐‘‘ = |supp(Q ๐‘— ) | โ‰ฅ ๐‘‘

๐‘’๐‘œ (๐›ผ ). Since โˆฅQ ๐‘— โˆ’U๐‘— โˆฅTV = ๐‘œ ( 1

๐‘Ÿ 2) andU๐‘— is equivalent to Dฮ”,๐‘

หœ๐‘‘, we have

Pr[A๐‘Ÿ has an error|Dฮ”,๐‘หœ๐‘‘] โ‰ค ๐›ฟ + ๐‘œ ( 1

๐‘Ÿ2) .

Under instance Dฮ”,๐‘หœ๐‘‘

, the sample budget for (๐‘Ÿ โˆ’ 1)-round algorithms is ๐‘‚ ( ๐›ผ หœ๐‘‘ฮ”2 (๐‘Ÿโˆ’1)2 ), and it holds that

๐‘‚ ( ๐›ผ หœ๐‘‘

ฮ”2 (๐‘Ÿ โˆ’ 1)2) โ‰ฅ ๐‘‚ (

๐›ผ ๐‘‘

๐‘’๐‘œ (๐›ผ )

ฮ”2 (๐‘Ÿ โˆ’ 1)2) โ‰ฅ ๐‘‚ ( ๐›ผ๐‘‘

1โˆ’๐‘œ (1)

ฮ”2 (๐‘Ÿ โˆ’ 1)2) โ‰ฅ ๐‘‚ ( ๐›ผ๐‘‘

ฮ”2๐‘Ÿ2)

42

Collaborative Pure Exploration in Kernel Bandit Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Then, we can construct an (๐‘Ÿ โˆ’1)-round algorithmA๐‘Ÿโˆ’1 using algorithmA๐‘Ÿ from the second round. The constructed

A๐‘Ÿโˆ’1 uses at most ๐‘‚ ( ๐›ผ๐‘‘ฮ”2๐‘Ÿ 2) โ‰ค ๐‘‚ ( ๐›ผ หœ๐‘‘

ฮ”2 (๐‘Ÿโˆ’1)2 ) samples and guarantees ๐›ฟ + ๐‘œ ( 1

๐‘Ÿ 2) error probability.

The specific procedure ofA๐‘Ÿโˆ’1 is as follows: ifA๐‘Ÿ samples some dimension ๐‘– โˆˆ supp(Q ๐‘— ), thenA๐‘Ÿโˆ’1 also samples ๐‘–;

otherwise, if A๐‘Ÿ samples some dimension ๐‘– โˆˆ [๐‘‘] \ supp(Q ๐‘— ), then A๐‘Ÿโˆ’1 samples Bernoulli(๐‘) and feeds the outcome

to A๐‘Ÿ . Finally, if A๐‘Ÿ returns some dimension ๐‘– โˆˆ supp(Q ๐‘— ), then A๐‘Ÿโˆ’1 also returns ๐‘–; otherwise, if A๐‘Ÿ returns some

dimension ๐‘– โˆˆ [๐‘‘] \ supp(Q ๐‘— ), then A๐‘Ÿโˆ’1 returns an arbitrary dimension in [๐‘‘] \ supp(Q ๐‘— ) (the error case). โ–ก

Proof of Lemma 6. Let ๐‘Ÿโˆ— =log๐‘‘

log๐›ผ+log log๐‘‘. Combining Lemmas 7,8, we obtain that there is no ๐‘Ÿโˆ—-round algorithm

A that can use ๐‘‚ ( ๐›ผ๐‘‘ฮ”2๐‘Ÿ 2

โˆ—) samples and guarantee ๐›ฟ1 โˆ’

โˆ‘๐‘Ÿโˆ—๐‘Ÿ=2

๐‘œ ( 1

๐‘Ÿ 2) error probability on instance Dฮ”,๐‘

๐‘‘for any ๐‘‘ > 1.

Let ๐›ฟโˆ— < ๐›ฟ1 โˆ’โˆ‘๐‘Ÿโˆ—๐‘Ÿ=2

๐‘œ ( 1

๐‘Ÿ 2). Thus, for any ๐‘‘ > 1 and 1 โ‰ค ๐›ผ โ‰ค log๐‘‘ , any distributed algorithm A that can use ๐‘‚ ( ๐›ผ๐‘‘

ฮ”2๐‘Ÿ 2

โˆ—)

samples and guarantee ๐›ฟโˆ— error probability on instance Dฮ”,๐‘๐‘‘

must cost ๐‘Ÿโˆ— = ฮฉ

(log๐‘‘

log( ๐‘‰๐›ฝ)+log log๐‘‘

)communication

rounds. โ–ก

Therefore, for any๐‘‘ > 1 and๐‘‰

log๐‘‘โ‰ค ๐›ฝ โ‰ค ๐‘‰ , a ๐›ฝ-speedup distributed algorithmA needs at leastฮฉ

(log๐œ” ( หœX)

log( ๐‘‰๐›ฝ)+log log๐œ” ( หœX)

)communication rounds under instance Dฮ”,๐‘

๐‘‘, which completes the proof of Theorem 5.

C TECHNICAL TOOLS

Lemma 9 (Lemma 15 in [10]). For _โˆ— = argmax_โˆˆโ–ณ หœXlog det

(๐ผ + bโˆ—โˆ’1

โˆ‘๏ฟฝฬƒ๏ฟฝ โ€ฒโˆˆ หœX _๏ฟฝฬƒ๏ฟฝ โ€ฒ๐œ™ (๐‘ฅ

โ€ฒ)๐œ™ (๐‘ฅ โ€ฒ)โŠค), we have

max

๏ฟฝฬƒ๏ฟฝ โˆˆ หœXโˆฅ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโ€ฒโˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝโ€ฒ๐œ™ (๏ฟฝฬƒ๏ฟฝ

โ€ฒ)๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)โŠค)โˆ’1

=โˆ‘๏ธ๏ฟฝฬƒ๏ฟฝ โˆˆ หœX

_โˆ—๏ฟฝฬƒ๏ฟฝโˆฅ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโ€ฒโˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝโ€ฒ๐œ™ (๏ฟฝฬƒ๏ฟฝ

โ€ฒ)๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)โŠค)โˆ’1

Lemma 10. For _โˆ— = argmax_โˆˆโ–ณ หœXlog det

(๐ผ + bโˆ—โˆ’1

โˆ‘๏ฟฝฬƒ๏ฟฝ โ€ฒโˆˆ หœX _๏ฟฝฬƒ๏ฟฝ โ€ฒ๐œ™ (๐‘ฅ

โ€ฒ)๐œ™ (๐‘ฅ โ€ฒ)โŠค), we have

โˆ‘๏ธ๏ฟฝฬƒ๏ฟฝ โˆˆ หœX

log

ยฉยญยซ1 + _โˆ—๏ฟฝฬƒ๏ฟฝโˆฅ๐œ™ (๐‘ฅ)โˆฅ2(

bโˆ—๐ผ+โˆ‘๏ฟฝฬƒ๏ฟฝโ€ฒโˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝโ€ฒ๐œ™ (๏ฟฝฬƒ๏ฟฝ

โ€ฒ)๐œ™ (๏ฟฝฬƒ๏ฟฝ โ€ฒ)โŠค)โˆ’1

ยชยฎยฌ โ‰ค log

det

(bโˆ—๐ผ +

โˆ‘๏ฟฝฬƒ๏ฟฝ โˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)โŠค

)det (bโˆ—๐ผ )

Proof of Lemma 10. For any ๐‘— โˆˆ [๐‘›๐‘‰ ], let๐‘€๐‘— = det

(bโˆ—๐ผ +

โˆ‘๐‘–โˆˆ[ ๐‘— ] _

โˆ—๐‘–๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )โŠค

).

det

ยฉยญยซbโˆ—๐ผ +โˆ‘๏ธ๏ฟฝฬƒ๏ฟฝ โˆˆ หœX

_โˆ—๏ฟฝฬƒ๏ฟฝ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)โŠคยชยฎยฌ

= det

ยฉยญยซbโˆ—๐ผ +โˆ‘๏ธ

๐‘–โˆˆ[๐‘›๐‘‰โˆ’1]_โˆ—๐‘– ๐œ™ (๐‘ฅ๐‘– )๐œ™ (๐‘ฅ๐‘– )

โŠค + _โˆ—๐‘›๐‘‰๐œ™ (๐‘ฅ๐‘›๐‘‰ )๐œ™ (๐‘ฅ๐‘›๐‘‰ )โŠคยชยฎยฌ

= det (๐‘€๐‘›๐‘‰โˆ’1) det

(๐ผ + _โˆ—๐‘›๐‘‰ ยท๐‘€

โˆ’ 1

2

๐‘›๐‘‰โˆ’1๐œ™ (๐‘ฅ๐‘›๐‘‰ )

(๐‘€โˆ’ 1

2

๐‘›๐‘‰โˆ’1๐œ™ (๐‘ฅ๐‘›๐‘‰ )

)โŠค)= det (๐‘€๐‘›๐‘‰โˆ’1) det

(๐ผ + _โˆ—๐‘›๐‘‰ ยท ๐œ™ (๐‘ฅ๐‘›๐‘‰ )

โŠค๐‘€โˆ’1

๐‘›๐‘‰โˆ’1๐œ™ (๐‘ฅ๐‘›๐‘‰ )

)= det (๐‘€๐‘›๐‘‰โˆ’1)

(1 + _โˆ—๐‘›๐‘‰ โˆฅ๐œ™ (๐‘ฅ๐‘›๐‘‰ )โˆฅ

2

๐‘€โˆ’1

๐‘›๐‘‰โˆ’1

)= det (bโˆ—๐ผ )

๐‘›๐‘‰โˆ๐‘–=1

(1 + _โˆ—๐‘– โˆฅ๐œ™ (๐‘ฅ๐‘– )โˆฅ

2

๐‘€โˆ’1

๐‘–โˆ’1

)43

Woodstock โ€™21, June 03โ€“05, 2021, Woodstock, NY

Thus,

det

(bโˆ—๐ผ +

โˆ‘๏ฟฝฬƒ๏ฟฝ โˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)โŠค

)det (bโˆ—๐ผ )

=

๐‘›๐‘‰โˆ๐‘–=1

(1 + _โˆ—๐‘– โˆฅ๐œ™ (๐‘ฅ๐‘– )โˆฅ

2

๐‘€โˆ’1

๐‘–โˆ’1

)Taking logarithm on both sides, we have

log

det

(bโˆ—๐ผ +

โˆ‘๏ฟฝฬƒ๏ฟฝ โˆˆ หœX _

โˆ—๏ฟฝฬƒ๏ฟฝ๐œ™ (๐‘ฅ)๐œ™ (๐‘ฅ)โŠค

)det (bโˆ—๐ผ )

=

๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

log

(1 + _โˆ—๐‘– โˆฅ๐œ™ (๐‘ฅ๐‘– )โˆฅ

2

๐‘€โˆ’1

๐‘–โˆ’1

)โ‰ฅ๐‘›๐‘‰โˆ‘๏ธ๐‘–=1

log

(1 + _โˆ—๐‘– โˆฅ๐œ™ (๐‘ฅ๐‘– )โˆฅ

2

๐‘€โˆ’1

๐‘›๐‘‰

),

which completes the proof of Lemma 10. โ–ก

Lemma 11 (Pinskerโ€™s ineqality). If ๐‘ƒ and ๐‘„ are two probability distributions on a measurable space (๐‘‹, ฮฃ), then for

any measurable event ๐ด โˆˆ ฮฃ, it holds that

|๐‘ƒ (๐ด) โˆ’๐‘„ (๐ด) | โ‰คโˆš๏ธ‚

1

2

KL(๐‘ƒ โˆฅ๐‘„) .

Lemma 12 (Lemma 29 in [38]). For any ๐›พ1, . . . , ๐›พ๐พ โˆˆ [0, 1] and ๐‘ฅ โ‰ฅ 0, it holds that

๐พโˆ๐‘–=1

max{1 โˆ’ ๐›พ๐‘– โˆ’ ๐›พ๐‘–๐‘ฅ, 0} โ‰ฅ๐พโˆ๐‘–=1

(1 โˆ’ ๐›พ๐‘– ) โˆ’ ๐‘ฅ

Lemma 13 (Fanoโ€™s Ineqality). Let ๐ด, ๐ต be random variables and ๐‘“ be a function that given ๐ด predicts a value for ๐ต.

If Pr(๐‘“ (๐ด) โ‰  ๐ต) โ‰ค ๐›ฟ , then ๐ป (๐ต |๐ด) โ‰ค ๐ป2 (๐›ฟ) + ๐›ฟ ยท log |๐ต |.

Lemma 14. For the instance Dฮ”,๐‘๐‘‘

with sample profile S, we have ๐ป (๐ผ ; ๐‘†) = ๐‘‚ ( |๐‘† | ยท ฮ”2

๐‘‘).

The proof of Lemma 14 follows the same analysis procedure of Lemma 7 in [1].

Lemma 15 (Lemma 8 in [1]). Let ๐ด โˆผ D be a random variable on [๐‘‘] with ๐ป (๐ด) โ‰ฅ log๐‘‘ โˆ’ ๐›พ for some ๐›พ โ‰ฅ 1. For any

Y > exp(โˆ’๐›พ), there exists โ„“ + 1 distributions๐œ“0,๐œ“1, ...,๐œ“โ„“ on [๐‘‘] along with โ„“ + 1 probabilities ๐‘0, ๐‘1, . . . , ๐‘โ„“ (โˆ‘โ„“๐‘–=0

๐‘๐‘– = 1)

for some โ„“ = ๐‘‚ (๐›พ/Y3) such that D =โˆ‘โ„“๐‘–=1

๐‘๐‘–๐œ“๐‘– , ๐‘0 = ๐‘‚ (Y), and for any ๐‘– โ‰ฅ 1,

1. log |supp(๐œ“๐‘– ) | โ‰ฅ log๐‘‘ โˆ’ ๐›พ/Y.2. โˆฅ๐œ“๐‘– โˆ’U๐‘– โˆฅTV = ๐‘‚ (Y) whereU๐‘– denotes the uniform distribution on supp(๐œ“๐‘– ).

44