arXiv:1912.03283v2 [quant-ph] 5 Apr 2020 · quantum machine learning; adversarial examples; Support Vector Machine; quantum theory. I. INTRODUCTION Supervised learning is one of the

A quantum active learning algorithm for sampling against adversarial attacks

P. A. M. Casares1, ∗ and M. A. Martin-Delgado1, †

1Departamento de Fısica Teorica, Universidad Complutense de Madrid.(Dated: April 7, 2020)

Adversarial attacks represent a serious menace for learning algorithms and may compromise thesecurity of future autonomous systems. A theorem by Khoury and Hadfield-Menell (KH), providessufficient conditions to guarantee the robustness of active learning algorithms, but comes with acaveat: it is crucial to know the smallest distance among the classes of the corresponding classifi-cation problem. We propose a theoretical framework that allows us to think of active learning assampling the most promising new points to be classified, so that the minimum distance betweenclasses can be found and the theorem KH used. Additionally, we introduce a quantum active learn-ing algorithm that makes use of such framework and whose complexity is polylogarithmic in thedimension of the space, m, and the size of the initial training data n, provided the use of qRAMs;and polynomial in the precision, achieving an exponential speedup over the equivalent classicalalgorithm in n and m.

Keywords: Quantum algorithm; active learning; pool-based sampling; expected error reduction sampling;quantum machine learning; adversarial examples; Support Vector Machine; quantum theory.

I. INTRODUCTION

Supervised learning is one of the subareas of ma-chine learning [1–3] that consists of techniques tolearn to classify new data taking as example a train-ing set. More specifically, the computer is given atraining set X, consisting on n pairs of point andlabel, (x, y). With the information, the computer issupposed to extract or infer the conditional proba-bility distributions p(y|x) and use it to classify newpoints x. This paradigm is in contrast with unsu-pervised learning, that like in the case of clustering,attempts to find structure to a set of points with-out labels; and reinforcement learning [4], where anagent has to figure out the best policy or action foreach situation it may face.

An important kind of supervised learning is whatis usually called active learning [5]. To introduce thisconcept suppose that we have a supervised learningalgorithm, with its corresponding training set. How-ever, instead of directly trying to predict the labelof new points, we give the classifier the option topose us interesting questions in order to reduce theuncertainty in p(y|x). In this setting, the algorithmwill add to its training set new points in areas whereit has a lot of uncertainty.

To explain the concept better, let us give an ex-ample. Suppose we have an image classifier used fora self-driving car, that has to to distinguish betweencars, pedestrians... Internally, images are decom-posed into pixels that can be characterised by theircombination of red, blue and green. Thus, an image

∗ [email protected]† [email protected]

can be expressed as an array of 3 dimensional vec-tors, or as a gigantic vector if we flatten the array.Any image is then a vector which can be identifiedwith a point of a high dimensional space contain-ing all images. The set of images of different classesforms a single manifoldM embedded in that highlydimensional space. In particular, the dimension ofthe space is 3np, for np the number of pixels. Thekey idea of an active learning algorithm is one that isable to understand in what kind of images it has themost uncertainty, and request additional examplesto be labeled and added to its training set.

Another important concept in the context of su-pervised learning is that of adversarial attacks or ad-versarial examples, the name given to a phenomenonwhere a trained and accurate (usually a neural net-work) classifier, can be mislead into wrong classifi-cation, by producing a carefully chosen and slightlymodified version of one point that is classified well.Adversarial examples were discovered quite recently[6, 7], and have received a lot of attention leading tomodels robust to particular attacks [8–11].

After the discovery of such adversarial examples,considerable effort has been put in explaining whythey happen and how they can be avoided. One ofthe given theoretical reasons links their existence toa high codimension, the difference in dimension inthe structure of the classes we are trying to sepa-rate with respect to the highly dimensional spacein which they are embedded [12]. Intuitively thismeans that the dimension of the space of imagesof cars, for example, is much lower than the en-tire space of all possible images, of dimension 3npas said earlier; since we need to impose a lot of con-straints for an image to be indeed the image of acar, even if such constraints are not easily definable.This suggest a strategy, proposed in theorem 5 of

arX

iv:1

912.

0328

3v2

[qu

ant-

ph]

5 A

pr 2

020

mailto:[email protected]

mailto:[email protected]

2

[12], and explained below. But before, we need tounderstand the concept of δ-cover of a manifold, theε0-neighbour of a submanifold, and finally what is anadversarial example.

Informally stated, a δ-cover of a manifold is a setof points ~xi of the manifold such that the set ofballs with centers ~xi and radious δ contains themanifold. Or in other words, any point at the man-ifold is at most δ-far from a point from ~xi. Re-latedly, an ε0-neighbour of a given submanifold iscomposed of all those points in the manifold at mostε0-far from the submanifold. In both cases we areassuming a p-norm.

An example of δ-cover is depicted in 1 in greenand will be a key ingredient to avoid adversarial ex-amples. With perhaps an abuse of notation, we willcall a δ-cover simultaneously to the set of ~xi pointsthat are in the center of these balls, and to the ballsthemselves. This means that for δ the parameterthat controls how coarse or fine is the sampling, aδ-cover is a coarse-grained sampling of each class.Following our previous example, an example of a δ-cover is a set of images of cars, for example, suchthat any possible image of a car is no further thanδ-far to one of the training set. Notice that one canmeasure distance between images by the distancebetween the vectors containing the amount of green,blue and red of each pixel, in a p-norm.

For the next definition we will need the notionof a classifier, a function f that, given a point x isable to predict a label y. An ε0-neighbour is alsodepicted in figure 1 in dashed lines around the redand blue submanifolds. Here, ε0 has the meaning ofthe robustness against perturbations. For example,take an image of a car and its corresponding vector.ε0 is the amount one can perturb the vector with-out fooling the classifier. Then, Mε0 is the space ofsuch perturbations of size ε0, for each class withinsubmanifold M, that contains the different classes.

Provided the previous definitions, an ε0-adversarial example can be described as a pointin the manifold that, being ε0-close to a class orsubmanifold, is mistakenly classified as an exampleof another class. For such definition to make senseε0 must be smaller than the distance between twoclasses rp. Else, a given point in the region Mε0

could be ε0-close to two classes at the same time.An example of an adversarial example is pictured infigure 1. The blue point is ε0-close to the red classbut classified as blue.

The intuition of [12] to avoid ε0-adversarial ex-amples is to cover all classes with a δ-cover, suchthat any point in the ε0-neighbour of the classes isδ-close to a correctly classified point. What we wantto find out is how big can δ be in order to maintainprotection against adversarial examples. Such δ will

FIG. 1. Submanifold M made of two classes separatedby a distance rp. The space near each class is Mε0 . Wealso represent a δ-cover of the blue class. The red point isin the red class, but when ε0-shifted, can be mistakenlyclassified as blue. It would be an adversarial examplethat we are trying to avoid.

depend crucially on the minimum distance of sep-aration between classes, δ < rp − ε0, and thus ourmain objective in this article is to sample efficientlyin an active learning setting, to find this minimumdistance rp. The precise statement of the Theoremcan be found in appendix A, but intuitively we aretrying to cover the space of Mε0 corresponding toeach class, with sets of balls, such that the sets donot overlap but the balls are as large as possible toavoid sampling more than it is needed.

Therefore, if we knew the minimum distance ofseparation of two classes, rp, we would be able toproduce a cover of the two classes that avoids the ad-versarial examples. However, finding rp is not easy,because we only have an upper bound to the mini-mum distance of samples between two classes. Thegeneralisation to a small arbitrary number of classesis usually done by pairs. The problem we aim tosolve in this paper is finding this minimum distancebetween the two classes rp, because if we overesti-mate δ we would not be able to use Theorem 1, andif we underestimate it, the δ-cover would be moreexpensive to establish.

Given these definitions and the previous theorem,we can think of this as a procedure to establish an ε0-tolerance to perturbations. The robustness providedby the δ-cover qualitatively means that, if we providea cover of the class with balls of size δ, then any pointε0-near the class, inMε0 , will be correctly classified.Thus, our classifier will be robust to perturbations.

3

With this purpose in mind, in this article wepresent a quantum active learning algorithm that al-lows for fast sampling of the most informative pointsthat could be added to the training set XL to findout this minimum distance rp. We will later definewhat we mean exactly by ‘informativeness’. Also,notice that this means that the level of robustnessε0 is something we choose, and rp is the unknownwe are looking for that would allow us to calculateδ.

The quantum algorithm will allow us to performthis sampling very efficiently, in polylogarithmictime in the dimension of the space m and the num-ber of already classified points, n, and polynomialcost in all other variables. To achieve such complex-ity we will need to use qRAMs [13] and techniquesfrom [14] and [15]. By comparison, linear algebraoperations will require polynomial cost in n and m,when performed using classical computing.

The origin of this advantage is that in order tocalculate such ‘informativeness’ we need to solve theQuantum Support Vector Machine, but we will notneed to read out the solution. Rather, we will op-erate quantumly with the output to calculate thescalar value of ‘informativeness’, which we define for-mally in the next section. Notice that most quan-tum linear algebra algorithms are better suited forthe cases where one does not have to read out thesolution entirely, but rather calculate some expectedvalue. This is partially the intuition that motivatedour research.

We also hope that the framing of this article willhighlight to the quantum machine learning commu-nity, often focused mostly on obtaining quantum ad-vantages, the necessity of designing systems that arenot only efficient and capable, but also robust andreliable.

The structure of the remaining sections of the ar-ticle is the following. In section II we explain howto model the problem of finding the minimum dis-tance between classes as an active learning problem.We also introduce some key concepts, the main al-gorithm we will be using and some background onSupport Vector Machines. Sections III, IV and Vconstitute the main corpus of the text. To performthe active learning algorithm, in section III we ex-plain how to obtain Pc(~x), the probability that anarbitrary point of the space is in a given class, whichis one of the main components of the ‘informative-ness’ of such point. The other main component isdescribed in the next two sections. In section IVwe review the main results that we will be usingfrom our reference [14], and in V we use the resultof the previous section to calculate |~w|n+1, the sec-ond main component of the ‘informativeness’ of ~x.Section VI explains the strategy to select a point

that with high probability improves the current es-timate of the SVM, and section VII is dedicated tothe calculation of the complexity of our algorithm.Finally, in section VIII we explain our conclusions,and we also include a subsection for those interestedin learning more about adversarial examples, activelearning, and quantum machine learning; where wealso briefly mention the differences between adver-sarial examples and generative adversarial networks.In the appendices we include definitions, theoremsfrom other articles that we use, and some technicalresults.

II. MODELLING THE ACTIVE LEARNINGPROBLEM

The problem we want to solve is the following:

Problem 1. Informative Points SamplingProblem Suppose we are given a set of points S =(~xi,~ci), i ∈ 1, .., n, where ~xi ∈ Rm are the ndata points already classified, and ~ci are label vec-tors that indicate probability of membership to thepossible classes where the point might be classified.Suppose also that these memberships sum up to, atmost, 1 for each point, and for simplicity assumethat there are just two classes. The problem is: whatare the most informative points to be added to S inorder to learn the Support Vector Machine, and inparticular rp, with better precision?

This problem makes reference to two key concepts:Support Vector Machine and ‘informativeness’, thatwe will explain now.

Definition 1. Support Vector Machine (SVM)Let M be a m-dimensional manifold with two sub-manifolds or classes and a set of points already clas-sified ~xi, yi, ~xi being the point and yi the label.Then a Support Vector Machine is a hyperplane inthe manifold separating both classes such that min-imum distance between ~xi and the hyperplane isas large as possible. If the SVM is linear it may bedescribed by equation

~w · ~x− b = 0, (1)

and the size of the margin is 1/|~w|, which would beequal to rp/2 if the SVM was perfect. One may alsodefine a non-linear SVM using a non-linear kernelfor the dot product.

We highlight that the name Support Vector Ma-chine (SVM) will refer to both the algorithm and theseparation hyperplane, the decision boundary.

In our particular problem, we are interested infinding the minimum distance between two classes,

4

in order to provide the required covering of the twoclasses that avoid having adversarial attacks. How-ever, it might be the case that both classes have somepoints in common, e.g. (~ci)j , (~ci)k > 0, where i indi-cates the point and j and k two classes. In order toavoid this, we establish that a point ~xi is in class j if(~ci)j > 0.8, and as

∑j(~ci)j = 1 for any point ~xi, this

will avoid any overlap between classes. This valueis arbitrary but should be strictly greater than 0.5.This way we clearly separate the two sets, and theminimum distance between classes is greater than 0.With this condition one wants to make classes clearlyseparated, since otherwise it does not make sense theconcept of adversarial example, which will be centralto our discussion. If there are more than two classes,then one can generalise the previous setting makingcomparison on all possible pairs of classes, whichscales quadratically with the number of classes.

Now let us look at figure 2. We can see that theinitially calculated SVM has a greater margin thanthat of the SVM we would have if we were able toperfectly know both classes. If we used such mar-gin in Theorem 1, we would not find a cover thatavoids adversarial examples. Thus, we are interestedin minimizing the maximum margin, which is givenby the true SVM if we knew the actual classes andnot just some samples.

Our problem is however in contrast with usualpool-based sampling (see section VIII A for more in-formation). In their problem the algorithm usuallyseeks to classify new points near the SVM. This isnot the case for our problem, since we want pointsthat have a membership to a class greater than 0.8.Points that have no clear membership to a singleclass are not so interesting and we do not includethem in our data set.

Now we turn to the second concept: ‘informa-tiveness’. How then do we measure how interest-ing could be to classify an arbitrary point? A goodheuristics for our problem is that we are trying tofind points that with high probability would decreasethe margin 1/|~w| a lot, to get a better estimate ofrp. This is because there are two competing condi-tions on rp. On the one hand, in order to fulfill thecondition of the problem, we cannot take rp largerthan it really is, as that would make us choose a δthat does not fulfill the conditions of the theorem.On the other, the smaller δ is, the more expensive itis to establish the δ-cover of the classes.

The previous paragraph suggests measuring the‘informativeness’ of a point by dividing the problemin two parts: we first find the probability that agiven point ~xn+1 is in class c, and then multiply thisprobability by the inverse of the margin distance ofthe updated SVM, taking into account this would-benewly classified point, |~wn+1|.

Class 1Probability > 80%

Class 2Probability > 80%

Points that are notin Class 1 nor 2

FIG. 2. Example of a SVM, characterised by the equa-tion ~w · ~x − b = 0. The points depicted are the initialtraining set, with colors according to their classification,and those in black are those not belonging to any classaccording to our criteria that for a point ~xi to be inclass j, cij > 0.8. Initially we have the SVM in black,but we want to obtain the green one, that is more accu-rate. The dotted lines are the ones that we cannot seeinitially and must find out. The equation of the SVMis the one depicted and the margin 1/|~w| is chosen tobe equal to rp. This is achieved when we make the twoparallel hyperplanes that indicate the margin to fulfillequations ~w · ~x − b = ±1. Since making the SVM moreprecise implies making the margin smaller, we want tofind an SVM that maximizes |~w|.

The ‘informativeness’ will thus measure an ex-pected value of information gain if a given point wasin a given class. As such, it will be the product of aprobability of ~x being in class c, Pc(~x), times the in-formation we gain, that we will measure as |~w|. Thereason for this choice is that we want to add pointsto training set, that give a more precise account ofthe classes. That is, we wish to minimize the marginof size 1/|~w|, and therefore we want to maximize |~w|.

Definition 2. We will call ‘informativeness’ to themetric that measures the expected value of informa-tion we get by adding a point ~xi to the training set.We will therefore define it as the product

Pc(~xi) · |~w~xi|, (2)

where Pc(~xi) is the probability that ~xi is in class c,and 1/|~w~xi

| is the size of the margin of the resultingSVM if ~xi really were in such class.

5

A. Main results

In this article we propose a theoretical frameworkthat allows us to think of active learning as sam-pling the most promising new points to be classi-fied, so that the minimum distance between classescan be found, and Theorem 1 used. We also pro-pose a quantum algorithm that would allow us toperform the sampling efficiently with a exponentialadvantage over what is achieved classically.

In the quantum algorithm we do not use neuralnetworks but rather a SVM [16], although it mightbe possible to use the general strategy in other se-tups using quantum neural networks.

Let us explain further the complexity comparisonwith the classical algorithms. One may argue that agood quantum strategy to solve Problem 1 would beto use Amplitude Estimation and bisection searchto find a threshold for the quantile 1− 1/C, for ex-ample the 1% most relevant points, in which caseC = 100. Then, one may use Amplitude Amplifica-tion to find one of the points in such quantile. Thisachieves an exponential advantage over the classicalcase when one wants to find such points using clas-sical computing with certainty. In the usual case ofAmplitude Amplification and Amplitude Estimationwe have an oracle that tells us whether an elementis marked or not, and this yields a quadratic ad-vantage with respect to the classical case. However,in our situation, being a good point or not dependson its relative ‘informativeness’ with respect to otherpoints. This means that classically, in order to assertwith certainty that a given point is within the top1/C = 1% quantile, one should first calculate the in-formativeness of 0.99N points, out of N points. Thisis clearly prohibitive, since N = O(lm), where m isthe dimension of the space, and l its discretisation.Notice for example that for a n0 × n0 image, thedimension of the image would be m = 3n2

0, due tothe three colours or channels needed to define eachpixel.

In contrast, if we want to solve this problem us-ing Amplitude Estimation we do not incur in suchcost. What we do is find, using bisection and Ampli-tude Estimation, an informativeness threshold abovewhich there are only 1/C of the most informativepoints. Once that is the case, we can mark thosepoints and use Amplitude Amplification to find them[17]. The cost will then be O(Cε−1β) for a givenprecision ε in the threshold, and success probability1− e−β , and crucially C independent of N . Ampli-tude Amplification also bears a cost O(

√C).

On the other hand, one can also find probabilisticclassical strategies that output a point that is in the1−1/C quantile with probability exponentially high,1− e−β , and the complexity would be polynomial in

parameters C and β. Since our quantum strategywould also have some exponentially small probabil-ity of failure, and the complexity would also be poly-nomial in those parameters, the advantage would beunclear. We discuss this point in depth in the ap-pendix B, for a problem that generalises our own,and find that this advantage would be quadratic insome parameters.

However, the algorithm that we present hereachieves an exponential advantage over its classicalcounterpart. The speedup is due to a faster calcula-tion of the ‘informativeness’ of a given point, thanksto the use of qRAMs, the algorithm for quantumSVM [14], and an approach to perform matrix-vectorproducts using Chebyshev polynomials taking ad-vantage of techniques developed in [18] and [15]; inconclusion, a better use of quantum linear algebratechniques. Our algorithm here has overall complex-ity O(Cε−1β(o||F ||2κ−3

eff ε−3 + poly log(κM ε

−1(n +m)))), where ||F || is the Frobenius norm of the ma-trix in (8), o the order of the kernel, κeff = O(1)an effective condition number that we choose, ε theprecision on the ‘informativeness’ of a given point,1−e−β the success probability with which one wantsto be sure that the chosen point is in the 1 − 1/Cquantile, and κM the condition number of matrix Mappearing in (25).

In contrast, a classical algorithm would have poly-nomial complexity in n and m due for instance tomatrix-vector product (25), or the final calculationof the norm of |~w|, central to our discussion.

In the next subsection we lay out the general strat-egy of our paper to solve this problem.

B. Algorithm strategy and background onSupport Vector Machines

As we have seen, our aim is to sample points tobe added to the training set, ~xn+1, with the highestpossible informativeness, defined as Pc(~xn+1)|~wn+1|,where the first term indicates the probability that agiven point is in a class, and the second measureshow much that would improve the classifier.

Thus, we will employ the strategy explained in theAlgorithm 1 to sample from the 1/C most relevantpoints.

Since we will rely on them somewhat heavily, inorder to carry out those results it is useful to re-member some results from [14], that explains howa quantum SVM algorithm works. The authors as-sume that each point ~xi is labeled with a single classyi, and there are only two classes yi = ±1. Given thepairs of data and label (~xi, yi)i∈1,...,n, the authorsstate that calculating the SVM is equivalent, in thedual formulation, to maximizing over the multipliers

6

Algorithm 1 Active learning against adversarialexamples.

1: procedure Active learning against adversar-ial examples

2: To find a point ~x that is in the 1−1/C quantileof ‘informativeness’ with probability 1 − e−β,iterate O(Cβ) times:

3: Uniformly at random sample a point ~xn+1 in thespace.

4: Evaluate, as explained in section III, the prob-ability that a point is in a given class c. We getPc(~xn+1).

5: Solve the linear system of equations (8) using theprocedure of [14], to obtain the state |b, ~α〉n+1. It isexplained in section IV.

6: Perform (24), using the Chebyshev approachtaken from [15], to calculate |~wn+1〉. Then use Am-plitude Estimation to estimate |~wn+1|. The proce-dure to calculate |~wn+1〉 is described in section V,and its norm in appendix C.

7: Calculate the informativeness Pc(~xn+1) · |~wn+1|of point ~xn+1. If it is higher than the previousbest ‘informativeness’, save the pair (~xn+1, Pc(~xn+1)·|~wn+1|) to memory substituting the previous bestpair of values.

8: After the O(βC) iterations, output the saved bestpair.

~α of the Lagrangian

L(~α) =

n∑i=1

αiyi −1

2

n∑i,k=1

αiKikαk, (3)

with constraints∑i αi = 0 and αiyi ≥ 0 ∀i. After

this, the result of the classifier is given, for a newpoint ~x as

y(~x) = sgn

n∑i=1

αi

m∑l,k=1

Kl,kxi,lxk + b

, (4)

with Kl,k = ~xl · ~xk, and may be rewritten as

y(~x) = sgn (K(~w, ~x) + b) , ~w :=∑i

αi~xi. (5)

Notice that the matrix Kl,k is a kernel matrix thatdefines a dot product. Since the margins are oflength at least 1, this means that for the trainingdata

yi(~w · ~xi + b) ≥ 1. (6)

Since there are only two classes, yi = ±1, and soy2i = 1. We will later say, with a bit of abuse of

notation, that a given point is in class c to mean

any of the two possible values of yi. Additionally,we can transform the previous inequality results intoequality adding slack variables ei,

(~w · ~xi + b) = yi − yiei, (7)

such that if we allow ei > 0, we will be allowing fora soft margin, that is, some points may not fulfill(6). In such case we will add to Lagrangian (3) aterm (γ/2)

∑j e

2j , where γ is specified by the user,

to penalise any violation of (6). Then, minimisingthe Lagrangian is equivalent to solving the followingleast square approximation [14],

F

(b~α

)=

(0 ~1T

~1 K + γ−11

)(b~α

)=

(0~y

). (8)

Here is where [14] uses quantum linear algebra tech-niques to solve (8), as we will see later, and obtainthe result |b, ~α〉. To classify a new point |~x〉, oneconstructs, making use of qRAMs,

|u〉 :=1√Nu

(b |0〉 |0〉+

n∑k=1

αk|~xk| |k〉 |~xk〉

)(9)

and

|x〉 :=1√Nx

(|0〉 |0〉+

n∑k=1

|~x| |k〉 |~x〉

). (10)

Remember that qRAMs perform∑k βk |k〉 →∑

k βk |k〉 |~xk〉, for βk arbitrary amplitudes. Usingthe qRAM proposed in [13] only O(log n) gates areactivated, although O(n) should be present in thecircuit.

If the previous is possible, one should also be ableto prepare the state |ψ〉 = 1/

√2(|0〉 |u〉+ |1〉 |x〉) and

measures the probability of the ancilla being in state|−〉 = 1/

√2(|0〉 − |1〉),

P =1

2(1− 〈u|x〉). (11)

This is equivalent to performing a Hadarmard gateover the first ancilla register and measuring the prob-ability of obtaining |1〉, and is called Swap Test [19].If P ≥ 1/2, y(~x) = 1, otherwise y(~x) = −1.

III. CALCULATING Pc(~xn+1)

The first thing we should care about is calculatingPc(~xn+1), the probability that point ~xn+1 is in classc.

The simplest way to calculate the probabilitywould be to solve the SVM, calculate the distancefrom the point ~xn+1 to the decision boundary, and

7

apply an activation function that converts the dis-tance to a probability. Solving the SVM can be doneusing [14] and reading each entry using AmplitudeEstimation. It would return the vector (b, ~α), whichcan be used to create ~w =

∑j αj~xj , and therefore

the SVM.A second, more elegant solution, is the following.

In [14], the authors propose a method to estimate,using the Swap Test and the output state |b, ~α〉 ,to which class does a given point |~x〉 belong. Theycalculate that the success probability of measuringa |−〉 in the ancilla is

P =1

2(1− 〈u|x〉), (12)

for |u〉 and |x〉 defined as in (9) and (10). Theinteresting thing to notice is that if P > 1/2 theclassification is in one class and if P < 1/2 it isin the other. We modify this protocol slightly sothat we perform Amplitude Estimation on this re-sult instead of repeatedly measuring the expectedvalue, which improves the complexity from O(ε−2)to O(ε−1). This allows us to obtain the amplitude

A~xn+1=√P . Then, one may use P as Pc(~xn+1),

or apply whatever activation function we considerappropriate to shape Pc(~xn+1), like a sigmoid forexample.

IV. QUANTUM SUPPORT VECTORMACHINE

Now we focus on the main subroutine that our al-gorithm uses, which is mostly based in [14], and thatoutputs |b, ~α〉, the solution to (8). This subroutinewill be key to calculating |~wn+1|, as one can use (5)to calculate ~w, whose norm we will estimate later. Inthis section we will review how [14] performs Hamil-tonian Simulation, and its role in the HHL algorithm[20], as well as some variations of that algorithm. Wewill also show the low rank approximation needed toachieve the exponential speedup, and the role of thecondition number as a normalization factor in theestimation of ||(b, ~α)||. Finally, we will review thecomplexity of this calculation, which relies on theuse of qRAMs to achieve an exponential speedup.

Thus, the first step is solving (8), which [14] doesby using HHL algorithm [20] with a different Hamil-tonian simulation. This is due to the fact that ingeneral the kernel matrix K is dense, and the Hamil-tonian simulation of [20] is better suited for thesparse case.

Rather, the authors of [14] use a technique theyhad previously developed in [21] that allows for effi-cient Hamiltonian simulation of dense low-rank ma-trices. Let us explain how to do it. Suppose we have

a given quantum state σ, and we are able to prepareT copies of a density matrix ρ. One would like toperform the Hamiltonian simulation

e−itρσeitρ. (13)

Repeated application of

tr1(e−i∆tSρ⊗σei∆tS) = σ−i∆t[ρ, σ]+O(∆t2), (14)

S the Swap operator, approximates

e−iT∆tρσeiT∆tρ, (15)

T indicating the number of times we apply (14). Theauthors of [21] observe that to simulate (13) withprecision ε−1 one has to repeat the process O(t2ε−1)times, each taking time O(logm) since that is thetime it takes to prepare ρ using a qRAM.

In our case ρ is the matrix F , that will be repre-sented by operator F = (J +K + γ−11)/trF , where

J =

(0 ~1T

~1 0

). (16)

Then, ei∆tF = ei∆tJ/trF ei∆tK/trF ei∆tγ−11/trF . Sim-

ulating J is described in [22], and the matrix γ−11is also easy. An important insight of [14] is how toprepare a state with density matrix K/trF . If wewere able to prepare K/trK then one would onlyneed to correct for a scalar term trK/trF = O(1) inthe simulation time. In the appendix A of [14] it isexplained how to estimate trK, which can be usedto estimate trF = trK + γtr1.

So, let us show how [14] prepares a density matrixstate K/trK. To do so, prepare, using a qRAM:

|χ〉 =1√Nχ

∑i

|~xi| |i〉 |~xi〉 , (17)

Tracing out the second register prepares

tr2(|χ〉〈χ|) =1

Nχ

n∑i,j=1

〈~xi|~xj〉 |~xi||~xj | |i〉〈j|

= K/trK.

(18)

The previous method allows to prepare the den-sity matrix K/trK needed to implement the Hamil-tonian simulation of (14), in time complexityO(ε−1t2k). Therefore, we can perform the Hamil-tonian simulation necessary to implement the HHLalgorithm efficiently.

The HHL algorithm consists on the followingsteps. First, one formally decomposes the state |0, ~y〉of (8) in eigenvectors |0, ~y〉 =

∑βj |uj〉 of the Hamil-

tonian defined by matrix F in (8). Next, one Phase

8

Estimates the eigenvalues, using a Hamiltonian sim-ulation algorithm such as the one explained above,obtaining

∑βj |uj〉 |λj〉. Finally, one performs the

controlled rotations∑j

βj |uj〉 |λj〉 |0〉 →

∑j

βj |uj〉 |λj〉

(1

λjκ|1〉+

√1− 1

λ2jκ

2|0〉

),

(19)

and postselects in ancilla state |1〉 to obtain the so-lution to the system of equations [20].

The stated complexity of the original HHL algo-rithm is O(κ2ε−1poly log n), κ the condition number.The complexity of the originally used Hamiltoniansimulation algorithm is O(dε−1poly log n), where dthe sparsity, for the sparse oracle access model [20].The complexity of the condition number has beenlowered to O(κ), as shown using the technique ofVariable Time Amplitude Estimation in [23], whichis optimal in this parameter, also proved in that ar-ticle. On the other hand, relying on more efficientHamiltonian simulation techniques, [15] proposed avariation of the HHL algorithm that does not requireAmplitude Estimation thus reducing the complex-ity from O(ε−1) to O(poly log ε−1). Unfortunately,the Hamiltonian simulation described above still hascomplexity O(ε−1), making it impossible to reducethe complexity of the overall method further.

There is an alternative to the Hamiltonian simula-tion method that we have described. In [24], a tech-nique is introduced that allows to simulate a denseHamiltonian with complexity O(

√npoly log ε−1), for

Hamiltonians of any rank, and it can be used withthe techniques from [15], to reduce the complexityon the precision from linear to polylogarithmic. Thecaveat is that it requires to use a special quantum-accessible data structure that is explained in the ap-pendix A and plays the role of a Quantum Read-Only Memory, and has the complexity stated above,polynomial in n.

Coming back to our main discussion, once we havethe state |b, ~α〉, we would like to recover the normof vector ~w. As a first step, we would like to obtainthe norm ||(b, ~α)||. Suppose we are trying to solveAx = b, where the largest eigenvalue of A is less orequal to 1; else see the end of appendix C for a minorcorrection. As explained in appendix A of [25], wecan calculate the norm of the solution ||x|| using

||x|| = κ||b||√p1, (20)

where√p1 represents the amplitude of the postselec-

tion ancilla of HHL algorithm being in the correctstate, usually |1〉. This ancilla is used to performthe non-unitary part of the algorithm via a measure-ment. The reason for the previous equation (20) is

because the acceptance probability of HHL scales as

p1 =||A−1b||||b||κ2

. (21)

Estimating such amplitude√p1 thus requires of Am-

plitude Estimation [17], with cost O(ε−1).Now let us turn to the condition number of the

system of equations that appears in the SVM we aresolving, the condition number of the matrix F in (8).Recall that the condition number is defined as

κ =σmax

σmin, (22)

where σmin and σmax are the minimum and max-imum singular values respectively. The conditionnumber will be important to correct the norm of thesolution to the system of equations, as can be seenfrom (20). However, calculating it with the Quan-tum Singular Value estimation technique from [26]might be too expensive.

On the other hand, the authors of the quantumSVM article, [14], propose that in the case where thekernel matrix has O(1) eigenvalues of size O(1), andO(n) eigenvalues with values O(1/n) as it is in ourcase, we can choose a condition number κeff = O(1)such that in the end we will get an additional errorof order O(1/

√n), in addition to ε.

This low rank approximation means that the algo-rithm in [14] only takes into account the eigenvaluesλi that are εK ≤ λi ≤ 1. The main idea of the lowrank approximation is filtering out those eigenvaluesλ < κ−1

eff , so the final rotation of the HHL algorithm,

indicated in (19), is performed only if λj > κ−1eff and

imposes κ = κeff = O(1).In appendix C of the supplementary material

of [14] the authors show that ||K − Keff || =√∑λi=O(1/n) λ

2i , with Keff the ‘filtered’ low-rank

approximation of K, Keff =∑λi≥κ−1

effλi |ui〉〈ui|.

Since there are O(n) eigenvalues of size O(1/n), theinduced error is of order O(n−1/2). Then, one canuse Theorem 1 in the appendix of the HHL algo-rithm, [20], to show that, if |b, ~α〉 is the exact so-lution of the system of equation; and |b, ~α〉eff theapproximate one after postselecting on the subspacespanned by the eigenvalues λj ≥ κ−1

eff and the ancilla

in state |1〉, then || |b, ~α〉eff − |b, ~α〉 || = O(ε+n−1/2).Thus, for relatively large n the induced error withthis approximation is small.

Overall, this implies that instead of (20), we willhave

||x|| = κeff ||b||√p1, (23)

In such case notice that we have imposed an effec-tive condition number for all candidate points ~xn+1:

9

κ~xn+1= κeff , so we no longer have to care about

the condition number: it will be the same scale fac-tor for all points ~xn+1, and thus without relevance.The same will happen also for ||b|| in (23) sinceb = (0, y1, ..., yn+1)T , with yn+1 the same for all can-didate points. Since we suppose n is large enough,the additional additive error we introduce will besmall, of order O(n−1/2) according to the appendixC in the supplementary material of [14]. Thus, thecomplexity of the algorithm is O(ε−3κ3

eff log(mn)),where m is the dimension of the space.

To finish this section, let us recall how to calculatethe complexity of the algorithm of [14]. They claim

that (14) has error ε = O(||F ||2∆t2), ||F || the Frobe-

nius norm of F , and it is repeated T periods. Thus,ε = O(||F ||2∆t2T ) = O(||F ||2t2/T ), for ∆t = t/T .

This implies a simulation cost T = O(t2||F ||2ε−1),when implementing (14) is taken at unit cost.

On the other hand, HHL requires phase estimat-ing the eigenvalues, so the relative error of λ−1 canbe indicated as ε = O(1/λt) ≤ O(κeff/t). There-fore t = O(κeffε

−1), and substituting in the previous

paragraph, T = O(||F ||2κ2effε−3 log(mn)). An addi-

tional κeff is needed to perform postselection on theresult. Thus, the overall complexity of the algorithmis O(o||F ||2κ3

effε−3 log(mn)), for a kernel of order o.

If instead of using the basic HHL technique wehad decided to use more advanced ones, such as theFourier approach described in [15] (see appendix A),

the complexity would have been O(Tα), for T =

O(||F ||2t2ε−1), t = O(κ) and α = O(κ), ignoringpolylogarithmic factors [15]. Overall, the complexity

would be O(||F ||2κ3ε−1).

V. THE NORM OF |~wn+1|

The aim of the previous section was to obtain|b, ~α〉 and its norm. The aim of this one is to calcu-late ~w from that result. We will see that one maydo that using a Fourier expansion, at cost O(ε−1);or better, use of Chebyshev series as in [15], at costpolylogarithmic in all variables, provided the use ofqRAMs. In appendix C we explain how to calculatethe norm of |~wn+1| from ~w, which only requires per-forming Amplitude Estimation once, and normaliz-ing with some factors.

The first, trivial, idea we had to calculate |~wn+1〉is reading all the entries of |b, ~α〉 using AmplitudeEstimation. Then one may use classical computingto compute ~w =

∑n+1i=1 αi~xi, and finally its norm

|~wn+1| =√∑m

i=1(∑n+1j=1 xij,xn+1

αj,xn+1)2. Ampli-

tude Estimation does not immediately recover thesign of each αi, but finding it is not complicated ei-ther. Once we know |αi| and |αj | we can prepare the

state Cij(|αi| |j〉+ |αj | |i〉) and perform a Swap Testwith the solution vector |b, ~α〉. If the relative sign ofentries i and j is the same we will get a nonzero re-sult proportional to 2Cijαiαj , but if the relative signis opposite, the dot product will cancel out [27]. Es-tablishing the relative sign of two entries is enough,since the norm will not care about the global sign.

However, the procedure described above is timeconsuming, as one needs to prepare the solution ofthe system of equations (8) O(n+1) times in order toread all the entries of the solution and calculate theirrelative sign. Also, calculating ~w takes complexityO(nm) using classical computing, so, instead of thatwe will try to prepare |~wn+1〉 from |b, ~α〉, and esti-mate the change in the norm of the result. Noticein the first place that the Quantum SVM article [14]is able to classify a point without first calculating|~wn+1〉, as indicated in (11) and subsequent para-graph. Also, it is not clear how to prepare |~wn+1〉from the state in (9).

Instead, since ~w =∑i αi~xi, this suggest multi-

plying the solution vector from the Quantum SVM,|b, ~α〉 by a matrix operator whose entries are xi,j .Explicitly,

A |b, ~α〉 =

0 x1,1 ... x1,m

......

0 xn+1,1 ... xn+1,m

bα1

...αn+1

=

∑mi=1 αjx1,j

...∑mi=1 αjxn+1,j

= ~w.

(24)

Performing this operation classically, requiresO(nm) operations. Can we perform this operationin a quantum way?

Our strategy to calculate the previous matrix-vector product will take inspiration from the algo-rithms presented in [15] to simulate A−1. The factthat in our case A is not necessarily square shouldnot pose any problem, since one can write

m+ 1 n+ 1( )m+ 1 0 AT

n+ 1 A 0

( )b~α

~0n+1

=( )~0m+1

~w. (25)

Let us call M the matrix from this previous equa-tion. The first thing we have to do is to attach aregister to |b, ~α〉, that we set to |0〉 for the entry b(that is, the original register is in state |0〉) and |1〉otherwise. This will allow us to substitute any of the0s in the matrix in (24), but avoiding interference ofthe entry of b with those of ~α. The aim of this is toavoid M being ill-conditioned, although it is only a

10

technical detail. Then all the singular values fall in[1/κM , 1], and we can rewrite b

~α~0n+1

→ b~0m~0n+1

⊗ |0〉+

0~α

~0n+1

⊗ |1〉 . (26)

Notice that by linearity of quantum mechanics theunitary we were going to apply to the the left-handside is applied individually to each of the two termsin the decomposition. Later we will only care aboutthe case when the additional ancilla we have addedis in state |1〉.

With this remark, the question becomes twofold:how to decompose M into a Linear Combination ofUnitaries (LCU), as it is done with A−1 in [15], andhow to simulate those linear operators efficiently.Notice that now e−iM is a unitary operator.

The first question is how to decompose M in alinear combination of unitaries. If we choose, asit is frequently done, unitary operators of the forme−iMt, we can write M =

∑i βie

−iMti . Then, asboth sides of the equality are diagonal in the samebasis, one can consider this is equivalent to writingx =

∑i βie

−ixti . The first idea is then to decom-pose x as a Fourier series. Specifically, decomposingx gives us

x =

∞∑t=1

(−1)t

t2 sin(tx)

= i

∞∑t=1

(−1)t

t(e−itx − eitx).

(27)

Since we cannot perform the entire summation, thequestion is how to truncate the series maintaininga given precision ε−1. Call St(x) the truncatedseries up to t. Since all the singular values willbe in [1/κM , 1], with κM the condition number ofM , then, we want to study the series in the range[0, 1]. Notice that f(x) = x is Lipschitz-continuouswith constant 1, so can use a result from [28] tobound the error incur when truncating the series.According to Corollary I in [28], we can say that|f(x) − St(x)| ≤ (a log t)/t, where a is a constant.Thus, if we want |x − St(x)| ≤ ε, it is enoughto choose t/ log t = O(ε−1). So the complexity isalmost linear in ε−1, as we have to Hamiltonian-simulate M during time t. Furthermore, if we chosea Hamiltonian simulation algorithm, as M is dense,the cost will be relatively large, O(

√n+m) if we

were to use [24] for instance. In any case we wouldnot obtain an exponential advantage, since Hamil-tonian simulation of dense Hamiltonians is not ex-pected to have complexity log n in general [29]. Alsonotice that we cannot use the trick from the previ-ous section, since that requires M to be a densitymatrix and trM = 0⇒M is not a density matrix.

A second, better approach, also inspired by [15],is to decompose the matrix in Chebyshev polyno-mials, x =

∑γiTi(x). In fact, this decomposition in

polynomials is quite simple since the first Chebyshevpolynomial is

T1(x) = x, (28)

so the expansion will have a single term and willbe exact, which is very positive. Let us now lookat the cost of simulating a Chebyshev polynomialfor a n × n d-sparse matrix M . One does that viaQuantum Walks. Define a quantum walk as set ofstates |ψj〉 ∈ C2n ⊗ C2n; j ∈ [n] [15]:

|ψj〉 := |j〉⊗1

d

∑k∈[n];Ajk 6=0

(√A∗jk |k〉+

√1− |Ajk| |k + n〉

).

(29)

Then, one defines the isometry

T =∑j∈[n]

|ψj〉〈j| , (30)

and S the Swap operator S : |j, k〉 → |k, j〉. TheWalk operator is W := S(2TT † − 1). With thesedefinitions, and using Lemma 15 of [15] enunciatedin the appendix A, one can prove that for any state|ψ〉, substituting T by its corresponding unitary TU[15]

T †UWTU |0dlog 2ne+1〉 |ψ〉= |0dlog 2ne+1〉 T1(M) |ψ〉+ |Φ⊥〉= |0dlog 2ne+1〉M |ψ〉+ |Φ⊥〉 ,

(31)

for M = M/d, and |Φ⊥〉 such that(|0dlog 2ne+1〉〈0dlog 2ne+1| ⊗ 1) |Φ⊥〉 = 0.

Let us first study the cost of applying T . Ac-cording to Lemma 10 in [18], to implement T onemust apply log d Hadamard gates, followed by O(1)calls to an oracle OF that outputs the column in-dex of the l-th non-zero element of a row j of thematrix, OF : |j, l〉 → |j, f(l, j)〉; and the sparse-access oracle OM , that outputs entry Mjk on input(j, k). For simplicity and without loss of general-ity let us assume that in (25) all entries of A arenon-zero. If such is the case, then the cost of theoracles is not high either. The first oracle outputsOF : |j, l〉 → |j, l〉 or OF : |j, l〉 → |j, l + n+ 1〉 de-pending on whether we are on the last n + 1 rowsor in the first m + 1, respectively. The second ora-cle performs OM : |j, k〉 |z〉 → |j, k〉 |z ⊕Mjk〉, whichis nevertheless no more restrictive than the qRAMmodel that we were using to solve the quantumSVM.

11

What is the cost of implementing W then? In theproof of their theorem 4, [15] cites Lemma 10 from[18] to explicitly state that if we want to simulate Wwith error ε′, its complexity and the complexity of(31) is O(log n+ log2.5(κd/ε′)). This is not surpris-ing since the cost only depends on T , its inverse T †,and a Swap gate. Notice that our matrix is dense, sod = O(n) = O(n+m), but still the cost of this pro-cess is polylogarithmic in all variables! In fact thismethod seems applicable to any matrix M , sparse ornot, whose entries are accessible through a sparse-access oracle in the form of a qRAM. One may evensay that the qRAM is hiding the cost, since it re-quires time to prepare the data and also has O(n)quantum gates even if it only uses a small numberof them. Notice also that the linear dependence ond and κ for the Quantum Linear System Algorithmpresented in [15] does not come from implementingW but because one has to apply it O(dκ) times inthe series expansion of A−1. In our case we onlyhave to apply W once, so we do not have such linearcomplexity term.

Finally, we measure the amplitude of |1〉 of theancilla that we attached to the state to mark theentry of b after (24). The amplitude of suchstate is, after taking into account a normaliza-tion factor for M and ||(b, ~α)||, precisely |~wn+1| =√∑m

i=1(∑n+1j=1 xij,xn+1

αj,xn+1)2, what we were look-

ing for. We can measure that amplitude using Am-plitude Estimation.

In appendix C we explain in greater detail howto correct the value extracted with Amplitude Esti-mation, to find the norm |~wn+1| provided either theLinear Combination of Unitaries with the Fourierseries, or the Chebyshev polynomial approach. Thecost of Amplitude Estimation is O(ε−1). Having cal-culated |Pc(~xn+1)| and |~wn+1|, the informativenessof a given point is just the product of both.

VI. FINDING THE TARGET POINT ~xn+1

Now that we know how to calculate the ‘informa-tiveness’ of a given point, let us explain why theAlgorithm 1 finds, with probability 1− e−β a pointthat is in the quantile 1 − 1/C of informativeness.The arguments presented here are the same to thosein appendix B.

If we sample uniformly at random, and calculatethe informativeness of c points, the probability thatnone of those points is in the quantile 1 − 1/C isclearly

p =

(1− 1

C

)c. (32)

We want to make such probability p < e−β , so thatat least one is in the 1− 1/C quantile. That means

c >−β

log(1− 1 1

C

) . (33)

Since

limC→∞

−1

log(1− 1C )

C= 1, (34)

it is clear then that the number of sampled pointsshould be c = O(βC). This explains why our pro-cedure is efficient in these variables. More informa-tion on alternative strategies can be found in theappendix B.

VII. COMPLEXITY

In this section we want to give a calculation of thecomplexity of the proposed quantum algorithm.

The first step is calculating the probabilitythat a given point of the high dimensional spaceis in a certain class, Pc(~xn+1). This impliessolving a quantum SVM [14], with complexityO(o||F ||2ε−3κ−3

eff log(mn)), where o is the order ofthe kernel, ||F || is the Frobenius norm of the matrixthat appears in (8), m the dimension of the space,n the number of already classified points, and κeff

is an effective condition number which can be takenO(1) for this problem.

Reading out the probability using Amplitude Es-timation in the Swap Test costs an additional mul-tiplicative O(ε−1), so the overall complexity of thisstep is O(o||F ||2ε−4κ−3

eff log(mn)). Since we are as-suming that we have scaled the matrix to obtainthe largest eigenvalue equal to 1, the procedure of[14] implies that the smallest one that we take intoconsideration is λ−1 = κeff . The low-rank approxi-mation induces an additional error of O(n−1/2), asit is explained in the supplementary material of theoriginal article [14].

Next step is calculating the Quantum SVM for thesystem with the added point, again with complexityO(o||F ||2ε−3κ−3

eff log(mn)). Then, preparing |~wn+1〉has complexity O(poly log(κM ε

−1(n + m))), as re-ported in the corresponding section. Calculating|~wn+1| also have an additional multiplicative com-

plexity of O(ε−1) due to the Amplitude Estimation,given the procedure in appendix C.

Finally, if we are interested in finding a point inthe 1− 1/C quantile of ‘informativeness’ with prob-ability 1− e−β , then the procedure explained in sec-tion VI implies iterating the procedure over O(Cβ)points and selecting the best.

12

Thus, in general, the complexity will beO(Cβε−1(o||F ||2κ−3

eff ε−3+poly log(κM ε

−1(n+m)))),

with error O(ε+ n−1/2). The n−1/2 comes from thelow-rank approximation of [14], as it is explained onits appendix C.

VIII. CONCLUSION

In the previous sections we have seen that pro-vided the use of qRAMs, it could be possible toestablish a polynomial-complexity sampling proce-dure which lets us know what are the most promis-ing points to be added to the training set in orderto get a better approximation of the minimum dis-tance between classes, rp. Recall that knowing rpis a requisite to applying a δ-cover and, using The-orem 1, avoid adversarial examples. This protocolis heuristic, which means that in order to check itsactual performance we have to run it in a realisticquantum computer.

In this paper we have presented a procedure thatallows to solve this problem efficiently, in time poly-logarithmic in the dimension of the space m andnumber of already classified points, n; and polyno-mial in C, β and the precision ε−1 of the informa-tiveness estimate.

There are nevertheless two important shortcom-ings of our algorithm. The first is that, since werely heavily on reference [14], and they use qRAMsto achieve their speedup, we also need qRAMs. Thesecond is that there are some minor parts of the gen-eral algorithm that will necessarily have polynomialcomplexity in the dimension of the space m. Thoseare: sampling initial points, and loading their valuein the qRAM. We strongly believe that this shouldnevertheless be no problem. The sampling proce-dure for instance will hardly be inefficient and canbe easily done while the previous point is being pro-cessed. The loading in the qRAM is perhaps moreexpensive, but may be done while the point is askedto be classified by a human. Therefore we believethat the complexity result is still very valid in prac-tice.

Finally, we want to recall that our work reliesheavily on reference [14]. At the time that articlewas written, there was no known method to solvelow-rank linear systems of equations in polyloga-rithmic time. However, [30] has recently proposeda method that achieves precisely that, complexityO(||F ||6k6κ6ε−6), k the rank. It relies on the tech-niques developed by Ewin Tang [31], which founda classical algorithm taking inspiration from [32].Therefore, one could also solve the linear system ofequations that appear in this case in polylogarithmictime in the dimension n. We have two arguments

why we have not explored the result here. Firstly,although [30] is efficient on n, it has a pretty bad be-haviour in other involved parameters such the errorε. Secondly, perhaps more important argument, isthat even though the most complicated part of thealgorithm, the solution to (8), can be performed withthis quantum-inspired algorithm in polylogarithmiccomplexity; solving (25), calculating the norm of|~wn+1|, and performing the inner product needed tocalculate Pc(~x), are, to the best knowledge of theauthors of this article, not possible in polylogarith-mic time. Thus, even though we think the exponen-tial advantage could perhaps be reduced to a ‘mere’polynomial one using classical computing, we do notthink there is an explicit method to do it, but onlya line of research that may do it in the future.

In conclusion, in this article we have framed a pos-sible solution to adversarial examples using activelearning, and a sampling methodology to find themost important points that could be added to thetraining set. We have also presented an algorithmthat provides an exponential advantage over its clas-sical counterpart, provided the use of qRAMs.

A. Related work: other classical approaches toadversarial examples and active learning, and

comments on quantum machine learning

Adversarial examples are a danger for any classi-fier that needs to be robust to perturbation. For in-stance, adversarial examples can be dangerous whenan autonomous car has to recognise traffic signals.Since adversarial examples were discovered [6], therehas been lots of work to explain why they happen[7] and also to obtain provably robust models [8]. Inparticular some of the most promising ideas to avoidthem are related to adversarial training: trainingagainst those adversarial examples before the actualadversary has time to pose them to the classifier.This is for instance the model explained in [8], wherethey use projected gradient descent to minimize themaximum expected loss

minθ

(E(x,y)∼D

[maxε∈S

L(θ, x+ ε, y)

]), (35)

where D is the initial population of points x, withtrue label y, and perturbations ε can be taken froma small set S. L is the loss function: the functionthat measures the difference between the predictedand actual classification, with adjustable parametersθ, and E indicates expected value. The maximisa-tion represents the work of the adversary, whereasthe minimization represents the work to make theclassifier robust. This setup certainly works, as longas one can minimize (35), and efforts have been put

13

forward to perform this adversarial training moreefficiently [33]. However, as pointed out in [12], inorder to work perfectly it would require an exponen-tial number of adversarial examples in the dimensionof the problem added to the training set. Thus, ad-ditional strategies are worth exploring.

It is also worth noticing that adversarial at-tacks are gaining attention in the quantum com-munity. Recently, it has been indicated that thisphenomenon is also present in the case of quantumclassifiers [34], where the dimension plays a very im-portant role: the higher the dimension the easier tocarry out those adversarial attacks. Some experi-mental work on this line is done in [35].

Our work is additionally strongly related to sev-eral forms of active learning algorithms. Activelearning algorithms are those where the classifier canask for new points to be classified and added to itstraining data base. This field can be divided in twomain branches[36]: query synthesis, where new ex-amples are created, and sampling. The later is sub-divided in stream-based sampling and pool-based. Instream-based sampling one selects one item at a timeand decides whether it is worth the cost of classifyingit [37]. In pool-based sampling examples are sam-pled from a large pool of unlabelled data [36]. Aswe will see, our algorithm can be used both as a pool-based sampling or as query synthesis, depending onthe points used. The most typical strategies to selectthe samples are uncertainty sampling, that selectspoints with maximum uncertainty about which classthey belong to [38], or Query-by-committee, wherethe space of classifiers that agree with the data ishalved sequentially [39]. Finally there is also thestrategy of using expected error reduction [40] thatselects those points that on average, weighted ac-cording to probability, make the loss function as lowas possible. This last procedure is similar to whatwe are using, except that instead of minimizing theloss function, we select points that on expectationwould achieve the margin of the SVM to be as lowas possible.

Finally, let us make a brief introduction to quan-tum algorithms used for quantum machine learning,since the quantum SVM that we have used throughthe text is just one of the earlier works on this field.The possibility of applying quantum techniques inthe machine learning and artificial intelligence do-main have attracted in the last decade much inter-est, and a review of this efforts is [41]. For exam-ple, following work on quantum SVM, various kernelmethods have been developed to improve SupportVector Machines, such as [42] or [43]. Most of thisefforts have been focused on approaches that can beimplemented NISQ devices. For instance, the laterreference, [43], was specially focused on constructing

a universal quantum classifier with minimal amountof quantum resources, using a technique called datare-uploading.

Other machine learning techniques that have beenquantized include Bayesian learning [44], where apolynomial speedup is expected thanks to the useof linear algebra techniques such as variations overHHL.

Finally, although considerable effort has been puton quantizing machine learning algorithms such asSupport Vector Machines and kernel methods, giventhe success of the Deep Learning (aka neural net-works) techniques in tackling many problems thatinclude image recognition and text translation, sig-nificant effort has be put in quantizing them. Oneexample of this is [45], or the recent publication ofTensorFlow Quantum [46], the TensorFlow libraryto work with quantum neural networks. One par-ticular example of a Deep Learning technique thathas received much attention both in the classical andquantum machine learning communities are the gen-erative adversarial networks [47]. This networks arecomposed of two parts, a generator and a discrimi-nator that compete, and a distribution D. The gen-erator tries to generate a distribution resembling asmuch as possible D, and the discriminator tries todifferentiate between the actual D, and the one gen-erated by the generator. The game ends when thegenerator has fully learned D, and precisely this isthe reason why adversarial generative networks havebeen devised as a method to prepare quantum statesefficiently.

However, we would like to emphasise that al-though somewhat related to adversarial examplesvia adversarial training, the setup in the later case isdifferent. For example, in the setting of adversarialexamples our classifier does not need to know howto generate any state of the distribution, but onlyhow to classify an input in one of the classes. Insuch respect, there is no generative device in adver-sarial examples, except perhaps the adversary whois trying to fool the classifier, whereas in the gen-erative adversarial training setup there is a singledistribution, and therefore a single class. Perhapsmore importantly, adversarial examples seem to bea general and somewhat unfortunate feature of manydifferent kinds of classifiers, whereas generative ad-versarial networks are a particular, although useful,technique.

ACKNOWLEDGEMENTS

We would like to thank Santiago Varona for use-ful comments on the manuscript, as well to JaimeSevilla, Nikolas Bernaola and Javier Prieto for point-

14

ing us to useful statistic results for Appendix B.We acknowledge financial support from the SpanishMINECO grants MINECO/FEDER Projects FIS2017-91460-EXP, PGC2018-099169-B-I00 FIS-2018and from CAM/FEDER Project No. S2018/TCS-

4342 (QUITEMAD-CM). The research of M.A.M.-D. has been partially supported by the U.S. ArmyResearch Office through Grant No. W911NF-14-1-0103. P. A. M. C. thanks the support of a FPUMECD Grant.

[1] S. Lloyd, M. Mohseni, and P. Rebentrost, “Quan-tum algorithms for supervised and unsupervisedmachine learning,” arXiv preprint arXiv:1307.0411,2013.

[2] M. Schuld, I. Sinayskiy, and F. Petruccione, “An in-troduction to quantum machine learning,” Contem-porary Physics, vol. 56, no. 2, pp. 172–185, 2015.

[3] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost,N. Wiebe, and S. Lloyd, “Quantum machine learn-ing,” Nature, vol. 549, no. 7671, pp. 195–202, 2017.

[4] G. D. Paparo, V. Dunjko, A. Makmal, M. A.Martin-Delgado, and H. J. Briegel, “Quantumspeedup for active learning agents,” Physical ReviewX, vol. 4, no. 3, p. 031002, 2014.

[5] A. A. Melnikov, H. P. Nautrup, M. Krenn, V. Dun-jko, M. Tiersch, A. Zeilinger, and H. J. Briegel, “Ac-tive learning machine learns to create new quantumexperiments,” Proceedings of the National Academyof Sciences, vol. 115, no. 6, pp. 1221–1226, 2018.

[6] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna,D. Erhan, I. Goodfellow, and R. Fergus, “Intrigu-ing properties of neural networks,” arXiv preprintarXiv:1312.6199, 2013.

[7] I. J. Goodfellow, J. Shlens, and C. Szegedy,“Explaining and harnessing adversarial examples,”arXiv preprint arXiv:1412.6572, 2014.

[8] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar,and A. Madry, “Adversarially robust generalizationrequires more data,” in Advances in Neural Infor-mation Processing Systems, pp. 5019–5031, 2018.

[9] A. Sinha, H. Namkoong, and J. Duchi, “Cer-tifying some distributional robustness withprincipled adversarial training,” arXiv preprintarXiv:1710.10571, 2017.

[10] E. Wong and J. Z. Kolter, “Provable de-fenses against adversarial examples via the con-vex outer adversarial polytope,” arXiv preprintarXiv:1711.00851, 2017.

[11] A. Raghunathan, J. Steinhardt, and P. Liang, “Cer-tified defenses against adversarial examples,” arXivpreprint arXiv:1801.09344, 2018.

[12] M. Khoury and D. Hadfield-Menell, “On the ge-ometry of adversarial examples,” arXiv preprintarXiv:1811.00525, 2018.

[13] V. Giovannetti, S. Lloyd, and L. Maccone, “Quan-tum random access memory,” Physical review let-ters, vol. 100, no. 16, p. 160501, 2008.

[14] P. Rebentrost, M. Mohseni, and S. Lloyd, “Quan-tum support vector machine for big data classifi-cation,” Physical review letters, vol. 113, no. 13,p. 130503, 2014.

[15] A. M. Childs, R. Kothari, and R. D. Somma,“Quantum algorithm for systems of linear equationswith exponentially improved dependence on preci-sion,” SIAM Journal on Computing, vol. 46, no. 6,pp. 1920–1950, 2017.

[16] C. Cortes and V. Vapnik, “Support-vector net-works,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.

[17] G. Brassard, P. Hoyer, M. Mosca, and A. Tapp,“Quantum amplitude amplification and estima-tion,” Contemporary Mathematics, vol. 305, pp. 53–74, 2002.

[18] D. W. Berry, A. M. Childs, and R. Kothari, “Hamil-tonian simulation with nearly optimal dependenceon all parameters,” in 2015 IEEE 56th AnnualSymposium on Foundations of Computer Science,pp. 792–809, IEEE, 2015.

[19] M. Schuld and F. Petruccione, Supervised Learningwith Quantum Computers, vol. 17. Springer, 2018.

[20] A. W. Harrow, A. Hassidim, and S. Lloyd, “Quan-tum algorithm for linear systems of equations,”Physical review letters, vol. 103, no. 15, p. 150502,2009.

[21] S. Lloyd, M. Mohseni, and P. Rebentrost, “Quan-tum principal component analysis,” Nature Physics,vol. 10, no. 9, pp. 631–633, 2014.

[22] A. M. Childs, “On the relationship betweencontinuous-and discrete-time quantum walk,” Com-munications in Mathematical Physics, vol. 294,no. 2, pp. 581–603, 2010.

[23] A. Ambainis, “Variable time amplitude amplifi-cation and quantum algorithms for linear alge-bra problems,” in STACS’12 (29th Symposium onTheoretical Aspects of Computer Science), vol. 14,pp. 636–647, LIPIcs, 2012.

[24] C. Wang and L. Wossnig, “A quantum algorithm forsimulating non-sparse hamiltonians,” arXiv preprintarXiv:1803.08273, 2018.

[25] A. Montanaro and S. Pallister, “Quantum algo-rithms and the finite element method,” Physical Re-view A, vol. 93, no. 3, p. 032324, 2016.

[26] I. Kerenidis and A. Prakash, “Quantum recommen-dation systems,” in Proceedings of the 8th Innova-tions in Theoretical Computer Science Conference,2017.

[27] P. Casares and M. Martin-Delgado, “A quantumip predictor-corrector algorithm for linear program-ming,” arXiv preprint arXiv:1902.06749, 2019.

[28] D. Jackson, The theory of approximation, vol. 11.American Mathematical Soc., 1930.

[29] A. M. Childs and R. Kothari, “Limitations onthe simulation of non-sparse hamiltonians,” arXiv

http://arxiv.org/abs/1307.0411









15

preprint arXiv:0908.4398, 2009.[30] J. M. Arrazola, A. Delgado, B. R. Bardhan, and

S. Lloyd, “Quantum-inspired algorithms in prac-tice,” arXiv preprint arXiv:1905.10415, 2019.

[31] E. Tang, “A quantum-inspired classical algo-rithm for recommendation systems,” arXiv preprintarXiv:1807.04271, 2018.

[32] I. Kerenidis and A. Prakash, “Quantum recommen-dation systems,” arXiv preprint arXiv:1603.08675,2016.

[33] C. Qin, J. Martens, S. Gowal, D. Krishnan, K. Dvi-jotham, A. Fawzi, S. De, R. Stanforth, and P. Kohli,“Adversarial robustness through local lineariza-tion,” in Advances in Neural Information ProcessingSystems, pp. 13824–13833, 2019.

[34] N. Liu and P. Wittek, “Vulnerability of quantumclassification to adversarial perturbations,” arXivpreprint arXiv:1905.04286, 2019.

[35] S. Lu, L.-M. Duan, and D.-L. Deng, “Quan-tum adversarial machine learning,” arXiv preprintarXiv:2001.00030, 2019.

[36] L. Wang, X. Hu, B. Yuan, and J. Lu, “Activelearning via query synthesis and nearest neighboursearch,” Neurocomputing, vol. 147, pp. 426–434,2015.

[37] L. E. Atlas, D. A. Cohn, and R. E. Ladner, “Train-ing connectionist networks with queries and selec-tive sampling,” in Advances in neural informationprocessing systems, pp. 566–573, 1990.

[38] D. D. Lewis and W. A. Gale, “A sequential al-gorithm for training text classifiers,” in SIGIR94,pp. 3–12, Springer, 1994.

[39] P. Melville and R. J. Mooney, “Diverse ensemblesfor active learning,” in Proceedings of the twenty-first international conference on Machine learning,p. 74, ACM, 2004.

[40] N. Roy and A. McCallum, “Toward optimal activelearning through monte carlo estimation of error re-duction,” ICML, Williamstown, pp. 441–448, 2001.

[41] V. Dunjko and H. J. Briegel, “Machine learning& artificial intelligence in the quantum domain: areview of recent progress,” Reports on Progress inPhysics, vol. 81, no. 7, p. 074001, 2018.

[42] M. Schuld and N. Killoran, “Quantum machinelearning in feature hilbert spaces,” Physical reviewletters, vol. 122, no. 4, p. 040504, 2019.

[43] A. Perez-Salinas, A. Cervera-Lierta, E. Gil-Fuster,and J. I. Latorre, “Data re-uploading for a universalquantum classifier,” Quantum, vol. 4, p. 226, 2020.

[44] Z. Zhao, A. Pozas-Kerstjens, P. Rebentrost, andP. Wittek, “Bayesian deep learning on a quantumcomputer,” Quantum Machine Intelligence, vol. 1,no. 1-2, pp. 41–51, 2019.

[45] E. Farhi and H. Neven, “Classification with quan-tum neural networks on near term processors,”arXiv preprint arXiv:1802.06002, 2018.

[46] M. Broughton, G. Verdon, T. McCourt, A. J. Mar-tinez, J. H. Yoo, S. V. Isakov, P. Massey, M. Y. Niu,R. Halavati, E. Peters, et al., “Tensorflow quantum:A software framework for quantum machine learn-ing,” arXiv preprint arXiv:2003.02989, 2020.

[47] C. Zoufal, A. Lucchi, and S. Woerner, “Quantumgenerative adversarial networks for learning andloading random distributions,” npj Quantum Infor-mation, vol. 5, no. 1, pp. 1–9, 2019.

[48] S. Chakraborty, A. Gilyen, and S. Jeffery, “Thepower of block-encoded matrix powers: improvedregression techniques via faster hamiltonian simula-tion,” arXiv preprint arXiv:1804.01973, 2018.

[49] E. W. Weisstein, “Normal difference distri-bution.” http://mathworld.wolfram.com/

NormalDifferenceDistribution.html.[50] “Distribution free confidence intervals for any per-

centile.” https://online.stat.psu.edu/stat414/

node/317/.[51] D. Nagaj, P. Wocjan, and Y. Zhang, “Fast ampli-

fication of qma,” arXiv preprint arXiv:0904.1549,2009.

Appendix A: Definitions and some useful results

In the main text we gave informal definitionsabout concepts needed to understand Theorem 1.Here we give them rigorously. The first concept weneed to introduce is that of a δ-cover of a manifold,which will be used in that theorem.

Definition 3. Given a manifold M, a δ-coverof such manifold is a set of balls of radius δ,⋃i∈I B(xi, δ), such that ∀x ∈ M ∃j ∈ I|x ∈

B(xi, δ).

Additionally, in order to understand the meaningof an adversarial example, we need to define an ε-neighbour:

Definition 4. Given a manifold M ⊂ Rm, an ε-neighbour of M in the norm p, Mε, is the set ofpoints x ∈ Rm such that the p-distance of x to M,dp(x,M) ≤ ε.

Finally, an adversarial example is defined as

Definition 5. Let M⊂ Rm be a manifold contain-ing several disjoint parts called classes Ci separatedby a distance rp ε in the norm p, and a classifier freasonably well trained to distinguish between thoseclasses. An ε-adversarial example of such classifierin the norm p is a point x whose p-distance to a givenclass C0 is smaller than ε, but it is classified to bein class C1. That is d(x,C0) ≤ ε and f(x) = C1.

With all the previous definitions, we can state thetheorem 1 that guarantees resistance to adversarialexamples.

Theorem 1. (Khoury and Hadfield-Menell)[12] Let M ⊂ Rm be a k−dimensional manifoldthat contains each of the classes, and let rp be theminimum separation distance between two classes in










http://mathworld.wolfram.com/NormalDifferenceDistribution.html

http://mathworld.wolfram.com/NormalDifferenceDistribution.html

https://online.stat.psu.edu/stat414/node/317/

https://online.stat.psu.edu/stat414/node/317/


16

norm p, rp > ε. Let L be a learning algorithm, andfL the classifier it produces. Assume that for anypoint x in the training set XL with label y, and anypoint x ∈ B(x, rp), the learning algorithm classifiesfL(x) = fL(x) = y.

We then have the following guarantee: If XL isa δ-cover for δ < rp − ε then fL correctly classifiesMε, that is, an ε-neighbour of M.

In this appendix we also give an overview of thequantum-accessible data structure that we use toperform the Hamiltonian Simulation of Dense ma-trices

Theorem 2. [26, 48]: Let M ∈ Rn′×n′ be a ma-trix. Let w be the number of nonzero entries. Thenthere exists a quantum-accessible data structure ofsize O(w log2(n′2)), which takes time O(log(n′2)) tostore or update a single entry. Using this data struc-ture, there is a quantum algorithm able to performthe following maps with error ε:

UM : |i〉 |0〉 → 1

||Mi·||∑j

Mij |ij〉 ; (A1)

UN : |0〉 |j〉 → 1

||M ||F

∑i

||Mi·|| |ij〉 ; (A2)

where ||Mi·|| is the l2−norm of row i of M . Thecomplexity of this algorithm is O(poly log(n′/ε)).This means in particular that given a vector f in thisdata structure, we can prepare an ε approximation ofit, 1/||v||2

∑i vi |i〉, in time O(poly log(n′/ε)).

The previous data structure can be used for denseHamiltonian simulation [24], into the main algo-rithm of [15]

Theorem 3. [24] Hamiltonian simulation ofDense Hamiltonians. Let H be a 2n-dimensionalHamiltonian stored in the quantum-accessible datastructure. Then there is a quantum algorithm ableto simulate it in complexity

O

(tΛn log5/2(tΛε−1)

log(t||H||ε−1)

log log(t||H||ε−1)

), (A3)

for t the simulation time, ε−1 the precision andΛ = max||H||, ||H||1, taking into account that

||H||1 ≤√

2n||H||, but taking ||H||, the spectralnorm, as measure of cost.

The Hamiltonian simulation is very useful to solvelinear systems of equations. The first and most fa-mous of those algorithms is usually called HHL aftertheir discoverers Harrow, Hassidim and Lloyd.

Theorem 4. [20] (HHL) Let M be an n′×n′ Her-mitian matrix (if the matrix is not Hermitian it canbe included as a submatrix of a Hermitian one) withcondition number κ and M having an sparsity d (atmost d nonzero entries in each row).

Let b be an n′-dimensional unit vector, and as-sume that there is an oracle Pb which produces thestate |b〉, and another PM which, taking (r, i) as in-put, outputs the location and value of the ith nonzeroentry in row r of M . Let

x = M−1b, |x〉 =x

||x||. (A4)

Then, there is an algorithm that outputs |x〉 incomplexity O(dκ2ε−1poly log n′).

However, since it was published, more efficientmethods have been published. In particular, [23] and[15] achieved improvements in the condition numberand precision respectively. We present here the mainresult from the later.

Theorem 5. [15] Let M be an n′×n′ Hermitian ma-trix (if the matrix is not Hermitian it can be includedas a submatrix of a Hermitian one) with conditionnumber κ and M having an sparsity d (at most dnonzero entries in each row).

Let b be an n′-dimensional unit vector, and as-sume that there is an oracle Pb which produces thestate |b〉, and another PM which, taking (r, i) as in-put, outputs the location and value of the ith nonzeroentry in row r of M . Let

x = M−1b, |x〉 =x

||x||. (A5)

Then, [15] construct an algorithm relying on Hamil-tonian simulation that outputs the state |x〉 upto precision ε, with constant probability of failure(i.e., independent from the problem parameters),and makes

O(dκ log2.5(κ/ε)) (A6)

uses of PM and

O(κ√

log(κ/ε)) (A7)

of Pf ; and has overall time complexity

O(dκ poly(log(n′dκ/ε))). (A8)

Also in the same article, [15], the authors presentthe Lemma 15 and 16, which we use

17

Lemma 1. [15] Let |λ〉 be an eigenvector of H =M/d, M a n × n matrix with sparsity d. Let λ ∈(−1,+1) the corresponding eigenvector. Within theinvariant subspace T |λ〉 , ST |λ〉, the walk operatorW has block form(

λ −√

1− λ2√

1− λ2 λ

). (A9)

In particular, this will have the consequence that

W jT |λ〉 = Tj(λ)T |λ〉+√

1− λ2Uj−1(λ) |⊥λ〉 ,(A10)

with Tj is the jth Chebyshev polynomial of the firstkind, and Uj of the second kind; and |⊥λ〉 a state inT |λ〉 , ST |λ〉 perpendicular to T |λ〉.

Notice that this lemma is the origin of (31).Finally, let us state the Theorem of Amplitude

Estimation, which we use through the text.

Theorem 6. (Amplitude Estimation) [17]: Forany positive integer k, the algorithm Amplitude Esti-mation outputs an estimate 0 ≤ a ≤ 1 of the desiredamplitude a such that

|a− a| ≤ 2πk

√a(1− a)

J+ k2 π

2

J2, (A11)

with success probability at least 8π2 for k = 1, and

with success probability greater than 1 − 12(k−1) for

k ≥ 2. J is defined as the number of times we needthe implementations of the oracle that tells whetheran element is marked, for Amplitude Estimation.Also, if a = 0 then a = 0, and if a = 1 and Jeven, then a = 1.

Appendix B: Comparison with otherprobabilistic classical strategies

In the main text we have focused on the particularproblem of sampling the most relevant points to de-termine the distance between different classes in thecontext of classifying. Here we focus on a more gen-eral setting, where one is given a oracle scoring func-tion s and wants to sample the points of the spacewith higher score, with exponentially high probabil-ity. We will see that some quadratic advantage ispossible for non-deterministic functions s.

Imagine we have a m-dimensional space S, whichwe discretise in n0 points in one dimension such thatin the end it contains N = nm0 points. This is forexample what happens when we have the space ofimages of n0 = (np × np) pixels, each displaying256 possible values. If the image is in colour then

to each pixels three such values ranging from 0 to255 will be assigned indicating the coordinates (red,green, blue), so that the number of possible pointsor images in the space would be N = (256)3n0 =(256)6np .

Another example could be, in the settingof reinforcement learning, the range of policiesparametrised by n parameters, each of which canagain take a range of discrete values. The policy ofan agent is a function that takes as input the stateof the agent or the observation it makes, and out-puts an action which seeks to maximize the rewardthe agent gets. In this case the space S would bethe space of possible policies. As we can see in bothcases the number of points in the space is exponen-tial in the dimension d.

Suppose that we also have some kind of scoringfunction, which we shall call s : S → R, which mayor may not be deterministic, and is given to us asan oracle. The problem we aim to solve is findinga point which is in the top 1/C fraction of pointsattaining the highest score, with some exponentiallyhigh probability. That it, we want to find one pointx0 such that P (s(x0) in top 1/C) = 1− p with p =e−β . Thus, C and β are the parameters we want tovary.

If the function s is probabilistic, it will output ascore when evaluated on point x that follows a prob-ability distribution with average µx and variance σ,D(µx, σ). In this case the objective is to achieve thesame guarantees for the average value of the evalu-ation of x0. In other words, P (µx0

in top 1/C) =1 − p with p = e−β . Let us summarise then theproblem.

Problem 2. (Sampling with statistical guar-antees). Let S be a m-dimensional discrete space.Let s : S → R be a scoring function that is given tous as an oracle. The problem is to return a pointx that, with probability 1 − e−β is in the percentilepC = 1 − 1/C of s. If s is non-deterministic, butrather s(x) ∈ D(µ = µx, σ

2), D a probability distri-bution, then return a point x such that µx is in thepreviously mentioned percentile.

In the main text we initially compared our strat-egy with the most Naive classical one: if we wantcertainty that the point we choose really is in the top1/C that means having to sample O(1/N) of them.However we could argue that an exponentially smallprobability of the point not being in the top 1/C ofthe most interesting ones is in fact a better compar-ison against our algorithm. After all, our algorithmdoes also achieve exponentially small probability ofchoosing the wrong point.

This appendix aims to propose a quantum strat-egy that allows us to efficiently sample one such

18

point even if variance σ is large when comparedagainst the range of values that µx may take, andcompare it against classical strategies. We willsee that the quantum strategy obtains a quadraticspeedup on some parameters, due to the use of Am-plitude Estimation [17] and Amplitude Estimation.

Let us start with the most naive of the classicalstrategies. If we try to sample points from the spaceuntil we get certainty that the best point of thosesampled is one of the seeked ones, the cost will beexponential even in the deterministic case, takingO(N(1− 1/C)) evaluations of s.

But if we only want statistical guarantees then onecan do much better. The first of the probabilisticclassical strategies can be called ‘greedy’.

1. The classical greedy strategy

Suppose that we are in the deterministic case. Thegreedy strategy consists on following process:

1. Uniformly at random, sample a single point xfrom the entire space S of possible candidatepoints.

2. Evaluate s(x).

3. If s(x) is higher than any of the previouslyevaluated ones, substitute the previous max-imum point by x, s(x).

By hypothesis we want to get a point that is inthe top 1/C, what means that each time we samplea point, we have a 1/C probability of spotting one.As this process gets the best of c candidates, theprobability that the best one is not in the top 1/Cdecreases as (1−1/C)c. Complexitywise, this meansthat we want to find c such that

p >

(1− 1

C

)c, (B1)

with p the probability of outputting a wrong point.Thus

c >log p

log(1− 1

C

) . (B2)

Obviously both the nominator and the denomina-tor are negative. Let us analyse this complexity byparts. First notice that

limC→∞

−1

log(1− 1C )

C= 1, (B3)

what means that −1

log(1− 1C )

= O(C), matching the

complexity of our algorithm for the C variable. On

the other hand, as we make p→ 0, − log p goes veryquickly to infinity. However if we make p exponen-tially small, p = e−β , then − log p = − log e−β = β.And this would imply that c = O(βC).

Let us now consider the case where s is not deter-ministic but rather follows a distribution D(µx, σ).In such case the greedy procedure becomes morecomplicated.

Let us introduce some notation. Assume that wename the points x0, ..., xc such that for the first eval-uation s(x0) ≥ ... ≥ s(xc). Also denote by mi thenumber of times xi has been evaluated, since we willneed to evaluate several times the same xi.

In order to analyse the case for non-deterministicfunction s, we will make use of the Central LimitTheorem:

Theorem 7. Central Limit Theorem. Let X be arandom variable that follows some probability distri-bution with average µ and variance σ2 < ∞. If wesample n times independently from such probabilitydistribution, the sample average µ will approximatelyfollow, for n large, a distribution with average µ andvariance σ2/n.

Now, suppose that we have called the oracle func-tion s m0 times for x0, and m1 times for x1. Thatmeans that the averages of the distributions, µ0 andµ1, will follow distributions with averages µ0 andµ1 and variances σ2/m0 and σ2/m1. Notice thatthe probability distribution of the difference of twonormal distributions is again a normal distributionwith mean µ0−µ1 and variance σ2/m0+σ2/m1 [49].Thus, the probability that in fact µ0 ≥ µ1 is givenby ∫ ∞

0

N (µ0 − µ1, σ2/m0 + σ2/m1), (B4)

N the normal distribution. The previous expressionis equal to[

1

2+

1

2erf

(z − (µ0 − µ1)√σ2/m0 + σ2/m1

)]∞0

, (B5)

with z the variable and erf the error function. Sim-plifying, as the error function is antisymmetric,

P1 =1

2+

1

2erf

((µ0 − µ1)√

σ2/m0 + σ2/m1

). (B6)

One would like this to be larger than 1−p′ such that1− p ≤ (1− p′)c and 1− p = 1− e−β , so

1

2+

1

2erf

((µ0 − µ1)√

σ2/m0 + σ2/m1

)≥ 1− p′ (B7)

19

implies

erf

((µ0 − µ1)√

σ2/m0 + σ2/m1

)≥ 1− 2p′, (B8)

and

σ2/m0 + σ2/m1 ≤(

(µ0 − µ1)

erf−1(1− 2p′)

)2

. (B9)

Notice that when p → 0, the denominator goes toinfinity. Since p′ ≥ 1 − (1 − p)1/c, that means thaterf−1(1− 2p′) = erf−1(−1 + 2(1− p)1/c). Then, ifwe take c = O(Cβ)

limβ→∞

erf−1(−1 + 2(1− e−β)1/c)√β

=

limβ→∞

erf−1(−1 + 2(1− e−β/c))√β

= 1.

(B10)

Notice however that the above expression does notdepend on c. If m0 = m1 then erf−1(1 − 2p′) =O(√β) and

m1 = m0 ≥2σ2

(µ0 − µ1)2O(β). (B11)

Notice that we have to check this for every pair(m0,mi) for all i ∈ (1, ..., c), and as we had saidthat c = O(βC), the overall complexity is upperbounded by Ω(β2C). This result is consistent evenif m0 6= m1, as can be easily checked. In contrast,we shall see that using quantum procedures we canachieve a complexity O(βC).

Another interesting point to extract from the pre-vious equation is that the complexity is greatly re-duced whenever σ2 (µ0 − µ1)2. In particular,when σ2 = 0 we recover the limit for the oracle func-tion s being deterministic.

But one may argue that actually one may be ableto generalise the strategy. With this we mean thatthe objective is checking that µ0 is in the top 1/C,not that it is better than any other µi. Thus, theway to think about it is the probability of at leastone of the c points being in the top 1/C, times theprobability of such point being x0; plus the prob-ability of at least two points being in the top 1/Ctimes the probability of x0 being the second best ofthe c points sampled... Thus calling Pi the value of(B6) when taken for points x0 and xi,

1− p ≤c∑

n=1

(c

n

)(1

C

)n(1− 1

C

)c−n·

·∑

j1<...<jn−1

n∏k=1

(1− Pjk)∏

i 6=j1,...,jn−1

Pi.

(B12)

The idea here would be to calculate c,m0, ...,mc

fulfilling the previous equation while minimising∑ci=0mi. However, optimizing m1, ...mc over the

previous set can be complicated. This suggests usinga different approach, that we review in the followingsection.

2. Threshold strategy

In the previous section we followed a greedy strat-egy that was quite efficient when the function s wasdeterministic, but became more involved when theoutput of s followed a distribution with variance σ2

relatively big compared to the distance (µ0 − µ1)2.

Thus, in this last case, another good strategycould be

1. Repeatedly sample uniformly at randompoints x ∈ S, and apply s(x), in order to cal-culate the percentile 1− 1/C of the evaluationof s(x), pC with error ε and exponentially goodprobability 1− p = 1− e−β .

2. Find a single point x and evaluate s(x) mx

times to make sure that, with exponentiallyhigh probability x is in the top 1− 1/C quan-tile.

Thus the first step is estimating a percentile. If wesample n points x1, ..., xn and evaluate s(xi) for eachof them, such that s(x1) ≤ ... ≤ s(xn). Then thebest estimation of the value in the 1−1/C percentileis the weighted average between xd(n+1)(1−1/C)e andxb(n+1)(1−1/C)c, with weights |d(n+ 1)(1− 1/C)e −(n+1)(1−1/C)| and |b(n+1)(1−1/C)c−(n+1)(1−1/C)| respectively. To clarify, suppose for examplethat we sample 3 points s(x1) < s(x2) < s(x3) froma uniform distribution and we want to calculate the1/3 quantile. As the previous calculation and our in-tuition indicate, it should be the (weighted) averageof x1 and x2, because that leaves 1/3 of the pointsbelow it.

More generally, subindex i follows, for large n, aprobability distribution D of pC being xi [50]. Suchprobability distribution is a normal distribution withaverage µ = npC and variance σ2 = npC(1−pC). Inother words, the variable

z =i− npC√npC(1− pC)

(B13)

follows the normal distribution N (µ = 0, σ2 = 1).Since we want to estimate pC with error ε′ = ε/C

20

and exponentially high probability 1−p, this means

P

((1− ε′)pCn− pCn√

npC(1− pC)≤ z ≤ (1 + ε′)pCn− pCn√

npC(1− pC)

)≥ 1− p

(B14)

The reason for choosing ε′ = ε/C is to be able tocompare with the quantum algorithm. Simplifying,

P

(−ε′√n

√pC

1− pC≤ z ≤ ε′

√n

√pC

1− pC

)≥ 1− p.

(B15)Calculating the left-hand side of the equation, asµz = 0 and σz = 1[

1

2+

1

2erf

(z − µz√

2σz

)]+ε′√n√

pC1−pC

−ε′√n√

pC1−pC

=

1

2

[erf

(ε′√npC√

2(1− pC)

)− erf

(−ε′√npC√2(1− pC)

)]=

erf

(ε′√npC√

2(1− pC)

)≥ 1− e−β .

(B16)

the last equal due to the error function being odd.This means we need the number of samples to growlike

n ≥ 21− pCpC

1

ε′2(erf−1(1− e−β)

)2, (B17)

where erf−1 is the inverse error function, and (1 −pC)/pC = (C − 1)−1.

As

limβ→∞

erf−1(1− e−β)√β

= 1, (B18)

and ε′ = ε/C, that means that n = O(ε−2βC).As a second step we need to calculate the num-ber of times we have to call the sample function sover a candidate in order to certify that with expo-nentially high probability it is above the percentile1− 1/C. Using the central limit theorem again, them-sample average µx follows a normal distributionN (µx, σ

2/m). So, given a threshold T which indi-cates the upper part of the confidence interval of thepercentile 1− 1/C; and supposing µx > T , we want∫ ∞

T

N (µx, σ2/m) ≥ 1− p = 1− e−β . (B19)

The left hand side can be written as[1

2+

1

2erf

(z − µx√

2mσ

)]∞T

=

1

2− 1

2erf

(T − µx√

2mσ

).

(B20)

From this it can be calculated that

m ≥ 2σ2

(µx − T )2(erf−1(1− 2e−β))−2, (B21)

and as

limβ→∞

erf−1(1− 2e−β)√β

= 1, (B22)

we have m = O(β). Over how many points x dowe have to repeat this procedure until we make surethat x is over the threshold? As described just af-ter (B3), one would need to repeat this over O(C)points, although in practice it may be much lessgiven the information we have of the previous steps.This step then has complexity O(βC). Thus theoverall complexity of this procedure is O(ε−2βC).

3. Quantum Threshold Strategy

Can we do better if we use quantum methods?The answer is that we can get a quadratic advan-tage over some parameters: we will have an overallcomplexity of O(ε−1Cβ).

Before explaining how to do it, we need to definethe equivalent version of the probabilistic oracle s.In this case s : |x〉 →

∑s(x)√px,s(x) |x〉 |s(x)〉, where

px,s(x) are the probabilities and follow the distribu-

tion D(µx, σ2).

The required steps are the following

1. Use bisection search, amplitude estimation [17]and the median lemma from [51] to estimate,with success probability 1 − p = 1 − 2−β anderror ε a threshold T for the percentile 1−1/Cof the Image of s.

2. Use Amplitude Estimation [17] to mark any |x〉for which the amplitude over threshold T + εis greater than 1/

√2 + ε . Use and the me-

dian lemma from [51] to achieve an exponen-tial high success probability 1−2−β . The timecomplexity would be O(βε).

3. Use Amplitude Amplification [17] to find all

such points in time O(√C).

Since the median lemma is not widely known, letus state it here without proof.

Lemma 2. (Median Lemma) [51] Consider a se-quence φ′k for k = 1...β. Let the probability thatφ′k does not belong to (φL, φR) be smaller δ < 1/2.Then, the probability that the median of φ′k fallsout of (φL, φR) can be bounded above by

pfail =1

2

(2√δ(1− δ)

)−β≤ 2−β−1. (B23)

21

Complexity Deterministic Non-deterministicClassical Greedy O(βC) Ω(β2C)

Classical Threshold O(ε−2βC) O(ε−2βC)Quantum Threshold O(ε−1βC) O(ε−1βC)

TABLE I. Complexities of different algorithms. Noticethat the step of determining the threshold is dominatingin the second and third strategies. However, there is alsoa difference in the step of determining a point over thethreshold. Classically, the complexity would be O(βC),

whereas quantumly it would be O(ε−1β√C).

This lemma can be used to reduce, in linear timeO(β), the probability that Amplitude Estimationwith error ε fails. On the other hand, the complexityof Amplitude Estimation is O(ε−1), for a given errorε.

This means that given an error tolerance ε/C, anda probability of failure 2−β we can perform Am-plitude Estimation with complexity O(ε−1βC). Fi-nally, to fully understand the step 1 above, we needto explain what we mean by bisection search. Theidea is to propose a trial threshold T ′ and checkwhether the percentage of points for which s(x) isabove T ′ is greater or smaller than 1/C. With thatone may refine the initial guess of T ′ until we knowthat the percentage of points for which s(x) > T isbetween (1− ε)/C and (1 + ε)/C with exponentiallyhigh probability 1−e−β . The cost of bisection searchis well known to be logarithmic in the precision.

Overall we can see the complexity of each strategyin the table I.

Appendix C: Estimating the norm of a vectorafter the application of a linear combination of

unitaries or after the application of a Chebyshevpolynomial

In the appendix A of [25] it is explained how toestimate the norm of the solution vector of the HHLalgorithm, as we already indicated in the equation(20). However, at the end of the appendix they in-dicated that they were not sure how to estimate thenorm of the solution using any other improvementsto HHL such as [15, 23]. Here we explain how toestimate the norm of any vector M |v〉, for M anymatrix that we can decompose as a linear combi-nation of unitaries M =

∑i αiUi, and |v〉 a pure

quantum state; or when we apply the Chebyshevapproach indicated in the main text.

Firstly, let us briefly review the simplest case ofLCU technique. Let M = U0 + U1. Then we canperform

|0〉 |v〉 H−→ 1√2

(|0〉+ |1〉) |v〉 . (C1)

Now, we perform the so called multi-U operator∑i |i〉〈i| ⊗ Ui,

1√2

(|0〉U0 + |1〉U1) |v〉 , (C2)

And performing a second Hadamard over the firstregister we get

1

2[|0〉 (U0 + U1) + |1〉 (U0 − U1)] |v〉 . (C3)

Postselecting on measuring |0〉 on the first registerwe get a state proportional to M |v〉. To calculatethe probability, let us write it in density matrix form

ρ =1

4

[|0〉〈0| (U0 + U1) |v〉〈v| (U0 + U1)†

+ |0〉〈1| (U0 + U1) |v〉〈v| (U0 − U1)†

+ |1〉〈0| (U0 − U1) |v〉〈v| (U0 + U1)†

+ |1〉〈1| (U0 − U1) |v〉〈v| (U0 − U1)†].

(C4)

To calculate the probability of measuring 0, weneed to measure the norm of P0ρP0, where P0 =|0〉〈0| ⊗ 1, taking the trace, which happens to be〈v|(U0 + U1)2|v〉 /4. The probability of measuringa |0〉 is then 〈v|(U0 + U1)2|v〉 /4. This same calcu-lation is performed to calculate the probability ofmeasuring a |0〉 in the Swap Test.

Let us use this to generalise to the setting whenM =

∑i αiUi. The first step, analogous to (C1) is

applying an operator V that prepares the coefficients

|0〉 |v〉 V⊗1−−−→ 1√∑i α

2i

∑i

αi |i〉 |v〉 . (C5)

This implies that we have to correct the norm we willmeasure with

√∑i α

2i . Then, we apply the multi-U ,∑

i |i〉〈i| ⊗ Ui.

1√∑i α

2i

∑i

αi |i〉Ui |v〉 . (C6)

Next we perform a Hadamard gate, H |i〉 =∑j(−1)i·j |j〉 over the first register,

1√∑i α

2i

∑i,j

(−1)i·jαi |j〉Ui |v〉 . (C7)

Using the same argument as we did in the simplercase, the amplitude of the first register being in state|0〉 is

A0 =1√∑i α

2i

∣∣∣∣∣∣∣∣∣∣(∑

i

αiUi

)|v〉

∣∣∣∣∣∣∣∣∣∣ . (C8)

22

Therefore, in the same way that the norm of the so-lution can be calculated using (20), we can calculatethe norm of the solution in this case using

||Mv|| = A0||v||√∑

i

α2i . (C9)

Clearly, estimating a given amplitude requires theuse of Amplitude Estimation procedure, with costO(ε−1), and can be applied to the procedure of theQuantum Linear System Algorithm of [15] for exam-ple.

Next we want to calculate the factor by which tocorrect the norm of a vector |v〉 after applying aChebyshev polynomial according to (31). First no-tice that if |v〉 = |λ〉, for λ = 1 the largest eigenvalueof M , then using (31) will result in staying the samestate |λ〉, due to Lemma 1. This implies that wewill have to correct the norm of the solution by thelargest eigenvalue λmax that will divide M in order

to ensure that the largest eigenvalue in M/λmax is1.

In such case, one can calculate the equivalent of(C9) as

||Mv|| = A0||v||λmax, (C10)

where A0 is the estimated amplitude of the correctstate in (31).

In general though, λmax is not known. In suchcase, [15] proposes making all entries in the matrixM smaller than 1/d, d the sparsity. Then, they ar-gue, the maximum eigenvalue will be in the interval(−1, 1) and Lemma 1 can still be used. In such case,we will use (C10) but substituting λmax by the fac-tor that makes all entries smaller than 1/d. Noticealso that when estimating the norm of the solutionof the HHL algorithm, we supposed that the maxi-mum eigenvalue was no larger than 1. Therefore, asimilar treatment to what we do here should be alsoapplied there.

Documents

arXiv:1912.03283v2 [quant-ph] 5 Apr 2020 · quantum machine learning; adversarial examples; Support Vector Machine; quantum theory. I. INTRODUCTION Supervised learning is one of the