11

Click here to load reader

Strange Bedfellows: Quantum Mechanics and Data Mining

Embed Size (px)

Citation preview

Page 1: Strange Bedfellows: Quantum Mechanics and Data Mining

Strange Bedfellows: Quantum Mechanics and Data Mining

Marvin Weinsteina∗

aSLAC National Accelerator Laboratory,Stanford, CA, USA

Last year, in 2008, I gave a talk titled Quantum Calisthenics . This year I am going to tell you about how thework I described then has spun off into a most unlikely direction. What I am going to talk about is how onemaps the problem of finding clusters in a given data set into a problem in quantum mechanics. I will then usethe tricks I described to let quantum evolution lets the clusters come together on their own.

1. What Is This About ?

Since I am talking to you at Light Cone 2009, itis fun for me to point out that a good deal of thematerial I will present is a direct spin-off of ma-terial I presented last year in my talk, QuantumCalisthenics. What is new (and surprising to me)is that the ideas for solving problems in quantummechanics would prove extremely useful in deal-ing with data-mining, a problem in computer sci-ence that, at first glance, does not appear to haveanything to do with quantum physics, or for thatmatter, with any kind of physics. In the next fewminutes hope to convince you that despite ap-pearances data-mining and quantum physics area match made in heaven.

1.1. What Is The Problem And How DoesIt Affect You ?

As one wag said, the problem of data-miningcan be summarized by, ”If a supermarket cus-tomer buys formula and diapers, how likely arethey to buy beer?”. Clearly, if it is possible togive a good answer to this question, a supermar-ket manager can decide if it is better to put thebeer next to the diapers, or to put it at the otherend of the store to encourage impulse buying.

As amusing as this simple summary may be,there is a bit more to the story. For example,other businesses that engage in data-mining areAmazon and NETFLIX. Both companies have∗This work was supported by the U. S. DOE, ContractNo. DE-AC02-76SF00515.

large data bases containing the information abouteach customer’s previous purchases. Their goal inmining this data is to suggest what the next booka given customer might like to purchase, or movieshe might like to rent. They base their suggestionon what they believe customers like her have pur-chased or rented. The key question is how dothey determine who is a customer like her ? Ingeneral the examination of unstructured data tofind clusters that are similar in some way is akey element of data-mining. The search for thesestructures in a data-set is imaginatively referredto as clustering .

While the use of clustering by Amazon andNETFLIX may improve the shopping experience,data-mining affects all of us in ways that impactout existence at a much more important level.This happens when banks and insurance compa-nies apply it to the problem of scoring . Basically,the idea is to take all of the data they have aboutpeople and identify clusters of similar people soas to predict how likely it is that an applicant willdefault on a mortgage, or loan, or how likely theyare to be robbed, die, etc.. To do this they taketheir and try, in various ways, to find clusters ofpeople that they can define as similar. For betteror worse we are all affected by these practices.

Finally, to give one of a myriad of examplesthat have nothing to do with questions of busi-ness, let us consider the case of gene chips asa tool for medical diagnosis and treatment plan-ning. Gene chips are a technology for measuring

Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–84

0920-5632/$ – see front matter © 2010 Elsevier B.V. All rights reserved.

www.elsevier.com/locate/npbps

doi:10.1016/j.nuclphysbps.2010.02.009

Page 2: Strange Bedfellows: Quantum Mechanics and Data Mining

which genes are being under or over-expressed bya cell. The idea is that understanding the dif-ferent patterns of gene expression and how theyrelate to specific diseases and drug resistance willallow a doctor to both diagnose a disease and pre-scribe the correct drug for treating it. Later I willtalk about a specific example of this sort of studyfor the case of patients with Type A or Type BLeukemia.

Summarizing we can say that many fields -physics, biology, medicine intelligence and home-land security, finance, insurance, and diplomacy- all collect large databases of information. Sort-ing through such data and searching for previ-ously unknown interesting and relevant structuresis referred to as data-mining. Clearly from thebreadth of topics one might guess that no pre-cise definition of data-mining exists. However,although we can’t make the definition of whatwe wish to do totally specific, it is intuitivelyclear that it is a useful notion. In what follows Iwill be describing a clustering technique that mycolleague David Horn and I call DQC (DynamicQuantum Clustering). I hope that the specificexamples that I discuss will give you a flavor forthe general problem. A more complete expositionof this material is given in Ref.[1].

2. The Data Miner’s Lament: The Curseof High Dimension

The general problem is we are given a databasewith a large number of entries. Associated witheach entry is a record that contains all of theinformation that has been collected about thatentry, each such item of information is called afeature. Clearly the number of features and thetype of information stored in each feature variesfrom database to database. As an example, let’stalk about NETFLIX. Assume that they have alibrary of 10,000 films that can be rented. Theneach entry in the database might correspond to acustomer and associated to each customer wouldbe 10,000 features (in numerical format) that in-dicate which films the customer has rented and,if available, his/her rating of each of those films.The number of features is called the dimension ofthe problem. Thus, we see that finding clusters of

people with similar tastes in the NETFLIX prob-lem involves searching for clusters in a space ofvery high dimension. It should not surprise youthat this is computationally difficult.

Because it is so difficult to search for clusters inspaces of high dimension, people try to find a wayto reduce the dimension of the problem without,too badly, compromising the information. Thetool most commonly used to accomplish this isSVD or Singular Value Decomposition. Clearly,given the time available to me I can’t go into theuse of SVD too deeply, but I would like to takea few moments to give you a feeling for what itdoes and doesn’t do.

2.1. Singular Value Decomposition: TheSwiss Army Knife of Data Mining

SVD, or Singular Value Decomposition, is thestatement that any n×m-matrix can be broughtto an almost diagonal form. The first time Iwas introduced to this theorem many years ago Icouldn’t believe it was really good for anything.That just goes to show what I knew. In fact,nowadays, SVD has crept into all sorts of areasof physics and data-mining. It is extensively usedin solid state physics computations based on theDMRG (or density matrix renormalization group)method, in image compression, in data-mining todo - among other things - dimensional reduction,and a host of other places. There is much I couldsay to try and expose you to its many applica-tions, but time being what it is I will limit myselfto giving a brief introduction to the SVD decom-position and an example of how we will use it.

Basically SVD is useful to us because all of thedata we will mine will be presented in the form ofan n×m-matrix of numerical values. Each row ofthis matrix will correspond to the entries in thedata base (i.e., customers, events in a physics ex-periment, the specific cell culture correspondingto a given gene-chip map, etc.). The columns ofthe matrix represent the features that have beenmeasured for each entry in the database. Thetheorem establishing the SVD decomposition saysthat for any n×m-matrix M , there exists a uni-tary n × n-matrix U , an n × m-matrix S and a

M. Weinstein / Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–84 75

Page 3: Strange Bedfellows: Quantum Mechanics and Data Mining

unitary m × m-matrix V , such that

M = U S V †. (1)

where the matrix S only has non-vanishing en-tries along the diagonal. Consider the explicitexample:(

8.3 5.8−4.5 −4.36.8 −8.5

)=

( −0.85 0.077 −0.5220.51 −0.133 −0.85

−0.135 −0.988 0.074

(11.875 0

0 10.9070 0

)·(−0.864 −0.503−0.503 0.864

),(2)

where M is the matrix on the left hand side of theequation and U , S and V † are the three matriceson the right hand side of the equation.

Now, to get a feeling for what is decomposi-tion is doing for us, let us think of the rows ofM as the coordinates of three different points intwo dimensions. Clearly these three vectors can-not all be linearly independent as there can onlybe two linearly independent vectors in 2-d. TheSVD decomposition gives us a coordinate systemfor plotting these points that is adapted to theway the data is spread out. To be more specificwe note that the two rows of V † are two ortho-normal vectors that span the space containing thepoints. From the formula for M we see that eachof the original points can be written as linear com-bination of the two rows of V †. The coefficient ofeach of these two basis vectors is given by theentries in the first two rows of U times the corre-sponding elements of S. From this we see that therows of U serve as coordinates for each of the datapoints, where the data has been mapped onto theunit sphere (since each row of U is a unit vec-tor). If one wishes to dimensionally reduce theclustering problem one can use a smaller numberof columns of the U -matrix to label the points.In this case we typically rescale each row to haveunit length. In all of the examples that followwe have done this step before starting the DQCprocedure.

Another use of SVD is to reduce the amountof data used to describe a picture as either a wayof filtering out noise, or as a first step towardsimage recognition. To see how this works assumethat a picture is given by a matrix, M , of pixels.Another way to write the SVD decomposition of

this matrix is

Mij =Nmax∑ν=1

Sνν tνi,j (3)

where the numbers λν run over the non-zero num-bers on the diagonal of S (in decreasing order) upto a maximum, Nmax, and the matrices tνij aregiven by

tνi,j = Uiν V †νj . (4)

This is a useful way of looking at things becausethe diagonal entries in S tend to decrease rapidly.Thus, since U and V † (and thus V ) are unitarymatrices, we see that∑i,j

M2ij =

∑ν

S2νν . (5)

If we approximate the original picture by a sumover the N largest entries in S; i.e.,

Mapproxij =

N∑ν=1

Sνν tνij (6)

then it follows that

∑i,j

(M − Mapprox)2ij =Nmax∑

ν=N+1

S2νν (7)

and so, we see that the sum of the squares of allof the errors in the pixels is bounded by the sumof the square of the eigenvalues Sνν that weren’tincluded in the approximation.

2.2. What About Clustering ?To this point I have only been talking about

using the SVD decomposition to plot the data ina coordinate system that is adapted to the nat-ural structure of the data. Thus the first axis isthe one in which the data had the largest vari-ance; i.e. had the biggest difference between thelargest and smallest coordinate values, the sec-ond axis is the direction in which the data hasthe next largest variance, and so on. While thisstep is useful and dimensionally reducing the databy keeping only the coordinates corresponding tothe first few directions can lead to plots that in

M. Weinstein / Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–8476

Page 4: Strange Bedfellows: Quantum Mechanics and Data Mining

some loose sense maximally separate data belong-ing to different clusters, in general this is only afirst step.

In the generic situation, especially for data thatcannot usefully be reduced to two or three dimen-sions, clusters cannot be easily identified. The sit-uation is even more complicated if the data doesnot group into separated globular clusters, butinstead is concentrated on some other geometri-cal structure, say for example, a ring. In thiscase SVD followed by strong dimensional reduc-tion can incorrectly lead one to conclude thereare clusters where none exist. This is where DQCcomes into play. DQC allows one to explore moredimensions of the original data and it can clearlyindicate the presence of higher dimensional ge-ometries when they exist.

3. DQC: What Is Dynamic Quantum Clus-tering?

In Dynamic Quantum Clustering, DQC, wemap the problem of finding clusters into a prob-lem in quantum mechanics and then use the tech-niques of quantum mechanics to allow the clus-ters, or over dense regions in the data, to revealthemselves.

3.1. The Parzen Window EstimatorAn old approach in clustering was the so-called

Parzen Window Estimator. The crux of this ideawas that given n-dimensional data, denoted bypoints �xi, one could construct a function on then-dimensional Euclidean space as follows:

ψ(�x) =∑

i

e−1

2σ2 (�x− �xi)·(�x−�xi) (8)

In other words, the Parzen estimator is simplya sum of n-dimensional Gaussians with a com-mon width determined by the parameter σ. Thenotion was that in regions where the data wasdenser, the Parzen estimator would have a rela-tive maximum. The hope was that clusters couldbe identified by finding these maxima and select-ing a region about each one, so that the points inthat region would be said to belong to a cluster.There were several problems with the Parzen es-timator approach. First, finding the maxima of

a complicated function in an n-dimensional spaceis computationally time consuming. Second, thestructure of the function proved to depend sensi-tively on the choice of σ. For a value of σ that istoo small, one obtains a function with many localmaxima and very small clusters. If σ is to largethe maxima all coalesce and no distinct clusterscan be found.

3.2. Quantum ClusteringQuantum clustering, (QC), begins with the

construction of a Parzen estimator8, but insteadof using the Parzen estimator to find clusters di-rectly, we use it to construct a potential functionwhose minima are related to the clusters. Theconstruction of this function was first introducedin the paper of Horn and Gottlieb[2] where itwas defined by the condition that it is the func-tion V (�x) for which the Parzen estimator satisfiesthe n-dimensional time-independent Schrodingerequation

Hψ =(−σ2

2�∇2 + V (�x)

)ψ = 0. (9)

That such a V (�x) always exists is clear, all onehas to do is to solve equation 9 for V (�x); i.e.,

V (�x) = − σ2

2 ψ(�x)�∇2 ψ(�x). (10)

The intuition behind this prescription is basedupon the fact that in the quantum problem localmaxima in the ground state wave-function cor-respond to local minima in the potential; more-over, such minima are likely to be a good dealsharper and deeper than the corresponding max-ima of the wave-function, due to the fact that thekinetic term, −�∇2, causes the wave-function to bemore spread out than the features of the poten-tial would suggest. A sample of such a potentialfunction will be shown in the next section whenI discuss the example of Ripley’s Crab Data.

In the orginal quantum clustering approach[2]points lying in the basin of attraction of particularminimum were identified as a single cluster. Oneway of determining which minimum was closestto a given point was to classically roll the pointsdownhill using the gradient descent method. The

M. Weinstein / Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–84 77

Page 5: Strange Bedfellows: Quantum Mechanics and Data Mining

disadvantages to this method are: first, it can becomputationally very expensive to carry out whenthere a large number of points and a large numberof features involved; second, there can be manyinstances where a slightly too small value for σcan lead to a potential with several nearby shal-low minima contained in a single larger minimum.When this happens the classical prescription willlead to several small clusters and not show theexistence of the larger cluster. As we will see, byappropriately choosing one of the two parametersthat control the DQC evolution, this sensitivityupon the choice of σ can be greatly reduced andthe structure of such nearby minima can be ex-plored even in problems with a large number offeatures.

3.3. Ripley’s Crab DataAs a simple example of the construction of the

potential associated with a data set let us con-sider the data on the morphology of rock crabsthat was used in the book on Pattern Recogni-tion and Neural Networks by R.D.Ripley[3]. Theproblem in this exercise is to see how well one canclassify the members of a collection of 200 crabs;the collection contains 100 crabs of two differentspecies and each set of 100 is further subdividedinto 50 crabs of each sex. The only informationone has consists of five measurements of carapacesize and claw size. In the picture I have shown Ihave plotted a potential in two-dimensions. Thecolored dots correspond to real classes. To con-struct this potential we did an SVD decomposi-tion of the data-matrix and then used the secondand third principal component to construct thepotential. As you can see from the picture thepotential has four local minima and, except fora few outliers, the different species and sexes areidentified by the valley in which they lie.

3.4. DQC: Dynamic Quantum ClusteringDQC begins by constructing the same poten-

tial function used in quantum clustering, it differsin how one handles the problem of how to iden-tify data points with local minima of the func-tion in N -dimensions. We simply exploit the factthat from the outset we are dealing with a quan-tum problem and a well defined Hamiltonian. To

Figure 1. This is a plot of the quantum potentialfunction for the two-dimensional problem of theRipley’s Crab Data where the coordinates of thedata points are chosen to be given by the secondand third principal components. The four knownclasses of data points are shown in different colorsand are placed upon the potential surface at theiroriginal locations.

roll the different data points down hill we sim-ply evolve each Gaussian wave-function using thetime development operator

U(t) = e−i H t, (11)

so that the evolution of any state ψi(�x) is definedto be

ψ(�x, t) = e−i H t ψ(�x). (12)

This time evolved state is the solution to thetime-dependent Schrodinger equation

i∂ψi(�x, t)

∂t= Hψi(�x, t)

=(−∇2

2m+ V (�x)

)ψi(�x, t), (13)

The important feature of quantum dynamics,which makes the evolution so useful in the clus-

M. Weinstein / Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–8478

Page 6: Strange Bedfellows: Quantum Mechanics and Data Mining

tering problem, is that according to Ehrenfest’stheorem, the time-dependent expectation value

〈ψ(t)| �x |ψ(t)〉 =∫

d�xψ∗i (�x, t) �xψi(�x, t), (14)

satisfies the equation,

d2〈 �x(t)〉dt2

= − 1m

∫d�xψ∗

i (�x, t) �∇V (�x)ψi(�x, t) (15)

= − 1m

〈ψ(t)| �∇V (�x) |ψ(t)〉. (16)

If ψi(�x) is a narrow Gaussian, this is equivalentto saying that the center of each wave-functionrolls towards the nearest minimum of the poten-tial according to the classical Newton’s law of mo-tion. This means we can explore the relation ofthis data point to the minima of V (�x) by fol-lowing the time-dependent trajectory 〈 �xi(t) 〉 =〈ψi(t)| �x |ψi(t)〉. Clearly, given Ehrenfest’s theo-rem, we expect to see any points located in, ornear, the same local minimum of V (�x) to oscil-late about that minimum, coming together andmoving apart. In our numerical solutions we gen-erate animations which display this dynamics fora finite time. This allows us to visually trace theclustering of points associated with each one ofthe potential minima.

At first glance this approach would seem to bea step backwards since we have replace solving or-dinary classical differential equations by the prob-lem of solving more complicated partial differen-tial equations, but this is incorrect. The trick isto think of the problem reduced to the subspaceof the full Hilbert space spanned by the originaldata points. While this is not completely equiva-lent to the original problem, for our purposes theaccuracy of the approximation is good enough tocapture the same clusters.

Clearly, these states are not orthogonal to oneanother, however we can choose an orthonormalset of states that span the subspace by computingthe matrix

Nij = 〈ψi(�x)|ψj(�x)〉 (17)

and then computing its eigenvectors. The eigen-vectors corresponding to non-vanishing eigenval-ues form a set of orthogonal linear combinations

of the original states and they are normalizedby dividing each eigenstate by the inverse squareroot of its eigenvalue. The fact that the initialstates are all Gaussians makes computing Nij

analytically a trivial exercise. Furthermore, theGaussian nature of the original states also makesit simple to calculate the matrix elements of theHamiltonian and the position operators Xk (theseoperators act on our wavefunctions by multiply-ing them by the xk coordinate).

Hij = 〈ψi(�x)|H |ψj(�x)〉, (18)

where H is defined to be

H =�p2

2 m+ V (�x)

= −�∇2

2 m+ V (�x). (19)

and

Xij = 〈ψi(�x)|Xk |ψj(�x)〉. (20)

(Note: I have introduced the parameter m intothe Hamiltonian used to evolve the states. Thisis different from the σ used to construct the po-tential. I do this so that by choosing m < 1 I canincrease the effect of quantum tunneling and de-crease the sensitivity by causing points in nearbyminima to merge.)

Given the analytical expressions for these ma-trix elements between the Gaussian states it isa simple matter to compute the same operatorsin the orthonormal basis defined by Nij and tothen exponentiate the Hamiltonian in this basis.In this way the apparently difficult problem ofsolving the time-dependent Schrodinger equationis reduced to the computation of simple closedform expressions followed by numerical evolutionin the truncated Hilbert space. This trick reducesthe problem to dealing with matrices whose sizeis determined by the number of data points andnot the dimension of the data-set (i.e., the num-ber of features associated with each data point).Of course, when there are a large number of data-points this might seem to be an intractable prob-lem however, fortunately, there is a simple trickfor dealing with that situation too.

M. Weinstein / Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–84 79

Page 7: Strange Bedfellows: Quantum Mechanics and Data Mining

Applying these ideas to Ripley’s crab data, re-duced to the first three principle components, isvery instructive. (The evolution of the full five-dimensional data-set shows exactly the same be-havior but is more complicated to plot.) In thefollowing pictures I show two steps in the timeevolution of the initial distribution. The firstpicture is a snapshot of the original data beforeevolution has begun. In the next picture we seewhat happens after a short time. As expected thepoints begin to approach one another. The lastpicture shows what happens if we stop the pointswhere they are and restart DQC using the newpoints. The clustering is quite easy to see and itis easy to understand why the data is clusteringas it is.

4. Exploiting DQC

Obviously, in this brief overview of DQC, I havenot given you all the details of how to implementthe algorithm, but I have given you all of the es-sentials. Now I want to point out that althoughcolors in Fig.2 were there so that you could see theclusters form and tell how well DQC was work-ing, the fact that colors can be used in the vi-sualization step can play an additional role indata mining. The crab problem was an exam-ple of a blind search for clusters. What I wantto concentrate on now is a different sort of prob-lem; namely, a situation where we start with aclassification of the data into clusters, but we donot know how the measured features in the data-set relate to the classification. As an example,consider Affymetrix gene chip data for a set ofLeukemia patients. The Affymetrix gene chip isa silicon device to which strands of RNA are at-tached in a regular matrix. The setup is madeso that, if a protein binds to a strand of RNAthat codes for that protein the the spot to whichthe strand is bound flouresces when light shineson the chip. Such a gene chip can measure theexpression of over 7000 genes (by seeing the pro-teins that are coded for by that gene). Now, inthe case of ALL and AML leukemia cells we startwith a clinical classification based upon a patho-logical examination of the cells. So we know howto color spots in the DQC picture according to

Figure 2. The left hand plot shows three-dimensional distribution of the original datapoints before quantum evolution. The middleplot shows the same distribution after quantumevolution. The right hand plot shows the re-sults of an additional iteration of DQC. The val-ues of parameters used to construct the Hamil-tonian and evolution operator are: σ = 0.07 andm = 0.2. Colors indicate the expert classificationof data into four classes, unknown to the cluster-ing algorithm. Note, small modifications of theparameters lead to the same results.

type. What we don’t know is how to use thegene chip information to identify these cells; thatis what we want to do using DQC (or any otherclustering method). Two problems that one has,in addition to the clustering problem, is that thebinding of a protein to a strand of RNA is not allthat specific, and most of the genes being mea-sured have nothing to do with cancer. That iswhere the coloring comes in. If one applies DQC

M. Weinstein / Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–8480

Page 8: Strange Bedfellows: Quantum Mechanics and Data Mining

to such a data set and sees clusters of the correctcolor form, then one knows that the gene chipdata contains the information we need. The nextstep is to eliminate features (i.e., measurementsof particular genes) without hurting the cluster-ing. (There are various schemes for doing this,including the brute force approach.) Clearly, ifwe pursue this process and weed out genes thathave nothing to do with the clustering, then onegoes a long way to both developing a diagnostictool, and obtaining an insight into which genesare related to the specific cancer.

I will show you some results for a data set byGolub et.al.[4]. This set contains gene chip mea-surements on cells from 72 leukemia patients withtwo different types of Leukemia, ALL and AML.The expert identification of the classes in thisdata set is based upon dividing the ALL set intotwo subsets corresponding to T-cell and B-cellLeukemia. The AML set is divided into patientswho underwent treatment and those who did not.In total the Affymetrix GeneChip used in this ex-periment measured the expression of 7129 genes.The specific feature filtering method we employedin this analysis was based on SVD-entropy, and isa simple modification of a method introduced byVarshavsky et al.[5] and applied to the same data.For our purposes it doesn’t matter what the fea-ture filtering method was, I want you to see thedifference removing features makes in clustering.I also want you to recognize how the DQC visual-ization makes it easy to quickly identify the effectof removing features from the data.

Figure 3 displays the raw data in the 3-dimensional space defined by the second to fourthprincipal components, and the effect that DQChas on these data. In Figure 4 we see the resultof applying feature filtering to the original data,represented in the same 3-dimensions, followed byDQC evolution. Applying a single stage of filter-ing has a dramatic effect upon clustering, evenbefore DQC evolution. The latter helps sharp-ening the cluster separation. Figure 5 shows theresults of removing many features (using the SVDentropy method) before and after DQC evolution.These plots, especially the after DQC pictures,show dramatic clustering, especially for the blue

Figure 3. The left hand picture is the raw datafrom the Affymetrix Chip plotted for principalcomponents 2,3,4. Clearly, without the coloringit would be hard to identify clusters. The righthand picture is the same data after DQC evo-lution using σ = 0.2 and a mass m = 0.01. Thedifferent classes are shown as blue, red, green andorange.

Figure 4. The left hand plot is the Golub dataafter one stage of SVD-entropy based filtering,but before DQC evolution. The right hand plotis the same data after DQC evolution.

points. With each stage of filtering we see thatthe blue points cluster better and better, in thatthe single red outlier separates from the clusterand the cluster separates more and more from theother points. The blue points are what I refer toas an obviously robust cluster identified in earlystages of feature filtering. If one continues remov-ing features, however, the clear separation of the

M. Weinstein / Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–84 81

Page 9: Strange Bedfellows: Quantum Mechanics and Data Mining

Figure 5. The left hand plot is the data afterthree stages of SVD-entropy based filtering, butbefore DQC evolution. The right hand plot is thesame data after DQC evolution.

blue points from the others begins to diminish.This tells us that we have gone to far with theblue points, we are now removing features thatmatter for its classification. This is, of course,just what we are looking for, a way of identifyingthose features which are important to the existingbiological clustering. We could, at this juncture,search among the most recent 278 eliminated fea-tures to isolate those most responsible for the sep-aration of the blue cluster from the others. Nowhowever I want to make another point. Since theblue cluster is so robust and easily identified, letus remove the blue cluster from the original dataand repeat the same process without this cluster.The idea here is that now the SVD-entropy basedfiltering will not be pulled by the blue cluster andso it will do a better job of sorting out the red,green and orange clusters. I want you to see thatthis is in fact the case. In Figure 6 we see a plotof what the starting configurations look like if onetakes the original data, removes the blue clusterand re-sorts the reduced data set according to theSVD-entropy based filtering rules. The left handplot is what happens if one filters a single time.The right hand plot shows what happens if onerepeats the filtering procedure two more times, Itis clear from the plots that each iteration of thefiltering step improves the separation of the start-ing clusters. By the time we have done five filter-ing steps the red, green and orange clusters are

Figure 6. The left hand plot is what the start-ing data looks like if one first removes the bluepoints and does one stage of SVD-entropy basedfiltering. The right hand plot is what the startingdata looks like after three stages of filtering.

distinct, if not obviously separated. Finally, tocomplete our discussion, we show Figure 7. Thisfigure shows the results of doing five iterationsof the SVD-entropy based filtering and followingthat with three stages of DQC evolution. Thedramatic clustering accomplished by DQC evolu-tion makes it easy to extract clusters. Note how-ever, that in the second plot we see what we haveseen throughout, that the red points first formtwo distinct sub-clusters which only merge aftertwo more stages of DQC evolution. This constantrepetition of the same phenomenon, which is onlymade more apparent by SVD-entropy based fil-tering, is certainly a real feature of the data. Itpresumably says that what appears to be a sam-ple of a single type of cell at the biological levelis in reality two somewhat different types of cellswhen one looks at gene expression. A measure ofthe success of clustering is given by the Jaccardscore which, for this result is 0.762, higher thanthe value 0.707 obtained by [5].

5. DQC and Large Data Sets

There are many scientific and commercialfields, such as cosmology, epidemiology, particlephysics, risk-management, etc., where the onedeals with very large data sets, often in largenumbers of dimensions. DQC, by its nature,

M. Weinstein / Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–8482

Page 10: Strange Bedfellows: Quantum Mechanics and Data Mining

Figure 7. The left hand plot is what the startingdata looks like if one first removes the blue pointsand does five stages of SVD-entropy based filter-ing. The right hand plot is what happens afterone stage of DQC evolution. The bottom plot isthe final result after iterating the DQC evolutionstep two more times. At this point the clustersare trivially extracted.

doesn’t have trouble with large dimensions. Ingeneral it allows one to use SVD decomposi-tion with much less severe dimensional reductionsthan other methods. However, dealing with largenumber of points requires some thought.

Since the method for evolving point requires di-agonalizing the truncated Hamiltonian, and sincediagonalizing such a matrix on a PC becomes dif-ficult for matrices that are larger than 2000 ×2000, it is obvious that using brute force meth-ods to evolve sets of data having tens or hundredsof thousands of points simply won’t work. Thesolution to this problem lies in the fact that theSVD decomposition maps the data into an N -dimensional unit cube, and the fact that the datapoints are represented by states in Hilbert spacerather than N -tuples of real numbers.

The key observation is that, since Gaussianwavefunctions whose centers lie within a givencube have non-vanishing overlaps, as one choosesmore and more Gaussians one eventually arrivesat a situation where the states become what wewill refer to as essentially linearly dependent . Inother words, we arrive at a stage at which anynew wave-function added to the set can, to somepredetermined accuracy, be expressed as a lin-ear combination of the wave-functions we alreadyhave. Of course, since quantum mechanical timeevolution is a linear process, this means that theseadditional states can be evolved by expressingthem as linear combinations of the previouslyselected states and using the evolution of thosestates to evolve the extra states. Since comput-ing the overlap of two Gaussians is done ana-lytically determining which points determine theset of maximally essentially linearly independentstates for the problem is easy. Typically, evenfor data sets with 100, 000 points, this is of theorder of 1600 points. The small number worksbecause, as we have already noted, we don’t needhigh accuracy for DQC evolution. The qualityof the clustering degrades very slowly with lossin accuracy. Thus, it is possible to compute thetime evolution operator in terms of a well chosensubset of the data and then apply it to the wholeset of points.

M. Weinstein / Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–84 83

Page 11: Strange Bedfellows: Quantum Mechanics and Data Mining

6. Afterward

Well this is about all I can cover in my al-lotted time. I hope I have convinced you that,strange as it may seem, quantum mechanics anddata mining are related to one another. In fact,there is enough interest in this stuff in the realworld that Stanford(SLAC) and Tel-Aviv Univer-sity are patenting this technology and some com-panies have already expressed an interest in usingit.

REFERENCES

1. Marvin Weinstein and David Horn. Dynamicquantum clustering: a method for visual ex-ploration of structures in data. SLAC-PUB-13759, arXiv-.0908.2644v1 [physics.data-an]18 Aug 2009 (submitted to Phys. Rev. E)

2. D. Horn and A. Gottlieb. Algorithm for DataClustering in Pattern Recognition ProblemsBased on Quantum Mechanics. Phys. Rev.Lett. 88 018702 (2001).

3. B. D. Ripley Pattern Recognition and NeuralNetworks. Cambridge University Press, Cam-bridge UK, 1996.

4. T.R. Golub, D.K. Slonim, P. Tamayo, C.Huard, M. Gaasenbeek et al: Molecular Clas-sification of Cancer: Class Discovery andClass Prediction by Gene Expression Moni-toring. Science 286 531 (1999).

5. R. Varshavsky, A. Gottlieb, M. Linial andD. Horn. Novel Unsupervised Feature Filter-ing of Biological Data. Bioinformatics 22 no.14 (2006), e507-e513.

M. Weinstein / Nuclear Physics B (Proc. Suppl.) 199 (2010) 74–8484