A Fault-tolerant, Ion-trap-based Architecture for the ...kubitron/courses/...cient algorithm to simulate local quantum systems. Then his work was extended to larger classes of quantum

A Fault-tolerant, Ion-trap-based Architecture for theQuantum Simulation Algorithm

Guoming WangComputer Science Division, UC Berkeley

[email protected]

Oleg KhainovskiSeimologoical Laboratory, UC Berkeley

[email protected]

ABSTRACTEfficient simulation of quantum mechanical systems is animportant application of quantum computing. We build afault-tolerant, ion-trap-based architecture for the quantumsimulation algorithm, using the quantum CAD flow devel-oped at Berkeley. First, we develop a software that, giventhe description of a quantum simulation problem, gener-ates the solution circuit and optimizes it by using a layeringtechnique to increase its parallelism and reduce its latency.This optimized circuit is then inputted into the quantumCAD flow. Second, we improve the datapath synthesis andmapping stages of the quantum CAD flow, by reducing thenumber of long-range communications. Third, we present anew version of selective error correction scheme which dis-tinguishes between bit-flip errors and phase-flip errors andhas a better estimation of error propagation than its pre-decessor. At last, we evaluate our design with the ADCRmetric, and exhibit the effectiveness of our optimizations.

1. INTRODUCTIONThe simulation of quantum mechanical systems is impor-tant for many scientific studies and applications, such asthe study of superconductivity, DNA structures, nuclear re-actions, or development of new materials, etc. The state of aquantum system is described by a vector of length exponen-tial in the number of particles it contains. This makes large-scale quantum systems extremely difficult to be simulatedon a classical computer, because there are too many vari-ables to record, not to mention the exponentially many dif-ferential equations to solve. Realizing this difficulty, in 1982Feynman[8] proposed to build quantum computers which hebelieved to be capable of simulating these systems efficiently.His belief was confirmed by Lloyd[16] who showed an effi-cient algorithm to simulate local quantum systems. Then hiswork was extended to larger classes of quantum systems[1,4, 5]. It is believed that efficient simulation of quantum sys-tems is one of the most practical applications of quantumcomputers.

1.1 Quantum CAD FlowWe are going to study the architecture for the quantum sim-ulation algorithm, using the quantum computer aided de-sign(CAD) flow developed by the authors of [32]. It wasdesigned to automate the synthesis and laying out of quan-tum circuits, and has been used to build a fault-tolerant,area-efficient architecture for Shor’s factoring algorithm [33].Figure 1 outlines its basic structure.

The quantum CAD flow receives as input a quantum circuitspecified using quantum assembly language(QASM). In thefirst stage, it encodes the circuit and inserts necessary cor-rections to make the circuit fault-tolerant. Then, it synthe-sizes the datapath without the communication network. Inthe next stage, it maps the circuit to the datapath. Namely,it assigns each circuit operation to an appropriate locationof the datapath and inserts necessary movement operations.After that, an appropriate long-distance communication net-work is synthesized to support these circuit and movementoperations. In the final stage, the CAD flow uses the fullyspecified datapath and list of events to compute the successprobability for this circuit and layout. If the overall qualityof area, latency and success probability, in terms of Area-Delay-to-Correct-Result (ADCR) metric[33], is unsatisfac-tory, then it goes back to the beginning and re-synthesizeuntil a satisfactory layout is obtained.

1.2 Ion Trap TechnologyThe quantum CAD flow currently utilizes ion trap [24] asits substrate technology. In this technology, qubits are rep-resented by ions which are suspended in channels betweenelectrodes. Gates are performed by firing lasers at trappedions. The layout consists of a tiling of macroblocks, whichare shown in Figure 2. Each macroblock has one or moreports through which ions may enter and exit. Gates areperformed only at certain locations, so involved qubits mustmove a valid gate location and remain there for the during ofthe gate. Trapped ions are moved along channels via pulsesapplied to electrodes. Gates, movements and other basicoperations are performed with high accuracy, as shown inTable 1. Besides, ion trap has a long decoherence time andshows good scalability[9, 28]. As a result, it is considered asone of the most promising technologies for implementing ascalable, universal quantum computer

Error Error LatencyPhysical Operation Set 1 [17] Set 2 [27] in (µs) [22]

One-Qubit Gate 10−6 10−4 1Two-Qubit Gate 10−6 10−4 10Measurement 10−6 10−4 50Zero Prepare 10−6 10−4 51Straight Move (∼30 µm) 10−8 10−6 190 Degree Turn 10−8 10−6 10Idle (per µs) 10−10 10−8 N/A

Table 1: Error probabilities and latency values forbasic physical operations.

1

Figure 1: Overview of the Quantum CAD Flow and Our Work (the blue parts).

Dead End Gate

Straight Channel Gate

Straight Channel

Turn Three-Way Intersection

Four-W ay Intersection

Figure 2: Basic macroblocks for Ion Trap layouts.Black boxes are gate locations, gray boxes are ab-stract electrodes, and wide whites channels are validpaths for qubit movement.(borrowed from [33])

1.3 Our ContributionsWe build a fault-tolerant ion-trap-based architecture for thequantum simulation algorithm, using the quantum CADflow described above. Specifically,

• We develop a software that, given the description ofa quantum simulation problem, generates the solutioncircuit and also optimizes it by using a layering tech-nique to increase its parallelism and reduce its latency.This optimized circuit is then inputted into the quan-tum CAD flow.

• We improve the datapath synthesis and mapping stagesof the quantum CAD flow, by utilizing a partitioning-and-favorite-qubit strategy to reduce the number ofexpensive long-range communications.

• We introduce the selective XZ error correction schemewhich improves upon its predecessor[33] by discrimi-nating between bit-flip errors and phase-flip errors andhence having a better estimation of error propagation.

• At last, we evaluate our design with the ADCR metric,and show the effectiveness of our optimizations.

The remainder of this paper is organized as follows. In sec-tion 2, we describe the quantum simulation algorithm andour implementation details. In section 3, we present our op-timizations to the datapath synthesis and mapping stagesof quantum CAD flow. In section 4, we propose the selec-tive XZ error correction scheme. In section 5, we show theevaluation results of our designs. Section 6 concludes thispaper.

2. QUANTUM SIMULATION ALGORITHMIn this section, we describe the quantum simulation algo-rithm and its implementation. We assume familiarity withthe fundamentals of quantum computation. If otherwise, werecommend the excellent introductory chapters of [19].

2.1 PreliminariesThe dynamical behavior of a quantum system is governedby Shrodinger’s equation

i~d|ψ(t)〉dt

= H(t)|ψ(t)〉,

where ~ is the reduced Planck constant, |ψ(t)〉 is the stateof the system at time t, and H(t) is the Hamiltonian of thesystem at time t. Formally, H(t) is a Hermitian operatoracting on the Hilbert space of the system, and it correspondsto the total energy of the system. In general, H(t) mightchange with time. But in this paper, we will focus on time-independent Hamiltonians, i.e. H(t) ≡ H. In this case,Shrodinger’s equation has the analytical solution

|ψ(t)〉 = e−iHt/~|ψ(0)〉.So the evolution of the system from time 0 to t can be de-scribed by a unitary transformation U = e−iHt/~. 1

For most physical systems, their Hamiltonians can be writ-ten as a sum over many local interactions:

H =

LXj=1

Hj ,

where each Hj only acts on at most k particles of this sys-tem, for some constant k. In this case, we say that H isk-local. Such locality is quite physically reasonable, becausemost interactions fall off with increasing distance and thusparticles only have strong correlation with its nearest neigh-bors. For many systems, H is just 2-local. One example isthe famous Ising Model, in which particles are placed on aline (for the 1D Ising Model) or on a square lattice (for the2D Ising Model), and the Hamiltonian is the sum of localinteractions between each pair of neighbors.

1Without loss of generality, from now on we will absorb theconstant ~ into the Hamiltonian H.

2

2.2 Quantum Simulation AlgorithmIn order to study a quantum system, we usually prepare itin certain state, let it evolve for a given time t, and extractsome information from the final state. But for some sys-tems, this method is too expensive, or even prohibited, suchas the explosion of nuclear weapons. In this situation, thesimulation method becomes very useful. We can choose arepresentation system and build a quantum circuit that ap-proximates the representation of the unitary transformationU = e−iHt to some desired accuracy. If this circuit containsa polynomial number of gates from a universal set, 2 thenwe say that the simulation is efficient.

Local Hamiltonians can be simulated efficiently[16]. At theheart of quantum simulation algorithm is the following ap-proximation theorem:

Theorem 2.1. If H =LP

j=1

Hj, then

e−iHt = limn→∞

(e−iH1t

n e−iH2t

n . . . e−iHLt

n )n. (1)

Note that since quantum operations do not commute, thestatement“e−iHt = e−iH1te−iH2t . . . e−iHLt”is generally wrong.However, Theorem 2.1 tells that, in the limit case, we canstill simulate a Hamiltonian by concatenating the simula-tion of each local term, but in a smooth and interleavingway. Besides, by choosing a sufficiently large n – the num-ber of iterations, we can achieve an arbitrary approximationaccuracy.

With the equipment of Theorem 2.1, now we only need figureout how to implement each e−iHjt/n. The key observationis that, since each local term Hj only acts on few numberof qubits, it can be implemented by a small-sized circuit.Specifically, without loss of generality, we can assume thateach Hj is a Pauli group operator, since all Pauli groupoperators form a linearly independent basis of the matrixspace. Let us suppose that H is 2-local. Then, up to thepermutation of qubits, there can be at most 6 possibilitiesfor each Hj :

Hj = aX ⊗X, aX ⊗ Y, aX ⊗ Z,

aY ⊗ Y, aY ⊗ Z, aZ ⊗ Z,

for some real number a. For all these possibilities, we canconstruct a corresponding circuit based on the followingequations:

e−iaX⊗X = CNOT · (e−iaX ⊗ I) · CNOT,

e−iaX⊗Y = (I ⊗ S) · CNOT · (e−iaX ⊗ I) · CNOT · (I ⊗ S†),

e−iaX⊗Z = (I ⊗H) · CNOT · (e−iaX ⊗ I) · CNOT · (I ⊗H),

e−iaY⊗Y = (I ⊗ S) · CNOT · (e−iaY ⊗ I) · CNOT · (I ⊗ S†),

e−iaY⊗Z = (I ⊗H) · CNOT · (e−iaY ⊗ I) · CNOT · (I ⊗H),

e−iaZ⊗Z = CNOT · (I ⊗ e−iaZ) · CNOT.2A set of gates is said to be universal for quantum compu-tation if any unitary operation can be approximated to ar-bitrary precision by a quantum circuit involving only thosegates. An example of such a set is Hadamard, phase, CNOTand π/8 gates.

The implementation of single-qubit operations e−iaX , e−iaY

and e−iaZ is described in the subsequent subsection. For thek-local cases, we can similarly build corresponding circuits.

The formal description of quantum simulation algorithm isas follows:

Algorithm 2.2. Quantum Simulation[16]

• Inputs: (1)A Hamiltonian H =PL

j=1 Hj acting onN-dimensional system, where each Hj acts on a sub-system of size independent of N ; (2)an initial state|ψ0〉 at time 0; (3)a time t at which the evolved stateis desired; (4)an accuracy ε > 0.

• Outputs: A state |ψ〉 such that |〈ψ|e−iHt|ψ0〉| > 1−ε.

• Procedure:

1. Choose a representation system of m = ploy(logN)qubits;

2. Choose a sufficiently large n – the number of it-erations in Eq.(1) so that the expected error isacceptable.

3. for j = 1 to L do

Construct a quantum circuit Uj that approx-imates e−iHjt/n well enough.

4. |ψ〉 ← |ψ0〉;5. for iteration 1 to n do

for j = L to 1 do

|ψ〉 ← Uj |ψ〉;

The complexity of this algorithm is O(L ·poly(1/ε)), assum-ing that the executions of Uj ’s are serialized. For a betteraccuracy, it needs more time to run. Meanwhile, the morelocal terms a Hamiltonian contains (i.e. the larger L is), thelonger this algorithm runs.

2.3 Implementation of Single-qubit OperationsIn this subsection, we describe how to implement the single-qubit operations e−iaX , e−iaY and e−iaZ . We need to trans-form them into short sequences of basic instructions, say,Hadamard gate H, π/8 gate T and its inverse T †.

A natural method is that we enumerate all short sequencesof H, T , T †, and find the closest one. However, this methodis extremely inefficient. Instead, we implement and use theSolovay-Kitaev algorithm, which is derived from the con-structive proof of the famous Solovay-Kitaev theorem as fol-lows:

Theorem 2.3. (Solovay-Kitaev Theorem[25, 13]) LetG be a finite set of elements in SU(2) containing its owninverse, such that 〈G〉 is dense in SU(2). 3 Then for any

3〈G〉 is the set of all elements in SU(2) that can be writtenas g1g2 . . . gm where gj ∈ G. A set A is dense in SU(2)means that every element in SU(2) can be approximated toarbitrary precision by an element in A.

3

U ∈ SU(2), for any ε > 0, it is possible to approximate U tothe precision ε using logc(1/ε) gates from G. The constant capproximately equals to 2.

This theorem guarantees the existence of a short and goodsequence of basic instructions for approximating any single-qubit operation. And, fortunately, the proof of this theoremis constructive, yielding the following algorithm:

Algorithm 2.4. Solovay-Kitaev[6]

• Inputs:(1) gate U ∈ SU(2); (2)a depth n;

• Outputs: Un ∈ 〈G〉 such that ‖U − Un‖ ≤ ε(n). 4

• Procedure:

if n==0 then

return Basic Approximation to U;

else

Set Un−1 = Solovay −Kitaev(U, n− 1);

Find V, W s.t. V WV †W † = UU†n−1;5

Set Vn−1 = Solovay −Kitaev(V, n− 1);

Set Wn−1 = Solovay −Kitaev(W, n− 1);

Return Un = Vn−1Wn−1V†

n−1W†n−1Un−1.

This algorithm is recursive. At the bottom level, it usessome basic method, such as exhaustive search, to find anapproximation with some large but bounded error. Then itrecursively improves the quality of approximation by tryingto close the gap between current approximation and U , usingtheory about group commutators. The larger n is, the moreaccurate the final approximation is. Specifically, in order toachieve an approximation precision ε, we need to set n =O(loglog 1

ε), which grows very slowly. In practice, we found

that n = 2 or 3 is usually sufficient for our purpose.

2.4 Optimization by LayeringIn the previous presentation of quantum simulation algo-rithm, we have assumed that the simulations of each localterm must be serialized. This is actually not necessary andusually leads to long latency. Note that if two local term Hj

and Hj+1 act on disjoint subsystems, then the operationse−iHjt/n and e−iHj+1t/n are independent and hence can beapplied simultaneously. As a result, the total latency can bereduced.

We find that the ordering of simulated local terms can sig-nificantly influence the parallelism available in the resultingcircuit. As shown in Figure 3(a), if not properly ordered,it is possible that every two consecutive operations have de-pendency and consequently they have to execute one afteranother.

We separate all the local terms of a Hamiltonian into dif-ferent “layers” as shown in Figure 3(b). All the local terms

4The approximation precision depends on n5It can be proved that as long as UU†n−1 is sufficiently closeto I, it always has this group commutator decomposition.

(a) Circuit before Layering

(b) Circuit after Layering

Figure 3: Illustration of Layering

in the same layer act on disjoint subsystems and thus canbe simulated at the same time. As a result, the simula-tion proceeds in a layer-by-layer fashion. By doing this, wegreatly increase the average width of the simulation circuitand greatly reduce its depth, so that it has more parallelismand much less latency. 6

We develop a greedy algorithm that attempts to partitionthe local terms of a given Hamiltonian into as few layers aspossible. It works as follows:

Algorithm 2.5. Greedy Layering

• Inputs: a set of local terms {H1, H2, . . . , HL}.• Outputs: a partition (i.e. layers) of {H1, H2, . . . , HL}.• Procedure:

1. S = {H1, H2, . . . , HL}, U = ∅.2. while S 6= ∅ do

T = ∅; // start to generate a new layer

while there exists a term in S that does notintersect with any term in T do

select one of this kind with the largest size,7

add it to T , delete it from S; //add termsto new layers until not possible

6It is worth noting that this layering technique is very spe-cific to the quantum simulation problem. The validity ofpermutations of quantum operations is guaranteed by The-orem 2.1. In general, quantum (and even classical) opera-tions do not commute, and hence this kind of permutationwill change the behavior of the original circuit.7Here we define the size of a local term to be the size of thesubsystem it acts on.

4

S = S − T , U = U ∪ {T};3. return U .

Currently we do not have a theoretical bound on the worst-case approximation factor of this algorithm, but experimen-tal results show that it works quite well and usually yieldsthe optimum number of layers. It remains open to prove itsoptimality or give better algorithms for layering.

For certain systems with some kind of geometric symmetry,this layering technique is particularly effective. For example,consider the Ising Model. Any 1D Ising Model can alwaysbe separated into 2 layers, while any 2D Ising Model canalways be divided into 4 layers (see Figure 4). The numberof layers does not depend on the number of particles thesystem has, but only depends on its dimension.

3. DATAPATH SYNTHESIS AND MAPPING3.1 Datapath SynthesisFigure 5 shows three major types of datapath organizationsfor quantum computing so far. They are QLA[18], CQLA[30] and Qalypso[33]. They differ from their configurationof compute or memory regions, ancilla generation and tele-portation network resources.

QLA can be viewed as the quantum analogue of FPGA. Itcontains a sea of identical compute regions, each of whichcontains enough resources to perform a two-qubit operation,and fixed resources in ancilla generator and teleportationrouter. CQLA improves upon QLA by making use of mem-ory regions. Compute regions are surrounded by memoryregions. Each region still has fixed size and fixed resourcesfor ancilla generation and routing. In addition, [33] intro-duced LQLA and CQLA+, which are variants of QLA andCQLA with improved ancilla generators from [15] and [12],respectively.

Qalypso offers the most flexibility. It allows variable sizedcompute and memory regions, variable resources in ancillagenerators and teleportation network[11]. It determines theseparameters only after analyzing the given circuit. Besides,it utilizes the improved pipelined ancilla factory from [12].It was shown that Qalypso is much superior to QLA, LQLA,CQLA, CQLA+ with respect to the ADCR metric for ran-dom circuits, adders and Shor’s factoring algorithm[33].

The quantum CAD flow basically synthesizes Qalypso dat-apath as follows. It first traverses the dataflow graph andestimates the starting and finishing time of each gate. Thenit divides the whole computation period into disjoint timeintervals, and for each interval it counts the number of activegates. Form these statistics, it decides how many total func-tional and memory units should be used. It has also defineda maximum acceptable area for each compute or memoryregion. Then it fits these functional and memory units intoas few regions as possible. At last, it embeds these regionsinto a mesh grid network.

We note that it is very important to reduce the number ofexpensive long-range communications. More remote com-munications lead to three consequences. First, the latency

becomes longer. Second, the link capacity needs to becomelarger to avoid congestion, which in turn implies more an-cillae and more areas. Third, more errors can be introducedduring teleportations, even though the purification proce-dure tries to suppress them. Because of the great impor-tance of minimizing the number of remote moves, we believethat it should be taken into consideration at the datapathsynthesis stage.

Our idea is to fist traverse the dataflow graph, and build aweighted undirected graph that reflects the strength of cor-relations between any pair of data qubits. The vertices ofthis graph are the data qubits, and the weight of each edgeis the number of gates that act on both of its two incidentvertices. The next step is to find a relatively balanced andsmall cut (i.e. sparse cut) of this graph. Namely, we dividethe vertices into several parts of approximately equal sizes,such that there are few edges across different parts. Afterthat, we assign each part of data qubits to a particular com-pute region as its “favorite” qubits. The intuition is that,hopefully, since there are few interactions between differentparts, these qubits will stay at this region most of the timeand do not need to be moved out often. The number ofparts should be chosen appropriately. If there are very fewparts, then each compute region is too large and not enoughparallelism is exploited. If there are too many parts, theneach part becomes too small and there will be still manyinter-region communications.

Since the problem of finding the sparsest cut [3] of a graph isNP-hard, we can generally hope to find an approximation tothe sparsest cut. We implement a variant of the algorithmfrom [29]. It works by repeatedly “contracting” edges. Ateach step, it selects a random edge and contracts it, replac-ing its two incident vertices with one, (i.e. combine the twoparts that the two vertices currently belong to), unless thiswould lead to an imbalanced partition. Repeat this proce-dure until there are few vertices (each of them representinga part) left. Experimental results shows that this algorithmworks quite well. It would be interesting to implement moresophisticated algorithms, such as the ones from [3, 2, 20],and compare their effectiveness.

After partitioning the data qubits, we still do what the orig-inal quantum CAD flow does to synthesize Qalypso datap-ath. However, this time we count the number of active gatesacting on each part of qubits, and hence get to know exactlyhow many functional units are needed for each compute re-gion. By doing this, we achieve a more fair distribution ofcomputational resources to different compute regions.

3.2 MappingAfter synthesizing the datapath, we need to map the circuitto it. For every circuit gate in the program order, the map-per decides which functional unit is used to execute it andinsert necessary data movements. If the chosen functionalunit is busy now, we need to wait until it finishes its currentoperation. If the involved qubits are not in the chosen func-tional unit now, we need to move them from their currentlocations to this functional unit.

We evaluate all functional units and find the best one. Everyfunctional unit gets a score based on three factors: (1)the

5

(a) 2D Ising Model (b) Layer 1 (c) Layer 2 (d) Layer 3 (e) Layer 4

Figure 4: A 2D Ising Model can always be divided into 4 layers.

(a) QLA (b) CQLA (c) Qalypso

Figure 5: Three Major Datapath Organizations (borrowed from [33])

waiting cost: the longer we need to wait for the functionalunit to finish its current operation, the larger penalty it gets;(2)the communication cost: the longer distance the involvedqubits need to move to get this functional units, the largerpenalty it gets; and if a teleportation is involved, it getsa very large penalty; (3)the favorite qubits of the computeregion it lies: if one of the involved qubits is not liked bythe compute region, then it gets a very large penalty. Then,we choose the functional unit that has the highest score toexecute this gate. By this rule, eventually, each data qubitlives in a particular compute region (the one whose favoritequbits contains it) most of the time, and moves out onlywhen necessary (when executing a gate acting on two qubitsin two different parts).

This partitioning-and-favorite-qubit strategy is particularlyeffective for Hamiltonian simulation circuits. In most phys-ical systems, particles interact only with its nearest neigh-bors. Our logical partitioning actually corresponds to thegeometric partitioning of the original physical system. Thenumber of interactions across the boundaries is small, so thenumber of inter-region communications is also small.

4. OPTIMIZING QECQuantum error correction is crucial for the fault-toleranceof quantum circuits. Figure 6 shows a standard error cor-rection circuit.8 It is composed of two stages: one stage

8This is Steane’s error correction[26], which only works forCSS codes. Another type, Knill’s error correction[14], whichis based on quantum teleportation, can work for any stabi-lizer codes.

for correcting X (bit-flip) errors, and another for correct-ing Z (phase-flip) errors. In each stage, we prepare ancillaein certain states, then apply CNOT gates transversally topropagate the data qubits’ errors to the ancillae, then per-form appropriate measurement on the ancillae to get thesyndrome bits, and at last perform corresponding correc-tions on the data qubits. Note that the placement of CNOTis different for the two stages. This is because X errors andZ errors have different behaviors with respect to CNOT, asshown in Figure 7 . For X error, it only propagates from thecontrol qubit to the target qubit, but not in the oppositedirection; for Z error, the situation is exactly opposite.

The traditional brute force method[23] applies error correc-tion after every gate to ensure that qubit errors do not prop-agate to other qubits. This method has two drawbacks.First, it is extremely expensive. Under this scheme, morethan 90% of physical operations will be used for error cor-rection instead of real useful computations [31]. Second,although this scheme tries to suppress errors, it involves toomany gates and data movements, which are likely to intro-duce new errors.

Having realized these disadvantages, the authors of [21, 33]suggested and practiced an improved method named selec-tive error correction. Its main idea is to analyze the circuitand estimate its critical error paths, and thus focus errorcorrection resources only on these paths. Specifically, it as-signs an error distance(EDist) to each edge of the dataflowgraph. Any gate increases EDist by 1. Besides, all qubitsinteracting with a gate acquire the maximum EDist out of

6

Figure 6: A Standard Quantum Error CorrectionCircuit. mX or mZ means the measurement in theX or Z basis respectively.

Figure 7: Propagations of X and Z errors with re-spect to a CNOT.

them. When an edge’s EDist reaches certain threshold, weplace a correction on it, and reset its EDist to 0 after thiscorrection, and go on. See Figure 8 for an example. Thisstrategy dramatically reduces total gate count, while stillmaintaining satisfactory success probability[33].

4.1 Selective XZ Error CorrectionWe observe that this selective error correction approach doesnot distinguish between X and Z errors. As a result, it con-servatively assumes that any multiqubit gate will propagatethe inputs’ errors to every output. Actually, as shown inFigure 7, this does not necessarily happen. So if we dis-criminate between X errors and Z errors, and consider thedifferent effects of different gates on them, then we can geta more accurate estimation of error propagation, and hencefurther reduce gate count. We call this method selectiveXZ error correction. Specifically, we assign an X error dis-tance(XEDist) and Z error distance(ZEDist) to each edge ofthe dataflow graph. Their update rules respect the specificinfluence of each type of gate on X and Z errors. For exam-ple, for a Hadamard gate, since it converts X error into Zerror and vice versa, (i.e. HXH = Z and HZH = X), so weexchange the XEDist and ZEDist after this gate; for CNOTand Control-Z, the update rules are based on the propaga-tion behaviors shown in Figure 7 and 9. Like selective errorcorrection, we also put X correction or Z correction on theedge whose XEDist or ZEDist reaches certain threshold, andreset it to 0 after that correction, and move on.

Figure 9: Propagations of X and Z errors with re-spect to a Control-Z.

Figure 10 shows the corrected version of the circuit in Figure8(a) using the selective XZ error correction. Comparing itwith Figure 8(c), one can find that our method saves one Xcorrection and two Z corrections even for this small circuit.

In addition to diminishing gate count and therefore decreas-ing the latency, our selective XZ error correction method hasanother potential advantage over the selective error correc-tion method. In most literatures on quantum fault-tolerance,people assume the depolarization noise model, in which theprobabilities of bit flipping and phase flipping are the same.This makes the theoretical analysis much easier, but it mightbe not faithful to the real situations. A system might bemore vulnerable to X error than to Z error, or vice versa.If so, say X error is more intense than Z error, then we canmake the XEDist threshold smaller than ZEDist threshold,and as a result more X corrections will be inserted. In otherwords, our scheme could adapt more easily to non-uniformnoise models.

In both selective error correction and selective XZ error cor-rection approaches, we need to choose an appropriate errordistance threshold. If it is too small, then the corrected cir-cuit contains too many physical gates. If it is too large, then,although the gate count is significantly reduced, the successprobability becomes too low. We need to evaluate differentthresholds and find the one that leads to a balance betweencosts and success probability.

5. EXPERIMENTAL RESULTSIn this section, we evaluate our designs described in pre-vious sections. Our benchmarks are mainly random localHamiltonians that we generate. We also examine the effectof layering for Ising Model, since it is a representative ofphysical systems having geometric symmetry. In addition,we also test our optimizations to the quantum CAD flow onrandom circuits in order to see their generality.

Since LQLA and CQLA+ are the optimized version of QLAand CQLA respectively, we choose to examine four types ofdatapath organizations: LQLA, CQLA+, Qalypso and ourimproved Qalypso using our mapper, which we call iQalypso.

We use the [7, 1, 3] Steane code to encode our circuit. Weassume the depolarization noise model, and use Error Set 1of Table 1 unless otherwise stated. The success probabilityis calculated by the vector Monte Carlo method [31].

7

(a) Uncorrected circuit (b) Brute Force Error Correction (c) Selective Error Correction with EDistThreshold=3

Figure 8: Illustration of Brute Force and Selective Error Correction Methods

(a) Insertion of X Corrections with XEDist Threshold=3 (b) Insertion of Z corrections with ZEDist Threshold=3

Figure 10: Illustration of Selective XZ Error Correction Method

We use the comprehensive metric Area-Delay-to-Correct-Result(ADCR) defined as follows to evaluate the quality offinal layout.

ADCR = Area× E(Latencytotal)

= Area×∞X

n=1

Latencysingle × Psuccess(1− Psuccess)n−1

= Area× Latencysingle

Psuccess.

For ADCR, lower is better. Besides, we also pay attentionto the area, latency, success probability, number of physicalgates, and number of long-range communications.

5.1 Effect of LayeringFigure 11 and 12 show the effect of layering for generalHamiltonians and 1D Ising Models. In both cases, layer-ing reduces ADCR a lot. For general Hamiltonians, layeringcan decreases ADCR by nearly an order of magnitude whenthe number of local terms approaches 1000. For Ising Mod-els, the diminishment of ADCR is more dramatic, and isnearly two orders of magnitude when the number of localterms approaches 1000. This is consistent with our previousprediction.

5.2 Comparison of Datapath OrganizationsFigure 13 and 14 show ADCR for general Hamiltonian simu-lations and random circuits using LQLA, CQLA+, Qalypso,iQalypso architectures without optimization to QEC. In allthese cases, Qalypso and iQalypso significantly outperformLQLA and CQLA+. Besides, iQalypso beats Qalypso by

Figure 11: Effect of Layering for General Hamilto-nian

a factor of 2 ∼ 3 of for general Hamiltonian simulations,and by a factor of nearly 2 for random circuits. Duringthese experiments, we also notice that iQalypso has manyless inter-region communications than the others. So ourpartitioning-and-favorite-qubit strategy indeed works andplays an important role in reducing ADCR. In addition, itis understandable that the advantage of iQalypso over Qa-lypso is smaller for random circuits than for Hamiltoniansimulations. This is because random circuit does not havevery good locality, and no matter how we partition it, the

8

Figure 12: Effect of Layering for Ising Model

cut is always a relatively large.

Figure 13: Comparison of Datapaths for Hamilto-nian Simulations

5.3 Comparison of Error Correction MethodsFigure 15 shows the effect of EDist threshold on successprobability and number of physical gates for a random cir-cuit with both the selective error correction and our selectiveXZ error correction schemes. As the EDist threshold grows,both the success probability and number of physical gatesdrop. Observe that even setting EDist threshold= 3 can savea lot of physical gates. Moreover, for the same EDist thresh-old, our selective XZ error correction has smaller number ofphysical gates and slightly larger success probability thanselective error correction. This fact verifies that by distin-guishing between X error and Z error, we get a better esti-mation of error propagation and thus insert less corrections.Reduced correction count means less gates and data move-ments, and hence less errors are introduced in the processof “corrections”, which explains the slightly larger successprobability of our method.

Figure 16 shows ADCR for random circuit using the brute-

Figure 14: Comparison of Datapaths for RandomCircuits

Figure 15: Effect of EDist Threshold on SuccessProbability and Gate Counts for a random 3000-gate circuit. Error Set 2 is used.

force error correction, selective error correction and selectiveXZ error correction schemes. Our scheme beats the brute-force scheme by a factor of 3 ∼ 4 on average, and reducesthe ADCR of selective error correction scheme by a factorof about 20% on average.

5.4 Overall ADCR, Latency and AreaFigure 17 shows ADCR for general Hamiltonian simulationusing LQLA, CQLA+, Qalypso, iQalypso architectures withselective XZ error correction. Still, Qalypso and iQalypsoare much superior to LQLA and CQLA+, and iQalypsobeats Qalypso by a factor of 2 ∼ 3. Besides, one can compareFigure 17 with 13, and see that using selective XZ error cor-rection reduces ADCR for all four datapath organizations.

Figure 18 and 19 show the latency and area for generalHamiltonian simulation. We can see that their growth is

9

Figure 16: Comparison of QEC Schemes. Error Set2 is used.

Figure 17: ADCR for Hamiltonian Simulation

not fast. From Figure 18 and 19, we also see that by opti-mizing QEC, the latency and area are both reduced.

6. CONCLUSIONTo summarize, we have designed and evaluated a fault-tolerantion-trap-based architecture for the quantum simulation al-gorithm, using the quantum CAD flow. We develop a soft-ware that solves a given quantum simulation problem andgenerates the simulation circuit. It also optimizes it witha layering technique, which greatly increases its parallelism.We make optimizations to the datapath synthesis and map-ping stages of the quantum CAD flow so that the numberof expensive remote communications are decreased. We alsointroduce the selective XZ error correction scheme, whichuses less gates than its predecessor while maintaining itssuccess probability. Experimental results indicate that theseoptimizations to the quantum CAD flow not only work forquantum simulation algorithm, but also work for generalcircuits.

Figure 18: Latency for Hamiltonian Simulation

Figure 19: Area for Hamiltonian Simulation

There are many directions that deserve further investigation.Here we point out a few that we are particularly interestedin.

• Is Algorithm 2.5 optimal? Or can we give a betteralgorithm for layering?

• As mentioned in section 3.1, can we can use more so-phisticated algorithms, such as the ones from [3, 2, 20],to get better cuts and thus further reduce the numberof remote moves?

• Due to time limits, we did not improve the routing orancilla generation parts of the quantum CAD flow. Itwould be beneficial to optimize those parts for eitherthe quantum simulation algorithm or general circuits.

• In this paper, we have only considered the simulationof time-independent Hamiltonians. So a natural andimportant extension would be to consider the simu-lation of time-dependent Hamiltonians. In particular,adiabatic algorithms[7] can be viewed as a special kindof time-dependent Hamiltonians which change linearlywith time: H(t) = t

TH0+(1− t

T)H1. So the simulation

of time-independent Hamiltonians would lead to a newway to implement adiabatic algorithms. It can also beused to study the ground state of any Hamiltonian, bymaking use of an adiabatic process.

10

• An efficient algorithm for simulating sparse Hamiltoni-ans was presented in [4]. Then this algorithm was usedby [10] as a subroutine of a quantum algorithm thatcould solve sparse linear systems exponentially fasterthan any classical algorithm. Since sparse linear sys-tems are so important and pervasive in scientific com-puting, it would be meaningful to study architecturesfor these algorithms.

7. ACKNOWLEDGEMENTWe would like to thank John Kubiatowicz for his encourage-ment on this project and helpful discussions. We also wouldlike to thank his group for generously sharing their quantumCAD toolset.

8. REFERENCES[1] D. Aharonov, and A. Ta-Shma. Adiabatic Quantum

State Generation and Statistical Zero Knowledge.Proc. ACM Synposium on Theory of Computing, 2003.

[2] S. Arora , E. Hazan , and S. Kale. O(p

log n))

Approximation to SPARSEST CUT in O(n2) Time.In Proc. IEEE Symposium on Foundations ofComputer Science, 2004.

[3] S. Arora, S. Rao, and U.V. Vazirani. Expander Flows,Geometric Embeddings and Graph Partitioning. InProc. ACM Synposium on Theory of Computing, 2004.

[4] D.W. Berry, G. Ahokas, R. Cleve, and B. C. Sanders.Efficient quantum algorithms for simulating sparseHamiltonians. Arxiv preprint quant-ph/0508139, 2005.

[5] A. Childs. On the relationship between continuous-and discrete-time quantum walk. Arxiv preprint0810.0312, 2008.

[6] C.M. Dawson, M.A. Nielsen. The Solovay-Kitaevalgorithm. Quantum Information and Computation,6(1): 81-95, 2006.

[7] E. Farhi, J. Goldstone, S. Gutmann, and M. Sipser.Quantum Computation by Adiabatic Evolution. Arxivpreprint quant-ph/0001106, 2000.

[8] R.P. Feynman. Simulating Physics with Computers.International Journal of Theoretical Physics, 21:467-488, 1982.

[9] H. Haffner et al. Scalable multiparticle entanglementof trapped ions. Nature, 438:643-646, 2005.

[10] A.W. Harrow, A. Hassidim, and S. Lloyd. QuantumAlgorithm for Linear Systems of Equations. Phys.Rev. Lett. 103, 150502, 2009.

[11] N. Isailovic, Y. Patel, M. Whitney, andJ. Kubiatowicz. Interconnection Networks for ScalableQuantum Computers. Intl. Symp. on ComputerArchitecture, 2006.

[12] N. Isailovic, M. Whitney, Y. Patel, andJ. Kubiatowicz. Running a quantum circuit at thespeed of data. In Intl. Symp. on ComputerArchitecture, 2008.

[13] A. Kiteav. Quantum Computations: algorithms anderror correction. Russ. Math. Surv., 52(6):1191-1249,1997.

[14] E. Knill. Quantum Computing with RealisticallyNoisy Devices. Nature, 434: 39-44, 2005.

[15] L. Kreger-Stickles and M. Oskin. MicrocodedArchitectures for Ion-Tap Quantum Computers. In

Intl. Symp. on Computer Architecture, 2008.

[16] S. Lloyd. Universal Quantum Simulators. Science,273(5278):1073-1078, 1996.

[17] M.J. Madsen et al. Planar ion trap geometry formicrofabrication. Applied Phys. B: Lasers and Optics,78:639 – 651, 2004.

[18] T.S. Metodi et al. A Quantum Logic ArrayMicroarchitecture: Scalable Quantum Data Movementand Computation. Intl. Symp. on Microarchitecture,2005.

[19] M.A. Nielsen, and I.L. Chuang. QuantumComputation and Quantum Information. CambridgeUniversity Press, 2000.

[20] L. Orecchia, L.J. Schulman, U.V. Vazirani and N.K.Vishnoi. On partitioning graphs via single commodityflows. In Proc. ACM Synposium on Theory ofComputing, 2008.

[21] M. Oskin, F.T. Chong, and I.L. Chuang. A practicalarchitecture for reliable quantum computers.Computer, 35(1):79–87, 2002.

[22] R. Ozeri et al. Hyperfine Coherence in the Presence ofSpontaneous Photon Scattering. Phys. Rev. Lett.,95:030403.

[23] J. Preskill. Fault-tolerant quantum computation.Arxiv preprint quant-ph/9712048, 1997.

[24] S. Seidelin et al. Microfabricated surface-electrode iontrap for scalable quantum information processing.Phys. Rev. Lett., 96(25):253003, 2006.

[25] R. Solovay. Unpublished Manuscript. 1995

[26] A.M. Steane. Active stabilization, quantumcomputation, and quantum state synthesis. Phys. Rev.Lett. 78, 2252, 1997.

[27] A.M. Steane. How to build a 300 bit, 1 Gop quantumcomputer. Arxiv preprint quant-ph/0412165, 2004.

[28] D. Stick et al. Ion trap in a semiconductor chip.Nature Physics, 2:36-39, 2006.

[29] M. Stoer, and F. Wagner. A Simple Min-CutAlgorithm. Journal of the ACM, 44(4):585-591, 1997.

[30] D.D. Thaker et al. Quantum Memory Hierarchies:Efficient Designs to Match Available Parallelism inQuantum Computing. Intl. Symp. on ComputerArchitecture, 2006.

[31] M. Whitney. Practical Fault Tolerance for QuantumCircuits. Ph.D. thesis, UC Berkeley, 2009.

[32] M. Whitney, N. Isailovic, Y. Patel, andJ. Kubiatowicz. Automated Generation of Layout andControl for Quantum Circuits. In Intl. Conf. onComputing Frontiers, 2007.

[33] M.G. Whitney, N. Isailovic, Y. Patel, and J.Kubiatowicz. A Fault Tolerant, Area EfficientArchitecture for Shor’s Factoring Algorithm. In Intl.Symp. on Computer Architecture, 2009.

11

Documents

A Fault-tolerant, Ion-trap-based Architecture for the ...kubitron/courses/...cient algorithm to simulate local quantum systems. Then his work was extended to larger classes of quantum