Verifiable ASICs: trustworthy hardware with untrusted ... · [A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016; Stealthy Dopant-Level Trojans, Becker et al., CHES 2013] But

Verifiable ASICs:trustworthy hardware with untrusted components

Riad S. Wahby◦?, Max Howald†?,Siddharth Garg?, abhi shelat‡, and Michael Walfish?

◦Stanford University?New York University†The Cooper Union

‡The University of Virginia

June 10th, 2016

Setting: ASICs with mutually distrusting designer, manufacturer

chip designPrincipal

(government,chip designer)

Manufacturer(“foundry”or “fab”)

Here we are thinking about ASICs, not CPUs:

CPU

RAM

cache

registerfile ALU

ASIC

in[0]

in[n]

...D Q

D Q


chip designPrincipal

(government,chip designer)

Manufacturer(“foundry”or “fab”)

Here we are thinking about ASICs, not CPUs:

CPU

RAM

cache

registerfile ALU

ASIC

in[0]

in[n]

...D Q

D Q


Firewall

e.g., a network firewall appliance,with a custom chip for packet processing

What if our packet processing chip has a back door?

Threat: incorrect execution of the packet filter

(Other concerns, e.g., secret state, are important but orthogonal)

US DoD controls supply chain with trusted foundries.

Untrusted manufacturers can craft hardware Trojans

Firewall

e.g., a network firewall appliance

,with a custom chip for packet processing



(Other concerns, e.g., secret state, are important but orthogonal)US DoD controls supply chain with trusted foundries.


Firewall








Firewall





(Other concerns, e.g., secret state, are important but orthogonal)US DoD controls supply chain with trusted foundries.


Firewall







Trusted fabs are the only way to get strong guarantees

For example, stealthy trojans can thwart post-fab detection[A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016;Stealthy Dopant-Level Trojans, Becker et al., CHES 2013]

But trusted fabrication is not a panacea:

7 Only 5 countries have cutting-edge fabs on-shore

7 Building a new fab takes $$$$$$, years of R&D

7 Semiconductor scaling: chip area and energy go withsquare and cube of transistor length (“critical dimension”)

7 So using an old fab means an enormous performance hite.g., India’s best on-shore fab is 108× behind state of the art

Can we get trust more cheaply?

























Verifiable ASICs

Principal

F → designsfor P,V

Verifiable ASICs

Untrustedfab (fast)builds P

Trustedfab (slow)builds V

Principal


Verifiable ASICs



Principal


IntegratorV P

Verifiable ASICs



Principal


Integrator

V Pinput

output

Verifiable ASICs



Principal


Integrator

V Pxy

proof thaty = F(x)

input

output

Can we build Verifiable ASICs?

V Pxy

proof thaty = F(x)

input

output Fvs.

Makes sense if V + P are cheaperthan trusted F

Babai85GMR85BCC86BFLS91FGLSS91Kilian92ALMSS92AS92Micali94BG02GOS06IKO07GKR08KR09GGP10Groth10GLR11Lipmaa11BCCT12GGPR13BCCT13KRR14. . .

SBW11CMT12

SMBW12TRMP12

SVPBBW12SBVBPW13

VSBW13PGHR13Thaler13

BCGTV13BFRSBW13

BFR13DFKP13

BCTV14aBCTV14b

BCGGMTV14FL14

KPPSST14FTP14

WSRHBW15BBFR15

CFHKNPZ15CTV15

KZMQCPPsS15

101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C

matrix multiplication (m=128) PAM clustering (m=20, d=128)

N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Figure 5—Prover overhead normalized to native execution cost for twocomputations. Prover overheads are generally enormous.

you’re sending Will Smith and Jeff Goldblum into space to saveEarth; then you care a lot more about correctness than costs (a calcu-lus that applies to ordinary satellites too). More prosaically, there isa scenario with an abundance of server CPU cycles, many instancesof the same computation to verify, and remotely stored inputs: data-parallel cloud computing. Verifiable MapReduce [15] is thereforean encouraging application.

OPEN QUESTIONS AND NEXT STEPSThe main issue in this area is performance, and the biggest problemis the prover’s overhead. The verifier’s costs are also quantitativelyhigher than ideal; qualitatively, we would ideally eliminate the veri-fier’s setup phase while retaining expressivity (such schemes existin theory [11] but have very high overheads, even in principle).

The computational model is also a critical area of focus. OnlyTinyRAM handles data-dependent loops, only Pantry handles re-motely stored inputs, and only TinyRAM and Pantry handle com-putations that work with RAM. Unfortunately, TinyRAM adds highoverhead to the circuit representation for every operation in the givencomputation; Pantry, on the other hand, adds even higher overheadto its constraint representation but only to operations that interactwith state. While improving the overhead of either representationwould be worthwhile, a more general research direction is to movebeyond the circuit and constraint model.

There are also questions in systems and programming languages.For instance, can we develop programming languages that are well-tailored to the circuit or constraint formalism? We might also be ableto co-design the language, computational model, and verification ma-chinery: many of the protocols naturally work with parallel computa-tions, and the current verification machinery is already amenable toa parallel implementation. Another systems question is to develop arealistic database application, which requires concurrency, relationalstructures, etc. More generally, an important test for this area—sofar unmet—is to run experiments at realistic scale.

Another interesting area of investigation concerns privacy. Byleveraging Pinocchio, Pantry has experimented with simple applica-tions that hide the prover’s state from the verifier, but there is morework to be done here and other notions of privacy that are worthproviding. For example, we can provide verifiability while conceal-ing the program that is executed (by composing techniques fromPantry, Pinocchio, and TinyRAM). A speculative application is toproduce versions of Bitcoin in which transactions can be conductedanonymously, in contrast to the status quo [18, 22].

REFLECTIONS AND PREDICTIONS

It is worth recalling that the intellectual foundations of this researcharea really had nothing to do with practice. For example, the PCPtheorem is a landmark achievement of complexity theory, but if wewere to implement the theory as proposed, generating the proofs,even for simple computations, would have taken longer than theage of the universe. In contrast, the projects described in this articlehave not only built systems from this theory but also performedexperimental evaluations that terminate before publication deadlines.

So that’s the encouraging news. The sobering news, of course, isthat these systems are basically toys. Part of the reason that we arewilling to label them near-practical is painful experience with whatthe theory used to cost. (As a rough analogy, imagine a graduatestudent’s delight in discovering hexadecimal machine code afteryears spent programming one-tape Turing machines.)

Still, these systems are arguably useful in some scenarios. In high-assurance contexts, we might be willing to pay a lot to know that aremotely deployed machine is executing correctly. In the streamingcontext, the verifier may not have space to compute locally, so wecould use CMT [21] to check that the outputs are correct, in concertwith Thaler’s refinements [57] to make the prover truly inexpensive.Finally, data parallel cloud computations (like MapReduce jobs)perfectly match the regimes in which the general-purpose schemesperform well: abundant CPU cycles for the prover and many in-stances of the same computation with different inputs.

Moreover, the gap separating the performance of the currentresearch prototypes and plausible deployment in the cloud is a feworders of magnitude—which is certainly daunting, but, given thecurrent pace of improvement, might be bridged in a few years.

More speculatively, if the machinery becomes truly low overhead,the effects will go far beyond verifying cloud computations: we willhave new ways of building systems. In any situation in which onemodule performs a task for another, the delegating module will beable to check the answers. This could apply at the micro level (ifthe CPU could check the results of the GPU, this would exposehardware errors) and the macro level (distributed systems could bebuilt under very different trust assumptions).

But even if none of the above comes to pass, there are excitingintellectual currents here. Across computer systems, we are startingto see a new style of work: reducing sophisticated cryptographyand other achievements of theoretical computer science to prac-tice [32, 43, 47, 60]. These developments are likely a product ofour times: the preoccupation with strong security of various kinds,and the computers powerful enough to run previously “paper-only”algorithms. Whatever the cause, proof-based verifiable computationis an excellent example of this tendency: not only does it composetheoretical refinements with systems techniques, it also raises re-search questions in other sub-disciplines of Computer Science. Thiscross-pollination is the best news of all.

AcknowledgmentsWe thank Srinath Setty both for his help with this article, includingthe experimental aspect, and for his deep influence on our under-standing of this area. This article has also benefited from many pro-ductive conversations with Justin Thaler, whose patient explanationshave been most helpful. This draft was improved by experimen-tal assistance from Riad Wahby; by detailed feedback from AlexisGallagher and the anonymous CACM reviewers; and by commentsfrom Boaz Barak, William Blumberg, Oded Goldreich, Yuval Ishai,Guy Rothblum, Riad Wahby, Eleanor Walfish, and Mark Walfish.This work was supported by AFOSR grant FA9550-10-1-0073; NSF

108×


V Pxy

proof thaty = F(x)

input

output Fvs.


Reasons for hope:• running time of V < F (asymptotically)

• Implementations exist

• P overheads are massive, but using anadvanced fab might offset these costs


SBW11CMT12

SMBW12TRMP12

SVPBBW12SBVBPW13


BCGTV13BFRSBW13

BFR13DFKP13

BCTV14aBCTV14b

BCGGMTV14FL14

KPPSST14FTP14

WSRHBW15BBFR15

CFHKNPZ15CTV15

KZMQCPPsS15

101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler















108×


V Pxy

proof thaty = F(x)

input

output Fvs.






SBW11CMT12

SMBW12TRMP12

SVPBBW12SBVBPW13


BCGTV13BFRSBW13

BFR13DFKP13

BCTV14aBCTV14b

BCGGMTV14FL14

KPPSST14FTP14

WSRHBW15BBFR15

CFHKNPZ15CTV15

KZMQCPPsS15

101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler















108×


V Pxy

proof thaty = F(x)

input

output Fvs.






SBW11CMT12

SMBW12TRMP12

SVPBBW12SBVBPW13


BCGTV13BFRSBW13

BFR13DFKP13

BCTV14aBCTV14b

BCGGMTV14FL14

KPPSST14FTP14

WSRHBW15BBFR15

CFHKNPZ15CTV15

KZMQCPPsS15

101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler















108×


V Pxy

proof thaty = F(x)

input

output Fvs.


Reasons for hope caution:• Theory is silent about feasibility

• Onus is heavier than in prior work

• Hardware issues: energy, chip area

• Need physically realizable circuit design

• Need V to save for plausible computation sizes


SBW11CMT12

SMBW12TRMP12

SVPBBW12SBVBPW13


BCGTV13BFRSBW13

BFR13DFKP13

BCTV14aBCTV14b

BCGGMTV14FL14

KPPSST14FTP14

WSRHBW15BBFR15

CFHKNPZ15CTV15

KZMQCPPsS15

101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler















108×

A qualified success

Zebra: a hardware design that saves costs

. . .

. . . sometimes.

A qualified success

Zebra: a hardware design that saves costs. . .

. . . sometimes.

Probabilistic proof protocols, briefly

V Pxy

proof thaty = F(x)

input

output

F must be expressed as an arithmetic circuit (AC)

AC satisfiable ⇐⇒ F was executed correctly

P convinces V that the AC is satisfiable

generalized boolean circuit over Fp

∧ → × ∨ → +What about other schemes? e.g.,FHE [GGP10], MIP+FHE [BC12], MIP [BTWV14],PCIP [RRR16], IOP [BCS16], PIR [BHK16], . . .

These all seem a bit further from practicality.

73


V Pxy

proof thaty = F(x)

input

output

Arguments [GGPR13,SBVBPW13, PGHR13, BCTV14]

e.g., Zaatar, Pinocchio, libsnark

+ nondeterministic ACs,arbitrary connectivity

+ Few rounds (≤ 3)

Unsuited to hardwareimplementation

IPs[GKR08, CMT12, VSBW13]

e.g., Muggles, CMT, Allspice

– deterministic ACs;layered, low depth

– Many rounds

Suited to hardwareimplementation


∧ → × ∨ → +

What about other schemes? e.g.,FHE [GGP10], MIP+FHE [BC12], MIP [BTWV14],PCIP [RRR16], IOP [BCS16], PIR [BHK16], . . .


7 3


V Pxy

proof thaty = F(x)

input

output









– Many rounds



∧ → × ∨ → +


These all seem a bit further from practicality.7 3


V Pxy

proof thaty = F(x)

input

output









– Many rounds



∧ → × ∨ → +

What about other schemes? e.g.,FHE [GGP10], MIP+FHE [BC12], MIP [BTWV14],PCIP [RRR16], IOP [BCS16], PIR [BHK16], . . .These all seem a bit further from practicality.

7 3


V Pxy

proof thaty = F(x)

input

output









– Many rounds



∧ → × ∨ → +



7 3


V Pxy

proof thaty = F(x)

input

output









– Many rounds



∧ → × ∨ → +



7

3


V Pxy

proof thaty = F(x)

input

output









– Many rounds



∧ → × ∨ → +



7 3

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

F must be expressed as alayered arithmetic circuit.

V Px

thinking...

y

thinking...

... sum-check[LFKN90]

more sum-checks


1. V sends inputs

2. P evaluates

, returns output y

3. V constructs polynomial relatingy to last layer’s input wires

4. V engages P in a sum-check

, getsclaim about second-last layer

5. V iterates

, gets claim aboutinputs, which it can check

V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs

2. P evaluates

, returns output y




5. V iterates


V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs

2. P evaluates

, returns output y




5. V iterates


V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs

2. P evaluates

, returns output y




5. V iterates


V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs

2. P evaluates, returns output y




5. V iterates


y

V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs





5. V iterates


V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs





5. V iterates


V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs



4. V engages P in a sum-check, getsclaim about second-last layer

5. V iterates


V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs




5. V iterates


V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs




5. V iterates


V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs




5. V iterates


V Px

thinking...

y

thinking...


more sum-checks


1. V sends inputs




5. V iterates, gets claim aboutinputs, which it can check

V Px

thinking...

y

thinking...


more sum-checks


Soundness error ∝ p−1

Cost to execute F directly:O(depth · width)

V ’s sequential running time:O(depth · log width + |x | + |y |)(assuming precomputed queries)

P ’s sequential running time:O(depth · width · log width)

V Pxy


more sum-checks






V Pxy


more sum-checks






V Pxy


more sum-checks

Extracting parallelism in Zebra

P executing AC: layers aresequential, but all gates at alayer can be executed in parallel

Proving step: Can V and Pinteract about all of F’s layersat once?

No. V must ask questions inorder or soundness is lost.

But: there is still parallelism tobe extracted. . .
















Extracting parallelism in Zebra’s P

V questions P aboutF(x1)’s output layer.

F(x1)


V questions P aboutF(x1)’s output layer.

Simultaneously, Preturns F(x2).

F(x1)

F(x2)


V questions P aboutF(x1)’s next layer

F(x1)


V questions P aboutF(x1)’s next layer, andF(x2)’s output layer.

F(x1)

F(x2)


V questions P aboutF(x1)’s next layer, andF(x2)’s output layer.

Meanwhile, P returnsF(x3).

F(x1)

F(x2)

F(x3)


This process continues. . .

F(x1)

F(x2)

F(x3)

F(x4)


This process continues. . .

F(x1)

F(x2)

F(x3)

F(x4)

F(x5)


This process continuesuntil V and P interactabout every layersimultaneously—but fordifferent computations.

V and P can completeone proof in each timestep.

F(x1)

F(x2)

F(x3)

F(x4)

F(x5)

F(x6)

F(x7)

F(x8)

Extracting parallelism in Zebra’s P with pipelining

Input (x)

Output (y)

prove

prove

prove

Sub-prover, layer 0

Sub-prover, layer 1

Sub-prover, layer d −1

V

Pqueries

responses

queries

responses

queries

responses

.

.

.

.

.

.

This approach is just a standard hardware technique, pipelining;it is possible because the protocol is naturally staged.

There are other opportunities to leverage the protocol’s structure.

Extracting parallelism in Zebra’s P with pipelining

Input (x)

Output (y)

prove

prove

prove

Sub-prover, layer 0

Sub-prover, layer 1

Sub-prover, layer d −1

V

Pqueries

responses

queries

responses

queries

responses

.

.

.

.

.

.

This approach is just a standard hardware technique, pipelining;it is possible because the protocol is naturally staged.

There are other opportunities to leverage the protocol’s structure.

Per-layer computations

For each sum-check round, Psums over each gate in a layer

,evaluating H[k], k ∈ {0, 1, 2}

In software:// compute H[0],H[1],H[2]for k ∈ {0, 1, 2}:H[k] ← 0for g ∈ layer:H[k] ← H[k] + δ(g, k)// δ uses state[g]

// update lookup table// with V’s random coinfor g ∈ layer:state[g] ← δ(g, rj)

layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:

gateprover

δ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateprover

δ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateprover

δ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateprover

δ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateprover

δ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateprover

δ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateprover

δ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateprover

δ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree


For each sum-check round, Psums over each gate in a layer,evaluating H[k], k ∈ {0, 1, 2}



layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:

gateprover

δ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateprover

δ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateprover

δ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateprover

δ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateprover

δ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateprover

δ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateprover

δ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateprover

δ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree





layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:

gateprover

δ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateprover

δ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateprover

δ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateprover

δ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateprover

δ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateprover

δ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateprover

δ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateprover

δ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree





layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:gate

proverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateproverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree





layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:gate

proverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateproverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree





layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:gate

proverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateproverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree





layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:gate

proverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateproverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree





layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:gate

proverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateproverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree





layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:gate

proverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateproverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree





layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:gate

proverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateproverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree





layer:H[k] =

∑g∈layer

δ(g , k)

In hardware:

gateproverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

RAM

+ ++

Adder tree

state[0]

gateproverδ(0, 0)

δ(0, 1)

δ(0, 2)

δ(0, rj)

state[1]

gateproverδ(1, 0)

δ(1, 1)

δ(1, 2)

δ(1, rj)

state[2]

gateproverδ(2, 0)

δ(2, 1)

δ(2, 2)

δ(2, rj)

state[3]

gateproverδ(3, 0)

δ(3, 1)

δ(3, 2)

δ(3, rj)

. . .

+ ++

Adder tree

Zebra’s design approach

3 Extract parallelism

e.g., pipelined provinge.g., parallel evaluation of δ by gate provers

3 Exploit locality: distribute data and control

e.g., no RAM: data is kept close to places it is needed

e.g., latency-insensitive design: localized control

3 Reduce, reuse, recycle

e.g., computation: save energy by adding memoization to Pe.g., hardware: save chip area by reusing the same circuits





e.g., no RAM: data is kept close to places it is needede.g., latency-insensitive design: localized control







e.g., no RAM: data is kept close to places it is needede.g., latency-insensitive design: localized control



Architectural challenges

Interaction between V and P requires a lot of bandwidth7 V and P on circuit board? Too much energy, circuit area

3 Zebra uses 3D integration

Protocol requires input-independent precomputation [VSBW13]

3 Zebra amortizes precomputations over many V-P pairs

Precomputations need secrecy, integrity

7 Give V trusted storage? Cost would be prohibitive3 Zebra uses untrusted storage + authenticated encryption

V

Pxy

proof thaty = F(x)

input

outputprei

V Pxy

proof thaty = F(x)

input

outputEk(prei)


Interaction between V and P requires a lot of bandwidth7 V and P on circuit board? Too much energy, circuit area3 Zebra uses 3D integration





V

Pxy

proof thaty = F(x)

input

outputprei

V Pxy

proof thaty = F(x)

input

outputEk(prei)







V

Pxy

proof thaty = F(x)

input

outputprei

V Pxy

proof thaty = F(x)

input

outputEk(prei)



Protocol requires input-independent precomputation [VSBW13]3 Zebra amortizes precomputations over many V-P pairs



V

Pxy

proof thaty = F(x)

input

outputprei

V Pxy

proof thaty = F(x)

input

outputEk(prei)




Precomputations need secrecy, integrity7 Give V trusted storage? Cost would be prohibitive

3 Zebra uses untrusted storage + authenticated encryption

V

Pxy

proof thaty = F(x)

input

outputprei

V Pxy

proof thaty = F(x)

input

outputEk(prei)




Precomputations need secrecy, integrity7 Give V trusted storage? Cost would be prohibitive3 Zebra uses untrusted storage + authenticated encryption

V

Pxy

proof thaty = F(x)

input

outputprei

V Pxy

proof thaty = F(x)

input

outputEk(prei)

Implementation

Zebra’s implementation includes

• a compiler that produces synthesizable Verilog for P• two V implementations

• hardware (Verilog)• software (C++)

• library to generate V ’s precomputations

• Verilog simulator extensions to modelsoftware or hardware V ’s interactions with P

. . . and it seemed to work really well!Zebra can produce 10k–100k proofs per second,while existing systems take tens of seconds per proof!

But that’s not a serious evaluation. . .

. . . and it seemed to work really well!Zebra can produce 10k–100k proofs per second,while existing systems take tens of seconds per proof!

But that’s not a serious evaluation. . .

Evaluation method

V Pxy

proof thaty = F(x)

input

output Fvs.

Baseline: direct implementation of F in same technology as V

Metrics: energy, chip size per throughput (discussed in paper)

Measurements: based on circuit synthesis and simulation,published chip designs, and CMOS scaling models

Charge for V, P, communication; retrieving and decryptingprecomputations; PRNG; Operator communicating with V

Constraints: trusted fab = 350 nm; untrusted fab = 7 nm;200 mm2 max chip area; 150 W max total power

350 nm: 1997 (Pentium II)7 nm: ≈ 2017 [TSMC]≈ 20 year gap betweentrusted and untrusted fab

Evaluation method

V Pxy

proof thaty = F(x)

input

output Fvs.







Evaluation method

V Pxy

proof thaty = F(x)

input

output Fvs.







Evaluation method

V Pxy

proof thaty = F(x)

input

output Fvs.







Application #1: number theoretic transform

NTT: a Fourier transform over Fp

Widely used, e.g., in computer algebra

Application #1: number theoretic transformRatio of baseline energy to Zebra energy

6 7 8 9 10 11 12 130.1

0.3

1

3

log2(NTT size)

baselin

e v

s. Z

ebra

(hig

her

is b

etter)

Application #2: Curve25519 point multiplication

Curve25519: a commonly-used elliptic curve

Point multiplication: primitive, e.g., for ECDH

Application #2: Curve25519 point multiplicationRatio of baseline energy to Zebra energy

84 170 340 682 11470.1

0.3

1

3

Parallel Curve25519 point multiplications

baselin

e v

s. Z

ebra

(hig

her

is b

etter)

A qualified success

Zebra: a hardware design that saves costs. . .

. . . sometimes.

Summary of Zebra’s applicability

Applies to IPs, but not arguments

1. Computation F must have a layered, shallow, deterministic AC

2. Must have a wide gap between cutting-edge fab (for P)and trusted fab (for V)

3. Amortizes precomputations over many instances

4. Computation F must be very large for V to save work

5. Computation F must be efficient as an arithmetic circuit

Common to essentially all built proof systems

101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler















System Amortization regime Advice

Zebra many V-P pairs short

Allspice[VSBW13]

batch of instancesof a particular F

short

BootstrappedSNARKs[BCTV14a,CTV15]

all computations long

BCTV[BCTV14b]

all computationsof the same length

long

Pinocchio[PGHR13]

all future instancesof a particular F

long

Zaatar[SBVBPW13]


long

Exception: [CMT12] with logspace-uniform ACs

For example, libsnark [BCTV14b], a highly optimized implementa-tion of [GGPR13] and Pinocchio [PGHR13]:

V ’s work: 6 ms + (|x |+ |y |) · 3 µs on a 2.7 GHz CPU

⇒ break-even point ≥ 16× 106 CPU ops

With 32 GB RAM, libsnark handles ACs with ≤ 16× 106 gates

⇒ breaking even requires > 1 CPU op per AC gate, e.g.,computations over Fp rather than machine integers









101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

















Allspice[VSBW13]


short



BCTV[BCTV14b]


long

Pinocchio[PGHR13]


long

Zaatar[SBVBPW13]


long







Arguments versus IPs, redux

Design principleIPs[GKR08, CMT12,VSBW13]

Arguments[GGPR13, SBVBPW13,PGHR13, BCTV14]

Extract parallelism 3 3Exploit locality 3

7

Reduce, reuse, recycle 3

7

Argument protocols seem friendly to hardware?

P computes over entire AC at once =⇒ need RAM

P does crypto for every gate in AC =⇒ special crypto circuits

. . . but we hope these issues are surmountable!




Extract parallelism 3 3Exploit locality 3 7Reduce, reuse, recycle 3

7

Argument protocols seem unfriendly to hardware:







Extract parallelism 3 3Exploit locality 3 7Reduce, reuse, recycle 3 7








Extract parallelism 3 3Exploit locality 3 7Reduce, reuse, recycle 3 7













101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

















Allspice[VSBW13]


short



BCTV[BCTV14b]


long

Pinocchio[PGHR13]


long

Zaatar[SBVBPW13]


long















101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

















Allspice[VSBW13]


short



BCTV[BCTV14b]


long

Pinocchio[PGHR13]


long

Zaatar[SBVBPW13]


long















101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

















Allspice[VSBW13]


short



BCTV[BCTV14b]


long

Pinocchio[PGHR13]


long

Zaatar[SBVBPW13]


long















101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

















Allspice[VSBW13]


short



BCTV[BCTV14b]


long

Pinocchio[PGHR13]


long

Zaatar[SBVBPW13]


long















101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

















Allspice[VSBW13]


short



BCTV[BCTV14b]


long

Pinocchio[PGHR13]


long

Zaatar[SBVBPW13]


long















101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

















Allspice[VSBW13]


short



BCTV[BCTV14b]


long

Pinocchio[PGHR13]


long

Zaatar[SBVBPW13]


long















101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

















Allspice[VSBW13]


short



BCTV[BCTV14b]


long

Pinocchio[PGHR13]


long

Zaatar[SBVBPW13]


long















101

105

0

109

103

107

1011

wor

ker’

s co

st

norm

aliz

ed to

nat

ive

C


N/A

1013

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

Pepp

erG

inge

r

Pin

occh

io

Zaa

tar

CM

T

nativ

e C

Alls

pice

Tin

yRA

M

Tha

ler

















Allspice[VSBW13]


short



BCTV[BCTV14b]


long

Pinocchio[PGHR13]


long

Zaatar[SBVBPW13]


long







Recap

V Pxy

proof thaty = F(x)

input

output

+ Verifiable ASICs: a new approach to buildingtrustworthy hardware under a strong threat model

+ First hardware design for a probabilistic proof protocol

+ Improves performance compared to trusted baseline

– Improvement compared to the baseline is modest

– Applicability is limited:precomputations must be amortizedcomputation needs to be “big enough”large gap between trusted and untrusted technologydoes not apply to all computations

Bottom line: Zebra is plausible—when it applieshttps://www.pepper-project.org/

Recap

V Pxy

proof thaty = F(x)

input

output







Recap

V Pxy

proof thaty = F(x)

input

output







Documents

Verifiable ASICs: trustworthy hardware with untrusted ... · [A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016; Stealthy Dopant-Level Trojans, Becker et al., CHES 2013] But