Upload
bertram-booth
View
218
Download
2
Embed Size (px)
Citation preview
How to Speed-up Fault-Tolerant Clock Generationin VLSI Systems-on-Chip via Pipelining
Matthias Függer1, Andreas Dielacher2 and Ulrich Schmid1
1Vienna University of TechnologyEmbedded Computing Systems Group
{fuegger, s}@ecs.tuwien.ac.at
2RUAG SpaceVienna
2
Outline
1. Fault-tolerant SoCs2. Asynchronous fault-tolerant clock generation algorithm3. Making it faster4. Proving it correct5. FPGA implementation
Making SoCs fault-tolerant
System Level Approach
• replication of functional units• communication between units necessaryto maintain consistency
• problems are analogous to those ofreplicated state machines indistributed systems!
4
Fault-tolerant SoC needs Common Time
precision: at any t, π(t) bounded
tick(3) tick(4) tick(5)
tick(2) tick(3) tick(4) tick(5)
p
q
π(t) = 2 #ticks(Δ) = 3
accuracy: l(Δ) < #ticks in any Δ < u(Δ)
p
q
Common time eases (allows) solving problems ofreplica determinism (atomic broadcast).
q’s local clock domain
5
Clocking a fault-tolerant SoC
(-) single point of failure
(+) common time acrosschip (< 1 tick)
(+) no single point of failure
(-) no common time across chip synchronize overhead & metastability
(+) no single point of failure
(+) common time across chip (< small # of ticks)
classicalsynchronous SoC
GALSglobally coordinatedclock generation: DARTS
Fu1
Data BusFu3
Fu2
Oscillator
Oscillator
Oscillator
Clo
ck
Tre
e
Oscillator
Fu1
Data Bus Fu3
Fu2
TG-AlgsFu1
Data Bus
Fu3
Fu2
TG-Net
DARTS High-level Algorithm
(1) Initially:(2) send tick(1) to all; clock:= 1;(3) If received tick(m) from at least f+1 remote nodes and m > clock:(4) send tick(clock+1),…, tick(m) to all; clock:= m;(5) If received tick(m) from at least 2f+1 remote nodes and m >= clock:(6) send tick(m+1) to all; clock:= m+1;
6
n = 5, f = 1
k
k+1k
TQS
DARTS Hardware Implementation
clk_out
Counter Module 1
Node premote clk_in
Remote Inputs rrem
Threshold Modules
...f+1
2f+1
...
TickGen
Local Inputs rloc
Counter Module n-1
rrem rloc
Counter Module 2
rrem rlocCounter Module 3
rrem rlocCommon time property proved in [EDCC06].
(1) Initially:(2) send tick(1) to all; clock:= 1;(3) If received tick(m) from at least f+1 remote nodes and m > clock:(4) send tick(clock+1),…, tick(m) to all; clock:= m;(5) If received tick(m) from at least 2f+1 remote nodes and m >= clock:(6) send tick(m+1) to all; clock:= m+1;
7
Provides them > clock
andm >= clock
status
DARTS Performance
8
Performance
Obtained frequency: 1/Δ, depends on end-to-end delay Δ
Δthe lock step-case (Δ = 1)
k k+1 k+2
Δ
Common time property proved in [EDCC06].
Making DARTS faster: Pipelining
9
Pipelined Performance
Idea: Let tick k+X+1 depend on tick k.Obtained frequency: (X+1)/Δ, maximum depends on local delays
Δthe lock step-case (Δ = 1)
k k+X+1 k+2X+2
X+1 ticks X+1 ticks
X = 4Δ
Making DARTS faster: Algorithm Adaptations
10
(1) Initially:(2) send tick(1) to all; clock:= 1;(3) If received tick(m) from at least f+1 remote nodes and m > clock:(4) send tick(clock+1),…, tick(m) to all; clock:= m;(5) If received tick(m) from at least 2f+1 remote nodes and m >= clock:(6) send tick(m+1) to all; clock:= m+1;
(1) Initially:(2) send tick(1), ..., tick(X+1) to all; clock:= X+1;(3) If received tick(m) from at least f+1 remote nodes and m > clock:(4) send tick(clock+1),…, tick(m) to all; clock:= m;(5) If received tick(m) from at least 2f+1 remote nodes and m + X >= clock:(6) send tick(m+1) to all; clock:= m+1;
not changed
allows sending k+X+1 based
on kSmall change in algorithm
Is pDARTS correct?!
11
n = 5, f = 1
k-X
k+1k-X
TQS
(1) Initially:(2) send tick(1), ..., tick(X+1) to all; clock:= X+1;(3) If received tick(m) from at least f+1 remote nodes and m > clock:(4) send tick(clock+1),…, tick(m) to all; clock:= m;(5) If received tick(m) from at least 2f+1 remote nodes and m + X >= clock:(6) send tick(m+1) to all; clock:= m+1;
easy to prove in classical systems (synchronous, Θ - model)
pDARTS Hardware Implementation
12
(1) Initially:(2) send tick(1), ..., tick(X+1) to all; clock:= X+1;(3) If received tick(m) from at least f+1 remote nodes and m > clock:(4) send tick(clock+1),…, tick(m) to all; clock:= m;(5) If received tick(m) from at least 2f+1 remote nodes and m + X >= clock:(6) send tick(m+1) to all; clock:= m+1;
clk_out
Counter Module 1
Node p
remote clk_in
Remote Inputs rrem
Threshold Modules
...f+1
2f+1
...
TickGen
Local Inputs rloc
Counter Module n-1
Counter Module 2
Counter Module 3
Remote Inputs rrem Local Inputs rlocm > clock
m +X >= clock
Provides them > clock
status
Provides them + X >= clock
status
pDARTS Hardware Implementation
13
C
C
C
C
Rremote,in
C
C
C
C
Rlocal,in
NAND2
NOR2
NOR1
NAND4
GEQe
GEQo
Counter Module 3f+1 of 3f+1
Local PipeDiff-GateRemote Pipe
Pipe Compare Signal Gen.
...
...
≥2f+1 ≥2f+1
≥f+1 ≥f+1
......
......
Threshold Gates____GEQe
___GRe
____GEQo
___GRo
...
3f+1
...
Ctop
LocalClk
RemoteClk
r s
Pipe Compare Signal Gen. (GEQ)
Diff-Gate
Local PipeRemote Pipe
Counter Module 1 of 3f+1
C
Tick Generation
LocalClk_self
C
C
C
C
Rremote,in
C
C
C
C
Rlocal,in
Ctop
Diff-Gate
Local PipeRemote Pipe
NOR2
NOR1
NAND3
NAND5
GRe
GRo
Pipe Compare Signal Gen. (GR)
Provides them > clock
status
Provides them + X >= clock
status
Is pDARTS still correct?!
14
Correctness Proof
• High-level algorithm, yes. (proof-gap)• Low-level pDARTS, has far more complex proofs than DARTS,
& queuing effects inside Counter Modules not neglected formal framework tied to hardware,
therein prove it correct.
C
C
C
C
Rremote,in
C
C
C
C
Rlocal,in
NAND2
NOR2
NOR1
NAND4
GEQe
GEQo
Counter Module 3f+1 of 3f+1
Local PipeDiff-Gate
Remote Pipe
Pipe Compare Signal Gen.
...
...
≥2f+1 ≥2f+1
≥f+1 ≥f+1
......
......
Threshold Gates____GEQe
___GRe
____GEQo
___GRo
...
3f+1
...
Ctop
LocalClk
RemoteClk
r s
Pipe Compare Signal Gen. (GEQ)
Diff-Gate
Local PipeRemote Pipe
Counter Module 1 of 3f+1
C
Tick Generation
LocalClk_self
C
C
C
C
Rremote,in
C
C
C
C
Rlocal,in
Ctop
Diff-Gate
Local PipeRemote Pipe
NOR2
NOR1
NAND3
NAND5
GRe
GRo
Pipe Compare Signal Gen. (GR)
The formal Framework
15
Ingredients
• Classical models: step-based (state machines)
• Modules with signal ports• Signal’s behavior specified by
• event trace: (t,x) in S• status function: S(t) = x• counting function: #S(t) = k
• Basic/Compound modules
• their behavior is specified byrelations on the port behavior
[Δ-, Δ+], initially 0I O
The formal Framework
16
Diff-Gate Module (Counting Function model)
When to remove tick k from both local and remote pipe:
C
C
C
C
Rremote,in
C
C
C
C
Rlocal,in
NAND2
NOR2
NOR1
NAND4
GEQe
GEQo
Counter Module 3f+1 of 3f+1
Local PipeDiff-GateRemote Pipe
Pipe Compare Signal Gen.
...
...
≥2f+1 ≥2f+1
≥f+1 ≥f+1
......
......
Threshold Gates____GEQe
___GRe
____GEQo
___GRo
...
3f+1
...
Ctop
LocalClk
RemoteClk
r s
Pipe Compare Signal Gen. (GEQ)
Diff-Gate
Local PipeRemote Pipe
Counter Module 1 of 3f+1
C
Tick Generation
LocalClk_self
C
C
C
C
Rremote,in
C
C
C
C
Rlocal,in
Ctop
Diff-Gate
Local PipeRemote Pipe
NOR2
NOR1
NAND3
NAND5
GRe
GRo
Pipe Compare Signal Gen. (GR)
For k = 0: If,(1) received tick 1 in remote pipe at t, and(2) received tick 1 in local pipe at t’, remove tick 0 from both pipes within max(t,t’) + [Δ-
Diff , Δ+Diff]
For k > 0: If,(1) received tick k+1 in remote pipe at t,(2) received tick k+1 in local pipe at t’, and(3) removed tick k-1 at t’’, remove tick k from both pipes within max(t,t’,t’’) + [Δ-
Diff , Δ+Diff]
Active signal only if exactly 1 tickin local pipe
Proof Results
17
Precision
Accuracy
L(t2-t1) ≤ #ticks in any (t2-t1) ≤ U(t2-t1)
Bounded Queue Sizes
depends on
FPGA prototype implementation
18
X = 0 (conventional DARTS)
maximum of X = 4 (stabilizes)
APEX EP20K1000 FPGA
Slow Δ compared to Δloc
Δ about 125nsΔloc about 25ns
Conclusions
• Replication to make fault-tolerant.• Clocking a replicated state machine is non-trivial, but possible.• Unfortunately: slow!• Apply pipelining idea to make it faster.• Formal analysis with hardware inspired formal framework.• Proved it correct & implemented FPGA prototype.
19
clk_out
Counter Module 1
Node p
remote clk_in
Remote Inputs rrem
Threshold Modules
...
f+1
2f+1...
TickGen
Local Inputs rloc
Counter Module n-1
Counter Module 2
Counter Module 3
Remote Inputs rrem Local Inputs rlocm > clock
m +X >= clock
Spreading effect of Ticks
C
C
C
C
Rremote,in
C
C
C
C
Rlocal,in
NAND2
NOR2
NOR1
NAND4
GEQe
GEQo
Counter Module 3f+1 of 3f+1
Local PipeDiff-GateRemote Pipe
Pipe Compare Signal Gen.
...
...
≥2f+1 ≥2f+1
≥f+1 ≥f+1
......
......
Threshold Gates____GEQe
___GRe
____GEQo
___GRo
...
3f+1
...
Ctop
LocalClk
RemoteClk
r s
Pipe Compare Signal Gen. (GEQ)
Diff-Gate
Local PipeRemote Pipe
Counter Module 1 of 3f+1
C
Tick Generation
LocalClk_self
C
C
C
C
Rremote,in
C
C
C
C
Rlocal,in
Ctop
Diff-Gate
Local PipeRemote Pipe
NOR2
NOR1
NAND3
NAND5
GRe
GRo
Pipe Compare Signal Gen. (GR)
21
tends to spread out ticks evenlyafter an initial phase