Upload
hoshi
View
27
Download
0
Embed Size (px)
DESCRIPTION
Predictable Programming on a Precision Timed Architecture. Hiren D. Patel UC Berkeley [email protected] Joint work with: Ben Lickly , Isaac Liu, Edward A. Lee - UC Berkeley Sungjun Kim, Stephen A. Edwards - Columbia University. Edwards and Lee - Case for PRET. - PowerPoint PPT Presentation
Citation preview
Predictable Programming on a Precision Timed Architecture
Hiren D. PatelUC Berkeley
Joint work with: Ben Lickly, Isaac Liu, Edward A. Lee - UC Berkeley
Sungjun Kim, Stephen A. Edwards - Columbia University
Patel, UC Berkeley, PRET 2
Edwards and Lee - Case for PRET
• 2007 – Edwards and Lee made a case for precision timed computers (PRET machines)– Predictability– Repeatability
S. A. Edwards and E. A. Lee, The case for the precision timed (PRET) machine. In Proceedings of the 44th Annual Conference on Design Automation (San Diego, California, June 04 - 08, 2007). DAC '07. ACM, New York, NY, 264-265.
2
Patel, UC Berkeley, PRET 3
Edwards and Lee - Case for PRET
• Unpredictability– Difficulty in determining timing behavior
through analysis
• Non-repeatability– Lack of guarantee that every execution
yields the same timing behavior
• Brittleness– Small changes have big effects on timing
behavior
3
Patel, UC Berkeley, PRET 4
Brittleness
• Expensive affair
• Tight coupling of software and hardware
• Reliance on testing for validation
• Upgrading difficult
• Solution: stockpile
4
Source: www.skycontrol.net
Patel, UC Berkeley, PRET 5
But wait …
• Real-time scheduling– Worst-case execution
time• Detailed model of
hardware• Large engineering
effort• Valid for particular
hardware models
– Interrupts, inter-process communication, locks …
• Bench testing
– Brittle
5
Sebastian Altmeyer, Christian Hümbert, Björn Lisper, and Reinhard Wilhelm. Parametric Timing Analysis for Complex Architectures. In Proceedings of the 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'08), pages 367-376, Kaohsiung, Taiwan, August 2008. IEEE Computer Society.
Patel, UC Berkeley, PRET 6
Precise Timing and High Performance
6
Traditional Alternative
Caches Scratchpads
Deep out-of-order pipelines Thread-interleaved pipelines
Function-only ISAs ISAs with timing instructions
Function-only languages Languages and programming models with timing
Best-effort communication Fixed-latency communication
Time-sharing Multiple independent processors
Patel, UC Berkeley, PRET 7
Outline
• Introduction• Related Work• PRET Machine• Programming Example• Future Work• Conclusion
7
Patel, UC Berkeley, PRET 8
Related Work
• Java Optimized Processor– Schoeberl et al. [2003]
• Timing instructions – Ip and Edwards [2006]
• Reactive processors– Von Hanxleden et al. [2005]– Salcic et al. [2005]
• Virtual Simple Architecture– Mueller et al. [2003]
8
Patel, UC Berkeley, PRET 99
Semantics of Timing Instructions
• Deadline instructions– Denote the required
execution time of a block
• When decoded– Stall instruction if
timer value is not 0– Otherwise set timer
value to new value
deadi $t0, 10
…
deadi $t0, 8
…
deadi $t0, 0
…
L0:
…
deadi $t0, 10
b L0
…
Straight Line Block 0
Straight Line Block 1
Loop Block
Patel, UC Berkeley, PRET 10
Tracing A Program Fragment
A: deadi $t0, 6B: sethi %hi(0x3f800000),
%g1C: or %g1, 0x200, %g1 D: st %g1, [ %fp + -12 ]E: deadi $t0, 8F: …
cycle
065432108
$t0
Patel, UC Berkeley, PRET 1111
Precision Timed Architecture
Thread-interleaved pipeline
Scratchpad memories
Time-triggered main memoryaccess
Round-robin thread scheduling
Patel, UC Berkeley, PRET 12
Memory Hierarchy
• Clocks– Main clock– Derived clocks
• Instruction and data scratchpad memories – 1 cycle access latency
• Main memory – 16MB size– Latency of 50ns– Frequency:250Mhz
• ~13 cycles latency12
CoreCore MainMem.MainMem.
SPMSPMSPMSPMSPMSPMSPMSPMSPMSPMSPMSPM
DMADMA
Patel, UC Berkeley, PRET 13
Thread-interleaved Pipeline
• Thread stalls – Main memory access– Multi-cycle operations– Deadline instructions
• Replay mechanism– Execute same PC next
iteration– Multi-cycle ALU ops
replay instructions
13
FetchFetch
DecodeDecode
Reg. AccessReg. Access
ExecuteExecute
MemoryMemory
WriteBackWriteBack
F/D
D/R
R/E
E/M
M/W
Decrement DeadlineTimers
Stall ifDeadlineInstruction
Increment PC
Check main memory access
Patel, UC Berkeley, PRET 14
Time-Triggered Access through Memory Wheel
• Decouple thread’s access pattern
• Time-triggered access
• Best-case access time– If accessed 1st cycle
• Worst-case access time– If accessed 2nd cycle
of window
14
thread0 thread1 thread2 thread3 thread4 thread5 thread0
90 cycles until thread0 completes
On time On time On time On time On time
Patel, UC Berkeley, PRET 1515
Tool Flow
• GCC 3.4.4, SystemC 2.2, Python 2.4
Boot code Motorola SREC files
C programstiming instructions
GCC to compile boot codeand program code
Patel, UC Berkeley, PRET 16
Simple Mutual Exclusion Example
• Producer followed by Consumer and Observer– Consumer and Observer execute together
• Loop rate of two rotations of memory wheel– 1st for Producer to write– 2nd Consumer and Observer to read
16
Write to shared dataRead from shared data
Write to output
Patel, UC Berkeley, PRET 1717
Video Game Example
Graphic Thread
Graphic Thread
VGA-Driver Thread
VGA-Driver Thread
Even BufferEven Buffer
Odd BufferOdd
Buffer
Main-Control Thread
Main-Control Thread
Odd Queue
Even Queue
Command
Command
Pixel Data
Pixel Data
Swap (When Sync Requested and When Odd Queue Empty)
Sync (After queue swapped)
Update Screen (Sync request)
Sync (After buffer swapped)
Refresh (Sync request)
Swap (When sync requested and when Vertical blank)
Patel, UC Berkeley, PRET 18
Timing Requirements
18
Signal Timing Requirement
Pixel Cycles
V. Sync 64µs 1611
V. Back-porch 1.02ms 25679
Draw 480 lines 15.25ms
V. Front-porch 350µs 8811
H. Sync 3.77µs 96
H. Back-porch 1.89µs 48
Draw 640 pixels 25.42µs
H. Front-porch 0.64µs 16
Patel, UC Berkeley, PRET 19
Timing Implementation
• Pixel-clock using derived clock– 25.175Mhz– ~ 39.72ns cycle
period
• Drawing 16 pixels
19
Patel, UC Berkeley, PRET 2020
Future Work
• Architecture– DMA– DDR2 main memory model– Thread synchronization primitives– Shared data between threads
• Real-time Benchmarks– With timing requirements
• Programming models– Memory allocation schemes– Synchronizations
Patel, UC Berkeley, PRET 2121
Conclusion
• What we want …– Time as a first class citizen of embedded
computing– Predictability– Repeatability
• Where we are at …– PRET cycle-accurate simulator– Release …
Patel, UC Berkeley, PRET 22
Patel, UC Berkeley, PRET 23
Extras
Patel, UC Berkeley, PRET 24
More on Brittleness
• Small changes may have big effects on timing behavior
Theorem (Richard’s anomalies):If a task set with fixed priorities, execution times, and
precedence constraints is optimally scheduled on a fixed number of processors, then increasing the number of processors, reducing execution times, or weakening precedence constraints can increase the schedule length.
Richard L. Graham, “Bounds on the performance of scheduling algorithms”, in E. G. Coffman, Jr.(ed.), Computer and Job-Shop Scheduling Theory, John Wiley, New York, 1975.
Patel, UC Berkeley, PRET 25
Richard’s Anomalies
1
9
2
5
3
6
4
7
T1/3 T2/2 T3/2 T4/2
T9/9 T5/4 T6/4 T7/4
8
T8/4
0 3 12
• 9 tasks, 3 processors, priority list, precedence order, execution times.
Patel, UC Berkeley, PRET 26
• eTime’ = eTime - 1
Richard’s Anomalies: Reducing Execution Times
1
9
2
5
3
6
4
7
T1/2 T2/1 T3/1 T4/1
T9/8 T5/3 T6/3 T7/3
8
T8/3
0 3 12
Patel, UC Berkeley, PRET 27
Richard’s Anomalies: More Processors
1
9
2
5
3
6
4
7
T1/3 T2/2 T3/2 T4/2
T9/9 T5/4 T6/4 T7/4
8
T8/4
0 3 12
• 4 processors
15
Patel, UC Berkeley, PRET 28
Richard’s Anomalies: Changing Priority List
1
7
2
4
6
3
3
8
T1/3 T2/2 T3/2 T4/2
T9/9 T5/4 T6/4 T7/4
9
T8/4
0 3 12
• L = (T1,T2,T4,T5,T6,T3,T9,T7,T8)
Patel, UC Berkeley, PRET 29
Brittleness Again…
• In general, all task scheduling strategies are brittle