CEC 450 Real-Time Systemsmercury.pr.erau.edu/~siewerts/cec450/documents/Lectures/...2020/09/29 · Bt878 PCI NTSC decoder (30 Hz) Serial A/B ISA ethernet frameRdy interrupt RGB32

September 29, 2020 Sam Siewert

CEC 450Real-Time Systems

Lecture 6 – Accounting for I/O Latency

Upcoming Events

Quiz #2 – Part of your preparation for Exam #1Exercise #3 – final exercise before Exam #1Study for and take Exam #1 on October 15th anytime between 8am and before 11:59pm

Sam Siewert 2

Exam #1Practice with Quiz #2, October 8th to Oct 13 before class

Thursday, October 15th, anytime that day – start during class time, with access to NUC systems or remotely with your own system and finish up anytime later that day before the day ends

Exam is available from 11:30 am until 11:59pm

Should take 1.25 hours, and at most, up to 2.5 hours to complete, however you have the full day

Sam Siewert 3

Code and Timing Analysis WalkthroughsPOSIX RT Threads– Simple Thread– Improved Thread

POSIX RT Examples– Clock– Linux demo– Message Queues

RT Clock

Better, More Generic Sequencer [sequencer_generic]

Synch

Videos on Code and TA - cec450/video/Coursera-rough-cuts/

Sam Siewert 4

http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/rt_simplethread/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/rt_thread_improved/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/POSIX-Examples/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/RT-Clock/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/sequencer_generic/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/example-sync-updated-2/pthread3.chttp://mercury.pr.erau.edu/%7Esiewerts/cec450/video/Coursera-rough-cuts/

Sam Siewert 5

A Service Release and ResponseCi WCETInput/Output LatencyInterference Time

EventSensed Interrupt Dispatch Preemption Dispatch

Interference

Completion(IO Queued)

Actuation(IO Completion)

Input-Latency

Dispatch-Latency

Execution Execution

Output-Latency

Time

Response Time = TimeActuation – TimeSensed(From Release to Response)

Sam Siewert 6

Service Response Timeline(With Intermediate Blocking)

InputLatency

DispatchLatency

EventSensed Interrupt Dispatch

Execution

Preemption Dispatch

ExecutionInterference


OutputLatency


Time

Response Time = TimeActuation – TimeSensed(From Event to Response)

InputLatency

DispatchLatency


Execution

Preemption Dispatch



OutputLatency


Time

BLOCKING

Yield

Execution

Dispatch

Sam Siewert 7

Service Response Timeline - WCET(With Intermediate I/O – Impacts BCET to Cause

WCET)

InputLatency

DispatchLatency


Execution

Preemption Dispatch



OutputLatency


Time

Response Time = TimeActuation – TimeSensed(From Event to Response)

InputLatency

DispatchLatency


Execution

Preemption Dispatch



OutputLatency


TimeMMIO(Stalls)

Execution

Sam Siewert 8

Services and I/OInitial Input I/O from Sensors– Low-Rate Input: Byte or Word Input from Memory-Mapped

Registers or I/O Ports– High-Rate Input: Block Input from DMA Channels and Block

Reads from FIFOs

Final Response I/O to Actuators– Low-Rate Output: Byte or Word Writes Posted to Output FIFO or

Bus Interface Queue– High-Rate Output: Block Output on DMA Channels and Block

Writes

Intermediate I/O– During Service Execution, Memory Mapped Register I/O– External Memory I/O– Cache Loads and Write-Backs– In Memory Input/Output from Service to Service

Sam Siewert 9

Service Integration ConceptsInitial Sensor Input and Final Actuator Response Output– Device Interface Drivers

Simple Service Has No Intermediate I/O Latency, No Synchronization

Environment

Sensors Effectors

Digital Control Law Service

ActuatorOutput

Interface

SensorInput Interface

ActuatorElectronics

SensorElectronics

Sam Siewert 10

Complex Multi-Service SystemsMultiple ServicesSynchronization Between ServicesCommunication Between ServicesMultiple Sensor Input and Actuator Output InterfacesIntermediate I/O, Shared Memory, Messaging

Sam Siewert 11

Multi-Service PipelinesRT Management

tBtvid(30 Hz)

Bt878 PCINTSC decoder

(30 Hz)Serial A/B ISA ethernet

frameRdyinterrupt

RGB32buffer

tFrmDisp(10 Hz)

grayscalebuffer

tFrmLnk(2 Hz)

tNet(2 Hz)

frame

TCP/IP pkts

Event releases / Service control

tOpnav(15 Hz) tTlmLnk(5 Hz)

tThrustCtl(15 Hz)

tCamCtl(3 Hz)

P0

P1

P2

P3

SWHW

frameRdy Control Plane

Data Plane

RemoteDisplay

Sam Siewert 12

Pipelined Architecture Review

Recall that Pipeline Yields CPI of 1 or LessInstruction Completed Each CPU ClockUnless Pipeline Stalls!

IF ID Execute Write-Back






Sam Siewert 13

Service Execution

Efficiency of Execution

Memory Access for Inter-Service Communication

Bounded Intermediate I/O

Ideally Lock Data and Code into Cache or Use Tightly Coupled Memory [Scratch Pad, Zero Wait State]

[ ]( _ _ _ ) _ _best caseWCET CPI Longest Path Inst Count Stall Cycles Clk Period−= × + ×

Sam Siewert 14

Service Efficiency

Path Length for a Release– Instruction Count– Longest Path Given Algorithm

BranchesLoop IterationsData Driven?

Path Execution– Number of Cycles to Complete Path– Clocks Per Instruction– Number of Stall Cycles for Pipeline

Data Dependencies (Intermediate IO)Cache Misses (Intermediate Memory Access)Branch Mis-predictions (Small Penalty)

Sam Siewert 15

Hiding Intermediate IO Latency(Overlapping CPU and Bus I/O)

ICT = Instruction Count Time– (CPI=1) x Inst_Cnt x CPU_Clk_Period– Time to execute a block of instructions with no stalls, hence CPI=1– CPU Clk Cycles x CPU Clock Period

IOT = Bus Interface IO Time – Bus IO Cycles x Bus Clock Period– Bus Clock Period / CPU_Clk_Period = Wait Clocks

OR = Overlap Required– Percentage of CPU cycles that must be concurrent with I/O cycles

NOA = Non-Overlap Allowable for Si to meet Di– Percentage of CPU cycles that can be in addition to IO cycle time without

missing service deadline

Di = Deadline for Service Si relative to release– interrupt or system call initiates Si request and Si execution

CPI = Clocks Per Instruction for a block of instructions– IPC is Instructions Per Clock, Also Used, Just the Inverse (Superscalar,

Pipelined)

Sam Siewert 16

Processing and I/O Time OverlapIf Di < IOT, Si is IO-Bound

If Di < ICT, Si is CPU-Bound

if Di < (IOT + ICT) where (Di < IOT AND Di < ICT), deadline Di can’t be met regardless of overlap, both IO-Bound and CPU-Bound

Di >= (IOT + ICT) requires no overlap of IOT with ICT

if Di < (IOT + ICT) where (Di >= IOT and Di >= ICT), overlap of IOT with ICT is required

CPIICT=1 by definition for ICT alone (no stalls, best case efficiency)

CPIactual = fefficiency x CPIICT

No code I/O optimization

required

Code I/O optimization required

HardwareRe-design

REQUIRED

Sam Siewert 17

Service Execution Efficiency – Waiting on I/O from MMIO Bus and Memory

CPIworst-case = (ICT + IOT) / ICT

CPIbest-case = (max(ICT, IOT)) / ICT

CPIrequired = Di / ICT

OR = 1 - [(Di - IOT) / ICT]

CPIrequired = [ICT(1-OR) + IOT] / ICT

NOA = (Di – IOT) / ICT; OR + NOA = 1 (by definition)

ICT IOT

IOT

ICT

IOT

ICT

OR

ICT

CPI UnitsCPIrequired = Di / ICT = Time/Time?

– Di = Clocks x Clock_T = Time [N x X nanoseconds]– ICT = CPI=1 x Num_instructions x Clock_T = Time [N x X

nanoseconds]

Di / ICT = (Clocks x Clock_T) / (1 x Num_instructions x Clock_T)– Clocks / Num_instructions = CPIrequired over deadline Di

Sam Siewert 18

Sam Siewert 19

Synchronization and Message Passing

Message Queues– Provide

Communication– Provide

Synchronization

Traditional Message Queue– Message Size– Internal Fragmentation

Heap Message Queue– Messages Are Pointers– Message Data in Heap

S1

Msg

S2

S1 Ptr S2

Msg

HeapBufferPool

AllocationOf Message

Heap

De-AllocationOf Message

Heap

ENEA OSE• Message Passing RTOS• Advertised Advantages

http://www.enea.com/solutions/rtos/ose/http://www.enea.com/Documents/Resources/Whitepapers/Advantages%20of%20Enea%20OSE%20-%20The%20Architectural%20Advantages%20of%20%20021061.pdf

What About Storage I/OMuch Higher Latency than Off-Chip Memory or Bus MMIOMilliseconds of Latency Compared to GHz Core Clock Rates – 1 million to 1!Flash Better, But Still Slower than SRAM or DRAM

Option #1 – Avoid Disk Drives and Flash I/O

Option #2 – Pre-Buffer Storage I/O (Read-Ahead Cache), Post Write-backs (Write-Back Cache), MFU/MRU (Most Frequently Used / Most Recently Used) Kept in Read Cache

Sam Siewert 20

Parallel Processing Speed-up(Makes I/O Latency Worse?)

Grid Data Processing Speed-up1. Multi-Core, Multi-threaded, Macro-blocks/Frames2. SIMD, Vector Instructions Operating over Large Words (Many Times

Instruction Set Size)3. Co-Processor Operates in Parallel to CPU(s)

SPMD – GPU or GP-GPU Co-Processor– PCI-Express Bus Interfaces– Transfer Program and Data to Co-Processor– Threads and Blocks to Transform Data Concurrently

Image Data Processing – Few Data Dependencies– Good Speed-up by Amdahl’s Law– P=Parallel Portion– (1-P)=Sequential Portion– S=# of Cores (Concurrency)– Overhead for Co-Processor– IO for Co-Processing

Sam Siewert 21

SPPUpSpeedMulticore

/)1(1__+−

=

0)1(1__+−

=P

UpSpeedMax

S is infinite here

Amdahl’s Law – Infinite CoresMaximum Speed-up Driven by Sequential and Parallel Portions of Program

– P = Parallel Portion– (1-P) = Sequential Portion– Speed-up for Given Multi-core Architecture Function of # of Cores (Speed-up in Parallel

Portions)

Sam Siewert 22

0123456789

1011121314151617181920

-1.55E-150.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Alg

orith

m S

peed

Up

Sequential Portion (% Computation in Sequential vs. Parallel Execution)

Amdahl's Law Max Speed-up(Any Number of Processor Cores)

Max Speed-up

No Parallel PortionAll Sequential(No Speed-up)

All Code Parallel(Infinite Speed-up) 95% Parallel(20x Speed-up)

Multi-Core Speed-Up

Sam Siewert 23

0

2

4

6

8

10

12

14

16

18

20

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Spee

d-up

Sequential Portion of Algorithm

Amdahl's Law - Speed-up with # Cores and Parallel Portion

Max Speed-up2 cores4 cores8 cores12 cores32 cores

95% Parallel Program

Hiding Storage I/O Latency –Overlapping with Processing

Simple Design – Each Thread has READ, PROCESS, WRITE-BACK Execution

Frame rate is READ+PROCESS+WRITE latency – e.g. 10 fps for 100 milliseconds– If READ is 70 msec, PROCESS is 10 msec, and WRITE-BACK

20 msec, predominate time is IO time, not processing– Disk drive with 100 MB/sec READ rate can only read 16 fps,

62.5 msec READ latency

Sam Siewert 24

READ F(1) Process F(1) Write-back F(1) READ F(2)

Hiding Storage I/O LatencySchedule Multiple Overlapping Threads?

Requires Nthreads = Nstages x Ncores1.5 to 2x Number of Threads for SMT (Hyper-threading)For IO Stage Duration Similar to Processing TimeMore Threads if IO Time (Read+WB+Read) >> 3 x Processing Time

Sam Siewert 25

READ F1 Process F1 Write-back F1 READ F4 Process F4 Write-back F4

READ F2 Process F2 Write-back F2 READ F5 Process F5 …

READ F3 Process F3 Write-back F3 Read F6 …

Start-up Core #1 Continuous Processing Core #1 Continuous Processing

READ F1 Process F1 Write-back F1 READ F4 Process F4 Write-back F4

READ F2 Process F2 Write-back F2 READ F5 Process F5 …

READ F3 Process F3 Write-back F3 Read F6 …

Start-up Core #2 Continuous Processing Core #2 Continuous Processing

Hiding Latency – Dedicated I/OSchedule Reads Ahead of Processing

Requires Nthreads = 2 + NcoresSynchronize Frame Ready/Write-backsBalance Stage Read/Write-Back Latency to Processing1.5 to 2x Threads for SMT (Hyper-threading)

Sam Siewert 26

Wait Process F1 Process F3 Process F5 …

Wait Process F2 Process F4 Process F6

Read F1 Read F2 Read F3 Read F4 Read F5 Read F6 Read F7 Read F8

Start-up

Wait … WB F1 WB F2 WB F3 WB F4 WB F5 WB F6

Dual-Core Concurrent Processing Completion

Processing Latency AloneWrite Code with Memory Resident Frames– Load Frames in Advance– Process In-Memory Frames Over and Over– Do No IO During Processing– Provides Baseline Measurement of Processing Latency per

Frame Alone– Provides Method of Optimizing Processing Without IO Latency

Sam Siewert 27

I/O Latency AloneComment Out Frame Transformation Code or Call Stubbed NULL Function– Provides Measurement of IO Frame Rate Alone– Essentially Zero Latency Transform– No Change Between Input Frames and Output Frames– Allows for Tuning of IO Scheduler and Threading

Sam Siewert 28

Tips for IO Scheduling

blockdev --getra /dev/sda– Should return 256– Means that reads read-ahead up to 128K– Function calls – read, fread should request as much as possible– Check “actual bytes read”, re-read as needed in a loop

blockdev --setra /dev/sda 16384 (8MB)

Switch CFQ to Deadline– Use “lsscsi” to verify your disk is /dev/sda … substitute block

driver interface used for file system if not sda– cat /sys/block/sda/queue/scheduler – echo deadline > /sys/block/sda/queue/scheduler

Options are noop, cfq, deadline

Sam Siewert 29

CEC 450�Real-Time SystemsUpcoming EventsExam #1Code and Timing Analysis WalkthroughsA Service Release and ResponseService Response Timeline�(With Intermediate Blocking)Service Response Timeline - WCET�(With Intermediate I/O – Impacts BCET to Cause WCET)Services and I/OService Integration ConceptsComplex Multi-Service SystemsMulti-Service PipelinesPipelined Architecture ReviewService ExecutionService EfficiencyHiding Intermediate IO Latency�(Overlapping CPU and Bus I/O)Processing and I/O Time OverlapService Execution Efficiency – Waiting on I/O from MMIO Bus and Memory CPI UnitsSynchronization and Message PassingWhat About Storage I/OParallel Processing Speed-up�(Makes I/O Latency Worse?)Amdahl’s Law – Infinite CoresMulti-Core Speed-UpHiding Storage I/O Latency – Overlapping with ProcessingHiding Storage I/O LatencyHiding Latency – Dedicated I/OProcessing Latency AloneI/O Latency AloneTips for IO Scheduling