29
September 29, 2020 Sam Siewert CEC 450 Real - Time Systems Lecture 6 Accounting for I/O Latency

CEC 450 Real-Time Systemsmercury.pr.erau.edu/~siewerts/cec450/documents/Lectures/...2020/09/29  · Bt878 PCI NTSC decoder (30 Hz) Serial A/B ISA ethernet frameRdy interrupt RGB32

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • September 29, 2020 Sam Siewert

    CEC 450Real-Time Systems

    Lecture 6 – Accounting for I/O Latency

  • Upcoming Events

    Quiz #2 – Part of your preparation for Exam #1Exercise #3 – final exercise before Exam #1Study for and take Exam #1 on October 15th anytime between 8am and before 11:59pm

    Sam Siewert 2

  • Exam #1Practice with Quiz #2, October 8th to Oct 13 before class

    Thursday, October 15th, anytime that day – start during class time, with access to NUC systems or remotely with your own system and finish up anytime later that day before the day ends

    Exam is available from 11:30 am until 11:59pm

    Should take 1.25 hours, and at most, up to 2.5 hours to complete, however you have the full day

    Sam Siewert 3

  • Code and Timing Analysis WalkthroughsPOSIX RT Threads– Simple Thread– Improved Thread

    POSIX RT Examples– Clock– Linux demo– Message Queues

    RT Clock

    Better, More Generic Sequencer [sequencer_generic]

    Synch

    Videos on Code and TA - cec450/video/Coursera-rough-cuts/

    Sam Siewert 4

    http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/rt_simplethread/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/rt_thread_improved/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/POSIX-Examples/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/RT-Clock/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/sequencer_generic/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/example-sync-updated-2/pthread3.chttp://mercury.pr.erau.edu/%7Esiewerts/cec450/video/Coursera-rough-cuts/

  • Sam Siewert 5

    A Service Release and ResponseCi WCETInput/Output LatencyInterference Time

    EventSensed Interrupt Dispatch Preemption Dispatch

    Interference

    Completion(IO Queued)

    Actuation(IO Completion)

    Input-Latency

    Dispatch-Latency

    Execution Execution

    Output-Latency

    Time

    Response Time = TimeActuation – TimeSensed(From Release to Response)

  • Sam Siewert 6

    Service Response Timeline(With Intermediate Blocking)

    InputLatency

    DispatchLatency

    EventSensed Interrupt Dispatch

    Execution

    Preemption Dispatch

    ExecutionInterference

    Completion(IO Queued)

    OutputLatency

    Actuation(IO Completion)

    Time

    Response Time = TimeActuation – TimeSensed(From Event to Response)

    InputLatency

    DispatchLatency

    EventSensed Interrupt Dispatch

    Execution

    Preemption Dispatch

    ExecutionInterference

    Completion(IO Queued)

    OutputLatency

    Actuation(IO Completion)

    Time

    BLOCKING

    Yield

    Execution

    Dispatch

  • Sam Siewert 7

    Service Response Timeline - WCET(With Intermediate I/O – Impacts BCET to Cause

    WCET)

    InputLatency

    DispatchLatency

    EventSensed Interrupt Dispatch

    Execution

    Preemption Dispatch

    ExecutionInterference

    Completion(IO Queued)

    OutputLatency

    Actuation(IO Completion)

    Time

    Response Time = TimeActuation – TimeSensed(From Event to Response)

    InputLatency

    DispatchLatency

    EventSensed Interrupt Dispatch

    Execution

    Preemption Dispatch

    ExecutionInterference

    Completion(IO Queued)

    OutputLatency

    Actuation(IO Completion)

    TimeMMIO(Stalls)

    Execution

  • Sam Siewert 8

    Services and I/OInitial Input I/O from Sensors– Low-Rate Input: Byte or Word Input from Memory-Mapped

    Registers or I/O Ports– High-Rate Input: Block Input from DMA Channels and Block

    Reads from FIFOs

    Final Response I/O to Actuators– Low-Rate Output: Byte or Word Writes Posted to Output FIFO or

    Bus Interface Queue– High-Rate Output: Block Output on DMA Channels and Block

    Writes

    Intermediate I/O– During Service Execution, Memory Mapped Register I/O– External Memory I/O– Cache Loads and Write-Backs– In Memory Input/Output from Service to Service

  • Sam Siewert 9

    Service Integration ConceptsInitial Sensor Input and Final Actuator Response Output– Device Interface Drivers

    Simple Service Has No Intermediate I/O Latency, No Synchronization

    Environment

    Sensors Effectors

    Digital Control Law Service

    ActuatorOutput

    Interface

    SensorInput Interface

    ActuatorElectronics

    SensorElectronics

  • Sam Siewert 10

    Complex Multi-Service SystemsMultiple ServicesSynchronization Between ServicesCommunication Between ServicesMultiple Sensor Input and Actuator Output InterfacesIntermediate I/O, Shared Memory, Messaging

  • Sam Siewert 11

    Multi-Service PipelinesRT Management

    tBtvid(30 Hz)

    Bt878 PCINTSC decoder

    (30 Hz)Serial A/B ISA ethernet

    frameRdyinterrupt

    RGB32buffer

    tFrmDisp(10 Hz)

    grayscalebuffer

    tFrmLnk(2 Hz)

    tNet(2 Hz)

    frame

    TCP/IP pkts

    Event releases / Service control

    tOpnav(15 Hz) tTlmLnk(5 Hz)

    tThrustCtl(15 Hz)

    tCamCtl(3 Hz)

    P0

    P1

    P2

    P3

    SWHW

    frameRdy Control Plane

    Data Plane

    RemoteDisplay

  • Sam Siewert 12

    Pipelined Architecture Review

    Recall that Pipeline Yields CPI of 1 or LessInstruction Completed Each CPU ClockUnless Pipeline Stalls!

    IF ID Execute Write-Back

    IF ID Execute Write-Back

    IF ID Execute Write-Back

    IF ID Execute Write-Back

    IF ID Execute Write-Back

    IF ID Execute Write-Back

  • Sam Siewert 13

    Service Execution

    Efficiency of Execution

    Memory Access for Inter-Service Communication

    Bounded Intermediate I/O

    Ideally Lock Data and Code into Cache or Use Tightly Coupled Memory [Scratch Pad, Zero Wait State]

    [ ]( _ _ _ ) _ _best caseWCET CPI Longest Path Inst Count Stall Cycles Clk Period−= × + ×

  • Sam Siewert 14

    Service Efficiency

    Path Length for a Release– Instruction Count– Longest Path Given Algorithm

    BranchesLoop IterationsData Driven?

    Path Execution– Number of Cycles to Complete Path– Clocks Per Instruction– Number of Stall Cycles for Pipeline

    Data Dependencies (Intermediate IO)Cache Misses (Intermediate Memory Access)Branch Mis-predictions (Small Penalty)

  • Sam Siewert 15

    Hiding Intermediate IO Latency(Overlapping CPU and Bus I/O)

    ICT = Instruction Count Time– (CPI=1) x Inst_Cnt x CPU_Clk_Period– Time to execute a block of instructions with no stalls, hence CPI=1– CPU Clk Cycles x CPU Clock Period

    IOT = Bus Interface IO Time – Bus IO Cycles x Bus Clock Period– Bus Clock Period / CPU_Clk_Period = Wait Clocks

    OR = Overlap Required– Percentage of CPU cycles that must be concurrent with I/O cycles

    NOA = Non-Overlap Allowable for Si to meet Di– Percentage of CPU cycles that can be in addition to IO cycle time without

    missing service deadline

    Di = Deadline for Service Si relative to release– interrupt or system call initiates Si request and Si execution

    CPI = Clocks Per Instruction for a block of instructions– IPC is Instructions Per Clock, Also Used, Just the Inverse (Superscalar,

    Pipelined)

  • Sam Siewert 16

    Processing and I/O Time OverlapIf Di < IOT, Si is IO-Bound

    If Di < ICT, Si is CPU-Bound

    if Di < (IOT + ICT) where (Di < IOT AND Di < ICT), deadline Di can’t be met regardless of overlap, both IO-Bound and CPU-Bound

    Di >= (IOT + ICT) requires no overlap of IOT with ICT

    if Di < (IOT + ICT) where (Di >= IOT and Di >= ICT), overlap of IOT with ICT is required

    CPIICT=1 by definition for ICT alone (no stalls, best case efficiency)

    CPIactual = fefficiency x CPIICT

    No code I/O optimization

    required

    Code I/O optimization required

    HardwareRe-design

    REQUIRED

  • Sam Siewert 17

    Service Execution Efficiency – Waiting on I/O from MMIO Bus and Memory

    CPIworst-case = (ICT + IOT) / ICT

    CPIbest-case = (max(ICT, IOT)) / ICT

    CPIrequired = Di / ICT

    OR = 1 - [(Di - IOT) / ICT]

    CPIrequired = [ICT(1-OR) + IOT] / ICT

    NOA = (Di – IOT) / ICT; OR + NOA = 1 (by definition)

    ICT IOT

    IOT

    ICT

    IOT

    ICT

    OR

    ICT

  • CPI UnitsCPIrequired = Di / ICT = Time/Time?

    – Di = Clocks x Clock_T = Time [N x X nanoseconds]– ICT = CPI=1 x Num_instructions x Clock_T = Time [N x X

    nanoseconds]

    Di / ICT = (Clocks x Clock_T) / (1 x Num_instructions x Clock_T)– Clocks / Num_instructions = CPIrequired over deadline Di

    Sam Siewert 18

  • Sam Siewert 19

    Synchronization and Message Passing

    Message Queues– Provide

    Communication– Provide

    Synchronization

    Traditional Message Queue– Message Size– Internal Fragmentation

    Heap Message Queue– Messages Are Pointers– Message Data in Heap

    S1

    Msg

    S2

    S1 Ptr S2

    Msg

    HeapBufferPool

    AllocationOf Message

    Heap

    De-AllocationOf Message

    Heap

    ENEA OSE• Message Passing RTOS• Advertised Advantages

    http://www.enea.com/solutions/rtos/ose/http://www.enea.com/Documents/Resources/Whitepapers/Advantages%20of%20Enea%20OSE%20-%20The%20Architectural%20Advantages%20of%20%20021061.pdf

  • What About Storage I/OMuch Higher Latency than Off-Chip Memory or Bus MMIOMilliseconds of Latency Compared to GHz Core Clock Rates – 1 million to 1!Flash Better, But Still Slower than SRAM or DRAM

    Option #1 – Avoid Disk Drives and Flash I/O

    Option #2 – Pre-Buffer Storage I/O (Read-Ahead Cache), Post Write-backs (Write-Back Cache), MFU/MRU (Most Frequently Used / Most Recently Used) Kept in Read Cache

    Sam Siewert 20

  • Parallel Processing Speed-up(Makes I/O Latency Worse?)

    Grid Data Processing Speed-up1. Multi-Core, Multi-threaded, Macro-blocks/Frames2. SIMD, Vector Instructions Operating over Large Words (Many Times

    Instruction Set Size)3. Co-Processor Operates in Parallel to CPU(s)

    SPMD – GPU or GP-GPU Co-Processor– PCI-Express Bus Interfaces– Transfer Program and Data to Co-Processor– Threads and Blocks to Transform Data Concurrently

    Image Data Processing – Few Data Dependencies– Good Speed-up by Amdahl’s Law– P=Parallel Portion– (1-P)=Sequential Portion– S=# of Cores (Concurrency)– Overhead for Co-Processor– IO for Co-Processing

    Sam Siewert 21

    SPPUpSpeedMulticore

    /)1(1__+−

    =

    0)1(1__+−

    =P

    UpSpeedMax

    S is infinite here

  • Amdahl’s Law – Infinite CoresMaximum Speed-up Driven by Sequential and Parallel Portions of Program

    – P = Parallel Portion– (1-P) = Sequential Portion– Speed-up for Given Multi-core Architecture Function of # of Cores (Speed-up in Parallel

    Portions)

    Sam Siewert 22

    0123456789

    1011121314151617181920

    -1.55E-150.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

    Alg

    orith

    m S

    peed

    Up

    Sequential Portion (% Computation in Sequential vs. Parallel Execution)

    Amdahl's Law Max Speed-up(Any Number of Processor Cores)

    Max Speed-up

    No Parallel PortionAll Sequential(No Speed-up)

    All Code Parallel(Infinite Speed-up) 95% Parallel(20x Speed-up)

  • Multi-Core Speed-Up

    Sam Siewert 23

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

    Spee

    d-up

    Sequential Portion of Algorithm

    Amdahl's Law - Speed-up with # Cores and Parallel Portion

    Max Speed-up2 cores4 cores8 cores12 cores32 cores

    95% Parallel Program

  • Hiding Storage I/O Latency –Overlapping with Processing

    Simple Design – Each Thread has READ, PROCESS, WRITE-BACK Execution

    Frame rate is READ+PROCESS+WRITE latency – e.g. 10 fps for 100 milliseconds– If READ is 70 msec, PROCESS is 10 msec, and WRITE-BACK

    20 msec, predominate time is IO time, not processing– Disk drive with 100 MB/sec READ rate can only read 16 fps,

    62.5 msec READ latency

    Sam Siewert 24

    READ F(1) Process F(1) Write-back F(1) READ F(2)

  • Hiding Storage I/O LatencySchedule Multiple Overlapping Threads?

    Requires Nthreads = Nstages x Ncores1.5 to 2x Number of Threads for SMT (Hyper-threading)For IO Stage Duration Similar to Processing TimeMore Threads if IO Time (Read+WB+Read) >> 3 x Processing Time

    Sam Siewert 25

    READ F1 Process F1 Write-back F1 READ F4 Process F4 Write-back F4

    READ F2 Process F2 Write-back F2 READ F5 Process F5 …

    READ F3 Process F3 Write-back F3 Read F6 …

    Start-up Core #1 Continuous Processing Core #1 Continuous Processing

    READ F1 Process F1 Write-back F1 READ F4 Process F4 Write-back F4

    READ F2 Process F2 Write-back F2 READ F5 Process F5 …

    READ F3 Process F3 Write-back F3 Read F6 …

    Start-up Core #2 Continuous Processing Core #2 Continuous Processing

  • Hiding Latency – Dedicated I/OSchedule Reads Ahead of Processing

    Requires Nthreads = 2 + NcoresSynchronize Frame Ready/Write-backsBalance Stage Read/Write-Back Latency to Processing1.5 to 2x Threads for SMT (Hyper-threading)

    Sam Siewert 26

    Wait Process F1 Process F3 Process F5 …

    Wait Process F2 Process F4 Process F6

    Read F1 Read F2 Read F3 Read F4 Read F5 Read F6 Read F7 Read F8

    Start-up

    Wait … WB F1 WB F2 WB F3 WB F4 WB F5 WB F6

    Dual-Core Concurrent Processing Completion

  • Processing Latency AloneWrite Code with Memory Resident Frames– Load Frames in Advance– Process In-Memory Frames Over and Over– Do No IO During Processing– Provides Baseline Measurement of Processing Latency per

    Frame Alone– Provides Method of Optimizing Processing Without IO Latency

    Sam Siewert 27

  • I/O Latency AloneComment Out Frame Transformation Code or Call Stubbed NULL Function– Provides Measurement of IO Frame Rate Alone– Essentially Zero Latency Transform– No Change Between Input Frames and Output Frames– Allows for Tuning of IO Scheduler and Threading

    Sam Siewert 28

  • Tips for IO Scheduling

    blockdev --getra /dev/sda– Should return 256– Means that reads read-ahead up to 128K– Function calls – read, fread should request as much as possible– Check “actual bytes read”, re-read as needed in a loop

    blockdev --setra /dev/sda 16384 (8MB)

    Switch CFQ to Deadline– Use “lsscsi” to verify your disk is /dev/sda … substitute block

    driver interface used for file system if not sda– cat /sys/block/sda/queue/scheduler – echo deadline > /sys/block/sda/queue/scheduler

    Options are noop, cfq, deadline

    Sam Siewert 29

    CEC 450�Real-Time SystemsUpcoming EventsExam #1Code and Timing Analysis WalkthroughsA Service Release and ResponseService Response Timeline�(With Intermediate Blocking)Service Response Timeline - WCET�(With Intermediate I/O – Impacts BCET to Cause WCET)Services and I/OService Integration ConceptsComplex Multi-Service SystemsMulti-Service PipelinesPipelined Architecture ReviewService ExecutionService EfficiencyHiding Intermediate IO Latency�(Overlapping CPU and Bus I/O)Processing and I/O Time OverlapService Execution Efficiency – Waiting on I/O from MMIO Bus and Memory CPI UnitsSynchronization and Message PassingWhat About Storage I/OParallel Processing Speed-up�(Makes I/O Latency Worse?)Amdahl’s Law – Infinite CoresMulti-Core Speed-UpHiding Storage I/O Latency – Overlapping with ProcessingHiding Storage I/O LatencyHiding Latency – Dedicated I/OProcessing Latency AloneI/O Latency AloneTips for IO Scheduling