Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors

Runahead Execution:

An Alternative to Very Large Instruction Windows for Out-of-order

Processors

Onur Mutlu, The University of Texas at Austin

Jared Start, Microprocessor Research, Intel Labs

Chris Wilkerson, Desktop Platforms Group, Intel Corp

Yale N. Patt, The University of Texas at Austin

Presented by: Mark Teper

Outline The Problem Related Work The Idea: Runahead Execution Details Results Issues

Brief Overview Instruction

Window: Set of in-order

instructions that have not yet been commited

Scheduling Window Set of unexecuted

instructions needed to selected for execution

What can go wrong?

Program FlowInstruction Window

Scheduling Windows

ExecutionUnits

The Problem

Program Flow

Unexecuted Instruction Executing Instruction

Long Running InstructionCommited Instruction

Instruction Window

Filling the Instruction Window

Better

Related Work Caches:

Alter size and structure of caches

Attempt to reduce unnecessary memory reads

Prefetching: Attempt to fetch data into

nearby cache before needed

Hardware & software techniques

Other techniques: Waiting instruction buffer

(WIB) Long-latency block

retirements

L1 Cache 1 Cycle

L2 Cache 10 Cycles

Memory 1000 cycles

RunAhead Execution Continue executing instructions during long stalls

Disregard results once data is available

Program Flow

Unexecuted Instruction Executing Instruction

Long Running InstructionCommited Instruction

Instruction Window

Benefits Acts as a high accuracy prefetcher

Software prefetchers have less information Hardware prefetchers can’t analyze code as

Biase predictors

Makes use of cycles that are otherwise wasted

Entering RunAhead Processors can enter run-ahead mode at any point

L2 Cache Misses used in paper

Architecture needs to be able to checkpoint and restore register state

Including branch-history register and return address stack

Handling Avoided Read Run Ahead trigger returns immediately

Value is marked as INV Processor continues fetching and executing

instructions

ld r1, [r2]

Add r3, r2, r2

Add r3, r1, r2

move r1, 0

Executing Instruction in RunAhead Instructions are fetched and executed as

normal Instructions are committed retired out of

the instruction window in program order If the instructions registers are INV it can be

retired without executing No data is ever observable outside the CPU

Branches during RunAhead

Divergence Points: Incorrect INV value branch prediction

Predict Branch

Yes – Assume predictor is correct,Continue execution

Does BranchDepend on INV?

No - Evaluate branch

Was branch predictor correct?

Yes – Continue Execution No – Flush instruction queue

Exiting RunAhead Occurs when stalling memory access finally

returns Checkpointed architecture is restored All instructions in the machine are flushed

Processor starts fetching again at instruction which caused RunAhead execution Paper presented optimization where fetching

started slightly before stalled instruction returned

Biasing Branch Predictors RunAhead can cause branch predictors to

be biased twice on the same branch

Several Alternatives:(1)Always train branch predictors (2)Never train branch predictors (3)Create list of predicted branches(4)Create separate Branch Predictor

RunAhead Cache

RunAhead execution disregards stores Can’t produce externally observable results

However, this data is needed for communication Solution: Run-Ahead cache

store r1, [r2]

add r1, r3, r1

store r1, [r4]

load r1, [r2]

bne r1, r5, Loop

Stores and Loads in Run Ahead

Loads1. If address is INV data

is automatically INV2. Next look in:

1. Store buffer2. RunAhead Cache

3. Finally go to memory1. In in cache treat as

valid2. If not treat as INV,

don’t stall

Stores1. Use store-buffer as

usual2. On Commit:

1. If address is INV ignore2. Otherwise write data to

RunAhead Cache

Run-Ahead Cache Results

Found that not passing data from stores to loads resulted in poor performance Significant number of INV results

Better

Details: Architecture

ResultsBetter

Results (2)Better

Issues

Some wrong assumptions about future machines Future baseline corresponds poorly to modern

architectures

Not a lot of details of architectural requirement for this technique Increase architecture size Increase power-requirements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors

Documents

Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University

CaSE: Cache-Assisted Secure Execution on ARM ProcessorsCaSE: Cache-Assisted Secure Execution on ARM Processors Ning Zhang, Kun Suny, Wenjing Lou, Y. Thomas Hou Virginia Polytechnic

Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh

Multithreaded and multicore processors - Intranet DEIBhome.deib.polimi.it/santambr/dida/phd/wonderland/2014/... · 2014-01-15 · Another Approach: Multithreaded Execution ! Multithreading:

Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and

Continuous Runahead: Transparent Hardware … Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi ∗, Onur Mutlu§, Yale N. Patt ∗The University

The Runahead Network-On-Chip - University of Torontosanmigu2/li-hpca2016.pdf · The Runahead Network-On-Chip Zimo Li University of Toronto ... We propose the Runahead NoC, a lightweight,

Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Techniques for Efficient Processing in Runahead Execution Engines

Parallel Programming with OpenMP - · PDF filepseudo-concurrent execution Concurrent execution does not require multiple processors: pseudo-concurrent execution instructions from different

Presenter: Shao-Jay Hou. In the multicore era, capturing execution traces of processors is indispensable to debugging complex software. The inability

Instruction Execution Pipelining - web.eecs.utk.eduweb.eecs.utk.edu/.../classes/cs160/lectures/09_intruc_pipelining.pdfPipelining CS160 Ward 2 ... RISC and CISC Based Processors •

Basic Performance Engineering€¦ · Parallel Execution Multicores are here ¾2 to 4 cores in a processor, ¾1 to 4 processors in a box ¾Cagnodes have 2 processors with 4 cores

Formal Abstractions for Attested Execution Secure Processors · formal treatment. We provide formal abstractions for \attested execution" secure processors and rigorously explore

Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

Dream or Reality? on Graphics Processors: Low-Latency ...adms-conf.org/2018-presentations/low-latency-presentation.pdf · Low-Latency Transaction Execution on Graphics Processors:

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt