132
Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014) 1

Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Embed Size (px)

Citation preview

Page 1: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

1

Energy Efficient Computing through Compiler Assisted Dynamic Specialization

Venkatraman Govindaraju

Advisor: Karthikeyan Sankaralingam

(Defense: 7/29/2014)

Page 2: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

2

Why energy efficiency?

1985 1990 1995 2000 2005 20100.001

0.01

0.1

1

10

100

1000

10000

100000 Transistors (in 100K)Power(W)Performance(GOPS)Efficiency(GOPS/W)

Moore’s Law is still valid

Limited by heat

Because of diminishing returns

Performance stagnates

Simplified Graph from “The Free Lunch Is Over”. Herb Sutter. In DDJ, March 2005

We must improve energy efficiency to scale performance

Year

Page 3: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

FabScalar OpenSPARC

Where energy consumed?

Actual execution consumes only a fraction of energy

3

Reduce overhead energy to improve overall energy efficiency

Data is from “Power balanced pipelines” Sartori et al. in HPCA 2012

Page 4: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

4

How to get efficiency?

Use accelerators or specialization

Efficiency

Gen

eral

ity,

Com

pile

r Effe

ctive

ness

General Purpose processor(GPP)

SIMD

ASIC

Flexible as GPP but with ASIC

efficiency?

Page 5: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

5

DySER: Compiler Assisted Hardware Specialization

Efficiency

DesignComplexity

Generality Efficiency

Use specialized hardware for hot regions

Generality Reconfigurable at run-time Use encodings generated at

compile-time

Design Complexity Decoupled Access/Execute Use original core for

uncommon task

Page 6: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

6

Evolution of DySER

Efficiency

Com

pile

r Effe

ctive

ness General Purpose

processor

SIMD: SSE

ASIC

DySER

DySER + DLP

DySER + DLP+ Slicer

Exploits DLP, and Vectorization for high efficiency

DSL programming[IEEE Micro, 2012]

AEPDG, new IR to model DySER

Auto compiles directly from C/C++ to DySER

[PACT 2013]

Dynamically specialized datapath

DSL programming[HPCA 2011]

Page 7: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

7

What’s New?Preliminary Exam (8/12) Defense (7/14)

Architecture Basic DySER ISA, Vector DySER ISA (Prelim)

Vector DySER ISAISA for irregular workloads

Compiler Preliminary DesignPartial Implementation

Complete DesignSource code released

Evaluation Used high level pipeline models - SPEC INT, PARSEC

Accurate Simulator models - SPEC INT, PARSEC - Throughput Kernels - Parboil - Database

Publications Architecture (HPCA2011)Prototype (HPCA2012)

DySER+DLP (IEEE Micro 2012)Compiler (PACT 2013)Integration (HotChips 2012)Modeling (Micro 2014 – In Submission)

Page 8: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Outline

Introduction

DySER: Architecture

Intermediate Representation:Access/Execute PDG

Slicer: Compiler

Evaluation and Results

Conclusion8

Page 9: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Overview

9

DySER

• Circuit-switched array of functional units• Integrated to processor pipeline• Dynamically creates specialized datapath

Fetch Decode Execute Memory WriteBack

D$

I$Register

File

Decode ExecUnits

Page 10: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Datapath

10

Page 11: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Configuration

Use same network for configuration bits

Configure once – reuse many times11

Page 12: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Execution Model:Decoupled Access/Execute Model

Memory access instructions execute in processor pipeline Address Calculation,

Loads, and Stores Configure DySER Send Data to DySER Recv Data from DySER Loop control

Computation executes in DySER

12Processor DySER

Config

________________________

____________

x

-

++

+

x

-

++

+

________________________

JMP LOOP

Page 13: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Execution Example

13

FU

S S

S

S S

S

FUS S

FUS S

FUS

FUS

FUS

FUS

FUS

FUS

Input FIFO

OutputFIFO

IP0 IP1 IP2 IP3

OP0 OP1 OP2 OP3

Config Path

DySER Program//Vector Dot ProductDyINIT (0xABCD)DyINIT (0xEF00)…SUM=[0,0];for(int i =0; i < LEN; i+=2) { DySend_Vec(SUM, IP0); DyLoad_Vec(a[i:i+1], IP1); DyLoad_Vec(b[i:i+1], IP2); DyRecv_Vec(OP2, SUM);}sum= accum(SUM);//(last iteration here)return sum;

Page 14: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Execution Example

14

S S

S

S S

S

S S

S S

×S

+S

S

S

S

S

Input FIFO

OutputFIFO

IP0 IP1 IP2 IP3

OP0 OP1 OP2 OP3

DySER Program//Vector Dot ProductDyINIT (0xABCD)DyINIT (0xEF00)…SUM=[0,0];for(int i =0; i < LEN; i+=2) { DySend_Vec(SUM, IP0); DyLoad_Vec(a[i:i+1], IP1); DyLoad_Vec(b[i:i+1], IP2); DyRecv_Vec(OP2, SUM);}sum= accum(SUM);//(last iteration here)return sum;

Page 15: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

15

Why does it work?

Applications execute in phases

Applications follow 90-10 rule 10% of code-region contributes to 90% of run time

Creating specialization for such code-regions amortizes the overheads

Page 16: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Where does performance come from?

Removing instructions from main pipeline Less use of Instruction Queue, ROB, Register File Effectively larger instruction window

Decoupled Execution Concurrency between main processor and DySER Many FUs -> High Potential ILP

Benefits of Vectorization Fewer memory access instructions Explicit pipelining of DySER

16

Page 17: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

17

Energy Savings?

Eliminates per instruction overheads No fetch, decode etc., No expensive register reads etc.,

High performance itself leads to energy savings No additional power-hungry structures

Page 18: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Outline

Introduction

DySER: Architecture

Intermediate Representation:Access/Execute PDG

Slicer: Compiler

Evaluation and Results

Conclusion18

Page 19: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

19

Compiler Intermediate Representation

Makes it easier to optimize for target architecture

A suitable IR should Model the architecture, accurately if possible Capture the dependencies between the

operations Generate code for the architecture with ease

Page 20: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Architecture:Configurable Datapath

20

Configure switches and functional units to create different datapath

Can specialize datapath For ILP For DLP

Allows acceleration of variety of computation patterns

×S S

S

S S

S

S S

×S

+S

×S

+S

S S S

+S

Mul-Accumulate

-S S

S

S S

S

&S S

S

>S

S

S

S S

?:S

+S

Sum of Abs. Differences

×S S

S

S S

S

+S S

×S

+S

×S

+S

S S S S

3x3 Convolution

Page 21: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

21

Compiler IR for DySER: Modeling Configurable Datapath

Graph based Nodes represent the operations/instruction Edges represent dependence between the operations

Easier to map computation to DySER

for (i = 0; i < N; ++i) C[i] += A[i] * B[i]

LDLD

×

ST

+2

×

S S

S

S

S

+S S

+S

×S

in1 in2

×

+2

out

Page 22: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Architecture:Control Flow Mapping

22

S S

S

S S

S

S S

>S

+S

S

-S

S S S

φS

Predication Predicates the output A metabit in datapath

propagates the validity of the data

“Select” function unit (PHI functions) Selects valid input and

forwards as its output

in0 in1

Out

Pred.

V V

PHI

V

0 1

1in1

Native control flow mapping allows accelerating code with arbitrary control-flow

Page 23: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

23

Compiler IR for DySER: Modeling Control Flow mapping

Special edges to represent control dependence

Special node to model PHI instruction

for (i = 0; i < N; ++i): if b[i] < 0: a = b[i] + 5; else: a = b[i] - 5; b[i] = a;

-+

<

LD

ST

φ

b+i

Page 24: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Architecture:Decoupled Access/Execute Execution

24

S S

S

S S

S

S S

S S

S

S

S

S

S

S

Input FIFO

OutputFIFO

IP0 IP1 IP2 IP3

OP0 OP1 OP2 OP3

Processor sends data to DySER through its input FIFOs (input ports)

DySER computes in data flow fashion

Processor receives data from DySER through its output FIFOs (output ports)

Allows DySER to consume data in different order than how it is stored

Page 25: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

25

Compiler IR for DySER: Modeling Decoupled Access/Execute Execution

Explicitly partitioned into Access and Execute PDG

for (i = 0; i < N; ++i): if b[i] < 0: a = b[i] + 5; else: a = b[i] - 5; b[i] = a;

-+

<

LD

ST

φ

b+i

+ -

<

φ

ExecutePDG

Page 26: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

b[i+1]b[i+0]

DySER Architecture: Flexible Vector Interface

struct vec { float x, y, z; float q;}vec A[], B[];float *a = A, *b = B;float dot[];for(int i =0; i < LEN; i+=1) { dot[i]=A[i].x*B[i].x +A[i].y*B[i].y +A[i].z*B[i].z;}

26

× × ×

+

+

a[i] a[i+1] a[i+2]

dot[i]

b[i+2]

Page 27: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

S S

S

S S

S

S S

S S

S

S

S

S

S

S

DySER Architecture: Flexible Vector Interface

× × ×

a[0]a[4]

a[1]a[5]

a[2]a[6]

++

How do weget this accesspattern?

Iteration 2Iteration 1

27

struct vec { float x, y, z; float q;}vec A[], B[];float *a = A, *b = B;float dot[];for(int i =0; i < LEN; i+=1) { dot[i]=A[i].x*B[i].x +A[i].y*B[i].y +A[i].z*B[i].z;}

Ports shown only for a[]

Page 28: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Architecture:Flexible Vector Interface

A flexible mechanism to map from contiguous inputs to arbitrary DySER I/Os.

Add a “Vector Port” before FIFOs.

Add a “Vector Map” which tells how data should be transferred.

Data is processed with a state machine when data arrives

04

0 1 32 4 5 76

15

26P0 P1 xP2 P0 P1 xP2

Vector Port:

Vector Port Map:

P0 P1 P2 P3

28

× × ×

Input FIFO

Page 29: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

29

DySER Architecture:Flexible Vector Interface

S S

S

S S

S

S S

S

S

S

S

IP0 IP1 IP2 IP3

3210 7654Memory/Vector register

0 1 2 30123

0 12 3

“Vector Port Mapping”Allows accelerating code region with different

memory access patterns (eg. Strided)

Page 30: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Compiler IR for DySER: Modeling Flexible Vector Interface

30

× × ×

+

+

a[i] a[i+1] a[i+2]

out[i]

× × ×

+

+

a[i] a[i+1] a[i+2]

out[i]

a[i+3] a[i+4] a[i+5]

out[i+1]

03

14

25

P0 P1 P2 P3

× × ×

0 1 32 4 5 760

10 1 0

1

10

Vector Port (for a[])

Original AEPDG Unrolled AEPDG Vector Map Generation(Load/Store Coalescing)

x x xx x x xxVector Port Map

P0 x P0x x x xxP0 P1 P0x P1 x xxP0 P1 P0P2 P1 P2 xx

Each edge on the interface knows its order

Page 31: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Compiler IR for DySER: Modeling Flexible Vector Interface

31

× × ×

+

+

a[i] a[i+1] a[i+2]

out[i]

× × ×

+

+

out[i:i+1]

04

15

26

P0 P1 P2 P3

× × ×

0 1 32 4 5 76Vector Port (for a[])

Original AEPDG Unrolled AEPDG Vector Map Generation

x x xx x x xxVector Port Map

P0 x xx P0 x xxP0 P1 xx P0 P1 xxP0 P1 P0P2 P1 P2 xxa[i:i+5]

Page 32: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

32

Compiler IR: Access Execute Program Dependence Graph (AEPDG)

A variant of PDG

Nodes represent operations

Edges represent both data and control dependence

Explicitly partitioned into access-PDG and execute-PDG subgraph

Edges between access and execute-PDG augmented with temporal information

-+

<

LD

ST

φ

b+i

+ -

<

φ

Page 33: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Outline

Introduction

DySER: Architecture

Intermediate Representation:Access/Execute PDG

Slicer: Compiler

Evaluation and Results

Conclusion33

Page 34: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Compilation Tasks Identify code-regions/loops

to specialize

Construct AEPDG Access PDG Execute PDG

Perform Vectorization/ Optimizations

Schedule Execute PDG to DySER Access PDG to core 34

Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Identification & Construct AEPDG

Application

Page 35: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Region Identification

Application

Region Identification Identify code-regions to

specialize Path Profiling Utilize Loops

Need Single-Entry / Single Exit Region

35

SpecializationRegion

Page 36: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Construct AEPDG Build Program Dependence Graph Separate memory access from

computation. Loads/Stores and all dependent

computation are access.

36

a[i]

×

b[i]

+2

c[i]

a+i b+i c+iAddress Calc:

Loads:

Store:Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Region Identification

Application

Page 37: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Construct AEPDG

37

a[i]

×

b[i]

+2

c[i]

a+i b+i c+iAddress Calc:

Loads:

Store:

a+i b+i c+i

×

+2

Build Program Dependence Graph Separate memory access from

computation. Loads/Stores and all dependent

computation are access.

Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Region Identification

Application

Page 38: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Construct AEPDG

38

a[i]

×

b[i]

+2

c[i]

a+i b+i c+iAddress Calc:

Loads:

Store:

a+i b+i c+i

×

+2

a+i b+i c+i

Build Program Dependence Graph Separate memory access from

computation. Loads/Stores and all dependent

computation are access.

Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Region Identification

Application

Page 39: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Construct AEPDG

39

a[i]

×

b[i]

+2

c[i]

a+i b+i c+iAddress Calc:

Loads:

Store:

ExecuteSubregion

Build Program Dependence Graph Separate memory access from

computation. Loads/Stores and all dependent

computation are access.

Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Region Identification

Application

Page 40: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Vectorization

40

• Similar to SIMD Techniques, loops must have:– Independent Iterations– Must be no Store/Load Aliasing

• Memory Access: No gather/scatter• Perform Loop Control

– Modify trip count/peel scalar loop

a[i]

×

b[i]

+2

c[i]Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Region Identification

Application

Page 41: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Vectorization

41

• Similar to SIMD Techniques, loops must have:– Independent Iterations– Must be no Store/Load Aliasing

• Memory Access: No gather/scatter• Perform Loop Control

– Modify trip count/peel scalar loop

a[i:i+3]

×

b[i:i+3]

+2

c[i:i+3]

Data is pipelinedthrough DySER

Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Region Identification

Application

Page 42: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Scheduling

42

in1

×

in2

+2

out

×

S S

S

S

S

+S S

+S

×S

in1 in2

• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to

minimize the total routes

Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Region Identification

Application

Page 43: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Scheduling

43

in1

×

in2

+2

out

×

S S

S

S

S

+S S

+S

×S

in1 in2

×

• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to

minimize the total routes

Core aaaDySER

Scheduling

VectorizationOptimization

Execute PDG

Access PDG

Region Identification

Application

Page 44: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Core aaaDySER

Scheduling

Vectorization

Execute Code

Access Code

Region Identification

Application

Scheduling

44

in1

×

in2

+2

out

×

S S

S

S

S

+S S

+S

×S

in1 in2

×

• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to

minimize the total routes

Page 45: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Core aaaDySER

Scheduling

Vectorization

Execute Code

Access Code

Region Identification

Application

Scheduling

45

in1

×

in2

+2

out

×

S S

S

S

S

+S S

+S

×S

in1 in2

×

out

+2

• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to

minimize the total routes

Page 46: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Core aaaDySER

Scheduling

Vectorization

Execute Code

Access Code

Region Identification

Application

Scheduling

46

• Map Execute Subregion to DySER– Sort nodes in data flow order– Greedily place each node to

minimize the total routes

in1

×

in2

+2

out

×

S S

S

S

S

+S S

+S

×S

in1 in2

×

+2

out

Page 47: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Case Study: Loop Dependence//Needleman Wunschint a[],b[]; //initialize

for(int i =1; i < NCOLS; ++i) { for(int j = 1; j < NROWS; ++j) { a[i][j]=max(a[i-1][j-1]+b[i][j], a[i-1][j], a[i][j-1]) }}Outer Iterations are dependent, too

+max

max

a[i-1][j-1] a[i-1][j]

Use result of previous iteration

a[i][j-1]

a[i][j]

+max

max

a[i-1][j] a[i-1][j+1]

a[i][j+1]

Vectorizable!

47

Dependence Chain

Array a[]

+max

max

+max

max

Page 48: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Outline

Introduction

DySER: Architecture

Intermediate Representation:Access/Execute PDG

Slicer: Compiler

Evaluation and Results

Conclusion48

Page 49: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

49

Evaluation Methodology Simulation Framework

Gem5 + DySERsim for performance McPAT for energy

Compiler Implementation Leverages LLVM compilation framework Constructs AEPDG from LLVM-IR Generates binary for x86, SPARC

Benchmarks Throughput Workloads: Intel TPT kernels, Parboil benchmark

suite General purpose Workloads: SPEC-2006, PARSEC Database: Operators and Primitives, Query

Page 50: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

50

Evaluation

What is the performance/energy benefits? DLP workloads General Purpose or Irregular workloads

How effective the compiler is?

How effective on database query processing? Both DLP and Irregular in a same application

Page 51: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

51

DySER vs. Superscalar: DLP

Series1

0

2

4

6

8

10

Spee

dup

CONV

MERGE

NBODY

RADAR

TrSRCH VR

CUTCPFF

T

KMEANS

LBM

MMM

RI-QSP

MV

STENCIL

TPACFNNW

NEEDLE GM0

102030405060708090

100

Ener

gy R

educ

tion

(%)

Control flow in memory accessMultiple Configurations:

Configuration cost starts to dominate

Indirect memory access, Loop carried dependences

DySER performs on average 3.4x better than baseline with 53% reduction in energy consumption

Page 52: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

52

DySER vs. Superscalar: General Purpose

Se-ries1

0%

5%

10%

15%

20%

Spee

dup

ASTAR

BZIP2

H264

HMM

ER

LIBQUANTUM

MCF

BLACKSC

HOLES

FLUID

ANIMATE

FREQM

INE

SWAPTIO

NS

STREAM

CLUST

ERGM

0

10

20

30

Ener

gy R

educ

tion

(%)

DySER provides 8% mean speedup with 11% reduction in energy consumption

Data dependent branches mapped to DySER, which leads less pipeline flushes

Exploits DLP available, but control

dependent stores prevent large gain

Page 53: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

53

Where is the efficiency come from?

CONV

MERGE

NBODY

RADAR

TrSRCH VR

CUTCPFF

T

KMEANS

LBM

MMM

RI-Q

NEEDLENNW

SPM

V

SPENCIL

TPACFGM

0123456789

10

DySER IPCCore IPCBaseline IPC

Effec

tive

IPC

DySER emulates a wider issue processor than the baseline processor

Page 54: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

54

DySER vs. Superscalar: Summary

TPT Kernels

Parboil SPECINT PARSEC0

1

2

3

4

5Sp

eedu

p

TPT Kernels Parboil SPECINT PARSEC0

20

40

60

80

100

Ener

gy R

educ

tion

(%)

11% 20%

10% 11%

On DLP workloads, DySER provides significant

improvements

On irregular workloads, DySER provides modest improvements

Page 55: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

55

Performance: SSE/AVX Vs. DySER

CONV

MERGE

NBODY

RADAR

TREESEARCH VR

CutCP

FFT

KMEANS

LBM

MMM

RI-QSP

MV

STENCIL

TPACFNNW

NEEDLE HM0

1

2

3

4

5

SSEAVXDySER

Spee

dup

Ove

r SSE

13x

DySER bottlenecked by FDIV/FSQRT units

When DLP readily available, both SIMD and DySER perform better

With control intensive code, DySER perform better

DySER performs on average 1.8x better than SSE/AVX

Page 56: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Why DySER is efficient than SIMD? SIMD vectorizes either inside the loop:

Superword-level-parallelism

Or, SIMD vectorizes across loop iterations

DySER can simultaneously vectorize both:

56

SIMD - SLP DySERSIMD – “Do Across”

Page 57: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

57

Programmer Optimized vs. Compiler Optimized

CONV

MERGE

NBODY

RADAR

TREESEARCH VR

CutCP

FFT

KMEANS

LBM

MMM

RI-QSP

MV

STENCIL

TPACFNNW

NEEDLE HM0

0.2

0.4

0.6

0.8

1

Compiler

Rela

tive

to p

rogr

amm

er o

ptim

ized

Outer Loop Transformations

Different strategy for Reduction

Constant Table

Lookup

Compiler generated code’s slowdown is only 30%

Page 58: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

58

Why Database?

Energy efficient in-core accelerator

Dynamically specializes frequently executing codes

DySER

Energy management is emerging as a primary goal

Query processing with database kernels/Primitives

Page 59: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

59

Simplified TPC-H Query 1

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

Page 60: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

60

Query

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

Projection: Highly data

parallel

Page 61: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

61

Query

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

SCAN:Highly data

parallel

Page 62: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

62

Query

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

HASH:Data parallel with Control

Page 63: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

63

Query

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

AGGR:Limited DLP

Page 64: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

64

Query Processing Implementations JIT

Whole query is processed in a single loop No intermediate results materialized

Vectorized Query is processed in blocks and data is accessed in

columnar fashion Intermediate results materialized Better for SIMD and exploits cache locality

Hybrid Partition the query to utilize the DLP available without

materializing intermediate results much.

Page 65: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

65

Result: TPC-H Query 1

0

1

2

3

4

Scalar SIMD DySER

Spee

dup

JIT

Vectorized

Hybrid

Since no DLP available, SIMD performs poorly.DySER speedups >2.5X by exploiting pipeline parallelism

Hardware/software codesign improves query processing significantly

Page 66: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

66

How about design complexity? We (five graduate students) implemented a prototype of

DySER integrated to OpenSPARC Prototype mapped onto a Xilinx Virtex 5 FPGA board Boots Unmodified Ubuntu 7.10 Linux DySER is not in the critical path!

Design, Integration, and Implementation of the DySER hardware Accelerator into OpenSPARC, in HPCA 2012

DySER is indeed a non-intrusive design and easy to integrate to a commercial processor

Page 67: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

67

Conclusion

Must rethink & co-design architecture, micro-architecture, and compilers Make energy as a primary constraint Incremental evolution of historical accelerators

has produced diminishing returns

Compiler assisted hardware specialization Provides energy-efficient without the loss of

generality and with low design complexity

Page 68: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

68

Publications [GNS PACT 2013] Breaking SIMD Shackles with an Exposed Flexible Micro -

architecture and the Access Execute PDG. In PACT 2013.

[GHNCSSK IEEE Micro 2012] DySER: Unifying Functionality and Parallelism Specialization for Energy Efficient Computing. IEEE Micro Sep/Oct 2012.

[BCFGHMNS HotChips 2012] Design, Integration and Implementation of DySER Hardware Accelerator into OpenSPARC, In HotChips 2012.

[BCFHGNS HPCA 2012] Design, Integration and Implementation of DySER Hardware Accelerator into OpenSPARC, In HPCA 2012.

[NSHGDS ISCA 2011] Sampling + DMR: Practical and Low-overhead Permanent Fault Detection, In ISCA 2011

[GHS HPCA 2011] Dynamically Specialized Datapaths for Energy Efficient Computing. In HPCA 2011

[GDSVM Micro 2008] Toward A Multicore Architecture for Real-time Ray-tracing. In Micro 2008

Page 69: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

69

Acknowledgements

Prof. Karu Sankaralingam

Marc de Kruijf

Tony Nowatzki

Lena Olson

DySER Team: Chen-Han, Tony, Chris, Ryan, Zach, Jesse

Page 70: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

?70

Page 71: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

71

Backup Slides

Page 72: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Datapath

72

• Ready (R) – for flow control (forward)• Credit (C) – for flow control (Backward)• Valid (V) – To support control-flow

C Vdata R

Page 73: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Processor Integration: in-order

DySER interface: FIFO

73

Fetch Decode Execute Memory WriteBack

D$

I$Register

File

Decode ExecUnits

DySER

Page 74: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Out-of-Order Integration

Out-of-order core integration

DySER itself maintains no architectural state

Use buffers to keep the state for speculative execution

74

Page 75: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Small loops Leverage loop properties.

Simply unroll the loop further, “Cloning” the region

×

+

a[0]a[1]a[2]a[3]

b[0]b[1]b[2]b[3]

c[0]c[1]c[2]c[3]

×

+

a[0]a[2]

b[0]b[2]

c[2]c[0]

×

+

a[1]a[3]

b[1]b[3]

c[3]c[1]

Input FIFO

OutputFIFO

ExecuteRegion

Before After

75

Also UsesFlexible I/O

Page 76: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Large Loops: Sub-graph Matching Subgraph Matching

Find Identical computations, split them out

Region Splitting Configure multiple regions, quickly switch between them

76

Large Region Subgraph Matching Region Virtualization

Page 77: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

77

Results: Configuration Cost

Programs follow 90/10 rule

blacksch

oles

canneal

fluidanimate

streamclu

ster

bzip2

mcf

h264ref

soplex

sphinx3

Mean0%

10%

20%

Percentage of Code-region Contributing to 90% running time

Page 78: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

78

Energy: SSE/AVX Vs. DySER

CONV

MERGE

NBODY

RADAR

TREESEARCH VR

CutCP

FFT

KMEANS

LBM

MMM

RI-QSP

MV

STENCIL

TPACFNNW

NEEDLE GM0

20

40

60

80

100

SSEDySER

% E

nerg

y Re

ducti

on

Page 79: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

79

Related Work: Architecture

Reconfigurable system FPGA -- High software cost

Coarse grain reconfigurable system Beret – uses a pre designed set of SEB Micro 2011 C-Cores – Uses a set of Ccores to accelerate

functions ASPLOS 2010 VEAL and CCA – loop accelerator ISCA 2008 Other reconfigurable coprocessor approach

Garp, Tartan, Chimera etc.,

Page 80: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

80

Related Work: Beret

An energy efficient coprocessor

No internal control-flow

Uses a set of SEB (subgraph execution block)

Page 81: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

81

Related work: Conservation Cores

A set of specialized units that accelerates a whole function.

Slow, no pipelining support

Page 82: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

82

Other Publications

Reliability

Sampling + DMR: Practical and Low-overhead Permanent Fault Detection, In ISCA 2011.

Specialized Architecture

Toward A Multicore Architecture for Real-time Ray-tracing, In Micro 2008

Page 83: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

83

Database Backup Slides

Page 84: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

84

Evaluation Methodology

Implemented optimized versions Baseline: C (no special operations) SSE: manually optimized with compiler intrinsics DySER: manually optimized with DySER instructions AutoDySER: Automatically DySERized using a compiler

Evaluated using a gem5 based simulator X86, an out-of-order CPU model

Page 85: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

85

SCAN

Scans a table with equality predicate

High data level parallelism

kernel: //inputs: in_mask:bitvector, col, key//output: out_mask:bitvector

for (i = 0; i < LEN; i += SZ): for (j = 0; j < SZ; ++j): out |= (col[i*SZ+j] == key) << j out_mask[i] = in_mask[i] & out

Page 86: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

86

SCAN: Results

If DLP available, both DySER and SIMD performs well

Page 87: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

87

Aggregation

Kernel:

Indirect memory access

Represents worst case for DySER

for (i = 0; i < LEN; i++): key = k[i] A[key] += V[i]

Page 88: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

88

Why it is hard for DySER

Mostly address calculation

Computation is just one instruction

KA

LD

+

LD

V

LD

+

ST

i

Page 89: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

89

Why it is hard for DySER

Mostly address calculation

Computation is just one instruction

KA

LD

+

LD

V

LD

+

ST

i

Page 90: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

90

Why it is hard for DySER

Mostly address calculation

Computation is just one instruction

Aliasing prevents loop unrolling

KA

LD

+

LD

V

LD

+

ST

i

Page 91: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

91

Why it is hard for DySERKA

LD

+

LD

V

LD

+

ST

i

KA

LD

+

LD

V

LD

+

ST

i+1

may dependence

Page 92: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

92

Solution: Alias Checking in DySERKA

LD

+

LD

V

LD

+

ST

i

KA

LD

+

LD

V

LD

+

ST

i+1

==

?

Page 93: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

93

Aggregation Results

With out-of-order processor, DySER provides speedup. But with inorder processor, it performs poorly.

Page 94: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

94

Database Kernels

DB Kernels with data-level parallelism SCAN SORT

DB Kernels with DLP and control SCAN on RLE HASH STRCMP (Variable length)

Data-level parallelism not readily available Aggregation

Page 95: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

95

Results

X86 Inorder

Page 96: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

96

Overview

Database Kernels Characterization and Evaluation

Codesigning DySER/DB

Conclusion

Page 97: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

97

Codesigning DySER/DB

DySER effectiveness reduces when memory operations dominate Integrate Load and Store with DySER Memory Access Dataflow (MAD) Architecture

Is vectorized query processing a problem for DySER? Compute/memory ratio is low JIT query processing may enable DySER to

exploit pipeline parallelism better

Page 98: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

100

A Simple Query

SELECTprice * (1-disc),price * (1-disc) * (1+tax))

FROMlineitem

Page 99: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

101

Query 1 Implementation Query Processing:

Vectorized Query Processing:

Out_1 = price * (1-disc)Out_2 = Out_1 * (1+tax)

Inputs Outputs

tmp_out1 = (1-disc)

tmp_out2 = (1+tax)

out1 = price * tmp_out1

out2 = out_1 * tmp_out2

Page 100: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

102

Result: Query 1

When fully data-parallel, both SIMD and DySER performs well

Page 101: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

103

Slightly Complex Query (TPC-H Q1)

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

Page 102: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

104

Query

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

Projection: Highly data

parallel

Page 103: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

105

Query

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

SCAN:Highly data

parallel

Page 104: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

106

Query

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

HASH:Data parallel with Control

Page 105: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

107

Query

SELECTsum(quantity),sum(price * (1-disc)),sum(price * (1-disc) * (1+tax))count(*)

FROMlineitem

WHEREship_date <= XXXX

GROUP BYreturnflag,linestatus

AGGR:No DLP

Page 106: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

108

Implementations JIT

Whole query is processed in a single loop No intermediate results materialized

Vectorized Query is processed in blocks and data is accessed in

columnar fashion Intermediate results materialized Better SIMD and exploits cache locality

Hybrid Partition the query to utilize the DLP available without

materializing intermediate results much.

Page 107: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

109

Result: TPC-H Query 1

If no DLP available, SIMD performs poorly.DySER speedups >2.5X by exploiting pipeline parallelism

Page 108: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

110

Database Conclusion DySER exploits both pipeline parallelism and DLP

When DLP present, DySER provides >2x speedup, so does SIMD DySER can speedup even aliasing/control present

For kernels with low computation/memory ratio Integrating LD/ST units with DySER may help But explicit aliasing and high bandwidth are required

Combining multiple database kernels to exploit pipeline parallelism in DySER improves performance Requires careful looping strategies to utilize DySER

better

Page 109: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

111

Support for Irregular Workloads

Page 110: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

112

Outline

Introduction

Architecture Changes

Compiler Changes

Evaluation

Page 111: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER and Irregular code462.libquantum

for(i=0; i<reg->size; i++){ if(reg->node[i].state & ((MAX_UNSIGNED) 1 << control1)) if(reg->node[i].state & ((MAX_UNSIGNED) 1 << control2)) reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);}

Loop Invariants

113

Page 112: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER and Irregular code462.libquantum

...Loop: load reg->node[i], %r1 andcc %r1, Ctrl1 bz NextIter andcc %r1, Ctrl2 bz NextIter xor %r1, Tgt, %r2 st %r2, reg->node[i]NextIter: ... b Loop ...

... Loop: Dload reg->node[i], p0 Dsend Ctrl1, p1 Dsend Ctrl2, p2 Dsend Tgt, p3 Drecv p4, valid cmp valid, 0 bz NoStore Dstore p5, reg->node[i] b MergeNoStore Drecv p5, dummyMerge: ... b Loop

Scalar Code DySER Code

Issue: Invariant Sends114

Page 113: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER and Irregular code462.libquantum

...Loop: load reg->node[i], %r1 andcc %r1, Ctrl1 bz NextIter andcc %r1, Ctrl2 bz NextIter xor %r1, Tgt, %r2 st %r2, reg->node[i]NextIter: ... b Loop

... Loop: Dload reg->node[i], p0 Dsend Ctrl1, p1 Dsend Ctrl2, p2 Dsend Tgt, p3 Drecv p4, valid cmp valid, 0 bz NoStore Dstore p5, reg->node[i] b MergeNoStore Drecv p5, dummyMerge: ... b Loop

Scalar Code DySER Code

Issue: Branch on DRECV is Expensive 115

Page 114: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER and Irregular code462.libquantum

...Loop: load reg->node[i], %r1 andcc %r1, Ctrl1 bz NextIter andcc %r1, Ctrl2 bz NextIter xor %r1, Tgt, %r2 st %r2, reg->node[i]NextIter: ... b Loop

... Loop: Dload reg->node[i], p0 Dsend Ctrl1, p1 Dsend Ctrl2, p2 Dsend Tgt, p3 Drecv p4, valid cmp valid, 0 bz NoStore Dstore p5, reg->node[i] b MergeNoStore Drecv p5, dummyMerge: ... b Loop

Scalar Code DySER Code

Issue: Receives need to drain even invalid outputs116

Page 115: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = arc->cost - arc->tail->potential + arc->head->potential; if( (red_cost < 0 && arc->ident == AT_LOWER) || (red_cost > 0 && arc->ident == AT_UPPER) ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }

DySER and Irregular Code429.MCF

AccessCode

Issue: Control Dependent memory117

Page 116: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER ISA for Irregular Workloads

DySER Send Invariant Instruction dysendinv <reg>, <port>

DySER invocation started Instruction dystart

DySER Branch Instruction dybz <port>, Label dybnz <port>, Label

118

Page 117: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Output Interface

DySER

“Invalid” Data

119

Page 118: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER Output Interface

DySER

Mark it as aborted and discard the

value

120

Page 119: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Outline

Issues with Irregular code

Simulator Fixes

Architecture Changes

Compiler Changes

Evaluation and Results

121

Page 120: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Compiler Changes

Slicing: Do not back slice through the control edges. Reason: DySER branch instruction Offloads more instructions to DySER

Code generator Changes Emit new DySER instructions No need to insert Dummy Insert Instructions

122

Page 121: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Outline

Issues with Irregular code

Simulator Fixes

Architecture Changes

Compiler Changes

Evaluation and Results

123

Page 122: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

DySER and Irregular code462.libquantum

...Loop: load reg->node[i], %r1 andcc %r1, Ctrl1 bz NextIter andcc %r1, Ctrl2 bz NextIter xor %r1, Tgt, %r2 st %r2, reg->node[i]NextIter: ... b Loop

... Dsendinv Ctrl1, p1 Dsendinv Ctrl2, p2 Dsendinv Tgt, p3 Loop: Dstart Dload reg->node[i], p0 Dbrz p4, NextIter Dstore p5, reg->node[i]NextIter: ... b Loop

Scalar Code DySER Code

124

Page 123: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = arc->cost - arc->tail->potential + arc->head->potential; if( (red_cost < 0 && arc->ident == AT_LOWER) || (red_cost > 0 && arc->ident == AT_UPPER) ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } } }

DySER and Irregular Code429.MCF

AccessCode

125

Execute Code

Page 124: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Performance Results

bzip2 hmmer libquantum mcf h264ref0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

In order2-OOO4-OOO

Spee

dup

over

resp

ectiv

e sc

alar

ve

rsio

n

126

Page 125: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

Energy Results

bzip2 hmmer libquantum mcf h264ref0%

5%

10%

15%

20%

25%

Inorder2-OOO4-OOO

Ener

gy R

educ

tion

(%)

127

Page 126: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

128

Future Directions

Page 127: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

129

Future Research Directions How can we make legacy code energy efficient

Use JIT compilation to target accelerators dynamically If source code available, from specialized IR Otherwise, from binary itself

Binary rewriters to target accelerators statically

Challenges: Analysis to identify acceleratable sequence of

instructions Light weight analysis for JIT Static analysis on compiled binary

Specialized IR design

Page 128: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

130

Future Research Directions Energy efficient memory hierarchy (EEMH)

Moving data burns most of the energy Filtering data or performing operations in the

hierarchy itself will help reduce energy

Challenges Design: How to perform computation efficiently in

memory? Programming Model: How to program the EEMH? Compiler : What compiler algorithms or

transformations needed for EEMH?

Page 129: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

132

DySER vs. Superscalar: Irregular

Series1

0%

5%

10%

15%

20%

Spee

dup

ASTAR

GCCH264

LIBQUANTUM

OMNETPP

SJENG

FLUID

ANIMATE

SWAPTIO

NS05

1015202530

Ener

gy R

educ

tion

(%)

Page 130: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

133

Opportunities in Database Traditional Query Processing:

Vectorized Query Processing:

SCANPROJECTHASH

AGGREGATE

One Record

Output for one record

Output for Multiple Records

AGGREGATE

HASHSCANPROJEC

TMultiple Records

Page 131: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

134

Database Kernels

DB Kernels with data-level parallelism SCAN SORT

DB Kernels with DLP and control SCAN on RLE HASH STRCMP

Data-level parallelism not readily available Aggregation

Page 132: Energy Efficient Computing through Compiler Assisted Dynamic Specialization Venkatraman Govindaraju Advisor: Karthikeyan Sankaralingam (Defense: 7/29/2014)

135

Database Kernels: Performance

SCANSCAN+RLESORT HASH STRCMP AGGR GM0

1

2

3

4

5

6

7

ScalarSSEDySERAutoDySERSp

eedu

p

Highly Data parallel

Provides speedup even with data intensive code