80
High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February 2013 ACM FPGA Symposium Monterey, CA Dept. of Electrical and Computer Engineering University of Toronto

High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Embed Size (px)

Citation preview

Page 1: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

High-Level Synthesis with LegUpA Crash Course for Users and Researchers

Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi

11 February 2013ACM FPGA Symposium

Monterey, CADept. of Electrical and Computer EngineeringUniversity of Toronto

Page 2: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LegUpLegUp

LegUpLegUp

LegUp

LegUp

LegUp

LegUp

LegUp

Hong Kong Berlin

Tokyo New York City

Page 3: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Tutorial Outline

• Overview of LegUp and its algorithms (60 min)• Labs (“hands on” via VirtualBox)

– Lab 1: Using the LegUp Framework (30 min)– Break– Lab 2: Adding resource constraints (30 min)– Lab 3: Changing How LegUp implements

hardware (30 min)

Page 4: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Project Motivation

• Hardware design has advantages over software:– Speed– Energy-efficiency

• Hardware design is difficult and skills are rare:– 10 software engineers for every hardware engineer*

• We need a CAD flow that simplifies hardware design for software engineers

*US Bureau of Labour Statistics ‘08

Page 5: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Top-Level Vision

Program code

C CompilerProcessor

(MIPS)

Self-ProfilingProcessor

Profiling Data:

Execution CyclesPower

Cache Misses

High-levelsynthesis Suggested

programsegments to

target to HWFPGA fabric

P Hardenedprogramsegments

Altered SW binary (calls HW accelerators)

int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum);}....

Page 6: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LegUp: Key Features• C to Verilog high-level synthesis• Many benchmarks (incl. 12 CHStone)• MIPS processor (Tiger)• Hardware profiler• Automated verification tests• Open source, freely downloadable

– Like ABC (Synthesis) or VPR (Place & Route)– 600+ downloads since March 2011– http://legup.eecg.utoronto.ca

Page 7: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

FPGA

System Architecture

MIPS ProcessorHardware

Accelerator

AVALON INTERFACE

Hardware Accelerator

Memory ControllerOn-Chip Cache

Memory

Off-Chip MemoryALTERA DE2 or DE4 Board

Cyclone II or Stratix IV

Memory Memory

Page 8: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

High-Level Synthesis Framework• Leverage LLVM compiler infrastructure:

– Language support: C/C++– Standard compiler optimizations– More on this shortly

• We support a large subset of ANSI C: Supported UnsupportedFunctions Dynamic MemoryArrays, Structs RecursionGlobal VariablesPointer ArithmeticFloating Point

Page 9: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

tAddr+= V1tAddr += (tAddr << 8)tAddr ^= (tAddr >> 4)b = (tAddr >> B1) & B2a = (tAddr + (tAddr << A1)) >> A2fNum = (a ^ tab[b])

Address Hash(in hardware)

Hardware Profiler Architecture

MIPS P Instr. $

Op Decoderret call

instr

0 1

PC

function #

targetaddress

F#

count

Popped F#(ret | call)

PC

counter+ 0

1

reset

0

Incr. when PC changes

Counter StorageMemory

(for all functions)

Call Stack

count

Data Counter(for current function)

See paper IEEE ASAP’11

• Monitor instr. bus to detect function call/ret.

• Call: Hash (in HW) from function address to index; push to stack.

• Ret: pop function index from stack.

• Use function indexes to associate profiling data (e.g. cycles, power) with counters.

Page 10: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Processor/Accelerator Hybrid Flow

int main () {…sum = dotproduct(N);...

}

int dotproduct(int N) {…for (i=0; i<N; i++) {

sum += A[i] * B[i];}return sum;

}

Page 11: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Processor/Accelerator Hybrid Flow

int main () {…sum = dotproduct(N);...

}

int dotproduct(int N) {…for (i=0; i<N; i++) {

sum += A[i] * B[i];}return sum;

}

#define dotproduct_DATA (volatile int *) 0xf0000000#define dotproduct_STATUS (volatile int *) 0xf0000008#define dotproduct_ARG1 (volatile int *) 0xf000000C

int legup_dotproduct(int N) {*dotproduct_ARG1 = (volatile int) N;*dotproduct_STATUS = 1;return *dotproduct_DATA;

}

Page 12: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Processor/Accelerator Hybrid Flow

int main () {…sum = dotproduct(N);...

}

set_accelerator_function “dotproduct”

HW Accelerator

HLS

Page 13: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

int main () {…sum = dotproduct(N);...

}

Processor/Accelerator Hybrid Flow#define dotproduct_DATA (volatile int *) 0xf0000000#define dotproduct_STATUS (volatile int *) 0xf0000008#define dotproduct_ARG1 (volatile int *) 0xf000000C

int legup_dotproduct(int N) {*dotproduct_ARG1 = (volatile int) N;*dotproduct_STATUS = 1;return *dotproduct_DATA;

}

sum = legup_dotproduct(N);

Page 14: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

int main () {…

...}

Processor/Accelerator Hybrid Flow#define dotproduct_DATA (volatile int *) 0xf0000000#define dotproduct_STATUS (volatile int *) 0xf0000008#define dotproduct_ARG1 (volatile int *) 0xf000000C

int legup_dotproduct(int N) {*dotproduct_ARG1 = (volatile int) N;*dotproduct_STATUS = 1;return *dotproduct_DATA;

}

MIPS Processor

SW

sum = legup_dotproduct(N);

Page 15: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

How Does LegUp Handle Memory and Pointers?

• LegUp stores each array in a separate FPGA BRAM• BRAM data width matches the data in the array• Each BRAM is identified by a 9-bit tag• Addresses consist of the RAM tag and array index:

• A shared memory controller uses the tag bit to determine which BRAM to read or write from

• The array index is the address passed to the BRAM

9-bit Tag 23-bit Index31 22 023

Page 16: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Pointer Example

• We have two arrays in the C function:– int A[100], B[100]

• Tag 0 is reserved for NULL pointers• Tag 1 is reserved for off-chip memory• Assign tag 2 to array A and tag 3 to array B• Address of A[3]: Address of B[7]:

Tag=2 Index=331 02223

Tag=3 Index=731 02223

Page 17: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

FF FF

Shared Memory Controller

• Both arrays A and B have 100 element BRAMs• Load from pointer D:

Tag=2 Index=1331 02223

A[0]0

...

A[13]

….

13

BRAM Tag=2A[99]99

B[0]0

...

B[13]

….

13

BRAM Tag=3B[99]99

3

2A[13]

32

3232

Page 18: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Core Benchmarks (+Many More)• 12 CHStone Benchmarks (JIP’09) and Dhrystone

– Too large/complex for academic HLS tools• Include golden input/output test vectors

• Not supported by academic toolsCategory Benchmarks Lines of C code

Arithmetic 64-bit double precision: add, mult, div, sin

376 – 755

Encryption AES, Blowfish, SHA 716 – 1,406

Processor MIPS processor 232

Media JPEG decoder, Motion, GSM, ADPCM 393 – 1,692

General Dhrystone 491

Page 19: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Experimental ResultsLegUp 1.0 (2011) for Cyclone II

1. Pure software on MIPS

Hybrid (software/hardware):2. Second most compute-intensive function

(and descendants) in H/W3. Same as 2 but with most compute-intensive

4. Pure hardware using LegUp5. Pure hardware using eXCite (commercial tool)

Page 20: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Experimental Results

MIPS-S

W

LegU

p-Hyb

rid2

LegU

p-Hyb

rid1

LegU

p-HW

eXCite-H

W0

500

1000

1500

2000

2500

0

5000

10000

15000

20000

25000

30000

35000

40000

# of LEsExec. time

Exec

ution

tim

e (g

eom

etric

mea

n)

# of

LEs

(geo

met

ric m

ean)

Page 21: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Comparison: LegUp vs eXCite• Benchmarks compiled to hardware• eXCite: Commercial high-level synthesis tool

• Couldn’t compile Dhrystone

Geomean LegUp eXcite LegUp/eXciteCircuit Runtime (μs) 292 357 0.82 (1.22x)Logic Elements 15,646 13,101 1.19Area-Delay Product 4.57M 4.68M 0.98

Page 22: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Energy Consumption

MIPS-S

W

LegU

p-Hyb

rid2

LegU

p-Hyb

rid1

LegU

p-HW

eXCite-H

W -

100,000

200,000

300,000

400,000

500,000

600,000

Ener

gy (μ

J) (g

eom

etric

mea

n)

18x less energy than software

Page 23: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Current Release: LegUp 3.0

• Loop pipelining• Dual and multi-ported memory support• Bitwidth minimization• Multi-pumping DSP units for area reduction• Alias analysis for dependency checks• Parallel accelerators via Pthreads & OpenMP

Results now considerably better than LegUp 1.0 release

Page 24: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LegUp 3.0 vs. LegUp 1.0

adpcm ae

s

blowfishdfad

ddfdiv

dfmul

dfsin

dhrystone

gsm jpegmips

motion sha

geomea

n0.4

0.6

0.8

1

1.2

1.4

1.6

Wall-Clock TimeCyclesFmaxLEs

CHStone Benchmark Circuit

LegU

p 3.

0/Le

gUp

1.0

Ratio

Wall-clock time: 16% betterCycle latency: 31% better

FMax: 18% worseLEs (area): 28% better

Page 25: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LLVM Compiler and HLS Algorithms

Page 26: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LLVM Compiler

• Open-source compiler framework.– http://llvm.org

• Used by Apple, NVIDIA, AMD, others.• Competitive quality with gcc.• LegUp HLS is a “back-end” of LLVM.

• LLVM: low-level virtual machine.

Page 27: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LLVM Compiler

• LLVM will compile C code into a control flow graph (CFG)

• LLVM will perform standard optimizations– 50+ different optimizations in LLVM

C Programint FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return sum;}....

LLVM

Compiler

CFG

BB0

BB1

BB2

Page 28: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Control Flow Graph

• Control flow graph is composed of basic blocks• basic block: is a sequence of instructions

terminated with exactly one branch– Can be represented by an acyclic data flow graph:

CFG

BB0

BB1

BB2

load load

+

load

+

store

Page 29: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LLVM Details

• Instructions in basic blocks are primitive computational operations:– shift, add, divide, xor, and, etc.

• Or are control-flow operations:– branch, call, etc.

• The CDFG is represented in LLVM’s intermediate representation (IR)– IR is machine-independent assembly code.

Page 30: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

High-Level Synthesis FlowC Compiler

(LLVM)C Program

Allocation

Scheduling

Binding

Target H/W Characterization

RTL Generation

User Constraints• Timing• Resource

Synthesizable Verilog

Optimized LLVM IR

Page 31: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Scheduling

• Scheduling: is the task of scheduling operations into clock cycles using a finite state machine

load load

+ load

+

store

State 1

State 0

State 2

State 3

FSM Schedule

Page 32: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Binding

• Binding: is the task of assigning scheduled operations to functional units in the datapath

load load

+ load

+

store

Schedule Datapath

2-port RAM +

FF

Page 33: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

High-Level Synthesis: Scheduling

Page 34: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

SDC Scheduling

• SDC System of Difference Constraints– Cong, Zhang, “An efficient and versatile scheduling algorithm based on SDC

formulation”. DAC 2006: 433-438.

• Basic idea: formulate scheduling as a mathematical optimization problem– Linear objective function + linear constraints

(==, <=, >=).• The problem is a linear program (LP)

– Solvable in polynomial time with standard solvers

Page 35: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Define Variables• For each operation i to

schedule, create a variable ti.

• The ti’s will hold the cycle # in which each op is scheduled.

• Here we have:– tadd, tshift, tsub

+ <<

-

Data flow graph (DFG): already accessible in LLVM.

Page 36: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Dependency Constraints

• In this example, the subtract can only happen after the add and shift.

• tsub – tadd >= 0

• tsub – tshift >= 0

• Hence the name difference constraints.

add shift

sub

Page 37: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Handling Clock Period Constraints

• Target period: P (e.g., 10 ns)• For each chain of dependant

operations in DFG, estimate the path delay D (LegUp’s models)– E.g.: D from mod -> or = 23 ns.

• Compute: R = ceiling(D/P) - 1– E.g.: R = 2

• Add the difference constraint:– tor - tmod >= 2

mod

xor

shr

or

Page 38: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Resource Constraints

• Restriction on # of operations of a given type that can execute in a cycle

• Why we need it?– Want to use dual-port RAMs in FPGA

• Allow up to 2 load/store operations in a cycle

– Floating point• Do not want to instantiate many FP cores of a given

type, probably just one• Scheduling must honour # of FP cores available

Page 39: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Resource Constraints in SDC

• Res-constrained scheduling is NP-hard.• Implemented approach in [Cong & Zhang DAC2006]

+ +

+

+

+ +

+

+A B

C

D

E F

G

H

Say want to schedule with only have 2 addersin the HW (lab #2)

Page 40: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Add SDC Constraints

• Generate a topological ordering of the resource-constrained operations.

• Say constrained to 2 adders in HW.• Starting at C in the ordering, create a

constraint: tC – tA > 0

• Next consider, E, add constraint: tE - tB > 0• Continue to the end• Resulting schedule will have <= 2 adds / cycle

A B C E F D G H

Page 41: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

ASAP Objective Function

• Minimize the sum of the variables:

• Operations will be scheduled as early as possible, subject to the constraints

• LP program solvable in polynomial time

Page 42: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

High-Level Synthesis: Binding

Page 43: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

High-Level Synthesis: Binding

• Weighted bipartite matching-based binding– Huang, Chen, Lin, Hsu, “Data path allocation based on bipartite weighted

matching”. DAC 1990: 499-504.

• Finds the minimum weighted matching of a bipartite graph at each step – Solve using the Hungarian Method (polynomial)

operations

hardware functional units

edge costs

Page 44: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Binding

• Bind the following scheduled program

State 0

State 1

State 2

State 3

Page 45: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Binding

• Resource Sharing: requires 3 multipliers

State 0

State 1

State 2

State 3

Page 46: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

State 0

State 1

State 2

State 3

Binding

• Bind the first cycle Functional Units

1

1

1

Page 47: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

State 0

State 1

State 2

State 3

Binding

• Bind the second cycle Functional Units

2

2

1

Page 48: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

State 0

State 1

State 2

State 3

Binding

• Bind the third cycle Functional Units

2

2

2

Page 49: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

State 0

State 1

State 2

State 3

Binding

• Bind the fourth cycle Functional Units

3

2

2

Page 50: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Binding

• Required Multiplexing: Functional Units

3

2

2

Page 51: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

High-Level Synthesis: Challenges

• Easy to extract instruction level parallelism using dependencies within a basic block

• But C code is inherently sequential and it is difficult to extract higher level parallelism

• Coarse-grained parallelism: – function pipelining

• Fine-grained parallelism: – loop pipelining

Page 52: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Loop Pipelining

Page 53: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Motivating Examplefor (int i = 0; i < N; i++) {

sum[i] = a + b + c + d}

+

a b

+

c

+

d

cycle

1

2

3

• Cycles: 3N• Adders: 3• Utilization: 33%

Page 54: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Loop PipeliningCycle 1 2 3 4 5 … N N+1 N+2

i=0 + + +

i=1 + + +

i=3 + + +

…. …. … …. …

i=N-2 + + +

i=N-1 + + +

• Cycles: N+2 (~1 cycle per iteration)• Adders: 3• Utilization: 100% in steady state

Steady State

Page 55: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Loop Pipelining Example

for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]

}• Each iteration requires:

• 2 loads from memory• 1 store

• No dependencies between iterations

Page 56: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Loop Pipelining Example

for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]

}• Cycle latency of operations:

• Load: 2 cycles• Store: 1 cycle• Add: 1 cycle

• Single memory port

Page 57: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LLVM Instructionsfor (int i = 0; i < N; i++) {

a[i] = b[i] + c[i]}

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5%scevgep6 = getelementptr

%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr

%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb

Page 58: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LLVM Instructionsfor (int i = 0; i < N; i++) {

a[i] = b[i] + c[i]}

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5%scevgep6 = getelementptr

%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr

%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb

Page 59: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LLVM Instructionsfor (int i = 0; i < N; i++) {

a[i] = b[i] + c[i]}

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5%scevgep6 = getelementptr

%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr

%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb

Page 60: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LLVM Instructionsfor (int i = 0; i < N; i++) {

a[i] = b[i] + c[i]}

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5%scevgep6 = getelementptr

%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr

%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb

Page 61: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LLVM Instructionsfor (int i = 0; i < N; i++) {

a[i] = b[i] + c[i]}

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5%scevgep6 = getelementptr

%c, %i.04%1 = load %scevgep6%2 = add nsw i32 %1, %0%scevgep = getelementptr

%a, %i.04store %2, %scevgep%3 = add %i.04, 1%exitcond = eq %3, 100br %exitcond, %bb2, %bb

Page 62: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Scheduling LLVM Instructions

for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]

}• Each iteration requires:

• 2 loads from memory• 1 store

• There are no dependencies between iterations

Cycle:

Page 63: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Scheduling LLVM Instructions

for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]

}• Each iteration requires:

• 2 loads from memory• 1 store

• There are no dependencies between iterations

Memory Port Conflict

Cycle:

Page 64: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Loop Pipelining Example

for (int i = 0; i < N; i++) {a[i] = b[i] + c[i]

}• Initiation Interval (II)

• Constant time interval between starting successive iterations of the loop

• The loop requires 6 cycles per iteration (II=6)• Can we do better?

Page 65: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Minimum Initiation Interval

• Resource minimum II:– Due to limited # of functional units– ResMII = Uses of functional unit

# of functional units• Recurrence minimum II:

– Due to loop carried dependencies• Minimum II = max(ResMII, RecMII)

Page 66: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Resource Constraints

• Assume unlimited functional units (adders, …)• Only constraint: single ported memory controller• Reservation table:

• The resource minimum initiation interval is 3

Page 67: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Iterative Modulo Scheduling

• There are no loop carried dependencies so Minimum II = ResMII = 3

• Iterative: Not always possible to schedule the loop for minimum II

II = minII

Attempt to modulo schedule loop with II II = II + 1

Fail

Success

Page 68: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Iterative Modulo Scheduling

• Operations in the loop that execute in cycle:i

• Must also execute in cycles:i + k*II k = 0 to N-1

• Therefore to detect resource conflicts look in the reservation table under cycle:

(i-1) mod II + 1• Hence the name “modulo scheduling”

Page 69: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

New Pipelined Schedule

Page 70: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Modulo Reservation Table

• Store couldn’t be scheduled in cycle 6 • Slot = (6-1) mod 3 + 1 = 3 • Already taken by an earlier load

Page 71: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Iterative Modulo Scheduling

• Now we have a valid schedule for II=3• We need to construct the loop kernel,

prologue, and epilogue• The loop kernel is what is executed when the

pipeline is in steady state– The kernel is executed every II cycles

• First we divide the schedule into stages of II cycles each

Page 72: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Pipeline Stages

00

Stage: 1 2 3

Page 73: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Pipelined Loop Iterations

i=0 i=1Stage 1

3 Cycles

i=0

i=2 i=3

i=4

i=3

i=1 i=2

i=0 i=1 i=4

i=4

i=3

i=2

Stage 2

Stage 3

Prologue Kernel (Steady State)

Epilogue

Page 74: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Loop Dependencies

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

a[j] = b[i] + a[j-1];

• May cause non-zero recurrence min II.• Several papers in FPGA 2013 deal with

discovering/optimizing loop dependencies

Depends on previous iteration

Page 75: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Limitations and Current Research

Page 76: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

LegUp HLS Limitations

• HLS will likely do better for datapath-oriented parts of a design.

• Results likely quite sensitive to how loops are structured in your C code.

• Difficult for HLS to “beat” optimized structured HW design.

Page 77: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

FPGA/Altera-Specific Aspects of LegUp

• Memory – On-chip (AltSyncRAM),

off-chip (DDR2/SDRAM controller)• IP cores

– Divider, floating point units• On-chip SOC interconnect

– Avalon interface• LegUp-generated Verilog fairly FPGA-agnostic:

– Not difficult to migrate to target ASICs

Page 78: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Current Research Work

• Impact of compiler optimizations on HLS• Enhanced parallel accelerator support

– Combining Pthreads+OpenMP• Smaller processor• Improved loop pipelining• Software fallback for bitwidth-optimized

accelerators• Enhanced GUI to display CDFG connected

with the schedule

Page 79: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

Current Work: PCIe Support

• Enable use of LegUp-generated accelerators in an HPC environment– Communicating with an x86

processor via PCIe

• Message passing or memory transfers– Software API for fpga_malloc,

fpga_free, send, receive

• DE4 / Stratix IV support in next LegUp release

Page 80: High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February

On to the Labs!