A Code Refinement Methodology for Performance-Improved Synthesis from C

A Code Refinement Methodology for Performance-Improved Synthesis from C

Greg Stitt, Frank Vahid*, Walid Najjar

Department of Computer Science and Engineering

University of California, RiversideAlso with the Center for Embedded Computer Systems,

UC Irvine

This research is supported in part by the National Science Foundation and the Semiconductor

Research Corporation

2/23

Introduction

uP FPGA

motionComp()filterLuma()

filterChroma()deblocking()

. . . . .

. . . . .

H.264

Compiler* * **

+ ++

* * **+ +

+

Synthesis

motionComp()

Select Critical Region

deblocking()

Previous work: In-depth hw/sw partitioning study of H.264 decoder

Collaboration with Freescale

3/23

Introduction Previous work: In-depth hw/sw partitioning

study of H.264 decoder Collaboration with Freescale

0123456789

10

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52

Number of Functions in Hardware

Spee

dup

Speedup from C Partititioning

0123456789

10

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52


Spee

dup

Ideal Speedup (Zero-time Hw Execution)

Speedup from C PartititioningLarge gap between ideal and actual speedup

Obtained 2.5x speedup

4/23

Introduction Noticed coding constructs/practices limited hw speed Identified problematic coding constructs

Developed simple coding guidelines Dozens of lines of code Minutes per guideline

Refined critical regions using guidelines

motionComp()filterLuma()

filterChroma()deblocking()

. . . . .

. . . . .

Apply Guidelines

motionComp’()filterLuma()

filterChroma()deblocking’()

. . . . .

. . . . .

Hw/Sw Partitioning

5/23

Introduction Noticed coding constructs/practices limited hw speed Identified problematic coding constructs

Developed simple coding guidelines Dozens of lines of code Minutes per guideline

Refined critical regions using guidelines

0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51


Sp

eed

up

Ideal Speedup (Zero-time Hw Execution)

Speedup After Rewrite (C Partitioning)

Speedup from C Partititioning Simple guidelines increased speedup to 6.5x

Can simple coding guidelines show similar improvements on other

applications?

6/23

Coding Guidelines Analyzed dozens of benchmarks

Identified common problems related to synthesis Developed 10 guidelines to fix problems

Although some are well known, analysis shows they are rarely applied

Automation unlikely or impossible in many cases

Conversion to Constants (CC)

Conversion to Fixed Point (CF)

Conversion to Explicit Data Flow (CEDF)

Conversion to Explicit Memory Accesses (CEMA)

Function Specialization (FS)

Constant Input Enumeration (CIE)Loop Rerolling (LR)Conversion to Explicit Control Flow (CECF)Algorithmic Specialization (AS)Pass-By-Value Return (PVR)

Coding Guidelines

7/23

Fast Refinement

Several dozen lines of code provide most performance improvement Refining takes minutes/hours

Idct()Memset()

FIR()Sort()

Search()ReadInput()

WriteOutput()Matrix()Brev()

Compress()Quantize()

. . . . .

Sample Application Name %Time

IDCT 65%Compress 11%FIR 10%Dequantize 3%Sort 2%Memset 1%Matrix 1%Search 0.5%ReadInput 0.5%WriteOutput 0.5%Brev 0.5%. . . . . . . . . .

Profiling Results

Only several performance critical regionsApply guidelines

to only the critical regions

8/23

int coef[100];void initCoef() { // initialize coef}void fir() { // fir filter using coef}

void f() { initCoef() // other code fir();}

Conversion to Constants (CC) int coef[100];void initCoef() { // initialize coef}void fir() { // fir filter using coef } void firConstWrapper(const int array[100]) { // misc code . . . fir(array);}void f() { initCoef() // other code fir();}

Problem: Arrays of constants commonly not specified as constants

Initialized at runtime Guideline: Use constant

wrapper function Specifies array constant

for all future functions Automation

Difficult, requires global def-use/alias analysis

int coef[100];void initCoef() { // initialize coef}void fir(const int array[100]) { // fir filter using const array} void constWrapper(const int array[100]) { // misc code . . . fir(array);}void f() { initCoef() // other code fir();}

int coef[100];void initCoef() { // initialize coef}void fir(const int array[100]) { // fir filter using const array} void constWrapper(const int array[100]) { // misc code . . . fir(array);}void f() { initCoef() constWrapper(coef);

}

Can also enable constant folding

int coef[100];void initCoef() { // initialize coef}void fir(const int array[100]) { // fir filter using const array} void constWrapper(const int array[100]) { prefetchArray( array ); // misc code . . . fir(array);}void f() { initCoef() constWrapper(coef);

}

Array can’t change, prefetching won’t violate dependencies

9/23

int array[100];void a() { for (i=0; i < 100; i++) array[i] = . . . . . }void b() { for (i=0; i < 100; i++) array[i] = array[i]+f(i);}int c() { for (i=0; i < 100; i++) temp += array[i];}void d() { for (. . . . . ) { a(); b(); c(); }}

Conversion to Explicit Data Flow (CEDF)

Problem: Global variables make determination of parallelism difficult

Requires global def-use/alias analysis

Guideline: Replace globals with extra parameters

Makes data flow explicit Simpler analysis may expose

parallelism Automation

Been proposed [Lee01] But, difficult because of

aliases

a(), b(), c() must execute sequentially because of global array dependencies

void a(int array[100]) { for (i=0; i < 100; i++) array[i] = . . . . . }void b(int array1[100], int array2[100]) { for (i=0; i < 100; i++) array2[i] = array1[i]+f(i);}int c(int array[100]) { for (i=0; i < 100; i++) temp += array[i];}void d() { int array1[100], array2[100]; for (. . . . . ) { a(array1 ); b(array1, array2 ); c(array2 ); }}

a() and c() can execute in parallel after 1st iteration

10/23

void f(int a, int b) { . . . . for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } }}

Constant Input Enumeration (CIE)

Problem: Function parameters may limit parallelism

Guideline: Create enum for possible values

Synthesis can create specialized functions

Automation In some cases, def-use

analysis may identify all inputs

In general, difficult due to aliases

c[i][j]

+

i jOne iteration at a time

Bounds not known, hard to unroll

enum PRM { VAL1=2, VAL2=4 };void f(enum PRM a, enum PRM b) { . . . . for (i=0; i < a; i++) { for (j=0; j < b; i++) { c[i][j]=i+j; } }}

c[0][0]

+

0 0

Iterations can be parallelized in each version

c[0][1]

+

0 1

c[0][2]

+

0 2

. . . . .

Specialized Versions: f(2,2), f(2,4), f(4,2), f(2,4)

11/23

Conversion to Explicit Control Flow (CECF)

Problem: Function pointers may prevent static control flow analysis

Guideline: Replace function pointer with if-else, static calls

Makes possible targets explicit

Automation In general, is

impossible Equivalent to halting

problem

void f( int (*fp) (int) ) { . . . . . for (i=0; i < 10; i++) { a[i] = fp(i); }}

enum Target { FUNC1, FUNC2, FUNC3 };void f( enum Target fp ) { . . . . . for (i=0; i < 10; i++) { if (fp == FUNC1) a[i] = f1(i); else if (fp == FUNC2) a[i] = f2(i); else a[i] = f3(i); }}

Synthesis unlikely to determine possible targets of function pointer

?a[i]

Synthesized Hardware

a[i]

Synthesized Hardware

f1(i) f2(i) f3(i)

3x1fp

12/23

Algorithmic Specialization (AS)

Algorithms targeting sw may not be fast in hw

Sequential vs. parallel

C code generally uses sw algorithms

Guideline: Specialize critical functions with hw algorithms

Automation Requires higher level

specification Intrinsics

void search(int a[], int k, int l, int r) { while (l <= r) { mid = (l+r)/2; if (k > a[mid]) l = mid+1; else if (k < a[mid) r = mid-1; else return mid; } return –1;}

void search(int a[], int k, const int s) { for (i=0; i < s; i++) { if (a[i] == k) return i; }

return –1;}

Can be parallelized in hardware

13/23

Pass-By-Value Return (PVR) Problem: Array

parameters cannot be prefetched due to potential aliases Designer may

know aliases don’t exist

Guideline: Use pass-by-value-return

Automation Requires global

alias analysis

void f(int *a, int *b, int array[16]) {

… // unrelated computation g(array); … // unrelated computation

} int g(int array[16]) { // computation done on array}

void f(int *a, int *b, int array[16]) { int localArray[16]; memcpy(localArray,array,16*sizeof(int)); … // misc computation g(localArray); … // misc computation memcpy(array, localArray,16*sizeof(int));} int g(int array[16]) { // computation done on array}

Can’t prefetch array for g(), may be aliased

Local array can’t be aliased, can prefetch

14/23

Why Synthesis From C? Why not use HDL?

HDL may yield better results C is mainstream language

Acceptable performance in many cases Learning HDL is large overhead

Approaches are orthogonal This work focuses on improving mainstream

Guidelines common for HDL Can also be applied to algorithmic HDL

15/23

Software Overhead

Refined regions may not be partitioned to hardware

Partitioner may select non-refined regions

OS may select software or hardware implementation

Based on state of FPGA

Coding guidelines have potential software overhead

motionComp’()filterLuma()

filterChroma()deblocking’()

. . . . .

. . . . .

Hw/Sw Partitioning

uP FPGA

filterLuma()filterChroma()

motionComp’()deblocking’()

Problem - Refined code mapped to software

16/23

Refinement Methodology Considerations

Reduce software overhead Reduce refinement time

Methodology Profile Iterative-improvement

Determine critical region Apply all except PVR/AS

Minimal overhead Apply PVR if overhead

acceptable Apply AS if known

algorithm and overhead acceptable

Profile

Determine Critical Region

Is overhead of copying array acceptable?

Does suitable hw algorithm exist and have acceptable sw

performance ?

Apply CC, CF, CEMA, CIE, CEDF, CECF, FS, LR

Apply PVR

Apply AS

yes

no

no

yes

Repeat until performance acceptable

17/23

Experimental Setup Benchmark suite

MediaBench, Powerstone Manually applied guidelines

1-2 hours 23 additional lines/benchmark, on

average Target Architecture

Xilinx VirtexII FPGA with ARM9 uP Hardware/software partitioning

Selects critical regions for hardware Synthesis

High-level synthesis tool ~30,000 lines of C code Outputs register-transfer level (RTL) VHDL

RTL Synthesis using Xilinx ISE Compilation

Gcc with –O1 optimizations

Benchmarks

Manual Refinement

Hw/Sw Partitioning

Synthesis

Refined Code

Compilation

Sw Hw

Bitfile

ARM9 Virtex II

18/23

0

5

10

15

20

g3fax

Sw

Hw/sw with original code

Hw/sw with guidelines

Speedups from Guidelines

0

5

10

15

20

g3fax

Sw



0

5

10

15

20

g3fax

Sw



Conversion to constants

Speedup: 3.6x Total Time: 5 minutes

Explicit Dataflow + Algorithmic Specialization


No guidelines

Speedup: 2x

19/23

0

5

10

15

20

g3fax crc

Sw




0

5

10

15

20

g3fax crc

Sw



0

5

10

15

20

g3fax crc

SwHw /sw w ith original codeHw /sw w ith guidelines

No Guidelines

Speedup: 8.6x

Conversion to Constants


Input Enumeration


Algorithmic Specialization

Speedup: 19x Time: 30 minutes Sw Overhead: 6000%

20/23

Speedups from Guidelines573 842

0

5

10

15

20

g3fax crc jpeg brev fir mpeg2

SwHw /sw w ith original codeHw /sw w ith guidelines

Original code Speedups range from 1x (no speedup) to 573x Average: 2.6x (excludes brev)

Refined code with guidelines Average: 8.4x (excludes brev) 3.5x average improvement compared to original code

21/23


Guidelines move speedups closer to ideal Almost identical for mpeg2, fir

Several examples still far from ideal May imply new guidelines needed

84267 1090

0

5

10

15

20

g3fax crc jpeg brev fir mpeg2

SwHw /sw w ith original codeHw /sw w ith guidelinesIdeal

22/23

Guideline SW Overhead/Improvement

Average Sw performance overhead: -15.7% (improvement) -1.1% excluding brev 3 examples improved

Average Sw size overhead (lines of C code) 8.4% excluding brev

-88% -47%-30%

-20%

-10%

0%

10%

20%

30%

g3fa

x

mpe

g2

jpeg

brev fir crc

Performance Overhead

Size Overhead Overhead

Improvement

23/23

Summary Simple coding guidelines significantly

improve synthesis from C 3.5x speedup compared to Hw/Sw

synthesized from unrefined code Major rewrites may not be necessary

Between 1-2 hours Refinement Methodology

Reduces software size/performance overhead In some cases, improvement

Future Work Test on commercial synthesis tools New guidelines for different domains

Documents

A Code Refinement Methodology for Performance-Improved Synthesis from C