Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin

AutomaticGenerationofEfficientAcceleratorDesignsforReconfigurableHardware

Raghu PrabhakarStanfordUniversity

PervasiveParallelismLab

TheTeam

StefanHadjis TianZhaoChristinaDelimitrou

ChristosKozyrakis KunleOlukotun

2

DavidKoeplingerYaqiZhangMattFeldman

FPGAsinDataCentersn IncreasinginterestinuseofFPGAsasapplication

acceleratorsindatacenters

3

Keyadvantage:Performance/Watt

Problem#1:Programmabilityn VerilogandVHDLtoolowlevelforsoftwaredevelopers

n Highlevelsynthesis(HLS)toolsneeduserpragmastohelpdiscoverparallelismn C-basedinput,pragmasrequiringhardwareknowledgen Limitedinexploitingdatalocalityn Difficulttosynthesizecomplexdatapathswithnestedparallelism

4

Problem#2:LargeDesignSpacesn Designspacesgrowexponentiallywiththenumberof

parametersn Parameterscanchangeruntimebyordersofmagnituden Parametersdependoneachothern Manualexplorationistedious,suboptimal

5

OurApproach

PatternTransformationsTiling

ParallelPatterns

TiledParallelPatterns

BitstreamGeneration

FPGAConfiguration

HardwareGenerationMetapipelineAnalysis

MaxJ

6

SpatialDesignSpaceExplorationLatency,AreaEstimation

OurApproach


ParallelPatterns


BitstreamGeneration

FPGAConfiguration


MaxJ

7


GeneratingConfigurableHardwarefromParallelPatterns,ASPLOS’16RaghuPrabhakar,DavidKoeplinger,KevinJ.Brown,HyoukJoong Lee,ChrisDeSa,ChristosKozyrakis,KunleOlukotun

AutomaticGenerationofEfficientAcceleratorsforReconfigurableHardware,ISCA’16DavidKoeplinger,RaghuPrabhakar,Yaqi Zhang,ChristinaDelimitrou,ChristosKozyrakis,KunleOlukotun

OurApproach


ParallelPatterns


BitstreamGeneration

FPGAConfiguration


MaxJ

8



ParallelPatternsn Constructswithspecialpropertieswithrespectto

parallelismandmemoryaccess

9

map zip reduce groupBy

key1 key3key2

ParallelPatterns:Map

10

val vectorB = vectorA map { v => v + 1 }

Performthegivenfunctiononthei’th elementoftheinputTheresultformsthei’th elementoftheoutput

3 8 1 4 2 6 5 1

+1 +1 +1 +1 +1 +1 +1 +1

ParallelPatterns:Zip

11

val vectorC = vectorA zip (vectorB) { _ + _ }

Performthegivenfunctiononthei’th elementoftheinputsTheresultformsthei’th elementoftheoutput

3 8 1 4 2 5 9 4 3 2

+ + + + +

ParallelPatterns:Reduce

12

val sum = vectorA reduce { (a,b) => a + b }

Combinetheelementsoftheinputusingthegivenfunction**Functionmustbeassociative

+ +

+

3 8 1 4

ParallelPatterns:GroupBy

13

val bins = vectorA groupBy { v => v/3 }

Grouptheelementsofthegivencollectionbaseduponthegivenfunction

0

/ 3

3 8 1 4 2 6 5 1

2

2 1

1

1 3 4 5 1 2 1

/ 3 / 3 / 3 / 3 / 3 / 3 / 3

ParallelPatterns:Filter

14

val vectorB = vectorA filter {v => v%2 == 1}

Produceacollectionforthei’th elementoftheinputTheoutputistheconcatenationofallproducedcollections

3 8 1 4 2 6 5 1

%

1

% % % %

1 53 1

% % %

K-means:ParallelPatterns

15

// Group points by closest centroid:val groups = points groupBy { point =>

// Compute distance for each centroidval dists = guesses map { guess => mean.zip(sample){ (a,b) => (a – b)**2 } reduce { (a,b) => a + b }

}// Find the index of the closest centroid(0 until dists.length) reduce { (i,j) =>if (dists(i) < dists(j)) i else j

}}// Average each groupval newKmeans = groups map { g => val sum = g reduce { (v1,v2) => v1.zip(v2){ (a,b) => a + b } }

sum map { a => a / g.size }}

OurApproach


ParallelPatterns


BitstreamGeneration

FPGAConfiguration


MaxJ

16



Key

DRAM

A

B

DesignSpaceExample:DotProduct

17

FPGA

+×TileB

TileA

Algorithm: Dot Product of Vectors A and B

Small andsimple,butslow!

acc

Scratchpad

Reg op

DRAM

A

B

ImportantParameters:TileSizes

n IncreaseslengthofDRAMaccesses Runtimen Increasesexploitedspatiallocality Runtimen Increaseslocalmemorysizes Area

18

FPGA

+×TileB

TileA


acc

Key

Scratchpad

Reg op

DRAM

A

B

FPGA

Stage 2

Stage 1

+×TileB

TileA

ImportantParameters:Pipelining

19


n Overlaps memoryandcompute Runtimen Increaseslocalmemorysizes Arean Addssynchronizationlogic Area

acc

Key

Double

Reg op

Buffer

DRAM

ImportantParameters:Parallelization

20

FPGA

+

×


×

×

TileA

TileB

+ +

n Improveselementthroughput Runtimen Duplicatescomputeresources Area

A

B

acc

Key

Scratchpad

Reg op

HardwareLanguageRequirements

21

VHDLVerilog

LegUp Vivado HLSOpenCL SDK

Aladdin Spatial

TargetsFPGAs

EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler

Evaluatesdesignspriortosynthesis

Exploresdesignspaceautomatically

Generatessynthesizablecode

Hardware LanguageRequirements

22

VHDLVerilog


Aladdin Spatial

TargetsFPGAs






23

VHDLVerilog


Aladdin Spatial

TargetsFPGAs






24

VHDLVerilog


Aladdin Spatial

TargetsFPGAs






25

VHDLVerilog


Aladdin Spatial

TargetsFPGAs





TheSpatialLanguagen Includesavarietyparameterizedtemplates

n Parallelpatterns withimplicitparallelizationfactorsn Pipelineconstructs forpipeliningatarbitrarylevelsn Explicitsizeparameters forloopstepsizeandbuffersizes

n Allparametersareexposedtocompilern Compilerincludeslatencyandareamodelsforquickdesignevaluation

n Compilerautomaticallyexploresdesignspacen GeneratessynthesizableMaxJ HGLafterexploration

26

DRAM

DotProductinSpatialDiagram

27

TileB

TileA

×

+

InnerReduce

OuterReduce

Parallelismfactor#1Pipeliningtoggle

TileSize(B)

Parallelismfactor#2

Parallelismfactor#3

A

B

outout

+

DotProductinSpatial

28

val output = Reg[Float]val vectorA = OffChipMem[Float](N)val vectorB = OffChipMem[Float](N)

Reduce(N by B)(output){ i =>val tileA = Scratchpad[Float](B)val tileB = Scratchpad[Float](B)val acc = Reg[Float]tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)

Reduce(B by 1)(acc){ j => tileA(j) * tileB(j)

}{a, b => a + b}}{a, b => a + b}

Parallelismfactor#1Pipeliningtoggle

TileSize(B)

Parallelismfactor#2

Parallelismfactor#3

1

2

val output = vectorA * vectorB // User’s code

SpatialDesignParameters

29

Type Example Description Parameter

Primitives+, -,*,/... Basicmath, logic,andcontrol Vectorwidth

Scratchpad LoadScratchpadStore

Load/storefromon-chipmemories Vectorwidth,stride

Memories

OffChipMem N-dimensionaloff-chiparray Dimensions

Scratchpad On-chipscratchpad Size,buffering, banking

Reg Accumulatorregister Buffering

Controllers

Counter Loop indices Parallelization,pattern

Pipe Pipelined inner-loopbody Parallelization, pattern

MetaPipe Coarse-grained pipeline Parallelization, pattern

Data Transfer TileLoadTile Store Load/storefromoff-chiparrays Tilesize,load rate

SpatialEnablesFastDSE

30

SpatialProgram

SimpleLinearModels

Concise IR

ParameterizedTemplates

EasilyDerivedSpaceConstraints

SpacePruning

FastDesignSpaceExploration

FastEstimationNoUnrollingNoScheduling

SmallerSpaces

LatencyModelingn Analyticalmodel

n Usesdepth-firstsearchtogetcriticalpathofpipelines

n Accurateestimationrequiresdatasizeannotations

n Main-memorymodeln Mathematicalmodelfittoobservedruntimesn Parameterizedby:

n Numberofcontendingreaders/writersn Numberofcommandsissuedinsequencen Commandlength

31

AreaModelingn Analyticalmodel

n Simplesummationofareaofeachtemplaten Includesestimatesfordelaylines,bankedmemories

n Neuralnetworkmodelsn Modelsroutingcostsandmemoryduplicationn Simple,3layernetworkssufficehere(weuse11-6-1)n Trainedonaboutsetof200characterizationdesigns

n Totalarea=analyticalarea+neuralnetarea

32

Evaluation

33

n Accuracy:Howaccuratearethemodels,comparedtoobservations?

ExperimentalSetupn Board:

n AlteraStratix Vn 48GBDDR3DRAM,6memorychannelsn BoardconnectedtohostviaPCI-e

n Executiontime=FPGAexecutiontimen DoesnotincludeCPUßà FPGAcommunication orconfigurationtime

34

ModelSynthesized

Results:ModelAccuracy(Area)

35

Areamodelsfollowimportanttrendsandareaccurateenoughtodrive

automaticdesignspaceexploration

100%

60%

20%

ALMsBRAMsDSPs

ResourceUsage(%

)

dotproduct outerprod tpchq6blackscholes gda kmeans gemm

Results:ModelAccuracy(Latency)

36

Latencymodelsfollowimportanttrendsandareaccurateenoughtodrive

automaticdesignspaceexploration

2.8% 1.3% 3.1% 3.4%

6.7% 7%

18.4%

0%

5%

10%

15%

20%

dotproduct outerprod tpchq6 blackscholes gda kmeans gemm

AverageError(%)

Evaluation

37


n Speed:Howfastarethepredictions,comparedtocommercialtools?

Results:PredictionSpeed

38

Benchmark Designs Search TimeDotProduct 5,426 5.3ms/designOuter Product 1,702 30ms/designTPCHQ6 5,426 8.2ms /designBlackscholes 572 27ms /designMatrix Multiply 70,740 11ms /designK-Means 75,200 20ms/designGDA 42,800 17ms/ design

Designs Search TimeGDA 250 1.85min/design

Vivado HLS:

DHDL:

Results:PredictionSpeed

39

Benchmark Designs Search TimeDotProduct 5,426 5.3ms/designOuter Product 1,702 30ms/designTPCHQ6 5,426 8.2ms /designBlackscholes 572 27ms /designMatrix Multiply 70,740 11ms /designK-Means 75,200 20ms/designGDA 42,800 17ms/ design

Designs Search TimeGDA 250 1.85min/design

Vivado HLS:

6533x Speedup Over HLS!

DHDL:

Evaluation

40



n Space:Dothedesignparametershelpcaptureaninterestingspace?

20%60%100%20%60%100%20%60%100%ALMsDSPsBRAMs

ResourceUsage(%ofmaximum)

Cycles(LogScale)

Results:GDADesignSpace

41

1010109108107

Validdesignpoint Pareto-optimaldesignInvaliddesignpoint Synthesizedpareto designpoint

20%60%100%20%60%100%20%60%100%ALMsDSPsBRAMs

ResourceUsage(%ofmaximum)

Cycles(LogScale)

Results:GDADesignSpace

42

1010109108107

Validdesignpoint Pareto-optimaldesignInvaliddesignpoint Synthesizedpareto designpoint

PerformancelimitedbyavailableBRAMs

SpaceforGDAspansfourordersofmagnitude

Evaluation

43



n Space:Dothedesignparametershelpcaptureaninterestingspace?

n Performance:Howgoodisthebestgenerateddesign?

Evaluation:Multi-CoreComparison

44

n FPGAn Altera Stratix V(28nm)n 150MHzclockn Peakmainmemorybandwidthof37.5GB/sec

n Multi-coreCPUn IntelXeonE5-2630(32nm)n 2.3GHzn Peakmainmemorybandwidthof42.6GB/secn 6cores,6threadsn Multi-threadedC++codegeneratedfromDelite

n Executiontime=FPGAexecutiontimen DoesnotincludeCPUßà FPGAcommunication orconfigurationtime

Results:ComparisonwithMulti-Core

45

1.072.42

1.11

16.73

4.55

1.15 0.10

5

10

15

20

dotproduct outerprod tpchq6 blackscholes gda kmeans gemm

Speedu

p

Memory-bound Compute-bound

Gemm usesmulti-threadedOpenBLAS onCPU

Summaryn Tiling andmetapipelining capturelocalityandnested

parallelismn Spatial captureslargedesignspacen Fast,accurateestimators andDSEtoolsenable

rapiddesignspaceexplorationn Upto16.7x speedupovermulti-coreCPUbenchmarksn Upto6533x fasterDSEcomparedtoVivado HLS

46

47

48

TilingTransformation1:StripMiningn Transformssinglepatternintosetofnestedpatterns

n Stripminedpatternsenablecomputationreordering

n Insertcopiesforpredictableaccessestoenhancelocalityn Copiesguidecreationofon-chipbuffers

ParallelPatterns Strip MinedPatternsMap(D)(f)

GroupBy(D)(k)(v)

FlatMap(D)(f)

x(i)

Map(D/B)Map(B)(f’)

GroupBy(D/B)(k’)GroupBy(B)(k’)(v’)

FlatMap(D/B)FlatMap(B)(f’)

x.copy(i to i+B); x(ii) 49

multiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)

(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})

})

}

TilingTransformation2:Interchangen Reordernestedpatterns

n Move‘copy’operationsouttowardouterpattern(s)n Improveslocalityandreuseofon-chipmemory

StripMinedPatterns InterchangedPatternsmultiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), map(b0,b1){i,j =>

multiFold(p/b2){kk =>yTl = y.copy(b1+jj, b2+kk)(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})

}{(a,b) => a + b}})

}

50

((ii,jj), multiFold(p/b2){kk => yTl = y.copy(b1+jj, b2+kk)

(0, map(b0,b1){i,j =>

}{(a,b) => map(b0,b1){i,j =>

a(i,j) + b(i,j) }})

Metapipeliningn Coarse-grainedpipelining:A“pipelineofpipelines”

n Exploitsnestedparallelism

n Usesasynchronoushandshakingsignalsbetweenstagesn Allowsstagestohavevariableexecutiontimesn Doesnotrequirecompleteunrollingofinnerpatternsn Noneedtocalculateinitiationinterval(II)statically

n Intermediatedatabetweenstagesstoredindoublebuffers

51

map(N) { r =>

}

Metapipelining – Intuition

row = matrix.slice(r)

diff = map(D) { i =>row(i) – sub(i)

}

vprod = map(D,D) {(i,j)=> diff(i) * diff(j)

}

vprod52

map(N) { r =>

}


ld ld

st

-

diff

sub

Pipe2

ld ld

st

*

vprod

Pipe3

row

TileMemControllerPipe1


2



}


}

vprod

r=

53

Metapipeline–4stages

map(N) { r =>

}


ld ld

st

-

diff

sub

Pipe2

ld ld

st

*

vprod

Pipe3

ld ld

st

-

diff

sub

Pipe2

row

ld ld

st

*

vprod

Pipe3

diff

row



row


vprod


2 5



}


}

vprod

r= r=

54

K-means:GeneratedHardware

55

VectorDist

(Norm)Vector

Dist (Norm)

++

//

VectorDist

(Norm)

samplesTile

Load

Inc

/New

kmeansTile

Store

+

kmeansTile

Load

Scalar Dist

(Tree +)

(MinDist, Idx)

kmeansBlockbuffer

samplesBlockDouble buffer

samplesBlockDouble buffer

minIdxDouble buffer

sumBuffer

countBuffer

new kmeansDouble Buffer

Similarto(andmoregeneralthan)hand-writtendesigns1

[1]Hussainetal,“Fpga implementationofk-meansalgorithmforbioinformaticsapplication:Anacceleratedapproachtoclusteringmicroarraydata”,AHS2011

1.Loadkmeans 2.Metapipeline:Calculatesum andcount

3.Metapipeline:Calculatenewkmeans,storeresults

Documents

Automatic Generation of Efficient Accelerator Designs for ... Talks/Raghu_Prabhakar.pdf · DRAM A B Design Space Example: Dot Product ... VHDL Verilog LegUp VivadoHLS OpenCLSDK Aladdin