Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
AutomaticGenerationofEfficientAcceleratorDesignsforReconfigurableHardware
Raghu PrabhakarStanfordUniversity
PervasiveParallelismLab
TheTeam
StefanHadjis TianZhaoChristinaDelimitrou
ChristosKozyrakis KunleOlukotun
2
DavidKoeplingerYaqiZhangMattFeldman
FPGAsinDataCentersn IncreasinginterestinuseofFPGAsasapplication
acceleratorsindatacenters
3
Keyadvantage:Performance/Watt
Problem#1:Programmabilityn VerilogandVHDLtoolowlevelforsoftwaredevelopers
n Highlevelsynthesis(HLS)toolsneeduserpragmastohelpdiscoverparallelismn C-basedinput,pragmasrequiringhardwareknowledgen Limitedinexploitingdatalocalityn Difficulttosynthesizecomplexdatapathswithnestedparallelism
4
Problem#2:LargeDesignSpacesn Designspacesgrowexponentiallywiththenumberof
parametersn Parameterscanchangeruntimebyordersofmagnituden Parametersdependoneachothern Manualexplorationistedious,suboptimal
5
OurApproach
PatternTransformationsTiling
ParallelPatterns
TiledParallelPatterns
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMetapipelineAnalysis
MaxJ
6
SpatialDesignSpaceExplorationLatency,AreaEstimation
OurApproach
PatternTransformationsTiling
ParallelPatterns
TiledParallelPatterns
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMetapipelineAnalysis
MaxJ
7
SpatialDesignSpaceExplorationLatency,AreaEstimation
GeneratingConfigurableHardwarefromParallelPatterns,ASPLOS’16RaghuPrabhakar,DavidKoeplinger,KevinJ.Brown,HyoukJoong Lee,ChrisDeSa,ChristosKozyrakis,KunleOlukotun
AutomaticGenerationofEfficientAcceleratorsforReconfigurableHardware,ISCA’16DavidKoeplinger,RaghuPrabhakar,Yaqi Zhang,ChristinaDelimitrou,ChristosKozyrakis,KunleOlukotun
OurApproach
PatternTransformationsTiling
ParallelPatterns
TiledParallelPatterns
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMetapipelineAnalysis
MaxJ
8
SpatialDesignSpaceExplorationLatency,AreaEstimation
AutomaticGenerationofEfficientAcceleratorsforReconfigurableHardware,ISCA’16DavidKoeplinger,RaghuPrabhakar,Yaqi Zhang,ChristinaDelimitrou,ChristosKozyrakis,KunleOlukotun
ParallelPatternsn Constructswithspecialpropertieswithrespectto
parallelismandmemoryaccess
9
map zip reduce groupBy
key1 key3key2
ParallelPatterns:Map
10
val vectorB = vectorA map { v => v + 1 }
Performthegivenfunctiononthei’th elementoftheinputTheresultformsthei’th elementoftheoutput
3 8 1 4 2 6 5 1
+1 +1 +1 +1 +1 +1 +1 +1
ParallelPatterns:Zip
11
val vectorC = vectorA zip (vectorB) { _ + _ }
Performthegivenfunctiononthei’th elementoftheinputsTheresultformsthei’th elementoftheoutput
3 8 1 4 2 5 9 4 3 2
+ + + + +
ParallelPatterns:Reduce
12
val sum = vectorA reduce { (a,b) => a + b }
Combinetheelementsoftheinputusingthegivenfunction**Functionmustbeassociative
+ +
+
3 8 1 4
ParallelPatterns:GroupBy
13
val bins = vectorA groupBy { v => v/3 }
Grouptheelementsofthegivencollectionbaseduponthegivenfunction
0
/ 3
3 8 1 4 2 6 5 1
2
2 1
1
1 3 4 5 1 2 1
/ 3 / 3 / 3 / 3 / 3 / 3 / 3
ParallelPatterns:Filter
14
val vectorB = vectorA filter {v => v%2 == 1}
Produceacollectionforthei’th elementoftheinputTheoutputistheconcatenationofallproducedcollections
3 8 1 4 2 6 5 1
%
1
% % % %
1 53 1
% % %
K-means:ParallelPatterns
15
// Group points by closest centroid:val groups = points groupBy { point =>
// Compute distance for each centroidval dists = guesses map { guess => mean.zip(sample){ (a,b) => (a – b)**2 } reduce { (a,b) => a + b }
}// Find the index of the closest centroid(0 until dists.length) reduce { (i,j) =>if (dists(i) < dists(j)) i else j
}}// Average each groupval newKmeans = groups map { g => val sum = g reduce { (v1,v2) => v1.zip(v2){ (a,b) => a + b } }
sum map { a => a / g.size }}
OurApproach
PatternTransformationsTiling
ParallelPatterns
TiledParallelPatterns
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMetapipelineAnalysis
MaxJ
16
SpatialDesignSpaceExplorationLatency,AreaEstimation
AutomaticGenerationofEfficientAcceleratorsforReconfigurableHardware,ISCA’16DavidKoeplinger,RaghuPrabhakar,Yaqi Zhang,ChristinaDelimitrou,ChristosKozyrakis,KunleOlukotun
Key
DRAM
A
B
DesignSpaceExample:DotProduct
17
FPGA
+×TileB
TileA
Algorithm: Dot Product of Vectors A and B
Small andsimple,butslow!
acc
Scratchpad
Reg op
DRAM
A
B
ImportantParameters:TileSizes
n IncreaseslengthofDRAMaccesses Runtimen Increasesexploitedspatiallocality Runtimen Increaseslocalmemorysizes Area
18
FPGA
+×TileB
TileA
Algorithm: Dot Product of Vectors A and B
acc
Key
Scratchpad
Reg op
DRAM
A
B
FPGA
Stage 2
Stage 1
+×TileB
TileA
ImportantParameters:Pipelining
19
Algorithm: Dot Product of Vectors A and B
n Overlaps memoryandcompute Runtimen Increaseslocalmemorysizes Arean Addssynchronizationlogic Area
acc
Key
Double
Reg op
Buffer
DRAM
ImportantParameters:Parallelization
20
FPGA
+
×
Algorithm: Dot Product of Vectors A and B
×
×
TileA
TileB
+ +
n Improveselementthroughput Runtimen Duplicatescomputeresources Area
A
B
acc
Key
Scratchpad
Reg op
HardwareLanguageRequirements
21
VHDLVerilog
LegUp Vivado HLSOpenCL SDK
Aladdin Spatial
TargetsFPGAs
EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler
Evaluatesdesignspriortosynthesis
Exploresdesignspaceautomatically
Generatessynthesizablecode
Hardware LanguageRequirements
22
VHDLVerilog
LegUp Vivado HLSOpenCL SDK
Aladdin Spatial
TargetsFPGAs
EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler
Evaluatesdesignspriortosynthesis
Exploresdesignspaceautomatically
Generatessynthesizablecode
Hardware LanguageRequirements
23
VHDLVerilog
LegUp Vivado HLSOpenCL SDK
Aladdin Spatial
TargetsFPGAs
EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler
Evaluatesdesignspriortosynthesis
Exploresdesignspaceautomatically
Generatessynthesizablecode
Hardware LanguageRequirements
24
VHDLVerilog
LegUp Vivado HLSOpenCL SDK
Aladdin Spatial
TargetsFPGAs
EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler
Evaluatesdesignspriortosynthesis
Exploresdesignspaceautomatically
Generatessynthesizablecode
Hardware LanguageRequirements
25
VHDLVerilog
LegUp Vivado HLSOpenCL SDK
Aladdin Spatial
TargetsFPGAs
EnablespipeliningatarbitrarylooplevelsExposesdesignparameterstothecompiler
Evaluatesdesignspriortosynthesis
Exploresdesignspaceautomatically
Generatessynthesizablecode
TheSpatialLanguagen Includesavarietyparameterizedtemplates
n Parallelpatterns withimplicitparallelizationfactorsn Pipelineconstructs forpipeliningatarbitrarylevelsn Explicitsizeparameters forloopstepsizeandbuffersizes
n Allparametersareexposedtocompilern Compilerincludeslatencyandareamodelsforquickdesignevaluation
n Compilerautomaticallyexploresdesignspacen GeneratessynthesizableMaxJ HGLafterexploration
26
DRAM
DotProductinSpatialDiagram
27
TileB
TileA
×
+
InnerReduce
OuterReduce
Parallelismfactor#1Pipeliningtoggle
TileSize(B)
Parallelismfactor#2
Parallelismfactor#3
A
B
outout
+
DotProductinSpatial
28
val output = Reg[Float]val vectorA = OffChipMem[Float](N)val vectorB = OffChipMem[Float](N)
Reduce(N by B)(output){ i =>val tileA = Scratchpad[Float](B)val tileB = Scratchpad[Float](B)val acc = Reg[Float]tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(B by 1)(acc){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
Parallelismfactor#1Pipeliningtoggle
TileSize(B)
Parallelismfactor#2
Parallelismfactor#3
1
2
val output = vectorA * vectorB // User’s code
SpatialDesignParameters
29
Type Example Description Parameter
Primitives+, -,*,/... Basicmath, logic,andcontrol Vectorwidth
Scratchpad LoadScratchpadStore
Load/storefromon-chipmemories Vectorwidth,stride
Memories
OffChipMem N-dimensionaloff-chiparray Dimensions
Scratchpad On-chipscratchpad Size,buffering, banking
Reg Accumulatorregister Buffering
Controllers
Counter Loop indices Parallelization,pattern
Pipe Pipelined inner-loopbody Parallelization, pattern
MetaPipe Coarse-grained pipeline Parallelization, pattern
Data Transfer TileLoadTile Store Load/storefromoff-chiparrays Tilesize,load rate
SpatialEnablesFastDSE
30
SpatialProgram
SimpleLinearModels
Concise IR
ParameterizedTemplates
EasilyDerivedSpaceConstraints
SpacePruning
FastDesignSpaceExploration
FastEstimationNoUnrollingNoScheduling
SmallerSpaces
LatencyModelingn Analyticalmodel
n Usesdepth-firstsearchtogetcriticalpathofpipelines
n Accurateestimationrequiresdatasizeannotations
n Main-memorymodeln Mathematicalmodelfittoobservedruntimesn Parameterizedby:
n Numberofcontendingreaders/writersn Numberofcommandsissuedinsequencen Commandlength
31
AreaModelingn Analyticalmodel
n Simplesummationofareaofeachtemplaten Includesestimatesfordelaylines,bankedmemories
n Neuralnetworkmodelsn Modelsroutingcostsandmemoryduplicationn Simple,3layernetworkssufficehere(weuse11-6-1)n Trainedonaboutsetof200characterizationdesigns
n Totalarea=analyticalarea+neuralnetarea
32
Evaluation
33
n Accuracy:Howaccuratearethemodels,comparedtoobservations?
ExperimentalSetupn Board:
n AlteraStratix Vn 48GBDDR3DRAM,6memorychannelsn BoardconnectedtohostviaPCI-e
n Executiontime=FPGAexecutiontimen DoesnotincludeCPUßà FPGAcommunication orconfigurationtime
34
ModelSynthesized
Results:ModelAccuracy(Area)
35
Areamodelsfollowimportanttrendsandareaccurateenoughtodrive
automaticdesignspaceexploration
100%
60%
20%
ALMsBRAMsDSPs
ResourceUsage(%
)
dotproduct outerprod tpchq6blackscholes gda kmeans gemm
Results:ModelAccuracy(Latency)
36
Latencymodelsfollowimportanttrendsandareaccurateenoughtodrive
automaticdesignspaceexploration
2.8% 1.3% 3.1% 3.4%
6.7% 7%
18.4%
0%
5%
10%
15%
20%
dotproduct outerprod tpchq6 blackscholes gda kmeans gemm
AverageError(%)
Evaluation
37
n Accuracy:Howaccuratearethemodels,comparedtoobservations?
n Speed:Howfastarethepredictions,comparedtocommercialtools?
Results:PredictionSpeed
38
Benchmark Designs Search TimeDotProduct 5,426 5.3ms/designOuter Product 1,702 30ms/designTPCHQ6 5,426 8.2ms /designBlackscholes 572 27ms /designMatrix Multiply 70,740 11ms /designK-Means 75,200 20ms/designGDA 42,800 17ms/ design
Designs Search TimeGDA 250 1.85min/design
Vivado HLS:
DHDL:
Results:PredictionSpeed
39
Benchmark Designs Search TimeDotProduct 5,426 5.3ms/designOuter Product 1,702 30ms/designTPCHQ6 5,426 8.2ms /designBlackscholes 572 27ms /designMatrix Multiply 70,740 11ms /designK-Means 75,200 20ms/designGDA 42,800 17ms/ design
Designs Search TimeGDA 250 1.85min/design
Vivado HLS:
6533x Speedup Over HLS!
DHDL:
Evaluation
40
n Accuracy:Howaccuratearethemodels,comparedtoobservations?
n Speed:Howfastarethepredictions,comparedtocommercialtools?
n Space:Dothedesignparametershelpcaptureaninterestingspace?
20%60%100%20%60%100%20%60%100%ALMsDSPsBRAMs
ResourceUsage(%ofmaximum)
Cycles(LogScale)
Results:GDADesignSpace
41
1010109108107
Validdesignpoint Pareto-optimaldesignInvaliddesignpoint Synthesizedpareto designpoint
20%60%100%20%60%100%20%60%100%ALMsDSPsBRAMs
ResourceUsage(%ofmaximum)
Cycles(LogScale)
Results:GDADesignSpace
42
1010109108107
Validdesignpoint Pareto-optimaldesignInvaliddesignpoint Synthesizedpareto designpoint
PerformancelimitedbyavailableBRAMs
SpaceforGDAspansfourordersofmagnitude
Evaluation
43
n Accuracy:Howaccuratearethemodels,comparedtoobservations?
n Speed:Howfastarethepredictions,comparedtocommercialtools?
n Space:Dothedesignparametershelpcaptureaninterestingspace?
n Performance:Howgoodisthebestgenerateddesign?
Evaluation:Multi-CoreComparison
44
n FPGAn Altera Stratix V(28nm)n 150MHzclockn Peakmainmemorybandwidthof37.5GB/sec
n Multi-coreCPUn IntelXeonE5-2630(32nm)n 2.3GHzn Peakmainmemorybandwidthof42.6GB/secn 6cores,6threadsn Multi-threadedC++codegeneratedfromDelite
n Executiontime=FPGAexecutiontimen DoesnotincludeCPUßà FPGAcommunication orconfigurationtime
Results:ComparisonwithMulti-Core
45
1.072.42
1.11
16.73
4.55
1.15 0.10
5
10
15
20
dotproduct outerprod tpchq6 blackscholes gda kmeans gemm
Speedu
p
Memory-bound Compute-bound
Gemm usesmulti-threadedOpenBLAS onCPU
Summaryn Tiling andmetapipelining capturelocalityandnested
parallelismn Spatial captureslargedesignspacen Fast,accurateestimators andDSEtoolsenable
rapiddesignspaceexplorationn Upto16.7x speedupovermulti-coreCPUbenchmarksn Upto6533x fasterDSEcomparedtoVivado HLS
46
47
48
TilingTransformation1:StripMiningn Transformssinglepatternintosetofnestedpatterns
n Stripminedpatternsenablecomputationreordering
n Insertcopiesforpredictableaccessestoenhancelocalityn Copiesguidecreationofon-chipbuffers
ParallelPatterns Strip MinedPatternsMap(D)(f)
GroupBy(D)(k)(v)
FlatMap(D)(f)
x(i)
Map(D/B)Map(B)(f’)
GroupBy(D/B)(k’)GroupBy(B)(k’)(v’)
FlatMap(D/B)FlatMap(B)(f’)
x.copy(i to i+B); x(ii) 49
multiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)
(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})
})
}
TilingTransformation2:Interchangen Reordernestedpatterns
n Move‘copy’operationsouttowardouterpattern(s)n Improveslocalityandreuseofon-chipmemory
StripMinedPatterns InterchangedPatternsmultiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), map(b0,b1){i,j =>
multiFold(p/b2){kk =>yTl = y.copy(b1+jj, b2+kk)(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})
}{(a,b) => a + b}})
}
50
((ii,jj), multiFold(p/b2){kk => yTl = y.copy(b1+jj, b2+kk)
(0, map(b0,b1){i,j =>
}{(a,b) => map(b0,b1){i,j =>
a(i,j) + b(i,j) }})
Metapipeliningn Coarse-grainedpipelining:A“pipelineofpipelines”
n Exploitsnestedparallelism
n Usesasynchronoushandshakingsignalsbetweenstagesn Allowsstagestohavevariableexecutiontimesn Doesnotrequirecompleteunrollingofinnerpatternsn Noneedtocalculateinitiationinterval(II)statically
n Intermediatedatabetweenstagesstoredindoublebuffers
51
map(N) { r =>
}
Metapipelining – Intuition
row = matrix.slice(r)
diff = map(D) { i =>row(i) – sub(i)
}
vprod = map(D,D) {(i,j)=> diff(i) * diff(j)
}
vprod52
map(N) { r =>
}
Metapipelining – Intuition
ld ld
st
-
diff
sub
Pipe2
ld ld
st
*
vprod
Pipe3
row
TileMemControllerPipe1
TileMemControllerPipe4
2
row = matrix.slice(r)
diff = map(D) { i =>row(i) – sub(i)
}
vprod = map(D,D) {(i,j)=> diff(i) * diff(j)
}
vprod
r=
53
Metapipeline–4stages
map(N) { r =>
}
Metapipelining – Intuition
ld ld
st
-
diff
sub
Pipe2
ld ld
st
*
vprod
Pipe3
ld ld
st
-
diff
sub
Pipe2
row
ld ld
st
*
vprod
Pipe3
diff
row
TileMemControllerPipe1
TileMemControllerPipe4
row
TileMemControllerPipe1
vprod
TileMemControllerPipe4
2 5
row = matrix.slice(r)
diff = map(D) { i =>row(i) – sub(i)
}
vprod = map(D,D) {(i,j)=> diff(i) * diff(j)
}
vprod
r= r=
54
K-means:GeneratedHardware
55
VectorDist
(Norm)Vector
Dist (Norm)
++
//
VectorDist
(Norm)
samplesTile
Load
Inc
/New
kmeansTile
Store
+
kmeansTile
Load
Scalar Dist
(Tree +)
(MinDist, Idx)
kmeansBlockbuffer
samplesBlockDouble buffer
samplesBlockDouble buffer
minIdxDouble buffer
sumBuffer
countBuffer
new kmeansDouble Buffer
Similarto(andmoregeneralthan)hand-writtendesigns1
[1]Hussainetal,“Fpga implementationofk-meansalgorithmforbioinformaticsapplication:Anacceleratedapproachtoclusteringmicroarraydata”,AHS2011
1.Loadkmeans 2.Metapipeline:Calculatesum andcount
3.Metapipeline:Calculatenewkmeans,storeresults