MapReduceMapReduce on the Cell on the Cell
Broadband Engine ArchitectureBroadband Engine Architecture
Marc de Marc de KruijfKruijf
OverviewOverview
�� MotivationMotivation
�� MapReduceMapReduce
�� Cell BE ArchitectureCell BE Architecture
�� DesignDesign
�� Performance AnalysisPerformance Analysis
�� Implementation StatusImplementation Status
�� Future WorkFuture Work
What is What is MapReduceMapReduce??
�� A parallel programming model for largeA parallel programming model for large--scale data scale data
processingprocessing
�� Simple, abstract interfaceSimple, abstract interface
�� Runtime handles all synchronization, communication, and Runtime handles all synchronization, communication, and
scheduling.scheduling.
�� Implementations exist for:Implementations exist for:
�� Large distributed clusters, as in GoogleLarge distributed clusters, as in Google’’s original s original MapReduceMapReduce
�� Shared memory multiprocessors, as in StanfordShared memory multiprocessors, as in Stanford’’s s PheonixPheonix
�� The Cell processor is neither of theseThe Cell processor is neither of these……..
What is the Cell then?What is the Cell then?
�� The Cell is:The Cell is:
�� ““A single chip multiprocessor A single chip multiprocessor
with nine processors operating with nine processors operating
on a shared, coherent memoryon a shared, coherent memory””
�� So whatSo what’’s the difference?s the difference?
�� From a programming From a programming
perspective, the Cell is much perspective, the Cell is much
more like a more like a ““clustercluster--onon--aa--chipchip””
�� That sounds hardThat sounds hard……
�� ItIt’’s not easy.s not easy.
MotivationMotivation
�� Programming the Cell is hardProgramming the Cell is hard…… yetyet……
�� Distributed (shared?) memory singleDistributed (shared?) memory single--chip multiprocessors are chip multiprocessors are
the way of the future.the way of the future. (Opinion.)(Opinion.)
�� Corollary:Corollary: Shared memory is out. Message passing is in. Shared memory is out. Message passing is in.
(More opinion.)(More opinion.)
�� WhatWhat’’s missing are the right runtime and abstraction s missing are the right runtime and abstraction
layers to enable the scalability potential of these and layers to enable the scalability potential of these and
future systems. future systems.
�� MapReduceMapReduce is just such an abstraction.is just such an abstraction.
OverviewOverview
�� MotivationMotivation
�� MapReduceMapReduce
�� Cell BE ArchitectureCell BE Architecture
�� DesignDesign
�� Performance AnalysisPerformance Analysis
�� Implementation Status Implementation Status
�� Future WorkFuture Work
MapReduceMapReduce RefresherRefresher
MapReduceMapReduce
on Cellon Cell
��DesignDesign
��Execution variants:Execution variants:
�� MapReduceMapReduce, sorted, sorted
�� Phases 1Phases 1--55
�� MapReduceMapReduce, no sort, no sort
�� Phases 1Phases 1--4 4
�� Map only, sortedMap only, sorted
�� Phases 1, 2, and 5Phases 1, 2, and 5
�� Map only, no sortMap only, no sort
�� Only phase 1Only phase 1
MapReduceMapReduce
on Cellon Cell
��DesignDesign
��Execution variants:Execution variants:
�� MapReduceMapReduce, sorted, sorted
�� Phases 1Phases 1--55
�� MapReduceMapReduce, no sort, no sort
�� Phases 1Phases 1--4 4
�� Map only, sortedMap only, sorted
�� Phases 1, 2, and 5Phases 1, 2, and 5
�� Map only, no sortMap only, no sort
�� Only phase 1Only phase 1
MapReduceMapReduce
on Cellon Cell
��DesignDesign
��Execution variants:Execution variants:
�� MapReduceMapReduce, sorted, sorted
�� Phases 1Phases 1--55
�� MapReduceMapReduce, no sort, no sort
�� Phases 1Phases 1--4 4
�� Map only, sortedMap only, sorted
�� Phases 1, 2, and 5Phases 1, 2, and 5
�� Map only, no sortMap only, no sort
�� Phase 1 (no hash)Phase 1 (no hash)
MapReduceMapReduce
on Cellon Cell
��DesignDesign
��Execution variants:Execution variants:
�� MapReduceMapReduce, sorted, sorted
�� Phases 1Phases 1--55
�� MapReduceMapReduce, no sort, no sort
�� Phases 1Phases 1--4 4
�� Map only, sortedMap only, sorted
�� Phases 1, 2, and 5Phases 1, 2, and 5
�� Map only, no sortMap only, no sort
�� Phase 1 (no hash)Phase 1 (no hash)
MapReduceMapReduce
on Cellon Cell
��DesignDesign
��Execution variants:Execution variants:
�� MapReduceMapReduce, sorted, sorted
�� Phases 1Phases 1--55
�� MapReduceMapReduce, no sort, no sort
�� Phases 1Phases 1--4 4
�� Map only, sortedMap only, sorted
�� Phases 1, 2, and 5Phases 1, 2, and 5
�� Map only, no sortMap only, no sort
�� Phase 1 (no hash)Phase 1 (no hash)
MapReduceMapReduce
on Cellon Cell
��DesignDesign
��Execution variants:Execution variants:
�� MapReduceMapReduce, sorted, sorted
�� Phases 1Phases 1--55
�� MapReduceMapReduce, no sort, no sort
�� Phases 1Phases 1--4 4
�� Map only, sortedMap only, sorted
�� Phases 1, 2, and 5Phases 1, 2, and 5
�� Map only, no sortMap only, no sort
�� Phase 1 (no hash)Phase 1 (no hash)
Design HighlightsDesign Highlights
�� Output is preOutput is pre--allocated. allocated.
�� Enhances performance as well as output locality.Enhances performance as well as output locality.
�� Work queue allows dynamic scheduling of tasks Work queue allows dynamic scheduling of tasks
among among SPEsSPEs for load balancing. for load balancing.
�� Adding priorities allows pipelining of computation Adding priorities allows pipelining of computation
to maximize resource utilization.to maximize resource utilization.
�� Outside of DMA transfers, there is no data Outside of DMA transfers, there is no data
copying.copying.
OverviewOverview
�� MotivationMotivation
�� MapReduceMapReduce
�� Cell BE ArchitectureCell BE Architecture
�� DesignDesign
�� Performance AnalysisPerformance Analysis
�� Implementation Status Implementation Status
�� Future WorkFuture Work
Performance AnalysisPerformance Analysis
�� Assumptions:Assumptions:�� While there is scheduled work for the While there is scheduled work for the SPEsSPEs, the , the SPEsSPEs are are always the bottleneck.always the bottleneck.
�� Execution time:Execution time:Fixed runtime startup cost =Fixed runtime startup cost = unknown (can be amortized) +unknown (can be amortized) +
Map execution time = Map execution time = (# map keys * map execution time) +(# map keys * map execution time) +
Sort time =Sort time = n n loglog(n(n)/2 , where )/2 , where nn = (= (## intermediate keys / hash range) +intermediate keys / hash range) +
Reduce execution time =Reduce execution time = (# reduce keys * reduce execution time) +(# reduce keys * reduce execution time) +
Sort time =Sort time = n n loglog(n(n)/2 , where )/2 , where nn = (= (## reduce keys / output buffer size) reduce keys / output buffer size)
�� Observations:Observations:
�� Buffer management and partitioning is a necessary part of Buffer management and partitioning is a necessary part of
programming the programming the SPEsSPEs and is not considered overheadand is not considered overhead
�� Sorting is the dominant overheadSorting is the dominant overhead…… letlet’’s examine it furthers examine it further. .
Performance AnalysisPerformance Analysis
�� SortingSorting�� Q: I hear the Q: I hear the SPEsSPEs are not good at control tasks, but sorting is a control task, iare not good at control tasks, but sorting is a control task, isnsn’’t it?t it?
�� A: You are right, letA: You are right, let’’s analyze the sorting performance of s analyze the sorting performance of SPEsSPEs vs. a typical superscalarvs. a typical superscalar
�� AssumptionsAssumptions�� LetLet’’s assume our key comparison function is string compares assume our key comparison function is string compare
�� Furthermore assume that our input strings are uniformly distribuFurthermore assume that our input strings are uniformly distributedted
/* /* strcmpstrcmp */*/
intint keyCompare(keyCompare(constconst voidvoid *one, *one, const voidconst void *two)*two)
{ {
charchar *a1 = (*a1 = (charchar *)one;*)one;
charchar *a2 = (*a2 = (charchar *)two;*)two;
intint i;i;
forfor (i = 0; (i = 0;
__builtin_expect(a1[i] == a2[i], 0) && __builtin_expect(a1[i] == a2[i], 0) &&
__builtin_expect(a1[i] != 0), 1);__builtin_expect(a1[i] != 0), 1);
i++);i++);
ifif (__builtin_expect(a1[i] == a2[i]), 0)(__builtin_expect(a1[i] == a2[i]), 0)
returnreturn 0;0;
else ifelse if (a1[i] > a2[i])(a1[i] > a2[i])
returnreturn 1; 1;
elseelse
returnreturn --1;1;
} }
�� SPESPE strcmpstrcmp( ) average exec time / ( ) average exec time /
comparison : comparison : 36 cycles36 cycles
�� Assuming 3.2 Assuming 3.2 GhzGhz: : 11.25ns / comp11.25ns / comp
�� x86 x86 strcmpstrcmp( ) average inst / comparison : ( ) average inst / comparison : 28 28
instinst
�� Assuming IPC of 1.5 @ 3.2 Assuming IPC of 1.5 @ 3.2 GhzGhz:: 5.8ns / comp5.8ns / comp
�� Bottom line:Bottom line: we have 8 we have 8 SPEsSPEs to 2 (maybe 4) to 2 (maybe 4)
x86 cores:x86 cores:
�� 11.25 / 8 = 11.25 / 8 = 1.406 ns/comp (CELL)1.406 ns/comp (CELL)
�� 5.8 / 4 = 5.8 / 4 = 1.45 ns/comp (1.45 ns/comp (KentsfieldKentsfield))
OverviewOverview
�� MotivationMotivation
�� MapReduceMapReduce
�� Cell BE ArchitectureCell BE Architecture
�� DesignDesign
�� Performance AnalysisPerformance Analysis
�� Implementation Status Implementation Status
�� Future WorkFuture Work
Implementation StatusImplementation Status
�� Very simple test application runs to completionVery simple test application runs to completion�� Larger test program does not return. Debugging using the Larger test program does not return. Debugging using the simulator is erratic and frequently ends with a timeout.simulator is erratic and frequently ends with a timeout.
�� Overestimated the capabilities of the simulator.Overestimated the capabilities of the simulator.
�� Cell Blade time needed, but too late to bear fruit.Cell Blade time needed, but too late to bear fruit.
�� ~2500 loc vs. ~1200 for Phoenix (shared ~2500 loc vs. ~1200 for Phoenix (shared memmem. . implementation)implementation)�� 21 threads total, though synchronization is not too difficult21 threads total, though synchronization is not too difficult
�� Buffer management is the hard partBuffer management is the hard part……
Future WorkFuture Work
�� Wrap up implementationWrap up implementation
�� Add performance counters to quantify overhead.Add performance counters to quantify overhead.
�� Perform applicationPerform application--agnostic runtime analysisagnostic runtime analysis
�� Combine with static analysis to determine Combine with static analysis to determine performance bottlenecksperformance bottlenecks
�� Determine if hash should be on the PPE or SPEDetermine if hash should be on the PPE or SPE
�� Perform applicationPerform application--specific runtime analysisspecific runtime analysis
�� What happened to Marching Cubes??What happened to Marching Cubes??�� Actually mostly done, but no opportunity to testActually mostly done, but no opportunity to test