Download pdf - MapReduce on the Cell Broadband Engine Architecturecavazos/cisc879-spring... · MapReduce is just such an abstraction. Overview Motivation MapReduce Cell BE Architecture Design Performance

MapReduceMapReduce on the Cell on the Cell

Broadband Engine ArchitectureBroadband Engine Architecture

Marc de Marc de KruijfKruijf

OverviewOverview

�� MotivationMotivation

�� MapReduceMapReduce

�� Cell BE ArchitectureCell BE Architecture

�� DesignDesign

�� Performance AnalysisPerformance Analysis

�� Implementation StatusImplementation Status

�� Future WorkFuture Work

What is What is MapReduceMapReduce??

�� A parallel programming model for largeA parallel programming model for large--scale data scale data

processingprocessing

�� Simple, abstract interfaceSimple, abstract interface

�� Runtime handles all synchronization, communication, and Runtime handles all synchronization, communication, and

scheduling.scheduling.

�� Implementations exist for:Implementations exist for:

�� Large distributed clusters, as in GoogleLarge distributed clusters, as in Google’’s original s original MapReduceMapReduce

�� Shared memory multiprocessors, as in StanfordShared memory multiprocessors, as in Stanford’’s s PheonixPheonix

�� The Cell processor is neither of theseThe Cell processor is neither of these……..

What is the Cell then?What is the Cell then?

�� The Cell is:The Cell is:

�� ““A single chip multiprocessor A single chip multiprocessor

with nine processors operating with nine processors operating

on a shared, coherent memoryon a shared, coherent memory””

�� So whatSo what’’s the difference?s the difference?

�� From a programming From a programming

perspective, the Cell is much perspective, the Cell is much

more like a more like a ““clustercluster--onon--aa--chipchip””

�� That sounds hardThat sounds hard……

�� ItIt’’s not easy.s not easy.

MotivationMotivation

�� Programming the Cell is hardProgramming the Cell is hard…… yetyet……

�� Distributed (shared?) memory singleDistributed (shared?) memory single--chip multiprocessors are chip multiprocessors are

the way of the future.the way of the future. (Opinion.)(Opinion.)

�� Corollary:Corollary: Shared memory is out. Message passing is in. Shared memory is out. Message passing is in.

(More opinion.)(More opinion.)

�� WhatWhat’’s missing are the right runtime and abstraction s missing are the right runtime and abstraction

layers to enable the scalability potential of these and layers to enable the scalability potential of these and

future systems. future systems.

�� MapReduceMapReduce is just such an abstraction.is just such an abstraction.

OverviewOverview




�� DesignDesign


�� Implementation Status Implementation Status


MapReduceMapReduce RefresherRefresher

MapReduceMapReduce

on Cellon Cell

��DesignDesign

��Execution variants:Execution variants:

�� MapReduceMapReduce, sorted, sorted

�� Phases 1Phases 1--55

�� MapReduceMapReduce, no sort, no sort

�� Phases 1Phases 1--4 4

�� Map only, sortedMap only, sorted

�� Phases 1, 2, and 5Phases 1, 2, and 5

�� Map only, no sortMap only, no sort

�� Only phase 1Only phase 1

MapReduceMapReduce

on Cellon Cell

��DesignDesign









�� Only phase 1Only phase 1

MapReduceMapReduce

on Cellon Cell

��DesignDesign









�� Phase 1 (no hash)Phase 1 (no hash)

MapReduceMapReduce

on Cellon Cell

��DesignDesign










MapReduceMapReduce

on Cellon Cell

��DesignDesign










MapReduceMapReduce

on Cellon Cell

��DesignDesign










Design HighlightsDesign Highlights

�� Output is preOutput is pre--allocated. allocated.

�� Enhances performance as well as output locality.Enhances performance as well as output locality.

�� Work queue allows dynamic scheduling of tasks Work queue allows dynamic scheduling of tasks

among among SPEsSPEs for load balancing. for load balancing.

�� Adding priorities allows pipelining of computation Adding priorities allows pipelining of computation

to maximize resource utilization.to maximize resource utilization.

�� Outside of DMA transfers, there is no data Outside of DMA transfers, there is no data

copying.copying.

OverviewOverview




�� DesignDesign




Performance AnalysisPerformance Analysis

�� Assumptions:Assumptions:�� While there is scheduled work for the While there is scheduled work for the SPEsSPEs, the , the SPEsSPEs are are always the bottleneck.always the bottleneck.

�� Execution time:Execution time:Fixed runtime startup cost =Fixed runtime startup cost = unknown (can be amortized) +unknown (can be amortized) +

Map execution time = Map execution time = (# map keys * map execution time) +(# map keys * map execution time) +

Sort time =Sort time = n n loglog(n(n)/2 , where )/2 , where nn = (= (## intermediate keys / hash range) +intermediate keys / hash range) +

Reduce execution time =Reduce execution time = (# reduce keys * reduce execution time) +(# reduce keys * reduce execution time) +

Sort time =Sort time = n n loglog(n(n)/2 , where )/2 , where nn = (= (## reduce keys / output buffer size) reduce keys / output buffer size)

�� Observations:Observations:

�� Buffer management and partitioning is a necessary part of Buffer management and partitioning is a necessary part of

programming the programming the SPEsSPEs and is not considered overheadand is not considered overhead

�� Sorting is the dominant overheadSorting is the dominant overhead…… letlet’’s examine it furthers examine it further. .

Performance AnalysisPerformance Analysis

�� SortingSorting�� Q: I hear the Q: I hear the SPEsSPEs are not good at control tasks, but sorting is a control task, iare not good at control tasks, but sorting is a control task, isnsn’’t it?t it?

�� A: You are right, letA: You are right, let’’s analyze the sorting performance of s analyze the sorting performance of SPEsSPEs vs. a typical superscalarvs. a typical superscalar

�� AssumptionsAssumptions�� LetLet’’s assume our key comparison function is string compares assume our key comparison function is string compare

�� Furthermore assume that our input strings are uniformly distribuFurthermore assume that our input strings are uniformly distributedted

/* /* strcmpstrcmp */*/

intint keyCompare(keyCompare(constconst voidvoid *one, *one, const voidconst void *two)*two)

{ {

charchar *a1 = (*a1 = (charchar *)one;*)one;

charchar *a2 = (*a2 = (charchar *)two;*)two;

intint i;i;

forfor (i = 0; (i = 0;

__builtin_expect(a1[i] == a2[i], 0) && __builtin_expect(a1[i] == a2[i], 0) &&

__builtin_expect(a1[i] != 0), 1);__builtin_expect(a1[i] != 0), 1);

i++);i++);

ifif (__builtin_expect(a1[i] == a2[i]), 0)(__builtin_expect(a1[i] == a2[i]), 0)

returnreturn 0;0;

else ifelse if (a1[i] > a2[i])(a1[i] > a2[i])

returnreturn 1; 1;

elseelse

returnreturn --1;1;

} }

�� SPESPE strcmpstrcmp( ) average exec time / ( ) average exec time /

comparison : comparison : 36 cycles36 cycles

�� Assuming 3.2 Assuming 3.2 GhzGhz: : 11.25ns / comp11.25ns / comp

�� x86 x86 strcmpstrcmp( ) average inst / comparison : ( ) average inst / comparison : 28 28

instinst

�� Assuming IPC of 1.5 @ 3.2 Assuming IPC of 1.5 @ 3.2 GhzGhz:: 5.8ns / comp5.8ns / comp

�� Bottom line:Bottom line: we have 8 we have 8 SPEsSPEs to 2 (maybe 4) to 2 (maybe 4)

x86 cores:x86 cores:

�� 11.25 / 8 = 11.25 / 8 = 1.406 ns/comp (CELL)1.406 ns/comp (CELL)

�� 5.8 / 4 = 5.8 / 4 = 1.45 ns/comp (1.45 ns/comp (KentsfieldKentsfield))

OverviewOverview




�� DesignDesign




Implementation StatusImplementation Status

�� Very simple test application runs to completionVery simple test application runs to completion�� Larger test program does not return. Debugging using the Larger test program does not return. Debugging using the simulator is erratic and frequently ends with a timeout.simulator is erratic and frequently ends with a timeout.

�� Overestimated the capabilities of the simulator.Overestimated the capabilities of the simulator.

�� Cell Blade time needed, but too late to bear fruit.Cell Blade time needed, but too late to bear fruit.

�� ~2500 loc vs. ~1200 for Phoenix (shared ~2500 loc vs. ~1200 for Phoenix (shared memmem. . implementation)implementation)�� 21 threads total, though synchronization is not too difficult21 threads total, though synchronization is not too difficult

�� Buffer management is the hard partBuffer management is the hard part……

Future WorkFuture Work

�� Wrap up implementationWrap up implementation

�� Add performance counters to quantify overhead.Add performance counters to quantify overhead.

�� Perform applicationPerform application--agnostic runtime analysisagnostic runtime analysis

�� Combine with static analysis to determine Combine with static analysis to determine performance bottlenecksperformance bottlenecks

�� Determine if hash should be on the PPE or SPEDetermine if hash should be on the PPE or SPE

�� Perform applicationPerform application--specific runtime analysisspecific runtime analysis

�� What happened to Marching Cubes??What happened to Marching Cubes??�� Actually mostly done, but no opportunity to testActually mostly done, but no opportunity to test