Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Vivek Sarkar
Department of Computer ScienceRice University
[email protected] 10, 2007
COMP 635: Seminar on Heterogeneous Processors
Lecture 2: Introduction to the Cell Processor
www.cs.rice.edu/~vsarkar/comp635
2COMP 635, Fall 2007 (V.Sarkar)
Announcements• Class dates (REMINDER)
— 8/27, 9/10, 9/20 (Thurs), 9/24, 10/1, 10/8, 10/22, 10/29, 11/5, 11/19, 11/26, 12/3— No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing
2007)— No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week— Time & Place
– Default: Mondays, 3:30pm - 4:30pm, DH 1075– Exception: 9/20 (Thurs) lecture, 3:30pm - 4:30pm, DH 3076– 30 minutes reserved after each lecture for discussion (optional)
— Office Hours (DH 3131)– 11am - 12noon, Fridays from 8/31/07 to 12/7/07
• Volunteers needed to lead discussion of papers in next lecture (9/20)1. “Using advanced compiler technology to exploit the performance of the Cell
Broadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal, Vol 45,No 1, 2006
2. “Dynamic Multigrain Parallelization on the Cell Broadband Engine”, F. Blagojevic et al,PPoPP 2007 Best Paper, March 2007.
• CELL HACK-A-THON II, Austin, September 22 - 25— See http://www.hpc-consortium.net/events/200709/ for details— Contact me if you’re interesting in attending so as to work on a class project
3COMP 635, Fall 2007 (V.Sarkar)
Acknowledgments
• MIT 6.189 IAP 2007, Jan 2007, Lecture 2, “Introduction to the Cell Processor”,Michael Perrone, http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf
• Georgia Tech, Sony/Toshiba/IBM Workshop on Software and Applications for theCell/B.E. processor, June 18-19, 2007, http://sti.cc.gatech.edu/program.html—Code and Data Partitioning for the Local Stores on the Cell/B.E. processor,
Kevin O'Brien, Kathryn O'Brien, Zehra Sura, Tao Zhang and Tong Chen,http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf
• U. Penn. Systems Seminar on “Cell Processor,” Diana Palsetia, 11/21/2006,www.cis.upenn.edu/~palsetia/cellproc.ppt
4COMP 635, Fall 2007 (V.Sarkar)
Outline
• Cell Processor and Software Environment—http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf
• Code and Data Partitioning for the Local Stores on the Cell/BE processor—http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf
• Yuan Zhao -- Experiences with compiling for Cell
5COMP 635, Fall 2007 (V.Sarkar)
Cell History• IBM, SCEI/Sony, Toshiba Alliance formed in 2000• Design Center opened in March 2001
— Based in Austin, Texas• Single Cell BE operational Spring 2004• 2-way SMP operational Summer 2004• February 7, 2005: First technical disclosures• October 6, 2005: Mercury Announces Cell Blade• November 9, 2005: Open Source SDK & Simulator Published• November 14, 2005: Mercury Announces Turismo Cell Offering• February 8, 2006 IBM Announced Cell Blade
Systems and Technology Group
6COMP 635, Fall 2007 (V.Sarkar)
Cell Chip
7COMP 635, Fall 2007 (V.Sarkar)
Cell Features
• Heterogeneousmulticore systemarchitecture— Power Processor
Element for controltasks
— Synergistic ProcessorElements for data-intensive processing
• SynergisticProcessor Element(SPE) consists of— Synergistic Processor
Unit (SPU)— Synergistic Memory
Flow Control (MFC)– Data movement and
synchronization– Interface to high-
performance ElementInterconnect Bus
16B/cycle(2x)
16B/cycle
BIC
FlexIOTM
MIC
DualXDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXUSPU
MFC
PXUL1
PPU
16B/cycleL2
32B/cycle
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
8COMP 635, Fall 2007 (V.Sarkar)
Permute UnitLoad-Store Unit
Floating-Point UnitFixed-Point Unit
Branch UnitChannel Unit
Result Forwarding and StagingRegister File
Local Store(256kB)
Single Port SRAM
128B Read 128B Write
DMA Unit
Instruction Issue Unit / Instruction Line Buffer
8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle64 Byte/Cycle
On-Chip Coherent Bus
SPE Block Diagram
9COMP 635, Fall 2007 (V.Sarkar)
SPU Details
• Synergistic Processor Element (SPE)— ISA influenced by VMX and PS2’s
Emotion Engine• User-mode architecture
— No translation/protection within SPE— DMA is full PowerPC protect/xlate
• Direct programmer control— DMA/DMA-list— Branch hint— No dynamic prediction— In-order execution
• VMX-like SIMD dataflow— Graphics SP-Float— No saturate arith, some byte— IEEE DP-Float (BlueGene-like)
• Unified register file— 128 entry x 128 bit
• 256KB Local Store— Combined I & D— 16B/cycle L/S bandwidth— 128B/cycle DMA bandwidth
• DMA unit w/ Memory Flow Control (MFC)commands
— MFC’s MMU allows consistent interfaceto system storage map for allprocessors despite heterogeneousstructure
• SPU Units (pipelined)— Simple (FXU even)
– Add/Compare– Rotate– Logical, Count Leading
Zero— Permute (FXU odd)
– Permute– Table-lookup
— FPU (Single / DoublePrecision)
— Control (SCN)– Dual Issue, Load/Store,
ECC Handling— Channel (SSC) – Interface to
MFC— Register File (GPR/FWD)
• SPU Latencies— Simple fixed point - 2 cycles*— Complex fixed point - 4 cycles*— Load - 6 cycles*— Single-precision (ER) float - 6 cycles*— Integer multiply - 7 cycles*— Branch miss (no penalty for correct hint) - 20 cycles— DP (IEEE) float (partially pipelined) - 13 cycles*— Enqueue DMA Command - 20 cycles*
10COMP 635, Fall 2007 (V.Sarkar)
LSA - Local Store Address (32 bit) EA - Effective Address (32 or 64 bit) TS - Transfer Size (16 bytes to 16K bytes) LS - DMA List Size (8 bytes to 16 K bytes) TG - Tag Group(5 bit) CL - Cache Management / Bandwidth Class
DMA Commands
Command Parameters
Put - Transfer from Local Store to EA spacePuts - Transfer and Start SPU executionPutr - Put Result - (Arch. Scarf into L2)Putl - Put using DMA List in Local StorePutrl - Put Result using DMA List in LS (Arch)Get - Transfer from EA Space to Local StoreGets - Transfer and Start SPU executionGetl - Get using DMA List in Local StoreSndsig - Send Signal to SPU Command Modifiers: <f,b>f: Embedded Tag Specific Fence
Command will not start until all previous commandsin same tag group have completed
b: Embedded Tag Specific BarrierCommand and all subsiquent commands in sametag group will not start until previous commands in sametag group have completed
SL1 Cache Management Commandssdcrt - Data cache region touch (DMA Get hint) sdcrtst - Data cache region touch for store (DMA Put hint)sdcrz - Data cache region zerosdcrs - Data cache region storesdcrf - Data cache region flush
Synchronization CommandsLockline (Atomic Update) Commands:
getllar - DMA 128 bytes from EA to LS and set Reservationputllc - Conditionally DMA 128 bytes from LS to EAputlluc - Unconditionally DMA 128 bytes from LS to EA
barrier - all previous commands complete before subsiquentcommands are started
mfcsync - Results of all previous commands in Tag groupare remotely visible
mfceieio - Results of all preceding Puts commands in samegroup visible with respect to succeeding Get commands
Memory Flow Controller Commands
11COMP 635, Fall 2007 (V.Sarkar)
PPE StructurePower Processor Element (PPE):
— General purpose, 64-bitRISC processor(Power/PowePC binarycompatible)
— In-order, dual issue, dualthreaded
— L1 : 32KB I ; 32KB D— L2 : 512KB— Coherent load / store— VMX-32— Realtime Controls
– Locking L2 Cache &TLB
– Software / hardwaremanaged TLB
– Bandwidth /ResourceReservation
– Mediated Interrupts
12COMP 635, Fall 2007 (V.Sarkar)
Element Interconnect Bus• EIB data ring for internal communication
—Four unidirectional 16 byte data rings, supporting multiple transfers– 2 clockwise, 2 anti-clockwise; worst-case latency is half ring length
—96B/cycle peak bandwidth—Over 100 outstanding requests
13COMP 635, Fall 2007 (V.Sarkar)
Example of Eight Concurrent Transactions
MIC SPE0 SPE2 SPE4 SPE6 BIF /IOIF1
Ramp
7
Controller
Ramp
8
Controller
Ramp
9
Controller
Ramp
10
Controller
Ramp
11
Controller
Controller
Ramp
0
Controller
Ramp
1
Controller
Ramp
2
Controller
Ramp
3
Controller
Ramp
4
Controller
Ramp
5
Controller
Ramp
6
Controller
Ramp
7
Controller
Ramp
8
Controller
Ramp
9
Controller
Ramp
10
Controller
Ramp
11
Data
Arbiter
Ramp
7
Controller
Ramp
8
Controller
Ramp
9
Controller
Ramp
10
Controller
Ramp
11
ControllerController
Ramp
5
Controller
Ramp
4
Controller
Ramp
3
Controller
Ramp
2
Controller
Ramp
1
Controller
Ramp
0
PPE SPE1 SPE3 SPE5 SPE7 IOIF1PPE SPE1 SPE3 SPE5 SPE7 IOIF1
PPE SPE1 SPE3 SPE5 SPE7 IOIF1MIC SPE0 SPE2 SPE4 SPE6 BIF /IOIF0
Ring1Ring3
Ring0Ring2
controls
14COMP 635, Fall 2007 (V.Sarkar)
FreescaleMPC8641D
1.5 GHz
Theoretical Peak Operations
AMDAthlon™ 64 X2
2.4 GHz
PowerPC®
970MP2.5 GHz
Cell BroadbandEngineTM
3.2 GHz
IntelPentium D®
3.2 GHz
15COMP 635, Fall 2007 (V.Sarkar)
Cell BE Performance
• BE can outperform a P4/SSE2 at same clock rate by 3 to 18x (assuminglinear scaling) in various types of application workloads
12x290 fps (per SPE)200 fps (IA32)mpeg2 decoder (sdtv)video processing
12x770 Telemark (perSPE)
501 Telemark (1.4GHzmpc7447)EEMBCcommunication
18x1.98 Gbps (per SPE)0.85 Gbps (IA32)SHA-1
6x2.3 Gbps (per SPE)2.68 Gbps (IA32)MD-5
10x0.16 Gbps (per SPE)0.12 Gbps (IA32)TDES
14x2Gbps (per SPE)1.1 Gbps (IA32)AESsecurity
15x24 fps (BE)1.6 fps (G5/VMX)TRE
12x240 MVPS (per SPE)160 MVPS (G5/VMX)transform-lightgraphics
6x420 Mcups (per SPE)570 Mcups (IA32)smith-watermanbioinformatic
2x12 GFLops (BE)6 GFlops (IA32)Linpack (D.P.)
8x150 GFlops (BE)18 GFlops (IA32)Linpack (S.P.)
8x190 GFlops (8SPEs)25 GflopsMatrix Multiplication (S.P.)HPC
BE PerfAdvantage3 GHz BE3 GHz GPPAlgorithmType
16COMP 635, Fall 2007 (V.Sarkar)
CELL Software Design Considerations• Four Levels of Parallelism
—Blade Level: Two Cell processors per blade—Chip Level: 9 cores run independent tasks—Instruction level: Dual issue pipelines on each SPE—Register level: Native SIMD on SPE and PPE VMX
• 256KB local store per SPE: data + code + stack• Communication
—DMA and Bus bandwidth– DMA granularity – 128 bytes– DMA bandwidth among LS and System memory
—Traffic control– Exploit computational complexity and data locality to lower data traffic
requirement—Shared memory / Message passing abstraction overhead—Synchronization—DMA latency handling
17COMP 635, Fall 2007 (V.Sarkar)
Typical CELL Software Development Flow
• Algorithm complexity study
• Data layout/locality and Data flow analysis
• Experimental partitioning and mapping of the algorithm andprogram structure to the architecture
• Develop PPE Control, PPE Scalar code
• Develop PPE Control, partitioned SPE scalar code—Communication, synchronization, latency handling
• Transform SPE scalar code to SPE SIMD code
• Re-balance the computation / data movement
• Other optimization considerations—PPE SIMD, system bottleneck, load balance
18COMP 635, Fall 2007 (V.Sarkar)
Programming the cell is challenging
Issues• Dividing program among different cores
• Creating instructions in a different language for the 8 SPEsthan for the PowerPC core.
• Need to think in terms of SIMD nature of dataflow to getmaximum performance from SPUs
• SPU local store needs to perform coherent DMA access foraccessing system memory
19COMP 635, Fall 2007 (V.Sarkar)
ProgrammerExperience
DevelopmentTools Stack
Hardware orSystem Level Simulator
Linux PPC64 with Cell Extensions
SPE Management LibApplication Libs
SamplesWorkloads
Demos
Code Dev Tools
Debug Tools
Performance Tools
Standards: Language extensionsABI
Verification Hypervisor
DevelopmentEnvironment
End-UserExperience
ExecutionEnvironment
Miscellaneous Tools
Cell Software Environment
20COMP 635, Fall 2007 (V.Sarkar)
Manually compiling and binding a Cell BE program
Copyright: IBM
21COMP 635, Fall 2007 (V.Sarkar)
Outline
• Cell Processor and Software Environment—http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf
• Code and Data Partitioning for the Local Stores on the Cell/BE processor—http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf
• Yuan Zhao -- Experiences with compiling for Cell
22COMP 635, Fall 2007 (V.Sarkar)
Shared Memory Processor
• CBE can be explicitly programmed as a shared-memory multiprocessorusing two different instruction sets
• The SPEs and the PPE can be programmed to fully inter-operate in a cache-coherent Shared-Memory Multiprocessor Model— Cache-coherent DMA operations for SPEs— DMA operations use effective address common to all PPE and SPEs— SPE shared-memory store instructions are replaced
– A store from the register file to the LS– DMA operation from LS to shared memory
— SPE shared-memory load instructions are replaced– DMA operation from shared memory to LS– A load from LS to register file
• Of course … a compiler could provide much of this functionality.
23COMP 635, Fall 2007 (V.Sarkar)
foo1 ();
#pragma omp parallel forfor (i=0; i < N; i++) A[i] = x * B[i];
foo2 ();
Single sourcefor (i=LB; i < UB; i++) A[i] = x * B[i];
foo3(LB,UB)
outline
foo3_SPU (LB,UB)
clone
for (i=LB; i < UB; i++) A[i] = x * B[i];Runtime barrier
foo1 ();Runtime distribution of work: invoke foo3, for i=[0,N)Runtime barrierfoo2 ();
Runtime barrier
In SPE code:A, B, and x are shared
Compiling a single source file for the Cell (w/o buffers)
24COMP 635, Fall 2007 (V.Sarkar)
foo1 ();
#pragma omp parallel forfor (i=0; i < N; i++) A[i] = x * B[i];
foo2 ();
Single source
foo1 ();Runtime distribution of work: invoke foo3 and foo3_SPU, for i=[0,N)Runtime barrierfoo2 ();
for (i=LB; i < UB; i++) A[i] = x * B[i];Runtime barrier
foo3(LB,UB)
outline
foo3_SPU (LB,UB)
/** buffers A´[M], B´[M] **/
for ( k=LB; k < UB; k+=M) { DMA M elements of B into B´ for (j=0; j<M; j++) { A´[j] = cache_lookup(x) * B´[j]; } DMA M elements of A out of A´}
Runtime barrier
clone
Compiling a single source file for the Cell (w/ buffers)
25COMP 635, Fall 2007 (V.Sarkar)
Data Partitioning
• Single Source assumption: all data lives in System Memory
• Naïve implementation, every load and store requires a dmaoperation
—Too costly (~700 cycles per load or store)
—MP will require locking on every reference
• What can be done to make this acceptable?
26COMP 635, Fall 2007 (V.Sarkar)
Prefetching
• Example:
for(i=0;i<100000;i++)
a[i]=b[i]+c[i];
for(i=0;i<100000;i+=100) {
dma_get(b’,b[i],400);
dma_get(c’,c[i],400);
for(ii=0;ii<100;ii++)
a’[ii]=b’[ii]+c’[ii];
dma_put(a[i],a’,400);
}
Original Code
Blocked, with prefetch
dma_get(b’,b[0],400);
dma_get(c’,c[0],400);
for(i=0;i<99900;i+=100) {
dma_get(b”,b[i+100],400);
dma_get(c”,c[i+100],400);
for(ii=0;ii<100;ii++)
a’[ii]=b’[ii]+c’[ii];
dma_put(a[i],a’,400);
swap(a’,a”);
swap(b’,b”);
swap(c’,c”);
}
for(ii=0;ii<100;ii++)
a”[ii]=b”[ii]+c”[ii];
dma_put(a[i],a”,400);
Software Pipelined Prefetch
27COMP 635, Fall 2007 (V.Sarkar)
Irregular Accesses
• b and c can be prefetched, but dhas an irregular access pattern,thus we cannot predict whatelements of d are required
• we seem to be thrown back onthe naïve implementation, d[f(i)]must be fetched on eachiteration with a consequentlarge slowdown of the loop
• observation: it’s as if everyaccess to d incurred a cachemiss
What do we do about this?
for(i=0;i<100000;i++)
a[i]=b[i]+c[i]*d[f(i)];
28COMP 635, Fall 2007 (V.Sarkar)
Software Caching
for(i=0;i<100000;i++)
= … d[f(i)];
for(i=0;i<100000;i++)
t=cache_lookup(d[f(i)];
= … t;
Original CodeCode with explicit Cache Lookup
inline vector cache_lookup(addr) {
if (cache_directory[addr&key_mask] != (addr&tag_mask))
miss_handler(addr);
return cache_data[addr&key_mask][addr&offset_mask];
}
the miss handler will dma the required data, and some suitable quantity of surrounding data
higher degrees of associativity can be supported, possibly for little extra cost on a SIMD processor
29COMP 635, Fall 2007 (V.Sarkar)
Combining Prefetch with Software Cache
• We may already have b[f(i)] in local store as a result of prefetching
• in this example, the only effect is to cause unneccesary misses
• but if we substitute a[f(i)] for the last term …
for(i=0;i<100000;i++)
a[i]=b[i]+c[i]*b[f(i)];
for(i=0;i<100000;i+=100) {
dma_get(b’,b[i],400);
dma_get(c’,c[i],400);
for(ii=0;ii<100;ii++) {
t=cache_look_up(b[f(i)]);
a’[ii]=b’[ii]+c’[ii]*t;
}
dma_put(a[i],a’,400);
}
Original Code
Prefetching and Caching
Prefetching must also update the cache directory, and
Miss handling must not evict prefetched data
30COMP 635, Fall 2007 (V.Sarkar)
Coherence Problem
• SPE accesses data in global memory through two mechanisms:— Software controlled cache— Static buffers
• Incorrect value may be used or generated if coherence is not maintained.Examples:
— Two copies of data in software controlled cache and static buffer. One changesthe value and the other one may read a stale value
— Multiple copies of data in different static buffers
• Approaches:— Compiler: no runtime overhead,— Runtime: more powerful but complicated
31COMP 635, Fall 2007 (V.Sarkar)
Solution Overview
• Combine two approaches for optimal solution—Try to apply compiler solution as much as possible—Resort to runtime solution if necessary
• Components—Local coherence simplification—Global coherence avoidance analysis—Dynamic coherence maintenance
32COMP 635, Fall 2007 (V.Sarkar)
Local Coherence Simplification
• There is no coherence problem for this static buffer in the loop• Runtime coherence maintenance is needed only
— At the entry of loop: DMA read and check whether the software controlled cachehas updated data
— At the exit of loop:– Write-through: update the hit cache line and DMA write– Write-back: put the static buffer content into cache
• Pros/Cons— Requires local data dependence info, which may be more likely to be available— The structure of software controlled cache remains unchanged
• References are put into static buffer in a loop only when there is no data dependencebetween the reference and any other reference accessed by software controlledcache or another static buffer in the loop.
— The coherence maintenance can be overlapped with DMA operations— Candidates for static buffer may be lost if the data dependence information is too
conservative
33COMP 635, Fall 2007 (V.Sarkar)
Global Coherence Avoidance Analysis
• Runtime coherence maintenance can be avoided by compileranalysis—At entry: if there is no updated cache line for this static buffer—At exit: if there is no cache line for this static buffer already in
cache that will be referenced later• How the compiler predicts cache contents
—No lines in cache after flush— If data is carefully aligned or padded, compiler can assume
different variables will never be in the same cache line—Can not predict the replacement. A line will be assumed to stay
in cache until flush
34COMP 635, Fall 2007 (V.Sarkar)
Optimization with Flushes
• When runtime coherence maintenance is needed by theprevious analysis, it may be profitable to insert extra cacheflushes to avoid the coherence maintenance
• Flush can be a flush for one variable or combine them as flushall
• The previous analysis can provide information about thepossible insertion points for flush—Move in the control flow graph to reduce the overhead—Similar to the algorithm of partial redundant elimination.—Branch profiling may help
35COMP 635, Fall 2007 (V.Sarkar)
SPU Code Partition Manager Overview
ActivePartition m
PartitionManager
(long resident)
SPU Processor
(partition m)
…….
call to Partition n
…….
Main Memory
Partition 1
Partition 2
…...
Partition n
…...
36COMP 635, Fall 2007 (V.Sarkar)
SPU Code Partition Manager Overview
PartitionManager
(long resident)
SPU Processor
(partition m)
…….
call to Partition n
…….
Main Memory
Partition 1
Partition 2
…...
Partition n
…...
37COMP 635, Fall 2007 (V.Sarkar)
OVERLAY command effect: Binary View
Header
……
Partition 1
Partition 2
Program Binary Image
…...
Partition n
…...
offset: 0x1000
offset: 0x2000
offset: 0x3000
38COMP 635, Fall 2007 (V.Sarkar)
OVERLAY command effect: Execution View
Program MemoryImage
Header
……
Partition 1 Partition 2 …... Partition n
virtual address:0x1000
39COMP 635, Fall 2007 (V.Sarkar)
Call Graph Partitioning Algorithm
• Build an affinity graph based on the global call graph.—Each global call graph node becomes a node in the affinity graph
and costs some memory—Each call graph edge becomes an edge in the affinity graph
• Each call graph edge is weighted.—Estimated through static program analysis—Profiling
• Apply maximum spanning tree algorithm to the affinity graph.—Process edges by the order of the weight—If merging the two nodes of the edge does not exceed the memory
limitation, then merge, and so on.
• Each (merged) node left is a program partition.
40COMP 635, Fall 2007 (V.Sarkar)
An Example of Call Graph Partitioning
Assume Memory Limitation is 1000
300
300 300
300300 300
100
1000200
10
5050
41COMP 635, Fall 2007 (V.Sarkar)
An Example of Call Graph Partitioning
300
600 300
300 300
100
200
10
5050
Assume Memory Limitation is 1000
42COMP 635, Fall 2007 (V.Sarkar)
An Example of Call Graph Partitioning
300
900 300
300
100 10
50
50
Assume Memory Limitation is 1000
43COMP 635, Fall 2007 (V.Sarkar)
An Example of Call Graph Partitioning
600
300
900
100 10
50
Assume Memory Limitation is 1000
44COMP 635, Fall 2007 (V.Sarkar)
An Example of Call Graph Partitioning
900900 150
Assume Memory Limitation is 1000
45COMP 635, Fall 2007 (V.Sarkar)
Optimizations
• Profiling to get accurate call edge frequencies—Especially with the presence of a lot of indirect calls
through function pointers
• Get the accurate function code size—Currently estimated—Conservative, very rough
• Leaf function duplication—Some leaf functions are referenced by two non-
coalescable partitions—May be beneficial to duplicate the function
• Double Buffering—Rely on good prefetching to be beneficial—Prefetching is a difficult problem
46COMP 635, Fall 2007 (V.Sarkar)
• NAS and SPEC OMP benchmarks, speedups against 1 SPE
• Scalability generally very good
—IS and equake not good due to non-parallelized loops
Performance Normalized to one SPU
1
2
3
4
5
6
7
8
1SPU 2SPU 4SPU 8SPU
Sp
ee
du
p
CG
EP
FT
IS
MG
equake
swim
47COMP 635, Fall 2007 (V.Sarkar)
Outline
• Cell Processor and Software Environment—http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf
• Code and Data Partitioning for the Local Stores on the Cell/BE processor—http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf
• Yuan Zhao -- Experiences with compiling for Cell
48COMP 635, Fall 2007 (V.Sarkar)
© Copyright International Business Machines Corporation 2006All Rights Reserved
This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available inother countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBMofferings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained inthis document.Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions onthe capabilities of non-IBM products should be addressed to the suppliers of those products.IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give youany license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY10504-1785 USA.All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guaranteeseither expressed or implied.All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and theresults that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurationsand conditions.IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisionsworldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipmenttype and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawalwithout notice.IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, pleasecheck: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.htmlAny performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and aredependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in thisdocument may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document shouldverify the applicable data for their specific environment.
Special Notices
49COMP 635, Fall 2007 (V.Sarkar)
The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks,BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBMBusiness Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage,xSeries; Advanced Micro-Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, PowerArchitecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks,System p, System p5, System Storage, VideoCharger, Virtualization Engine.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.
Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, othercountries, or both.Rambus is a registered trademark of Rambus, Inc.XDR and FlexIO are trademarks of Rambus, Inc.UNIX is a registered trademark in the United States, other countries or both.Linux is a trademark of Linus Torvalds in the United States, other countries or both.Fedora is a trademark of Redhat, Inc.Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries.AMD Opteron is a trademark of Advanced Micro Devices, Inc.Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries.TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC).SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimapand SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC).AltiVec is a trademark of Freescale Semiconductor, Inc.PCI-X and PCI Express are registered trademarks of PCI SIG.InfiniBand™ is a trademark the InfiniBand® Trade AssociationOther company, product and service names may be trademarks or service marks of others.
Revised July 23, 2006
Special Notices (Cont.) -- Trademarks
50COMP 635, Fall 2007 (V.Sarkar)
(c) Copyright International Business Machines Corporation 2005.All Rights Reserved. Printed in the United Sates April 2005.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document areNOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could resultin death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or changeIBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnityunder the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specificenvironments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be reliedupon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liablefor damages arising directly or indirectly from any use of the information contained in this document.
IBM Microelectronics Division The IBM home page is http://www.ibm.com1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page isHopewell Junction, NY 12533-6351 http://www.chips.ibm.com