Upload
clay
View
64
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Embedded Computer Architecture 5KK73 MPSoC Platforms. Part2: Cell Bart Mesman and Henk Corporaal. The Complexity Crisis. I have always wished that my computer would be as easy to use as my telephone. My wish has come true. I no longer know how to use my telephone. --Bjarne Stroustrup. - PowerPoint PPT Presentation
Citation preview
Embedded Computer Architecture5KK73
MPSoC Platforms
Part2: Cell
Bart Mesman and Henk Corporaal
The Complexity Crisis
I have always wished that my computer would be as easy to use as my telephone. My wish has come true. I no longer know how to use my telephone.
--Bjarne Stroustrup
04/21/23 2
04/21/23 3
The Software Crisis
04/21/23 4
The first SW crisis
Time Frame: ’60s and ’70s• Problem: Assembly Language Programming
– Computers could handle larger more complex programs
• Needed to get Abstraction and Portability without losing Performance
• Solution:– High-level languages for von-Neumann machines
FORTRAN and C
04/21/23 5
The second SW crisis
Time Frame: ’80s and ’90s• Problem: Inability to build and maintain complex
and robust applications requiring multi-million lines of code developed by hundreds of programmers– Computers could handle larger more complex
programs
• Needed to get Composability and Maintainability– High-performance was not an issue: left for Moore’s
Law
04/21/23 6
Solution
• Object Oriented Programming– C++, C# and Java
• Also…– Better tools
• Component libraries, Purify
– Better software engineering methodology• Design patterns, specification, testing, code
reviews
04/21/23 7
Today: Programmers are Oblivious to Processors
• Solid boundary between Hardware and Software• Programmers don’t have to know anything about the
processor– High level languages abstract away the processors
• Ex: Java bytecode is machine independent
– Moore’s law does not require the programmers to know anything about the processors to get good speedups
• Programs are oblivious of the processor -> work on all processors– A program written in ’70 using C still works and is much faster
today
• This abstraction provides a lot of freedom for the programmers
04/21/23 8
The third crisis: Powered by PlayStation
04/21/23 9
Contents
• Hammer your head against 4 walls– Or: Why Multi-Processor
• Cell Architecture
• Programming and porting– plus case-study
04/21/23 10
Moore’s Law
04/21/23 11
Single Processor SPECint Performance
04/21/23 12
What’s stopping them?
• General-purpose uni-cores have stopped historic performance scaling– Power consumption– Wire delays– DRAM access latency– Diminishing returns of more instruction-level
parallelism
04/21/23 13
Power density
04/21/23 14
Power Efficiency (Watts/Spec)
04/21/23 15
1 clock cycle wire range
04/21/23 16
Global wiring delay becomes dominant over gate delay
Gate delay vs. wire delay
0
50
100
150
200
250
300
350
400
0.5 0.35 0.25 0.18 0.13 0.1
technology (micron)
ps
wire delay (ps/mm)
gate delay (ps)
04/21/23 17
Memory
µProc:55%/year
CPU
DRAM:7%/yearDRAM
1
10
100
1000
1980
1985
1990
1995
2000
Processor-MemoryPerformance Gap:(grows 50% / year)
Performance
Time
“Moore’s Law”
[Patterson]
2005
04/21/23 18
Now what?
• Latest research drained
• Tried every trick in the book
So: We’re fresh out of ideas
Multi-processor is all that’s left!
04/21/23 19
Low power through parallelism• Sequential Processor
– Switching capacitance C– Frequency f– Voltage V– P = fCV2
• Parallel Processor (two times the number of units)– Switching capacitance 2C– Frequency f/2– Voltage V’ < V– P = f/2 2C V’2 = fCV’2
04/21/23 20
Architecture methods
Powerful Instructions (1)
MD-technique• Multiple data operands per operation• SIMD: Single Instruction Multiple Data
Vector instruction:
for (i=0, i++, i<64) c[i] = a[i] + 5*b[i];
c = a + 5*b
Assembly:
set vl,64ldv v1,0(r2)mulvi v2,v1,5ldv v1,0(r1)addv v3,v1,v2stv v3,0(r3)
04/21/23 21
Architecture methods
Powerful Instructions (1)
• Sub-word parallelism– SIMD on restricted scale:– Used for Multi-media instructions– Motivation: use a powerful 64-bit alu
as 4 x 16-bit alus
• Examples– MMX, SUN-VIS, HP MAX-2, AMD-
K7/Athlon 3Dnow, Trimedia II
– Example: i=1..4|ai-bi| * * * *
04/21/23 22
MPSoC Issues
• Homogeneous vs Heterogeneous• Shared memory vs local memory• Topology• Communication (Bus vs. Network)• Granularity (many small vs few large)• Mapping
– Automatic vs manual parallelization– TLP vs DLP– Parallel vs Pipelined
04/21/23 23
Multi-core
04/21/23 24
Cell
04/21/23 25
What can it do?
04/21/23 26
Cell/B.E. - the history
• Sony/Toshiba/IBM consortium– Austin, TX – March 2001– Initial investment: $400,000,000
• Official name: STI Cell Broadband Engine – Also goes by Cell BE, STI Cell, Cell
• In production for:– PlayStation 3 from Sony – Mercury’s blades
04/21/23 27
Cell blade
04/21/23 28
Cell/B.E. – the architecture1 x PPE 64-bit PowerPC
L1: 32 KB I$ + 32 KB D$L2: 512 KB
8 x SPE cores:Local store: 256 KB 128 x 128 bit vector registers
Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA
• EIB: 205 GB/s sustained aggregate bandwidth• Processor-to-memory bandwidth: 25.6 GB/s• Processor-to-processor: 20 GB/s in each direction
04/21/23 29
Cell chip
04/21/23 30
SPE
04/21/23 31
SPE
04/21/23 32
SPE pipeline
04/21/23 33
Communication
04/21/23 34
8 parallel transactions
04/21/23 35
C++ on Cell
1
234
Send the code of the function to be run on SPE
Send address to fetch the dataDMA data in LS from the main memoryRun the code on the SPE
56
DMA data out of LS to the main memorySignal the PPE that the SPE has finished the function
04/21/23 36
Conclusions
• Multi-processors inevitable• Huge performance increase, but…• Hell to program
– Got to be an architecture expert– Portability?