View
215
Download
1
Tags:
Embed Size (px)
Citation preview
1
Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was also addressable as registers.
2
COMP 740:COMP 740:Computer Architecture and Computer Architecture and ImplementationImplementation
Montek SinghMontek Singh
Thu, April 2, 2009Thu, April 2, 2009
Topic: Topic: Multiprocessors IMultiprocessors I
33
Uniprocessor Performance Uniprocessor Performance (SPECint)(SPECint)
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Pe
rfo
rma
nce
(vs
. V
AX
-11
/78
0)
25%/year
52%/year
??%/year
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
3X
44
Déjà vu all over again?Déjà vu all over again?“… “… today’s processors … are nearing an impasse as technologies today’s processors … are nearing an impasse as technologies
approach the speed of light..”approach the speed of light..” David Mitchell, David Mitchell, The Transputer: The Time Is NowThe Transputer: The Time Is Now ( (19891989))
Transputer had bad timing (Uniprocessor performanceTransputer had bad timing (Uniprocessor performance)) Procrastination rewarded: 2X seq. perf. / 1.5 years Procrastination rewarded: 2X seq. perf. / 1.5 years
““We are dedicating all of our future product development to We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”multicore designs. … This is a sea change in computing”
Paul Otellini, President, Intel (Paul Otellini, President, Intel (20052005) ) All microprocessor companies switch to MP (2X CPUs / 2 yrs)All microprocessor companies switch to MP (2X CPUs / 2 yrs)
Procrastination penalized: 2X sequential perf. / 5 yrs Procrastination penalized: 2X sequential perf. / 5 yrs
Manufacturer/YearManufacturer/Year AMD/’05AMD/’05 Intel/’06Intel/’06 IBM/’04IBM/’04 Sun/’05Sun/’05
Processors/chipProcessors/chip 22 22 22 88Threads/ProcessorThreads/Processor 11 22 22 44
Threads/chipThreads/chip 22 44 44 3232
55
Other Factors Other Factors Multiprocessors Multiprocessors Growth in data-intensive applicationsGrowth in data-intensive applications
Data bases, file servers, … Data bases, file servers, … Growing interest in servers, server perf.Growing interest in servers, server perf. Increasing desktop perf. less important Increasing desktop perf. less important
Outside of graphicsOutside of graphics Improved understanding in how to use Improved understanding in how to use
multiprocessors effectively multiprocessors effectively Especially server where significant natural TLPEspecially server where significant natural TLP
Advantage of leveraging design investment by Advantage of leveraging design investment by replication replication Rather than unique designRather than unique design
66
Flynn’s TaxonomyFlynn’s Taxonomy Flynn classified by data and control in 1966Flynn classified by data and control in 1966
SIMD SIMD Data Level Parallelism Data Level Parallelism MIMD MIMD Thread Level Parallelism Thread Level Parallelism MIMD popular because MIMD popular because
Flexible: N pgms and 1 multithreaded pgmFlexible: N pgms and 1 multithreaded pgm Cost-effective: same MPU in desktop & MIMDCost-effective: same MPU in desktop & MIMD
Single Instruction Single Single Instruction Single Data (SISD)Data (SISD)
(Uniprocessor)(Uniprocessor)
Single Instruction Multiple Single Instruction Multiple Data Data SIMDSIMD
(single PC: Vector, CM-2)(single PC: Vector, CM-2)
Multiple Instruction Single Multiple Instruction Single Data (MISD)Data (MISD)
(????)(????)
Multiple Instruction Multiple Instruction Multiple Data Multiple Data MIMDMIMD
(Clusters, SMP servers)(Clusters, SMP servers)
M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
77
Back to BasicsBack to Basics Parallel Architecture = Computer Architecture Parallel Architecture = Computer Architecture
+ Communication Architecture+ Communication Architecture
2 classes of multiprocessors WRT memory:2 classes of multiprocessors WRT memory:1.1. Centralized Memory Multiprocessor Centralized Memory Multiprocessor
< few dozen processor chips (and < 100 cores) in 2006< few dozen processor chips (and < 100 cores) in 2006 Small enough to share single, centralized memorySmall enough to share single, centralized memory
2.2. Physically Distributed-Memory multiprocessorPhysically Distributed-Memory multiprocessor Larger number chips and cores than 1.Larger number chips and cores than 1. BW demands BW demands Memory distributed among processors Memory distributed among processors
88
Centralized vs. Distributed Centralized vs. Distributed MemoryMemory
Centralized Memory Distributed Memory
99
Centralized Memory Centralized Memory Multiprocessor Multiprocessor Also called symmetric multiprocessors (SMPs)Also called symmetric multiprocessors (SMPs)
because single main memory has a symmetric because single main memory has a symmetric relationship to all processorsrelationship to all processors
Large caches and single memory can satisfy Large caches and single memory can satisfy memory demands of small number of memory demands of small number of processorsprocessors Can scale to a few dozen processors by using a switch Can scale to a few dozen processors by using a switch
instead of bus, and many memory banksinstead of bus, and many memory banks Although scaling beyond that is technically Although scaling beyond that is technically
conceivable, it becomes less attractive as the number conceivable, it becomes less attractive as the number of processors sharing centralized memory increasesof processors sharing centralized memory increases
1010
Distributed Memory Distributed Memory Multiprocessor Multiprocessor Pros:Pros:
Cost-effective way to scale memory bandwidth Cost-effective way to scale memory bandwidth If most accesses are to local memoryIf most accesses are to local memory
Reduces latency of local memory accessesReduces latency of local memory accesses
Cons:Cons: Communicating data between processors more Communicating data between processors more
complexcomplex Must change software to take advantage of increased Must change software to take advantage of increased
memory BWmemory BW
1111
2 Models for Comm and Mem 2 Models for Comm and Mem ArchArch1.1. Communication occurs explicitlyCommunication occurs explicitly
by passing messages among the processors: by passing messages among the processors: message-passing multiprocessorsmessage-passing multiprocessors
2.2. Communication occurs implicitlyCommunication occurs implicitly through a shared address space (via loads and through a shared address space (via loads and
stores): stores): shared memory multiprocessorsshared memory multiprocessors Either:Either:
UMA (Uniform Memory Access time) for shared address, UMA (Uniform Memory Access time) for shared address, centralized memory MPcentralized memory MP
NUMA (Non Uniform Memory Access time multiprocessor) NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MPfor shared address, distributed memory MP
Note: In past, confusion whether “sharing” means Note: In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing sharing physical memory (Symmetric MP) or sharing address spaceaddress space
1212
Challenges of Parallel ProcessingChallenges of Parallel Processing First challenge is Amdahl’s Law:First challenge is Amdahl’s Law:
what % of program inherently sequential?what % of program inherently sequential?
Suppose 80X speedup from 100 processors. What Suppose 80X speedup from 100 processors. What fraction of original program can be sequential?fraction of original program can be sequential?
a. 10%a. 10%b. 5%b. 5%c. 1%c. 1%d. <1%d. <1%
1414
Challenges of Parallel ProcessingChallenges of Parallel Processing Second challenge: long latency to remote Second challenge: long latency to remote
memorymemory Suppose 32 CPU MP, 2GHz, 200 ns remote memory, Suppose 32 CPU MP, 2GHz, 200 ns remote memory,
all local accesses hit memory hierarchy and base CPI all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.)is 0.5. (Remote access = 200/0.5 = 400 clock cycles.)
What is performance impact if 0.2% instructions What is performance impact if 0.2% instructions involve remote access?involve remote access?a. 1.5Xa. 1.5Xb. 2.0Xb. 2.0Xc. 2.5Xc. 2.5X
1616
Challenges of Parallel ProcessingChallenges of Parallel Processing1.1. Application parallelism – primarily via new Application parallelism – primarily via new
algorithms that have better parallel algorithms that have better parallel performanceperformance
2.2. Long remote latency impactLong remote latency impact For example, reduce frequency of remote accesses For example, reduce frequency of remote accesses
either by either by Caching shared data (HW) Caching shared data (HW) Restructuring the data layout to make more accesses Restructuring the data layout to make more accesses
local (SW)local (SW) We’ll look at reducing latency via cachesWe’ll look at reducing latency via caches
1717
T1 (“Niagara”)T1 (“Niagara”) Target: Commercial server applicationsTarget: Commercial server applications
High thread level parallelism (TLP)High thread level parallelism (TLP)Large numbers of parallel client requestsLarge numbers of parallel client requests
Low instruction level parallelism (ILP)Low instruction level parallelism (ILP)High cache miss ratesHigh cache miss ratesMany unpredictable branchesMany unpredictable branchesFrequent load-load dependenciesFrequent load-load dependencies
Power, cooling, and space arePower, cooling, and space aremajor concerns for data centersmajor concerns for data centers
Metric: Performance/Watt/Sq. Ft.Metric: Performance/Watt/Sq. Ft. Approach: Multicore, Fine-grain Approach: Multicore, Fine-grain
multithreading, Simple pipeline, Small L1 multithreading, Simple pipeline, Small L1 caches, Shared L2caches, Shared L2
1818
T1 ArchitectureT1 Architecture Also ships with 6 or 4 processorsAlso ships with 6 or 4 processors
1919
T1 pipelineT1 pipeline Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W 3 clock delays for loads & branches.3 clock delays for loads & branches. Shared units: Shared units:
L1, L2L1, L2 TLB TLB
2020
T1 Fine-Grained MultithreadingT1 Fine-Grained Multithreading Each core:Each core:
supports four threadssupports four threads has its own level one caches (16KB instr and 8 KB has its own level one caches (16KB instr and 8 KB
data)data) Switches to a new thread on each clock cycle Switches to a new thread on each clock cycle
Idle threads are bypassed in the scheduling Idle threads are bypassed in the scheduling – Waiting due to a pipeline delay or cache missWaiting due to a pipeline delay or cache miss
Processor is idle only when all 4 threads are idle or stalled Processor is idle only when all 4 threads are idle or stalled Both loads and branches incur a 3 cycle delay that Both loads and branches incur a 3 cycle delay that
can only be hidden by other threads can only be hidden by other threads
A single set of floating point functional units is A single set of floating point functional units is shared by all 8 coresshared by all 8 cores floating point performance not focus for T1floating point performance not focus for T1
2121
ConclusionConclusion Parallelism challenges: % parallelizable, long Parallelism challenges: % parallelizable, long
latency to remote memorylatency to remote memory Centralized vs. distributed memoryCentralized vs. distributed memory
Small MP vs. lower latency, larger BW for Larger MPSmall MP vs. lower latency, larger BW for Larger MP
Message Passing vs. Shared AddressMessage Passing vs. Shared Address Uniform access time vs. Non-uniform access timeUniform access time vs. Non-uniform access time
Cache criticalCache critical
Next:Next: Review of caching (App. C)Review of caching (App. C) Methods to ensure cache consistency in SMPsMethods to ensure cache consistency in SMPs