Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
1
EEC 581
Computer Architecture
Multicore Architecture
Department of Electrical Engineering and Computer Science
Cleveland State University
Multiprocessor Architectures
Late 1950s - one general-purpose processor and one
or more special-purpose processors for input and
output operations
Early 1960s - multiple complete processors, used for
program-level concurrency
Mid-1960s - multiple partial processors, used for
instruction-level concurrency
Single-Instruction Multiple-Data (SIMD) machines
Multiple-Instruction Multiple-Data (MIMD) machines
A primary focus of this chapter is shared memory
MIMD machines (multiprocessors)
2
Instruction and Data Streams
3
2
(3)
Thread Level Parallelism (TLP)
• Multiple threads of execution
• Exploit ILP in each thread
• Exploit concurrent execution across threads
(4)
Instruction and Data Streams
• Taxonomy due to M. Flynn
Data Streams
Single Multiple
Instruction
Streams
Single SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
Multiple MISD:
No examples today
MIMD:
Intel Xeon e5345
Example: Multithreading (MT) in a single address space
Recall Executable Format
4
3
(5)
Recall The Executable Format
header
text
static data
reloc
symbol table
debug
Object file ready to be linked and loaded
Linker Loader
Static Libraries
An executable instance or
Process
What does a loader do?
(6)
Process
• A process is a running program with state
! Stack, memory, open files
! PC, registers
• The operating system keeps tracks of the state of all processors ! E.g., for scheduling processes
• There many processes for the
same application
! E.g., web browser
• Operating systems class for details Code
Static data
Heap
Stack
DLL’s
3
Process Level Parallelism
5
4
(7)
Process Level Parallelism
Process Process Process
• Parallel processes and throughput computing
• Each process itself does not run any faster
(8)
From Processes to Threads
• Switching processes on a core is expensive ! A lot of state information to be managed
• If I want concurrency, launching a process is expensive
• How about splitting up a single process into parallel computations?
" Lightweight processes or threads!
Process
6
3
(5)
Recall The Executable Format
header
text
static data
reloc
symbol table
debug
Object file ready to be linked and loaded
Linker Loader
Static Libraries
An executable instance or
Process
What does a loader do?
(6)
Process
• A process is a running program with state
! Stack, memory, open files
! PC, registers
• The operating system keeps tracks of the state of all processors ! E.g., for scheduling processes
• There many processes for the same application
! E.g., web browser
• Operating systems class for details Code
Static data
Heap
Stack
DLL’s
4
Categories of Concurrency
Categories of Concurrency: Physical concurrency - Multiple independent processors (
multiple threads of control)
Logical concurrency - The appearance of physical concurrency is presented by time-sharing one processor (software can be designed as if there were multiple threads of control)
Coroutines (quasi-concurrency) have a single thread of control
A thread of control in a program is the sequence of program points reached as control flows through the program
Motivations for the Use of Concurrency
Multiprocessor computers capable of physical
concurrency are now widely used
Even if a machine has just one processor, a program
written to use concurrent execution can be faster than
the same program written for nonconcurrent execution
Involves a different way of designing software that can
be very useful—many real-world situations involve
concurrency
Many program applications are now spread over
multiple machines, either locally or over a network
5
Introduction to Subprogram-Level
Concurrency
A task or process or thread is a program unit that can be in concurrent execution with other program units
Tasks differ from ordinary subprograms in that: A task may be implicitly started
When a program unit starts the execution of a task, it is not necessarily suspended
When a task’s execution is completed, control may not return to the caller
Tasks usually work together
Two General Categories of Tasks
Heavyweight tasks execute in their own address
space
Lightweight tasks all run in the same address space –
more efficient
A task is disjoint if it does not communicate with or
affect the execution of any other task in the program
in any way
6
Task Synchronization
A mechanism that controls the order in which tasks execute
Two kinds of synchronization Cooperation synchronization
Competition synchronization
Task communication is necessary for synchronization, provided by:- Shared nonlocal variables- Parameters- Message passing
Kinds of synchronization
Cooperation: Task A must wait for task B to complete
some specific activity before task A can continue its
execution, e.g., the producer-consumer problem
Competition: Two or more tasks must use some
resource that cannot be simultaneously used, e.g., a
shared counter
Competition is usually provided by mutually exclusive access
(approaches are discussed later)
7
From Processes to Threads
13
4
(7)
Process Level Parallelism
Process Process Process
• Parallel processes and throughput computing
• Each process itself does not run any faster
(8)
From Processes to Threads
• Switching processes on a core is expensive ! A lot of state information to be managed
• If I want concurrency, launching a process is expensive
• How about splitting up a single process into parallel computations?
" Lightweight processes or threads!
A Thread
14
5
(9)
Thread Parallel Execution Process
thread
(10)
A Thread
• A separate, concurrently executable instruction stream within a process
• Minimum amount state to execute on a core
! Program counter, registers, stack
! Remaining state shared with the parent process
o Memory and files
• Support for creating threads
• Support for merging/terminating threads
• Support for synchronization between threads
! In accesses to shared data
Our datapath
so far!
8
15
TLP
ILP of a single program is hard
Large ILP is Far-flung
We are human after all, program w/ sequential mind
Reality: running multiple threads or programs
Thread Level Parallelism
Time Multiplexing
Throughput computing
Multiple program workloads
Multiple concurrent threads
Helper threads to improve single program performance
Thread Level Parallelism (TLP)
16
2
(3)
Thread Level Parallelism (TLP)
• Multiple threads of execution
• Exploit ILP in each thread
• Exploit concurrent execution across threads
(4)
Instruction and Data Streams
• Taxonomy due to M. Flynn
Data Streams
Single Multiple
Instruction
Streams
Single SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
Multiple MISD:
No examples today
MIMD:
Intel Xeon e5345
Example: Multithreading (MT) in a single address space
9
17
Single and Multithreaded Processes
A Simple Example
18
6
(11)
A Simple Example
Data Parallel Computation
(12)
Thread Execution: Basics
funcA()
Static data
Heap
Stack
funcB()
Stack
PC, registers, stack pointer
PC, registers, stack pointer
Thread #1
Thread #2
create_thread(funcB)
create_thread(funcA)
funcA() funcB()
WaitAllThreads()
end_thread() end_thread()
10
19
Examples of Threads
A web browser
One thread displays images
One thread retrieves data from network
A word processor
One thread displays graphics
One thread reads keystrokes
One thread performs spell checking in the background
A web server
One thread accepts requests
When a request comes in, separate thread is created to service
Many threads to support thousands of client requests
RPC or RMI (Java)
One thread receives message
Message service uses another thread
Thread Execution
20
6
(11)
A Simple Example
Data Parallel Computation
(12)
Thread Execution: Basics
funcA()
Static data
Heap
Stack
funcB()
Stack
PC, registers, stack pointer
PC, registers, stack pointer
Thread #1
Thread #2
create_thread(funcB)
create_thread(funcA)
funcA() funcB()
WaitAllThreads()
end_thread() end_thread()
11
Threads Execution on a Single Core
21
7
(13)
Threads Execution on a Single Core
• Hardware threads ! Each thread has its own hardware state
• Switching between threads on each cycle to share the core pipeline – why?
IF ID MEM WB EX
lw $t0, label($0) lw $t1, label1($0) and $t2, $t0, $t1
andi $t3, $t1, 0xffff
srl $t2, $t2, 12 ……
lw $t3, 0($t0) add $t2, $t2, $t3 addi $t0, $t0, 4
addi $t1, $t1, -1
bne $t1, $zero, loop
…….
Thread #1
Thread #2
lw
lw
lw lw
lw
lw
lw lw lw add
lw lw lw add and
No pipeline stall on load-to-use hazard!
Interleaved execution Improve
utilization !
(14)
An Example Datapath
From Poonacha Kongetira, Microarchitecture of the UltraSPARC T1 CPU
Execution Model: Multithreading
22
9
(17)
Execution Model: Multithreading
• Fine-grain multithreading ! Switch threads after each cycle
! Interleave instruction execution
• Coarse-grain multithreading ! Only switch on long stall (e.g., L2-cache miss)
! Simplifies hardware, but does not hide short stalls (e.g., data hazards)
! If one thread stalls (e.g., I/O), others are executed
(18)
Simultaneous Multithreading
• In multiple-issue dynamically scheduled processors ! Instruction-level parallelism across threads
! Schedule instructions from multiple threads
! Instructions from independent threads execute when function units are available
• Example: Intel Pentium-4 HT ! Two threads: duplicated registers, shared function
units and caches
! Known as Hyperthreading in Intel terminology
12
23
Threads vs. Processes
Thread
A thread has no data
segment or heap
A thread cannot live on its
own, it must live within a
process
There can be more than one
thread in a process, the first
thread calls main and has
the process’s stack
Inexpensive creation
Inexpensive context
switching
If a thread dies, its stack is
reclaimed by the process
Processes
A process has
code/data/heap and other
segments
There must be at least one
thread in a process
Threads within a process
share code/data/heap, share
I/O, but each has its own
stack and registers
Expense creation
Expensive context switching
It a process dies, its
resources are reclaimed and
all threads die
24
Thread Implementation
Process defines address
space
Threads share address
space
Process Control Block (PCB)
contains process-specific
info
PID, owner, heap pointer,
active threads and pointers
to thread info
Thread Control Block (TCB)
contains thread-specific info
Stack pointer, PC, thread
state, register …
CODE
Initialized data
Heap
Stack – thread 2
Stack – thread 1
DLL’s
Reserved
Process’s address space
$pc
$sp
State
Registers
…
…
$pc
$sp
State
Registers
…
…
TCB for thread2
TCB for thread1
13
Benefits
Responsiveness When one thread is blocked, your browser still responds
E.g. download images while allowing your interaction
Resource Sharing Share the same address space
Reduce overhead (e.g. memory)
Economy Creating a new process costs memory and resources
E.g. in Solaris, 30 times slower in creating process than thread
Utilization of MP Architectures Threads can be executed in parallel on multiple processors
Increase concurrency and throughput
User-level Threads
Thread management done by user-level threads library
Similar to calling a procedure
Thread management is done by the thread library in user space
User can control the thread scheduling (No disturbing the underlying OS scheduler)
No OS kernel support more portable
Low overhead when thread switching
Three primary thread libraries: POSIX Pthreads
Java threads
Win32 threads
14
Kernel Threads
A.k.a. lightweight process in the literature
Supported by the Kernel
Thread scheduling is fairer
Examples
Windows XP/2000
Solaris
Linux
Tru64 UNIX
Mac OS X
Multithreading Models
Many-to-One
One-to-One
Many-to-Many
15
Many-to-One
Many user-level threads mapped to one single kernel
thread
The entire process will block if a thread makes a
blocking system call
Cannot run threads in parallel on multiprocessors
Examples
Solaris Green Threads
GNU Portable Threads
Many-to-One Model
16
One-to-One
Each user-level thread maps to kernel thread
Do not block other threads when one is making a blocking system call
Enable parallel execution in an MP system
Downside:
performance/memory overheads of creating kernel threads
Restriction of the number of threads that can be supported
Examples
Windows NT/XP/2000
Linux
Solaris 9 and later
One-to-one Model
17
Many-to-Many Model
Allows many user level threads to be mapped to many
kernel threads
Allows the operating system to create a sufficient
number of kernel threads
Threads are multiplexed to a smaller (or equal)
number of kernel threads which is specific to a
particular application or a particular machine
Solaris prior to version 9
Windows NT/2000 with the ThreadFiber package
Many-to-Many Model
18
Pipeline Hazards
35
Multithreading
36
19
37
38
Multi-Tasking Paradigm Virtual memory makes it easy
Context switch could be
expensive or requires extra
HW
VIVT cache
VIPT cache
TLBs
Thread 1
Unused
Exe
cuti
on
Tim
e Q
uan
tum
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Thread 2
Thread 3
Thread 4
Thread 5
20
39
Multi-threading Paradigm
Thread 1
UnusedE
xecu
tio
n T
ime
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Simultaneous
Multithreading
(SMT)
Fine-grained
Multithreading
(cycle-by-cycle
Interleaving)
Thread 2
Thread 3
Thread 4
Thread 5
Coarse-grained
Multithreading
(Block Interleaving)
Chip
Multiprocessor
(CMP or
MultiCore)
40
Conventional Multithreading
Zero-overhead context switch
Duplicated contexts for threads
0:r0
0:r71:r0
1:r72:r0
2:r73:r0
3:r7
CtxtPtr
Memory (shared by threads)
Register file
21
41
Cycle Interleaving MT
Per-cycle, Per-thread instruction fetching
Examples: HEP, Horizon, Tera MTA, MIT M-machine
Interesting questions to consider
Does it need a sophisticated branch predictor?
Or does it need any speculative execution at all?
Get rid of “branch prediction”?
Get rid of “predication”?
Does it need any out-of-order execution capability?
42
Block Interleaving MT
Context switch on a specific event (dynamic pipelining) Explicit switching: implementing a switch instruction
Implicit switching: trigger when a specific instruction class fetched
Static switching (switch upon fetching) Switch-on-memory-instructions: Rhamma processor
Switch-on-branch or switch-on-hard-to-predict-branch
Trigger can be implicit or explicit instruction
Dynamic switching Switch-on-cache-miss (switch in later pipeline stage): MIT Sparcle
(MIT Alewife’s node), Rhamma Processor
Switch-on-use (lazy strategy of switch-on-cache-miss) Wait until last minute
Valid bit needed for each register Clear when load issued, set when data returned
Switch-on-signal (e.g. interrupt)
Predicated switch instruction based on conditions
No need to support a large number of threads
22
43
Register
Renamer
Register
Renamer
Register
Renamer
Register
Renamer
Register
Renamer
Simultaneous Multithreading (SMT)
SMT name first used by UW; Earlier versions from UCSB [Nemirovsky, HICSS‘91] and [Hirata et al., ISCA-92]
Intel’s HyperThreading (2-way SMT)
IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores. Each 2-way SMT, 4 chips per package) : Power5 has OoO cores, Power6 In-order cores;
Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources
Reg
File
FMult(4 cycles)
FAdd(2 cyc)
ALU
1ALU
2Load/Store(variable)
Fdiv, unpipe (16 cycles)
RS & ROB
plus
Physical
Register
File
Fetch
Unit
PCPCPCPCPCPCPCPC
I-CACHE
Decode
Register
RenamerReg
File
Reg
File
Reg
File
Reg
File
Reg
File
Reg
File
Reg
File
D-CACHE
Register
Renamer
Register
Renamer
44
Instruction Fetching Policy
FIFO, Round Robin, simple but may be too naive
Adaptive Fetching Policies BRCOUNT (reduce wrong path issuing)
Count # of br inst in decode/rename/IQ stages
Give top priority to thread with the least BRCOUNT
MISSCOUT (reduce IQ clog)
Count # of outstanding D-cache misses
Give top priority to thread with the least MISSCOUNT
ICOUNT (reduce IQ clog)
Count # of inst in decode/rename/IQ stages
Give top priority to thread with the least ICOUNT
IQPOSN (reduce IQ clog)
Give lowest priority to those threads with inst closest to the head of INT or FP instruction queues
Due to that threads with the oldest instructions will be most prone to IQ clog
No Counter needed
23
45
Resource Sharing
Could be tricky when threads compete for the resources
Static
Less complexity
Could penalize threads (e.g. instruction window size)
P4’s Hyperthreading
Dynamic
Complex
What is fair? How to quantify fairness?
A growing concern in Multi-core processors
Shared L2, Bus bandwidth, etc.
Issues
Fairness
Mutual thrashing
HyperThreading
46
10
(19) 19
Hyper-threading
• Implementation of Hyper-threading adds less than 5% to the chip area
• Principle: share major logic components (functional units) and improve utilization
• Architecture State: All core pipeline resources needed for executing a thread
Processor Execution Resources
Arch State
Processor Execution Resources
Processor Execution Resources
Processor Execution Resources
Arch State Arch State Arch State Arch State Arch State
2 CPU Without Hyper-threading 2 CPU With Hyper-threading
(20)
Multithreading with ILP: Examples
24
47
P4 HyperThreading Resource Partitioning
TC (or UROM) is alternatively accessed per cycle for
each logical processor unless one is stalled due to TC
miss
op queue (into ½) after fetched from TC
ROB (126/2)
LB (48/2)
SB (24/2) (32/2 for Prescott)
General op queue and memory op queue (1/2)
TLB (½?) as there is no PID
Retirement: alternating between 2 logical processors
48
Alpha 21464 (EV8) Processor
Technology
Leading edge process technology – 1.2 ~ 2.0GHz
0.125µm CMOS
SOI-compatible
Cu interconnect
low-k dielectrics
Chip characteristics
~1.2V Vdd
~250 Million transistors
~1100 signal pins in flip chip packaging
25
49
Alpha 21464 (EV8) Processor
Architecture
Enhanced out-of-order execution (that giant 2Bc-gskew
predictor we discussed before is here)
Large on-chip L2 cache
Direct RAMBUS interface
On-chip router for system interconnect
Glueless, directory-based, ccNUMA for up to 512-way
SMP
8-wide superscalar
4-way simultaneous multithreading (SMT)
Total die overhead ~ 6% (allegedly)
50
SMT Pipeline
Fetch Decode/
Map
Queue Reg
Read
Execute Dcache/
Store
Buffer
Reg
Write
Retire
Icache
Dcache
PC
Register
Map
Regs Regs
Source: A company once called Compaq
26
51
EV8 SMT
In SMT mode, it is as if there are 4 processors on a chip that shares their caches and TLB
Replicated hardware contexts
Program counter
Architected registers (actually just the renaming table since architected registers and rename registers come from the same physical pool)
Shared resources
Rename register pool (larger than needed by 1 thread)
Instruction queue
Caches
TLB
Branch predictors
Deceased before seeing the daylight.
52
Reality Check, circa 200x
Conventional processor designs run out of steam
Power wall (thermal)
Complexity (verification)
Physics (CMOS scaling)
1
10
100
1000
Wa
tts
/cm
2
i386i486
Pentium ® processor
Pentium Pro ® processor
Pentium II ® processor
Pentium III ® processor
Hot plateHot plate
Nuclear ReactorNuclear Reactor RocketRocketNozzleNozzle
Sun’sSun’sSurfaceSurface
1
10
100
1000
Wa
tts
/cm
2
i386i486
Pentium ® processor
Pentium Pro ® processor
Pentium II ® processor
Pentium III ® processor
Hot plateHot plate
Nuclear ReactorNuclear Reactor RocketRocketNozzleNozzle
Sun’sSun’sSurfaceSurface
“Surpassed hot-plate power
density in 0.5m; Not too long
to reach nuclear reactor,”
Former Intel Fellow Fred
Pollack.
27
53
Latest Power Density Trend
Yeo and Lee, “Peeling the Power Onion of Data Centers,” In
Energy Efficient Thermal Management of Data Centers, Springer. To appear 2011
54
Reality Check, circa 200x
Conventional processor designs run out of steam
Power wall (thermal)
Complexity (verification)
Physics (CMOS scaling)
Unanimous direction Multi-core
Simple cores (massive number)
Keep
Wire communication on leash
Gordon Moore happy (Moore’s Law)
Architects’ menace: kick the ball to the other side of the court?
What do you (or your customers) want?
Performance (and/or availability)
Throughput > latency (turnaround time)
Total cost of ownership (performance per dollar)
Energy (performance per watt)
Reliability and dependability, SPAM/spy free
28
55
Multi-core Processor Gala
56
Intel’s Multicore Roadmap
To extend Moore’s Law
To delay the ultimate limit of physics
By 2010
all Intel processors delivered will be multicore
Intel’s 80-core processor (FPU array)
Source: Adapted from Tom’s Hardware
2006 20082007
SC 1MB
DC 2MB
DC 2/4MB shared
DC 3 MB/6 MB shared (45nm)
2006 20082007
DC 2/4MB
DC 2/4MB shared
DC 4MB
DC 3MB /6MB shared (45nm)
2006 20082007
DC 2MB
DC 4MB
DC 16MB
QC 4MB
QC 8/16MB shared
8C 12MB shared (45nm)
SC 512KB/ 1/ 2MB
8C 12MB shared (45nm)
Deskto
p p
roce
sso
rs
Mo
bile
p
roce
sso
rs
En
terp
rise
p
roce
sso
rs
29
57
Is a Multi-core really better off?
Well, it is hard to say in Computing World
If you were plowing a field,
which would you rather use:
Two strong oxen or 1024 cores?
--- Seymour Cray
58
Q1. For a PIPT cache with virtual memory support, three possible events can be triggered
during an instruction fetch: (1) a cache lookup, (2) a TLB miss, (3) a page fault. Please
order these events in the correct order of their occurrences.
(2) (3) (1): In a PIPT cache, address translation will take place prior to a cache
lookup. It first searches for a match in TLB, therefore, a TLB miss (if any) will take
place first. Then a page table walk is initiated. If a page has not been allocated, a page
fault will follow. The OS will then allocate the page, fill in the page table entry, then
fill the translation into TLB, followed by a cache lookup.
Q2. Given a 256Meg x4 DRAM chip which consists of 2 banks, with 14-bit row
addresses. (256Meg indicates the number of addresses.) What is the row buffer size for
each bank?
256M 28 address bits needed. One bit is used for bank index, hence, the column
address = 28 – 1 – 14 = 13 bits
As the DRAM is a “x4” configuration
One row buffer of a bank = 213 * (x4 bits) = 32 kbits = 4 KB
30
59
Q3. Assume an Inverted Page Table (8-entry IPT) is used by a 32-bit OS. The memory
page size is 256KB. The complete IPT content is shown below. The Physical Page Number
(PPN) starts from 0 to 7 from the top of the table. There are three active processes, P1
(PID=1), P2 (PID=2) and P3 (PID=3) running in the system and the IPT holds the translation
for the entire physical memory. Answer the following questions.
Based on the size of the Inverted Page Table above, what is the size of the physical
memory ?
There are 8 entries in the IPT. As each page is 256KB, the size of the physical
memory = 8 * 256KB = 2MB
60
IBM Watson Jeopardy! Competition
POWER7 chips (2,880 cores) + 16TB memory
Massively parallel processing
Combine: Processing power, Natural language processing, AI, Search, Knowledge extraction
31
61
Major Challenges for Multi-Core Designs
Communication
Memory hierarchy
Data allocation (you have a large shared L2/L3 now)
Interconnection network
AMD HyperTransport
Intel QPI
Scalability
Bus Bandwidth, how to get there?
Power-Performance — Win or lose?
Borkar’s multicore arguments
15% per core performance drop 50% power saving
Giant, single core wastes power when task is small
How about leakage?
Process variation and yield
Programming Model