14
© 2003 IBM Corporation Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM Corporation

Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor

Ron Kalla, Balaram Sinharoy, Joel TendlerIBM Systems Group

Page 2: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 20032

SMT Implementation in POWER5

Outline

§ Motivation§ Background§ Threading Fundamentals§ Enhanced SMT POWER5 Implentation§ Additional SMT Considerations§ Summary

Page 3: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 20033

SMT Implementation in POWER5

Microprocessor Design Optimization Focus Areas

§ Memory latency4 Increased processor speeds make memory appear further away

4 Longer stalls possible§ Branch processing

4 Mispredict more costly as pipeline depth increases resulting in stalls and wasted power

4 Predication drives increased power and larger chip area§ Execution Unit Utilization

4 Currently 20-25% execution unit utilization common§ Simultaneous multi-threading (SMT) and POWER architecture

address these areas

Page 4: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 20034

SMT Implementation in POWER5

POWER4 --- Shipped in Systems December 2001

§ Technology: 180nm lithography, Cu, SOI4 POWER4+ shipping in 130nm today

§ Dual processor core§ 8-way superscalar

4 Out of Order execution

4 2 Load / Store units

4 2 Fixed Point units

4 2 Floating Point units

4 Logical operations on Condition Register

4 Branch Execution unit§ > 200 instructions in flight§ Hardware instruction and data

prefetch

L3

Dir

ecto

ry/C

on

tro

lL2 L2 L2

LSU LSUIFUBXU

IDU IDU

IFUBXU

FPU FPU

FX

U

FXUISU ISU

Page 5: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 20035

SMT Implementation in POWER5

POWER5 --- The Next Step

§ Technology: 130nm lithography, Cu, SOI§ Dual processor core§ 8-way superscalar§ Simultaneous multithreaded

(SMT) core4 Up to 2 virtual processors per

real processor

4 24% area growth per core for SMT

4 Natural extension to POWER4 design

Page 6: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 20036

SMT Implementation in POWER5

Multi-threading Evolution

Thread 0 ExecutingThread 0 Executing No Thread Executing

FX0FX1FP0FP1LS0LS1BRXCRL

Single ThreadFX0FX1FP0FP1LS0LS1BRXCRL

Coarse Grain Threading

FX0FX1FP0FP1LS0LS1BRXCRL

Fine Grain Threading

FX0FX1FP0FP1LS0LS1BRXCRL

Simultaneous Multi-Threading

Page 7: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 20037

SMT Implementation in POWER5

Changes Going From ST to SMT Core§ SMT easily added to Superscalar Micro-architecture

4 Second Program Counter (PC) added to share I-fetch bandwidth4 GPR/FPR rename mapper expanded to map second set of registers

(High order address bit indicates thread)4 Completion logic replicated to track two threads4 Thread bit added to most address/tag buses

FetchUnit

I-Cache

Decode RegisterRename

IntegerIssue Qs

FPIssue Qs

BR/CRUIssue Qs

CR, LR,CTR

FPRs

GPRs

BR, CRLUnits

FPUsFPUs

FXUs,LSUs

DataCache

PCPC

RegisterRename

IntegerIssue Qs

FPIssue Qs

BR/CRUIssue Qs

CR, LR,CTR

FPRs

GPRs

BR, CRLUnits

FPUsFPUs

FXUs,LSUs

Page 8: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 20038

SMT Implementation in POWER5

Resource Sizes

Results based on simulation of an online transaction processing applicationVertical axis does not originate at 0

~~

§ Analysis done to optimize every micro-architectural resource size4GPR/FPR rename pool size

4I-fetch buffers

4Reservation Station

4SLB/TLB/ERAT

4I-cache/D-cache§Many Workloads examined§ Associativity also examined

50 60 70 80 90 100 110 120 13080 120

Number of GPR Renames

IPC

ST SMT

Page 9: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 20039

SMT Implementation in POWER5

Resource Sharing

Results based on simulation of an online transaction processing application

§ Threads share many resources

4 Global Completion Table, BHT, TLB, . . .

§ Higher performance realized when resources balanced across threads

4 Tendency to drift toward extremes accompanied by reduced performance

§ Solution: Dynamically adjust resource utilization

0

2

4

6

8

10

Rel

ativ

e O

ccu

rren

ce

0 5 10 15 20

05

1015

20

Thread 0

Threa

d 1

With dynamic resource utilization

adjustment

0

2

4

6

8

10

Rel

ativ

e O

ccu

rren

ce

0 5 10 15 20

05

1015

20

Thread 0

Threa

d 1

Without dynamic resource utilization

adjustment

Global Completion Table Occupancy

Page 10: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 200310

SMT Implementation in POWER5

Thread Priority§ Instances when unbalanced

execution desirable4 No work for opposite thread

4 Thread waiting on lock

4 Software determined non uniform balance

4 Power management

4 …

§ Solution: Control instruction decode rate

4 Software/hardware controls 8 priority levels for each thread

0

0

0

1

1

1

1

1

2

2

IPC

-5 -3 -1 0 1 3 50,7 7,0 1,1

Thread 1 Priority - Thread 0 Priority

Thread 0 IPC Thread 1 IPCPower Save Mode

Single Thread Mode

Page 11: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 200311

SMT Implementation in POWER5

Dynamic Thread Switching

§ Used if no task ready for second thread to run§ Allocates all machine resources to

one thread§ Initiated by software§ Dormant thread wakes up on:

4 External interrupt

4 Decrementer interrupt

4 Special instruction from active thread

Active

Dormant

Null

software

hardware or software

software

software

Thread States

Page 12: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 200312

SMT Implementation in POWER5

Single Thread Operation

§ Advantageous for execution unit limited applications

4 Floating or fixed point intensive workloads

§ Execution unit limited applications provide minimal performance leverage for SMT4 Extra resources necessary for SMT

provide higher performance benefit when dedicated to single thread

§ Determined dynamically on a per processor basis

PO

WE

R4+

PO

WE

R5

ST

Matrix Multiply

IPC

PO

WE

R5

SM

T

Page 13: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 200313

SMT Implementation in POWER5

Other SMT Considerations

§ Power Management4 SMT Increases execution unit utilization

4 Dynamic power management does not impact performance§ Debug tools / Lab bring-up

4 Instruction tracing

4 Hang detection

4 Forward progress monitor§ Performance Monitoring§ Serviceability

Page 14: Simultaneous Multi-threading Implementation in POWER5 ...pages.cs.wisc.edu/~markhill/cs838-david/reader/power5.pdf · POWER5 Hot Chips 15 Presentation Keywords: POWER5, SMT Created

© 2003 IBM CorporationHotchips 15, August 200314

SMT Implementation in POWER5

Summary

§ POWER5 SMT implementation is more than SMT4 Good ROI for silicon area: Performance gain > Area increase

4 Resource sizes optimized

4 Dynamic feedback enhances instruction throughput

4 Software controlled priority exploits machine architecture

4 Dynamic ST to/from SMT mode capability optimizes system resources

§ SMT impacts pervasive throughout chip§ Operating in laboratory

4 AIX, Linux and OS/400 booted and running