Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation 18 May 2005

Reconfigurable Microprocessors

Lih Wen Koh

05s1 COMP4211 presentation

18 May 2005

2

Presentation Overview

Current Research Direction

Related Work

Experiments

What Next?

3


Execute

Address + ALU1 ALU2 FP + FP *, ÷, √

Hardware Components of MIPS R10000

Fetch

Instruction CacheInstruction Predecode

Branch History Table

InstructionTLB

Decode

Instruction DecodeActive List(32 entries)

Free Register Lists (1 for Int, 1 For FP)

Register Map Tables(1 for Int, 1 for FP)

Integer Queue(16 entries)

FP Queue(16 entries)

Mem Queue(16 entries)

Integer Registers / Bypass64 x 64 bits

FP Registers / Bypass64 x 64 bits

Issue

Write

Data TLB

Data Cache

[Yeager96]

Wide superscalar, out-of-order execution processor core

Exploits ILP

But true data dependencies are inherent in application programs

MIPS R10k, NetBurst, AMD etc. use bypass network to forward just-computed result allow back-to-back issue of dependent instructions

Complexity of bypass network grows quadratic w.r.t. issue width

4


Observation 1: Multi-cycle broadcast

Wire delays – accounted for in Intel NetBurst

Allows higher processor clock frequency at the cost of reduced IPC

Observation 2: FP execution unit is idle most of the time, even in FP-intensive applications (5-10%)

[Sassone04]

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

MediaBench Applications

Proportion of Functional Unit Type Requested

Rd/Wr Ports

FP_MULT/DIV

FP_ALU

Int_MULT/DIV

Int_ALU0%

10%20%

30%40%

50%60%

70%80%

90%100%

164.

gzip.g

raph

ic

164.

gzip.lo

g

164.

gzip.p

rogr

am

164.

gzip.ra

ndom

164.

gzip.so

urce

176.

gcc

181.

mcf

197.

parse

r

168.

wupwise

171.

swim

173.

applu

179.

art

183.

equak

e

188.

amm

p

301.

apsi

SPEC2000 Applications

Proportion of Functional Unit Type Requested

Rd/Wr Ports

FP_MULT/DIV

FP_ALU

Int_MULT/DIV

Int_ALU

5

Literature Survey

[Epalza04]: Dynamic Allocation of Functional Units in Superscalar Processors

Switch the execution mode of idle floating-point units to four additional integer ALUs

Addition of bypass networks add 1 cycle latency to the modified FPU.

19% speedup for SPECint2000

3.5% speedup for SPECfp2000

Issues:

Need to improve control for mode switching

6

Plans

Other patterns:

y1 y3

x

y2

Possible 2nd input for node y1/y2/y3:1. from register file2. from node x

x1 x2

z

y

Possible 2nd input for node y:1. from register file2. from node x13. from node x2

Possible 2nd input for node z:1. from register file2. from node x13. from node x24. from node y

y1 y2

z

xPossible 2nd input for node y1:1. from register file2. from node x

Possible 2nd input for node y2:1. from register file2. from node x

Possible 2nd input for node z:1. from register file2. from node x3. from node y14. from node y2

z1 z2

y

xPossible 2nd input for node y:1. from register file2. from node x

Possible 2nd input for node z1:1. from register file2. from node x3. from node y

Possible 2nd input for node z2:1. from register file2. from node x3. from node y

x1 x2

y2y1

Possible 2nd input for node y1:1. from register file2. from node x13. from node x2

Possible 2nd input for node y2:1. from register file2. from node x13. from node x2

Possible 1st input for node y2:1. from node x12. from node x2

Possible inputs:1. from register file2. from node x13. from node x24. from node x3

x1 x2 x3

y

7

Related Work

[Palacharla97] Dependence-based (FIFO queues + clustered execution units)

8

Related Work

Extension to rePLay framework [Yehia04]

9

Experiment : Chaining pairs of dependent instructions

[Intel01] Double-speed ALUs

from Register File

NormalInteger

ALU

3-1 InterlockCollapsing

ALU

Result of first instruction in

dependent sequence

Result of second instruction in dependent sequence

Carry Lookahead

Adder

LogicOperations

mux

4 stages

1 stage

Carry-Save Adder

LogicOperations

Control

Carry-Lookahead Adder + Logic

Operations4 stages

1 stage

[Vassiliadis96] 3-1 Interlock

Collapsing ALUs

10


Instruction Fetch Queue(IFQ)

Load/Store Queue(LSQ)

Register Update Unit(RUU)

Ready Queue

Operands ready EA ready

F_MEM

IntALUs Int Mult/Div Rd/Wr Ports FP Adders FP Mult/Div Chained ALU

Event Queue

Instruction WriteBack(Broadcast/Bypass Logic)

Branch Misprediction?If so, recover

Instruction Commit

Issue if requested functional unit is not busy

if the requested functional unit is IntALU && the list of in-flight instructions waiting only on the result of this instruction is non-empty && the chained ALU is not busy => schedule this instruction and the first obtained dependent instruction to the chained ALU

ruu_fetch()

ruu_dispatch()

ruu_issue()

ruu_writeback()

ruu_commit()

Modifications to

sim-outorder for

SimpleScalar

PISA.

11


2 CIALUs sufficient

IPC improvement of ~8%, solely due to

savings of broadcast cycles

Reduces utlization of IALUs by ~50%

Reduces up to 45% of queue entries

waiting for result

Up to 25% speedup as broadcast cycles

= 4

0%

5%

10%

15%

20%

25%


Speedup on IPC for MediaBench Applications(fetch-decode-issue-commit w idth = 8, ruu:size = 32, #ialu = 8, #cialu = 2)

broadcast_delay = 1

broadcast_delay = 2

broadcast_delay = 3

broadcast_delay = 4

0%

5%

10%

15%

20%

25%

rawcaudio

rawdaudioepic

unepicencode

decodecjpeg

djpeg

mpeg2encode

mpeg2decode

pegwitenc

pegwitdec


Speedup on IPC for MediaBench Applications(fetch-decode-issue-commit width = 8, ruu:size = 32, broadcast delay = 1 cycle)

#IntALU = 2, #CIntALU = 1












12

What Next?

Chaining sequence of 3 dependent instructions, other patterns out of the 80.

Architectural impact of adding chained units

complexity of local bypass network etc.

Replace chained units by xALUs converted from the CSA trees in a FP

multiply/divide unit

Need to explore the hardware circuits of FP multiply/divide

Develop an adaptive configuration scheme – to best match the interconnections of

the swappable xALUs to the patterns of in-flight instructions.

Need to determine the most frequent subset of patterns

References[Vassiliadis96] High-Performance 3-1 Interlock Collapsing ALUs. James Phillips and Stamatis

Vassiliadis.

[Yeager96] The MIPS R10000 Superscalar Microprocessor. Kenneth C. Yeager. IEEE Micro 1996.

[Palacharla97] Subbarao Palacharla, Norman P. Jouppi, J.E. Smith. Complexity-Effective Superscalar Processor. ISCA 1997.

[Intel01] The Microarchitecture of the Pentium® 4 Processor. Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, Patrice Roussel Intel Technology Journal Q1. 2001.

[Epalza04] Dynamic Reallocation of Functional Units In Superscalar Processors. Marc Epalza, Paolo Ienne, Daniel Mlynek. In the 9th Asia-Pacific Computer Systems Architecture Conference (ACSAC), 2004.

[Yehia04] From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance without ILP or Speculation. Sami Yehia and Olivier Temam.

[Sassone04] Multicycle Broadcast Bypass: Too Readily Overlooked. Peter G. Sassone and D. Scott Wills, Proceedings of the Workshop on Complexity-Effective Design (WCED), May 2004.

Thank You

15

Overview of Research TopicGoal of this research:

“investigate the feasibility and potential benefit of effective, automated runtime compilation and execution of software binaries on reconfigurable microprocessors”

Software binaries executing only on superscalar processor

Profile committed instructions to identify critical code regions Identify and extract suitable

instructions from critical code regions for collapsing into complex, atomic instructions

Assembly-to-hardware mapping of collapsed instructions

On the next execution of the transformed critical region, load configuration for the reconfigurable logic

Transfer execution from superscalar processor to the reconfigurable unit

Monitor the execution of the coupled system

Continue normal execution of binary code following the transformed critical code region.

superscalarprocessor

reconfigurablelogic

Software Binaries

Reconfigurable Microprocessor

16

Motivations

Improved execution performance by exploiting parallelism and redundancy in

hardware.

Adaptation of hardware resources based on the dynamic behaviour of programs.

Availability of runtime profile allows exploitation of runtime optimizations otherwise difficult to exploit at compile time.

Compilation at the binary level allows execution of legacy software binaries.

Runtime compilation allows transparent migration of software code to hardware.

Documents

Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation 18 May 2005