Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science...

Preview:

Citation preview

Multithreaded ASC

Kevin Schaffer and Robert A. Walker

ASC Processor GroupComputer Science Department

Kent State University

Organization of an ASC Computer

Processing Element 1

Bro

adca

st/R

edu

ctio

n

Net

wo

rk

Control Unit(Scalar Processor)

Processing Element 2

Processing Element N

Bro

adca

st/R

edu

ctio

n

Net

wo

rk

……

Broadcast/Reduction Bottleneck

Time to perform a broadcast or reduction increases as the number of PEs increases

Even for a moderate number of PEs, this time can dominate the machine cycle time

Pipelining reduces the cycle time but increases the latency

Additional latency causes pipeline hazards

Instruction Types

Scalar instructions Execute entirely within the control unit

Broadcast/Parallel instructions Execute within the PE array Use the broadcast network to transfer instruction and data

Reduction instructions Execute within the PE array Use the broadcast network to transfer instruction and data Use the reduction network to combine data from PEs

Scalar Pipeline

Instruction Fetch (IF)

Instruction Decode (ID)

Execute (EX)

Memory Access (M)

Write Back (W)

Hazards in a Scalar Pipeline

Unified SIMD Pipeline

Broadcast (B1...Bn)

Reduction (R1...Rn)

Number of stages is variable

All instructions go through every stage

Diversified SIMD Pipeline

Separate paths for each instruction type so instructions only go through stages that they use

Stalls less often than a unified pipeline organization

Hazards

Multithreading

Pipelining alone cannot eliminate hazards caused by broadcast and reduction latencies

Solution: use instructions from multiple threads to keep the pipeline full

Instructions from different threads are independent so they cannot generate stalls due to data dependencies

As long as there are a sufficient number of threads, it is possible to fill any number of stall cycles

Types of Multithreading

Coarse-grain multithreading switches to a new thread when the current thread encounters a high latency operation

Fine-grain multithreading switches to a new thread every clock cycle

Simultaneous multithreading can issue instructions from multiple threads in the same clock cycle

For a SIMD processor, fine-grain or simultaneous multithreading is necessary as pipeline stalls are relatively short and occur frequently

Multithreaded Control Unit

Fetch Unit

InstructionCache

Decode Unit 1

Decode Unit 2

Decode Unit 3

Decode Unit N

Instruction Status Table

Sch

edu

ler

Th

read

Sta

tus

Tab

le

Reduction Hazard with a Single Thread

Reduction Hazard with Multiple Threads

Execution Time vs. Latency

0

50

100

150

200

250

300

350

400

450

1 2 3 4 5 6 7 8

Communication Latency (cycles)

Exe

cuti

on

Tim

e (c

ycle

s)

ASC Multithreaded ASC MASC

Throughput vs. Latency

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Communication Latency (cycles)

No

rmal

ized

Th

rou

gh

pu

t (i

nst

ruct

ion

s/cy

cle)

ASC Multithreaded ASC MASC

Multithreaded ASC vs. MASC

A multithreaded ASC computer can execute at most one instruction in a cycle

A MASC computer with j instruction streams can execute up to j instructions in a cycle

In multithread ASC each thread can access every PE

In MASC each instruction stream can only access its partition of PEs

A multithreaded MASC computer could combine the advantages of both

ASC

Multithreaded ASC

MASC

Multithreaded MASC

Multithreaded ASC Processor

In order to validate simulation results and estimate hardware costs, a prototype processor was developed

Targeted for an Altera Cyclone II (EP2C35) FPGA

Using an FPGA makes it possible to get detailed measurements of speed and hardware cost

Additional Enhancements

Flags (logical values) are a first-class data types with their own set of registers and instructions

Extra reduction operators Count Responders Sum

Hardware semaphores for thread synchronization

Synthesis Results

Targeted for an Altera Cyclone II FPGA (EP2C35)

16 x 16-bit PEs

16 hardware threads

Clock speed: 75 MHz