The Standford Hydra CMP

The Standford Hydra The Standford Hydra CMPCMP

Lance HammondLance Hammond Benedict A. HubbertBenedict A. Hubbert Michael SiuMichael Siu Manohar K. PrabhuManohar K. Prabhu Michael ChenMichael Chen Kunle OlukotunKunle Olukotun

Presented by Jason Davis

IntroductionIntroduction Hydra CMP with 4 MIPS Hydra CMP with 4 MIPS

ProcessorsProcessors L1 cache for each CPU and L2 L1 cache for each CPU and L2

cache that holds the cache that holds the permanent statespermanent states

Why?Why?– Moore’s law is reaching its Moore’s law is reaching its

endend– Finite amount of ILPFinite amount of ILP– TLP (Thread Level Parallelism) TLP (Thread Level Parallelism)

vs ILP in pipelined vs ILP in pipelined architecturearchitecture

– CMP can use ILP as well (TLP CMP can use ILP as well (TLP and ILP are orthogonal)and ILP are orthogonal)

– Wire DelayWire Delay– Design Time (CPU core Design Time (CPU core

doesn’t need to be doesn’t need to be redesigned) just increase the redesigned) just increase the numbernumber

ProblemsProblems– Integration densities just now Integration densities just now

giving reasons to consider giving reasons to consider new modelsnew models

– Difficult to convert Difficult to convert uniprocessor codeuniprocessor code

– Multiprogramming is hardMultiprogramming is hard

Base DesignBase Design

4 MIPS Cores (250 4 MIPS Cores (250 MHz)MHz)– Each core:Each core:

L1 Data CacheL1 Data Cache L1 Primary Instruction L1 Primary Instruction

CacheCache

– Share a single L2 Share a single L2 CacheCache

– Virtual Buses Virtual Buses (pipelined with (pipelined with repeaters)repeaters)

Read bus (256 bits)Read bus (256 bits)– Acts as general purpose system bus for Acts as general purpose system bus for

moving data between CPUs, L2, and external moving data between CPUs, L2, and external memorymemory

– Wide enough to handle entire cache line Wide enough to handle entire cache line (CMP explicit gain, multiprocessor systems (CMP explicit gain, multiprocessor systems would require too many pinswould require too many pins

Write bus (64 bits)Write bus (64 bits)– Writes directly from 4 CPUs to L2Writes directly from 4 CPUs to L2– Pipelined to allow for single-cycle occupancy Pipelined to allow for single-cycle occupancy

(not a bottleneck)(not a bottleneck)– Uses simple invalidation for caches Uses simple invalidation for caches

(broadcast invalidates all other L1s)(broadcast invalidates all other L1s) L2 CacheL2 Cache

– Point of communication (10-20 cycles)Point of communication (10-20 cycles) Bus Sufficient for 4-8 MIPS cores, more need Bus Sufficient for 4-8 MIPS cores, more need

larger system buseslarger system buses

Base DesignBase Design

Parallel Software PerformanceParallel Software Performance

Thread SpeculationThread Speculation Takes sequence of instructions on normal program and Takes sequence of instructions on normal program and

arbitrarily breaks it into a sequenced group of threadsarbitrarily breaks it into a sequenced group of threads– Hardware must track all interthread dependencies to insure Hardware must track all interthread dependencies to insure

program acts the same wayprogram acts the same way– Must re-execute code that follows a data violation based upon Must re-execute code that follows a data violation based upon

a true dependencya true dependency Advantages:Advantages:

– Does not require synchronization (different than enforcing Does not require synchronization (different than enforcing dependencies on multiprocessor systems)dependencies on multiprocessor systems)

– Dynamic (done at runtime) so programmer only needs to Dynamic (done at runtime) so programmer only needs to consider for maximum performanceconsider for maximum performance

– Conventional Parallelizing compilers miss a lot of TLP because Conventional Parallelizing compilers miss a lot of TLP because synchronization points must be inserted where dependencies synchronization points must be inserted where dependencies can happen and not just where they do happen can happen and not just where they do happen

5 Issues to address:5 Issues to address:

Thread SpeculationThread Speculation

1. Forward data 1. Forward data between parallel between parallel threadsthreads

2. Detect when reads 2. Detect when reads occur to early occur to early (RAW)(RAW)

3. Safely Discard 3. Safely Discard speculative state speculative state after violationsafter violations

Thread SpeculationThread Speculation

5. Provide Memory 5. Provide Memory renaming (WAR renaming (WAR hazards)hazards)

4. Retire speculative 4. Retire speculative writes in correct writes in correct order (WAW hazard)order (WAW hazard)

Hydra Speculation ImplementationHydra Speculation Implementation

Takes care of the 5 issues:Takes care of the 5 issues:– Forward data between parallel threads:Forward data between parallel threads:

When thread writes to bus, newer threads that need the When thread writes to bus, newer threads that need the data have their current cache lines for that data invalidateddata have their current cache lines for that data invalidated

On miss in L1, access L2, write buffers of current or older On miss in L1, access L2, write buffers of current or older thread replaces data returned from L2 byte-bytethread replaces data returned from L2 byte-byte

– Detect when read occurs too early:Detect when read occurs too early: Primary cache bits are set to mark possible violations, if Primary cache bits are set to mark possible violations, if

write to that address of an earlier thread invalidates – write to that address of an earlier thread invalidates – Violation detected and thread is restarted.Violation detected and thread is restarted.

– Safely discard speculative states after violation:Safely discard speculative states after violation: Permanent state kept in L2, any L1 lines that are Permanent state kept in L2, any L1 lines that are

speculative data are invalidated, L2 buffer for thread is speculative data are invalidated, L2 buffer for thread is discarded (permanent state not effected)discarded (permanent state not effected)


– Place speculative writes in memory in correct order:Place speculative writes in memory in correct order: Separate speculative data L2 buffers kept for each threadSeparate speculative data L2 buffers kept for each thread Must be drained into L2 in original sequenceMust be drained into L2 in original sequence Thread sequencing system also sequences the buffer Thread sequencing system also sequences the buffer

drainingdraining– Memory Renaming:Memory Renaming:

Each CPU can only read data written by itself or earlier Each CPU can only read data written by itself or earlier threadsthreads

Writes from later threads don’t cause immediate Writes from later threads don’t cause immediate invalidations (since writes from these threads should not be invalidations (since writes from these threads should not be visible yet)visible yet)

Ignored invalidations are recorded with pre-invalidate bitIgnored invalidations are recorded with pre-invalidate bit If thread accesses L2 it must only access data it should be If thread accesses L2 it must only access data it should be

able to see from itself or earlier L2 buffersable to see from itself or earlier L2 buffers When current thread completes all currently pre-When current thread completes all currently pre-

invalidated lines are check against future threads for invalidated lines are check against future threads for violationsviolations



Speculation PerformanceSpeculation Performance

PrototypePrototype MIPS-based RC32364MIPS-based RC32364 SRAM macro cellsSRAM macro cells 8-Kbyte L1 data and instruction caches8-Kbyte L1 data and instruction caches 128 Kbytes L2128 Kbytes L2 Die is 90 mm^2, .25-micron processDie is 90 mm^2, .25-micron process Have a verilog model, moving to physical Have a verilog model, moving to physical

design using synthesisdesign using synthesis Central Arbritration for Buses will be the Central Arbritration for Buses will be the

most difficult part, hard to pipeline, must most difficult part, hard to pipeline, must accept many requests, and must reply accept many requests, and must reply with grant signalswith grant signals

PrototypePrototype

PrototypePrototype

ConclusionConclusion

Hydra CMPHydra CMP– High performanceHigh performance- Cost effective alternative to large chip Cost effective alternative to large chip

single processorssingle processors- Similar die area can achieve similar to Similar die area can achieve similar to

uniprocessor performance on integer uniprocessor performance on integer programs using thread speculationprograms using thread speculation

- Multiprogrammed or High Parallelism can Multiprogrammed or High Parallelism can do better then single processordo better then single processor

- Hardware Thread-Speculation is not cost Hardware Thread-Speculation is not cost intensive, and can give great gains to intensive, and can give great gains to performanceperformance

QuestionsQuestions

Documents

The Standford Hydra CMP