DAP Spr.‘98 ©UCB 1 CS 203 A Lecture 16: Review for Test 2

DAP Spr.‘98 ©UCB 1

CS 203 ALecture 16:

Review for Test 2


Project and Test 2

• Think what you can change in CPU or cache architecture to speed up the executions for network applications. Modify that part in Simple Scalar, rerun your applications and compare with your results in project 1.

• Test 2: 40 points => 80 mins => About 2 mins per point => gives you an idea about the time you spend on a question. Test has 4 questions with several parts, and the points are noted. Answer precisely and briefly.


Minimizing Stalls Technique 1: Compiler Optimization

6 clocks

Instruction Instruction Latency inproducing result using result clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Swap BNEZ and SD by changing address of SD1 Loop: LD F0,0(R1) 2 SUBI R1,R1,8

3 ADDD F4,F0,F2 4 Stall

5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;Address altered from 0(R1)

to 8(R1) when moved past SUBI


Compiler Technique 2: Loop Unrolling

1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 ;1 cycle delay * 3 SD 0(R1),F4 ;drop SUBI & BNEZ – 2cycles delay * 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 ; 1 cycle delay 6 SD -8(R1),F8 ;drop SUBI & BNEZ – 2 cycles delay 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 ; 1 cycle delay 9 SD -16(R1),F12 ;drop SUBI & BNEZ – 2 cycles delay 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 ; 1 cycle delay 12 SD -24(R1),F16 ; 2 cycles daly 13 SUBI R1,R1,#32 ;alter to 4*8; 1 cycle delay 14 BNEZ R1,LOOP ; Delayed branch 15 NOP

*1 cycle delay for FP operation after load. 2 cycles delay for store after FP. 1 cycle after SUBI.

15 + 4 x (1+2) + 1 = 28 clock cycles, or 7 per iteration

Loop Unrolling is essential for ILP Processors Why?But, increase in Code memory and no. of registers.


Minimize Stall + Loop Unrolling

• What assumptions made when moved code?

– OK to move store past SUBI even though changes register

– OK to move loads before stores: get right data?

– When is it safe for compiler to do such changes?

1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP ; Delayed branch14 SD 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration


Very Long Instruction Word:VLIW Architectures

• Wide-issue processor that relies on compiler to– Packet together independent instructions to be issued in parallel

– Schedule code to minimize hazards and stalls

• Very long instruction words (3 to 8 operations)– Can be issued in parallel without checks

– If compiler cannot find independent operations, it inserts nops

• Advantage: simpler HW for wide issue– Faster clock cycle

– Lower design & verification cost

• Disadvantages:– Code size

– Requires aggressive compilation technology


VLIW and Superscalar• sequential stream of long instruction words• instructions scheduled statically by the compiler• number of simultaneously issued instructions is fixed during

compile-time • instruction issue is less complicated than in a superscalar

processor• Disadvantage: VLIW processors cannot react on dynamic

events, e.g. cache misses, with the same flexibility like superscalars.

• The number of instructions in a VLIW instruction word is usually fixed.

• Padding VLIW instructions with no-ops is needed in case the full issue bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops.

• VLIW is an architectural technique, whereas superscalar is a microarchitecture technique.

• VLIW processors take advantage of spatial parallelism.


Multithreading• How can we guarantee no dependencies between instructions in a pipeline?

– One way is to interleave execution of instructions from different program threads on same pipeline – Micro context switching

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

T1: LW r1, 0(r2)

T2: ADD r7, r1, r4

T3: XORI r5, r4, #12

T4: SW 0(r7), r5

T1: LW r5, 12(r1)


Simple Multithreaded Pipeline

• Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage


Comparison of Issue CapabilitiesCourtesy of Susan Eggers; Used with Permission


From Superscalar to SMT

• Small items– per-thread program counters

– per-thread return stacks

– per-thread bookkeeping for instruction retirement, trap & instruction dispatch queue flush

– thread identifiers, e.g., with BTB & TLB entries


Typical NP ArchitectureSDRAM(Packet buffer)

SRAM(Routing table)

multi-threaded processing elements

Co-processor

Input ports Output ports

Network Processor

Bus Bus


Why Network Processors• Current Situation

– Data rates are increasing

– Protocols are becoming more dynamic and sophisticated

– Protocols are being introduced more rapidly

• Processing Elements– GP(General-purpose Processor)

» Programmable, Not optimized for networking applications

– ASIC(Application Specific Integrated Circuit)» high processing capacity, long time to develop, Lack the flexibility

– NP(Network Processor)» achieve high processing performance

» programming flexibility

» Cheaper than GP


IXP1200 Block Diagram• StrongARM

processing core

• Microengines introduce new ISA

• I/O– PCI

– SDRAM

– SRAM

– IX : PCI-like packet bus

• On chip FIFOs– 16 entry 64B each


IXP1200 Microengine• 4 hardware contexts

– Single issue processor

– Explicit optional context switch on SRAM access

• Registers– All are single ported

– Separate GPR

– 256*6 = 1536 registers total

• 32-bit ALU– Can access GPR or XFER registers

• Shared hash unit– 1/2/3 values – 48b/64b

– For IP routing hashing

• Standard 5 stage pipeline

• 4KB SRAM instruction store – not a cache!

• Barrel shifter

Ref: [NPT]


MEv26

MEv27

MEv25

MEv28

Intel®XScale™

Core32K IC32K DC

Rbuf64 @ 128B

Tbuf64 @ 128B

Hash64/48/128

Scratch16KB

QDRSRAM

1

QDRSRAM

2

DDRAM

GASKET

PCI

(64b)66 MHz

32b32b

32b32b

1818 18181818 1818

7272

64b64b

SPI3orCSIX

E/D Q E/D Q

MEv22

MEv23

MEv21

MEv24

CSRs -Fast_wr -UART-Timers -GPIO-BootROM/Slow Port

IXP2400IXP2400


Intel®XScale™

Core32K IC32K DC MEv2

10MEv2

11MEv2

12

MEv215

MEv214

MEv213

Rbuf64 @ 128B

Tbuf64 @ 128B

Hash48/64/128

Scratch16KBQDR

SRAM2

QDRSRAM

1

RDRAM1

RDRAM3

RDRAM2

GASKET

PCI

(64b)66 MHz

IXP2800IXP2800

16b16b

16b16b

1818 18181818 1818

1818 1818 1818

64b64b

SPI4orCSIX

Stripe

E/D Q E/D Q

QDRSRAM

3

E/D Q

1818 1818

MEv29

MEv216

MEv22

MEv23

MEv24

MEv27

MEv26

MEv25

MEv21

MEv28

CSRs -Fast_wr -UART-Timers -GPIO-BootROM/SlowPort

QDRSRAM

4

E/D Q

1818 1818


Memory Hierarchy Goal: Illusion of large, fast, cheap memory

• Fact: Large memories are slow, fast memories are small

• How do we create a memory that is large, cheap and fast (most of the time)?

• Hierarchy of Levels– Uses smaller and faster memory technologies close to the

processor

– Fast access time in highest level of hierarchy

– Cheap, slow memory furthest from processor

• The aim of memory hierarchy design is to have access time close to the highest level and size equal to the lowest level


Introduction to Caches

• Cache– is a small very fast memory (SRAM, expensive)

– contains copies of the most recently accessed memory locations (data and instructions): temporal locality

– is fully managed by hardware (unlike virtual memory)

– storage is organized in blocks of contiguous memory locations: spatial locality

– unit of transfer to/from main memory (or L2) is the cache block

• General structure– n blocks per cache organized in s sets

– b bytes per block

– total cache size n*b bytes


Cache Organization(1) How do you know if something is in the cache?

(2) If it is in the cache, how to find it?

• Answer to (1) and (2) depends on type or organization of the cache

• Direct mapped cache, each memory address is associated with one possible block within the cache– Therefore, we only need to look in a single location in the

cache for the data if it exists in the cache

• Fully Associative Cache – Block can be placed anywhere, but complex in design

• N-way set associative - N cache blocks for each Cache Index– Like having N direct mapped caches operating in parallel


Review: Four Questions for Memory Hierarchy Designers

• Q1: Where can a block be placed in the upper level? (Block placement)

– Fully Associative, Set Associative, Direct Mapped

• Q2: How is a block found if it is in the upper level? (Block identification)

– Tag/Block

• Q3: Which block should be replaced on a miss? (Block replacement)

– Random, LRU

• Q4: What happens on a write? (Write strategy)

– Write Back or Write Through (with Write Buffer)


Review: Cache Performance

CPUtime = Instruction Count x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time

Misses per instruction = Memory accesses per instruction x Miss rate

CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time

To Improve Cache Performance:

1. Reduce the miss rate

2. Reduce the miss penalty

3. Reduce the time to hit in the cache.


Where to misses come from?• Classifying Misses: 3 Cs

–Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache)

–Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)

–Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)

• 4th “C”:–Coherence - Misses caused by cache coherence.


4: Add a second-level cache

• L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 +

Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

• Definitions:– Local miss rate— misses in this cache divided by the total number

of memory accesses to this cache (Miss rateL2)

– Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU

– Global Miss Rate is what matters


Cache Optimization Summary

Technique MR MP HT Complexity

Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0

Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2

mis

s r

ate

mis

s p

en

alty


Main Memory Background

• Random Access Memory (vs. Serial Access Memory)

• Different flavors at different levels– Physical Makeup (CMOS, DRAM)

– Low Level Architectures (FPM,EDO,BEDO,SDRAM)

• Cache uses SRAM: Static Random Access Memory– No refresh (6 transistors/bit vs. 1 transistor

Size: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16

• Main Memory is DRAM: Dynamic Random Access Memory– Dynamic since needs to be refreshed periodically

– Addresses divided into 2 halves (Memory as a 2D matrix):» RAS or Row Access Strobe

» CAS or Column Access Strobe


Main Memory Organizations

CPU

Cache

Bus

Memory

CPU

Bus

Memory

Multiplexor

Cache

CPU

Cache

Bus

Memorybank 1

Memorybank 2

Memorybank 3

Memorybank 0

one-word widememory organization

wide memory organization interleaved memory organization

DRAM access time >> bus transfer time


Virtual Memory• Idea 1: Many Programs sharing DRAM Memory so

that context switches can occur• Idea 2: Allow program to be written without

memory constraints – program can exceed the size of the main memory

• Idea 3: Relocation: Parts of the program can be placed at different locations in the memory instead of a big chunk.

• Virtual Memory:(1) DRAM Memory holds many programs running

at same time (processes)(2) use DRAM Memory as a kind of “cache” for

disk


Mapping Virtual to Physical Address

Virtual Page Number Page Offset

Page OffsetPhysical Page Number

Translation

31 30 29 28 27 .………………….12 11 10

29 28 27 .………………….12 11 10

9 8 ……..……. 3 2 1 0

Virtual Address

Physical Address

9 8 ……..……. 3 2 1 0

1KB page size


How Translate Fast?

• Problem: Virtual Memory requires two memory accesses!– one to translate Virtual Address into Physical Address (page

table lookup)– one to transfer the actual data (cache hit)– But Page Table is in physical memory! => 2 main memory

accesses!• Observation: since there is locality in pages of data, must be

locality in virtual addresses of those pages!• Why not create a cache of virtual to physical address

translations to make translation fast? (smaller is faster)• For historical reasons, such a “page table cache” is called a

Translation Lookaside Buffer, or TLB


Translation Look-Aside Buffers•TLB is usually small, typically 32-4,096 entries

• Like any other cache, the TLB can be fully associative, set associative, or direct mapped

Processor TLB Cache MainMemory

misshit

data

hit

miss

DiskMemory

OS FaultHandler

page fault/protection violation

PageTable

data

virtualaddr.

physicaladdr.

Documents

DAP Spr.‘98 ©UCB 1 CS 203 A Lecture 16: Review for Test 2