View
216
Download
1
Embed Size (px)
Citation preview
DAP Spr.‘98 ©UCB 1
CS 203 ALecture 16:
Review for Test 2
DAP Spr.‘98 ©UCB 2
Project and Test 2
• Think what you can change in CPU or cache architecture to speed up the executions for network applications. Modify that part in Simple Scalar, rerun your applications and compare with your results in project 1.
• Test 2: 40 points => 80 mins => About 2 mins per point => gives you an idea about the time you spend on a question. Test has 4 questions with several parts, and the points are noted. Answer precisely and briefly.
DAP Spr.‘98 ©UCB 3
Minimizing Stalls Technique 1: Compiler Optimization
6 clocks
Instruction Instruction Latency inproducing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Swap BNEZ and SD by changing address of SD1 Loop: LD F0,0(R1) 2 SUBI R1,R1,8
3 ADDD F4,F0,F2 4 Stall
5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;Address altered from 0(R1)
to 8(R1) when moved past SUBI
DAP Spr.‘98 ©UCB 4
Compiler Technique 2: Loop Unrolling
1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 ;1 cycle delay * 3 SD 0(R1),F4 ;drop SUBI & BNEZ – 2cycles delay * 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 ; 1 cycle delay 6 SD -8(R1),F8 ;drop SUBI & BNEZ – 2 cycles delay 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 ; 1 cycle delay 9 SD -16(R1),F12 ;drop SUBI & BNEZ – 2 cycles delay 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 ; 1 cycle delay 12 SD -24(R1),F16 ; 2 cycles daly 13 SUBI R1,R1,#32 ;alter to 4*8; 1 cycle delay 14 BNEZ R1,LOOP ; Delayed branch 15 NOP
*1 cycle delay for FP operation after load. 2 cycles delay for store after FP. 1 cycle after SUBI.
15 + 4 x (1+2) + 1 = 28 clock cycles, or 7 per iteration
Loop Unrolling is essential for ILP Processors Why?But, increase in Code memory and no. of registers.
DAP Spr.‘98 ©UCB 5
Minimize Stall + Loop Unrolling
• What assumptions made when moved code?
– OK to move store past SUBI even though changes register
– OK to move loads before stores: get right data?
– When is it safe for compiler to do such changes?
1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP ; Delayed branch14 SD 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
DAP Spr.‘98 ©UCB 6
Very Long Instruction Word:VLIW Architectures
• Wide-issue processor that relies on compiler to– Packet together independent instructions to be issued in parallel
– Schedule code to minimize hazards and stalls
• Very long instruction words (3 to 8 operations)– Can be issued in parallel without checks
– If compiler cannot find independent operations, it inserts nops
• Advantage: simpler HW for wide issue– Faster clock cycle
– Lower design & verification cost
• Disadvantages:– Code size
– Requires aggressive compilation technology
DAP Spr.‘98 ©UCB 7
VLIW and Superscalar• sequential stream of long instruction words• instructions scheduled statically by the compiler• number of simultaneously issued instructions is fixed during
compile-time • instruction issue is less complicated than in a superscalar
processor• Disadvantage: VLIW processors cannot react on dynamic
events, e.g. cache misses, with the same flexibility like superscalars.
• The number of instructions in a VLIW instruction word is usually fixed.
• Padding VLIW instructions with no-ops is needed in case the full issue bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops.
• VLIW is an architectural technique, whereas superscalar is a microarchitecture technique.
• VLIW processors take advantage of spatial parallelism.
DAP Spr.‘98 ©UCB 8
Multithreading• How can we guarantee no dependencies between instructions in a pipeline?
– One way is to interleave execution of instructions from different program threads on same pipeline – Micro context switching
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
T1: LW r1, 0(r2)
T2: ADD r7, r1, r4
T3: XORI r5, r4, #12
T4: SW 0(r7), r5
T1: LW r5, 12(r1)
DAP Spr.‘98 ©UCB 9
Simple Multithreaded Pipeline
• Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
DAP Spr.‘98 ©UCB 10
Comparison of Issue CapabilitiesCourtesy of Susan Eggers; Used with Permission
DAP Spr.‘98 ©UCB 11
From Superscalar to SMT
• Small items– per-thread program counters
– per-thread return stacks
– per-thread bookkeeping for instruction retirement, trap & instruction dispatch queue flush
– thread identifiers, e.g., with BTB & TLB entries
DAP Spr.‘98 ©UCB 12
Typical NP ArchitectureSDRAM(Packet buffer)
SRAM(Routing table)
multi-threaded processing elements
Co-processor
Input ports Output ports
Network Processor
Bus Bus
DAP Spr.‘98 ©UCB 13
Why Network Processors• Current Situation
– Data rates are increasing
– Protocols are becoming more dynamic and sophisticated
– Protocols are being introduced more rapidly
• Processing Elements– GP(General-purpose Processor)
» Programmable, Not optimized for networking applications
– ASIC(Application Specific Integrated Circuit)» high processing capacity, long time to develop, Lack the flexibility
– NP(Network Processor)» achieve high processing performance
» programming flexibility
» Cheaper than GP
DAP Spr.‘98 ©UCB 14
IXP1200 Block Diagram• StrongARM
processing core
• Microengines introduce new ISA
• I/O– PCI
– SDRAM
– SRAM
– IX : PCI-like packet bus
• On chip FIFOs– 16 entry 64B each
DAP Spr.‘98 ©UCB 15
IXP1200 Microengine• 4 hardware contexts
– Single issue processor
– Explicit optional context switch on SRAM access
• Registers– All are single ported
– Separate GPR
– 256*6 = 1536 registers total
• 32-bit ALU– Can access GPR or XFER registers
• Shared hash unit– 1/2/3 values – 48b/64b
– For IP routing hashing
• Standard 5 stage pipeline
• 4KB SRAM instruction store – not a cache!
• Barrel shifter
Ref: [NPT]
DAP Spr.‘98 ©UCB 16
MEv26
MEv27
MEv25
MEv28
Intel®XScale™
Core32K IC32K DC
Rbuf64 @ 128B
Tbuf64 @ 128B
Hash64/48/128
Scratch16KB
QDRSRAM
1
QDRSRAM
2
DDRAM
GASKET
PCI
(64b)66 MHz
32b32b
32b32b
1818 18181818 1818
7272
64b64b
SPI3orCSIX
E/D Q E/D Q
MEv22
MEv23
MEv21
MEv24
CSRs -Fast_wr -UART-Timers -GPIO-BootROM/Slow Port
IXP2400IXP2400
DAP Spr.‘98 ©UCB 17
Intel®XScale™
Core32K IC32K DC MEv2
10MEv2
11MEv2
12
MEv215
MEv214
MEv213
Rbuf64 @ 128B
Tbuf64 @ 128B
Hash48/64/128
Scratch16KBQDR
SRAM2
QDRSRAM
1
RDRAM1
RDRAM3
RDRAM2
GASKET
PCI
(64b)66 MHz
IXP2800IXP2800
16b16b
16b16b
1818 18181818 1818
1818 1818 1818
64b64b
SPI4orCSIX
Stripe
E/D Q E/D Q
QDRSRAM
3
E/D Q
1818 1818
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28
CSRs -Fast_wr -UART-Timers -GPIO-BootROM/SlowPort
QDRSRAM
4
E/D Q
1818 1818
DAP Spr.‘98 ©UCB 18
Memory Hierarchy Goal: Illusion of large, fast, cheap memory
• Fact: Large memories are slow, fast memories are small
• How do we create a memory that is large, cheap and fast (most of the time)?
• Hierarchy of Levels– Uses smaller and faster memory technologies close to the
processor
– Fast access time in highest level of hierarchy
– Cheap, slow memory furthest from processor
• The aim of memory hierarchy design is to have access time close to the highest level and size equal to the lowest level
DAP Spr.‘98 ©UCB 19
Introduction to Caches
• Cache– is a small very fast memory (SRAM, expensive)
– contains copies of the most recently accessed memory locations (data and instructions): temporal locality
– is fully managed by hardware (unlike virtual memory)
– storage is organized in blocks of contiguous memory locations: spatial locality
– unit of transfer to/from main memory (or L2) is the cache block
• General structure– n blocks per cache organized in s sets
– b bytes per block
– total cache size n*b bytes
DAP Spr.‘98 ©UCB 20
Cache Organization(1) How do you know if something is in the cache?
(2) If it is in the cache, how to find it?
• Answer to (1) and (2) depends on type or organization of the cache
• Direct mapped cache, each memory address is associated with one possible block within the cache– Therefore, we only need to look in a single location in the
cache for the data if it exists in the cache
• Fully Associative Cache – Block can be placed anywhere, but complex in design
• N-way set associative - N cache blocks for each Cache Index– Like having N direct mapped caches operating in parallel
DAP Spr.‘98 ©UCB 21
Review: Four Questions for Memory Hierarchy Designers
• Q1: Where can a block be placed in the upper level? (Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level? (Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss? (Block replacement)
– Random, LRU
• Q4: What happens on a write? (Write strategy)
– Write Back or Write Through (with Write Buffer)
DAP Spr.‘98 ©UCB 22
Review: Cache Performance
CPUtime = Instruction Count x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time
Misses per instruction = Memory accesses per instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time
To Improve Cache Performance:
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the time to hit in the cache.
DAP Spr.‘98 ©UCB 23
Where to misses come from?• Classifying Misses: 3 Cs
–Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache)
–Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)
–Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)
• 4th “C”:–Coherence - Misses caused by cache coherence.
DAP Spr.‘98 ©UCB 24
4: Add a second-level cache
• L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 +
Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
• Definitions:– Local miss rate— misses in this cache divided by the total number
of memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU
– Global Miss Rate is what matters
DAP Spr.‘98 ©UCB 25
Cache Optimization Summary
Technique MR MP HT Complexity
Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0
Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2
mis
s r
ate
mis
s p
en
alty
DAP Spr.‘98 ©UCB 26
Main Memory Background
• Random Access Memory (vs. Serial Access Memory)
• Different flavors at different levels– Physical Makeup (CMOS, DRAM)
– Low Level Architectures (FPM,EDO,BEDO,SDRAM)
• Cache uses SRAM: Static Random Access Memory– No refresh (6 transistors/bit vs. 1 transistor
Size: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16
• Main Memory is DRAM: Dynamic Random Access Memory– Dynamic since needs to be refreshed periodically
– Addresses divided into 2 halves (Memory as a 2D matrix):» RAS or Row Access Strobe
» CAS or Column Access Strobe
DAP Spr.‘98 ©UCB 27
Main Memory Organizations
CPU
Cache
Bus
Memory
CPU
Bus
Memory
Multiplexor
Cache
CPU
Cache
Bus
Memorybank 1
Memorybank 2
Memorybank 3
Memorybank 0
one-word widememory organization
wide memory organization interleaved memory organization
DRAM access time >> bus transfer time
DAP Spr.‘98 ©UCB 28
Virtual Memory• Idea 1: Many Programs sharing DRAM Memory so
that context switches can occur• Idea 2: Allow program to be written without
memory constraints – program can exceed the size of the main memory
• Idea 3: Relocation: Parts of the program can be placed at different locations in the memory instead of a big chunk.
• Virtual Memory:(1) DRAM Memory holds many programs running
at same time (processes)(2) use DRAM Memory as a kind of “cache” for
disk
DAP Spr.‘98 ©UCB 29
Mapping Virtual to Physical Address
Virtual Page Number Page Offset
Page OffsetPhysical Page Number
Translation
31 30 29 28 27 .………………….12 11 10
29 28 27 .………………….12 11 10
9 8 ……..……. 3 2 1 0
Virtual Address
Physical Address
9 8 ……..……. 3 2 1 0
1KB page size
DAP Spr.‘98 ©UCB 30
How Translate Fast?
• Problem: Virtual Memory requires two memory accesses!– one to translate Virtual Address into Physical Address (page
table lookup)– one to transfer the actual data (cache hit)– But Page Table is in physical memory! => 2 main memory
accesses!• Observation: since there is locality in pages of data, must be
locality in virtual addresses of those pages!• Why not create a cache of virtual to physical address
translations to make translation fast? (smaller is faster)• For historical reasons, such a “page table cache” is called a
Translation Lookaside Buffer, or TLB
DAP Spr.‘98 ©UCB 31
Translation Look-Aside Buffers•TLB is usually small, typically 32-4,096 entries
• Like any other cache, the TLB can be fully associative, set associative, or direct mapped
Processor TLB Cache MainMemory
misshit
data
hit
miss
DiskMemory
OS FaultHandler
page fault/protection violation
PageTable
data
virtualaddr.
physicaladdr.