The AMD K8 Processor Architecture December 14 th 2006

Preview:

Citation preview

The AMD K8 Processor Architecture

December 14th 2006

K7 vs K8

K7: 3 x86 decoding units, 3 integer units (ALU), 3 floating point units (FPU),128KB L1 cache

K8: 3 decoders (16 bytes of instructions per clock cycle); x86 instructions decoded into fixed length micro-operations (µOPs). Complex instructions are decoded into 2 + µOps FastPath: Certain µOPs are packed together µOPs are then dispatched to the execution units. 3 Address Generation Units (AGU) for Loads and Stores Three integer units (ALU): most µOps executed in one cycle,

multiplication has a 3 cycles latency in 32 bits, and a 5 cycles latency in 64 bits

Three floating point units (FPU), that handle x87, MMX, 3DNow!, SSE and SSE2 instructions

Load/Store stage: The L1 is dual-ported, that means it can handle two 64 bits reads or writes each clock cycle

K8 Hammer Microarchitecture

K7 vs K8 Pipelines

K8 L1 and L2Cache The L1 cache

CPU K8 Athlon XP Pentium 4 Northwood Pentium 4 Prescott

Sizecode : 64KB

data : 64KBcode : 64Ko

data : 64KBTC : 12Kµops

data : 8KBTC : 12Kµops

data : 16KB

Associativitycode : 2 way

data : 2 waycode : 2 way

data : 2 wayTC : 8 way

data : 4 wayTC : 8 way

data : 8 way

Cache line sizecode : 64 bytes

data : 64 bytescode : 64 bytes

data : 64 bytesTC : n.adata : 64 bytes

TC : n.adata : 64 bytes

Write policy Write Back Write Back Write Through Write Through

Latency 3 cycles 3 cycles 2 cycles 4 cycles

The L2 cache

CPU K8 Athlon XP Pentium 4 Northwood Pentium 4 Prescott

Size512KB (Newcastle)

1024KB (Hammer)256 and 512KB 512KB 1024KB

Associativity 16 way 16 way 8 way 8 way

Cache line size 64 bytes 64 bytes 64 bytes 64 bytes

Latency(given by

manufacturer)? 8 cycles 7 cycles 11 cycles

Bus width 128 bits 64 bits 256 bits 256 bits

L1 relationship exclusive exclusive inclusive inclusive

Exclusive vs Inclusive Cache

Exclusive L1-L2Positive Negative

L1 and L2 cache designs a cache line (instructions/data) is not persisted from L1 to L2

No constraint on the L2 size (it can be small). Total cache size is sum of the sub-level sizes.

L2 performance impaired (latency)

Need to use a Victim Buffer

Inclusive L1-L2Positive Negative

Duplicates the content of the L1 cache in the L2 Cache

L2 performance improved Constraint on the L1/L2 size ratio (relatively large L2)Total cache size may be smaller.

K8 Athlon 64

Athlon 64 Operating Modes

Opteron VS. Xeon

Recommended