25
AMD K7 Processor AMD K7 Processor Architecture Architecture CMPE 511 CMPE 511 prepared by Özsun S. prepared by Özsun S. Sönmez Sönmez

AMD K7 Processor Architecture

  • Upload
    badru

  • View
    89

  • Download
    3

Embed Size (px)

DESCRIPTION

AMD K7 Processor Architecture. CMPE 511 prepared by Özsun S. Sönmez. Introduction. AMD K7 is the first 7 th generation PC CPU. First six generations were 8086, 80286, 80386, 80486, Pentium (AMD K5/K6) and Pentium II (AMD K6-2/K6-3). It is designed to operate above 500MHz. - PowerPoint PPT Presentation

Citation preview

Page 1: AMD K7 Processor Architecture

AMD K7 Processor AMD K7 Processor ArchitectureArchitecture

CMPE 511CMPE 511

prepared by Özsun S. Sönmezprepared by Özsun S. Sönmez

Page 2: AMD K7 Processor Architecture

IntroductionIntroduction• AMD K7 is the first 7AMD K7 is the first 7thth generation PC CPU. First six generation PC CPU. First six

generations were 8086, 80286, 80386, 80486, Pentium generations were 8086, 80286, 80386, 80486, Pentium (AMD K5/K6) and Pentium II (AMD K6-2/K6-3). It is (AMD K5/K6) and Pentium II (AMD K6-2/K6-3). It is designed to operate above 500MHz.designed to operate above 500MHz.

• AMD K7,also known as AMD Athlon, was introduced in the AMD K7,also known as AMD Athlon, was introduced in the first half of 1999 and its architecture forms the basis for first half of 1999 and its architecture forms the basis for the subsequent Athlon XP versions the subsequent Athlon XP versions untiluntil the release of K8 the release of K8 (AMD Hammer).(AMD Hammer).

• Its competitor, Intel Pentium III was also released in the Its competitor, Intel Pentium III was also released in the same year and these two processors will be compared same year and these two processors will be compared whenever possible throughout the presentation.whenever possible throughout the presentation.

Page 3: AMD K7 Processor Architecture

Main FeaturesMain Features• Out-of-order, 3-way superscalar x86 uPOut-of-order, 3-way superscalar x86 uP• 9 independent execution pipelines, with 10 9 independent execution pipelines, with 10

stage integer and 15-stage FP pipeline :stage integer and 15-stage FP pipeline :– 3 Integer Execution Units3 Integer Execution Units– 3 Address Calculation Units3 Address Calculation Units– 3 Floating Point Execution Units3 Floating Point Execution Units

• 64kB instruction and 64kB data L1 caches64kB instruction and 64kB data L1 caches

• Integrated L2 cache controller up to 8MBIntegrated L2 cache controller up to 8MB

• Extended 3DNow! instructionsExtended 3DNow! instructions

Page 4: AMD K7 Processor Architecture

Main FeaturesMain Features• K7 uses Digital™ Alpha™ EV6 system bus

interface. This is probably the most important architectural difference from the previous generations. EV6 provides:

- Use of both rising and falling edges, resulting in doubled bus speed

- Scalability beyond 200MHz(beyond 400MHz bus speed)

- Highest bandwidth of that time: Athlon using 100MHz(x2) 1.60 GB/s PIII using 133MHz 1.01 GB/s

- 72(64 + 8ECC) bit data bus - Independent address bus able to address 8

terabytes- Independent snoop bus

Page 5: AMD K7 Processor Architecture

Main Features – EV6 cont.Main Features – EV6 cont.

- low-voltage signaling for low-cost motherboard implementations

Motherboards with GeForce, Dolby and Ethernet available below $80.

- Point-to-Point topology - Point-to-Point topology with clock forwarding for with clock forwarding for scalable multiprocessing.scalable multiprocessing.

Page 6: AMD K7 Processor Architecture

AMD K7 Processor Block AMD K7 Processor Block DiagramDiagram

Page 7: AMD K7 Processor Architecture

Cache Architecture• Separate L1 instruction and data cachesSeparate L1 instruction and data caches

• Both are 64kB, 64-bit, 2-way set associative, Both are 64kB, 64-bit, 2-way set associative, dual ported and have 24-entry(32-entry for dual ported and have 24-entry(32-entry for DC) L1 TLB, 256-entry L2 TLB.DC) L1 TLB, 256-entry L2 TLB.

• IC stores predecode information to assist IC stores predecode information to assist multiple instruction decoders.multiple instruction decoders.

• L2 cache controller can interface up to 8MB L2 cache controller can interface up to 8MB industry standard SDR or DDR SRAMs and industry standard SDR or DDR SRAMs and provides full tag for 512kB cache or partial tag provides full tag for 512kB cache or partial tag for larger caches. Interface is 64+8ECCfor larger caches. Interface is 64+8ECC

Page 8: AMD K7 Processor Architecture

Cache CompetitionCache Competition• AMD Athlon(1999):AMD Athlon(1999):

– 2x64kB, 64-bit, 2-way, 3~ L1 cache with 64-byte lines2x64kB, 64-bit, 2-way, 3~ L1 cache with 64-byte lines– 512kB, 64-bit, 2-way, 18~ off-chip L2 with 64-byte lines512kB, 64-bit, 2-way, 18~ off-chip L2 with 64-byte lines

• Intel PIII Katmai(1999): Intel PIII Katmai(1999): – 2x16kB, 64-bit, 4-way, 3~ L1 cache with 32-byte lines2x16kB, 64-bit, 4-way, 3~ L1 cache with 32-byte lines– 512kB, 64-bit, 2-way, 21~ off-chip L2 with 32-byte lines512kB, 64-bit, 2-way, 21~ off-chip L2 with 32-byte lines

• Intel PIII Coppermine(1999): L2 changed toIntel PIII Coppermine(1999): L2 changed to– 256kB, 256-bit, 8-way, 4~ on-chip256kB, 256-bit, 8-way, 4~ on-chip

• AMD Athlon Thunderbird(2000):L2 changed to AMD Athlon Thunderbird(2000):L2 changed to - 256kB, 64-bit, 16-way, 7~ on-chip256kB, 64-bit, 16-way, 7~ on-chip- Exclusive cache structure meaning that data in L1 and L2 Exclusive cache structure meaning that data in L1 and L2

caches are differentcaches are different

Page 9: AMD K7 Processor Architecture

Cache CompetitionCache Competition

Page 10: AMD K7 Processor Architecture

Pipeline Architecture - Pipeline Architecture - DecodersDecoders

- 3-way Decoders convert instructions into fixed-length “Macro-Ops” (or MOPs) and send to ICU

- ICU contains 72 entries vs. 20 entries of PIII superior out-of-order execution performance

Page 11: AMD K7 Processor Architecture

Pipeline – Integer Execution Pipeline – Integer Execution UnitsUnits- 3 IEU, 3 AGU

- 15 entry integer scheduler

- 24 entry 32bit 9 read 8 write

register file

Page 12: AMD K7 Processor Architecture

Pipeline - Floating Point UnitPipeline - Floating Point Unit- Floating Point Units execute MMX,

x87 (FP) and 3D-Now! Instructions

- 36 entry FP scheduler

- 88 entry 90bit 5 read 5 write register file.

- Some stages of the MUL pipeline may be unused during DIV/Sqrt iterations. ICU informs the FP scheduler in such cases so that there is sufficient time to schedule independent MULs in the unused cycle.

- DIV by exact 2n or zero takes 11~

SingleSingle DoublDoublee

ExtenExtendedded

DIVDIV 16/1316/13 20/1720/17 24/2124/21

SqrtSqrt 19/1619/16 27/2427/24 35/3235/32

Page 13: AMD K7 Processor Architecture

Pipeline – Load/Store UnitPipeline – Load/Store Unit

• 44 entry Load/Store queue 44 entry Load/Store queue

• Data forwarding fromData forwarding from

stores to dependent loadsstores to dependent loads

Page 14: AMD K7 Processor Architecture

Pipeline - StagesPipeline - StagesIntegerInteger Floating PointFloating Point

Stage 1Stage 1 Fetch Fetch FetchFetch

Stage 2Stage 2 ScanScan ScanScan

Stage 3Stage 3 Align1Align1 Align1Align1

Stage 4Stage 4 Align2Align2 Align2Align2

Stage 5Stage 5 Early Decode (EDEC)Early Decode (EDEC) Early Decode (EDEC)Early Decode (EDEC)

Stage 6Stage 6 IDECIDEC IDECIDEC

Stage 7Stage 7 ScheduleSchedule Stack RenameStack Rename

Stage 8Stage 8 ExecuteExecute Register RenameRegister Rename

Stage 9Stage 9 Address GenerationAddress Generation Schedule WriteSchedule Write

Stage 10Stage 10 Data Cache AccessData Cache Access ScheduleSchedule

Stage 11Stage 11 Register File ReadRegister File Read

Stage 12Stage 12 Floating-point executionFloating-point execution

Stage 13Stage 13 Floating-point executionFloating-point execution

Stage 14Stage 14 Floating-point executionFloating-point execution

Stage 15Stage 15 Floating-point executionFloating-point execution

Page 15: AMD K7 Processor Architecture

Branch PredictionBranch Prediction• Dynamic branch prediction logic Dynamic branch prediction logic

composed of:composed of:– Branch prediction table: two-way, 2048-

entry(512 for PIII). BPT stores prediction information that is used for predicting the direction of conditional branches.

– Branch target address table: stores target addresses of conditional and

unconditional branches.

Page 16: AMD K7 Processor Architecture

Branch PredictionBranch Prediction– Return address stack: 12-entry

optimizes CALL/RET instruction pairs

– BPT is accessed during Fetch stage and prediction is made during scan stage using Smith Prediction Algorithm (2-bit counters)

– Misprediction penalty is 10 cyclesMisprediction penalty is 10 cycles

• Approximate Correct Branch PredictionsApproximate Correct Branch Predictions– AMD Athlon: 95%AMD Athlon: 95%– Intel Pentium III: 90-92%Intel Pentium III: 90-92%

Page 17: AMD K7 Processor Architecture

3DNow! Technology3DNow! Technology

• 3DNow! is a set of SIMD instructions designed 3DNow! is a set of SIMD instructions designed to accelerate the FP-intensive multimedia to accelerate the FP-intensive multimedia applications.applications.

• Instructions operate on two packed single-Instructions operate on two packed single-precision 32-bit doublewords simultaneously:precision 32-bit doublewords simultaneously:Dst[63:32] = Dst[63:32] op Src[63:32]Dst[31:00] = Dst[31:00] op Src[31:00]

Page 18: AMD K7 Processor Architecture

3DNow! Technology3DNow! Technology• With significant code analysis, AMD engineers found that there

are two compelling implementation alternatives:

- - extending MMX with 3DNow! instructions - using separate wide registers from MMX, 4-operand instruction

format and support for MAC. - Anything in between requires significantly greater hardware

area or complexity without providing a corresponding performance benefit.

• AMD chose the first one that achieves most of the performance AMD chose the first one that achieves most of the performance benefit with significantly less area and power. Since no benefit with significantly less area and power. Since no additional registers are used, no new states are introduced additional registers are used, no new states are introduced compatibility with the existing OSs.compatibility with the existing OSs.

• The second choice is implemented in PowerPC G4 under the The second choice is implemented in PowerPC G4 under the name AltiVec.name AltiVec.

Page 19: AMD K7 Processor Architecture

3DNow! Technology3DNow! Technology

• Instead of division and sqrt, reciprocal and reciprocal sqrt Instead of division and sqrt, reciprocal and reciprocal sqrt are implemented in AMD K7 since they are encountered are implemented in AMD K7 since they are encountered more often in multimedia applications.more often in multimedia applications.

• MMX and 3DNow! instructions have at most 4 cycle latency MMX and 3DNow! instructions have at most 4 cycle latency (only for 3DNow! Add and Mul ) and 1 cycle throughput. (only for 3DNow! Add and Mul ) and 1 cycle throughput. This is much faster than single precision FP division(13~) This is much faster than single precision FP division(13~) and sqrt(16~).and sqrt(16~).

• Using 2 FP pipelines simultaneously, maximum throughput Using 2 FP pipelines simultaneously, maximum throughput is 4 FPops/~. is 4 FPops/~.

Page 20: AMD K7 Processor Architecture

Integer Performance of AMD Integer Performance of AMD AthlonAthlon

Page 21: AMD K7 Processor Architecture

Floating Point Performance of Floating Point Performance of AMD AthlonAMD Athlon

Page 22: AMD K7 Processor Architecture
Page 23: AMD K7 Processor Architecture

ConclusionConclusion

• Being the first 7Being the first 7thth generation CPU, generation CPU, AMD K7 has been a major leap AMD K7 has been a major leap forward in the CPU history.forward in the CPU history.

• It had both performance and cost It had both performance and cost benefits when compared to Intel PIII benefits when compared to Intel PIII and started the competition that and started the competition that ended with today’s AMD Athlon XP ended with today’s AMD Athlon XP and P4 processors.and P4 processors.

Page 24: AMD K7 Processor Architecture

ReferencesReferences• Hesley, S., V. Andrade, B. Burd,G. Constant, J. Correll, M. Crowley, Hesley, S., V. Andrade, B. Burd,G. Constant, J. Correll, M. Crowley,

M. Golden, N. Hopkins, S. Islam, S. Johnson, R. Khondker, D. M. Golden, N. Hopkins, S. Islam, S. Johnson, R. Khondker, D. Meyer, J. Moench, H. Partovi, R. Posey, F. Weber and J. Yong, “A 7Meyer, J. Moench, H. Partovi, R. Posey, F. Weber and J. Yong, “A 7 thth Generation x86 Microprocessor ”, Generation x86 Microprocessor ”, IEEE International Solid State IEEE International Solid State Circuits Conference, Circuits Conference, pp. 92-93,1999.pp. 92-93,1999.

• Scherer, A., M. Golden, N. Juffa, S. Meier, S. Oberman, H. Partovi Scherer, A., M. Golden, N. Juffa, S. Meier, S. Oberman, H. Partovi and F. Weber, “ An Out-of-Order Three-Way Superscalar and F. Weber, “ An Out-of-Order Three-Way Superscalar Multimedia Floating Point Unit ”, Multimedia Floating Point Unit ”, IEEE International Solid State IEEE International Solid State Circuits Conference, Circuits Conference, pp. 94-95,1999.pp. 94-95,1999.

• Oberman, S., “ Floating Point Division and Square Root Algorithms Oberman, S., “ Floating Point Division and Square Root Algorithms and Implementation in the AMD-K7 Microprocessor ”, and Implementation in the AMD-K7 Microprocessor ”, 1414thth IEEE IEEE Symposium on Computer ArithmeticSymposium on Computer Arithmetic, pp. 106-115, 1999., pp. 106-115, 1999.

• Oberman, S., G. Favor and F. Weber, “ AMD 3DNow! Technology: Oberman, S., G. Favor and F. Weber, “ AMD 3DNow! Technology: Architecture and Implementations ”, IEEE Micro, 1999. Architecture and Implementations ”, IEEE Micro, 1999.

• AMD Athlon Processor Datasheet and Technical Brief from AMD Athlon Processor Datasheet and Technical Brief from www.amd.comwww.amd.com

• Intel PIII Processor Datasheet from Intel PIII Processor Datasheet from www.intel.comwww.intel.com

Page 25: AMD K7 Processor Architecture

Questions?Questions?