View
71
Download
0
Category
Tags:
Preview:
DESCRIPTION
Intel Pentium 4. ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk. Overview:. Product review Specialized architectural features (NetBurst) SIMD instructional capabilities (MMX, SSE2) SHARC 2106x comparison. Intel Pentium 4. Reworked micro-architecture for high-bandwidth applications - PowerPoint PPT Presentation
Citation preview
Intel Pentium 4Intel Pentium 4ENCM 515 - 2002
Jonathan Bienert
Tyson Marchuk
Overview:Overview:
• Product review
• Specialized architectural features (NetBurst)
• SIMD instructional capabilities (MMX, SSE2)
• SHARC 2106x comparison
Intel Pentium 4
• Reworked micro-architecture for high-bandwidth applications
• Internet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environments
• These are DSP intensive applications!– What about uses other than in PC?
Hardware Features:Hardware Features:(NetBurst micro-architecture)
• Hyper pipelined technology
• Advanced dynamic execution
• Cache (data, L1, L2)
• Rapid ALU execution engines
• 400 MHz bus
• OOE
• Microcode ROM
Hyper PipelineHyper Pipeline
• 20-stage pipeline!!!
• breaks down complex CISC instructions– sub-stages mimic RISC– faster execution
Filling the pipeline...
• Review of next 126 instructions to be executed
• Branch prediction– if mispredict must flush 20-stage pipeline!!!– branch target buffer (BTB)– 4K branch history table (BHT)– assembly instruction hints
CacheCache
• 8KB Data Cache• L1 Execution Trace Cache
– 12K of previous micro-instructions stored– saves having to translate
• L2 Advanced Transfer Cache– 256K for data– 256-bit transfer every cycle
• allows 77GB/s data transfer on 2.4GHz
Rapid ALU Execution EnginesRapid ALU Execution Engines
• 2 ALUs– allow parallel operations
• Many arithmetic operations take 1/2 cycle– each 2X ALU can have 2 operations per cycle
Software Features:Software Features:• Multimedia Extensions (MMX)
– 8 MMX registers
• Streaming SIMD Extensions (SSE2)– 8 SSE/SSE2 registers
• Standard x86 Registers– EAX, EBX, ECX, EDX, ESI, etc.– Register rename to over 100
MMX (Multimedia Extensions)MMX (Multimedia Extensions)
• Accelerated performance through SIMD• multimedia, communication, internet applications
• 64-bit packed INTEGER data– signed/unsigned
SSE2 (Streaming SIMD SSE2 (Streaming SIMD Extensions)Extensions)
• Accelerate a broad range of applications– video, speech, and image, photo processing, encryption, financial,
engineering, and scientific applications
• 128-bit SIMD instruction formats 4 single precision FP values
2 double precision FP values
16 byte values
8 word values
4 double word values
2 quad word values
1 128-bit integer value
SIMD ExampleSIMD Example(16-tap FIR filter - Real numbers)16-tap FIR filter - Real numbers)
• Applications for real FIR filters• general purpose filters in image processing, audio,
and communication algorithms
• Will utilize SSE2 SIMD instruction set
Thinking about SIMDThinking about SIMD
• SSE2 instruction format is 128-bits
• 128-bit SSE2 registers
• Many data formats!
• What precision do we want?
• Lets use 32-bit floating point for coefficients, input, output
4 data sets x 32-bit = 128 bits
ParallelizingParallelizing• Require many single multiplications
(coefficients x inputs), then add the results for output!
• Multiplications…
• then need to perform additions...
Using SSE2 formatUsing SSE2 format• Can hold 4 elements of an array (of 32-bit
data) in each 128-bit register
• 4 single precision floating point ops per cycle (32-bit)
Additions...Additions...• In both registers, now have 4 32-bit results
– First add the results into an accumulator register
• 4 single precision floating point ops per cycle (32-bit)
Additions...Additions...• In a register, now have 4 32-bit results
– however, NO SSE2 instruction to add these 4!– But can use other instructions
• Some BIT INTERTWINING…then add
– This will give results for several output values!
ADI SADI SHARCHARC 21k vs. P4 21k vs. P4
Disadvantages
• Slower clock speed (40MHz vs 2400MHz)
• Less opportunities for parallelism (5 vs 11)
• Much less memory (Cache and System)– Limited algorithm applicability– Limited applications
• Older (Less support – compiler)– 1994 vs 2001
ADI Sharc 21k vs. P4ADI Sharc 21k vs. P4
Advantages
• Hardware loops
• Easier to program for optimal speed
• Cheaper
• Lower power consumption
• Runs cooler
FIR Performance
• Hard to obtain P4 performance numbers• Can estimate based on 2 FP multiplies per
clock, clock rate and assumption that pipeline can be kept full.– 2 * 2.4GHz ~ 4.8 billion multiplies per second– If ~4 multiplies per element & 44000 samples/s– FIR length > ~25k taps
• SHARC => ~ 200 taps (Lab 4)• Factor of ~125x
IIR Performance
• Hard to obtain P4 performance numbers
• No hardware circular buffers
• Does have BTB, BHT, etc.
• Prefetches ~256bytes ahead of current position in code.
FFT Performance
• Hard to obtain P4 performance numbers
• Prime95 uses FFT to calculate Lucas-Lehmer test for Mersenne Primes– Involves FFT, squaring and iFFT, etc.
• 256k points on P4 2.3GHz ~ 10.517ms
• Compare to SHARC 2048 point FFT ~0.37ms
• If SHARC could do 256k, 46.25ms (But…)
Optimization Example
• Hard to optimize Pentium 4 assembly
• Example of multiplying by a constant, 10
• Taken mainly from: www.emulators.com/docs/pentium_1.htm
Multiplying by 10
• Slowest way: – IMUL EAX, 10
• Usually optimal way (Visual C++ 6.0)– LEA EAX, [EAX+EAX*4]– SHL EAX, 1– Shift – Add – Shift– On most x86 processors takes 2 cycles– Pentium MMX and before 3 cycles– On Pentium 4 takes 6 cycles!
Multiplying by 10
• Optimal for Pentium 4– LEA ECX, [EAX + EAX]– LEA EAX, [ECX+EAX*8]– On most x86 still takes 2 cycles– On Pentium 4 takes ~ 3 cycles (OOE - Ops)– But on older processors Pentium MMX and
before this now takes 4 cycles!
Multiplying by 10
• Best generic case– LEA EAX, [EAX + EAX*4]– ADD EAX, EAX– On most x86 still takes 2 cycles– On older processors Pentium MMX and before
this now takes 3 cycles again– On Pentium 4 this takes 4 cycles
• Obviously really hard to optimize
REFERENCES
• Intel application note: AP 809 - Real and Complex Filter Using Streaming SIMD Extentions
• graphics from: http://www6.tomshardware.com/cpu/00q4/001120/p4-01.html
Recommended