Upload
dorit
View
44
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Haswell. Thomas Shull Bhargava Reddy Gopi Reddy Raghavendra Pradyumna Pothukuchi. RISC-y. Find each instruction i.e. decode the length. Each x86 instruction (Macro Op) is chopped into “µOps” Some Macro Op combos can be treated as 1 instruction. Pack them together. 1 - PowerPoint PPT Presentation
Citation preview
HaswellThomas ShullBhargava Reddy Gopi ReddyRaghavendra Pradyumna Pothukuchi
RISC-y
• Find each instruction i.e. decode the length.• Each x86 instruction (Macro Op) is chopped
into “µOps”• Some Macro Op combos can be treated as 1
instruction. Pack them together. 1
• CMP <–> JUMP IF; ± <–> TEST• Some µOps are packed into 1 µOp and are later
implicitly broken op.• ADD [EBX] EAX -> MOV ECX [EBX] -> ADD [EBX] EAX
ADD ECX EAX
www.realworldtech.com/haswell-cpu/
Fetch and Decode
• Multicycle power hungry decode.
• µOps are cached.• 32 sets: 8 ways: 6 µops per line: • 32B window (18 µOps at maximum) is inserted at once
• if 32B has more than 18 µOps, do not insert.• Deliver atmost 4 µOps on a “full hit”• Double bandwidth (32B vs. 16B) on a hit.
Why?AVX!
Renaming and Oh Oh Oh!• Renaming – Map from logical registers to
physical registers (PRF) and allocate resources.• ROB is a placeholder.• Break the fused µOps to simpler Ops.
www.realworldtech.com/haswell-cpu/
Scheduler• 8 Issue Ports • 1 WB per Port
• INT, FP, SIMD networks + MEM• More penalty for inter-network data forwarding.
• Register-Register moves are folded by just changing PRF map. • Extra pipeline stage for dereferencing links
Execution Units
60 Entry Unified Scheduler
Int
FMAVecto
rVecto
rBranc
hDiv
Vector
Int
FMAVecto
rVecto
r
IntVecto
rVecto
rVecto
r
Mem Store Int Store
Branch
Port 0 Port 1 Port 2 Port 4Port 3 Port 5 Port 6 Port 7
Did we forget something?Branch Predictor !!
• More entries in BTB (less per entry!)• Entries with fewer offset bits
• Use the space saved for global branch prediction• 2 level global predictor? 1-bit entries?• 14 -17 cycles of misprediction penalty.
• 56 entry µOp buffer for identifying small loops
Big Picture:14 stage pipeline
www.realworldtech.com/haswell-cpu/
Memory Hierarchy – For Data
Load Buffer
Store Buffer
Unified scheduler
64-bit
AGU
64-bit
AGU
Store
Data
Store
AGU
32 KB L1 D Cache (8-way) L1 TLB
L2 TLB
256KB L2 Cache (8-way)
Port 2
Port 3
Port 7
Port 4
32B2x32B
64B
L3/LLC
4k – 64 2M/4M -
321G - 44-way
1024 Entry
Shared 8-way
L3 (Also Last Level Cache)• Banked Structure, One bank per core• Shared and Fully inclusive
• Separate tag arrays • One for Data Requests • One for Prefetches and Coherency
Requests
• Point of Coherence
• Separate Frequency domain from CPU • Helps to run CPU, GPU and LLC at
different speeds as necessary
Core0
Core1
Core3
Core2
System Agent
L3
L3
L3
L3
GPU
The Ring• Ring stops
• Core/L3 bank (Cachebox) can send/receive two packets on ring each cycle• Up direction• Down direction
• GPU and System Agent can send only one per cycle
• Ring actually consists of 4 Rings
Core0
Core1
Core3
Core2
System Agent
L3
L3
L3
L3
GPU
Memory Controller• 2 Clock Domains• DCLK – DDR command clock• QCLK – DDR data clock
• Requested 32B are returned first • Maintains a page table information and
corresponding requests• Page Hits are given priority -> increase the
bandwidth• Reads are given priority
• Write Data Buffer to maintain writes• Write Merging can happen in WriteDataBuffer
System Agent• Contains• Memory Controller• PCI Express Controller• DMI Controller• Display Engine• Power Control Unit • I/O
Core0
Core1
Core3
Core2
L3
L3
L3
L3
GPU
Display
EngineMemory Controlle
rPCU
PCIE DMI
Multithreading• Use atomic operations to control access to
items used by multiple threads• Obtain and release locks for critical sections
• Intel currently supports making the following operations atomic by appending a “LOCK” prefix:• ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,
CMPXCH8B, DEC, INC, NEG, NOT, OR SBB, SUB, XOR, XADD, and XCHG
• MOV and LEAL are also atomic on aligned accesses
Transactional Memory• Main idea: try to run critical sections without
locks and monitor for conflicts
• Use Read and Write Sets to log memory accesses in transactional sections
• If conflicts occur, abort and revert register state to the beginning of transaction
• If successful, commit the changes to memory so they are visible to other threads
Restricted Transactional Memory• Haswell is the first Intel mainstream processor to include
Transactional Memory• Added Transactional Synchronization eXtension (TSX)
• New instructions for Restricted Transactional Memory• XBEGIN – indicates start of transaction• XEND – indicates end of transaction• XABORT – used for testing; aborts transaction• XTEST – indicates whether preforming in a transactional
region
• Must have pointer to code that runs upon an abort• Requires code to be rewritten using transactional sections
Integrated Graphics• Supports 3
simultaneous display, HDMI
• Scalable Architecture: different versions of processor (GT1, GT2, GT3) offer different number of Execution Units (EUs) among other upgrades
• Multiple Video Encoding and Decoding Support in Hardware.• Supported encodings include MPEG4, MPEG2, SVC
• Supports Open CL 1.1, Open GL 4.0
Figure taken from “Technology Insight: Intel Next Generation Microarchitecture Code Name Haswell” Presentation. Intel Developers Forum, San Francisco, 2012
Power Management• Three Voltage
Domains
• Allows for screen to be updated while processor is turned off
• Voltage Regulators are on chip
• Power Gating• New Power Saving States
• S0ix idle states• Recommends power levels and response times for
vendors• Uses 20x less power than previous S0 state
Figure taken from “Intel Next Generation Microarchitecture Codename Haswell: New Processor Innovations” Presentation. Intel Developers Forum, San Francisco, 2012
Recap:• 14 stage pipeline• 4 cores, SMT machine• In order issue, Out of Order execution, In order
commit.• Wider data paths and extra Store AGU to provide
more bandwidth in AVX2 computations• LLC/Ring is the point of coherence and
distributed arbitration of requests. • Intel TSX• Added support for Restricted Transaction Memory
• Integrated Graphics and Improved Power Management• Power Efficiency is a huge emphasis
ResourcesGeneral Information• Technology Insight: Intel Next Generation Microarchitecture
Code Name Haswell. Presented at IDF 2012 by Tom Piazza, Hong Jiang, Per Hammarlund, Ronak Singhal
• Intel Next Generation Micro Architecture Codename Haswell: New Processor Innovations. Presented at IDF 2012 by Robert Chappell, Bret Toll, Ronal Singhal
• Kanter, David Intel’s Haswell Cpu Microarchitecture. November 13, 2012. www.realworldtech.com/haswell-cpu/
• Kanter, David Analysis of Haswell’s Transactional Memory. February 15, 2012. www.realworldtech.com/haswell-tm/
• Lai Shimpi, Anand. Intel’s Haswell Architecture Analyzed: Building a New PC and a New Intel. October 5, 2012. www.anandtech.com/show/6355/intels-haswell-architecture
• Introducing SandyBridge. Presented at IDF 2010 by Bob Valentine.
• Sandy Bridge Spans Generation. Micro Processor Report. September 2010
ResourcesProcessor Core• Fog Agner. The microarchitecture of Intel, AMD and VIA
CPUs, An optimization guide for assembly programmers and compiler makers. Copenhagen University College of Engineering
• Intel 64 and IA-32 Architectures Optimization Reference Manual. Order Number: 248966-026. April 2012
Transactional Memory• Intel Transactional Synchronization Extensions. Presented
at IDF 2012 by Ravi Rajwar, Martin Dixon• Intel Architecture Instruction Set Extensions Programming
Reference Manual. Order Number: 319433-012A. February 2012
• Gelas, J and Hamm, C. Making Sense of the Intel Haswell Transactional Synchronization eXtensions. September 15, 2012. www.anandtech.com/show/6290/making-sense-of-intel-haswell-transactional-synchronization-extensions
Extra Slides
Current Locking Strategies
acquire_lock(mutex)
release_lock(mutex)
Scalability Issues
As core count increases, efficiency is drastically reduced!
Figure taken Making Sense of the Intel Haswell Transactional Synchronization eXtensions.
Lock Elision• Idea introduced by Ravi Rajwar and James R. Goodman in
2001• remove locks, run code as a transaction• If there are conflicts, abort and rerun code with locks intact• On success, commit the transaction’s writes to memory
• To other threads the lock still remains available• Reduces execution time if conflicts do not occur
• Guarantees Correctness by using the transactional memory
• Have new instructions to implement Lock Elision• XAQUIRE: denotes start of lock elision section• XRELEASE: denotes end of lock elision section
• These options are added as prefixes to existing instructions
Lock Elisionacquire_lock(mutex)
release_lock(mutex)
Changes can be made in library functions. User does not have to adopt new programming paradigm
Performance BenefitsIntel says using TSX Helps!
Software Transactional Memory has been researched, but the overhead in software negated performance benefits
Figure taken from “Intel Transactional Synchronization Extensions” Presentation. Intel Developers Forum, San Francisco, 2012