Haswell

HaswellThomas ShullBhargava Reddy Gopi ReddyRaghavendra Pradyumna Pothukuchi

RISC-y

• Find each instruction i.e. decode the length.• Each x86 instruction (Macro Op) is chopped

into “µOps”• Some Macro Op combos can be treated as 1

instruction. Pack them together. 1

• CMP <–> JUMP IF; ± <–> TEST• Some µOps are packed into 1 µOp and are later

implicitly broken op.• ADD [EBX] EAX -> MOV ECX [EBX] -> ADD [EBX] EAX

ADD ECX EAX

www.realworldtech.com/haswell-cpu/

Fetch and Decode

• Multicycle power hungry decode.

• µOps are cached.• 32 sets: 8 ways: 6 µops per line: • 32B window (18 µOps at maximum) is inserted at once

• if 32B has more than 18 µOps, do not insert.• Deliver atmost 4 µOps on a “full hit”• Double bandwidth (32B vs. 16B) on a hit.

Why?AVX!

Renaming and Oh Oh Oh!• Renaming – Map from logical registers to

physical registers (PRF) and allocate resources.• ROB is a placeholder.• Break the fused µOps to simpler Ops.


Scheduler• 8 Issue Ports • 1 WB per Port

• INT, FP, SIMD networks + MEM• More penalty for inter-network data forwarding.

• Register-Register moves are folded by just changing PRF map. • Extra pipeline stage for dereferencing links

Execution Units

60 Entry Unified Scheduler

Int

FMAVecto

rVecto

rBranc

hDiv

Vector

Int

FMAVecto

rVecto

r

IntVecto

rVecto

rVecto

r

Mem Store Int Store

Branch

Port 0 Port 1 Port 2 Port 4Port 3 Port 5 Port 6 Port 7

Did we forget something?Branch Predictor !!

• More entries in BTB (less per entry!)• Entries with fewer offset bits

• Use the space saved for global branch prediction• 2 level global predictor? 1-bit entries?• 14 -17 cycles of misprediction penalty.

• 56 entry µOp buffer for identifying small loops

Big Picture:14 stage pipeline


Memory Hierarchy – For Data

Load Buffer

Store Buffer

Unified scheduler

64-bit

AGU

64-bit

AGU

Store

Data

Store

AGU

32 KB L1 D Cache (8-way) L1 TLB

L2 TLB

256KB L2 Cache (8-way)

Port 2

Port 3

Port 7

Port 4

32B2x32B

64B

L3/LLC

4k – 64 2M/4M -

321G - 44-way

1024 Entry

Shared 8-way

L3 (Also Last Level Cache)• Banked Structure, One bank per core• Shared and Fully inclusive

• Separate tag arrays • One for Data Requests • One for Prefetches and Coherency

Requests

• Point of Coherence

• Separate Frequency domain from CPU • Helps to run CPU, GPU and LLC at

different speeds as necessary

Core0

Core1

Core3

Core2

System Agent

L3

L3

L3

L3

GPU

The Ring• Ring stops

• Core/L3 bank (Cachebox) can send/receive two packets on ring each cycle• Up direction• Down direction

• GPU and System Agent can send only one per cycle

• Ring actually consists of 4 Rings

Core0

Core1

Core3

Core2

System Agent

L3

L3

L3

L3

GPU

Memory Controller• 2 Clock Domains• DCLK – DDR command clock• QCLK – DDR data clock

• Requested 32B are returned first • Maintains a page table information and

corresponding requests• Page Hits are given priority -> increase the

bandwidth• Reads are given priority

• Write Data Buffer to maintain writes• Write Merging can happen in WriteDataBuffer

System Agent• Contains• Memory Controller• PCI Express Controller• DMI Controller• Display Engine• Power Control Unit • I/O

Core0

Core1

Core3

Core2

L3

L3

L3

L3

GPU

Display

EngineMemory Controlle

rPCU

PCIE DMI

Multithreading• Use atomic operations to control access to

items used by multiple threads• Obtain and release locks for critical sections

• Intel currently supports making the following operations atomic by appending a “LOCK” prefix:• ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,

CMPXCH8B, DEC, INC, NEG, NOT, OR SBB, SUB, XOR, XADD, and XCHG

• MOV and LEAL are also atomic on aligned accesses

Transactional Memory• Main idea: try to run critical sections without

locks and monitor for conflicts

• Use Read and Write Sets to log memory accesses in transactional sections

• If conflicts occur, abort and revert register state to the beginning of transaction

• If successful, commit the changes to memory so they are visible to other threads

Restricted Transactional Memory• Haswell is the first Intel mainstream processor to include

Transactional Memory• Added Transactional Synchronization eXtension (TSX)

• New instructions for Restricted Transactional Memory• XBEGIN – indicates start of transaction• XEND – indicates end of transaction• XABORT – used for testing; aborts transaction• XTEST – indicates whether preforming in a transactional

region

• Must have pointer to code that runs upon an abort• Requires code to be rewritten using transactional sections

Integrated Graphics• Supports 3

simultaneous display, HDMI

• Scalable Architecture: different versions of processor (GT1, GT2, GT3) offer different number of Execution Units (EUs) among other upgrades

• Multiple Video Encoding and Decoding Support in Hardware.• Supported encodings include MPEG4, MPEG2, SVC

• Supports Open CL 1.1, Open GL 4.0

Figure taken from “Technology Insight: Intel Next Generation Microarchitecture Code Name Haswell” Presentation. Intel Developers Forum, San Francisco, 2012

Power Management• Three Voltage

Domains

• Allows for screen to be updated while processor is turned off

• Voltage Regulators are on chip

• Power Gating• New Power Saving States

• S0ix idle states• Recommends power levels and response times for

vendors• Uses 20x less power than previous S0 state

Figure taken from “Intel Next Generation Microarchitecture Codename Haswell: New Processor Innovations” Presentation. Intel Developers Forum, San Francisco, 2012

Recap:• 14 stage pipeline• 4 cores, SMT machine• In order issue, Out of Order execution, In order

commit.• Wider data paths and extra Store AGU to provide

more bandwidth in AVX2 computations• LLC/Ring is the point of coherence and

distributed arbitration of requests. • Intel TSX• Added support for Restricted Transaction Memory

• Integrated Graphics and Improved Power Management• Power Efficiency is a huge emphasis

ResourcesGeneral Information• Technology Insight: Intel Next Generation Microarchitecture

Code Name Haswell. Presented at IDF 2012 by Tom Piazza, Hong Jiang, Per Hammarlund, Ronak Singhal

• Intel Next Generation Micro Architecture Codename Haswell: New Processor Innovations. Presented at IDF 2012 by Robert Chappell, Bret Toll, Ronal Singhal

• Kanter, David Intel’s Haswell Cpu Microarchitecture. November 13, 2012. www.realworldtech.com/haswell-cpu/

• Kanter, David Analysis of Haswell’s Transactional Memory. February 15, 2012. www.realworldtech.com/haswell-tm/

• Lai Shimpi, Anand. Intel’s Haswell Architecture Analyzed: Building a New PC and a New Intel. October 5, 2012. www.anandtech.com/show/6355/intels-haswell-architecture

• Introducing SandyBridge. Presented at IDF 2010 by Bob Valentine.

• Sandy Bridge Spans Generation. Micro Processor Report. September 2010

http://www.realworldtech.com/haswell-cpu/

http://www.realworldtech.com/haswell-tm/

http://www.anandtech.com/show/6355/intels-haswell-architecture

http://www.anandtech.com/show/6355/intels-haswell-architecture

ResourcesProcessor Core• Fog Agner. The microarchitecture of Intel, AMD and VIA

CPUs, An optimization guide for assembly programmers and compiler makers. Copenhagen University College of Engineering

• Intel 64 and IA-32 Architectures Optimization Reference Manual. Order Number: 248966-026. April 2012

Transactional Memory• Intel Transactional Synchronization Extensions. Presented

at IDF 2012 by Ravi Rajwar, Martin Dixon• Intel Architecture Instruction Set Extensions Programming

Reference Manual. Order Number: 319433-012A. February 2012

• Gelas, J and Hamm, C. Making Sense of the Intel Haswell Transactional Synchronization eXtensions. September 15, 2012. www.anandtech.com/show/6290/making-sense-of-intel-haswell-transactional-synchronization-extensions

http://www.anandtech.com/show/6290/making-sense-of-intel-haswell-transactional-synchronization-extensions

http://www.anandtech.com/show/6290/making-sense-of-intel-haswell-transactional-synchronization-extensions

Extra Slides

Current Locking Strategies

acquire_lock(mutex)

release_lock(mutex)

Scalability Issues

As core count increases, efficiency is drastically reduced!

Figure taken Making Sense of the Intel Haswell Transactional Synchronization eXtensions.

Lock Elision• Idea introduced by Ravi Rajwar and James R. Goodman in

2001• remove locks, run code as a transaction• If there are conflicts, abort and rerun code with locks intact• On success, commit the transaction’s writes to memory

• To other threads the lock still remains available• Reduces execution time if conflicts do not occur

• Guarantees Correctness by using the transactional memory

• Have new instructions to implement Lock Elision• XAQUIRE: denotes start of lock elision section• XRELEASE: denotes end of lock elision section

• These options are added as prefixes to existing instructions

Lock Elisionacquire_lock(mutex)

release_lock(mutex)

Changes can be made in library functions. User does not have to adopt new programming paradigm

Performance BenefitsIntel says using TSX Helps!

Software Transactional Memory has been researched, but the overhead in software negated performance benefits

Figure taken from “Intel Transactional Synchronization Extensions” Presentation. Intel Developers Forum, San Francisco, 2012

Documents

Haswell