42
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Embed Size (px)

Citation preview

Page 1: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Larrabee

Eric JogerstCortlandt Schoonover

Francis Tan

Page 2: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Larrabee

• Intel’s new approach to a GPU

• Considered to be a hybrid between a multi-core CPU and a GPU

• Combines functions of a multi-core CPU with the functions of a GPU

Page 3: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Larrabee

Page 4: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

FETCHLarrabee

Page 5: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Fetch

• Utilizes a hardware prefecther• Supports four threads of execution– Separate register files for each thread– Switches threads in order to cover cases where

the compiler is unable to schedule code without stalls or if the prefetcher has not received new instructions

– Inactive thread data is written to the core’s local L2 cache

Page 6: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

PIPELINE ORGANIZATIONLarrabee

Page 7: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Pipeline

• Pipeline derived from the dual-issue Pentium processor, which is 5-stages– Short, inexpensive execution pipeline

• Pairing rules for primary and secondary instruction pipes are deterministic– Allows compilers to perform offline analysis with a

wide scope

Page 8: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Pipeline

• Pairing rules for primary and secondary instruction pipes are deterministic– Allows compilers to perform offline analysis with a wide

scope • All instructions can be issued on the primary pipeline– Minimizes the combinational problems for a compiler

• Secondary pipeline can execute a large x86 instruction set– Small and cheap– Power wasted by failing to dual-issue on every cycle is

minimal

Page 9: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Pipeline

• Each core has own pipeline– Based upon the 5 stage Pentium– Dual issues instructions– In order execution

• Pipeline is shared between threads– Hardware can switch between threads that have

instructions that have instructions ready to execute

Page 10: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Pipeline

• Designed software-rendering pipeline to minimize the number of locks and other synchronization events

• Graphics-rendering pipeline written with high-level languages and tools– Enables developers to add innovative rendering

capabilities

Page 11: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

SIMD ORGANIZATIONLarrabee

Page 12: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Vector Processor Unit

• 16-wide vector processor unit (VPU) – executes integer, single-precision float, and

double-precision float instructions– VPU and register are approximately one-third the

area of the processor core

• Tradeoff– Increased computational density– Wider VPU’s have higher utilization

Page 13: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Vector Processor Unit

• VPU instructions can be predicated by a mask register

• Mask controls which parts of a vector register or memory location are written and which are left untouched

• Advantages– Reduces branch misprediction penalties – Gives instruction scheduler greater freedom

Page 14: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Number of Cores

• Many-core processor– Planned to have 24 to 48

cores

Page 15: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Number of Cores

Page 16: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Number of Cores

Page 17: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

SYSTEM ON-CHIP COMPONENTSLarrabee

Page 18: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

System On-Chip Components

• x86 computer cores - Dual issue, in order processors that support the x86 protocol with Larrabee extensions. Connected to ring network and high bandwidth connection to adjacent L2 Cache subset.

Page 19: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

System On-Chip Components

• L2 Cache subsets– High bandwidth access to adjacent CPU– Connected directly to the ring network– Coherent cache, uses the ring network to check

coherency when allocating new cache lines

Page 20: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

System On-Chip Components

• Ring Network Nodes– Simple bi-directional routers with a 512 bit data

path in each direction (1024 bit total bandwidth)– Organized in rings of 8-16 cores and other devices– Interconnected with other rings– All data moved between cores and fixed

functional units passes through the ring network

Page 21: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

System On-Chip Components

• Fixed function logic components– Provides rasterization, interpolation and other

commonly needed functions– Directly connected to the ring network– Will be spread among the cores to provide lower

latency and load balancing on the ring network

Page 22: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

System On-Chip Components

• Memory & I/O interface– Provides and manages communication between

the Ring Network and off chip devices.– Manages initial routing and tasking of cores

Page 23: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

MEMORY HIERARCHYLarrabee

Page 24: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan
Page 25: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Memory Interface

Page 26: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

ON-CHIP INTERCONNECTLarrabee

Page 27: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

On-Chip Interconnect

• Ring interconnect bus• Similar to the Sony Cell processor.

Page 28: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Ring Bus

Page 29: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Ring Bus Features

• Bi-directional• 512 Bits in each direction• Presumably running at core speed.• Each element can take from one direction on

odd CC and other direction on even CC.

Page 30: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Ring Bus Comparisons

• Compared to AMD’s R600/RV670 bus, it is half the bit-width.

• The higher clock speed of Larrabee’s bus should make up for the difference in bandwidth.

Page 31: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Ring Bus Tradeoff Analysis

Page 32: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Ring Bus Tradeoff Analysis

Pros:•Straightforward, not complex•Able to deliver high bandwidth•Great performance if memory clients need high bandwidth.

Cons:•Waste of chip area if most applications don’t need high memory bandwidth•That area could be spent elsewhere to increase performance in a different way.

Page 33: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

MULTITHREADING ORGANIZATION Larrabee

Page 34: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Multithreading Organization

• Superscalar• In-Order• Four Threads of execution• Dual issue (with a vector processing unit)

Page 35: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Comparison to OO Execution# CPU cores: 2 out-of-order 10 in-order

Instruction issue: 4 per clock 2 per clock

VPU per core: 4-wide SSE 16-wide

L2 cache size: 4 MB 4 MB

Single-stream: 4 per clock 2 per clock

Vector throughput: 8 per clock 160 per clock

Page 36: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan
Page 37: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

8 per clock 8 per clock

Larrabee Vector Processor

Page 38: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Scheduling Policy

• Software Controlled• More flexible due to the software controlled

scheduling than a typical GPU.

Page 39: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Software Controlled Scheduling

Pros• Flexible: can choose the

scheduler to suit the application.

• Worst case won’t be so bad. (As compared to a hardware encoded scheduling policy)

Cons• Overhead of scheduler

takes a bite out of performance

• Programmer overhead of selecting the correct scheduler.

Page 40: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Criticism

• NVIDIA– “like a GPU from 2006”– Unrealistic performance projections– Motivated by interest to retain market share

Page 41: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Possible Market

• Dreamworks Animation• Xbox / Playstation• Scientific research

Page 42: Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

Questions?