COSC 6385 Computer Architecture - Data Level Parallelism ...gabriel/courses/cosc6385_s20/CA_16_DLP_3.pdf · Inter-Processor Ring Network • Bi-directional ring network • 512 bits-wide

1

COSC 6385

Computer Architecture

- Data Level Parallelism (III)

The Intel Larrabee, Intel Xeon

Phi and IBM Cell processors

Edgar Gabriel

Fall 2020

References• Intel Larrabee:

[1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J.

Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan:

“Larrabee: a many-core x86 architecture for visual computing”,

ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp. 1-15.

http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf

• IBM Cell processor:

[2] C. R. Johns, D. A. Brokenshire

“Introductioon to the Cell Broadband Engine Architecture”,

IBM Journal of Research and Development, vol. 51, no. 5, pp. 503-519

http://www.research.ibm.com/journal/rd/515/johns.pdf

[3] M. Kistler, M. Perrone, F. Petrini,

“Cell Multiprocessor Communication Network: Built for Speed”

IEEE Micro, vol. 26, no. 3, pp .10-23

ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf

1

2

http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf

http://www.research.ibm.com/journal/rd/515/johns.pdf

2

Larrabee Motivation

• Comparison of two architectures with the same number

of transistors

– Half the performance of a single stream for the simplified

core

– 40x increase for multi-stream executions

2 out-of-order

cores

10 in-order

cores

Instruction issue 4 2

VPU per core 4-wide SSE 16-wide

L2 cache size 4 MB 4 MB

Single stream 4 per clock 2 per clock

Vector

throughput

8 per clock 160 per clock

Larrabee Overview

• Many-core visual computing architecture

• Based on x86 CPU cores

– Extended version of the regular x86 instruction set

– Supports subroutines and page faulting

• Number of x86 cores can vary depending on the

implementation and processor version

• Fixed functional units for texture filtering

– Other graphical operations such as rasterization or post-

shader blending done in software

3

4

3

Larrabee Overview (II)

Image Source: [1]

Overview of a Larrabee Core (I)

Image Source: [1]

5

6

4

Overview of a Larrabee Core (I)

• x86 core derived from the Pentium processor

– No out-of-order execution

• Standard Pentium instruction set with the addition of

– 64 bit instructions

– Instructions for pre-fetching data into L1 and L2 cache

– Support for 4 simultaneous threads, separate registers for

each thread

• Each core is augmented with a wide vector processor

(VPU)

• 32kb L1 Instruction cache, 32 kb L1 Data Cache

• 256 KB of ‘local subset’ of the L2 cache

– Coherent L2 cache across all cores

Vector Processing Unit in Larrabee

• 16-wide VPU executing integer, single- and double

precision floating point operations

• VPU supports gather-scatter operations

– The 16 elements are loaded or can be stored from up to

16 different addresses

• Support for predicated instructions using a mask control

register (if-then-else statements)

7

8

5

Inter-Processor Ring Network

• Bi-directional ring network

• 512 bits-wide per direction

• Routing decisions done before injecting message into

the network

Larrabee Programming Models

• Most application can be executed without modification

due to the full support of the x86 instruction set

• Support for POSIX threads to create multiple threads

– API extended by thread affinity parameters

• Recompiling code with Larrabee’s native compiler will

generate automatically the codes to use the VPUs.

• Alternative parallel approaches

– Intel threading building blocks

– Larrabee specific OpenMP directives

9

10

6

Larrabee Performance

Image Source: [1]

Intel Xeon Phi Processor

• First generation of Intel MIC (Many Integrated Cores)

architecture

• 60 cores / 1.0 GHz

• 512-bit wide vector engine

• 32 Kb L1 I/D cache,

• 512 Kb L2 cache (per core)

• Up to 1 TFLOPS double-precision performance

• 8 Gb GDDR5 memory and 320 Gb/s bandwidth

• Standard PCIe x16 form factor

11

12

7

IBM Cell Overview (I)

• Cell Broadband Architecture (CBEA) defined by a

consortium of IBM, Sony, and Toshiba

• Originally targeting the multi-media industry

– E.g. Playstation 3, Toshiba HDTV, etc.

• Sold as regular compute-blades also by IBM

– IBM QS20, QS21, QS22

• Main idea: heterogeneous microprocessor consisting of

– one (or more) general purpose processor element (PPE)

and

– (one or) more synergistic processor elements (SPEs)

13

14

8

Cell Architecture block diagram

Image Source: [2]

• Two generations available so far:

– Cell BE:

• 204.8 GFLOPS single precision peak performance

• 14.6 GFLOPS double precision peak performance

– PowerXCell 8i (2008):

• 204.8 GFLOPS single precision peak performance

• 102.4 GFLOPS double precision peak performance

– Both have 1 PPE and 8 SPEs

15

16

9

General Purpose Processor (PPE)

• Based on the IBM PowerPC processor

– Supports multiple simultaneous operating environments

(virtualization)

– E.g. can execute an instance of a real-time operating

system and an instance of a non-real-time operating

system

• Performs management and application control

functions

Synergistic Processor Element (SPE)

• SIMD processor used for offloading compute-intensive,

data parallel operations from the PPE

• Each SPE has its own local storage and can access data

only from the local storage

– Current versions of the Cell processors: 256k local storage

• The local storage is connected to the main memory

through a Memory Flow Controller (MFC)

– MFC moves data from main memory to local storage or

between two SPEs.

17

18

10

MFC commands

Image Source: [2]

Synergistic Processor Element (SPE) (II)

• Each SPE has 128 registers

• Each register is 128 bits wide which can be used to

hold

– Sixteen 8-bit integers or

– Eight 16-bit integers or

– Four 32-bit integers or single precision floating-point

numbers

– Two 64-bit integers or double precision floating point

numbers

• Most instructions supported by the synergistic processor

unit utilize all elements in a register -> SIMD

19

20

11

Simplified representation of a current

Cell processor

Image Source: [3]

Element Interconnect Bus

• PPE and SPEs communicate through the Element

Interconnect Bus

– Contains a shared command bus

• Sets up end-to-end transactions

• Used for coherence protocols

– Point-to-point data interconnect

• Four 16-byte-wide rings, two used for clockwise data

transfers, two for counter-clockwise data transfers

• Each ring transfer 128 byte packets ( = cache block

size of an SPE)

• Communication costs between two SPEs can vary

between 1 hop and 6 hops

– Overall bandwidth: 204.8 GB/s

21

22

12

Comparison IBM Cell and Intel

Larrabee• Both use a large number of small and simple cores

• Both use high-bandwidth ring bus to communicate

between the cores

• Intel Larrabee is homogeneous, while IBM Cell is a

heterogeneous process (difference between PPE and

SPE)

• IBM Cell requires data to be moved explicitly to the

‘local store’, while Larrabee can address any memory

area

– Programm for the Cell have to be written taking the

limited amount of memory available for a SPE into

account

23

Documents

COSC 6385 Computer Architecture - Data Level Parallelism ...gabriel/courses/cosc6385_s20/CA_16_DLP_3.pdf · Inter-Processor Ring Network • Bi-directional ring network • 512 bits-wide