Bringing Heterogeneous Multiprocessors Into the Mainstreamsaahpc.ncsa.illinois.edu/09/sessions/day1/session2/... · 2009-09-17 · ‘Small’ changes to ABI – good operability

Bringing Heterogeneous Multiprocessors Into the Mainstream

July 27 2009

Brian Flachs Cell/B.E. Architect

Godfried Goldrian, Peter Hofstee, Jörg-Stephan Vogt IBM Systems and Technology Group

Page 2

Microprocessor Trends Lo

g(P

erfo

rman

ce) Hybrid

More active transistors, higher frequency

Multi-Core


Single Thread


Special Purpose

More active transistors,higher frequency

2005 2015(?) 2025(??) 2035(???)

Page 3

Heterogeneity   Spreading tasks within a multiprocessor   Hope to divide tasks according to characteristics

  Dynamic branch prediction valuable vs. static + conditional move

  Caches effective vs. memory latency tolerance

  Tight data dependancies vs. control flow/address stream is data independent

  Design processors differently for Efficiency   Accelerators

  Control plane:

– High voltage, small register file, short pipeline, low frequency with dynamic branch prediction and large caches

  Data plane:

–  Low voltage, large register file, long pipeline, high frequency with memory latency tolerance

Page 4

SPE Highlights   User-mode architecture   No need to run the O/S   No translation/protection within SPU   DMA is full Power Arch protect/x-late

  Not just a coprocessor, has its own PC   RISC like organization   32 bit fixed width instructions   Dual Issue, 11-FO4 design   Broad set of operations (8/16/32/64)   VMX-like SIMD dataflow   Graphics SP-Float, IEEE DP-Float

  Large unified register file   128 entry x 128 bit (I&FP)   Deep unrolling to cover unit latencies

  256 KB Local Store   Flexible DMA Engine

  Improve effective memory bandwidth –  improving latency tolerance –  Improve utilization of moved data

  Vector Load/Store w/ Scatter Gather 14.5mm2 (90nm SOI)

Page 5

Cell Broadband Microprocessor Highlights   SMP on a chip

  1 PowerPC –  L1 (32kB+32kB) –  L2 (512kB)

  8 SPUs –  256 kB Local Store –  4-way Single Precision FP

Support   High-B/W Memory

  25.6 GB/s (data)   Dual Rambus XDRam   3.2 Gbit/sec

  Two Configurable I/O interfaces   Up to 35GB/s out   Up to 25GB/s in   Coherent interface or I/O   5 Gbit/sec

Page 6

Memory Managing Processor vs. Traditional General Purpose Processor

IBM

AMD

Intel

Cell

BE

Page 7

Slide 7

17 Connected Units – 278 Racks

Misc

ALF DaCS

.

Other

/

Library

SDK 3.1 TriBlade Linux

New Hardware

Systems Integration

Bring Up, Test and Benchmarking

LS21 E

xpansion LS

21

QS

22 Q

S22

Firmware Device Driver

New Software Open Source

Los Alamos National Lab   A New Programming Model extended from standard, cluster computing (Innovation derived through close collaboration with LANL)

  Hybrid and Heterogeneous Hardware

  Built around BladeCenter and Industry IB-DDR

The making of Roadrunner – Key Building Blocks

Hybrid Systems

Page 8

Efficiency Benefits

 More efficient use of memory bandwidth  More performance per area  More performance per watt   Fewer modules, boards, racks, etc.   Larger portion of interconnect on chip  More computation per soft-fail / check point

Page 9

System Level Efficiency

  Power efficiency is an important problem

  Key areas of attack:   Most efficient Ops/Watts operating point

–  Voltage frequency scaling –  Match critical functional path to Low speed Vmin –  Low FO4 design maximizes perf unit area & leakage

  In-system voltage selection –  reduce voltage

  Lower temperature operating point –  Improves circuit performance, reduces leakage

Page 10

Efficient Operating Point

 On left side of graph voltage is fixed, on right side voltage is increasing to support operating frequency

 Want to separately optimize the different processor types

Page 11

In-System Voltage Determination   Power systems have margin

  Voltage is time variant   Voltage is different from voltage set point

  Chip has voltage guard band   Noise   End-of-life   System-to-tester correlation margin

  Don’t want the sum   Build card inc VRM + Processor

  Set desired operating frequency   Start at high voltage and successively reduce voltage

  Until challenging workload stops running   Add necessary guard band   Saves 16 W per processor

Page 12

DDR2 Memory (1GBit each)

IBM PowerXCell(TM) 8i

PMC Sierra PHYs Xilinx FPGA (Virtex 5)

Dimensions: 330mm x 155mm

VRM

VRM

VRM

QPACE Processor Node Card

Page 13

TIM for memory modules TIM for FPGA TIM for PHYs

TIM for VRM

TIM for VRM

Hight-adjustable Heatpath for tolerance compensation. TIM is applied on the processor module and the small side of the Heatpath

TIM area for the thermal interface between the Processor Node and the Cold Plate

heat

flow

QPACE Processor Node Cooling

Page 14

Water flow

In Out

Cold Plate for 32 processor nodes

Processor Node

Processor Node

16 Processor Nodes and 1 Root Card

on each side of the Cold Plate.

Root Card QPACE Cold Plate

Page 15

QPACE Rack

  The QPACE project is a research collaboration of IBM Development and European universities and research institutes with the goal to build a prototype of a cell processor-based supercomputer.

  The major part of this project funded by the German Research Foundation (DFG – Deutsche Forschungsgemeinschaft) as part of a Collaborative Research Center (SFB – Sonderforschungsbereich [TR55]).

Page 16

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Water Cooling : Temperature drop from processor junction to warm water is <30°C. No chiller is required !!

Estimated Power Efficiency of QPACE : Rack Power: 30 kW Linpack: 18 TFLOPS => 600 MFLOPS/W

Green 500

Page 17

Future Compilation Strategy for Accelerators   Idle accelerators are not efficient

  Programmer effort can limit utilization of accelerators   Limit specialization of accelerators   Limit investment in accelerators

  Want to support standard languages: C, C++, Fortran, OpenMP, OpenCL

  Want mostly transparent   Need to honor programmer directives   Ok, for programmers to sacrifice portability for better results in high value/benefit codes

  Assume different Architecture / Micro-architecture benefits from or requires different binaries   Instruction scheduling, execution latencies   Special instructions & operations

  Compile everything for all cores   Analyze output for core type assignment metrics, have source level pragmas   Might have different sources for different processor types

  Fat binaries can run anything anywhere   At least via RPC/migration   Cost metrics for execution migration   Load balancing & task queue architecture

Page 18

Heterogeneous Code-Gen Problems

  Function pointers   Translation table

–  From function pointer address space to actual entry point –  All constructor code runs the same on all cores –  Adds extra path to C++ virtual method calls

  Self-modifying code? –  architecture independent: llvm

  Data pointers   Shared address space   Coherency management

  Data formats   Little endian/Big endian, etc.   Conversions probably impact source

Page 19

Making Cell Easier to Program

  Software ICache

  Standard Language Support   OpenMP

  OpenCL

Page 20

Soft I-Cache Summary   Upto ½ GB of code   Normal tool-chain flow

  No detailed knowledge required on the part of the developer.

  ‘Small’ changes to ABI – good operability with old source.

  New virtual address space for code –  32 bit function pointers –  Indirects require tag check

  Runtime edits branches to go directly to their targets when they are in cache – no overhead in hit case.

  Less than 10% total runtime penalty for running in small caches. Still working to improve.

  Verified on QS22 & PXCAB hardware.   Support code out-side of cache structure

.c

.s

compiler

assembler

.o

linker

Exe/lib

Runtime System

Divide code with branch always into Blocks smaller than the line size & annotate branches with importance+ Return stack information

Linker analyzes whole program to Determine cache layout that minimizes cache conflicts on Important paths

Page 21

A Cache Block

  Linker creates a “stub” at the end of the cache line for each branch immediate

  Stub has 4 word sized fields:   Address of target   Pointer to the branch   Trampoline   XOR pattern

  When line is brought from memory into cache branches immediate branches to a brsl (the trampoline (T)) in the stub to the runtime entry point.

  Brsl records a pointer to the stub

Code… Constants Stub(s) br

brsl xor ptr isa

Page 22

Rewriting

  Runtime patches branch to point to target.   Load & rototate xor pattern

–  Converts branch to stub to branch to target   Load quadword pointed to by stub (contains the branch)

  Xor   Store back the quadword with the branch

  Next time branch goes directly to intended target   no extra delay

  Easy because cache is direct mapped.   Branch hints work!!!!  Must un-rewrite if target is evicted

Page 23

Performance Relative to LS Static

 Miss penalty = 400 cycle access to main memory in parallel with cache management software

Page 24

OpenMP and OpenCL

  Pros   Single source   Annotated scalar code base   Established in the market

  Cons   Doesn’t always scale well   Only supports shared memory models   Relies heavily on compiler

  Pros   Single source   Portable vector language   Explicit memory control

  Cons   Unproven   Host API calls   GPU Texture Engine exposed

CPUs MIMD Scalar code bases Parallel for loop Shared Memory Model

GPUs SIMD

Scalar/Vector code bases Data Parallel

Distributed Shared Memory Model

OpenMP OpenCL

CPU GPU SPE

Page 25

OpenCL Memory Model

  Shared memory model   Release consistency

  Multiple distinct address spaces

  Address spaces can be collapsed depending on the device’s memory subsystem

  Address Qualifiers   __private

  __local

  __constant and __global   Example:

  __global float4 *p;

Compute Unit 1

Private Memory Private Memory

WorkItem 1 WorkItem M

Compute Unit N

Private Memory Private Memory

WorkItem 1 WorkItem M

Local Memory Local Memory

Global / Constant Memory Data Cache

Global Memory

Compute Device Memory

Compute Device

Page 26

Power with VMX   Unified host and device memory

  Zero copies between them   Collapsed local and private

memory   Locked cache range

  Cached global memory

Private Memory Local Memory Device Global Memory

System Memory

Host Memory

Host Device

Compute Unit 1

WorkItem 1


Compute Device

Compute Unit N

WorkItem N


Power 2 Power 1 Power N

Page 27

PCIe Attached GPU   Separate host, device, and local address

spaces   Explicit copies between them

System Memory

Host Memory Device Global Memory

Device Memory

PCIe

Host Device

Compute Unit 1

WorkItem 1

Device Memory

Private Memory

Local Memory

Compute Device

WorkItem N

Private Memory x86

Page 28

Cell Broadband Engine   Unified host and device memory

  Zero copies between them

Device Global Memory

System Memory

Host Memory

Host Device

PPE Compute Unit 1

WorkItem 1

Local Store

Private Memory

Local Memory

Compute Device

Compute Unit N

WorkItem N

Local Store

Private Memory

Local Memory

SPE 1 SPE N

Page 29

Summary   What we have done:

  Broken new barriers in performance and efficiency –  Top 500 #1 –  Green 500 #1

  Proven commercial viability –  Repsol, Mentor Graphics, Broadcast International, Hitachi, Hoplon …

  What is next:   Continued Focus on Efficiency   Increasing Focus on Standards-Based Programming

–  Software ICache for Cell/B.E. –  OpenMP & OpenCL for Cell/B.E. and other processors –  …

  Increasing Focus on Ease of Use –  Make accelerators “invisible” for most customers –  Commercial applications, not just HPC –  Not an easy thing to do

  Continue to Broaden Application Reach for Cell and Hybrid Systems

Page 30

Next Era of Innovation – Hybrid Computing The Next Bold Step in Innovation & Integration

Symmetric Multiprocessing Era Hybrid Computing Era

Today pNext 1.0 pNext 2.0

p6 p7

Cell

BlueGene

Driven by cores/threads Driven by workload consolidation

Throughput

Traditional

Computational

Technology Out Market In

All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM.

Page 31

This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.

Special notices

Page 32

The following terms are registered trademarks of International Business Machines Corporation in the United States and/or other countries: AIX, AIX/L, AIX/L(logo), alphaWorks, AS/400, BladeCenter, Blue Gene, Blue Lightning, C Set++, CICS, CICS/6000, ClusterProven, CT/2, DataHub, DataJoiner, DB2, DEEP BLUE, developerWorks, DirectTalk, Domino, DYNIX, DYNIX/ptx, e business(logo), e(logo)business, e(logo)server, Enterprise Storage Server, ESCON, FlashCopy, GDDM, i5/OS, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), Informix, IntelliStation, IQ-Link, LANStreamer, LoadLeveler, Lotus, Lotus Notes, Lotusphere, Magstar, MediaStreamer, Micro Channel, MQSeries, Net.Data, Netfinity, NetView, Network Station, Notes, NUMA-Q, OpenPower, Operating System/2, Operating System/400, OS/2, OS/390, OS/400, Parallel Sysplex, PartnerLink, PartnerWorld, Passport Advantage, POWERparallel, Power PC 603, Power PC 604, PowerPC, PowerPC(logo), Predictive Failure Analysis, pSeries, PTX, ptx/ADMIN, RETAIN, RISC System/6000, RS/6000, RT Personal Computer, S/390, Scalable POWERparallel Systems, SecureWay, Sequent, ServerProven, SpaceBall, System/390, The Engines of e-business, THINK, Tivoli, Tivoli(logo), Tivoli Management Environment, Tivoli Ready(logo), TME, TotalStorage, TURBOWAYS, VisualAge, WebSphere, xSeries, z/OS, zSeries.

The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: Advanced Micro-Partitioning, AIX 5L, AIX PVMe, AS/400e, Chiphopper, Chipkill, Cloudscape, DB2 OLAP Server, DB2 Universal Database, DFDSM, DFSORT, DS4000, DS6000, DS8000, e-business(logo), e-business on demand, eServer, Express Middleware, Express Portfolio, Express Servers, Express Servers and Storage, General Purpose File System, GigaProcessor, GPFS, HACMP, HACMP/6000, IBM TotalStorage Proven, IBMLink, IMS, Intelligent Miner, iSeries, Micro-Partitioning, NUMACenter, On Demand Business logo, POWER, PowerExecutive, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, PowerPC 603, PowerPC 603e, PowerPC 604, PowerPC 750, POWER2, POWER2 Architecture, POWER3, POWER4, POWER4+, POWER5, POWER5+, POWER6, POWER6+, pure XML, Redbooks, Sequent (logo), SequentLINK, Server Advantage, ServeRAID, Service Director, SmoothStart, SP, System i, System i5, System p, System p5, System Storage, System z, System z9, S/390 Parallel Enterprise Server, Tivoli Enterprise, TME 10, TotalStorage Proven, Ultramedia, VideoCharger, Virtualization Engine, Visualization Data Explorer, X-Architecture, z/Architecture, z/9.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org. UNIX is a registered trademark of The Open Group in the United States, other countries or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Microsoft, Windows, Windows NT and the Windows logo are registered trademarks of Microsoft Corporation in the United States, other countries or both. Intel, Itanium, Pentium are registered tradas and Xeon is a trademark of Intel Corporation or its subsidiaries in the United States, other countries or both. AMD Opteron is a trademark of Advanced Micro Devices, Inc. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries or both. TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC). SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC). NetBench is a registered trademark of Ziff Davis Media in the United States, other countries or both. AltiVec is a trademark of Freescale Semiconductor, Inc. Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc. Other company, product and service names may be trademarks or service marks of others.

Special notices (cont.)

Documents

Bringing Heterogeneous Multiprocessors Into the Mainstreamsaahpc.ncsa.illinois.edu/09/sessions/day1/session2/... · 2009-09-17 · ‘Small’ changes to ABI – good operability