Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Bringing Heterogeneous Multiprocessors Into the Mainstream
July 27 2009
Brian Flachs Cell/B.E. Architect
Godfried Goldrian, Peter Hofstee, Jörg-Stephan Vogt IBM Systems and Technology Group
Page 2
Microprocessor Trends Lo
g(P
erfo
rman
ce) Hybrid
More active transistors, higher frequency
Multi-Core
More active transistors, higher frequency
Single Thread
More active transistors, higher frequency
Special Purpose
More active transistors,higher frequency
2005 2015(?) 2025(??) 2035(???)
Page 3
Heterogeneity Spreading tasks within a multiprocessor Hope to divide tasks according to characteristics
Dynamic branch prediction valuable vs. static + conditional move
Caches effective vs. memory latency tolerance
Tight data dependancies vs. control flow/address stream is data independent
Design processors differently for Efficiency Accelerators
Control plane:
– High voltage, small register file, short pipeline, low frequency with dynamic branch prediction and large caches
Data plane:
– Low voltage, large register file, long pipeline, high frequency with memory latency tolerance
Page 4
SPE Highlights User-mode architecture No need to run the O/S No translation/protection within SPU DMA is full Power Arch protect/x-late
Not just a coprocessor, has its own PC RISC like organization 32 bit fixed width instructions Dual Issue, 11-FO4 design Broad set of operations (8/16/32/64) VMX-like SIMD dataflow Graphics SP-Float, IEEE DP-Float
Large unified register file 128 entry x 128 bit (I&FP) Deep unrolling to cover unit latencies
256 KB Local Store Flexible DMA Engine
Improve effective memory bandwidth – improving latency tolerance – Improve utilization of moved data
Vector Load/Store w/ Scatter Gather 14.5mm2 (90nm SOI)
Page 5
Cell Broadband Microprocessor Highlights SMP on a chip
1 PowerPC – L1 (32kB+32kB) – L2 (512kB)
8 SPUs – 256 kB Local Store – 4-way Single Precision FP
Support High-B/W Memory
25.6 GB/s (data) Dual Rambus XDRam 3.2 Gbit/sec
Two Configurable I/O interfaces Up to 35GB/s out Up to 25GB/s in Coherent interface or I/O 5 Gbit/sec
Page 6
Memory Managing Processor vs. Traditional General Purpose Processor
IBM
AMD
Intel
Cell
BE
Page 7
Slide 7
17 Connected Units – 278 Racks
Misc
ALF DaCS
.
Other
/
Library
SDK 3.1 TriBlade Linux
New Hardware
Systems Integration
Bring Up, Test and Benchmarking
LS21 E
xpansion LS
21
QS
22 Q
S22
Firmware Device Driver
New Software Open Source
Los Alamos National Lab A New Programming Model extended from standard, cluster computing (Innovation derived through close collaboration with LANL)
Hybrid and Heterogeneous Hardware
Built around BladeCenter and Industry IB-DDR
The making of Roadrunner – Key Building Blocks
Hybrid Systems
Page 8
Efficiency Benefits
More efficient use of memory bandwidth More performance per area More performance per watt Fewer modules, boards, racks, etc. Larger portion of interconnect on chip More computation per soft-fail / check point
Page 9
System Level Efficiency
Power efficiency is an important problem
Key areas of attack: Most efficient Ops/Watts operating point
– Voltage frequency scaling – Match critical functional path to Low speed Vmin – Low FO4 design maximizes perf unit area & leakage
In-system voltage selection – reduce voltage
Lower temperature operating point – Improves circuit performance, reduces leakage
Page 10
Efficient Operating Point
On left side of graph voltage is fixed, on right side voltage is increasing to support operating frequency
Want to separately optimize the different processor types
Page 11
In-System Voltage Determination Power systems have margin
Voltage is time variant Voltage is different from voltage set point
Chip has voltage guard band Noise End-of-life System-to-tester correlation margin
Don’t want the sum Build card inc VRM + Processor
Set desired operating frequency Start at high voltage and successively reduce voltage
Until challenging workload stops running Add necessary guard band Saves 16 W per processor
Page 12
DDR2 Memory (1GBit each)
IBM PowerXCell(TM) 8i
PMC Sierra PHYs Xilinx FPGA (Virtex 5)
Dimensions: 330mm x 155mm
VRM
VRM
VRM
QPACE Processor Node Card
Page 13
TIM for memory modules TIM for FPGA TIM for PHYs
TIM for VRM
TIM for VRM
Hight-adjustable Heatpath for tolerance compensation. TIM is applied on the processor module and the small side of the Heatpath
TIM area for the thermal interface between the Processor Node and the Cold Plate
heat
flow
QPACE Processor Node Cooling
Page 14
Water flow
In Out
Cold Plate for 32 processor nodes
Processor Node
Processor Node
16 Processor Nodes and 1 Root Card
on each side of the Cold Plate.
Root Card QPACE Cold Plate
Page 15
QPACE Rack
The QPACE project is a research collaboration of IBM Development and European universities and research institutes with the goal to build a prototype of a cell processor-based supercomputer.
The major part of this project funded by the German Research Foundation (DFG – Deutsche Forschungsgemeinschaft) as part of a Collaborative Research Center (SFB – Sonderforschungsbereich [TR55]).
Page 16
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
Water Cooling : Temperature drop from processor junction to warm water is <30°C. No chiller is required !!
Estimated Power Efficiency of QPACE : Rack Power: 30 kW Linpack: 18 TFLOPS => 600 MFLOPS/W
Green 500
Page 17
Future Compilation Strategy for Accelerators Idle accelerators are not efficient
Programmer effort can limit utilization of accelerators Limit specialization of accelerators Limit investment in accelerators
Want to support standard languages: C, C++, Fortran, OpenMP, OpenCL
Want mostly transparent Need to honor programmer directives Ok, for programmers to sacrifice portability for better results in high value/benefit codes
Assume different Architecture / Micro-architecture benefits from or requires different binaries Instruction scheduling, execution latencies Special instructions & operations
Compile everything for all cores Analyze output for core type assignment metrics, have source level pragmas Might have different sources for different processor types
Fat binaries can run anything anywhere At least via RPC/migration Cost metrics for execution migration Load balancing & task queue architecture
Page 18
Heterogeneous Code-Gen Problems
Function pointers Translation table
– From function pointer address space to actual entry point – All constructor code runs the same on all cores – Adds extra path to C++ virtual method calls
Self-modifying code? – architecture independent: llvm
Data pointers Shared address space Coherency management
Data formats Little endian/Big endian, etc. Conversions probably impact source
Page 19
Making Cell Easier to Program
Software ICache
Standard Language Support OpenMP
OpenCL
Page 20
Soft I-Cache Summary Upto ½ GB of code Normal tool-chain flow
No detailed knowledge required on the part of the developer.
‘Small’ changes to ABI – good operability with old source.
New virtual address space for code – 32 bit function pointers – Indirects require tag check
Runtime edits branches to go directly to their targets when they are in cache – no overhead in hit case.
Less than 10% total runtime penalty for running in small caches. Still working to improve.
Verified on QS22 & PXCAB hardware. Support code out-side of cache structure
.c
.s
compiler
assembler
.o
linker
Exe/lib
Runtime System
Divide code with branch always into Blocks smaller than the line size & annotate branches with importance+ Return stack information
Linker analyzes whole program to Determine cache layout that minimizes cache conflicts on Important paths
Page 21
A Cache Block
Linker creates a “stub” at the end of the cache line for each branch immediate
Stub has 4 word sized fields: Address of target Pointer to the branch Trampoline XOR pattern
When line is brought from memory into cache branches immediate branches to a brsl (the trampoline (T)) in the stub to the runtime entry point.
Brsl records a pointer to the stub
Code… Constants Stub(s) br
brsl xor ptr isa
Page 22
Rewriting
Runtime patches branch to point to target. Load & rototate xor pattern
– Converts branch to stub to branch to target Load quadword pointed to by stub (contains the branch)
Xor Store back the quadword with the branch
Next time branch goes directly to intended target no extra delay
Easy because cache is direct mapped. Branch hints work!!!! Must un-rewrite if target is evicted
Page 23
Performance Relative to LS Static
Miss penalty = 400 cycle access to main memory in parallel with cache management software
Page 24
OpenMP and OpenCL
Pros Single source Annotated scalar code base Established in the market
Cons Doesn’t always scale well Only supports shared memory models Relies heavily on compiler
Pros Single source Portable vector language Explicit memory control
Cons Unproven Host API calls GPU Texture Engine exposed
CPUs MIMD Scalar code bases Parallel for loop Shared Memory Model
GPUs SIMD
Scalar/Vector code bases Data Parallel
Distributed Shared Memory Model
OpenMP OpenCL
CPU GPU SPE
Page 25
OpenCL Memory Model
Shared memory model Release consistency
Multiple distinct address spaces
Address spaces can be collapsed depending on the device’s memory subsystem
Address Qualifiers __private
__local
__constant and __global Example:
__global float4 *p;
Compute Unit 1
Private Memory Private Memory
WorkItem 1 WorkItem M
Compute Unit N
Private Memory Private Memory
WorkItem 1 WorkItem M
Local Memory Local Memory
Global / Constant Memory Data Cache
Global Memory
Compute Device Memory
Compute Device
Page 26
Power with VMX Unified host and device memory
Zero copies between them Collapsed local and private
memory Locked cache range
Cached global memory
Private Memory Local Memory Device Global Memory
System Memory
Host Memory
Host Device
Compute Unit 1
WorkItem 1
Global / Constant Memory Data Cache
Compute Device
Compute Unit N
WorkItem N
Global / Constant Memory Data Cache
Power 2 Power 1 Power N
Page 27
PCIe Attached GPU Separate host, device, and local address
spaces Explicit copies between them
System Memory
Host Memory Device Global Memory
Device Memory
PCIe
Host Device
Compute Unit 1
WorkItem 1
Device Memory
Private Memory
Local Memory
Compute Device
WorkItem N
Private Memory x86
Page 28
Cell Broadband Engine Unified host and device memory
Zero copies between them
Device Global Memory
System Memory
Host Memory
Host Device
PPE Compute Unit 1
WorkItem 1
Local Store
Private Memory
Local Memory
Compute Device
Compute Unit N
WorkItem N
Local Store
Private Memory
Local Memory
SPE 1 SPE N
Page 29
Summary What we have done:
Broken new barriers in performance and efficiency – Top 500 #1 – Green 500 #1
Proven commercial viability – Repsol, Mentor Graphics, Broadcast International, Hitachi, Hoplon …
What is next: Continued Focus on Efficiency Increasing Focus on Standards-Based Programming
– Software ICache for Cell/B.E. – OpenMP & OpenCL for Cell/B.E. and other processors – …
Increasing Focus on Ease of Use – Make accelerators “invisible” for most customers – Commercial applications, not just HPC – Not an easy thing to do
Continue to Broaden Application Reach for Cell and Hybrid Systems
Page 30
Next Era of Innovation – Hybrid Computing The Next Bold Step in Innovation & Integration
Symmetric Multiprocessing Era Hybrid Computing Era
Today pNext 1.0 pNext 2.0
p6 p7
Cell
BlueGene
Driven by cores/threads Driven by workload consolidation
Throughput
Traditional
Computational
Technology Out Market In
All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM.
Page 31
This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.
Special notices
Page 32
The following terms are registered trademarks of International Business Machines Corporation in the United States and/or other countries: AIX, AIX/L, AIX/L(logo), alphaWorks, AS/400, BladeCenter, Blue Gene, Blue Lightning, C Set++, CICS, CICS/6000, ClusterProven, CT/2, DataHub, DataJoiner, DB2, DEEP BLUE, developerWorks, DirectTalk, Domino, DYNIX, DYNIX/ptx, e business(logo), e(logo)business, e(logo)server, Enterprise Storage Server, ESCON, FlashCopy, GDDM, i5/OS, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), Informix, IntelliStation, IQ-Link, LANStreamer, LoadLeveler, Lotus, Lotus Notes, Lotusphere, Magstar, MediaStreamer, Micro Channel, MQSeries, Net.Data, Netfinity, NetView, Network Station, Notes, NUMA-Q, OpenPower, Operating System/2, Operating System/400, OS/2, OS/390, OS/400, Parallel Sysplex, PartnerLink, PartnerWorld, Passport Advantage, POWERparallel, Power PC 603, Power PC 604, PowerPC, PowerPC(logo), Predictive Failure Analysis, pSeries, PTX, ptx/ADMIN, RETAIN, RISC System/6000, RS/6000, RT Personal Computer, S/390, Scalable POWERparallel Systems, SecureWay, Sequent, ServerProven, SpaceBall, System/390, The Engines of e-business, THINK, Tivoli, Tivoli(logo), Tivoli Management Environment, Tivoli Ready(logo), TME, TotalStorage, TURBOWAYS, VisualAge, WebSphere, xSeries, z/OS, zSeries.
The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: Advanced Micro-Partitioning, AIX 5L, AIX PVMe, AS/400e, Chiphopper, Chipkill, Cloudscape, DB2 OLAP Server, DB2 Universal Database, DFDSM, DFSORT, DS4000, DS6000, DS8000, e-business(logo), e-business on demand, eServer, Express Middleware, Express Portfolio, Express Servers, Express Servers and Storage, General Purpose File System, GigaProcessor, GPFS, HACMP, HACMP/6000, IBM TotalStorage Proven, IBMLink, IMS, Intelligent Miner, iSeries, Micro-Partitioning, NUMACenter, On Demand Business logo, POWER, PowerExecutive, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, PowerPC 603, PowerPC 603e, PowerPC 604, PowerPC 750, POWER2, POWER2 Architecture, POWER3, POWER4, POWER4+, POWER5, POWER5+, POWER6, POWER6+, pure XML, Redbooks, Sequent (logo), SequentLINK, Server Advantage, ServeRAID, Service Director, SmoothStart, SP, System i, System i5, System p, System p5, System Storage, System z, System z9, S/390 Parallel Enterprise Server, Tivoli Enterprise, TME 10, TotalStorage Proven, Ultramedia, VideoCharger, Virtualization Engine, Visualization Data Explorer, X-Architecture, z/Architecture, z/9.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org. UNIX is a registered trademark of The Open Group in the United States, other countries or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Microsoft, Windows, Windows NT and the Windows logo are registered trademarks of Microsoft Corporation in the United States, other countries or both. Intel, Itanium, Pentium are registered tradas and Xeon is a trademark of Intel Corporation or its subsidiaries in the United States, other countries or both. AMD Opteron is a trademark of Advanced Micro Devices, Inc. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries or both. TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC). SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC). NetBench is a registered trademark of Ziff Davis Media in the United States, other countries or both. AltiVec is a trademark of Freescale Semiconductor, Inc. Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc. Other company, product and service names may be trademarks or service marks of others.
Special notices (cont.)