Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelerate your business

Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.9.0

Heterogeneous computing on POWER

IBM and OpenPOWER technologies to accelerate your business

César Diniz MacielExecutive IT SpecialistIBM Corporate Strategy

Session objectives

In this session we present how accelerators providesignificant application performance improvementsand how they can be deployed in a solution. Weevaluate different types of accelerators, and focuson the latest accelerator technologies available forIBM Power Systems and OpenPOWER solutions.

Acknowledgements

I would like to thank the following people for providinginvaluable information on POWER8, CAPI and FPGAs• Jeff Stuecheli, POWER Hardware Architect• Bill Starke, DE, POWER Server Nest Architect• Bruce Wile, STG Hardware Design and Verification• Jonathan Dement, Program Director, • Power Systems and OpenPower Innovation • IBM US

What is heterogeneous computing?From Wikipedia “Heterogeneous computing refers to systems that use more than one kind of processor. These are systems that gain performance not just by adding the same type of processors, but by adding dissimilar processors, usually incorporating specialized processing capabilities to handle particular tasks.“

What is heterogeneous computing?

Not a new concept, widely used in the industry

On-chip accelerators: Cryptography and compression accelerators inside the

POWER8 processor

PCIe based accelerators GPGPUs, such as Nvidia Tesla PCIe adapters for SSL acceleration, cryptography and

compression CAPI adapters for POWER8 systems

Appliance-based accelerators

Netezza appliances to accelerate queries on the IBM DB2 Analytics Accelerator

Why heterogeneous computing? Applications are becoming more complex and demanding more

computing resources Application speedup limited to performance of the slowest algorithm

If algorithm execution can be accelerated, application runs faster – code optimization

If algorithm can be built on silicon, execution speeds up – ASICs, FPGA If algorithm can be broken in multiple pieces, and these can run simultaneously,

application runs faster – parallelization

“Heterogeneous (or asymmetric) chip multiprocessors present uniqueopportunities for improving system throughput, reducing processor power,and mitigating Amdahl’s law. On-chip heterogeneity allows the processor to better match execution resources to each application’s needs and toaddress a much wider spectrum of system loads—from low to high threadparallelism—with high efficiency.”IEEE Computer, November 2005,

Why heterogeneous computing?

Scaling & power Wall Issues, Chip Design and Fabrication Economics, and Time To Market Demands are all intersecting enabling/requiring change to Business As Usual

On-chip accelerators

• Excellent performance and integration• Accelerators are part of the microprocessor design and share

the same silicon die.• Algorithms are implemented in silicon

• Fastest performance• Application transparency – Hypervisor/OS abstract the

accelerator so that applications do not need to be modified• Close integration with processor design and capabilities

• However, on-chip accelerators use same silicon space that could be used for caches, processor features, etc• Tradeoff between accelerators and core features• Once built, they cannot be changed/updated

Examples of performance benefits of acceleratorsPOWER7+ and POWER8 processors include memory compression accelerators, and cryptographic accelerators

Memory Compression

AsymmetricMathematical

Functions

CryptographicEngines

The on-chip loosely coupled accelerators are part of the processor chip and managed by the Power Hypervisor. Multiple partitions can use the accelerators and the Power Hypervisor manages the QoS and address mapping.

Examples of performance benefits of acceleratorsPerformance comparison of Active Memory Expansion (AME) on an SAP workload (SD 2-Tier benchmark) on POWER7 and POWER7+ (output of amepat command)

Significant reduction in core consumption

PCIe-based accelerators

• Provide great flexibility for adding capability to existing systems• Allow many more options in terms of types of accelerators, algorithms, performance

characteristics and implementation devices• GPUs• FPGAs• ASICs

• Easy to replace with newer/faster accelerators• However, PCIe-based accelerators are seen as an I/O device, and therefore

communicate with the processor and main memory through an I/O subsystem• Programming model needs to incorporate the I/O device• Need of OS device drivers for the accelerator• Data needs to be copied from/to main memory to accelerator memory• Data transfer performance limited by the I/O subsystem

Java gzip PCIe-based accelerator

• Feature code #EJ12/ #EJ13 - PCIe3 FPGA Accelerator Adapter• Also known as Generic Work Queue Engine (GenWQE) accelerator• Hardware compression accelerator for AIX and Linux• FPGA-based PCIe adapter, provides high performance, low latency compression

without significant CPU overhead•

• Based on the standard compression library – zlib• Widely used open source C library that provides compression and decompression• zlib supports RFC1950, RFC1951, and RFC1952•

• Enabled transparently in IBM Java 7.1.1 release and later on Linux and AIX on POWER8 processor-based systems

• HW offload enabled by setting • env variables:

• ZLIB_INFLATE_IMPL = 1• ZLIB_DEFLATE_IMPL = 1

Java gzip PCIe-based accelerator - Performance

IBM internal testing

GPUs

From NVIDIA:• • “GPU-accelerated computing is the use of a graphics processing unit (GPU)

together with a CPU to accelerate scientific, analytics, engineering, consumer, and enterprise applications...• …GPU-accelerated computing offers unprecedented application performance by

offloading compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU. From a user's perspective, applications simply run significantly faster.”

General Purpose GPUs

GPUs are well suited for parallelprocessing tasks. They have thousands of cores than work in parallel

Coupled with a high performanceprocessor and a GPU programming model, significant application acceleration can be achieved

16

NVIDIA K40 GPU

• Systems• Up to 2 K40 GPU in S824L

• GPU Spec• Kepler-2 architecture GPU • ASIC: GK110B

• PCIe interface• PCIe Gen3 x16• Full length / double wide PCIe form factor• Plugs in using existing double wide cassette

• Power• 235W Max power draw :75W via PCIe slot

plus 160W via 8-pin Aux. cable.• OS support on POWER

• Ubuntu 14.10 or later

http://www.nvidia.com/object/tesla-servers.html

16

NVIDIA and POWER8

From http://www.ecmwf.int/sites/default/files/HPC-WS-Appleyard.pdf

NVIDIA on Power

“ The combination of POWER8 CPUs & NVIDIA Tesla accelerators is amazing. It is the highest performance we have ever seen in individual cores, and the close integration with accelerators is outstanding for heterogeneous parallelization. Thanks to the little endian chip and standard CUDA environment it took us less than 24 hours to port and accelerate GROMACS.”

Erik Lindahl, Professor of Biophysics at the Science for Life Laboratory, Stockholm University & KTH. Developer of GROMACS

GPUs are not only for technical computing

“ Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs - A Technology Demonstration”http://on-demand.gputechconf.com/gtc/2015/presentation/S5835-Sina-Meraji.pdf

Java acceleration with GPUs“IBM’s POWER group has partnered with NVIDIA to make GPUs available on a high-performance server platform, promising the next generation of parallel performance for Java applications.”The Next Wave of Enterprise Performance with Java, POWER Systems and NVIDIA GPUs

Performance of java.util.Arrays#sort(int[]) on NVIDIA Tesla K40 GPU (ECC enabled) and IBM Power8 CPU.

KeplerCUDA 5.5 – 7.0

Unified Memory

Buffered Memory

POWER8

PCIe

Current

PascalCUDA 8

Full GPU Paging

Pascal

POWER8+

Near Future

NVLink

SXM2

VoltaCUDA 9

Cache Coherent

POWER9

More in the future

NVLink 2.0

SXM2

VoltaKepler

NVIDIA Roadmap on POWER

FPGA as an Accelerator

• FPGA: Field Programmable Gate Array

• From Wikipedia: “A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a

customer or a designer after manufacturing – hence "field-programmable". The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC). “

• It implements the algorithm as a hardware logic circuit• Not as optimized as an ASIC, but faster than running a software-based algorithm

• It can run fast (cycle times of 250 – 500 MHz or more)• It has industry standard interfaces like PCIe Gen3• The major FPGA Suppliers, Altera and Xilinx, are OpenPOWER Foundation members

gzip Encrypt

MonteCarlo

FPGA Library

Source code for FPGAs has traditionallybeen written in RTL* (VHDL** or Verilog).Now, we also have OpenCL, a more programmer friendly language.

*RTL = Register Transfer Level**VHDL = VHSIC*** Hardware Description Language***VHSIC = Very High Speed Integrated Circuit

Why FPGAs

• Transistor Efficiency & Extreme Parallelism• Bit-level operations• Variable-precision floating point• Compare divides on GPU vs. FPGA

• Power-Performance Advantage • >2x compared to a general multicore processor or GPGPU• Unused lookup tables (LUTs) are powered off

• Technology Scaling better than CPU/GPU• FPGAs are not frequency or power limited yet• 3D has great potential

• Dynamic reconfiguration• Flexibility for application tuning at run-time vs. compile-time

• Additional advantages when FPGAs are network connected• allows network as well as compute specialization

Several IT companies looking at FPGA

Baidu FPGA accelerator

Microsoft Bing accelerator

Combining the best of both architectures

Coherent Accelerator base principles

Function Based Acceleration• Main Application executed on Host Processor

Computational heavy functions on Accelerator• Single binary image encapsulates both HW and SW

version of accelerated functions• Application calls Accelerator for common or custom

libraries • Enable an application to work with or without available

Accelerator• Accelerator call faults when function not available• Host processor executes function when Accelerator

not available• No special requirements on data structures• System software virtualization of accelerator’s function(s)

Full Peer to Processor• IBM-designed processor interface

• Maintains a trusted coherent interface to system• Direct communications with application• Accelerator function(s) use an unmodified EA

• Full access to real address space• Utilize processor’s page tables directly• Page faults handled by system software

• Multiple functions can exist in a single accelerator

Customizable HardwareApplication Accelerator • Specific system SW, middleware, or user

application• Written to durable interface provided by PSL

Virtual Addressing• Accelerator can work with same memory

addresses that the processors use• Pointers de-referenced same as the host

application• Removes OS & device driver overhead

Hardware Managed Cache Coherence• Enables the accelerator to participate in “Locks”

as a normal thread Lowers Latency over IO communication model

POWER8 CAPI (Coherent Accelerator Processor Interface)

Ecosystem

CAPP

PCIe

Power Processor

FPGA

I/OEthernet, DASD, etc…

Function n

Function

0

Function

1

Function 2

CAPI

IBM Supplied POWER Service Layer

Coherent Accelerator Processor Interface (CAPI) overview

CAPP PCIe

POWER8 Processor

FPGA

Accelerator Function Unit (AFU)

CAPI

IBM-Supplied POWER Service Layer

Typical I/O model flow

Flow with a coherent model

Shared MemoryNotify Accelerator Acceleration Shared Memory

Completion

DD Call Copy or PinSource Data

MMIO NotifyAccelerator Acceleration Poll / Interrupt

CompletionCopy or Unpin

Result DataReturn From DD

Completion

Advantages of coherent attachment over I/O attachment

Virtual addressing and data caching– Shared memory

– Lower latency for highly referenced data

Easier, more natural programming model– Traditional thread-level programming– Long latency of I/O typically requires

restructuring of application

Enables applications not possible on I/O– Pointer chasing, and so on

Typical I/O Model Flow:

Flow with a Coherent Model:

Shared Mem. Notify Accelerator Acceleration Shared Memory

Completion

DD Call Copy or PinSource Data

MMIO NotifyAccelerator Acceleration Poll / Interrupt

CompletionCopy or UnpinResult Data

Ret. From DDCompletion

ApplicationDependent, butEqual to below

ApplicationDependent, butEqual to above

300 Instructions 10,000 Instructions 3,000 Instructions1,000 Instructions

1,000 Instructions

7.9µs 4.9µs

Total ~13µs for data prep

<400 Instructions <100 Instructions

0.3µs 0.06µs

Total 0.36µs

CAPI vs. I/O Device Driver: Data Prep

Time spent by a GPU-accelerated application

“ Figure 7 shows that the processing phases ofthe FD-OCT algorithm (DC-Removal, Resample, DispersionCompensation, FFT and Logarithmic Scaling) account for 40% of ∼the total run-time time, while the memory (data)transfers (host-to-device and device-to-host) require approximately 60% of the time.”

From: Scalable, High Performance Fourier Domain Optical Coherence Tomography: why FPGAs and not GPGPUsJian Li, Marinko V. Sarunic, and Lesley ShannonSchool of Engineering ScienceSimon Fraser University, Burnaby BC, Canada

strategy ( )

CAPI Attached Flash Optimization

• Attach IBM FlashSystem to POWER8 via CAPI coherent Attach• Issues Read/Write Commands from applications to eliminate 97% of code

pathlength• Saves 20-30 cores per 1M IOPs

Pin buffers, Translate, Map DMA, Start I/O

Application

Read/Write Syscall

Interrupt, unmap, unpin,Iodone scheduling

20K instructions reduced to <500

Disk and Adapter DD

strategy ( ) iodone ( )

FileSystem

Application

User Library

Posix Async I/O Style API

Shared Memory Work Queue

aio_read()aio_write()

iodone ( )

LVM

Innovative “In-Memory” NoSQL/KVSIBM Data Engine for NoSQL

24:1 Reduction in infrastructure

2.4x Price reduction

12x Less Energy

6x Less rack space

40TB of extended memory

4U

Demonstrating the Value of CAPI AttachmentIBM Data Engine for NoSQL

Comparison of FFT on three paradigms

Same FFT Algorithm as CAPI, but large latency impactP8 Core working much harder to deliver data to FPGA as device driver code is invoked

Small FFTs can be implemented in a fully streamed fashion on FPGAPerformance scales with host-AFU bandwidthUse radix-2 pipeline for up to 4 GB/sRadix-2 needs 2 complex samples per cycle → 16B / 250MHz cycle → 4 GB/sSwitch to (more efficient) radix-4 for up to 8 GB/s

Software FFT from the IBM Engineering and Scientific Subroutine Library (ESSL)Poor performance for small FFTs using multi-threaded softwareLittle data reuse and strided access patterns lead to software inefficiencies

P8Core

P8Core FPGA

P8Core

FPGA

CAPI

FFT Results

CAPI projects – PDT planning

• FPGA-CAPI accelerator for PDT (Photodynamic Therapy)• Monte-Carlo analysis of light scattering in tissue• Used to plan for non-invasive cancer treatment• Project led by Jeff Cassidy (University of Toronto) and • Lothat Lilge (Princess Margaret Cancer Centre), Canada,• with IBM Austin support.

CAPI projects – Text Analytics

Query A Query B Query C Query D Query E Query F0

100

200

300

400

500

600

700

800

SW 1 thread

FPGA 4 streams

Throughput [MB/s]

SW 1 thread SW 64 threads FPGA 4 streams

Performance

Additional CAPI projects

• Algo acceleration for risk analysis for High Performance Trading• Biometric (image and voice recognition) acceleration for secure• identification for banking system• Flash acceleration (NVMe)• Video transcoding and compression for IPTV/VoD• Video analytics• • .. And many more• • • • If you have an application, or know an application that can benefit from FPGA and CAPI...

Talk to IBM!!!

References

First Annual OpenPOWER Summit – presentations and solutions on CAPI and GPU accelerators

Porting GPU-Accelerated Applications to POWER8 Systems

POWER8 Coherent Accelerator Processor Interface (CAPI)

IBM Data Engine for NoSQL – Power Systems Edition

Accelerating performance with the Generic Work Queue Engine (GenWQE)

© Copyright IBM Corporation 2015

Continue growing your IBM skills

ibm.com/training provides acomprehensive portfolio of skills and careeraccelerators that are designed to meet all your training needs.

• Training in cities local to you - where and when you need it, and in the format you want– Use IBM Training Search to locate public training classes

near to you with our five Global Training Providers– Private training is also available with our Global Training Providers

• Demanding a high standard of quality – view the paths to success – Browse Training Paths and Certifications to find the

course that is right for you

• If you can’t find the training that is right for you with our Global Training Providers, we can help.– Contact IBM Training at [email protected]

44

Global Skills Initiative

http://www-304.ibm.com/jct03001c/services/learning/ites.wss/zz/en?pageType=tp_search_new

http://www-304.ibm.com/services/learning/ites.wss/zz/en?pageType=page&c=a0003096

http://www-03.ibm.com/certify/

mailto:[email protected]

Devices & Hardware

Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelerate your business