Upload
cesar-maciel
View
766
Download
1
Embed Size (px)
Citation preview
Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.9.0
Heterogeneous computing on POWER
IBM and OpenPOWER technologies to accelerate your business
César Diniz MacielExecutive IT SpecialistIBM Corporate Strategy
Session objectives
In this session we present how accelerators providesignificant application performance improvementsand how they can be deployed in a solution. Weevaluate different types of accelerators, and focuson the latest accelerator technologies available forIBM Power Systems and OpenPOWER solutions.
Acknowledgements
I would like to thank the following people for providinginvaluable information on POWER8, CAPI and FPGAs• Jeff Stuecheli, POWER Hardware Architect• Bill Starke, DE, POWER Server Nest Architect• Bruce Wile, STG Hardware Design and Verification• Jonathan Dement, Program Director, • Power Systems and OpenPower Innovation • IBM US
What is heterogeneous computing?From Wikipedia “Heterogeneous computing refers to systems that use more than one kind of processor. These are systems that gain performance not just by adding the same type of processors, but by adding dissimilar processors, usually incorporating specialized processing capabilities to handle particular tasks.“
What is heterogeneous computing?
Not a new concept, widely used in the industry
On-chip accelerators: Cryptography and compression accelerators inside the
POWER8 processor
PCIe based accelerators GPGPUs, such as Nvidia Tesla PCIe adapters for SSL acceleration, cryptography and
compression CAPI adapters for POWER8 systems
Appliance-based accelerators
Netezza appliances to accelerate queries on the IBM DB2 Analytics Accelerator
Why heterogeneous computing? Applications are becoming more complex and demanding more
computing resources Application speedup limited to performance of the slowest algorithm
If algorithm execution can be accelerated, application runs faster – code optimization
If algorithm can be built on silicon, execution speeds up – ASICs, FPGA If algorithm can be broken in multiple pieces, and these can run simultaneously,
application runs faster – parallelization
“Heterogeneous (or asymmetric) chip multiprocessors present uniqueopportunities for improving system throughput, reducing processor power,and mitigating Amdahl’s law. On-chip heterogeneity allows the processor to better match execution resources to each application’s needs and toaddress a much wider spectrum of system loads—from low to high threadparallelism—with high efficiency.”IEEE Computer, November 2005,
Why heterogeneous computing?
Scaling & power Wall Issues, Chip Design and Fabrication Economics, and Time To Market Demands are all intersecting enabling/requiring change to Business As Usual
On-chip accelerators
• Excellent performance and integration• Accelerators are part of the microprocessor design and share
the same silicon die.• Algorithms are implemented in silicon
• Fastest performance• Application transparency – Hypervisor/OS abstract the
accelerator so that applications do not need to be modified• Close integration with processor design and capabilities
• However, on-chip accelerators use same silicon space that could be used for caches, processor features, etc• Tradeoff between accelerators and core features• Once built, they cannot be changed/updated
Examples of performance benefits of acceleratorsPOWER7+ and POWER8 processors include memory compression accelerators, and cryptographic accelerators
Memory Compression
AsymmetricMathematical
Functions
CryptographicEngines
The on-chip loosely coupled accelerators are part of the processor chip and managed by the Power Hypervisor. Multiple partitions can use the accelerators and the Power Hypervisor manages the QoS and address mapping.
Examples of performance benefits of acceleratorsPerformance comparison of Active Memory Expansion (AME) on an SAP workload (SD 2-Tier benchmark) on POWER7 and POWER7+ (output of amepat command)
Significant reduction in core consumption
PCIe-based accelerators
• Provide great flexibility for adding capability to existing systems• Allow many more options in terms of types of accelerators, algorithms, performance
characteristics and implementation devices• GPUs• FPGAs• ASICs
• Easy to replace with newer/faster accelerators• However, PCIe-based accelerators are seen as an I/O device, and therefore
communicate with the processor and main memory through an I/O subsystem• Programming model needs to incorporate the I/O device• Need of OS device drivers for the accelerator• Data needs to be copied from/to main memory to accelerator memory• Data transfer performance limited by the I/O subsystem
Java gzip PCIe-based accelerator
• Feature code #EJ12/ #EJ13 - PCIe3 FPGA Accelerator Adapter• Also known as Generic Work Queue Engine (GenWQE) accelerator• Hardware compression accelerator for AIX and Linux• FPGA-based PCIe adapter, provides high performance, low latency compression
without significant CPU overhead•
• Based on the standard compression library – zlib• Widely used open source C library that provides compression and decompression• zlib supports RFC1950, RFC1951, and RFC1952•
• Enabled transparently in IBM Java 7.1.1 release and later on Linux and AIX on POWER8 processor-based systems
• HW offload enabled by setting • env variables:
• ZLIB_INFLATE_IMPL = 1• ZLIB_DEFLATE_IMPL = 1
Java gzip PCIe-based accelerator - Performance
IBM internal testing
GPUs
From NVIDIA:• • “GPU-accelerated computing is the use of a graphics processing unit (GPU)
together with a CPU to accelerate scientific, analytics, engineering, consumer, and enterprise applications...• …GPU-accelerated computing offers unprecedented application performance by
offloading compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU. From a user's perspective, applications simply run significantly faster.”
General Purpose GPUs
GPUs are well suited for parallelprocessing tasks. They have thousands of cores than work in parallel
Coupled with a high performanceprocessor and a GPU programming model, significant application acceleration can be achieved
16
NVIDIA K40 GPU
• Systems• Up to 2 K40 GPU in S824L
• GPU Spec• Kepler-2 architecture GPU • ASIC: GK110B
• PCIe interface• PCIe Gen3 x16• Full length / double wide PCIe form factor• Plugs in using existing double wide cassette
• Power• 235W Max power draw :75W via PCIe slot
plus 160W via 8-pin Aux. cable.• OS support on POWER
• Ubuntu 14.10 or later
http://www.nvidia.com/object/tesla-servers.html
16
NVIDIA and POWER8
From http://www.ecmwf.int/sites/default/files/HPC-WS-Appleyard.pdf
NVIDIA on Power
“ The combination of POWER8 CPUs & NVIDIA Tesla accelerators is amazing. It is the highest performance we have ever seen in individual cores, and the close integration with accelerators is outstanding for heterogeneous parallelization. Thanks to the little endian chip and standard CUDA environment it took us less than 24 hours to port and accelerate GROMACS.”
Erik Lindahl, Professor of Biophysics at the Science for Life Laboratory, Stockholm University & KTH. Developer of GROMACS
GPUs are not only for technical computing
“ Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs - A Technology Demonstration”http://on-demand.gputechconf.com/gtc/2015/presentation/S5835-Sina-Meraji.pdf
Java acceleration with GPUs“IBM’s POWER group has partnered with NVIDIA to make GPUs available on a high-performance server platform, promising the next generation of parallel performance for Java applications.”The Next Wave of Enterprise Performance with Java, POWER Systems and NVIDIA GPUs
Performance of java.util.Arrays#sort(int[]) on NVIDIA Tesla K40 GPU (ECC enabled) and IBM Power8 CPU.
KeplerCUDA 5.5 – 7.0
Unified Memory
Buffered Memory
POWER8
PCIe
Current
PascalCUDA 8
Full GPU Paging
Pascal
POWER8+
Near Future
NVLink
SXM2
VoltaCUDA 9
Cache Coherent
POWER9
More in the future
NVLink 2.0
SXM2
VoltaKepler
NVIDIA Roadmap on POWER
FPGA as an Accelerator
• FPGA: Field Programmable Gate Array
• From Wikipedia: “A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a
customer or a designer after manufacturing – hence "field-programmable". The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC). “
• It implements the algorithm as a hardware logic circuit• Not as optimized as an ASIC, but faster than running a software-based algorithm
• It can run fast (cycle times of 250 – 500 MHz or more)• It has industry standard interfaces like PCIe Gen3• The major FPGA Suppliers, Altera and Xilinx, are OpenPOWER Foundation members
gzip Encrypt
MonteCarlo
FPGA Library
Source code for FPGAs has traditionallybeen written in RTL* (VHDL** or Verilog).Now, we also have OpenCL, a more programmer friendly language.
*RTL = Register Transfer Level**VHDL = VHSIC*** Hardware Description Language***VHSIC = Very High Speed Integrated Circuit
Why FPGAs
• Transistor Efficiency & Extreme Parallelism• Bit-level operations• Variable-precision floating point• Compare divides on GPU vs. FPGA
• Power-Performance Advantage • >2x compared to a general multicore processor or GPGPU• Unused lookup tables (LUTs) are powered off
• Technology Scaling better than CPU/GPU• FPGAs are not frequency or power limited yet• 3D has great potential
• Dynamic reconfiguration• Flexibility for application tuning at run-time vs. compile-time
• Additional advantages when FPGAs are network connected• allows network as well as compute specialization
Several IT companies looking at FPGA
Baidu FPGA accelerator
Microsoft Bing accelerator
Combining the best of both architectures
Coherent Accelerator base principles
Function Based Acceleration• Main Application executed on Host Processor
Computational heavy functions on Accelerator• Single binary image encapsulates both HW and SW
version of accelerated functions• Application calls Accelerator for common or custom
libraries • Enable an application to work with or without available
Accelerator• Accelerator call faults when function not available• Host processor executes function when Accelerator
not available• No special requirements on data structures• System software virtualization of accelerator’s function(s)
Full Peer to Processor• IBM-designed processor interface
• Maintains a trusted coherent interface to system• Direct communications with application• Accelerator function(s) use an unmodified EA
• Full access to real address space• Utilize processor’s page tables directly• Page faults handled by system software
• Multiple functions can exist in a single accelerator
Customizable HardwareApplication Accelerator • Specific system SW, middleware, or user
application• Written to durable interface provided by PSL
Virtual Addressing• Accelerator can work with same memory
addresses that the processors use• Pointers de-referenced same as the host
application• Removes OS & device driver overhead
Hardware Managed Cache Coherence• Enables the accelerator to participate in “Locks”
as a normal thread Lowers Latency over IO communication model
POWER8 CAPI (Coherent Accelerator Processor Interface)
Ecosystem
CAPP
PCIe
Power Processor
FPGA
I/OEthernet, DASD, etc…
Function n
Function
0
Function
1
Function 2
CAPI
IBM Supplied POWER Service Layer
Coherent Accelerator Processor Interface (CAPI) overview
CAPP PCIe
POWER8 Processor
FPGA
Accelerator Function Unit (AFU)
CAPI
IBM-Supplied POWER Service Layer
Typical I/O model flow
Flow with a coherent model
Shared MemoryNotify Accelerator Acceleration Shared Memory
Completion
DD Call Copy or PinSource Data
MMIO NotifyAccelerator Acceleration Poll / Interrupt
CompletionCopy or Unpin
Result DataReturn From DD
Completion
Advantages of coherent attachment over I/O attachment
Virtual addressing and data caching– Shared memory
– Lower latency for highly referenced data
Easier, more natural programming model– Traditional thread-level programming– Long latency of I/O typically requires
restructuring of application
Enables applications not possible on I/O– Pointer chasing, and so on
Typical I/O Model Flow:
Flow with a Coherent Model:
Shared Mem. Notify Accelerator Acceleration Shared Memory
Completion
DD Call Copy or PinSource Data
MMIO NotifyAccelerator Acceleration Poll / Interrupt
CompletionCopy or UnpinResult Data
Ret. From DDCompletion
ApplicationDependent, butEqual to below
ApplicationDependent, butEqual to above
300 Instructions 10,000 Instructions 3,000 Instructions1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Total ~13µs for data prep
<400 Instructions <100 Instructions
0.3µs 0.06µs
Total 0.36µs
CAPI vs. I/O Device Driver: Data Prep
Time spent by a GPU-accelerated application
“ Figure 7 shows that the processing phases ofthe FD-OCT algorithm (DC-Removal, Resample, DispersionCompensation, FFT and Logarithmic Scaling) account for 40% of ∼the total run-time time, while the memory (data)transfers (host-to-device and device-to-host) require approximately 60% of the time.”
From: Scalable, High Performance Fourier Domain Optical Coherence Tomography: why FPGAs and not GPGPUsJian Li, Marinko V. Sarunic, and Lesley ShannonSchool of Engineering ScienceSimon Fraser University, Burnaby BC, Canada
strategy ( )
CAPI Attached Flash Optimization
• Attach IBM FlashSystem to POWER8 via CAPI coherent Attach• Issues Read/Write Commands from applications to eliminate 97% of code
pathlength• Saves 20-30 cores per 1M IOPs
Pin buffers, Translate, Map DMA, Start I/O
Application
Read/Write Syscall
Interrupt, unmap, unpin,Iodone scheduling
20K instructions reduced to <500
Disk and Adapter DD
strategy ( ) iodone ( )
FileSystem
Application
User Library
Posix Async I/O Style API
Shared Memory Work Queue
aio_read()aio_write()
iodone ( )
LVM
Innovative “In-Memory” NoSQL/KVSIBM Data Engine for NoSQL
24:1 Reduction in infrastructure
2.4x Price reduction
12x Less Energy
6x Less rack space
40TB of extended memory
4U
Demonstrating the Value of CAPI AttachmentIBM Data Engine for NoSQL
Comparison of FFT on three paradigms
Same FFT Algorithm as CAPI, but large latency impactP8 Core working much harder to deliver data to FPGA as device driver code is invoked
Small FFTs can be implemented in a fully streamed fashion on FPGAPerformance scales with host-AFU bandwidthUse radix-2 pipeline for up to 4 GB/sRadix-2 needs 2 complex samples per cycle → 16B / 250MHz cycle → 4 GB/sSwitch to (more efficient) radix-4 for up to 8 GB/s
Software FFT from the IBM Engineering and Scientific Subroutine Library (ESSL)Poor performance for small FFTs using multi-threaded softwareLittle data reuse and strided access patterns lead to software inefficiencies
P8Core
P8Core FPGA
P8Core
FPGA
CAPI
FFT Results
CAPI projects – PDT planning
• FPGA-CAPI accelerator for PDT (Photodynamic Therapy)• Monte-Carlo analysis of light scattering in tissue• Used to plan for non-invasive cancer treatment• Project led by Jeff Cassidy (University of Toronto) and • Lothat Lilge (Princess Margaret Cancer Centre), Canada,• with IBM Austin support.
CAPI projects – Text Analytics
Query A Query B Query C Query D Query E Query F0
100
200
300
400
500
600
700
800
SW 1 thread
FPGA 4 streams
Throughput [MB/s]
SW 1 thread SW 64 threads FPGA 4 streams
Performance
Additional CAPI projects
• Algo acceleration for risk analysis for High Performance Trading• Biometric (image and voice recognition) acceleration for secure• identification for banking system• Flash acceleration (NVMe)• Video transcoding and compression for IPTV/VoD• Video analytics• • .. And many more• • • • If you have an application, or know an application that can benefit from FPGA and CAPI...
Talk to IBM!!!
References
First Annual OpenPOWER Summit – presentations and solutions on CAPI and GPU accelerators
Porting GPU-Accelerated Applications to POWER8 Systems
POWER8 Coherent Accelerator Processor Interface (CAPI)
IBM Data Engine for NoSQL – Power Systems Edition
Accelerating performance with the Generic Work Queue Engine (GenWQE)
© Copyright IBM Corporation 2015
Continue growing your IBM skills
ibm.com/training provides acomprehensive portfolio of skills and careeraccelerators that are designed to meet all your training needs.
• Training in cities local to you - where and when you need it, and in the format you want– Use IBM Training Search to locate public training classes
near to you with our five Global Training Providers– Private training is also available with our Global Training Providers
• Demanding a high standard of quality – view the paths to success – Browse Training Paths and Certifications to find the
course that is right for you
• If you can’t find the training that is right for you with our Global Training Providers, we can help.– Contact IBM Training at [email protected]
44
Global Skills Initiative