28
Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Case Study: Accelerating Full Waveform Inversion via OpenCL™ on AMD GPUs ©2014 Acceleware Ltd. All rights reserved. Chris Mason, Acceleware Product Manager March 5, 2014

Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Embed Size (px)

DESCRIPTION

To view the corresponding video, please visit: http://bit.ly/1iBiW17 This webinar takes you through a case study of accelerating a seismic algorithm on a cluster of AMD GPU compute nodes for a geophysical software provider. Acceleware Product Manager Chris Mason presents a programming example, step-by-step project phase profiling, optimization techniques, a look at the strategy behind taking advantage of the massively parallel GPU architecture, and run time performance results. Chris has eight years of experience developing commercial applications for the GPU and multi-core CPUs. His previous experience also includes parallelization of algorithms on digital signal processors (DSPs) for cellular phones and base stations. His specialty is in electromagnetic simulations, medical imaging, signal processing and linear algebra. Sign up for the developer newsletter and learn about future webinars here: http://bit.ly/176wril For more training options from Accelerware, visit http://bit.ly/MRn6Gn Share your ideas with other developers at http://bit.ly/P5ohUo

Citation preview

Page 1: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Case Study: Accelerating Full Waveform Inversion

via OpenCL™ on AMD GPUs

©2014 Acceleware Ltd. All rights reserved.

Chris Mason, Acceleware Product Manager

March 5, 2014

Page 2: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

About Acceleware

Software and services company specializing in HPC product development, developer training and consulting services

OpenCL training for AMD GPUs

– Progressive lectures and hands-on lab exercises

– Experienced instructors

– Delivered worldwide

– Find out more

High performance consulting

– Feasibility studies

– Porting and optimization

– Code commercialization

– Find out more

1

Page 3: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Acceleware Software

Seismic Applications

– Survey design and 3D modeling

– Reverse Time Migration

Electromagnetics

– FDTD Solver

Radio Frequency Heating

– Simulation application for the RF

heating of hydrocarbon reserves

2

Page 4: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Outline

Watch the recording of this webinar

What is Full Waveform Inversion?

The Project

OpenCL

Optimizations

– Coalescing

– Iterative kernel for stencil operations

– Fusing kernels together to eliminate redundant memory accesses

Key Performance Results

3

Page 5: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

What is Full Waveform Inversion?

Seismic inversion technique

Used to build Earth models from recorded seismic data

Uses a finite-difference solution to the acoustic wave

equation

Computationally expensive

4

Page 6: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

What is FWI? From a basic starting point...

... to an accurate velocity model

5

Page 7: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

FWI Algorithm Initial Model Estimate

Forward Propagate Source → Residuals

Back Propagate Residuals → Gradient

Forward Propagation(s) → Step Length

Update Model

Increase Frequency

Loop over shots

Loop over frequencies

Loop until convergence

6

Page 8: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

FWI Compute Cost

Cluster size of 10s to 100s of CPU nodes

Many days of runtime

Accuracy and quality reduced to keep runtime acceptable

7

Page 9: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

The Project

GeoTomo develops high-end geophysical software products that help geophysicists around the world to image beneath the subsurface

GeoTomo had pre-existing cluster-ready multi-threaded (OpenMP based) CPU FWI solution

GeoTomo required their FWI application to run faster so they could deliver the results quicker to their clients – Looked to AMD GPUs to potentially accelerate their FWI and approached

Acceleware for our help to make it happen

8

Page 10: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Why use GPUs? Performance!

9

AMD Opteron 6386 SE

AMD FirePro

W9000

AMD Firepro

S10000

Memory Bandwidth 59.7 GB/s 264 GB/s 480 GB/s

Peak Gflops (single) ~410 4000 5910

Peak Gflops (double) ~205 1000 1480

Total Memory >>6 GB 6GB 6 GB

Power Consumption 140 W 274 W 375 W

Gflops per Watt (single precision) <3 14.59 15.76

Page 11: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

OpenCL Overview

Parallel computing architecture standardized by the Khronos Group

OpenCL:

– Is a royalty free standard

– Provides an API to coordinate parallel computation across heterogeneous processors Of interest because heterogeneous devices can significantly accelerate certain

(primarily data-parallel) workloads

– Defines a cross-platform programming language

– Used on handheld/embedded devices through supercomputers

10

Page 12: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

OpenCL Programming Model

Heterogeneous model, including provisions for a host connected to one or more devices

– Example: GPUs, CPUs

Host

Device 1 GPU

Device 2 GPU

… Device N

GPU

11

Page 13: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

The OpenCL Programming Model

Data-parallel portions of an algorithm are executed on the device as kernels – Kernels are C functions with some

restrictions and a few language extensions

– Many (parallel) work-items execute the kernel

The host executes serial code between device kernel launches – Memory management

– Data exchange to/from device (usually)

– Error handling

12

Work-Group (0,0) Work-Group (1,0)

Work-Group (0,1) Work-Group (1,1)

Work-Group (0,2) Work-Group( 1,2)

ND Range

Work-Group (0,0)

Work-Group (1,0)

Work-Group (2,0)

Work-Group (0,1)

Work-Group (1,1)

Work-Group (2,1)

ND Range

Host

Device

Host

Device

Page 14: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

OpenCL Memory Model

OpenCL kernels have access to four distinct memory regions: – Global

Allows read/write access from all work-items in all work-groups

Persistent across kernels

– Local Memory that is local to all work-items within a work-group

– Constant Region of memory that remains constant (read-only) during the execution of a kernel

– Private Memory that is private to a work-item

OpenCL vendors map memory regions into physical resources – Local/constant/private memory usually several orders of magnitude lower

capacity but orders of magnitude faster than global memory

13

Page 15: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

OpenCL Syntax – Memory Spaces

Host and device have separate memory spaces – Data is explicitly moved between them

Typically over PCIe bus

Host functions to allocate, copy, and free memory on device, eg.

– clCreateBuffer()

– clEnqueueReadBuffer()

– clEnqueueWriteBuffer()

– clReleaseMemoryObject()

14

Page 16: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Putting It All Together

15

A0 A1 A2 A3 A4 A5 A6 A7

B0 B1 B2 B3 B4 B5 B6 B7

C0 C1 C2 C3 C4 C5 C6 C7

Cx = Ax + Bx

One work-item per element

Operation

__kernel

void VectorAdd(__global float* a,

__global float* b,

__global float* c)

{

int idx = get_global_id(0);

c[idx] = a[idx] + b[idx];

}

Each work-item has a unique index, typically used to index into arrays

Page 17: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Vector Add – Host Code

16

void VectorAdd(float* aH, float* bH, float* cH, int N)

{

int N_BYTES = N * sizeof(float);

// Device management code

cl_mem aD = clCreateBuffer(…,N_BYTES, …);

cl_mem bD = clCreateBuffer(…,N_BYTES, …);

cl_mem cD = clCreateBuffer(…,N_BYTES, …);

clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…);

clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…);

// Pass kernel arguments and launch kernel

clEnqueueNDRangeKernel(…, &N, …);

clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…);

}

Allocate memory on device

Transfer input arrays to device

Launch kernel

Transfer output array to host

Page 18: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Project Steps

1) Profiling

– Acquired code, datasets and reference benchmarks from GeoTomo

– Set up local machines with near-equivalent hardware, compiled code and confirmed reference benchmark numbers

– Augmented code with timers to determine time spent in parallel regions, areas of interest

17

Page 19: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Project Steps

2) Feasibility Analysis

– Investigated memory footprint for FWI jobs

GPU memory limited to 6GB per card

– Investigated potential speedup / time to port code

Maximum speed up determined by time spent in parallel regions (Amdahl’s Law)

Time to port dependent on feature set

– E.g. domain decomposition across multiple GPUs

18

Page 20: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Project Steps

3) Implementation

– Creating testing harnesses

– Kernel implementation

– Resolving hardware driver issues

– Enabling multi-GPU device support

– Optimization iterations

4) Wrapup

– Delivery of port, along with installation documentation

– Trained GeoTomo developer on OpenCL

19

Page 21: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Key GeoTomo Optimizations

1) Coalescing

– Changing memory access patterns in the kernels to those best suited for GPUs

Global memory is accessed via a request for a multi-byte word

Combine load/store requests from consecutive work-items to reduce the number of requested words

– Fewer requests less contention to global memory

Make one big multi-word burst request to global memory whenever possible

– Contiguous bursts -> less global memory overhead

20

Page 22: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Key GeoTomo Optimizations

2) Iterative kernel for stencil operations

Input Volumes Stencil Kernels

* • Outputs are weighted combinations of surrounding elements from input volumes • Off-axis weights are zero

Acknowledgement: Paulius Micikevicius, 2009 21

Page 23: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Key GeoTomo Optimizations

Naïve implementation would have each work-item read all of its neighboring elements directly from global memory

– Possible to hit maximum GPU memory bandwidth but redundant reads hurt performance

22

Page 24: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Key GeoTomo Optimizations

Alternative: Iterating over 2D slices along slowest dimension

– Single items responsible for column of output array

– Work-group caches 2D plane of input in local memory

– Work-items store inputs in direction of iteration in registers

– Reduces required number of global memory reads significantly

Single Work-item View

Register Local memory

Acknowledgement: Paulius Micikevicius, 2009 23

Page 25: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Key GeoTomo Optimizations

3) Kernel Fusion

– Reduce redundant memory accesses by fusing kernels that operate on the same volume together

– Improves performance by reducing redundant global memory reads

4) Kernel Fission

– Improve occupancy by lowering kernel resource requirements (registers) via kernel simplification

– Allows for more work-items to run concurrently on GPU, improving masking of global memory latency

24

Page 26: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Performance Results

FWI 15 Hz, 15 shots

– GPU version 7997 seconds

– CPU (5 cores per shot) 67086 seconds [8.4X]

– CPU (30 cores per shot) 166948 seconds [20.9X]

GPU: Sapphire Radeon HD 7970 GHz Edition

– 6GB model

25

Page 27: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Performance Results

“Using GPU’s we can use higher frequencies and more if not all of the shots to improve the resolution and coverage.”

James Jackson, President, GeoTomo

26

Page 28: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Questions?

Contact Us Tel: +1 403.249.9099

Email: [email protected]

OpenCL Courses June 3-6, 2014, Calgary, Canada

Private onsite classes also available

Find out more

OpenCL Consulting Feasibility studies

Code commercialization

Porting and optimization

Mentoring

Find out more

Watch the recording of this webinar

27