Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Case Study: Accelerating Full Waveform Inversion

via OpenCL™ on AMD GPUs

©2014 Acceleware Ltd. All rights reserved.

Chris Mason, Acceleware Product Manager

March 5, 2014

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

About Acceleware

Software and services company specializing in HPC product development, developer training and consulting services

OpenCL training for AMD GPUs

– Progressive lectures and hands-on lab exercises

– Experienced instructors

– Delivered worldwide

– Find out more

High performance consulting

– Feasibility studies

– Porting and optimization

– Code commercialization

– Find out more

1

http://acceleware.com/opencl-training?source=amd-fwi-webinar

http://acceleware.com/services?source=amd-fwi-webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Acceleware Software

Seismic Applications

– Survey design and 3D modeling

– Reverse Time Migration

Electromagnetics

– FDTD Solver

Radio Frequency Heating

– Simulation application for the RF

heating of hydrocarbon reserves

2

http://acceleware.com/seismic-forward-modeling?source=amd-fwi-webinar

http://acceleware.com/rtm?source=amd-fwi-webinar

http://acceleware.com/fdtd-solvers?source=amd-fwi-webinar

http://acceleware.com/rf-heating?source=amd-fwi-webinar

http://acceleware.com/rf-heating?source=amd-fwi-webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Outline

Watch the recording of this webinar

What is Full Waveform Inversion?

The Project

OpenCL

Optimizations

– Coalescing

– Iterative kernel for stencil operations

– Fusing kernels together to eliminate redundant memory accesses

Key Performance Results

3

http://acceleware.com/blog/webinar-accelerating-fwi-opencl-amd-gpus?source=amd-fwi-webinar

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

What is Full Waveform Inversion?

Seismic inversion technique

Used to build Earth models from recorded seismic data

Uses a finite-difference solution to the acoustic wave

equation

Computationally expensive

4

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

What is FWI? From a basic starting point...

... to an accurate velocity model

5

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

FWI Algorithm Initial Model Estimate

Forward Propagate Source → Residuals

Back Propagate Residuals → Gradient

Forward Propagation(s) → Step Length

Update Model

Increase Frequency

Loop over shots

Loop over frequencies

Loop until convergence

6

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

FWI Compute Cost

Cluster size of 10s to 100s of CPU nodes

Many days of runtime

Accuracy and quality reduced to keep runtime acceptable

7

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

The Project

GeoTomo develops high-end geophysical software products that help geophysicists around the world to image beneath the subsurface

GeoTomo had pre-existing cluster-ready multi-threaded (OpenMP based) CPU FWI solution

GeoTomo required their FWI application to run faster so they could deliver the results quicker to their clients – Looked to AMD GPUs to potentially accelerate their FWI and approached

Acceleware for our help to make it happen

8

https://www.geotomo.com/



Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Why use GPUs? Performance!

9

AMD Opteron 6386 SE

AMD FirePro

W9000

AMD Firepro

S10000

Memory Bandwidth 59.7 GB/s 264 GB/s 480 GB/s

Peak Gflops (single) ~410 4000 5910

Peak Gflops (double) ~205 1000 1480

Total Memory >>6 GB 6GB 6 GB

Power Consumption 140 W 274 W 375 W

Gflops per Watt (single precision) <3 14.59 15.76

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

OpenCL Overview

Parallel computing architecture standardized by the Khronos Group

OpenCL:

– Is a royalty free standard

– Provides an API to coordinate parallel computation across heterogeneous processors Of interest because heterogeneous devices can significantly accelerate certain

(primarily data-parallel) workloads

– Defines a cross-platform programming language

– Used on handheld/embedded devices through supercomputers

10

http://www.khronos.org/



Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

OpenCL Programming Model

Heterogeneous model, including provisions for a host connected to one or more devices

– Example: GPUs, CPUs

Host

Device 1 GPU

Device 2 GPU

… Device N

GPU

11

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

The OpenCL Programming Model

Data-parallel portions of an algorithm are executed on the device as kernels – Kernels are C functions with some

restrictions and a few language extensions

– Many (parallel) work-items execute the kernel

The host executes serial code between device kernel launches – Memory management

– Data exchange to/from device (usually)

– Error handling

12

Work-Group (0,0) Work-Group (1,0)

Work-Group (0,1) Work-Group (1,1)

Work-Group (0,2) Work-Group( 1,2)

ND Range

Work-Group (0,0)

Work-Group (1,0)

Work-Group (2,0)

Work-Group (0,1)

Work-Group (1,1)

Work-Group (2,1)

ND Range

Host

Device

Host

Device

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

OpenCL Memory Model

OpenCL kernels have access to four distinct memory regions: – Global

Allows read/write access from all work-items in all work-groups

Persistent across kernels

– Local Memory that is local to all work-items within a work-group

– Constant Region of memory that remains constant (read-only) during the execution of a kernel

– Private Memory that is private to a work-item

OpenCL vendors map memory regions into physical resources – Local/constant/private memory usually several orders of magnitude lower

capacity but orders of magnitude faster than global memory

13

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

OpenCL Syntax – Memory Spaces

Host and device have separate memory spaces – Data is explicitly moved between them

Typically over PCIe bus

Host functions to allocate, copy, and free memory on device, eg.

– clCreateBuffer()

– clEnqueueReadBuffer()

– clEnqueueWriteBuffer()

– clReleaseMemoryObject()

14

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Putting It All Together

15

A0 A1 A2 A3 A4 A5 A6 A7

B0 B1 B2 B3 B4 B5 B6 B7

C0 C1 C2 C3 C4 C5 C6 C7

Cx = Ax + Bx

One work-item per element

Operation

__kernel

void VectorAdd(__global float* a,

__global float* b,

__global float* c)

{

int idx = get_global_id(0);

c[idx] = a[idx] + b[idx];

}

Each work-item has a unique index, typically used to index into arrays

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Vector Add – Host Code

16

void VectorAdd(float* aH, float* bH, float* cH, int N)

{

int N_BYTES = N * sizeof(float);

// Device management code

…

cl_mem aD = clCreateBuffer(…,N_BYTES, …);

cl_mem bD = clCreateBuffer(…,N_BYTES, …);

cl_mem cD = clCreateBuffer(…,N_BYTES, …);

clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…);

clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…);

// Pass kernel arguments and launch kernel

…

clEnqueueNDRangeKernel(…, &N, …);

clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…);

}

Allocate memory on device

Transfer input arrays to device

Launch kernel

Transfer output array to host

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Project Steps

1) Profiling

– Acquired code, datasets and reference benchmarks from GeoTomo

– Set up local machines with near-equivalent hardware, compiled code and confirmed reference benchmark numbers

– Augmented code with timers to determine time spent in parallel regions, areas of interest

17

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Project Steps

2) Feasibility Analysis

– Investigated memory footprint for FWI jobs

GPU memory limited to 6GB per card

– Investigated potential speedup / time to port code

Maximum speed up determined by time spent in parallel regions (Amdahl’s Law)

Time to port dependent on feature set

– E.g. domain decomposition across multiple GPUs

18

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Project Steps

3) Implementation

– Creating testing harnesses

– Kernel implementation

– Resolving hardware driver issues

– Enabling multi-GPU device support

– Optimization iterations

4) Wrapup

– Delivery of port, along with installation documentation

– Trained GeoTomo developer on OpenCL

19

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Key GeoTomo Optimizations

1) Coalescing

– Changing memory access patterns in the kernels to those best suited for GPUs

Global memory is accessed via a request for a multi-byte word

Combine load/store requests from consecutive work-items to reduce the number of requested words

– Fewer requests less contention to global memory

Make one big multi-word burst request to global memory whenever possible

– Contiguous bursts -> less global memory overhead

20

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s


2) Iterative kernel for stencil operations

Input Volumes Stencil Kernels

* • Outputs are weighted combinations of surrounding elements from input volumes • Off-axis weights are zero

Acknowledgement: Paulius Micikevicius, 2009 21

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s


Naïve implementation would have each work-item read all of its neighboring elements directly from global memory

– Possible to hit maximum GPU memory bandwidth but redundant reads hurt performance

22

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s


Alternative: Iterating over 2D slices along slowest dimension

– Single items responsible for column of output array

– Work-group caches 2D plane of input in local memory

– Work-items store inputs in direction of iteration in registers

– Reduces required number of global memory reads significantly

Single Work-item View

Register Local memory

Acknowledgement: Paulius Micikevicius, 2009 23

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s


3) Kernel Fusion

– Reduce redundant memory accesses by fusing kernels that operate on the same volume together

– Improves performance by reducing redundant global memory reads

4) Kernel Fission

– Improve occupancy by lowering kernel resource requirements (registers) via kernel simplification

– Allows for more work-items to run concurrently on GPU, improving masking of global memory latency

24

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Performance Results

FWI 15 Hz, 15 shots

– GPU version 7997 seconds

– CPU (5 cores per shot) 67086 seconds [8.4X]

– CPU (30 cores per shot) 166948 seconds [20.9X]

GPU: Sapphire Radeon HD 7970 GHz Edition

– 6GB model

25

Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Performance Results

“Using GPU’s we can use higher frequencies and more if not all of the shots to improve the resolution and coverage.”

James Jackson, President, GeoTomo

26


Case S

tudy:

Accele

rati

ng F

ull

Wavefo

rm Invers

ion v

ia O

penC

L

on A

MD

GPU

s

Questions?

Contact Us Tel: +1 403.249.9099

Email: [email protected]

OpenCL Courses June 3-6, 2014, Calgary, Canada

Private onsite classes also available

Find out more

OpenCL Consulting Feasibility studies

Code commercialization

Porting and optimization

Mentoring

Find out more

Watch the recording of this webinar

27

mailto:[email protected]

http://acceleware.com/opencl-training?source=amd-fwi-webinar

http://acceleware.com/services?source=amd-fwi-webinar

http://acceleware.com/blog/webinar-accelerating-fwi-opencl-amd-gpus?source=amd-fwi-webinar

Technology

Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar