34

Click here to load reader

CSIRO and OpenCL

Embed Size (px)

DESCRIPTION

Following our successful participation at SIGGRAPH Asia 2012 in Singapore, the Khronos Group is excited to demonstrate and educate about Khronos APIs at SIGGRAPH Asia 2013 in Hong Kong. This presentation covers OpenCL an Accelerated Science, by Tomasz Benarz, Computational Research Scientist at CSIRO

Citation preview

Page 1: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 1

Accelerated Science – use of OpenCL in Land Down Under

Tomasz Bednarz Computational Research Scientist, CSIRO

Page 2: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 2

Outline • CSIRO - Positive Impact

• Bragg – the CSIRO GPU Cluster

• CSIRO Accelerated Service

• MASSIVE

• Radiation therapy

• Level set segmentation

• CFD

• Cloud based imaging

• Other projects

• OpenCL for Game AI Pro

• Sydney Khronos Chapter

• ACW at OzViz 2013

Page 3: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 3

CSIRO - Positive Impact • In the world’s Top 10 institutions for 3

research fields.

• Top 1% of global research institutions

in 14 of 22 research fields.

• Trusted advisor, innovative.

• 2000 doctorates, 500 masters.

Page 4: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 4

Bragg – the CSIRO GPU Cluster • A bit of history

- Heterogeneous compute cluster (CPUs + GPUs)

- Funded by CSS TCP (2009) - first of its kind in AU

• Continuously evolving:

- GPUs upgraded in 2010 to NVIDIA Fermi Teslas

- CPUs upgraded in 2012 to Intel Sandy Bridge and 3rd

GPU added to each node

- GPUs upgraded in 2013 to NVIDIA Kepler Teslas

• Spec:

- 128 Dual Xeon 8-core E5-2650 (= 2048 cores)

- 128 GB RAM, FDR10 InfiniBand Interconnect

- 132 Fermi Tesla M2050 GPUs (= 59,136 cores)

- 254 Kepler Tesla M20 GPUs (= 633,984 cores)

- 16 Intel SC5110P Xenon PHI (= 960 cores)

- 289 on the Top500 list (as per June 2013)

Page 5: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 5

CSIRO Accelerated Computing Service • About the Service

- Supported by ASC Applications and Novel Technology Staff

- Not limited to GPU support

- Delivers: - Accelerated Computing Projects

- HPC Workshops and Training

- Benchmarking

• Activities include

- Porting and Parallelising - Targeting HPC clusters, GPUs

- Profiling

- Performance Tuning

- Debugging

- Numerical Error Analysis

- Algorithmic Improvements Sam Moskwa – ACW at OzViz 2011

OpenCL CUDA Python R OpenMP MPI C/C++ OpenACC Fortran

Page 6: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 6

MASSIVE - Two high performance computing

facilities, located at the Australian

Synchrotron and Monash University.

- Specialised imaging and visualisation

software and databases.

- Expertise in visualisation, image

processing, image analysis, HPC and

GPU computing.

- Training and Outreach program.

- NCI Specialised Facility for Imaging

and Visualisation.

For more info contact Dr Wojtek James Goscinski

Page 7: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 7

Scientific Visualisation – CT reconstruction

Insect CT scan, rendered using Drishi (http://anusf.anu.edu.au/Vizlab/drishti/) by Sherry Mayo (CSIRO)

Page 8: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 8

Radiation therapy applications • Modern radiation therapy is to a large extent a computational discipline and can

greatly benefit from use of task- and data-parallelism. Some applications were

demonstrated on GPUs already:

- CT reconstructions

- Image registrations

- Treatment planning

- Dose computations (e.g. X Gu, U Jelen et al 2011 PMB 56)

• Need for speed: imaging and treatment verification can be used as feedback to improve the

treatment (adaptive radiotherapy), currently offline (mostly population-based), one day online.

• Particle (proton/carbon ion) therapy with raster scanning @ University of Marburg:

- most precise external beam technique (only 5 centers worldwide: 3 active, 2 to start)

- increased precision = increased need for verification (more computations)

- longer computational times (small head case: 1 hour on single-thread)

• Collaborative project between CSIRO and University of Marburg

(Adapted from Schlegel and Mahr 2001)

Ammazzalorso, Bednarz, Jelen

Page 9: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 9

Plan robustness in radiation therapy • Automatic discovery of robust beam setups.

• Results (mean and sd for a single beam):

- 4-core Intel Xeon W3530 2.8GHz 12GB RAM + NVIDIA Tesla C2050 3GB RAM

- 10 skull base cases, 42 beams directions (10 runs each for timing stats)

- 4k-40k pencils of 120-350 samples, 2 mm analysis radius (0.5 mm step)

- Single-precision floating-point operations only (sufficient precision)

mean(sd) ms P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 Pool

Native (1 thread)

21299 (6628)

9891 (2837)

6258 (1485)

15768 (4959)

4342 (1136)

10888 (3179)

10117 (2849)

5464 (1470)

8155 (2195)

11388 (3936)

10357 (5941)

GPU OpenCL

219 (109)

122 (51)

88 (38)

148 (56)

61 (24)

160 (65)

151 (64)

52 (22)

109 (46)

126 (61)

124 (75)

Gain 119 x (36)

98 x (34)

87 x (30)

123 x (36)

83 x (25)

81 x (24)

82 x (30)

124 x (42)

90 x (31)

106 x (29)

99 x (36)

CPU OpenCL

6498 (1996)

2552 (615)

1898 (438)

4810 (1495)

1324 (331)

3280 (944)

3051 (841)

1396 (310)

2481 (649)

2935 (818)

3022 (1798)

Gain 3.3 x (0.0)

3.8 x (0.4)

3.3 x (0.0)

3.3 x (0.0)

3.3 x (0.0)

3.3 x (0.0)

3.3 x (0.0)

3.9 x (0.4)

3.3 x (0.0)

3.8 x (0.4)

3.5 x (0.3)

F. Ammazzalorso (Uni-Marburg), T. Bednarz (CSIRO) and U. Jelen (Uni-Marburg)

- Oral poster presentation at Int. Conference on the Use of Computers in Radiation Therapy ICCR2013 (Melbourne, 6-9 May 2013)

- Accepted for journal publication in IOP JPCS (upcoming)

Page 10: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 10

Fast particle therapy dose computation • How OpenCL 2.0 can help?

- Radiation therapy (RT) has need for both task- and data-parallelism and it represents a

perfect potential application field for heterogeneous parallel systems.

- OpenCL 2.0 can help exploiting these system in RT applications and open the way to

new/simpler usage scenarios.

- The robust planning approach requires continuous exchange of large dataset: for now

latency can be hidden with ad-hoc synchronization

→ Great opportunity with OCL2.0 shared virtual memory (SVM).

- Dose computation (especially for ions, where biological weighing is involved) is a more

complex problem, which OCL2.0 dynamic parallelism (DP) can help breaking into simpler

blocks, enabling scheduling of different kernels without host-side and communication

overhead.

- A combination of SVM and DP can open the way to treatment plan optimization on massively

parallel architectures: large working datasets so far unsuitable for GPU computation (at

decent resolutions) and complex scheduling to enable e.g. intermediate dose computation at

each optimization step.

Page 11: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 11

Image segmentation • What is image segmentation?

- Wiki: Segmentation is the process of partitioning a digital image into multiple segments

(sets of pixels, also known as super-pixels). The goal of segmentation is to simplify and/or

change the representation of an image into something that is more meaningful and easier to

analyze.

• Our Goal: fast, interactive and accurate segmentations even with noisy data

HCA-Vision

Page 12: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 12

Multiple Myeloma deadly cancer of blood plasma cells • Facts

- Despite a rash of new drugs and advances in stem-cell therapy, this

rare bloodborne cancer is still an almost certain death sentence

- A cure remains a long way off

- Plasma cell cancer

- Collections of abnormal cells accumulate in bones, where they cause

bone lesions (abnormal areas of tissue), and in the bone marrow

where they interfere with the production of normal blood cells

- The disease develops in 1–4 per 100,000 people per year.

Micro-CT scans reveal bone

damage from myeloma in a

5T2MM-bearing mouse.

PETER CROUCHER & KARIN VANDERKERKEN

Page 13: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 13

Level sets segmentation - applications

Page 14: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 14

Segmentation using Level Sets • Embed a seed surface in an image

• Iteratively deform the surface along normal according to local properties of the

surface and the underlying image

• Good: competitive accuracy compared to manual segmentation.

• Advantages: the arbitrary complex shapes can be modeled and topological changes such as

merges and splitting are handled implicitly.

iterations

Page 15: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 15

Level Set Workflow

Initialize f0 to Signed Euclidean Distance Transform from mask m

Calculate Data Speed Term D(I)

Calculate First and Second Order Spatial Derivatives

Calculate Curvature Terms

Calculate Gradient of f

Calculate Speed Term F

Update Time Step, and Reinitialize f if needed

Rep

eat

Un

til C

on

verg

ed

• Governing equation ¶f

¶t= - Ñf aD(x )+ (1-a)Ñ×

Ñf

Ñf

é

ëêê

ù

ûúú

the velocity term the mean curvature

(smoothing term)

Page 16: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 16

Input Image Applied Mask f0 as SDF

iterations

Page 17: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 17

Level Set 2D Effect of a

• Governing equation:

¶f

¶t= - Ñf aD(x )+ (1-a)Ñ×

Ñf

Ñf

é

ëêê

ù

ûúú

900 iterations, a = 0.06 900 iterations, a = 0.95 1800 iterations, a = 0.001

0

20

40

60

80

100

120

140

160

C2050 opt C2050 Quadro FX580

Xeon E5520

152.24 sec

15.43 sec 8.94 sec 2.12 sec

Max speedup ~70 X

Page 18: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 18

Level Set Method • Computational Problems

- Computationally intensive for moderate sized data sets

- SDF is not maintained

- Explicit schemes give CFL time step restriction

• Current Apprach

- OpenCL to accelerate adaptive timestepping PDE solver

- OpenMPI to distribute processing to engage multiple GPUs

- Qt – interactive user interface

- Liar/C2Liar – Quantitative Imaging toolbox (CSIRO internal)

- OpenGL + OpenCL for interactive real time volume rendering

• Further Implementation Notes

- The Level Set PDE solver mapped across multiple GPUs using OpenCL and MPI

- 3D volume split evenly among MPI processors

- Ghost slices transmitted between processes at each iteration – minimal overhead

- Simple boundary extension is performed at each iteraction

“Safe Interactions” with bacteria. level set segmentation researched also to track biofilm formation.

Page 19: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 19

Bone Model C4-data-set

Interactive volume visualisation – OpenGL + OpenCL

Page 20: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 20

Bone Model C4-data-set

Interactive volume visualisation – OpenGL + OpenCL

Page 21: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 21

GPU Speed-up Results

0

50

100

150

200

250

0

0.2

0.4

0.6

0.8

1

1.2

3 6 9 12

Number Of Processes

Execution Time (sec per 50 Iterations)

GPU

CPU

Page 22: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 22

Cloud based image analysis and processing •Available now: http://cloudimaging.net.au, see our demos

Page 23: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 23

Cloud based image analysis and processing • Currently using WebGL based 3D viewer Slice:Drop http://slicedrop.com

• Need for WebCL/WebGL for interactive parameters tuning

Page 24: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 24

Experiments with Fluids

Courtesy of High Field Magnet Laboratory, NL

Air Air

N2

gas

Wakayama Jet

Ozoe Lab, Candle

Ozoe Lab, 10T magnet

Page 25: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 25

Verification – Physics vs Simulations

PIT

numerical

Page 26: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 26

Methodology • Governing equations Discretization Implementation Verification Results

Transfer CPU code to heterogeneous architectures Verification Results

• Method tested

- Finite Difference (for discretization), HSMAC Method (for mutual iteration of pressure and

velocity fields), BFC (to solve N-S on complex geometry), Upwind and UTOPIA (for getting

more stable and accurate results), OpenCL (speedup)

• Speedups achieved: ~200-230x

• Easily extended, 2D, 3D, applied to simulate different cases, e.g. smoke, free

surface flows, ventilation, turbulent flows, etc.

• See: https://www.massive.org.au/images/stories/workshops-and-tutorials/computational-fluid-dynamics.pdf

Boundary Fitted Coordinates

Page 27: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 27

Ground Penetrating Radar and FDTD • Finite-difference time-domain method (FDTD) is a computational technique used to model electromagnetic

fields (in our case, radio waves). It provides an approximation to both the spatial and temporal derivatives

that appear in Maxwell’s equations.

• FDTD models how radio waves propagate through space and reflect at boundaries depending on the

properties of the material.

• FDTD code is used to generate synthetic data for a ground

penetrating radar (GPR) system.

• FDTD simulated data can be compared to a real radar system

data from a test environment with known geometry.

With an aim to get modelled information to match the real

system so that predictions of unknown geometry can be made.

Implementation: Based on original code by Andrew Strange. OpenCL version used mixed precision code with single precision image buffers used for constant matrices.

Contact: Josh Bowden CSIRO ASC

Page 28: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 28

Other projects using OpenCL

PCA NIPALS

algorithm –

rapid estimation of

wood properties

Sparse

Hydrodynamic

Ocean Code

(SHOC)

Finite Difference

Time Domain

(FDTD) - Ground

Penetrating Radar

PCA

AWDA-DA

Ground water data

assimilation

Contact: Josh Bowden CSIRO ASC

Page 29: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 29

GPGPU for Games • GPU heavily used for game physics

- Performance vs Accuracy

• Nvidia PhysX and FLEX

- Particles, Fluids and Rigid Bodies on

the GPU

• Bullet Physics

- Version 3.x to have 100% GPU support

• Not all games require cutting edge graphics, which leaves processing time

available for other areas

- Audio Processing, Artificial Intelligence

Conan Bourke AIE

Page 30: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 30

GPGPU Artificial Intelligence • Game Artificial Intelligence doesn’t easily lend itself to massively

parallel implementations

- Inter-Agent Communication

- Complex decision making typically controlled by a non-programmer

designer resulting in lots of dynamic branching

- Numerous levels of world state data required

• Some aspects do…

- Flow Fields

- Influence Maps

- Path Finding

- Steering Behaviors

Conan Bourke, Tomasz Bednarz

Page 31: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 31

GPGPU Steering Behavior • Steering Behaviors are a set of

commonly used algorithms for agent motion - Craig Reynolds “Steering Behaviors for

Autonomous Characters” GDC1999

• Easily adaptable to parallel implementations due to close similarities to particle physics - Position, Velocity etc

• Simple brute-force algorithm can be implemented in OpenCL within a couple hours for vast performance gains - Conan Bourke & Tomasz Bednarz

“Introduction to GPGPU for AI”, Game AI Pro 2013 0

500

1000

1500

2000

2500

3000

3500

4000

51

2

10

24

40

96

81

92

16

38

4

Intel i7-2670QM

NvidiaGTX560

NvidiaTeslaC2050

Conan Bourke, Tomasz Bednarz

Page 32: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 32

Acknowledgments • Collaborators, coworkers, coauthors:

- John Taylor

- Josh Bowden

- Luke Domanski

- Filippo Ammazzarolso

- Urszula Jelen

- Marc Piggott

- Wojtek Goscinski

- Conan Bourke

- Tim Gureyev

- Darren Thompson

- NeCTAR RT035 Project team

- Sam Moskwa

- Matt Adcock

- and many others…

Page 33: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 33

Sydney Khronos Chapter http://www.meetup.com/Sydney-GPU-Users

Last meetup, 17th Oct Tomasz Bednarz: Welcome, the Khronos Group Conan Bourke: The Fall and Rise of OpenGL for Games Rob Manson: The Augmented Web Visit to the CSIRO Vis Lab Pizza

Page 34: CSIRO and OpenCL

© Copyright Khronos Group & CSIRO, 2013 - Page 34

Accelerated Computing Workshop @ OzViz • ACW and OzViz 2013 to be held on 8th-

10th December 2013 in Melbourne

- Technologies: OpenCL, CUDA, etc.

- Hardware architectures: GPU, APU, MIC,

etc.

• Previous events

- OpenCL Workshop at OzViz 2010, Brisbane - Presenters: Mike Houston, Mark Harris, Derek

Gerstmann, Tomasz Bednarz, Craig James,

Con Caris, John Taylor

- ACW at OzViz 2011, Sydney - Keynote: Prof. Takayuki Aoki

- ACW at OzViz 2012, Perth

https://sites.google.com/site/ozvizworkshop/ozviz-2013