29
PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE GPUS Fanny Nina-Paravecino, Leming Yu, Qianqian Fang*, David Kaeli Department of Electrical and Computer Engineering Department of Bioengineering*

PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

PORTABLE PERFORMANCE FOR MONTE CARLO

SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND

MULTIPLE GPUS Fanny Nina-Paravecino,

Leming Yu, Qianqian Fang*, David Kaeli Department of Electrical and Computer Engineering

Department of Bioengineering*

Page 2: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

SIMULATION OF PHOTON TRANSPORT INSIDE HUMAN BRAIN

•  Photon migration in 3D turbid media

•  Prediction of experimental outcomes

•  Simulation is a time-consuming task

2 GTC April 4-7, 2016 | Silicon Valley

Page 3: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

GTC April 4-7, 2016 | Silicon Valley 3

MCX.SPACE

Page 4: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

MCX AROUND THE WORLD

� Over 30,000 unique visits made from 148 countries

� Accumulative download is over 12,000 worldwide

� Over 900 registered users, from more than 350 institutions/companies around the world

GTC April 4-7, 2016 | Silicon Valley 4

Page 5: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

MCX STATISTICS

GTC April 4-7, 2016 | Silicon Valley 5

Page 6: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

OUTLINE

�  Portable Performance Monte Carlo Extreme (MCX) �  MCX in CUDA �  Persistent Threads in CUDA (MCX) �  Portable Performance MCX �  Other enhacements �  Results

�  MCX on multiple GPUs �  Performance Model �  Partitioning Schemes �  Performance Results

6 GTC April 4-7, 2016 | Silicon Valley

Page 7: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

PORTABLE PERFORMANCE MCX

Photons initialization

3D voxelated

media

7 GTC April 4-7, 2016 | Silicon Valley

Page 8: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

MONTE CARLO EXTREME (MCX) �  Estimates the 3D light (fluence) distribution by

simulating a large number of independent photons

�  Most accurate algorithm for a wide ranges of optical properties, including low-scattering/voids, high absorption and short source-detector separation

�  Computationally intensive, so a great target for GPU acceleration

�  Widely adopted for bio-optical imaging applications: �  Optical brain functional imaging �  Fluorescence imaging of small animals for drug development �  Gold stand for validating new optical imaging instrumentation

designs and algorithms

8 GTC April 4-7, 2016 | Silicon Valley

Page 9: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

MCX APPLICATIONS

Imaging of bone marrow in the tibia

Imaging of a complex mouse model using Monte Carlo simulations

Simulation of photons inside human brain

9 GTC April 4-7, 2016 | Silicon Valley

Page 10: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

MCX IN CUDA [1]

Thread i Thread i+1 …

Launch a new photon

Compute a new scattering length

Propagate photon until cross voxel boundary

Compute attenuation based on absorption

Accumulate photon energy loss to the

volume

End of scattering

path?

Total photon # reached?

Terminate thread

Exceeding time gate?

Compute a new scattering direction vector

Global Memory

y

y

n n

y n

Seed GPU RNG with CPU RNG

(optional) Repetition complete?

Retrieve solution

Normalize & save solution

CPU GPU

Start

End of simulation

Loop of repetitions

[1] Q. Fang and D. A. Boas. "Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units." Optics express 17.22 (2009): 20178-20190.

10 GTC April 4-7, 2016 | Silicon Valley

Page 11: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

PERSISTENT THREADS (PT) IN MCX �  PT kernels alter the notion of a virtual thread lifetime,

treating those threads as physical hardware threads �  PT kernels provide a view that threads are active for

the entire duration of the kernel �  We schedule only as many threads as the GPU SMs can

concurrently run �  The threads remain active until end of kernel execution

Worker thread

Thread initializes and enter thread

loop Thread loops continuously

Thread exits loop, clean up, and shut down

11 GTC April 4-7, 2016 | Silicon Valley

Page 12: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

PORTABLE PERFORMANCE MCX

autoBlock =MaxThreadsPerMP /MaxBlocksPerMPautoThread = autoBlock *MaxBlocksPerMP*MP

Feature Fermi Kepler Maxwell MaxThreadBlocks/

MP 8 16 32

Maxthreads/MP 1536 2048 2058 MP 16 14 22

CUDA cores/MP 32 192 128

12 GTC April 4-7, 2016 | Silicon Valley

Page 13: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

OTHER ENHANCEMENTS

� Autopilot improvement � Developed customized operation such

as: � mcx_nextafter

� Reduced the use of SharedMemory �  Enables more threads to be launch

� Avoided branch divergence by using indexes

13 GTC April 4-7, 2016 | Silicon Valley

Page 14: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

IMPROVEMENT PER ENHANCEMENT

14 GTC April 4-7, 2016 | Silicon Valley

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

GK110

980Ti

Autopilot

Reducing Shared Memory

Increasing Local Memory/ Hide Latency

Avoid branch divergence/ Customized function

Overall Performance

2.4x

1.4x

Page 15: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

PERFORMANCE MCX - RESULTS

Arch GPU Photons/ms (Baseline)

Photons/ms Speedup

Fermi GTX 590 2044.99 2901.92 1.4x Kepler GT 730 529.89 1263.74 2.4x Kepler GK110 2383.22 5238.34 2.2x Maxwell 980Ti 12268.98 19157.09 1.4x

0

5000

10000

15000

20000

25000

GTX 590 GT 730 GK110 980Ti

Phot

ons

GPUs

Performance (photons/ms)

15 GTC April 4-7, 2016 | Silicon Valley

�  Baseline: MCX version Sep 12, 2015

Page 16: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

MCX AS A BENCHMARK

GTC April 4-7, 2016 | Silicon Valley 16

MCX_core.ptx

MCX_core.sass 0.00

200.00 400.00 600.00 800.00

1000.00 1200.00 1400.00 1600.00 1800.00

Baseline After Improvement

After Improvement with Hack

0.14 0.14

0.14

1799.76

1550.63

1368.96

Size

(KB

)

CUDA 7.5 - Maxwell Compute 5.2 (980Ti)

+10x

-10x

Performance is changing dramatically •  Same input •  Same code of sequence

Page 17: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

MCX ON MULTIPLE GPUS

Page 18: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

MOTIVATION

�  Monte Carlo eXtreme (MCX) simulation in OpenCL � Distribute workloads among different devices

� NVIDIA GPUs / AMD GPUs / CPUs

thread

thread

thread

GPU1

MCXCL

GPU2

GPU3

Platform Partitioning Scheme

18 GTC April 4-7, 2016 | Silicon Valley

Page 19: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

METHODOLOGY

� Predict the kernel execution time �  Evaluate the kernel runtime � Develop the performance model

� Partitioning Schemes

Core-based

The number of parallel compute

units

Throughput

Application throughput

(photons/ms)

Iterative

Throughput-based iterative

partitioning

Fminimax

Nonlinear linear programming solution for

minimax problem

19 GTC April 4-7, 2016 | Silicon Valley

Page 20: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

PERFORMANCE MODEL �  Measure the kernel execution time on various devices

�  Simulate 1M to 25M photon migrations

20 GTC April 4-7, 2016 | Silicon Valley

Page 21: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

PERFORMANCE MODEL �  Given n devices: D1, D2, … Dn �  Given linear performance for each device �  Given the performance for 1M and 2M for each device �  We can obtain the linear equation for each device as

follows:

y1 = a1x1 + c1Device 1 :

.

. . .

Device 2 : 2222 cxay +=

Device n : nnnn cxay +=

21 GTC April 4-7, 2016 | Silicon Valley

Page 22: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

PARTITIONING SCHEME ELABORATION

Core-based Initialization

Iteratively evaluate throughput-based

partitioning

Stop when achieving the

max throughput

Iterative Approximation

ComputeUnitsiComputeUnitsi∑

ThroughputiThroughputi∑

22 GTC April 4-7, 2016 | Silicon Valley

Page 23: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

PERFORMANCE RESULTS

0

5000

10000

15000

20000

25000

30000

10M 100M 10M 100M

GTX 980 Ti + GTX 590 + GT 730 K40c + K20c

Core-based Throughput Iterative Fminimax

Max throughput 30323 photons/ms Max throughput 9688 photons/ms

10M 100M Core-based 35.01% 41.65%

Throughput 59.31% 93.42%

Iterative 68.85% 93.77%

Fminimax 68.85% 93.77%

10M 100M Core-based 85.31% 97.56%

Throughput 80.39% 87.89%

Iterative 80.39% 87.89% Fminimax 80.39% 87.89%

Throughput Utilization Throughput Utilization

23 GTC April 4-7, 2016 | Silicon Valley

Page 24: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

PERFORMANCE RESULTS

0 500

1000 1500 2000 2500 3000 3500 4000 4500

10M 100M 10M 100M

AMD 7970M + Intel i7-3740QM AMD 7970 + Fiji + Intel i7-4770

Core-based Throughput Iterative Fminimax

Max throughput 4529 photons/ms Max throughput 19176 photons/ms

10M 100M Core-based 19.32% 18.69%

Throughput 18.81% 27.14%

Iterative 18.78% 27.91%

Fminimax 18.78% 27.91%

10M 100M Core-based 15.10% 19.06%

Throughput 16.38% 21.10%

Iterative 16.38% 21.10% Fminimax 16.38% 21.10%

Throughput Utilization Throughput Utilization

24 GTC April 4-7, 2016 | Silicon Valley

Page 25: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

SUMMARY �  We have improved the performance of MCX

across a range of NVIDIA GPU architectures �  We have showed how to exploit Persistent Thread

kernel to automatically tune MCX kernel �  We developed an iterative scheme to search the

best partition to run MCX on multiple accelerators �  We obtained an 24% and 44% throughput

utilization improvement (Iterative vs Core-based) for 10M and 100M photon simulations, respectively

25 GTC April 4-7, 2016 | Silicon Valley

Page 26: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

FUTURE WORK �  Instrumentation of MCX

�  Leverage SASSI to instrument MCX and better characterize the behavior of a kernel to guide auto-tuning

� MCX on Multiple GPUs �  Evaluate our partitioning optimization for

multiple devices

26 GTC April 4-7, 2016 | Silicon Valley

Page 27: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

MCX CHALLENGE �  Interested in improving performance of

MCX over 40% compared to current version? � Monetary reward will be announced soon.

Stay tuned to mcx.space

27 GTC April 4-7, 2016 | Silicon Valley

Page 28: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

ACKNOWLEDGEMENT �  This project is funded by the NIH/NIGMS

under the grant R01-GM114365 � We would like to acknowledge NVIDIA

for their support for this work through the NVIDIA Research Center program

28 GTC April 4-7, 2016 | Silicon Valley

Page 29: PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF … · 2016. 4. 13. · PORTABLE PERFORMANCE FOR MONTE CARLO SIMULATION OF PHOTON MIGRATION IN 3D TURBID MEDIA FOR SINGLE AND MULTIPLE

THANK YOU! QUESTIONS?

[email protected] [email protected]