Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1 · 2017-05-19 · 2 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1 Oak Ridge Leadership

ORNL is managed by UT-Battelle

for the US Department of Energy

Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Benjamin Hernandez, [email protected] Data and Workflows Group

Oak Ridge Leadership Computing Facility

Oak Ridge National Laboratory

2Exploratory Visualization of Petascale Particle

Data in Nvidia DGX-1

Oak Ridge Leadership Computing Facility (OLCF)

• Mission: Provide the computational and data resources required to solve the most challenging problems.

• Highly competitive user allocation programs (INCITE, ALCC).

• Projects receive 10x to 100x more resource than at other generally available centers.

• We partner with users to enable science & engineering breakthroughs.



Sight:Exploratory Visualization of Scientific Data

• Client/Server architecture to provide high end visualization in laptops, desktops, and powerwalls.

• Heterogeneous scientific visualization

– Take advantage of both CPU & GPU resources within a node: DGX-1 use case.

– Advanced shading to enable new insights into data exploration.

• Parallel I/O & Data Staging

– “Pluggable” for in-situ visualization

• Lightweight tool

– Load your data

– Perform exploratory analysis

– Visualize/Save results



Server (DGX-1)

Server (DGX-1)

Sight System Architecture (in progress)

Server (DGX-1 or multiGPU node)

CPU cores

Multi-GPU

OSPrayNvidiaOptix

VT

K-m

Lo

ca

l/P

ara

llel F

ile S

yste

m

HP

C C

luste

r

AD

IOS

I/O

Syste

m

VT

K-m

VisualizationFrames

Co

mp

ressio

n

Websockets

HTML Client

*OSPray and Nvidia Optix are finely tuned libraries for Ray Tracing in multicore and manycore architectures



ADIOS

• ADIOS is an I/O framework

– Provides multiple methods to stage data to a staging area (on node, off node, off machine)

– Data output can be anything one wants

• Different methods allow for different types of data movement, aggregation, and arrangement to the storage system or to stream over the local-nodes, LAN, WAN

– It contains our own file format if you choose to use it (ADIOS-BP)

– Compress/decompress data in parallel

– Contains mechanisms to index and query data



First Approach: OpenGL

VBO(points)

V.S.Apply transfer

function

G.S.Quads w/tex.

coords

F.S.Sphere gen. and Shading



OpenGL Bindless Graphics

Initialization

Display

Address pointer

Vertices start from vboAddress to vboAddress + sizeof (float)*size



(-1,-1)

(1,1)(-1,1)

(1,-1)

Fragment Shader sphere Generation

• Sphere equation:

𝑟2 = 𝑥 − 𝑥02 + 𝑦 − 𝑦0

2 + 𝑧 − 𝑧02

r = 1.0, z = 1.0

x = texCoord.x

y = texCoord.y

zz = 1.0 – x*x – y*y

if (zz <= 0.0) // removes fragments outside

discard;

// scale to the desired radius

// calculate diffuse illumination



Results



OpenGL Multi-GPU Rendering

• One MPI task for each device

– Easy to implement

– Each device initialize its GLX/EGL context

• Multi-threading. One thread per device

– In EGL is possible:

• Create the main context in the main thread:

mainCtx = eglCreateContext(display, config, 0, contextAttrs)

• Each additional thread create a shared context:

lclThrdCtx = eglCreateContext(display, config, mainCtx, contextAttrs);

• Implement some mutex/semaphores to sync any updates

– Vulkanize your viz !

• Devices are aware of other devices and can coordinate between each other

• That’s precisely NVIDIA Optix can do



Second ApproachNvidia Optix Ray Tracing Engine

• “The OptiX API is an application framework for achieving optimal ray tracing performance on the GPU. It provides a simple, recursive, and flexible pipeline for accelerating ray tracing algorithms.”

• “Similar to OpenGL in doing the “heavy lifting” of ray tracing and leaving capability and technique to the developers”

– Plus it can use all GPUs available in your system

– Naturally fits material appearance and scene illumination



• Optix provides eight programmable components, some of them are:

1. Ray generation

2. Intersection

3. Shading (closest hit)

Shadows (any hit)

Selector

• Shaders are CUDA like syntax

Nvidia Optix Programming Model

1

2

3



Nvidia Optix Graph Nodes

• Geometry is defined by Graph nodes.

• A tree-like hierarchy where:

– Nodes at the bottom describes geometric objects.

– Nodes at the top describes collections of geometric objects.

Group

Transform Selector

GeometryGroup

GeometryInstance

GeometryInstance

GeometryGroup

GeometryGroup

Geometry Geometry

GeometryInstance

GeometryInstance

Geometry Geometry

Acceleration

AccelerationAcceleration



Nvidia Optix Graph Nodes

• “Keep the hierarchy as flat as possible…”

GeometryGroup

GeometryInstance

Particles

Acceleration

But not too flat!

Group

GeometryGroup

GeometryInstance

Particles

Acceleration

AccelerationGeometry

Group

GeometryInstance

Particles

AccelerationGeometry

Group

GeometryInstance

Particles

Acceleration



ResultsTest Systems

• Workstation

– CPU Intel Xeon 20 cores… 512 GB

– GPU Titan Z (2 Geforce Kepler GPU, 2x6 GB VRAM),

– Ubuntu 16, Nvidia Driver

• Rhea Node

– CPU Intel Xeon …

– GPU 2x Tesla K80 (4 Tesla Kepler GPU, 2x24 GB VRAM)

– Redhat 7. Nvidia Driver

• DGX-1

– CPU Intel…

– GPU 8x Tesla Pascal SMX, 8x16 GB VRAM

– Ubuntu 16, Nvidia Driver …

All systems:

CUDA 8.0Optix 4.0.2Acceleration Structure: TrbvhImage resolution: 1080pShading: Phong Illumination &

Ambient Occlusion



Results. How fast is built the acceleration structure ? lower is better

1

10

100

1000

Workstation Rhea Node DGX-1

1 Million 10 Million 20 Million

808

Tim

e (

ms

) 108 85 80

Particles

1616

736

14741958

980



ResultsPerformance, lower is better

0

25

50

75

100

125

150

175

200

225

250

275

Workstation Rhea node DGX-1

Frame rate (worst case)

30 59 127 600

ms

pe

r fr

am

e

Million particles

0

5

10

15

20

25

30

Workstation Rhea node DGX-1

Frame rate (best case)

30 59 127 600

Million particles

ms

pe

r fr

am

e

60 fps

30 fps

60 fps

30 fps



Results



Discussion

• DGX-1 can handle particle systems up to 10x larger in our test environment.

• For particle systems of the same size DGX-1 is 10x faster than the workstation system and 4.6x faster than Rhea node

– We expect for larger image resolutions DGX-1 speed up will increase.

• Our preliminary tests showed DGX-1 has enough compute power to drive a powerwall

– 3840 x 1080 @ 60 – 120 fps

• Test larger resolution

– Researchers usually are happy when

they can explore datasets even at 1 fps



Discussion

• Nvidia Optix provides multi-GPU support with no hassle

– Test if Nvidia Optix leaves free resources for analysis tasks.

• Paging was removed in Optix 4.x

– DGX-1 includes 40 CPU cores and 512 RAM

– Using Nvidia Optix & OSPRay library will allow full system allocation to handle larger systems.

• Summit is likely to support EGL through the Nvidia GPU Drivers (do not take it as a fact or alternative fact neither!).

– Best if (pre)exascale visualization tools are 100 % CUDA compliant.



Questions ?

Benjamin Hernandez, [email protected]

Advanced Data and Workflows Group

Oak Ridge Leadership Computing Facility

Oak Ridge National Laboratory

Acknowledgements:

Dylan Lacewell and the Nvidia Optix Team for their technical support.

Datasets provided by Cheng-Yu Shi and Leonid Zhigilei from the Computational Materials Group at University of Virginia.

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.



Further Reading

• Tom True, Alina Alt (2013) “Configuring, Programming and Debugging Applications for Multiple GPUs” GTC 2013

• Wil Braithwaite “Multi-GPU Programming for Visual Computing” SIGGRAPH 2013

Available in GTC on Demand:

– http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php

• Adios Manual

– http://users.nccs.gov/~pnorbert/ADIOS-UsersManual-1.11.0.pdf

• Optix Tutorial Talks– http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-

gtc.php?searchByKeyword=optix&searchItems=&sessionTopic=&sessionEvent=&sessionFormat=&submit=&select=+



2018 INCITE Call for Proposals• The 2018 INCITE Call for Proposals opened April 17, 2017 and closes June 23, 2017.

• Features large allocations of computer time and supporting resources at the Argonne and Oak Ridge Leadership Computing Facility (LCF) centers, operated by the US Department of Energy (DOE) Office of Science.

• Soliciting research proposals for awards of time on the 27-petaflop Cray XK7 Titan, and the 10-petaflop IBM Blue Gene/Q, Mira. In addition, certain 2018 INCITE awards will receive time on Argonne’s new Intel/Cray system, a 9.65-petaflops system called Theta.

• The INCITE program seeks research proposals for capability computing

– Production simulations, including ensembles, that use a large fraction of the LCF systems, or

– Proposals that require the unique LCF architectural infrastructure for high-performance computing projects that cannot be performed elsewhere

• The INCITE program is open to US and non-US based researchers.

• The INCITE program invites you to participate in an INCITE Proposal Writing Webinar, offered on April 19, May 18, and June 6.

• For more information visit http://www.doeleadershipcomputing.org/



ResultsHow fast is built the acceleration structure ?

Time(%) Time Calls Avg Min Max Name

49.63% 53.663ms 111 483.45us 1.3430us 2.5468ms [CUDA memcpy HtoD]

22.64% 24.478ms 10 2.4478ms 2.4320ms 2.4768ms Megakernel_CUDA_1

20.43% 22.090ms 22 1.0041ms 3.7440us 2.2334ms [CUDA memcpy DtoH]


0.30% 324.41us 1 324.41us 324.41us 324.41us [CUDA memcpy HtoA]

0.21% 231.71us 23 10.074us 1.2480us 150.49us [CUDA memset]

0.05% 55.903us 44 1.2700us 1.2150us 1.6320us [CUDA memcpy DtoD]



34.61% 29.596ms 44 672.63us 3.3280us 1.0593ms [CUDA memcpy DtoH]

9.48% 8.1082ms 10 810.82us 623.39us 1.2392ms Megakernel_CUDA_1








18.13% 14.800ms 44 336.36us 1.7920us 371.84us [CUDA memcpy PtoP]

6.16% 5.0304ms 10 503.04us 332.39us 889.03us Megakernel_CUDA_1

1.11% 904.00us 1 904.00us 904.00us 904.00us Megakernel_CUDA_0




Workstation

Rhea node

DGX-1



ResultsHow fast is built the acceleration structure ?




20.43% 22.090ms 22 1.0041ms 3.7440us 2.2334ms [CUDA memcpy DtoH]








9.48% 8.1082ms 10 810.82us 623.39us 1.2392ms Megakernel_CUDA_1








18.13% 14.800ms 44 336.36us 1.7920us 371.84us [CUDA memcpy PtoP]

6.16% 5.0304ms 10 503.04us 332.39us 889.03us Megakernel_CUDA_1

1.11% 904.00us 1 904.00us 904.00us 904.00us Megakernel_CUDA_0




Workstation

Rhea node

DGX-1



ADIOS I/O

• Abstracting metadata, data types, and dimensions from the source code into an XML file C

Fortran

POSIXMPIMPI_LUSTREPHDF5…DATASPACESDIMESFLEXPATHICEE

zlib, bzip2, szip, zfp, isobar

Alacrity

all data in adios_write() calls are buffered before writing to

the file system.



ADIOS I/O

• Generate the c-code from the XML file

gwrite_Atoms.ch

gread_Atoms.ch

• Both files contains code to write and read ADIOS files

– You only need to modify your XML file an generate new *.ch files

– Main code remains the same

gpp.py atoms.xml



ADIOS – Write/Read example

Write Read

Documents

Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1 · 2017-05-19 · 2 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1 Oak Ridge Leadership