5
More Charm++/TAU examples Applications: NAMD Parallel Framework for Unstructured Meshing (ParFUM) Features: Profile snapshots: Captures the runtime of the application by segregating it into user specified intervals CUDA Profiling Tracks time spent in CUDA kernel routines Shows scaling behavior for a experiment varying the number of devices used.

More Charm++/TAU examples

  • Upload
    eman

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

More Charm++/TAU examples. Applications: NAMD Parallel Framework for Unstructured Meshing ( ParFUM ) Features: Profile snapshots: Captures the runtime of the application by segregating it into user specified intervals CUDA Profiling Tracks time spent in CUDA kernel routines - PowerPoint PPT Presentation

Citation preview

Page 1: More Charm++/TAU examples

More Charm++/TAU examplesApplications: NAMD Parallel Framework for Unstructured Meshing (ParFUM)

Features:• Profile snapshots:

• Captures the runtime of the application by segregating it into user specified intervals

• CUDA Profiling• Tracks time spent in CUDA kernel routines• Shows scaling behavior for a experiment varying the

number of devices used.

Page 2: More Charm++/TAU examples

Load Balancing Phases NAMD Snapshot Profile of over 800sec on 2048 processors

Mea

n Ex

clus

ive

Tim

eSt

anda

rd D

evia

tion enqueneSelfB

enqueneSelfA

Main

enqueneWorkBenqueneWorkA

Idle

Page 3: More Charm++/TAU examples

NAMD CUDA events

GPU efficiency gained by doubling the number of GPU from 16 to 32. These Events are broken down by routine and by device number.

Device #0

~100% efficiency

~50% efficiency

Page 4: More Charm++/TAU examples

NAMD CUDA scaling

Non-Bonded Calculations

Sum Forces Calculations

Scaling by event and device number, Non-Bonded Calculations scale well. Sum Forces less well but the overall time is only a few microseconds.

Number of Devices

Scal

ing

Effici

ency

Page 5: More Charm++/TAU examples

ParFUM CUDA speedup

128x8x8 Mesh0

50

100

150

200

250

Total time using only a CPUTotal Time with CUDA accelerationTime spent in CUDA Kernel

Single CPU or GPU Performance on a 128x8x8 mesh. When run with GPU acceleration enabled ParFUM spent 9 seconds in the CUDA Kernel routines.