1
A High Performance C++ Generic Benchmark for Computational Epidemiology Aniket Pugaonkar, Sandeep Gupta, Keith R. Bisset and Madhav V. Marathe {aniketnp, sandeep, kbisset, mmarathe}@vbi.vt.edu The Network Dynamics and Simulation Science Laboratory, Virginia Tech, USA HPL or Top500 benchmark is the most widely recognized and discussed metric for high performance computing systems. Other widely known benchmarks: NPB, HPCC, SPEC, EuroBen. Application specific benchmarking is essential for two reasons – (a) to find better correlation between an application and the machine running it and (b) to help choose the most appropriate hardware-software configuration for given application and system parameters. The Graph 500 [3] benchmark was developed for applications with graphs as their core analytical workloads. Boost Graph Library[4] is based on Boost C++ framework and is based on generic programming principles. PBGL[6] is distributed graph library built by lifting, i.e., providing distributed implementation for various interfaces and operators in BGL. INTRODUCTION CONTAGION MODEL CHALLENGES AND GOALS Kernel 0: Create Person-Location Activity List. Kernel 1: Construct a Person-Location Graph. Kernel 2: Construct a Person-Person Graph. Kernel 3: Assign locations to location groups in Person-Location Graph. Kernel 4: Run the activity based contagion process over Person- Location graph. The computational complexity of the kernel is comparable to that of EpiSimdemics Algorithm [1]. Kernel 5: Run the contact based contagion process over person- person graph. The computational complexity of this kernel is comparable to that of EpiFast Algorithm [2]. [1] Christopher L. Barrett, et al. Episimdemics: An efficient algorithm for simulating the spread of infectious disease over large realistic social networks. (SC2008) [2] Keith R. Bisset, et. al. EpiFast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems. (ICS '09). ACM, New York, NY, USA, 430-439. [3] Murphy, Richard C., et al. "Introducing the graph 500" Cray User’s Group(2010) [4] Lee, Lie-Quan, and Andrew Lumsdaine. The Boost graph library: user guide and reference manual. Addison-Wesley Professional, 2002. [5] Chuck Pheatt. 2008. Intel® threading building blocks. J. Comput. Sci. Coll. 23, 4 (April 2008), 298- 298. [6] Gregor, Douglas, et al. "The Parallel Boost Graph Library." The Trustees of Indiana University (2005). [7] M. Heroux and J. Dongarra . Towards a New Metric for Ranking High Performance Computing Systems, UTK EECS and Sandia National Labs Report SAND2013-4744, June 2013. We design a benchmark consisting of several kernels which capture the essential compute, communication, and data access patterns of high performance contagion-diffusion simulations used in computational networked epidemiology. The goal is to (a) derive alternative implementations for computing the contagion by combining different implementation of the kernels, and (b) evaluate which combination of implementation, runtime, and hardware is most effective in running large-scale contagion diffusion simulations. Our proposed benchmark is designed using C++ generic programming primitives and lifting sequential strategies for parallel computations. Together these lead to a succinct description of the benchmark and significant code reuse when deriving strategies for new hardware. These aspects are crucial for an effective benchmark because the potential combination of hardware and runtimes are growing rapidly thereby making infeasible to write an optimized strategy for the complete contagion diffusion from ground up for each compute system. Overall Metric : Total Interactions Per Second Number of Interactions for a given network are independent of disease parameters. = # ( ) Speedup Strong Scaling (Speedup) for kernel 4 and 5 : = Overall time : Complete running time of benchmark The time to run the complete benchmark with a particular strategy for a particular hardware and runtime. PERFORMANCE RELATED STUDY CHALLENGES Complex algorithms used in real applications cannot be used as benchmarks because of intricate application parameters. Designing and implementing strategies for new hardware limits code reuse and generic programming. GOALS Design kernels which capture the essential computation, communication and data access patterns in tools used to simulate spread of infectious disease through contagion models. Develop and implement different evaluation strategies for kernels for existing and emerging hardware. Evaluate the most effective combination of implementation, runtime and hardware for a given contagion. REFERENCES BENCHMARK METRICS HIGH LEVEL SPECS TASK BASED PARALLELISM COMPLEX INTERVENTIONS KERNEL FLOW DIAGRAMS A four node and five contact edge contact network with Δt E = 0, Δt I = 2 (in days) for each node and transmission probability 0.5 for each day. Node A is infectious at start. i. Day 0: A transmits the disease to B but not D. ii. Day 1: Both A and B are infectious. A transmits the disease to D; B infects C but not D. iii. Day 2: A is removed, all others are infectious with no susceptible nodes. iv. Nodes are removed gradually and on day 4, the system enters a fixed point and stops evolving. The state transitions are one-way (from susceptible to exposed to infected to recovered) with no other possible transitions. Node Interventions (NI) – Alter Vertex Properties of Persons Antiviral, vaccinations etc. Change the infectivity or susceptibility of the person (see fig below). Edge Interventions (EI) – Alter Edge Properties (Activities) Location Closure – redirected to alternate locations. Activity Modification – Can alter the duration of activity or contact period. Intervention Type is also a Benchmark Parameter Contagion Model is a Benchmark Parameter SEIR, SIR, SIS, SEIS etc. r i = Infectivity of person i s j = Susceptibility of person j t = Disease transmissibility Pr (i->j) = Prob. of infection from i to j CONCLUSIONS Compute Platform Shadowfax BlueRidge Processor Xeon E7-4860 Xeon E5-2670 L3 Cache 24 MB 20 MB # Cores 10 8 #Threads 20 16 Clock 2.26 GHz 2.6 GHz QPI Speed 6.4 GT/s (1link) 8 GT/s (2links) #sockets per node 4 2 Total Cores 40 (4x10) 16 (2x8) Person-Location Graph Graph 1 Graph 2 Graph 3 Graph 4 # Persons 2 15 2 16 2 17 2 18 # Locations 8169 16321 32339 64421 # edges (activities) 163464 327314 654690 1308386 # interactions (millions) 10 20.5 41.2 82.7 Person-Person Graph Graph 1 Graph 2 Graph 3 # Persons 2 17 2 18 2 19 # edges (activities) 2839404 5676741 11363809 # interactions (millions) 17 34.5 69.4 0 2 4 6 8 10 1 4 6 8 12 16 Speedup Number of Threads BlueRidge Graph 1 Graph 2 Graph 3 Graph 4 0 0.5 1 1.5 2 2.5 3 3.5 1 4 6 8 12 16 Speedup Number of Threads BlueRidge Graph 1 Graph 2 Graph 3 0 0.5 1 1.5 2 1 4 6 8 12 16 24 32 Speedup Number of Threads Shadowfax Graph 1 Graph 2 Graph 3 0 1 2 3 4 5 6 7 1 4 6 8 12 16 24 32 Speedup Number of Threads Shadowfax Graph 1 Graph 2 Graph 3 Graph 4 Strong Scaling : Kernel 4 Performance Strong Scaling : Kernel 5 Performance 0 1 2 3 4 5 6 7 8 9 6 12 16 24 Speedup for i threads Number of Threads Shadowfax BlueRidge Graph 3 2 17 Graph 1 2 15 Graph 4 2 18 Graph 2 2 16 0 0.5 1 1.5 2 2.5 3 3.5 6 12 16 Speedup for i threads Number of Threads Shadowfax BlueRidge Graph 1 2 17 Graph 1 2 18 Graph 1 2 19 CONTRIBUTIONS Develop kernel specifications and metrics for our benchmark. Provide generic implementation of kernels in C++. Provide generic kernels for agent based and contact based contagion models. The kernels capture the computational complexity (and not semantics) of algorithms [1],[2]. Develop scalable shared and distributed memory generic implementation of kernels using task based parallelism and message passing interface. *Images adapted from [2] Compute Platform – a stack of hardware, runtime and approach. Contagion-diffusion – an agent based simulation. Person – models the agent. Location – spatial region. Activity – a person visiting a location for a time period. Interaction – two people at same location with overlap in time periods. Contagion – spread of disease among agents. State – health state of a person (susceptible, infected etc.) Graph – data structure which encodes the relationships such as visits or interactions. Intervention – alter the activity and state of a person. CONCEPTS Disease Parameters Model SEIR Transmissibility 0.00003 Incubation period 2 days Infectious period 4 days Interventions NI, EI Location Groups 100 Infectivity 1 Susceptibility 1 Weak Scaling In this work, we present a suite of kernels that together form benchmark for contagion diffusion simulation. We provide an encoding of the benchmark specifications using C++11 templates and iterators that is generic and composable, i.e., different implementations of kernels can be composed together to arrive at alternative implementation of the benchmark. The benchmark is used to evaluate performance of two class of machines based upon the TIPS metric. Preliminary results indicate that the BlueRidge system is more scalable than Shadowfax. Ongoing work: Our current work is focused on two major aspects – (1) S tandardization – developing codes for our benchmark to make it compatible to any standard graph library which implements its basic specifications. (2) Distributed/Shared memory implementation – we aim to lift the sequential implementation for large scale shared memory and distributed memory implementation without affecting the genericness and simplicity of the benchmark. Serial Performance Experimental Setup Kernel 5 Kernel 4 Parallel Performance 0 20 40 60 80 100 120 17 18 19 time in seconds Scale (in powers of 2) Kernel 2 Kernel 2 Note: The serial performance of kernel 2 (Person-Person Graph) is provided for above graph. The performance of kernel 0, 1 and 3 are not provided here as their running times are significantly low (and partly because of space restrictions) The benchmark ranks different compute platforms based upon the TIPS metric More Scalable Less Scalable Compiler GCC 4.7.2 TBB 4.2 BOOST 1.55 Node Interventions State: Infected Infectivity: 0.5 Properties Person (Vertex) Less Susceptible State: Infected Infectivity: 1 Properties Person (Vertex) Properties Person (Vertex) Properties Person (Vertex) State: Susceptible Susceptibility: 0.5 State: Susceptible Susceptibility: 1 Less Infectious ACKNOWLEDGEMENT We thank the members of Network Dynamics and Simulation Science Laboratory (NDSSL) for their contributions useful discussions and comments. LG1 LG2 W1 Worker TBB Thread MPI Process T1 LG n W2 T2 W i T n Task Pool L1 L2 Ln Tasks TASK BASED WORK STEALING MECHANISM Worker – Intel TBB Thread or MPI Process Coarse Grained – Every worker handles a location group and creates location group tasks (T 1 ,…, T n ) Fine Grained – Every location group task further spawns sub-location tasks (T 11 , T 12 ,…,T xx ) Stealing – Idle Workers steal the task from task pool (T 1 ,T 2 ,…,T 11 , T xx ) T 11 T 12 T i3 Worker Pool Tasks LG1 LG2 W1 T 12 LG n W2 W i Local Task Pool L1 L2 Ln Each MPI Process spawns TBB threads and creates its local task pool; one task per location Fine Grained Coarse Grained Map Location Groups to MPI Processes T 11 T 2x T nx Two locations for LG1 Fine Grained Coarse Grained Shared Memory Distributed Memory

A High Performance C++ Generic Benchmark for …sc14.supercomputing.org/sites/all/themes/sc14/files/archive/tech... · Aniket Pugaonkar, Sandeep Gupta, Keith R. Bisset and Madhav

Embed Size (px)

Citation preview

Page 1: A High Performance C++ Generic Benchmark for …sc14.supercomputing.org/sites/all/themes/sc14/files/archive/tech... · Aniket Pugaonkar, Sandeep Gupta, Keith R. Bisset and Madhav

A High Performance C++ Generic Benchmark for Computational EpidemiologyAniket Pugaonkar, Sandeep Gupta, Keith R. Bisset and Madhav V. Marathe {aniketnp, sandeep, kbisset, mmarathe}@vbi.vt.edu

The Network Dynamics and Simulation Science Laboratory, Virginia Tech, USA

HPL or Top500 benchmark is the most widely recognized and discussed metric for high performance computing systems. Other widely known benchmarks: NPB, HPCC, SPEC, EuroBen.

Application specific benchmarking is essential for two reasons –(a) to find better correlation between an application and the machine running it and (b) to help choose the most appropriate hardware-software configuration for given application and system parameters.

The Graph 500 [3] benchmark was developed for applications with graphs as their core analytical workloads.

Boost Graph Library[4] is based on Boost C++ framework and is based on generic programming principles. PBGL[6] is distributed graph library built by lifting, i.e., providing distributed implementation for various interfaces and operators in BGL.

INTRODUCTION

CONTAGION MODEL

CHALLENGES AND GOALS

Kernel 0: Create Person-Location Activity List.

Kernel 1: Construct a Person-Location Graph.

Kernel 2: Construct a Person-Person Graph.

Kernel 3: Assign locations to location groups in Person-Location Graph.

Kernel 4: Run the activity based contagion process over Person-Location graph. The computational complexity of the kernel is comparable to that of EpiSimdemics Algorithm [1].

Kernel 5: Run the contact based contagion process over person-person graph. The computational complexity of this kernel is comparable to that of EpiFast Algorithm [2].

[1] Christopher L. Barrett, et al. Episimdemics: An efficient algorithm for simulating the spread of infectious disease over large realistic social networks. (SC2008)

[2] Keith R. Bisset, et. al. EpiFast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems. (ICS '09). ACM, New York, NY, USA, 430-439.

[3] Murphy, Richard C., et al. "Introducing the graph 500" Cray User’s Group(2010)

[4] Lee, Lie-Quan, and Andrew Lumsdaine. The Boost graph library: user guide and reference manual. Addison-Wesley Professional, 2002.

[5] Chuck Pheatt. 2008. Intel® threading building blocks. J. Comput. Sci. Coll. 23, 4 (April 2008), 298-298.

[6] Gregor, Douglas, et al. "The Parallel Boost Graph Library." The Trustees of Indiana University (2005).

[7] M. Heroux and J. Dongarra . Towards a New Metric for Ranking High Performance Computing Systems, UTK EECS and Sandia National Labs Report SAND2013-4744, June 2013.

We design a benchmark consisting of several kernels which capture the essential compute, communication, and data access patterns of high performance contagion-diffusion simulations used in computational networked epidemiology. The goal is to (a) derive alternative implementations for computing the contagion by combining different implementation of the kernels, and (b) evaluate which combination of implementation, runtime, and hardware is most effective in running large-scale contagion diffusion simulations.

Our proposed benchmark is designed using C++ generic programming primitives and lifting sequential strategies for parallel computations. Together these lead to a succinct description of the benchmark and significant code reuse when deriving strategies for new hardware. These aspects are crucial for an effective benchmark because the potential combination of hardware and runtimes are growing rapidly thereby making infeasible to write an optimized strategy for the complete contagion diffusion from ground up for each compute system.

Overall Metric : Total Interactions Per Second

• Number of Interactions for a given network are independent of disease parameters.

• 𝐓𝐈𝐏𝐒 =𝑻𝒐𝒕𝒂𝒍 # 𝒊𝒏𝒕𝒆𝒓𝒂𝒄𝒕𝒊𝒐𝒏𝒔 𝒊𝒏 𝒕𝒉𝒆 𝒈𝒓𝒂𝒑𝒉

𝑻𝒐𝒕𝒂𝒍 𝒕𝒊𝒎𝒆 𝒓𝒆𝒒𝒖𝒊𝒓𝒆𝒅 (𝒖𝒔𝒊𝒏𝒈 𝒊 𝒕𝒉𝒓𝒆𝒂𝒅𝒔)

Speedup Strong Scaling (Speedup) for kernel 4 and 5 :

• 𝐒𝐩𝐞𝐞𝐝𝐮𝐩 =𝑻𝒊𝒎𝒆 𝒕𝒐 𝒓𝒖𝒏 𝒄𝒐𝒏𝒕𝒂𝒈𝒊𝒐𝒏 𝒐𝒏 𝒕𝒉𝒆 𝒈𝒓𝒂𝒑𝒉 𝒖𝒔𝒊𝒏𝒈 𝟏 𝒕𝒉𝒓𝒆𝒂𝒅

𝑻𝒊𝒎𝒆 𝒕𝒐 𝒓𝒖𝒏 𝒄𝒐𝒏𝒕𝒂𝒈𝒊𝒐𝒏 𝒐𝒏 𝒕𝒉𝒆 𝒈𝒓𝒂𝒑𝒉 𝒖𝒔𝒊𝒏𝒈 𝒏 𝒕𝒉𝒓𝒆𝒂𝒅𝒔

Overall time : Complete running time of benchmark

• The time to run the complete benchmark with a particular strategy for a particular hardware and runtime.

PERFORMANCE

RELATED STUDY

CHALLENGES

Complex algorithms used in real applications cannot be used as benchmarks because of intricate application parameters.

Designing and implementing strategies for new hardware limits code reuse and generic programming.

GOALS

Design kernels which capture the essential computation, communication and data access patterns in tools used to simulate spread of infectious disease through contagion models.

Develop and implement different evaluation strategies for kernelsfor existing and emerging hardware.

Evaluate the most effective combination of implementation, runtime and hardware for a given contagion.

REFERENCES

BENCHMARK METRICS

HIGH LEVEL SPECS

TASK BASED PARALLELISM

COMPLEX INTERVENTIONSKERNEL FLOW DIAGRAMS

A four node and five contact edge contact network with ΔtE = 0, ΔtI = 2 (in days) for each node and transmission probability 0.5 for each day. Node A is infectious at start. i. Day 0: A transmits the disease to B but not D.ii. Day 1: Both A and B are infectious. A transmits the disease to

D; B infects C but not D.iii. Day 2: A is removed, all others are infectious with no

susceptible nodes.iv. Nodes are removed gradually and on day 4, the system enters

a fixed point and stops evolving.The state transitions are one-way (from susceptible to exposed to infected to recovered) with no other possible transitions.

Node Interventions (NI) – Alter Vertex Properties of Persons• Antiviral, vaccinations etc.

• Change the infectivity or susceptibility of the person (see fig below).

Edge Interventions (EI) – Alter Edge Properties (Activities)• Location Closure – redirected to alternate locations.

• Activity Modification – Can alter the duration of activity or contact

period.

Intervention Type is also a Benchmark Parameter

Contagion Model is a Benchmark Parameter• SEIR, SIR, SIS, SEIS etc.

ri = Infectivity of person isj = Susceptibility of person jt = Disease transmissibility Pr(i->j) = Prob. of infection from i to j

CONCLUSIONS

Compute Platform Shadowfax BlueRidge

Processor Xeon E7-4860 Xeon E5-2670

L3 Cache 24 MB 20 MB

# Cores 10 8

#Threads 20 16

Clock 2.26 GHz 2.6 GHz

QPI Speed 6.4 GT/s (1link) 8 GT/s (2links)

#sockets per node 4 2

Total Cores 40 (4x10) 16 (2x8)

Person-Location Graph Graph 1 Graph 2 Graph 3 Graph 4

# Persons 215 216 217 218

# Locations 8169 16321 32339 64421

# edges (activities) 163464 327314 654690 1308386

# interactions (millions) 10 20.5 41.2 82.7

Person-Person Graph Graph 1 Graph 2 Graph 3

# Persons 217 218 219

# edges (activities) 2839404 5676741 11363809

# interactions (millions) 17 34.5 69.4

0

2

4

6

8

10

1 4 6 8 12 16

Spe

ed

up

Number of Threads

BlueRidge

Graph 1 Graph 2 Graph 3 Graph 4

0

0.5

1

1.5

2

2.5

3

3.5

1 4 6 8 12 16

Spe

ed

up

Number of Threads

BlueRidge

Graph 1 Graph 2 Graph 3

0

0.5

1

1.5

2

1 4 6 8 12 16 24 32

Spe

ed

up

Number of Threads

Shadowfax

Graph 1 Graph 2 Graph 3

0

1

2

3

4

5

6

7

1 4 6 8 12 16 24 32

Spe

ed

up

Number of Threads

Shadowfax

Graph 1 Graph 2 Graph 3 Graph 4

Strong Scaling : Kernel 4 Performance

Strong Scaling : Kernel 5 Performance

0

1

2

3

4

5

6

7

8

9

6 12 16 24

Spe

ed

up

fo

rit

hre

ads

Number of Threads

Shadowfax BlueRidge

Graph 3217

Graph 1215

Graph 4218

Graph 2216

0

0.5

1

1.5

2

2.5

3

3.5

6 12 16

Spe

ed

up

fo

r i t

hre

ads

Number of Threads

Shadowfax BlueRidge

Graph 1217

Graph 1218

Graph 1219

CONTRIBUTIONS

Develop kernel specifications and metrics for our benchmark.

Provide generic implementation of kernels in C++.

Provide generic kernels for agent based and contact based contagion models. The kernels capture the computational complexity (and not semantics) of algorithms [1],[2].

Develop scalable shared and distributed memory generic implementation of kernels using task based parallelism and message passing interface.

*Images adapted from [2]

Compute Platform – a stack of hardware, runtime and approach.

Contagion-diffusion – an agent based simulation.

Person – models the agent.

Location – spatial region.

Activity – a person visiting a location for a time period.

Interaction – two people at same location with overlap in time periods.

Contagion – spread of disease among agents.

State – health state of a person (susceptible, infected etc.)

Graph – data structure which encodes the relationships such as visits or interactions.

Intervention – alter the activity and state of a person.

CONCEPTS

Disease Parameters

Model SEIR

Transmissibility 0.00003

Incubation period 2 days

Infectious period 4 days

Interventions NI, EI

Location Groups 100

Infectivity 1

Susceptibility 1

Weak Scaling

In this work, we present a suite of kernels that together form benchmark for contagion diffusion simulation. We provide an encoding of the benchmark specifications using C++11 templates and iterators that is generic and composable, i.e., different implementations of kernels can be composed together to arrive at alternative implementation of the benchmark.

The benchmark is used to evaluate performance of two class of machines based upon the TIPS metric. Preliminary results indicate that the BlueRidge system is more scalable than Shadowfax.

Ongoing work: Our current work is focused on two major aspects –(1) Standardization – developing codes for our benchmark to make it compatible to any standard graph library which implements its basic specifications.(2) Distributed/Shared memory implementation – we aim to lift the sequential implementation for large scale shared memory and distributed memory implementation without affecting the genericness and simplicity of the benchmark.

Serial Performance

Experimental Setup

Kernel 5Kernel 4

Parallel Performance

0

20

40

60

80

100

120

17 18 19

tim

e in

se

con

ds

Scale (in powers of 2)

Kernel 2

Kernel 2

Note:

The serial performance of kernel 2 (Person-Person Graph) is provided for above graph.

The performance of kernel 0, 1 and 3 are not provided here as their running times are significantly low (and partly because of space restrictions)

The benchmark ranks different compute platforms based upon the TIPS metric More Scalable Less Scalable

Compiler GCC 4.7.2

TBB 4.2

BOOST 1.55

Node Interventions

State: InfectedInfectivity: 0.5

Properties

Person (Vertex)

Less Susceptible

State: InfectedInfectivity: 1

Properties

Person (Vertex)

Properties

Person (Vertex)

Properties

Person (Vertex)

State: SusceptibleSusceptibility: 0.5

State: SusceptibleSusceptibility: 1

Less Infectious

ACKNOWLEDGEMENT

We thank the members of Network Dynamics and Simulation Science Laboratory (NDSSL) for their contributions useful discussions and comments.

LG1

LG2

W1

Worker

TBB Thread MPI Process

T1

LGn

W2T2

WiTn

Task Pool

L1

L2

Ln

Tasks

TASK BASED WORK STEALING MECHANISMWorker – Intel TBB Thread or MPI ProcessCoarse Grained – Every worker handles a location group and creates location group tasks (T1,…, Tn)Fine Grained – Every location group task further spawns sub-location tasks (T11, T12,…,Txx)Stealing – Idle Workers steal the task from task pool (T1,T2,…,T11, Txx)

T11

T12

Ti3

Worker PoolTasks

LG1

LG2

W1T12

LGn

W2

Wi

Local Task Pool

L1

L2

Ln

Each MPI Process spawns TBB threads and creates its local task pool;one task per location

Fine Grained Coarse Grained

Map Location Groups to MPI

Processes

T11

T2x

Tnx

Two locations for LG1

Fine Grained Coarse Grained

Shared Memory Distributed Memory