114
Topology Aware Mapping Abhinav Bhatele Charm++ Tutorial Computer Network Information Center, Chinese Academy of Sciences

Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

Topology Aware Mapping

Abhinav Bhatele

Charm++ TutorialComputer Network Information Center, Chinese Academy of Sciences

Page 2: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Motivation

• Running a parallel application on a linear array of processors:

2

Page 3: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Motivation

• Running a parallel application on a linear array of processors:

2

Page 4: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Motivation

• Running a parallel application on a linear array of processors:

2

Page 5: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Motivation

• Running a parallel application on a linear array of processors:

2

• Typical communication is between random pairs of processors simultaneously

Page 6: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Benchmark Creating Artificial Contention

• Pair each processor with a partner that is n hops away

3

1 hop

2 hops

3 hops

Page 7: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Results: Contention

1

4

16

64

256

1024

4096

16384

65536

4 16 64 256 1K 4K 16K 64K 256K 1M

La

ten

cy (

us)

Message Size (Bytes)

Effect of distance on latencies (Torus - 8 x 8 x 16)

8 hops7 hops6 hops5 hops4 hops3 hops2 hops1 hop

Bhatele A., Kale L. V., Quantifying Network Contention on Large Parallel Machines, Parallel Processing Letters (Special Issue on Large-Scale Parallel Processing), 2009. Best Poster Award, ACM Student Research Competition, Supercomputing 2008, Austin, TX.

4

Blue Gene/P

Page 8: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Results: Contention

8

16

32

64

128

256

512

1024

2048

4096

8192

4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)

Message Size (Bytes)

Effect of distance on latencies (Torus - 8 x 8 x 16)

8 hops7 hops6 hops5 hops4 hops3 hops2 hops1 hop

Bhatele A., Kale L. V., Quantifying Network Contention on Large Parallel Machines, Parallel Processing Letters (Special Issue on Large-Scale Parallel Processing), 2009. Best Poster Award, ACM Student Research Competition, Supercomputing 2008, Austin, TX.

4

XT4

Page 9: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Interconnect Topologies

5

Roadrunner Technical Seminar Series, March 13th 2008, Ken Koch, LANL

16. Mai 2008 9

BLUE GENE ARCHITECTURE

! is designed to support efficient execution ofmassively parallel MPI programs

! Compute nodes organized as a 3D-torus

! MAIN FEATURE:every node is connected to itssix neighbour nodes throughbidirectional links

! To maintain application performance,correct mapping of MPI tasks ontotorus network is a critical factor

• Three dimensional meshes

• 3D Torus: Blue Gene/L, Blue Gene/P, Cray XT4/5

• Trees

• Fat-trees (Infiniband) and CLOS networks (Federation)

• Dense Graphs

• Kautz Graph (SiCortex), Hypercubes

• Future Topologies?

• Blue Waters, Blue Gene/Q

Page 10: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Application Topologies

Patch

Compute

Proxy

6

http://wrf-model.org/plots/realtime_main.php

http://math.lanl.gov/Research/Projects/meshing.shtml

http://www.ks.uiuc.edu/Gallery/Science/

Page 11: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Application Topologies

Patch

Compute

Proxy

6

We want to map communicating objects closer to one another

http://wrf-model.org/plots/realtime_main.php

http://math.lanl.gov/Research/Projects/meshing.shtml

http://www.ks.uiuc.edu/Gallery/Science/

Page 12: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

The Mapping Problem

• Applications have a communication topology and processors have an interconnect topology

• Definition: Given a set of communicating parallel “entities”, map them on to physical processors to optimize communication

• Goals:

• Minimize communication traffic and hence contention

• Balance computational load (when n > p)

7

Page 13: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Scope of this work

• Currently we are focused on 3D mesh/torus machines

• For certain classes of applications

Communication bound

Computation bound

Latency tolerant Latency sensitive

8

Page 14: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Outline

• Case studies:

• OpenAtom

• NAMD

• Automatic Mapping Framework

• Pattern matching

• Heuristics for Regular Graphs

• Heuristics for Irregular Graphs

9

Page 15: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Case Study I: OpenAtom

10

Performance on Blue Gene/L

0

0.075

0.15

0.225

0.3

512 1024 2048 4096 8192

Tim

e pe

r st

ep (

s)

Number of cores

Default Mapping

Page 16: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Diagnosis

11

Timeline view (OpenAtom on 8,192 cores of BG/L) using the performance visualization tool, Projections

Page 17: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Mapping of OpenAtom Arrays

12

A. Bhatele, E. Bohm, and L. V. Kale. A Case Study of Communication Optimizations on 3D Mesh Interconnects. In Euro-Par, LNCS 5704, pages 1015–1028, 2009. Distinguished Paper Award, Feng Chen Memorial Best Paper Award

!"#$%&'$()*$+%,+$-.)

/&$+"#$%&

/0./

1&23(-4

5)-0.

6)$23#.3&

6)$23#.3&

/0./7$)-/0.!

/0.!7$)-

8

98

88

9

888 89

988

9888

Page 18: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Mapping of OpenAtom Arrays

12

A. Bhatele, E. Bohm, and L. V. Kale. A Case Study of Communication Optimizations on 3D Mesh Interconnects. In Euro-Par, LNCS 5704, pages 1015–1028, 2009. Distinguished Paper Award, Feng Chen Memorial Best Paper Award

!"#$%&'$()*$+%,+$-.)

/&$+"#$%&

/0./

1&23(-4

5)-0.

6)$23#.3&

6)$23#.3&

/0./7$)-/0.!

/0.!7$)-

8

98

88

9

888 89

988

9888

Paircalculator and GSpace have plane-wisecommunication

RealSpace and GSpace have state-wisecommunication

Page 19: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Mapping of OpenAtom Arrays

12

A. Bhatele, E. Bohm, and L. V. Kale. A Case Study of Communication Optimizations on 3D Mesh Interconnects. In Euro-Par, LNCS 5704, pages 1015–1028, 2009. Distinguished Paper Award, Feng Chen Memorial Best Paper Award

!"#$%&'$()*$+%,+$-.)

/&$+"#$%&

/0./

1&23(-4

5)-0.

6)$23#.3&

6)$23#.3&

/0./7$)-/0.!

/0.!7$)-

8

98

88

9

888 89

988

9888

!"#$%&'$()*$+%,+$-.)

/&$+"#$%&

"-$-&0

'+$1&0

"-$-&0

'+$1&0

'+$1&0

"-$-&0

"-$-&0

Paircalculator and GSpace have plane-wisecommunication

RealSpace and GSpace have state-wisecommunication

Page 20: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Performance Benefits from Mapping

13

Performance on Blue Gene/L

0

0.075

0.15

0.225

0.3

512 1024 2048 4096 8192

Tim

e pe

r st

ep (

s)

Number of cores

Default MappingTopology Mapping

Page 21: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Diagnosis of Improvement

14

Timeline view using the performance visualization tool, Projections

Timeline of 1 iteration of OpenAtom running

WATER_256M_70Ry on 8192 cores of BG/L

Page 22: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

OpenAtom Performance on Blue Gene/P

15

0

2.75

5.5

8.25

11

1024 2048 4096 8192

Tim

e pe

r st

ep (

s)

Number of cores

Default MappingTopology Mapping

Application Performance

Page 23: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

OpenAtom Performance on Blue Gene/P

15

0

275

550

825

1100

1024 2048 4096 8192

Syst

em B

andw

idth

(G

B/st

ep)

Number of cores

Default MappingTopology Mapping

0

2.75

5.5

8.25

11

1024 2048 4096 8192

Tim

e pe

r st

ep (

s)

Number of cores

Default MappingTopology Mapping

Application Performance Performance Counters

Page 24: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

OpenAtom Performance on Blue Gene/P

15

0

275

550

825

1100

1024 2048 4096 8192

Syst

em B

andw

idth

(G

B/st

ep)

Number of cores

Default MappingTopology Mapping

0

2.75

5.5

8.25

11

1024 2048 4096 8192

Tim

e pe

r st

ep (

s)

Number of cores

Default MappingTopology Mapping

Application Performance Performance Counters

Page 25: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

OpenAtom Performance on Cray XT3

16

Page 26: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

OpenAtom Performance on Cray XT3

• Cray XT3:

• Link bandwidth - 3.8 GB/s (XT3), 0.425 (BG/P), 0.175 (BG/L)

• Bytes per flop - 8.77 (XT3), 0.375 (BG/P and BG/L)

16

Page 27: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

OpenAtom Performance on Cray XT3

• Cray XT3:

• Link bandwidth - 3.8 GB/s (XT3), 0.425 (BG/P), 0.175 (BG/L)

• Bytes per flop - 8.77 (XT3), 0.375 (BG/P and BG/L)

• Job schedulers on Cray are not topology aware

16

Page 28: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

OpenAtom Performance on Cray XT3

• Cray XT3:

• Link bandwidth - 3.8 GB/s (XT3), 0.425 (BG/P), 0.175 (BG/L)

• Bytes per flop - 8.77 (XT3), 0.375 (BG/P and BG/L)

• Job schedulers on Cray are not topology aware

• Performance Benefit at 2048 cores: 40% (XT3), 45% (BG/P), 41% (BG/L)

16

0

2

4

6

8

512 1024 2048

Tim

e pe

r st

ep (

s)

Number of cores

Default MappingTopology Mapping

Page 29: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Case Study II: NAMD

17

are assigned statically to processors during program start-up. On the other hand,

computes, can be moved around to balance load across processors. If a patch com-

municates with more than one compute on a processor, a proxy is placed on this

processor for the patch. The proxy receives the message from the patch and then

forwards it to the computes internally (Figure 7.7). This avoids adding new com-

munication paths when new computes for the same patch are added on a processor.

Patch

Compute

Proxy

Figure 7.7: Placement of patches, computes and proxies on a 2D mesh of processors

The number of computes on a processor and their individual computational

loads determines its computational load and the number of proxies on a processor

indicates its communication load. Load balancing in NAMD is measurement-based.

This assumes that load patterns tend to persist over time and even if they change,

the change is gradual (referred to as the principle of persistence). The load balancing

framework records information about object (compute) loads for some time steps.

It also records the communication graph between the patches and proxies. This

information is collected on one processor and based on the instrumentation data,

a load balancing phase is executed. Decisions are then sent to all processors. The

current strategy is centralized and we shall later discuss future work to make it fully

60

a smaller brick within the 3D torus (shown in dark grey in the figure). The sum

of distances from any processor within this brick to the two patches is minimum.

Hence, if we find two processors with proxies for both patches, we give preference

to the processor which is within this inner brick defined by the patches.

Inner Brick

Outer Brick

Patch 1

Patch 2

Figure 7.9: Topological placement of a compute on a 3D torus/mesh of processors

Step II: Likewise, in this case too, we give preference to a processor with one proxy

or patch which is within the brick defined by the two patches that interact with the

compute.

Step III: If Step I and II fail, we are supposed to look for any underloaded processor

to place the compute on. Under the modified scheme of things, we first try to find

an underloaded processor within the brick and if there is no suitable processor, we

spiral around the brick to find the first underloaded one.

To implement these new topology aware schemes in the existing load balancers,

we build two preference tables (similar to Figure 7.8) instead of one. The first

preference table contains processors which are topologically close to the patches in

consideration (within the brick) and the second one contains the remaining proces-

66

A. Bhatele, L. V. Kale and S. Kumar, Dynamic Topology Aware Load Balancing Algorithms for Molecular Dynamics Applications, In 23rd ACM International Conference on Supercomputing (ICS), 2009.

Communication between patches and computes

Topology aware placement of computes

Page 30: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Evaluation Metric: Hop-bytes

• Indicates amount of traffic and hence contention on the network

• Previously used metric: maximum dilation

NAMD Performance on Blue Gene/P

18

0

375

750

1125

1500

512 1024 2048 4096

Hop

-byt

es (

MB

per

itera

tion)

Number of cores

Topology ObliviousTopoAware PatchesTopoAware Computes

Measured Hop-bytes

di = distancebi = bytesn = no. of messages

5 Hop-bytes as an Evaluation Metric

The volume of inter-processor communication can be characterized by the hop-bytes

metric which is the weighted sum of message sizes where the weights are the number

of hops (links) traveled by the respective messages. Hop-bytes can be calculated by

the equation,

HB =

n�

i=1

di × bi (5.1)

where di is the number of links traversed by message i and bi is the message size in

bytes for message i and the summation is over all messages sent.

Hop-bytes is an indication of the average communication load on each link on the

network. This assumes that the application generates nearly uniform traffic over all

links in the partition. The metric does not give an indication of hot-spots generated

on specific links on the network but is an easily derivable metric and correlates well

with actual application performance.

In VLSI circuit design and early parallel computing work, emphasis was placed

on another metric called maximum dilation which is defined as,

d(e) = max{di|ei ∈ E} (5.2)

where di is the dilation of the edge ei. This metric aims at minimizing the longest

length of the wire in a circuit. We claim that reducing the largest number of links

traveled by any message is not as critical as reducing the average hops across all

messages.

32

5 Hop-bytes as an Evaluation Metric

The volume of inter-processor communication can be characterized by the hop-bytes

metric which is the weighted sum of message sizes where the weights are the number

of hops (links) traveled by the respective messages. Hop-bytes can be calculated by

the equation,

HB =

n�

i=1

di × bi (5.1)

where di is the number of links traversed by message i and bi is the message size in

bytes for message i and the summation is over all messages sent.

Hop-bytes is an indication of the average communication load on each link on the

network. This assumes that the application generates nearly uniform traffic over all

links in the partition. The metric does not give an indication of hot-spots generated

on specific links on the network but is an easily derivable metric and correlates well

with actual application performance.

In VLSI circuit design and early parallel computing work, emphasis was placed

on another metric called maximum dilation which is defined as,

d(e) = max{di|ei ∈ E} (5.2)

where di is the dilation of the edge ei. This metric aims at minimizing the longest

length of the wire in a circuit. We claim that reducing the largest number of links

traveled by any message is not as critical as reducing the average hops across all

messages.

32

are assigned statically to processors during program start-up. On the other hand,

computes, can be moved around to balance load across processors. If a patch com-

municates with more than one compute on a processor, a proxy is placed on this

processor for the patch. The proxy receives the message from the patch and then

forwards it to the computes internally (Figure 7.7). This avoids adding new com-

munication paths when new computes for the same patch are added on a processor.

Patch

Compute

Proxy

Figure 7.7: Placement of patches, computes and proxies on a 2D mesh of processors

The number of computes on a processor and their individual computational

loads determines its computational load and the number of proxies on a processor

indicates its communication load. Load balancing in NAMD is measurement-based.

This assumes that load patterns tend to persist over time and even if they change,

the change is gradual (referred to as the principle of persistence). The load balancing

framework records information about object (compute) loads for some time steps.

It also records the communication graph between the patches and proxies. This

information is collected on one processor and based on the instrumentation data,

a load balancing phase is executed. Decisions are then sent to all processors. The

current strategy is centralized and we shall later discuss future work to make it fully

60

Page 31: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

NAMD Performance on Blue Gene/P

18

0

375

750

1125

1500

512 1024 2048 4096

Hop

-byt

es (

MB

per

itera

tion)

Number of cores

Topology ObliviousTopoAware PatchesTopoAware Computes

Measured Hop-bytesare assigned statically to processors during program start-up. On the other hand,

computes, can be moved around to balance load across processors. If a patch com-

municates with more than one compute on a processor, a proxy is placed on this

processor for the patch. The proxy receives the message from the patch and then

forwards it to the computes internally (Figure 7.7). This avoids adding new com-

munication paths when new computes for the same patch are added on a processor.

Patch

Compute

Proxy

Figure 7.7: Placement of patches, computes and proxies on a 2D mesh of processors

The number of computes on a processor and their individual computational

loads determines its computational load and the number of proxies on a processor

indicates its communication load. Load balancing in NAMD is measurement-based.

This assumes that load patterns tend to persist over time and even if they change,

the change is gradual (referred to as the principle of persistence). The load balancing

framework records information about object (compute) loads for some time steps.

It also records the communication graph between the patches and proxies. This

information is collected on one processor and based on the instrumentation data,

a load balancing phase is executed. Decisions are then sent to all processors. The

current strategy is centralized and we shall later discuss future work to make it fully

60

Page 32: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

NAMD Performance on Blue Gene/P

18

0

375

750

1125

1500

512 1024 2048 4096

Hop

-byt

es (

MB

per

itera

tion)

Number of cores

Topology ObliviousTopoAware PatchesTopoAware Computes

0

3.75

7.5

11.25

15

512 1024 2048 4096 8192 16384

Tim

e pe

r st

ep (

ms)

Number of cores

Topology ObliviousTopoAware PatchesTopoAware Computes

Measured Hop-bytes Application Performanceare assigned statically to processors during program start-up. On the other hand,

computes, can be moved around to balance load across processors. If a patch com-

municates with more than one compute on a processor, a proxy is placed on this

processor for the patch. The proxy receives the message from the patch and then

forwards it to the computes internally (Figure 7.7). This avoids adding new com-

munication paths when new computes for the same patch are added on a processor.

Patch

Compute

Proxy

Figure 7.7: Placement of patches, computes and proxies on a 2D mesh of processors

The number of computes on a processor and their individual computational

loads determines its computational load and the number of proxies on a processor

indicates its communication load. Load balancing in NAMD is measurement-based.

This assumes that load patterns tend to persist over time and even if they change,

the change is gradual (referred to as the principle of persistence). The load balancing

framework records information about object (compute) loads for some time steps.

It also records the communication graph between the patches and proxies. This

information is collected on one processor and based on the instrumentation data,

a load balancing phase is executed. Decisions are then sent to all processors. The

current strategy is centralized and we shall later discuss future work to make it fully

60

Page 33: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

NAMD Performance on Blue Gene/P

18

0

375

750

1125

1500

512 1024 2048 4096

Hop

-byt

es (

MB

per

itera

tion)

Number of cores

Topology ObliviousTopoAware PatchesTopoAware Computes

0

3.75

7.5

11.25

15

512 1024 2048 4096 8192 16384

Tim

e pe

r st

ep (

ms)

Number of cores

Topology ObliviousTopoAware PatchesTopoAware Computes

Measured Hop-bytes Application Performance

6% 13% 28%

are assigned statically to processors during program start-up. On the other hand,

computes, can be moved around to balance load across processors. If a patch com-

municates with more than one compute on a processor, a proxy is placed on this

processor for the patch. The proxy receives the message from the patch and then

forwards it to the computes internally (Figure 7.7). This avoids adding new com-

munication paths when new computes for the same patch are added on a processor.

Patch

Compute

Proxy

Figure 7.7: Placement of patches, computes and proxies on a 2D mesh of processors

The number of computes on a processor and their individual computational

loads determines its computational load and the number of proxies on a processor

indicates its communication load. Load balancing in NAMD is measurement-based.

This assumes that load patterns tend to persist over time and even if they change,

the change is gradual (referred to as the principle of persistence). The load balancing

framework records information about object (compute) loads for some time steps.

It also records the communication graph between the patches and proxies. This

information is collected on one processor and based on the instrumentation data,

a load balancing phase is executed. Decisions are then sent to all processors. The

current strategy is centralized and we shall later discuss future work to make it fully

60

Page 34: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Outline

• Case studies:

• OpenAtom

• NAMD

• Automatic Mapping Framework

• Pattern matching

• Heuristics for Regular Graphs

• Heuristics for Irregular Graphs

19

Page 35: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Automatic Mapping Framework

20

Pattern Matching Framework

Regular Graphs Irregular Graphs

2D Object Graph 3D Object Graph

Choose best heuristic depending

on hop-bytes

Output: Mapping file used for the next run

W/o coordinate information

W/ coordinate information

MXOVLP, MXOV_AL, EXC, COCE, AFFN

EXC, COCE, AFFN

Processor topology information

Application communication graph

BFT, MHT,Infer structure

AFFN, COCE,COCE+MHT

Figure 8.1: Schematic of the automatic mapping framework

identifying regular patterns in communication graphs.

8.1 Communication Graph: Identifying Patterns

Automatic topology aware mapping, as we shall see in the next few sections, uses

heuristics for fast scalable runtime solutions. Heuristics can yield more efficient so-

lutions if we can derive concrete information about the communication graph of the

application and exploit it. For this, we need to look for identifiable communication

patterns, if any, in the object graph. Many parallel applications have relatively

simple and easily identifiable 2D, 3D or 4D communication patterns. If we can

identify such patterns, then we can apply better suited heuristic techniques for such

73

Page 36: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Automatic Mapping Framework

20

Pattern Matching Framework

Regular Graphs Irregular Graphs

2D Object Graph 3D Object Graph

Choose best heuristic depending

on hop-bytes

Output: Mapping file used for the next run

W/o coordinate information

W/ coordinate information

MXOVLP, MXOV_AL, EXC, COCE, AFFN

EXC, COCE, AFFN

Processor topology information

Application communication graph

BFT, MHT,Infer structure

AFFN, COCE,COCE+MHT

Figure 8.1: Schematic of the automatic mapping framework

identifying regular patterns in communication graphs.

8.1 Communication Graph: Identifying Patterns

Automatic topology aware mapping, as we shall see in the next few sections, uses

heuristics for fast scalable runtime solutions. Heuristics can yield more efficient so-

lutions if we can derive concrete information about the communication graph of the

application and exploit it. For this, we need to look for identifiable communication

patterns, if any, in the object graph. Many parallel applications have relatively

simple and easily identifiable 2D, 3D or 4D communication patterns. If we can

identify such patterns, then we can apply better suited heuristic techniques for such

73

Page 37: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Automatic Mapping Framework

20

Pattern Matching Framework

Regular Graphs Irregular Graphs

2D Object Graph 3D Object Graph

Choose best heuristic depending

on hop-bytes

Output: Mapping file used for the next run

W/o coordinate information

W/ coordinate information

MXOVLP, MXOV_AL, EXC, COCE, AFFN

EXC, COCE, AFFN

Processor topology information

Application communication graph

BFT, MHT,Infer structure

AFFN, COCE,COCE+MHT

Figure 8.1: Schematic of the automatic mapping framework

identifying regular patterns in communication graphs.

8.1 Communication Graph: Identifying Patterns

Automatic topology aware mapping, as we shall see in the next few sections, uses

heuristics for fast scalable runtime solutions. Heuristics can yield more efficient so-

lutions if we can derive concrete information about the communication graph of the

application and exploit it. For this, we need to look for identifiable communication

patterns, if any, in the object graph. Many parallel applications have relatively

simple and easily identifiable 2D, 3D or 4D communication patterns. If we can

identify such patterns, then we can apply better suited heuristic techniques for such

73

Page 38: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Automatic Mapping Framework

20

Pattern Matching Framework

Regular Graphs Irregular Graphs

2D Object Graph 3D Object Graph

Choose best heuristic depending

on hop-bytes

Output: Mapping file used for the next run

W/o coordinate information

W/ coordinate information

MXOVLP, MXOV_AL, EXC, COCE, AFFN

EXC, COCE, AFFN

Processor topology information

Application communication graph

BFT, MHT,Infer structure

AFFN, COCE,COCE+MHT

Figure 8.1: Schematic of the automatic mapping framework

identifying regular patterns in communication graphs.

8.1 Communication Graph: Identifying Patterns

Automatic topology aware mapping, as we shall see in the next few sections, uses

heuristics for fast scalable runtime solutions. Heuristics can yield more efficient so-

lutions if we can derive concrete information about the communication graph of the

application and exploit it. For this, we need to look for identifiable communication

patterns, if any, in the object graph. Many parallel applications have relatively

simple and easily identifiable 2D, 3D or 4D communication patterns. If we can

identify such patterns, then we can apply better suited heuristic techniques for such

73

Relieve the application developer of the mapping burden

Page 39: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Automatic Mapping Framework

20

Pattern Matching Framework

Regular Graphs Irregular Graphs

2D Object Graph 3D Object Graph

Choose best heuristic depending

on hop-bytes

Output: Mapping file used for the next run

W/o coordinate information

W/ coordinate information

MXOVLP, MXOV_AL, EXC, COCE, AFFN

EXC, COCE, AFFN

Processor topology information

Application communication graph

BFT, MHT,Infer structure

AFFN, COCE,COCE+MHT

Figure 8.1: Schematic of the automatic mapping framework

identifying regular patterns in communication graphs.

8.1 Communication Graph: Identifying Patterns

Automatic topology aware mapping, as we shall see in the next few sections, uses

heuristics for fast scalable runtime solutions. Heuristics can yield more efficient so-

lutions if we can derive concrete information about the communication graph of the

application and exploit it. For this, we need to look for identifiable communication

patterns, if any, in the object graph. Many parallel applications have relatively

simple and easily identifiable 2D, 3D or 4D communication patterns. If we can

identify such patterns, then we can apply better suited heuristic techniques for such

73

Relieve the application developer of the mapping burden No change to the

application code

Page 40: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Topology Discovery

• Topology Manager API: for 3D interconnects (Blue Gene, XT)

• Information required for mapping:

• Physical dimensions of the allocated job partition

• Mapping of ranks to physical coordinates and vice versa

• On Blue Gene machines such information is available and the API is a wrapper

• On Cray XT machines, jump several hoops to get this information and make it available through the same API

21

Page 41: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Application communication graph

• Several ways to obtain the graph

• MPI applications:

• Profiling tools (IBM’s HPCT tools)

• Collect information using the PMPI interface

• Manually provided by the application end user

• Charm++ applications:

• Instrumentation at runtime

• Profiling tools (HPCT): when n = p

22

Page 42: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Pattern Matching

• We want to identify regular 2D/3D communication patterns

23

Algorithm 8.1 Pseudo-code for identifying regular communication graphs

Input: CMn,n (communication matrix)Output: isRegular (boolean, true if communication is regular)

dims[ ] (dimensions of the regular communication graph)for i = 1 to n do

find the maximum number of neighbors for any rank in CMi,n

end for

if max neighbors ≤ 5 then

// this might be a case of regular 2D communicationselect an arbitrary rank startpe find its distance from its neighborsdist = difference between ranks of startpe and its top or bottom neighborfor i := 1 to n do

if distance of all ranks from their neighbors == 1 or dist then

isRegular = truedim[0] = distdim[1] = n/dist

end if

end for

end if

computation, but the algorithm can be enhanced so that it can identify other reg-

ular patterns such as communication with all 8 neighbors around a rank in 2D.

The algorithms for identifying 3D and 4D near-neighbor patterns are similar. Once

the information about communicating neighbors has been extracted and identified,

mapping algorithms can use it to map communicating neighbors on nearby physical

processors.

The pattern matching algorithms were tested with three different applications

which are known to have regular communication: MILC, POP and WRF. The

communication patterns and the size of each dimension were correctly identified as

shown in Table 8.1.

Application No. of cores Dimensionality Size of dimensions

MILC 256 4-dimensional 4× 4× 4× 4

POP 256 2-dimensional 8× 32

POP 512 2-dimensional 32× 16

WRF 256 2-dimensional 16× 16

WRF 512 2-dimensional 32× 32

Table 8.1: Pattern identification of communication in MILC, POP and WRF

75

Page 43: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Pattern Matching

• We want to identify regular 2D/3D communication patterns

23

Algorithm 8.1 Pseudo-code for identifying regular communication graphs

Input: CMn,n (communication matrix)Output: isRegular (boolean, true if communication is regular)

dims[ ] (dimensions of the regular communication graph)for i = 1 to n do

find the maximum number of neighbors for any rank in CMi,n

end for

if max neighbors ≤ 5 then

// this might be a case of regular 2D communicationselect an arbitrary rank startpe find its distance from its neighborsdist = difference between ranks of startpe and its top or bottom neighborfor i := 1 to n do

if distance of all ranks from their neighbors == 1 or dist then

isRegular = truedim[0] = distdim[1] = n/dist

end if

end for

end if

computation, but the algorithm can be enhanced so that it can identify other reg-

ular patterns such as communication with all 8 neighbors around a rank in 2D.

The algorithms for identifying 3D and 4D near-neighbor patterns are similar. Once

the information about communicating neighbors has been extracted and identified,

mapping algorithms can use it to map communicating neighbors on nearby physical

processors.

The pattern matching algorithms were tested with three different applications

which are known to have regular communication: MILC, POP and WRF. The

communication patterns and the size of each dimension were correctly identified as

shown in Table 8.1.

Application No. of cores Dimensionality Size of dimensions

MILC 256 4-dimensional 4× 4× 4× 4

POP 256 2-dimensional 8× 32

POP 512 2-dimensional 32× 16

WRF 256 2-dimensional 16× 16

WRF 512 2-dimensional 32× 32

Table 8.1: Pattern identification of communication in MILC, POP and WRF

75

Page 44: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Pattern Matching

• We want to identify regular 2D/3D communication patterns

23

Algorithm 8.1 Pseudo-code for identifying regular communication graphs

Input: CMn,n (communication matrix)Output: isRegular (boolean, true if communication is regular)

dims[ ] (dimensions of the regular communication graph)for i = 1 to n do

find the maximum number of neighbors for any rank in CMi,n

end for

if max neighbors ≤ 5 then

// this might be a case of regular 2D communicationselect an arbitrary rank startpe find its distance from its neighborsdist = difference between ranks of startpe and its top or bottom neighborfor i := 1 to n do

if distance of all ranks from their neighbors == 1 or dist then

isRegular = truedim[0] = distdim[1] = n/dist

end if

end for

end if

computation, but the algorithm can be enhanced so that it can identify other reg-

ular patterns such as communication with all 8 neighbors around a rank in 2D.

The algorithms for identifying 3D and 4D near-neighbor patterns are similar. Once

the information about communicating neighbors has been extracted and identified,

mapping algorithms can use it to map communicating neighbors on nearby physical

processors.

The pattern matching algorithms were tested with three different applications

which are known to have regular communication: MILC, POP and WRF. The

communication patterns and the size of each dimension were correctly identified as

shown in Table 8.1.

Application No. of cores Dimensionality Size of dimensions

MILC 256 4-dimensional 4× 4× 4× 4

POP 256 2-dimensional 8× 32

POP 512 2-dimensional 32× 16

WRF 256 2-dimensional 16× 16

WRF 512 2-dimensional 32× 32

Table 8.1: Pattern identification of communication in MILC, POP and WRF

75

Page 45: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example

• WRF running on 32 cores of Blue Gene/P

Proc

esso

rs

0

31

24

Page 46: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example

• WRF running on 32 cores of Blue Gene/P

Pattern matching to identify regular communication patterns such as 2D/3D near-neighbor graphs

Proc

esso

rs

0

31

24

Page 47: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example

• WRF running on 32 cores of Blue Gene/P

Pattern matching to identify regular communication patterns such as 2D/3D near-neighbor graphs

Proc

esso

rs

0

31

24

Page 48: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Communication Graphs

• Regular communication:

• POP (Parallel Ocean Program): 2D Stencil like computation

• WRF (Weather Research and Forecasting model): 2D Stencil

• MILC (MIMD Lattice Computation): 4D near-neighbor

• Irregular communication:

• Unstructured mesh computations: FLASH, CPSD code

• Many other classes of applications

25

Page 49: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Outline

• Case studies:

• OpenAtom

• NAMD

• Automatic Mapping Framework

• Pattern matching

• Heuristics for Regular Graphs

• Heuristics for Irregular Graphs

26

Page 50: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Maximum Overlap (MXOVLP)

Mapping Regular Graphs (2D)

Object Graph: 9 x 8Processor Graph: 12 x 6

27

Page 51: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Maximum Overlap (MXOVLP)

Mapping Regular Graphs (2D)

Object Graph: 9 x 8Processor Graph: 12 x 6

27

Page 52: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Maximum Overlap (MXOVLP)

Mapping Regular Graphs (2D)

Object Graph: 9 x 8Processor Graph: 12 x 6

27

Page 53: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Maximum Overlap (MXOVLP)

Mapping Regular Graphs (2D)

Object Graph: 9 x 8Processor Graph: 12 x 6

27

Page 54: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Maximum Overlap (MXOVLP)

Mapping Regular Graphs (2D)

Object Graph: 9 x 8Processor Graph: 12 x 6

27

Page 55: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Maximum Overlap (MXOVLP)

Mapping Regular Graphs (2D)

Object Graph: 9 x 8Processor Graph: 12 x 6

27

• Maximum Overlap with Alignment (MXOV+AL)

• Alignment at each recursive call

Page 56: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Maximum Overlap (MXOVLP)

Mapping Regular Graphs (2D)

Object Graph: 9 x 8Processor Graph: 12 x 6

27

• Maximum Overlap with Alignment (MXOV+AL)

• Alignment at each recursive call

• Expand from Corner (EXCO)

Page 57: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Maximum Overlap (MXOVLP)

Mapping Regular Graphs (2D)

Object Graph: 9 x 8Processor Graph: 12 x 6

27

• Maximum Overlap with Alignment (MXOV+AL)

• Alignment at each recursive call

• Expand from Corner (EXCO)

Page 58: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Maximum Overlap (MXOVLP)

Mapping Regular Graphs (2D)

Object Graph: 9 x 8Processor Graph: 12 x 6

27

• Maximum Overlap with Alignment (MXOV+AL)

• Alignment at each recursive call

• Expand from Corner (EXCO)

Page 59: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• Maximum Overlap (MXOVLP)

Mapping Regular Graphs (2D)

Object Graph: 9 x 8Processor Graph: 12 x 6

27

• Maximum Overlap with Alignment (MXOV+AL)

• Alignment at each recursive call

• Expand from Corner (EXCO)

Page 60: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

More heuristics ...

• Corners to Center (COCE)

• Start simultaneously from all corners

28

Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung, "Automated Mapping of Regular Communication Graphs on Mesh Interconnects", Proceedings of International Conference on High Performance Computing (HiPC), 2010.

Page 61: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

More heuristics ...

• Corners to Center (COCE)

• Start simultaneously from all corners

28

Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung, "Automated Mapping of Regular Communication Graphs on Mesh Interconnects", Proceedings of International Conference on High Performance Computing (HiPC), 2010.

Page 62: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

More heuristics ...

• Corners to Center (COCE)

• Start simultaneously from all corners

28

Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung, "Automated Mapping of Regular Communication Graphs on Mesh Interconnects", Proceedings of International Conference on High Performance Computing (HiPC), 2010.

Page 63: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

More heuristics ...

• Corners to Center (COCE)

• Start simultaneously from all corners

28

Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung, "Automated Mapping of Regular Communication Graphs on Mesh Interconnects", Proceedings of International Conference on High Performance Computing (HiPC), 2010.

Page 64: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

More heuristics ...

• Corners to Center (COCE)

• Start simultaneously from all corners

28

• Affine Mapping (AFFN)

Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung, "Automated Mapping of Regular Communication Graphs on Mesh Interconnects", Proceedings of International Conference on High Performance Computing (HiPC), 2010.

Page 65: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

More heuristics ...

• Corners to Center (COCE)

• Start simultaneously from all corners

28

• Affine Mapping (AFFN)

Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung, "Automated Mapping of Regular Communication Graphs on Mesh Interconnects", Proceedings of International Conference on High Performance Computing (HiPC), 2010.

Page 66: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

More heuristics ...

• Corners to Center (COCE)

• Start simultaneously from all corners

28

• Affine Mapping (AFFN)

the basic technique in a simpler context.

2. The second scenario is where we have a two-dimensional array of objects where

each object communicates with two immediate neighbors in its row and col-

umn. We wish to map this group of objects on to a 2D mesh of processors.

12.1 Mapping of a 1D Ring

Problem: Load balancing a 1D array of v objects which communicate in a ring

pattern to a 1D linear array of p processors.

Solution: We want to map these objects on to processors while considering the

load of each object and the communication patterns among the objects. In order to

optimize communication, we want to place objects next to each other on the same

processor as much as possible and cross processor boundaries only for ensuring load

balance. We assume that the IDs of objects denote the nearness in terms of who

communicates with whom. Hence the problem reduces to finding contiguous groups

of objects in the 1D array such that the load on all processors is nearly the same.

We arrange the objects virtually by their IDs and perform a prefix sum in parallel

between them based on the object loads. At the conclusion of a prefix sum, every

object knows the sum of loads of all objects that appear before it (Figure 12.1).

Then the last object broadcasts the sum of loads of all objects so that every object

knows the global load of the system. Each object i, can calculate its destination

processor (di), based on the total load of all objects (Lv), prefix sum of loads up to

it (Li), its load (li) and the total number of processors (p), by this equation,

di = �p ∗ Li − li/2

Lv� (12.1)

(x, y) → (�Px ∗x

Ox�, �Py ∗

y

Oy�) (12.2)

118Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung, "Automated Mapping of Regular Communication Graphs on Mesh Interconnects", Proceedings of International Conference on High Performance Computing (HiPC), 2010.

Page 67: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

More heuristics ...

• Corners to Center (COCE)

• Start simultaneously from all corners

28

• Affine Mapping (AFFN)

the basic technique in a simpler context.

2. The second scenario is where we have a two-dimensional array of objects where

each object communicates with two immediate neighbors in its row and col-

umn. We wish to map this group of objects on to a 2D mesh of processors.

12.1 Mapping of a 1D Ring

Problem: Load balancing a 1D array of v objects which communicate in a ring

pattern to a 1D linear array of p processors.

Solution: We want to map these objects on to processors while considering the

load of each object and the communication patterns among the objects. In order to

optimize communication, we want to place objects next to each other on the same

processor as much as possible and cross processor boundaries only for ensuring load

balance. We assume that the IDs of objects denote the nearness in terms of who

communicates with whom. Hence the problem reduces to finding contiguous groups

of objects in the 1D array such that the load on all processors is nearly the same.

We arrange the objects virtually by their IDs and perform a prefix sum in parallel

between them based on the object loads. At the conclusion of a prefix sum, every

object knows the sum of loads of all objects that appear before it (Figure 12.1).

Then the last object broadcasts the sum of loads of all objects so that every object

knows the global load of the system. Each object i, can calculate its destination

processor (di), based on the total load of all objects (Lv), prefix sum of loads up to

it (Li), its load (li) and the total number of processors (p), by this equation,

di = �p ∗ Li − li/2

Lv� (12.1)

(x, y) → (�Px ∗x

Ox�, �Py ∗

y

Oy�) (12.2)

118Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung, "Automated Mapping of Regular Communication Graphs on Mesh Interconnects", Proceedings of International Conference on High Performance Computing (HiPC), 2010.

Page 68: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

More heuristics ...

• Corners to Center (COCE)

• Start simultaneously from all corners

28

• Affine Mapping (AFFN)

the basic technique in a simpler context.

2. The second scenario is where we have a two-dimensional array of objects where

each object communicates with two immediate neighbors in its row and col-

umn. We wish to map this group of objects on to a 2D mesh of processors.

12.1 Mapping of a 1D Ring

Problem: Load balancing a 1D array of v objects which communicate in a ring

pattern to a 1D linear array of p processors.

Solution: We want to map these objects on to processors while considering the

load of each object and the communication patterns among the objects. In order to

optimize communication, we want to place objects next to each other on the same

processor as much as possible and cross processor boundaries only for ensuring load

balance. We assume that the IDs of objects denote the nearness in terms of who

communicates with whom. Hence the problem reduces to finding contiguous groups

of objects in the 1D array such that the load on all processors is nearly the same.

We arrange the objects virtually by their IDs and perform a prefix sum in parallel

between them based on the object loads. At the conclusion of a prefix sum, every

object knows the sum of loads of all objects that appear before it (Figure 12.1).

Then the last object broadcasts the sum of loads of all objects so that every object

knows the global load of the system. Each object i, can calculate its destination

processor (di), based on the total load of all objects (Lv), prefix sum of loads up to

it (Li), its load (li) and the total number of processors (p), by this equation,

di = �p ∗ Li − li/2

Lv� (12.1)

(x, y) → (�Px ∗x

Ox�, �Py ∗

y

Oy�) (12.2)

118Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung, "Automated Mapping of Regular Communication Graphs on Mesh Interconnects", Proceedings of International Conference on High Performance Computing (HiPC), 2010.

Page 69: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Running Time

• Pairwise Exchanges (PAIRS) - Bokhari, Lee et al.

29

1.34

1.36

1.39

1.41

1.43

3.40161 72.9302 139.801 206.445 273.304 340.145

Hop

s pe

r by

te

Time (s)

Hops for 4k nodes

Page 70: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Running Time

• Pairwise Exchanges (PAIRS) - Bokhari, Lee et al.

29

0.01

0.1

1

10

100

1k 4k 16k 64k

Tim

e (m

s)

Number of nodes

AFFNCOCEMXOVLPMXOV+ALEXCO

1.34

1.36

1.39

1.41

1.43

3.40161 72.9302 139.801 206.445 273.304 340.145

Hop

s pe

r by

te

Time (s)

Hops for 4k nodes

Page 71: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example Mapping

Object Graph: 9 x 8Processor Graph: 12 x 6

Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

30

Page 72: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example Mapping

Object Graph: 9 x 8Processor Graph: 12 x 6

Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

30

Page 73: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example Mapping

Object Graph: 9 x 8Processor Graph: 12 x 6

Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

30

Page 74: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example Mapping

Object Graph: 9 x 8Processor Graph: 12 x 6

Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

30

Page 75: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example Mapping

Object Graph: 9 x 8Processor Graph: 12 x 6

Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

30

Page 76: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example Mapping

Object Graph: 9 x 8Processor Graph: 12 x 6

Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

30

Page 77: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example Mapping

Object Graph: 9 x 8Processor Graph: 12 x 6

Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

30

Page 78: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Example Mapping

Object Graph: 9 x 8Processor Graph: 12 x 6

Aleliunas, R. and Rosenberg, A. L. On Embedding Rectangular Grids in Square Grids. IEEE Trans. Comput., 31(9):907–913, 1982

30

Page 79: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Mapping of 9x8 graph to 12x6 mesh

31

MXOVLP: 1.66 MXOV+AL: 1.65 EXCO: 2.31 COCE: 1.91

Page 80: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele 32

MXOVLP: 1.66 MXOV+AL: 1.65 EXCO: 2.31 COCE: 1.91

Mapping of 9x8 graph to 12x6 mesh

Page 81: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele 33

Mapping of 9x8 graph to 12x6 mesh

STEP: 1.39 AFFN1: 1.77 AFFN2: 1.53 AFFN3: 1.91

Page 82: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele 34

Mapping of 9x8 graph to 12x6 mesh

STEP: 1.39 AFFN1: 1.77 AFFN2: 1.53 AFFN3: 1.91

Page 83: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Evaluation

35

1

10

100

27x44 to 36x33 100x40 to 125x32 128x128 to 512x32 320x200 to 125x512

Hop

s pe

r by

te

MXOVLPMXOV+ALEXCOCOCEAFFNPAIRS

~1k nodes ~4k nodes ~16k nodes ~64k nodes

Page 84: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Mapping 2D Graphs to 3D

• Map a two-dimensional object graph to a three-dimensional processor graph

• Divide object graph into subgraphs once each for the number of planes

• Stacking

• Folding

• Best 2D to 2D heuristic chosen based on hop-bytes

36

Page 85: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Results: 2D Stencil on Blue Gene/P

37

0

5

10

15

20

512 1024 2048 4096 819216384

Hop

s pe

r by

te

Number of cores

Default MappingTopology Mapping

Hop-bytes

Page 86: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Results: 2D Stencil on Blue Gene/P

37

400

417.5

435

452.5

470

512 1024 2048 4096 8192 16384

Tim

e pe

r st

ep (

ms)

Number of cores

Default MappingTopology Mapping

0

5

10

15

20

512 1024 2048 4096 819216384

Hop

s pe

r by

te

Number of cores

Default MappingTopology Mapping

Hop-bytes Performance

Page 87: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Increasing communication

• With faster processors and constant link bandwidths

• computation is becoming cheap

• communication is a bottleneck

• Trend for bytes per flop

• XT3: 8.77

• XT4: 1.357

• XT5: 0.23

38

0.1

1

10

100

512 B 2 KB 8 KB 32 KB 128 KB

Tim

e pe

r st

ep (

s)

Message size

Default MappingTopology Mapping

2D Stencil on BG/P

Page 88: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Results: WRF on Blue Gene/P

0

1

2

3

4

256 512 1024 2048 4096

Ave

rage

hop

s pe

r by

te

Number of nodes

DefaultTopology

39

Lower Bound

Hops from IBM HPCT

Page 89: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Results: WRF on Blue Gene/P

• Performance improvement negligible on 256 and 512 cores

0

1

2

3

4

256 512 1024 2048 4096

Ave

rage

hop

s pe

r by

te

Number of nodes

DefaultTopology

39

Lower Bound

Hops from IBM HPCT

Page 90: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Results: WRF on Blue Gene/P

• Performance improvement negligible on 256 and 512 cores

• On 1024 nodes:

• Hops reduce by: 63%

• Time for communication reduces by 11%

• Performance improves by 17%

0

1

2

3

4

256 512 1024 2048 4096

Ave

rage

hop

s pe

r by

te

Number of nodes

DefaultTopology

39

Lower Bound

Hops from IBM HPCT

Page 91: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Results: WRF on Blue Gene/P

• Performance improvement negligible on 256 and 512 cores

• On 1024 nodes:

• Hops reduce by: 63%

• Time for communication reduces by 11%

• Performance improves by 17%

0

1

2

3

4

256 512 1024 2048 4096

Ave

rage

hop

s pe

r by

te

Number of nodes

DefaultTopology

39

17%

Lower Bound

Hops from IBM HPCT

Page 92: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Results: WRF on Blue Gene/P

• Performance improvement negligible on 256 and 512 cores

• On 1024 nodes:

• Hops reduce by: 63%

• Time for communication reduces by 11%

• Performance improves by 17%

0

1

2

3

4

256 512 1024 2048 4096

Ave

rage

hop

s pe

r by

te

Number of nodes

DefaultTopology

39

17% 5%

Lower Bound

Hops from IBM HPCT

Page 93: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Outline

• Case studies:

• OpenAtom

• NAMD

• Automatic Mapping Framework

• Pattern matching

• Heuristics for Regular Graphs

• Heuristics for Irregular Graphs

40

Page 94: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Mapping Irregular Graphs

Object graph: 90 nodes Processor Mesh: 10 x 9

41

Page 95: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Two different scenarios

• There is no spatial information associated with the node

• Option 1: Work without it

• Option 2: If we know that the simulation has a geometric configuration, try to infer the structure of the graph

• We have geometric coordinate information for each node

• Use coordinate information to avoid crossing of edges and for other optimizations

42

Page 96: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

No coordinate information

43

Page 97: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

No coordinate information

• Breadth first traversal (BFT)

• Start with a random node and one end of the processor mesh

• Map nodes as you encounter them close to their parent

43

Page 98: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

No coordinate information

• Breadth first traversal (BFT)

• Start with a random node and one end of the processor mesh

• Map nodes as you encounter them close to their parent

• Max heap traversal (MHT)

• Start with a random node and one end/center of the mesh

• Put neighbors of a mapped node into the heap (node at the top is the one with maximum number of mapped neighbors)

• Map elements in the heap one by one around the centroid of their mapped neighbors

43

Page 99: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Mapping visualization

BFT: 2.89 MHT: 2.69

44

Page 100: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Inferring the spatial placement

45

Page 101: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Inferring the spatial placement

• Graph layout algorithms

• Force-based layout to reduce the total energy in the system

• Use the graphviz library to obtain coordinates of the nodes

45

Page 102: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Inferring the spatial placement

• Graph layout algorithms

• Force-based layout to reduce the total energy in the system

• Use the graphviz library to obtain coordinates of the nodes

45

Page 103: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

With coordinate information

• Affine Mapping (AFFN)

• Stretch/shrink the object graph (based on coordinates of nodes) to map it on to the processor grid

• In case of conflicts for the same processor, spiral around that processor

• Corners to Center (COCE)

• Use four corners of the object graph based on coordinates

• Start mapping simultaneously from all sides

• Place nodes encountered during a BFT close to their parents

46

Page 104: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Mapping visualization

AFFN: 3.17 COCE: 2.88

47

Page 105: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

• COCE+MHT Hybrid:

• We fix four nodes at geometric corners of the mesh to four processors in 2D

• Put neighbors of these nodes into a max heap

• Map from all sides inwards

• Starting from centroid of mapped neighbors

48

COCE: 2.78

Page 106: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Time Complexity

49

Page 107: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Time Complexity

• All algorithms discussed above choose a desired processor and spiral around it to find the nearest available processor

• Heuristics generally applicable to any topology

49

Page 108: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Time Complexity

• All algorithms discussed above choose a desired processor and spiral around it to find the nearest available processor

• Heuristics generally applicable to any topology

• Depending on the running time of findNext:

49

BFT COCE AFFN MHT COCE+MHT

O(n) O(n) O(n) O(n logn) O(n logn)

O(n (logn)2) O(n (logn)2) O(n (logn)2) O(n (logn)2) O(n (logn)2)

Page 109: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Running Time

50

0.1

1

10

100

1000

10000

256 1024 4096 16384

Tim

e (m

s)

Number of nodes

COCE+MHTCOCEBFTAFFNMHT

Page 110: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Results: simple2D

51

0

3.75

7.5

11.25

15

256 1024 4096 16384

Hop

s pe

r by

te

Number of nodes in communication graph

DefaultBFTMHTCOCECOCE+MHTAFFNPAIRS

Page 111: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Summary

• Contention in modern day supercomputers can impact performance: makes mapping important

• Certain classes of applications (latency sensitive, communication bound) benefit most

• OpenAtom shows performance improvements of up to 50%

• NAMD - improvements for > 4k cores

• Developing an automatic mapping framework

• Relieve the application developer of the mapping burden

52

Page 112: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Summary

• Topology discovery: Topology Manager API

• Object Communication Graph: Profiling, Instrumentation

• Pattern matching

• Regular graphs

• Irregular graphs

• Suite of heuristics for mapping

• Distributed strategies with global view

53

Page 113: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

December 16th, 2010 CNIC @ CAS © Abhinav Bhatele

Future Work

• More sophisticated algorithms for pattern matching and mapping

• Multicast and many-to-many patterns

• Handling multiple communication graphs

• Simultaneous or occurring in different phases

• Extension of the work on distributed load balancing

54

Page 114: Abhinav Bhatele - University Of Illinoischarm.cs.illinois.edu › newPapers › 10-39 › talk.pdfDecember 16th, 2010 CNIC @ CAS © Abhinav Bhatele Results: Contention 1 4 16 64 256

Thanks

Charm++ TutorialComputer Network Information Center, Chinese Academy of Sciences

Abhinav Bhatele. Automating Topology Aware Mapping for Supercomputers, PhD Thesis, Department of Computer Science, University of Illinois. http://hdl.handle.net/2142/16578