29
PGAS languages The facts, the myths and the requirements Dr Michèle Weiland [email protected] Monday, 1 October 12

PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland [email protected] ... Wall clock time is used to measure the per- ... e.g

  • Upload
    dangnhu

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

PGAS languages The facts, the myths and the requirements

Dr Michèle [email protected]

Monday, 1 October 12

Page 2: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

What is PGAS?

• a model, not a language!

• based on principle of “partitioned global address space”

• many different implementations exist

• new languages, language extensions, libraries

• world of PGAS is rather complex and murky...

Monday, 1 October 12

Page 3: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Some implementations

Chapel, X10, Titanium, Fortress

Unified Parallel C, Coarray Fortran

Global Arrays, OpenSHMEM

Monday, 1 October 12

Page 4: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Important point to keep in mind

unfortunately, there isn’t really such a thing as “a typical PGAS language”...

... there are many programming languages that implement the PGAS model in very different, even opposing, ways.

Monday, 1 October 12

Page 5: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

PGAS example: The UPC model

thread 0 thread 1 thread 2 thread 3global partitioned address space

cpu cpu cpu cpu

shared

private

Monday, 1 October 12

Page 6: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

SPMD and Global - two different views

• UPC and CAF take classic fragmented SPMD approach

• all processes execute same program

• Chapel and X10 take a global view

• they are able to dynamically spawn processes, as and when required

• advantage: (in principle) no redundant serial computation

Monday, 1 October 12

Page 7: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

A model for the future...?

• single-sided communication and built-in parallelism attractive concepts

• manipulate remote memory directly

• complex communications patterns easy(ish) to implement

• parallelism explicitly supported

• most implementations can be used standalone or alongside other models

• learning curve is low compared to MPI

Monday, 1 October 12

Page 8: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

User Adoption

• PGAS will only play a role in Exascale computing if user adoption is improved

• there is a lot of scepticism in the user community

• this can only happen if

• performance is able match that of established models, i.e. MPI

• there is support in the form of benchmark suites, libraries and debugging/performance tools

Monday, 1 October 12

Page 9: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Performance

• depends on the quality of runtime and compiler

• not a problem for CAF, UPC or Chapel... if you own a Cray!

• other vendors now starting to catch up

• also depends on the quality of the implementation (of course)

• this is where the HPC user community and tools providers come into play

Monday, 1 October 12

Page 10: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Possible performance gains - IFS

0

50

100

150

200

250

300

350

400

450

0 10000 20000 30000 40000 50000 60000 70000

Fore

cast

Day

s / D

ay

Number of Cores

T2047L137 model performance on HECToR (CRAY XE6) RAPS12 IFS (CY37R3), cce=7.4.4

Ideal

LCOARRAYS=T

LCOARRAYS=F

ORIGINAL

F - includes MPI optimisations to wave model + other opts T - includes above & Legendre transform coarray optimization

APRIL 2012

Operational performance requirement

Image courtesy of George Mozdzynski (ECMWF) and the CRESTA project

Monday, 1 October 12

Page 11: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Distributed Hash Table

GHz interlagos processors, with a total node memory of 32GBytes per node. The total number of cores is 90,112. Theinterconnect uses the Cray Gemini communications chips.The Cray compiler (version 8) is used to compile the bench-mark, as this has support for both UPC and CAF. CraySHMEM is used for the SHMEM version, and the MPI li-brary is MPICH2. The XMP compiler is Omni XMP 0.5.4(modified) using the GASNet Gemini conduit (GASNet ver-sion 1.18.2). The HA-PACS system is the high performancecluster in University of Tsukuba. Its compute nodes consistsof two Intel ES (Sandy Bridge EP) 2.6GHz 8 Cores proces-sors, with 128 GByte memory per node. The Interconnectis Infiniband QDR with two rails. The compiler used is gcc4.7.0 for XMP, UPC and MPI, and gcc 4.4.5 for SHMEM.The XMP compiler is Omni XMP compiler 0.5.4 (modified).Berkeley UPC7 2.14.0, OpenSHMEM8 1.0c, and OpenUH9

3.0.13 are used for UPC, SHMEM and CAF respectively.The communication library is mvapich2-1.7a for MPI andGASNet 1.18.0 for the others.

3.2 ResultsIn previous work [5], results for one-sided MPI and UPCon Phase 2b of HECToR, a Cray XE6 with 24 core Magny-Cours processors were presented. The results for one-sidedMPI, UPC and SHMEM on HECToR phase 3, with 32 coreInterlagos processors were originally presented in [4]. In thiswork, these are extended to include new results for CAFand XMP, and are compared to performance results on HA-PACS.

For all five programming models the size of the integer ob-ject is eight bytes and two passes of the hash table are made,once to populate and once to visit. Five di!erent volumesor local hash table size are used, namely 10, 20, 40, 80 and160 thousand entries per process. All results are presentedin terms of local hash table size which corresponds to weakscaling analysis. For all results one process runs on a singlecore. For HECToR we present results on 32 (one node) upto 16384 cores, and on HA-PACS on 16 cores (one node) upto 512 cores. Wall clock time is used to measure the per-formance as both set up and run-time are important. Alldata for these results are shown in Tables 1 and 2 in the ap-pendix. We do not present the results for OpenSHMEM onHA-PACS as this is a reference only implementation withoutoptimisation with correspondingly poor performance.

The test code used as the benchmark performs very littlecomputation and is e!ectively bound by memory transfer.Increasing the number of processes increases the number ofworkers, but in the weak scaling analysis also increases theamount of work. Moreover, the amount remote memorytransfers that are performed also increases. Consequently,increasing the number of processes does not make the codeexecute faster. E!ectively, the benchmark is measuring theperformance of one-sided any-to-any communication.

Shown in Figure 1 is a log-log plot of the performance of thebenchmark for the di!erent models at fixed local volume onHECToR. One-sided MPI has a fast execution time for small

7http://upc.lbl.gov/8http://openshmem.org/9http://www2.cs.uh.edu/~openuh/

32 64 128 256 512 1024 2048 4096 8192 16384number of cores

1

10

100

time

(sec

onds

)

MPIUPCSHMEMCAFXMP

Figure 1: Wall clock time versus core count at fixedlocal hash table size of 40000 elements, on HECToR.

10 20 40 80 160local hash table size (thousands)

10

100

1000

time

(sec

onds

)

MPIUPCSHMEMCAFXMP

Figure 2: Wall clock time versus local hash tablesize at fixed core count of 2048 on HECToR

numbers of cores, but it does not scale well to large numbersof cores. UPC, CAF and SHMEM show good scaling out tolarge core counts, with modest degrading of performanceafter 4096 cores. XMP appears similar to one-sided MPI inthat it has fast execution for small numbers of cores, but theexecution time increases at large core counts.

Shown in Figure 2 is a log-log plot of the performance ofthe benchmark for the di!erent sizes of local hash table sizeat fixed core count on HECToR. UPC, CAF and SHMEMall have similar performance and scale well with volume.As the problem size is increased, the amount of RMA isincreased proportionally. For one-sided MPI and XMP, theoverall trend is the same, expect that they scale much lesse"ciently with volume, such that for the largest volumesthey are an order of magnitude slower than UPC, CAF andSHMEM.

We do not show plots for other fixed volumes and core countsas the trends are established in the plots and all the data isavailable in the appendix.

Image courtesy of the

HPCGAP project

Monday, 1 October 12

Page 12: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

The truth about PGAS

• although in general easy to learn...

• simple codes can be parallelised very quickly

• ... difficult to use on real codes!

• a lot of functionality hidden from the user; often implicit communication and parallelism

• hidden functionality may be root cause of poor performance

Monday, 1 October 12

Page 13: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Common bottleneck: data access

• Shared data objects can be accessed directly

• cost of access depends on where data resides

• is it in shared cache? on a memory bank attached to a processor in another cabinet?

• Deceptively simple operation

• but implications for performance are huge

Monday, 1 October 12

Page 14: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Also: synchronisation

• important for memory consistency, avoidance of data races

• implicit nature of communication makes this surprisingly difficult to get right

• especially problematic in large codes

• common approach: “if in doubt, synchronise!”

• the result is correct but badly performing code that spends most if its time waiting for things to happen

Monday, 1 October 12

Page 15: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

What needs to happen? An ideal world

• common misplaced belief that PGAS is “easy” needs to be addressed

• not a quick fix for performance and scaling problems!

• stories of success and failure need to be told

• what works? what doesn’t?

• and finally: programmers need help, writing code without the support of tools is like shooting in the dark...

Monday, 1 October 12

Page 16: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Debugging

• main focus needs to be on the principal feature of PGAS and the unwanted side-effects of RDMA

• memory consistency is the key

• detect data races: is a memory location “safe” to use?

• help with resolution of data races, e.g. atomic operations, synchronisation, critical sections, ...

Monday, 1 October 12

Page 17: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Debugging (2)

• ensure synchronisations are correct

• too few, and the code will break;

• too many, and the code will perform badly...

Monday, 1 October 12

Page 18: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Debugging (2)

• ensure synchronisations are correct

• too few, and the code will break;

• too many, and the code will perform badly...

a debugging tool could for example

visually match synchronisation points and

give advice based on data race detection

Monday, 1 October 12

Page 19: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

(Micro-)Benchmarks

• performance characteristics need to be quantifiable

• runtime overheads, communication costs, parallel constructs

• allows programmers to model and analyse the performance of their code and make intelligent decisions regarding implementation

Monday, 1 October 12

Page 20: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Interoperability tools

• focus on language interoperability

• aim is to enable code written in one language to be called directly from code written in another language

• encourage and enable code reuse ➟ libraries

Monday, 1 October 12

Page 21: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Interoperability tools

• focus on language interoperability

• aim is to enable code written in one language to be called directly from code written in another language

• encourage and enable code reuse ➟ libraries

notable effort here is Babel (out of LLNL)

C,C++, Fortran 77-2008, Python, Java and now also Chapel, UPC and X10

though the latter three are still experimental

Monday, 1 October 12

Page 22: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Performance profiling

• get information on hotspots and breakdown of timings

• how much time is spent waiting for data to arrive at the processing core?

• how much time does is spent on memory management?

• lower-level information such as cache reuse, memory bandwidth, cycles per instruction, etc.

Monday, 1 October 12

Page 23: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Visualising data locality

• accessing memory not uniformly expensive

• important to keep data on memory infrastructure close to processing core that will operate on it

Monday, 1 October 12

Page 24: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Visualising data locality

• accessing memory not uniformly expensive

• important to keep data on memory infrastructure close to processing core that will operate on it

tool should highlight poor data locality, based on memory access patterns

Monday, 1 October 12

Page 25: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Visualising “communications”

• related to data locality on previous slide

• communication is implicit in most (though not all) PGAS languages

• remote direct memory access

• difficult to gain a clear understanding of the communication patterns

• optimising these patterns important for performance

Monday, 1 October 12

Page 26: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

What is the reality?

• some of this functionality would be extremely beneficial, but does not even exist for shared-memory programming

• e.g. data locality visualiser for multi-core processors?

• what chance is there for PGAS tools?

• need to support a myriad of different programming, memory and execution models...

Monday, 1 October 12

Page 27: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Will PGAS play a role in Exascale?

• not all of the PGAS languages will survive

• they will suffer the same fate as HPF

• the ones (or even the one?) that will remain won’t necessarily be the best implementations of PGAS

• but those that got the most support and managed to pick up momentum

Monday, 1 October 12

Page 28: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Workshop on Tools for Exascale, CEA, 01/10/2012

Conclusions

• PGAS is in principle an attractive model

• but there are too many disparate implementations

• this makes community support difficult and may even be the downfall of the PGAS implementations

• only time will tell!

Monday, 1 October 12

Page 29: PGAS languages -  · PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk ... Wall clock time is used to measure the per- ... e.g

Questions?

Monday, 1 October 12